PlaneText is developed for enabling users to analyze target structured text with various natural language processing (NLP) tools. The current version provides a tool for converting text structured as XML into plain text sequences which can serve as input for NLP tools. In a future update, we will implement correspondence of output by NLP tools to original input documents, which will broaden the possibilities of applying obtained NLP analysis to various applications, enable users to utilize/transmit obtained NLP analysis for more broad application or higher-layer subsequent analysis.
Input
PlaneText currently assumes that input text is an XML / XML-like tagged document (or collection) where tags provide text with structures beyond plain word sequences. (GUI tool of ver. 0.0.1 can show previews only for XHTML documents while tag classification and conversion are available for various XML / XML-like documents. We will improve this limitation in a future update.)
Output
For each input document, three files (*.ann, *.txt, *.xhtml) are generated in the specified output directory. *.ann and *.xhtml are reserved for future updates of PlaneText.
*.txt contains obtained plain text sequences consisting of words or dummy words for Object-tagged regions. Output style in *.txt can be changed by specifying some settings in config.yaml. (Please see also explanation of tag classification in About PlaneText)
(A. enable mark_displacement in config.yaml)
You can add markers to explicitly show which Independent-tagged regions are separated from which locations, by specifying in config.yaml as follows:
mark_displacement: true
For example, assume the input:
The UI is more useful than XYZ<footnote>Notice that … .</footnote> and … .
If “footnote” tag is classified as “Independent”, the output plain text sequences are as follows:
The UI is more useful than XYZ__FOOTNOTE_1__ and … . (<– The region where “footnote” region is separated is represented by marker __FOOTNOTE_1__.)
__FOOTNOTE_1__ (<– The separated region is output below this marker)
Notice that … .
By referring to markers, the original structure of an input document is easy to grasp; however, the sequences cannot be directly input to NLP tools because they contain extra characters __FOOTNOTE_1__. Some pre-/postprocessing will thus be required.
(B. disable mark_displacement in config.yaml)
You can, on the other hand, just separate regions enclosed by “Independent” tags by specifying as follows in config.yaml:
mark_displacement: false
For example, assume again the following input:
The UI is more useful than XYZ<footnote>Notice that … .</footnote> and … .
If “footnote” tag is classified as “Independent”, the output plain text sequences are as follows:
The UI is more useful than XYZ and … .
Notice that … .
These sequences can be directly input to NLP tools while the original structure of the input document cannot be tracked.
About settings of config.yaml
You can change input/output styles of PlaneText by specifying in config.yaml under planetext (install) directory. Currently you cannot specify any other remarkable settings than changing the style of output as above, but we will provide in future updates various other options by which input/output can be flexibly changed for user’s convenience.