goal01-enPlaneText is a framework for assisting you to apply any NLP tools you like to target real-world documents containing structured text. Currently, a tool is being developed for converting XML-tagged text into plain text sequences which can be directly input to NLP tools.

Downloadable tool package will be released in SeptemberObtober, 2014.

You can try an online demo here.


Basic concept: converting XML document into plain text based on tag classification

Real-world documents do not simply consist of sentence sequences, but are “structured”. In some cases, they are sectioned, chaptered or footnoted, etc. In some cases, expressions which cannot be represented by normal characters, such as mathematical expressions, figures, or tables, are introduced into sentences. In some cases, sizes or colors of certain characters are changed. In other cases, some information which is not displayed but used for managing documents is embedded.

Mproblem01-enost of natural language processing (NLP) tools, on the other hand, assume that input text consists of sequential sentneces, and therefore a user of the tools has to convert every target document into text sequences which can be input to the tools, which is very bothersome and delicate labor. Some users may give up using the tools, or other users may forcibly input such structured text to the tools and spoil the potential of the tools.

PlaneText is develeped for removing such barriers and problems that occur when people try to apply NLP tools to real-world documents.

framework01-enThe current version assumes that text is structured by XML tags, and provide the framework for converting structured text into plain text sequences which can be input to NLP tools, according to classification of observed tags into four types: Independent, Decoration, Object, and Meta-info tags. Although human labor is still required for tag classification, PlaneText eases the labor both by an efficient classification procedure and by two types of interfaces for the procedure: GUI-based and command line-based.

Four functional types of XML tags

PlaneText assumes that structurization of text is represented by XML tags, and classifies XML tags  into four functional types of structurization.

  • Independent: regard a tagged region as separate from surrounding text
  • Decoration: change the displayed style of tagged region
  • Object: introduce some non-natural-language structure into text
  • Meta-info: insert some information on text which is not displayed

Take scientific papers for example. Each section, title, footnote or etc. is completed text while the region can be embedded within another sentence. In order to properly separate such regions from surrounding text, “Independent” tags are utilized.

Changing the size, color, font of characters in some regions or making some regions link to other locations, on the other hand, does not mean separation of sentences but just decorate or emphasize the target regions. The tags bringing such functions, that is, “Decoration” tags, can be therefore ignorable. (Such decoration or emphasis can actually suggest some word separation etc. which can be useful information for NLP tools. We are planning to utilize the information in our furture update.)

conversion01-enMathematical expressions, figures, tables, itemization, etc. have some structures consisting not of simple characters but of special structures, symbols, images etc. and therefore cannot be analyzed by NLP tools. In PlaneText, such structures are called “Object” and the tags introducing Objects are called “Object” tags. (There are actually text within Object tags, which will be considered in our future update.)

The above three types of tags represent elements which are displayed while text sometimes contains not-displayed elements which give bookmarks for making index, adding information for managing documents, etc. The tags representing such regions are called “Meta-info” tags.

In PlaneText, according to classification of tags in the documents into the above four types, structurized documents are converted into plain text sequences which can be directly input to NLP tools, by “extracting each of text regions enclosed by Independent tags”, “removing Decoration tags”, “replacing the regions enclosed by Object tags with dummy words”, and “removing the regions enclosed by Meta-info tags”.  conversion02

In many cases, each organization or publisher issues or holds documents in a certain style of tag fomat, and it is therefore possible to process a large amount of documents in the same format once classification of tags for the format is given.

Reflect user’s intention into tag classification

A document can consist of various textual regions separated by XML tags, and which part of the document to analyze using NLP tools — body text with titles excluded, publications in bibliography, or etc. — can accordingly vary among users.

In PlaneText, different demands by users can be reflected by changing the classification of tags. You can ignore titles in your NLP analysis by classifying the tags enclosing titles into Meta-info tags, or you can focus only on a publication list by classifying any tags other than the ones which enclose bibliography sections into Meta-info tags.

PlaneText thus leaves the final decision on tag classification to a user, while it introduces the procedure which decreases the user’s labor of tag classification by suggesting a minimal number of tags to classify.  In our experiments, for each of several formats of documents, the documents can be converted into text sequences which can be input to NLP tools by classifying only one-fifth of tags in the documents [1].