PlaneText provides two types of implementations: direct excecution via UNIX command-line and intuitive execution via GUI on a web browser. The formats of setting and configure files are common between the two types, and therefore you can switch according to your convenience.
Put directory containing XML documents somewhere, then execute under “planetext” directory:
> ./bin/planetext [filename for saving tag-classification information] [XML documents directory]
then PlaneText checks unclassified tags in the XML documents according to the saved tag-classification information. If the file for saving tag-classification does not exist, it is automatically generated. After checking all documents, outermost unclassified tags detected in the documents are reported as follows:
> ./bin/planetext test.yaml data
f000179.xhtml: (1-37893) <xmlns:div id="id52849" class="content">
f000179.xhtml: (37894-37915) <xmlns:div id="id65114" class="footer">
In this example, PlaneText reports that, in data/f000179.xhtml, unclassifide tag named
- Offset region 1-37893 with attribute “id” as “id52849” and attribute “class” as “content”
- Offset region 37894-37915 with attribute “id” as “id6514” and attribute “class” as “footer”
According to the report, a user checks actual regions in the document f000179.xhtml, and judge whether the tags are classified into either of four types (Independent/Decoration/Object/Meta-info).（A user does not need to classify all reported tags, and can leave the judge later. But, in order to proceed the process, the user should classify at least one unclassified tag.
Judged classification of tags is added to configure file by specifying the following command(, and then PlaneText again checks unclassified tags):
> ./bin/planetext test.yaml data -i div[class: content]
In this example, tag
div with attribute “class” as “content” is classified via option
-i (–independent) into independent tag. Thus, you can do detailed tag classification not only by classifying tag names but also by specifying the values of attributes. For more detail, please refer to
The above “report on unclasified tag” and “tag classification and recheck” until no unclassified tag is reported. After that, each XML document can be converted into plain text sequences which can be input to NLP tools via the following command:
> ./bin/planetext text.yaml data -o data-out
In this example, the conversion results are output to directory “data-out”. For each document *.ann, *.txt, *.xhtml are generated, and obtained plain text sequences are output in *.txt. The other two files will be used in the future.
- You can classify at a time several tags by lining options sequentially (e.g. “-i html -i body”).
- You can specify the number of documents containing unknown tags which PlaneText reports, by “-l (–limit) [number of documents]” option. This will help you to decrease the waiting time before PlaneText updates unknown tags based on your additional classsification. Please set the number as needed.
- You can cancel the classification of tags by -u (–unclassify) option. It requires checking of unclassified tags from scratch, and therefore may take some time.
Make directory named “data” under PlaneText directoy, and put under “data” the directory containing target XML (XHTML) documents.
> cd planetext
Compassfile Gemfile Gemfile.lock Guardfile app bin config.ru config.yaml.example lib sessions
> mkdir data
> ls ~/hogehoge/xml_docs/
> cp -r ~/hogehoge/xml_docs data/
If you want to make PlaneText resident, rack server such as Passenger is required; otherwise you can use temporarily GUI version by executing under PlaneText directory:
> RACK_ENV=production rackup [-p port_number]
After that, data set (directory) selection menu is displayed on the web browser by accessing the above port number (http://IP address:port_number/ etc. If firewall is set on UNIX server, port forwarding is required.). In the example of the figure, directories “xml_docs” and “test” are displayed.
If you choose one of the directory, GUI tool where the target documents are already loaded is displayed on the browser. Tag classification information is automatically generated for every session, and updated in the later tag clasification.
Te upper left box reports outmost unclassified tags detected for the target documents, based on already-given tag classification in configure files. (In the case of HTML document, the tag “html” will be first displayed.)
“Attributes” accompanied by the tag is displayed by clicking the tag name. When you choose one of the “attributes”, words used as the values for the attribute are given in “Word”. When you further click one of the words, the observed combination of words are given in “Value”. In rightmost “Instance”, their locations in the documents are displayed using filename and offset-regions. By clicking one of these locations, the region enclosed by the tag is actually highlighted in the lower large box, with which a user judged the tag classification.
Tag classification can be done by drag & drop tag name / attribute / word into “Independent”/”Decoration”/”Object”/”Meta-info” boxes in the second row, or typing i/d/o/m keys. Thus, a user can detailed tag classification not only by classifying tag names but also by specifying attributes and their values used for the tags. The tag classification is added to configure file, based on which PlaneText rechecks unclassified tags and again reports unclassified tags on the browser.
If “Autosubmit” button is activated, unclassified tags are rechecked every after a user classiffies one tag. If the button is not activated, a user can continues the clasification of tags without recheck by PlaneText until “Submit” button is pushed.
The number specified in “Docs” box means that unclassified tags are reported until the specified number of files containing unclassified tags are found, which can shorten the waiting time for next report of unclassified tags.
When unclassified tags are not reported, the classification process is finished. The target documents are, by clicking “Date”, converted into plain text sequences which can be input to NLP tools, and the sequences are downloaded. It requires time according to the total amont of documents, and therefore not suitable for converting a large amount of documents.
When you want to convert a large amount of documents, first download configure file containing tag-classification information by clicking “Config”. It would be safer to load this file into command-line version and execute conversion.
- PlaneText checks classified tags from upper ones. You can change the order by drag and drop although PlaneText then re-checks all the tags from scratch, which can take some time.
- Classification of tags can be canceled by doing drag and drop from “Classified Tags” to “Unknown Tags” or pressing ‘u’ key. It requires re-checking of tags from scratch, which can also take some time in the same manner as re-ordering of classification.
- Classification of tags can be done by mouse operation (click and drag & drop) alone, by keyboard operation (cursor key, i/d/o/m/u keys) alone or by both. Please try and find your convenient way.
Combination of command-line and GUI versions
The formats of configure files recording tag classification are common between command-line and GUI versions. You can therefore use the (complete/incomplete) tag classification by GUI in command-line by downloading the configure file from “Config” link.