Input/Output – planetext

PlaneText is developed for enabling users to analyze target structured text with various natural language processing (NLP) tools. The current version provides a tool for converting text structured as XML into plain text sequences which can serve as input for NLP tools. In a future update, we will implement correspondence of output by NLP tools to original input documents, which will broaden the possibilities of applying obtained NLP analysis to various applications, enable users to utilize/transmit obtained NLP analysis for more broad application or higher-layer subsequent analysis.

Input

PlaneText currently assumes that input text is an XML / XML-like tagged document (or collection) where tags provide text with structures beyond plain word sequences. (GUI tool of ver. 0.0.1 can show previews only for XHTML documents while tag classification and conversion are available for various XML / XML-like documents. We will improve this limitation in a future update.)

Output

For each input document, three files (*.ann, *.txt, *.xhtml) are generated in the specified output directory. *.ann and *.xhtml are reserved for future updates of PlaneText.

*.txt contains obtained plain text sequences consisting of words or dummy words for Object-tagged regions. Output style in *.txt can be changed by specifying some settings in config.yaml. (Please see also explanation of tag classification in About PlaneText)

(A. enable mark_displacement in config.yaml）

You can add markers to explicitly show which Independent-tagged regions are separated from which locations, by specifying in config.yaml as follows:

mark_displacement: true

For example, assume the input:

The UI is more useful than XYZ<footnote>Notice that … .</footnote> and … .

If “footnote” tag is classified as “Independent”, the output plain text sequences are as follows:

The UI is more useful than XYZ__FOOTNOTE_1__ and … . (<– The region where “footnote” region is separated is represented by marker __FOOTNOTE_1__.)

__FOOTNOTE_1__ （<– The separated region is output below this marker)
Notice that … .

By referring to markers, the original structure of an input document is easy to grasp; however, the sequences cannot be directly input to NLP tools because they contain extra characters __FOOTNOTE_1__. Some pre-/postprocessing will thus be required.

(B. disable mark_displacement in config.yaml）

You can, on the other hand, just separate regions enclosed by “Independent” tags by specifying as follows in config.yaml:

mark_displacement: false

For example, assume again the following input:

The UI is more useful than XYZ<footnote>Notice that … .</footnote> and … .

If “footnote” tag is classified as “Independent”, the output plain text sequences are as follows:

The UI is more useful than XYZ and … .

Notice that … .

These sequences can be directly input to NLP tools while the original structure of the input document cannot be tracked.

About settings of config.yaml

You can change input/output styles of PlaneText by specifying in config.yaml under planetext (install) directory. Currently you cannot specify any other remarkable settings than changing the style of output as above, but we will provide in future updates various other options by which input/output can be flexibly changed for user’s convenience.

PlaneText では、構造を与えられたテキストを様々な自然言語処理 (NLP) ツールで解析できるようにすることを目指しています。現バージョンでは、XMLで構造化された文書を多くの自然言語処理 (NLP) ツールが入力として受け取れるようなテキスト列に変換するツールを提供しますが、今後のアップデートにより、解析結果と元の文書の対応づけなどを実装することで、利用者が解析結果をより有効活用できるようにする予定です。

入力

XML またはそれに準ずる形式で記述された文書に対応しています。この XML 文書においては、XML タグを用いることで、単語の列のみで構成された文章以上の構造をテキストに与え（「構造化」し）ていることを想定しています。ただし、現行のバージョン (Ver. 0.0.1) においては、GUI ツールのタグ使用箇所表示は XHTML 文書にのみ対応しております（分類作業自体は XHTML 文書ではなくとも行えます）。この件に関しては、将来のバージョンアップで対応予定です。

出力

各文書に対し、*.ann *.txt *.xhtml という３ファイルが出力先に指定したディレクトリに作成されます。（*.ann と *.xhtml に関しては、今後のバージョンアップで追加される機能のためのものですので、ここでは詳しくは記述いたしません。） *.txt に、単語（および実体タグに対する代替のダミー語）で構成されたテキスト列が出力されます。設定ファイル config.yaml （ダウンロードの説明を参照）における指定によって以下のような出力形式の差が生じます。（ PlaneText についてのタグ分類の説明も併せて参照ください。）

（A. config.yamlで mark_displacement を有効にした場合）

config.yaml で

mark_displacement: true

と指定すると、独立タグによって囲まれた領域を抜き出す際に、どの位置からどの領域が抜き出されたのかを明示的に示す目印を加えます。例えば、

The UI is more useful than XYZ<footnote>Notice that … .</footnote> and … .

という入力に対して、 footnote を独立タグとした場合、出力される平文テキスト列は、

The UI is more useful than XYZ__FOOTNOTE_1__ and … .　　　（<– footnote タグ領域が抜き出された箇所を__FOOTNOTE_1__と表示）

__FOOTNOTE_1__ （<– この下に抜き出された部分が書き出される） Notice that … .

のようになります。元々の文書の構造が目で辿りやすい反面、このまま NLP ツールに入力すると、__FOOTNOTE_1__ の部分が余計になってしまうので、前・後処理が必要になります。

（B. config.yaml で mark_displacement を無効にした場合）

一方、config.yaml で

mark_displacement: false

と指定すると、独立タグによって囲まれた領域は単純に分離されます。例えば、

The UI is more useful than XYZ<footnote>Notice that … .</footnote> and … .

という入力に対して、 footnote を独立タグとした場合、出力される平文テキスト列は、

The UI is more useful than XYZ and … .

Notice that … .

のようになります。各テキストがそのまま NLP ツールに直接入力できる反面、元々の文書の構造は把握できなくなります。

設定ファイル config.yaml について

planetext ディレクトリ直下に置く config.yaml (必須) 内で指定することで、入出力の仕様等を変更することが可能です。現状では、上記の出力指定以外に特筆すべき設定は行えませんが、今後のバージョンアップで、より利用者にとって便利な入出力仕様にカスタマイズできるような設定を加える予定です。