MODNLP-tec: A corpus management suite

The modnlp corpus management suite, which consist of modnlp-idx (the indexer), modnlp-teccli (the corpus/concordance browser) and modnlp-tecser (the corpus server), were originally developed to allow free access to linguistic material over the Internet.

The indexer, modnlp-idx, allows you to create an index, which you can later access through the TEC* browser (modnlp-teccli). The corpus will usually contain data (e.g. text-orienteds XML files, which we refer to as text files) and "meta-data" (perhaps stored as separate XML files, which we refer to as header files. It is not essential that data and meta-data be stored in separate files or encoded in XML. However, if you would like to be able to select sub-corpora, for instance, in order to make the concordancer display only concordances coming from texts that share certain features (e.g. all texts whose source language is Japanese), you need to create meta-data files describing the features of interest and link them back to the appropriate sections of the text files. The following files illustrate a typical pair of corpus text and header files:

If you would like to test and/or use the software, a binary distribution of the client (which also run in stand-alone mode, using a local index creted by idx), download the modnlp-teccli module.

If you want to make corpus data and concordances (though not necessarily your full texts) available to the community over the Internet your should download the modnlp-tecser module.

In addition, the entire suit is available for download.

A live example of these tools in action is the TEC Corpus Broser, available on the web, where you can try the concordancer prototype and a few of its "plug-ins", via Java Web Start.

More information on the TEC client/server architecture can be found in the following paper:

A useful tutorial on how to use the modnlp/tec suite to create and index a corpus was edited by Sally Marshall and is available here.


*The suite is named after "TEC", the project for which it was originally developed. TEC (short for "Translational English Corpus") is a computerised collection of contemporary translational English text.