MODNLP-tec: A corpus management suite
The modnlp corpus management suite, which consist of modnlp-idx (the indexer), modnlp-teccli (the corpus/concordance browser) and modnlp-tecser (the corpus server), were originally developed to allow free access to linguistic material over the Internet.
The indexer, modnlp-idx, allows you to create an index, which you can later access through the TEC* browser (modnlp-teccli). The corpus will usually contain data (e.g. text-orienteds XML files, which we refer to as text files) and "meta-data" (perhaps stored as separate XML files, which we refer to as header files. It is not essential that data and meta-data be stored in separate files or encoded in XML. However, if you would like to be able to select sub-corpora, for instance, in order to make the concordancer display only concordances coming from texts that share certain features (e.g. all texts whose source language is Japanese), you need to create meta-data files describing the features of interest and link them back to the appropriate sections of the text files. The following files illustrate a typical pair of corpus text and header files:
- EN20050110.xml is the text file
(borrowed from the ECPC corpus), though it also
contains meta-data. Things can be arranged so that only the text
between pre-specified tags will be indexed. These tags are specified
in the property file of modnlp-idx. Property settings that
would be suitable for this kind of data are shown
in this property file. The following
lines, for instance, state that the only the text within 'speech'
and 'writing' element pairs will be indexed, and that they will be
uniquely identified by the value of their attribute 'ref' (ideally, a
suitable DTD should also guarantee that the elements and attributes
exist and that the uniqueness of 'ref' values is enforced) :
subcorpusindexer.element=(speech|writing) subcorpusindexer.attribute=ref
The following line specifies which text will NOT be indexed. The specification is done through a regular-expression matching element names that surround text which one does not wish to index:tokeniser.ignore.elements=(omit|ignore|header|chair|heading|post|name)
- EN20050110.hed is the header
file. The modnlp tools allows the meta-data encoded in the header files
to be "queried" in XQuery. This is used to select
sub-corpora. The following lines in the idxmgr.properties files
define how sub-corpora are selected for the ECPC files:
xquery.root.filedescription.path=/header xquery.attribute.chooser.specs=File name;../../header/@filename;Spoken language;speech/@language;Written language;writing/@language;Affiliation;(speaker|writer)/affiliation/@EPparty
The first line specifies the topmost level of the header file (it's root element). The second line Contains XML "paths" to the features which we would like to use to define sub-corpora (preceeded by a human-readable description of the same).Similarly, the last line specifies part of an XQuery expression which the system uses to present a description of each file in the corpus to the user.
xquery.file.description.return={data($s/@filename)}
{data($s/@language)} {data($s/index/label)}, {data($s/index/place)}, {data($s/index/date)}, {data($s/index/edition)}
If you would like to test and/or use the software, a binary distribution of the client (which also run in stand-alone mode, using a local index creted by idx), download the modnlp-teccli module.
If you want to make corpus data and concordances (though not necessarily your full texts) available to the community over the Internet your should download the modnlp-tecser module.
In addition, the entire suit is available for download.
A live example of these tools in action is the TEC Corpus Broser, available on the web, where you can try the concordancer prototype and a few of its "plug-ins", via Java Web Start.
More information on the TEC client/server architecture can be found in the following paper:
- S. Luz. Web-based corpus software. In A. Kruger, K. Wallmach, and J. Munday, editors, Corpus-based Translation Studies - Research and Applications, chapter 5, pages 124-149. Continuum, 2011. [ bib | .pdf ]
A useful tutorial on how to use the modnlp/tec suite to create and index a corpus was edited by Sally Marshall and is available here.
*The suite is named after "TEC", the project for which it was originally developed. TEC (short for "Translational English Corpus") is a computerised collection of contemporary translational English text.