MODNLP: Modular Suite of NLP Tools
modnlp aims to provide a modular architecture and tools for natural language processing written (mainly) in Java. These tools are being developed in connection with the the Genealogies of Knowledge project.The following modnlp modules are currently available:
- idx: an API and tools for (inverted)
indexing, storage and retrieval of large amounts of text, with
(XML-based) handling of meta-data.
- tc: an API and tools for text
categorisation, including, functionality for XML parsing, term set
reduction (and basic keyword extraction), probabilistic classifier
induction, two sample classification tools, and evaluation modules.
- tec-toolsv2, consisting of
tec-server
, a corpus indexer and server for corpus access and analysis over the web andtec-client
: a corpus analysis client. Unlike the (now obsolete) version 1 of these tools, originally developed for the TEC project, and written in Perl, C (server side) and Java, the version in this site (v2) is is written entirely in Java.This new version of the tools forms the basis of software support for text analysis and visualisation in the Genealogies of Knowledge project.
The modnlp/tec tools have also been used by the European Parliamentary Comparable and Parallel Corpora project (ECPC) coordinated by Dr. Calzada Pérez (Universitat Jaume I, Spain), and by the Translational English Corpus, which has been collected and maintained under Prof Mona Baker's supervision at the University of Manchester, and made available on the Internet through the Genealogies of Knowledge project website, in a collaboration between The University of Edinburgh and The University of Manchester
Also available is the documentation of the modnlp suite (for developers).
Publication:
The design, structure and motivations for the TEC/ECPC tools are described in the following paper:- S. Luz. Web-based corpus software. In A. Kruger, K. Wallmach, and J. Munday, editors, Corpus-based Translation Studies - Research and Applications, chapter 5, pages 124-149. Continuum, 2011. [ bib | .pdf ]
Current developers
- Saturnino Luz
- Shane Sheehan
Past Contributors:
- Michael Davy (contributed to the TC module)
- Daniel Kelleher (contributed to the IDX module)
- Noel Skehan (contributed to an earlier version of the teccli/tecser modules)
See the Developer's site at Sourceforge.net for downloads, and GIT repositories.