modnlp.tc.tsr
public class MakeReducedTermSet extends java.lang.Object
MakeReducedTermSet corpus_list stopwdlist aggr tf_method categ parser
SYNOPSIS:
Tokenise each file in corpus_list, remove words in stopwdlist
and reduce the term set by a factor of aggr.
ARGUMENTS
tf_method: term filtering method. One of:
'df': document frequency, local,
'dfg': document frequency, global,
'ig': information gain.
'gss': GSS coefficient
categ: target category (e.g. 'acq'.) for local term filtering OR
a method for combining local scores. One of:
'_DFG' (global document frequency),
'_MAX' (maximum local score),
'_SUM' (sum of local scores),
'_WAVG' (sum of local scores wbeighted by category generality.)
PARSER: parser to be used [default: 'news']
'lingspam': Androutsopoulos' lingspam corpus
'news': REUTERS-21578 corpus, XML version.
BVProbabilityModel,
TermFilter,
NewsParser| Constructor and Description |
|---|
MakeReducedTermSet(java.lang.String cl,
java.lang.String sw,
java.lang.String aggr)
Set up the main user interface items
|
| Modifier and Type | Method and Description |
|---|---|
static void |
main(java.lang.String[] args) |
ParsedCorpus |
parse(java.lang.String filename,
java.lang.String plugin)
parseNews: Set up a REUTERS-21578 news XML parser object, perform
parsing, and return a ParsedCorpus
|
WordScorePair[] |
rank(java.lang.String method,
BVProbabilityModel pm,
java.lang.String categ) |
public MakeReducedTermSet(java.lang.String cl,
java.lang.String sw,
java.lang.String aggr)
public ParsedCorpus parse(java.lang.String filename, java.lang.String plugin) throws java.lang.Exception
java.lang.Exceptionpublic WordScorePair[] rank(java.lang.String method, BVProbabilityModel pm, java.lang.String categ) throws java.lang.Exception
java.lang.Exceptionpublic static void main(java.lang.String[] args)