modnlp.tc.tsr
public class MakeReducedTermSet extends java.lang.Object
MakeReducedTermSet corpus_list stopwdlist aggr tf_method categ parser SYNOPSIS: Tokenise each file in corpus_list, remove words in stopwdlist and reduce the term set by a factor of aggr. ARGUMENTS tf_method: term filtering method. One of: 'df': document frequency, local, 'dfg': document frequency, global, 'ig': information gain. 'gss': GSS coefficient categ: target category (e.g. 'acq'.) for local term filtering OR a method for combining local scores. One of: '_DFG' (global document frequency), '_MAX' (maximum local score), '_SUM' (sum of local scores), '_WAVG' (sum of local scores wbeighted by category generality.) PARSER: parser to be used [default: 'news'] 'lingspam': Androutsopoulos' lingspam corpus 'news': REUTERS-21578 corpus, XML version.
BVProbabilityModel
,
TermFilter
,
NewsParser
Constructor and Description |
---|
MakeReducedTermSet(java.lang.String cl,
java.lang.String sw,
java.lang.String aggr)
Set up the main user interface items
|
Modifier and Type | Method and Description |
---|---|
static void |
main(java.lang.String[] args) |
ParsedCorpus |
parse(java.lang.String filename,
java.lang.String plugin)
parseNews: Set up a REUTERS-21578 news XML parser object, perform
parsing, and return a ParsedCorpus
|
WordScorePair[] |
rank(java.lang.String method,
BVProbabilityModel pm,
java.lang.String categ) |
public MakeReducedTermSet(java.lang.String cl, java.lang.String sw, java.lang.String aggr)
public ParsedCorpus parse(java.lang.String filename, java.lang.String plugin) throws java.lang.Exception
java.lang.Exception
public WordScorePair[] rank(java.lang.String method, BVProbabilityModel pm, java.lang.String categ) throws java.lang.Exception
java.lang.Exception
public static void main(java.lang.String[] args)