modnlp.tc.util
public class MakeARFF extends java.lang.Object
MakeARFF corpus_list stopwdlist aggr tf_method categ repr SYNOPSIS: Tokenise each file in corpus_list, remove words in stopwdlist and reduce the term set by a factor of aggr. ARGUMENTS tf_method: term filtering method. One of: 'df': document frequency, local, 'dfg': document frequency, global, 'ig': information gain. categ: target category (e.g. 'acq'.) This will be written into the ARFF file as the last attribute of an instance). If the categ argument denotes a global TSR method (_MAX, _SUM, _WAVG, or _DFG), a document will (possibly) be represented as several lines in the ARFF file: one for each category the document belongs to. repr: document representation. One of 'occur': a vector of integers representing the number of time a term occurs in the document, 'boolean': a vector of Boolean values indicating whether a term occurs in the document of not 'pweight': a vector of real values indicating a term's proportional weight, computed as pweight = round ( 10 x (1+ log #_occurs_term_i_in_j / 1 + log #_terms_in_j)) 'tfidf': Term frequency inverse document frequency: tfidf = no_of_occurrences_of_t_in_d * log ( size_of_corpus / size_of_subcorpus_in_which_t_occurs) parser: parser to be used [default: 'NewsParser'] 'LingspamEmailParser': Androutsopoulos' lingspam corpus, 'NewsParser': REUTERS-21578 corpus, XML version. (add your own 'parser' by subclassing modnlp.tc.parser.Parser)
TCProbabilityModel
,
TermFilter
,
Parser
,
ARFFUtil
,
the weka toolkit for data mining.
Constructor and Description |
---|
MakeARFF(java.lang.String clist,
java.lang.String swlist,
java.lang.String aggr)
Set up the main user interface items
|
Modifier and Type | Method and Description |
---|---|
WordFrequencyPair[] |
filter(java.lang.String method,
TCProbabilityModel pm,
java.lang.String categ) |
static void |
main(java.lang.String[] args) |
ParsedCorpus |
parse(java.lang.String filename,
java.lang.String plugin)
parseNews: Set up parser object, perform parsing, and print
indented contents onto stdout (for test purposes only)
|
public MakeARFF(java.lang.String clist, java.lang.String swlist, java.lang.String aggr)
public ParsedCorpus parse(java.lang.String filename, java.lang.String plugin) throws java.lang.Exception
java.lang.Exception
public WordFrequencyPair[] filter(java.lang.String method, TCProbabilityModel pm, java.lang.String categ) throws java.lang.Exception
java.lang.Exception
public static void main(java.lang.String[] args)