modnlp.tc.util
public class MakeARFF extends java.lang.Object
MakeARFF corpus_list stopwdlist aggr tf_method categ repr
SYNOPSIS:
Tokenise each file in corpus_list, remove words in stopwdlist
and reduce the term set by a factor of aggr.
ARGUMENTS
tf_method: term filtering method. One of:
'df': document frequency, local,
'dfg': document frequency, global,
'ig': information gain.
categ: target category (e.g. 'acq'.) This will be written into
the ARFF file as the last attribute of an instance).
If the categ argument denotes a global TSR method (_MAX, _SUM,
_WAVG, or _DFG), a document will (possibly) be represented as
several lines in the ARFF file: one for each category the
document belongs to.
repr: document representation. One of
'occur': a vector of integers representing the number of
time a term occurs in the document,
'boolean': a vector of Boolean values indicating whether
a term occurs in the document of not
'pweight': a vector of real values indicating a term's
proportional weight, computed as
pweight = round ( 10 x (1+ log #_occurs_term_i_in_j
/ 1 + log #_terms_in_j))
'tfidf': Term frequency inverse document frequency:
tfidf = no_of_occurrences_of_t_in_d *
log ( size_of_corpus /
size_of_subcorpus_in_which_t_occurs)
parser: parser to be used [default: 'NewsParser']
'LingspamEmailParser': Androutsopoulos' lingspam corpus,
'NewsParser': REUTERS-21578 corpus, XML version.
(add your own 'parser' by subclassing modnlp.tc.parser.Parser)
TCProbabilityModel,
TermFilter,
Parser,
ARFFUtil,
the weka toolkit for data mining.| Constructor and Description |
|---|
MakeARFF(java.lang.String clist,
java.lang.String swlist,
java.lang.String aggr)
Set up the main user interface items
|
| Modifier and Type | Method and Description |
|---|---|
WordFrequencyPair[] |
filter(java.lang.String method,
TCProbabilityModel pm,
java.lang.String categ) |
static void |
main(java.lang.String[] args) |
ParsedCorpus |
parse(java.lang.String filename,
java.lang.String plugin)
parseNews: Set up parser object, perform parsing, and print
indented contents onto stdout (for test purposes only)
|
public MakeARFF(java.lang.String clist,
java.lang.String swlist,
java.lang.String aggr)
public ParsedCorpus parse(java.lang.String filename, java.lang.String plugin) throws java.lang.Exception
java.lang.Exceptionpublic WordFrequencyPair[] filter(java.lang.String method, TCProbabilityModel pm, java.lang.String categ) throws java.lang.Exception
java.lang.Exceptionpublic static void main(java.lang.String[] args)