modnlp.tc.util
public class ARFFUtil extends PrintUtil
the weka toolkit for data mining.
Constructor and Description |
---|
ARFFUtil() |
Modifier and Type | Method and Description |
---|---|
static void |
printBooleanARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
static void |
printDebug(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out)
print debug information and all possible ARFF representation this
class handles
|
static void |
printOccurARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
static void |
printPWeightARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
static void |
printTermByDocMatrixCSV(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out) |
static void |
printTermCoOccurARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out) |
static void |
printTermCoOccurCSV(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out) |
static void |
printTFIDFARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
donePrinting, printNoMove, resetCounter, toString, toString, toString, toString, toString, toString, toString, toString, toString
public static void printOccurARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex
into an ARFF file for
category
(a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair
array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of integers whose elements
indicate the number of occurrences of terms in a document.ii
- a TCInvertedIndex
valuewfp
- a WordFrequencyPair[]
valuecategory
- a String
representing a category or
null
representing all categories.out
- a PrintStream
valuepublic static void printBooleanARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex
into an ARFF file for
category
(a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair
array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of Boolean values whose elements
indicate the occurrence or non-occurrence of terms a the document.ii
- a TCInvertedIndex
valuewfp
- a WordFrequencyPair[]
valuecategory
- a String
representing a category or
null
representing all categories.out
- a PrintStream
valueNewsItemAsOccurVector.getBooleanTextArray()
public static void printTFIDFARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex
into an ARFF file for
category
(a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair
array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of TFIDF values calculated as follows:
size_of_corpus tfidf = no_of_occurrences_of_t_in_d * log ---------------------------------- size_of_subcorpus_in_which_t_occurs
ii
- a TCInvertedIndex
valuewfp
- a WordFrequencyPair[]
valuecategory
- a String
representing a category or
null
representing all categories.out
- a PrintStream
valueNewsItemAsOccurVector.getTFIDFVector(WordFrequencyPair[],int)
public static void printPWeightARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex
into an ARFF file for
category
(a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair
array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of proportional term weight
values calculated as follows:
1 + log no_occurs_term_i_in_j pweight = round ( 10 x ------------------------------ ) 1 + log no_terms_in_jif
no_terms_in_j > 0
. Otherwise pweight = 0
.ii
- a TCInvertedIndex
valuewfp
- a WordFrequencyPair[]
valuecategory
- a String
representing a category or
null
representing all categories.out
- a PrintStream
valueNewsItemAsOccurVector.getPWEIGHTVector(int)
public static void printTermCoOccurARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)
public static void printTermCoOccurCSV(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)
public static void printTermByDocMatrixCSV(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)
public static void printDebug(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)