modnlp.tc.util
public class ARFFUtil extends PrintUtil
the weka toolkit for data mining.| Constructor and Description |
|---|
ARFFUtil() |
| Modifier and Type | Method and Description |
|---|---|
static void |
printBooleanARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
static void |
printDebug(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out)
print debug information and all possible ARFF representation this
class handles
|
static void |
printOccurARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
static void |
printPWeightARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
static void |
printTermByDocMatrixCSV(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out) |
static void |
printTermCoOccurARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out) |
static void |
printTermCoOccurCSV(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.io.PrintStream out) |
static void |
printTFIDFARFF(TCInvertedIndex ii,
WordFrequencyPair[] wfp,
java.lang.String category,
java.io.PrintStream out)
Convert a
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. |
donePrinting, printNoMove, resetCounter, toString, toString, toString, toString, toString, toString, toString, toString, toStringpublic static void printOccurARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of integers whose elements
indicate the number of occurrences of terms in a document.ii - a TCInvertedIndex valuewfp - a WordFrequencyPair[] valuecategory - a String representing a category or
null representing all categories.out - a PrintStream valuepublic static void printBooleanARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of Boolean values whose elements
indicate the occurrence or non-occurrence of terms a the document.ii - a TCInvertedIndex valuewfp - a WordFrequencyPair[] valuecategory - a String representing a category or
null representing all categories.out - a PrintStream valueNewsItemAsOccurVector.getBooleanTextArray()public static void printTFIDFARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of TFIDF values calculated as follows:
size_of_corpus
tfidf = no_of_occurrences_of_t_in_d * log ----------------------------------
size_of_subcorpus_in_which_t_occurs
ii - a TCInvertedIndex valuewfp - a WordFrequencyPair[] valuecategory - a String representing a category or
null representing all categories.out - a PrintStream valueNewsItemAsOccurVector.getTFIDFVector(WordFrequencyPair[],int)public static void printPWeightARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.lang.String category, java.io.PrintStream out)
TCInvertedIndex into an ARFF file for
category (a single category or <>null<>,
representing all categories) and prints the ARFF file onto an
output stream. The WordFrequencyPair array restricts
the entries of this ARFF file to those terms that occur in wfp
(i.e. the terms selected by term set reduction.)
Documents are represented as vectors of proportional term weight
values calculated as follows:
1 + log no_occurs_term_i_in_j
pweight = round ( 10 x ------------------------------ )
1 + log no_terms_in_j
if no_terms_in_j > 0. Otherwise pweight = 0.ii - a TCInvertedIndex valuewfp - a WordFrequencyPair[] valuecategory - a String representing a category or
null representing all categories.out - a PrintStream valueNewsItemAsOccurVector.getPWEIGHTVector(int)public static void printTermCoOccurARFF(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)
public static void printTermCoOccurCSV(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)
public static void printTermByDocMatrixCSV(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)
public static void printDebug(TCInvertedIndex ii, WordFrequencyPair[] wfp, java.io.PrintStream out)