modnlp.tc.dstruct
public interface TCInvertedIndex
BVProbabilityModel| Modifier and Type | Method and Description |
|---|---|
void |
addParsedCorpus(ParsedCorpus pt,
StopWordList swlist)
Index each term (type) of each
ParsedDocument in
ParsedCorpus, except those in
stopwdlist, on this index. |
boolean |
containsTerm(java.lang.String term)
Check if the index contains
term |
WordScorePair[] |
getBlankWordScoreArray()
make a new
WordScorePair, big enough to store all
terms indexed by this index, with scores initialised to zero |
java.util.Set |
getCategorySet()
Get all categories in this corpus.
|
java.util.Vector |
getCategVector(java.lang.String id)
Find all categories under which document
id has been
classified |
double |
getCatGenerality(java.lang.String cat)
Calculate and return the generality of
cat for this
model. |
int[] |
getCooccurrenceVector(java.lang.String term,
java.lang.String[] terms)
Return a vector containing the number of documents the word
term
co-occurs with each term in terms |
int |
getCorpusSize()
Size of the corpus on which this model is based.
|
int |
getCount(java.lang.String term)
Return the number of occurrences of
term in the
corpus |
int |
getCount(java.lang.String id,
java.lang.String term)
Return the number of occurrences of
term in
document id |
java.util.Set |
getDocSet()
Return the set of documents used in the generation of this index
|
int |
getTermCount(java.lang.String term)
Get the number of files a term occurs in.
|
java.util.Set |
getTermSet()
Get the set of terms (types) indexed by this index
|
int |
getTermSetSize()
Get the number of terms (types) indexed by this index
|
WordScorePair[] |
setFreqWordScoreArray(WordScorePair[] wsp)
gets an initialised
WordScorePair and populate it
with global term frequency |
void |
trimTermSet(java.util.Set rts)
Delete all entries for terms not in the reduced term set.
|
boolean containsTerm(java.lang.String term)
termterm - a term to be looked up.term, fals otherwisedouble getCatGenerality(java.lang.String cat)
cat for this
model. Generality is given by
G_cat = no_of_docs_classified_as_cat / no_of_docs_in_corpus
i.e. (G_cat = p(cat))cat - a String representing a categorydouble valuejava.util.Set getDocSet()
Set containing the IDs of all documents
indexed in this indexjava.util.Vector getCategVector(java.lang.String id)
id has been
classifiedVector containing the the vector of
categories (of type String) to which document id
belongsjava.util.Set getCategorySet()
Set containing all categories that occur in the corpusvoid addParsedCorpus(ParsedCorpus pt, StopWordList swlist)
ParsedDocument in
ParsedCorpus, except those in
stopwdlist, on this index.pt - a ParsedCorpus valueswlist - a StopWordList valuevoid trimTermSet(java.util.Set rts)
int getTermSetSize()
int valuejava.util.Set getTermSet()
Set containing all terms in the indexint getTermCount(java.lang.String term)
term - word (type) to be looked upint value containing the number of files
that contain at least one token of type term>int getCount(java.lang.String id,
java.lang.String term)
term in
document idid - a String representing a unique document idterm - a Stringint getCount(java.lang.String term)
term in the
corpusterm - a Stringint[] getCooccurrenceVector(java.lang.String term,
java.lang.String[] terms)
term
co-occurs with each term in termsterm - a String valueterms - a String[] valueint[] valueint getCorpusSize()
corpusSize represents the number of documents.WordScorePair[] getBlankWordScoreArray()
WordScorePair, big enough to store all
terms indexed by this index, with scores initialised to zeroWordScorePair[] valueWordScorePair[] setFreqWordScoreArray(WordScorePair[] wsp)
WordScorePair and populate it
with global term frequencywsp - a WordScorePair[] valueWordScorePair[] value