modnlp.tc.dstruct
public interface TCInvertedIndex
BVProbabilityModel
Modifier and Type | Method and Description |
---|---|
void |
addParsedCorpus(ParsedCorpus pt,
StopWordList swlist)
Index each term (type) of each
ParsedDocument in
ParsedCorpus , except those in
stopwdlist , on this index. |
boolean |
containsTerm(java.lang.String term)
Check if the index contains
term |
WordScorePair[] |
getBlankWordScoreArray()
make a new
WordScorePair , big enough to store all
terms indexed by this index, with scores initialised to zero |
java.util.Set |
getCategorySet()
Get all categories in this corpus.
|
java.util.Vector |
getCategVector(java.lang.String id)
Find all categories under which document
id has been
classified |
double |
getCatGenerality(java.lang.String cat)
Calculate and return the generality of
cat for this
model. |
int[] |
getCooccurrenceVector(java.lang.String term,
java.lang.String[] terms)
Return a vector containing the number of documents the word
term
co-occurs with each term in terms |
int |
getCorpusSize()
Size of the corpus on which this model is based.
|
int |
getCount(java.lang.String term)
Return the number of occurrences of
term in the
corpus |
int |
getCount(java.lang.String id,
java.lang.String term)
Return the number of occurrences of
term in
document id |
java.util.Set |
getDocSet()
Return the set of documents used in the generation of this index
|
int |
getTermCount(java.lang.String term)
Get the number of files a term occurs in.
|
java.util.Set |
getTermSet()
Get the set of terms (types) indexed by this index
|
int |
getTermSetSize()
Get the number of terms (types) indexed by this index
|
WordScorePair[] |
setFreqWordScoreArray(WordScorePair[] wsp)
gets an initialised
WordScorePair and populate it
with global term frequency |
void |
trimTermSet(java.util.Set rts)
Delete all entries for terms not in the reduced term set.
|
boolean containsTerm(java.lang.String term)
term
term
- a term to be looked up.term
, fals otherwisedouble getCatGenerality(java.lang.String cat)
cat
for this
model. Generality is given by
G_cat = no_of_docs_classified_as_cat / no_of_docs_in_corpusi.e. (
G_cat
= p(cat))cat
- a String
representing a categorydouble
valuejava.util.Set getDocSet()
Set
containing the IDs of all documents
indexed in this indexjava.util.Vector getCategVector(java.lang.String id)
id
has been
classifiedVector
containing the the vector of
categories (of type String
) to which document id
belongsjava.util.Set getCategorySet()
Set
containing all categories that occur in the corpusvoid addParsedCorpus(ParsedCorpus pt, StopWordList swlist)
ParsedDocument
in
ParsedCorpus
, except those in
stopwdlist
, on this index.pt
- a ParsedCorpus
valueswlist
- a StopWordList
valuevoid trimTermSet(java.util.Set rts)
int getTermSetSize()
int
valuejava.util.Set getTermSet()
Set
containing all terms in the indexint getTermCount(java.lang.String term)
term
- word (type) to be looked upint
value containing the number of files
that contain at least one token of type term>
int getCount(java.lang.String id, java.lang.String term)
term
in
document id
id
- a String
representing a unique document idterm
- a String
int getCount(java.lang.String term)
term
in the
corpusterm
- a String
int[] getCooccurrenceVector(java.lang.String term, java.lang.String[] terms)
term
co-occurs with each term in terms
term
- a String
valueterms
- a String[]
valueint[]
valueint getCorpusSize()
corpusSize
represents the number of documents.WordScorePair[] getBlankWordScoreArray()
WordScorePair
, big enough to store all
terms indexed by this index, with scores initialised to zeroWordScorePair[]
valueWordScorePair[] setFreqWordScoreArray(WordScorePair[] wsp)
WordScorePair
and populate it
with global term frequencywsp
- a WordScorePair[]
valueWordScorePair[]
value