modnlp.tc.dstruct
public class BVProbabilityModel extends TCProbabilityModel implements TCInvertedIndex
TCInvertedIndex given here can also serve as basis for
other kinds of TCProbabilityModels. (!!ADD THIS TO
TODO LIST!!)TCProbabilityModel,
TCInvertedIndex,
Serialized ForminvertedIndex, LAPLACE, NOSMOOTHING| Constructor and Description |
|---|
BVProbabilityModel()
Creates a new
BVProbabilityModel instance and
TCInvertedIndex. |
BVProbabilityModel(ParsedCorpus pt,
StopWordList swlist)
Creates a new
BVProbabilityModel instance and
TCInvertedIndex (see above note re. |
| Modifier and Type | Method and Description |
|---|---|
void |
addParsedCorpus(ParsedCorpus pt,
StopWordList swlist)
Index each term (type) of each
ParsedDocument in
ParsedCorpus, except those in
stopwdlist, on this PM. |
void |
addParsedDocument(ParsedDocument pni,
StopWordList swlist) |
boolean |
containsTerm(java.lang.String term)
Check if the index contains
term |
WordScorePair[] |
getBlankWordScoreArray()
make a new
WordScorePair, big enough to store all
terms indexed by this PM, with scores initialised to zero
. |
java.util.Set |
getCategorySet()
Get all categories in this corpus.
|
int |
getCategSetSize() |
java.util.Vector |
getCategVector(java.lang.String id)
Find all categories under which document
id has been
classified |
double |
getCatGenerality(java.lang.String cat)
Calculate and return the generality of
cat for this
model. |
int[] |
getCooccurrenceVector(java.lang.String term,
java.lang.String[] terms)
Return a vector containing the number of documents the word
term
co-occurs with each term in terms |
int |
getCorpusSize()
Get the size of
docSet |
int |
getCount(java.lang.String term)
Return the number of occurrences of
term in the
corpus |
int |
getCount(java.lang.String id,
java.lang.String term)
Return the number of occurrences of
term in
document id |
java.util.Set |
getDocSet()
Return the set of documents used in the generation of this index
|
Probabilities |
getProbabilities(java.lang.String term,
java.lang.String cat)
Get a summary of probabilities associated with
term and cat |
int |
getTermCount(java.lang.String term)
Get the number of files a term occurs in.
|
java.util.Set |
getTermSet()
Get the set of terms (types) indexed by this index
|
int |
getTermSetSize()
Get the number of terms (types) indexed by this PM
|
WordScorePair[] |
getWordScoreArray() |
boolean |
isIgnoreCase()
Get the value of ignoreCase.
|
boolean |
occursInCategory(java.lang.String term,
java.lang.String cat) |
WordScorePair[] |
setFreqWordScoreArray(WordScorePair[] wsp)
gets an initialised
WordScorePair and populate it
with global term frequency |
void |
setIgnoreCase(boolean v)
Set the value of ignoreCase.
|
void |
trimTermSet(java.util.Set rts)
Delete all entries for terms not in the reduced term set
|
void |
trimTermSet(WordFrequencyPair[] rts)
Delete all entries for terms not in the reduced term set
|
getCreationInfo, getCreator, getCreatorArgs, getCreatorArgsCSV, getSmoothingType, setCreator, setCreatorArgs, setSmoothingTypepublic BVProbabilityModel()
BVProbabilityModel instance and
TCInvertedIndex. (See above note re. separating
these two classes)public BVProbabilityModel(ParsedCorpus pt, StopWordList swlist)
BVProbabilityModel instance and
TCInvertedIndex (see above note re. separating these
two classes) and initialise them with pt, excluding the terms in
swlist.pt - a ParsedCorpus valueswlist - a StopWordList valuepublic void addParsedCorpus(ParsedCorpus pt, StopWordList swlist)
ParsedDocument in
ParsedCorpus, except those in
stopwdlist, on this PM.addParsedCorpus in interface TCInvertedIndexpt - a ParsedCorpus valueswlist - a StopWordList valuepublic void addParsedDocument(ParsedDocument pni, StopWordList swlist)
public java.util.Set getCategorySet()
getCategorySet in interface TCInvertedIndexSet containing all categories that occur
in the corpuspublic java.util.Set getDocSet()
getDocSet in interface TCInvertedIndexSet containing the IDs of all documents
indexed in this indexpublic int getCorpusSize()
docSetgetCorpusSize in interface TCInvertedIndexpublic boolean containsTerm(java.lang.String term)
TCInvertedIndextermcontainsTerm in interface TCInvertedIndexterm - a term to be looked up.term, fals otherwisepublic double getCatGenerality(java.lang.String cat)
cat for this
model. Generality is given by
G_cat = no_of_docs_classified_as_cat / no_of_docs_in_corpus
i.e. (G_cat = p(cat))getCatGenerality in interface TCInvertedIndexcat - a String representing a categorydouble valuepublic Probabilities getProbabilities(java.lang.String term, java.lang.String cat)
term and catgetProbabilities in class TCProbabilityModelterm - cat - a String representing a categoryProbabilitiespublic int getTermSetSize()
getTermSetSize in interface TCInvertedIndexint valuepublic void trimTermSet(java.util.Set rts)
trimTermSet in interface TCInvertedIndexpublic java.util.Set getTermSet()
TCInvertedIndexgetTermSet in interface TCInvertedIndexSet containing all terms in the indexpublic void trimTermSet(WordFrequencyPair[] rts)
public int getCategSetSize()
public java.util.Vector getCategVector(java.lang.String id)
TCInvertedIndexid has been
classifiedgetCategVector in interface TCInvertedIndexVector containing the the vector of
categories (of type String) to which document id
belongspublic WordScorePair[] getBlankWordScoreArray()
WordScorePair, big enough to store all
terms indexed by this PM, with scores initialised to zero
.getBlankWordScoreArray in interface TCInvertedIndexWordScorePair[] valuepublic WordScorePair[] setFreqWordScoreArray(WordScorePair[] wsp)
WordScorePair and populate it
with global term frequencysetFreqWordScoreArray in interface TCInvertedIndexwsp - a WordScorePair[] valueWordScorePair[] valuepublic WordScorePair[] getWordScoreArray()
public int getTermCount(java.lang.String term)
getTermCount in interface TCInvertedIndexterm - word (type) to be looked upint value containing the number of files
that contain at least one token of type term>public int getCount(java.lang.String id,
java.lang.String term)
term in
document idgetCount in interface TCInvertedIndexid - a String representing a unique document idterm - a Stringpublic int getCount(java.lang.String term)
term in the
corpusgetCount in interface TCInvertedIndexterm - a Stringpublic int[] getCooccurrenceVector(java.lang.String term,
java.lang.String[] terms)
TCInvertedIndexterm
co-occurs with each term in termsgetCooccurrenceVector in interface TCInvertedIndexterm - a String valueterms - a String[] valueint[] valuepublic boolean occursInCategory(java.lang.String term,
java.lang.String cat)
public boolean isIgnoreCase()
public void setIgnoreCase(boolean v)
v - Value to assign to ignoreCase.