modnlp.tc.dstruct
public class BVProbabilityModel extends TCProbabilityModel implements TCInvertedIndex
TCInvertedIndex
given here can also serve as basis for
other kinds of TCProbabilityModel
s. (!!ADD THIS TO
TODO LIST!!)TCProbabilityModel
,
TCInvertedIndex
,
Serialized ForminvertedIndex, LAPLACE, NOSMOOTHING
Constructor and Description |
---|
BVProbabilityModel()
Creates a new
BVProbabilityModel instance and
TCInvertedIndex . |
BVProbabilityModel(ParsedCorpus pt,
StopWordList swlist)
Creates a new
BVProbabilityModel instance and
TCInvertedIndex (see above note re. |
Modifier and Type | Method and Description |
---|---|
void |
addParsedCorpus(ParsedCorpus pt,
StopWordList swlist)
Index each term (type) of each
ParsedDocument in
ParsedCorpus , except those in
stopwdlist , on this PM. |
void |
addParsedDocument(ParsedDocument pni,
StopWordList swlist) |
boolean |
containsTerm(java.lang.String term)
Check if the index contains
term |
WordScorePair[] |
getBlankWordScoreArray()
make a new
WordScorePair , big enough to store all
terms indexed by this PM, with scores initialised to zero
. |
java.util.Set |
getCategorySet()
Get all categories in this corpus.
|
int |
getCategSetSize() |
java.util.Vector |
getCategVector(java.lang.String id)
Find all categories under which document
id has been
classified |
double |
getCatGenerality(java.lang.String cat)
Calculate and return the generality of
cat for this
model. |
int[] |
getCooccurrenceVector(java.lang.String term,
java.lang.String[] terms)
Return a vector containing the number of documents the word
term
co-occurs with each term in terms |
int |
getCorpusSize()
Get the size of
docSet |
int |
getCount(java.lang.String term)
Return the number of occurrences of
term in the
corpus |
int |
getCount(java.lang.String id,
java.lang.String term)
Return the number of occurrences of
term in
document id |
java.util.Set |
getDocSet()
Return the set of documents used in the generation of this index
|
Probabilities |
getProbabilities(java.lang.String term,
java.lang.String cat)
Get a summary of probabilities associated with
term and cat |
int |
getTermCount(java.lang.String term)
Get the number of files a term occurs in.
|
java.util.Set |
getTermSet()
Get the set of terms (types) indexed by this index
|
int |
getTermSetSize()
Get the number of terms (types) indexed by this PM
|
WordScorePair[] |
getWordScoreArray() |
boolean |
isIgnoreCase()
Get the value of ignoreCase.
|
boolean |
occursInCategory(java.lang.String term,
java.lang.String cat) |
WordScorePair[] |
setFreqWordScoreArray(WordScorePair[] wsp)
gets an initialised
WordScorePair and populate it
with global term frequency |
void |
setIgnoreCase(boolean v)
Set the value of ignoreCase.
|
void |
trimTermSet(java.util.Set rts)
Delete all entries for terms not in the reduced term set
|
void |
trimTermSet(WordFrequencyPair[] rts)
Delete all entries for terms not in the reduced term set
|
getCreationInfo, getCreator, getCreatorArgs, getCreatorArgsCSV, getSmoothingType, setCreator, setCreatorArgs, setSmoothingType
public BVProbabilityModel()
BVProbabilityModel
instance and
TCInvertedIndex
. (See above note re. separating
these two classes)public BVProbabilityModel(ParsedCorpus pt, StopWordList swlist)
BVProbabilityModel
instance and
TCInvertedIndex
(see above note re. separating these
two classes) and initialise them with pt, excluding the terms in
swlist.pt
- a ParsedCorpus
valueswlist
- a StopWordList
valuepublic void addParsedCorpus(ParsedCorpus pt, StopWordList swlist)
ParsedDocument
in
ParsedCorpus
, except those in
stopwdlist
, on this PM.addParsedCorpus
in interface TCInvertedIndex
pt
- a ParsedCorpus
valueswlist
- a StopWordList
valuepublic void addParsedDocument(ParsedDocument pni, StopWordList swlist)
public java.util.Set getCategorySet()
getCategorySet
in interface TCInvertedIndex
Set
containing all categories that occur
in the corpuspublic java.util.Set getDocSet()
getDocSet
in interface TCInvertedIndex
Set
containing the IDs of all documents
indexed in this indexpublic int getCorpusSize()
docSet
getCorpusSize
in interface TCInvertedIndex
public boolean containsTerm(java.lang.String term)
TCInvertedIndex
term
containsTerm
in interface TCInvertedIndex
term
- a term to be looked up.term
, fals otherwisepublic double getCatGenerality(java.lang.String cat)
cat
for this
model. Generality is given by
G_cat = no_of_docs_classified_as_cat / no_of_docs_in_corpusi.e. (
G_cat
= p(cat))getCatGenerality
in interface TCInvertedIndex
cat
- a String
representing a categorydouble
valuepublic Probabilities getProbabilities(java.lang.String term, java.lang.String cat)
term
and cat
getProbabilities
in class TCProbabilityModel
term
- cat
- a String
representing a categoryProbabilities
public int getTermSetSize()
getTermSetSize
in interface TCInvertedIndex
int
valuepublic void trimTermSet(java.util.Set rts)
trimTermSet
in interface TCInvertedIndex
public java.util.Set getTermSet()
TCInvertedIndex
getTermSet
in interface TCInvertedIndex
Set
containing all terms in the indexpublic void trimTermSet(WordFrequencyPair[] rts)
public int getCategSetSize()
public java.util.Vector getCategVector(java.lang.String id)
TCInvertedIndex
id
has been
classifiedgetCategVector
in interface TCInvertedIndex
Vector
containing the the vector of
categories (of type String
) to which document id
belongspublic WordScorePair[] getBlankWordScoreArray()
WordScorePair
, big enough to store all
terms indexed by this PM, with scores initialised to zero
.getBlankWordScoreArray
in interface TCInvertedIndex
WordScorePair[]
valuepublic WordScorePair[] setFreqWordScoreArray(WordScorePair[] wsp)
WordScorePair
and populate it
with global term frequencysetFreqWordScoreArray
in interface TCInvertedIndex
wsp
- a WordScorePair[]
valueWordScorePair[]
valuepublic WordScorePair[] getWordScoreArray()
public int getTermCount(java.lang.String term)
getTermCount
in interface TCInvertedIndex
term
- word (type) to be looked upint
value containing the number of files
that contain at least one token of type term>
public int getCount(java.lang.String id, java.lang.String term)
term
in
document id
getCount
in interface TCInvertedIndex
id
- a String
representing a unique document idterm
- a String
public int getCount(java.lang.String term)
term
in the
corpusgetCount
in interface TCInvertedIndex
term
- a String
public int[] getCooccurrenceVector(java.lang.String term, java.lang.String[] terms)
TCInvertedIndex
term
co-occurs with each term in terms
getCooccurrenceVector
in interface TCInvertedIndex
term
- a String
valueterms
- a String[]
valueint[]
valuepublic boolean occursInCategory(java.lang.String term, java.lang.String cat)
public boolean isIgnoreCase()
public void setIgnoreCase(boolean v)
v
- Value to assign to ignoreCase.