modnlp.idx.inverted
public class TokeniserRegex extends Tokeniser
| Modifier and Type | Field and Description |
|---|---|
static java.lang.String |
DEFAULTWORDREGEXP |
static java.lang.String |
PUNCTUATIONWORDREGEXP |
encoding, indexPuntuation, originalText, SEPTKARR, SEPTOKEN, tagIndexing, tokenMap, verbose| Constructor and Description |
|---|
TokeniserRegex(java.io.File t,
java.lang.String e) |
TokeniserRegex(java.lang.String t) |
TokeniserRegex(java.net.URL t,
java.lang.String e) |
| Modifier and Type | Method and Description |
|---|---|
java.lang.String |
getBigWordRegexp()
Get the
bigWordRegexp value. |
java.lang.String |
getIgnoredElements()
Get the
IgnoredElements value. |
TokenIndex |
getTokenIndex(java.lang.String str) |
java.lang.String |
getWordRegexp()
Gets the value of wordRegexp
|
void |
setBigWordRegexp(java.lang.String argBigWordRegexp)
Sets the value of bigWordRegexp
|
void |
setIgnoredElements(java.lang.String newIgnoredElements)
Set the
IgnoredElements value. |
void |
setIndexPuntuation(java.lang.Boolean argIndexPuntuation)
Sets the value of indexPuntuation
|
void |
setWordRegexp(java.lang.String argWordRegexp)
Sets the value of wordRegexp
|
java.util.List<java.lang.String> |
split(java.lang.String s) |
void |
tokenise()
tokenise: Very basic tokenisation; Serious tokenisers
must override this method. |
disbar, fixType, getEncoding, getIndexPuntuation, getOriginalText, getTagIndexing, getTokenMap, getVerbose, isBar, setEncoding, setTagIndexing, setTokenMap, setVerbose, splitWordOnlypublic static final java.lang.String DEFAULTWORDREGEXP
public static final java.lang.String PUNCTUATIONWORDREGEXP
public TokeniserRegex(java.lang.String t)
public TokeniserRegex(java.io.File t,
java.lang.String e)
throws java.io.IOException
java.io.IOExceptionpublic TokeniserRegex(java.net.URL t,
java.lang.String e)
throws java.io.IOException
java.io.IOExceptionpublic final java.lang.String getIgnoredElements()
IgnoredElements value.String valuepublic final java.lang.String getBigWordRegexp()
bigWordRegexp value.String valuepublic final void setBigWordRegexp(java.lang.String argBigWordRegexp)
argBigWordRegexp - Value to assign to this.bigWordRegexppublic final void setIndexPuntuation(java.lang.Boolean argIndexPuntuation)
setIndexPuntuation in class TokeniserargIndexPuntuation - Value to assign to this.indexPuntuationpublic final java.lang.String getWordRegexp()
public final void setWordRegexp(java.lang.String argWordRegexp)
argWordRegexp - Value to assign to this.wordRegexppublic final void setIgnoredElements(java.lang.String newIgnoredElements)
IgnoredElements value.setIgnoredElements in class TokenisernewIgnoredElements - The new IgnoredElements value.public void tokenise()
throws java.io.IOException
Tokenisertokenise: Very basic tokenisation; Serious tokenisers
must override this method. Note that positions in the tokenMap
here correspond to the ORDER in which the token appears in
originalText not its actual OFFSET.tokenise in class Tokeniserjava.io.IOExceptionfor a proper
implementation.public java.util.List<java.lang.String> split(java.lang.String s)
public TokenIndex getTokenIndex(java.lang.String str)
getTokenIndex in class Tokeniser