Tokeniser

java.lang.Object
- modnlp.util.Tokeniser

Direct Known Subclasses:

TokeniserGNU, TokeniserJP, TokeniserJPLucene, TokeniserRegex
```
public class Tokeniser
extends java.lang.Object
```
Tokenise a chunk of text and record the position of each token

See Also:

Field Summary

Fields
Modifier and Type	Field and Description
`protected java.lang.String`	`encoding`
`protected java.lang.Boolean`	`indexPuntuation`
`protected java.lang.String`	`originalText`
`static char[]`	`SEPTKARR`
`static java.lang.String`	`SEPTOKEN`
`protected boolean`	`tagIndexing`
`protected TokenMap`	`tokenMap`
`protected boolean`	`verbose`

Constructor Summary

Constructors
Constructor and Description
`Tokeniser(java.io.File file, java.lang.String e)`
`Tokeniser(java.lang.String text)`
`Tokeniser(java.net.URL url, java.lang.String e)`

Method Summary

Methods
Modifier and Type	Method and Description
`static java.lang.String`	`disbar(java.lang.String token)` Disbar token
`static java.lang.String`	`fixType(java.lang.String type)` Delete dots (e.g.
`java.lang.String`	`getEncoding()`
`java.lang.Boolean`	`getIndexPuntuation()` Gets the value of indexPuntuation
`java.lang.String`	`getOriginalText()`
`boolean`	`getTagIndexing()`
`TokenIndex`	`getTokenIndex(java.lang.String str)`
`TokenMap`	`getTokenMap()`
`boolean`	`getVerbose()`
`static boolean`	`isBar(java.lang.String token)` Check is token is a negated token (e.g '-c' in p(t\|-c))
`void`	`setEncoding(java.lang.String v)`
`void`	`setIgnoredElements(java.lang.String i)`
`void`	`setIndexPuntuation(java.lang.Boolean argIndexPuntuation)` Sets the value of indexPuntuation
`void`	`setTagIndexing(boolean v)`
`void`	`setTokenMap(TokenMap t)`
`void`	`setVerbose(boolean v)`
`java.util.List<java.lang.String>`	`split(java.lang.String str)`
`java.util.List<java.lang.String>`	`splitWordOnly(java.lang.String str)`
`void`	`tokenise()` `tokenise`: Very basic tokenisation; Serious tokenisers must override this method.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail

tagIndexing
```
protected boolean tagIndexing
```

verbose
```
protected boolean verbose
```

originalText

protected java.lang.String originalText

tokenMap
```
protected TokenMap tokenMap
```

encoding
```
protected java.lang.String encoding
```

SEPTKARR
```
public static final char[] SEPTKARR
```

SEPTOKEN

public static final java.lang.String SEPTOKEN

indexPuntuation

protected java.lang.Boolean indexPuntuation

Constructor Detail

Tokeniser

public Tokeniser(java.lang.String text)

Tokeniser

public Tokeniser(java.net.URL url,
         java.lang.String e)
          throws java.io.IOException

Throws:: java.io.IOException

Tokeniser

public Tokeniser(java.io.File file,
         java.lang.String e)
          throws java.io.IOException

Throws:: java.io.IOException

Method Detail

getIndexPuntuation
```
public final java.lang.Boolean getIndexPuntuation()
```
Gets the value of indexPuntuation

Returns:
the value of indexPuntuation

setIndexPuntuation
```
public void setIndexPuntuation(java.lang.Boolean argIndexPuntuation)
```
Sets the value of indexPuntuation

Parameters:
argIndexPuntuation - Value to assign to this.indexPuntuation

setTokenMap
```
public void setTokenMap(TokenMap t)
```

getTagIndexing
```
public boolean getTagIndexing()
```

setTagIndexing

public void setTagIndexing(boolean v)

getVerbose
```
public boolean getVerbose()
```

setVerbose
```
public void setVerbose(boolean v)
```

setIgnoredElements

public void setIgnoredElements(java.lang.String i)

getEncoding
```
public java.lang.String getEncoding()
```

setEncoding

public void setEncoding(java.lang.String v)

getTokenMap
```
public TokenMap getTokenMap()
```

getOriginalText

public java.lang.String getOriginalText()

tokenise
```
public void tokenise()
              throws java.io.IOException
```
tokenise: Very basic tokenisation; Serious tokenisers must override this method. Note that positions in the tokenMap here correspond to the ORDER in which the token appears in originalText not its actual OFFSET.

Throws:

java.io.IOException
See Also:
for a proper implementation.

split

public java.util.List<java.lang.String> split(java.lang.String str)

splitWordOnly

public java.util.List<java.lang.String> splitWordOnly(java.lang.String str)

getTokenIndex

public TokenIndex getTokenIndex(java.lang.String str)

fixType
```
public static java.lang.String fixType(java.lang.String type)
```
Delete dots (e.g. "U.S.A" => "USA"), remove spaces, clean up any remaining garbage left by Tokenizer

Returns:
type: a 'clean' type

isBar
```
public static boolean isBar(java.lang.String token)
```
Check is token is a negated token (e.g '-c' in p(t|-c))

disbar

public static java.lang.String disbar(java.lang.String token)

Disbar token

Class Tokeniser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

tagIndexing

verbose

originalText

tokenMap

encoding

SEPTKARR

SEPTOKEN

indexPuntuation

Constructor Detail

Tokeniser

Tokeniser

Tokeniser

Method Detail

getIndexPuntuation

setIndexPuntuation

setTokenMap

getTagIndexing

setTagIndexing

getVerbose

setVerbose

setIgnoredElements

getEncoding

setEncoding

getTokenMap

getOriginalText

tokenise

split

splitWordOnly

getTokenIndex

fixType

isBar

disbar