Package com.londogard.nlp.tokenizer

Types

Link copied to clipboard
public final class CharTokenizer implements Tokenizer

A Character Tokenizer which returns a token for each character in the string.

Link copied to clipboard
public final class HuggingFaceTokenizerWrapper implements Tokenizer

Wraps a HuggingFace tokenizer. These are subword tokenizers meaning they return subword tokens. Instantiated by calling the HuggingFace models name.

Link copied to clipboard
public final class SentencePieceTokenizer implements Tokenizer

A SentencePiece Tokenizer. This is a subword-tokenizer meaning that it return subword-tokens, e.g. "hey" might end up "h", "ey".

Link copied to clipboard
public final class SentencePieceTokenizerKt
Link copied to clipboard
public final class SimpleTokenizer implements Tokenizer

A simple tokenizer which allows you to define your own whitespace to split upon.

Link copied to clipboard
public interface Tokenizer

Tokenize a string into multiple tokens

Link copied to clipboard
public class TokenizerSpecialTokens

Special tokens that is usable to Machine Learning.

Link copied to clipboard
public enum VocabSize extends Enum<VocabSize>