Package com. londogard. nlp. tokenizer
Types
Link copied to clipboard
A Character Tokenizer which returns a token for each character in the string.
Link copied to clipboard
Wraps a HuggingFace tokenizer. These are subword tokenizers meaning they return subword tokens. Instantiated by calling the HuggingFace models name.
Link copied to clipboard
A SentencePiece Tokenizer. This is a subword-tokenizer meaning that it return subword-tokens, e.g. "hey" might end up "h", "ey".
Link copied to clipboard
Link copied to clipboard
A simple tokenizer which allows you to define your own whitespace to split upon.
Link copied to clipboard
Special tokens that is usable to Machine Learning.