SimpleTokenizer

public final class SimpleTokenizer implements Tokenizer

A simple tokenizer which allows you to define your own whitespace to split upon.

Constructors

Link copied to clipboard
public SimpleTokenizer SimpleTokenizer(Boolean splitContraction, String whitespaceRegex)

Types

Link copied to clipboard
public class Companion

Functions

Link copied to clipboard
public List<List<String>> batchSplit(List<String> texts)

A more efficient approach for native tokenizers, i.e. HuggingFaceTokenizer

Link copied to clipboard
public List<String> split(String text)