SentencePieceTokenizer

public final class SentencePieceTokenizer implements Tokenizer

A SentencePiece Tokenizer. This is a subword-tokenizer meaning that it return subword-tokens, e.g. "hey" might end up "h", "ey".

Constructors

Link copied to clipboard
public SentencePieceTokenizer SentencePieceTokenizer(Path modelPath, Path vocabPath)

Types

Link copied to clipboard
public class Companion

Functions

Link copied to clipboard
public List<List<String>> batchSplit(List<String> texts)

A more efficient approach for native tokenizers, i.e. HuggingFaceTokenizer

Link copied to clipboard
public final Set<String> getVocab()
Link copied to clipboard
public List<String> split(String text)

Properties

Link copied to clipboard
private final Set<String> vocab