HuggingFaceTokenizerWrapper

public final class HuggingFaceTokenizerWrapper implements Tokenizer

Wraps a HuggingFace tokenizer. These are subword tokenizers meaning they return subword tokens. Instantiated by calling the HuggingFace models name.

Constructors

Link copied to clipboard
public HuggingFaceTokenizerWrapper HuggingFaceTokenizerWrapper(String modelName)
Link copied to clipboard
public HuggingFaceTokenizerWrapper HuggingFaceTokenizerWrapper(HuggingFaceTokenizer tokenizer)

Functions

Link copied to clipboard
public final Array<Encoding> batchEncode(List<String> texts)
Link copied to clipboard
public List<List<String>> batchSplit(List<String> texts)

A more efficient approach for native tokenizers, i.e. HuggingFaceTokenizer

Link copied to clipboard
public final Encoding encode(String text)
Link copied to clipboard
public final HuggingFaceTokenizer getTokenizer()
Link copied to clipboard
public List<String> split(String text)

Properties

Link copied to clipboard
private final HuggingFaceTokenizer tokenizer