Package com.londogard.nlp.embeddings

Embeddings is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning, more on wikipedia.org.

In com.londogard:nlp there's multiple embeddings supported:

Embedding	Languages Supported	Details
LightWordEmbeddings	157 via fastText ∞ from disk	Retrieves up to K vectors from disc ad-hoc, caching most used results in memory
WordEmbeddings	157 via fastText ∞ from disk	Ordinary word embeddings
BpeEmbeddings	275 via bpemb	Embeds SentencePiece tokenized data. Retains +-5% performance using MB's rather than GB's!

nlp also supports Sentence Embeddings, through Average or USif method.

Types

BpeEmbeddings

public final class BpeEmbeddings implements Embeddings

BPEEmbeddings are subword embeddings that embeds SentencePieceTokenizer tokenized data. Studies show that performance are on par with GloVe (+-5%) while only using few MB's of data rather than GB's. Supports 275 languages through bpemb.

EmbeddingLoader

public class EmbeddingLoader

Util to simplify loading embeddings. Simply call fromLanguageOrNull or fromFile.

Embeddings

public interface Embeddings

LightWordEmbeddings

public final class LightWordEmbeddings implements Embeddings

Retrieves vectors from disc ad-hoc and caches the maxWordCount results in memory based on most active cache keys. Works very well in constrained environments like a Raspberry Pi.

WordEmbeddings

public final class WordEmbeddings implements Embeddings

Ordinary Word Embeddings.