Package com.londogard.nlp.embeddings

Embeddings is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning, more on wikipedia.org.

In com.londogard:nlp there's multiple embeddings supported:

EmbeddingLanguages SupportedDetails
LightWordEmbeddings157 via fastText
∞ from disk
Retrieves up to K vectors from disc ad-hoc, caching most used results in memory
WordEmbeddings157 via fastText
∞ from disk
Ordinary word embeddings
BpeEmbeddings275 via bpembEmbeds SentencePiece tokenized data.
Retains +-5% performance using MB's rather than GB's!

nlp also supports Sentence Embeddings, through Average or USif method.

Types

Link copied to clipboard
public final class BpeEmbeddings implements Embeddings

BPEEmbeddings are subword embeddings that embeds SentencePieceTokenizer tokenized data. Studies show that performance are on par with GloVe (+-5%) while only using few MB's of data rather than GB's. Supports 275 languages through bpemb.

Link copied to clipboard
public class EmbeddingLoader

Util to simplify loading embeddings. Simply call fromLanguageOrNull or fromFile.

Link copied to clipboard
public interface Embeddings
Link copied to clipboard
public final class LightWordEmbeddings implements Embeddings

Retrieves vectors from disc ad-hoc and caches the maxWordCount results in memory based on most active cache keys. Works very well in constrained environments like a Raspberry Pi.

Link copied to clipboard
public final class WordEmbeddings implements Embeddings

Ordinary Word Embeddings.