Package com. londogard. nlp. embeddings
Embeddings is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning, more on wikipedia.org.
In com.londogard:nlp
there's multiple embeddings supported:
Embedding | Languages Supported | Details |
---|---|---|
LightWordEmbeddings | 157 via fastText ∞ from disk | Retrieves up to K vectors from disc ad-hoc, caching most used results in memory |
WordEmbeddings | 157 via fastText ∞ from disk | Ordinary word embeddings |
BpeEmbeddings | 275 via bpemb | Embeds SentencePiece tokenized data. Retains +-5% performance using MB's rather than GB's! |
nlp
also supports Sentence Embeddings, through Average or USif method.
Types
BPEEmbeddings are subword embeddings that embeds SentencePieceTokenizer tokenized data. Studies show that performance are on par with GloVe (+-5%) while only using few MB's of data rather than GB's. Supports 275 languages through bpemb.
Util to simplify loading embeddings. Simply call fromLanguageOrNull or fromFile.
Retrieves vectors from disc ad-hoc and caches the maxWordCount
results in memory based on most active cache keys. Works very well in constrained environments like a Raspberry Pi.
Ordinary Word Embeddings.