Introduction to Word Embeddings (NLP)

A complete study about capturing the contextual meanings of neighbouring words using techniques like Word2Vec & GloVe.

6 min readSep 9, 2021

One hot encoding usually works in some situations but breaks down when we have a large vocabulary to deal with because the size of our word representation grows with the number of words. What we need is a way to control the size of our word representation by limiting it to a fixed size vector. There comes the need for word embeddings!

In other words, we want to find an embedding for each word in some vector space and we wanted to exhibit some desired properties.

Representation of different words in vector space (Image by author)

For example, if two words are similar in meaning, they should be closer to each other compared to words that are not. And, if two pair of words have a similar difference in their meanings, they should be approximately equally separated in the embedded space.

We could use such a representation for a variety of purposes like finding synonyms and analogies, identifying concepts around which words are clustered, classifying words as positive, negative, neutral, etc. By combining word vectors, we can come up with another way of representing documents as well.

Word2Vec — The General Idea

Word2Vec is perhaps one of the most popular examples of word embeddings used in practice. As the name Word2Vec indicates, it transforms words to vectors. But what the name doesn’t give away is how that transformation is performed.

Continuous Bag of Words (CBoW) & Continuous Skip-gram Model (Image by author)

The core idea behind Word2Vec is this, a model that is able to predict a given word, given neighboring words, or vice versa, predict neighboring words for a given word is likely to capture the contextual meanings of words very well. And, these are infact, two flavours of Word2Vec models, one where we are given neighboring words called continuous bag of words, and the other where we are given the middle word called skip-gram.

skip-gram model — From Data to Decisions

Vector space model is well known in information retrieval where each document is represented as a vector. The vector…

iksinc.online

In the skip gram model, we pick any word from a sentence, convert it into a one-hot encoded vector and feed it into a Neural network or some other probabilistic model that is designed to predict a few surrounding words, its context. Using a suitable loss function, the weights or parameters of the model are optimized and this step is repeated till it learns to predict context words as best as it can.

Architecture of Skip-Gram Model (Image by author)

Now, taking an intermediate representation like a hidden layer in a neural network. The outputs of that layer for a given word become the corresponding word vector. The continuous bag of words variation also uses a similar strategy!

Properties of Word2Vec:

It is a Robust and Distributed representation.
Vector size of Word2Vec is independent of the vocabulary.
Training once and storing in lookup table.
Deep learning ready!

This yields a very robust representation of words because the meaning of each word is distributed throughout the vector. The size of the word vector is up to us, how we want to tune performance versus complexity. It remains constant no matter how many words we train on, unlike the Bag of Words model, for instance, where the size grows with the number of unique words. And, once we pre-train a large set of word vectors, we can use them efficiently without having to transform again and again, just storing them in a lookup table. Finally, it is ready to be used in Deep learning architectures.

For example, it can be used as the input vector for recurrent neural nets. It is also possible to use RNN’s to learn even better word embeddings. Some other optimizations are possible that further reduce the model and training complexity such as representing the output words using Hierarchical Softmax, computing loss using Sparse Cross Entropy, etc.

Introduction to GloVe

Word2Vec is just one type of forward embedding. Recently, several other approaches have been proposed that are really promising. GloVe or global vectors for word representation is one such approach that tries to directly optimize the vector representation of each word just using co-occurence statistics, unlike Word2Vec which sets up an ancillary prediction task.

Global vector for word representation (Image by author)

First, the probability that the word “j” appears in the context of word “i” is computed, “Pj”given “i” for all word pairs “ij” in a given corpus. What do we mean by “j” appears in context of “i”?

Simply that word “j” is present in the vicinity of word “i”, either right next to it or a few words away. We count all such occurrences of “i” and “j” in our text collection, and then normalize account to get a probability.

Relationship between context and target (Image by author)

Then, a random vector is initialized for each word, actually two vectors. One for the word when it is acting as a context, and one when it is acting as the target. So far, so good. Now for any pair of words “ij”, we want the dot product of their word vectors, “w_i” times “w_j”, to be equal to their co-occurence probability. Using this as our goal and a suitable loss function, we can iteratively optimize these word vectors. The result should be a set of vectors that capture the similarities and differences between individual words.

Implementation of Dot Product (Image by author)

If we look at it from another point of view, we are essentially factorizing the co-occurence probability matrix into two smaller matrices. This is the basic idea behind GloVe.

References

1. Hierarchical softmax or generally known as H-Softmax which is proposed by Morin and Bengio is an approximation derived from binary trees. More information about the Hierarchical Softmax can be found on the following link below:

Hierarchical Softmax

These are the notes from a talk I gave at the seminar] Hierarchical softmax is an alternative to the softmax in which…

building-babylon.net

2. The major difference between the Sparse Cross Entropy and the Categorical Cross Entropy is the format in which the true labels are mentioned. Further, more differences between the two can be found on the reference below:

Multi-hot Sparse Categorical Cross-entropy

When doing multi-class classification, categorical cross entropy loss is used a lot. It compares the predicted label…

cwiki.apache.org

That’s it for Word2Vec & GloVe. Thanks for reading and following along. I hope you had a good time reading and learning. Bundle of thanks for reading it!

My Linkedin :)

Android Apps by Prateek Sawhney on Google Play

Artificial Intelligence Engineer @ Digital Product School by UnternehmerTUM & Technical University of Munich, Germany

play.google.com

prateeksawhney97 — Overview

Share Split app enables quick and easy file transfer without internet usage. Share Split app created by Prateek Sawhney…

github.com