
Table Of Contents
· Introduction.
· Intuition.
· Why word2vec approach?
· Word2vec Architecture.
1. Continuous Bags Of Words(CBOW).
2. Skip-Gram.
Introduction
I hope we all have a very good understanding of how the N-GRAMS model works by now. If not,
I would suggest you read the previous blog where I have briefly explained the intuition with
the implementation as well. Alright, Enough of bits and baps, and let's get right to it!
Intuition
So word2vec model is a model where in we project words on to a huge dimensional space (about 300-400) as numerical vectors. There is a speciality about these vectors tho .i.e Vectors representing words that have similar meanings or if there exists any sort of relation between the words are closer to each other. Let’s consider a simple example.

Notice that words with similar semantics are closer and if we say King-man+women (vectorial addition) would give us a vector that is very close to queen.I have a fun little exercise for ya,
Think in similar terms and see what examples you can come up with.
I will give you a few to get you started:
“Brother – man + women = Sister”
“Tokyo - Japan + France = Paris”
“Tallest – tall + short = Shortest”
Well, what we must notice while playing around with these word analogies is that we are indirectly learning the context and different degrees of similarities between the words. For instance,
Tokyo – Japan + France = Paris. It might look like random places names but just like what Tokyo(capital) is to Japan, the model learns that Paris is the capital of France. So there is no limit to the contexts that we learn (provided the dataset is relevant enough).This is what our model does in a nutshell and this conversion of words to vectors are called word embeddings .We then use these word embeddings to learn the vectors to either predict the target word or predict context given a particular word via neural networks. This technique is proven to be very effective in learning the syntax and semantic relations of words. The layer in the neural net that does this job of projection is called projection layer which is connected to the input layer, It is also connected to the hidden layer where the model learns the word embeddings and that is then connected to the output layer which typically uses a Softmax projection matrix to predict the target word…
Why word2vec approach?
There are many word embedding schemes, Simplest of them all is One hot encoding where the dimension of the created vector is of the size of entire vocabulary and only one particular index representing the word is 1 with remaining being 0.
For example,
have = [1, 0, 0, 0, 0, 0, ... 0] a = [0, 1, 0, 0, 0, 0, ... 0] good = [0, 0, 1, 0, 0, 0, ... 0] day = [0, 0, 0, 1, 0, 0, ... 0] ...
But the disadvantage is that we see words as atomic and independent units so no relationship exists between words. It is also waste of sooo much memory if vocabulary size of huge.
Another method is bag of words where instead of 1 and 0 we use the count of the word appearing in the corpus.
Disadvantage in this is that we wont be able to differentiate between words as long as the count is same.
For example:
“The movie was not good and boring” would mean the same as “the movie was good and not boring” according to bag of words model.
That’s where our word2vec model stands out! It comes from its ability to group together vectors of similar words. Given a large enough dataset, Word2Vec can make strong estimates about a word’s meaning based on their occurrences in the text. These estimates yield word associations with other words in the corpus.
Now that we have understood how word2vec model works lets understand its architectures.
1)Continuous Bag Of Words(Cbow) Architecture.
Know that this isn’t the same as that of classic Bag Of Words. Main difference is that BOW uses frequency of a particular word as the sole source to learn context and CBOW represents context words as a continuous vector instead of discrete counts and averages them to generalize the context. This was the first ever architecture proposed and in this model, What essentially happens here is the model learns from the database and then predict the best fitting word, given some context. We limit the context by setting a parameter called ‘Window Size’.
Example,
For the sentence “She ate her dog’s homework” with window size being 2 .i.e (2 words considered for context on either side).Visually,
· For the word "She": ["ate", "her"] · For the word "ate": ["She", "her", "dog's"] · For the word "her": ["She", "ate", "dog's", "homework"] · For the word "dog's": ["ate", "her", "homework"] · For the word "homework": ["her", "dog's"]
Those tokenized set of words would be the training data and words to the left are the expected output. One specific thing about this architecture is that all the words in a given context share the projection layer .i.e (same weights on the projection layer).The vectors hence obtained are averaged to capture a more generalized representation of the context are projected to the generalized position(same position).
· This reduces model complexity since we don’t use individual weights on projection layer for each word.
· Improves computational efficiency and also enables us to capture a more generalized context.
Visually,

2)Continuous Skip-Gram Architecture.
This is another neural network-based architecture used for learning word embeddings. In Skip-gram, The main objective is to predict the context words surrounding a target word limited by the window size. Essentially the opposite to what CBOW architecture was doing. This model creates what are called as positive and negative examples. Positive examples(positive context words) are the context words from the actual dataset and negative examples are random words from the vocabulary. The objective here is to maximize the probability of predicting the positive examples and minimize the probability of predicting the negative examples.This is a gist of how the model works.
Visually,

References:
I hope by now we all have understood the intuition behind the word2vec model deeply and also this blog is rather becoming too big hehe, So in my next blog we shall implement the Word2vec model hands-on which is going to be pretty interesting to all my data science enthusiasts out here. Stay tuned!!
Comments