1/68 Word2vec

Now you might agree that representing text with vectors is a good idea.

2/68 Word2vec

But how exactly does it work?

3/68 Word2vec

Let’s look at word2vec and walk you through the neural network training process.

4/68 Word2vec

Word2vec was created by a team of Google researchers led by Tomas Mikolov in 2013.

5/68 Word2vec

Let’s start with intuition.

Think about how you learn new vocabulary while reading back in elementary schools.

6/68 Word2vec

You read,

7/68 Word2vec

guess,

8/68 Word2vec

and you remember.

9/68 Word2vec

How do you guess the meaning of a new word?

10/68 Word2vec

normally through the context, meaning the words surrounding that missing word.

11/68 Word2vec

For example, look at this sentence. ___

12/68 Word2vec

As an eight-year-old, you don’t recognize the word acumen, but you might guess through the context that this word probably means intelligence and sharp judgment.

13/68 Word2vec

The idea of using context to predict words applies to word2vec.

14/68 Word2vec
15/68 Word2vec
16/68 Word2vec

You’ll explore two major variants of word2vec: continuous bag-of-words (CBOW) and skip-gram.

17/68 Word2vec

Both algorithms rely on the same idea of using relationships between the surrounding context and the center word to make predictions,

18/68 Word2vec

but in opposite directions.

19/68 Word2vec

Continuous bag-of-words predicts the center word given the context words,

20/68 Word2vec

and skip-gram predicts the context words given the center word.

21/68 Word2vec
22/68 Word2vec

Assume you miss the center word chasing

23/68 Word2vec

and CBOW tries to use the context words, dog, is, a, and person to predict the missing word.

How does it work technically?

24/68 Word2vec

Let’s walk through the two primary steps:

  • preparing the dataset and

  • training the neural network.

25/68 Word2vec

Prepare the training data.

You run a sliding window of size 2K+1

26/68 Word2vec

over the entire text corpus.

You might hear the word corpus often in NLP. It basically means a collection of words or a body of words.

27/68 Word2vec

To put simply, let’s say K equals to 2.

28/68 Word2vec

You start with the first word A.

To run a 2K+1 window and K equals to 2,

29/68 Word2vec

You have two words before and after the center word A.

No words before A but two words after A.

Therefore, the two words after A, dog and is, will be used to predict A, which is sometimes called the label word.

30/68 Word2vec

Then the window shifts to the second word dog as the center word.

31/68 Word2vec

Dog will be predicted by the word A before it and the words is and chasing after it.

32/68 Word2vec

Therefore, three words: a, is, and chasing will be used to predict the label word “dog.”

33/68 Word2vec

The same process continues till the last word person is the center word.

34/68 Word2vec

Now you get the training data ready.

35/68 Word2vec

train the neural network

specifically a narrow neural network to learn the embedding matrix.

36/68 Word2vec

Narrow means there’s only one hidden layer in this model

37/68 Word2vec

instead of multiple hidden layers, as in a deep neural network.

38/68 Word2vec

___ (for example, from TensorFlow) without needing to know the details.

39/68 Word2vec

In case you’re curious and want to train your own word2vec model instead of using a pre-trained one, here is what happens in the backend.

40/68 Word2vec

The goal is to learn the embedding matrix E_v*d,

41/68 Word2vec

where v is the size of the vocabulary

42/68 Word2vec

and d is the number of dimensions.

43/68 Word2vec

For example in the sentence “A dog is chasing a person,” you have five different words:

44/68 Word2vec

Therefore, v equals to 5.

45/68 Word2vec

The value of d depends on the number of features that you want the neural work to learn to represent each word.

46/68 Word2vec

It can be anywhere between one to four digits.

Normally, when the number is larger, the model will be more refined; although it costs more computational resources.

47/68 Word2vec

d is also a hyperparameter that you can tune when using Google’s pre-trained word2vec embedding, which means that you can try different numbers to see which one produces the best result.

To make the visual simple,

48/68 Word2vec

let’s say d equals to 3 here.

49/68 Word2vec

Therefore, the matrix in this example is 5 by 3 and it looks like this.

Each w represents a weight that you want to train the neural network to learn.

50/68 Word2vec

A simplified version to explain the process is to take the vectors of the surrounding 2000 words as input, sum them up, and output a vector to represent the center word.

This version is not very informative.

Let’s discuss this in more detail.

51/68 Word2vec

First is the input layer.

The Input layer consists of 2000 words, each is represented by a one-hot encoder vector.

If you recall from the previous lesson, a one-hot encoder is a one by v vector and v is the size of the vocabulary.

You’ll place a 1 in the position corresponding to the word and zeros in the rest positions.

52/68 Word2vec

Then you embed these vectors with the embedding matrix Ev*d, which means you multiply each vector by the embedding matrix E.

53/68 Word2vec

To illustrate this, let’s assume you have an input vector of a one-hot encoder as [0,1,0,0,0].

You multiply it with the embedding matrix, which in the case is the word2vec, by using the CBOW technique. And then you get the embedded 1*d vector as [10,12,23].

You can imagine the embedding matrix as a lookup table to turn the original word into a vector that has semantic meaning.

54/68 Word2vec

To begin this process, all the values (weights) in the embedding matrix E are randomly assigned.

55/68 Word2vec

Once you get the embedded vector for each context word, sum all the vectors and get a hidden layer H, a one by d vector.

56/68 Word2vec

You multiply H with another matrix E’ and feed the result to a softmax function to get the probability y.

y is a one-by-v vector and each value shows the probability of the word in that position being the center word.

This is the output layer and the predicted result.

57/68 Word2vec

However, this is actually the beginning of the iteration.

You must compare the output vector with the actual result,

58/68 Word2vec

And use backpropagation to adjust the weights in the embedding matrices E and E’.

59/68 Word2vec

Iterate this process until the difference between the predicted result and the actual result is minimum.

Now you get the E, the word2vec embedding matrix that you aimed to learn at the beginning.

The training process depicts how a neural network learns.

60/68 Word2vec

Opposite to continuous bag-of-words (CBOW), skip-gram uses the center word to predict context words.

For example, given chasing, what are the probabilities of other words such as a, dog, and person occur in the surrounding context?

61/68 Word2vec

The process is similar to CBOW, though.

First, you prepare the training data by running a sliding window of size 2K+1.

For example, let k equals to 2.

62/68 Word2vec

You start with the first word A.

63/68 Word2vec

You have no word before A but two words after it.

Therefore, you use A to predict dog and is respectively.

64/68 Word2vec

And then the window shifts to the second word dog as the center word.

You use dog to predict the word a before it and the words is and chasing after it.

65/68 Word2vec

The same process continues till the last word person is the center word.

66/68 Word2vec

Now your training data is ready.

67/68 Word2vec

Next, you train the neural network to learn the embedding matrix E.

Compared to CBOW, E here is based on the skip-gram technique.

You start with the input layer which consists of the one-hot encoder vector of the center word.

Embed this vector with embedding matrix E, and feed the embedded vectors to one hidden layer.

After another E’ and the softmax function, you finally generate the output layer which consists of vectors to predict 2000 surrounding context words.

68/68 Word2vec

Now that you understand how wrod2vec works, let’s look at the coding.

Fortunately, you don’t have to train wrod2vec yourself.

Keras in TensorFlow packages all the details and you only need to call the pre-trained models and tune the hyperparameters.