1/43 One-hot encoding and bag-of-words

You were introduced to the general steps for preparing text data for model training.

You focused on tokenization, which aims to divide the text into smaller language units such as words.

You also briefly explored the different tasks of preprocessing.

The preprocessing generates a set of cleaned language units, which serve as the input of text representation.

2/43 One-hot encoding and bag-of-words

If tokenization is how a computer reads text, the text representation solves the problem of how a computer understands text.

3/43 One-hot encoding and bag-of-words
4/43 One-hot encoding and bag-of-words

This challenge can be further split into two subproblems:

5/43 One-hot encoding and bag-of-words
6/43 One-hot encoding and bag-of-words
7/43 One-hot encoding and bag-of-words
8/43 One-hot encoding and bag-of-words
9/43 One-hot encoding and bag-of-words

Morse code and ASCII can be used to turn text into digits, why can’t you use them in modern NLP?

10/43 One-hot encoding and bag-of-words
11/43 One-hot encoding and bag-of-words
12/43 One-hot encoding and bag-of-words

The first idea is called one-hot encoding.

With this technique, you “one-hot” encode each word in your vocabulary.

13/43 One-hot encoding and bag-of-words

Consider the sentence “A dog is chasing a person.”

After tokenization and preprocessing, the sentence can be represented by three words: dog, chase, person.

14/43 One-hot encoding and bag-of-words

To turn each word to a vector, you must first create a vector whose length equals the size of the vocabulary.

Assume you have a vocabulary that includes six words:

15/43 One-hot encoding and bag-of-words

dog, chase, person, my, cat, and run.

16/43 One-hot encoding and bag-of-words

Then, place a “1” to the position that corresponds to the word and “0”s to the rest of the positions.

17/43 One-hot encoding and bag-of-words
18/43 One-hot encoding and bag-of-words

In the example, the left represents the words from the sentence and the top represents the vocabulary.

19/43 One-hot encoding and bag-of-words

By conducting one-hot encoding, you converted the sentence “A dog is chasing a person” into a matrix that an ML model takes.

20/43 One-hot encoding and bag-of-words
21/43 One-hot encoding and bag-of-words

It’s intuitive to understand

22/43 One-hot encoding and bag-of-words

and easy to implement.

However, let’s also acknowledge the disadvantages.

23/43 One-hot encoding and bag-of-words

Recall the two sub-challenges of text representation, one-hot encoding has two main issues among other.

24/43 One-hot encoding and bag-of-words
25/43 One-hot encoding and bag-of-words
26/43 One-hot encoding and bag-of-words

The dimensions of each vector depend on the size of the vocabulary, which can easily be tens of thousands.

27/43 One-hot encoding and bag-of-words

Also, most of the values of each vector are zeros, which means that this is a super sparse representation.

28/43 One-hot encoding and bag-of-words

Another method of text representation is called bag-of-words.

29/43 One-hot encoding and bag-of-words

You first collect a “bag” of words from the text in the NLP project to build your vocabulary (or dictionary).

30/43 One-hot encoding and bag-of-words

For example, you might have a vocabulary that includes six words: ___

31/43 One-hot encoding and bag-of-words

To represent the sentence, you must create a vector whose length is equal to the size of the vocabulary, then

32/43 One-hot encoding and bag-of-words

place a value to represent the frequency in which the word appears in the given document.

Now you get the outcome vector as [2,1,0,1,0,0].

33/43 One-hot encoding and bag-of-words

Sometimes you might not care about the frequency, but only the occurrence of the words.

34/43 One-hot encoding and bag-of-words

You can simply use 1 and 0 to represent whether this word exists in the text.

35/43 One-hot encoding and bag-of-words

By conducting bag-of-words, you can convert the sentence “a dog is chasing a person” to a vector that an ML model takes.

36/43 One-hot encoding and bag-of-words

Similar to one-hot encoding, it’s intuitive to understand and easy to implement.

Compared to one-hot encoding, it has two improvements:

37/43 One-hot encoding and bag-of-words
38/43 One-hot encoding and bag-of-words
39/43 One-hot encoding and bag-of-words
40/43 One-hot encoding and bag-of-words

Bag-of-words still has high-dimensional and sparse vectors.

The dimension of the vector increases with the size of the vocabulary.

Thus sparsity remains a problem.

41/43 One-hot encoding and bag-of-words

Although it captures some semantic similarities between sentences, it still far from captures the relationship between words.

42/43 One-hot encoding and bag-of-words

Bag-of-words does not consider the order of the words and that is why it’s called a “bag” of words.

43/43 One-hot encoding and bag-of-words

The basic vectorization methods such as one-hot encoding and bag-of-words are not ideal.

Two major problems that have not been solved include:

  • the high-dimensional and sparse vectors;

  • the lack of relationship between words.