1/58 Tokenization

You may recall the end-to-end NLP workflow and the three major stages in developing an NLP project:

2/58 Tokenization

data preparation,

3/58 Tokenization

model training,

4/58 Tokenization

model serving

5/58 Tokenization

In the first stage of data preparation, you must engineer the data for model training.

6/58 Tokenization

As you know, a computer only takes digits

7/58 Tokenization

and you can only feed an NLP model with numbers.

Here, you encounter a significant challenge for NLP:

8/58 Tokenization
9/58 Tokenization

To better understand the challenge of text representation in NLP, let’s compare text data with other data types such as tabular, image, and audio.

10/58 Tokenization

Tabular data might be the easiest to feed into an ML model because most of them is already numeric, and the columns which are not numbers can be easily converted into numeric values.

11/58 Tokenization

How about image data?

How can you convert an image into numbers?

12/58 Tokenization

You can take advantage of pixels.

Each cell in the matrix of pixels represents the intensity of the corresponding pixel in the image.

13/58 Tokenization

How about audio?

How do you convert a song into numbers?

14/58 Tokenization

You can use waves.

You can sample the wave and record its amplitude (the height).

The audio then can be represented by an array of amplitude at fixed-time intervals.

15/58 Tokenization

How about text?

Can you think of a way to turn a sentence into numbers?

16/58 Tokenization

The answer is not that obvious.

17/58 Tokenization

Well, let’s divide the feature engineering in NLP into smaller steps.

Please note:

18/58 Tokenization

___ You must call the right functions at the right time.

The following explanation is to uncover how these libraries work and assumes you might want to build your own NLP applications from the beginning.

19/58 Tokenization
20/58 Tokenization

Assuming you already uploaded raw text,

21/58 Tokenization

You’ll then tokenize the text, which basically means to divide the text into smaller language units such as words.

This is how a computer reads text.

22/58 Tokenization

After that, you’ll preprocess the language units, for example, by only keeping the root of each word and removing punctuation.

23/58 Tokenization

You’ll then turn the preprocessed language units into numbers that represent some meanings.

This step is often called text representation, and it’s where a computer understands text in addition to reading it.

24/58 Tokenization

The output of text representation is normally vectors that can be fed into ML models to solve specific tasks.

Before exploring different techniques for text representation and various NLP models,

25/58 Tokenization

let’s start with tokenization and explore how a computer reads text.

26/58 Tokenization

Tokenization is the first step to prepare text for ML models.

27/58 Tokenization

It aims to ___

28/58 Tokenization

For example, tokenization will split the sentence “a dog is chasing a person,”

29/58 Tokenization

into separate words.

This step is often overlooked and underappreciated, simply because English is easy to tokenize

30/58 Tokenization

with a delimiter such as a whitespace.

However, if you take a moment to think about it,

31/58 Tokenization

you’ll find the problem is not as obvious as it looks.

32/58 Tokenization

First of all, what about other languages such as Chinese?

33/58 Tokenization

In this example, 一条狗在追一个人

34/58 Tokenization

which is the Chinese translation of “A dog is chasing a person”

35/58 Tokenization

there’s no space between characters, so how do you split the sentence?

36/58 Tokenization

To solve this problem, people must develop different tokenization strategies and tools for different languages.

37/58 Tokenization
38/58 Tokenization

Smaller language units, which are called tokens in tokenization, can exist at different levels.

For example:

39/58 Tokenization

Character tokens split the text at the character level, for instance,

40/58 Tokenization

dog is split into d-o-g.

41/58 Tokenization

Subword tokens split the text at the root word level,

42/58 Tokenization

for example, “chasing” is split into “chase” and “ing.”

43/58 Tokenization

Word tokens split the text by whitespaces.

44/58 Tokenization

Phrase tokens split the text by phrases, for example, “a dog” and “is chasing.”

45/58 Tokenization

And finally, sentence tokens split the text by punctuation.

46/58 Tokenization

Word tokenization is the most commonly used algorithm for splitting text;

47/58 Tokenization

however, each tokenization has its own advantages and disadvantages.

48/58 Tokenization

The choice of the tokenization type mainly depends on

49/58 Tokenization

the NLP libraries and the NLP models you’re using.

50/58 Tokenization

After tokenization, you must further prepare the text.

It’s called preprocessing.

51/58 Tokenization

Different things you can do in this step, for example, ___

52/58 Tokenization
53/58 Tokenization
54/58 Tokenization
55/58 Tokenization

You have various NLP libraries to help you complete these preprocessing tasks automatically.

56/58 Tokenization

For example,TensorFlow provides a new text preprocessing layer using TextVectorization API.

57/58 Tokenization

It maps text features to integer sequences including the functions such as

58/58 Tokenization

preprocessing, tokenization, and even the vectorization.

Using this new API, you can do all the text preparation work in one place.