Data extraction, analysis, and preparation

After you define the business use case

3/46 Data extraction, analysis, and preparation

and establish the success criteria,

4/46 Data extraction, analysis, and preparation

the process of delivering an ML model to production typically involves several steps,

5/46 Data extraction, analysis, and preparation

which can be completed manually or by an automated pipeline.

6/46 Data extraction, analysis, and preparation

The first three steps deal with data.

7/46 Data extraction, analysis, and preparation

Data needs to be ingested, which means it’s extracted from a raw data source.

8/46 Data extraction, analysis, and preparation

With data extraction, you retrieve data from various sources.

Those sources can be “streaming”, in “real-time”, or “batch”.

9/46 Data extraction, analysis, and preparation

For example, you may extract data from a customer relationship management system, or CRM, to analyze customer behavior

10/46 Data extraction, analysis, and preparation

This data may be

11/46 Data extraction, analysis, and preparation

“structured”, where the file type is a .csv, .txt. JSON, or .XML format, or,

12/46 Data extraction, analysis, and preparation

you may have an “unstructured” data source where you have images of your customers, or text comments from chat sessions with your customers.

13/46 Data extraction, analysis, and preparation

You may have to extract “streaming” data from your company’s transportation vehicles that are equipped with sensors that transmit data in real time.

14/46 Data extraction, analysis, and preparation

If the data you want to train your model on or get predictions for is structured, you might retrieve it from a data warehouse – such as BigQuery.

15/46 Data extraction, analysis, and preparation

Or, you can use Apache Beam’s IO module.

16/46 Data extraction, analysis, and preparation

In this Dataflow example we’re loading data from BigQuery, calling predict on every record,

17/46 Data extraction, analysis, and preparation

and then writing the results back into BigQuery.

18/46 Data extraction, analysis, and preparation

In data analysis, you analyze the data you’ve extracted.

19/46 Data extraction, analysis, and preparation

For example, you can use exploratory data analysis (or EDA).

20/46 Data extraction, analysis, and preparation

This involves using graphics and basic sample statistics to explore your data,

21/46 Data extraction, analysis, and preparation

such as looking for outliers

22/46 Data extraction, analysis, and preparation

or anomalies,

23/46 Data extraction, analysis, and preparation

trends,

24/46 Data extraction, analysis, and preparation

and data distributions.

25/46 Data extraction, analysis, and preparation

It may not be apparent how changes in the distribution of your data could affect your model, so let’s consider a scenario.

26/46 Data extraction, analysis, and preparation

In this scenario, an upstream data source encodes a categorical feature using a number, such as a product number

27/46 Data extraction, analysis, and preparation

One day, the product numbering convention changes and now the customer uses a totally different mapping, using some old numbers and some new numbers.

28/46 Data extraction, analysis, and preparation

How would you know that this had happened?

29/46 Data extraction, analysis, and preparation

How would you debug your ML model?

30/46 Data extraction, analysis, and preparation

The output of your model would tell you if there’s a drop in performance, but it won’t tell you why.

The raw inputs themselves would appear valid because we’re still getting numbers.

In order to recognize this change, you would need to look at changes in the distribution of your inputs.

31/46 Data extraction, analysis, and preparation

In doing so, you might find that while earlier, the most commonly occurring value might have been a 4, in the new distribution, 4 might never even occur and the most commonly occurring value might be a 10

32/46 Data extraction, analysis, and preparation

Depending on how you implemented your feature columns, these new values might be mapped to one component of a one-hot encoded vector or to many components.

33/46 Data extraction, analysis, and preparation

If, for example, you used a categorical column with a hash bucket, the new values would be distributed according to the hash function, and so one hash bucket might now get more and different values than before.

If you used a vocabulary, then the new values would map to OOV buckets.

34/46 Data extraction, analysis, and preparation

But what’s important is that, for a given tensor, its relationship to the label before, and its relationship to the label now, are likely to be very different.

35/46 Data extraction, analysis, and preparation

So, after you’ve extracted and analyzed your data, the next step in the process is data preparation.

36/46 Data extraction, analysis, and preparation

Data preparation includes data transformation and feature engineering, which is the process of changing, or converting, the format, structure, or values of data you’ve extracted, into another format or structure.

37/46 Data extraction, analysis, and preparation

Most ML models require categorical data

38/46 Data extraction, analysis, and preparation

to be in a numerical format,

39/46 Data extraction, analysis, and preparation

but some models work either with numeric or categorical features,

40/46 Data extraction, analysis, and preparation

while others can handle mixed-type features.

41/46 Data extraction, analysis, and preparation

For example, here are three types of preprocessing for dates using SQL in BigQuery ML, where we are:

42/46 Data extraction, analysis, and preparation

Extracting the parts of the date into different columns: Year, month, day, etc.

43/46 Data extraction, analysis, and preparation

Extracting the time period between the current date and columns in terms of years, months, days, etc

44/46 Data extraction, analysis, and preparation

And extracting some specific features from the date: Name of the weekday, weekend or not, holiday or not, etc.

45/46 Data extraction, analysis, and preparation

Now here is an example of the dayofweek and hourofday queries extracted using SQL and visualized as a table in Data Studio.

46/46 Data extraction, analysis, and preparation

Please note that for all non-numeric columns, other than TIMESTAMP, BigQuery ML performs a one-hot encoding transformation.

This transformation generates a separate feature for each unique value in the column.

Eduardo Avelar

Data extraction, analysis, and preparation