1/39 TensorFlow data validation
2/39 TensorFlow data validation

There are three phases in a pipeline:

3/39 TensorFlow data validation
  • Data is ingested and validated

4/39 TensorFlow data validation
  • A model is trained and analyzed

5/39 TensorFlow data validation
  • The model is then deployed in production

6/39 TensorFlow data validation

We’ll provide an overview of TensorFlow Data Validation,

7/39 TensorFlow data validation

which is part of the ingest and validate data phase.

To learn more about the train and analyze the model phase, or how to deploy a model in production, please check out our MLOps Course.

8/39 TensorFlow data validation

TensorFlow Data Validation is a library for analyzing and validating machine learning data.

Two common use-cases of TensorFlow Data Validation within a TensorFlow Extended pipelines are

9/39 TensorFlow data validation

validation of continuously arriving data

10/39 TensorFlow data validation

and training/serving skew detection.

11/39 TensorFlow data validation

The pipeline begins with the ExampleGen component.

This component takes raw data as input and generates TensorFlow examples, it can take many input formats (e.g. CSV, TF Record).

It also does split the examples for you into Train/Eval.

It then passes the result…

12/39 TensorFlow data validation

…to the StatisticsGen component.

This brings us to the three main components of TensorFlow Data Validation.

The Statistics Generation component - which generates statistics for feature analysis,

13/39 TensorFlow data validation

the Schema Generation component - which gives you a description of your data,

14/39 TensorFlow data validation

and the Example Validator component - which allows you to check for anomalies.

We’ll explore those three components in more depth, but first let’s look at our use cases for TensorFlow Data Validation so that we can understand how these components work.

15/39 TensorFlow data validation

There are many reasons and use cases where you may need to analyze and transform your data.

16/39 TensorFlow data validation

For example, when you’re missing data, such as features with empty values,

17/39 TensorFlow data validation

or when you have labels treated as features, so that your model gets to peek at the right answer during training.

18/39 TensorFlow data validation

You may also have features with values outside the range you expect

19/39 TensorFlow data validation

or other data anomalies.

20/39 TensorFlow data validation

To engineer more effective feature sets, you should identify:

21/39 TensorFlow data validation

Especially informative features,

22/39 TensorFlow data validation

redundant features,

23/39 TensorFlow data validation

features that vary so widely in scale that they may slow learning,

24/39 TensorFlow data validation

and features with little or no unique predictive information.

25/39 TensorFlow data validation

One use case for TensorFlow Data Validation is to validate continuously arriving data.

26/39 TensorFlow data validation
  • Let’s say on day one you generate statistics based on data from day one

27/39 TensorFlow data validation

Then, you generate statistics based on day two data.

28/39 TensorFlow data validation

From there, you can validate Day 2 statistics against day one statistics and generate a validation report.

29/39 TensorFlow data validation
  • You can do the same for day three, validating day three statistics against statistics from both day two and day one.

30/39 TensorFlow data validation

TensorFlow Data Validation can also be used to detect distribution skew between training and serving data.

Training-serving skew

31/39 TensorFlow data validation

occurs when training data is generated differently from how the data used to request predictions is generated.

But what causes distribution skew?

Possible causes might come from a change in how data is handled in training vs in production, or even a faulty sampling mechanism that only chooses a subsample of the serving data to train on.

32/39 TensorFlow data validation

For example, if you use an average value, and for training purposes you average over 10 days,

33/39 TensorFlow data validation

but when you request prediction, you average over the last month.

34/39 TensorFlow data validation

In general, any difference between how you generate your training data and your serving data (the data you use to generate predictions) should be reviewed to prevent training-serving skew.

35/39 TensorFlow data validation

Training-serving skew can also occur based on your data distribution in your training,

36/39 TensorFlow data validation

validation,

37/39 TensorFlow data validation

and testing data splits.

38/39 TensorFlow data validation

To summarize, distribution skew occurs when the distribution of feature values for training data is significantly different from serving data

39/39 TensorFlow data validation

and one of the key causes for distribution skew is how data is handled or changed in training vs in production.