Training on large datasets with tf.data API

It’s time to look at some specifics.

The tf.data API introduces a tf.data.Dataset abstraction that represents a sequence of elements, in which each element consists of one or more components.

For example, in an image pipeline, an element might be a single training example, with a pair of tensor components representing the image and its label.

There are two distinct ways to create a dataset:

A data source constructs a Dataset from data stored in memory or in one or more files.
A data transformation constructs a dataset from one or more tf.data.Dataset objects.

2/10 Training on large datasets with tf.data API

Large datasets tend to be sharded or broken apart into multiple files which can be loaded progressively.

Remember that you train on mini-batches of data.

You do not need to have the entire dataset in memory.

One mini-batch is all you need for one training step.

3/10 Training on large datasets with tf.data API

The Dataset API will help you create input functions for your model that will load data progressively.

There are specialized Dataset classes that can read data from text files like CSVs, Tensorflow records or fixed length record files.

Datasets can be created from different file formats:

Use TextLineDataset to instantiate a Dataset object which is comprised of lines from one or more text files.
TFRecordDataset comprises records from one or more TFRecord files.
And FixedLengthRecordDataset is a dataset object from fixed-length records from one or more binary files. For anything else, you can use the generic Dataset class and add your own decoding code.

4/10 Training on large datasets with tf.data API

Let’s walk through an example of TFRecordDataset.

At the beginning, the TFRecord op is created and executed, producing a variant tensor representing a dataset which is stored in the corresponding Python object.

5/10 Training on large datasets with tf.data API

Next, the Shuffle op is executed, using the output of the TFRecord op as its input, connecting the two stages of the input pipeline.

6/10 Training on large datasets with tf.data API

Next, the user-defined function is traced and passed as attribute to the Map operation, along with the Shuffle dataset variant input.

7/10 Training on large datasets with tf.data API

Finally, the Batch op is created and executed, creating the final stage of the input pipeline.

8/10 Training on large datasets with tf.data API

When the for loop mechanism is used for enumerating the elements of dataset, the iter method is invoked on the dataset, which triggers creation and execution of two ops.

First an anonymous iterator op is created and executed, which results in creation of an iterator resource.

Subsequently, this resource along with the Batch dataset variant is passed into the MakeIterator op, initializing the state of the iterator resource with the dataset.

9/10 Training on large datasets with tf.data API

When the next method is called, it triggers creation and execution of the IteratorGetNext op, passing in the iterator resource as the input.

Note that the iterator op is created only once but executed as many times as there are elements in the input pipeline.

10/10 Training on large datasets with tf.data API

Finally, when the Python iterator object goes out of scope, the DeleteIterator op is executed to make sure the iterator resource is properly disposed of.

To state the obvious, properly disposing of the iterator resource is essential as it is not uncommon for the iterator resource to allocate 100MBs to GBs of memory because of internal buffering.

Eduardo Avelar

Training on large datasets with tf.data API