1/19 Mirrored strategy
2/19 Mirrored strategy

The mirrored strategy is the simplest way to get started with distributed training.

3/19 Mirrored strategy

It is a single machine with multiple GPU devices that creates one replica of the model on each GPU device.

During training, one mini-batch is split into N parts, where N equals the number of GPUS, and each part feeds to one GPU device.

For this setup, the TensorFlow mirrored strategy manages the coordination of data distribution and gradient updates across all of the GPUs.

4/19 Mirrored strategy

Let’s look at an image classification example where a Keras ResNet model with the functional API is defined.

5/19 Mirrored strategy

First, download the Cassava dataset from TensorFlow Datasets.

6/19 Mirrored strategy

Then, add a preprocess_data function to scale the images.

7/19 Mirrored strategy

Then, define the model.

8/19 Mirrored strategy

Let’s create the strategy object using tf.distribute.MIrroredStrategy.

9/19 Mirrored strategy

Next, let’s create the model with variables within the strategy scope.

These variables include the model, spare_categorical_crossentropy for loss, a Keras optimizer, and metrics variables to compute accuracy.

10/19 Mirrored strategy

The last change you will want to make is to the batch size.

When you carry out distributed training with the tf.distribute.Strategy API and tf.data, the batch size now refers to the global batch size.

11/19 Mirrored strategy

In other words, if you pass a batch size of 64, and you have two GPUs, then each machine will process 32 examples per step.

In this case, 64 is known as the global batch size, and 32 as the per replica batch size.

To make the most out of your GPUs, you will want to scale the batch size by the number of replicas, which is two in this case because there is one replica on each GPU.

12/19 Mirrored strategy

From there, map, shuffle, and prefetch the data.

13/19 Mirrored strategy

You then call model.fit on the training data.

Here we are going to run five passes of the entire training dataset.

14/19 Mirrored strategy

Let’s take a brief look at what actually happens when we call model.fit before adding a strategy.

For simplicity, imagine you have a simple linear model instead of the ResNet50 architecture.

In TensorFlow, you can think of this simple model in terms of its computational graph (or Directed Acyclic Graph - or DAG).

Here, the matmul op takes in the X and W tensors, which are the training batch and weights respectively.

The resulting tensor is then passed to the add op with the tensor b, which is the model’s bias terms.

The result of this op is ypred, which is the model’s predictions.

15/19 Mirrored strategy

Here is an example of data parallelism with two GPUs.

16/19 Mirrored strategy

The input batch X is split in half,

17/19 Mirrored strategy

and one slice is sent to GPU 0,

18/19 Mirrored strategy

and the other to GPU 1.

In this case, each GPU calculates the same ops but on different slices of the data.

19/19 Mirrored strategy

For more information on making the most of your GPUs, please refer to the guide titled, “Optimize TensorFlow GPU Performance with the TensorFlow Profiler,” found at tensorflow.org/guide/gpu_performance_analysis.