TensorFlow distributed training strategies

Distributed training is particularly useful for very large datasets, because it becomes very difficult, and often unrealistic to perform model training on only a single hardware accelerator, such as a GPU.

2/10 TensorFlow distributed training strategies

TensorFlow’s distributed strategies make it easier to seamlessly scale up heavy training workloads across multiple hardware accelerators — be it GPUs or even TPUs.

But in doing so, you may face challenges.

For example:

3/10 TensorFlow distributed training strategies

How will you distribute the model parameters across the different devices?

4/10 TensorFlow distributed training strategies

How will you accumulate the gradients during backpropagation?

5/10 TensorFlow distributed training strategies

And how will the model parameters be updated?

6/10 TensorFlow distributed training strategies

tf.distribute.Strategy can help with these, and other, potential challenges.It’s a TensorFlow API to distribute training across multiple GPUs, multiple machines or TPUs.

And there are four TensorFlow distributed training strategies.

The list includes:

7/10 TensorFlow distributed training strategies

The mirrored strategy.

8/10 TensorFlow distributed training strategies

The multi-worker mirrored strategy.

9/10 TensorFlow distributed training strategies

The TPU strategy, and finally

10/10 TensorFlow distributed training strategies

The parameter server strategy.

Eduardo Avelar

TensorFlow distributed training strategies