1/25 Why distributed training is needed

In this module, we’ll explore how to run a distributed training job with TensorFlow.

2/25 Why distributed training is needed

We’ll begin with understanding

3/25 Why distributed training is needed

why distributed training is needed

After that, we’ll explore

4/25 Why distributed training is needed

distributed training architectures.

Then lastly, we’ll provide an overview of

5/25 Why distributed training is needed

TensorFlow distributed training strategies

6/25 Why distributed training is needed

Deep learning works because datasets are large.

7/25 Why distributed training is needed

Notice that the x-axis here is logarithmic.

For every doubling in the size of the data, the error rate falls linearly.

8/25 Why distributed training is needed

A more complex model also helps – that is the jump from the blue line to the orange line – but more data is even more helpful in this situation.

As a consequence of both of these trends, in terms of larger data sizes and more complex models, the compute required to build state-of-the-art models has grown over time.

This growth is exponential as well.

9/25 Why distributed training is needed

Each y-axis tick on this graph

10/25 Why distributed training is needed

shows a 10x increase in computational need

11/25 Why distributed training is needed

AlexNet, which started the deep learning revolution in 2013, required less than 0.01 petaflops per second-day in compute per day for training.

12/25 Why distributed training is needed

By the time you get to Neural Architecture Search, the learn-to-learn model published by Google in 2017, you need about 100 petaflops per second-day or 1000x more compute than you needed for AlexNet.

13/25 Why distributed training is needed

This growth in algorithm complexity and data size means that, with complex models and large data volumes, distributed systems are pretty much a necessity when it comes to machine learning.

14/25 Why distributed training is needed
15/25 Why distributed training is needed

Training complex networks with large amounts of data can often take a long time.

This graph shows training time on the x-axis plotted against

16/25 Why distributed training is needed

the accuracy of predictions on the y-axis, when training an image recognition model on a GPU.

17/25 Why distributed training is needed

As the dotted line shows, it took around 80 hours to reach 75% accuracy.

18/25 Why distributed training is needed
  • If your training takes a few minutes to a few hours, it will make you productive and happy, and you can try out different ideas fast.

19/25 Why distributed training is needed
  • If the training takes a few days, you could still deal with that by running a few ideas in parallel.

20/25 Why distributed training is needed
  • If the training starts to take a week or more, your progress will slow down because you can’t try out new ideas quickly.

21/25 Why distributed training is needed
  • And if it takes more than a month… Well that’s probably not even worth thinking about!

And this is no exaggeration.

Training deep neural networks such as ResNet50 can take up to a week on one GPU.

22/25 Why distributed training is needed

A natural question to ask is - how can you make training faster?

23/25 Why distributed training is needed
  • You can use a more powerful device such as TPU or GPU (accelerator).

24/25 Why distributed training is needed
  • You can optimize your input pipeline.

25/25 Why distributed training is needed

Or, you can try out distributed training.