Multi-worker mirrored strategy

The MultiWorkerMirroredStrategy is very similar to the MirroredStrategy

It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs. Similar to MirroredStraegy, it creates copies of all variables in the model on each device across all workers.

If you’ve mastered single host training and are looking to scale training even further, then adding multiple machines to your cluster can help you get an even greater performance boost.

You can make use of a cluster of machines that are CPU only, or that each have one or more GPUs.

Like its single-worker counterpart, MirroredStrategy, MultiWorkerMirroredStrategy is a synchronous data parallelism strategy that can be used with only a few code changes.

However, unlike MirroredStrategy, for a multi-worker setup TensorFlow needs to know which machines are part of the cluster.

In most cases, this is specified with the environment variable TF_CONFIG.

In this simple TF_CONFIG example, the “cluster” key contains a dictionary with the internal IPs and ports of all the machines.

In MultiWorkerMirroredStrategy, all machines are designated as workers, which are the physical machines on which the replicated computation is executed.

In addition to each machine being a worker, there needs to be one worker that takes on some extra work such as saving checkpoints and writing summary files to TensorBoard.

This machine is known as the chief (or by its deprecated name master).

Conveniently, when using AI Platform Training,

the TF_CONFIG environment variable is set on each machine in your cluster so there’s no need to worry about this set up!

As with any strategy in the tf.distribute module, step one is to create a strategy object.

Step two is to wrap the creation of the model parameters within the scope of the strategy.

This is crucial because it tells MirroredStrategy which variables to mirror across the GPU devices.

And the third and final step is to scale the batch size by the number of replicas in the cluster. This ensures that each replica processes the same number of examples on each step.

Since we’ve already covered training with MirroredStrategy, the previous steps should be familiar.

The main difference when moving from synchronous data parallelism on one machine to many is that the gradients at the end of each step now need to be synchronized across all GPUs in a machine and across all machines in the cluster.

This additional step of synchronizing across the machines increases the overhead of distribution.

With multi-worker mirrored strategy, the data needs to be sharded, meaning that each worker is assigned a subset of the entire dataset.

If autosharding is turned off, each replica processes every example in the dataset, which is not recommended.

Therefore, at each step, a global batch size of non-overlapping dataset elements will be processed by each worker.

This sharding happens automatically with tf.data.experimental.AutoShardPolicy.

By default, the policy is set to AUTO, which will shard your data depending on whether it is file-based or not.

Saving the model is slightly more complicated in the multi-worker case because there needs to be

different destinations for each worker.

The chief worker will save to the desired model directory,

while the other workers will save the model to temporary directories.

It’s important that these temporary directories are unique in order to prevent multiple workers from writing to the same location.

Saving can contain collective ops, so all workers must save and not just the chief.

Eduardo Avelar

Multi-worker mirrored strategy