Distributed Training in Deep Learning using PyTorch: A Handy Tutorial

Research @ Oracle Health and AI - Vision Service

5 min readMay 26, 2021

Hossein Chaghazardi & Iman Zadeh

Oracle Cloud Infrastructure, Artificial Intelligence Services (OCAS)

At A Glance.

Our intent for writing up this short article is to provide

A summary of the theory behind the Distributed Training in PyTorch
Instructions to set up the environment — Python and GCC, necessary for multiple nodes (Multiple machines with multiple Graphical Processing Units, GPUs) training
An example of training a model and evaluate in a distributed fashion using the NVIDIA APEX library.

Since the introduction of AlexNet in 2012, the number of tunable parameters in deep neural networks has increased. Training such massive networks requires many GPUs. To perform training across multiple GPUs, we should perform the training process in a distributed fashion. This tutorial is based on using the NVIDIA APEX library, which is compatible with PyTorch for training across multiple GPUs.

Using PyTorch for Distributed Training.

PyTorch has built-in packages which support distributed training. There are two approaches for running a distributed training in PyTorch.

DataParallel or ‘DP’ which is easy to implement, runs as a single CPU process — multiple threads and can be used to train the model on a single node (one machine with multiple GPUs).

2. DistributedDataParallel or ‘DDP’, which requires third-party dependencies, such as NIVIDA APEX library, runs with multiple CPU processes — multiple threads and can be used to train the model on multiple nodes (many machines with many GPUs).

DDP always trains models faster than DP; however, it requires more lines of code change to the single-GPU code, namely, code change for the model, optimizer, and the backpropagation step. Based on our experience, the good news is that DDP could save a significant amount of train time by utilizing all GPUs at almost 100% of memory usage across multiple nodes.

In the following paragraphs, we elaborate on how to use DP and DDP by providing an example for each method.

DataParallel (DP).

Code change for implementing DP is very minimal. DP will take care of distributing both data and models across the GPUs. To create a model that supports DP, one has to pass the model inside a DataParallel Python class.
See below:

1.  from torch.nn import DataParallel2.  # setting up optimizer, data, etc,..

#.

#. 

#.3.  model = ResNet50()4.  if torch.cuda.device_count() > 1:5.    model = DataParallel(model)6.  model.cuda()7.  # training your model8.  #.

In line 4 it is checked whether there is more than one available GPU. If this condition passes, the model is replaced by a DataParallel class instance imported in line 1. In line 6, the model object is transferred onto the GPUs. DP takes care of copying the model object into all of the available GPUs.

DistributedDataParallel (DDP).

Distributed training using DDP is multi-processed, i.e., there is more than one process spawned by running the Python script which performs the model training. A single process is dedicated to controlling the communication between the GPUs (the master process). Also, each of the GPUs will run a single process.

It should be noted that each GPU should be assigned to process a unique split of the training data. By splitting data, we assure the training process performs faster than training in a non-distributed way.

Requirements.

NVIDIA APEX Library.

According to our experience, the APEX library is the best choice among similar libraries to perform distributed training. In addition to supporting distributed training, the APEX library can switch the bit precision from 32 to 16. This ability is called Automatic Mixed Precision (AMP).

To install this library, you have to install GCC by following this link.

See the below snippet for installing the APEX library.

source venv/bin/activate. # activate your env

git clone https://github.com/NVIDIA/apex

cd apex

pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Make sure to install using the flags “option=” — cpp_ext” — global-option=” — cuda_ext” to take full advantage of the GPUs.

Example of a DDP Process.

Parsing the Local Rank.

The first change to the code for running in the DDP fashion is to parse the “local_rank” argument passed in the command line:

parser.add_argument(“--local_rank”, default=0, type=int)
 args = parser.parse_args()

Next, the multiple subprocesses have to be initialized.

torch.cuda.set_device(args[“local_rank”])
torch.distributed.init_process_group(backend=’nccl’,init_method=’env://’)

Then, the model has to be transferred to the GPUs using ‘.cuda()’. After ‘.cuda()’, we can initialize the AMP module responsible for AMP. After initializing the model and optimizer, the DDP module wraps the model object:

from apex import amp
from apex.parallel import DistributedDataParallelmodel.to(torch.device(“cuda”))
model, optimizer = amp.initialize(model, optimizer, opt_level=”O1")
model = DistributedDataParallel(model, delay_allreduce=True)

Finally, we run the training script from the command line using the command below:

(venv) $ python -m torch.distributed.launch — nproc_per_node={NUM_OF_GPU} driver.py -c {CONFIG_FILE}

Note: If you plan to run a multi-node training job you should use the following command on each node:

python -m torch.distributed.launch \--nproc_per_node=${NUM_GPUS}--nnodes=${NODE_RANK} \--node_rank={THIS_NODE_RANK} --master_addr=${MASTER_IP} \--master_port=${MASTER_PORT} driver.py -c {CONFIG_GILE}

In line 4, “master_addr” is the node's address where the “master” process that controls the other processes runs. Although we previously mentioned that the “local_rank” argument has to be parsed, we do not explicitly pass in the “local_rank” argument from the command line as it will be done by “torch.distributed.launch” for each of the processes.

Sharding (Splitting) Data.

It is important to remember that each GPU must be trained with a portion of data. If different GPUs are trained with the same data, there would not be any speed gains. The advantage of distributed training is to leverage different GPUs to split the work, i.e., split data and model. To split data, we can strictly impose each process to use only a portion of the data. For example, see below for a function that performs sharding data based on the “RANK” system variable, which is passed by the main process when GPU processes are launched.

def _read_and_shard(self, annotation_path):
 self.shard_id, self.total_shards = os.environ[“RANK”], os.environ[“WORLD_SIZE”]
 total_df = pd.read_csv(annotation_path)start = int(self.shard_id * len(total_df)/self.total_shards)
 end = int((self.shard_id + 1) * len(total_df)/self.total_shards)print(“shard id: {}, world: {}, start: {}, end: {}”.format(self.shard_id, self.total_shards, start, end))sharded_df = total_df[start: end]
 return sharded_df

‘RANK’ and ‘WORLD_SIZE’ determine the GPU count of the current process and the total number of GPUs available, respectively. These two variables, determine which part of the annotation file should be used of a specific GPU.

Multiprocessing Advantage.

As discussed above, DDP uses multiprocessing, however, DP uses multithreading. DDP has the “advantage” of using multiprocessing, because multithreading is slow in Python, due to a known problem called GIL or Global Interpreter Lock. Multiprocessing does not suffer from GIL, because multiprocessing need not be locked for thread safety.

If you are curious about other advantages of DDP over DP and what other bottlenecks are present in DP, I would highly recommend reading this post where the author elaborates on the steps in training using DP.

Summary.

In this short blog post, we discussed two approaches that PyTorch provides for the distributed training. We went over how each module can be used and discussed their differences.

We also introduced the NVIDIA APEX library for PyTorch, which speeds up the distributed training.