A PyTorch DDP Case Study With ImageNet¶

In this blog post, we will play about with neural networks, on a dataset called ImageNet, to give some intuition on how these neural networks work. We will train them on Apocrita with DistributedDataParallel and show benchmarks to give you a guide on how many GPUs to use. This is a follow on from a previous blog post where we explained how to use DistributedDataParallel to speed up your neural network training with multiple GPUs.

ImageNet¶

ImageNet is a huge dataset containing millions of images, each labelled with what they contain such as a bee, a bird, a dock, a saxophone, a strawberry, ...etc. There are 1,000 labels in this dataset, making this a very useful dataset to train neural networks to predict what an image contains. Figure 1 shows some images from the dataset.

Figure 1: A sample of images from ImageNet (a robin and a banded gecko)

To get started with ImageNet, PyTorch has a wrapper class ImageNet which can be used with your PyTorch code. If you are running this locally, you will have to download the dataset yourself. It is free to download for non-commercial research and educational purposes but do note it is hundreds of GB in size.

Using PyTorch ImageNet on Apocrita

Fortunately, ImageNet is available as a public dataset on Apocrita (see our docs) at /data/PublicDataSets/ImageNet-2012/ILSVRC2012. The uncompressed images and the metadata are provided.

To instantiate an ImageNet object in Python, use

dataset = torchvision.datasets.ImageNet("/data/PublicDataSets/ImageNet-2012/ILSVRC2012")

Of course, you may instead assign the path name to a variable and pass it along with further arguments in ImageNet().

Let's Play in the Playground¶

The neural network, AlexNet, was one of the early networks trained on ImageNet back in 2012. It's a suitable network to play with as it's small enough to fit onto a commercial-grade GPU and not so cutting edge so it'll make some interesting mistakes.

PyTorch has its own implementation of AlexNet which includes pre-trained weights, meaning the network has already been trained for us and can make predictions out of the box.

It is important to note that ImageNet is split into a training set and a validation set. The training set is used to train the neural network, whereas the validation set is used to assess the neural network's predictions on unseen images. This means that making predictions on images from the training set is usually good as the network has seen them before, as illustrated in Figure 2.

A photo of a great white shark predicted correctly with 89% certainty,
a green lizard predicted correctly with 56% certainty,
letter openers predicted correctly with 45% certainty
and a schooner predicted correctly with 91% certainty Figure 2: A sample of images from ImageNet's training set with their actual and predicted labels using AlexNet

More interestingly, the predictions on images from the validation set won't be as good but they are more realistic assessments, as illustrated in Figure 3.

A photo of a hotdog predicted correctly with 98% certainty,
a night snake predicted incorrectly as a boa constrictor with 85% certainty,
a Lhasa dog predicted incorrectly as a Shih-Tzu with 59% certainty
and a Guenon monkey predicted incorrectly as a Langur with 29% certainty Figure 3: A sample of images from ImageNet's validation set with their actual and predicted labels using AlexNet

Even more interesting is that we can use AlexNet to predict unlabelled photos from my phone, as shown in Figure 4.

A photo of an archery target predicted as a maze with 75% certainty,
a crocheted bee predicted as a Teddy with 81% certainty,
a hotdog in a cheese blanket predicted as a jackfruit with 14% certainty
and the ruins of Swansea Castle predicted as a castle with 96% certainty Figure 4: From the top left clockwise, a photo of an archery target, a crocheted bee, the ruins of Swansea Castle and a hotdog in a cheese blanket. Below each image are the predicted labels from AlexNet.

As shown in the figure, AlexNet did well labelling the photo of the ruins of Swansea Castle as a castle. However, it does not generalise well to images and labels outside the training set. For example, it struggled to notice that there was a hotdog underneath that cheese blanket, mistaken a crocheted bee for a Teddy and probably misjudged the lines on an archery target as walls in a maze. These are fair mistakes, for example, given that most catering staff would melt the cheese on a hot dog rather than slapping it on cold as an afterthought.

We've only shown a sample of images but in research, you would do a large-scale assessment using tens of thousands of images. The most basic metric is the 0-1 loss, this is the proportion of images the neural network has predicted correctly. However, neural networks are also capable of producing more than one prediction, each with a quantified certainty. Thus you can use metrics, such as the cross-entropy loss, to quantify how good their predictions are with respect to their certainties.

With a bit of tinkering, you can also see the filtered images in the middle of the neural network. Figure 5 shows the 64 images after the first convolution layer in AlexNet when giving the network the photo of a crocheted bee. You can see that the network picks out patterns of the crocheted bee, such as its glasses, the stitching pattern and the spherical shape of the bee. These patterns are fed down the network to make its prediction.

Figure 5: Images from the first convolution layer in AlexNet when inputting an image of a crocheted bee.

Benchmarks¶

In this section, we will benchmark training AlexNet and ConvNeXt, which is more cutting edge, on ImageNet using the GPUs on Apocrita. For AlexNet, we train on 90 epochs with batch size 32. For ConvNeXt, 45 epochs with batch size 64. We chose lower numbers of epochs to ensure training can finish in the order of days.

We used stochastic gradient descent with a decaying learning rate for optimisation. The specifications are:

Learning rate: 0.01
Momentum factor: 0.9
Weight decay (L2 penalty): 0.0001
Period of learning rate decay: 30
Multiplicative factor of learning rate decay: 0.1

We ran the training on different numbers of A100 and H100 GPUs to investigate the uplift when using more GPUs with DistributedDataParallel. For consistency, we used the A100 cards on the rdg nodes and the H100 cards on the xdg nodes. The results for AlexNet and ConvNeXt are shown in Figures 6 and 7 respectively.

Figure 6: Time it takes to train AlexNet on Apocrita. The configuration shows the number of GPUs and the model used, for example, 3xA100 means 3 A100 GPUs were used.

Figure 7: Time it takes to train ConvNeXt on Apocrita. The configuration shows the number of GPUs and the model used, for example, 3xA100 means 3 A100 GPUs were used.

The results show that using more GPUs with DistributedDataParallel does scale linearly. For example, you can half the time it takes to train a neural network from using one GPU to two. However, you would need another two GPUs to half the time again, gaining diminishing returns. Those extra GPUs may be better off used by someone else. You may also benefit from shorter queueing times when requesting fewer GPUs.

It was noted for AlexNet, that training is slower on the newer H100 cards compared to the older A100 cards! It's the opposite story for ConvNeXt where the code is twice as fast on the H100 cards compared to the A100 cards. One possible reason for this is that AlexNet is a very old model and isn't large enough to make use of modern enterprise-grade cards. This can be observed using nvtop to look at the GPU utility during training, as shown in Figure 8. On the A100 cards, AlexNet only uses about 25% GPU utility whereas ConvNeXt uses 100% GPU utility consistently, pushing the GPU to work hard.

Figure 8: The resulting nvtop graphs during training on an A100 card. The top graph is for AlexNet and the bottom is for ConvNeXt. The x-axis is time and the y-axis is a percentage. The blue line shows the GPU utility and the yellow line shows the VRAM used.

It should be noted that comparisons between GPU models can be unfair because the rdg nodes have a different CPU compared to the xdg nodes. This may affect the benchmarks. However, for ConvNeXt, because most of the computation time is done on the GPU, a different CPU would have a smaller effect compared to changing the GPU.

We've also reported that ConvNeXt has a resulting validation 1-0 accuracy of around ~60%, better than AlexNet with ~45%, despite training ConvNeXt with fewer epochs. It could very well be possible this could be higher if we continue training ConvNeXt for many more epochs.

Conclusion¶

Neural networks are not sentient beings, but rather, lots of mathematical functions to look for patterns. We can see this by looking at the sort of mistakes it makes and looking at what happens in the middle of the network.

We've found that DistributedDataParallel does scale with more GPUs on Apocrita but with diminishing returns. For example, we found that our code runs twice as fast with two GPUs compared to one but needs another two GPUs for another twofold speed up. We've also observed that we see performance uplift when using newer cards if the problem has a high throughput, in our case, that would be larger neural networks.

These findings should help you suggest how many GPUs, and which model of GPUs, to request for your job. For example, using a quick benchmark with one epoch and a single GPU, you can make a back-of-envelope estimation of how long training your neural network would take when given one, two, three or four GPUs. Using more GPUs will speed up your code up to some reasonable point, but you may have shorter queuing times if you request fewer GPUs.

It's also important that you ensure you make the most of the GPUs and run them hard, otherwise, your performance may degrade as we saw when we trained AlexNet on the newer H100 cards. My suggestion is to monitor your GPU jobs using nvtop and see how hard you are pushing the GPUs. If you find your GPU utility is low, consider using older cards for shorter queueing times or modifying your problem to push the GPU harder.

This experiment could be extended to multinode GPU systems but I anticipate the benchmarks would not scale as well. This is because communication between nodes is subject to higher latency and lower bandwidth compared to within a node.

The source code to run the training is available on my GitHub.