A PyTorch DDP Case Study With ImageNet¶
In this blog post, we will play about with neural networks, on a dataset called
ImageNet, to give some intuition on how these neural networks work. We will
train them on Apocrita with
DistributedDataParallel
and show benchmarks to give you a guide on how many GPUs to use. This is a
follow on from a previous blog post where we explained how to
use DistributedDataParallel
to speed up your neural network training with
multiple GPUs.
ImageNet¶
ImageNet is a huge dataset containing millions of images, each labelled with what they contain such as a bee, a bird, a dock, a saxophone, a strawberry, ...etc. There are 1,000 labels in this dataset, making this a very useful dataset to train neural networks to predict what an image contains. Figure 1 shows some images from the dataset.
To get started with ImageNet, PyTorch has a wrapper class
ImageNet
which can be used with your PyTorch code. If you are running this locally, you
will have to download the dataset yourself. It is free to download for
non-commercial research and educational purposes but do note it is hundreds of
GB in size.
Using PyTorch ImageNet on Apocrita
Fortunately, ImageNet is available as a public dataset on Apocrita (see our
docs) at
/data/PublicDataSets/ImageNet-2012/ILSVRC2012
. The uncompressed
images and the metadata are provided.
To instantiate an ImageNet
object in Python, use
dataset = torchvision.datasets.ImageNet("/data/PublicDataSets/ImageNet-2012/ILSVRC2012")
Of course, you may instead assign the path name to a variable and pass it
along with further arguments in ImageNet()
.
Let's Play in the Playground¶
The neural network, AlexNet, was one of the early networks trained on ImageNet back in 2012. It's a suitable network to play with as it's small enough to fit onto a commercial-grade GPU and not so cutting edge so it'll make some interesting mistakes.
PyTorch has its own implementation of AlexNet which includes pre-trained weights, meaning the network has already been trained for us and can make predictions out of the box.
It is important to note that ImageNet is split into a training set and a validation set. The training set is used to train the neural network, whereas the validation set is used to assess the neural network's predictions on unseen images. This means that making predictions on images from the training set is usually good as the network has seen them before, as illustrated in Figure 2.
More interestingly, the predictions on images from the validation set won't be as good but they are more realistic assessments, as illustrated in Figure 3.
Even more interesting is that we can use AlexNet to predict unlabelled photos from my phone, as shown in Figure 4.
As shown in the figure, AlexNet did well labelling the photo of the ruins of Swansea Castle as a castle. However, it does not generalise well to images and labels outside the training set. For example, it struggled to notice that there was a hotdog underneath that cheese blanket, mistaken a crocheted bee for a Teddy and probably misjudged the lines on an archery target as walls in a maze. These are fair mistakes, for example, given that most catering staff would melt the cheese on a hot dog rather than slapping it on cold as an afterthought.
We've only shown a sample of images but in research, you would do a large-scale assessment using tens of thousands of images. The most basic metric is the 0-1 loss, this is the proportion of images the neural network has predicted correctly. However, neural networks are also capable of producing more than one prediction, each with a quantified certainty. Thus you can use metrics, such as the cross-entropy loss, to quantify how good their predictions are with respect to their certainties.
With a bit of tinkering, you can also see the filtered images in the middle of the neural network. Figure 5 shows the 64 images after the first convolution layer in AlexNet when giving the network the photo of a crocheted bee. You can see that the network picks out patterns of the crocheted bee, such as its glasses, the stitching pattern and the spherical shape of the bee. These patterns are fed down the network to make its prediction.
Benchmarks¶
In this section, we will benchmark training AlexNet and ConvNeXt, which is more cutting edge, on ImageNet using the GPUs on Apocrita. For AlexNet, we train on 90 epochs with batch size 32. For ConvNeXt, 45 epochs with batch size 64. We chose lower numbers of epochs to ensure training can finish in the order of days.
We used stochastic gradient descent with a decaying learning rate for optimisation. The specifications are:
- Learning rate: 0.01
- Momentum factor: 0.9
- Weight decay (L2 penalty): 0.0001
- Period of learning rate decay: 30
- Multiplicative factor of learning rate decay: 0.1
We ran the training on different numbers of A100 and H100 GPUs to investigate
the uplift when using more GPUs with DistributedDataParallel
. For consistency,
we used the A100 cards on the rdg
nodes and the H100 cards on the xdg
nodes. The results for AlexNet and
ConvNeXt are shown in Figures 6 and 7 respectively.
The results show that using more GPUs with DistributedDataParallel
does scale
linearly. For example, you can half the time it takes to train a neural network
from using one GPU to two. However, you would need another two GPUs to half the
time again, gaining diminishing returns. Those extra GPUs may be better off used
by someone else. You may also benefit from shorter queueing times when
requesting fewer GPUs.
It was noted for AlexNet, that training is slower on the newer H100 cards
compared to the older A100 cards! It's the opposite story for ConvNeXt where the
code is twice as fast on the H100 cards compared to the A100 cards. One possible
reason for this is that AlexNet is a very old model and isn't large enough to
make use of modern enterprise-grade cards. This can be observed using
nvtop
to look at the GPU utility during training, as shown in Figure 8. On the A100
cards, AlexNet only uses about 25% GPU utility whereas ConvNeXt uses 100% GPU
utility consistently, pushing the GPU to work hard.
It should be noted that comparisons between GPU models can be unfair because the
rdg
nodes have a different CPU compared to the xdg
nodes. This may affect
the benchmarks. However, for ConvNeXt, because most of the computation time is
done on the GPU, a different CPU would have a smaller effect compared to
changing the GPU.
We've also reported that ConvNeXt has a resulting validation 1-0 accuracy of around ~60%, better than AlexNet with ~45%, despite training ConvNeXt with fewer epochs. It could very well be possible this could be higher if we continue training ConvNeXt for many more epochs.
Conclusion¶
Neural networks are not sentient beings, but rather, lots of mathematical functions to look for patterns. We can see this by looking at the sort of mistakes it makes and looking at what happens in the middle of the network.
We've found that DistributedDataParallel
does scale with more GPUs on Apocrita
but with diminishing returns. For example, we found that our code runs twice as
fast with two GPUs compared to one but needs another two GPUs for another
twofold speed up. We've also observed that we see performance uplift when
using newer cards if the problem has a high throughput, in our case, that would
be larger neural networks.
These findings should help you suggest how many GPUs, and which model of GPUs, to request for your job. For example, using a quick benchmark with one epoch and a single GPU, you can make a back-of-envelope estimation of how long training your neural network would take when given one, two, three or four GPUs. Using more GPUs will speed up your code up to some reasonable point, but you may have shorter queuing times if you request fewer GPUs.
It's also important that you ensure you make the most of the GPUs and run them
hard, otherwise, your performance may degrade as we saw when we trained AlexNet
on the newer H100 cards. My suggestion is to monitor your GPU jobs using
nvtop
and see how hard you are pushing the GPUs. If you find your GPU utility is low,
consider using older cards for shorter queueing times or modifying your problem
to push the GPU harder.
This experiment could be extended to multinode GPU systems but I anticipate the benchmarks would not scale as well. This is because communication between nodes is subject to higher latency and lower bandwidth compared to within a node.
The source code to run the training is available on my GitHub.