In this blog post, we will play about with neural networks, on a dataset called
ImageNet, to give some intuition on how these neural networks work. We will
train them on Apocrita with
DistributedDataParallel
and show benchmarks to give you a guide on how many GPUs to use. This is a
follow on from a previous blog post where we explained how to
use DistributedDataParallel to speed up your neural network training with
multiple GPUs.
The delivery of new GPUs for research is continuing, most notable is the new
Isambard-AI cluster at
Bristol. As new cutting-edge GPUs are released, software engineers are
tasked with being made aware of the new architectures and features these new
GPUs offer.
The new Grace-Hopper GH200 nodes, as announced in a previous blog
post, consist of a 72-core NVIDIA Grace CPU and an
H100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLink
Chip-2-Chip (C2C) and unified memory, which allows fast and seamless automation
of transferring data from CPU to GPU. It also allows the GPU to be
oversubscribed, allowing it to handle data much larger than it can host,
potentially tackling out-of-GPU memory problems. This allows software engineers
to focus on implementing algorithms without having to think too much about
memory management.
This blog post will demonstrate manual GPU memory management and introduce
managed and unified memory with simple examples to illustrate its benefits.
We'll try and keep this to an introductory level but the blog does assume basic
knowledge of C++, CUDA and compiling with nvcc.
In this blog post, we explore what
torchrun and
DistributedDataParallel
are and how they can be used to speed up your neural network training by using
multiple GPUs.
We still encounter jobs on the HPC cluster
that try to use all the cores on the node on which they're running, regardless
of how many cores they requested, leading to node alarms. Sometimes, jobs try
to use exactly twice or one-and-a-half the allocated cores, or even that number
squared. This was a little perplexing at first. In your enthusiasm to parallelize
your code, make sure someone else hasn't already done so.
In a previous blog, we discussed ways we could use multiprocessing and
mpi4py together to use multiple nodes of GPUs. We will cover some machine
learning principles and two examples of pleasingly parallel machine learning
problems. Also known as embarrassingly parallel problems, I rather call them
pleasingly because there isn't anything embarrassing when you design your
problem to be run in parallel. When doing so, you could launch very similar
functions to each GPU and collate their results when needed.
Using multiple GPUs is one option to speed up your code. On Apocrita, we have
V100, A100 and H100 GPUs available, with up to 4 GPUs per node. On other compute
clusters, JADE2 has 8 V100 GPUs per node and
Sulis has 3 A100 GPUs per node. If your problem
is pleasingly parallel, you can distribute identical or similar tasks to each
GPU on a node, or even on multiple nodes.
Nowadays, there seems to be an R package for anything and everything. While this
makes starting a project in R seem quick and easy, there are considerations to
take into account that will make your life easier in the long run.
Living with Machines is a funded project at
The Alan Turing Institute (aka the Turing),
bringing together academics from different disciplines, to answer research
questions such as how did historical newspapers tell the political landscape,
how were accidents in factories reported, how did road and settlement names
change, how did people change occupations during the industrial revolution...
There are many strategies and tools for improving the performance of Python
code, for a comprehensive treatment see
High Performance Python
by Gorelick and Ozsvald (institutional access
is available to QM staff). However, there are some subtleties when using them
in an HPC environment. More bluntly, requesting processor cores does not
automatically mean your code will use them effectively, and that cannot happen
if it doesn't know how many of them there are!
As the complexity of HPC applications increases, the management of memory
and threading scopes becomes increasingly important. Tools like Intel
Inspector are crucial in this context, to effectively identify and resolve
a wide array of memory errors and thread synchronisation issues.