2024 has been productive year in the outreach and education of
HPC to different schools at Queen Mary University of London.
We have formed alliances with different managers and PIs from various schools
within the University who understand the value that HPC can add to their
scientific research. We are pleased to share our latest event in 2024:
The delivery of new GPUs for research is continuing, most notable is the new
Isambard-AI cluster at
Bristol. As new cutting-edge GPUs are released, software engineers are
tasked with being made aware of the new architectures and features these new
GPUs offer.
The new Grace-Hopper GH200 nodes, as announced in a previous blog
post, consist of a 72-core NVIDIA Grace CPU and an
H100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLink
Chip-2-Chip (C2C) and unified memory, which allows fast and seamless automation
of transferring data from CPU to GPU. It also allows the GPU to be
oversubscribed, allowing it to handle data much larger than it can host,
potentially tackling out-of-GPU memory problems. This allows software engineers
to focus on implementing algorithms without having to think too much about
memory management.
This blog post will demonstrate manual GPU memory management and introduce
managed and unified memory with simple examples to illustrate its benefits.
We'll try and keep this to an introductory level but the blog does assume basic
knowledge of C++, CUDA and compiling with nvcc.
Regular expressions, or regex, are patterns used to match strings of text.
They can be very useful for searching, validating, or manipulating text
efficiently. This guide will introduce the basics of regex with easy-to-follow
examples.
In this blog post, we explore what
torchrun and
DistributedDataParallel
are and how they can be used to speed up your neural network training by using
multiple GPUs.
We still encounter jobs on the HPC cluster
that try to use all the cores on the node on which they're running, regardless
of how many cores they requested, leading to node alarms. Sometimes, jobs try
to use exactly twice or one-and-a-half the allocated cores, or even that number
squared. This was a little perplexing at first. In your enthusiasm to parallelize
your code, make sure someone else hasn't already done so.
Whilst most Apocrita users will want to use the
R module or
RStudio via OnDemand for R
workflows, it is also possible to use R inside of Conda via
Miniforge.
In a previous blog, we discussed ways we could use multiprocessing and
mpi4py together to use multiple nodes of GPUs. We will cover some machine
learning principles and two examples of pleasingly parallel machine learning
problems. Also known as embarrassingly parallel problems, I rather call them
pleasingly because there isn't anything embarrassing when you design your
problem to be run in parallel. When doing so, you could launch very similar
functions to each GPU and collate their results when needed.
NVIDIA recently announced the GH200 Grace Hopper Superchip which is a
combined CPU+GPU with high memory bandwidth, designed for AI workloads. These
will also feature in the forthcoming Isambard
AI National supercomputer. We were offered the chance to pick up a couple of
these new servers for a very attractive launch price.
The CPU is a 72-core ARM-based Grace processor, which is connected to an H100
GPU via the NVIDIA chip-2-chip interconnect, which delivers 7x the bandwidth of
PCIe Gen5, commonly found in our other GPU nodes. This effectively allows the
GPU to seamlessly access the system memory. This
datasheet
contains further details.
Since this new chip offers a lot of potential for accelerating AI workloads,
particularly for workloads requiring large amounts of GPU RAM or involving a
lot of memory copying between the host and the GPU, we've been running a few
tests to see how this compares with the alternatives.