Skip to content

featured

A look at the Grace Hopper superchip

NVIDIA recently announced the GH200 Grace Hopper Superchip which is a combined CPU+GPU with high memory bandwidth, designed for AI workloads. These will also feature in the forthcoming Isambard AI National supercomputer. We were offered the chance to pick up a couple of these new servers for a very attractive launch price.

The CPU is a 72-core ARM-based Grace processor, which is connected to an H100 GPU via the NVIDIA chip-2-chip interconnect, which delivers 7x the bandwidth of PCIe Gen5, commonly found in our other GPU nodes. This effectively allows the GPU to seamlessly access the system memory. This datasheet contains further details.

Since this new chip offers a lot of potential for accelerating AI workloads, particularly for workloads requiring large amounts of GPU RAM or involving a lot of memory copying between the host and the GPU, we've been running a few tests to see how this compares with the alternatives.

Using job statistics to increase job performance and reduce queueing time

You may wonder why some jobs start immediately but some wait in the queue for hours or days, even if your job is quite simple. If you notice your job has been queueing for a while, you may want to consider adjusting the requested resources to reduce queueing time and reduce any potential resource wastage as the job runs. Below, we outline two useful tools for you to check the resource usage of previous jobs.

Running Machine Learning workloads on Apocrita

In this tutorial we'll be showing you how to run a TensorFlow job using the GPU nodes on the Apocrita HPC cluster. We will expand upon the essentials provided on the QMUL HPC docs site, and provide more explanation of the process. We'll start with software installation before demonstrating a simple task and a more complex real-world example that you can adapt for your own jobs, along with tips on how to check if the GPU is being used.

Productivity tips for Apocrita cluster users

This article presents a selection of useful tips for running successful and well-performing jobs on the QMUL Apocrita cluster.

In the ITS Research team, we spend quite a bit of time monitoring the Apocrita cluster and checking jobs are running correctly, to ensure that this valuable resource is being used effectively. If we notice a problem with your job, and think we can help, we might send you an email with some recommendations on how your job can run more effectively. If you receive such an email, please don't be offended! We realise there are users with a range of experience, and the purpose of this post is to point out some ways to ensure you get your results as quickly and correctly as possible, and to ease the learning curve a little bit.

Sizing your Apocrita jobs for quicker results

At any one time, a typical HPC cluster is usually full. This is not such a bad thing, since it means the substantial investment is working hard for the money, rather than sitting idle. A less ideal situation is having to wait too long to get your research results. However, jobs are constantly starting and finishing, and many new jobs get run shortly after being added to the queue. If your resource requirements are rather niche, or very large, then you will be competing with other researchers for a more scarce resource. In any case, whatever sort of jobs you run, it is important to choose resources optimally, in order to get the best results. Using fewer cores, although increasing the eventual run time, may result in a much shorter queuing time.