Managing Python environments

There are some important things Apocrita Python users need to know, such as how to manage and curate your personal environments, as well as how to tackle some common problems during the process.

There are some important things Apocrita Python users need to know, such as how to manage and curate your personal environments, as well as how to tackle some common problems during the process.

For much of the year we have been working on a major project to upgrade Apocrita to a new operating system, (Rocky Linux 9, hereafter known as Rocky 9). As part of the project, we have deployed a new package building tool to help us recompile all of the research applications to work on the new system.
The majority of the cluster has now been upgraded to Rocky 9. The remaining CentOS 7 nodes will be updated in due course.

In this blog post, we will play about with neural networks, on a dataset called
ImageNet, to give some intuition on how these neural networks work. We will
train them on Apocrita with
DistributedDataParallel
and show benchmarks to give you a guide on how many GPUs to use. This is a
follow on from a previous blog post where we explained how to
use DistributedDataParallel to speed up your neural network training with
multiple GPUs.

2024 has been productive year in the outreach and education of HPC to different schools at Queen Mary University of London. We have formed alliances with different managers and PIs from various schools within the University who understand the value that HPC can add to their scientific research. We are pleased to share our latest event in 2024:

The delivery of new GPUs for research is continuing, most notable is the new Isambard-AI cluster at Bristol. As new cutting-edge GPUs are released, software engineers are tasked with being made aware of the new architectures and features these new GPUs offer.
The new Grace-Hopper GH200 nodes, as announced in a previous blog post, consist of a 72-core NVIDIA Grace CPU and an H100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLink Chip-2-Chip (C2C) and unified memory, which allows fast and seamless automation of transferring data from CPU to GPU. It also allows the GPU to be oversubscribed, allowing it to handle data much larger than it can host, potentially tackling out-of-GPU memory problems. This allows software engineers to focus on implementing algorithms without having to think too much about memory management.
This blog post will demonstrate manual GPU memory management and introduce
managed and unified memory with simple examples to illustrate its benefits.
We'll try and keep this to an introductory level but the blog does assume basic
knowledge of C++, CUDA and compiling with nvcc.

Regular expressions, or regex, are patterns used to match strings of text. They can be very useful for searching, validating, or manipulating text efficiently. This guide will introduce the basics of regex with easy-to-follow examples.

In this blog post, we explore what
torchrun and
DistributedDataParallel
are and how they can be used to speed up your neural network training by using
multiple GPUs.

The Apocrita highmem nodes have just been upgraded so that they contain newer
CPUs with more modern instruction sets.

We still encounter jobs on the HPC cluster that try to use all the cores on the node on which they're running, regardless of how many cores they requested, leading to node alarms. Sometimes, jobs try to use exactly twice or one-and-a-half the allocated cores, or even that number squared. This was a little perplexing at first. In your enthusiasm to parallelize your code, make sure someone else hasn't already done so.

In this tutorial we'll be showing you how to create a new Git project within RStudio using either a new or existing GitHub repository