Cluster Update Summary

Posted on Wed 12 June 2019 in news by Simon Butcher

As part of our commitment to providing stable and manageable systems, here is a round-up of some recent updates we have been working on behind the scenes:

1) Upgrade of all HPC cluster nodes to CentOS 7.6

Over the last couple of weeks, you may have noticed a few nodes in disabled or maintenance state when running the nodestatus command. We have been rolling out an operating system update from CentOS 7.4 to 7.6, which provides essential security updates, and some other fixes. We run the operating system update on each node as an exclusive cluster job, followed by benchmarks and functionality tests, before bringing the node back online. This allows us to perform essential updates with the minimum of disruption.

2) GPU additions and CUDA driver updates

We recently purchased an additional 4 Nvidia V100 GPUs to keep up with demand for GPU acceleration. These were added to the sbg nodes to make 4 GPUs per server. Additionally, the CentOS 7.6 update allowed us to update the CUDA drivers to version 10.0, which has provided performance improvements, and allows use of the latest TensorFlow and MATLAB versions.

3) sdv node firmware updates

Our sdv nodes were quite new on the market when they were purchased, and as a result, have received quite a few firmware updates to address a variety of bugs, security updates, performance and hardware compatibility issues. We will be applying these over the next couple of weeks, as nodes become available.