Queue scheduler announcement¶
Updated March 2026
We have updated this post with answers to some common questions.
For the lifetime of the Apocrita HPC cluster, the queue scheduler software we have been using to allocate jobs to individual compute nodes has been a variant of Grid Engine. Over the past few years, the company who own this software has changed hands a few times, and we feel that development has stagnated in terms of features, while bugs have not been resolved to our satisfaction.
The majority of HPC sites have been using the open source Slurm scheduler for some time and we’ve been impressed by the modern design and substantial feature list, which are increasingly complex (or impossible) to implement in Grid Engine, particularly for a heterogenous system such as Apocrita.
Moving to Slurm¶
The university’s senior management is also requiring us to implement a clearer HPC charging model and improved auditing for REF, which is also incentivising us to migrate Apocrita, (and the associated GPU clusters Andrena & Apini) to Slurm by the end of 2025, when our Grid Engine licence terminates.
This is also a long-awaited opportunity to gain powerful new features, fairer queues, simplify our complex setup, and focus on compute quota allocations which consume cpu/core hours or gpu card hours from a credit allocation provided to each research group, rather than unconstrained dedicated access to particular compute nodes.
QMUL Research and Innovation Board (RIB) have requested a level of free access to HPC resources needs to remain, in order not to stifle research. However, paid-for allocations (e.g. from research grants) will also be supported and prioritised to run over free queues. This is to ensure that funded research can complete in a timely manner. We will make it easier to fund compute-time via research grants than before, and as a result we are moving away from dedicated "owned nodes" on research grants which are.
This approach is common at other institutions, including the EPSRC Tier 2 HPC services we provide research groups with budget allocations as part of a consortium.
Changes ahead for end-users¶
From a user perspective, there will be changes to job scripts, the commands used to submit and monitor jobs, and jobs will either consume credit from your research group account at higher priority, or use a lower priority free allocation. Many of you will be already familiar with the Slurm scheduler commands from use of other clusters, since it has become the de-facto standard in the industry now. We will also be benefitting from a lot of community provided resources instead of having to write in-house scripts, and you may have noticed that a lot of research software documentation is targeted towards Slurm users by default.
We’re consulting with the Slurm developers (SchedMD) on how to best implement the above approach and will be conducting training sessions as well as updating our existing documentation once we have confirmed the details.
We realise there will be quite a few questions and we will be adding answers to frequently asked questions to this blog post in due course.
Updated March 2026
The following information is correct as of March 2026.
Common questions¶
How do I get access to the pilot?¶
All HPC users have been added to the pilot and therefore, users no longer need to request access to join the pilot.
Will I lose my access to grid engine by requesting Slurm access?¶
Grid Engine has now been retired on the cluster and is no longer in use.
Is there an ondemand service on Slurm?¶
The usual OnDemand server has been reconfigured to only interact with Slurm now Grid Engine has been retired.
Why isn't /data/scratch working any more?¶
The Slurm service uses a new scratch storage system which we recently purchased.
This provides improved performance over the old system which is now end of life.
Going forward, please use /gpfs/scratch/$USER folder to access scratch. This
is particularly important as jobs will fail under Slurm if you try to use the
old /data/scratch/$USER folder.
The default 3TB free scratch storage remains the same. By design, scratch is a high performance system for working data produced while jobs are running. It is not backed up, and unused files are automatically deleted after 65 days.
The new scratch service is accessible as a Globus collection. If you commonly
need to transfer a large amount of data from another institution and need a
temporary holding place, searching Globus for QMUL Apocrita Scratch (unique ID
28d3101c-1631-499c-9809-a301a645245a) will provide access to the contents of
your /gpfs/scratch/$USER/globus directory.
Where can I find more information about using Slurm?¶
We've been working on adding example batch scripts for various apps to the Slurm docs site, which is still a work in progress. The main place to start is here
This information will soon be moved to the main site.
Who can I contact if I have issues?¶
Questions, bugs and issues can be directed to its-research-support@qmul.ac.uk, or via the #slurm channel on our slack instance
Image credit: Meizhi Lang on Unsplash
