In this tutorial we’ll be showing you how to run a TensorFlow job using the GPU nodes on the Apocrita HPC cluster. We will expand upon the essentials provided on the QMUL HPC docs site, and provide more explanation of the process. We’ll start with software installation before demonstrating a simple task and a more complex real-world example that you can adapt for your own jobs, along with tips on how to check if the GPU is being used.
GPU cards can provide huge acceleration to certain workloads, particularly in the field of Machine Learning.
The QMUL Apocrita HPC cluster has the following GPU enabled nodes:
- 4 nxg nodes with NVIDIA Kepler K80 (effectively dual K40) cards.
- 3 sbg nodes with 4 x NVIDIA Volta V100 cards each.
- 1 sbg node with 4 x NVIDIA Ampere A100 cards.
- 16 sbg nodes with 4 x NVIDIA Ampere A100 cards (access for DERI Andrena cluster users only).
Using pip and virtualenv
TensorFlow for GPU is provided as a compiled package for the
environments, and hence can be installed by the user. For simplicity we will
focus on the
pip method. The TensorFlow instructions for
are also provided on the Apocrita HPC
The procedure follows the standard method for virtual environments on a shared system.
Virtual environments allow us to install different collections of python packages without experiencing conflicts, or versioning issues.
Loading applications using the module command
module avail python will show the available python versions;
load python without the version number will load the default version into the
current session, and will also provide the pip and virtualenv commands. On
Apocrita, the default python module version is a recent python3 version,
$ module avail python ----------- /share/apps/environmentmodules/centos7/general --------------- python/2.7.15 python/3.6.3 python/3.8.5(default) $ module load python $ module list Currently Loaded Modulefiles: 1) python/3.8.5(default)
The Python project announced that Python 2 will not receive any updates, including security updates after Jan 1, 2020, and you should ensure that your code is Python 3 compliant.
Installing TensorFlow GPU package in a virtual environment
We will now demonstrate how to install the TensorFlow GPU package, using the following steps:
- load the python module
- set up a new virtual environment in your home directory (we will use
tensorgpuin this example)
- activate the
- install the TensorFlow package into the active environment
module load python virtualenv ~/tensorgpu source ~/tensorgpu/bin/activate pip install tensorflow
For releases 1.15 and older, CPU and GPU packages are separate:
pip install tensorflow==1.15 # CPU pip install tensorflow-gpu==1.15 # GPU
Any Tensorflow dependencies will be installed at the same time. Notice that the
session prompt becomes prefixed by the name of the currently activated
virtualenv, as a handy visual reminder. You can deactivate the current
virtualenv with the
Now we have a virtual environment which can be loaded again on demand. To do so in a new session, or job script, we load the python module and source our virtualenv. Ensure you load the same python module that was used to create the virtualenv, to benefit from thread optimisation and shared library support.
module load python source ~/tensorgpu/bin/activate
While in an activated environment, running the
pip freeze command will show
installed packages and their version number. It’s good practice to keep a copy
of this output in case you need to re-create this environment in future.
A common mistake is for new users to include the virtualenv creation and
pip installcommands in their job script - however after the correct packages have been installed, all that is required to use them, is to activate the virtualenv (from within your job script, etc).
TensorFlow and CUDA/CUDNN library versions
The GPU version of TensorFlow requires the CUDA and CUDNN modules to be loaded in the environment. Loading the correct CUDNN module will load in the accompanying CUDA version as a dependency. Loading the incorrect CUDA/CUDNN module for the relevant TensorFlow version will result in errors at runtime, resulting in fallback to CPU-only mode.
|TensorFlow version||CUDNN version||CUDA version|
|2.4 - 2.6||8.1||11|
|2.1 - 2.3||7.6||10.1|
|1.13.1 - 2.0||7.4||10|
|1.5 - 1.12||7||9|
Installing a specific version of a package
Instead of installing the latest package, for compatibility reasons, you may
require a specific version. For example,
pip install tensorflow-gpu==1.15
will install the exact version, if it is available.
Bulk install of packages using a requirements file
file, in the format produced by
pip freeze, will install all listed packages
with the use of
pip install -r requirements.txt, in a rapid and reproducible
For example, given a set of required packages for your job, make a
requirements.txt file containing the packages (and version numbers as
necessary). The following list is just an example of what that might look like:
Keras==2.4.3 matplotlib==3.4.1 pandas==1.2.4 sklearn tensorflow==2.4.1
Create a fresh environment (which we will call
myenv) and install the
module load python virtualenv myenv source myenv/bin/activate pip install -r requirements.txt
Additional dependencies will be pulled in as required, or as a preferred
approach, supply the whole output of
pip freeze from a known good virtualenv
you have set up previously, which will also include the dependencies.
Running a simple job
All work must be submitted via the job scheduler, to ensure optimal and fair use of resources. This basic job will check that you can access a GPU node, load your environment, run TensorFlow and output the TensorFlow version. Before running a GPU job, you need to request addition to the GPU user access list, while providing an example of a typical job script you will be running, so we can avoid situations where a user runs a lot of jobs that request GPU resources but don’t use them.
In a text editor, create the file
basic.qsub. Note that it’s best to create
and edit files using a text editor on the HPC system, such as vim, nano or
emacs, rather than creating them on your local workstation. This avoids a
with Windows control-characters, and also ensures a more streamlined work-flow.
#!/bin/bash #$ -cwd #$ -j y # Merge output and error files (optional) #$ -pe smp 8 # Request cores (8 per GPU) #$ -l h_vmem=7.5G # Request RAM (7.5GB per core) #$ -l h_rt=240:0:0 # Request maximum runtime (10 days) #$ -l gpu=1 # Request 1 GPU #$ -N basicGPU # Name for the job (optional) # Load the necessary modules module load python module load cudnn/8.1.1-cuda11.2 # Load the virtualenv containing the tensorflow package source ~/tensorgpu/bin/activate # Report the TensorFlow version python -c 'import tensorflow as tf; print(tf.__version__)'
qsub basic.qsub will tell the scheduler to add the job to the queue.
You can verify this with the
qstat command. Note that, while usually the
rules about resource requests are very strict (request only what you will use),
the convention is to request 8 cores per GPU.
If there are free resources, the job will run immediately and produce an output file a few seconds later containing the results of the job. See this page for an explanation of job output filenames, and here for more detail on using GPU nodes.
Running a real-life job
The prerequisite for this job is a TensorFlow virtualenv and a copy of
mnist_classify.py code from below and also on
import tensorflow as tf mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 model = tf.keras.models.Sequential([ tf.keras.layers.Flatten(input_shape=(28, 28)), tf.keras.layers.Dense(128, activation='relu'), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10, activation='softmax')]) model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) model.fit(x_train, y_train, epochs=10) test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2) print('\nTest accuracy:', test_acc)
#!/bin/bash #$ -cwd #$ -j y # Merge output and error files (optional) #$ -pe smp 8 # Request cores (8 per GPU) #$ -l h_vmem=7.5G # Request RAM (7.5GB per core) #$ -l h_rt=240:0:0 # Request maximum runtime (10 days) #$ -m bea # Send email on begin,end,abort (optional) #$ -l gpu=1 # Request 1 GPU #$ -N mnist_classify # Name for the job (optional) # Load necessary modules module load python module load cudnn/8.1.1-cuda11.2 # Load the virtualenv containing the tensorflow package source ~/tensorgpu/bin/activate # Run the mnist_classify.py code python mnist_classify.py
Since GPU are the primary resource on the nodes, we request that users standardise their CPU and RAM requests on GPU nodes to ensure non-GPU resources are shared evenly between GPU devices without too much effort from users. This equates to 8 cores and 7.5GB RAM per core, for each GPU requested.
Submit the job with
qsub mnist_classify.qsub and check the status of your
queued and running jobs with
$ qsub mnist_classify.qsub Your job 630581 ("mnist_classify") has been submitted $ qstat job-ID prior name user state submit/start at queue slots ---------------------------------------------------------------------------------- 630581 15.00646 mnist_classify abc123 r 03/22/2021 09:57:00 all.q@nxg1 8
We have added
-m bea in the job script to send an email to notify when the
Checking the progress of your job
If your jobs starts immediately, you can ssh to the node and run
to check the GPU device activity and attached processes.
Your process will be a python process. Note that another user will
likely be using one of the other GPU, which may also be python. The first few
lines of the job output file
mnist_classify.o<jobid> will mention a GPU
device being used (note that it might state GPU 0 even when using another GPU
device, because only the GPUs you have requested are visible to you, starting
at GPU 0).
+-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 On | 00000000:83:00.0 Off | 0 | | N/A 46C P0 113W / 149W | 38403MiB / 11441MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 On | 00000000:84:00.0 Off | 0 | | N/A 70C P0 135W / 149W | 767MiB / 11441MiB | 99% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 17053 C python 38403MiB | | 1 N/A N/A 18998 C python 756MiB | +-----------------------------------------------------------------------------+
One of the ways to check which of the tasks is yours, is to use the
command and search for the process IDs attached to each GPU. For
$ ps -f 17053 18998 UID PID PPID C STIME TTY STAT TIME CMD abc123 17053 13708 99 16:16 ? Sl 55:13 python /data/home/abc123/.../mnist_classify.py xyz987 18998 19483 99 10:45 ? Sl 675:27 python tuning.py
In this case, process id 17053 is owned by user abc123 and is using GPU 0, and at this particular moment, is consuming 38403MiB of GPU RAM, and 100% GPU utilisation. The GPU usage may fluctuate over the course of the job, but consistently low figures may be an indication that some settings could be tweaked, to gain better performance.
We have confirmed that the job is using a GPU, and we will now inspect the job output file. The file will be created in the same directory where you ran the job, which is a concatenation of the job name and the job id number. If there is no job name provided in the job script file, then the file name of the script file is taken instead.
In this example, the job runs for around 1-2 minutes and the output file is
mnist_classify.o630581, we can inspect the file using
We’ve truncated parts of the output but it is important to check the CUDA library messages that appear to ensure the code is being run on the GPU. If there are any missing libraries or error messages, the code might not run as expected, or may continue to run on the CPU.
Variable OMP_NUM_THREADS has been set to 8 Loading cudnn/8.1.1-cuda11.2 Loading requirement: cuda/11.2.2 [cuda library messages] Epoch 1/10 32/1875 [..............................] - ETA: 9s - loss: 1.9411 - accuracy: 0.3788 149/1875 [=>............................] - ETA: 4s - loss: 1.2637 - accuracy: 0.6238 300/1875 [===>..........................] - ETA: 2s - loss: 0.9795 - accuracy: 0.7124 452/1875 [======>.......................] - ETA: 2s - loss: 0.8372 - accuracy: 0.7553 604/1875 [========>.....................] - ETA: 2s - loss: 0.7483 - accuracy: 0.7818 753/1875 [===========>..................] - ETA: 1s - loss: 0.6869 - accuracy: 0.7998 897/1875 [=============>................] - ETA: 1s - loss: 0.6419 - accuracy: 0.8130 1043/1875 [===============>..............] - ETA: 1s - loss: 0.6055 - accuracy: 0.8237 1191/1875 [==================>...........] - ETA: 1s - loss: 0.5753 - accuracy: 0.8305 1341/1875 [====================>.........] - ETA: 0s - loss: 0.5496 - accuracy: 0.8383 1493/1875 [======================>.......] - ETA: 0s - loss: 0.5274 - accuracy: 0.8451 1639/1875 [=========================>....] - ETA: 0s - loss: 0.5133 - accuracy: 0.8508 1786/1875 [===========================>..] - ETA: 0s - loss: 0.4963 - accuracy: 0.8557 1875/1875 [==============================] - 6s 1ms/step - loss: 0.4833 - accuracy: 0.8595 ... Epoch 10/10 103/1875 [>.............................] - ETA: 2s - loss: 0.0380 - accuracy: 0.9891 252/1875 [===>..........................] - ETA: 2s - loss: 0.0377 - accuracy: 0.9891 402/1875 [=====>........................] - ETA: 2s - loss: 0.0393 - accuracy: 0.9885 553/1875 [=======>......................] - ETA: 1s - loss: 0.0397 - accuracy: 0.9882 704/1875 [==========>...................] - ETA: 1s - loss: 0.0403 - accuracy: 0.9879 855/1875 [============>.................] - ETA: 1s - loss: 0.0406 - accuracy: 0.9877 990/1875 [==============>...............] - ETA: 1s - loss: 0.0408 - accuracy: 0.9875 1142/1875 [=================>............] - ETA: 1s - loss: 0.0412 - accuracy: 0.9873 1290/1875 [===================>..........] - ETA: 0s - loss: 0.0415 - accuracy: 0.9872 1437/1875 [=====================>........] - ETA: 0s - loss: 0.0417 - accuracy: 0.9870 1586/1875 [========================>.....] - ETA: 0s - loss: 0.0418 - accuracy: 0.9870 1732/1875 [==========================>...] - ETA: 0s - loss: 0.0419 - accuracy: 0.9869 1875/1875 [==============================] - 3s 1ms/step - loss: 0. - accuracy: 0.9868 420 - accuracy: 0.9868 313/313 - 0s - loss: 0.0700 - accuracy: 0.9803 Test accuracy: 0.9803000092506409
We can see that the job initialised with a GPU device, and the job is progressing. It’s important to inspect this information, as a badly configured job may not utilise GPU at all, and result in very poor performance, and blocking a GPU from use by another researcher.
To check the latest progress of the job, you can use
tail -f <filename> to
show the end of the file, and continue to output data as the file grows.
Use of multiple GPUs
If your code supports it, you may request more than one GPU for your job. Note
that requesting 2 GPUs does not automatically mean that both GPU will be used,
so it’s good practice to check
nvidia-smi each time you try new software.
Other Machine Learning applications
We’ve worked through a detailed approach for running TensorFlow jobs, which can
largely be applied to other frameworks such as
PyTorch which are also available via
conda. Some packages involve additional dependencies, which may
not be available in the standard python package repositories, and require
installing manually from code repositories. Please
get in touch if you need extra
Visualisation with TensorBoard
TensorBoard is a web-based visualisation tool to allow you to analyse the progress your training. It comes installed with TensorFlow and includes the following features:
- tracking metrics such as loss and accuracy
- displaying image/audio data
- visualising the model graph
To visualise your data using TensorBoard, please see our Using TensorBoard via OnDemand page on our docs site for further information.
 Princeton University GitHub, (2020)
TensorFlow logo image: Licence