tutorial,

Running Machine Learning workloads on Apocrita

Simon Butcher Simon Butcher Follow Oct 14, 2021 · 16 mins read
Running Machine Learning workloads on Apocrita
Share this

In this tutorial we’ll be showing you how to run a TensorFlow job using the GPU nodes on the Apocrita HPC cluster. We will expand upon the essentials provided on the QMUL HPC docs site, and provide more explanation of the process. We’ll start with software installation before demonstrating a simple task and a more complex real-world example that you can adapt for your own jobs, along with tips on how to check if the GPU is being used.

Available hardware

GPU cards can provide huge acceleration to certain workloads, particularly in the field of Machine Learning.

The QMUL Apocrita HPC cluster has the following GPU enabled nodes:

  • 4 nxg nodes with NVIDIA Kepler K80 (effectively dual K40) cards.
  • 3 sbg nodes with 4 x NVIDIA Volta V100 cards each.
  • 1 sbg node with 4 x NVIDIA Ampere A100 cards.
  • 16 sbg nodes with 4 x NVIDIA Ampere A100 cards (access for DERI Andrena cluster users only).
  • 2 wn nodes running POWER architecture CPUs, and 4 x NVIDIA Volta V100 cards. While the POWER nodes can run TensorFlow very effectively, the installation instructions in this tutorial will differ on these machines, and will be covered in another tutorial in future.

Installation

Using pip and virtualenv

TensorFlow for GPU is provided as a compiled package for the pip and conda environments, and hence can be installed by the user. For simplicity we will focus on the pip method. The TensorFlow instructions for pip and conda are also provided on the Apocrita HPC documentation site.

The procedure follows the standard method for virtual environments on a shared system.

Virtual environments allow us to install different collections of python packages without experiencing conflicts, or versioning issues.

Loading applications using the module command

Running module avail python will show the available python versions; module load python without the version number will load the default version into the current session, and will also provide the pip and virtualenv commands. On Apocrita, the default python module version is a recent python3 version, shown below:

$ module avail python
----------- /share/apps/environmentmodules/centos7/general ---------------
python/2.7.15  python/3.6.3 python/3.8.5(default)
$ module load python
$ module list
Currently Loaded Modulefiles:
 1) python/3.8.5(default)

Installing TensorFlow GPU package in a virtual environment

We will now demonstrate how to install the TensorFlow GPU package, using the following steps:

  • load the python module
  • set up a new virtual environment in your home directory (we will use tensorgpu in this example)
  • activate the tensorgpu virtual environment
  • install the TensorFlow package into the active environment
module load python
virtualenv ~/tensorgpu
source ~/tensorgpu/bin/activate
pip install tensorflow

For releases 1.15 and older, CPU and GPU packages are separate:

pip install tensorflow==1.15      # CPU
pip install tensorflow-gpu==1.15  # GPU

Any Tensorflow dependencies will be installed at the same time. Notice that the session prompt becomes prefixed by the name of the currently activated virtualenv, as a handy visual reminder. You can deactivate the current virtualenv with the deactivate command.

Now we have a virtual environment which can be loaded again on demand. To do so in a new session, or job script, we load the python module and source our virtualenv. Ensure you load the same python module that was used to create the virtualenv, to benefit from thread optimisation and shared library support.

module load python
source ~/tensorgpu/bin/activate

While in an activated environment, running the pip freeze command will show installed packages and their version number. It’s good practice to keep a copy of this output in case you need to re-create this environment in future.

TensorFlow and CUDA/CUDNN library versions

The GPU version of TensorFlow requires the CUDA and CUDNN modules to be loaded in the environment. Loading the correct CUDNN module will load in the accompanying CUDA version as a dependency. Loading the incorrect CUDA/CUDNN module for the relevant TensorFlow version will result in errors at runtime, resulting in fallback to CPU-only mode.

TensorFlow version CUDNN version CUDA version
2.4 - 2.6 8.1 11
2.1 - 2.3 7.6 10.1
1.13.1 - 2.0 7.4 10
1.5 - 1.12 7 9
<1.4 6 8

Installing a specific version of a package

Instead of installing the latest package, for compatibility reasons, you may require a specific version. For example, pip install tensorflow-gpu==1.15 will install the exact version, if it is available.

Bulk install of packages using a requirements file

A requirements file, in the format produced by pip freeze, will install all listed packages with the use of pip install -r requirements.txt, in a rapid and reproducible manner.

For example, given a set of required packages for your job, make a requirements.txt file containing the packages (and version numbers as necessary). The following list is just an example of what that might look like:

Keras==2.4.3
matplotlib==3.4.1
pandas==1.2.4
sklearn
tensorflow==2.4.1

Create a fresh environment (which we will call myenv) and install the packages:

module load python
virtualenv myenv
source myenv/bin/activate
pip install -r requirements.txt

Additional dependencies will be pulled in as required, or as a preferred approach, supply the whole output of pip freeze from a known good virtualenv you have set up previously, which will also include the dependencies.

Running a simple job

All work must be submitted via the job scheduler, to ensure optimal and fair use of resources. This basic job will check that you can access a GPU node, load your environment, run TensorFlow and output the TensorFlow version. Before running a GPU job, you need to request addition to the GPU user access list, while providing an example of a typical job script you will be running, so we can avoid situations where a user runs a lot of jobs that request GPU resources but don’t use them.

In a text editor, create the file basic.qsub. Note that it’s best to create and edit files using a text editor on the HPC system, such as vim, nano or emacs, rather than creating them on your local workstation. This avoids a common issue with Windows control-characters, and also ensures a more streamlined work-flow.

#!/bin/bash
#$ -cwd
#$ -j y             # Merge output and error files (optional)
#$ -pe smp 8        # Request cores (8 per GPU)
#$ -l h_vmem=7.5G   # Request RAM (7.5GB per core)
#$ -l h_rt=240:0:0  # Request maximum runtime (10 days)
#$ -l gpu=1         # Request 1 GPU
#$ -N basicGPU      # Name for the job (optional)

# Load the necessary modules
module load python
module load cudnn/8.1.1-cuda11.2

# Load the virtualenv containing the tensorflow package
source ~/tensorgpu/bin/activate

# Report the TensorFlow version
python -c 'import tensorflow as tf; print(tf.__version__)'

Running qsub basic.qsub will tell the scheduler to add the job to the queue. You can verify this with the qstat command. Note that, while usually the rules about resource requests are very strict (request only what you will use), the convention is to request 8 cores per GPU.

If there are free resources, the job will run immediately and produce an output file a few seconds later containing the results of the job. See this page for an explanation of job output filenames, and here for more detail on using GPU nodes.

Running a real-life job

The prerequisite for this job is a TensorFlow virtualenv and a copy of mnist_classify.py code from below and also on GitHub. [1]

import tensorflow as tf

mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=10)
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=2)

print('\nTest accuracy:', test_acc)

Prepare mnist_classify.qsub:

#!/bin/bash
#$ -cwd
#$ -j y               # Merge output and error files (optional)
#$ -pe smp 8          # Request cores (8 per GPU)
#$ -l h_vmem=7.5G     # Request RAM (7.5GB per core)
#$ -l h_rt=240:0:0    # Request maximum runtime (10 days)
#$ -m bea             # Send email on begin,end,abort (optional)
#$ -l gpu=1           # Request 1 GPU
#$ -N mnist_classify  # Name for the job (optional)

# Load necessary modules
module load python
module load cudnn/8.1.1-cuda11.2

# Load the virtualenv containing the tensorflow package
source ~/tensorgpu/bin/activate

# Run the mnist_classify.py code
python mnist_classify.py

Since GPU are the primary resource on the nodes, we request that users standardise their CPU and RAM requests on GPU nodes to ensure non-GPU resources are shared evenly between GPU devices without too much effort from users. This equates to 8 cores and 7.5GB RAM per core, for each GPU requested.

Submit the job with qsub mnist_classify.qsub and check the status of your queued and running jobs with qstat.

$ qsub mnist_classify.qsub
Your job 630581 ("mnist_classify") has been submitted
$ qstat
job-ID prior    name           user    state submit/start at     queue       slots
----------------------------------------------------------------------------------
630581 15.00646 mnist_classify abc123  r     03/22/2021 09:57:00 all.q@nxg1  8

We have added -m bea in the job script to send an email to notify when the job begins/ends/aborts.

Checking the progress of your job

If your jobs starts immediately, you can ssh to the node and run nvidia-smi to check the GPU device activity and attached processes.

Your process will be a python process. Note that another user will likely be using one of the other GPU, which may also be python. The first few lines of the job output file mnist_classify.o<jobid> will mention a GPU device being used (note that it might state GPU 0 even when using another GPU device, because only the GPUs you have requested are visible to you, starting at GPU 0).

+-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:83:00.0 Off |                    0 |
| N/A   46C    P0   113W / 149W |  38403MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   70C    P0   135W / 149W |    767MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     17053      C   python                          38403MiB |
|    1   N/A  N/A     18998      C   python                            756MiB |
+-----------------------------------------------------------------------------+

Figure 1: Output of nvidia-smi, showing 2 GPUs in use

One of the ways to check which of the tasks is yours, is to use the ps command and search for the process IDs attached to each GPU. For example:

$ ps -f 17053 18998
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
abc123   17053 13708 99 16:16 ?        Sl    55:13 python /data/home/abc123/.../mnist_classify.py
xyz987   18998 19483 99 10:45 ?        Sl   675:27 python tuning.py

In this case, process id 17053 is owned by user abc123 and is using GPU 0, and at this particular moment, is consuming 38403MiB of GPU RAM, and 100% GPU utilisation. The GPU usage may fluctuate over the course of the job, but consistently low figures may be an indication that some settings could be tweaked, to gain better performance.

We have confirmed that the job is using a GPU, and we will now inspect the job output file. The file will be created in the same directory where you ran the job, which is a concatenation of the job name and the job id number. If there is no job name provided in the job script file, then the file name of the script file is taken instead.

In this example, the job runs for around 1-2 minutes and the output file is mnist_classify.o630581, we can inspect the file using less mnist_classify.o630581.

We’ve truncated parts of the output but it is important to check the CUDA library messages that appear to ensure the code is being run on the GPU. If there are any missing libraries or error messages, the code might not run as expected, or may continue to run on the CPU.

Variable OMP_NUM_THREADS has been set to 8
Loading cudnn/8.1.1-cuda11.2
  Loading requirement: cuda/11.2.2

[cuda library messages]

Epoch 1/10
  32/1875 [..............................] - ETA: 9s - loss: 1.9411 - accuracy: 0.3788
 149/1875 [=>............................] - ETA: 4s - loss: 1.2637 - accuracy: 0.6238
 300/1875 [===>..........................] - ETA: 2s - loss: 0.9795 - accuracy: 0.7124
 452/1875 [======>.......................] - ETA: 2s - loss: 0.8372 - accuracy: 0.7553
 604/1875 [========>.....................] - ETA: 2s - loss: 0.7483 - accuracy: 0.7818
 753/1875 [===========>..................] - ETA: 1s - loss: 0.6869 - accuracy: 0.7998
 897/1875 [=============>................] - ETA: 1s - loss: 0.6419 - accuracy: 0.8130
1043/1875 [===============>..............] - ETA: 1s - loss: 0.6055 - accuracy: 0.8237
1191/1875 [==================>...........] - ETA: 1s - loss: 0.5753 - accuracy: 0.8305
1341/1875 [====================>.........] - ETA: 0s - loss: 0.5496 - accuracy: 0.8383
1493/1875 [======================>.......] - ETA: 0s - loss: 0.5274 - accuracy: 0.8451
1639/1875 [=========================>....] - ETA: 0s - loss: 0.5133 - accuracy: 0.8508
1786/1875 [===========================>..] - ETA: 0s - loss: 0.4963 - accuracy: 0.8557
1875/1875 [==============================] - 6s 1ms/step - loss: 0.4833 - accuracy: 0.8595
...
Epoch 10/10
 103/1875 [>.............................] - ETA: 2s - loss: 0.0380 - accuracy: 0.9891
 252/1875 [===>..........................] - ETA: 2s - loss: 0.0377 - accuracy: 0.9891
 402/1875 [=====>........................] - ETA: 2s - loss: 0.0393 - accuracy: 0.9885
 553/1875 [=======>......................] - ETA: 1s - loss: 0.0397 - accuracy: 0.9882
 704/1875 [==========>...................] - ETA: 1s - loss: 0.0403 - accuracy: 0.9879
 855/1875 [============>.................] - ETA: 1s - loss: 0.0406 - accuracy: 0.9877
 990/1875 [==============>...............] - ETA: 1s - loss: 0.0408 - accuracy: 0.9875
1142/1875 [=================>............] - ETA: 1s - loss: 0.0412 - accuracy: 0.9873
1290/1875 [===================>..........] - ETA: 0s - loss: 0.0415 - accuracy: 0.9872
1437/1875 [=====================>........] - ETA: 0s - loss: 0.0417 - accuracy: 0.9870
1586/1875 [========================>.....] - ETA: 0s - loss: 0.0418 - accuracy: 0.9870
1732/1875 [==========================>...] - ETA: 0s - loss: 0.0419 - accuracy: 0.9869
1875/1875 [==============================] - 3s 1ms/step - loss: 0. - accuracy: 0.9868
420 - accuracy: 0.9868
313/313 - 0s - loss: 0.0700 - accuracy: 0.9803

Test accuracy: 0.9803000092506409

We can see that the job initialised with a GPU device, and the job is progressing. It’s important to inspect this information, as a badly configured job may not utilise GPU at all, and result in very poor performance, and blocking a GPU from use by another researcher.

To check the latest progress of the job, you can use tail -f <filename> to show the end of the file, and continue to output data as the file grows.

Use of multiple GPUs

If your code supports it, you may request more than one GPU for your job. Note that requesting 2 GPUs does not automatically mean that both GPU will be used, so it’s good practice to check nvidia-smi each time you try new software.

Other Machine Learning applications

We’ve worked through a detailed approach for running TensorFlow jobs, which can largely be applied to other frameworks such as PyTorch which are also available via pip and conda. Some packages involve additional dependencies, which may not be available in the standard python package repositories, and require installing manually from code repositories. Please get in touch if you need extra assistance.

Visualisation with TensorBoard

TensorBoard is a web-based visualisation tool to allow you to analyse the progress your training. It comes installed with TensorFlow and includes the following features:

  • tracking metrics such as loss and accuracy
  • displaying image/audio data
  • visualising the model graph

To visualise your data using TensorBoard, please see our Using TensorBoard via OnDemand page on our docs site for further information.

References

Simon Butcher
Written by Simon Butcher Follow
Head of Research Applications. He likes open source software, maths and problem-solving.