Running Machine Learning workloads on Apocrita

Simon Butcher Simon Butcher Follow Mar 22, 2019 · 15 mins read
Running Machine Learning workloads on Apocrita
Share this

In this tutorial we’ll be showing you how to run a TensorFlow job using the GPU nodes on the Apocrita HPC cluster. We will expand upon the essentials provided on the QMUL HPC docs site, and provide more explanation of the process. We’ll start with installation, and run through some simple tasks and benchmarks, along with tips on how to check if the GPU is being used effectively. Finally we’ll demonstrate a more complex real-world example that you can adapt for your own jobs.

Available hardware

GPU cards can provide huge acceleration to certain workloads, particularly in the field of Machine Learning.

The QMUL Apocrita HPC cluster has the following GPU enabled nodes:

  • 4 nxg nodes with NVIDIA Kepler K80 (effectively dual K40) cards
  • 3 sbg nodes with 4 x NVIDIA Volta V100 cards each.
  • 2 wn nodes running POWER architecture CPUs, and 4 x NVIDIA Volta V100 cards. While the POWER nodes can run TensorFlow very effectively, the installation instructions in this tutorial will differ on these machines, and will be covered in another tutorial in future.


Using pip and virtualenv

TensorFlow for GPU is provided as a compiled package for the pip and conda environments, and hence can be installed by the user. For simplicity we will focus on the pip method. The TensorFlow instructions for pip and conda are also provided on the Apocrita HPC documentation site.

The procedure follows the standard method for virtual environments on a shared system.

Virtual environments allow us to install different collections of python packages without experiencing conflicts, or versioning issues.

Loading applications using the module command

Running module avail python will show the available python versions; module load python without the version number will load the default version into the current session, and will also provide the pip and virtualenv commands. On Apocrita, the default python module version is a recent python3 version, shown below:

$ module avail python
----------- /share/apps/environmentmodules/centos7/general ---------------
python/2.7.15  python/3.6.3(default)
$ module load python
$ module list
Currently Loaded Modulefiles:
 1) python/3.6.3(default)

Installing TensorFlow GPU package in a virtual environment

We will now demonstrate how to install the TensorFlow GPU package, using the following steps:

  • load the module
  • set up a new virtual environment, called tensorgpu
  • activate the virtual environment
  • install the TensorFlow GPU package into the active environment
module load python
virtualenv tensorgpu
source tensorgpu/bin/activate
pip install tensorflow-gpu

Any Tensorflow dependencies will be installed at the same time. Notice that the session prompt becomes prefixed by the name of the currently activated virtualenv, as a handy visual reminder. You can deactivate the current virtualenv with the deactivate command.

Now we have a virtual environment which can be loaded again on demand. To do so in a new session, or job script, we load the python module and source our virtualenv. Ensure you load the same python module that was used to create the virtualenv, to benefit from thread optimisation and shared library support.

module load python
source tensorgpu/bin/activate

While in an activated environment, running the pip freeze command will show installed packages and their version number. It’s good practice to keep a copy of this output in case you need to re-create this environment in future.

TensorFlow and CUDA/CUDNN library versions

The GPU version of TensorFlow requires the CUDA and CUDNN modules to be loaded in the environment. Loading the correct CUDNN module will load in the accompanying CUDA version as a dependency. Loading the incorrect CUDA/CUDNN module for the relevant TensorFlow version will result in errors at runtime, resulting in fallback to CPU-only mode.

TensorFlow version CUDNN version CUDA version
2.1 7.6 10.1
1.13.1 - 2.0 7.4 10
1.5 - 1.12 7 9
<1.4 6 8

Installing a specific version of a package

Instead of installing the latest package, for compatibility reasons, you may require a specific version. For example, pip install tensorflow-gpu==1.15.0 will install the exact version, if it is available.

Bulk install of packages using a requirements file

A requirements file, in the format produced by pip freeze, will install all listed packages with the use of pip install -r requirements.txt, in a rapid and reproducible manner.

For example, given a set of required packages for your job, make a requirements.txt file containing the packages (and version numbers as necessary). The following list is just an example of what that might look like:


Create a fresh environment (which we will call myenv) and install the packages:

module load python
virtualenv myenv
source myenv/bin/activate
pip install -r requirements.txt

Additional dependencies will be pulled in as required, or as a preferred approach, supply the whole output of pip freeze from a known good virtualenv you have set up previously, which will also include the dependencies.

Running a simple job

All work must be submitted via the job scheduler, to ensure optimal and fair use of resources. This basic job will check that you can access a GPU node, load your environment, run TensorFlow and output the TensorFlow version. Before running a GPU job, you need to request addition to the GPU user access list, while providing an example of a typical job script you will be running, so we can avoid situations where a user runs a lot of jobs that request GPU resources but don’t use them.

In a text editor, create the file basic.qsub. Note that it’s best to create and edit files using a text editor on the HPC system, such as vim, nano or emacs, rather than creating them on your local workstation. This avoids a common issue with Windows control-characters, and also ensures a more streamlined work-flow.

#$ -cwd
#$ -j y
#$ -pe smp 8        # Request cores (8 per GPU)
#$ -l h_vmem=7.5G   # Request RAM (7.5GB per core)
#$ -l h_rt=1:0:0    # Max 1hr runtime (can request up to 240hr)
#$ -l gpu=1         # Request 1 GPU
#$ -N basicGPU      # Name for the job (optional)

# Load the necessary modules
module load python
module load cudnn/7.5-cuda-10.0

# Load the virtualenv containing the tensorflow-gpu package
source ~/tensorgpu/bin/activate

# Report the TensorFlow version
python -c 'import tensorflow as tf; print(tf.__version__)'

Running qsub basic.qsub will tell the scheduler to add the job to the queue. You can verify this with the qstat command. Note that, while usually the rules about resource requests are very strict (request only what you will use), the convention is to request 8 cores per GPU.

If there are free resources, the job will run immediately and produce an output file a few seconds later containing the results of the job. See this page for an explanation of job output filenames, and here for more detail on using GPU nodes.

Running a benchmark job

Benchmark jobs get the GPU to do real work and allow us to check and compare output against expected results. The prerequisite for this job is a TensorFlow virtualenv and a copy of the benchmark, which we will obtain from GitHub. In your working directory, run the following to clone the repository.

module load git
git clone https://github.com/tensorflow/models
# Checkout the r1.13.0 branch if testing against version 1.13 - 1.15
cd models
git checkout r1.13.0

Prepare bench.qsub:

#$ -cwd
#$ -j y            # Merge output and error files (optional)
#$ -pe smp 8       # Request cores (8 per GPU)
#$ -l h_vmem=7.5G  # Request RAM (7.5Gb per core)
#$ -l h_rt=1:0:0   # Request 1hr runtime (max is 240hr)
#$ -m bea          # Send email on begin,end,abort
#$ -l gpu=1        # Request one GPU
#$ -N cnnbench     # Name of job (optional)

# Load necessary modules
module load python
module load cudnn/7.5-cuda-10.0

# Activate the virtualenv containing the required packages
source ~/tensorgpu/bin/activate

# Run our code
python ~/models/tutorials/image/mnist/convolutional.py

Since GPU are the primary resource on the nodes, we request that users standardise their CPU and RAM requests on GPU nodes to ensure non-GPU resources are shared evenly between GPU devices without too much effort from users. This equates to 8 cores and 7.5GB RAM per core, for each GPU requested.

Submit the job with qsub bench.qsub and check the status of your queued and running jobs with qstat.

$ qsub bench.qsub
Your job 630581 ("cnnbench") has been submitted
$ qstat
job-ID prior    name     user    state submit/start at     queue       slots ja-task-ID
630581 15.00646 cnnbench abc123  r     03/22/2019 09:57:00 all.q@nxg1  8

We have added -m bea in the job script to send an email to notify when the job begins/ends/aborts.

Checking the progress of your job

If your jobs starts immediately, you can ssh to the node and run nvidia-smi to check the GPU device activity and attached processes.

Your process will be a python process. Note that another user will likely be using one of the other GPU, which may also be python. The first few lines of the job output file cnnbench.o.<jobid> will mention a GPU device being used (note that it might state GPU 0 even when using another GPU device, because only the GPUs you have requested are visible to you, starting at GPU 0).

| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           On   | 00000000:83:00.0 Off |                    0 |
| N/A   46C    P0   113W / 149W |    767MiB / 11441MiB |     74%      Default |
|   1  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   70C    P0   135W / 149W |    767MiB / 11441MiB |     99%      Default |

| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|    0     17053      C   python                                       756MiB |
|    1     18998      C   python                                       756MiB |

Figure 1: Output of nvidia-smi, showing 2 GPUs in use

One of the ways to check which of the tasks is yours, is to use the ps command and search for the process IDs attached to each GPU. For example:

ps -f 17053 18998
abc123   17053 13708 99 16:16 ?        Sl    55:13 python ./20190320_2222.py
xyz987   18998 19483 99 10:45 ?        Sl   675:27 python tuning.py

In this case, process id 17053 is owned by user abc123 and is using GPU 0, and at this particular moment, is consuming 767MiB of GPU RAM, and 74% GPU utilisation. The GPU usage may fluctuate over the course of the job, but consistently low figures may be an indication that some settings could be tweaked, to gain better performance.

We have confirmed that the job is using a GPU, and we will now inspect the job output file. The file will be created in the same directory where you ran the job, which is a concatenation of the job name and the job id number. If there is no job name provided in the job script file, then the file name of the script file is taken instead.

In this example, the output file is cnnbench.630581, we can inspect the file using less cnnbench.630581:

Loading cudnn/7.5-cuda-10.0
  Loading requirement: cuda/10.0.130
2019-03-22 09:57:21.383876: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-22 09:57:21.652428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:83:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-03-22 09:57:21.652506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-03-22 09:57:26.776585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-22 09:57:26.776700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-03-22 09:57:26.776745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-03-22 09:57:26.780454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB m
emory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:83:00.0, compute capability: 3.7)
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Step 0 (epoch 0.00), 102.0 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.6%
Step 100 (epoch 0.12), 10.1 ms
Minibatch loss: 3.248, learning rate: 0.010000
Minibatch error: 7.8%
Validation error: 7.6%
Step 200 (epoch 0.23), 8.4 ms

We can see that the job initialised with a GPU device, and the job is progressing. It’s important to inspect this information, as a badly configured job may not utilise GPU at all, and result in very poor performance, and blocking a GPU from use by another researcher.

To check the latest progress of the job, you can use tail -f <filename> to show the end of the file, and continue to output data as the file grows.

Use of multiple GPUs

If your code supports it, you may request more than one GPU for your job. Note that requesting 2 GPUs does not automatically mean that both GPU will be used, so it’s good practice to check nvidia-smi each time you try new software.

Other Machine Learning applications

We’ve worked through a detailed approach for running TensorFlow jobs, which can largely be applied to other frameworks such as PyTorch which are also available via pip and conda. Some packages involve additional dependencies, which may not be available in the standard python package repositories, and require installing manually from code repositories. Please get in touch if you need extra assistance.

Visualisation with TensorBoard

TensorBoard is a web-based visualisation tool to allow you to analyse the progress your training. It comes installed with TensorFlow and can be invoked with tensorboard --logdir=/path/to/directory --bind_all. If invoked within a submitted job, or via an interactive session, this will start an interactive web interface on the compute node for the duration of the job. By default, TensorBoard will use port 6006 however, this may be changed by passing the --port=PORT switch to the tensorboard command, using a real integer port number rather than PORT.

An SSH tunnel will be required to access the web interface because compute nodes are not directly accessible from outside the cluster. For example, if the TensorBoard is running on node nxg1 on port 12345, you can forward this through an SSH tunnel to access on your desktop via a web browser:

ssh -L abc123@login.hpc.qmul.ac.uk

This will open a login session to Apocrita for username abc123, and request a password as usual. At the same time, it will establish a tunnel that will serve up the contents of nxg1:12345 to http://localhost:8888.

Figure 2

Figure 2: TensorBoard interface.

Simon Butcher
Written by Simon Butcher Follow
Head of Research Applications. He likes open source software, maths and problem-solving.