Running Machine Learning workloads on Apocrita

Posted on Fri 22 March 2019 in tutorial by Simon Butcher

In this tutorial we'll be showing you how to run a TensorFlow job using the GPU nodes on the Apocrita HPC cluster. We will expand upon the essentials provided on the QMUL HPC docs site, and provide more explanation of the process. We'll start with installation, and run through some simple tasks and benchmarks, along with tips on how to check if the GPU is being used effectively. Finally we'll demonstrate a more complex real-world example that you can adapt for your own jobs.

Available hardware

GPU cards can provide huge acceleration to certain workloads, particularly in the field of Machine Learning.

The QMUL Apocrita HPC cluster has the following GPU enabled nodes:

  • 4 nxg nodes with NVIDIA Kepler K80 (effectively dual K40) cards
  • 2 sbg nodes with 2 x NVIDIA Volta V100 cards.
  • 2 wn nodes running POWER architecture CPUs, and 4 x NVIDIA Volta V100 cards. While the POWER nodes can run TensorFlow very effectively, the installation instructions in this tutorial will differ on these machines, and will be covered in another tutorial in future.

Installation

Using pip and virtualenv

TensorFlow for GPU is provided as a compiled package for the pip and conda environments, and hence can be installed by the user. For simplicity we will focus on the pip method. The TensorFlow instructions for pip and conda are also provided on the Apocrita HPC documentation site.

The procedure follows the standard method for virtual environments on a shared system.

Virtual environments allow us to install different collections of python packages without experiencing conflicts, or versioning issues. An outstanding issue with virtualenv requires --include-lib to be added to the virtualenv creation command.

Loading applications using the module command

Running module avail python will show the available python versions; module load python without the version number will load the default version into the current session, and will also provide the pip and virtualenv commands. On Apocrita, the default python module version is a recent python3 version, shown below:

$ module avail python
----------- /share/apps/environmentmodules/centos7/general ---------------
python/2.7.15  python/3.6.3(default)
$ module load python
$ module list
Currently Loaded Modulefiles:
 1) python/3.6.3(default)

Use Python 3 instead of Python 2

The Python project have announced that Python 2 will not receive any updates, including security updates after Jan 1, 2020, and you should ensure that your code is Python 3 compliant.

Installing TensorFlow GPU package in a virtual environment

We will now demonstrate how to install the TensorFlow GPU package, using the following steps:

  • load the module
  • set up a new virtual environment, called tensorgpu
  • activate the virtual environment
  • install the TensorFlow GPU package into the active environment
module load python
virtualenv --include-lib tensorgpu
source tensorgpu/bin/activate
pip install tensorflow-gpu

TensorFlow and CUDA library versions

Starting with tensorflow-gpu version 1.13, the CUDA library version used to build the package installed via pip was changed from 9.0.176 to version 10. Loading the incorrect CUDA/CUDNN module for the relevant TensorFlow version will result in errors at runtime.

Any Tensorflow dependencies will be installed at the same time. Notice that the session prompt becomes prefixed by the name of the currently activated virtualenv, as a handy visual reminder. You can deactivate it the virtualenv with the deactivate command.

Now we have a virtual environment which can be loaded again on demand. To do so in a new session, or job script, we load the python module and source our virtualenv.

module load python
source tensorgpu/bin/activate

While in an activated environment, running the pip freeze command will show installed packages and their version number. It's good practice to keep a copy of this output in case you need to re-create this environment in future.

Installing packages in a virtualenv only needs to be done once

A common mistake is for new users to include the virtualenv creation and pip install commands in their job script - however after the correct packages have been installed, all that is required to use them, is to activate the virtualenv (from within your job script, etc).

Installing a specific version of a package

Instead of installing the latest package, for compatibility reasons, you may require a specific version. For example, pip install tensorflow-gpu==1.12.0 will install the exact version, if it is available.

Bulk install of packages using a requirements file

A requirements file, in the format produced by pip freeze, will install all listed packages with the use of pip install -r requirements.txt, in a rapid and reproducible manner.

For example, given a set of required packages for your job, make a requirements.txt file containing the packages (and version numbers as necessary).

Keras==2.2.4
matplotlib==3.0.3
pandas==0.24.2
sklearn
tensorflow-gpu

Create a fresh environment (which we will call myenv) and install the packages:

module load python
virtualenv --include-lib myenv
source myenv/bin/activate
pip install -r requirements.txt

Additional dependencies will be pulled in as required, or as a preferred approach, supply the whole output of pip freeze from a known good virtualenv you have set up previously, which will also include the dependencies.

Running a simple job

All work must be submitted via the job scheduler, to ensure optimal and fair use of resources. This basic job will check that you can access a GPU node, load your environment, run TensorFlow and output the TensorFlow version. Before running a GPU job, you need to request addition to the GPU user access list, while providing an example of a typical job script you will be running, so we can avoid situations where a user runs a lot of jobs that request GPU resources but don't use them.

In a text editor, create the file basic.qsub. Note that it's best to create and edit files using a text editor on the HPC system, such as vim, nano or emacs, rather than creating them on your local workstation. This avoids a common issue with Windows control-characters, and also ensures a more streamlined work-flow.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8        # Request cores (8 per GPU)
#$ -l h_vmem=7.5G   # Request RAM (7.5GB per core)
#$ -l h_rt=1:0:0    # Max 1hr runtime (can request up to 240hr)
#$ -l gpu=1         # Request 1 GPU
#$ -N basicGPU      # Name for the job (optional)
# Assign the correct GPU card
export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}
# Load the necessary modules
module load python
module load cudnn/7.5-cuda-10.0
# Load the virtualenv
source ~/tensorgpu/bin/activate
# Report the TensorFlow version
python -c 'import tensorflow as tf; print(tf.__version__)'

Running qsub basic.qsub will tell the scheduler to add the job to the queue. You can verify this with the qstat command. Note that, while usually the rules about resource requests are very strict (request only what you will use), the convention is to request 8 cores per GPU.

If there are free resources, the job will run immediately and produce an output file a few seconds later containing the results of the job. See this page for an explanation of job output filenames, and here for more detail on using GPU nodes.

CUDA/CUDNN

Use of TensorFlow on a NVIDIA GPU requires a driver and access to CUDA and CUDNN libraries that the package was built with. The tensorflow-gpu package will install but not run unless the libraries are present, and need to be loaded as a module inside the job script. At time of writing, tensorflow-gpu is built using the CUDA 10.0 library. Loading module cudnn/7.5-cuda-10.0 will load both the required CUDNN 7.5 and CUDA 10.0 libraries. TensorFlow versions prior to 1.13 require the cudnn/7.4-cuda-9.0 module.

Running a benchmark job

Benchmark jobs get the GPU to do real work and allow us to check and compare output against expected results. The prerequisite for this job is a TensorFlow virtualenv and a copy of the benchmark, which we will obtain from GitHub. In your working directory, run the following to clone the repository.

module load git
git clone https://github.com/tensorflow/models

Prepare bench.qsub:

#!/bin/bash
#$ -cwd
#$ -j y            # Merge output and error files (optional)
#$ -pe smp 8       # Request cores (8 per GPU)
#$ -l h_vmem=7.5G  # Request RAM (7.5Gb per core)
#$ -l h_rt=1:0:0   # Request 1hr runtime (max is 240hr)
#$ -m bea          # Send email on begin,end,abort
#$ -l gpu=1        # Request one GPU
#$ -N cnnbench     # Name of job (optional)
# Only expose the requested GPUs to our job
export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}
# Load necessary modules
module load python
module load cudnn/7.5-cuda-10.0
# Activate the virtualenv containing the required packages
source ~/tensorgpu/bin/activate
# Run our code
python ~/models/tutorials/image/mnist/convolutional.py

Since GPU are the primary resource on the nodes, we request that users standardise their CPU and RAM requests on GPU nodes to ensure non-GPU resources are shared evenly between GPU devices without too much effort from users. This equates to 8 cores and 7.5GB RAM per core, for each GPU requested.

Submit the job with qsub bench.qsub and check the status of your queued and running jobs with qstat.

$ qsub bench.qsub
Your job 630581 ("cnnbench") has been submitted
$ qstat
job-ID prior    name     user    state submit/start at     queue       slots ja-task-ID
---------------------------------------------------------------------------------------
630581 15.00646 cnnbench abc123  r     03/22/2019 09:57:00 all.q@nxg1  8

We have added -m bea in the job script to send an email to notify when the job begins/ends/aborts.

Checking the progress of your job

If your jobs starts immediately, you can ssh to the node and run nvidia-smi to check the GPU device activity and attached processes.

Your process will be a python process. Note that another user will likely be using one of the other GPU, which may also be python. The first few lines of the job output file cnnbench.o.<jobid> will mention a GPU device being used (note that it might state GPU 0 even when using another GPU device, due to the way CUDA_VISIBLE_DEVICES only shows you the GPUs you have requested, starting at GPU 0).

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:83:00.0 Off |                    0 |
| N/A   46C    P0   113W / 149W |    767MiB / 11441MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:84:00.0 Off |                    0 |
| N/A   70C    P0   135W / 149W |    767MiB / 11441MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     17053      C   python                                       756MiB |
|    1     18998      C   python                                       756MiB |
+-----------------------------------------------------------------------------+

Fig1: Output of nvidia-smi, showing 2 GPUs in use.

One of the ways to check which of the tasks is yours, is to use the ps command and search for the process IDs attached to each GPU. For example:

ps -f 17053 18998
UID        PID  PPID  C STIME TTY      STAT   TIME CMD
abc123   17053 13708 99 16:16 ?        Sl    55:13 python ./20190320_2222.py
xyz987   18998 19483 99 10:45 ?        Sl   675:27 python tuning.py

In this case, process id 17053 is owned by user abc123 and is using GPU 0, and at this particular moment, is consuming 767MiB of GPU RAM, and 74% GPU utilisation. The GPU usage may fluctuate over the course of the job, but consistently low figures may be an indication that some settings could be tweaked, to gain better performance.

We have confirmed that the job is using a GPU, and we will now inspect the job output file. The file will be created in the same directory where you ran the job, which is a concatenation of the job name and the job id number. If there is no job name provided in the job script file, then the file name of the script file is taken instead.

In this example, the output file is cnnbench.630581, we can inspect the file using less cnnbench.630581:

Loading cudnn/7.5-cuda-10.0
  Loading requirement: cuda/10.0.130
2019-03-22 09:57:21.383876: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-22 09:57:21.652428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:83:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-03-22 09:57:21.652506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-03-22 09:57:26.776585: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-22 09:57:26.776700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0
2019-03-22 09:57:26.776745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N
2019-03-22 09:57:26.780454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB m
emory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:83:00.0, compute capability: 3.7)
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
Initialized!
Step 0 (epoch 0.00), 102.0 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.6%
Step 100 (epoch 0.12), 10.1 ms
Minibatch loss: 3.248, learning rate: 0.010000
Minibatch error: 7.8%
Validation error: 7.6%
Step 200 (epoch 0.23), 8.4 ms
...

We can see that the job initialised with a GPU device, and the job is progressing.

To check the latest progress of the job, you can use tail -f <filename> to show the end of the file, and continue to output data as the file grows.

Use of multiple GPUs

If your code supports it, you may request more than one GPU for your job. Note that requesting 2 GPUs does not automatically mean that both GPU will be used, so it's good practice to check nvidia-smi each time you try new software. Be aware that some codes that do not properly respect the CUDA_VISIBLE_DEVICES environment variable. The next version of the Univa Grid Engine scheduler supports better confinement of GPU tasks, reducing potential impact of another user's badly behaved code.

Other Machine Learning applications

We've worked through a detailed approach for running TensorFlow jobs, which can largely be applied to other frameworks such as PyTorch which are also available via pip and conda. Some packages involve additional dependencies, which may not be available in the standard python package repositories, and require installing manually from code repositories. Please get in touch if you need extra assistance.

Visualisation with TensorBoard

TensorBoard is a web-based visualisation tool to allow you to analyse the progress your training. It comes installed with TensorFlow and can be invoked with tensorboard --logdir=/path/to/directory. If invoked within your code, this will start an interactive web interface on the compute node for the duration of the job, and an ssh tunnel will be required to access the web interface, since compute nodes are not directly accessible from outside the cluster.

For example, if the tensorboard is running on node nxg1 on port 12345, you can forward this through an ssh tunnel to access on your desktop on a user-defined port.

ssh -L 127.0.0.1:8888:nxg1.apocrita:12345 abc123@login.hpc.qmul.ac.uk

This will open a login session to Apocrita for username abc123, and request a password as usual. At the same time, it will establish a tunnel that will serve up the contents of nxg1:12345 to http://localhost:8888

Figure 2

Figure 2: TensorBoard interface.