Skip to content

Using Apocrita's GPUs with OpenMP

On Apocrita we can use OpenMP to execute code on GPU devices. This post looks at how to compile such programs and submit them to run on the GPU nodes. The post assumes that you have code, already developed and tested, which is ready for deployment, and that you have been granted access to the GPU nodes.

If you do not yet have suitable code, or do not have access to the GPU nodes, please see below to contact us.

OpenMP target offload

In this post we'll look at OpenMP programs which use the target construct for offloading execution to a non-host device. On Apocrita, the non-host devices are GPUs. To use the GPU nodes on Apocrita you will have code that is ready to deploy, but for this post we'll use the following C and Fortran examples.


#include <math.h>
#include <stdio.h>
int main () {
  float x, y;
  int i, j;

#pragma omp target map(tofrom:x)
#pragma omp parallel for private(j,y) reduction(+:x)
  for (i=0; i<1000000; i++) {
    for (j=0; j<1000; j++) y=(y+sqrtf(y)/2);
  printf("Our final result is: %f\n", x);
  return 0;


  implicit none

  integer :: i, j
  real :: x, y


!$omp target map(tofrom:x)
!$omp parallel do private(j,y) reduction(+:x)
  do i=1,1000000
     do j=1,1000
     end do
  end do
!$omp end target

  print '("Our final result is: ",G0.5)', x

We have on Apocrita a number of GPU node types. For the examples in this post we'll first use the short interactive queues to access a node. To access one of our usual nodes we can use the qlogin command:

qlogin -pe smp 8 -l gpu=1

For more complete detail of how to request GPU resources for your jobs please see our documentation pages linked below.

To compile your code, and the examples, you will need to use one of the correct compilers. Currently, the only suitable compiler on Apocrita is GCC (versions greater than 10.2.0) for all nodes. Other compilers, and versions of GCC, may fail to understand the syntax of OpenMP offload or may fail to generate code for offloading. For example, versions of GCC before 10.2.0 on Apocrita do not have offload support and the Intel compilers capable of producing offload code do not target the GPU device types we have.

Compiling with GCC

Once we have a session on one of Apocrita's GPU nodes we can move to a directory with the source code example files above. To compile with GCC we first load a module which provides offload support, in this case version 10.2.0:

module load gcc/10.2.0

To check that this installation of GCC supports the desired offload we can ask the gcc command:

$ gcc -v

This offload target nvptx-none is the required target for our NVIDIA devices: if gcc -v doesn't report this target then code will not be offloaded to our GPUs and this may mean that you haven't used an appropriate compiler module.

To compile our C and Fortran examples with offloading we need to enable OpenMP in the compiler with the -fopenmp compile option. For our Fortran example we can compile like:

gfortran -fopenmp test-gpu-openmp.f90 -o test-fortran

In our C example, we use the function sqrtf so we need to link against the maths library. We need to link against this library both with the host code (-lm) and the device code (-loffload=-lm):

gcc -fopenmp test-gpu-openmp.c -o test-c -lm -foffload=-lm

The Fortran code of our example doesn't require a comparable -foffload=... linker option, but some Fortran code may require the GFortran runtime library in the device. In this case -loffload=-lgfortran should be specified but -lgfortran, to link the host code against the library, will be implied if linking with gfortran directly:

gfortran -fopenmp offload-rtl.f90 -loffload=-lgfortran -o test-fortran

With the compilation complete we should now have two compiled programs, test-c and test-fortran, ready to execute.

Running the compiled programs

Once we have the compiled program and a session on a suitable GPU node we can run the program and watch it execute on the GPU device. We can continue to use the same qlogin session that we had when compiling, but we can also create a new session. Either way, we must be sure to have loaded the same module that we used when compiling our code. For our examples that would be gcc/10.2.0. We advise further that you use the same GPU selection type (-l gpu_type=volta for example) for maximum portability if you are compiling and running in separate sessions.

On every node, the compiled program can be run like any other, not requiring any special options to enable offloading:

$ ./test-c &
[1] 129036
$ nvidia-smi pmon
# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0          -     -     -     -     -     -   -
    0     129036     C    39     0     0     0   test-c
    0     129036     C    92     0     0     0   test-c
    0     129036     C    99     0     0     0   test-c
    0     129036     C    99     0     0     0   test-c

In this case we ran the program test-c in the background (using the & after the command name). The shell immediately returns the process ID of the running program (here 129036) and then we run nvidia-smi to query the state of the GPU to which the process is assigned.

After a little delay attaching the process and fully starting up, we see that the process, and matching command name, report under the sm column a high use. Our example code does not require much memory (as reported by the mem column).

If when observing the process running over time we not see a process attach to the device, or use remains low (particularly 0%) then the offload may have failed.

The running examples will quickly return a final result.

Running the program in batch mode

Until now we've been using interactive jobs in the short queues. As with other programs we want to run on the cluster, once we're satisfied that the jobs are set up correctly we should submit them to the scheduler to run in batch as resources become available.

For this example, we've seen that our program has compiled correctly and has been running on the GPU devices as we'd expect. If we've compiled the program using GCC 10.2.0 targeting Volta nodes, for example, we can submit the script

#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l gpu=1
#$ -l gpu_type=volta
#$ -l h_rt=240:0:0

module load gcc/10.2.0

Using target constructs without the GPUs

We've seen how target constructs can be used to offload execution to GPU devices. However, we can also use source code which has these constructs without using the GPU nodes, for testing or other reasons.

Using GCC compilers we can enable OpenMP using fopenmp but disable offloading with -foffload=disable:

gcc -fopenmp -foffload=disable test-gpu-openmp.c -o test-c -lm
gfortran -fopenmp -foffload=disable test-gpu-openmp.f90 -o test-fortran

Equally, we could disable OpenMP compilation completely by using -fno-openmp or not giving -fopenmp.

Finally, code compiled with offload support may also be run when a GPU device is not available: when the target construct is reached execution will continue on the host (CPU) when no offload device (GPU) is detected. For extreme testing cases offloading may be disabled when requested at compile time and when a GPU is available by using an environment variable: this should generally be avoided, however, as jobs requesting GPU nodes must usually make good use of the GPU devices.

Further documentation for Apocrita

For more detail on compiling on Apocrita, including using the GPUs please see our documentation page on compiling and our introductory post about compilation. We also have more detailed documentation about how to submit jobs to run on the GPU nodes.

Contact us

The GPU nodes on Apocrita are restricted to approved users. Contact us, with details of your requirements and example code, to request access to these nodes.

Also contact us if you would like support in porting your existing OpenMP code to use GPUs or want an assessment of the possible value in using GPUs.