Unification of Memory on the Grace Hopper Nodes¶

The delivery of new GPUs for research is continuing, most notable is the new Isambard-AI cluster at Bristol. As new cutting-edge GPUs are released, software engineers are tasked with being made aware of the new architectures and features these new GPUs offer.

The new Grace-Hopper GH200 nodes, as announced in a previous blog post, consist of a 72-core NVIDIA Grace CPU and an H100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLink Chip-2-Chip (C2C) and unified memory, which allows fast and seamless automation of transferring data from CPU to GPU. It also allows the GPU to be oversubscribed, allowing it to handle data much larger than it can host, potentially tackling out-of-GPU memory problems. This allows software engineers to focus on implementing algorithms without having to think too much about memory management.

This blog post will demonstrate manual GPU memory management and introduce managed and unified memory with simple examples to illustrate its benefits. We'll try and keep this to an introductory level but the blog does assume basic knowledge of C++, CUDA and compiling with nvcc.

Experienced users should check the CUDA C programming guide as well as NVIDIA's technical blog on memory management.

CPU GPU Architecture¶

It's helpful to recap how the components of a computer interact. Both the CPU and RAM are mounted on the motherboard. The data your software is currently processing resides in RAM. When the CPU needs that data, it's quickly fetched from RAM over a high-speed bus. Most modern high-level programmers don't think about data fetching as this is abstracted away in most programming languages.

On consumer and gaming computers, a graphics card houses the GPU and VRAM. VRAM is dedicated memory the GPU can access via a high-speed bus on the graphics card. This card is typically installed in a PCIe slot. This means that transferring data between RAM and VRAM happens over the PCIe bus, which is slower compared to the internal buses.

Figure 1: A diagram showing a graphics card, which houses the GPU and VRAM, connected to the motherboard via PCIe. The motherboard also houses the CPU and RAM. The CPU and RAM transfer data between each other via fast internal buses and similarly between the GPU and VRAM.

Figure 2: A photograph of a V100 card slotted in a PCIe slot. This PCIe slot is part of a raiser cable which connects the card to a motherboard

When using GPUs as a programmer, you typically have to write code to explicitly manage memory and instruct transferring data from RAM to VRAM. Even when using GPU-aware Python libraries such as CuPy, PyTorch and numba, it's important to minimise the amount of transfer between RAM and VRAM and carefully manage memory to not exhaust it.

The Grace-Hopper nodes have a slightly different architecture. Instead of a PCIe slot, it has a NVIDIA NVLink C2C, which is 7 times faster than PCIe Gen5. This allows fast transfer of data between RAM and VRAM. It is also possible to connect multiple Grace-Hopper nodes with NVLink and infiniband, but this is beyond the scope of this blog.

Do note that quite commonly, the entire graphics card is referred to as the GPU. It is also common that the terms CPU and GPU may be used instead of RAM and VRAM when referring to memory. For example, some programmers may say that data is transferred from CPU to GPU over PCIe rather than RAM to VRAM. This blog will use the technically incorrect, but widely accepted, terms.

Manual Managed Memory¶

We will give an example of different ways to allocate memory on the GPU with a toy program. Beginner-level C++ and CUDA are assumed but important techniques and terms will be recapped. The toy program will require the following C++ standard libraries

#include <iostream>
#include <limits>
#include <random>
#include <string>
#include <vector>

Suppose we declare the following variables

float* x_data;
float* x_data_host;
std::vector<float> x_vector;

where x_data and x_data_host are pointers to an array of floats and x_vector is that array. Suppose n_data is a user-defined int, we can allocate an array of n_data floats and get the corresponding pointer with

x_vector.resize(n_data);
x_data_host = x_vector.data();

We can fill in that vector with data, for example, exponentially distributed random numbers. We can create a function which does this

void GenerateRandomNumbers(float* ptr, int n) {
  std::exponential_distribution<float> prob_dist;
  std::default_random_engine rng;
  for (int i = 0; i < n; ++i) {
    ptr[i] = prob_dist(rng);
  }
}

and use it by passing the pointer to the array and the length of it

GenerateRandomNumbers(x_data_host, n_data);

Keep in mind that everything is done on the CPU so far. To transfer this data to the GPU, we first have to allocate memory on the GPU. This can be done by using the function cudaMalloc()

cudaError_t error_code;
error_code = cudaMalloc(&x_data, n_data * sizeof(float));
if (error_code != cudaSuccess) {
    std::cout << "CUDA error " << error_code << " - Unable to cudaMalloc"
            << std::endl;
    return 1;
}

The function cudaMalloc(), and many other CUDA functions, returns an error code which the programmer should handle when something goes wrong. This can happen, for example, if we allocate memory more than available on the GPU.

To transfer data from CPU to GPU, we use the function cudaMemcpy()

error_code = cudaMemcpy(x_data, x_data_host, n_data * sizeof(float),
                            cudaMemcpyHostToDevice);
if (error_code != cudaSuccess) {
    std::cout << "CUDA error " << error_code
            << " - Unable to cudaMemcpyHostToDevice" << std::endl;
    return 1;
}

In this snippet, we copy and transfer n_data lots of floats located at x_data_host to the GPU. The address of the memory which contains that transferred data on the GPU is assigned to the variable x_data. This means that both x_data and x_data_host contain copies of the same data but on the GPU and CPU respectively.

To give a toy example of running GPU code, we write a kernel which calculates the golden ratio using the iterative formula

\[ \varphi_{i+1} = \sqrt{1 + \varphi_i}\]

We treat the array x_vector as n_data lots of initial values \(\varphi_0\) . For each element in the array, we can calculate the golden ratio using the iterative formula. Thus we can verify the correctness of the result by comparing all resulting elements to

\[ \varphi =\dfrac{1+\sqrt{5}}{2}\]

This is not realistic research code as memory is accessed sequentially and processed element-wise but it is a simple toy example which can push the GPU to work hard when given enough throughput. We can also verify the correctness of the results.

The kernel can be written such that it can accept an arbitrary number of blocks with arbitrary dimensions

__global__ void GoldenRatio(float* d_x_vector, std::size_t n_data, int n_iter) {
  int idx = blockDim.x * blockIdx.x + threadIdx.x;
  int n_total_thread = blockDim.x * gridDim.x;

  int n_strive = n_data / n_total_thread + (n_data % n_total_thread != 0);

  int index;
  float* x;
  for (int i_strive = 0; i_strive < n_strive; ++i_strive) {
    index = idx + i_strive * n_total_thread;
    if (index < n_data) {
      x = d_x_vector + index;
      for (int i = 0; i < n_iter; ++i) {
        *x = sqrtf(1.0f + *x);
      }
    }
    __syncthreads();
  }
}

where n_iter is the number of iterations to do. We set this to be n_iter = 10000.

We call the kernel by passing the pointer to the GPU memory, x_data, to the kernel, as well as other parameters and the block specifications

GoldenRatio<<<n_block, n_thread_per_block>>>(x_data, n_data, n_iter);
error_code = cudaGetLastError();
if (error_code != cudaSuccess) {
    std::cout << "CUDA error " << error_code << " - Unable to call CUDA kernel"
                << std::endl;
    return 1;
}

where n_block is the number of blocks and n_thread_per_block is the number of threads per block. We use cudaDeviceSynchronize() to tell the CPU to wait for the GPU to finish its calculations

error_code = cudaDeviceSynchronize();
if (error_code != cudaSuccess) {
    std::cout << "CUDA error " << error_code
              << " - Unable to cudaDeviceSynchronize" << std::endl;
    return 1;
}

Lastly, we have to transfer the computed data from the GPU back to the CPU

error_code = cudaMemcpy(x_data_host, x_data, n_data * sizeof(float),
                            cudaMemcpyDeviceToHost);
if (error_code != cudaSuccess) {
    std::cout << "CUDA error " << error_code
            << " - Unable to cudaMemcpyDeviceToHost" << std::endl;
    return 1;
}

We can test the correctness of the content of the array using

float epsilon = std::numeric_limits<float>::epsilon();
bool is_pass = true;
float golden_ratio = (1.0f + sqrt(5.0f)) / 2.0f;
for (int i = 0; i < n_data; ++i) {
  if (abs(x_data[i] - golden_ratio) > epsilon) {
    std::cout << "Incorrect value of " << x_data[i] << " detected"
              << std::endl;
    is_pass = false;
  }
}
if (is_pass) {
  std::cout << "PASS!" << std::endl;
} else {
  std::cout << "FAIL!" << std::endl;
}

At the end of the program, you should free any memory

cudaFree(x_data);

While this is only a toy example, it does illustrate how a programmer has to manage memory both on the CPU and GPU and how it can be long and tedious.

On my computer (32 GB of RAM, NVIDIA RTX 3060 graphics card with 12 GB of VRAM), the toy program works if the array is less than 12 GB (n_data \(\approx 3 \times 10^9\) ). But any more, it fails to cudaMalloc() because it tries to allocate more memory available on the GPU.

CUDA error 2 - Unable to cudaMalloc

In such a scenario, the programmer should send and process data on the GPU in batches. For this toy program where the data is processed element-wise, it is quite simple to do. But for more coupled or random data accesses, it may be more involved.

Automatic Managed Memory¶

On most modern GPUs, you can use cudaMallocManaged() instead of cudaMalloc(). With cudaMallocManaged(), the memory allocated is managed by CUDA and will transfer data back and forth between CPU and GPU for you without having to use cudaMemcpy().

Because cudaMallocManaged() manages the memory for us, we only need one pointer

float* x_data;

We allocate n_data floats and assign that address of the data to x_data using

error_code = cudaMallocManaged(&x_data, n_data * sizeof(float));
if (error_code != cudaSuccess) {
    std::cout << "CUDA error " << error_code
            << " - Unable to cudaMallocManaged" << std::endl;
    return 1;
}

We can treat the pointer x_data like any other pointer, for example, we can fill the array with random numbers using the CPU

GenerateRandomNumbers(x_data, n_data);

To run the kernel, call it as usual, using the pointer x_data

GoldenRatio<<<n_block, n_thread_per_block>>>(x_data, n_data, n_iter);
error_code = cudaGetLastError();
if (error_code != cudaSuccess) {
    std::cout << "CUDA error " << error_code << " - Unable to call CUDA kernel"
                << std::endl;
    return 1;
}

After calling the kernel and cudaDeviceSynchronize(), you can also check the contents of x_data without having to use cudaMemcpy(). In terms of programming, this is convenient as both CPU functions and GPU kernels can read and write memory with the same pointer without having to explicitly transfer data.

You still should free memory, allocated by cudaDeviceSynchronize(), using cudaFree() at the end of the program.

Remarkably, this toy program works even if I set n_data to be larger than 12 GB (n_data \(\approx 3 \times 10^9\) ). Behind the scenes, CUDA transfers data from CPU and GPU for us when the program needs it whilst also avoiding exhausting memory.

Using managed memory is a way to use CUDA kernels where the data you want to work with is larger than what the GPU can handle without having to batch your data. CUDA will manage the memory for you. In terms of programming, this makes it easier as it abstracts the memory management and transfer for us.

Admittedly, this toy program is limited in scope because it doesn't explore situations where memory access can be coupled or random. However, it demonstrates on a surface level where managed memory can be useful and how it works.

Unified Memory¶

The Grace Hopper nodes have a new NVIDIA NVLink C2C which enables very fast data transfer between CPU and GPU. So fast that CPU and GPU threads could practically access both CPU and GPU data concurrently. CUDA has a feature called unified memory which caters for this. You can pass arbitrary pointers of data to the GPU kernel. CUDA will sort the memory management and transfer for you without having to use cudaMallocManaged(). This means you can pass data allocated by external libraries or software directly to the GPU.

For example, we can use the standard C++ library std::vector to allocate an array, fill it with random numbers using the CPU and pass it directly to the GPU kernel.

float* x_data;
std::vector<float> x_vector;
x_vector.resize(n_data);
x_data = x_vector.data();
GenerateRandomNumbers(x_data, n_data);
GoldenRatio<<<n_block, n_thread_per_block>>>(x_data, n_data, n_iter);

You will notice the lack of CUDA functions as memory transfer and allocation have been abstracted away from us. You still have to call cudaDeviceSynchronize() to tell the CPU to wait for the GPU to finish its calculations as well as error handling.

We do not need to use cudaFree() because the responsibility of freeing memory is left to the std::vector library. The memory should be free automatically when x_vector goes out of scope.

On the Grace Hopper node and using the naive manual memory management approach, we can allocate and process an array of up to 100 GB (n_data \(\approx 2.5 \times 10^{10}\)) before it fails to due running out of memory. However, with unified memory, we found that it can process an array of 500 GB (n_data \(\approx 1.2 \times 10^{11}\)) and perhaps more! This demonstrates with unified memory, both CPU and GPU and read and write the same piece of data. As a result, the programmer can also pass arbitrary data to the GPU with ease, without having to think too much about memory management.

Summary¶

With new technology, software engineers will need to keep learning in order to use them. Unified memory allows programmers to focus on writing GPU code without having to worry too much about memory management. With the Grace Hopper nodes having hundreds of GB of memory, it can potentially make writing high-performance GPU easier and run them faster.

This blog post only covers the surface as we have not done any benchmarks or looked at distributed programming with Grace Hopper nodes. There is still plenty to explore with the new Grace Hopper nodes.

If you think your GPU application could make use of the new Grace Hopper nodes and need some help, do contact the ITSR team.

Acknowledgement¶

The illustration for the graphics card and motherboard were obtained from vecteezy.com