Unification of Memory on the Grace Hopper Nodes¶
The delivery of new GPUs for research is continuing, most notable is the new Isambard-AI cluster at Bristol. As new cutting-edge GPUs are released, software engineers are tasked with being made aware of the new architectures and features these new GPUs offer.
The new Grace-Hopper GH200 nodes, as announced in a previous blog post, consist of a 72-core NVIDIA Grace CPU and an H100 Tensor Core GPU. One of the key innovations is the NVIDIA NVLink Chip-2-Chip (C2C) and unified memory, which allows fast and seamless automation of transferring data from CPU to GPU. It also allows the GPU to be oversubscribed, allowing it to handle data much larger than it can host, potentially tackling out-of-GPU memory problems. This allows software engineers to focus on implementing algorithms without having to think too much about memory management.
This blog post will demonstrate manual GPU memory management and introduce
managed and unified memory with simple examples to illustrate its benefits.
We'll try and keep this to an introductory level but the blog does assume basic
knowledge of C++, CUDA and compiling with nvcc
.
Experienced users should check the CUDA C programming guide as well as NVIDIA's technical blog on memory management.
CPU GPU Architecture¶
It's helpful to recap how the components of a computer interact. Both the CPU and RAM are mounted on the motherboard. The data your software is currently processing resides in RAM. When the CPU needs that data, it's quickly fetched from RAM over a high-speed bus. Most modern high-level programmers don't think about data fetching as this is abstracted away in most programming languages.
On consumer and gaming computers, a graphics card houses the GPU and VRAM. VRAM is dedicated memory the GPU can access via a high-speed bus on the graphics card. This card is typically installed in a PCIe slot. This means that transferring data between RAM and VRAM happens over the PCIe bus, which is slower compared to the internal buses.
When using GPUs as a programmer, you typically have to write code to explicitly manage memory and instruct transferring data from RAM to VRAM. Even when using GPU-aware Python libraries such as CuPy, PyTorch and numba, it's important to minimise the amount of transfer between RAM and VRAM and carefully manage memory to not exhaust it.
The Grace-Hopper nodes have a slightly different architecture. Instead of a PCIe slot, it has a NVIDIA NVLink C2C, which is 7 times faster than PCIe Gen5. This allows fast transfer of data between RAM and VRAM. It is also possible to connect multiple Grace-Hopper nodes with NVLink and infiniband, but this is beyond the scope of this blog.
Do note that quite commonly, the entire graphics card is referred to as the GPU. It is also common that the terms CPU and GPU may be used instead of RAM and VRAM when referring to memory. For example, some programmers may say that data is transferred from CPU to GPU over PCIe rather than RAM to VRAM. This blog will use the technically incorrect, but widely accepted, terms.
Manual Managed Memory¶
We will give an example of different ways to allocate memory on the GPU with a toy program. Beginner-level C++ and CUDA are assumed but important techniques and terms will be recapped. The toy program will require the following C++ standard libraries
#include <iostream>
#include <limits>
#include <random>
#include <string>
#include <vector>
Suppose we declare the following variables
float* x_data;
float* x_data_host;
std::vector<float> x_vector;
where x_data
and x_data_host
are pointers to an array of floats and
x_vector
is that array. Suppose n_data
is a user-defined int
, we can
allocate an array of n_data
floats and get the corresponding pointer with
x_vector.resize(n_data);
x_data_host = x_vector.data();
We can fill in that vector with data, for example, exponentially distributed random numbers. We can create a function which does this
void GenerateRandomNumbers(float* ptr, int n) {
std::exponential_distribution<float> prob_dist;
std::default_random_engine rng;
for (int i = 0; i < n; ++i) {
ptr[i] = prob_dist(rng);
}
}
and use it by passing the pointer to the array and the length of it
GenerateRandomNumbers(x_data_host, n_data);
Keep in mind that everything is done on the CPU so far. To transfer this data to
the GPU, we first have to allocate memory on the GPU. This can be done by using
the function cudaMalloc()
cudaError_t error_code;
error_code = cudaMalloc(&x_data, n_data * sizeof(float));
if (error_code != cudaSuccess) {
std::cout << "CUDA error " << error_code << " - Unable to cudaMalloc"
<< std::endl;
return 1;
}
The function cudaMalloc()
, and many other CUDA functions, returns an error
code which the programmer should handle when something goes wrong. This can
happen, for example, if we allocate memory more than available on the GPU.
To transfer data from CPU to GPU, we use the function cudaMemcpy()
error_code = cudaMemcpy(x_data, x_data_host, n_data * sizeof(float),
cudaMemcpyHostToDevice);
if (error_code != cudaSuccess) {
std::cout << "CUDA error " << error_code
<< " - Unable to cudaMemcpyHostToDevice" << std::endl;
return 1;
}
In this snippet, we copy and transfer n_data
lots of floats located at
x_data_host
to the GPU. The address of the memory which contains that
transferred data on the GPU is assigned to the variable x_data
. This means that
both x_data
and x_data_host
contain copies of the same data but on the GPU
and CPU respectively.
To give a toy example of running GPU code, we write a kernel which calculates the golden ratio using the iterative formula
We treat the array x_vector
as n_data
lots of initial
values \(\varphi_0\) .
For each element in the array, we can calculate the golden ratio using the
iterative formula. Thus we can verify the correctness of the result by
comparing all resulting elements to
This is not realistic research code as memory is accessed sequentially and processed element-wise but it is a simple toy example which can push the GPU to work hard when given enough throughput. We can also verify the correctness of the results.
The kernel can be written such that it can accept an arbitrary number of blocks with arbitrary dimensions
__global__ void GoldenRatio(float* d_x_vector, std::size_t n_data, int n_iter) {
int idx = blockDim.x * blockIdx.x + threadIdx.x;
int n_total_thread = blockDim.x * gridDim.x;
int n_strive = n_data / n_total_thread + (n_data % n_total_thread != 0);
int index;
float* x;
for (int i_strive = 0; i_strive < n_strive; ++i_strive) {
index = idx + i_strive * n_total_thread;
if (index < n_data) {
x = d_x_vector + index;
for (int i = 0; i < n_iter; ++i) {
*x = sqrtf(1.0f + *x);
}
}
__syncthreads();
}
}
where n_iter
is the number of iterations to do. We set this to be
n_iter = 10000
.
We call the kernel by passing the pointer to the GPU memory, x_data
, to the
kernel, as well as other parameters and the block specifications
GoldenRatio<<<n_block, n_thread_per_block>>>(x_data, n_data, n_iter);
error_code = cudaGetLastError();
if (error_code != cudaSuccess) {
std::cout << "CUDA error " << error_code << " - Unable to call CUDA kernel"
<< std::endl;
return 1;
}
where n_block
is the number of blocks and n_thread_per_block
is the number
of threads per block. We use cudaDeviceSynchronize()
to tell the CPU to wait
for the GPU to finish its calculations
error_code = cudaDeviceSynchronize();
if (error_code != cudaSuccess) {
std::cout << "CUDA error " << error_code
<< " - Unable to cudaDeviceSynchronize" << std::endl;
return 1;
}
Lastly, we have to transfer the computed data from the GPU back to the CPU
error_code = cudaMemcpy(x_data_host, x_data, n_data * sizeof(float),
cudaMemcpyDeviceToHost);
if (error_code != cudaSuccess) {
std::cout << "CUDA error " << error_code
<< " - Unable to cudaMemcpyDeviceToHost" << std::endl;
return 1;
}
We can test the correctness of the content of the array using
float epsilon = std::numeric_limits<float>::epsilon();
bool is_pass = true;
float golden_ratio = (1.0f + sqrt(5.0f)) / 2.0f;
for (int i = 0; i < n_data; ++i) {
if (abs(x_data[i] - golden_ratio) > epsilon) {
std::cout << "Incorrect value of " << x_data[i] << " detected"
<< std::endl;
is_pass = false;
}
}
if (is_pass) {
std::cout << "PASS!" << std::endl;
} else {
std::cout << "FAIL!" << std::endl;
}
At the end of the program, you should free any memory
cudaFree(x_data);
While this is only a toy example, it does illustrate how a programmer has to manage memory both on the CPU and GPU and how it can be long and tedious.
On my computer (32 GB of RAM, NVIDIA RTX 3060 graphics card with 12 GB of VRAM),
the toy program works if the array is less than 12 GB (n_data
\(\approx 3
\times 10^9\) ). But any more, it fails to cudaMalloc()
because it tries to
allocate more memory available on the GPU.
CUDA error 2 - Unable to cudaMalloc
In such a scenario, the programmer should send and process data on the GPU in batches. For this toy program where the data is processed element-wise, it is quite simple to do. But for more coupled or random data accesses, it may be more involved.
Automatic Managed Memory¶
On most modern GPUs, you can use cudaMallocManaged()
instead of
cudaMalloc()
. With cudaMallocManaged()
, the memory allocated is managed by
CUDA and will transfer data back and forth between CPU and GPU for you without
having to use cudaMemcpy()
.
Because cudaMallocManaged()
manages the memory for us, we only need one
pointer
float* x_data;
We allocate n_data
floats and assign that address of the data to x_data
using
error_code = cudaMallocManaged(&x_data, n_data * sizeof(float));
if (error_code != cudaSuccess) {
std::cout << "CUDA error " << error_code
<< " - Unable to cudaMallocManaged" << std::endl;
return 1;
}
We can treat the pointer x_data
like any other pointer, for example, we can
fill the array with random numbers using the CPU
GenerateRandomNumbers(x_data, n_data);
To run the kernel, call it as usual, using the pointer x_data
GoldenRatio<<<n_block, n_thread_per_block>>>(x_data, n_data, n_iter);
error_code = cudaGetLastError();
if (error_code != cudaSuccess) {
std::cout << "CUDA error " << error_code << " - Unable to call CUDA kernel"
<< std::endl;
return 1;
}
After calling the kernel and cudaDeviceSynchronize()
, you can also check the
contents of x_data
without having to use cudaMemcpy()
. In terms of
programming, this is convenient as both CPU functions and GPU kernels can read
and write memory with the same pointer without having to explicitly transfer
data.
You still should free memory, allocated by cudaDeviceSynchronize()
, using
cudaFree()
at the end of the program.
Remarkably, this toy program works even if I set n_data
to be larger than 12
GB (n_data
\(\approx 3 \times 10^9\) ). Behind the scenes, CUDA transfers data
from CPU and GPU for us when the program needs it whilst also avoiding
exhausting memory.
Using managed memory is a way to use CUDA kernels where the data you want to work with is larger than what the GPU can handle without having to batch your data. CUDA will manage the memory for you. In terms of programming, this makes it easier as it abstracts the memory management and transfer for us.
Admittedly, this toy program is limited in scope because it doesn't explore situations where memory access can be coupled or random. However, it demonstrates on a surface level where managed memory can be useful and how it works.
Unified Memory¶
The Grace Hopper nodes have a new NVIDIA NVLink C2C which enables very fast data
transfer between CPU and GPU. So fast that CPU and GPU threads could practically
access both CPU and GPU data concurrently. CUDA has a feature called unified
memory which caters for this. You can pass arbitrary pointers of data to the GPU
kernel. CUDA will sort the memory management and transfer for you without having
to use cudaMallocManaged()
. This means you can pass data allocated by external
libraries or software directly to the GPU.
For example, we can use the standard C++ library std::vector
to allocate an
array, fill it with random numbers using the CPU and pass it directly to the GPU
kernel.
float* x_data;
std::vector<float> x_vector;
x_vector.resize(n_data);
x_data = x_vector.data();
GenerateRandomNumbers(x_data, n_data);
GoldenRatio<<<n_block, n_thread_per_block>>>(x_data, n_data, n_iter);
You will notice the lack of CUDA functions as memory transfer and allocation
have been abstracted away from us. You still have to call
cudaDeviceSynchronize()
to tell the CPU to wait for the GPU to finish its
calculations as well as error handling.
We do not need to use cudaFree()
because the responsibility of freeing memory
is left to the std::vector
library. The memory should be free automatically
when x_vector
goes out of scope.
On the Grace Hopper node and using the naive manual memory management approach,
we can allocate and process an array of up to 100 GB (n_data
\(\approx 2.5
\times 10^{10}\)) before it fails to due running out of memory. However, with
unified memory, we found that it can process an array of 500 GB (n_data
\(\approx 1.2 \times 10^{11}\)) and perhaps more! This demonstrates with unified
memory, both CPU and GPU and read and write the same piece of data. As a result,
the programmer can also pass arbitrary data to the GPU with ease, without having
to think too much about memory management.
Summary¶
With new technology, software engineers will need to keep learning in order to use them. Unified memory allows programmers to focus on writing GPU code without having to worry too much about memory management. With the Grace Hopper nodes having hundreds of GB of memory, it can potentially make writing high-performance GPU easier and run them faster.
This blog post only covers the surface as we have not done any benchmarks or looked at distributed programming with Grace Hopper nodes. There is still plenty to explore with the new Grace Hopper nodes.
If you think your GPU application could make use of the new Grace Hopper nodes and need some help, do contact the ITSR team.
Acknowledgement¶
The illustration for the graphics card and motherboard were obtained from vecteezy.com