# Some Pleasingly Parallel GPU Case Studies in Machine Learning¶

In a previous blog, we discussed ways we could use `multiprocessing`

and
`mpi4py`

together to use multiple nodes of GPUs. We will cover some machine
learning principles and two examples of pleasingly parallel machine learning
problems. Also known as embarrassingly parallel problems, I rather call them
pleasingly because there isn't anything embarrassing when you design your
problem to be run in parallel. When doing so, you could launch very similar
functions to each GPU and collate their results when needed.

## Supervised Learning¶

The objective of supervised learning is to make predictions. On Amazon, the website will recommend products to you but it isn't necessary to know the reasoning behind it, a contrast to the natural sciences. Recommender systems, or models, are engineered to fulfil the task of making predictions. As a result, some of these models can be extremely complicated, to the point engineers cannot give causal explanations on how predictions are made.

Inspired by CGP Grey's YouTube video on AI, these models can be viewed as a black box with lots of dials. The more dials it has, the more complex it is. This black box accepts an input and gives an output, for example, an input could be handwriting and an output a prediction of what that handwriting represents. To train this black box to make predictions, you give it data, called the training set, and an optimiser fiddles about with the dials until it makes correct predictions on the given data.

Afterwards, we can then assess the model on its ability to make correct predictions by testing it on unseen data, called the test set.

There are a few special dials, called tuning or hyperparameters, which need to be set before any optimisation can begin, for example, one can set how many dials there are on the black box and another characterises how the optimiser fiddles with the dials during training. Choosing or setting the tuning parameter is difficult. A common way is to do cross-validation, where you split the training set into a smaller training set and a validation set. You try out different tuning parameters by training your model on the smaller training set and choosing the tuning parameter which predicts the best on the validation set.

Overfitting

Cross-validation seems very involved and heuristic but these procedures aim to avoid overfitting. This is where the model predicts the training data so well, that it picks up all the unnecessary details and is unable to generalise well outside the training set. An analogy would be a student reciting text from a book, word to word, to answer a question. We do not want this.

## MNIST Dataset¶

A nice and simple dataset to play with is the MNIST dataset. It's a collection of 60 000 training images and 10 000 test images. Each image is a 28×28 pixel greyscale scan of the handwriting of a digit. The possible digits are integers from zero to nine. The goal would be, given the handwriting of a digit, to predict the digit the handwriting represents. This is an example of a classification problem where the possible predictions are one of the set of ten numbers.

It is available through `torchvision`

.

## SVMs and Grid Search¶

The Python package scikit-learn offers many classical machine learning algorithms out of the box. The support vector machine (SVM) is one example. Extremely briefly, it aims to put a hyperplane to separate a digit from the rest in feature space. Together with the kernel trick, non-linear decision boundaries can be made.

In scikit-learn, the class
`sklearn.svm.SVC`

is an SVM for classification problems. It has a very basic interface where you
instantiate a `sklearn.svm.SVC`

object `model = sklearn.svm.SVC()`

and call the
methods `model.train()`

and `model.predict()`

to train the model and make
predictions respectively.

The simplicity comes at a cost however. The constructor for `sklearn.svm.SVC`

has conveniently set a lot of defaults, for example, `C=1.0`

and `gamma='scale'`

where both `C`

and `gamma`

can be a positive number. These are tuning
parameters, much like the special dials discussed in the previous section. In my
opinion, they should be tuned to give the best possible performance, but
instead, they arbitrarily default to some number. These defaults implemented in
scikit-learn are heavily debated, such as in the Tweet
below.

By default, logistic regression in scikit-learn runs w L2 regularization on and defaulting to magic number C=1.0. How many millions of ML/stats/data-mining papers have been written by authors who didn't report (& honestly didn't think they were) using regularization?

— Zachary Lipton (@zacharylipton) August 30, 2019

Default values

Think about any API you come across with default values. Do you need to adjust them or tune them? Do you need to report them in your research?

Fortunately, scikit-learn provides tools to do tuning, such as k-fold
cross-validation
`sklearn.model_selection.KFold`

and grid search
`sklearn.model_selection.GridSearchCV`

.
Grid search is where you try out and cross-validate many combinations of tuning
parameters, for example, `C`

and `gamma`

. The value of `C`

and `gamma`

which
gives the best results on the validation set is used and assessed on the test
set.

There is no GPU support for scikit-learn as stated in
their FAQ.
However, RAPIDS' `cuML`

offers GPU
implementations for a lot
of scikit-learn's models
with matching API, including
`cuml.svm.SVC`

at the time of writing on v23.08. This offers an easy way to speed up your
scikit-learn with a GPU.

RAPIDs vs scikit-learn

RAPIDS aims to port the scikit-learn library with GPU compatibility. However, the entirety of the scikit-learn library has not been ported. There are still some tools such as cross validation which still need to be implemented, as discussed on their GitHub. In such a situation, experienced programmers or practitioners may want to look elsewhere for existing code or implement their own cross-validation.

### Splitting up the Search Space¶

Grid search can be parallelised by splitting the search space equally between
MPI processes. Each MPI process can work on its allocated search space to work
on. Afterwards, the results from each MPI process are gathered and collated to
find the best `C`

and `gamma`

pair which produces the best validation result.

The search space is two-dimensional where both `C`

and `gamma`

can vary. We can
start by defining the boundaries of our search space where `penalty_parameters`

and `kernel_parameters`

are the boundaries for `C`

and `gamma`

respectively.

```
import numpy as np
log_penalty_parameters = np.linspace(-3, 6, n_tuning)
log_kernel_parameters = np.linspace(-6, -2, n_tuning)
penalty_parameters = np.power(10, log_penalty_parameters)
kernel_parameters = np.power(10, log_kernel_parameters)
```

The parameter `n_tuning`

is the number of points to look on the boundary. Given
the boundary, we can use `np.meshgrid()`

to obtain all possible combinations of
pairs of tuning parameters, or in other words, our grid.

```
penalty_grid, kernel_grid = np.meshgrid(penalty_parameters, kernel_parameters)
```

We can get the MPI rank and size from `mpi4py`

```
from mpi4py import MPI
COMM = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
```

Using that, we could, as evenly as possible, distribute grid elements to the MPI processes. A way to do that is to flatten the grid into an array, which is easier to divide up.

```
penalty_grid = penalty_grid.flatten()
kernel_grid = kernel_grid.flatten()
parameter_grid = zip(penalty_grid, kernel_grid)
```

Then we split the array into `size`

sections. We can do that by working out the
indices, `parameter_split_index`

, for each section.

```
parameter_split_index = np.round(
np.linspace(0, len(penalty_parameters)*len(kernel_parameters), size+1))
parameter_split_index = parameter_split_index.astype(np.int32)
```

Then finally, given a rank, we get the `rank`

th section of grid elements.

```
import itertools
rank_parameter_grid = itertools.islice(parameter_grid,
parameter_split_index[rank],
parameter_split_index[rank+1])
rank_parameter_grid = list(rank_parameter_grid)
```

Each MPI process works on its `rank_parameter_grid`

. This is one way to
distribute the search space between GPUs within and between nodes.

### Working Together on the Same Search Space¶

We could also use `multiprocessing`

to use multiple GPUs on a node. They can
work together on the same search space using `multiprocessing.Queue`

. This is
useful when some parameters are faster to fit and validate compared to others.
For example, when one GPU finishes with one parameter early, it can work on the
next parameter on the queue without waiting for other GPUs to finish. This
avoids GPUs sitting there doing nothing if they finish earlier than expected.

Here is some pseudo-code to demonstrate. Suppose we have a list of parameters in
a search space in `tuning_grid`

. We can place all of those in a
`multiprocessing.Queue`

called `tuning_queue`

.

```
tuning_queue = multiprocessing.Queue()
for i, param in enumerate(tuning_grid):
tuning_queue.put([i, param])
```

The queue `tuning_queue`

can be passed to instances of
`multiprocessing.Process`

, each process can `put()`

and `get()`

items into and
from the same queue. It is designed to be thread-safe, meaning the processes
won't be fighting over access to the queue.

The pseudo-code below shows passing `tuning_queue`

to the constructor of
`Trainer`

, a subclass of `multiprocessing.Process`

. Each process will use a
different GPU by passing a different value of `gpu_id`

to each `Trainer`

.
Multiple processes are instantiated and then launched in parallel by calling
`start()`

.

```
trainer_list = [Trainer(rank, gpu_id, data, tuning_queue) for gpu_id in gpu_ids]
for trainer in trainer_list:
trainer.start()
```

The class structure of `Trainer`

is shown below. In the method `run()`

, the
process repeatedly gets a tuning parameter from `tuning_queue`

and processes it
until the `tuning_queue`

is empty. By using a queue, we encourage all GPUs to
work continuously without having to wait for results from other GPUs.

```
from cuml import svm
class Trainer(multiprocessing.Process):
def __init__(self, rank, gpu_id, data, tuning_queue):
super().__init__()
self._rank = rank
self._gpu_id = gpu_id
self._data_ = data
self._tuning_queue = tuning_queue
def run(self):
with cuda.Device(self._gpu_id):
while True:
try:
index, tuning = self._tuning_queue.get(False)
except queue.Empty:
break
model = svm.SVC(C=tuning[0],
gamma=tuning[1],
multiclass_strategy="ovr")
# Validate the model
# Finalise and transfer results here
```

We provide the `False`

value in the method `get()`

. This so that a `queue.Empty`

exception is thrown when the queue is empty. In such a situation, there are no
more parameters in the queue and we can start finalising results.

### Results and Benchmarks¶

To demonstrate the use of multiple GPUs, we implemented grid search with 5-fold
cross-validation to find the best value of `C`

and `gamma`

on the MNIST dataset.
We opted to use a combination of `mpi4py`

and `multiprocessing`

in a hybrid
approach. We can distribute the search space between nodes, one MPI process per
node. All GPUs on a node can work together on the node's allocated piece of the
search space. With this approach, we should see better performance for a higher
number of GPUs per node.

Running out of GPU memory when using RAPIDS

We have observed that repeated instantiation of `cuml.svm.SVC`

causes GPU
memory usage to increase, even if the instantiated object goes out of scope.
To get around it, we limit the number of `cuml.svm.SVC`

instantiation for
each `multiprocessing.Process`

can do. Afterwards, we close it, to guarantee
the GPU memory is cleared. If there are still more tuning parameters in the
queue, we can instantiate another `multiprocessing.Process`

and continue.

We ran our grid search, of size 24×24, on Sulis with different formations:

- 1 A100 GPU on a node
- 2 A100 GPUs on a node
- 3 A100 GPUs on a node
- 3 nodes of 1 A100 GPU (total 3 GPUs)
- 2 nodes of 3 A100 GPUs (total 6 GPUs)
- 3 nodes of 3 A100 GPUs (total 9 GPUs)
- 4 nodes of 3 A100 GPUs (total 12 GPUs)

These different formations are chosen to investigate any speed up from increasing the number of GPUs and the number of GPUs per node.

To illustrate grid search and 5-fold cross-validation, the figure below shows a
box plot of validation errors when we vary `C`

for a fixed `gamma`

. We used a
box plot to capture the 5 validation errors, one from each fold. Indeed from the
plot, it can be verified that there is a value of `C`

which minimises the
validation error. We report a respectable 1.3% error rate on the test set.

The benchmarks for the different formations are shown in the figure below.

In the single-node case, the more GPUs there are, the faster it is. It also scales, so three GPUs take about a third of the time one GPU takes. This is as expected as the GPUs on the node are working together on the entire search space.

When using a total of three GPUs, we see a performance degradation when using three nodes of one GPU. In this case, the search space was split, orderly, between the three nodes. However, we have observed that some nodes would finish faster than others, waiting for the other one to finish. This is a bottleneck and is caused by some parameters being faster to fit and validate than others.

In the multi-node case, the more nodes there are, the faster it is. However, the performance uplift is not as much as compared to increasing the number of GPUs in the single node case. The cause is similar, some nodes are finishing their grid search faster than others.

This demonstrates that for this particular problem and implementation, higher GPUs per node are preferred. The main cause is that when splitting the search space between nodes, some nodes find their allocated search space faster to fit and validate compared to others. This could be improved if the search space was split randomly, to mask the effect that some parameters are faster to fit than others. This could also be improved if one could implement sharing the search space between nodes in MPI, avoiding situations where nodes are waiting for other nodes to finish their grid search.

## Neural Networks and Bagging¶

Neural networks are another machine learning model and are available in Python packages as such PyTorch. A lot of people like to see them as models inspired by the human brain. Mathematically, they are a composite function mapping an input to an output.

Suppose \(\eta_1(\cdot)\) and \(\eta_2(\cdot)\) are linear functions and \(g_1(\cdot)\) and \(g_2(\cdot)\) are non-linear functions, otherwise known as activation functions. We can create a single-layer neural network \(f_1(\cdot)\)

or a single hidden-layer neural network \(f_2(\cdot)\)

We can keep adding layers hidden layers if desired. Neural networks, theoretically, are known to approximate a wide range of functions, making them very versatile in machine learning.

We can choose the output activation function, \(g_2(\cdot)\), so that \(f_2(\cdot)\) maps our image, the input, to a vector containing numbers between 0 and 1, the output, representing the certainty of each digit prediction. However, this leaves the practitioner to select the remaining activation functions and how many parameters to use for each linear function or layer. There's also a question of how many hidden layers to use. Too many parameters and layers will cause the neural network to be so complicated, it will start to capture noise and irrelevant information in the data, causing it to overfit.

There are some tuning parameters, such as the step size and momentum, which characterise the optimisation procedure, such as its sensitivity when moving the dials. I would also argue the number of parameters at each layer are tuning parameters too. As a result, grid search becomes unfeasible because there are too many dimensions on our grid. A random search, where you randomly try out different parameters and validate them, is a computationally easier alternative to grid search.

Once we have selected the architect of our neural network and tuning parameters, we can fit the network onto the data by optimising the parameters of the linear functions, analogous to adjusting the many dials on our black box.

### PyTorch¶

You can build a neural network for our MNIST example in an object-oriented
fashion using PyTorch. This can be done by making a subclass of
`torch.nn.Module`

, an example is shown below.

```
import torch
class Net(torch.nn.Module):
def __init__(self, n_conv_layer, kernel_size, n_hidden_layer):
super().__init__()
self._conv_layer = torch.nn.Conv2d(1, n_conv_layer, kernel_size)
self._hidden_layer = torch.nn.Linear(
n_conv_layer * (28-kernel_size+1)**2,
n_hidden_layer
)
self._output_layer = torch.nn.Linear(n_hidden_layer, 10)
self._activation = torch.nn.ReLU()
def forward(self, x):
x = self._conv_layer(x)
x = torch.flatten(x, 1)
x = self._activation(x)
x = self._hidden_layer(x)
x = self._activation(x)
x = self._output_layer(x)
return x
```

In the constructor, we define the functions in our neural network as attributes
and use them in our prediction in the method `forward()`

. Those functions are
applied one after the other to create the composite function. In particular:

`self._conv_layer`

is a 2D convolution and it's also a linear function. The number of layers is defined by`n_conv_layer`

and the size of the kernel is defined by`kernel_size`

`self._activation`

is the rectified linear unit activation function`self._hidden_layer`

is a linear function. The number of parameters is defined by`n_hidden_layer`

`self._output_layer`

is a linear function and outputs 10 numbers, representing the certainty of a prediction for each digit. You have to use the softmax function to transform them into probabilities.

Activation functions and number of layers

The choices of what and how many linear functions and activation functions to use are arbitrary and left up to the user. Experienced practitioners can make good guesses about which one works well depending on the dataset and the domain it comes from. For example, 2D convolutions are commonly used in images. It is still an active area of research.

Optimisation tools, such as stochastic gradient descent, are provided in PyTorch. The calculations of gradients are done automatically in PyTorch, making optimisation using gradient methods easier.

### Bagging¶

To demonstrate parallelising using multiple GPUs, we can use bagging to reduce the variance in our predictions and hopefully tackle overfitting which neural networks tend to do.

The problem with random search is that we only have finite computational resources to search and validate a high-dimensional space. Also, if someone else repeats the random search, we will get a different answer.

Bagging aims to embrace such uncertainty using bootstrap samples of the data. A bootstrap sample is a variation of the data, created by resampling the data with replacement. This introduces noise to the original data which simulates what would happen if someone else obtained the same dataset again. We do a random search to get the best validated model for each bootstrap sample. This gives us multiple different answers on what the best model could be.

With multiple different models, we can aggregate them, for example using majority vote, when using them to predict a digit. This reduces variance by eliminating the odd confident incorrect predictions.

Another advantage of bagging is that fewer computation results are wasted. In a singular grid search, only one model is used and the rest are discarded. However, with bagging, we keep the various models which contribute to our predictions.

An analogy for bagging

Consider predicting the number of sweets in a jar. Do you get a better prediction from one professor or take an average from numerous students' answers? Bagging aims to do the latter.

Resampling within folds

When we do k-fold cross-validation to validate, we have to ensure data is
not 'leaked' between folds. So to combine k-fold cross-validation with
bagging, we bootstrap by resampling with replacement *within* folds, rather
than within the entire dataset.

### Parallelising Bagging¶

Bagging is very parallelisable. We can distribute bootstrap samples between GPUs or nodes to do a random search on each one in parallel. Or even better, to reproduce results, we can distribute unique seeds to seed a random number generator (RNG), one for each bootstrap sample. The RNG can then

- Obtain a bootstrap sample by resampling within folds
- Produce more seeds to seed any additional RNGs

Each bootstrap sample has its own RNG so that processes can work in parallel. If processes share the same RNG, they may fight over access to it.

Below is an example code on how we can distribute `n_bootstrap`

seeds between
MPI processes given one, and only one, `SEED`

.

```
from mpi4py import MPI
import numpy as np
from numpy import random
COMM = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
seeds = random.SeedSequence(SEED).spawn(n_bootstrap)
seed_index_rank = np.round(np.linspace(0, n_bootstrap, size+1))
seed_index_rank = seed_index_rank.astype(np.int32)
seeds_rank = seeds[seed_index_rank[rank]: seed_index_rank[rank+1]]
```

Each MPI process has a list of seeds called `seeds_rank`

to work with. Then
for each `seed`

in `seeds_rank`

we can instantiate and seed an RNG

```
rng = random.default_rng(seed)
```

and produce more random seeds to seed RNGs used in PyTorch

```
import torch
torch.backends.cudnn.deterministic = True
torch.manual_seed(
int.from_bytes(rng.bytes(8), byteorder='big', signed=True))
torch.cuda.manual_seed(
int.from_bytes(rng.bytes(8), byteorder='big', signed=True))
```

The RNGs can be used to generate random numbers for the random search and other required random numbers PyTorch needs.

To instruct PyTorch to use the GPU with `gpu_id`

, use the function
`torch.device()`

to set the device. Then for any neural network or tensor
object, use the method `to()`

to transfer it to the GPU memory.

```
device = torch.device(f"cuda:{gpu_id}")
net = mnist_nn.model.Net(n_conv_layer, kernel_size, n_hidden_layer)
net.to(device)
```

### Results¶

We performed bagging using 12 bootstrap samples. For each bootstrap sample, we tried out 12 random tuning parameters and validated them using 5-fold cross-validation. The tuning parameters are:

- The number of layers in
`self._conv_layer`

- The kernel size in
`self._conv_layer`

- The number of parameters in
`self._hidden_layer`

- The learning rate of the stochastic gradient descent
- The momentum of the stochastic gradient descent

We've used a batch size of 5 for the stochastic gradient descent.

The table below shows each of our 12 models with their best tuning parameters along with their test errors.

Model number | `n_conv_layer` |
`kernel_size` |
`n_hidden_layer` |
Learning Rate | Momentum | Test error |
---|---|---|---|---|---|---|

0 | 707 | 6 | 1764 | 0.0027 | 0.32 | 0.0152 |

1 | 237 | 10 | 1215 | 0.0047 | 0.11 | 0.0162 |

2 | 277 | 4 | 1350 | 0.011 | 0.083 | 0.0144 |

3 | 293 | 9 | 1298 | 0.0014 | 0.63 | 0.0146 |

4 | 637 | 8 | 1768 | 0.014 | 0.023 | 0.0131 |

5 | 13 | 8 | 227 | 0.054 | 0.040 | 0.0160 |

6 | 789 | 7 | 495 | 0.0020 | 0.65 | 0.0155 |

7 | 253 | 4 | 1769 | 0.0029 | 0.36 | 0.0119 |

8 | 230 | 10 | 1502 | 0.028 | 0.12 | 0.0161 |

9 | 678 | 9 | 1945 | 0.024 | 0.086 | 0.0145 |

10 | 588 | 4 | 1557 | 0.0044 | 0.71 | 0.0129 |

11 | 558 | 10 | 1870 | 0.0033 | 0.66 | 0.0139 |

We can see that, indeed, each of our 12 models found different tuning parameters due to the randomness of the random search and also the parameters are suited for their specific bootstrap sample. Each of the models performs respectably with a test error of about ~1.5%.

Combining the 12 models using a majority vote to predict, we found the test error to be 1.114%, which outperforms all singular models we've found.

Bagging becomes very powerful when it comes to uncertainty quantification. We can plot a box plot of the certainty of each model's prediction of each digit. In the example below, the digit 7 is predicted correctly by all models with high certainty.

A harder handwriting example is shown below. Each model's prediction varied, either predicting the digit 5 or 6. I would expect a human to perform similarly too. Bagging allows us to create a box plot which quantifies the uncertainty of the certainty of each prediction.

Some practitioners may find quantifying uncertainty excessive but this is incredibly important if instead of predicting digits, you're diagnosing diseases. By quantifying uncertainty for predictions, practitioners can make a risk assessment when using such predictions for decision-making.

In the last example below, we tested the model to predict a digit on a photo of a bear.

From the box plot, you can see some models predicted either the digits 2, 5 or 8 with quite some certainty. By using bagging, we can see the variation in each model's prediction and certainty. With such uncertainty in the predicted digits, it should hint to the user, that perhaps, this data is erroneous and doesn't represent a digit.

### Benchmarks¶

We ran our code on the A100 GPUs on Sulis and the H100 GPUs on Apocrita with the following formations:

- 2 nodes of 3 A100 GPUs (total 6 GPUs)
- 3 nodes of 2 A100 GPUs (total 6 GPUs)
- 3 nodes of 3 A100 GPUs (total 9 GPUs)
- 4 nodes of 3 A100 GPUs (total 12 GPUs)
- 1 node of 4 H100 GPUs (total 4 GPUs)

The benchmarks are shown below.

When using 6 A100 GPUs, both the 2-node and 3-node cases have similar benchmarks. Each GPU was happily working independently without having to wait for the others to finish.

We've observed some, but not significant, speed-up when going from 6 GPUs to 9 GPUs. In this scenario, 9 GPUs are working together to validate 12 models in parallel. Because 12 is not divisible by 9, you get 3 GPUs validating 2 models and the remaining 6 validating one model. Those 6 GPUs will be sitting there doing nothing while the remaining GPUs are working on their second model.

When going to 12 GPUs, we've observed a speed-up. In this formation, each GPU works on one model independently. Each GPU should finish their job at around the same time, making most of the time allocated to them.

We've included benchmarks using the faster H100 GPUs. This is to demonstrate that 4 H100 GPUs perform almost as well as 9 A100 GPUs. Higher profile GPUs, in this case, do speed up code.

## Conclusion¶

We've studied some pleasingly parallel machine-learning problems and implemented them using multiple nodes of GPUs.

Using more nodes, more GPUs or better GPUs can improve the performance of your code, but sometimes it doesn't. It heavily depends on your problem and how it is implemented. We've also found that the number of GPUs per node can have an effect too.

There are many more packages which do parallelisation and some do abstract it away from the user. It may be a good idea to study how your package parallelises your problem using multiple nodes and multiple GPUs. This may give a hint on how many nodes and GPUs you should use for your job. Benchmarking can also achieve this as well. It may be an idea to do a short benchmark of your code with different formations to choose the resources you would like to request.

With that in mind, we hope this should give you some ideas on how to speed up your code using multiple GPUs.

The code to run these experiments is available on my GitHub.