Skip to content

Some Pleasingly Parallel GPU Case Studies in Machine Learning

In a previous blog, we discussed ways we could use multiprocessing and mpi4py together to use multiple nodes of GPUs. We will cover some machine learning principles and two examples of pleasingly parallel machine learning problems. Also known as embarrassingly parallel problems, I rather call them pleasingly because there isn't anything embarrassing when you design your problem to be run in parallel. When doing so, you could launch very similar functions to each GPU and collate their results when needed.

Supervised Learning

The objective of supervised learning is to make predictions. On Amazon, the website will recommend products to you but it isn't necessary to know the reasoning behind it, a contrast to the natural sciences. Recommender systems, or models, are engineered to fulfil the task of making predictions. As a result, some of these models can be extremely complicated, to the point engineers cannot give causal explanations on how predictions are made.

Inspired by CGP Grey's YouTube video on AI, these models can be viewed as a black box with lots of dials. The more dials it has, the more complex it is. This black box accepts an input and gives an output, for example, an input could be handwriting and an output a prediction of what that handwriting represents. To train this black box to make predictions, you give it data, called the training set, and an optimiser fiddles about with the dials until it makes correct predictions on the given data.

Afterwards, we can then assess the model on its ability to make correct predictions by testing it on unseen data, called the test set.

There are a few special dials, called tuning or hyperparameters, which need to be set before any optimisation can begin, for example, one can set how many dials there are on the black box and another characterises how the optimiser fiddles with the dials during training. Choosing or setting the tuning parameter is difficult. A common way is to do cross-validation, where you split the training set into a smaller training set and a validation set. You try out different tuning parameters by training your model on the smaller training set and choosing the tuning parameter which predicts the best on the validation set.

Diagram showing an input, going to a black box with lots of dials, producing an output. An optimiser supervises it, also has a dial

Figure 1: A representation of supervised learning. An input is fed into a black box which produces an output. In this example, the input is handwriting and the output is the predicted symbol of the handwriting. The black box, or machine learning model, has lots of parameters, represented as dials. An optimiser can adjust all of these dials, given data, for the best prediction. The optimiser also has tuning parameters which need to be set, or tuned, for the best performance.

Overfitting

Cross-validation seems very involved and heuristic but these procedures aim to avoid overfitting. This is where the model predicts the training data so well, that it picks up all the unnecessary details and is unable to generalise well outside the training set. An analogy would be a student reciting text from a book, word to word, to answer a question. We do not want this.

MNIST Dataset

A nice and simple dataset to play with is the MNIST dataset. It's a collection of 60 000 training images and 10 000 test images. Each image is a 28×28 pixel greyscale scan of the handwriting of a digit. The possible digits are integers from zero to nine. The goal would be, given the handwriting of a digit, to predict the digit the handwriting represents. This is an example of a classification problem where the possible predictions are one of the set of ten numbers.

It is available through torchvision.

Some digits, 1, 2 7

Figure 2: A sample of handwritten digits from the MNIST dataset. From left to right, they are labelled as 1, 2 and 7

The Python package scikit-learn offers many classical machine learning algorithms out of the box. The support vector machine (SVM) is one example. Extremely briefly, it aims to put a hyperplane to separate a digit from the rest in feature space. Together with the kernel trick, non-linear decision boundaries can be made.

Two clusters of points separated by a plane

Figure 3: In an SVM, a hyperplane separates two different classes, blue and green. In this illustration, because the diagram is in 2D, the hyperplane is a straight line, shown in red.

In scikit-learn, the class sklearn.svm.SVC is an SVM for classification problems. It has a very basic interface where you instantiate a sklearn.svm.SVC object model = sklearn.svm.SVC() and call the methods model.train() and model.predict() to train the model and make predictions respectively.

The simplicity comes at a cost however. The constructor for sklearn.svm.SVC has conveniently set a lot of defaults, for example, C=1.0 and gamma='scale' where both C and gamma can be a positive number. These are tuning parameters, much like the special dials discussed in the previous section. In my opinion, they should be tuned to give the best possible performance, but instead, they arbitrarily default to some number. These defaults implemented in scikit-learn are heavily debated, such as in the Tweet below.

Default values

Think about any API you come across with default values. Do you need to adjust them or tune them? Do you need to report them in your research?

Fortunately, scikit-learn provides tools to do tuning, such as k-fold cross-validation sklearn.model_selection.KFold and grid search sklearn.model_selection.GridSearchCV. Grid search is where you try out and cross-validate many combinations of tuning parameters, for example, C and gamma. The value of C and gamma which gives the best results on the validation set is used and assessed on the test set.

There is no GPU support for scikit-learn as stated in their FAQ. However, RAPIDS' cuML offers GPU implementations for a lot of scikit-learn's models with matching API, including cuml.svm.SVC at the time of writing on v23.08. This offers an easy way to speed up your scikit-learn with a GPU.

RAPIDs vs scikit-learn

RAPIDS aims to port the scikit-learn library with GPU compatibility. However, the entirety of the scikit-learn library has not been ported. There are still some tools such as cross validation which still need to be implemented, as discussed on their GitHub. In such a situation, experienced programmers or practitioners may want to look elsewhere for existing code or implement their own cross-validation.

Splitting up the Search Space

Grid search can be parallelised by splitting the search space equally between MPI processes. Each MPI process can work on its allocated search space to work on. Afterwards, the results from each MPI process are gathered and collated to find the best C and gamma pair which produces the best validation result.

The search space is two-dimensional where both C and gamma can vary. We can start by defining the boundaries of our search space where penalty_parameters and kernel_parameters are the boundaries for C and gamma respectively.

import numpy as np
log_penalty_parameters = np.linspace(-3, 6, n_tuning)
log_kernel_parameters = np.linspace(-6, -2, n_tuning)
penalty_parameters = np.power(10, log_penalty_parameters)
kernel_parameters = np.power(10, log_kernel_parameters)

The parameter n_tuning is the number of points to look on the boundary. Given the boundary, we can use np.meshgrid() to obtain all possible combinations of pairs of tuning parameters, or in other words, our grid.

penalty_grid, kernel_grid = np.meshgrid(penalty_parameters, kernel_parameters)

We can get the MPI rank and size from mpi4py

from mpi4py import MPI
COMM = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

Using that, we could, as evenly as possible, distribute grid elements to the MPI processes. A way to do that is to flatten the grid into an array, which is easier to divide up.

penalty_grid = penalty_grid.flatten()
kernel_grid = kernel_grid.flatten()
parameter_grid = zip(penalty_grid, kernel_grid)

Then we split the array into size sections. We can do that by working out the indices, parameter_split_index, for each section.

parameter_split_index = np.round(
    np.linspace(0, len(penalty_parameters)*len(kernel_parameters), size+1))
parameter_split_index = parameter_split_index.astype(np.int32)

Then finally, given a rank, we get the rankth section of grid elements.

import itertools
rank_parameter_grid = itertools.islice(parameter_grid,
                                       parameter_split_index[rank],
                                       parameter_split_index[rank+1])
rank_parameter_grid = list(rank_parameter_grid)

Each MPI process works on its rank_parameter_grid. This is one way to distribute the search space between GPUs within and between nodes.

Working Together on the Same Search Space

We could also use multiprocessing to use multiple GPUs on a node. They can work together on the same search space using multiprocessing.Queue. This is useful when some parameters are faster to fit and validate compared to others. For example, when one GPU finishes with one parameter early, it can work on the next parameter on the queue without waiting for other GPUs to finish. This avoids GPUs sitting there doing nothing if they finish earlier than expected.

Here is some pseudo-code to demonstrate. Suppose we have a list of parameters in a search space in tuning_grid. We can place all of those in a multiprocessing.Queue called tuning_queue.

tuning_queue = multiprocessing.Queue()
for i, param in enumerate(tuning_grid):
    tuning_queue.put([i, param])

The queue tuning_queue can be passed to instances of multiprocessing.Process, each process can put() and get() items into and from the same queue. It is designed to be thread-safe, meaning the processes won't be fighting over access to the queue.

The pseudo-code below shows passing tuning_queue to the constructor of Trainer, a subclass of multiprocessing.Process. Each process will use a different GPU by passing a different value of gpu_id to each Trainer. Multiple processes are instantiated and then launched in parallel by calling start().

trainer_list = [Trainer(rank, gpu_id, data, tuning_queue) for gpu_id in gpu_ids]
for trainer in trainer_list:
    trainer.start()

The class structure of Trainer is shown below. In the method run(), the process repeatedly gets a tuning parameter from tuning_queue and processes it until the tuning_queue is empty. By using a queue, we encourage all GPUs to work continuously without having to wait for results from other GPUs.

from cuml import svm

class Trainer(multiprocessing.Process):

    def __init__(self, rank, gpu_id, data, tuning_queue):
        super().__init__()
        self._rank = rank
        self._gpu_id = gpu_id
        self._data_ = data
        self._tuning_queue = tuning_queue

    def run(self):
        with cuda.Device(self._gpu_id):
            while True:
                try:
                    index, tuning = self._tuning_queue.get(False)
                except queue.Empty:
                    break
                model = svm.SVC(C=tuning[0],
                                gamma=tuning[1],
                                multiclass_strategy="ovr")
                # Validate the model

            # Finalise and transfer results here

We provide the False value in the method get(). This so that a queue.Empty exception is thrown when the queue is empty. In such a situation, there are no more parameters in the queue and we can start finalising results.

Results and Benchmarks

To demonstrate the use of multiple GPUs, we implemented grid search with 5-fold cross-validation to find the best value of C and gamma on the MNIST dataset. We opted to use a combination of mpi4py and multiprocessing in a hybrid approach. We can distribute the search space between nodes, one MPI process per node. All GPUs on a node can work together on the node's allocated piece of the search space. With this approach, we should see better performance for a higher number of GPUs per node.

Running out of GPU memory when using RAPIDS

We have observed that repeated instantiation of cuml.svm.SVC causes GPU memory usage to increase, even if the instantiated object goes out of scope. To get around it, we limit the number of cuml.svm.SVC instantiation for each multiprocessing.Process can do. Afterwards, we close it, to guarantee the GPU memory is cleared. If there are still more tuning parameters in the queue, we can instantiate another multiprocessing.Process and continue.

We ran our grid search, of size 24×24, on Sulis with different formations:

  • 1 A100 GPU on a node
  • 2 A100 GPUs on a node
  • 3 A100 GPUs on a node
  • 3 nodes of 1 A100 GPU (total 3 GPUs)
  • 2 nodes of 3 A100 GPUs (total 6 GPUs)
  • 3 nodes of 3 A100 GPUs (total 9 GPUs)
  • 4 nodes of 3 A100 GPUs (total 12 GPUs)

These different formations are chosen to investigate any speed up from increasing the number of GPUs and the number of GPUs per node.

To illustrate grid search and 5-fold cross-validation, the figure below shows a box plot of validation errors when we vary C for a fixed gamma. We used a box plot to capture the 5 validation errors, one from each fold. Indeed from the plot, it can be verified that there is a value of C which minimises the validation error. We report a respectable 1.3% error rate on the test set.

Figure 4: A box plot of validation errors for a fixed gamma when fitting a cuml.svm.SVC on the MNIST dataset. The box plot captures the 5 validation errors from 5-fold cross-validation. The minimum is somewhere around C=math.log(1.3, 10).

The benchmarks for the different formations are shown in the figure below.

Figure 5: Benchmarks when doing a 24×24 grid search to find the best C and gamma for cuml.svm.SVC when fitting on the MNIST dataset. Different numbers of nodes and GPUs were used, labelled on the x-axis respectively along with the GPU model. For example, 2×3×A100 means 2 nodes, each using 3 A100 GPUs.

In the single-node case, the more GPUs there are, the faster it is. It also scales, so three GPUs take about a third of the time one GPU takes. This is as expected as the GPUs on the node are working together on the entire search space.

When using a total of three GPUs, we see a performance degradation when using three nodes of one GPU. In this case, the search space was split, orderly, between the three nodes. However, we have observed that some nodes would finish faster than others, waiting for the other one to finish. This is a bottleneck and is caused by some parameters being faster to fit and validate than others.

In the multi-node case, the more nodes there are, the faster it is. However, the performance uplift is not as much as compared to increasing the number of GPUs in the single node case. The cause is similar, some nodes are finishing their grid search faster than others.

This demonstrates that for this particular problem and implementation, higher GPUs per node are preferred. The main cause is that when splitting the search space between nodes, some nodes find their allocated search space faster to fit and validate compared to others. This could be improved if the search space was split randomly, to mask the effect that some parameters are faster to fit than others. This could also be improved if one could implement sharing the search space between nodes in MPI, avoiding situations where nodes are waiting for other nodes to finish their grid search.

Neural Networks and Bagging

Neural networks are another machine learning model and are available in Python packages as such PyTorch. A lot of people like to see them as models inspired by the human brain. Mathematically, they are a composite function mapping an input to an output.

Suppose \(\eta_1(\cdot)\) and \(\eta_2(\cdot)\) are linear functions and \(g_1(\cdot)\) and \(g_2(\cdot)\) are non-linear functions, otherwise known as activation functions. We can create a single-layer neural network \(f_1(\cdot)\)

\[ f_1(\cdot) = g_1(\eta_1(\cdot))\]

or a single hidden-layer neural network \(f_2(\cdot)\)

\[ f_2(\cdot) = g_2(\eta_2(f_1(\cdot))) \]

We can keep adding layers hidden layers if desired. Neural networks, theoretically, are known to approximate a wide range of functions, making them very versatile in machine learning.

We can choose the output activation function, \(g_2(\cdot)\), so that \(f_2(\cdot)\) maps our image, the input, to a vector containing numbers between 0 and 1, the output, representing the certainty of each digit prediction. However, this leaves the practitioner to select the remaining activation functions and how many parameters to use for each linear function or layer. There's also a question of how many hidden layers to use. Too many parameters and layers will cause the neural network to be so complicated, it will start to capture noise and irrelevant information in the data, causing it to overfit.

There are some tuning parameters, such as the step size and momentum, which characterise the optimisation procedure, such as its sensitivity when moving the dials. I would also argue the number of parameters at each layer are tuning parameters too. As a result, grid search becomes unfeasible because there are too many dimensions on our grid. A random search, where you randomly try out different parameters and validate them, is a computationally easier alternative to grid search.

Once we have selected the architect of our neural network and tuning parameters, we can fit the network onto the data by optimising the parameters of the linear functions, analogous to adjusting the many dials on our black box.

PyTorch

You can build a neural network for our MNIST example in an object-oriented fashion using PyTorch. This can be done by making a subclass of torch.nn.Module, an example is shown below.

import torch

class Net(torch.nn.Module):

    def __init__(self, n_conv_layer, kernel_size, n_hidden_layer):
        super().__init__()
        self._conv_layer = torch.nn.Conv2d(1, n_conv_layer, kernel_size)
        self._hidden_layer = torch.nn.Linear(
            n_conv_layer * (28-kernel_size+1)**2,
            n_hidden_layer
        )
        self._output_layer = torch.nn.Linear(n_hidden_layer, 10)
        self._activation = torch.nn.ReLU()

    def forward(self, x):
        x = self._conv_layer(x)
        x = torch.flatten(x, 1)
        x = self._activation(x)
        x = self._hidden_layer(x)
        x = self._activation(x)
        x = self._output_layer(x)
        return x

In the constructor, we define the functions in our neural network as attributes and use them in our prediction in the method forward(). Those functions are applied one after the other to create the composite function. In particular:

  • self._conv_layer is a 2D convolution and it's also a linear function. The number of layers is defined by n_conv_layer and the size of the kernel is defined by kernel_size
  • self._activation is the rectified linear unit activation function
  • self._hidden_layer is a linear function. The number of parameters is defined by n_hidden_layer
  • self._output_layer is a linear function and outputs 10 numbers, representing the certainty of a prediction for each digit. You have to use the softmax function to transform them into probabilities.

Activation functions and number of layers

The choices of what and how many linear functions and activation functions to use are arbitrary and left up to the user. Experienced practitioners can make good guesses about which one works well depending on the dataset and the domain it comes from. For example, 2D convolutions are commonly used in images. It is still an active area of research.

Optimisation tools, such as stochastic gradient descent, are provided in PyTorch. The calculations of gradients are done automatically in PyTorch, making optimisation using gradient methods easier.

Bagging

To demonstrate parallelising using multiple GPUs, we can use bagging to reduce the variance in our predictions and hopefully tackle overfitting which neural networks tend to do.

The problem with random search is that we only have finite computational resources to search and validate a high-dimensional space. Also, if someone else repeats the random search, we will get a different answer.

Bagging aims to embrace such uncertainty using bootstrap samples of the data. A bootstrap sample is a variation of the data, created by resampling the data with replacement. This introduces noise to the original data which simulates what would happen if someone else obtained the same dataset again. We do a random search to get the best validated model for each bootstrap sample. This gives us multiple different answers on what the best model could be.

Flow chart showing creating n bootstrap samples to train and validate a model each

Figure 6: A flow chart showing obtaining n bootstrap samples from the original data set. A model is trained and validated for each bootstrap sample. Each fitted model will be different.

With multiple different models, we can aggregate them, for example using majority vote, when using them to predict a digit. This reduces variance by eliminating the odd confident incorrect predictions.

Another advantage of bagging is that fewer computation results are wasted. In a singular grid search, only one model is used and the rest are discarded. However, with bagging, we keep the various models which contribute to our predictions.

Flow chart with a handwriting digit pointing to each model, each model makes a prediction

Figure 7: A flow chart demonstrating bagging. Each model makes a prediction and they may be different. This variation in prediction can be used to quantify the uncertainty in our prediction. The predictions can also be aggregated to form a single prediction if needed.

An analogy for bagging

Consider predicting the number of sweets in a jar. Do you get a better prediction from one professor or take an average from numerous students' answers? Bagging aims to do the latter.

A jar of sweets

Figure 8: A jar of sweets

Resampling within folds

When we do k-fold cross-validation to validate, we have to ensure data is not 'leaked' between folds. So to combine k-fold cross-validation with bagging, we bootstrap by resampling with replacement within folds, rather than within the entire dataset.

Parallelising Bagging

Bagging is very parallelisable. We can distribute bootstrap samples between GPUs or nodes to do a random search on each one in parallel. Or even better, to reproduce results, we can distribute unique seeds to seed a random number generator (RNG), one for each bootstrap sample. The RNG can then

  • Obtain a bootstrap sample by resampling within folds
  • Produce more seeds to seed any additional RNGs

Each bootstrap sample has its own RNG so that processes can work in parallel. If processes share the same RNG, they may fight over access to it.

Below is an example code on how we can distribute n_bootstrap seeds between MPI processes given one, and only one, SEED.

from mpi4py import MPI
import numpy as np
from numpy import random

COMM = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

seeds = random.SeedSequence(SEED).spawn(n_bootstrap)

seed_index_rank = np.round(np.linspace(0, n_bootstrap, size+1))
seed_index_rank = seed_index_rank.astype(np.int32)
seeds_rank = seeds[seed_index_rank[rank]: seed_index_rank[rank+1]]

Each MPI process has a list of seeds called seeds_rank to work with. Then for each seed in seeds_rank we can instantiate and seed an RNG

rng = random.default_rng(seed)

and produce more random seeds to seed RNGs used in PyTorch

import torch

torch.backends.cudnn.deterministic = True
torch.manual_seed(
    int.from_bytes(rng.bytes(8), byteorder='big', signed=True))
torch.cuda.manual_seed(
    int.from_bytes(rng.bytes(8), byteorder='big', signed=True))

The RNGs can be used to generate random numbers for the random search and other required random numbers PyTorch needs.

To instruct PyTorch to use the GPU with gpu_id, use the function torch.device() to set the device. Then for any neural network or tensor object, use the method to() to transfer it to the GPU memory.

device = torch.device(f"cuda:{gpu_id}")
net = mnist_nn.model.Net(n_conv_layer, kernel_size, n_hidden_layer)
net.to(device)

Results

We performed bagging using 12 bootstrap samples. For each bootstrap sample, we tried out 12 random tuning parameters and validated them using 5-fold cross-validation. The tuning parameters are:

  • The number of layers in self._conv_layer
  • The kernel size in self._conv_layer
  • The number of parameters in self._hidden_layer
  • The learning rate of the stochastic gradient descent
  • The momentum of the stochastic gradient descent

We've used a batch size of 5 for the stochastic gradient descent.

The table below shows each of our 12 models with their best tuning parameters along with their test errors.

Model number n_conv_layer kernel_size n_hidden_layer Learning Rate Momentum Test error
0 707 6 1764 0.0027 0.32 0.0152
1 237 10 1215 0.0047 0.11 0.0162
2 277 4 1350 0.011 0.083 0.0144
3 293 9 1298 0.0014 0.63 0.0146
4 637 8 1768 0.014 0.023 0.0131
5 13 8 227 0.054 0.040 0.0160
6 789 7 495 0.0020 0.65 0.0155
7 253 4 1769 0.0029 0.36 0.0119
8 230 10 1502 0.028 0.12 0.0161
9 678 9 1945 0.024 0.086 0.0145
10 588 4 1557 0.0044 0.71 0.0129
11 558 10 1870 0.0033 0.66 0.0139

Table 1: The tuning parameters and test error for each of our 12 models. The learning rate and momentum are rounded to two significant figures.

We can see that, indeed, each of our 12 models found different tuning parameters due to the randomness of the random search and also the parameters are suited for their specific bootstrap sample. Each of the models performs respectably with a test error of about ~1.5%.

Combining the 12 models using a majority vote to predict, we found the test error to be 1.114%, which outperforms all singular models we've found.

Bagging becomes very powerful when it comes to uncertainty quantification. We can plot a box plot of the certainty of each model's prediction of each digit. In the example below, the digit 7 is predicted correctly by all models with high certainty.

Figure 9: Box plot of the certainty of each model's prediction of each digit. The handwriting to predict is shown on the top left.

A harder handwriting example is shown below. Each model's prediction varied, either predicting the digit 5 or 6. I would expect a human to perform similarly too. Bagging allows us to create a box plot which quantifies the uncertainty of the certainty of each prediction.

Some practitioners may find quantifying uncertainty excessive but this is incredibly important if instead of predicting digits, you're diagnosing diseases. By quantifying uncertainty for predictions, practitioners can make a risk assessment when using such predictions for decision-making.

Figure 10: Box plot of the certainty of each model's prediction of each digit. The handwriting to predict is shown on the top left.

In the last example below, we tested the model to predict a digit on a photo of a bear.

Figure 11: Box plot of the certainty of each model's prediction of each digit. The handwriting to predict is shown on the top left but it's a photo of a bear.

From the box plot, you can see some models predicted either the digits 2, 5 or 8 with quite some certainty. By using bagging, we can see the variation in each model's prediction and certainty. With such uncertainty in the predicted digits, it should hint to the user, that perhaps, this data is erroneous and doesn't represent a digit.

Benchmarks

We ran our code on the A100 GPUs on Sulis and the H100 GPUs on Apocrita with the following formations:

  • 2 nodes of 3 A100 GPUs (total 6 GPUs)
  • 3 nodes of 2 A100 GPUs (total 6 GPUs)
  • 3 nodes of 3 A100 GPUs (total 9 GPUs)
  • 4 nodes of 3 A100 GPUs (total 12 GPUs)
  • 1 node of 4 H100 GPUs (total 4 GPUs)

The benchmarks are shown below.

Figure 12: Benchmarks when bagging with 12 bootstrap samples, each validating 12 random tuning parameters with 5-fold cross-validation. Different numbers of nodes and GPUs were used, labelled on the x-axis respectively along with the GPU model. For example, 2×3×A100 means 2 nodes, each using 3 A100 GPUs.

When using 6 A100 GPUs, both the 2-node and 3-node cases have similar benchmarks. Each GPU was happily working independently without having to wait for the others to finish.

We've observed some, but not significant, speed-up when going from 6 GPUs to 9 GPUs. In this scenario, 9 GPUs are working together to validate 12 models in parallel. Because 12 is not divisible by 9, you get 3 GPUs validating 2 models and the remaining 6 validating one model. Those 6 GPUs will be sitting there doing nothing while the remaining GPUs are working on their second model.

When going to 12 GPUs, we've observed a speed-up. In this formation, each GPU works on one model independently. Each GPU should finish their job at around the same time, making most of the time allocated to them.

We've included benchmarks using the faster H100 GPUs. This is to demonstrate that 4 H100 GPUs perform almost as well as 9 A100 GPUs. Higher profile GPUs, in this case, do speed up code.

Conclusion

We've studied some pleasingly parallel machine-learning problems and implemented them using multiple nodes of GPUs.

Using more nodes, more GPUs or better GPUs can improve the performance of your code, but sometimes it doesn't. It heavily depends on your problem and how it is implemented. We've also found that the number of GPUs per node can have an effect too.

There are many more packages which do parallelisation and some do abstract it away from the user. It may be a good idea to study how your package parallelises your problem using multiple nodes and multiple GPUs. This may give a hint on how many nodes and GPUs you should use for your job. Benchmarking can also achieve this as well. It may be an idea to do a short benchmark of your code with different formations to choose the resources you would like to request.

With that in mind, we hope this should give you some ideas on how to speed up your code using multiple GPUs.

The code to run these experiments is available on my GitHub.