LLM Multi-Machine Training Solutions

Scaling LLMs with Distributed Training

To maximize the resource utilization and reduce the training cost, practitioners use distributed computing techniques for multi-GPU or multi-machine training. This techniques are named as distributed data parallelism and distributed model parallelism.This methods help in efficient use of resources. They also support in horizontal scaling, fault tolerance and parallel processing.

Applying Data Parallelism Techniques

Data parallelism is used when data does not fit in a single device or lets say a GPU. With data parallelism, dataset is shared across multiple devices which contain the copy of model.In beggining, a mini-batch of dataset is distributed equally in exclusive manner across all model copies. Then this copies are trained in parallel and model parameters are coordinated across all devices. Collective algorithms and high performance computing networking frameworks are used to perform parameter synchronization.

Approaches of Data Parallelism are as follows:-

1.AllReduce

The AllReduce approach counts on direct communication between devices to interactively exchange model gradients and parameters. This approach aggregates the data from all devices and redistributes the aggregted results back to them.

2.Parameter-Server

Local model copies are synchronized by using publisher between set of parameter servers. This servers hold the most up-to-date copy of model. If not, then they participate in weight averaging step.It can be performed at the end of each training step(synchronous). Also unsychronously, where model copies pull parameters and push gradients independently. To improve the performance of parameter-server approach, HPC infrastructure components are used.

Applying Model Parallelism Techniques

When the neural network is too big to fit in a single device or say a GPU, Model parallelism is an ideal solution. It also makes training process less memory intensive. In model parallelism, the model is partitioned across multiple devices to effectively utilize the combined memory of training cluster.It stores the entire model in memory-efficient fashion.

Common approaches in model Parallelism are as follows:-

1.pipeline parallelism

It partitions set of model layers across several devices and divided the mini-batch training into micro-batches. This micro-batches are scheduled in an artificial pipeline for forward and backward calculations in overlap manner. It reduces device inactive time.

2.Tensor parallelism

In pipeline parallelism, it partitions the set of weights. But in this case, it splits the indivisual weights across multiple devices.Tensor Parallelism in required in the case where a single parameter consumes most of the GPU memory. Big models like GPT need to be divided and run on many devices at same time to handle all calculations.