Porting LLM trainer to run on Frontier

what’s interesting is the amount of porting work that had to be done to make ROCm versions of common stuff. I hope they shared that either to open source or to AMD.

Leveraging Megatron-DeepSpeed would be relatively straightforward had Frontier been built with Nvidia GPUs –but it wasn’t. Getting it to run on AMD hardware required working with AMD developers to port the project to ROCm.

Needless to say, this wasn’t as simple as running HIPIFY to convert the code to AMD’s heterogeneous compute interface for portability (HIP) runtime. No matter how many times chipmakers say they can seamlessly convert CUDA code to some vendor agnostic format, at these scales, it’s rarely that simple, but the situation is getting better.

Among the headaches researchers ran up against was DeepSpeed’s operations are built when the training pipeline is executed. Unfortunately, this particular nuance doesn’t play nicely with ROCm, requiring researchers to disable the just-in-time compilation and prebuild them instead.

Even then, researchers needed AMD developer’s help to fill in gaps in the ROCm runtime. Namely ROCm equivalents of certain essential CUDA packages had to be built. This included the APEX library, which is used by Megatron-DeepSpeed for mixed-precision computation.

The team also adapted ROCm’s implementation of FlashAttention and FlashAttention2 for use with the compilers available on Frontier. The latter, it seems, was a smart play, as the lab credited FlashAttention2 for a 30 percent improvement in throughput.

As for tensor parallelism, ORNL found that trying to scale this across nodes resulted in latency bottlenecks due to the sheer number of “AllReduce” operations being called. The best results were achieved by limiting tensor parallelism to a single node of eight GPUs. Remember, each “Aldebaran” MI250X is really two GPU chiplets fused together with 64 GB of HBM2e each. It looks like two GPUs to the software, which is the test. (The follow-on “Antares” MI300X, by the way, does not look like eight GPUs, but one, even though it has eight chiplets, because their interconnect and caches are more tightly coupled.)

Finally, the team implemented the ZeRO-1 optimizer to reduce memory overheads and Amazon Web Services’ ROCm collective communication library (RCCL) plug-in – this allows EC2 developers to use libfabric as a network provider – to improve communication stability between Frontier’s nodes.

In terms of efficiency, the team found that for a given problem size per processor — otherwise known as weak scaling – data parallel training was 100 percent efficient. In other words, the more GPUs you throw at the problem the bigger the problem you can solve.

Where ORNL found diminishing returns was scaling against a fixed problem size. Intuitively, you would think that if 500 GPUs can train a model in X time, then 1,000 GPUs would do it in X/2 the time. In reality, scaling up incurs all kinds of bottlenecks and this bore out in ORNL’s testing.

By setting the global batch size to 8,000 and varying the number of processors, the team found it was able to achieve 89.9 percent efficiency in the 175 billion parameter model test with 1,024 GPUs and 87.05 percent efficiency for the 1 trillion parameter model using 3,072 GPUs.

1 Like

This is some pretty good visibility into why AMD will have a hard time taking significant market share from NVDA. As mentioned, the software continues to improve, but it is still a lot of work.
Alan

2 Likes