The channel WelchLabs on YouTube has great content regarding the concepts behind AI. It posted a video on 9/13/2024 discussing how the latest releases of these LLM engines are continuing to demonstrate an outer bound of “efficiency” of AI models when their performance is graphed as a function of the compute power burned to train the model.
At 10:39 in the video, a graph is displayed showing the performance change from GPT-3 to GPT-4 . Next to the two graphs is a legend that reflects an astounding statistic. The GPT-3 model training burned 3,640 PetaFLOP-days of compute resource. The GPT-4 training took an astonishing 200,000 petaflop-days of compute power.
To put that in perspective…
A FLOP is a “FLOating Point operation”, a unit of horsepower consumed by a typical operation performed within a CPU. It’s not a calculation of a complex matrix dot-product of two arrays with 1024 dimensions but it’s not a simple ADD operation to a local register either. A PetaFLOP is one thousand trillion floating point operations.
1 petaflop = 1,000,000,000,000,000
An Intel i9 processor that might be found in a high-end desktop machine used for gaming has 24 cores, 8 running at up to 6 GHz and 16 running at up to 4 GHz. Intel rates that CPU at 1,228 gigaflops or
1,228,000,000,000 operations/second
It would take 1,000,000 / 1,228 or 814 similar servers to provide 1 petaflop of computing power.
That means training of GPT-3 that took 3640 petaflop-days would require 3640 x 814 or 2,962,960 individual desktop computers to match that computing power. For GPT-4 that took 200,000 petaflop-days. That would require 200,000 x 814 or 162,800,000 desktop PC equivalents.
Obviously, this is a bit of an exaggeration. The computers used for this training were equipped with the latest GPU (Graphical Processing Unit) blades that are optimized for matrix mathematical operations. A top of the line consumer grade GPU made by NVIDIA branded the RTX 4090 can perform 3.6 teraflops so the training loads would be
GPT-3 training = 3,640 petaflops = 3,640,000 teraflops / 3.6 = 1,011,111 RTX 4090 GPUs
GPT-4 training = 200,000 petaflops = 200,000,000 teraflops / 3.6 = 55,555,555 RTX 4090 GPUs
NVIDIA’s top data center oriented GPU labeled the A100 is rated at 312 teraflops so the same training loads would equate to these counts of A100 processors (which cost about $23,000 each):
GPT-3 training = 3,640 petaflops = 3,640,000 teraflops / 312 = 11,667 A100 GPUs
GPT-4 training = 200,000 petaflops = 200,000,000 teraflops / 312 = 641,025 A100 GPUs
Note that the original statistics were not in petaflops but petaflop-days. A petaflop-day is the output of compute providing one petaflop of processing operating for an entire day. Here’s the power consumption of each of those hardware examples above:
- Intel i9 14900K processor = 360 watts
- NVIDIA RTX 4090 consumer grade GPU card = 450 watts
- NVIDIA A100 data center grade GPU card = 400 watts
The kilowatt power consumption for each for a full day of operation would be:
- Intel i9 14900K processor = 0.360 watts x 24 = 8.64 kilowatt hours
- NVIDIA RTX 4090 consumer grade GPU card = 0.450 watts x 24 = 10.8 kilowatt hours
- NVIDIA A100 data center grade GPU card = 0.400 watts x 24 = 9.6 kilowatt hours
Mapping those power consumption levels to the processing required for GPT-4 training that totaled 200,000 petaflops is jaw-dropping:
- Chat GPT-4 training on desktop equivalents = 8.64 kW x 162,800,000 = 1,406,592,000 kilowatt hours
- Chat GPT-4 training on RTX 4090 GPUs = 10.8 kW x 55,555,555 = 599,999,994 kilowatt hours
- Chat GPT-4 training on A100 GPUs = 9.6 kW x 641,025 = 6,153,840 kilowatt hours
For comparison, the largest single power plant complex in the United States is Grand Coulee Dam which is rated at a capacity of 7,097 megawatts or 7,097,000 kilowatts. If I haven’t scrambled a units conversion somewhere, the power consumed by the GPT-4 training could have consumed all of the power from Grand Coulee Dam for 52 minutes.
And that’s just the training compute. I haven’t seen a good summary anywhere that explains how the final output of a training model is then scaled out for interactive use. That processing is equally dependent on extremely large matrix mathematics operations so I would presume the computing investment to run the final model is equivalent to that required to train it, especially as millions start using it on a daily basis.
It seems obvious at this point that an accurate cost / benefit analysis of AI has not been attempted. The assumption seems to have been that this compute is just laying around doing SOMETHING, let’s have it do THIS and see if anything interesting results. It’s only when you see announcements of one hundred billion dollars in new equipment being planned that it becomes apparent there are environmental impacts to AI that need to be considered as public policy, not solely as private investment decisions that treat power and water as freely available resources.
WTH