Over the last year, as a person with a hardware background, I have heard a lot of complaints about Nvidia’s dominance of the machine learning market and whether I can build chips to make the situation better. While the amount of money I would expect it to take is less than $7 trillion, hardware accelerating this wave of AI will be a very tough problem–much tougher than the last wave focused on CNNs–and there is a good reason that Nvidia has become the leader in this field with few competitors. While the inference of CNNs used to be a math problem, the inference of large language models has actually become a computer architecture problem involving figuring out how to coordinate memory and I/O with compute to get the best performance out of the system.
The Basic Operations Used by ML Models
I am going to start with a brief overview of each of the operations used in AI systems here, but each has far more depth to go into for its implementation specifics. The largest operations in terms of compute time are the linear operations: matrix multiplications and convolutions, as well as the matrix multiplications that are within an attention calculation. Accumulations and other nonlinear functions consume significantly less time, but create mathematical “barriers” in the calculation that prevent you from combining the computations of multiple matrix multiplications together. The combination of these linear and nonlinear ops is what gives neural networks power from a statistical perspective: a layer of a neural network is essentially a combination of a matrix-multiplication-like linear operation and a nonlinear operation.
Matrix multiplication is a basic operation used in neural network calculations, frequently involving multiplying a weight matrix by a vector of internal state. A matrix-vector product combined with a nonlinear function is the basic operation of a fully-connected layer. Matrix multiplication is one of the most well-studied algorithms in computer science, and any company building an accelerator for graphics or neural networks will have an efficient solution. There are a lot of clever ways to do matrix multiplication, but the most efficient one is often brute force: using a lot of multipliers in straightforward arrays, and being clever about how you access memory.
Depending on the model, the characteristics of these matrices can be different: image-based models will have a lot of matrix-matrix products, while LLMs will have matrix-vector products.
While not present in LLMs, convolution operations are a common feature of computer vision models, and there is some chance that they will find new usage. A convolution involves “sliding” a small matrix over a much bigger matrix, and calculating the correlation between the shifted versions of the small matrix and the big matrix. This becomes a small version of a matrix-matrix product with a lot of data sharing between computations. Convolutions have a lot of tricks for efficient computation due to the data sharing and the fact that each of these small matrix-matrix multiplications shares one operand. The hardware for matrix multiplication is also relatively easy to share with convolution.
Nonlinear functions like exponentials, sigmoids, arctans, and a family of functions that look like ReLU are common operations inside neural networks. Given the low precision of the number systems used (16-bit floating point at most), these are relatively easy to compute with fast approximations or lookup table methods. Each neural network often also uses relatively few different kinds of nonlinear functions, so it is possible to precompute parameters to make this computation faster. While these are generally not computationally intensive, their main function is to make the matrix multiplications in each layer irreducible.
Accumulation operations involve scanning a vector and computing a function over every element in that vector. For neural networks, the most common function used is the “maximum” function: this is used for softmax calculations and layer normalizations. While not a computationally intensive application, the result of these computations will not be available until the entire vector is computed. This forms a computational barrier: anything that depends on the result of this operation has to wait for it to be produced. It is common to put a normalization between layers of neural networks, preventing number magnitudes from getting out of hand.
Attention is the “new kid on the block” in terms of operations used for neural networks, and is the defining feature of transformer models. Attention is supposed to represent a key-value lookup of a vector of queries, but that is a pretty loose intuition of what happens in my opinion. In terms of the operations done, attention is a calculation that combines a few matrix-vector products and a softmax-based normalization function. Thanks to the linear nature of the matrix-vector products, there are a lot of clever ways to rearrange this calculation to avoid unnecessary memory loads even when large matrices are used: the Flash Attention paper goes through the most famous rearrangement. However, there have been other attempts to do fast attention calculation, including numerical approximations of the calculation and the use of “landmark tokens” to turn the large version of the calculation into several smaller versions. These calculations all translate to hardware relatively well, and build on top of the units we build for matrix multiplication and softmax.
How Computer Architects Think about Matrix Math Operations
Doing arithmetic operations is divided into two types of operations: getting the data into and out of the chip, and doing the operations. There is a well-known ratio here called “arithmetic intensity”: the ratio of memory operations done to arithmetic operations. Problems with high arithmetic intensity are compute-bound, and are limited by your ability to put arithmetic units onto a chip that can perform the relevant operations. Problems with low arithmetic intensity are memory bandwidth bound: when you are spending a lot of time loading and storing data, your chip’s compute units are idling.
Every chip can be modeled as having a “roofline” on performance based on arithmetic intensity depending on its available compute FLOPs and its memory bandwidth. This roofline has a knee at a certain point, where operations go from being memory bandwidth bound to compute bound. Chips are generally architected to support the software that they will be used to run (with about 5 years of lag–it takes a long time to make a chip), and this knee is generally chosen to be at a relatively good point for most users. However, it is one-size-fits all.
Google performed this analysis for the TPUv4, and found that the neural networks of interest had an arithmetic intensity relatively close to the knee of an A100 GPU. The H100, by the way, has its knee at almost the same place. However, Google’s analysis here does not describe what they are doing in each cited model, so we are left to guess.
To apply this to the operations we discussed earlier, the matrix operations are the ones that consume memory bandwidth. Accumulation and nonlinear operations can generally be attached to the previous block of computation and add arithmetic operations. However, the matrix operations have the disadvantage that they operate on blocks of data, meaning that they add a relatively large amount of memory management per mathematical operation.
The Working Sets of Models
Thanks to the regularity of the computation of a model, the working set of models is relatively well-defined. The working set contains:
- The model weights
- The input data
- The intermediate state of the model
- During training, gradients (and learning rates) for each weight
These items can be small or large depending on the model. For models like LLMs, the weights are huge, sometimes too big for a single GPU, while the input data and the intermediate state are a few thousand numbers. For computer vision models, the weights are a lot smaller and the input data and intermediate state are a lot bigger. Due to the size of the weights, training for LLMs takes about 3x as much memory as inference, but otherwise looks somewhat similar to inference.
The size of the working set isn’t the only thing that matters, however. The ability to put parts of that working set on the chip (ie in cache) has a significant effect on the arithmetic intensity of the computation of a model. Parts of the working set that are in nearby caches and registers allow you to access those parts without hitting memory, so the heat distribution of the working set dictates how often you are going to hit memory rather than loading from a cache. The bandwidth of caches, for most purposes, is effectively unlimited.
This leads to a rough breakpoint in model size that dictates how the model is computed: small models that can fit in the cache of a GPU (or the on-chip memory of an ASIC) get to stream data through the weights, while larger models have to stream the weights through the data. This is not a perfect model, since “medium-sized” models can hide some of the effects of streaming the weights through the data, but it is the difference between a smaller computer vision or audio model and a large language model. Not only is the LLM bigger, it also has to be computed less efficiently.
Small Models: Streaming Data through Weights
The previous AI wave of 2016 involved models with thousands or millions of parameters. These models are small enough to fit inside the cache and compute units of a reasonably-sized chip. This means that it is possible to keep the entire model on the chip used for inference or training. In this case, many GPU-like systems with cache hierarchies will find it easy to keep the model in cache, and systems with a scratchpad or with special internal memory dedicated for storing models are possible to build cost-efficiently.
This configuration has the benefit that it is the “natural” way to compute over data: you can load the data from a disk once, perform a calculation, and then store or use the results. Also, this method of computing lets you take advantage of incredibly high arithmetic intensity during inference. Each time the model is inferenced, the entire run of the model relies on the speed of loads of the input. However, if the input is large, needs a lot of preprocessing, or has to be scanned many times, this can result in a drop in arithmetic intensity, particularly for systems that rely on caching: processing the input may end up evicting the model from cache, causing extra loads.
Large Models: Streaming Weights through Data
With larger models of several billions of parameters, it is no longer possible to keep the weights on a single chip (note: HBM memory does not count as “on the chip”). However, the use of linear algebra means that there is still a regular pattern of computation to take advantage of: every model goes layer by layer. The intermediate state of large models is still relatively small: the largest of the original LLaMA models uses an internal state of only 8k floating point numbers, and a context window of 4k input tokens.
However, if you store only the intermediate products of the model, the working set is still relatively small. To compute these models, the typical general approach is to stream the model through the data. One batch of data gets loaded onto the chip, and the internal state of the model continues to stay on the chip as inference progresses layer by layer.
In this configuration, the data stored off-chip is going to be accessed repeatedly, so loading model weights from disk (or even RAM on a CPU or another remote site) is generally considered too slow to do. This is why a lot of LLMs have high VRAM requirements for GPUs: you have to store the entire model (plus a few other things, like the K-V cache used to speed up inference) inside the GPU’s VRAM.
Due to the fact that weights are streamed into the chip, the arithmetic intensity with this model of computing is comparatively very low, and is driven by your ability to reuse weights as they get loaded. While many models have operations that allow you to take advantage of reuse, LLMs have very limited reuse thanks to the prevalence of matrix-vector products. GPT-2 on a GPU notably has an arithmetic intensity of about 2 operations per byte loaded. However, due to weight reuse, BERT models get up to around 100-200 ops per byte.
The Software-Level Tricks: Batching, Quantization, and Sparsity
It is possible to increase the compute intensity of almost any compute application, but particularly machine learning models, with batching. Batching means performing several parallel computations of the same type at the same time, like multiple chats with an LLM or several images with a computer vision model. However, batching requires that you have the parallel work available: a local LLM probably cannot benefit from batching nearly as much as OpenAI’s centralized chatbot system with thousands or millions of simultaneous users. A system built for larger batch sizes can generally use a higher compute to memory ratio.
Quantization involves reducing the size of the working set by using smaller numbers. This is a very effective strategy for inference, where you have the full, pretrained model and can now do a bunch of math to figure out the best way to reduce the precision of the weights so you store effectively 2-4 bits per weight with comparatively low loss in model accuracy. Quantized training, however, is somewhat harder: you don’t know the weights ahead of time, so the normal mathematical tricks of quantized inference do not work. A lot of training processes need 8-bit floating point numbers today, and this is already considered “quantized” compared to 2020, where 16-bit floats were the norm for training. Quantization does add some extra ops to convert from the quantized form to a form that can be used for computing (the internal state of models is still FP8 or FP16), but those ops are not the limiting factor.
Sparsity can be thought of as a kind of quantization. Sparsifying models involves finding the weights that don’t matter for the outcome of a neural network, and treating those weights as though they are equal to 0, dropping them from computation. This both reduces the working set size and the number of arithmetic ops required to compute the output. Sparsity is more effective when you can treat a whole block of weights as 0, but can also be applied at a fine-grained level when matrix multiplication units are designed for it: Nvidia’s are. However, sparisty is basically impossible to take advantage of in training: you don’t know which weights are going to matter before you train the model.
All of these are knobs that can tune the exact operating point of model inference, and help people building systems pull the most out of their hardware.
The “$7 Trillion” Question
The last wave of AI accelerators got the benefit of focusing on the “small model” case. This is a circuit designer’s dream: the state you need to compute over can generally fit entirely on a chip, so it is within your control to design an ASIC that can compute it efficiently. You are free to design new circuits and systems that implement the “new” math of these systems to push performance without worrying so much about what goes on outside of the borders of your silicon. This is part of why we saw so many new, interesting new architectures at this time, from analog matrix multiplications to interesting new kinds of arithmetic to single-chip massive SIMT systems. These promised huge gains in energy efficiency and speed, and achieved those gains, but found themselves pushing the knee of their roofline models far to the right: many are only able to obtain that efficiency when they can avoid memory operations altogether.
Tomorrow’s AI accelerators will need to focus on memory to contend with the case of streaming weights through data. The primary goal of these chips will be to find a way to “feed the beast” and get the working set closer to the compute. This is what NVidia, who designed the best general-purpose parallel compute engine they could, has figured out how to do well. They had the advantage that getting data close to compute is a common problem in HPC systems, which have a large working set but tend to have higher arithmetic intensity than LLMs. While everyone else is scrambling to do this now, Nvidia has been working on this problem for almost a decade, as their Tesla accelerators have eaten the HPC world.
Some startups, like Groq and Cerebras, have found that they can somewhat adapt to this new environment, although both in relatively niche ways. Future technologies in this area could be new kinds of in-memory compute, new ways to attach memory to compute, or new multi-chip networks that allow you to get a lot more RAM and compute next to each other. Getting the working set closer to the compute can also mean compressing the working set, which generally happens through quantization. However, future accelerators may be able to find novel ways to compress the working set, which has a very predictable access pattern, that aren’t necessarily subject to the same working set limitations, like using traditional compression methods on LLM weights (although there are a lot of reasons why gzip-style compression won’t work well). Block floating point is one easy option here, which compresses floating-point numbers by sharing one exponent between many numbers, but there is probably something more clever to do.
It does appear that there is an unserved niche here in local LLMs (if that market exists): large memory capacity and high memory bandwidth combined with a relatively underpowered compute core seems to be something that nobody is doing yet. However, this is a problem that lives at the interface between memory and compute. Conversely, there may be another opportunity here in super-large batch sizes for companies like Google and OpenAI who probably do have the work available, and the solution to use in that case will look very different than the solution for low batch size. These systems will probably look a lot more like custom supercomputers than what we have today.
The advances required to make LLMs orders of magnitude faster will likely be more system-level than circuit-level, but the GPU technologies that exist today for most “AI” aren’t necessarily off the mark. Nvidia/AMD’s manufacturing process edge may be able to more than make up for the efficiencies you can gain by focusing only on the computations used for inference. For startups, which already have a hard time raising the $10-20 million you would need to spin a chip, raising the hundreds of millions it may take to design these completely new systems may be beyond reach. With the pace of advancement in model architectures, there is also no guarantee that this research will be relevant by the time it arrives: the models of 2030 may be as different from today’s models as today’s models are from the models of 2016. In particular, as new architectures show up for models, the trend of incredibly low arithmetic intensity may not continue. Still, I am looking forward to seeing what the startups in this area can cook up.