Micro-Optimization

The Computer Architecture of AI (in 2024)

Over the last year, as a person with a hardware background, I have heard a lot of complaints about Nvidia’s dominance of the machine learning market and whether I can build chips to make the situation better. While the amount of money I would expect it to take is less than $7 trillion, hardware accelerating this wave of AI will be a very tough problem–much tougher than the last wave focused on CNNs–and there is a good reason that Nvidia has become the leader in this field with few competitors. While the inference of CNNs used to be a math problem, the inference of large language models has actually become a computer architecture problem involving figuring out how to coordinate memory and I/O with compute to get the best performance out of the system.

Introduction to Micro-Optimization

A modern CPU is an incredible machine. It can execute many instructions at the same time, it can re-order instructions to ensure that memory accesses and dependency chains don’t impact performance too much, it contains hundreds of registers, and it has huge areas of silicon devoted to predicting which branches your code will take. However, if you have a tight loop and you are interested in optimizing the hell out of it, the same mechanisms that make your code run fast can make your job very difficult. They add a lot of complexity that can make it hard to figure out how to optimize a function, and they can also create local optima that trap you into a less efficient solution.