A lot of the time, when engineers think of performance work, we think about looking at benchmarks and making the numbers smaller. We anticipate that we are benchmarking the right pieces of code, and we take it for granted that reducing some of those numbers is a benefit, but also “the root of all evil” if done prematurely. If you are a performance-focused software engineer, or you are working with performance engineers, it can help to understand the value proposition of performance and when to work on it.
There are three important components of performance for computer systems: latency, throughput, and efficiency. Broadly speaking, latency is how quickly you can do something, throughput is how many of those things you can do per second, and efficiency is how much you pay to do that thing. Each of these dimensions can be a little bit abstract and hard to work with directly, so we often think about proxies, like CPU time, but it helps to keep your eye on what is actually impactful. These dimensions all have interesting relationships with each other, and reasoning about them directly can help you figure out how impactful performance work can be for you.
Throughput is either defined as the load on a system or the system’s maximum capacity to sustain load. Throughput is measured in terms of queries per second or bytes per second (or both at the same time). Often, systems are designed with specific throughput goals, or are designed to achieve maximum throughput under a certian set of constraints. In many cases, throughput is the performance dimension that dictates budgets and hardware requirements. A table of common throughputs can be found here.
Throughput often has a relationship to efficiency. When you have a certain number of servers and you are running a simple process on those servers, reducing the amount of compute spent on each query can often provide a corresponding in throughput. It is common to hear about QPS/core or QPS/VM for a service that scales, and for simple services and systems, throughput and efficiency often go hand-in-hand. Throughput can also have a relationship with latency: If you have a limit on the number of operations outstanding in a system, the latency of the operations can cause a throughput limit. Consider a database that can process only 1 write operation at a time where a write operation takes 1 ms: That database cannot process more than 1000 writes per second no matter how efficient it is.
Additionally, the throughput of a process is limited by its lowest-throughput step. For example, if you have a system that calls an API that can only take 1000 QPS, your system cannot exceed that 1000 QPS limit. Similarly, a super-fast storage system backed by a hard drive will never exceed ~150-200 MB/second (depending on the hard drive). Focusing on any other part of the system will not improve its throughput at all.
There are many ways to improve throughput of a system, including adding parallelism, sharding, using caches, and making efficiency improvements, but thanks to the principle that the lowest-throughput component determines throughput, you have to reason about the throughput of entire systems if you want to make meaningful improvements. In our hypothetical case of a hard-drive-backed storage system, adding a caching SSD would raise its throughput a lot because the hard drive is the limiting factor, and in the hypothetical write-limited database system, sharding the application-level database would alleviate throughput problems caused by the latency of write operations.
Today, we often hear about “infinite-scale” or “planet-scale” distributed systems, and these systems are often designed for infinite throughput scalability. These systems are very academically interesting, but they often make trades in efficiency or latency for their ability to scale without limits. If you are concerned about the other dimensions of performance, it may be better to eschew planet scale for a system that works at the size you want.
Compute efficiency is about dollars spent per dollar earned. For a straightforward example, if you have a serverless backend and you charge $1.00 to customers per serverless function call, reducing the CPU time on that serverless function is meaningful efficiency work - you will spend less per dollar you make.
It is important to make sure that you can tie efficiency gains to dollars. For example, if you need to use a certain VM shape for a given reason (because it has SSDs or a GPU for example), and your service fits in that VM shape comfortably, improving its memory usage or CPU usage can be useless for efficiency - you have to spend money for that VM anyway! However, if the improvements you make allow you to spend less VM time, that is worth something.
There is one exception to the rule of tying compute efficiency to money savings, and where compute efficiency can contribute to profits: the client side. Having a more efficient client means that users with cheaper devices can use your software. Some time ago, when YouTube decreased the size of their video client, they found that average loading times went up instead of down! People with slower internet connections were now able to load the YouTube player, so YouTube’s efficiency gain resulted in more revenue.
Efficiency work can often be done at the level of subsystems, but some end-to-end thinking can help.
It can be worth it to make subsystems less efficient if they offer APIs or interfaces that improve system-level
efficiency. Google recently published a
on an interesting case of this: using a slower
malloc algorithm resulted in better compute efficiency
since the slower algorithm provided better memory locality for programs that used it.
Efficiency work is usually the easiest performance work to price, since it is solely related to cost reductions, so managers and companies often think (close to) rationally about efficiency work. However, since efficiency work is a cost reduction, many companies are loath to support it, believing that features are more important. Some of those companies believe that performance and efficiency are the same, and dismiss performance work entirely as “premature optimization.” However, a little efficiency work can end up saving you a lot, and even startups can benefit if some optimization gives them a longer runway.
Latency is how long it takes for your system to respond to a request, measured in seconds. If a user sees high latency, they will tell you that your system is “slow,” and they won’t care about how your system can scale out to serve billions of people at once. In some industries, like ad tech and high-frequency trading, latency is the key metric that needs to be optimized. In other situations, like training ML models, overall latency is relatively unimportant. A table of common latencies can be found here.
Many people wouldn’t care about latency except that the latency of a subsystem has a habit of placing limits on throughput. This occurs in every closed-loop system, where you wait for completion of one request before processing another one. If your system has locks that must be exclusively held per request, or in the case of the write-limited database we looked at before, the latency of the locked section or the database becomes critical. Supercomputers often have low-latency networks because high-performance computing programs usually involve feedback loops between machines, so having a low-latency network results in having higher system-level throughput.
Latency is usually measured by looking at distributions of request times. These distributions are usually generated by two methods: probing and tracing. Probers generate artifical requests to see how long it takes to get a response, and tracers generally attach a tag to requests which asks each server along the path of the request to measure how long each step took. Both of these methods have problems, and analysis should be done with this in mind: Tracing adds overhead to each traced request, which can affect throughput and makes traced requests slightly unrepresentative of untraced requests. Probers tend to test only a few types of requests, so an over-reliance on probers can result in overfitting. With the data from latency measurement tools, different parameters of the distribution can tell you different things:
- The median of the latency distribution tells you about a “normal” request, and helps you understand the happy path.
- “Tail latency” usually refers to the 90th, 99th, 99.9th, or 99.99th percentile of the distribution, and can tell you a lot about how your system performs when it is under load or otherwise operating in a degraded mode.
- Average latency can be important for components of closed-loop systems: if a system does 100 serialized reads from a database to produce a result, the latency of that system is probably close to 100 times the average latency of the database.
Queueing and batching often result in latency, even though they may help on the other dimensions: long queues of requests mean that a system is running at high throughput (good) but also that each request experiences high latency (bad). Latency associated with queues is called “queueing delay,” and frequently shows up at the tails - the 99.9th percentile request probably experiences significant queueing delay. Queues are not always explicit: congested network routes often add significant queueing delay, and operating systems have run queues, so if you are running more than one thread per core, you will experience queueing delay related to scheduling. Batching will always hurt minimum latency, but batching can benefit average latency and tail latency if batching means doing less work per request.
“Thread per core” systems, like high-performance databases (eg ScyllaDB or Aerospike) try to make all of these queues explicit to achieve 90th percentile latencies below 1 millisecond. This usually means using an asynchronous programming model and giving up a lot of the conveniences that an operating system provides. Attack of the Killer Microseconds has some more detail on this regime. High-frequency trading systems go further to get median latencies in the single-digit microseconds, using spinning cores and disabling turbo on CPUs to avoid hardware-induced latency.
Improving latency can be done in many ways, but most boil down to the simple rule of: “Find something you’re doing in the critical path, and don’t do it.” Depending on the system, improving latency can mean many different things. For example: removing communication, removng RPCs/machine hops, removing (or adding) queueing/batching, or improving prioritization can all improve latency in a system. Latency profiles from traces will tell you what to work on.
Conversely, if you add something to the critical path of a system, expect it to hurt latency. This is a natural reason why feature-rich systems tend to become slow: adding even rarely-used options can have a little bit of latency overhead. One new feature won’t cost much, but the small effect of each new feature adds up. The one exception is caches, which will usually help latency on cache hits, but will hurt your latency on cache misses.
Latency can be hard to price, but very often has tremendous value. In the mid-2000s, Google started studying the effects of latency on users, and found that even 100 ms of extra latency had a noticeable impact on the number of searches that each user conducts, published in their blog. They have also publicly announced that page loading latency is a factor in Google search rankings. None of this is a surprise to UX folks and front-end web developers, but the further down the stack you go, it is easy to forget that your system may be in the critical path for a user-facing service.
The value of latency is usually nonlinear. The fastest trader on the stock market will make a lot more money than the second fastest trader, who makes a lot more than the third fastest, and so on. The 100th fastest doesn’t make much more than number 101. The difference in speed between the top three may only be a few nanoseconds, but each one of those nanoseconds has tremendous value! Conversely, improving user interface latency from 200 ms to 150 ms has a lot more value than improving it from 60 ms to 10 ms: the former takes a UI from “sluggish” to “workable” while users are unlikely to notice the latter improvement (60 ms is below most humans’ reaction time).
Performance is Money
By now, you have probably gotten the theme here: performance work is about incremental improvements in a system’s profitability, and the three dimensions of performance are about product and money:
- Throughput is about how much traffic you can accept
- Efficiency is about how cheaply you can accept it
- Latency is about how quickly you respond to each request
With the exception of latency, which I would argue is a valuable product feature, compute performance work directly relates to your company’s bottom line. If you make sure that you care about the right things, it can have immense value.