As you build a computer system, little things start to show up: maybe that database query is awkward for the feature you are building, or you find your server getting bogged down transferring gigabytes of data in hexadecimal ASCII, or your app translates itself to Japanese on the fly for hundreds of thousands of separate users. These are places where your abstractions are misaligned - your app would be quantitatively better if it had a better DB schema, a way to transfer binary data, or native internationalization for your Japanese users. Each of these misalignments carries a cost.
For many computer systems, abstraction misalignment is where we spend the majority of our resources: both in terms of engineering costs and compute time. Building apps at a high level pays dividends in terms of getting them set up, but eventually the bill comes due, either in the form of tech debt, slow performance, or both. Conversely, systems where abstractions are well-aligned throughout the tech stack, like high-frequency trading systems and (ironically) chat apps, are capable of amazing feats of engineering.
Most small-scale web services that do normal things don’t pay much for abstraction misalignment, but large-scale systems and systems that do odd things pay huge costs. Misalignments can also show up as systems age and things change - migrations are difficult, and you would rather add a feature than do a refactor (okay, maybe not you, but your manager definitely does).
When I left Google, I was working on a new storage system that leveraged several cool technologies to go fast. The real power we had, however, was that we could design the stack so that every abstraction was well-aligned with the layers above and below it, eliminating the shims that sucked up lines of code and compute cycles. I saw this kind of system in high-frequency trading, too. I didn’t know how good I had it…
Assumptions, Values, and Requirements
Every project is built with a set of assumptions, requirements, and values. These will come to define the constraints under which the system is built, and ultimately the characteristics of the system. Requirements are simple: they are what you absolutely need for your product. For example, if you are building a NoSQL database, you require a key-value interface of some sort and a storage medium of some sort. A project’s values define what you want to aim for. Perhaps you are building a performance-focused NoSQL database: In that case, you require a NoSQL database interface, and you value performance. Bryan Cantrill has done a good talk on values. Assumptions are the final pillar: these tell you about the invisible pseudo-requirements you build your system under. For example, most of us assume certain things about programming languages and computing environments: at the very least, most systems are built on the assumption that computers will run them. Breaking that assumption can result in great outcomes (and a lot of work).
Assumptions, values, and requirements tend to determine the final characteristics of a system. A typical startup CRUD app might have the following characteristics:
- Require that we do our startup’s CRUD
- Value time to market
- Value the ability to run cheaply until you get product-market fit
- Assume that we run in a hosted provider or a cloud
A high frequency trading system is built with different constraints:
- Require that we execute a trading strategy
- Value trading profitability
- Value speed
- Value safety/compliance with regulations
- Assume that you control all of your hardware inside a co-located datacenter
The former set of constraints gives you millisecond response times, features, and comparative affordability. The latter set of constraints gives you 10-100 nanosecond response times, exotic hardware, comparatively high visibility, and MASSIVE startup costs.
Comparing Two Databases
Let’s look at two easily comparable examples. A product like ScyllaDB might have the following characteristics (disclaimer: I don’t work with ScyllaDB, so these are not their words):
- Required: Distributed NoSQL database
- Value speed
- Value scalability
- Value compatibility with existing NoSQL DB ecosystem (Cassandra)
- Assume NVMe flash and modern network cards on Linux machines
These two products drive most of their design decisions from their values, within the space of their requirements and assumptions. However, products with different assumptions and requirements turn out very different. Contrast ScyllaDB to a project with these assumptions, values, and requirements:
- Required: Distributed SQL database
- Value scalability
- Value global consistency and durability
- Value compatibility with existing SQL DB ecosystem (PostgreSQL)
- Assume that you run on Linux machines
These are very similar requirements, values, and assumptions, but they drive completely different products. The second set is an approximation of the values of CockroachDB or Google’s Spanner database (same disclaimer as ScyllaDB applies), which are both about two orders of magnitude slower than ScyllaDB to execute a transaction on the same hardware, but offer a SQL interface and global consistency.
Ideally, you would like all of the abstractions you use to have aligned goals with your system. If you can buy a dependency that aligns with your goals, that’s great. If not, you will likely have to “massage” your dependencies to be able to do what you want. This is the first time an abstraction costs you. If you use the wrong database schema (or the wrong technology), you may find yourself scanning database tables when a different schema would do a single lookup. For a non-database example, if you make an electron-based computer game, it will likely be unplayably slow (but you will be able to build it in record time!).
Going back to the CRUD app, let’s pick a database. Is a ScyllaDB cluster a good choice? What about a CockroachDB cluster? We probably don’t mind if our database doesn’t scale the best or if it’s the fastest, but we do mind the expense of running a cluster, so maybe we should look for an alternative.
Compared to our hypothetical cases of ScyllaDB and CockroachDB, SQLite has some different assumptions, requirements, and values:
- Required: Embeddable SQL database
- Value ease of use
- Value reliability
- Value cross-platform compatibility
- Assume that your run on some sort of computer
Which of these aligns better with a CRUD app? Probably SQLite, at least until product-market fit, because it will be easier and cheaper to run. DynamoDB or another hosted database (or CockroachDB’s serverless offering) might align even better with what you want. After all, you probably don’t care very much about cross-platform compatibility if you are using a cloud, and the database is literally free if you keep it small - and hopefully you will be making money when it starts getting expensive.
Most companies don’t build their own database because there is a wealth of available options that can help you with any kind of project. However, you don’t have a wealth of options for many other abstractions: often, you have only one or two to choose from, and those abstractions were built without much thought to your use case.
Every Abstraction Counts
It’s easy to see how using the right database schema or picking the right programming language can help you with both CPU time and engineer time, but the abstraction tax hits us up and down the stack.
As an extreme example, consider TCP (yes, that TCP). Most of us take HTTP/TCP as a given for applications and run with the kernel’s TCP driver, and it would be complete folly for most projects to do something different. Not for Google (disclaimer: I worked with the folks who published this paper). Storage and search folks needed faster, more efficient networking for RPCs, and cloud computing needed it to be possible to develop virtualization features. The result was Snap, a userspace networking driver, and Pony Express, a transport protocol designed for the demands of Google’s big users. By eschewing the unused features and swapping from “reliable bytestream” to “reliable messaging,” Pony Express ended up being 3-4x faster than TCP.
Another example from Google is tcmalloc. Many large companies have designed their own memory allocators, but the tcmalloc paper is the best, in my biased opinion, at describing how a memory allocator can impact the performance of applications. By explicitly aligning the goal of the memory allocator with the goals of the fleet, they found that by increasing the time spent in the allocator, they could improve allocation efficiency enough that the end application was much faster due to better locality and reductions in cycles spent walking page tables (TLB misses).
Also, even though I have been focusing on performance here, it’s not always eaiser to work with the off-the-shelf abstraction either: part of the motivation for the Snap networking system was programmability and extensibility.
Everything Changes over Time
Even abstractions that are well-aligned at the outset of a project can see themselves becoming the wrong choice. Usually, the change comes from either the underlying assumptions changing or your values changing. Returning to databases, a lot of successful apps tend to outgrow a single server using SQLite or PostgreSQL. There are many solutions to this, but the fundamental change is that “embedded” becomes untenable, the value of “easy to use” starts to be diminished, and instead we start to value scalability, availability, and speed. The other alternatives we have thought about here, ScyllaDB and CockroachDB, start to become much more attractive. The migration is costly and difficult, and you have to deal with a few bugs at first, but the database scales.
Of course, the alternative to a migration is to put in a compatibility layer so that you can keep the old database in production while you put new entries in a new database. This also leads to slowness and bugs. This is not an uncommon pattern - it is often too costly or risky to do a database migration. Of course, running two database systems can further misalign your abstractions.
We also saw this in the case of Google’s user-space networking paper. What changed for Google wasn’t their values, but their assumptions: modern datacenter networks are very fast, CPUs can crunch a ton of data, and NVMe flash is 2-3 orders of magnitude faster than spinning disks. Saturating a 100 Gbps network card with TCP is expensive - taking 16 cores per the Snap paper - while saturating it with a custom protocol is 4x cheaper, and you can saturate a NIC with a single stream. In Google’s case, the bandwidth expansion of network cards caused the TCP abstraction to become stale.
Nothing about your system can change, and your abstractions can still go bad. Your software environment can change, your users can change their usage patterns, or your dependencies can simply get updated in a way that you don’t like. The wear-out of abstractions that used to work well is also commonly known as “technical debt.” However, if you choose your abstractions well and define clean boundaries, the abstractions you use can far outlive your system.
You can’t avoid abstractions as a software engineer - software itself is an abstraction. In a way, software engineers are professional abstraction wranglers. The only thing we can do is stay on top of our abstractions, the underlying assumptions they make, and their implications. Focusing only on your “core business need” and your “unique value add” doesn’t build a successful business alone - if the abstractions you use to get there aren’t well-aligned to your goals, you will have achieved a pyrrhic victory at best, and your focus and dedication to the bottom line may have cost you the chance to scale up or run profitably.
At companies with huge engineering forces, abstraction management is what a lot of them spend their time on. Often, these engineers are actually the most “productive” in terms of money saved - infrastructure projects tend to result in 8-9 figure savings or unique capabilities, and performance engineering (another form of abstraction alignment) frequently has 8 figure returns per engineer. Another large group of engineers is in charge of making sure that the old abstractions don’t break and crash the entire system.
Conversely, this is where startups can develop technical advantages on big tech despite having much smaller engineering teams, and where bootstrapped companies can out-engineer series-D companies. Given the freedom to align your abstractions to your goals, amazing things are possible.
If you liked this topic, another blogger, Dan Luu, whose articles I like a lot, has written adjacent to this topic before: A year ago, he wrote about the value of in-house expertise, and he has written in the past on why companies tend to have a lot of engineers for easy problems.