<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Speculative Branches</title>
    <link>https://specbranch.com/</link>
    <description>Recent content on Speculative Branches</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Sun, 28 Nov 2021 23:00:22 +0000</lastBuildDate><atom:link href="https://specbranch.com/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>The Computer Architecture of AI (in 2024)</title>
      <link>https://specbranch.com/posts/ai-infra/</link>
      <pubDate>Sat, 10 Feb 2024 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/ai-infra/</guid>
      <description>Over the last year, as a person with a hardware background, I have heard a lot of complaints about Nvidia&amp;rsquo;s dominance of the machine learning market and whether I can build chips to make the situation better. While the amount of money I would expect it to take is less than $7 trillion, hardware accelerating this wave of AI will be a very tough problem&amp;ndash;much tougher than the last wave focused on CNNs&amp;ndash;and there is a good reason that Nvidia has become the leader in this field with few competitors.</description>
      <content:encoded>&lt;p&gt;Over the last year, as a person with a hardware background, I have heard a lot of complaints about Nvidia&amp;rsquo;s dominance
of the machine learning market and whether I can build chips to make the situation better.  While the amount of money
I would expect it to take is less than
&lt;a href=&#34;https://www.tomshardware.com/tech-industry/artificial-intelligence/openai-ceo-sam-altman-seeks-dollar5-to-dollar7-trillion-to-build-a-network-of-fabs-for-ai-chips&#34;&gt;$7 trillion&lt;/a&gt;,
hardware accelerating this wave of AI will be a very tough problem&amp;ndash;much tougher than the last wave focused on CNNs&amp;ndash;and
there is a good reason that Nvidia has become the leader in this field with few competitors.  While the
inference of CNNs used to be a math problem, the inference of large language models has actually become a computer
architecture problem involving figuring out how to coordinate memory and I/O with compute to get the best performance
out of the system.&lt;/p&gt;
&lt;h2 id=&#34;the-basic-operations-used-by-ml-models&#34;&gt;The Basic Operations Used by ML Models&lt;/h2&gt;
&lt;p&gt;I am going to start with a brief overview of each of the operations used in AI systems here, but each has far more
depth to go into for its implementation specifics.  The largest operations in terms of compute time are the
linear operations: matrix multiplications and convolutions, as well as the matrix multiplications that are within
an attention calculation.  Accumulations and other nonlinear functions consume significantly less time, but create
mathematical &amp;ldquo;barriers&amp;rdquo; in the calculation that prevent you from combining the computations of multiple matrix
multiplications together.  The combination of these linear and nonlinear ops is what gives neural networks power from a
statistical perspective: a layer of a neural network is essentially a combination of a matrix-multiplication-like
linear operation and a nonlinear operation.&lt;/p&gt;
&lt;h4 id=&#34;matrix-multiplication&#34;&gt;Matrix Multiplication&lt;/h4&gt;
&lt;p&gt;Matrix multiplication is a basic operation used in neural network calculations, frequently involving multiplying
a weight matrix by a vector of internal state.  A matrix-vector product combined with a nonlinear function is the
basic operation of a fully-connected layer.  Matrix multiplication is one of the most well-studied algorithms in
computer science, and any company building an accelerator for graphics or neural networks will have an efficient
solution.  There are a lot of clever ways to do matrix multiplication, but the most efficient one is often brute
force: using a lot of multipliers in straightforward arrays, and being clever about how you access memory.&lt;/p&gt;
&lt;p&gt;Depending on the model, the characteristics of these matrices can be different: image-based models will have a lot
of matrix-matrix products, while LLMs will have matrix-vector products.&lt;/p&gt;
&lt;h4 id=&#34;convolution&#34;&gt;Convolution&lt;/h4&gt;
&lt;p&gt;While not present in LLMs, convolution operations are a common feature of computer vision models, and there is
some chance that they will find new usage.  A convolution involves &amp;ldquo;sliding&amp;rdquo; a small matrix over a much bigger matrix,
and calculating the correlation between the shifted versions of the small matrix and the big matrix.  This becomes a
small version of a matrix-matrix product with a lot of data sharing between computations.  Convolutions have a lot of
tricks for efficient computation due to the data sharing and the fact that each of these small matrix-matrix
multiplications shares one operand.  The hardware for matrix multiplication is also relatively easy to share with
convolution.&lt;/p&gt;
&lt;h4 id=&#34;nonlinear-operations&#34;&gt;Nonlinear Operations&lt;/h4&gt;
&lt;p&gt;Nonlinear functions like exponentials, sigmoids, arctans, and a family of functions that look like ReLU are common
operations inside neural networks.  Given the low precision of the number systems used (16-bit floating point at most),
these are relatively easy to compute with fast approximations or lookup table methods.  Each neural network often
also uses relatively few different kinds of nonlinear functions, so it is possible to precompute parameters to make
this computation faster.  While these are generally not computationally intensive, their main function is to make the
matrix multiplications in each layer irreducible.&lt;/p&gt;
&lt;h4 id=&#34;accumulation-operations&#34;&gt;Accumulation Operations&lt;/h4&gt;
&lt;p&gt;Accumulation operations involve scanning a vector and computing a function over every element in that vector.  For
neural networks, the most common function used is the &amp;ldquo;maximum&amp;rdquo; function: this is used for softmax calculations and
layer normalizations.  While not a computationally intensive application, the result of these computations will not
be available until the entire vector is computed.  This forms a computational barrier: anything that depends on the
result of this operation has to wait for it to be produced.  It is common to put a normalization between layers of
neural networks, preventing number magnitudes from getting out of hand.&lt;/p&gt;
&lt;h4 id=&#34;attention&#34;&gt;Attention&lt;/h4&gt;
&lt;p&gt;Attention is the &amp;ldquo;new kid on the block&amp;rdquo; in terms of operations used for neural networks, and is the defining
feature of transformer models.  Attention is supposed to represent a key-value lookup of a vector of queries, but
that is a pretty loose intuition of what happens in my opinion.  In terms of the operations done, attention is a
calculation that combines a few matrix-vector products and a softmax-based normalization function.  Thanks to the
linear nature of the matrix-vector products, there are a lot of clever ways to rearrange this calculation to avoid
unnecessary memory loads even when large matrices are used: the
&lt;a href=&#34;https://arxiv.org/pdf/2205.14135.pdf&#34;&gt;Flash Attention&lt;/a&gt; paper goes through the most famous
rearrangement.  However, there have been other attempts to do fast attention calculation, including
&lt;a href=&#34;https://arxiv.org/pdf/2102.03902.pdf&#34;&gt;numerical approximations&lt;/a&gt; of the calculation and the
&lt;a href=&#34;https://arxiv.org/pdf/2305.16300.pdf&#34;&gt;use of &amp;ldquo;landmark tokens&amp;rdquo;&lt;/a&gt; to turn the large version of the calculation into
several smaller versions.  These calculations all translate to hardware relatively well, and build on top of the units we
build for matrix multiplication and softmax.&lt;/p&gt;
&lt;h2 id=&#34;how-computer-architects-think-about-matrix-math-operations&#34;&gt;How Computer Architects Think about Matrix Math Operations&lt;/h2&gt;
&lt;p&gt;Doing arithmetic operations is divided into two types of operations: getting the data into and out of the chip, and doing
the operations.  There is a well-known ratio here called &amp;ldquo;arithmetic intensity&amp;rdquo;: the ratio of memory operations done to
arithmetic operations.  Problems with high arithmetic intensity are compute-bound, and are limited by your ability to put
arithmetic units onto a chip that can perform the relevant operations.  Problems with low arithmetic intensity are
memory bandwidth bound: when you are spending a lot of time loading and storing data, your chip&amp;rsquo;s compute units are
idling.&lt;/p&gt;
&lt;p&gt;Every chip can be modeled as having a &amp;ldquo;roofline&amp;rdquo; on performance based on arithmetic intensity depending on its available
compute FLOPs and its memory bandwidth.  This roofline has a knee at a certain point, where operations go from being
memory bandwidth bound to compute bound. Chips are generally architected to support the software that they will
be used to run (with about 5 years of lag&amp;ndash;it takes a long time to make a chip), and this knee is generally chosen to be
at a relatively good point for most users.  However, it is one-size-fits all.&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/google-intensity.png#center&#34; width=&#34;70%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;Google performed this analysis &lt;a href=&#34;https://arxiv.org/pdf/2304.01433.pdf&#34;&gt;for the TPUv4&lt;/a&gt;, and found that the neural networks
of interest had an arithmetic intensity relatively close to the knee of an A100 GPU.  The H100, by the way, has its knee
at almost the same place.  However, Google&amp;rsquo;s analysis here does not describe what they are doing in each cited model, so
we are left to guess.&lt;/p&gt;
&lt;p&gt;To apply this to the operations we discussed earlier, the matrix operations are the ones that consume memory bandwidth.
Accumulation and nonlinear operations can generally be attached to the previous block of computation and add arithmetic
operations.  However, the matrix operations have the disadvantage that they operate on blocks of data, meaning that they
add a relatively large amount of memory management per mathematical operation.&lt;/p&gt;
&lt;h2 id=&#34;the-working-sets-of-models&#34;&gt;The Working Sets of Models&lt;/h2&gt;
&lt;p&gt;Thanks to the regularity of the computation of a model, the working set of models is relatively well-defined.  The working
set contains:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The model weights&lt;/li&gt;
&lt;li&gt;The input data&lt;/li&gt;
&lt;li&gt;The intermediate state of the model&lt;/li&gt;
&lt;li&gt;During training, gradients (and learning rates) for each weight&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These items can be small or large depending on the model.  For models like LLMs, the weights are huge, sometimes too big
for a single GPU, while the input data and the intermediate state are a few thousand numbers.  For computer vision
models, the weights are a lot smaller and the input data and intermediate state are a lot bigger.  Due to the size of the
weights, training for LLMs takes about 3x as much memory as inference, but otherwise looks somewhat similar to inference.&lt;/p&gt;
&lt;p&gt;The size of the working set isn&amp;rsquo;t the only thing that matters, however.  The ability to put parts of that working set on
the chip (&lt;em&gt;ie&lt;/em&gt; in cache) has a significant effect on the arithmetic intensity of the computation of a model.  Parts of the
working set that are in nearby caches and registers allow you to access those parts without hitting memory, so the
heat distribution of the working set dictates how often you are going to hit memory rather than loading from a cache.
The bandwidth of caches, for most purposes, is effectively unlimited.&lt;/p&gt;
&lt;p&gt;This leads to a rough breakpoint in model size that dictates how the model is computed:
small models that can fit in the cache of a GPU (or the on-chip memory of an ASIC) get to stream data through the
weights, while larger models have to stream the weights through the data.  This is not a perfect model, since
&amp;ldquo;medium-sized&amp;rdquo; models can hide some of the effects of streaming the weights through the data, but it is the difference
between a smaller computer vision or audio model and a large language model.  Not only is the LLM bigger, it also has
to be computed less efficiently.&lt;/p&gt;
&lt;h4 id=&#34;small-models-streaming-data-through-weights&#34;&gt;Small Models: Streaming Data through Weights&lt;/h4&gt;
&lt;p&gt;The previous AI wave of 2016 involved models with thousands or millions of parameters.  These models are small
enough to fit inside the cache and compute units of a reasonably-sized chip.  This means that it is possible to
keep the entire model on the chip used for inference or training.  In this case, many GPU-like systems with cache
hierarchies will find it easy to keep the model in cache, and systems with a scratchpad or with special internal memory
dedicated for storing models are possible to build cost-efficiently.&lt;/p&gt;
&lt;p&gt;This configuration has the benefit that it is the &amp;ldquo;natural&amp;rdquo; way to compute over data: you can load the data
from a disk once, perform a calculation, and then store or use the results.  Also, this method of computing lets
you take advantage of incredibly high arithmetic intensity during inference.  Each time the model is inferenced, the
entire run of the model relies on the speed of loads of the input.  However, if the input is large, needs a lot of
preprocessing, or has to be scanned many times, this can result in a drop in arithmetic intensity, particularly for
systems that rely on caching: processing the input may end up evicting the model from cache, causing extra loads.&lt;/p&gt;
&lt;h4 id=&#34;large-models-streaming-weights-through-data&#34;&gt;Large Models: Streaming Weights through Data&lt;/h4&gt;
&lt;p&gt;With larger models of several billions of parameters, it is no longer possible to keep the weights on
a single chip (note: HBM memory does not count as &amp;ldquo;on the chip&amp;rdquo;).  However, the use of linear algebra means that
there is still a regular pattern of computation to take advantage of: every model goes layer by layer.  The
intermediate state of large models is still relatively small: the largest of the original LLaMA models uses an
internal state of only 8k floating point numbers, and a context window of 4k input tokens.&lt;/p&gt;
&lt;p&gt;However, if you store only the intermediate products of the model, the working set is still relatively small.  To
compute these models, the typical general approach is to stream the model through the data.  One batch of data gets
loaded onto the chip, and the internal state of the model continues to stay on the chip as inference progresses
layer by layer.&lt;/p&gt;
&lt;p&gt;In this configuration, the data stored off-chip is going to be accessed repeatedly, so loading model weights from
disk (or even RAM on a CPU or another remote site) is generally considered too slow to do.  This is why a lot of
LLMs have high VRAM requirements for GPUs: you have to store the entire model (plus a few other things, like the
K-V cache used to speed up inference) inside the GPU&amp;rsquo;s VRAM.&lt;/p&gt;
&lt;p&gt;Due to the fact that weights are streamed into the chip, the arithmetic intensity with this model of computing is
comparatively very low, and is driven by your ability to reuse weights as they get loaded.  While many models have
operations that allow you to take advantage of reuse, LLMs have very limited reuse thanks to the prevalence of
matrix-vector products.
&lt;a href=&#34;https://arxiv.org/pdf/2302.14017.pdf&#34;&gt;GPT-2 on a GPU notably has an arithmetic intensity of about 2 operations per byte loaded&lt;/a&gt;.
However, due to weight reuse, BERT models get up to around 100-200 ops per byte.&lt;/p&gt;
&lt;h4 id=&#34;the-software-level-tricks-batching-quantization-and-sparsity&#34;&gt;The Software-Level Tricks: Batching, Quantization, and Sparsity&lt;/h4&gt;
&lt;p&gt;It is possible to increase the compute intensity of almost any compute application, but particularly machine
learning models, with batching.  Batching means performing several parallel computations of the same type at the
same time, like multiple chats with an LLM or several images with a computer vision model.
However, batching requires that you have the parallel work available: a local LLM probably cannot benefit from
batching nearly as much as OpenAI&amp;rsquo;s centralized chatbot system with thousands or millions of simultaneous users.
A system built for larger batch sizes can generally use a higher compute to memory ratio.&lt;/p&gt;
&lt;p&gt;Quantization involves reducing the size of the working set by using smaller numbers.  This is a very effective
strategy for inference, where you have the full, pretrained model and can now do a bunch of math to figure out
the best way to reduce the precision of the weights so you store effectively 2-4 bits per weight with
comparatively low loss in model accuracy.  Quantized training, however, is somewhat harder: you don&amp;rsquo;t know the
weights ahead of time, so the normal mathematical tricks of quantized inference do not work.  A lot of training
processes need 8-bit floating point numbers today, and this is already considered &amp;ldquo;quantized&amp;rdquo; compared to 2020,
where 16-bit floats were the norm for training.  Quantization does add some extra ops to convert from the
quantized form to a form that can be used for computing (the internal state of models is still FP8 or FP16),
but those ops are not the limiting factor.&lt;/p&gt;
&lt;p&gt;Sparsity can be thought of as a kind of quantization.  Sparsifying models involves finding the weights that don&amp;rsquo;t
matter for the outcome of a neural network, and treating those weights as though they are equal to 0, dropping
them from computation.  This both reduces the working set size and the number of arithmetic ops required to compute
the output.  Sparsity is more effective when you can treat a whole block of weights as 0, but can also be applied
at a fine-grained level when matrix multiplication units are designed for it: Nvidia&amp;rsquo;s are.  However, sparisty is
basically impossible to take advantage of in training: you don&amp;rsquo;t know which weights are going to matter before you
train the model.&lt;/p&gt;
&lt;p&gt;All of these are knobs that can tune the exact operating point of model inference, and help people building systems
pull the most out of their hardware.&lt;/p&gt;
&lt;h2 id=&#34;the-7-trillion-question&#34;&gt;The &amp;ldquo;$7 Trillion&amp;rdquo; Question&lt;/h2&gt;
&lt;p&gt;The last wave of AI accelerators got the benefit of focusing on the &amp;ldquo;small model&amp;rdquo; case.  This is a circuit
designer&amp;rsquo;s dream: the state you need to compute over can generally fit entirely on a chip, so it is
within your control to design an ASIC that can compute it efficiently.  You are free to design new circuits and
systems that implement the &amp;ldquo;new&amp;rdquo; math of these systems to push performance without worrying so much about what goes
on outside of the borders of your silicon.  This is part of why we saw so many new, interesting new architectures
at this time, from analog matrix multiplications to interesting new kinds of arithmetic to single-chip massive SIMT
systems. These promised huge gains in energy efficiency and speed, and achieved those gains, but found themselves pushing
the knee of their roofline models far to the right: many are only able to obtain that efficiency when they can avoid
memory operations altogether.&lt;/p&gt;
&lt;p&gt;Tomorrow&amp;rsquo;s AI accelerators will need to focus on memory to contend with the case of streaming weights through data.
The primary goal of these chips will be to find a way to &amp;ldquo;feed the beast&amp;rdquo; and get the working set closer to the compute.
This is what NVidia, who designed the best general-purpose parallel compute engine they could, has figured out how to do
well.  They had the advantage that getting data close to compute is a common problem in HPC systems, which have a large
working set but tend to have higher arithmetic intensity than LLMs.  While everyone else is scrambling to do this now,
Nvidia has been working on this problem for almost a decade, as their Tesla accelerators have eaten the HPC world.&lt;/p&gt;
&lt;p&gt;Some startups, like Groq and Cerebras, have found that they can somewhat adapt to this new environment, although both
in relatively niche ways.  Future technologies in this area could be new kinds of in-memory compute, new
ways to attach memory to compute, or new multi-chip networks that allow you to get a lot more RAM and compute next
to each other. Getting the working set closer to the compute can also mean compressing the working set, which generally
happens through quantization.  However, future accelerators may be able to find novel ways to compress the working set,
which has a very predictable access pattern, that aren&amp;rsquo;t necessarily subject to the same working set limitations, like
using traditional compression methods on LLM weights (although there are a lot of reasons why gzip-style compression
won&amp;rsquo;t work well).  Block floating point is one easy option here, which compresses floating-point numbers by sharing one
exponent between many numbers, but there is probably something more clever to do.&lt;/p&gt;
&lt;p&gt;It does appear that there is an unserved niche here in local LLMs (if that market exists): large memory capacity and high
memory bandwidth combined with a relatively underpowered compute core seems to be something that nobody is doing yet.
However, this is a problem that lives at the interface between memory and compute.  Conversely, there may be another
opportunity here in super-large batch sizes for companies like Google and OpenAI who probably do have the work available,
and the solution to use in that case will look very different than the solution for low batch size.  These systems will
probably look a lot more like custom supercomputers than what we have today.&lt;/p&gt;
&lt;h2 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The advances required to make LLMs orders of magnitude faster will likely be more system-level than
circuit-level, but the GPU technologies that exist today for most &amp;ldquo;AI&amp;rdquo; aren&amp;rsquo;t necessarily off the mark.  Nvidia/AMD&amp;rsquo;s
manufacturing process edge may be able to more than make up for the efficiencies you can gain by focusing only on the
computations used for inference.  For startups, which already have a hard time raising the $10-20 million you would need
to spin a chip, raising the hundreds of millions it may take to design these completely new systems may be beyond reach.
With the pace of advancement in model architectures, there is also no guarantee that this research will be relevant by
the time it arrives: the models of 2030 may be as different from today&amp;rsquo;s models as today&amp;rsquo;s models are from the models of
2016.  In particular, as new architectures show up for models, the trend of incredibly low arithmetic intensity may not
continue. Still, I am looking forward to seeing what the startups in this area can cook up.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>The Knight Capital Disaster</title>
      <link>https://specbranch.com/posts/knight-capital/</link>
      <pubDate>Wed, 22 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/knight-capital/</guid>
      <description>This account comes from several publicly available sources as well as accounts from insiders who worked at Knight Capital Group at the time of the issue. I am telling it second- or third-hand.
On August 1, 2012, Knight Capital fell on its sword. It experienced a software glitch that literally bankrupted the company. Between 9:30 am and 10:15 am EST, the employees of Knight capital watched in disbelief and scrambled to figure out what went wrong as the company acquired massive long and short positions, largely concentrated in 154 stocks, totaling 397 million shares and $7.</description>
      <content:encoded>&lt;p&gt;&lt;em&gt;This account comes from several publicly available sources as well as accounts from insiders who worked at
Knight Capital Group at the time of the issue. I am telling it second- or third-hand.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;On August 1, 2012, Knight Capital fell on its sword. It experienced a software glitch that literally bankrupted the
company.  Between 9:30 am and 10:15 am EST, the employees of Knight capital watched in disbelief and scrambled to
figure out what went wrong as the company acquired massive long and short positions, largely concentrated in 154
stocks, totaling 397 million shares and $7.65 billion.  At 10:15, the kill switch was flipped, stopping the
company&amp;rsquo;s trading operations for the day.  By early afternoon, many of Knight Capital&amp;rsquo;s employees had already
sent out resumes, expecting to be unemployed by the end of the week.&lt;/p&gt;
&lt;p&gt;The root cause of the failure? A comedy of errors in several parts of their development and ops processes.&lt;/p&gt;
&lt;h3 id=&#34;knights-order-entry-software&#34;&gt;Knight&amp;rsquo;s Order-entry Software&lt;/h3&gt;
&lt;p&gt;Knight Capital had an application called &amp;ldquo;SMARS&amp;rdquo; which sends orders to the stock exchange. It contained some logic
to break up large orders into much smaller orders that would have less of an effect on the market. The SMARS
software accepted orders from trading strategies using a binary protocol, and contained logic to make sure that
those orders got filled at a desired price. The protocol contained fields for price, desired size, time in force,
etc. The protocol also had a flag field, which set options for a given order. Nanoseconds counted, so protobufs
and JSON were too heavy here: these commands were sent over the wire as serialized structs.&lt;/p&gt;
&lt;p&gt;One such option, from the very early 2000&amp;rsquo;s, was called &amp;ldquo;power peg.&amp;rdquo; Power peg was an order type used for manual
market making: a power peg order would stay open at a given price, effectively &amp;ldquo;pegging&amp;rdquo; the stock to a given
price. If a power peg order was filled, SMARS would refresh it at the same price. It kept a count of how many
shares were filled for a power peg order, and when a certain (very large) cumulative number was hit, the power peg
order would be automatically canceled.  The intended user flow was for a market maker to open a power peg order,
get it filled as many times as needed, and then cancel the order when the market was about to move.&lt;/p&gt;
&lt;p&gt;In 2003, Knight Capital deprecated the power peg option. They followed almost all the steps for flag deprecation
that you would expect from a disciplined engineering department: they marked the flag as deprecated, they switched
users away from using it, and they defaulted the clients to prevent use of the option. However, they never took
the last step: removing the server code. SMARS had been written somewhat hastily, and some of the code for Power
Peg was deeply entwined with other code in the server, and as long as the tests kept working, it wouldn&amp;rsquo;t be a
big deal. During a refactor 2 years later, the tests for power peg were breaking, so they were deleted. Nobody was
using the long-deprecated option, so there was no longer a need to check its correctness.&lt;/p&gt;
&lt;h3 id=&#34;adding-a-new-feature&#34;&gt;Adding a New Feature&lt;/h3&gt;
&lt;p&gt;In July 2012, Knight Capital needed a flag for orders from their new Retail Liquidity Program (RLP).
Knight had opened a new line of business: buying order flow from retail brokerage and executing those orders.
Retail orders needed special handling, so SMARS needed a new flag. However, the flag word was out of new bits
for flags, so an engineer reused a bit from a deprecated flag: the power peg flag. The remaining power peg code
in SMARS was disconnected from the flag, and new RLP code was added. The code went through review successfully,
and passed a battery of automated tests.&lt;/p&gt;
&lt;p&gt;The new RLP code was deployed to the SMARS system on July 27, 2012. Knight Capital officially ran a manual
deployment process: the person assigned to deploy the code would SSH into each SMARS machine, rsync the new
binary to that machine, and update some configuration to set it up to run instead of the old binary. Knight&amp;rsquo;s
operations team had seen the danger of this: to avoid missing a machine, they set up a script to perform the
process for each machine. On July 27, 2012, one member of the operations team ran the deployment script for the
new version of SMARS.&lt;/p&gt;
&lt;p&gt;Unbeknownst to the team, the deployment script had a small bug: when it failed to open an SSH connection to a
machine, it would fail silently, continue to update the other machines, and report success. It was never tested
or checked in like a piece of software because it&amp;rsquo;s a script that one person wrote for convenience.&lt;/p&gt;
&lt;p&gt;That day, one of ten SMARS machines was down for maintenance during the software upgrade, and rejected
an SSH connection. After its planned maintenance, the server came back up with an old version of SMARS.&lt;/p&gt;
&lt;p&gt;Knight allowed the new SMARS binary to soak for 3 days before turning on RLP trades, and caught no errors.
That was to happen on August 1. They also did a limited test of RLP orders in one of the production SMARS
servers to make sure that the new logic was working correctly. The server they tested had received the new
software version, and the test was successful.&lt;/p&gt;
&lt;h3 id=&#34;august-1-2012&#34;&gt;August 1, 2012&lt;/h3&gt;
&lt;p&gt;Beginning at 8:01 EST on August 1, Knight began receiving retail orders through the RLP.  Things were going
well, and the internal servers handling RLP orders were working exactly as they had in testing the prior days.&lt;/p&gt;
&lt;p&gt;At 9:30, the market opened.  Initially, trading in about 150 stocks looked like it was going wrong.  Engineers
and quants were called to figure out what the problem was.  New and experimental trading algorithms were shut
off.  Quantitative researchers, not known for their programming prowess, were thought to have created the bug.
The RLP, now past its experimental soaking period, was allowed to continue operating.&lt;/p&gt;
&lt;p&gt;From debug logs, engineers later narrowed down the problem to a bug in SMARS: orders were leaving trading
servers correctly, but somehow the firm was starting to accrue large positions on these orders, filling them
many times over.  Noticing the flaw, engineers decided to roll back SMARS to its previous version, hoping to
continue trading with a known-good version.&lt;/p&gt;
&lt;p&gt;After the rollback, the abnormal behavior accelerated and spread to seemingly every stock on the market. The
losses accelerated, and the SMARS software kept acquiring massive positions that were not allocated to any
trading strategy. Trading algorithms also continued to be rolled back, as bugs in those machines could have
caused the same issue, but none of this helped. Unknown to the operations department, they hadn&amp;rsquo;t rolled
back to a good version of SMARS&amp;mdash;they had rolled back to the same bad version that had been the cause of
their problems.&lt;/p&gt;
&lt;p&gt;At 10:15, the call was made to shut down trading for the day.  Knight had been losing money and accruing
positions so quickly that the computers took a while to figure out exactly how bad it was.&lt;/p&gt;
&lt;p&gt;Knight&amp;rsquo;s executive team now needed to figure out how to cover these positions.  Some positions could be closed
manually, and some of these trades even made money.  However, most of Knight&amp;rsquo;s positions were too large for this
approach. Talks began with banks and other trading partners to figure out how to get out of the hole. Exchanges
were asked if they could reverse the trades. It appeared that Knight would have a $1 billion loss on their hands,
and not anywhere near enough cash to cover it.&lt;/p&gt;
&lt;p&gt;Line employees caught wind of the trouble, and many started to answer the emails from recruiters that they had
long ignored. In the afternoon of August 1, more Knight employees were working on resumes than anything else.&lt;/p&gt;
&lt;h3 id=&#34;the-final-fatal-flaws&#34;&gt;The Final Fatal Flaws&lt;/h3&gt;
&lt;p&gt;During the 2005 refactor of SMARS, the code for reporting power peg positions back to trading strategies had
broken, which was what caused the test failures (and the subsequent deletion of tests). If not for this
breakage, the strategies would have been allocated correct positions, and the correct trading algorithms could
have been shut down.&lt;/p&gt;
&lt;p&gt;Finally, SMARS was built to be fast, and did not conduct a lot of pre-trade risk checks.  That was the job for the
trading servers, and they were very accurate at it.  SMARS simply accepted orders and executed them, regardless
of whether the strategy (or the firm) had the requisite capital.  Since Knight was a broker-dealer and had a
direct connection to the exchange, the exchange didn&amp;rsquo;t know whether Knight had the money for their trades either,
and continued accepting orders.  This type of check is the responsibility of the broker, and Knight was their own.
Trading strategies, whose risk management code had an inaccurate view of their own positions, continued to send
orders like nothing was wrong.  Nobody at Knight had built any infrastructure to manage the financial risks
related to rogue order entry servers.&lt;/p&gt;
&lt;h3 id=&#34;aftermath&#34;&gt;Aftermath&lt;/h3&gt;
&lt;p&gt;When the dust settled, Knight was able to close its positions at a $440 million loss. On August 5, 2012, Knight
received $400 million of rescue financing that allowed them to continue operations. They rebranded as &amp;ldquo;KCG,&amp;rdquo;
and were acquired in 2013 by GETCO, another algorithmic trading company, to form KCG Holdings. They were later
acquired by Virtu Financial in 2017.&lt;/p&gt;
&lt;p&gt;The story of Knight Capital prompted other trading firms to review their processes and adopt new layers of risk
checks as well as modern DevOps practices to protect themselves from being the next Knight. Some of them quietly
admitted that they were lucky: their practices were similar to the ones that brought down their competitor.
Adding risk checks to the &lt;em&gt;last&lt;/em&gt; stage of an order&amp;rsquo;s life became universal in the industry, and testing and
deployment practices were largely brought into the modern era.&lt;/p&gt;
&lt;p&gt;Knight Capital, the SEC, the exchanges, and FINRA conducted thorough postmortem reviews of what happened during
this incident. There was a lot of blame to go around at Knight, and they ended up paying an additional $12 million
of fines for failing to hold up their responsibilities as broker. A lot of Knight&amp;rsquo;s development practices were
changed. As of 2016, the engineer who did the update still worked at KCG. His entire management chain had been
replaced (resigned or fired) in light of this incident, all the way up to the CTO.&lt;/p&gt;
&lt;p&gt;The story of Knight Capital today serves as a cautionary tale for trading firms who ask &amp;ldquo;what&amp;rsquo;s the worst that
could happen?&amp;rdquo;&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Abstraction is Expensive</title>
      <link>https://specbranch.com/posts/expensive-abstraction/</link>
      <pubDate>Wed, 07 Dec 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/expensive-abstraction/</guid>
      <description>As you build a computer system, little things start to show up: maybe that database query is awkward for the feature you are building, or you find your server getting bogged down transferring gigabytes of data in hexadecimal ASCII, or your app translates itself to Japanese on the fly for hundreds of thousands of separate users. These are places where your abstractions are misaligned - your app would be quantitatively better if it had a better DB schema, a way to transfer binary data, or native internationalization for your Japanese users.</description>
      <content:encoded>&lt;p&gt;As you build a computer system, little things start to show up: maybe that database query is awkward
for the feature you are building, or you find your server getting bogged down transferring gigabytes
of data in hexadecimal ASCII, or your app translates itself to Japanese on the fly for hundreds of
thousands of separate users.  These are places where your abstractions are misaligned - your app
would be quantitatively better if it had a better DB schema, a way to transfer binary data, or
native internationalization for your Japanese users.  Each of these misalignments carries a cost.&lt;/p&gt;
&lt;p&gt;For many computer systems, abstraction misalignment is where we spend the majority of our resources:
both in terms of engineering costs and compute time.  Building apps at a high level pays dividends
in terms of getting them set up, but eventually the bill comes due, either in the form of tech debt,
slow performance, or both.  Conversely, systems where abstractions are well-aligned throughout the
tech stack, like high-frequency trading systems and (ironically) chat apps, are capable of amazing
feats of engineering.&lt;/p&gt;
&lt;p&gt;Most small-scale web services that do normal things don&amp;rsquo;t pay much for
abstraction misalignment, but large-scale systems and systems that do odd things pay huge costs.
Misalignments can also show up as systems age and things change - migrations are difficult, and you
would rather add a feature than do a refactor (okay, maybe not you, but your manager definitely does).&lt;/p&gt;
&lt;p&gt;When I left Google, I was working on a new storage system that leveraged several cool technologies
to go fast.  The real power we had, however, was that we could design the stack so that every
abstraction was well-aligned with the layers above and below it, eliminating the shims that
sucked up lines of code and compute cycles.  I saw this kind of system in high-frequency trading,
too.  I didn&amp;rsquo;t know how good I had it&amp;hellip;&lt;/p&gt;
&lt;h2 id=&#34;assumptions-values-and-requirements&#34;&gt;Assumptions, Values, and Requirements&lt;/h2&gt;
&lt;p&gt;Every project is built with a set of assumptions, requirements, and values.  These will come to
define the constraints under which the system is built, and ultimately the characteristics
of the system.  Requirements are simple: they are what you &lt;em&gt;absolutely need&lt;/em&gt; for your product.  For
example, if you are building a NoSQL database, you require a key-value interface of some sort and
a storage medium of some sort.  A project&amp;rsquo;s values define what you want to aim for.  Perhaps you
are building a performance-focused NoSQL database: In that case, you require a NoSQL database
interface, and you value performance.  Bryan Cantrill has done a good talk on
&lt;a href=&#34;https://www.youtube.com/watch?v=9QMGAtxUlAc&#34;&gt;values&lt;/a&gt;.  Assumptions are the final pillar: these
tell you about the invisible pseudo-requirements you build your system under.  For example, most of
us assume certain things about programming languages and computing environments: at the very least,
most systems are built on the assumption that computers will run them.  Breaking that assumption
can result in &lt;a href=&#34;https://dzone.com/articles/when-databases-meet-fpga-achieving-1-million-tps-w&#34;&gt;great outcomes&lt;/a&gt;
(and a lot of work).&lt;/p&gt;
&lt;p&gt;Assumptions, values, and requirements tend to determine the final characteristics of a system.  A
typical startup CRUD app might have the following characteristics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Require that we do our startup&amp;rsquo;s CRUD&lt;/li&gt;
&lt;li&gt;Value time to market&lt;/li&gt;
&lt;li&gt;Value the ability to run cheaply until you get product-market fit&lt;/li&gt;
&lt;li&gt;Assume that we run in a hosted provider or a cloud&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A high frequency trading system is built with different constraints:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Require that we execute a trading strategy&lt;/li&gt;
&lt;li&gt;Value trading profitability&lt;/li&gt;
&lt;li&gt;Value speed&lt;/li&gt;
&lt;li&gt;Value safety/compliance with regulations&lt;/li&gt;
&lt;li&gt;Assume that you control all of your hardware inside a co-located datacenter&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The former set of constraints gives you millisecond response times, features, and comparative
affordability.  The latter set of constraints gives you 10-100 nanosecond response times, exotic
hardware, comparatively high visibility, and MASSIVE startup costs.&lt;/p&gt;
&lt;h4 id=&#34;comparing-two-databases&#34;&gt;Comparing Two Databases&lt;/h4&gt;
&lt;p&gt;Let&amp;rsquo;s look at two easily comparable examples. A product like ScyllaDB might have the following
characteristics (disclaimer: I don&amp;rsquo;t work with ScyllaDB, so these are not their words):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Required: Distributed NoSQL database&lt;/li&gt;
&lt;li&gt;Value speed&lt;/li&gt;
&lt;li&gt;Value scalability&lt;/li&gt;
&lt;li&gt;Value compatibility with existing NoSQL DB ecosystem (Cassandra)&lt;/li&gt;
&lt;li&gt;Assume NVMe flash and modern network cards on Linux machines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These two products drive most of their design decisions from their values, within the space of their
requirements and assumptions.  However, products with different assumptions and requirements turn out
very different.  Contrast ScyllaDB to a project with these assumptions, values, and requirements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Required: Distributed SQL database&lt;/li&gt;
&lt;li&gt;Value scalability&lt;/li&gt;
&lt;li&gt;Value global consistency and durability&lt;/li&gt;
&lt;li&gt;Value compatibility with existing SQL DB ecosystem (PostgreSQL)&lt;/li&gt;
&lt;li&gt;Assume that you run on Linux machines&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are very similar requirements, values, and assumptions, but they drive completely different
products.  The second set is an approximation of the values of CockroachDB or Google&amp;rsquo;s Spanner database
(same disclaimer as ScyllaDB applies), which are both about two orders of magnitude slower than ScyllaDB
to execute a transaction on the same hardware, but offer a SQL interface and global consistency.&lt;/p&gt;
&lt;h2 id=&#34;aligning-abstractions&#34;&gt;Aligning Abstractions&lt;/h2&gt;
&lt;p&gt;Ideally, you would like all of the abstractions you use to have aligned goals with your system.  If
you can buy a dependency that aligns with your goals, that&amp;rsquo;s great.  If not, you will likely have to
&amp;ldquo;massage&amp;rdquo; your dependencies to be able to do what you want.  This is the first time an abstraction
costs you.  If you use the wrong database schema (or the wrong technology), you may find yourself
scanning database tables when a different schema would do a single lookup.  For a non-database
example, if you make an electron-based computer game, it will likely be unplayably slow (but you
will be able to build it in record time!).&lt;/p&gt;
&lt;p&gt;Going back to the CRUD app, let&amp;rsquo;s pick a database.  Is a ScyllaDB cluster a good choice?  What about
a CockroachDB cluster?  We probably don&amp;rsquo;t mind if our database doesn&amp;rsquo;t scale the best or if it&amp;rsquo;s the
fastest, but we do mind the expense of running a cluster, so maybe we should look for an alternative.&lt;/p&gt;
&lt;p&gt;Compared to our hypothetical cases of ScyllaDB and CockroachDB, SQLite has some different assumptions,
requirements, and values:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Required: Embeddable SQL database&lt;/li&gt;
&lt;li&gt;Value ease of use&lt;/li&gt;
&lt;li&gt;Value reliability&lt;/li&gt;
&lt;li&gt;Value cross-platform compatibility&lt;/li&gt;
&lt;li&gt;Assume that your run on some sort of computer&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Which of these aligns better with a CRUD app?  Probably SQLite, at least until product-market fit,
because it will be easier and cheaper to run. DynamoDB or another hosted database (or CockroachDB&amp;rsquo;s
serverless offering) might align even better with what you want.  After all, you probably don&amp;rsquo;t care
very much about cross-platform compatibility if you are using a cloud, and the database is literally
free if you keep it small - and hopefully you will be making money when it starts getting expensive.&lt;/p&gt;
&lt;p&gt;Most companies don&amp;rsquo;t build their own database because there is a wealth of available options that
can help you with any kind of project.  However, you don&amp;rsquo;t have a wealth of options for many other
abstractions: often, you have only one or two to choose from, and those abstractions were built
without much thought to your use case.&lt;/p&gt;
&lt;h4 id=&#34;every-abstraction-counts&#34;&gt;Every Abstraction Counts&lt;/h4&gt;
&lt;p&gt;It&amp;rsquo;s easy to see how using the right database schema or picking the right programming language
can help you with both CPU time and engineer time, but the abstraction tax hits us up and down
the stack.&lt;/p&gt;
&lt;p&gt;As an extreme example, consider TCP (yes, that TCP).  Most of us take HTTP/TCP as a given for
applications and run with the kernel&amp;rsquo;s TCP driver, and it would be complete folly for most projects
to do something different.
&lt;a href=&#34;https://courses.grainger.illinois.edu/CS598HPN/fa2020/papers/snap.pdf&#34;&gt;Not for Google&lt;/a&gt;
(disclaimer: I worked with the folks who published this paper). Storage and search folks needed
faster, more efficient networking for RPCs, and cloud computing needed it to be possible to
develop virtualization features.  The result was Snap, a userspace networking driver, and Pony
Express, a transport protocol designed for the demands of Google&amp;rsquo;s big users.  By eschewing the
unused features and swapping from &amp;ldquo;reliable bytestream&amp;rdquo; to &amp;ldquo;reliable messaging,&amp;rdquo; Pony Express
ended up being 3-4x faster than TCP.&lt;/p&gt;
&lt;p&gt;Another example from Google is &lt;a href=&#34;https://storage.googleapis.com/pub-tools-public-publication-data/pdf/cebd5a9f6e300184fd762f190ffd8978b724e0c8.pdf&#34;&gt;tcmalloc&lt;/a&gt;.
Many large companies have designed their own memory allocators, but the tcmalloc paper is the best,
in my biased opinion, at describing how a memory allocator can impact the performance of
applications.  By explicitly aligning the goal of the memory allocator with the goals of the fleet,
they found that by &lt;em&gt;increasing&lt;/em&gt; the time spent in the allocator, they could improve allocation
efficiency enough that the end application was much faster due to better locality and reductions
in cycles spent walking page tables (TLB misses).&lt;/p&gt;
&lt;p&gt;It turns out that similar gains are also available at the level of
&lt;a href=&#34;https://www.usenix.org/system/files/nsdi19-ousterhout.pdf&#34;&gt;thread schedulers&lt;/a&gt;,
&lt;a href=&#34;https://www.phoronix.com/news/Linux-5.14-File-Systems&#34;&gt;filesystems&lt;/a&gt;,
&lt;a href=&#34;https://sschakraborty.github.io/benchmark/index.html&#34;&gt;programming languages&lt;/a&gt; and others.
Some of them turn out to be quite expensive!&lt;/p&gt;
&lt;p&gt;Also, even though I have been focusing on performance here, it&amp;rsquo;s not always eaiser to work with the
off-the-shelf abstraction either: part of the motivation for the Snap networking system was
programmability and extensibility.&lt;/p&gt;
&lt;h2 id=&#34;everything-changes-over-time&#34;&gt;Everything Changes over Time&lt;/h2&gt;
&lt;p&gt;Even abstractions that are well-aligned at the outset of a project can see themselves becoming the
wrong choice.  Usually, the change comes from either the underlying assumptions changing or your
values changing.  Returning to databases, a lot of successful apps tend to outgrow a single server
using SQLite or PostgreSQL.  There are many solutions to this, but the fundamental change is that
&amp;ldquo;embedded&amp;rdquo; becomes untenable, the value of &amp;ldquo;easy to use&amp;rdquo; starts to be diminished, and instead we
start to value scalability, availability, and speed.  The other alternatives we have thought about
here, ScyllaDB and CockroachDB, start to become much more attractive.  The migration is costly and
difficult, and you have to deal with a few bugs at first, but the database scales.&lt;/p&gt;
&lt;p&gt;Of course, the alternative to a migration is to put in a compatibility layer so that you can keep
the old database in production while you put new entries in a new database.  This also leads to
slowness and bugs.  This is not an uncommon pattern - it is often too costly or risky to do a
database migration.  Of course, running two database systems can further misalign your abstractions.&lt;/p&gt;
&lt;p&gt;We also saw this in the case of Google&amp;rsquo;s user-space networking paper. What changed for Google wasn&amp;rsquo;t
their values, but their assumptions: modern datacenter networks are very fast, CPUs can crunch a ton
of data, and NVMe flash is 2-3 orders of magnitude faster than spinning disks.  Saturating a 100
Gbps network card with TCP is expensive - taking 16 cores per the Snap paper - while saturating it
with a custom protocol is 4x cheaper, and you can saturate a NIC &lt;em&gt;with a single stream&lt;/em&gt;. In Google&amp;rsquo;s
case, the bandwidth expansion of network cards caused the TCP abstraction to become stale.&lt;/p&gt;
&lt;p&gt;Nothing about your system can change, and your abstractions can still go bad.  Your software
environment can change, your users can change their usage patterns, or your dependencies can
simply get updated in a way that you don&amp;rsquo;t like.  The wear-out of abstractions that used to work
well is also commonly known as &amp;ldquo;technical debt.&amp;rdquo;  However, if you choose your abstractions well
and define clean boundaries, the abstractions you use can far outlive your system.&lt;/p&gt;
&lt;h2 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;You can&amp;rsquo;t avoid abstractions as a software engineer - software itself is an abstraction.  In a way,
software engineers are professional abstraction wranglers.  The only thing we can do is stay on
top of our abstractions, the underlying assumptions they make, and their implications.  Focusing
only on your &amp;ldquo;core business need&amp;rdquo; and your &amp;ldquo;unique value add&amp;rdquo; doesn&amp;rsquo;t build a successful business
alone - if the abstractions you use to get there aren&amp;rsquo;t well-aligned to your goals, you will have
achieved a pyrrhic victory at best, and your focus and dedication to the bottom line may have cost
you the chance to scale up or run profitably.&lt;/p&gt;
&lt;p&gt;At companies with huge engineering forces, abstraction management is what a lot of them spend their
time on.  Often, these engineers are actually the most &amp;ldquo;productive&amp;rdquo; in terms of money saved -
infrastructure projects tend to result in 8-9 figure savings or unique capabilities, and
performance engineering (another form of abstraction alignment) frequently has 8 figure returns
per engineer.  Another large group of engineers is in charge of making sure that the old
abstractions don&amp;rsquo;t break and crash the entire system.&lt;/p&gt;
&lt;p&gt;Conversely, this is where startups can develop technical advantages on big tech despite having much
smaller engineering teams, and where bootstrapped companies can out-engineer series-D companies.
Given the freedom to align your abstractions to your goals, amazing things are possible.&lt;/p&gt;
&lt;h4 id=&#34;further-reading&#34;&gt;Further Reading&lt;/h4&gt;
&lt;p&gt;If you liked this topic, another blogger, Dan Luu, whose articles I like a lot, has written
adjacent to this topic before:
A year ago, he wrote about the &lt;a href=&#34;https://danluu.com/in-house/&#34;&gt;value of in-house expertise&lt;/a&gt;, and
he has written in the past on
&lt;a href=&#34;https://danluu.com/sounds-easy/&#34;&gt;why companies tend to have a lot of engineers for easy problems&lt;/a&gt;.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Contemplating Randomness</title>
      <link>https://specbranch.com/posts/random-nums/</link>
      <pubDate>Thu, 27 Oct 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/random-nums/</guid>
      <description>I have recently been immersed in the theory and practice of random number generation while working on Arbitrand, a new high-quality true random number generation service hosted in AWS. Because of that, I am starting a sequence of blog posts on randomness and random number generators. This post is the first of the sequence, and focuses what random number generators are and how to test them.
Formally, random number generators are systems that produce a stream of bits (or numbers) with two properties:</description>
      <content:encoded>&lt;p&gt;I have recently been immersed in the theory and practice of random number generation while working
on &lt;a href=&#34;https://arbitrand.com&#34;&gt;Arbitrand&lt;/a&gt;, a new high-quality true random number generation service
hosted in AWS.  Because of that, I am starting a sequence of blog posts on randomness and
random number generators.  This post is the first of the sequence, and focuses what random
number generators are and how to test them.&lt;/p&gt;
&lt;p&gt;Formally, random number generators are systems that produce a stream of bits (or numbers) with
two properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Uniformity:&lt;/strong&gt; The stream is uniformly distributed. Each possible output is equally probable.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Independence:&lt;/strong&gt; Elements of the stream are completely independent of each other.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;These two properties are enough to both ensure that the stream of random nubmers has maximal
entropy (1 bit of entropy per bit), and to make it very tricky to generate good random
numbers!&lt;/p&gt;
&lt;p&gt;Before we get into the math, there are a lot of philosophical debates about whether we can
actually produce random numbers, or whether there is some hidden determinism in random number
generation.  As far as I know, it is almost impossible to produce numbers that are perfectly
random.  Even the best quantum random number generators can be potentially biased by external
factors, like electric fields or gravity.  However, it appears that we can get imperceptibly close.&lt;/p&gt;
&lt;p&gt;Randomness tests exist to detemine how close any random number generator is to truly random
number generation.  Treating a perfectly fair coin (another system that
&lt;a href=&#34;https://statweb.stanford.edu/~cgates/PERSI/papers/dyn_coin_07.pdf&#34;&gt;doesn&amp;rsquo;t exist&lt;/a&gt;) as the gold
standard for randomness, we can actually test randomness pretty well, but we can also create
deterministic algorithms that pretend to be random, too.&lt;/p&gt;
&lt;h2 id=&#34;defining-randomness-mathematically&#34;&gt;Defining Randomness Mathematically&lt;/h2&gt;
&lt;p&gt;In order to measure uniformity and independence, we use mathematical properties that we expect from
a stream of random numbers.  We can then use statistical tests over a number of samples from a
random stream to determine if the stream has those properties.&lt;/p&gt;
&lt;p&gt;For uniformity, assuming a one-bit stream, we would like the stream of bits to behave
like a perfect coin flip: a Bernoulli distribution with $p = 0.5$.  For example, we expect that over
a number of trials (where a trial is a large sample from the random number generator), the average
trial will have the following characteristics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Mean of $0.5$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Variance of $0.25$&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Skewness (and higher statistical moments) of $0$&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For independence, we want the stream to have the proerty that the probability of any bit
being one does not depend on previous bits.  $P(R_n = 1 | R_m = 1) = 0.5$ for all $n$ and $m$
where $n \neq m$.  To measure independence, we can look at sequences of bits.  For example,
we expect the average trial to have the following properties:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Toggles from 1 to 0 and back $(n - 1) / 2$ times.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Has a number of runs of ones and zeros of a given size determined by a
&lt;a href=&#34;https://mathworld.wolfram.com/Run.html&#34;&gt;known formula&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We can also measure the independence property using the autocorrelation of the stream.  We
would like each bit of the stream to be uncorrelated to the history of the stream.  In an
equation, we expect that over several samples of $n$ bits, the following holds for all $k$:&lt;/p&gt;
&lt;p&gt;$$ 0 = \sum\limits_{i=0}^n R_i R_{i-k} $$&lt;/p&gt;
&lt;p&gt;Practically, testing for uniformity is easy.  Testing for independence is both subtle and
computationally difficult.&lt;/p&gt;
&lt;h2 id=&#34;pseudorandomness&#34;&gt;Pseudorandomness&lt;/h2&gt;
&lt;p&gt;Most computer random number generators use pseudorandom algorithms, that generate a fixed sequence
of bits, rather than true random nubmer generators.  In a way, pseudorandom number generators are
compressed representations of extremely long sequences.&lt;/p&gt;
&lt;p&gt;All pseudorandom number generators have the property that over some length of output, the sequence
generated will eventually repeat.  It can be huge, but it always exists.&lt;/p&gt;
&lt;p&gt;You might have noticed that pseudorandom number generators do not actually have the &amp;ldquo;independence&amp;rdquo;
property, since the streams they produce are deterministic.  However, they can fake independence
by replacing it with the following properties:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;A long period before the output stream repeats.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Equidistributed&lt;/strong&gt; output.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Treating the output of the random number generator as numbers between $0$ and $1$ (eg mapping
linearly from integers), equidistribution means that given a number $x \in [0, 1]$ it is expected
that a sample of $n$ outputs from a pseudorandom number generator with a period of $T$ contains
$\frac{n^2 x}{T}$ outputs that are less than $x$.  In other words, the fraction of sampled outputs
that is less than $x$ is proportional to $x$ and $\frac{n}{T}$ for any $x$.&lt;/p&gt;
&lt;p&gt;Equidistribution also has a concept of dimensionality: higher-dimensional equidistribution involves
sparse sampling of the output stream, which can defeat simple pseudorandom number generators which
do not have a very long period.&lt;/p&gt;
&lt;h2 id=&#34;imperfect-random-number-streams&#34;&gt;Imperfect Random Number Streams&lt;/h2&gt;
&lt;p&gt;Most often, when dealing with true random numbers, the entropy source that is used to create the
random numbers is imperfect.  However, random number generators are expected to output uniform
random numbers because uniform random numbers maximize the amount of entropy available per bit,
so they are the &amp;ldquo;maximally compressed&amp;rdquo; way to deliver entropy.&lt;/p&gt;
&lt;p&gt;There are a lot of techniques available to extract entropy from imperfect random number streams and
transform them into semmingly-perfect random number streams.  The most common and heaviest
technique involves cryptographic functions, but there are alternatives (some of which have
questionable quality).&lt;/p&gt;
&lt;h4 id=&#34;non-uniform-randomness&#34;&gt;Non-Uniform Randomness&lt;/h4&gt;
&lt;p&gt;So far, I have only focused on uniformly distributed random numbers.  You can generate random
numbers that are not uniformly random - many quantum RNGs natively output exponentially distributed
random numbers - and then convert the result to uniform randomness.  The drawback of non-uniform
random number generators is that they are effectively biased: when presented as a stream of bits,
each bit carries less than 1 bit of entropy.&lt;/p&gt;
&lt;p&gt;Non-uniform random numbers can be converted to uniform random numbers at a minor loss of entropy
by quantizing the numbers with bins of equal probability mass (but unequal size).  This comes at a
loss of either bandwidth or quality.  Transforming uniform random numbers into samples from
non-uniform distributions is easier to do without losing much entropy.&lt;/p&gt;
&lt;h4 id=&#34;biased-random-number-generators&#34;&gt;Biased Random Number Generators&lt;/h4&gt;
&lt;p&gt;Another common class of slightly-imperfect random number generators are random number generators
that have a bias.  These are like unfair coins: 1&amp;rsquo;s or 0&amp;rsquo;s will be more likely to show up.  Most
hardware true random number generators are biased in one way or another.&lt;/p&gt;
&lt;p&gt;Bias can show up in hardware true random number generators from a lot of sources: imbalances in
transistor drive strength can manifest as bias in many common TRNG circuits, but biases can also
show up due to the way circuits are designed, external factors, or non-ideal operating conditions.
TRNGs that use background radio noise are particularly sensitive to bias due to external factors.&lt;/p&gt;
&lt;p&gt;There are several different types of algorithms to simply de-bias a stream of random numbers, but
these algorithms can often weaken the independence property.  If you know the bias of your random
number generator, you can attempt to &amp;ldquo;swallow&amp;rdquo; ones or zeros every once in a while to fix the bias,
but this has to be done very carefully to avoid independence problems.&lt;/p&gt;
&lt;p&gt;The most common way to fix bias problems in random number generation is with cryptographic
post-processing.&lt;/p&gt;
&lt;h4 id=&#34;cryptographic-post-processing&#34;&gt;Cryptographic Post-Processing&lt;/h4&gt;
&lt;p&gt;Cryptographic post-processing is used by &lt;code&gt;/dev/random&lt;/code&gt; in Linux as well as most hardware true
random number generators (but not quantum random number generators or the Arbitrand TRNG) as
insurance against biased inputs.  This type of post-processing involves:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Collecting entropy from an entropy source by XOR-ing its output with the previous output from
the TRNG.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Estimating how much entropy you have collected, and waiting to provide an output until you
have collected enough entropy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Producing a cryptographic hash of the contents of the entropy pool once you have enough
entropy.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you are correct (or conservative) about how much entropy is in the entropy pool, the resulting
stream is essentially a perfect random stream.  However, there may be some cause for concern if
the cryptographic post-processing method used is broken, which could be as simple as cracking a
secret key used in the hashing process.&lt;/p&gt;
&lt;h2 id=&#34;rolling-my-own-random-number-generator&#34;&gt;&amp;ldquo;Rolling&amp;rdquo; My Own Random Number Generator&lt;/h2&gt;
&lt;p&gt;Designing new pseudorandom number generation algorithms is very difficult.  I would
suggest avoiding it if you can.  Pseudorandom number generators, like cryptographic algorithms,
have a lot of non-obvious pitfalls that can bias them or otherwise render them useless.  True
random number generators actually may be easier to design - there are a number of circuit and
system techniques involved, and you don&amp;rsquo;t have to deal with the &amp;ldquo;equidistribution&amp;rdquo; property
which can be quite hard to achieve.  However, in both cases, the devil is in the details.&lt;/p&gt;
&lt;p&gt;For &lt;a href=&#34;https://arbitrand.com&#34;&gt;Arbitrand&lt;/a&gt;, I have tried to take a slightly different approach to
randomness: instead of trying to get one perfect entropy source, we combine many uncorrelated
entropy sources that are known to be slightly imperfect.  By characterizing these sources and
understanding their pathologies, we can effectively cover imperfections in one entropy source
with another source.  This strategy wastes entropy, but covers the TRNG across temporal and
environmental variations in operating conditions, allowing us to operate on cloud FPGAs.&lt;/p&gt;
&lt;p&gt;This strategy has worked well so far.  The Arbitrand TRNG today produces almost 5 Gbps and passes
the most stringent randomness tests available.  The throughput number will likely go a lot higher
soon, and I am also investigating how to scale down the throughput to around 100 Mbps with a tiny
circuit.&lt;/p&gt;
&lt;h2 id=&#34;final-thoughts&#34;&gt;Final Thoughts&lt;/h2&gt;
&lt;p&gt;I hope you&amp;rsquo;re interested in hearing a lot more about randomness, because developing a TRNG has
been exciting for me so far, and I am going to be doing a lot more blogs along these lines while
I try to bring cheap true random numbers to everyone who needs them. I will be back to
micro-optimization soon, too.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Introduction to Micro-Optimization</title>
      <link>https://specbranch.com/posts/intro-to-micro-optimization/</link>
      <pubDate>Sun, 11 Sep 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/intro-to-micro-optimization/</guid>
      <description>A modern CPU is an incredible machine. It can execute many instructions at the same time, it can re-order instructions to ensure that memory accesses and dependency chains don&amp;rsquo;t impact performance too much, it contains hundreds of registers, and it has huge areas of silicon devoted to predicting which branches your code will take. However, if you have a tight loop and you are interested in optimizing the hell out of it, the same mechanisms that make your code run fast can make your job very difficult.</description>
      <content:encoded>&lt;p&gt;A modern CPU is an incredible machine.  It can execute many instructions at the same time, it can
re-order instructions to ensure that memory accesses and dependency chains don&amp;rsquo;t impact performance
too much, it contains hundreds of registers, and it has huge areas of silicon devoted to predicting
which branches your code will take.  However, if you have a tight loop and you are interested in
optimizing the hell out of it, the same mechanisms that make your code run fast can make your job
very difficult.  They add a lot of complexity that can make it hard to figure out how to optimize a
function, and they can also create local optima that trap you into a less efficient solution.&lt;/p&gt;
&lt;p&gt;There are going to be very few times in your career when you actually have to pull the last few
drops of throughput out of a function.  However, there are going to be many more times when you
have a performance-sensitive function to implement.  The only difference between producing a
heavily-optimized function and a good function for a performance-sensitive environment is one of
degree: the same principles apply to both.  If we can understand how to get peak performance out of
a computer, it is a lot easier to write code that merely has good performance.&lt;/p&gt;
&lt;p&gt;Over the last several years of my career, I have learned a lot about how to work with this machine
to pull out the last drops of performance that it can offer, and applied these techniques on
systems that require nanosecond-level latencies and terabit per second throughput.  This guide is
an attempt to record and systematize the thought process I have used for micro-optimization, and
the insights it allows you to apply to everyday computer performance.&lt;/p&gt;
&lt;h2 id=&#34;what-is-micro-optimization&#34;&gt;What is Micro-Optimization?&lt;/h2&gt;
&lt;p&gt;There are two definitions I have run across for micro-optimization, and they are very different.
The first and most common definition is &amp;ldquo;optimization at the assembly level.&amp;rdquo;  This is the
definition I use.  The other definition is &amp;ldquo;optimization underneath the level of a function.&amp;rdquo;  I
usually just call this &amp;ldquo;optimization.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Micro-optimization is the last step of any performance optimization process.  After you
micro-optimize a piece of code, there is &lt;em&gt;no&lt;/em&gt; performance left to gain from further optimizations,
except re-specializing your code for a new architecture.  By design, micro-optimized code tends to
overfit to specific aspects of the environment in which your code runs.  In some cases, that can be
as specific as a single CPU model, but it is usually a CPU family, or a class of CPUs with a given
characteristic (&lt;em&gt;eg&lt;/em&gt; CPUs with the AVX-512 instruction set).&lt;/p&gt;
&lt;p&gt;Micro-optimization applies specifically to code, but it does not apply specifically to &lt;em&gt;CPUs&lt;/em&gt;.  You
can also micro-optimize GPU code, although the tools available to do so are not as good.  Most of
what I am about to put in this guide applies to CPUs, but GPUs follow the same principles: with an
understanding of the architecture and enough visibility into how it executes your code, you can
make it sing.&lt;/p&gt;
&lt;h4 id=&#34;micro-optimization-and-assembly-language&#34;&gt;Micro-Optimization and Assembly Language&lt;/h4&gt;
&lt;p&gt;By definition, micro-optimization involves interacting with assembly language, but today it rarely
invovles writing assembly.  You do not need to know how to write assembly to micro-optimize code:
it can be done in any compiled language.  However, you do need to know how to read assembly, you
need to know your language well, and you need to know your compiler reasonably well.  You can even
micro-optimize Java or Javascript code if you know the interpreter well, too.&lt;/p&gt;
&lt;p&gt;Also, if you are working on a team with other engineers, most of them would prefer that you avoid
writing too much assembly.  Assembly language is very hard to read, and it is very hard to edit, so
if you can keep micro-optimized functions in higher-level languages, such as C++, Rust, or Go, your
teammates will thank you.&lt;/p&gt;
&lt;h2 id=&#34;when-to-micro-optimize&#34;&gt;When to Micro-Optimize&lt;/h2&gt;
&lt;p&gt;Whenever you talk about optimization, people pull out the quote &amp;ldquo;premature optimization is the root
of all evil.&amp;rdquo;  That is probably correct, but optimization becomes worthwhile a lot earlier than
many people think.  When you are planning an optimization project, it is important to make sure
that you can justify the cost, based on the gain that you expect.  Improving the speed of a
program or a server can
&lt;a href=&#34;https://specbranch.com/posts/performance-dimensions/&#34;&gt;mean many things and help in many ways&lt;/a&gt;, so
if you have a good reason to believe that you should optimize, quantitative arguments often work
to show that speed is a feature.&lt;/p&gt;
&lt;p&gt;Also, look out for times when optimization can save you engineering effort.  If you are
thinking about horizontal scaling, but you know that optimizing an inner loop can allow you
to serve several times more users, you should probably do it before figuring out how to scale
horizontally.&lt;/p&gt;
&lt;p&gt;Finally, and most importantly, look at a CPU profile of your application before embarking on any
optimization effort. You need to know that the thing you want to optimize is actually taking CPU
time, and your measurements can give you a lot of hints as to what can improve.&lt;/p&gt;
&lt;h2 id=&#34;how-hardware-executes-instructions&#34;&gt;How Hardware Executes Instructions&lt;/h2&gt;
&lt;p&gt;Almost every CPU, from the smallest microcontrollers to the largest server CPUs, has a pipelined
architecture.  This means that in each of these CPUs, multiple instructions are in flight at any
single time.  Pipelining allows a CPU to raise its clock frequency without spending multiple clock
cycles per instruction.  Larger CPUs tend to have deeper pipelines: microcontroller cores usually
have 2-5 cycle pipelines, application processors tend to have pipelines around 10 cycles long, and
server CPUs are over 15 cycles long.&lt;/p&gt;
&lt;p&gt;Additionally, CPUs larger than microcontroller cores will go a step further, and have logic to
execute multiple instructions at the same time.  Mobile processors tend to be able to execute
2-4 instructions per cycle, while desktop and server cores can execute up to 8 instructions in a
single clock cycle (as of 2022).&lt;/p&gt;
&lt;p&gt;The largest cores contain logic to re-order instructions.  This allows you to issue instructions
whose arguments are not yet available without blocking the CPU.  This is useful when you have
instructions with variable latency, such as memory loads with a cache hierarchy or instructions
like &lt;code&gt;DIV&lt;/code&gt; which execute a basic operation that has no fast hardware implementation.&lt;/p&gt;
&lt;p&gt;A modern server CPU has high-throughput pipeline, capable of working on hundreds of instructions
at the same time, and churning through 4 or more per clock cycle.  Because of the pipelined and
out-of-order processing of instructions, there are a lot of tricks used to hide the true processing
time of instructions from the programmer and make sure that you see a reasonable computing
abstraction.  Register renaming, complex instruction schedulers, branch predictors, and pipeline
bypassing multiplexers are all parts of this.&lt;/p&gt;
&lt;p&gt;Here is an AMD Zen 2 core, from WikiChip, showing off all of its major units:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://en.wikichip.org/w/images/f/f2/zen_2_core_diagram.svg#center&#34; width=&#34;100%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;In addition, due to the latency of memory, CPUs have complex cache hierarchies to ensure that most
memory reads are completed quickly, while providing an &amp;ldquo;infinite memory&amp;rdquo; abstraction to programs
with virtual memory. This system of caches and virtualization accelerators (called Translation
Lookaside Buffers) are involved in memory reads and writes, and can introduce latency to both.&lt;/p&gt;
&lt;h4 id=&#34;parts-of-a-cpu-core&#34;&gt;Parts of a CPU Core&lt;/h4&gt;
&lt;p&gt;Broadly speaking, CPU cores are split into three parts: the frontend, the backend, and the memory
subsystem.  The frontend is responsible for fetching and decoding instructions that need to be
executed, and the backend does the execution.  The memory subsystem handles interactions with the
CPU&amp;rsquo;s bus matrix, facilitating access to I/O controllers, chip-wide caches, and main memory.  Each
component interacts with the others at well-defined interface: the frontend passes CPU-specific
commands called micro-ops to the backend, and both the frontend and backend issue memory accesses
to the memory subsystem.&lt;/p&gt;
&lt;p&gt;When you are micro-optimizing, most of your energy will be likely focused on the backend.  The
the frontend tends to establish limits on how fast you can go, and memory access problems have
likely been fixed long before you start thinking about cycle-by-cycle performance.  Nevertheless,
functional units inside each part of the core can be the limiting factor.&lt;/p&gt;
&lt;p&gt;For one final note about the parts of the CPU core: Most CPUs of a similar performance class have
a similar architecture.  An x86 server CPU will have more in common with an ARM (or even RISC-V)
server CPU than it does with an embedded x86 CPU.  Everything I will discuss below applies to
desktop and server CPUs rather than embedded CPUs.  Embedded CPUs are a lot easier to optimize for,
since they contain many fewer parts.&lt;/p&gt;
&lt;h4 id=&#34;mental-models-of-hardware-execution&#34;&gt;Mental Models of Hardware Execution&lt;/h4&gt;
&lt;p&gt;Having a good mental model of how the CPU works helps you figure out how to get an understanding
of the complexity of a modern CPU.  In subsequent parts of this guide, we will be thinking about
the different parts of the CPU core, and constructing a simple, useful mental model for its
behavior.  In particular, we would like the parts of the CPU to be modeled as reasonably tame
directed graphs.  Thankfully, CPU manufacturers have some help for us here: they make instruction
retirement so wide that it is almost never the bottleneck.&lt;/p&gt;
&lt;p&gt;This means that your mental model can usually consider only the forward flow of instructions
through the pipeline, without worrying about feedback loops causing delay or odd performance
pathologies.  The one exception to this is branches, although the branch predictors on modern CPUs
are very efficient at mitigating the performance impact of the dependency cycles that can come
from branches.&lt;/p&gt;
&lt;p&gt;In the subsequent parts of this guide, we are going to go in depth into the parts of a CPU to build
mental models and intuition for how those components work.&lt;/p&gt;
&lt;h5 id=&#34;the-frontend&#34;&gt;The Frontend&lt;/h5&gt;
&lt;p&gt;Here is my overview of what happens in the frontend and how it attaches to the backend:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/uOpt/frontend.png#center&#34; width=&#34;60%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;The frontend begins with the branch predictor, which determines the program counter value from
which to fetch instructions.  This is by far the largest piece of silicon inside the core,
accounting for 25% or more of it.  If you have a truly unpredictable branch, you will need to
consider the impact of branch mispredictions: each misprediction requires the pipeline to drain
before you can do useful work, costing around 15 cycles on a server CPU (about 60-100 instruction
slots).  Removing these unpredictable branches can improve code performance significantly.  In most
micro-optimization cases, however, you are working with a tight loop that runs many iterations
(often a fixed number of iterations), and the branch predictor&amp;rsquo;s behavior is nearly perfect.&lt;/p&gt;
&lt;p&gt;The branch predictor emits an address to read from one of two caches: either the instruction cache
or the micro-op cache.  The micro-op cache, used for loops, contains a small number of decoded
micro-ops that are ready to issue to the backend.  If the address to fetch is not in the micro-op
cache, ops are fetched from the instruction cache (or from further down the cache hierarchy if
needed), and then decoded.  At this stage, multiple instructions may also be fused together into
one micro-op.  Once the instruction is decoded into micro-ops, it is issued to the backend for
execution, and eventually retired.&lt;/p&gt;
&lt;p&gt;Re-ordering can happen at many parts inside the actual frontend, and we will go into more depth
about when this happens, but we can consolidate the re-ordering step into the final issue step
in our mental model for now, while assuming that the front-end does what it needs to do to keep
itself and the backend saturated with usable instructions.  This is usually almost true in
practice.&lt;/p&gt;
&lt;h5 id=&#34;the-backend&#34;&gt;The Backend&lt;/h5&gt;
&lt;p&gt;An example of a CPU backend, modeled after an Intel Skylake, is below:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/uOpt/exec-intel.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;The backend has a comparatively simple job: it contains many execution units and uses them to
execute micro-ops that it receives from the frontend.  It uses several tricks baked into the
renaming and scheduling steps to keep instructions flowing and keep the individual execution units
full.&lt;/p&gt;
&lt;p&gt;During the rename step, the backend re-names registers in the assembly code to make sure that there
are no false dependencies.  CPUs with 16 or 32 registers specified by their architecture often have
over 150 physical registers, and will rename the architectural registers each time they are
reassigned.  The scheduler issues instructions to ports when their operands are ready.  This way,
long dependency chains of instructions can be kept out of the way of other instructions that do not
depend on those results, and the units can be kept full.&lt;/p&gt;
&lt;p&gt;The execution units themselves are a mixture of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Arithmetic/Logic Units (ALUs) for integer work, some of which can also execute branches&lt;/li&gt;
&lt;li&gt;ALUs for vector work&lt;/li&gt;
&lt;li&gt;Units for loading data from memory and storing data to memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each unit has specific micro-ops that it can accept, and each micro-op has specific latency
and throughput.  For example, micro-ops for addition usually have 1 cycle latency, while
multiplication micro-ops have 3 cycle latency.  Due to pipelining, both micro-ops still have
throughput of 1 instruction per cycle, but some other micro-ops, such as the ones for memory
barriers, have worse throughput.  Many instructions can be satisfied by several different units,
but some are restricted to one unit: for example, additions can generally be satisfied by any ALU,
but instructions like CRC32C and even multiplications are more restricted.&lt;/p&gt;
&lt;p&gt;After execution, the results from each pipeline are written back to registers or forwarded to
other execution units for future use.  Finally, the instruction is retired, advancing the state of
the CPU.  Due to the complexity of out-of-order operation, CPUs have a limited window of
instructions that can actually be executed out of order, and the retire stage is the final point
of serialization.  Pretty much everything that happens between instruction fetch and retirement
happens out of order.&lt;/p&gt;
&lt;p&gt;CPU backends are one of the areas that has the most variation: the exact units used and their
organization varies greatly between vendors.  While a CPU frontend on one core may be wider or
narrower in certain areas than another core, their backends may have very different compositions.&lt;/p&gt;
&lt;h5 id=&#34;the-memory-subsystem&#34;&gt;The Memory Subsystem&lt;/h5&gt;
&lt;p&gt;Here is a model for the units involved in how a CPU interacts with memory:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/uOpt/memory.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;The memory subsystem is simultaneously the part of the core that is the most arcane and the most
well-known to software engineers.  The caches are the most important and well-known parts of the
memory system.  However, surrounding the caches, there are many buffers that handle out-of-order
access to memory, allow writes and reads of partial cache lines to retire without fetching the full
cache line, and allow the CPU to re-order memory accesses.  Additionally, there are several
translation lookaside buffers (TLBs) that are used to accelerate virtual memory accesses.  Each of
these buffers, caches, and TLBs accelerates a particluar aspect of memory access, and each can
be a bottleneck.&lt;/p&gt;
&lt;p&gt;Finally, the diagram above addresses only the parts of the memory hierarchy that are attached to
individual cores (or small groups of cores in some architectures).  The remaining memory hierarchy
contains many more complex features, including a bus matrix that can be locked to implement atomic
operations; inter-socket cache coherency systems; and large, heterogeneous system-wide caches.
These can all affect the performance of individual cores, but tend to fall outside the scope of
&amp;ldquo;micro-optimization.&amp;rdquo;&lt;/p&gt;
&lt;h2 id=&#34;micro-optimization-tools&#34;&gt;Micro-Optimization Tools&lt;/h2&gt;
&lt;p&gt;Starting out, you need to thoroughly measure the code that you are trying to optimize.  There are
many tools that can be used for this purpose depending on what you are trying to measure.
However, some of the most useful ones for micro-optimization are purpose-built and specific.&lt;/p&gt;
&lt;h4 id=&#34;performance-counters&#34;&gt;Performance Counters&lt;/h4&gt;
&lt;p&gt;CPUs contain many performance counters that you can use to figure out what is happening inside a
given core.  These are vendor-specific and architecture-specific, but they often provide enough
visibility to get a clue as to what functional units are bottlenecking a particular piece of code.&lt;/p&gt;
&lt;p&gt;The usual tool for accessing performance counters is GNU &lt;code&gt;perf&lt;/code&gt;, but vendor-specific tools, such
as Intel&amp;rsquo;s &lt;code&gt;vtune&lt;/code&gt; can also be used for this purpose.  GPUs often also have performance counters
available, usually through vendor-specific tools.&lt;/p&gt;
&lt;p&gt;Measurements of performance counters are usually your first clue as to why you need to
micro-optimize a particular section of code, and they are used throughout the process to measure
your progress.&lt;/p&gt;
&lt;h4 id=&#34;benchmarks&#34;&gt;Benchmarks&lt;/h4&gt;
&lt;p&gt;Micro-optimization makes heavy use of benchmarks, but benchmarks tend to have long iteration
cycles and often don&amp;rsquo;t offer precise information about the nature of the code you are running.
This means that they are not always the first choice of tool for micro-optimization.
In order to provide visibility into the performance counters of a CPU which tell you about its
behavior, specialized microbenchmarking frameworks, such as Agner Fog&amp;rsquo;s
&lt;a href=&#34;https://www.agner.org/optimize/#testp&#34;&gt;&lt;code&gt;TestP&lt;/code&gt;&lt;/a&gt; are very useful.&lt;/p&gt;
&lt;p&gt;One cautionary note on benchmarks: they often over-estimate how fast your code runs, since they
allow you to measure behavior when a piece of code has a monopoly on the resources of a machine.
This has the potential to distort your results, which becomes particularly noticeable when caching
effects are involved.&lt;/p&gt;
&lt;h4 id=&#34;microarchitectural-code-analyzers&#34;&gt;Microarchitectural Code Analyzers&lt;/h4&gt;
&lt;p&gt;Code analyzers such as Intel&amp;rsquo;s (sadly deprecated) architecture code analyzer (IACA), as well as
similar tools from &lt;a href=&#34;https://llvm.org/docs/CommandGuide/llvm-mca.html&#34;&gt;LLVM&lt;/a&gt; and
&lt;a href=&#34;https://uica.uops.info/&#34;&gt;uops.info&lt;/a&gt;, are among the most useful tools for micro-optimization.
These tools provide you the most precise observability into your code&amp;rsquo;s execution by simulating
how it runs on a core.  They use detailed knowledge of the microarchitecture of a CPU to determine
how a CPU should execute your code, and show you when and how instructions are scheduled and
executed.&lt;/p&gt;
&lt;p&gt;Code analyzers are usually moderately wrong about the performance of your code - moreso than
benchmarks - but they are very good at pinpointing the factors that bottleneck execution.  In some
cases, a simplifying assumption in the analyzer can cause major differences between the analyzer&amp;rsquo;s
predicted performance and your actual performance.  However, in exchange for being slightly wrong,
they provide unparalleled visibility into the likely problems with your code.&lt;/p&gt;
&lt;p&gt;Microarchitectural analyzers are best used in tandem with a benchmark: the benchmark will allow you
to test hypotheses about your code&amp;rsquo;s performance and improvements you make.&lt;/p&gt;
&lt;p&gt;These analyzers give you nearly complete visibility into how your code runs, and have the fastest
iteration cycle of all of the tools listed here.  For this reason, I spend the most time using
code analyzers when micro-optimizing, but keep benchmarks ready to test the accuracy of the
analyzer&amp;rsquo;s predictions.&lt;/p&gt;
&lt;h2 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Micro-optimization is the application of computer architecture to software performance.  By
thinking about the architecture of the CPU and modeling how it runs our code, we can understand
how to make the hardware sing, and pull the last few drops of speed out of the system.&lt;/p&gt;
&lt;p&gt;Finally, by going into our code&amp;rsquo;s performance to this level of depth, and thinking about how to
make code run extremely fast, we can gain insight into how to make ordinary code run well,
and how to understand system performance more generally.&lt;/p&gt;
&lt;p&gt;I am planning to go into much more depth on each part of the core and the tools and processes to
optimize them in later parts, starting with the backend in part 2.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Rest in Peace, Optane</title>
      <link>https://specbranch.com/posts/rip-optane/</link>
      <pubDate>Fri, 12 Aug 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/rip-optane/</guid>
      <description>Intel&amp;rsquo;s Optane memory modules launched with a lot of fanfare in 2015, and were recently discontinued, in 2022, with similar fanfare. It was a sad day for me, a lover of abstraction-breaking technologies, but it was forseeable and understandable.
At the time of Optane&amp;rsquo;s launch, a lot of us were excited about the idea of having a new storage tier, sitting between DRAM and flash. It was announced as having DRAM endurance and speed with the persistence and size of flash.</description>
      <content:encoded>&lt;p&gt;Intel&amp;rsquo;s Optane memory modules launched with a lot of fanfare in 2015, and were recently
discontinued, in 2022, with similar fanfare.  It was a sad day for me, a lover of
abstraction-breaking technologies, but it was forseeable and understandable.&lt;/p&gt;
&lt;p&gt;At the time of Optane&amp;rsquo;s launch, a lot of us were excited about the idea of having a new storage
tier, sitting between DRAM and flash.  It was announced as having DRAM endurance and speed with
the persistence and size of flash.  It was a futuristic memory technology, but the technology of
the future met the full force of Wright&amp;rsquo;s Law.&lt;/p&gt;
&lt;h2 id=&#34;wrights-law&#34;&gt;Wright&amp;rsquo;s Law&lt;/h2&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Each doubling of production volume corresponds to a 20% decrease in cost&lt;/strong&gt;&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This is the simple statement of Wright&amp;rsquo;s Law.  It was originally developed by a man named Theodore
Wright in the 1930&amp;rsquo;s when looking at the production of airplane parts.  Since then, it has been
verified to hold in many different industries, although the cost reduction seems to vary between
10 and 25% depending on the industry.  Despite the industry difference, the history of
semiconductor technology has actually followed Wright&amp;rsquo;s Law more closely than Moore&amp;rsquo;s Law: The
density increases predicted by Moore&amp;rsquo;s Law tend to speed up and slow down with demand growth,
exactly as predicted by Wright&amp;rsquo;s Law.&lt;/p&gt;
&lt;p&gt;During the last 7 years, DRAM and Flash have both experienced massive increases in density and
production volume, and in turn, Wright&amp;rsquo;s Law has given us cheaper, faster, bigger devices.&lt;/p&gt;
&lt;h4 id=&#34;dram-production-2015-2022&#34;&gt;DRAM Production 2015-2022&lt;/h4&gt;
&lt;p&gt;The lifetime of DDR4 memory roughly matches the lifetime of Optane.  DDR4 was introduced a few
years before 2015, and became cost-effective compared to DDR3 around 2015-2016.  Today, DDR5 RAM
and the CPUs that can use it have just recently been introduced to the market.  Over this time,
RAM price per bit has roughly halved, and global RAM production has grown by a factor of about
10-15.  However, due to competition between AMD and Intel, the number of memory channels per CPU
has also roughly doubled.  Thus, a server in 2022 holds about 4x as much RAM as a server in 2015,
and memory cost as a share of system cost has increased by a factor of 2.  Memory bandwidth has
increased by a factor of about 3 due to a doubling of memory channels and the channels getting
faster.&lt;/p&gt;
&lt;p&gt;This means that both the capacity argument of Optane (&amp;ldquo;a system with Optane DIMMs can hold more
memory&amp;rdquo;) has diminished over time, and the performance gap between Optane and DDR memory has also
grown.  A two-socket server populated with RDIMMs can now hold several TB of memory, so the value
proposition of Optane for memory capacity has been eliminated.&lt;/p&gt;
&lt;h4 id=&#34;flash-is-a-spectrum&#34;&gt;Flash is a Spectrum&lt;/h4&gt;
&lt;p&gt;Flash memory comes in four flavors, based on the number of bits stored in each cell.  Single-level
cell (SLC) flash stores one bit per cell.  It has the lowest density, but the longest endurance and
highest speeds.  MLC (multi-level cell) holds two bits per cell, and is worse on speed and
endurance than SLC, but better than TLC (triple-level cell).  Most SSDs today use TLC flash: it
offers a good balance between price and endurance, and most workloads are read-heavy, so there
isn&amp;rsquo;t a lot of stress on the memory arrays.  QLC (quad-level cell) has recently been introduced to
the market as a capacity SSD technology: QLC has even lower endurance and speed than TLC, but
increases the capacity a lot.  PLC (penta-level) is on the horizon, promising to continue the
trend.&lt;/p&gt;
&lt;p&gt;On the back of this distinction, we can see a natural tiering of flash: using TLC or QLC for
capacity (in place of disk for all but the largest datasets), and using SLC for caching and
write-heavy workloads.  This means that SLC flash is a direct competitor to Optane, but it
has the advantages of being much cheaper per bit without being a lot slower.&lt;/p&gt;
&lt;h4 id=&#34;flash-production-2015-2022&#34;&gt;Flash Production 2015-2022&lt;/h4&gt;
&lt;p&gt;After Optane&amp;rsquo;s launch, Flash memory technology advanced quickly.  In 2015, NVMe drives had been out
for about a year or two.  NVMe drives and the controller chips that ran them were still working out
the kinks.  Understandably, Intel also created NVMe drives with Optane, offering much better read
and write latencies than NVMe drives with flash.  Not only did Optane offer faster access than
flash, but it had the advantage of being a simpler, more reliable form of memory that didn&amp;rsquo;t need
complicated controller algorithms.&lt;/p&gt;
&lt;p&gt;As time went on, companies quicly improved the speed, reliability, and density of each flash
technology.  In 2015, TLC was roughly in the position QLC occupies today: mostly for read-only and
read-heavy workloads.  Now, TLC is the workhorse flash technology used for general-purpose drives,
as its speed and endurance have improved.  Similarly, SSD controller chips have also improved over
time, allowing them to work faster and make better use of the flash chips on an SSD.  Lots of money
goes into R&amp;amp;D work for flash, and it shows.&lt;/p&gt;
&lt;p&gt;In the last 7 years, flash has grown to be the dominant storage technology: going from almost 20x
more per bit than hard drives to just 5x more per bit.  In addition, SSD latency has dropped by
a factor of 2 due to control chips getting more powerful and algorithms getting better.  Flash
bandwidth has increased to the point where an SSD can saturate a PCIe gen 4 x4 link.  A modern
SLC SSD can be found with read latency under 30 microseconds, and even TLC SSDs can get close to
100 microsecond latencies.&lt;/p&gt;
&lt;p&gt;Flash has advanced so much over the last 7 years that if the trend continues, I may be eulogizing
the magnetic hard drive in 2030.  Flash is eating the storage world for good reason: it can offer
high speeds and high capacity, and while it had some rough edges, we have learned how to work with
it very well.&lt;/p&gt;
&lt;h4 id=&#34;optane-and-wrights-law&#34;&gt;Optane and Wright&amp;rsquo;s Law&lt;/h4&gt;
&lt;p&gt;Clearly, as time has advanced, Optane&amp;rsquo;s position in the memory hierarchy has been getting crushed
between expanding DDR4 DRAM and flash that is getting faster and faster.  Both of these
technologies have had much more innovation and R&amp;amp;D over the last 7 years than Optane, and it shows.
In retrospect, Intel was fighting an uphill battle if they wanted to treat Optane simply as a point
on the memory hierarchy.  To make the project successful, they needed it to be a unique value add.&lt;/p&gt;
&lt;h2 id=&#34;the-value-proposition-of-optane&#34;&gt;The Value Proposition of Optane&lt;/h2&gt;
&lt;p&gt;After the dust settled, Optane modules only ended up having a small price difference with DRAM
modules of the same size, but were almost as slow as SLC flash, and only had 3-5x the endurance of
flash memory.  The NVMe drives were faster, but not that much faster, than SLC flash.  The only
value-add that was left was the abstraction difference with normal memory: Optane offered
&lt;strong&gt;persistence on the memory bus&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;If you have worked on a database or a storage system, you will know how valuable this can be: lots
of code in high-throughput storage systems is there to make sure that transactions persist as
quickly as possible.  Conceivably, if the normal memory writes involved in that transaction can
make it persistent, you can delete a ton of code.  You could even conceivably run a small database
on an &lt;code&gt;mmap&lt;/code&gt;-ed Optane file, and persistence would be given to you for free. Oh wait&amp;hellip; a lot of
databases &lt;code&gt;mmap&lt;/code&gt; their backing files already.&lt;/p&gt;
&lt;h4 id=&#34;persistence-on-the-memory-abstraction&#34;&gt;Persistence on the Memory Abstraction&lt;/h4&gt;
&lt;p&gt;It turns out that what developers mostly would have wanted from Optane they got from BSD and Linux
in the form of &lt;code&gt;mmap&lt;/code&gt;.  &lt;code&gt;mmap&lt;/code&gt;-ing a file allows you to treat the contents of the file as though
they are in memory, and allows the filesystem and a cache to handle the rest.  The parts of the
file you have accessed recently are in memory, while other parts of the file are fetched from disk
when you demand them - your access triggers a page fault, which in turn triggers a read from the
filesystem.  Writes to the file are handled in the background. This is not the fastest way to
access files, but it is a very convenient one, and it works well enough for many high-performance
key-value databases like LevelDB, LMDB, SQLite, QuestDB, RavenDB, and more.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;mmap&lt;/code&gt; had one more secret weapon: the filesystem.  If the media backing the filesystem wore out,
the filesystem (and the layers underneath it) could detect and correct issues.  Optane had no
such protection.&lt;/p&gt;
&lt;p&gt;When Intel thought they were enabling a whole new class of high-performance databases by offering
persistence on the memory bus, they were really offering a performance boost compared to persistent
databases using &lt;code&gt;mmap&lt;/code&gt;.  It might not even have been a performance boost: Optane slowed down your
median read and write operations in exchange for avoiding those page faults, so if your working set
was small or mostly fit inside RAM, &lt;code&gt;mmap&lt;/code&gt; and NVMe flash was actually faster.&lt;/p&gt;
&lt;h4 id=&#34;persistent-memory-and-caches&#34;&gt;Persistent Memory and Caches&lt;/h4&gt;
&lt;p&gt;It turns out that offering persistence on the memory bus runs into abstraction problems with the
current memory controllers on CPUs.  The memory controllers and cache hierarchies are designed
assuming that memory is dumb and volatile, and they can indefinitely delay writes to save
bandwidth.&lt;/p&gt;
&lt;p&gt;The initial solution that Intel came up with was an instruction, &lt;code&gt;CLFLUSH&lt;/code&gt;, that flushed a cache
line to memory.  However, &lt;code&gt;CLFLUSH&lt;/code&gt; was a serializing instruction, like &lt;code&gt;CPUID&lt;/code&gt;, so it ended up
flushing the CPU pipeline as well as the cache hierarchy when it was issued.  Worse, flusing the
cache line would invalidate it, so if you wanted to read the value back after writing it, you would
incur a cache miss. It was later supplemented by &lt;code&gt;CLFLUSHOPT&lt;/code&gt; and &lt;code&gt;CLWB&lt;/code&gt; which also could be used
to flush and write back cache lines without incurring the same performance penalty.&lt;/p&gt;
&lt;p&gt;However, when Optane memory started to come out, &lt;code&gt;CLFLUSH&lt;/code&gt; was the only instruction from this suite
that was available, meaning that initial performance tests suffered from both the speed difference
between Optane and DRAM and the substantial overhead of the &lt;code&gt;CLFLUSH&lt;/code&gt; instruction.  Intel would
probably have had a much easier time selling Optane if their core architecture were ready.&lt;/p&gt;
&lt;h4 id=&#34;alternative-methods-of-persistence&#34;&gt;Alternative Methods of Persistence&lt;/h4&gt;
&lt;p&gt;Also between 2015 and 2022, several companies were offering a different kind of persistent DIMM.
Instead of using a special type of memory, this kind of persistent DIMM used normal DRAM, but added
some flash memory and a small controller circuit on the back that saved the contents of the DRAM
to flash every time the power started to dip.&lt;/p&gt;
&lt;p&gt;To make sure that the contents of the DIMM were safe, these persistent DIMMs had a supercapacitor
or small lithium ion battery (usually kept in a 2.5 inch drive bay and connected to the DIMM
through a cable) that kept the DIMM powered while the rest of the system was going down.  On
power-up, the DIMM would restore the memory contents.&lt;/p&gt;
&lt;p&gt;These were later specified by JEDEC as &amp;ldquo;NVDIMM-N&amp;rdquo; modules (standing for Non-volatile Dual Inline
Memory Module - NAND flash type).&lt;/p&gt;
&lt;p&gt;Still, these alternatives have some problems: they are a lot thicker than normal memory DIMMs, and
they can&amp;rsquo;t offer the capacity that Optane modules could offer.  However, they don&amp;rsquo;t have the same
durability problems that Optane and flash do, since the flash is rarely written, and they operate
at the same speed as the rest of the system&amp;rsquo;s memory.&lt;/p&gt;
&lt;h2 id=&#34;thank-you-optane&#34;&gt;Thank You, Optane&lt;/h2&gt;
&lt;p&gt;So many technologies have become successful around Optane and the promises it held.  Unfortunately,
Optane was not one of them.&lt;/p&gt;
&lt;p&gt;SSDs using SLC flash offer blazing fast performance with a block abstraction, and enterprises and
database developers learned how to take advantage of the differences between SSDs that were small
and fast, using SLC flash, and SSDs that were slower and larger, with TLC and QLC flash.  Some SSD
manufacturers also saw the idea of a caching drive and competed by adding SLC caching to their TLC
SSDs.  Most high-end consumer SSDs today do this for you.&lt;/p&gt;
&lt;p&gt;For the few customers who needed persistence on the memory bus, the JEDEC NVDIMM standards emerged,
with the flagship NVDIMM-N modules allowing you to have a DRAM module that was both persistent
&lt;em&gt;and&lt;/em&gt; fast, and an additional standard to cover future persistent memory technologies.  Intel&amp;rsquo;s
new instructions allow users to take advantage of these new modules, adding a fundamental
capability to CPUs.&lt;/p&gt;
&lt;p&gt;Optane has helped us learn to build computing systems that take advantage of the spectrum of
&amp;ldquo;legacy&amp;rdquo; storage and memory technologies.  I, for one, am sad to see it go, but happy that I won&amp;rsquo;t
miss it.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Use One Big Server</title>
      <link>https://specbranch.com/posts/one-big-server/</link>
      <pubDate>Wed, 27 Jul 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/one-big-server/</guid>
      <description>A lot of ink is spent on the &amp;ldquo;monoliths vs. microservices&amp;rdquo; debate, but the real issue behind this debate is about whether distributed system architecture is worth the developer time and cost overheads. By thinking about the real operational considerations of our systems, we can get some insight into whether we actually need distributed systems for most things.
We have all gotten so familiar with virtualization and abstractions between our software and the servers that run it.</description>
      <content:encoded>&lt;p&gt;A lot of ink is spent on the &amp;ldquo;monoliths vs. microservices&amp;rdquo; debate, but the real issue behind
this debate is about whether distributed system architecture is worth the developer time and
cost overheads.  By thinking about the real operational considerations of our systems, we can
get some insight into whether we actually need distributed systems for most things.&lt;/p&gt;
&lt;p&gt;We have all gotten so familiar with virtualization and abstractions between our software
and the servers that run it.  These days, &amp;ldquo;serverless&amp;rdquo; computing is all the rage, and even
&amp;ldquo;bare metal&amp;rdquo; is a class of virtual machine.  However, every piece of software runs on a
server.  Since we now live in a world of virtualization, most of these servers are a lot
bigger and a lot cheaper than we actually think.&lt;/p&gt;
&lt;h2 id=&#34;meet-your-server&#34;&gt;Meet Your Server&lt;/h2&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://www.servethehome.com/wp-content/uploads/2021/03/Microsoft-Azure-HPC-HBv3-Hosting-Node-2.jpg#center&#34; width=&#34;70%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;This is a picture of a server used by Microsoft Azure with AMD CPUs.  Starting from the left,
the big metal fixture on the left (with the copper tubes) is a heatsink, and the metal boxes
that the copper tubes are attached to are heat exchangers on each CPU.  The CPUs are AMD&amp;rsquo;s
third generation server CPU, each of which has the following specifications:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;64 cores&lt;/li&gt;
&lt;li&gt;128 threads&lt;/li&gt;
&lt;li&gt;~2-2.5 GHz clock&lt;/li&gt;
&lt;li&gt;Cores capable of 4-6 instructions per clock cycle&lt;/li&gt;
&lt;li&gt;256 MB of L3 cache&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In total, this server has 128 cores with 256 simultaneous threads.  With all of the cores working
together, this server is capable of 4 TFLOPs of peak double precision computing performance. This
server would sit at the top of the top500 supercomputer list in early 2000. It would take until
2007 for this server to leave the top500 list.  Each CPU core is substantially more powerful than a
single core from 10 years ago, and boasts a much wider computation pipeline.&lt;/p&gt;
&lt;p&gt;Above and below each CPU is the memory: 16 slots of DDR4-3200 RAM per socket.  The largest
capacity &amp;ldquo;cost effective&amp;rdquo; DIMMs today are 64 GB.  Populated cost-efficiently, this server can hold
&lt;strong&gt;1 TB&lt;/strong&gt; of memory.  Populated with specialized high-capacity DIMMs (which are generally slower
than the smaller DIMMs), this server supports up to &lt;strong&gt;8 TB&lt;/strong&gt; of memory total.  At DDR4-3200, with
a total of 16 memory channels, this server will likely see ~200 Gbps of memory throughput across
all of its cores.&lt;/p&gt;
&lt;p&gt;In terms of I/O, each CPU offers 64 PCIe gen 4 lanes.  With 128 PCIe lanes total, this server is
capable of supporting 30 NVMe SSDs plus a network card.  Typical configurations you can buy will
offer slots for around 16 SSDs or disks. The final thing I wanted to point out in this picture is
in the top right, the network card.  This server is likely equipped with a 50-100 Gbps network
connection.&lt;/p&gt;
&lt;h4 id=&#34;the-capabilities-of-one-server&#34;&gt;The Capabilities of One Server&lt;/h4&gt;
&lt;p&gt;One server today is capable of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://people.freebsd.org/~gallatin/talks/euro2021.pdf&#34;&gt;Serving video files at 400 Gbps&lt;/a&gt; (now &lt;a href=&#34;http://nabstreamingsummit.com/wp-content/uploads/2022/05/2022-Streaming-Summit-Netflix.pdf&#34;&gt;800 Gbps&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.scylladb.com/2017/05/10/faster-and-better-what-to-expect-running-scylla-on-aws-i3-instances/&#34;&gt;1 million IOPS on a NoSQL database&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://www.enterprisedb.com/blog/pgbench-performance-benchmark-postgresql-12-and-edb-advanced-server-12&#34;&gt;70k IOPS in PostgreSQL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://openbenchmarking.org/test/pts/nginx&#34;&gt;500k requests per second to nginx&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://openbenchmarking.org/test/pts/build-linux-kernel-1.14.0&#34;&gt;Compiling the linux kernel in 20 seconds&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://openbenchmarking.org/test/pts/x264-2.7.0&#34;&gt;Rendering 4k video with x264 at 75 FPS&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Among other things.  There are a lot of public benchmarks these days, and if you know how your
service behaves, you can probably find a similar benchmark.&lt;/p&gt;
&lt;h4 id=&#34;the-cost-of-one-server&#34;&gt;The Cost of One Server&lt;/h4&gt;
&lt;p&gt;In a large hosting provider, OVHCloud, you can rent an HGR-HCI-6 server with similar specifications
to the above, with 128 physical cores (256 threads), 512 GB of memory, and 50 Gbps of bandwidth
for $1,318/month.&lt;/p&gt;
&lt;p&gt;Moving to the popular budget option, Hetzner, you can rent a smaller server with 32 physical cores
and 128 GB of RAM for about €140.00/month.  This is a smaller server than the one from OVHCloud
(1/4 the size), but it gives you some idea of the price spread between hosting providers.&lt;/p&gt;
&lt;p&gt;In AWS, one of the largest servers you can rent is the m6a.metal server. It offers 50 Gbps
of network bandwidth, 192 vCPUs (96 physical cores), and 768 GB of memory, and costs $8.2944/hour
in the US East region.  This comes out to $6,055/month.  The cloud premium is real!&lt;/p&gt;
&lt;p&gt;A similar server, with 128 physical cores and 512 GB of memory (as well as appropriate NICs,
SSDs, and support contracts), can be purchased from the Dell website for about $40,000.  However,
if you are going to spend this much on a server, you should probably chat with a salesperson to
make sure you are getting the best deal you can.  You will also need to pay to host this server
and connect it to a network, though.&lt;/p&gt;
&lt;p&gt;In comparison, buying servers takes about 8 months to break even compared to using cloud servers,
and 30 months to break even compared to renting.  Of course, buying servers has a lot of drawbacks,
and so does renting, so going forward, we will think a little bit about the &amp;ldquo;cloud premium&amp;rdquo; and
whether you should be willing to pay it (spoiler alert: the answer is &amp;ldquo;yes, but not as much as the
cloud companies want you to pay&amp;rdquo;).&lt;/p&gt;
&lt;h2 id=&#34;thinking-about-the-cloud&#34;&gt;Thinking about the Cloud&lt;/h2&gt;
&lt;p&gt;The &amp;ldquo;cloud era&amp;rdquo; began in earnest around 2010.  At the time, the state of the art CPU was an
8-core Intel Nehalem CPU.  Hyperthreading had just begun, so that 8-core CPU offered a
whopping 16 threads.  Hardware acceleration was about to arrive for AES encryption, and
vectors were 128 bits wide. Tthe largest CPUs had 24 MB of cache, and your server could fit a
whopping 256 GB of DDR3-1066 memory. If you wanted to store data, Seagate had just begun to
offer a 3 TB hard drive.  Each core offered 4 FLOPs per cycle, meaning that your 8-core
server running at 2.5 GHz offered a blazing fast 80 GFLOPs.&lt;/p&gt;
&lt;p&gt;The boom in distributed computing rode on this wave: if you wanted to do anything that
involved retrieval of data, you needed a lot of disks to get the storage throughput you want.
If you wanted to do large computations, you generally needed a lot of CPUs. This meant that
you needed to coordinate between a lot of CPUs to get most things done.&lt;/p&gt;
&lt;p&gt;Since that time began, the size of servers has increased a lot, and SSDs have increased available
IOPS by a factor of at least 100, but the size of mainstream VMs and containers hasn&amp;rsquo;t increased
much, and we still use virtualized drives that perform more like hard drives than SSDs (although
this gap is closing).&lt;/p&gt;
&lt;h4 id=&#34;one-server-plus-a-backup-is-usually-plenty&#34;&gt;One Server (Plus a Backup) is Usually Plenty&lt;/h4&gt;
&lt;p&gt;If you are doing anything short of video streaming, and you have under 10k QPS, one server
will generally be fine for most web services.  For really simple services, one server could
even make it to a million QPS or so.  Very few web services get this much traffic - if you
have one, you know about it.  Even if you&amp;rsquo;re serving video, running only one server for your
control plane is very reasonable.  A benchmark can help you determine where you are.
Alternatively, you can use common benchmarks of similar applications, or
&lt;a href=&#34;https://specbranch.com/posts/common-perf-numbers/&#34;&gt;tables of common performance numbers&lt;/a&gt; to estimate how big of a
machine you might need.&lt;/p&gt;
&lt;h4 id=&#34;tall-is-better-than-wide&#34;&gt;Tall is Better than Wide&lt;/h4&gt;
&lt;p&gt;When you need a cluster of computers, if one server is not enough, using fewer larger servers
will often be better than using a large fleet of small machines.  There is non-zero overhead
to coordinate a cluster, and that overhead is frequently O(n) on each server.  To reduce this
overhead, you should generally prefer to use a few large servers than to use many small servers.
In the case of things like serverless computing, where you allocate tiny short-lived containers,
this overhead accounts for a large fraction of the cost of use.  On the other extreme end,
coordinating a cluster of one computer is trivial.&lt;/p&gt;
&lt;h4 id=&#34;big-servers-and-availability&#34;&gt;Big Servers and Availability&lt;/h4&gt;
&lt;p&gt;The big drawback of using a single big server is availability.  Your server is going to need
downtime, and it is going to break.  Running a primary and a backup server is usually enough,
keeping them in different datacenters.  A 2x2 configuration should appease the truly paranoid: two
servers in a primary datacenter (or cloud provider) and two servers in a backup datacenter will
give you a lot of redundancy.  If you want a third backup deployment, you can often make that
smaller than your primary and secondary.&lt;/p&gt;
&lt;p&gt;However, you may still have to be concerned about &lt;em&gt;correlated&lt;/em&gt; hardware failures.  Hard drives
(and now SSDs) have been known to occasionally have correlated failures: if you see one disk
fail, you are a lot more likely to see a second failure before getting back up if your disks
are from the same manufacturing batch.  Services like Backblaze overcome this by using many
different models of disks from multiple manufacturers.  Hacker news learned this the hard way
recently when the primary and backup server went down at the same time.&lt;/p&gt;
&lt;p&gt;If you are using a hosting provider which rents pre-built servers, it is prudent to rent two
different types of servers in each of your primary and backup datacenters.  This should avoid
almost every failure mode present in modern systems.&lt;/p&gt;
&lt;h2 id=&#34;use-the-cloud-but-dont-be-too-cloudy&#34;&gt;Use the Cloud, but don&amp;rsquo;t be too Cloudy&lt;/h2&gt;
&lt;p&gt;A combination of availability and ease of use is one of the big reasons why I (and most other
engineers) like cloud computers.  Yes, you pay a significant premium to rent the machines, but
your cloud provider has so much experience building servers that you don&amp;rsquo;t even see most failures,
and for the other failures, you can get back up and running really quickly by renting a new
machine in their nearly-limitless pool of compute.  It is their job to make sure that you don&amp;rsquo;t
experience downtime, and while they don&amp;rsquo;t always do it perfectly, they are pretty good at it.&lt;/p&gt;
&lt;p&gt;Hosting providers who are willing to rent you a server are a cheaper alternative to cloud
providers, but these providers can sometimes have poor quality and some of them don&amp;rsquo;t understand
things like network provisioning and correlated hardware failures. Also, moving from one rented
server to a larger one is a lot more annoying than resizing a cloud VM. Cloud servers have a
price premium for a good reason.&lt;/p&gt;
&lt;p&gt;However, when you deal with clouds, your salespeople will generally push you towards
&amp;ldquo;cloud-native&amp;rdquo; architecture.  These are things like microservices in auto-scaling VM groups with
legions of load balancers between them, and vendor-lock-in-enhancing products like serverless
computing and managed high-availability databases.  There is a good reason that cloud
salespeople are the ones pushing &amp;ldquo;cloud architecture&amp;rdquo; - it&amp;rsquo;s better for them!&lt;/p&gt;
&lt;p&gt;The conventional wisdom is that using cloud architecture is good because it lets you scale up
effortlessly. There are good reasons to use cloud-native architecture, but serving lots of people
is not one of them: most services can serve millions of people at a time with one server, and
will never give you a surprise five-figure bill.&lt;/p&gt;
&lt;h4 id=&#34;why-should-i-pay-for-peak-load&#34;&gt;Why Should I Pay for Peak Load?&lt;/h4&gt;
&lt;p&gt;One common criticism of the &amp;ldquo;one big server&amp;rdquo; approach is that you now have to pay for your peak
usage instead of paying as you go for what you use.  Thus, serverless computing or fleets of
microservice VMs more closely align your costs with your profit.&lt;/p&gt;
&lt;p&gt;Unfortunately, since all of your services run on servers (whether you like it or not), someone
in that supply chain is charging you based on their peak load.  Part of the &amp;ldquo;cloud premium&amp;rdquo; for
load balancers, serverless computing, and small VMs is based on how much extra capacity your
cloud provider needs to build in order to handle &lt;em&gt;their&lt;/em&gt; peak load.  You&amp;rsquo;re paying for someone&amp;rsquo;s
peak load anyway!&lt;/p&gt;
&lt;p&gt;This means that if your workload is exceptionally bursty - like a simulation that needs
to run once and then turn off forever - you should prefer to reach for &amp;ldquo;cloudy&amp;rdquo; solutions, but if
your workload is not so bursty, you will often have a cheaper system (and an easier time building
it) if you go for few large servers.  If your cloud provider&amp;rsquo;s usage is more bursty than yours,
you are going to pay that premium for no benefit.&lt;/p&gt;
&lt;p&gt;This premium applies to VMs, too, not just cloud services. However, if you are running a cloud VM
24/7, you can avoid paying the &amp;ldquo;peak load premium&amp;rdquo; by using 1-year contracts or negotiating with
a salesperson if you are big enough.&lt;/p&gt;
&lt;p&gt;Generally, the burstier your workload is, the more cloudy your architecture should be.&lt;/p&gt;
&lt;h4 id=&#34;how-much-does-it-cost-to-be-cloudy&#34;&gt;How Much Does it Cost to be Cloudy?&lt;/h4&gt;
&lt;p&gt;Being cloudy is expensive.  Generally, I would anticipate a 5-30x price premium depending on what
you buy from a cloud company, and depending on the baseline. &lt;em&gt;Not 5-30%, a factor of between 5 and
30.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Here is the pricing of AWS lambda: $0.20 per 1M requests + $0.0000166667 per GB-second of RAM.  I
am using pricing for an x86 CPU here to keep parity with the m6a.metal instance we saw above.
Large ARM servers and serverless ARM compute are both cheaper.&lt;/p&gt;
&lt;p&gt;Assuming your server costs $8.2944/hour, and is capable of 1k QPS with 768 GB of RAM:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;1k QPS is 60k queries per minute, or 3.6M queries per hour&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Each query here gets 0.768 GB-seconds of RAM (amortized)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Replacing this server would cost about $46/hour using serverless computing&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The price premium for serverless computing over the instance is a factor of 5.5.  If you can keep
that server over 20% utilization, using the server will be cheaper than using serverless computing.
This is before any form of savings plan you can apply to that server - if you can rent those big
servers from the spot market or if you compare to the price you can get with a 1-year contract,
the price premium is even higher.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;If you compare to the OVHCloud rental price for the same server, the price premium of buying your
compute through AWS lambda is a factor of 25&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;If you are considering renting a server from a low-cost hosting provider or using AWS lambda, you
should prefer the hosting provider if you can keep the server operating at 5% capacity!&lt;/p&gt;
&lt;p&gt;Also, note that the actual QPS number doesn&amp;rsquo;t matter: if the $8.2944/hour server is capable of 100k
QPS, the query would use 100x less memory-time, meaning that you would arrive at the same 5.5x
(or 25x) premium. Of course, you should scale the size of the server to fit your application.&lt;/p&gt;
&lt;h2 id=&#34;common-objections-to-one-big-server&#34;&gt;Common Objections to One Big Server&lt;/h2&gt;
&lt;p&gt;If you propose using the one big server approach, you will often get pushback from people who are
more comfortable with the cloud, prefer to be fashionable, or have legitimate concerns.  Use your
judgment when you think about it, but most people vastly underestimate how much &amp;ldquo;cloud
architecture&amp;rdquo; actually costs compared to the underlying compute.  Here are some common objections.&lt;/p&gt;
&lt;h4 id=&#34;but-if-i-use-cloud-architecture-i-dont-have-to-hire-sysadmins&#34;&gt;But if I use Cloud Architecture, I Don&amp;rsquo;t Have to Hire Sysadmins&lt;/h4&gt;
&lt;p&gt;Yes you do.  They are just now called &amp;ldquo;Cloud Ops&amp;rdquo; and are under a different manager. Also, their
ability to read the arcane documentation that comes from cloud companies and keep up  with the
corresponding torrents of updates and deprecations makes them 5x more expensive than system
administrators.&lt;/p&gt;
&lt;h4 id=&#34;but-if-i-use-cloud-architecture-i-dont-have-to-do-security-updates&#34;&gt;But if I use Cloud Architecture, I Don&amp;rsquo;t Have to Do Security Updates&lt;/h4&gt;
&lt;p&gt;Yes you do.  You may have to do fewer of them, but the ones you don&amp;rsquo;t have to do are the easy ones
to automate.  You are still going to share in the pain of auditing libraries you use, and making
sure that all of your configurations are secure.&lt;/p&gt;
&lt;h4 id=&#34;but-if-i-use-cloud-architecture-i-dont-have-to-worry-about-it-going-down&#34;&gt;But if I use Cloud Architecture, I Don&amp;rsquo;t Have to Worry About it Going Down&lt;/h4&gt;
&lt;p&gt;The &amp;ldquo;high availability&amp;rdquo; architectures you get from using cloudy constructs and microservices just
about make up for the fragility they add due to complexity.  At this point, if you use two
different cloud regions or two cloud providers, you can generally assume that is good enough to
avoid your service going down.  However, cloud providers have often had global outages in the past,
and there is no reason to assume that cloud datacenters will be down any less often than your
individual servers.&lt;/p&gt;
&lt;p&gt;Remember that we are trying to prevent &lt;em&gt;correlated&lt;/em&gt; failures.  Cloud datacenters have a lot of
parts that can fail in correlated ways.  Hosting providers have many fewer of these parts.
Similarly, complex cloud services, like managed databases, have more failure modes than simple
ones (VMs).&lt;/p&gt;
&lt;h4 id=&#34;but-i-can-develop-more-quickly-if-i-use-cloud-architecture&#34;&gt;But I can Develop More Quickly if I use Cloud Architecture&lt;/h4&gt;
&lt;p&gt;Then do it, and just keep an eye on the bill and think about when it&amp;rsquo;s worth it to switch.  This
is probably the strongest argument in favor of using cloudy constructs.  However, if you don&amp;rsquo;t
think about it as you grow, you will likely end up burning a lot of money on your cloudy
architecture long past the time to switch to something more boring.&lt;/p&gt;
&lt;h4 id=&#34;my-workload-is-really-bursty&#34;&gt;My Workload is Really Bursty&lt;/h4&gt;
&lt;p&gt;Cloud away.  That is a great reason to use things like serverless computing.  One of the big
benefits of cloud architecture constructs is that the &lt;em&gt;scale down&lt;/em&gt; really well.  If your workload
goes through long periods of idleness punctuated with large unpredictable bursts of activity, cloud
architecture probably works really well for you.&lt;/p&gt;
&lt;h4 id=&#34;what-about-cdns&#34;&gt;What about CDNs?&lt;/h4&gt;
&lt;p&gt;It&amp;rsquo;s impossible to get the benefits of a CDN, both in latency improvements and bandwidth savings,
with one big server.  This is also true of other systems that need to be distributed, like backups.
Thankfully CDNs and backups are competitive markets, and relatively cheap. These are the kind of
thing to buy rather than build.&lt;/p&gt;
&lt;h2 id=&#34;a-note-on-microservices-and-monoliths&#34;&gt;A Note On Microservices and Monoliths&lt;/h2&gt;
&lt;p&gt;Thinking about &amp;ldquo;one big server&amp;rdquo; naturally lines up with thinking about monolithic architectures.
However, you don&amp;rsquo;t need to use a monolith to use one server.  You can run many containers on one
big server, with one microservice per container.  However, microservice architectures in general
add a lot of overhead to a system for dubious gain when you are running on one big server.&lt;/p&gt;
&lt;h2 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;When you experience growing pains, and get close to the limits of your current servers, today&amp;rsquo;s
conventional wisdom is to go for sharding and horizontal scaling, or to use a cloud architecture
that gives you horizontal scaling &amp;ldquo;for free.&amp;rdquo;  It is often easier and more efficient to scale
vertically instead.  Using one big server is comparatively cheap, keeps your overheads at a
minimum, and actually has a pretty good availability story if you are careful to prevent correlated
hardware failures.  It&amp;rsquo;s not glamorous and it won&amp;rsquo;t help your resume, but one big server will serve
you well.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>The Most Useful Statistical Test You Didn&#39;t Learn in School</title>
      <link>https://specbranch.com/posts/kolmogorov-smirnov/</link>
      <pubDate>Mon, 04 Jul 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/kolmogorov-smirnov/</guid>
      <description>In performance work, you will often find many distributions that are weirdly shaped: fat-tailed distributions, distributions with a hard lower bound at a non-zero number, and distributions that are just plain odd. Particularly when you look at latency distributions, it is extremely common for the 99th percentile to be a lot further from the mean than the 1st percentile. These sorts of asymmetric fat-tailed distributions come with the business.
Often times, when performance engineers need to be scientific about their work, they will take samples of these distributions, and put them into into a $t$-test to get a $p$-value for the significance of their improvements.</description>
      <content:encoded>&lt;p&gt;In performance work, you will often find many distributions that are weirdly shaped: fat-tailed
distributions, distributions with a hard lower bound at a non-zero number, and distributions
that are just plain odd.  Particularly when you look at latency distributions, it is extremely
common for the 99th percentile to be a lot further from the mean than the 1st percentile.  These
sorts of asymmetric fat-tailed distributions come with the business.&lt;/p&gt;
&lt;p&gt;Often times, when performance engineers need to be scientific about their work, they will take
samples of these distributions, and put them into into a $t$-test to get a $p$-value for the
significance of their improvements.  That is what you learned in a basic statistics or lab science
class, so why not?  Unfortunately, the world of computers is more complicated than the beer
quality experiments for which the $t$-test was invented, and violates one of its core assumptions:
that the sample means are normally distributed.  When you have a lot of samples, this can hold,
but it often doesn&amp;rsquo;t.&lt;/p&gt;
&lt;p&gt;Alternatively, engineers may run several trials of thousands of samples and then use $t$-tests
on the descriptive statistics produced by those trials.  While this is statistically valid, you
may need to expend a lot of resources to run those trials, and the required quantity of tests can
limit what information you can test.&lt;/p&gt;
&lt;p&gt;Instead, when comparing two samples that are very oddly distributed, my preferred test is the
&lt;strong&gt;Kolmogorov-Smirnov test&lt;/strong&gt;.  It is a fairly simple statistical test that can be used to compare
any two samples (or a sample with a distribution), and directly answers the question of whether
the two samples could have come from the same distribution without making any other assumptions.&lt;/p&gt;
&lt;h2 id=&#34;the-kolmogorov-smirnov-test&#34;&gt;The Kolmogorov-Smirnov Test&lt;/h2&gt;
&lt;p&gt;The Kolmogorov-Smirnov test is simple to execute.  Consider two samples, $A$, and $B$ with sizes
$m$ and $n$:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Compute the cumulative distribution function (CDF) of each of the samples.  For a sample,
the CDF is defined as the fraction of the sample that is less than or equal to $x$ for any given $x$.
For a continuous distribution, the CDF is the probability that a random value drawn from that
distribution is less than or equal to $X$.&lt;/li&gt;
&lt;li&gt;Find the maximum distance between the CDFs, and call it $D$.&lt;/li&gt;
&lt;/ol&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/ks-examplegraphic.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;In math terms:&lt;/p&gt;
&lt;p&gt;$$ D = \max_x \left| CDF_A(x) - CDF_B(x) \right| $$&lt;/p&gt;
&lt;ol start=&#34;3&#34;&gt;
&lt;li&gt;Reject the null hypothesis at the $\alpha$ level if the following inequality holds:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;$$ D &amp;gt; \sqrt{-\ln\left( \frac{\alpha}{2} \right) \cdot \frac{1 + \frac{n}{m}}{2n}} $$&lt;/p&gt;
&lt;p&gt;Alternatively, you can re-arrange this equation to determine a $p$-value for a given $D$:&lt;/p&gt;
&lt;p&gt;$$ p = 2 \exp \left[ -D^2 \frac{2n}{1 + \frac{n}{m}} \right] $$&lt;/p&gt;
&lt;p&gt;In the example we have above, each of the two samples has 30 points.  $D = 0.267$, and the
critical value of $D$ at the $0.05$ level is $0.35$.  The null hypothesis would not be
rejected ($p = 0.24$).&lt;/p&gt;
&lt;p&gt;Here is another example, with two samples of 30 points, where the null hypothesis is rejected:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/ks-reject.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;In this case, $D = 0.43$, and $D_{crit, 0.05} = 0.35$.  Therefore, with $p = 0.00715$, the
null hypothesis is rejected, and the Kolmogorov-Smirnov test concludes that the samples are
drawn from different distributions.  That is a good thing, because they are drawn from different
distributions!  In both cases, the yellow sample was drawn from a beta distribution with
$\alpha = 2$ and $\beta = 4$, while the blue sample was drawn from a beta distribution with
$\alpha = 2$ and $\beta = 5$.&lt;/p&gt;
&lt;p&gt;By the way, there is a way to do a multivariate Kolmogorov-Smirnov test, but I have never done
one.  The algorithms were developed fairly recently,
&lt;a href=&#34;https://www.sciencedirect.com/science/article/abs/pii/S0167715297000205&#34;&gt;one in 1997 for a sample and a distribution&lt;/a&gt;,
&lt;a href=&#34;https://www.sciencedirect.com/science/article/pii/S0047259X03000794?via%3Dihub&#34;&gt;and a two-sample one in 2004&lt;/a&gt;.
I would suggest a different test for a multivariate statistical test.&lt;/p&gt;
&lt;h4 id=&#34;testing-against-a-known-distribution&#34;&gt;Testing Against a Known Distribution&lt;/h4&gt;
&lt;p&gt;The other use of the Kolmogorov-Smirnov test is to compare data from an experiment to an expected
distribution.  In this case, the critical value and $p$-value calculation is given by taking the limit
as $m \rightarrow \infty$:&lt;/p&gt;
&lt;p&gt;$$ D_{crit} = \lim_{m \rightarrow \infty} \left( \sqrt{-\ln\left( \frac{\alpha}{2} \right) \cdot
\frac{1 + \frac{n}{m}}{2n}} \right) = \sqrt{-\frac{1}{2n} \ln\left( \frac{\alpha}{2} \right)} $$&lt;/p&gt;
&lt;p&gt;The calculation to find $D$ is the same as before: we look for the point of maximum difference between
the two distributions.&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/ks-continuous.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;So we are now testing for:&lt;/p&gt;
&lt;p&gt;$$ \max_x \left| CDF_A(x) - CDF_B(x) \right| &amp;gt; \sqrt{-\frac{1}{2n} \ln\left( \frac{\alpha}{2} \right)} $$&lt;/p&gt;
&lt;p&gt;Where $b$ is the size of the dataset being tested, and $\alpha$ is the desired $p$-value.  Similarly,
the $p$-value calculation for the dataset becomes:&lt;/p&gt;
&lt;p&gt;$$ p = 2 e^{-2nD^2} $$&lt;/p&gt;
&lt;p&gt;This can be useful as a test of normality or a test of the goodness of fit of a regression.  It is also
sometimes used for testing random number generators against uniform distributions, and checking for
financial and election fraud using &lt;a href=&#34;https://en.wikipedia.org/wiki/Benford%27s_law&#34;&gt;Benford&amp;rsquo;s Law&lt;/a&gt;.&lt;/p&gt;
&lt;h4 id=&#34;as-an-algorithm&#34;&gt;As an Algorithm&lt;/h4&gt;
&lt;p&gt;SciPy, the Excel statistics package, R, and mathematical computing packages have pre-written algorithms
for Kolmogorov-Smirnov tests.  However, it is a pretty easy algorithm to implement.  Because of the
monotonic nature of CDFs, the maximum will occur at one of the &amp;ldquo;steps&amp;rdquo; of one of the CDFs, which happen
at the points in each respective set.  Thus, we only have to test a very limited number of points to be
the possible maximum instead of doing any real math.  Here is the two-sample test in (unoptimized) C++:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;background-color:#f0f0f0;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-C++&#34; data-lang=&#34;C++&#34;&gt;&lt;span style=&#34;color:#007020&#34;&gt;#include&lt;/span&gt; &lt;span style=&#34;color:#007020&#34;&gt;&amp;lt;cmath&amp;gt;&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;#include&lt;/span&gt; &lt;span style=&#34;color:#007020&#34;&gt;&amp;lt;set&amp;gt;&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;&lt;/span&gt;
&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Kolmogorov-Smirnov test of two ordered sets of doubles
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; &lt;span style=&#34;color:#06287e&#34;&gt;two_sample_ks&lt;/span&gt; (&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;const&lt;/span&gt; std&lt;span style=&#34;color:#666&#34;&gt;::&lt;/span&gt;set&lt;span style=&#34;color:#666&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt;&lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&amp;amp;&lt;/span&gt; set_a,
                      &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;const&lt;/span&gt; std&lt;span style=&#34;color:#666&#34;&gt;::&lt;/span&gt;set&lt;span style=&#34;color:#666&#34;&gt;&amp;lt;&lt;/span&gt;&lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt;&lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&amp;amp;&lt;/span&gt; set_b) {
    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Doubles of set size and CDFs
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; n &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; set_a.size();
    &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; m &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; set_b.size();
    &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; cdf_a &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;0.0&lt;/span&gt;;
    &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; cdf_b &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;0.0&lt;/span&gt;;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Track the maximum delta between the CDFs
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; max_delta &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;0.00&lt;/span&gt;;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Iterators through the sets
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;auto&lt;/span&gt; walk_a &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; set_a.begin();
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;auto&lt;/span&gt; walk_b &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; set_b.begin();

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Walk across the X axis tracking the delta between the CDFs of A and B
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// The only relevant points to check are places where the CDF changes
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;while&lt;/span&gt; (walk_a &lt;span style=&#34;color:#666&#34;&gt;!=&lt;/span&gt; set_a.end() &lt;span style=&#34;color:#666&#34;&gt;&amp;amp;&amp;amp;&lt;/span&gt; walk_b &lt;span style=&#34;color:#666&#34;&gt;!=&lt;/span&gt; set_b.end()) {
        &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Compute the CDFs and the delta between them, tracking the maximum
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;        &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; delta &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; std&lt;span style=&#34;color:#666&#34;&gt;::&lt;/span&gt;abs(cdf_a &lt;span style=&#34;color:#666&#34;&gt;-&lt;/span&gt; cdf_b);
        &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;if&lt;/span&gt; (delta &lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&lt;/span&gt; max_delta) max_delta &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; delta;

        &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Move to the next relevant position on the X axis and update the CDFs
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;        &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; a_value &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt;walk_a;
        &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;if&lt;/span&gt; (&lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt;walk_a &lt;span style=&#34;color:#666&#34;&gt;&amp;lt;=&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt;walk_b) {
            cdf_a &lt;span style=&#34;color:#666&#34;&gt;+=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;1.0&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;/&lt;/span&gt; n;
            &lt;span style=&#34;color:#666&#34;&gt;++&lt;/span&gt;walk_a;
        }
        &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;if&lt;/span&gt; (&lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt;walk_b &lt;span style=&#34;color:#666&#34;&gt;&amp;lt;=&lt;/span&gt; a_value) {
            cdf_b &lt;span style=&#34;color:#666&#34;&gt;+=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;1.0&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;/&lt;/span&gt; m;
            &lt;span style=&#34;color:#666&#34;&gt;++&lt;/span&gt;walk_b;
        }
    }

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Compute the p-value of the test
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; size_factor &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;2.0&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; n &lt;span style=&#34;color:#666&#34;&gt;/&lt;/span&gt; (&lt;span style=&#34;color:#40a070&#34;&gt;1.0&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;+&lt;/span&gt; n &lt;span style=&#34;color:#666&#34;&gt;/&lt;/span&gt; m);
    &lt;span style=&#34;color:#902000&#34;&gt;double&lt;/span&gt; p_value &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;2.0&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; std&lt;span style=&#34;color:#666&#34;&gt;::&lt;/span&gt;exp(&lt;span style=&#34;color:#666&#34;&gt;-&lt;/span&gt;max_delta &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; max_delta &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; size_factor);
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;return&lt;/span&gt; p_value;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;If you wanted to do this with a vector or an unordered set, you would have to sort it first.  The
one-sample test is similarly constrained: the maximum delta will occur either right before or right
after a step, so there are again very few points to search to find the maximum delta.  Running a
Kolmogorov-Smirnov test is just a single pass over the data, and doesn&amp;rsquo;t need any fancy math.&lt;/p&gt;
&lt;h2 id=&#34;comparison-to-t-tests&#34;&gt;Comparison to $t$-tests&lt;/h2&gt;
&lt;p&gt;The Kolmogorov-Smirnov test is a non-parametric statistical test: it makes no assumptions about
the underlying data.  However, like most non-parametric tests, it tends to have higher $p$-values
than tests that make assumptions about the underlying data.  The strength of the K-S test is that
you don&amp;rsquo;t need very many samples to make sure that the test is valid, and you don&amp;rsquo;t need to check
the distributions of the data to ensure that the sample means are normal.&lt;/p&gt;
&lt;p&gt;As I alluded to above, if you have a very large sample where you can be reasonably sure that sample
means are normally distributed, you will probably get a better $p$-value from a $t$-test.  Compared
to a $t$-test, you can usually run a valid Kolmogorov-Smirnov test with fewer data points, and if
the distributions are oddly shaped (eg if they have the same mean, but very different behavior at
the tails), the K-S test will often tell you that they are different when a $t$-test might say that
the distributions are the same.  Remember, a $t$-test tests whether the two samples have different
&lt;strong&gt;means&lt;/strong&gt; while a Kolmogorov-Smirnov test determines whether the samples are drawn from different
&lt;strong&gt;distributions&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Finally, a Kolmogorov-Smirnov test needs at least $O(n \log n)$ time on unsorted data because you
need to sort your data to find its CDF.  $t$-tests and other similar tests which use descriptive
statistics can be computed in $O(n)$ time without sorting the data.  However, if you already have
sorted data, like database data, you can compute a Kolmogorov-Smirnov test in one pass over the
data.  A $t$-test similarly needs one pass.&lt;/p&gt;
&lt;h4 id=&#34;when-the-t-test-doesnt-help&#34;&gt;When the $t$-test Doesn&amp;rsquo;t Help&lt;/h4&gt;
&lt;p&gt;In performance testing, we often see optimizations that help tails of distributions, and keep
means relatively unchanged.  $t$-tests will reject these effects, while a Kolmogorov-Smirnov
test will show an improvement.  Here is an example of two datasets with 50 points:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/ks-vs-t.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;In this case, the two datasets have very similar means: $0.55$ (blue) and $0.50$ (yellow).
However, the standard deviations of the distributions are very different: $0.21$ for the blue
sample, and $0.11$ for the yellow sample.  A $t$-test yields a $p$-value of $0.133$ when comparing
these samples, while a Kolmogorov-Smirnov test yields a $p$-value of $0.012$!  Visually, these
samples are very different, and the Kolmogorov-Smirnov test helps you argue that.&lt;/p&gt;
&lt;h4 id=&#34;replacing-t-tests-with-kolmogorov-smirnov-tests&#34;&gt;Replacing $t$-Tests with Kolmogorov-Smirnov Tests&lt;/h4&gt;
&lt;p&gt;Returning to our example of testing a performance improvement, the usual way to solve this problem with
$t$-tests is to run a massive number of samples and split the sample pool into several trials.  For
example, a set of 100,000 runs would be split into 50 &amp;ldquo;trials&amp;rdquo; of 2,000 runs.  On each trial, you
produce descriptive statistics, such as 90th and 99th percentiles, relying on the size of each trial
to make sure that the central limit theorem kicks in. These are then compared using $t$-tests, one for
the set of 90th percentiles, one for the set of means, etc.&lt;/p&gt;
&lt;p&gt;It is a lot easier to use one Kolmogorov-Smirnov test rather than to split your sample artificially
so you can use $t$-tests, and you can use it to detect much subtler effects!  Alternatively, you can
get a similar sensitivity from a Kolmogorov-Smirnov test by running 50 samples that you would get
from $t$-testing 50 trials of 2,000 samples each.  In both cases, $n = 50$.&lt;/p&gt;
&lt;p&gt;When you actually want to compare two means, the $t$-test is the right thing to use.  When you want
to compare distributions, it is a lot better to use Kolmogorov-Smirnov tests.&lt;/p&gt;
&lt;h2 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;The Kolmogorov-Smirnov test is a powerful statistical test that lets you avoid some of the problems
of traditional statistical tests.  A Kolmogorov-Smirnov test lets you compare an entire distribution
against another distribution, so you can detect changes at the tails of a distribution that do not
affect the means.  It carries no assumptions about the datasets or distributions used.  Finally, the
Kolmogorov-Smirnov test allows you to accurately scale down these questions to tens of samples.
It is a very useful statistical technique to add to your arsenal as a performance engineer or data
scientist.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>What Happened with FPGA Acceleration?</title>
      <link>https://specbranch.com/posts/fpgas-what-happened/</link>
      <pubDate>Wed, 01 Jun 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/fpgas-what-happened/</guid>
      <description>In 2018, I took the jump from being primarily an FPGA hardware engineer to being primarily a software engineer. At the time, things were looking great for FPGA acceleration, with AWS and later Azure bringing in VMs with FPGAs and the two big FPGA vendors setting their sights on application acceleration. Almost 5 years later, I am working on another project with FPGAs, this time a cloud-oriented one. That has inspired me to write a retrospective on the last 5 years of what we thought would be an FPGA acceleration boom.</description>
      <content:encoded>&lt;p&gt;In 2018, I took the jump from being primarily an FPGA hardware engineer to being primarily a software
engineer.  At the time, things were looking great for FPGA acceleration, with AWS and later Azure
bringing in VMs with FPGAs and the two big FPGA vendors setting their sights on application
acceleration. Almost 5 years later, I am working on another project with FPGAs, this time a
cloud-oriented one. That has inspired me to write a retrospective on the last 5 years of what we
thought would be an FPGA acceleration boom.&lt;/p&gt;
&lt;p&gt;In 2016-2018, the major FPGA manufacturers have made great overtures about
the applicability of FPGAs to compute acceleration, setting their sights on GPUs.  Finally, the FPGA
would be today&amp;rsquo;s compute accelerator, not tomorrow&amp;rsquo;s.  Reality has worked out a bit differently.
In 2022, FPGAs are still considered niche devices, but they have made great strides since then.
In order to make the jump from niche to mainstream, there is a lot of work still to be done.&lt;/p&gt;
&lt;h2 id=&#34;background-what-is-an-fpga&#34;&gt;Background: What is an FPGA?&lt;/h2&gt;
&lt;p&gt;A &lt;strong&gt;F&lt;/strong&gt;ield &lt;strong&gt;P&lt;/strong&gt;rogrammable &lt;strong&gt;G&lt;/strong&gt;ate &lt;strong&gt;A&lt;/strong&gt;rray (FPGA) is a device that can emulate digital circuits.
Instead of having large processing units, they have millions of small lookup tables, which can compute
simple logical functions.  The lookup tables are combined together using a routing network on the FPGA
to create larger circuits within the logic fabric of the FPGA.  Registers within the FPGA fabric allow
those circuits to be pipelined, and most FPGAs today also contain hardware multipliers and RAM blocks,
since those are hard to emulate with lookup tables.  FPGAs also offer clocking circuits to support
these applications, and both programmable I/O pins (for things like blinking lights) and fixed-function
I/O units (for protocols like PCIe or DDR4) to allow the FPGA to interact with the outside world.&lt;/p&gt;
&lt;p&gt;An example diagram of an Intel FPGA is below:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://www.intel.co.uk/content/dam/www/program/design/us/en/images/16x9/psg-migration-test-images/arria10-architecture-16x9.png&#34; width=&#34;100%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;The Arria 10 device shown is typical of a modern FPGA in its structure.  It has all the bells and
whistles to implement a high-performance application.&lt;/p&gt;
&lt;p&gt;Keep in mind that these FPGAs are big devices.  In fact, they are generally
the biggest logic devices that you can buy.  As of today, an FPGA, the Xilinx Versal VP1802, the largest
FPGA available, has 92 billion transistors, while the Nvidia GH100, the largest GPU, has 80 billion.
Currently, Apple&amp;rsquo;s M1 Ultra is the only computing chip (CPU, GPU, or FPGA) that beats the FPGAs for
transistor count, at 114 billion, and combines a huge CPU and a huge GPU
(techically, the Cerebras Wafer Scale Engine is also a chip with more transistors).&lt;/p&gt;
&lt;p&gt;However, FPGAs are completely proprietary.  Despite the similarity between architectures from  FPGA
vendors, they do not open up anything about their architectures.  Currently, there are two major FPGA
vendors, Intel and AMD, but the situation is nothing like x86.  There are no inter-operable standards.&lt;/p&gt;
&lt;h4 id=&#34;fpga-programming&#34;&gt;FPGA Programming&lt;/h4&gt;
&lt;p&gt;Throughout the history of FPGAs, the primary programming languages used for them have been hardware
description languages (HDLs), like Verilog and VHDL.  Hardware description languages are designed to describe
hardware with a great deal of precision, and as much expressiveness that can come afterwards.  Hardware
description languages are used to produce &lt;em&gt;configurations&lt;/em&gt; for FPGAs using a process that is called
&lt;em&gt;synthesis&lt;/em&gt;.  Instead of an executable or .elf file, the configuration is described by a proprietary
&lt;em&gt;bitstream&lt;/em&gt;.  Everything about the process is proprietary to the vendors.&lt;/p&gt;
&lt;p&gt;Hardware description languages are generally fairly primitive from the perspective of programming
language theory. It is possible to write valid things in a hardware description language that cannot
synthesize to hardware, and there is very little possibility of using programming concepts like
inheritance in hardware description languages. Even recursion needed until 2008 to make it into Verilog
hardware descriptions.&lt;/p&gt;
&lt;p&gt;FPGA synthesis is several orders of magnitude more complicated than a traditional compilation process,
and involves approximating many NP-hard problems with millions of elements. As a result, synthesis of
an FPGA design can take several hours on a large server.&lt;/p&gt;
&lt;h2 id=&#34;the-promise-of-accessible-fpgas&#34;&gt;The Promise of Accessible FPGAs&lt;/h2&gt;
&lt;p&gt;Around 2015, the two FPGA giants, Altera (now Intel) and Xilinx (now owned by AMD) invested heavily in
the idea of computational acceleration.  At the time, things looked great for FPGAs as computing
devices.  Despite the expense, the largest FPGAs at the time could out-FLOP the largest
GPUs in some applications, despite not having the benefit of hard floating point units.  On integer
tasks, there was no contest: a small FPGA could out-perform the largest GPUs on string parsing and
other tasks that were not floating-point-specific.  On top of that, the FPGA would consume less than
1/5th the power of a GPU or CPU performing a comparable task.&lt;/p&gt;
&lt;p&gt;To support their acceleration ambitions, Xilinx and Intel started selling FPGA cards designed for
application acceleration.  These cards were cheaper than FPGA cards for other purposes (about in line
with a server GPU, not 10x more expensive) and were designed so that you didn&amp;rsquo;t need to be a hardware
engineering expert to understand how to use them.  Amazon and Microsoft also noticed the promise of
FPGAs for compute acceleration. Microsoft in particular used an FPGA system to accelerate
&lt;a href=&#34;https://www.microsoft.com/en-us/research/project/project-catapult/&#34;&gt;Bing&lt;/a&gt;,
and reportedly uses FPGAs in Azure servers to
&lt;a href=&#34;https://www.microsoft.com/en-us/research/uploads/prod/2018/03/Azure_SmartNIC_NSDI_2018.pdf&#34;&gt;accelerate networking&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Both Amazon and Microsoft offer instances with FPGA cards to allow cloud customers to use FPGA
acceleration.  With new cards aimed at server deployment and support from the top two cloud companies,
FPGA acceleration looked like it was in a good place to take off.&lt;/p&gt;
&lt;h4 id=&#34;just-one-problem&#34;&gt;Just One Problem&amp;hellip;&lt;/h4&gt;
&lt;p&gt;Meanwhile, FPGAs had a huge problem: the way you design a custom circuit is very different than the way you
program a processor.  Exotic DSPs, one-penny microcontrollers, massive GPUs, and server CPUs all have
similar, mature toolchains.  FPGAs don&amp;rsquo;t.  The steps to program an exotic, proprietary DSP are almost
the same as the steps to program for an x86 CPU.  Not so for an FPGA.  The toolchains for GPUs and
CPUs use many common open-source components.  FPGA toolchains don&amp;rsquo;t have anything in common with each
other.&lt;/p&gt;
&lt;p&gt;It seems only natural that around that time, Altera and Xilinx would invest heavily in technologies
that closed the gap between software code and FPGA hardware.  The technology that they mainly chose
was &lt;em&gt;high-level synthesis&lt;/em&gt; (HLS), which would add a few more steps to the synthesis flow to allow
hardware to be described in C or C++.  C++ code describing a data flow would be synthesized to a
series of hardware blocks with standardized interfaces between them.&lt;/p&gt;
&lt;p&gt;High-level synthesis has been around for a while: Since the early 2000&amp;rsquo;s, Matlab has offered HLS
for signal processing applications. However, before the late 2010&amp;rsquo;s, there weren&amp;rsquo;t any successful
attempts to make a &lt;em&gt;general-purpose&lt;/em&gt; HLS system.  This was the challenge that Altera and Xilinx
now faced.  Software engineers would find many different applications, and the new HLS system
would need to work for all of them.&lt;/p&gt;
&lt;h4 id=&#34;pursuing-general-purpose-high-level-synthesis&#34;&gt;Pursuing General-purpose High-level Synthesis&lt;/h4&gt;
&lt;p&gt;Altera began its general-purpose HLS efforts by supproting OpenCL on its FPGAs.  The OpenCL framework
allowed FPGAs to work with a common acceleration framework that could also be used for GPUs.  However,
it also coerced the FPGA application to use an architecture that looked like a GPU.  This is fine for
most acceleration tasks, but takes a lot of the magic of FPGAs off the table.  To compensate for this,
Altera and later Intel added FPGA-specific OpenCL features.  For example, streaming data from one hardware
block to another (instead of through memory) required using &amp;ldquo;streams&amp;rdquo; which were Intel-specific.&lt;/p&gt;
&lt;p&gt;Xilinx&amp;rsquo;s first major HLS effort was built inside its Vivado design suite.  Vivado HLS didn&amp;rsquo;t use
a framework like OpenCL, and instead allowed you to create hardware modules with C++ code.  Chip
architecture still had to be done like hardware, although Vivado allows you to do it with block diagrams.&lt;/p&gt;
&lt;p&gt;Both general-purpose HLS systems have a problem: &lt;code&gt;#pragma&lt;/code&gt; statements.  HLS
code tends to feature almost as many lines of pragmas as lines of code.  Worse, the pragmas control
things like insertion of pipeline stages, loop unrolling, and other tasks that expose the nature of
the hardware to the engineer.  Combined with the FPGA-specific constructs, pragmas have pretty much
squashed the dream that FPGA programming might be accessible to software engineers.&lt;/p&gt;
&lt;p&gt;Each of the two major vendors later copied each other&amp;rsquo;s approach: Intel introduced an HLS compiler
to compete with Xilinx, and Xilinx added an OpenCL environment. At best, these HLS systems have
become productivity tools for hardware engineers rather than allowing software engineers to access
FPGAs. They are not the &amp;ldquo;CUDA&amp;rdquo; of FPGAs.&lt;/p&gt;
&lt;h4 id=&#34;an-early-foothold-neural-networks&#34;&gt;An Early Foothold: Neural Networks&lt;/h4&gt;
&lt;p&gt;Neural networks also offered another foothold through which FPGAs could become more mainstream.
FPGAs particularly promised benefits on neural network inference, where they could offer huge SRAM
bandwidth and thousands of integer ops per clock cycle.  The FPGA companies and their partners
recognized this, and began working on compilers for neural network inference on FPGAs.  These efforts
were wildly successful.&lt;/p&gt;
&lt;p&gt;However, FPGAs have faced and continue to face stiff competition in machine learning inference.
Several companies during the same time frame have successfully developed inference ASICs, NVidia released
inference-focused GPUs, and the AVX-512 and VNNI instruction sets from Intel have helped to improve
neural network inference performance on CPUs.  FPGAs in machine learning applications now have to
justify their cost compared to both dedicated hardware and devices that are more general-purpose.&lt;/p&gt;
&lt;h4 id=&#34;moving-past-general-purpose-hls&#34;&gt;Moving Past General-Purpose HLS&lt;/h4&gt;
&lt;p&gt;In 2019, Xilinx released Vitis, a platform for HLS and development of hardware accelerators.  Vitis
introduced the approach of hardware-accelerating several common software libraries, including FFmpeg,
TensorFlow, BLAS, and OpenCV. Additionally, Vitis can be used for HLS of C++ and Matlab, along with
synthesis of P4 packet processing programs.&lt;/p&gt;
&lt;p&gt;The Vitis suite appears to be a concession to the idea that a general-purpose HLS system wouldn&amp;rsquo;t
work for FPGAs accelerators.  Instead, Vitis is a suite of special-purpose compilers.  This approach
appears to be working: it has resulted in FPGAs expanding to many important niches.&lt;/p&gt;
&lt;h2 id=&#34;fpga-acceleration-hits&#34;&gt;FPGA Acceleration Hits&lt;/h2&gt;
&lt;p&gt;Other than machine learning acceleration, there have been many successes of FPGA acceleration that
leverage the unique benefits of FPGAs.  However, these accelerators haven&amp;rsquo;t necessarily benefitted
from high-level synthesis the way that the FPGA companies hoped.&lt;/p&gt;
&lt;h4 id=&#34;high-frequency-trading&#34;&gt;High-frequency Trading&lt;/h4&gt;
&lt;p&gt;FPGA acceleration in high-frequency trading began long before 2016.  FPGAs offered a unique value
proposition for HFTs, since they allow you to respond to an incoming message with extremely low
latency.  Software packet processing relies on doing one step at a time: receiving a packet in full,
parsing it, and generating a response.  FPGAs allowed the HFT industry to do that in a streaming
fashion.  As a result, today&amp;rsquo;s HFTs can execute trades in tens of nanoseconds, 100x better than the
fastest software solutions, which needed microseconds.&lt;/p&gt;
&lt;p&gt;Since HFT happens in nanoseconds, there isn&amp;rsquo;t really much room for HLS systems.  Reportedly, some
HFT companies have invested in domain-specific HLS systems, but they have kept very quiet on any
successes.&lt;/p&gt;
&lt;p&gt;The HFT companies have one major advantage that I have not seen anywhere else in the FPGA industry:
they treat FPGA hardware like software.  It is software - you are not spending $10 million making
a run of chips, you are just reprogramming something.  However, applying software engineering
methodologies and a software engineering mindset has allowed these companies to be more agile with
FPGAs than everyone else.  Traditional FPGA companies can likely learn a lot from this process.&lt;/p&gt;
&lt;h4 id=&#34;smartnics&#34;&gt;SmartNICs&lt;/h4&gt;
&lt;p&gt;FPGA-enabled SmartNICs have helped Microsoft build Azure, and Intel and Altera are both investing
heavily in SmartNICs.  Reportedly, some of the Chinese clouds also use a similar approach to Azure,
using an FPGA to accelerate virtualized networking.&lt;/p&gt;
&lt;p&gt;In the case of SmartNICs and network acceleration, the throughput offered by building a wide pipeline
on an FPGA is hard to match.  Combined with the fact that the FPGA is programmable, and is not
constrained by a limited instruction set the way that other packet-processing accelerators are. A
traditional CPU just cannot reach 100 gbps at the same latency that it gets at 1 gbps, and an FPGA
can do that (comparatively) easily.&lt;/p&gt;
&lt;p&gt;SmartNICs benefit from P4 acceleration pipelines, and may be a platform for high-level synthesis
of other domain-specific packet processing paths.  I am not currently aware of an HLS system for
eBPF programs, but that may be a promising solution for more flexible systems.&lt;/p&gt;
&lt;h4 id=&#34;computational-genomics&#34;&gt;Computational Genomics&lt;/h4&gt;
&lt;p&gt;Computational genomics is one application where FPGAs appear to have won.  Companies like Gemalto
and Deneb Genetics offer solutions for genome processing on FPGAs.  As a non-expert, computational
genomics appears to be driven by the ability to make thousands of comparisons at the same time.
This lends itself nicely to FPGAs, which can be turned into large banks of comparators.&lt;/p&gt;
&lt;p&gt;As far as I am aware, many computational genomics systems don&amp;rsquo;t use HLS any more, but may have
started in OpenCL.&lt;/p&gt;
&lt;h2 id=&#34;the-best-is-hopefully-yet-to-come&#34;&gt;The Best is (Hopefully) Yet to Come&lt;/h2&gt;
&lt;p&gt;Domain-specific acceleration offers a lot of promise for FPGAs, but will never have a flashy
&amp;ldquo;CUDA&amp;rdquo; moment.  As of today, the FPGA companies appear to have made their peace with that.  Still,
there is tremendous promise for FPGAs, particularly accelerating network and storage applications
in the coming years.&lt;/p&gt;
&lt;h4 id=&#34;a-new-development-paradigm&#34;&gt;A New Development Paradigm&lt;/h4&gt;
&lt;p&gt;Languages like HardCaml, Chisel, and Bluespec promise to close the productivity gap between
high-level languages and HDLs.  Unlike we thought
in 2016, the future of FPGA acceleration may not be in the hands of software engineers.  It may come
from hardware engineers, in which case it will come from closing the productivity gaps between
hardware and software development, and learning to treat FPGA programming with the same mindset
that we use for software code.&lt;/p&gt;
&lt;p&gt;This is the same paradigm shift that happened to software in the 1980-90s. It took a long time: the
first processors came out in the 1950s, and it took 30-40 years to invent the &amp;ldquo;software mindset.&amp;rdquo;
FPGAs are a more recent invention, with the first ones being made in the 1990s, and there are still
a lot of kinks to work out.&lt;/p&gt;
&lt;h4 id=&#34;the-elephant-in-the-room-access&#34;&gt;The Elephant in the Room: Access&lt;/h4&gt;
&lt;p&gt;FPGAs have one big problem still compared to GPUs and CPUs, and this is a problem that has not
gotten better over their history: &lt;em&gt;FPGAs are expensive and hard to access.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;FPGA accelerator cards cost thousands of dollars.  I first wrote CUDA using my $200 home
GPU.  Many of us did our first programming on devices that we use every day.  A Xilinx Alveo
card costs over $3,000 today.  It needs a $3,000 tool suite to program.  This takes FPGA
acceleration out of the range where an everyday hobbyist can experiment with it.  Most FPGA
programmers today learned how to program an FPGA by doing research or a course in college, run
by the electrical engineering department. There is none of the same organic experimentation
that we get in software.&lt;/p&gt;
&lt;p&gt;Further, cheap FPGA boards and small FPGAs today are aimed at hardware engineers who want to build
a thing, not software engineers who want to try acceleration.  This is a totally different problem
for a totally different group of people!&lt;/p&gt;
&lt;p&gt;A final obstacle is that high-quality open-source code for FPGAs is almost non-existent,
and there isn&amp;rsquo;t a &amp;ldquo;Linux&amp;rdquo; of hardware or a standard library. Components like ethernet I/O cores
cost tens of thousands of dollars and often contain bugs. The lack of these libraries and examples
means that there isn&amp;rsquo;t a good way to do FPGA accelerator development without significant investment
in &amp;ldquo;boilerplate&amp;rdquo; technologies.&lt;/p&gt;
&lt;p&gt;Open source and standardization would likely help on all fronts here.  The economists
who are in control at Xilinx and Intel are probably right that standardization and open sourcing
key technologies will hurt them in the short term.  Despite this, it may be key to the long-term
success of the FPGA market in computing.&lt;/p&gt;
&lt;p&gt;Until FPGA vendors can improve access to acceleration-ready devices and the toolchains and libraries
that can be used to program them, they may be stuck as &amp;ldquo;the accelerators of tomorrow.&amp;rdquo;&lt;/p&gt;
&lt;hr&gt;
&lt;h3 id=&#34;epilogue-my-own-use-of-hls&#34;&gt;Epilogue: My Own Use of HLS&lt;/h3&gt;
&lt;p&gt;In 2015, I spent a few weeks looking at Altera&amp;rsquo;s OpenCL system.  I was very bullish on the technology,
but it wasn&amp;rsquo;t quite mature enough at the time and didn&amp;rsquo;t match the use case.&lt;/p&gt;
&lt;p&gt;More recently, I have been trying to figure out Xilinx HLS for my current FPGA project, hoping it
will save some work and help me generate good hardware quickly, but I am going back to SystemVerilog
in defeat.  Learing the toolchain and the language of &lt;code&gt;pragmas&lt;/code&gt; just wasn&amp;rsquo;t worth it.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Teach Your Kids Bridge</title>
      <link>https://specbranch.com/posts/teach-bridge/</link>
      <pubDate>Sat, 21 May 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/teach-bridge/</guid>
      <description>A post recently made the rounds on hacker news claiming that you should teach your kids poker, not chess. The comments on that post go through a lot of the reasons why poker is a bad game to teach your children, but I felt that I was well suited to opine on this topic, and explain why duplicate bridge is the best game for practicing the life skills involved in business and programming, compared to all of the alternatives.</description>
      <content:encoded>&lt;p&gt;A post recently made the rounds on &lt;a href=&#34;https://news.ycombinator.com/news&#34;&gt;hacker news&lt;/a&gt; claiming that
&lt;a href=&#34;https://momentofdeep.substack.com/p/teach-your-kids-poker-not-chess?s=r&#34;&gt;you should teach your kids poker, not chess&lt;/a&gt;.
The comments on that post go through a lot of the reasons why poker is a bad game to teach your
children, but I felt that I was well suited to opine on this topic, and explain why duplicate
bridge is the best game for practicing the life skills involved in business and programming,
compared to all of the alternatives.&lt;/p&gt;
&lt;p&gt;I am a passionate player of strategy games, but never a professional player at any.  I played chess
as a child, ultimately studying to the point where I can play at ~1500 play strength today.
In high school, I was a passionate Magic: the Gathering player.  I played a lot at the competitive level
and also became a level 2 Magic Judge for a brief time.  In college, I started to seriously play bridge
and poker.  When I turned 21, I made money playing Texas Hold&amp;rsquo;em and Pot Limit Omaha (the latter of which
I enjoyed more, but had fewer players). In college, I played bridge with the college club twice a week,
and became decent at the game.  Today, I am mostly a bridge player, but I still play chess, Magic, and
poker once in a while.  I believe that bridge is the best game for learning life skills, and that it is
a great game to play as an amateur.&lt;/p&gt;
&lt;h2 id=&#34;the-game-of-bridge&#34;&gt;The Game of Bridge&lt;/h2&gt;
&lt;p&gt;Bridge is unfortunately associated with an aging population of players and is a lot less &amp;ldquo;cool&amp;rdquo; than
chess or poker.  However, there is a small community of young players that are both friendly and
competitive, hopefully the benginning of a renaissance for the game.&lt;/p&gt;
&lt;p&gt;Bridge is played with four players, two teams of two, using a standard deck of cards.  Each player sits
across from their teammate.  On each hand, the entire deck is dealt out, and then the game goes through
two phases: bidding and playing.&lt;/p&gt;
&lt;p&gt;During the bidding phase, players call out public bids to determine the
&lt;em&gt;contract&lt;/em&gt;, exchanging information about their hands through the limited information of a number and a
suit (&amp;ldquo;one spade&amp;rdquo; is a bridge bid).  Information exchange occurs during the bidding phase using agreements
about the meaning of each bid called bidding conventions.  Partnerships can come up with their own agreement
or use one of a few common bidding conventions, like Standard American, 2-over-1, and ACOL.  Learning bidding
conventions is like learning chess openings: you need to spend time learning one of them.&lt;/p&gt;
&lt;p&gt;During the playing phase, one player from the partnership that won the bidding places their hand face up on
the table, and the other player from that partnership plays a
&lt;a href=&#34;https://en.wikipedia.org/wiki/Trick-taking_game&#34;&gt;trick-taking game&lt;/a&gt; (similar to hearts or spades) against
the opposing pair to try to take as many tricks as possible.  The person playing the contract is called
the &amp;ldquo;declarer,&amp;rdquo; while the other two players are &amp;ldquo;defenders&amp;rdquo; (the player whose hand is face-up is the &amp;ldquo;dummy&amp;rdquo;).
The playing phase is then scored based on whether the declarer has made the number of tricks promised in the
contract.  Taking as many tricks as promised (or more) is called &amp;ldquo;making the contract,&amp;rdquo; and offers rewards
that depend on the level of the contract and the number of tricks taken, while taking fewer tricks is &amp;ldquo;going
down.&amp;rdquo; The defending side is rewarded for making you go down. Playing the hand is often subtle and challenging,
regardless of whether you are defending or declaring, and every trick matters regardless of whether you are
expecting to take 10 tricks or 2.&lt;/p&gt;
&lt;h2 id=&#34;the-benefits-of-bridge&#34;&gt;The Benefits of Bridge&lt;/h2&gt;
&lt;h4 id=&#34;managing-information&#34;&gt;Managing Information&lt;/h4&gt;
&lt;p&gt;Information management is central to bridge.  During the bidding phase, you and your partner determine what is
important to share with each other to determine a contract.  During the playing phase, you have to work with the
fact that two other hands are not publicly known.
&lt;a href=&#34;https://en.wikipedia.org/wiki/Contract_bridge_probabilities&#34;&gt;Probability&lt;/a&gt; comes into play here: understanding
distributions of cards and possible hands that opponents can have will allow you to figure out how to play for
the maximum number of tricks.  The bidding and playing phases are all about deriving the maximum value you can
from the limited information you have, much like life.&lt;/p&gt;
&lt;h4 id=&#34;planning-a-line&#34;&gt;Planning a Line&lt;/h4&gt;
&lt;p&gt;Like chess, bridge play often involves thinking several tricks ahead and planning your line of play.  However,
you have to plan for contingencies and plan to mitigate the effects of the unknown information.  Because you
know that the entire deck is dealt out, there is often an understandable amount of missing information.  There
are 13 tricks in a hand, which is long enough to plan a line based on a few critical decisions, but short enough
that you can reasonably work with the probability space.&lt;/p&gt;
&lt;p&gt;There is a tremendous amount of depth to bridge play strategy: For example,
&lt;a href=&#34;https://en.wikipedia.org/wiki/Squeeze_play_(bridge)&#34;&gt;squeezes&lt;/a&gt; and &lt;a href=&#34;https://en.wikipedia.org/wiki/Endplay&#34;&gt;endplays&lt;/a&gt;
are examples of ways to play a bridge hand based on giving an opponent options for how they want to give up a trick
to you. To execute an endplay or a squeeze, you have to plan for a lot of possible hands and outcomes.&lt;/p&gt;
&lt;p&gt;Additionally, there is a lot of strategy on the defensive side, including strategy on how to make an opening lead
and strategy around how to give information to you partner.&lt;/p&gt;
&lt;h4 id=&#34;asymmetry-not-randmoness&#34;&gt;Asymmetry, not Randmoness&lt;/h4&gt;
&lt;p&gt;The cards that are distributed around the table are not distributed fairly.  The game of bridge teaches you to
deal with the cards you are dealt and maximize the number of tricks your partnership takes in every circumstance.
Due to duplication and the contract scoring system, one trick often matters a lot, and playing weak hands well
matters as much as playing strong hands well.&lt;/p&gt;
&lt;p&gt;Bridge games are set up so that the aspect of &amp;ldquo;who has the good cards&amp;rdquo; is completely mitigated. Bridge
is often played &amp;ldquo;duplicated&amp;rdquo;: many tables play with the same sets of hands, and your result is compared against
other pairs (at other tables) with the same hands, &lt;em&gt;not&lt;/em&gt; the other pair at the same table.  In this way, playing
weak hands well is rewarded, while playing strong hands poorly is punished.  The randomness of &amp;ldquo;who gets the cards&amp;rdquo;
is mitigated in the scoring.  You can still lose a hand to someone who takes a different (possibly worse) line of
play and happens to catch the cards in the right places, but the skill difference tends to average out over 20-30
hands, which is normal for a bridge tournament.&lt;/p&gt;
&lt;h4 id=&#34;teamwork&#34;&gt;Teamwork&lt;/h4&gt;
&lt;p&gt;Bridge is a partnership game, and you are going to have to learn to work with your partner.  This means learning
your partner&amp;rsquo;s tendencies, learning how to work with others, and learing to become a good communicator.  Like a
team game, your individual skill is only half of your team&amp;rsquo;s success.  Bridge will also involve miscommunications
and mistakes, and so you will also learn to ask for forgiveness and to forgive others for their mistakes.&lt;/p&gt;
&lt;p&gt;Partners and teams will often have a postmortem after a bridge game where they review the hands and learn from
them.  The postmortem will help you learn to give and take criticism constructively (although postmortems are not
always constructive).&lt;/p&gt;
&lt;h4 id=&#34;preparation&#34;&gt;Preparation&lt;/h4&gt;
&lt;p&gt;Learning to work with bidding conventions and learning to recognize certain types of tactics during the play
of the hand are important for being a good bridge player. This kind of preparation rewards you the same way that
learning chess openings rewards you. Crafting new bidding conventions or modifying the ones you use with a
partner are also important for maximizing your chances of winning. Many different kinds of bidding conventions
are out there, and you can learn a lot from learning about them and thinking about them.&lt;/p&gt;
&lt;h2 id=&#34;comparison-to-the-alternatives&#34;&gt;Comparison to the Alternatives&lt;/h2&gt;
&lt;p&gt;In comparison to bridge, there are several alternative games on offer, including poker, chess, trading card games,
and other board games.  I am going to be categorizing these on a few axes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Imperfect vs perfect information: In a perfect information game, both players have the same information about
the state of the game.&lt;/li&gt;
&lt;li&gt;Asymmetric vs symmetric: In a symmetric game, both players are playing with the same objectives and set of pieces.
Asymmetric games have different objectives or different pieces.&lt;/li&gt;
&lt;li&gt;Luck vs skill: In a luck game, some information about the game state that is hidden from all players (&lt;em&gt;eg&lt;/em&gt; with
dice rolls or facedown cards).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;By this categorization, bridge is an &lt;strong&gt;imperfect information asymmetric skill game&lt;/strong&gt;.  It is very unusual to have
imperfect information but no luck in a game, since imperfect infomration is often derived from luck.  Bridge
derives 100% of its information gaming from asymmetry, and mitigates luck through the way it is scored.  In my
opinion, it is hard to have a richer strategic space or a better learning environment than you can find in an
imperfect information skill game.&lt;/p&gt;
&lt;p&gt;Because we are talking about games for kids, I will also be talking a little bit about the communities around
each game, as I have observed them.  One point against bridge is that the bridge community as a whole is not the
friendliest or the most welcoming.  However, you can find players, particularly younger players, who are happy to
play with new players who are learning the game.&lt;/p&gt;
&lt;h4 id=&#34;poker&#34;&gt;Poker&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Imperfect information symmetric luck game&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Poker is a game of information, but poker is also primarily a gambling game: money and gambling are inseparable
from poker.  People will play poker differently when they are playing for large amounts of money than when
they are playing for small amounts of money: they are a lot more conservative when a lot of cash is on the line.
Due to the nature of pot sizes in poker, people in small games often find themselves making big-money decisions
and vice versa.  However, if you are playing for &lt;em&gt;no&lt;/em&gt; money, that dynamic disappears and the game changes a lot.
Unless you want to introduce gambling to your children, this fact alone disqualifies poker.&lt;/p&gt;
&lt;p&gt;A lot of people think that high-level poker is a lot more about &amp;ldquo;tells&amp;rdquo; and &amp;ldquo;reading your opponent&amp;rdquo; than doing
math.  This is not the case.  Playing pot odds poker (straight math) will often earn you money at low-stakes
games because people look too hard for tells.  High-level poker players will talk and think a lot about &amp;ldquo;ranges&amp;rdquo;
and probabilities.  A &amp;ldquo;range&amp;rdquo; is a set of hands that would act in a particular way, and high-level players
understand that betting at a poker table is more about your range and your opponents&amp;rsquo; ranges, and how you
stack up.&lt;/p&gt;
&lt;p&gt;Since poker involves luck, when you lose or win a hand, you have the task of separating out your skill from your
luck.  In all luck games, people are biased in favor of attributing your wins to skill and your losses to luck,
and it takes a lot of discipline to avoid this fallacy.  However, poker does involve a lot more skill than most
luck games.&lt;/p&gt;
&lt;h4 id=&#34;chess&#34;&gt;Chess&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Perfect information symmetric skill game&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you want to become a professional at a strategy game as a child, I would personally suggest chess.  There
is a tremendous amount of strategy to study around chess, and at the high levels it actually has a lot of the
&amp;ldquo;information gaming&amp;rdquo; aspects of poker or bridge.  High-level chess involves a lot of learning and preparation,
and you can gain an advantage against another player by playing an opening or aiming for a position that you
have studied and they have not.&lt;/p&gt;
&lt;p&gt;However, at lower levels, chess games are mostly determined by who makes fewer blunders.  There isn&amp;rsquo;t much
of an information game when you sometimes leave pieces open for your opponent to caputre using basic tactics
and combinations.  It takes a tremendous amount of thought and attention to play the game to even this level.
High-level chess is an imperfect information bluffing game, but lower-level chess is about avoiding mistakes
with perfect information.&lt;/p&gt;
&lt;p&gt;Finally, there is a vibrant community of serious chess players everywhere, and there are a tremendous
number of resources for people who want to go from &amp;ldquo;serious casual&amp;rdquo; to &amp;ldquo;professional.&amp;rdquo;&lt;/p&gt;
&lt;h4 id=&#34;go&#34;&gt;Go&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Perfect information symmetric skill game&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;I don&amp;rsquo;t know much about how to play go, but I understand that it is similar to chess: the low-level game is
about avoiding mistakes, while the high-level game has components of strategy and preparation.&lt;/p&gt;
&lt;h4 id=&#34;trading-card-games&#34;&gt;Trading Card Games&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Imperfect information asymmetric luck games&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Trading card games like Magic: the Gathering and Pokemon are often marketed as being &amp;ldquo;combinations of poker
and chess.&amp;rdquo;  I have generally found that not to be the case.  In my experience, randomness determines the
outcomes of games more than skill or information asymmetry.  Generally, players start with a hand of cards
and draw one card per turn from the top of their deck.  This trickle of resources both creates luck and
limits the asymmetry of information: you only get one new card per turn to bluff with, and you can find
yourself seeing a good mix of resources or being starved of one resource or another.&lt;/p&gt;
&lt;p&gt;Good magic players have some ability to limit the effects of randomness, but much less than in other luck games
on this list.  The top ranked players by the &lt;a href=&#34;http://www.mtgeloproject.net/leaders.php&#34;&gt;MTG Elo Project&lt;/a&gt;
have win rates below 66%, and their &amp;ldquo;GP records&amp;rdquo; (games played in &amp;ldquo;Grand Prix&amp;rdquo; tournaments,
where anyone can enter) generally have a win rate below 70%.  These highly skilled players cannot crack a 75%
win rate over random players!  Furthermore, these are match wins: their game win rate would be closer to 50%
than to 75%, since best-of-three matches have the effect of decreasing the effects of luck.&lt;/p&gt;
&lt;p&gt;Futhermore, trading card games are expensive to play, with a single competitive Magic deck costing
$300-2000.  In an environment where &amp;ldquo;the best&amp;rdquo; deck changes every 3 months, there is a tremendous amout of
money required to play seriously.&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Limited&amp;rdquo; format tournaments, where the cards are provided for the tournament, are also fairly expensive because
buying packs of cards is expensive, and packs don&amp;rsquo;t hold their value once they are opened.  For example, a
sealed Grand Prix usually has a $100 entry fee, and on average you will walk out with &amp;lt;$10 of new cards.  A
casual draft may have a $25 or less entry fee, though, and I find these to be the best way to play magic.&lt;/p&gt;
&lt;p&gt;One final note: competitive Magic players are often a lot less friendly than compeititve players of other games.
I don&amp;rsquo;t know why this is.  Depending on who you ask, they may also be less honest: compeititive Magic has had
a lot of cheating scandals in its recent history.&lt;/p&gt;
&lt;h4 id=&#34;bonus-scrabble&#34;&gt;Bonus: Scrabble&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Imperfect information symmetric luck game&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Scrabble is a great casual game that can teach your child a trememdnous amount of vocabulary, particularly
words that are two and seven letters.  I learned such words as &amp;ldquo;aa,&amp;rdquo; &amp;ldquo;cwm,&amp;rdquo; and &amp;ldquo;aorists&amp;rdquo; playing Scrabble
with my grandmother.&lt;/p&gt;
&lt;p&gt;Like trading card games, Scrabble is an imperfect information game played between two players with an aspect
of randomness.  However, high-level Scrabble players are really good at removing variance from the game by
opening up fewer spaces where new words can be played.  The fact that the dictionary of words can be
completely memorized also helps.  Like poker, high-level Scrabble players know how to mitigate luck to the
point where it is almost a skill game.&lt;/p&gt;
&lt;p&gt;I have never played competitive scrabble, but I have heard that the competitive scene is the most welcoming
of any game.&lt;/p&gt;
&lt;h2 id=&#34;give-bridge-a-try&#34;&gt;Give Bridge a Try&lt;/h2&gt;
&lt;p&gt;Compared to the alternatives, bridge offers a rewarding experience that combines asymemtric gameplay and
imperfect information while avoiding the effects of luck.  This makes bridge a perfect game to teach a
growing young mind.  They will learn to deal with the asymmetric circumstances of life and others with
hidden information, while also teaching about planning lines of play and the value of teamwork.&lt;/p&gt;
&lt;p&gt;A lot of the alternatives are great games, but if you are interested in teaching and learning useful life
skills through the platform of a game, there are few alternatives that offer the richness of contract bridge.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Fixed Point Arithmetic</title>
      <link>https://specbranch.com/posts/fixed-point/</link>
      <pubDate>Wed, 18 May 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/fixed-point/</guid>
      <description>When we think of how to represent fractional numbers in code, we reach for double and float, and almost never reach for anything else. There are several alternatives, including constructive real numbers that are used in calculators, and rational numbers. One alternative predates all of these, including floating point, and actually allows you to compute faster than when you use floating point numbers. That alternative is fixed point: a primitive form of decimal that does not offer any of the conveniences of float, but allows you to do decimal computations more quickly and efficiently.</description>
      <content:encoded>&lt;p&gt;When we think of how to represent fractional numbers in code, we reach for &lt;code&gt;double&lt;/code&gt; and &lt;code&gt;float&lt;/code&gt;,
and almost never reach for anything else.  There are several alternatives, including
&lt;a href=&#34;https://dl.acm.org/doi/pdf/10.1145/2911981&#34;&gt;constructive real numbers&lt;/a&gt; that are used in calculators,
and &lt;a href=&#34;https://docs.python.org/3/library/fractions.html&#34;&gt;rational numbers&lt;/a&gt;.  One alternative predates
all of these, including floating point, and actually allows you to compute faster than when you
use floating point numbers.  That alternative is fixed point: a primitive form of decimal that does
not offer any of the conveniences of &lt;code&gt;float&lt;/code&gt;, but allows you to do decimal computations more quickly
and efficiently. Fixed point still has usage in some situations today, and it can be a potent tool
in your arsenal as a programmer if you find yourself working with math at high speed.&lt;/p&gt;
&lt;h2 id=&#34;introduction-to-fixed-point-numbers&#34;&gt;Introduction to Fixed Point Numbers&lt;/h2&gt;
&lt;p&gt;Fixed point numbers are conceptually simple: they are like integers, except that the decimal
point is somewhere other than after the rightmost bit.  Fixed-point numbers are stored in identical
formats to integers.  Mathematically, fixed point numbers have the following value:&lt;/p&gt;
&lt;p&gt;$$ N = I * 2^{-k} $$&lt;/p&gt;
&lt;p&gt;where $I$ is the integer value of the number and $k$ is a constant number of bits that are to the
right of the decimal point.  This allows you to represent non-integers (positive $k$) and large
numbers (negative $k$) with the same precision, dynamic range, and storage format as an integer.&lt;/p&gt;
&lt;p&gt;In comparison, floating point numbers are much more complicated.  Here is how a double-precision
floating point number relates to the bits that are stored:&lt;/p&gt;
&lt;p&gt;$$ N = (-1)^S * 2^{E - 1023} * M $$&lt;/p&gt;
&lt;p&gt;Where $S$ is a sign bit, $E$ is an 11-bit exponent, and $M$ is a mantissa, which is a 53 bit number
comprised of a 52-bit fraction stored in the number and an implied 1 or 0 bit depending on the
value of the exponent. Fixed point is much simpler!&lt;/p&gt;
&lt;h2 id=&#34;examples-of-fixed-point-numbers&#34;&gt;Examples of Fixed Point Numbers&lt;/h2&gt;
&lt;p&gt;Some examples are here, showing numbers represented in 16-bit unsigned fixed point:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;Number&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Fractional Bits ($k$)&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Hex&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Hex with Decimal Point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0003&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0001&lt;/code&gt;$.$&lt;code&gt;1&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0018&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;001&lt;/code&gt;$.$&lt;code&gt;8&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0180&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$0.75$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0003&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0000&lt;/code&gt;$.$&lt;code&gt;3&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$0.75$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;00C0&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;00&lt;/code&gt;$.$&lt;code&gt;C0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$0.75$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;C000&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.$&lt;code&gt;C000&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$2^{-16}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0001&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.$&lt;code&gt;0001&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;You can also use fixed point numbers to imply a number of leading or trailing zeros by using
large values of $k$ (to get leading zeros) or negative values of $k$ (to get trailing zeros):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;Number&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Fractional Bits ($k$)&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Hex&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Hex with Decimal Point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$2^{-32}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;32&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0001&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.0000$&lt;code&gt;0001&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$2^{-16}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;32&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0100&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.0000$&lt;code&gt;0100&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$2^{16}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0001&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0001&lt;/code&gt;$0000.0$&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$2^{16}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;-8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0100&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0100&lt;/code&gt;$00.0$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Here are a few numbers that can&amp;rsquo;t be exactly represented in 16-bit unsigned fixed point with a given
number of fractional bits:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;Number&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Fractional Bits ($k$)&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.25$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Between &lt;code&gt;0002&lt;/code&gt; $(1.0)$ and &lt;code&gt;0003&lt;/code&gt; $(1.5)$&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Overflow, greater than &lt;code&gt;FFFF&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$2^{-32}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;16&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Between &lt;code&gt;0000&lt;/code&gt; $(0)$ and &lt;code&gt;0001&lt;/code&gt; $(2^{-16})$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Fixed point numbers can also be signed using two&amp;rsquo;s complement, like signed integers.  For example:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;Number&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Fractional Bits ($k$)&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Hex&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Hex with Decimal Point&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0018&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;001&lt;/code&gt;$.$&lt;code&gt;8&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$-1.5$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;FFE8&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;FFE&lt;/code&gt;$.$&lt;code&gt;8&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.75$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;01C0&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;C0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$-1.75$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;FE40&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;FE&lt;/code&gt;$.$&lt;code&gt;40&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$0.75$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;6000&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;0&lt;/code&gt;$.$&lt;code&gt;600&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$-0.75$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;a000&lt;/code&gt;&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;&lt;code&gt;1&lt;/code&gt;$.$&lt;code&gt;200&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h2 id=&#34;fixed-point-operations&#34;&gt;Fixed Point Operations&lt;/h2&gt;
&lt;p&gt;Fixed point numbers can be used to efficiently compute every usual arithmetic operation using no more
than one integer operation and one fixed-offset bit shift.  However, unlike floating point operations,
fixed point operations are not hardware accelerated on modern server CPUs, and no batteries are included!&lt;/p&gt;
&lt;h4 id=&#34;addition-and-subtraction&#34;&gt;Addition and Subtraction&lt;/h4&gt;
&lt;p&gt;Fixed point addition and subtraction are exactly like integer addition and subtraction if the
numbers have the same number of fractional bits, but a shift is required to add and subtract numbers
with different $k$.  Here is an example with the same number of fractional bits (using subscripts
to show the number of fractional bits):&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer ADD&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;40&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;02&lt;/code&gt;$.$&lt;code&gt;C0&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$+$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$1.25_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$2.75_8$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Here is an example of addition without justification to align the decimal point:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer ADD&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;001&lt;/code&gt;$.$&lt;code&gt;4&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;94&lt;/code&gt; / &lt;code&gt;019&lt;/code&gt;$.$&lt;code&gt;4&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$+$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$1.25_4$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\neq$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$1.578125_8$ or $25.25_4$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The result of the addition is not correct without shifting to justify.  The correct process looks like:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Right shift 4&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;001&lt;/code&gt;$.$&lt;code&gt;8&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer ADD&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;001&lt;/code&gt;$.$&lt;code&gt;4&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;002&lt;/code&gt;$.$&lt;code&gt;C&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Convert to $x_4$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_4$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$+$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$1.25_4$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$2.75_4$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;This means that adding two fixed point numbers is either just a simple integer addition or a constant
shift and an add.  That is much faster than a floating point addition.  It also means that when the
numbers have the same number of fractional bits, addition is associative, so operations can be re-ordered
freely without affecting the result.&lt;/p&gt;
&lt;p&gt;Subtraction is much the same as addition. The two operands need to be aligned to have the same number of
fractional bits, but once that occurs, you can use an integer subtraction operation. Also, negative numbers
work just fine:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer SUB&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;02&lt;/code&gt;$.$&lt;code&gt;40&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;FF&lt;/code&gt;$.$&lt;code&gt;40&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$-$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$2.25_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$-0.75_8$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4 id=&#34;pitfalls-when-justifying-fixed-point-numbers&#34;&gt;Pitfalls when Justifying Fixed Point Numbers&lt;/h4&gt;
&lt;p&gt;Shifting right to remove fractional bits can result in loss of precision:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;88&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Right shift 4&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;001&lt;/code&gt;$.$&lt;code&gt;8&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.53125_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Inexact conversion to $x_4$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_4$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Additionally, a left shift to add fractional bits can result in an overflow:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;100&lt;/code&gt;$.$&lt;code&gt;8&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Left shift 4&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;00&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$256.5_4$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;Overflow converting to $x_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.5_8$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The loss of precision on a right shift is similar to what happens when adding floating point numbers.
However, the possibility of overflow is not there with floating point numbers.  Using fixed point
numbers requires careful analysis of the ranges of numbers involved and the required precisions to
avoid the risk of overflow.&lt;/p&gt;
&lt;h4 id=&#34;multiplication&#34;&gt;Multiplication&lt;/h4&gt;
&lt;p&gt;Multiplication and division are different: when two fixed point numbers are multiplied, the integer
components are multiplied or divided, but the result has a different $k$ value than the inputs.
For multiplication, the result has $k = k_1 + k_2$, and is computed using an integer multiplication
instruction.  For example, if you multiply two numbers with 4 fractional bits each, the result will
have 8 fractional bits, and the number must be shifted back if you would like a result with 4
fractional bits.&lt;/p&gt;
&lt;p&gt;This means that multiplying a fixed point number by an integer results in a fixed point number with
the same number of fractional bits:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer MUL&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0005&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;07&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\times$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$5_{INT}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$7.5_8$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The full result of an integer multiplication has two times as many bits as the size of the inputs.
ARM allow you to access the high part of the multiplication using the &lt;code&gt;MULH&lt;/code&gt; and &lt;code&gt;UMULH&lt;/code&gt; instructions
and x86 provides a double-width result to the &lt;code&gt;MUL&lt;/code&gt; and &lt;code&gt;MULX&lt;/code&gt; instructions using a pair of result
registers.  Here is an example where the high bits of the multiplication carry the important
information of the result:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;$0.$&lt;code&gt;C000&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer MUL&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;$0.$&lt;code&gt;C000&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;$0.$&lt;code&gt;9000&lt;/code&gt; &lt;code&gt;0000&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$0.75_{16}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\times$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.75_{16}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$.5625_{32}$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These instructions allow you to use register-width fixed point arithmetic.  Without access to the
high part of the multiplication result, you would be effectively limited to half-width fixed point
(a 64-bit CPU would only be able to multiply 32-bit fixed point numbers).&lt;/p&gt;
&lt;p&gt;In this example, we can see that the information can be split between the high and low halves of the
result:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer MUL&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0002&lt;/code&gt;$.$&lt;code&gt;4000&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_{8}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\times$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_{8}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$2.25_{16}$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In cases like this, recovering a number with 8 fractional bits would involve several assembly
instructions to shift the registers to the appropriate place and combine them.  This would definitely
be slower than a floating-point multiplication: floating point multiplication is actually relatively
easy to compute!  That said, some DSPs and embedded processors have a shifter after their multiplier
to allow native fixed point multiplication wihtout having any register justification problems.&lt;/p&gt;
&lt;h4 id=&#34;division&#34;&gt;Division&lt;/h4&gt;
&lt;p&gt;For division, the result has $k = k_1 - k_2$ fractional bits when dividing $N_1$ by $N_2$.  Again,
dividing by an integer yields a result that has the same number of fractional bits as the input.  This
is largely similar to multiplication, and again you can use the integer division operation to divide
fixed point numbers.&lt;/p&gt;
&lt;p&gt;Three examples are below.  First, dividing by an integer:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer DIV&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0003&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;00&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\div$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$3_{INT}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.5_8$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dividing by a number of the same precision:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer DIV&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;00&lt;/code&gt;$.$&lt;code&gt;40&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0006&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\div$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.25_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$6_{INT}$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Dividing by a fixed point number with a different number of fractional bits:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer DIV&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;000&lt;/code&gt;$.$&lt;code&gt;8&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;003&lt;/code&gt;$.$&lt;code&gt;0&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\div$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.5_4$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$3_4$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Thanks to rounding, division of fixed point numbers can be dangerous, particularly when the divisor
has more fractional bits than the dividend.  Here is one case where an integer division will not
produce a correct result:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;01&lt;/code&gt;$.$&lt;code&gt;80&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer DIV&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0&lt;/code&gt;$.$&lt;code&gt;400&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0000&lt;/code&gt;$0.0$&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$1.5_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\div$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.25_{12}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\neq$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0_{-4}$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In this case, the result should have been 6 but the result register cannot represent any number
that isn&amp;rsquo;t a multiple of 16.  If the result would be a multiple of 16, you actually could get a
correct division:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;08&lt;/code&gt;$.$&lt;code&gt;00&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Integer DIV&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0&lt;/code&gt;$.$&lt;code&gt;400&lt;/code&gt;&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Result:&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;&lt;code&gt;0002&lt;/code&gt;$0.0$&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;center&#34;&gt;$8_8$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$\div$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$0.25_{12}$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$=$&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;$32_{-4}$&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;If you find yourself working with fixed point numbers, take great care around divisions.  Integer
division has nonlinearity that occurs due to rounding, which can accidentally end up in your code
if you are not careful to make sure that the dividend has more precision than the divisor.&lt;/p&gt;
&lt;h4 id=&#34;operation-chains&#34;&gt;Operation Chains&lt;/h4&gt;
&lt;p&gt;Successful use of fixed point arithmetic often depends on constructing chains of operations that
fit together in sequence, and tracking $k$ and the dynamic range of the values in the operation chain
to avoid overflows and unnecessary shift operations.&lt;/p&gt;
&lt;p&gt;Extremely long operation chains can be found in hardware accelerators and DSP pipelines for RF and
audio systems, where they compute functions like Fourier transforms, filters, and more complicated
functions like &lt;a href=&#34;https://dl.acm.org/doi/pdf/10.1145/3458817.3487397&#34;&gt;atomic interactions&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Unlike floating point, $k$ is a property of a variable, not a stored quantity, so it can take a lot
of developer work to keep track of $k$ for each variable and add shifts when needed.
&lt;a href=&#34;https://www.mathworks.com/discovery/high-level-synthesis.html&#34;&gt;Mathematical computing software&lt;/a&gt;
can also be used to convert high-level code into fixed point operation pipelines.&lt;/p&gt;
&lt;h2 id=&#34;fixed-point-in-the-wild&#34;&gt;Fixed Point in the Wild&lt;/h2&gt;
&lt;p&gt;Thanks to the proliferation of floating point hardware, fixed point numbers have been relegated
to a few niches, including embedded systems, signal processing, and trading systems.  However,
fixed point numbers can have broad utility if you are interested in computing things as quickly
as possible. Fixed point computations use the integer side of a modern CPU, which allows them to
take advantage of faster instructions and more available instruction ports.&lt;/p&gt;
&lt;p&gt;Financial systems often use a form of decimal fixed point numbers to represent prices.  Prices
are often manipulated as dollars with two decimal places (integer number of cents) or dollars with
four decimal places.  Some bond markets use 8 fractional bits, with the minimum increment of price
being 1/256th of a dollar.  This allows exchanges to have fixed precision across the range of
prices: a one-dollar stock has the same price precision as expensive stocks like Google or
Berkshire Hathaway.  Floating point prices do not have this property: the minimum increment of
a floating point number depends on its magnitude.  For example,
&lt;a href=&#34;https://www.wsj.com/articles/berkshire-hathaways-stock-price-is-too-much-for-computers-11620168548&#34;&gt;the NASDAQ was recently caught using two base 10 decimal places with 32-bit prices&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Fixed point is also commonly used in hardware accelerators and custom hardware, since it has smaller
silicon area than an equivalent floating point calculation and uses less power.  Hardware
accelerators also have the benefit of custom word sizes, and shifting numbers or adding zeros to a
number is completely free in hardware as long as the shift is a fixed size (you can just connect the
wires differently). Arguably, you could suggest that neural networks have rediscovered fixed
point since neural network inference now often uses 8- or 32-bit integer arithmetic.&lt;/p&gt;
&lt;p&gt;Fixed point can often be found be seen in places where resources are tightly constrained.
For example, my &lt;a href=&#34;https://specbranch.com/posts/faster-div8/&#34;&gt;division&lt;/a&gt; experiment used fixed point numbers to represent
quantities between 0 and 1, where we used 16 and 32 bits behind the decimal place.  Audio and DSP
systems frequently use fixed point math for calculations.  Many embedded DSPs don&amp;rsquo;t even have
floating point hardware accelerators because fixed-point math is so effective in their domain.
Interestingly, fixed point is the highest-precision way to represent decimals on a computer and
keep the benefit of reasonably fast math: you get 64 bits of precision compared to 53 for &lt;code&gt;double&lt;/code&gt;.&lt;/p&gt;
&lt;h4 id=&#34;fixed-point-in-compilers&#34;&gt;Fixed Point in Compilers&lt;/h4&gt;
&lt;p&gt;When you add the following function to a C program, you will get an optimization that is enabled
by the math of fixed point arithmetic:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;background-color:#f0f0f0;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-C&#34; data-lang=&#34;C&#34;&gt;uint64_t &lt;span style=&#34;color:#06287e&#34;&gt;div_three&lt;/span&gt;(uint64_t x) {
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;return&lt;/span&gt; x &lt;span style=&#34;color:#666&#34;&gt;/&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;3&lt;/span&gt;;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You would expect that this would compile to a division.  However, we can compute it a different way,
by multiplying by fixed point $\frac{1}{3}$.  The compiler does this!  Its assembly output (with
&lt;code&gt;-O3 -march=skylake&lt;/code&gt;) is:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;background-color:#f0f0f0;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-asm&#34; data-lang=&#34;asm&#34;&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;movabs&lt;/span&gt;  &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;0xAAAAAAAAAAAAAAAB&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;mov&lt;/span&gt;     &lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rdi&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;mulx&lt;/span&gt;    &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;shr&lt;/span&gt;     &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;ret&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The first line of the assembly stores fixed point $ \left(\frac{2}{3}\right)_{64} $ (rounded up)
in the &lt;code&gt;RAX&lt;/code&gt; register.  The third instruction is the multiplication (the second instruction,
&lt;code&gt;mov rdx, rdi&lt;/code&gt;, sets up an implied operand of the &lt;code&gt;mulx&lt;/code&gt; instruction).  The fourth instruction
is a shift that divides by 2, so the result is equal to $\frac{x}{3}$.&lt;/p&gt;
&lt;p&gt;The multiplication is by $\frac{2}{3}$ instead of $\frac{1}{3}$ so that the result in &lt;code&gt;RAX&lt;/code&gt; has
maximum precision.  $\frac{1}{3}$ can&amp;rsquo;t be exactly represented in fixed point, so the slightly
roundabout method of multiplying by $\frac{2}{3}$  and dividing by 2 is required to avoid
an error in the least significant bit when a large integer is passed in.&lt;/p&gt;
&lt;p&gt;Compiler optimizations like this give us the speed benefit of fixed point arithmetic every day.
Using an integer division would be several times slower.&lt;/p&gt;
&lt;h2 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;Fixed point numbers have some speed benefits on server-scale systems and allow you to do
non-integer math on systems without floating point hardware.  However, these benefits cost a lot
of developer time and effort, so sticking with &lt;code&gt;float&lt;/code&gt; and &lt;code&gt;double&lt;/code&gt; are usually the way to go.
Fixed point has its niches, like prices for financial trading and audio processing, and it is a
good tool when you want to compute as fast as possible and can give up some dynamic range.  There
are a few places where fixed point arithmetic can be an important tool, and in those places, it is
invaluable.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>You (Probably) Shouldn&#39;t use a Lookup Table</title>
      <link>https://specbranch.com/posts/lookup-tables/</link>
      <pubDate>Wed, 04 May 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/lookup-tables/</guid>
      <description>I have been working on another post recently, also related to division, but I wanted to address a comment I got from several people on the previous division article. This comment invariably follows a lot of articles on using math to do things with chars and shorts. It is: &amp;ldquo;why are you doing all of this when you can just use a lookup table?&amp;rdquo;
Even worse, a stubborn and clever commenter may show you a benchmark where your carefully-crafted algorithm performs worse than their hamfisted lookup table.</description>
      <content:encoded>&lt;p&gt;I have been working on another post recently, also related to division, but I wanted to address
a comment I got from several people on the previous division article.  This comment invariably
follows a lot of articles on using math to do things with &lt;code&gt;chars&lt;/code&gt; and &lt;code&gt;shorts&lt;/code&gt;.  It is: &amp;ldquo;why
are you doing all of this when you can just use a lookup table?&amp;rdquo;&lt;/p&gt;
&lt;p&gt;Even worse, a stubborn and clever commenter may show you a benchmark where your carefully-crafted
algorithm performs worse than their hamfisted lookup table.  Surely you have made a mistake and
you should just use a lookup table.  Just look at the benchmark!&lt;/p&gt;
&lt;p&gt;When you have only 8 or 16 bits of arguments, it is tempting to precompute all of the answers
and throw them in a lookup table.  After all, a table of 256 8-bit numbers is only 256 bytes,
and even 65536 16-bit numbers is well under a megabyte.  Both are a rounding error compared to
the memory and code footprints of software today.  A lookup table is really easy to generate, and
it&amp;rsquo;s just one memory access to find the answer, right?&lt;/p&gt;
&lt;p&gt;The correct response to the lookup table question is the following: &lt;em&gt;If you care about performance,
you should almost never use a lookup table algorithm, despite what a microbenchmark might say.&lt;/em&gt;&lt;/p&gt;
&lt;h3 id=&#34;cpu-cache-hierarchy&#34;&gt;CPU Cache Hierarchy&lt;/h3&gt;
&lt;p&gt;Unfortunately, with many performance-related topics, the devil is in the details.  The pertinent
detail, for lookup tables, is hidden in the &amp;ldquo;one memory access.&amp;rdquo;  How long does that memory access
take?&lt;/p&gt;
&lt;p&gt;It turns out that on a modern machine, it depends a lot on a lot of things, but it is usually
a lot longer than our mental models predict. Let&amp;rsquo;s take the example of a common CPU: an Intel
Skylake server CPU.  It turns out that on a Skylake CPU, that memory access takes somewhere
between 4 and &lt;strong&gt;250&lt;/strong&gt; CPU cycles!  Most other modern CPUs have a similar latency range.&lt;/p&gt;
&lt;p&gt;The variance in access latency is driven by the cache hierarchy of the CPU:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Each core has 64 kB of dedicated L1 cache, split equally between code and data (32k for each)&lt;/li&gt;
&lt;li&gt;Each core has 1 MB of dedicated L2 cache&lt;/li&gt;
&lt;li&gt;A CPU has 1.375 MB of shared L3 cache per core (up to about 40 MB for a 28-core CPU)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These caches have the following access times:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4-5 cycle access latency for L1 data cache&lt;/li&gt;
&lt;li&gt;14 cycle access time for L2 cache&lt;/li&gt;
&lt;li&gt;50-70 cycle access time for L3 cache&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Main memory has an access latency of about 50-70 ns (100-210 cycles for a Skylake server chip)
on top of the L3 access time.  Dual-socket machines, as are used in cloud computing environments,
have an additional ~10% memory access time penalty to keep the caches coherent between the two
sockets.&lt;/p&gt;
&lt;p&gt;Here, we are disregarding throughput, but suffice it to say that L1 and L2 caches have nearly
unlimited throughput, the L3 cache has enough to do most things you would want to do with lookup
tables, and main memory never has enough.&lt;/p&gt;
&lt;h5 id=&#34;a-brief-introduction-to-caches&#34;&gt;A Brief Introduction to Caches&lt;/h5&gt;
&lt;p&gt;Caches are designed to help with common memory access patterns, generally meaning linear scans
of arrays, while also being constrained by hardware engineering constraints.  To that end, caches
don&amp;rsquo;t operate on arbitrary-sized variables, only 64-byte cache lines.  When a memory access
occurs, a small memory access for a single &lt;code&gt;char&lt;/code&gt; is translated into a memory access for a 64-byte
cache line.  Then, the L1 cache allows the CPU to do operations smaller than 64 bytes.&lt;/p&gt;
&lt;p&gt;A memory access to a cache line that is in a cache is called a cache hit, and an access to a
cache line that must be fetched from memory (or from another cache) is a cache miss.  Caches can
only hold so many cache lines, so once a line has expired from the cache (which happens when it
has not been hit recently), it is evicted, at which point, the next access to that cache line
will be a cache miss.&lt;/p&gt;
&lt;p&gt;To make the hardware smaller, caches are also generally split up into sets.  Each memory address
is associated with only one set in a cache, so instead of looking through the entire cache to see
if you have a cache hit, you only have to look within a set.  The sets tend to be small: the L1
caches on a Skylake CPU have 8 slots (or &amp;ldquo;ways&amp;rdquo;).  Thus, the Skylake L1 cache is an &amp;ldquo;8-way set
associative cache&amp;rdquo; because it has 8 ways.  A cache that has only one way is called &amp;ldquo;direct
mapped&amp;rdquo; and a cache without set associativity (where a cache line can reside anywhere in the
cache) is called &amp;ldquo;fully associative.&amp;rdquo;  Fully associative caches take a lot of power and area,
and direct-mapped caches take the least. The L2 cache on a Skylake is 16-way set associative,
and the L3 cache is 11-way set associative.&lt;/p&gt;
&lt;p&gt;The set size of a cache influences how long a cache line is likely to stay in the cache: the more
associativity the cache has (the more ways there are), the longer it will take for a cache line
to be evicted.  For example, a fully associative cache that holds 100 cache lines will only evict
a line after 100 memory accesses.  On the other hand, a direct mapped cache can evict a cache line
on the next memory access, if that access happens to map to the same set as the previous access.&lt;/p&gt;
&lt;p&gt;In theory, if you are performing a linear scan of a large array, you are going to scan through
the entire line, so the cache doesn&amp;rsquo;t waste bandwidth operating this way.  If you aren&amp;rsquo;t scanning
an array, you may be accessing a local variable, and if you are, you will likely access another
local variable nearby, within the same cache line.  In turn, compilers and performance engineers
take cache locality into account, improving the effectiveness of the cache.&lt;/p&gt;
&lt;p&gt;In addition, the caches have prefetching logic which can predict when a memory access is going
to occur, and fetch the appropriate cache line.  Thus, if you are doing a linear scan, the
prefetch logic can help you avoid taking any cache misses at all! You can also influence the
prefetch logic on an x86 core with instructions like &lt;code&gt;PREFETCHT0&lt;/code&gt;, but these are only requests,
not direct instructions.&lt;/p&gt;
&lt;p&gt;Processors can have different sized cache lines, but many CPUs (including all x86 CPUs) use 64
bytes.  That is also the same size as a DDR4 transfer, so it is convenient for memory systems
to work with 64 byte units and nothing smaller.&lt;/p&gt;
&lt;h3 id=&#34;lookup-tables-and-caches&#34;&gt;Lookup Tables and Caches&lt;/h3&gt;
&lt;p&gt;When comparing a lookup table to the alternative, it is important to think about where the lookup
table needs to be situated in order to have a speed advantage.  In most cases where a lookup table
will be worth using, you will want the lookup table to have an advantage if it is in the L3
cache or above and you want the function to be called enough that the table stays there.  To take
an absurd example, consider a function that outputs the result of the subset-sum problem on a
particular set for a given target.  It is much faster to precompute all of the subset-sum results
and then put them in a lookup table than to try to compute each solution on the fly.&lt;/p&gt;
&lt;h5 id=&#34;simulating-lookup-tables-with-a-cache-hierarchy&#34;&gt;Simulating Lookup Tables with a Cache Hierarchy&lt;/h5&gt;
&lt;p&gt;To prove my point, I made a simple simulation.  The simulation tracks data accesses of a single core
of a program, and tracks the population of the cache lines.  We are assuming that the memory footprint
of the program is 1 GB in total, and that it accesses memory addresses in an exponentially distributed
way when it is not using the lookup table.&lt;/p&gt;
&lt;p&gt;As a simplifying assumption, we are ignoring code access and assuming that there are 0 bytes of code
outside the L1 instruction cache.  We also assume that the L3 cache is strictly partitioned so that
the one core gets 1.375 MB of it, and we are simulating only one core (assuming that other cores are
doing other work).  Finally, we are assuming that the cache follows an LRU eviction policy, and will
always evict the least recently used cache line in a set.  We are also assuming the following
latencies for a lookup:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;4 cycles for L1 cache&lt;/li&gt;
&lt;li&gt;14 cycles for L2 cache&lt;/li&gt;
&lt;li&gt;50 cycles for L3 cache&lt;/li&gt;
&lt;li&gt;200 cycles for main memory&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The simulator will let us simulate several different types of access patterns, but the two access
patterns we will use are the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Even distribution of intervening memory accesses - this is an academic exercise that helps us
understand the impact of caching on performance of lookup tables.&lt;/li&gt;
&lt;li&gt;Bursty distribution of intervening memory accesses, modeled with an exponential distribution -
this is a lot closer to how a real program will look: short bursts of function calls will be
interspersed with longer gaps where no use of the lookup table occurs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For both categories, we will be simulating points in the range between 0 and 500 expected intervening
memory accesses between lookup table accesses.  Here are a few example guidelines to think about,
although the actual meaning of &amp;ldquo;intervening memory accesses&amp;rdquo; is program-dependent:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0-5 intervening memory accesses: this is basically a microbenchmark.  Likely all you are doing is reading
values, using the lookup table, and writing values back.&lt;/li&gt;
&lt;li&gt;~20 intervening memory accesses: the lookup table function is one of the core functions in a program.&lt;/li&gt;
&lt;li&gt;~100 intervening memory accesses: the lookup table function is one of the major functions used, but not
at the top of the CPU profile.&lt;/li&gt;
&lt;li&gt;~500 intervening memory accesses: the lookup table function is warm, but not particularly hot.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Distilling down to a few regimes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;0-10 - microbenchmarks and benchmarks&lt;/li&gt;
&lt;li&gt;10-100 - hot functions, usually the region of interest for optimization&lt;/li&gt;
&lt;li&gt;100-500 - warm functions that may or may not be of interest&lt;/li&gt;
&lt;/ul&gt;
&lt;h5 id=&#34;example-1-a-256-byte-lookup-table&#34;&gt;Example 1: A 256-byte Lookup Table&lt;/h5&gt;
&lt;p&gt;Let&amp;rsquo;s take the example of a small lookup table, a 256-byte table, and let&amp;rsquo;s assume that it
is accessed uniformly randomly.  The table is 4 cache lines long.  Here&amp;rsquo;s the access latency in cycles
that we get from our simulation, plotted against the period of access on a log scale (1 = every memory
access goes to the lookup table, and 100 = 1% of accesses go to the lookup table):&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut4_latency.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;Here is the cache hit rate for accesses to our lookup table against log-scaled number of intervening
memory accesses:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut4_hitrate.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;If we are in the &amp;ldquo;extremely hot function&amp;rdquo; range, with less than 20 intervening memory accesses, the
lookup table function looks great.  If not, it doesn&amp;rsquo;t look so great.&lt;/p&gt;
&lt;p&gt;It seems to defy expectations that such a small lookup table would be getting evicted from a cache so
quickly.  However, let&amp;rsquo;s think about the access pattern for the lookup table: &lt;em&gt;on each access, we are
only touching one of its cache lines&lt;/em&gt;.  Since there are 4 cache lines in the lookup table, each cache
line effectively sees 4 times as many intervening accesses as the entire lookup table sees.
This means that we should actually expect the lookup table to be much further down the cache
hierarchy than we would think given its small size.&lt;/p&gt;
&lt;p&gt;This is not analogous to the case of 256 bytes of code or 256 bytes of local variables: we expect to touch
&lt;em&gt;every&lt;/em&gt; cache line in the code and &lt;em&gt;every&lt;/em&gt; cache line of local variables each time the function is called.
We could attempt to rectify this by adding prefetching instructions for all of the cache lines, but
this will not be a panacea either, and most importantly, it will hurt our latency floor when the function
is called very often.&lt;/p&gt;
&lt;p&gt;Here are our performance graphs on bursty (non-uniform) accesses to the lookup table:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut4_poisson_latency.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;The ceiling of the chart here is far better, around 160 cycle latency on average instead of 200 cycle
latency, and as we can see in the caching plot below, the cache hierarchy helps a lot with occasional
burst accesses to the lookup table:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut4_poisson_hitrate.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;Even with an average of 500 memory accesses between each lookup table call, 30% of accesses to the
lookup table are cached!  However, in the useful window between 10 and 100 intervening memory accesses,
we get very similar performance to the uniform access simulations.  This should not be surprising:
most of those accesses are already coming out of caches, and even though we shift more of them to the
L1 cache, there are also longer gaps between bursts during which the lookup table is evicted down the
cache hierarchy.&lt;/p&gt;
&lt;h5 id=&#34;example-2-a-65536-byte-lookup-table&#34;&gt;Example 2: A 65536-byte Lookup Table&lt;/h5&gt;
&lt;p&gt;Let&amp;rsquo;s take the example of a large lookup table corresponding to 16 input bits.  Again, we will assume
that it is accessed uniformly randomly.  Our latency profile (again, in clock cycles) looks like:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut1024_latency.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;That&amp;rsquo;s not so good - it looks like we are almost working out of main memory to start!  Note that this
graph cuts off at 100 intervening accesses, because after that point, the average latency is almost
equal to the memory latency.  Here is the cache hit rate, confirming our suspicions:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut1024_hitrate.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;This is an extreme example, but the conclusion we get here is pretty stark: if you want to use a
lookup table of several KB, expect it to take a long time.  Each cache line of this lookup table sees
1024 times more intervening memory accesses than the entire table sees!&lt;/p&gt;
&lt;p&gt;For a bursty access pattern, we actually see worse performance results than our uniform access pattern:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut1024_poisson_latency.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;You need a very long burst to cache the entire lookup table, and each cache line of your table is
an antagonist for the other cache lines of the table.  We get the following cache hit rates from
the bursty access pattern:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut1024_poisson_hitrate.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;

&lt;h5 id=&#34;example-3-a-64-byte-lookup-table&#34;&gt;Example 3: A 64-byte Lookup Table&lt;/h5&gt;
&lt;p&gt;Finally, let&amp;rsquo;s look at a lookup table that is a single cache line, and see how well it performs.
You can&amp;rsquo;t look up many values in this table, but it illustrates how the effects of the cache
hierarchy are amplified by the size of the lookup table.  Here&amp;rsquo;s how the 64-byte lookup table
looks with a realistic, bursty access pattern:&lt;/p&gt;
&lt;p&gt;&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut1_poisson_latency.png#center&#34; width=&#34;50%&#34;/&gt; 
&lt;/figure&gt;

&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut1_poisson_hitrate.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;Note that the latency graph tops out at around 90 cycles, unlike previous charts, and 60% of
accesses are cached.  The &amp;ldquo;critical region&amp;rdquo; between 10 and 100 intervening accesses is served
mostly from the L1 cache, and cached 100% of the time.&lt;/p&gt;
&lt;p&gt;This is good.  A single cache line lookup table has great performance for a warm function.
If the function is not warm, we don&amp;rsquo;t care much either, since it&amp;rsquo;s not a significant performance
driver of the overall system.  For caching purposes, a lookup table that fits in a single cache line
behaves like a constant or a local variable: It is going to be accessed on every invocation of the
function, so it is going to be in the cache if the function was called recently, and its performance
is not data-dependent.&lt;/p&gt;
&lt;h3 id=&#34;bringing-it-all-together&#34;&gt;Bringing it All Together&lt;/h3&gt;
&lt;p&gt;A comparison of the performance of lookup tables in all of our simulations is here, alongside a
simulation for a 2 kB lookup table:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/lut_comparison.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;These siumulations show us the problem of lookup tables and microbenchmarks: the cache hierarchy
completely destroys the results of the benchmark by operating on the far left side of the graph.
Most bits of code tend to perform slightly better in microbenchmarks than in situ thanks to the
cache hierarchy, but lookup tables get an extreme version of this benefit. Once your lookup table
gets evicted down the cache hierarchy, it starts to slow down significantly.&lt;/p&gt;
&lt;p&gt;Unfortunately, the range between &amp;ldquo;10&amp;rdquo; and &amp;ldquo;100&amp;rdquo; on this graph is the critical zone for performance-
sensitive functions, and that is the region where even small lookup tables start to get slow and
start to see a lot of variance in their performance.  This is when the caching architecture of
modern CPUs tends to stop favoring you.&lt;/p&gt;
&lt;h5 id=&#34;non-uniform-access-patterns&#34;&gt;Non-uniform Access Patterns&lt;/h5&gt;
&lt;p&gt;Up to this point, we have been assuming a uniform random access to the lookup table.  Sometimes this
is not true.  If we have nonuniform access patterns, the commonly-accessed parts of the lookup table are
more likely to be closer to the core, but the uncommonly accessed parts are likely to be farther.&lt;/p&gt;
&lt;p&gt;To look at this case, I simulated access to the lookup table using an exponentially distributed pattern
to simulate a non-uniform access pattern. In one simulation, 63% of acceses (1 standard deviation)
go to the first cache line.  In the next simulation, 95% of accesses (2 standard deviations) go to the
first cache line. The results of the simulation, with the 64B table and a uniform access pattern for
comparison are below:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/lut/nonuniform_lut4.png#center&#34; width=&#34;65%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;The one sigma case may have worse performance than the uniform case!  The cache misses incurred on the
other cache lines of the table are expensive, and more than make up for the performance gain you get
from one cache line being a lot hotter than in the uniform case.   As the function gets colder,
non-uniform access to the lookup table starts to help a lot - I cut this graph off at 100 intervening
accesses to show the interesting region better, but the trends you see at 100 continue.&lt;/p&gt;
&lt;p&gt;As expected, the two sigma case performs very similarly to the case of a lookup table that is a single
cache line in the tail, but the performance of the 256 byte table whose first 64 bytes see 95% of
accesses is similar to that of the uniformly accessed table on the left side of the graph:
the rare cache misses on other cache lines hurt a lot.&lt;/p&gt;
&lt;h5 id=&#34;what-about-the-cost-of-code&#34;&gt;What about the cost of code?&lt;/h5&gt;
&lt;p&gt;An easy argument against what we have said so far is: &amp;ldquo;Fine, the lookup table is not very cache
efficient, but neither is the extra kilobyte of code that you use to avoid a lookup table!
That has to be fetched from memory too!&amp;rdquo;&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s true that the code does have to be fetched from memory, and a larger code footprint increases
the chance of an instruction cache miss.  However, there are three significant mitigating factors
in terms of the cost of cache misses on code:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &amp;ldquo;heat&amp;rdquo; of code tends to be much more unevenly distributed than data&lt;/li&gt;
&lt;li&gt;Code tends to be easy to prefetch unless it is branchy&lt;/li&gt;
&lt;li&gt;Code is usually touched on every invocation of a function unless it branches&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Combining all of these factors gives you a low instruction cache miss rate for most programs:
The L1 instruction cache can consume 64 bytes per cycle, but the instruction fetch unit accepts
16 bytes per cycle.  As a result, code can actually be perfectly prefetched through a few branches
(not including loops, which are already in the cache), and adding a branch can actually be more
costly than adding bytes.  Every byte being touched on every invocation helps a lot, too. Finally,
the fact that code tends to have a few hot parts and a large cold region indicates is a double-edged
sword: raising the hot path over a certain threshold can have a high cost, but until the hot path
is a certain size, you can add code without too much of a caching cost.&lt;/p&gt;
&lt;p&gt;This is why specific-length-optimized SIMD algorithms for the &lt;code&gt;mem*&lt;/code&gt; functions and a lot of other
similar tricks work well: up to a certain program size, the extra code actually doesn&amp;rsquo;t cost much,
especially when that code is either looping or going in a straight line. Of course, there are limits.&lt;/p&gt;
&lt;h3 id=&#34;when-to-use-lookup-tables&#34;&gt;When to use Lookup Tables&lt;/h3&gt;
&lt;p&gt;Lookup table algorithms in software engineering projects are justified in four situations:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;where the lookup table is one cache line or less&lt;/li&gt;
&lt;li&gt;where a &lt;em&gt;full-system&lt;/em&gt; benchmark shows that the lookup table algorithm is faster&lt;/li&gt;
&lt;li&gt;where you don&amp;rsquo;t care about performance and it is the easiest algorithm to understand&lt;/li&gt;
&lt;li&gt;where you have complete control of the cache hierarchy (eg on a DSP, not a server CPU)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We have covered the first case, so I won&amp;rsquo;t go into it here.&lt;/p&gt;
&lt;p&gt;The second case is straightforward, but easy to mess up.  Lookup tables are liars, and can&amp;rsquo;t be
trusted unless you have a significant quantity of code around them.  If a full-system benchmark
shows that you should use a lookup table algorithm, then you should use it.  A microbenchmark
or a benchmark of a small part of the system does not apply!  It will usually be the case that a
full-system benchmark suggests using a lookup table when your function is called often enough
that the table stays warm, and a smaller table will work out more often than a larger one.&lt;/p&gt;
&lt;p&gt;The third case is the easiest to consider.  Write something readable. If the lookup table is the
most readable thing to do, do it, but please make sure you describe how the table was generated.&lt;/p&gt;
&lt;p&gt;The fourth case does not apply to general-purpose CPUs. DSPs and special-purpose processors often have
scratchpads instead of (or in addition to) caches, so you actually can guarantee that your lookup table
will be quickly accessible as long as you are willing to spend some of your limited scratchpad RAM on
the table.  GPUs have a region of &amp;ldquo;constant&amp;rdquo; memory which can be accessed quickly and can be used
for lookup tables. If you are working with a microcontroller that has only RAM and flash, there is
no cache hierarchy, so you have complete control.  These are all places where lookup tables can work
well.  While modern desktop and server CPUs do allow you to perform some limited cache manipulation
and provide hints to the prefetch unit, there is currently no way to force a cache line to stay in a
cache. The &lt;code&gt;PREFETCH*&lt;/code&gt; instructions are just hints!&lt;/p&gt;
&lt;h3 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;Lookup tables have a time and a place, but it&amp;rsquo;s not all the time and it&amp;rsquo;s definitely not everywhere.
In performance software engineering, arguments in favor of lookup tables generally have to be very
subtle and well-thought-out if you want to make sure that they are actually correct.  Even a small
lookup table can be a lot worse than the alternatives unless it is used extremely frequently, and a
large lookup table should almost never be used.&lt;/p&gt;
&lt;p&gt;Further, unlike with other algorithms, a microbenchmark is not a representative measure of performance
for a lookup table algorithm unless it is all you are doing, since the distorting effects of the cache
hierarchy are magnified for lookup table algorithms.  Lookup tables also introduce data dependence
on performance, so they are not suitable for algorithms where you need determinism, like in
cryptography.&lt;/p&gt;
&lt;p&gt;When we think about performance, it is important to make sure that we consider the full picture.
Algorithms that are so dependent on their use characteristics and the structure of the hardware can
make it very hard to do that.  Memory speed hasn&amp;rsquo;t kept up with the speed of CPUs, and so a lot of
silicon on a modern processor is made to cope with that fact.  As a result, many other algorithms are
built around the idea of reducing reliance on memory speed.  Lookup tables depend on it to a fault.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Consulting</title>
      <link>https://specbranch.com/consulting/</link>
      <pubDate>Fri, 01 Apr 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/consulting/</guid>
      <description>At this time, I am not open to consulting projects.</description>
      <content:encoded>&lt;p&gt;At this time, I am not open to consulting projects.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Who Controls a DAO?</title>
      <link>https://specbranch.com/posts/who-controls-a-dao/</link>
      <pubDate>Fri, 01 Apr 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/who-controls-a-dao/</guid>
      <description>In honor of April Fools&amp;rsquo; Day, I decided to write about a blockchain topic. The crypto economy is in the process of speedrunning their way from zero to a modern economy, and when you move that fast, a few things have to break along the way. One of those things is corporate governance.
Matt Levine&amp;rsquo;s &amp;ldquo;Money Stuff&amp;rdquo; is a financial newsletter that I can&amp;rsquo;t recommend enough. If you are at all interested in finance, stocks, and markets, it is funny and informative read.</description>
      <content:encoded>&lt;p&gt;In honor of April Fools&amp;rsquo; Day, I decided to write about a blockchain topic. The crypto economy is in
the process of speedrunning their way from zero to a modern economy, and when you move that fast,
a few things have to break along the way.  One of those things is corporate governance.&lt;/p&gt;
&lt;p&gt;Matt Levine&amp;rsquo;s &lt;a href=&#34;https://www.bloomberg.com/opinion/authors/ARbTQlRLRjE/matthew-s-levine&#34;&gt;&amp;ldquo;Money Stuff&amp;rdquo;&lt;/a&gt;
is a financial newsletter that I can&amp;rsquo;t recommend enough.  If you are at all interested in finance,
stocks, and markets, it is funny and informative read.  One of the recurring topics of Money Stuff
is &amp;ldquo;who controls a company?&amp;rdquo;  Quoting a bit of the
&lt;a href=&#34;https://www.bloomberg.com/opinion/articles/2018-07-24/papa-john-s-poison-pilled-papa-john?sref=1kJVNqnU&#34;&gt;newsletter&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Who controls a company? It’s a question we talk about from time to time, and the shareholders,
the board of directors, the chief executive officer, and whoever has the keys to the front door
all have good arguments that they are really in control.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You wouldn&amp;rsquo;t expect a this to be a problem for most companies, but it comes up a lot in disputes
between the various parties who have ownership claims.  CEOs go against the wishes of boards all the
time&amp;ndash;one notably refused to be fired and kept filing documents with the SEC as though he was still
running the place&amp;ndash;and Arm recently learned that &amp;ldquo;whoever has the keys to the front door&amp;rdquo; (or rather,
the corporate seal) actually has a good ownership claim on Arm China.&lt;/p&gt;
&lt;p&gt;The crypto version of this question is &amp;ldquo;who controls a Distributed Autonomous Organization (DAO)?&amp;rdquo;
As with many other financial topics, adding a little bit of cryptography and a dose of &amp;ldquo;code is law&amp;rdquo;
turns this problem up to 11, and a recent exploit against a DAO has started me thinking about the
parallels.&lt;/p&gt;
&lt;h3 id=&#34;what-is-a-dao&#34;&gt;What is a DAO?&lt;/h3&gt;
&lt;p&gt;For the uninitiated, a DAO is a little bit like the crypto version of a company.  DAOs are
organizations set up for a specific purpose whose membership (and associated voting power) is defined
by ownership of a &amp;ldquo;governance token.&amp;rdquo;  A governance token is like a normal cryptocurrency, but it
comes with the ability to vote on proposals at the DAO.  The founders of the DAO often hold large
amounts of the governance token and are usually given management roles, which come with the ability to
spend crypto tokens from the DAO&amp;rsquo;s wallet, ostensibly to make more money for the DAO.  All of these
rules are defined by a smart contract on a blockchain.&lt;/p&gt;
&lt;p&gt;Generally, people interested in the purpose of the DAO fund the DAO by buying tokens, the same way
interested parties buy shares in corporations.  The founders are given some number of tokens, similar to
founders of companies, and the remaining tokens are held by the DAO itself.  Much like companies, which
can issue shares, DAOs can also mint more governance tokens, but often need a vote to issue more than
a certain cap.&lt;/p&gt;
&lt;p&gt;Voting and proposals are generally set up such that any token holder can make proposals to the members
of the DAO, and then all of the token holders get the chance to vote.  For some DAOs, voting costs
governance tokens, while other DAOs have free voting with voting power proportional to the number of
governance tokens held by the voter.  The latter case is similar to shareholder votes in companies.&lt;/p&gt;
&lt;p&gt;Occasionally, DAOs give out some of their money (in the form of Ethereum or other coins) to holders
of governance tokens, like a dividend, and other DAOs will offer to allow people to redeem their
governance tokens for a fraction of the money held by the DAO, similar to a buyback.&lt;/p&gt;
&lt;p&gt;DAOs have been used for many interesting things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Tornado cash is a DAO that runs an anonymizer for Ethereum.&lt;/li&gt;
&lt;li&gt;Venture DAO provides crypto investment to other DAOs.&lt;/li&gt;
&lt;li&gt;ConstitutionDAO was founded to buy a copy of the US constitution.  When it lost a bidding war against
hedge fund manager Ken Griffin, the managers started returning ~$40 million in crypto to token holders.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But with the good come the bad (and the dumb):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The first DAO, simply called &amp;ldquo;The DAO,&amp;rdquo; had a bug in its code that allowed a thief to walk away
with $60 million (at the time) worth of Ethereum.  This hack resulted in a hard fork of the Ethereum
blockchain to reverse the transaction.&lt;/li&gt;
&lt;li&gt;A DAO called &amp;ldquo;SpiceDAO&amp;rdquo; paid $3 million for a copy of a script of an early Dune movie thinking they
could make the movie.  They overpaid by a factor of 100, and they neglected to buy movie rights.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;pulling-the-rug&#34;&gt;Pulling the Rug&lt;/h3&gt;
&lt;p&gt;&amp;ldquo;Rug pulls&amp;rdquo; are a fact of life for DAO investors.  A rug pull happens when the founder of a DAO
disappears from social media and exercises their power as a manager to steal all of the cryptocurrency
from the DAO&amp;rsquo;s coffers, leaving it broke.  This is usually accompanied by the DAO&amp;rsquo;s website, social media
accounts, and discord server (all of which are usually controlled by the founder) being shut down.  Rug
pulls are so common that
&lt;a href=&#34;https://www.coindesk.com/markets/2021/12/17/defi-rug-pull-scams-pulled-in-28b-this-year-chainalysis/&#34;&gt;$2.8 billion of crypto tokens were rug pulled in 2021&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Moreover, rug pulls appear to be illegal,
&lt;a href=&#34;https://www.zdnet.com/article/frosties-nft-operators-arrested-for-1-million-rug-pull-scheme/&#34;&gt;at least according to the US government who arrested Ethan Nguyen and Andre Llacuna for pulling one&lt;/a&gt;.
The thesis used to justify this arrest appears to be based on fraud: Nguyen and Llacuna accepted money
promising to build a game, and instead they tried to disappear with it.  The two are charged with conspiracy
to commit fraud and conspiracy to commit money laundering.&lt;/p&gt;
&lt;p&gt;A rug pull can generally only be done by the people who started the DAO, lending some credence to the
&amp;ldquo;manager primacy&amp;rdquo; theory.  In a normal company, there is also a pesky little thing called &amp;ldquo;fiduciary duty&amp;rdquo;:
the managers of a company ostensibly have a duty to act in the interest of shareholders.  Looting a company&amp;rsquo;s
assets is a pretty straightforward breach of fiduciary duty.&lt;/p&gt;
&lt;h3 id=&#34;the-hostile-takeover-of-build-fianance&#34;&gt;The Hostile Takeover of BUILD Fianance&lt;/h3&gt;
&lt;p&gt;What if the person who pulls the rug isn&amp;rsquo;t the person who puts the rug there?  What if they have made
no promises at all?  What if the token holders of the DAO have elected them to be the new manager in a
legitimate election (after all, when code is law, no election can be illegitimate)?  There is still the
matter of fiduciary duty if you believe that DAOs and corporations should be governed by the same law,
but at least there&amp;rsquo;s no fraud!&lt;/p&gt;
&lt;p&gt;I usually stay away from crypto news, but there was a very interesting exploit of a DAO about a month
and a half ago that caught my attention.  BUILD Finance is a DAO that attempted to create a crypto
investment fund that financed projects that use their token (also called BUILD).  In February 2022, the
BUILD Finance DAO was subject to a hostile takeover by a user going by &amp;ldquo;suho.eth,&amp;rdquo; who promptly drained
the DAO&amp;rsquo;s accounts of all tokens, walking away with $500,000.  BUILD Finance issued 130,000 tokens
initially, and only 5,000 of them voted on the proposed change of manager: 5,000 in favor, and 0 opposed.&lt;/p&gt;
&lt;p&gt;A few days before this exploit, suho.eth tried to play the same game with 2,000 build tokens.  Luckily
for the DAO, an automated discord bot was set up to notify users of votes, and one person voted against
suho.eth on his first try.  A few days later, suho.eth transferred some BUILD tokens to a new account and
tried again.  This time, the discord bot broke and didn&amp;rsquo;t send out a notification of the election.
Elections for BUILD Finance ran for 24 hours, and during the following day, nobody noticed that there was
an open proposal.  The Discord chat was silent, and nobody voted against the proposal. With less than 5%
of the outstanding BUILD tokens voting, suho.eth was made the manager of BUILD Finance.&lt;/p&gt;
&lt;p&gt;Some time later, the BUILD Finance twitter account, presumably run by the founders of the DAO, announced
that they had been subject to a &amp;ldquo;hostile governance takeover,&amp;rdquo; including a step-by-step description of
how the exploit occurred.  The twitter thread included such things as:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Team members have made direct contact with the attacker but there seems to be no appetite for a dialogue,
much less any reparations. 15/18&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;We would welcome a discussion in the discord with community members about the way to move forward from
this but it is difficult to see a future for BUILD with only its brand recognition and IP assets, and
no liquid treasury. 16/18&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;From one perspective, it appears that the BUILD Finance DAO was hacked and ruthlessly stripped of its assets.
Another perspective, however, is that the token holders elected new management and the new managers decided
to fire the old employees, shut down operation, and spend the company cash on bonuses.&lt;/p&gt;
&lt;p&gt;The top response to the tweet thread was:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;You need to give this Twitter account over to the person who controls the DAO, you&amp;rsquo;re misrepresenting
yourself on here by not doing so&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It seems like many people prefer the second view.  As of today, the &lt;a href=&#34;https://build.finance&#34;&gt;BUILD Finance website&lt;/a&gt;
is down, and it appears that the DAO has ceased operating. BUILD tokens are now worthless. The only thing
protecting BUILD Finance from this outcome was a discord bot.&lt;/p&gt;
&lt;p&gt;When we consider the sequence of events that happened to BUILD Finance, the step that is actually &lt;em&gt;criminal&lt;/em&gt;
under the fiduciary duty theory is the last step, spending the company cash on bonuses.  The rest of the
steps seem like a traditional hostile corporate takeover (I am not a lawyer and this is not legal advice).&lt;/p&gt;
&lt;p&gt;What if the new management had instead decided to dissolve the DAO and return the assets to token holders?
What if instead of taking all of the money, the new management had instead set up new voting rules and
started operating the DAO in good faith?  Then it &lt;strong&gt;would&lt;/strong&gt; have been a hostile takeover.  Whoever runs the
DAO&amp;rsquo;s twitter account (and website) might not be so happy about this, but at that point, they wouldn&amp;rsquo;t
really be able to claim that they are representatives of the DAO.  They were the old management, and the
token holders have replaced them.  As with other corporate takeovers, the new management of the DAO could
presumably claim a handsome bonus&amp;ndash;perhaps 10% of the DAO&amp;rsquo;s assets&amp;ndash;for their work.&lt;/p&gt;
&lt;h3 id=&#34;why-doesnt-this-happen-with-companies&#34;&gt;Why Doesn&amp;rsquo;t This Happen with Companies?&lt;/h3&gt;
&lt;p&gt;A company has several safeguards against this sort of takeover.  The first, as I already mentioned, is
the law and the fiduciary duty requirement.  It&amp;rsquo;s hard to use the law as a deterrent in countries that
don&amp;rsquo;t extradite to your preferred venue.&lt;/p&gt;
&lt;p&gt;The second safeguard is something that DAOs can acutally implement: in a normal company, abstaining from
a shareholder vote is counted as a vote in favor of things not changing.  This logic could easily be built
into smart contracts.  Thus, a change of management would require 51% of the token holders to actively vote
for the new manager.  This sounds like vote rigging, but it has reasonable logic: most people who aren&amp;rsquo;t
voting are passive shareholders who are happy with the status quo.  Overcoming their implied preference
should have a high bar.&lt;/p&gt;
&lt;p&gt;Companies also usually have specific voting windows (some DAOs do today), requiring proposals to be made
during annual or quarterly shareholder meetings, and shareholder votes often run for weeks.  In addition,
there is a layer of indirection at typical companies.  Shareholder&amp;rsquo;s don&amp;rsquo;t vote for &lt;em&gt;managers&lt;/em&gt;, they vote
for &lt;em&gt;board members&lt;/em&gt;.  The board then runs a private selection process for managers.  Typically, only part
of the board is up for election at any given time.  Installing hostile management at a traditional
company involves spending a year or more getting enough board members elected, then going through the
process of a board vote to change the managers.  It is a long and slow process, and it is hard to fly
under the radar.&lt;/p&gt;
&lt;p&gt;One final point is that voting on a proposal at a DAO is actually not free: you have to pay a gas fee
to do anything on the blockchain, including voting on a DAO.  Participating in shareholder votes is free,
so a lot of shareholders generally do it.&lt;/p&gt;
&lt;p&gt;Eventually, DAOs will likely discover similar rules to corporations, when they realize that they can&amp;rsquo;t
avoid them.  Some have adopted multi-signature wallets, which need approvals from multiple people to
move the money, others have restricted voting windows, and more have eschewed decentralization and
forced every proposal to get managerial approval before going to a vote.  Personally, I think that the
&amp;ldquo;abstension is a vote for the managers&amp;rsquo; position&amp;rdquo; solution is an elegant balance&amp;ndash;shareholders have
&lt;em&gt;some&lt;/em&gt; recourse against bad management, but only if a lot of them agree.  Running every proposal
through the managers doesn&amp;rsquo;t sound &amp;ldquo;decentralized&amp;rdquo; at all.&lt;/p&gt;
&lt;p&gt;When crypto enthusiasts look at the long lists of arcane rules that govern the modern economy, it seems
easy to do better.  However, a lot of those rules were the result of learning from past economic hacks
and exploits.  Building your own system from the ground up, however fun it is, is usually a bad idea!
I expect to see more DAOs learning the same lessons in the future.  For now, though, my opinion of
crypto topics will continue to be the following:&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Have some fun in the economic sandbox, but be ready to lose your investment.&lt;/em&gt;&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Python is Like Assembly</title>
      <link>https://specbranch.com/posts/python-and-asm/</link>
      <pubDate>Sun, 06 Mar 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/python-and-asm/</guid>
      <description>Python and Assembly have one thing in common: as a professional software engineer, they are both languages that you probably should know how to read, but be terrified to write. These languages seem to be (and are) at opposite ends of the spectrum: One is almost machine code, and the other is almost a scripting language. One is beginner-friendly and the other is seen as hostile to experts. One is viciously versatile with tons of libraries and ports, and the other is ridiculously limited in its capabilities.</description>
      <content:encoded>&lt;p&gt;Python and Assembly have one thing in common: as a professional software engineer, they are both
languages that you probably should know how to read, but be terrified to write. These languages seem
to be (and are) at opposite ends of the spectrum: One is almost machine code, and the other is almost a
scripting language. One is beginner-friendly and the other is seen as hostile to experts. One is
viciously versatile with tons of libraries and ports, and the other is ridiculously limited in its
capabilities. However, when you are creating production software, both are the wrong tool for the
job.&lt;/p&gt;
&lt;h3 id=&#34;python-easy-to-write-hard-to-read&#34;&gt;Python: Easy to Write, Hard to Read&lt;/h3&gt;
&lt;p&gt;You probably agree with me that you shouldn&amp;rsquo;t be writing assembly code, but Python is much more
contentious. Many people argue that Python is bad to use in production because it is slow, and that
you will get better performance using something else. However, in most of the cases where I have seen
Python make it to production, Python is not slow: the Python code is often gluing together some NumPy
or TensorFlow functions, which are written in optimized C. Python users who care a lot about speed
and need to run actual Python can reach for Cython or Pyston to solve their speed problems. The
problem with Python isn&amp;rsquo;t its speed.&lt;/p&gt;
&lt;p&gt;The problem is that Python is a language that is optimized for writeability. Dynamic typing, playing it
fast and loose with scoping, and syntactic whitespace are all nice features when you want to churn out a
lot of code quickly. Having every error show up a runtime is not a big deal if you are expecting to be
the person running the code you wrote. For that reason, it is a very popular language: you can feel very
&amp;ldquo;productive&amp;rdquo; when you are churning out Python. However, if you are working on a large codebase or an
old codebase, wortking with Python is a huge headache, especially when you need the code to run 24/7
at thousands of QPS.&lt;/p&gt;
&lt;p&gt;Type information is immensely helpful to readers and maintainers, and guarantees about what type
a variable can hold solve a lot of headaches. Many Python programs are not written with a lot of type
discipline: variables can often hold many different types, and when you modify a program, you need to
consider all possible types. Also, it can often be unclear how to access certain variables due to the
odd scoping. This is actually pretty similar to assembly code: a Python variable is like an architectural
register - it&amp;rsquo;s hard to know what&amp;rsquo;s inside without reading a &lt;em&gt;lot&lt;/em&gt; of code.&lt;/p&gt;
&lt;p&gt;An added consequence of all of the dynamism is that programs don&amp;rsquo;t &lt;em&gt;really&lt;/em&gt; know what a variable is and
how to manipulate it until runtime, making it very hard to write helpful IDEs and analyzers similar to
what you get with C++, Rust, Go, Java, or almost any other language. Those runtime errors also have a
nasty habit of showing up only once code gets to production even when they would ordinarily be caught
by a tool in another language.&lt;/p&gt;
&lt;p&gt;For small code snippets, one-offs, and pseudocode, Python is an invaluable tool. For production,
reach for something else.&lt;/p&gt;
&lt;h3 id=&#34;assembly-code-doesnt-do-its-job&#34;&gt;Assembly Code doesn&amp;rsquo;t do its Job&lt;/h3&gt;
&lt;p&gt;When people think of high-performance code, they often think of hand-optimized assembly. In limited
circumstances, this is true. Nobody has ever argued that assembly code is readable or easy to maintain,
so for our purposes, I am going to focus on the performance arguments, and why you should almost never
reach for assembly when performance is critical.&lt;/p&gt;
&lt;p&gt;More often than not, writing assembly language (and even C code) can cause you to miss algorithmic
solutions to speed problems in favor of brute micro-optimization. Here are a few examples to try in C
and assembly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Write a function that produces the sum of every number from 1 to 10000&lt;/li&gt;
&lt;li&gt;Write a function that divides a number by 57&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For the first example, you probably wrote a loop in both languages. However, if you plug your C code
into &lt;a href=&#34;https://godbolt.org&#34;&gt;compiler explorer&lt;/a&gt;, you will find that clang will give you a closed-form
solution when you compile it in C, resulting in a huge speedup. For the second example, you may have
used a DIV instruction, or you may have known that you can do this with a MUL and a magic number. If
you used a DIV instruction, clang will beat you again: it can compute the magic number and turn that
into a multiplication. If you used a MUL instruction, congratulations on knowing the trick, but you
had to compute the magic number yourself, didn&amp;rsquo;t you?  More subtly, hand tuning will often result in
bad register allocation and code that doesn&amp;rsquo;t play well with other code.&lt;/p&gt;
&lt;p&gt;Also, there are a few huge readability problems with assembly code. One of these problems is that variables
tend to exist as registers and memory locations, which can be ephemeral. The contents of a register are not
immediately obvious without considering the previous code. Macros can go a long way toward code readability,
but they do not take you all the way to the point where you have variable and function names names.  This
rhymes with one of the reasons why Python is hard to maintain: it&amp;rsquo;s hard to know what your variables
actually could be, and there is no verification of their state.&lt;/p&gt;
&lt;p&gt;This is why you don&amp;rsquo;t want to write assembly code directly: doing the work of a compiler yourself is
hard, and the compiler has a good global view. In cases where you might want to reach for assembly to do
something that is not native to a higher-level language, like computing a CRC32C or a carryless
multiplication, there is almost always an
&lt;a href=&#34;https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html&#34;&gt;intrinsic function you can use&lt;/a&gt;,
so you never have to accept writing assembly.&lt;/p&gt;
&lt;p&gt;Fully optimized assembly is UGLY, as it should be in order to go fast, but ugly code is hard to
work with. Instead of writing (slow) human-readable assembly, let the compiler do the work. For
critical sections, read the output of the assembler and tweak with intrinsics and hints when
the compiler doesn&amp;rsquo;t find the right code. There is almost nothing that needs fully handwritten
assembly these days, but there is still a lot of assembly that needs to be read.&lt;/p&gt;
&lt;h3 id=&#34;readability-is-paramount&#34;&gt;Readability is Paramount&lt;/h3&gt;
&lt;p&gt;All long-running software projects have one thing in common: the code is read a lot more times than it
is written. When you decide what language to use on a project, readability should be a key factor in any
piece of code that is more than a one-off. In other words, the main audience for the code you write is not
a computer, but other programmers.&lt;/p&gt;
&lt;p&gt;Python is a fantastic scripting language and a decent language for prototyping new applications, but
unfortunately, it is a language designed to be &lt;em&gt;written&lt;/em&gt;, not &lt;em&gt;read&lt;/em&gt;. Like assembly languages, Python is a
conversation between a programmer and a computer, and is ill-equipped for communication with other
programmers at scale and over time.&lt;/p&gt;
&lt;p&gt;If you must use Python in production, there are lots of attempts to improve the readability of Python by
adding static typing and type annotations. These tools will likely be valuable, but they do not get you all
the way. For now, I will continue to use Python as a scripting language and a tool for small one-offs, but
I recommend that projects planning for longevity or significant scale start in something else.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Racing the Hardware: 8-bit Division</title>
      <link>https://specbranch.com/posts/faster-div8/</link>
      <pubDate>Tue, 22 Feb 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/faster-div8/</guid>
      <description>Occasionally, I like to peruse uops.info. It is a great resource for micro-optimization: benchmark every x86 instruction on every architecture, and compile the results. Every time I look at this table, there is one thing that sticks out to me: the DIV instruction. On a Coffee Lake CPU, an 8-bit DIV takes a long time: 25 cycles. Cannon Lake and Ice Lake do a lot better, and so does AMD. We know that divider architecture is different between architectures, and aggregating all of the performance numbers for an 8-bit DIV, we see:</description>
      <content:encoded>&lt;p&gt;Occasionally, I like to peruse &lt;a href=&#34;https://uops.info&#34;&gt;uops.info&lt;/a&gt;.  It is a great resource for micro-optimization:
benchmark every x86 instruction on every architecture, and compile the results.  Every time I look at this table,
there is one thing that sticks out to me: the &lt;code&gt;DIV&lt;/code&gt; instruction. On a Coffee Lake CPU, an 8-bit &lt;code&gt;DIV&lt;/code&gt; takes
a long time: 25 cycles.  Cannon Lake and Ice Lake do a lot better, and so does AMD. We know that divider
architecture is different between architectures, and aggregating all of the performance numbers for an
8-bit &lt;code&gt;DIV&lt;/code&gt;, we see:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;left&#34;&gt;Microarchitecture&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Micro-Ops&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Latency (Cycles)&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Reciprocal Throughput (Cycles)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Haswell/Broadwell&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;9&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;21-24&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Skylake/Coffee Lake&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Cannon Lake/Rocket Lake/Tiger Lake/Ice Lake&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Alder Lake-P&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;17&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Alder Lake-E&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;9-12&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Zen+&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;9-12&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Zen 2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;9-12&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Zen 3&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Intel, for Cannon Lake, improved DIV performance significantly.  AMD also improved performance
between Zen 2 and Zen 3, but was doing a lot better than Intel to begin with. We know that most
of these processors have hardware dividers, but it seems like there should be a lot of room to
go faster here, especially given the performance gap between Skylake and Cannon Lake.&lt;/p&gt;
&lt;p&gt;Before you check out thinking that we can&amp;rsquo;t even get close to the performance of the hardware
accelerated division units in modern CPUs, take a look at the benchmark results (lower is better on both
graphs):&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/benchmark_results.png#center&#34; width=&#34;100%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;And it does not benefit &lt;em&gt;at all&lt;/em&gt; from hardware acceleration. We see slightly better latency for division
alone and better throughput on the Intel chips, and much better throughput on Zen 1 and Zen 2. That is
not to say that this is a pure win compared to the DIV instruction on those machines, since we are spending
extra micro-ops and crowding the other parts of the CPU, but we get some nice results and we have an
algorithm that prepares us for a vectorized implementation, where we can get significantly better throughput.&lt;/p&gt;
&lt;h2 id=&#34;what-we-know-about-intel-and-amds-solution&#34;&gt;What we Know about Intel and AMD&amp;rsquo;s Solution&lt;/h2&gt;
&lt;p&gt;The x86 &lt;code&gt;DIV&lt;/code&gt; instruction has an impedance mismatch with how it is usually used: the &lt;code&gt;DIV&lt;/code&gt; instruction accepts
a numerator that is twice the width of the denominator, and produces a division result and a remainder.  Usually,
when you compute a division result with &lt;code&gt;DIV&lt;/code&gt;, you have a &lt;code&gt;size_t&lt;/code&gt; (or &lt;code&gt;unsigned char&lt;/code&gt;) that you want to divide by
another &lt;code&gt;size_t&lt;/code&gt; (or &lt;code&gt;unsigned char&lt;/code&gt;).  As a result, if you want a 64-bit denominator, the x86 &lt;code&gt;DIV&lt;/code&gt; instruction
forces you to accept a 128-bit numerator spread between two registers. Similarly, for an 8-bit division, the &lt;code&gt;DIV&lt;/code&gt;
instruction has a 16-bit numerator. As such, if you only want the quotient (and not the remainder), or if you want
a numerator that is the same width as the denominator, you don&amp;rsquo;t get the instruction you want.&lt;/p&gt;
&lt;p&gt;This kind of abstraction mismatch is important to watch for in performance work: the wrong abstraction can be
incredibly expensive, and aligning abstractions with the problem you want to solve is often the main consideration
when improving the speed of a comptuer system. For today&amp;rsquo;s software, &lt;code&gt;DIV&lt;/code&gt; is the wrong abstraction, but since
it is part of the instruction set, we are stuck with it.&lt;/p&gt;
&lt;h5 id=&#34;internal-details-of-div&#34;&gt;Internal Details of &lt;code&gt;DIV&lt;/code&gt;&lt;/h5&gt;
&lt;p&gt;Computer division generally uses iterative approximation algorithms based on an initial guess. However,
effective use of these methods requires that you have a decent initial guess: in each iteration
these methods square their error, doubling the number of correct bits in the guess, but that isn&amp;rsquo;t
very efficient when you start with only one correct bit. As a result, most CPUs have a unit that computes
an initial guess.&lt;/p&gt;
&lt;p&gt;We have some insight into how Intel and AMD do this - the &lt;code&gt;RCPPS&lt;/code&gt; and &lt;code&gt;RCPSS&lt;/code&gt; instructions are a
hardware-accelerated reciprocal calculation
function with a guarantee that they are within $1.5 * 2^{-12}$ of the correct reciprocal. Intel recently
released the &lt;code&gt;VRCP14SD/PD&lt;/code&gt; instructions in AVX-512, which gives an approximation within $2^{-14}$ (and they
had &lt;code&gt;VRCP28PD&lt;/code&gt;, but only on the Xeon Phi). Both &lt;code&gt;RCPSS&lt;/code&gt; and &lt;code&gt;VRCPSS&lt;/code&gt; are pretty efficient: they perform
similarly to the &lt;code&gt;MUL&lt;/code&gt; instruciton.&lt;/p&gt;
&lt;p&gt;Intel has released some code to simulate the &lt;code&gt;VRCP14SD&lt;/code&gt; instruction: it uses a piecewise linear function to
approximate the reciprocal of the fractional part of the floating-point number, and directly computes the
power of 2 for the reciprocal.&lt;/p&gt;
&lt;p&gt;Using the clues we have, we can guess that for integer division, Intel and AMD probably use something like
following algorithm:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Prepare the denominator for a guess by left-justifying it (making it similar to a floating-point number)&lt;/li&gt;
&lt;li&gt;Use the hardware behind &lt;code&gt;RCPSS&lt;/code&gt; or &lt;code&gt;VRCP14SD&lt;/code&gt; to get within a tight bound of the answer&lt;/li&gt;
&lt;li&gt;Round the denominator or go through one iteration of the refinement algorithm chosen&lt;/li&gt;
&lt;li&gt;Multiply by the numerator and then shift to get the final division result&lt;/li&gt;
&lt;li&gt;Multiply the division result by the denominator and subtract from the numerator to get the modulus
(perhaps using a fused-multiply-add instruction of some kind)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;And this is all done by a small state machine inside the division acceleration unit, although it probably
used to be done with some assistance from microcode in the Haswell and Skylake generations of Intel CPUs.&lt;/p&gt;
&lt;h2 id=&#34;function-estimation&#34;&gt;Function Estimation&lt;/h2&gt;
&lt;p&gt;It is tempting to use lookup tables of magic numbers, and that is certainly an approach that will
microbenchmark well, but in practice, lookup table algorithms tend to perform poorly: one cache
miss costs you tens or hundreds of clock cycles. Intel and AMD can use lookup tables without this
problem, but we don&amp;rsquo;t have that luxury.&lt;/p&gt;
&lt;p&gt;As a first step to DIYing it, let&amp;rsquo;s look at two numerical techniques for generating polynomial
approximations: Chebyshev Approximations and the Remez Exchange algorithm. They deserve their own
blog post, which is on my list, but here is a very quick overview.&lt;/p&gt;
&lt;h5 id=&#34;constraining-the-domain&#34;&gt;Constraining the Domain&lt;/h5&gt;
&lt;p&gt;Approximation algortihms work best when we are approximating over a small domain. For division, the domain
can be $0.5 &amp;lt; D \leq 1$, meaning that we can constrain $\frac{1}{D}$ pretty tightly as well:
$1 \leq \frac{1}{D} &amp;lt; 2$.  We will do the other part of the division by bit shifting, since
bit shifts are fast and allow us to divide by any power of 2.  In order to compute an 8-bit DIV, we would
like to estimate the reciprocal of $\frac{1}{D}$ such that we are overestimating, but never by more than
$2^{-8}$.  We can then multiply this reciprocal by the numerator and shift, and we get a division.&lt;/p&gt;
&lt;h5 id=&#34;chebyshev-approximation&#34;&gt;Chebyshev Approximation&lt;/h5&gt;
&lt;p&gt;Chebyshev approximation is a method for generating a polynomial approximation of a function within a
window.  Most people have heard of using a Taylor series to do this, but a Chebyshev approximation
produces a result that is literally an order of magnitude better than a Taylor series.  Instead of
approximating a function as a series of derivatives, we approximate the function as a sum of
Chebyshev polynomials. The Chebyshev polynomials are special because they are level: The nth
Chebyshev polynomial oscillates between +1 and -1 within the interval [-1, 1] and crosses 0 n times.
If you are familiar with Fourier series decomposition, Chebyshev approximation uses a similar process,
but with the Chebyshev polynomials instead of sines and cosines (for the mathematically inclined,
the Chebyshev polynomials are an orthogonal basis).  The first 4 Chebyshev polynomials look like:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/chebyshevplots.png#center&#34; width=&#34;80%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;When we do a Chebyshev approximation, we get a much more accurate approximation than a Taylor series:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/taylor_vs_cheb.png#center&#34; width=&#34;60%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;The Chebyshev approximation is almost exactly the same as the function!  Instead of error increasing
over the input like a Taylor series, the error of a Chebyshev approximation oscillates and the error stays
near 0:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/taylor_cheb_error.png#center&#34; width=&#34;100%&#34;/&gt; 
&lt;/figure&gt;

&lt;h5 id=&#34;the-remez-algorithm&#34;&gt;The Remez Algorithm&lt;/h5&gt;
&lt;p&gt;The Remez Exchange algorithm allows you to refine an existing Chebyshev approximation, improving its
maximum error by a little bit. Looking at the error of the Chebyshev approximation from above, the
absolute value of the peaks decreases as x increases.  The Remez algorithm adjusts the approximation
to equalize the error over the domain.  Since we are trying to minimize the maximum error, this helps
the approximation a little bit.  After applying one iteration of the Remez algorithm, the error
changes from the figure on the left to the figure on the right:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/chebyshev_vs_remez.png#center&#34; width=&#34;100%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;To compare the two approximations, we get the following approximate formulas from Chebyshev and Remez:&lt;/p&gt;
&lt;p&gt;$$ Chebyshev(x) = 7.07107 - 19.6967x + 27.0235x^2 - 18.2694x^3 + 4.87184x^4 $$&lt;/p&gt;
&lt;p&gt;$$ Remez(x) = 7.15723 - 20.1811x + 28.024x^2 - 19.17x^3 + 5.17028x^4 $$&lt;/p&gt;
&lt;h2 id=&#34;improving-our-approximation&#34;&gt;Improving our Approximation&lt;/h2&gt;
&lt;p&gt;We want to do two things to improve our approximation to make it useful: (1) we want it to compute faster, and (2) we
want it to never underestimate so we can round by truncation (round down).  If we shift up the Remez formula a bit,
we get something that solves the second criterion:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/shifted_remez_error.png#center&#34; width=&#34;60%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;$$ RemezShifted(x) = 5.17028 (1.96866 - 2.6834x + x^2)(0.703212 - 1.02432x + x^2) $$&lt;/p&gt;
&lt;p&gt;But not the first.  To compute this, we still need 5 multiplication instructions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;We have to produce $x^2$&lt;/li&gt;
&lt;li&gt;We have to multiply the linear terms in each of the sets of parentheses&lt;/li&gt;
&lt;li&gt;We have to multiply the two parts together&lt;/li&gt;
&lt;li&gt;We need to multiply the whole thing by 5.17028&lt;/li&gt;
&lt;/ol&gt;
&lt;h5 id=&#34;eliminating-multiplications&#34;&gt;Eliminating Multiplications&lt;/h5&gt;
&lt;p&gt;Not all multiplications are equal.  Multiplying by 5 is much easier than multiplying by 5.17028.  This is because
multiplying a number by 5 is equivalent to multiplying it by 4 (shifting left by 2) and then adding the result to
itself.  If you want to multiply by a register by 5 on an x86 machine, you can do it without a &lt;code&gt;MUL&lt;/code&gt; instruction:
&lt;code&gt;LEA $result_reg, [$input_reg + $input_reg * 4]&lt;/code&gt; performs this calculation. The &lt;code&gt;LEA&lt;/code&gt; instruction is
one of the most effective mathematical instructions in the x86 instruction set: it allows you to do several types of
addition and multiply by 2, 3, 4, 5, 8, or 9 in one instruction.&lt;/p&gt;
&lt;p&gt;Intel CPUs have a set of units that accelerate array indexing which allow you to compute an address for an array
element. The accelerators compute &lt;code&gt;base + constant + index * size&lt;/code&gt; while loading or storing to memory, with &lt;code&gt;size&lt;/code&gt;
restricted to a word length (1, 2, 4, or 8 on a 64-bit machine). Instead of loading from memory, &lt;code&gt;LEA&lt;/code&gt; directly
gives us the result of the array indexing calculation. This allows us to do simple multiplication, 3-operand addition,
and other similar functions in one instruction (although 3-operand &lt;code&gt;LEA&lt;/code&gt;s tend to carry a performance penalty compared
to 2-operand &lt;code&gt;LEA&lt;/code&gt; instructions).&lt;/p&gt;
&lt;p&gt;The coefficients of the formula we derived with Remez are pretty close to some coefficients that allow us to
use bit shifts and &lt;code&gt;LEA&lt;/code&gt;s: $5 \approx 5.17028$, $2.6834 \approx 2.6875$, and $1.02432 \approx 1.03125$.&lt;/p&gt;
&lt;p&gt;$$ 1.03125 x = x + (x \gg 5) $$&lt;/p&gt;
&lt;p&gt;$$ 2.6875 x = 3 x - 5 (x \gg 4) $$&lt;/p&gt;
&lt;p&gt;That gives us the following function:&lt;/p&gt;
&lt;p&gt;$$ ApproxRecip(x) = 5(c_1 - 2.6875x + x^2)(c_2 - 1.03125x + x^2) $$&lt;/p&gt;
&lt;p&gt;That gives us both coefficients using two &lt;code&gt;SHR&lt;/code&gt; instructions, two &lt;code&gt;LEA&lt;/code&gt; instructions, one &lt;code&gt;ADD&lt;/code&gt; (which
could also be an &lt;code&gt;LEA&lt;/code&gt;), and one &lt;code&gt;SUB&lt;/code&gt;. Our overall computation now only has two &lt;code&gt;MUL&lt;/code&gt; instructions:
one to compute $x^2$ and one to multiply the results from the parentheses.  The final multiplication by 5
is another &lt;code&gt;LEA&lt;/code&gt; instruction instead of a &lt;code&gt;MUL&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;We can use a minimax algorithm to determine $c_1 = 1.97955322265625$ and $c_2 = 0.717529296875$.
Needless to say, the function we get is no longer optimal. The error (comparing to $1/x$) looks like this:&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/approx_error.png#center&#34; width=&#34;60%&#34;/&gt; 
&lt;/figure&gt;

&lt;p&gt;This is not anywhere near as good as our shifted Remez approximation, but it is still good enough:
the maximum of the error is a little over $0.0022$, which means we get within 8.8 bits of the correct
result of the division. That&amp;rsquo;s close enough for an 8-bit divison! Since we overestimate, we can just
truncate to 8 bits, rounding down, and we will get the correct result.&lt;/p&gt;
&lt;h2 id=&#34;mapping-to-code&#34;&gt;Mapping to Code&lt;/h2&gt;
&lt;p&gt;Mapping this algorithm to code is going to require a healthy dose of fixed-point arithmetic. We have
talked a lot about numbers between 0.5 and 1, but in order to map this to efficient code, we are not
going to use floating point. Instead, we will use fixed-point: we will treat a certain number of bits
of an integer as &amp;ldquo;behind the decimal point&amp;rdquo; and the rest as &amp;ldquo;in front of the decimal point.&amp;rdquo;  At first,
we wil be using 16 bits behind the decimal point, and then later 32 bits. We will refer to them as &lt;code&gt;m.n&lt;/code&gt;
numbers, with m bits ahead of the decimal place, and n bits behind. Since we are working on a 64-bit
machine, we are using &lt;code&gt;48.16&lt;/code&gt; and &lt;code&gt;32.32&lt;/code&gt; numbers.&lt;/p&gt;
&lt;p&gt;Adding fixed-point numbers works like integer addition as long as each fixed-point number has the same
order of magnitude.  If not, the numbers have to be shifted to be aligned: when you add a &lt;code&gt;32.32&lt;/code&gt; number
to a &lt;code&gt;48.16&lt;/code&gt; number, one needs to be shifted by 16 bits to align it with the other.&lt;/p&gt;
&lt;p&gt;Multiplying fixed-point numbers is a bit wierder. When you multiply two 32 bit integers, you get a
result that fits in 64 bits. Often, we discard the top 32 bits of that result to truncate the result
to the same size as the inputs. With fixed-point &lt;code&gt;m.n&lt;/code&gt; numbers, you add the number of bits before and
after the decimal place: when you multiply a &lt;code&gt;48.16&lt;/code&gt; number with a &lt;code&gt;32.32&lt;/code&gt; number, the result is an &lt;code&gt;80.48&lt;/code&gt;
number, and must be shifted and trucated to select the bits you want.  We are going to try to avoid
any shifting if we can, and just use the bottom 64 bits of the result so we can use a single &lt;code&gt;MUL&lt;/code&gt;
instruction for multiplication.&lt;/p&gt;
&lt;p&gt;One final caveat: for these algorithms, we will be ignoring division by 0 for now.  Division by 0 can
be handled slowly (it is for the DIV instruction - Intel throws an interrupt) as long as it doesn&amp;rsquo;t
impact other results.  Thankfully, we have a tool to handle this: branches and the branch predictor.
Around each of these blocks of code, we can place a wrapper that checks for division by 0 and gives a
specific result (or throws an exception) if you have a division by 0.  Thanks to branch prediction, this
branch will be free if we are not frequently dividing by 0, and if we are dividing by 0 frequently, it
will be a lot better on average than the exception!&lt;/p&gt;
&lt;h5 id=&#34;doing-it-in-c&#34;&gt;Doing it in C&lt;/h5&gt;
&lt;p&gt;I attempted to write this in C for microbenchmarks, fully expecting the compiler to produce a result that
was as good (in a standalone context) as handwritten assembly.  Here is the C code:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;background-color:#f0f0f0;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-C&#34; data-lang=&#34;C&#34;&gt;&lt;span style=&#34;color:#007020&#34;&gt;#include&lt;/span&gt; &lt;span style=&#34;color:#007020&#34;&gt;&amp;lt;stdint.h&amp;gt;&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;#include&lt;/span&gt; &lt;span style=&#34;color:#007020&#34;&gt;&amp;lt;immintrin.h&amp;gt;&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#007020&#34;&gt;&lt;/span&gt;
&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Constants computed with minimax
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;static&lt;/span&gt; &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;const&lt;/span&gt; &lt;span style=&#34;color:#902000&#34;&gt;int&lt;/span&gt; C1 &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;0x1fac5&lt;/span&gt;;
&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;static&lt;/span&gt; &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;const&lt;/span&gt; &lt;span style=&#34;color:#902000&#34;&gt;int&lt;/span&gt; C2 &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;0xb7b0&lt;/span&gt;;

int64_t &lt;span style=&#34;color:#06287e&#34;&gt;approx_recip16&lt;/span&gt;(int64_t divisor) {
    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Square the divisor, shifting to get a 48.16 fixed point solution
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    int64_t div_squared &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; (divisor &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; divisor) &lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;16&lt;/span&gt;;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Compute the factors of the approximation
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    int64_t f1 &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; C1 &lt;span style=&#34;color:#666&#34;&gt;-&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;3&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; divisor &lt;span style=&#34;color:#666&#34;&gt;+&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;5&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; (divisor &lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;4&lt;/span&gt;);
    int64_t f2 &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; C2 &lt;span style=&#34;color:#666&#34;&gt;-&lt;/span&gt; divisor &lt;span style=&#34;color:#666&#34;&gt;-&lt;/span&gt; (divisor &lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;5&lt;/span&gt;);
    f1 &lt;span style=&#34;color:#666&#34;&gt;+=&lt;/span&gt; div_squared;
    f2 &lt;span style=&#34;color:#666&#34;&gt;+=&lt;/span&gt; div_squared;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Combine the factors and multiply by 5, giving a 32.32 result
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    int64_t combined &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; f1 &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; f2;
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;return&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;5&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; combined;
}

uint8_t &lt;span style=&#34;color:#06287e&#34;&gt;div8&lt;/span&gt;(uint8_t num, uint8_t div) {
    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Map to 48.16 fixed-point
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    int64_t divisor_extended &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; div;
    uint16_t shift &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; __lzcnt16(divisor_extended);

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Shift out the bits of the extended divisor in front of the decimal
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    divisor_extended &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; divisor_extended &lt;span style=&#34;color:#666&#34;&gt;&amp;lt;&amp;lt;&lt;/span&gt; shift;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Compute the reciprocal
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    int64_t reciprocal &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; approx_recip16(divisor_extended);

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Multiply the reciprocal by the numerator to get a preliminary result
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// in 32.32 fixed-point
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    int64_t result &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; num &lt;span style=&#34;color:#666&#34;&gt;*&lt;/span&gt; reciprocal;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// Shift the result down to an integer and finish the division by dividing
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;// by the appropriate power of 2
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    uint32_t result_shift &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;48&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;-&lt;/span&gt; shift;
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;return&lt;/span&gt; result &lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&amp;gt;&lt;/span&gt; result_shift;
}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;For a remainder calculation, we just have to do a multiplication and subtraction afterwards.&lt;/p&gt;
&lt;h5 id=&#34;hand-tuned-assembly&#34;&gt;Hand-tuned Assembly&lt;/h5&gt;
&lt;p&gt;Since we are racing against hardware and we have a simple function, we want to give ourselves every
advantage we can.  Let&amp;rsquo;s see how we do with handwritten assembly:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;background-color:#f0f0f0;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-asm&#34; data-lang=&#34;asm&#34;&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Divide 8-bit numbers assuming use of the System-V ABI
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Arguments:
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;#   - RDI = numerator (already 0-extended)
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;#   - RSI = denominator (already 0-extended)
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Return value:
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;#   - RAX = division result
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Clobbers:
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;#   RCX, RDX, R8, R9, R10, flags
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;
&lt;span style=&#34;color:#4070a0&#34;&gt;.text&lt;/span&gt;
&lt;span style=&#34;color:#002070;font-weight:bold&#34;&gt;div8:&lt;/span&gt;
    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Count leading zeroes of the denominator, preparing for fixed point
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;lzcnt&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;cx&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;si&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Put denominator inside [0.5, 1) as a 48.16 number, creating 2 copies
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# We do two of these instead of using a MOV to improve latency
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;shlx&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rsi&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rcx&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;shlx&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r8&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rsi&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rcx&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Square one of the copies
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;imul&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r8&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;r8&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Make denominator * 3 to begin the first parenthetical factor
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;lea&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r10d&lt;/span&gt;, [&lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt; &lt;span style=&#34;&#34;&gt;+&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt; * &lt;span style=&#34;color:#40a070&#34;&gt;2&lt;/span&gt;]

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Begin producing negative of the second factor by subtracting
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# C2 from the denominator
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;lea&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;eax&lt;/span&gt;, [&lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt; - &lt;span style=&#34;color:#40a070&#34;&gt;0xb7b0&lt;/span&gt;]

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Copy the denominator
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;mov&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r9d&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;edx&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Negate denominator * 3
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;neg&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r10d&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Shift our copies of the denominator
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;shr&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;edx&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;4&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;shr&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r9d&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;5&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Compute 5 * (denominator &amp;gt;&amp;gt; 4) in EDX
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;lea&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;edx&lt;/span&gt;, [&lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt; &lt;span style=&#34;&#34;&gt;+&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt; * &lt;span style=&#34;color:#40a070&#34;&gt;4&lt;/span&gt;]

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Add C1 to the first factor
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;add&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r10d&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;0x1fac5&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Make denominator + (denominator &amp;gt;&amp;gt; 5) - C2
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;add&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;eax&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;r9d&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Make C1 - 3 * denominator + 5 * (denominator &amp;gt;&amp;gt; 4)
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;add&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;edx&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;r10d&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Shift denominator^2 right by 16, going from 32.32 to 48.16, to
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# prepare for addition to a 48.16 number
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;shr&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r8d&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;16&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Add the square to each factor, since one is negative we subtract
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;add&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;edx&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;r8d&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;sub&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;r8d&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;eax&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Multiply the factors and then multiply by 5
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# After this step, we have a 32.32 approximate reciprocal
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;imul&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;r8&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;lea&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;, [&lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt; &lt;span style=&#34;&#34;&gt;+&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rdx&lt;/span&gt; * &lt;span style=&#34;color:#40a070&#34;&gt;4&lt;/span&gt;]

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Multiply the numerator by the approximate reciprocal to get a divisor
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;imul&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rdi&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Shift out bits to divide by the remaining powers of 2 and
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# undo the shift we did to put the denominator inside [0.5, 1)
&lt;/span&gt;&lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;&lt;/span&gt;    &lt;span style=&#34;color:#06287e&#34;&gt;neg&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;ecx&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;add&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;ecx&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;48&lt;/span&gt;
    &lt;span style=&#34;color:#06287e&#34;&gt;shrx&lt;/span&gt; &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rax&lt;/span&gt;, &lt;span style=&#34;color:#60add5&#34;&gt;rcx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;A remainder calculation is also a matter of a straightforward multiplication and subtraction after the
division, which amounts to an &lt;code&gt;IMUL&lt;/code&gt; and a &lt;code&gt;SUB&lt;/code&gt;.&lt;/p&gt;
&lt;h5 id=&#34;comparison-between-c-and-assembly&#34;&gt;Comparison between C and Assembly&lt;/h5&gt;
&lt;p&gt;Using &lt;a href=&#34;https://godbolt.org&#34;&gt;godbolt.org&lt;/a&gt; to inspect the assembled result of the C, a few comparisons come up:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Clang did significantly better than gcc. With code that is aritmetic-heavy, this is not surprising.&lt;/li&gt;
&lt;li&gt;C compilers (both gcc and clang) used fewer registers than my handwritten solution.&lt;/li&gt;
&lt;li&gt;The best compiled solution (&lt;code&gt;clang -O3 -march=haswell&lt;/code&gt;) involved 25 micro-ops issued, but since 3 of them
were &lt;code&gt;MOV&lt;/code&gt; instructions that copy one register to another, most machines will only have to retire 22 micro-ops
due to &lt;code&gt;MOV&lt;/code&gt; elision.&lt;/li&gt;
&lt;li&gt;Presumably as a result of register allocation, the C had worse throughput and latency.&lt;/li&gt;
&lt;li&gt;Architecture specializations don&amp;rsquo;t appear to change the compiler output significantly.&lt;/li&gt;
&lt;li&gt;Computing the negative of the second factor and using an additional &lt;code&gt;SHRX&lt;/code&gt; instruction instead of a &lt;code&gt;MOV&lt;/code&gt;
was important for latency in the assembly implementation.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Outside of a microbenchmark and in real software, the compiler probably wins due to the register allocation
and the reduction in number of retired micro-ops. However, if I were considering a microcoded division
implementation for a new CPU, I would likely prefer to spend registers in exchange for speed: a modern x86 CPU
has an abundance of registers available to microcode programs (Skylake has 180 integer registers), but only
exposes 16 of them at a time to software.&lt;/p&gt;
&lt;h5 id=&#34;working-with-non-x86-machines&#34;&gt;Working with Non-x86 Machines&lt;/h5&gt;
&lt;p&gt;On a non-x86 machine, you can do something very similar: ARM allows similar instructions to Intel&amp;rsquo;s
&lt;code&gt;LEA&lt;/code&gt; instruction with its barrel-shifted operands.  RISC-V with the Bitmanip instruction sets is similar.
However, without the bitmanip extension, we don&amp;rsquo;t get the LEA trick, so RISC-V needs several more instructions
for multiplication by 5 and 3.  Several of the choices of bit widths in this algorithm are based on which
x86 instructions are the fastest: an ARM or RISC-V solution would need similar care.&lt;/p&gt;
&lt;p&gt;Embedded systems these days often have a single-cycle multiplier.  In systems like that, the &lt;code&gt;MUL&lt;/code&gt;
elimination steps we took are useless: you can do something that is a lot simpler.  Using a formula
similar to the $RemezShifted(x)$ formula, you could likely get great division performance on an embedded
system.&lt;/p&gt;
&lt;p&gt;On narrower machines, you can do the same calculation with some bit shifting or some implied bits
to get similar results without using 64 bit words.  You need at least a 16 bit word to do this algorithm,
though: with 8 bit words you lose too much precision to quantization.  For a vectorized version, we will be
using 16-bit words.&lt;/p&gt;
&lt;h2 id=&#34;measurements&#34;&gt;Measurements&lt;/h2&gt;
&lt;p&gt;In order to produce the data tables above, I used &lt;code&gt;llvm-mca&lt;/code&gt; to determine the reciprocal throughput of each
solution, and computed latency by hand. The latency difference between Intel and AMD comes down to the &lt;code&gt;lzcnt&lt;/code&gt;
instruction: it takes 3 cycles on Intel and 1 cycle on AMD. The assembly code I have proposed here is fairly
portable and has similar characteristics between AMD and Intel cores.&lt;/p&gt;
&lt;h5 id=&#34;benchmarks&#34;&gt;Benchmarks&lt;/h5&gt;
&lt;p&gt;To verify the theoretical numbers, I ran a few benchmarks using Agner Fog&amp;rsquo;s
&lt;a href=&#34;https://www.agner.org/optimize/#testp&#34;&gt;PMCTest&lt;/a&gt; program on several machines, including some control
benchmarks on &lt;code&gt;DIV&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Here is the control benchmark for &lt;code&gt;DIV&lt;/code&gt;:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;left&#34;&gt;Machine&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;DIV Latency&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;DIV Recip Throughput&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;AMD Ryzen 1700 Desktop&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Intel Cascade Lake (DigitalOcean VM)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;AMD Zen 2 (DigitalOcean VM)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;11&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;11&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;These are the same as the measurements from uops.info. And the results for our division function are:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;left&#34;&gt;Machine&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Our Latency&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Our Recip Throughput&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Our Micro-Ops&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;AMD Ryzen 1700 Desktop&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15 (+15%)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;7.7 (-40%)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;26.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Intel Cascade Lake (DigitalOcean VM)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;18 (-28%)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6.7 (+12%)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;AMD Zen 2 (DigitalOcean VM)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13 (+18%)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5.5 (-50%)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;23&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Our benchmark numbers line up with our estimates, and suggest that we have produced something that is in line
with the performance of a hardware-accelerated division function.  We have ~40% better throughput on AMD Zen
1/2 machines than DIV, with a minor latency penalty.  Compared to a recent Intel machine, we have 28% better
latency with a 12% throughput penalty.&lt;/p&gt;
&lt;p&gt;Once again, in graphical form, here is the comparison between DIV and our function (lower is better on both
charts):&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;https://specbranch.com/division/benchmark_results.png#center&#34; width=&#34;100%&#34;/&gt; 
&lt;/figure&gt;

&lt;h2 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h2&gt;
&lt;p&gt;It looks like you can produce a CPU with microcoded division that gets close in performance to a CPU with
a hardware accelerated unit.  That is not to say that Intel or AMD should get rid of their division accelerators:
they are comparatively small and the new ones offer better performance than this algorithm.  However, you can
get very similar performance to the older accelerators with pure software.  If you are producing a
smaller RISC-V core, though, an algorithm like this looks like it&amp;rsquo;s an option rather than adding a division
accelerator.  Additionally, it may benefit applications that do a lot of division on Skylake and older CPUs to
use this approach rather than dividing with DIV.&lt;/p&gt;
&lt;p&gt;For server-class CPUs, the real power of this kind of formula is not in scalar division, but vector division,
where hardware acceleration is a lot more expensive.  The algorithm will look slightly different: we will
probably want to use a 16-bit word length rather than a 64-bit word length.&lt;/p&gt;
&lt;p&gt;We also have another loose end: wider divisions.  64 bit division is a lot more useful than 8 bit division,
and some machines have very long computation times for 64 bit DIVs (Haswell and Skylake in particular).  If
we can improve 64 bit division, there may be some places where we can race hardware there too.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>The Meaning of Speed</title>
      <link>https://specbranch.com/posts/performance-dimensions/</link>
      <pubDate>Sun, 13 Feb 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/performance-dimensions/</guid>
      <description>A lot of the time, when engineers think of performance work, we think about looking at benchmarks and making the numbers smaller. We anticipate that we are benchmarking the right pieces of code, and we take it for granted that reducing some of those numbers is a benefit, but also &amp;ldquo;the root of all evil&amp;rdquo; if done prematurely. If you are a performance-focused software engineer, or you are working with performance engineers, it can help to understand the value proposition of performance and when to work on it.</description>
      <content:encoded>&lt;p&gt;A lot of the time, when engineers think of performance work, we think about looking at benchmarks and
making the numbers smaller.  We anticipate that we are benchmarking the right pieces of code, and we take
it for granted that reducing some of those numbers is a benefit, but also &amp;ldquo;the root of all evil&amp;rdquo; if done
prematurely.  If you are a performance-focused software engineer, or you are working with performance
engineers, it can help to understand the value proposition of performance and when to work on it.&lt;/p&gt;
&lt;p&gt;There are three important components of performance for computer systems: latency, throughput, and
efficiency.  Broadly speaking, latency is how quickly you can do something, throughput is how many
of those things you can do per second, and efficiency is how much you pay to do that thing. Each of these
dimensions can be a little bit abstract and hard to work with directly, so we often think about proxies,
like CPU time, but it helps to keep your eye on what is actually impactful. These dimensions all have
interesting relationships with each other, and reasoning about them directly can help you figure out
how impactful performance work can be for you.&lt;/p&gt;
&lt;h3 id=&#34;throughput&#34;&gt;Throughput&lt;/h3&gt;
&lt;p&gt;Throughput is either defined as the load on a system or the system&amp;rsquo;s maximum capacity to sustain load.
Throughput is measured in terms of queries per second or bytes per second (or both at the same time). Often,
systems are designed with specific throughput goals, or are designed to achieve maximum throughput under
a certian set of constraints. In many cases, throughput is the performance dimension that dictates budgets
and hardware requirements.  A table of common throughputs can be found
&lt;a href=&#34;https://specbranch.com/posts/common-perf-numbers/&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Throughput often has a relationship to efficiency. When you have a certain number of servers and you are
running a simple process on those servers, reducing the amount of compute spent on each query can often
provide a corresponding in throughput. It is common to hear about QPS/core or QPS/VM for a service that
scales, and for simple services and systems, throughput and efficiency often go hand-in-hand.
Throughput can also have a relationship with latency: If you have a limit on the
number of operations outstanding in a system, the latency of the operations can cause a throughput limit.
Consider a database that can process only 1 write operation at a time where a write operation takes 1 ms:
That database cannot process more than 1000 writes per second no matter how efficient it is.&lt;/p&gt;
&lt;p&gt;Additionally, the throughput of a process is limited by its lowest-throughput step. For example, if you
have a system that calls an API that can only take 1000 QPS, your system cannot exceed that 1000 QPS limit.
Similarly, a super-fast storage system backed by a hard drive will never exceed ~150-200 MB/second
(depending on the hard drive).  Focusing on any other part of the system will not improve its throughput at all.&lt;/p&gt;
&lt;p&gt;There are many ways to improve throughput of a system, including adding parallelism, sharding, using caches, and
making efficiency improvements, but thanks to the principle that the lowest-throughput component determines
throughput, you have to reason about the throughput of entire systems if you want to make meaningful
improvements.  In our hypothetical case of a hard-drive-backed storage system, adding a caching SSD would
raise its throughput a lot because the hard drive is the limiting factor, and in the hypothetical write-limited
database system, sharding the application-level database would alleviate throughput problems caused by the
latency of write operations.&lt;/p&gt;
&lt;p&gt;Today, we often hear about &amp;ldquo;infinite-scale&amp;rdquo; or &amp;ldquo;planet-scale&amp;rdquo; distributed systems, and these systems are
often designed for infinite &lt;em&gt;throughput&lt;/em&gt; scalability. These systems are very academically interesting, but they
often make trades in efficiency or latency for their ability to scale without limits. If you are concerned
about the other dimensions of performance, it may be better to eschew planet scale for a system that works at
the size you want.&lt;/p&gt;
&lt;h3 id=&#34;efficiency&#34;&gt;Efficiency&lt;/h3&gt;
&lt;p&gt;Compute efficiency is about dollars spent per dollar earned. For a straightforward example, if you have a
serverless backend and you charge $1.00 to customers per serverless function call, reducing the CPU time on
that serverless function is meaningful efficiency work - you will spend less per dollar you make.&lt;/p&gt;
&lt;p&gt;It is important to make sure that you can tie efficiency gains to dollars. For example, if you
need to use a certain VM shape for a given reason (because it has SSDs or a GPU for example), and your
service fits in that VM shape comfortably, improving its memory usage or CPU usage can be useless
for efficiency - you have to spend money for that VM anyway! However, if the improvements you make allow
you to spend less VM time, that is worth something.&lt;/p&gt;
&lt;p&gt;There is one exception to the rule of tying compute efficiency to money savings, and where compute
efficiency can contribute to profits: the client side. Having a more efficient client means that users
with cheaper devices can use your software. Some time ago, when YouTube decreased the size of their video
client, they found that average loading times went up instead of down! People with slower internet
connections were now able to load the YouTube player, so YouTube&amp;rsquo;s efficiency gain resulted in more revenue.&lt;/p&gt;
&lt;p&gt;Efficiency work can often be done at the level of subsystems, but some end-to-end thinking can help.
It can be worth it to make subsystems less efficient if they offer APIs or interfaces that improve system-level
efficiency. Google recently published a
&lt;a href=&#34;https://cloud.google.com/blog/topics/systems/trading-off-malloc-costs-and-fleet-efficiency&#34;&gt;blog post&lt;/a&gt;
on an interesting case of this: using a slower &lt;code&gt;malloc&lt;/code&gt; algorithm resulted in &lt;em&gt;better&lt;/em&gt; compute efficiency
since the slower algorithm provided better memory locality for programs that used it.&lt;/p&gt;
&lt;p&gt;Efficiency work is usually the easiest performance work to price, since it is solely related to cost
reductions, so managers and companies often think (close to) rationally about efficiency work. However,
since efficiency work is a cost reduction, many companies are loath to support it, believing that features
are more important. Some of those companies believe that performance and efficiency are the same, and
dismiss performance work entirely as &amp;ldquo;premature optimization.&amp;rdquo; However, a little efficiency work can end
up saving you a lot, and even startups can benefit if some optimization gives them a longer runway.&lt;/p&gt;
&lt;h3 id=&#34;latency&#34;&gt;Latency&lt;/h3&gt;
&lt;p&gt;Latency is how long it takes for your system to respond to a request, measured in seconds. If a user sees
high latency, they will tell you that your system is &amp;ldquo;slow,&amp;rdquo; and they won&amp;rsquo;t care about how your system can
scale out to serve billions of people at once. In some industries, like ad tech and high-frequency trading,
latency is the key metric that needs to be optimized.  In other situations, like training ML models,
overall latency is relatively unimportant.  A table of common latencies can be found
&lt;a href=&#34;https://specbranch.com/posts/common-perf-numbers/&#34;&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Many people wouldn&amp;rsquo;t care about latency except that the latency of a subsystem has a habit of placing limits on
throughput. This occurs in every closed-loop system, where you wait for completion of one request
before processing another one. If your system has locks that must be exclusively held per request, or in the
case of the write-limited database we looked at before, the latency of the locked section or the database
becomes critical. Supercomputers often have low-latency networks because high-performance computing programs
usually involve feedback loops between machines, so having a low-latency network results in having higher
system-level throughput.&lt;/p&gt;
&lt;p&gt;Latency is usually measured by looking at distributions of request times. These distributions are usually
generated by two methods: probing and tracing. Probers generate artifical requests to see how long it takes to
get a response, and tracers generally attach a tag to requests which asks each server along the path of the
request to measure how long each step took. Both of these methods have problems, and analysis should be done
with this in mind: Tracing adds overhead to each traced request, which can affect throughput and makes traced
requests slightly unrepresentative of untraced requests. Probers tend to test only a few types of requests, so
an over-reliance on probers can result in overfitting. With the data from latency measurement tools, different
parameters of the distribution can tell you different things:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The median of the latency distribution tells you about a &amp;ldquo;normal&amp;rdquo; request, and helps you understand the happy path.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Tail latency&amp;rdquo; usually refers to the 90th, 99th, 99.9th, or 99.99th percentile of the distribution, and can
tell you a lot about how your system performs when it is under load or otherwise operating in a degraded mode.&lt;/li&gt;
&lt;li&gt;Average latency can be important for components of closed-loop systems: if a system does 100 serialized reads from
a database to produce a result, the latency of that system is probably close to 100 times the average latency of the
database.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Queueing and batching often result in latency, even though they may help on the other dimensions: long queues of
requests mean that a system is running at high throughput (good) but also that each request experiences
high latency (bad). Latency associated with queues is called &amp;ldquo;queueing delay,&amp;rdquo; and frequently shows up at the tails -
the 99.9th percentile request probably experiences significant queueing delay. Queues are not always explicit:
congested network routes often add significant queueing delay, and operating systems have run queues, so if you are
running more than one thread per core, you will experience queueing delay related to scheduling. Batching will always
hurt minimum latency, but batching can benefit average latency and tail latency if batching means doing less work per
request.&lt;/p&gt;
&lt;p&gt;&amp;ldquo;Thread per core&amp;rdquo; systems, like high-performance databases (&lt;em&gt;eg&lt;/em&gt; ScyllaDB or Aerospike)
try to make all of these queues explicit to achieve 90th percentile latencies below 1 millisecond. This usually means using
an asynchronous programming model and giving up a lot of the conveniences that an operating system provides.
&lt;a href=&#34;https://www.barroso.org/publications/AttackoftheKillerMicroseconds.pdf&#34;&gt;Attack of the Killer Microseconds&lt;/a&gt; has some
more detail on this regime. High-frequency trading systems go further to get median latencies in the single-digit
microseconds, using spinning cores and disabling turbo on CPUs to avoid hardware-induced latency.&lt;/p&gt;
&lt;p&gt;Improving latency can be done in many ways, but most boil down to the simple rule of: &amp;ldquo;Find something you&amp;rsquo;re
doing in the critical path, and don&amp;rsquo;t do it.&amp;rdquo; Depending on the system, improving latency can mean many different things.
For example: removing communication, removng RPCs/machine hops, removing (or adding) queueing/batching, or improving prioritization can all improve latency in a system. Latency profiles from traces will tell you what to work on.&lt;/p&gt;
&lt;p&gt;Conversely, if you add something to the critical path of a system, expect it to hurt latency. This is a natural
reason why feature-rich systems tend to become slow: adding even rarely-used options can have a little bit of
latency overhead. One new feature won&amp;rsquo;t cost much, but the small effect of each new feature adds up. The one
exception is caches, which will usually help latency on cache hits, but will hurt your latency on cache misses.&lt;/p&gt;
&lt;p&gt;Latency can be hard to price, but very often has tremendous value. In the mid-2000s, Google started studying
the effects of latency on users, and found that even 100 ms of extra latency had a noticeable impact on the
number of searches that each user conducts, published in their
&lt;a href=&#34;https://ai.googleblog.com/2009/06/speed-matters.html&#34;&gt;blog&lt;/a&gt;. They have also
&lt;a href=&#34;https://developers.google.com/web/updates/2018/07/search-ads-speed&#34;&gt;publicly announced&lt;/a&gt; that page loading
latency is a factor in Google search rankings. None of this is a surprise to UX folks and front-end web
developers, but the further down the stack you go, it is easy to forget that your system may be in the critical path
for a user-facing service.&lt;/p&gt;
&lt;p&gt;The value of latency is usually nonlinear. The fastest trader on the stock market will make a lot more money
than the second fastest trader, who makes a lot more than the third fastest, and so on. The 100th fastest doesn&amp;rsquo;t
make much more than number 101. The difference in speed between the top three may only be a few nanoseconds,
but each one of those nanoseconds has tremendous value! Conversely, improving user interface latency from 200 ms
to 150 ms has a lot more value than improving it from 60 ms to 10 ms: the former takes a UI from &amp;ldquo;sluggish&amp;rdquo; to
&amp;ldquo;workable&amp;rdquo; while users are unlikely to notice the latter improvement (60 ms is below most humans&amp;rsquo; reaction time).&lt;/p&gt;
&lt;h3 id=&#34;performance-is-money&#34;&gt;Performance is Money&lt;/h3&gt;
&lt;p&gt;By now, you have probably gotten the theme here: performance work is about incremental improvements in
a system&amp;rsquo;s profitability, and the three dimensions of performance are about product and money:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Throughput is about how much traffic you can accept&lt;/li&gt;
&lt;li&gt;Efficiency is about how cheaply you can accept it&lt;/li&gt;
&lt;li&gt;Latency is about how quickly you respond to each request&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With the exception of latency, which I would argue is a valuable product feature, compute performance work
directly relates to your company&amp;rsquo;s bottom line. If you make sure that you care about the right things,
it can have immense value.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Performance Numbers Worth Knowing</title>
      <link>https://specbranch.com/posts/common-perf-numbers/</link>
      <pubDate>Mon, 31 Jan 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/common-perf-numbers/</guid>
      <description>When you design software to achieve a particular level of performance, it can be a good idea to be familiar with the general speed regimes you are working with: fundamental limitations like storage devices and networks can drive software architecture. Here are a set of common benchmark numbers that can help you anchor performance conversations and think about the components that your software will interact with. As with all guidelines, these numbers are all slightly wrong, but still useful.</description>
      <content:encoded>&lt;p&gt;When you design software to achieve a particular level of performance, it can be a good idea to be familiar with
the general speed regimes you are working with: fundamental limitations like storage devices and networks can drive
software architecture. Here are a set of common benchmark numbers that can help you anchor performance conversations
and think about the components that your software will interact with. As with all guidelines, these numbers are all
slightly wrong, but still useful.&lt;/p&gt;
&lt;h3 id=&#34;throughputs&#34;&gt;Throughputs&lt;/h3&gt;
&lt;p&gt;Some common byte-level throughputs are shown in the table below. All of the computing functions (eg compression,
memcpy) are for one core of a modern server CPU.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;left&#34;&gt;Type&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Component&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Throughput&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Time for 1 MB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Average US Cable Internet Upload&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.25 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;800 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Slow WiFi (802.11g)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;6.75 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;150 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Tight Compression (gzip -9)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Average US Cable Internet Download&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.5 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;80 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Compression (gzip -1)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;64 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;16 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;WiFi (802.11n)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;75 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;13 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Gigabit Ethernet&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;125 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Storage&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Hard Drive&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;150-200 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5-7 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Decompression (gzip)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;300 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3.3 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Fast WiFi (802.11ax)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;440 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.3 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Fast Compression (lz4)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;500 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;SHA-512&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;600 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.6 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Storage&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;SATA 3.0 SSD&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;750 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.3 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;SHA-1&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;900 MB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;I/O&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PCIe gen 3 x1 (WiFi Card)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;10 Gigabit Ethernet&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.25 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;800 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;AES-GCM Encryption&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;500 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;I/O&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PCIe gen 4 x1 (WiFi Card)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;500 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;JSON parsing with simdjson&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;330 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Fast Decompression (lz4)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;3 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;330 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;I/O&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PCIe gen 3 x4 (NVMe SSD)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;250 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;I/O&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PCIe gen 4 x4 (NVMe SSD)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;8 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;125 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Memory&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;DDR4-3200 DRAM Channel Actual (x64)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;~12 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;83 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;100 Gigabit Ethernet&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12.5 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;80 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;I/O&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PCIe gen 3 x16 (GPU or accelerator)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;16 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;63 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Memory&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;DDR5-4800 DRAM Channel Acutal (x64)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;~20 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;CRC32C Checksum&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;40 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Memory&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;DDR4-3200 DRAM Channel Theoretical Max (x64)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;25.6 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;40 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;I/O&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PCIe gen 4 x16 (GPU or accelerator)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;32 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;32 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Memory&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;DDR5-4800 DRAM Channel Theoretical Max (x64)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;38.4 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;26 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Algorithm&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;memcpy&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50 GB/s&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;20 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3 id=&#34;latencies&#34;&gt;Latencies&lt;/h3&gt;
&lt;p&gt;Some common latencies are shown in the table below. Most of these are fundamental, but several are the product of
the design of protocols and systems. All of these are shown assuming that they are uncongested (so there is no
queueing delay), and network delays are shown as 1/2 of round-trip time, representing one-way latency.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th align=&#34;left&#34;&gt;Type&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Component/Process&lt;/th&gt;
&lt;th align=&#34;center&#34;&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;CPU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;CPU Instruction (1 cycle)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;400 ps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;CPU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;L1 Cache Access&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.2 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;CPU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Branch Misprediction&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;CPU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;L2 Cache Access&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;CPU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Atomic Instruction&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;~10 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;CPU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;L3 Cache Access&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;15-20 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;CPU&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;DRAM Access (Cache Miss)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50-100 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;LAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Cut-through Switch&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Device&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Low-latency Network Card&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;500 ns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Device&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;PCIe Accelerator/GPU Access&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Serializaton&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1.5 kB Store-and-forward Delay on 10 Gigabit Ethernet&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.2 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Device&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Datacenter Network Card (SFP, QSFP, etc.)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.5 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;LAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Datacenter Switch&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;LAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Medium-sized Datacenter Network&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Device&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Gigabit Ethernet Network Card (1GBase-T)&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Serializaton&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1.5 kB Store-and-forward Delay on Gigabit Ethernet&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;12 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;LAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Copper Gigabit Ethernet Switch&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;20 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Cloud&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Intra-zone Cloud Network&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;20 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Storage&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;SSD Read&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50-100 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Cloud&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Inter-zone Cloud Network&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;500 µs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Edge Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Low-interference WiFi Connection&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Serializaton&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;1.5 kB Store-and-forward Delay on 10 Megabit Ethernet&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;1.2 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Cloud&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Inter-region Cloud Network&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;2.5 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Storage&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;7200 RPM Hard Drive Rotation&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;4.2 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Edge Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;High-interference WiFi Connection&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Edge Network&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;DOCSIS 3.0 Cable Modem&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;5 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;Storage&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Hard Drive Seek&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;10 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;WAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;US East Coast to West Coast&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;20 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;WAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;US East Coast to UK&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;30 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;WAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;US West Coast to Chile&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;50 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;WAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;UK to India&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;70 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;WAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;US West Coast to Hong Kong&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;100 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;WAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;US to India&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;150 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td align=&#34;left&#34;&gt;WAN&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;UK to Hong Kong&lt;/td&gt;
&lt;td align=&#34;center&#34;&gt;150 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Constant-time Fibonacci</title>
      <link>https://specbranch.com/posts/const-fib/</link>
      <pubDate>Sat, 22 Jan 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/const-fib/</guid>
      <description>This is the second part in a 2-part series on the &amp;ldquo;Fibonacci&amp;rdquo; interview problem. We are building off of a previous post, so take a look at Part I if you haven&amp;rsquo;t seen it.
Previously, we examined the problem and constructed a logarithmic-time solution based on computing the power of a matrix. Now we will derive a constant time solution using some more linear algebra. If you had trouble with the linear algebra in part I, it may help to read up on matrices, matrix multiplicaiton, and special matrix operations (specifically determinants and inverses) before moving on.</description>
      <content:encoded>&lt;p&gt;This is the second part in a 2-part series on the &amp;ldquo;Fibonacci&amp;rdquo; interview problem.
We are building off of a previous post, so &lt;a href=&#34;https://specbranch.com/posts/fibonacci/&#34;&gt;take a look at Part I&lt;/a&gt; if you haven&amp;rsquo;t seen it.&lt;/p&gt;
&lt;p&gt;Previously, we examined the problem and constructed a logarithmic-time solution based on computing the power
of a matrix.  Now we will derive a constant time solution using some more linear algebra. If you had
trouble with the linear algebra in part I, it may help to read up on
&lt;a href=&#34;https://www.mathsisfun.com/algebra/matrix-introduction.html&#34;&gt;matrices&lt;/a&gt;,
&lt;a href=&#34;https://www.mathsisfun.com/algebra/matrix-multiplying.html&#34;&gt;matrix multiplicaiton&lt;/a&gt;, and special matrix
operations (specifically &lt;a href=&#34;https://www.mathsisfun.com/algebra/matrix-determinant.html&#34;&gt;determinants&lt;/a&gt;
and &lt;a href=&#34;https://www.mathsisfun.com/algebra/matrix-inverse.html&#34;&gt;inverses&lt;/a&gt;) before moving on.&lt;/p&gt;
&lt;p&gt;It&amp;rsquo;s time to pull out the &lt;em&gt;eigenvalues.&lt;/em&gt;&lt;/p&gt;
&lt;h3 id=&#34;eigenvalues-and-eigenvectors&#34;&gt;Eigenvalues and Eigenvectors&lt;/h3&gt;
&lt;p&gt;Eigenvalues and eigenvectors are a bit of a magical concept in linear algebra.  The definition of eigenvalues
is: &amp;ldquo;A matrix times an eigenvector equals an eigenvalue times an eigenvector.&amp;rdquo; As an equation,
the eigenvalues ($\lambda$) and eigenvectors ($\textbf{v}$) of a matrix ($\textbf{A}$) can be computed by
solving:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}\textbf{v} = \textbf{v}\lambda $$&lt;/p&gt;
&lt;p&gt;In more intuitive terms, each matrix has a set of vectors (its eigenvectors) which are scaled by a constant
factor (the eigenvalues) when multiplied by the matrix. If we think of matrix multiplication as a transformation
of a shape in space, the eigenvectors are the directions along which the matrix stretches the shape, and
the eigenvalues are the scaling factors.&lt;/p&gt;
&lt;p&gt;A square $m \times m$ matrix can have up to $m$ eigenvalues. In the case of recurrence relations, the
matrices we are manipulating should always have $m$ eigenvalues. As long as the matrix has enough
eigenvalues, we can make a matrix out of all of the eigenvectors, which we will call $\textbf{Q}$, and
we can make a matrix which has the eigenvalues along the diagonal, $\Lambda$. Now our equation looks like:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}\textbf{Q} = \textbf{Q}\Lambda $$&lt;/p&gt;
&lt;p&gt;We can multiply both sides by the inverse of the $\textbf{Q}$ matrix, and we will get:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A} = \textbf{Q}\Lambda\textbf{Q}^{-1} $$&lt;/p&gt;
&lt;p&gt;Now we can do something interesting:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}^2 = \textbf{A} \cdot \textbf{A} = \textbf{Q}\Lambda\textbf{Q}^{-1} \cdot \textbf{Q}\Lambda\textbf{Q}^{-1} =
\textbf{Q}\Lambda^2\textbf{Q}^{-1} $$&lt;/p&gt;
&lt;p&gt;The multiplication of $\textbf{Q}^{-1}$ and $\textbf{Q}$ in the middle allows us to cancel them out, so
we can reduce the problem of squaring a matrix to the problem of squaring its eigenvalues.  There is no need
for us to stop at squaring. If we multiply by $\textbf{A}$ again, we get another pair of $\textbf{Q}^{-1}$
and $\textbf{Q}$ that cancel:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}^3 = \textbf{A}^2 \cdot \textbf{A} = \textbf{Q}\Lambda^2\textbf{Q}^{-1} \cdot \textbf{Q}\Lambda\textbf{Q}^{-1} =
\textbf{Q}\Lambda^3\textbf{Q}^{-1} $$&lt;/p&gt;
&lt;p&gt;More generally, for every power of the matrix, we can compute it by taking a power of its eigenvalue matrix:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}^n = \textbf{Q} \Lambda^n \textbf{Q}^{-1} $$&lt;/p&gt;
&lt;p&gt;The &amp;ldquo;stretching&amp;rdquo; idea hints at this: stretching a shape multiple times using the same transformation should
stretch it more along the same axes. However, formula won&amp;rsquo;t apply unless $\textbf{A}$ (which is an
$m \times m$ matrix) has exactly $m$ eigenvalues.&lt;/p&gt;
&lt;p&gt;At first glance, it&amp;rsquo;s not exactly clear why this helps: we have reduced computing the power of the
$\textbf{A}$ matrix to the power of a different matrix.  However, $\Lambda$ is a diagonal matrix, so
computing its power is equivalent to computing the power of each of the elements along the diagonal:&lt;/p&gt;
&lt;p&gt;$$ \Lambda^n = \begin{bmatrix} \lambda_1 &amp;amp; 0 \\ 0 &amp;amp; \lambda_2 \end{bmatrix}^n =
\begin{bmatrix} \lambda_1^n &amp;amp; 0 \\ 0 &amp;amp; \lambda_2^n \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;Through eigendecomposition, we have reduced the problem of an $m \times m$ matrix power to computing $m$
scalar powers. The scalar powers can be computed in constant time (using the venerable &lt;code&gt;pow&lt;/code&gt; function),
whereas the matrix power cannot otherwise be computed in constant time.  Now we just have to apply this
to the Fibonacci problem.&lt;/p&gt;
&lt;h3 id=&#34;computing-our-eigenvalues-and-eigenvectors&#34;&gt;Computing our Eigenvalues and Eigenvectors&lt;/h3&gt;
&lt;p&gt;A nice interviewer might let you use Wolfram Alpha to do an eigendecomposition, but just in case, let&amp;rsquo;s do it
by hand.  We determine the eigenvalues using the characteristic polynomial of the matrix.  The polynomial is:&lt;/p&gt;
&lt;p&gt;$$ p(\lambda) = \det (\textbf{A} - \lambda \textbf{I}) $$&lt;/p&gt;
&lt;p&gt;Then, for each eigenvalue ($\lambda_i$), we solve the eigenvalue equation to get the corresponding
eigenvector ($\textbf{v}_i$):&lt;/p&gt;
&lt;p&gt;$$ (\textbf{A} - \lambda_i \textbf{I}) \textbf{v}_i = 0 $$&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s do it for the Fibonacci recurrence relation.  We will start with the eigenvalues:&lt;/p&gt;
&lt;p&gt;$$ 0 = \begin{vmatrix} 1 - \lambda &amp;amp; 1 \\ 1 &amp;amp; -\lambda \end{vmatrix} = -\lambda (1 - \lambda) - 1 = \lambda^2 - \lambda - 1 $$&lt;/p&gt;
&lt;p&gt;Solving this with the quadratic formula gives:&lt;/p&gt;
&lt;p&gt;$$ \lambda = \frac{1 \pm \sqrt{5}}{2} $$&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s cool. That&amp;rsquo;s the golden ratio (and its conjugate)! Our eigenvalues are the golden ratio and its
conjugate.  We will use $\phi = \frac{1 + \sqrt{5}}{2}$ for the golden ratio from now on, and
$\psi = \frac{1 - \sqrt{5}}{2}$ for its conjugate.  Also, there is an interesting property of the golden
ratio that helps us simplify the algebra:&lt;/p&gt;
&lt;p&gt;$$ -\frac{1}{\phi} = 1 - \phi = \psi $$&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s solve for the eigenvectors, starting with the eigenvector for $\phi$:&lt;/p&gt;
&lt;p&gt;$$ \begin{bmatrix} 1 - \phi &amp;amp; 1 \\ 1 &amp;amp; -\phi \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix} = 0 $$&lt;/p&gt;
&lt;p&gt;This simplifies, using some properties of the golden ratio, to the equations:&lt;/p&gt;
&lt;p&gt;$$ 0 = \psi x + y $$
$$ 0 = x + \phi y $$&lt;/p&gt;
&lt;p&gt;So we can use $x = \phi$ and $y = 1$ for this eigenvector.  Note that any constant factor of these
numbers also solves this set of equations, so $x = 2\phi$ and $y = 2$ also works.  Conventionally,
eigenvectors are usually scaled so that their magnitude is 1, but we can use any scaling factor that
makes the math easy.  With that one out of the way, let&amp;rsquo;s do the other eigenvector (for the
eigenvalue $\psi$):&lt;/p&gt;
&lt;p&gt;$$ \begin{bmatrix} 1 - \psi &amp;amp; 1 \\ 1 &amp;amp; -\psi \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix} = 0 $$&lt;/p&gt;
&lt;p&gt;Which gives $x = \psi$ and $y = 1$ as its solution.  The resulting $\textbf{Q}$ matrix, which has the
eigenvectors as its columns, is:&lt;/p&gt;
&lt;p&gt;$$ \textbf{Q} = \begin{bmatrix} \phi &amp;amp; \psi \\ 1 &amp;amp; 1 \end{bmatrix} $$&lt;/p&gt;
&lt;h3 id=&#34;finishing-constant-time-fibonacci&#34;&gt;Finishing Constant-time Fibonacci&lt;/h3&gt;
&lt;p&gt;So we have $\textbf{Q}$ and $\Lambda$, and now we just need the inverse matrix of $\textbf{Q}$.  There is a
pretty straightforward formula for the inverse of a $2 \times 2$ matrix, and we will apply it to get the
inverse matrix: negate the off-diagonal elements (bottom left and top right), swap the diagonal elements,
and divide by the determinant. For a $3 \times 3$ matrix, you are better off using a computer than trying to
memorize a formula.  Fortunately, linear algebra packages (including LAPACK and NumPy) can do this.
The inverse of the eigenvector matrix is:&lt;/p&gt;
&lt;p&gt;$$ \textbf{Q}^{-1} = \frac{1}{\phi - \psi}\begin{bmatrix} 1 &amp;amp; -\psi \\ -1 &amp;amp; \phi \end{bmatrix} = \frac{1}{\sqrt{5}}\begin{bmatrix} 1 &amp;amp; -\psi \\ -1 &amp;amp; \phi \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;Putting it all together, we have our formula for $\textbf{A}^n$, based on constant-time calculations:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}^n = \frac{1}{\sqrt{5}} \begin{bmatrix} \phi &amp;amp; \psi \\ 1 &amp;amp; 1 \end{bmatrix} \begin{bmatrix} \phi^n &amp;amp; 0 \\ 0 &amp;amp; \psi^n \end{bmatrix} \begin{bmatrix} 1 &amp;amp; -\psi \\ -1 &amp;amp; \phi \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;$$ = \frac{1}{\sqrt{5}}\begin{bmatrix} \phi^{n+1}-\psi^{n+1} &amp;amp; -\phi^{n + 1}\psi + \psi^{n+1}\phi \\ \phi^n-\psi^n &amp;amp; -\phi^n\psi + \psi^n\phi \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;Simplifying based on our earlier identity that $\psi = -\frac{1}{\phi}$:&lt;/p&gt;
&lt;p&gt;$$ A^n = \frac{1}{\sqrt{5}}\begin{bmatrix} \phi^{n+1} - \psi^{n+1} &amp;amp; \phi^n - \psi^{n} \\ \phi^n - \psi^{n} &amp;amp; \phi^{n-1} - \psi^{n-1} \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;The last step is to multiply by the initial conditions:&lt;/p&gt;
&lt;p&gt;$$ \begin{bmatrix} F_{n+1} \\ F_{n} \end{bmatrix} = \frac{1}{\sqrt{5}}\begin{bmatrix} \phi^{n+1} - \psi^{n+1} &amp;amp; \phi^n - \psi^{n} \\ \phi^n - \psi^{n} &amp;amp; \phi^{n-1} - \psi^{n-1} \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;Simplifying, and only taking the bottom row since that is the formula for $F_n$, we get:&lt;/p&gt;
&lt;p&gt;$$ F_n = \frac{\phi^n - \psi^n}{\sqrt{5}} $$&lt;/p&gt;
&lt;p&gt;For the Fibonacci recurrence relation, his formula is fairly well known under the name &amp;ldquo;Binet&amp;rsquo;s formula.&amp;rdquo;&lt;/p&gt;
&lt;h3 id=&#34;writing-the-code&#34;&gt;Writing the Code&lt;/h3&gt;
&lt;p&gt;In theory, if this is an interview, we are going to be expected to produce some code.  We are now going to
pretend that we don&amp;rsquo;t know the formula we just derived in the previous section, and we are going to
write a more generic solution rather than attempting to re-derive the formula during a high-pressure
situation. I don&amp;rsquo;t know about you, but I&amp;rsquo;m pretty bad at doing algebra by hand, so it would be better to
have the comptuer do everything. With that in mind, let&amp;rsquo;s do this in Python this time&amp;ndash;Python is generally
not my choice of language, but NumPy is very user-friendly and I would recommend it for linear algebra like
this.  Honestly, if I had to use C or C++ (without LAPACK), I would probably do the logarithmic-time method
from part I rather than risking the algebra to derive a closed-form $O(1)$ solution.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;background-color:#f0f0f0;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;import&lt;/span&gt; &lt;span style=&#34;color:#0e84b5;font-weight:bold&#34;&gt;numpy&lt;/span&gt; &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;as&lt;/span&gt; &lt;span style=&#34;color:#0e84b5;font-weight:bold&#34;&gt;np&lt;/span&gt;
&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;from&lt;/span&gt; &lt;span style=&#34;color:#0e84b5;font-weight:bold&#34;&gt;numpy&lt;/span&gt; &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;import&lt;/span&gt; linalg &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;as&lt;/span&gt; LA

&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#06287e&#34;&gt;constant_fibonacci&lt;/span&gt;(n):
    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Set up the A matrix and the initial conditions&lt;/span&gt;
    a &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; np&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;array([[&lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;],
                  [&lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;0&lt;/span&gt;]])
    initial_conditions &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; np&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;array([&lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;0&lt;/span&gt;])

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Do eigendecomposition and invert the eigenvector matrix&lt;/span&gt;
    eigenvalues, eigenvectors &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; LA&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;eig(a)
    vector_inverse &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; np&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;inv(eigenvectors)

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Compute the power of the eigenvalues, done elementwise&lt;/span&gt;
    lambda_pow &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; np&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;diag(eigenvalues &lt;span style=&#34;color:#666&#34;&gt;**&lt;/span&gt; (n &lt;span style=&#34;color:#666&#34;&gt;-&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;))

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Put it all together - compute A^n and apply initial conditions&lt;/span&gt;
    a_pow &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; eigenvectors &lt;span style=&#34;&#34;&gt;@&lt;/span&gt; lambda_pow &lt;span style=&#34;&#34;&gt;@&lt;/span&gt; vector_inverse
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;return&lt;/span&gt; &lt;span style=&#34;color:#007020&#34;&gt;round&lt;/span&gt;((a_pow &lt;span style=&#34;&#34;&gt;@&lt;/span&gt; initial_conditions)[&lt;span style=&#34;color:#40a070&#34;&gt;0&lt;/span&gt;])
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Note the rounding here at the end: we are working with transcendental numbers as intermediate values, so we
cannot avoid predominantly using floating-point math for this form of solution.  Additionally, eigenvalues
and eigenvectors may end up being computed using numerical methods, and they may not be exact.  Even with
rounding, the use of floats will cause divergence from the acutal Fibonacci numbers when $Fib(n)$ is big:
Double-precision floating point numbers have 53 bits of precision, and once you are past $2^{53}$, they cannot
represent integers exactly.  Unsigned ints have the same problem at $2^64$. This happens pretty fast:
the Fibonacci numbers grow exponentially, so the numpy code above (which uses doubles as its underlying
numeric type) has its first error at &lt;code&gt;n = 71&lt;/code&gt;.  Using a bignum or arbitrary precision numeric library will
bring this solution back to at least $O(log n)$, if not $O(n)$, depending on the algorithms used for
exponentiation and dvision.&lt;/p&gt;
&lt;p&gt;In response to my last post on reddit, a number of commenters suggested that you just use NumPy&amp;rsquo;s matrix
power function. This is how that matrix power calculation works.&lt;/p&gt;
&lt;h3 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;Over the last two posts, we have discussed how to do better than O(n) on Fibonacci-style problems.  We
have an easier logarithmic solution and a tricky constant-time solution straight out of a differential
equations course.&lt;/p&gt;
&lt;p&gt;The Fibonacci problem is a nice example of the improvements you can get from recursion with memoization
and one-dimensional dynamic programming, but can&amp;rsquo;t we find another nice example where there isn&amp;rsquo;t a
better solution than recursion? Teaching students that they should be using recursion to find Fibonacci
numbers teaches them to develop a blind spot - instead, they should be learning to examine the problem
carefully for a better solution rather than reaching into a bag of algorithmic tricks.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>Less-than-linear Fibonacci</title>
      <link>https://specbranch.com/posts/fibonacci/</link>
      <pubDate>Fri, 14 Jan 2022 00:00:00 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/fibonacci/</guid>
      <description>Few interview problems are as notorious as the &amp;ldquo;Fibonacci&amp;rdquo; interview question. At first glance, it seems good: Most people know something about the problem, and there are several clever ways to achieve a linear time solution. Usually, in interviews, the linear time solution is the expected solution. However, the Fibonacci problem is unique among interview problems in that the expected solution is not the optimal solution. There is an $O(1)$ solution, and to get there, we need a little bit of linear algebra.</description>
      <content:encoded>&lt;p&gt;Few interview problems are as notorious as the &amp;ldquo;Fibonacci&amp;rdquo; interview question. At first glance,
it seems good: Most people know something about the problem, and there are several
clever ways to achieve a linear time solution. Usually, in interviews, the linear time solution
is the expected solution. However, the Fibonacci problem is unique among interview problems in that
the expected solution is &lt;em&gt;not&lt;/em&gt; the optimal solution. There is an $O(1)$ solution, and to get there,
we need a little bit of linear algebra.&lt;/p&gt;
&lt;p&gt;In this part, we are going to use basic linear algebra to get to $O(\log(n))$ time complexity.
In the next part, we build on this with some more advanced math to get to a generic $O(1)$ solution
for Fibonacci-like problems.&lt;/p&gt;
&lt;h3 id=&#34;the-fibonacci-interview-problem&#34;&gt;The Fibonacci Interview Problem&lt;/h3&gt;
&lt;p&gt;The Fibonacci numbers are the sequence of numbers constructed such that each number is the sum of the
two previous numbers in the sequence, starting with 0 and 1:
$$ F_0 = 0 \\ F_1 = 1 \\ F_n = F_{n-1} + F_{n-2} $$
The resulting sequence is:
$$ 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, &amp;hellip; $$
&lt;a href=&#34;https://oeis.org/A000045&#34;&gt;And so on.&lt;/a&gt; You can generalize the formula by changing
the initial conditions or the recursive formula (which mathematicians call a
recurrence relation). Everything we are discussing applies to the generalized problem.&lt;/p&gt;
&lt;p&gt;The interview problem is simple: given a number n, compute the nth Fibonacci number.&lt;/p&gt;
&lt;p&gt;According to tradition, the optimal solution to the Fibonacci interview problem is to compute the
entire sequence up to that point,
&lt;a href=&#34;https://www.baeldung.com/cs/fibonacci-top-down-vs-bottom-up-dynamic-programming&#34;&gt;using either recursion with memoization or one-dimensional dynamic programming.&lt;/a&gt;
Both result in a solution that uses linear time and space.&lt;/p&gt;
&lt;h3 id=&#34;recurrence-relations-and-systems-of-equations&#34;&gt;Recurrence Relations and Systems of Equations&lt;/h3&gt;
&lt;p&gt;To compute the nth Fibonacci in constant time, we need a closed-form solution. There are Googleable
solutions for the Fibonacci recurrence relation (specifically Binet&amp;rsquo;s Formula), but we
would like to create a general solution for this class of problem. It turns out that there is a
general method to derive a closed-form solution for any recurrence relation, and it is better to
learn it once than to look for a Fibonacci-specific answer.&lt;/p&gt;
&lt;p&gt;We start by creating a system of equations. The recurrence relation is one of those equations,
and the other equations are simple equalities. For the Fibonacci sequence, our system is:
$$ F_{n + 1} = F_{n} + F_{n - 1} \\ F_n = F_n $$
We can represent the system as a matrix equation:
$$ \begin{bmatrix} F_{n+1} \\ F_{n} \end{bmatrix} = \begin{bmatrix} 1 &amp;amp; 1 \\ 1 &amp;amp; 0 \end{bmatrix}
\begin{bmatrix} F_{n} \\ F_{n-1} \end{bmatrix} $$
This is a restatement of the equation above.
The top row corresponds to $F_{n + 1} = F_{n} + F_{n - 1}$ and the bottom row is $F_n = F_n$.  The left
side is the result of the equations, and the right side is the product of the coefficient matrix and the
variables that feed into the equations.&lt;/p&gt;
&lt;p&gt;Let&amp;rsquo;s call the big matrix $\textbf{A}$:
$$ \textbf{A} = \begin{bmatrix} 1 &amp;amp; 1 \\ 1 &amp;amp; 0 \end{bmatrix} $$
If the recurrence relation changes, $\textbf{A}$ changes. Using this form, if we want $F_2$, the formula is:
$$ \begin{bmatrix} F_{2} \\ F_{1} \end{bmatrix} =
\textbf{A} \begin{bmatrix} F_{1} = 1 \\ F_{0} = 0 \end{bmatrix} $$
If we want the third Fibonacci number, we need to do a little more work:
$$ \begin{bmatrix} F_{3} \\ F_{2} \end{bmatrix} =
\textbf{A} \begin{bmatrix} F_{2} \\ F_{1} \end{bmatrix} = \textbf{A}^2 \begin{bmatrix} 1 \\ 0 \end{bmatrix} $$
We have an interesting relationship here:  By computing powers of $\textbf{A}$,
we can directly calculate Fibonacci numbers.  Our task is now to compute the power of the $\textbf{A}$ matrix
quickly. Even the simplest algorithm for this, multiplying $\textbf{A}$ by itself $n$ times, takes $O(n)$ time and
$O(1)$ space. This is already better than the &amp;ldquo;optimal&amp;rdquo; solution that you can find in Cracking the Coding Interview.&lt;/p&gt;
&lt;h5 id=&#34;generalizing-beyond-fibonacci&#34;&gt;Generalizing Beyond Fibonacci&lt;/h5&gt;
&lt;p&gt;If we have a different recurrence relation, we can still do this trick.
For example, with $X_{n+1} = 2 X_n + 3 X_{n-1}$, $ \textbf{A} $ changes to:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}_X = \begin{bmatrix} 2 &amp;amp; 3 \\ 1 &amp;amp; 0 \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;And for $Y_{n+1} = Y_n + 2 Y_{n-1} + 3 Y_{n-2}$, we can add another equality to the system of equations:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}&amp;rsquo; = \begin{bmatrix} 1 &amp;amp; 2 &amp;amp; 3 \\ 1 &amp;amp; 0 &amp;amp; 0 \\ 0 &amp;amp; 1 &amp;amp; 0 \end{bmatrix} $$&lt;/p&gt;
&lt;p&gt;The full equation is:&lt;/p&gt;
&lt;p&gt;$$ \begin{bmatrix} F_{n+1} \\ F_{n} \\ F_{n-1} \end{bmatrix} = \textbf{A}&amp;rsquo; \begin{bmatrix} F_{n} \\ F_{n-1} \\ F_{n-2} \end{bmatrix} $$&lt;/p&gt;
&lt;h3 id=&#34;getting-logarithmic&#34;&gt;Getting Logarithmic&lt;/h3&gt;
&lt;p&gt;We have reduced the problem to efficiently computing $\textbf{A}^n$ to get the nth Fibonacci number.
Matrix multiplication is associative, so we can re-order the computation of the matrix products for
efficiency.  A convenient way to do this is to compute the powers of two of the matrix by repeated squaring,
and then decomposing the exponentiation into a product of powers of two. For example:&lt;/p&gt;
&lt;p&gt;$$ \textbf{A}^{53} = \textbf{A}^{32} * \textbf{A}^{16} * \textbf{A}^4 * \textbf{A}^1 $$&lt;/p&gt;
&lt;p&gt;This is similar to how binary code decomposes numbers into a set of ones, so we can use the bits of $n$
to decide which powers to multiply to make $A^n$: When you have a one bit, multiply the corresponding power
of $\textbf{A}$ into the result.  When you have a zero bit, don&amp;rsquo;t, and stop when you reach the most
significant one bit in $n$.&lt;/p&gt;
&lt;p&gt;Note that we do either one or two matrix multiplications (constant time) per
bit until we reach the most significant bit of $n$, and then we are done.  Starting from the first bit
in the ones position, the MSB of $n$ is bit number $\log_2(n)$, so we have $O(\log_2(n))$ time
complexity (with constant space).&lt;/p&gt;
&lt;p&gt;Working in Python, we get something like the following code. Rust, Python, and many other languages have
well-supported easy-to-use matrix multiplication routines, so we are using Python.  BLAS is also
available for languages like C, C++, and Fortran.&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;background-color:#f0f0f0;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-python&#34; data-lang=&#34;python&#34;&gt;&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;import&lt;/span&gt; &lt;span style=&#34;color:#0e84b5;font-weight:bold&#34;&gt;numpy&lt;/span&gt; &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;as&lt;/span&gt; &lt;span style=&#34;color:#0e84b5;font-weight:bold&#34;&gt;np&lt;/span&gt;

&lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;def&lt;/span&gt; &lt;span style=&#34;color:#06287e&#34;&gt;log_fibonacci&lt;/span&gt;(n):
    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Initialize the A matrix and the matrix that represents A^n&lt;/span&gt;
    a &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; np&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;array([[&lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;],
                  [&lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;0&lt;/span&gt;]])
    result &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; np&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;identity(&lt;span style=&#34;color:#40a070&#34;&gt;2&lt;/span&gt;)

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Computing A^n gives us Fibonacci number n + 1, so decrement n&lt;/span&gt;
    n &lt;span style=&#34;color:#666&#34;&gt;-=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;while&lt;/span&gt; n &lt;span style=&#34;color:#666&#34;&gt;!=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;0&lt;/span&gt;:
        &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Multiply by the A matrix power if we have a one bit&lt;/span&gt;
        &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;if&lt;/span&gt; n &lt;span style=&#34;color:#666&#34;&gt;&amp;amp;&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt; &lt;span style=&#34;color:#666&#34;&gt;==&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;:
            result &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; result &lt;span style=&#34;&#34;&gt;@&lt;/span&gt; a
        
        &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Square the a matrix and move to the next bit&lt;/span&gt;
        a &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; a &lt;span style=&#34;&#34;&gt;@&lt;/span&gt; a
        n &lt;span style=&#34;color:#666&#34;&gt;&amp;gt;&amp;gt;=&lt;/span&gt; &lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;

    &lt;span style=&#34;color:#60a0b0;font-style:italic&#34;&gt;# Apply the initial conditions&lt;/span&gt;
    initial_conditions &lt;span style=&#34;color:#666&#34;&gt;=&lt;/span&gt; np&lt;span style=&#34;color:#666&#34;&gt;.&lt;/span&gt;array([&lt;span style=&#34;color:#40a070&#34;&gt;1&lt;/span&gt;, &lt;span style=&#34;color:#40a070&#34;&gt;0&lt;/span&gt;])
    &lt;span style=&#34;color:#007020;font-weight:bold&#34;&gt;return&lt;/span&gt; (result &lt;span style=&#34;&#34;&gt;@&lt;/span&gt; initial_conditions)[&lt;span style=&#34;color:#40a070&#34;&gt;0&lt;/span&gt;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;conclusions&#34;&gt;Conclusions&lt;/h3&gt;
&lt;p&gt;By pairing a little bit of traditional mathematics with algorithms, we have a solution that is a lot
better than the traditional &amp;ldquo;algorithms&amp;rdquo; approach to the problem of computing Fibonacci numbers.
In the next part, we are going to apply more math to do even better.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://specbranch.com/posts/const-fib/&#34;&gt;Go on to part II.&lt;/a&gt;&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>About Me</title>
      <link>https://specbranch.com/about/</link>
      <pubDate>Sun, 28 Nov 2021 23:00:16 +0000</pubDate>
      
      <guid>https://specbranch.com/about/</guid>
      <description>My name is Nima Badizadegan, a software and hardware developer in the Northeastern United States. I am passionate about exploring the limits of computer system performance, and I examine the underlying assumptions of computer systems, from hardware to numerics.
I am currently working on a startup, arbitrand that aims to make true random number generation more secure and widely available.
Most recently, I was at Google working on improving the performance of Google&amp;rsquo;s exabyte-scale filesystem (Colossus) for the NVMe flash world.</description>
      <content:encoded>&lt;p&gt;My name is Nima Badizadegan, a software and hardware developer in the Northeastern United States.
I am passionate about exploring the limits of computer system performance, and I examine the
underlying assumptions of computer systems, from hardware to numerics.&lt;/p&gt;
&lt;p&gt;I am currently working on a startup, &lt;a href=&#34;https://arbitrand.com&#34;&gt;arbitrand&lt;/a&gt; that aims to make
true random number generation more secure and widely available.&lt;/p&gt;
&lt;p&gt;Most recently, I was at Google working on improving the performance of Google&amp;rsquo;s exabyte-scale
filesystem (Colossus) for the NVMe flash world.  In the past, I have worked on
hardware-accelerated trading systems, disease models, e-reader hardware, and hardware and
software for satellites.&lt;/p&gt;
&lt;h3 id=&#34;interests-and-blog-topics&#34;&gt;Interests and Blog Topics&lt;/h3&gt;
&lt;p&gt;This blog should generally focus on technical topics, based on my interests. Topics you should
expect to see represented include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;System-level software engineering&lt;/li&gt;
&lt;li&gt;FPGAs and design of hardware-accelerated systems&lt;/li&gt;
&lt;li&gt;Computer system performance&lt;/li&gt;
&lt;li&gt;Micro-optimization of algorithms&lt;/li&gt;
&lt;li&gt;Data structure and algorithm internals, and hardware implementation of algorithms&lt;/li&gt;
&lt;li&gt;Time synchronization&lt;/li&gt;
&lt;li&gt;Numerical algorithms&lt;/li&gt;
&lt;li&gt;High-performance networking technologies&lt;/li&gt;
&lt;li&gt;Working with new storage technologies&lt;/li&gt;
&lt;li&gt;Fundamental physical limits of computer systems&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I am also fond of several less-techical topics including the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The culture of hardware and software engineering&lt;/li&gt;
&lt;li&gt;Organizational psychology and corporate structure&lt;/li&gt;
&lt;li&gt;The goings on of the financial markets&lt;/li&gt;
&lt;li&gt;Intellectual property law&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;how-to-get-in-touch&#34;&gt;How to Get in Touch&lt;/h3&gt;
&lt;p&gt;The best way to get in touch with me is through email, but you can use any of the social links
on the right.  If you are reaching out about W-2 employment, I am not interested at this time.
Otherwise, feel free to shoot me an email about anything you find interesting at
&lt;a href=&#34;mailto:specbranch@fastmail.com&#34;&gt;specbranch@fastmail.com&lt;/a&gt;.  If you would like to receive an email when a new post goes live on
this blog, you can sign up for my email list below.&lt;/p&gt;
</content:encoded>
    </item>
    
    <item>
      <title>First Post</title>
      <link>https://specbranch.com/posts/first-post/</link>
      <pubDate>Sat, 27 Nov 2021 18:28:52 +0000</pubDate>
      
      <guid>https://specbranch.com/posts/first-post/</guid>
      <description>Hello everyone, and welcome to my blog.
I am an ex-Google senior engineer focused on systems programming and performance optimization. My past experience includes hardware engineering, numerical analysis, and high-frequency trading.
Here, we will be talking about software engineering, performance, computer systems and foundations, interesting math concepts, electrical engineering, hardware acceleration, FPGAs, and more. We may also branch out to non-technical topics including companies, organizational psychology, and the stock market.</description>
      <content:encoded>&lt;p&gt;Hello everyone, and welcome to my blog.&lt;/p&gt;
&lt;p&gt;I am an ex-Google senior engineer focused on systems programming and performance optimization. My
past experience includes hardware engineering, numerical analysis, and high-frequency trading.&lt;/p&gt;
&lt;p&gt;Here, we will be talking about software engineering, performance, computer systems and foundations,
interesting math concepts, electrical engineering, hardware acceleration, FPGAs, and more. We may
also branch out to non-technical topics including companies, organizational psychology, and the
stock market.&lt;/p&gt;
&lt;p&gt;With all that in mind, thanks for reading and I hope you find something that interests you here!&lt;/p&gt;
</content:encoded>
    </item>
    
  </channel>
</rss>