Contemplating Randomness

I have recently been immersed in the theory and practice of random number generation while working on Arbitrand, a new high-quality true random number generation service hosted in AWS. Because of that, I am starting a sequence of blog posts on randomness and random number generators. This post is the first of the sequence, and focuses what random number generators are and how to test them.

Formally, random number generators are systems that produce a stream of bits (or numbers) with two properties:

Uniformity: The stream is uniformly distributed. Each possible output is equally probable.
Independence: Elements of the stream are completely independent of each other.

These two properties are enough to both ensure that the stream of random nubmers has maximal entropy (1 bit of entropy per bit), and to make it very tricky to generate good random numbers!

Before we get into the math, there are a lot of philosophical debates about whether we can actually produce random numbers, or whether there is some hidden determinism in random number generation. As far as I know, it is almost impossible to produce numbers that are perfectly random. Even the best quantum random number generators can be potentially biased by external factors, like electric fields or gravity. However, it appears that we can get imperceptibly close.

Randomness tests exist to detemine how close any random number generator is to truly random number generation. Treating a perfectly fair coin (another system that doesn’t exist) as the gold standard for randomness, we can actually test randomness pretty well, but we can also create deterministic algorithms that pretend to be random, too.

Defining Randomness Mathematically

In order to measure uniformity and independence, we use mathematical properties that we expect from a stream of random numbers. We can then use statistical tests over a number of samples from a random stream to determine if the stream has those properties.

For uniformity, assuming a one-bit stream, we would like the stream of bits to behave like a perfect coin flip: a Bernoulli distribution with $p = 0.5$. For example, we expect that over a number of trials (where a trial is a large sample from the random number generator), the average trial will have the following characteristics:

Mean of $0.5$
Variance of $0.25$
Skewness (and higher statistical moments) of $0$

For independence, we want the stream to have the proerty that the probability of any bit being one does not depend on previous bits. $P(R_n = 1 | R_m = 1) = 0.5$ for all $n$ and $m$ where $n \neq m$. To measure independence, we can look at sequences of bits. For example, we expect the average trial to have the following properties:

Toggles from 1 to 0 and back $(n - 1) / 2$ times.
Has a number of runs of ones and zeros of a given size determined by a known formula.

We can also measure the independence property using the autocorrelation of the stream. We would like each bit of the stream to be uncorrelated to the history of the stream. In an equation, we expect that over several samples of $n$ bits, the following holds for all $k$:

$$ 0 = \sum\limits_{i=0}^n R_i R_{i-k} $$

Practically, testing for uniformity is easy. Testing for independence is both subtle and computationally difficult.

Pseudorandomness

Most computer random number generators use pseudorandom algorithms, that generate a fixed sequence of bits, rather than true random nubmer generators. In a way, pseudorandom number generators are compressed representations of extremely long sequences.

All pseudorandom number generators have the property that over some length of output, the sequence generated will eventually repeat. It can be huge, but it always exists.

You might have noticed that pseudorandom number generators do not actually have the “independence” property, since the streams they produce are deterministic. However, they can fake independence by replacing it with the following properties:

A long period before the output stream repeats.
Equidistributed output.

Treating the output of the random number generator as numbers between $0$ and $1$ (eg mapping linearly from integers), equidistribution means that given a number $x \in [0, 1]$ it is expected that a sample of $n$ outputs from a pseudorandom number generator with a period of $T$ contains $\frac{n^2 x}{T}$ outputs that are less than $x$. In other words, the fraction of sampled outputs that is less than $x$ is proportional to $x$ and $\frac{n}{T}$ for any $x$.

Equidistribution also has a concept of dimensionality: higher-dimensional equidistribution involves sparse sampling of the output stream, which can defeat simple pseudorandom number generators which do not have a very long period.

Imperfect Random Number Streams

Most often, when dealing with true random numbers, the entropy source that is used to create the random numbers is imperfect. However, random number generators are expected to output uniform random numbers because uniform random numbers maximize the amount of entropy available per bit, so they are the “maximally compressed” way to deliver entropy.

There are a lot of techniques available to extract entropy from imperfect random number streams and transform them into semmingly-perfect random number streams. The most common and heaviest technique involves cryptographic functions, but there are alternatives (some of which have questionable quality).

Non-Uniform Randomness

So far, I have only focused on uniformly distributed random numbers. You can generate random numbers that are not uniformly random - many quantum RNGs natively output exponentially distributed random numbers - and then convert the result to uniform randomness. The drawback of non-uniform random number generators is that they are effectively biased: when presented as a stream of bits, each bit carries less than 1 bit of entropy.

Non-uniform random numbers can be converted to uniform random numbers at a minor loss of entropy by quantizing the numbers with bins of equal probability mass (but unequal size). This comes at a loss of either bandwidth or quality. Transforming uniform random numbers into samples from non-uniform distributions is easier to do without losing much entropy.

Biased Random Number Generators

Another common class of slightly-imperfect random number generators are random number generators that have a bias. These are like unfair coins: 1’s or 0’s will be more likely to show up. Most hardware true random number generators are biased in one way or another.

Bias can show up in hardware true random number generators from a lot of sources: imbalances in transistor drive strength can manifest as bias in many common TRNG circuits, but biases can also show up due to the way circuits are designed, external factors, or non-ideal operating conditions. TRNGs that use background radio noise are particularly sensitive to bias due to external factors.

There are several different types of algorithms to simply de-bias a stream of random numbers, but these algorithms can often weaken the independence property. If you know the bias of your random number generator, you can attempt to “swallow” ones or zeros every once in a while to fix the bias, but this has to be done very carefully to avoid independence problems.

The most common way to fix bias problems in random number generation is with cryptographic post-processing.

Cryptographic Post-Processing

Cryptographic post-processing is used by /dev/random in Linux as well as most hardware true random number generators (but not quantum random number generators or the Arbitrand TRNG) as insurance against biased inputs. This type of post-processing involves:

Collecting entropy from an entropy source by XOR-ing its output with the previous output from the TRNG.
Estimating how much entropy you have collected, and waiting to provide an output until you have collected enough entropy.
Producing a cryptographic hash of the contents of the entropy pool once you have enough entropy.

If you are correct (or conservative) about how much entropy is in the entropy pool, the resulting stream is essentially a perfect random stream. However, there may be some cause for concern if the cryptographic post-processing method used is broken, which could be as simple as cracking a secret key used in the hashing process.

“Rolling” My Own Random Number Generator

Designing new pseudorandom number generation algorithms is very difficult. I would suggest avoiding it if you can. Pseudorandom number generators, like cryptographic algorithms, have a lot of non-obvious pitfalls that can bias them or otherwise render them useless. True random number generators actually may be easier to design - there are a number of circuit and system techniques involved, and you don’t have to deal with the “equidistribution” property which can be quite hard to achieve. However, in both cases, the devil is in the details.

For Arbitrand, I have tried to take a slightly different approach to randomness: instead of trying to get one perfect entropy source, we combine many uncorrelated entropy sources that are known to be slightly imperfect. By characterizing these sources and understanding their pathologies, we can effectively cover imperfections in one entropy source with another source. This strategy wastes entropy, but covers the TRNG across temporal and environmental variations in operating conditions, allowing us to operate on cloud FPGAs.

This strategy has worked well so far. The Arbitrand TRNG today produces almost 5 Gbps and passes the most stringent randomness tests available. The throughput number will likely go a lot higher soon, and I am also investigating how to scale down the throughput to around 100 Mbps with a tiny circuit.

Final Thoughts

I hope you’re interested in hearing a lot more about randomness, because developing a TRNG has been exciting for me so far, and I am going to be doing a lot more blogs along these lines while I try to bring cheap true random numbers to everyone who needs them. I will be back to micro-optimization soon, too.