Randomness Testing Guide

Autocorrelation Test

A linear congruential generator is defined by the following formula.

Xn+1=(aXn+c)modmX_{n+1} = (aX_n + c) \bmod m

In this formula, XnX_n is the current value in the sequence, aa is the multiplier, cc is the increment, and mm is the modulus. X0X_0 is a seed that can be arbitrarily chosen.

Let's say we have the following binary string generated from a linear congruential generator.

s=00110111001011011011101101011101001010101011011010s = 00110111001011011011101101011101001010101011011010

The weakness we can exploit here is that each bit is a linear function of the previous bits. If we can successfully predict the next bit using the previous bits, then the sequence isn't random.

Therefore we want to calculate the following probabilities.

  • Pr(next bit = 0 | current bit = 0)\text{Pr(next bit = 0 | current bit = 0)}
  • Pr(next bit = 1 | current bit = 0)\text{Pr(next bit = 1 | current bit = 0)}
  • Pr(next bit = 0 | current bit = 1)\text{Pr(next bit = 0 | current bit = 1)}
  • Pr(next bit = 1 | current bit = 1)\text{Pr(next bit = 1 | current bit = 1)}

However, this is exactly the same thing as calculating the frequencies of 00, 01, 10, and 11 in the binary string. This is extremely similar to what the word frequency test does, but with one key difference: the word frequency test uses non-overlapping words, whereas here we are looking at overlapping words.

s=s = \text{\color{lightcoral}{0} \color{orange}{0} \color{yellow}{1} \color{lightgreen}{1} \color{lightblue}{0} \color{orchid}{1} \color{white}{1 \ldots}}
s=00 01 11 10 01 11s' = \text{\color{lightcoral}{00} \color{orange}{01} \color{yellow}{11} \color{lightgreen}{10} \color{lightblue}{01} \color{orchid}{11}} \color{white}{\ldots}

The advantage of using overlapping words is that the sample size becomes much larger. For words of length 2 (such as the above example), the sample is 2 times larger. This is important because when using the chi-squared test, if every category has an equal expected value, then too many categories leads to an extremely sparse expected distribution and causes the test to give invalid results. ("Sparse distribution" means that many of the entries are equal to zero, which violates the sample size assumption of the chi-squared test.)

Unfortunately, using overlapping words violates the assumption that all observations are independent, which means chi-squared test won't give correct results. We can still catch non-random strings using this test, but the binary string length needs to be extremely large.

This test, known as the autocorrelation test, is a significantly more difficult to pass than the words test. Let's try and use it (with a word size of 2) on a 100-bit string from a linear congruential generator.

Small LCG String0011011100101101101110110101110100101010101101101011010010101111101111100101100010001011100111000000

There are two problems here. First is that 100 bits is not enough to detect any pattern. Second is that that word size is way too small and cannot be increased due to the small size of the binary string. However, if we use a 700,000-bit binary string with a word size of 8, we will be able to detect that the string was not randomly generated.

Large LCG String(test string too long to display: 700000 bits)

By comparison, a random string generated using AES (Advanced Encryption Standard) will not be failed by this test.

AES Generator(test string too long to display: 700000 bits)

Notes

The autocorrelation test, as implemented in this guide, is an overly-simplified version of the "serial test" from the NIST paper A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications. In their (much better) version, they re-run this "autocorrelation test" for m and m-1 length words, and compute the difference between their statistics. This difference is akin to taking the "derivative" and eliminates the dependence between observations.