A common exercise in freshman courses on statistics and probability is to divide the students into two groups, let’s call them A and B.
Each student in group A is instructed to flip a coin 100 times and record the resulting sequence of heads and tails. Each student in group B is instructed to merely pretend to have done so, and write down the fictional sequence. The sequences are submitted anonymously to the professor, but invariably the professor correctly determines which group they belong to.
Take a look at the example at right. If you were the professor would you assume it comes from group A or from group B? Would it strike you as suspicious that the first six tosses alternate between T and H with perfect regularity, or that starting with toss 12, there are five H’s in a row?
Actually, if you were the professor, and if your students were naive in such matters, as freshmen presumably would be, you wouldn’t hesitate to assign this example to group A. The reason has to do with a common misconception that strings of consecutive T’s, consecutive H’s, or any repeating pattern such as THTHTH, are extremely unlikely to occur in the real world, and therefore to be avoided like the plague when falsifying data. In other words, unsophisticated fraudsters would make their data overly homogenous, and it simply would not look like this example.
Here’s a different, but related exercise: divide the students into groups A and B, but this time the students in group A will research and record 1,000 actual receipt totals from the campus business office, and, you guessed it, group B will fake 1,000 entries of the same.
Again the professor will have no trouble differentiating the real list from the fake one, because our hypothetical freshmen are extremely unlikely to be familiar with Benford’s Law (a.k.a., the “First-Digit Law”), which states that for many real-world data sets, the left-most digit will be 1 approximately 30.1% of the time, 2 approximately 17.6% of the time, and so on through the digits until 9, which will appear as the first digit about 4.6% of the time.
The students are also unlikely to be familiar with today’s demo file, Benfords-Law, which is based on one presented by John Allen of Allen & Allen Semiotics at an FM-DiSC meeting earlier this year. Thank you John for permission to share it here (and come to think of it, for introducing me to Benford’s Law in the first place).
The file contains a number of different data sets, and you can import your own as well, to see the extent their first digits do or do not conform to Benford’s Law. To quote from the above-cited Wikipedia entry,
This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.
The Wikipedia entry also includes examples of distributions that would not be expected to obey Benford’s Law, including:
- Where numbers are assigned: e.g., check numbers, invoice numbers
- Where numbers are influenced by human thought: e.g., prices set by psychological thresholds ($1.99)
- Entries with a built-in minimum or maximum
The demo file also includes a couple examples of data sets that, as common sense would suggest, do not conform to Benford’s Law, including the complete set of all numbers between 1 and 99,999 and a random set of 99,999 numbers defined with 1 as the lowest possible value and 99,999 as the highest possible value.
If you’d like to learn more about Benford’s Law, there’s a wonderful, visually-rich article at the Data Genetics Blog…
And you can play with different data sets at a site called Testing Benford’s Law.