A common exercise in freshman courses on statistics and probability is to divide the students into two groups, let’s call them A and B.

Each student in group A is instructed to flip a coin 100 times and record the resulting sequence of heads and tails. Each student in group B is instructed to merely pretend to have done so, and write down the fictional sequence. The sequences are submitted anonymously to the professor, but invariably the professor correctly determines which group they belong to.

Take a look at the example at right. If you were the professor would you assume it comes from group A or from group B? Would it strike you as suspicious that the first six tosses alternate between T and H with perfect regularity, or that starting with toss 12, there are five H’s in a row?

Actually, if you were the professor, and if your students were naive in such matters, as freshmen presumably would be, you wouldn’t hesitate to assign this example to group A. The reason has to do with a common misconception that strings of consecutive T’s, consecutive H’s, or any repeating pattern such as THTHTH, are extremely unlikely to occur in the real world, and therefore to be avoided like the plague when falsifying data. In other words, unsophisticated fraudsters would make their data overly homogenous, and it simply would not look like this example.

Here’s a different, but related exercise: divide the students into groups A and B, but this time the students in group A will research and record 1,000 actual receipt totals from the campus business office, and, you guessed it, group B will fake 1,000 entries of the same.

Again the professor will have no trouble differentiating the real list from the fake one, because our hypothetical freshmen are extremely unlikely to be familiar with Benford’s Law (a.k.a., the “First-Digit Law”), which states that for many real-world data sets, the left-most digit will be 1 approximately 30.1% of the time, 2 approximately 17.6% of the time, and so on through the digits until 9, which will appear as the first digit about 4.6% of the time.

The students are also unlikely to be familiar with today’s demo file, Benfords-Law, which is based on one presented by John Allen of Allen & Allen Semiotics at an FM-DiSC meeting earlier this year. Thank you John for permission to share it here (and come to think of it, for introducing me to Benford’s Law in the first place).

The file contains a number of different data sets, and you can import your own as well, to see the extent their first digits do or do not conform to Benford’s Law. To quote from the above-cited Wikipedia entry,

This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.

The Wikipedia entry also includes examples of distributions that would not be expected to obey Benford’s Law, including:

- Where numbers are assigned: e.g., check numbers, invoice numbers
- Where numbers are influenced by human thought: e.g., prices set by psychological thresholds ($1.99)
- Entries with a built-in minimum or maximum

The demo file also includes a couple examples of data sets that, as common sense would suggest, do not conform to Benford’s Law, including the complete set of all numbers between 1 and 99,999 and a random set of 99,999 numbers defined with 1 as the lowest possible value and 99,999 as the highest possible value.

If you’d like to learn more about Benford’s Law, there’s a wonderful, visually-rich article at the Data Genetics Blog…

And you can play with different data sets at a site called Testing Benford’s Law.

Interesting! Thanks for that intro, Kevin.

Benford’s Law obviously falls apart when numbers starting with zeroes are allowed into the mix. If my invoice numbers are serial numbers which are actually text, beginning with “0000001”, for example.

Hi Lorne, thanks for taking the time to comment. You could of course use the Abs function to strip off those leading zeroes, but still, invoice numbers or any other assigned numbers aren’t good candidates for Benford analysis. By contrast the line item amounts, or the grand totals, would be ideal candidates.

Hi Kevin,

ia this DevCon make you write all this?

Very interesting theory, can’t wait to put this to a test!

Kind regards, and hope to see you this week!

Andries

The number of DevCon attendees is 1,200 — the leading digit is 1 which of course would come as no surprise to students of Benford’s Law.

P.S. I’m smiling as I type this because I’m sitting about 15 feet from you at the Tuesday morning DevCon session (and of course 15 has a leading digit of 1…).

Kevin,

Very interesting piece. Thanks for posting it.