A common exercise in freshman courses on statistics and probability is to divide the students into two groups, let’s call them A and B.
Take a look at the example at right. If you were the professor would you assume it comes from group A or from group B? Would it strike you as suspicious that the first six tosses alternate between T and H with perfect regularity, or that starting with toss 12, there are five H’s in a row?
Actually, if you were the professor, and if your students were naive in such matters, as freshmen presumably would be, you wouldn’t hesitate to assign this example to group A. The reason has to do with a common misconception that strings of consecutive T’s, consecutive H’s, or any repeating pattern such as THTHTH, are extremely unlikely to occur in the real world, and therefore to be avoided like the plague when falsifying data. In other words, unsophisticated fraudsters would make their data overly homogenous, and it simply would not look like this example.
Here’s a different, but related exercise: divide the students into groups A and B, but this time the students in group A will research and record 1,000 actual receipt totals from the campus business office, and, you guessed it, group B will fake 1,000 entries of the same.
The students are also unlikely to be familiar with today’s demo file, Benfords-Law, which is based on one presented by John Allen of Allen & Allen Semiotics at an FM-DiSC meeting earlier this year. Thank you John for permission to share it here (and come to think of it, for introducing me to Benford’s Law in the first place).
The file contains a number of different data sets, and you can import your own as well, to see the extent their first digits do or do not conform to Benford’s Law. To quote from the above-cited Wikipedia entry,
This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.
The Wikipedia entry also includes examples of distributions that would not be expected to obey Benford’s Law, including:
- Where numbers are assigned: e.g., check numbers, invoice numbers
- Where numbers are influenced by human thought: e.g., prices set by psychological thresholds ($1.99)
- Entries with a built-in minimum or maximum
The demo file also includes a couple examples of data sets that, as common sense would suggest, do not conform to Benford’s Law, including the complete set of all numbers between 1 and 99,999 and a random set of 99,999 numbers defined with 1 as the lowest possible value and 99,999 as the highest possible value.
If you’d like to learn more about Benford’s Law, there’s a wonderful, visually-rich article at the Data Genetics Blog…
And you can play with different data sets at a site called Testing Benford’s Law.