Ideas for Business
Use Benford's Law & Charts in Excel to Improve Business Planning
Benford's Law reveals an amazing characteristic of data. Not only does it help to identify fraud, it could help you to improve budgets and forecasts.
Unless you're a public accountant, you probably haven't experimented
with Benford's Law.
Auditors sometimes use this fascinating statistical insight to
uncover fraudulent accounting data. But it might
reveal a useful strategy for investing in the stock market. And it might
help you to improve the accuracy of your budgets and forecasts.
This article will explain Benford's Law, show you how to calculate
it with Excel, and suggest ways that you could put it to good use.
From a hands-on-Excel point of view, the article describes new uses
for the SUMPRODUCT function and discusses the use of local and global
Benford's Law Explained
Also called the "first-digit law", Benford's Law says that for many real-life
sources of data, the first digit will be "1" about 30% of the time and
that small first-digits will occur more frequently than large first-digits.
The law is named after Dr. Frank Benford, a physicist working for
the General Electric Company. In 1938 he noticed that the pages of
logarithms for low numbers were dirtier and more worn than pages for
high numbers, and that pages for numbers starting with 1 were the most
heavily used of all.
Intrigued, he analyzed thousands of sets of numbers. These included
lists of the areas of rivers, street addresses in a phone book, baseball
statistics, population statistics, and numbers used in magazine
articles. In all cases, the smaller that the first digit of the number
was, the more likely was that first digit to appear.
derived a formula that states the expected frequency of each beginning
digit. In Excel, we would express this formula as...
...where FirstDigit can be a whole number 1 through 9.
This table shows the results of the formula. To illustrate, the
probability of 17.6% in the table for a first digit of 2 was calculated using this formula:
Testing Benford's Law With Stock Data
I tested Benford's Law using two sets of stock data from the Wall Street
This page gave me data for the New York Stock Exchange and
this page gave me data for the Nasdaq Capital Market, formerly
called the Nasdaq SmallCap Market. In both cases, the data was for July
To test Benford, I found the first-digit totals for a variety of data
for all stocks in the two tables, then used charts to compare how
frequently each digit appeared with the frequency that the Benford
To get started, I used links provided in the two WSJ pages to download a worksheet
dataset. I moved the two worksheets into one workbook; saved the
workbook under a new name; named one worksheet NYSE and the other
Nasdaq; deleted the columns I didn't care about
in each sheet; and formatted the table in each worksheet. To illustrate, here's
my formatted table for the NYSE worksheet:
To show the table in this figure, I hid rows 7 through 2100. To do so, I selected those rows and chose
Home, Cells, Format, Hide & Unhide, Hide Rows.
An Aside: Assigning the Range Names
So that you can understand the formulas that follow, I need to interrupt
this description to explain how I handled my range names.
To make my formulas easier to use and understand, I decided to assign
the column titles as range names for the columns of data. And because I saved the NYSE and Nasdaq tables in two sheets of the same
workbook, I needed to assign names that were local to each worksheet.
Most range names are workbook-global. That is, if you assign the name Close to a range in
one worksheet, you can refer to the name in any worksheet in that
workbook. To illustrate, if I assigned the name Close to a column of
data, I could use that name in any formula in the workbook, like this:
But in this workbook, I wanted names that were local to each worksheet. With
worksheets named NYSE and Nasdaq, I wanted to be able to specify which
range I wanted, using references like these:
The time-consuming way to assign local names is with the New Name dialog. To
illustrate, I could select the range C3:C2104 in the NYSE worksheet
choose Formulas, Defined Names, Define Name, Define Name...; enter Close in
the Names edit box; choose NYSE as the scope, and then choose OK. Then I would need to
repeat that process for each of the other columns I wanted to name.
The quick way to assign these names is to use the Create Names From
This dialog assigns a name globally if a global version hasn't already
been defined; otherwise, it assigns the name locally. Therefore, in
cases like this, I always set up a block of Reserved Names that I define
globally, but plan never to use; then I assign the real names locally.
To make sure that I don't accidentally use the global versions, I enter
=NA() in the named cells. That way, my formula returns the #N/A error
value if I use the global version by mistake.
To set up the global versions of the names, I first added a new worksheet to the workbook and
named it Control. Then I copied the row of titles shown in the table
above to a column in the Control worksheet.
do so, I copied the range A2:H2 above, selected cell A3 in the Control
sheet, shown here; chose Home, Clipboard, Paste, Paste Special; chose Values and Transpose
in the Paste Special dialog; then I chose OK.
I entered the formula =NA() in cell B3 and copied it down the column as
shown here. I added some minor formatting, then assigned the text in
column A as names in column B. To do so, I selected the range A3:B10;
chose Formulas, Defined Names, Create From Selection; made sure that only Left Column was selected
in the Create Names From Selection dialog; then chose OK.
In the NYSE worksheet, illustrated above, I selected the range A2:H2104,
chose Formulas, Defined Names, Create From Selection,; made sure that only Top Row was selected
in the Create Names From Selection dialog; then chose OK. I used a similar process
to name the table of data in the Nasdaq worksheet.
Testing Benford's Law, Resumed
After I named the ranges of data for the two exchanges, I summarized it
in two tables. Here's the table with the NYSE data:
Column E contains the numbers 1 through 9.
The formulas in columns A and B count the number of cells that have
numbers beginning with each of the nine digits. The first two formulas
return the closing price and the volume, respectively:
In cell A4, the LEFT function returns the left-most digit as text and
the VALUE function converts the text to a number. Then the formula compares
that left-most digit to the number in cell E4. Because SUMPRODUCT works with
arrays, the formula performs this calculation on each cell in the Close
range, and stores TRUE or FALSE in a temporary list in memory. Adding zero to
those values returns a list of 1 and 0 values, respectively. Finally,
the SUMPRODUCT sums those ones and zeros, returning a count of all
numbers in the NYSE's Close range that begin with the value 1.
The first time I entered these formulas, Excel returned a #VALUE! error.
As I looked through the data I saw that it used the text "..." to show
missing values. Rather than creating a complex formula to deal with this
problem, I used Excel's Find and Replace dialog to replace all such text
with 0 (zero). Later, I saw that the PE column used "dd" to indicate a
loss in the prior four quarters; so I also replaced "dd" with 0.
The formula in cell B4 performs the same calculation for the NYSE's
Columns C and D contain dollar and percentage changes in stock price.
Because this data consists of decimal fractions that can be less than
zero, we must use a slightly different formula.
function finds the absolute value of each number, removing any minus
signs. And multiplying each number by 100 moves the first digit to the
left of the decimal point. The rest of these formulas is like those for
columns A and B.
I copied the range A4:D4 down the column as shown. Then in the range
A13:D13 I used SUM functions to return the total of each column. (These
columns show slightly different values because data is missing from some
of the columns.)
Now that I had the summary data, I could write the formulas for the
This formula is Excel's version of Benford's formula, as I described
Copy it down the column as shown above.
This formula finds the percentage of values in the Close column that
begin with the digit "1". When you copy this formula to the range G4:J12
it will adapt to all nine digits for each of the four categories of NYSE
Finally, I looked for a way to score how closely the actual distribution
of first digits matches the Benford pattern. I settled on this
formula, which I copied to the range H13 through J13:
The SUMXMY2 function subtracts each value in one column from the
corresponding value in the other column, squares each difference, and
then sums the results. The function's name is derived from: SUM of X
Multiplied by Y, each quantity squared.
Dividing the result by 9 returns an average value for each of the nine
squared deviations. And finding the square root of this average value returns an
average deviation between the two columns of nine numbers.
I call this final score the Average Deviation, or AD. The closer the AD
is to zero, the more closely our actual data matches Benford's number pattern.
Honestly, statisticians probably would object to this
calculation. But it's simple to use and makes sense to
non-statisticians: It merely represents the average amount that our data
deviates from the Benford pattern.
To give you an idea of how a chart of your data might correspond with a
specific AD score, the figure below illustrates six data series, in
green, and the Benford numbers in orange. You can see how the green line
strays farther from the orange line as the AD grows from .9% to 12.4%.
Near the end of this article I'll explain how you might use the AD for
Charting Benford Data
After I created the NYSE table, I made another one for NASDAQ data. Then I created this
display of the charts of both data sets:
What do these charts tell us?
In both charts, Benford's calculation
is plotted as a thick orange line. But in both charts, the line is
barely visible because plots of the actual data match Benford's
calculation so precisely...with
one exception, the green line showing the closing price.
Notice that the NYSE has fewer stocks than predicted with closing prices that
begin with "1" and more stocks with closing prices that begin with "3", "4", and
Further, compare the Average Deviation for the the Close column in
cell G13 of the table above with the other ADs shown in that figure. The
AD for the Close column is 4.1%. This value is notably higher than the
other three ADs.
I suspect this pattern for closing prices arises because the average NYSE
stock is expensive, averaging more than $96 per share. As a consequence, companies often split their stock when
its price gets above $100. This makes "round lots" of the
stock (100 shares) more affordable for small investors.
We see an opposite pattern in the small-cap Nasdaq stocks, which have an average price of less than $10 per share. In
those firms, the chart shows that many Nasdaq
companies apparently prefer their stock prices to begin with "1".
After all, a $1.49 stock seems more important than a stock costing $.49,
and a $12.95 stock seems more important than a stock priced at $3.95.
The NYSE pattern got me wondering whether stock splitting made financial sense for
companies and their investors.
The plot of the PE ratio in this chart seems to show that NYSE investors
valued stocks with prices beginning with "4" significantly above other
stocks. Also, in general, it appears that the larger the first digit, the more
expensive a stock seemed to investors, and the less they valued it.
Now consider the year-to-date change in the stock price. People who
invested in stocks priced with larger first-digits appeared to have
higher returns than people who invested in stocks priced with smaller
That is, there's some indication that investors would benefit by buying
stocks with prices beginning with digits 5 through 9.
Keep in mind that this is a quick-and-dirty analysis based on only one
dataset. And the calculation ignores dividends. A more complete analysis
could produce different results. But at the minimum, this chart
seems to show that it might be worthwhile for serious investors to
explore whether using Benford's Law could help to improve investment
Where Benford's Law Works Best
Benford's first-digit number pattern is typically found in real-life
data independent of human or categorical influence. Here are some more
The data should vary by at least an order of magnitude. The more
variation, the better.
built-in maximum, minimum, or often-repeated numbers.
The Benford pattern wouldn't be found in a measure of the height of adults or
in the ages
of the members of a local senior center.
A record of purchases would be distorted by the frequent purchase of
one item for one price. Similarly, in a list of company expense reports, if
company policy states that expenses of $25 or more
must include a copy of the receipt, expect many expenses to be listed at
$24.95. This would distort the first-digit pattern, of course.
Large sample size.
You want enough numbers to give the Benford number pattern a chance
to show itself. Unfortunately, I haven't found a source that offers
well-reasoned advice about how small a sample size can be for
This chart illustrates a non-mathematical way to address this issue.
It shows the average deviation for sample sizes of 10 through 610
numbers for both the closing prices and volumes for the NYSE data.
As you can see, the AD started at about 10% for a sample size of 10
for both series and dropped quickly to about 5% for a sample size of 60.
From there, the green line for the closing price quickly approached its
true value of about 4.1%. And the brown line for the volume continued
downward, coming close to its actual value of 0.5%.
Based on this limited test, it looks that about 50 numbers will give
you some idea about the degree to which the Benford pattern applies to
your data. But 500 numbers should give you an excellent idea.
This assumes, of course, that your data is in no particular order. If
it's sorted by the column you're testing, you would need to test the
entire list, no matter how long it is.
Results of standard transactions or calculations.
Numbers generated in the normal course of business tend to follow
In contrast, lists of check numbers or purchase order numbers would not
work. These are numeric labels and not true values.
And particularly, numbers made up out of thin air would not
follow Benford's pattern. This is why Benford's Law is useful for auditors
searching for fraud, and why it might also be useful for improving
Using Benford's Law to Improve Business Planning
Back when I was a CFO for small companies, I quickly learned to distrust
the budgets and forecasts that certain managers gave me. These people
would open my budget forms and quickly enter their guesstimates. I
variously called these either No-Thought Plans or Five-Minute Plans.
My challenge was to know which plans were the result of real thought and
which were the result of no-thought guesses. I first assumed
that any number ending with several zeros was a no-thought guess. But
managers quickly learned to add randomly chosen numbers to their
guesses, making their plans seem more precise.
Now, years later, I realize that Benford's Law might have helped me to
find the no-thought plans.
To see how this might work, consider two sales managers.
One manager spends five minutes with his sales forecast. He takes a wild guess about how each
product, region, or sales person will perform; he reduces
his guesses by a fudge factor; and then he quickly enters his "forecast"
Because this data comes mostly from the manager's imagination, you
wouldn't expect it to follow Benford's Law.
The other manager is much more precise. She carefully considers past
sales performance by region and month. She estimates the
business she expects to do with each of her largest customers. She looks at
price quotes outstanding and the likelihood that each quote will produce
a purchase order. She considers the likely returns from planned
marketing projects. She gets detailed estimates from her sales staff.
And she might even use some form of statistical time-series forecasting.
Because this forecast is more closely linked to real-world data, you
would expect it to follow the Benford distribution more closely than the
other manager's guesstimate.
The same logic could apply to budgeting expenses.
Therefore, here's how you probably could improve your business plans:
(1) Calculate the Benford distribution for each plan that your managers
submit, and then calculate an Average Deviation score for each plan.
(2) Sort the results so that the largest AD is at the top of the list.
(3) Work your way down the list,
quizzing managers about the methods they used to create their
plans. And request revised plans as necessary.
(4) Continue down the list until you either find managers who truly have thought about their plans, or until you run out of
time...whichever comes first. If you do run out of time, you'll know
that you probably have improved the worst of the plans.
Alternatively, if you know you'll run out of time, you could
isolate the plans that have the highest ADs and sort them by the total
amount of the plans. That way, you can work down a list of the largest of the
Honestly, I've never used this method on the job. But it has every
reason to work. If you decide to give it a try, please use the Comments
link at the top of this article to let me know how it went.
Tags: #excel, #benford's law, #NYSE, #SUMPRODUCT, #LOG, #VALUE, #LEFT, #SQRT, #SUMXMYS, #range names,