When you take the time to explore a new worksheet function, it’s amazing what you can discover!
And I won’t tell you about the really amazing thing I realized until the last few paragraphs.
Several days ago, I finished a project for a company that wanted an Excel dashboard that would chart annual trends for 15 measures, for any of 6800 companies. In Excel, the Table for that data has 55,000 rows and 29 columns.
At the last minute, they wanted me to add an additional line to several charts, a line that would let them compare the specified company’s performance to two different percentiles calculated from all 6800 companies.
This figure shows a significantly simplified and modified view of their Excel Table.
For one line, I needed to calculate the 30% percentile of Sales by year by product, and for the other line, I needed the 25% percentile of Profits.
My first thought was that it sure would be nice if Microsoft could give us a PERCENTILEIFS function, like we have for AVERAGEIFS and SUMIFS. But they haven’t yet. So how could I fake a PERCENTILEIFS function?
I could have used an array formula to filter the Table. To do so, I’d calculate a column of values that contained FALSE, instead of numbers, for the values I wanted the PERCENTILE function to ignore. But that would consume a lot of calculation time and computer memory.
And that’s when I thought about Excel’s relatively new FILTER function, which uses this syntax:
FILTER( array, include, [if_empty] )
Logically, I could use FILTER to filter out the data I didn’t need, and then find the percentile of the data that remained.
Because I’m not yet a FILTER expert, I quickly checked Excel’s help topic about FILTER. Because all the examples showed FILTER being used with the entire Table, I assumed that’s how the function works. So I first filtered my Table for the year, using a formula like this…
G2: =FILTER( Sales, Sales[Year]=G1 )
…and got these results.
(Because my version of Excel 365 implements dynamic arrays, that one formula “spilled” its results to the entire range of values in this figure.)
Notice FILTER’s include argument (the second argument) in the formula for cell G2 above. Generally speaking, Microsoft filters data with its functions in two ways.
First, we have filters like the ones that SUMIFS and AVERAGEIFS use, which require a pair of arguments for each filter. This design can’t reference arrays in its calculations, which means that its filter can’t rely on values calculated within the function. The user advantage of this design is that it calculates quickly and uses hardly any computer memory.
On the other hand, the FILTER function works somewhat like array formulas and the SUMPRODUCT function do. That is, its one include argument can reference arrays and perform nearly any type of filtering calculation. But those calculations must return a column—in memory—of TRUE or FALSE values, and then FILTER returns the values for all rows where the filtering column is TRUE.
That was a start, but what I really needed was to calculate the percentile by year and by product. And to do that, my one include argument needed to perform two tests.
To filter for rows where the year was 2020 AND where the product was games, I multiplied each test in the FILTER’s include argument, as you see here…
G12: =FILTER( Sales, ( Sales[Year]=G11 ) * ( Sales[Product]=H11 ) )
…where Sales table is the blue Table above.
This test works because:
- TRUE * TRUE = TRUE
- TRUE * FALSE = FALSE
- FALSE * FALSE = FALSE
On the other hand, if I had wanted to test for an OR condition, I would have used a plus sign rather than a multiplication sign. This test would work because:
- TRUE + TRUE = 2, which Excel treats as TRUE
- FALSE + TRUE = 1, which Excel treats as TRUE
- FALSE + FALSE =0, which Excel treats as FALSE
We can easily combine these tests to create logic like this: Show me all rows for 2020 for the Acme or Gast companies.
To do that, we combine the tests like this:
G18: =FILTER( Sales,
(Sales[Year]=G17) * ( (Sales[Company]=H17) + (Sales[Company]=I17) ) )
But none of those calculations put me any closer to creating a single column that I could use to return a percentile value.
So, looking at any of the gray figures above, how could I return one column of data that I could pass to a percentile function?
I could use the INDEX function, of course!
That is, I could use this approach to give me one column of data:
=INDEX( FILTER(whatever) , , MATCH(column name, table header, 0) )
The INDEX function, when it’s used with arrays, has this syntax:
=INDEX( array, row_num, [column_num] )
And when the row_num is either zero or missing, and we include a column_num, INDEX returns all rows in the specified column. Therefore, I could use the following formula to return the column of values beginning in cell M4, as shown in this figure.
M4: =INDEX( FILTER( Sales, ( Sales[Year]=M1) * (Sales[Product]=M2) ),,
MATCH( M3, Sales[#Headers], 0 ) )
And then, I could surround that formula with the PERCENTILE.INC function to return the value I was looking for. That function uses this syntax…
=PERCENTILE.INC( array, k )
…where array is like the array of numbers shown in the previous figure and k is the percentile value we want, from 0 to 100 percent.
(Excel gives us two choices for percentile functions, PERCENTILE.INC and PERCENTILE.EXC. The INC version (and also the deprecated PERCENTILE version) includes all k values from 0 to 1. The EXC version excludes k percent-ages below 1/(N+1) and above N/(N+1).)
The first percentile formula in the figure is…
…where the crosshatch (pound sign) tells Excel to use the full range of the spilled array that begins in cell M4, and where .25 says that we’re asking for the 25% percentile.
The second formula is this long one:
INDEX( FILTER(Sales, (Sales[Year]=M1) * (Sales[Product]=M2) ),,
MATCH( M3, Sales[#Headers],0) ),
That is, it wraps the PERCENTILE.INC function around the INDEX-FILTER-MATCH section of the formula…as shown above for cell M4.
That formula gave me the result I was looking for. But I wasn’t satisfied. Here’s why:
Although Excel’s help topic about the FILTER function shows it referencing the entire Table—which in my case was 55,000 rows by 29 columns—I saw no technical reason for that requirement. It just made no sense.
So as a test, I set up this formula for the figure at the right:
M11: =FILTER( Sales[Profits], (Sales[Year]=M9) * (Sales[Product]=M10) )
As you can see, I got the same result as I did with the longer formulas above. And that meant I could wrap the PERCENTILE.INC function around that formula to find any percentile value I want.
This is great news! It means that FILTER resembles XLOOKUP and INDEX-MATCH in this one regard: None of these functions care where each of the ranges we reference are. They even could be in different workbooks. It doesn’t matter. The only limitation is an obvious one: The ranges or arrays must have an equal number of rows.
However, my initial tests using the full Table as the array argument for the FILTER function weren’t a waste of time. That’s because I would have needed a similar approach if I ever wanted to set up one formula that I could copy to the range P19:Q22 in this figure.
INDEX(Sales,, MATCH( $N19, Sales[#Headers], 0) ),
(Sales[Year]=$M19) * (Sales[Product]=P$18 ) ),
In the formula for cell N3 above, the formula filtered the entire Sales table, and then used INDEX-MATCH to return one column from that filtered result. That approach caused Excel to filter a needlessly large amount of data.
But here, in cell P19, the formula uses INDEX-MATCH against the Sales table to pass just one column of data to the FILTER function’s array argument. And then, after that one column has been filtered, PERCENTILE.INC finds the percentile.
Notice that cell P21 in the previous figure shows the same result for the percentile as shown in cells N2 and N3 in a preceding figure. That’s because all three formulas use different methods to calculate the 25% percentile of the profits for the games product in 2020.
The Amazing Thing
It’s easy to overlook the really amazing thing about these formula patterns. To see what I mean, think again about the formula above in cell M11:
=FILTER( Sales[Profits], (Sales[Year]=M9) * (Sales[Product]=M10) )
As illustrated here, we can use the FILTER function to return data filtered in any way we want. And so, if we wrap that formula with SUM, we’d get the same answer that we’d get with SUMIFS—assuming we could specify the same filters with SUMIFS.
And that means we can fake just about any “XIFS” type formula we want. So, for example, we now can calculate results like this:
STDEVIFS ➜ STDEV( FILTER( whatever ) )
GEOMEANIFS ➜ GEOMEAN( FILTER( whatever ) )
MEDIANIFS ➜ MEDIAN( FILTER( whatever ) )
And by adding SORTBY to the pattern, we could have the equivalent of TRENDIFS, NPVIFS, FORECASTIFS, and so on. And even CORRELIFS, if we filter two columns of values.
And finally, it also means that even for SUMIFS, AVERAGEIFS, MINIFS, and so on, when we want to apply more complex filtering than those “XIFS” functions can provide, we can wrap SUM, AVERAGE, MIN, and so on around a FILTER function, as well.
And that gives us even more power!