This entry is not much about OLAP or MDX but gives some background on statistical tests that may be very useful in designing meaningful KPIs based on statistical hypothesis tests. In KPIs we usually compare actuals with goals and visualize the results as a traffic light. In most cases both the actual value and the target can be read from the cube. It is very common to have a dimension "Scenario" with members Actuals and Forecast or something like that.

But when dealing with statistical results, things are getting a little bit more difficult. For example, we’re testing the quality of products in a manufacturing line. If the quality test is expensive we have to rely on a spot sample. Let’s assume we want a quality of 99%. This means that only 1 of 100 produced items should fail the quality test. So for our spot test let’s say we pick 500 out of 10000 produced items of a charge and perform the quality check on them. Depending on the result of the test we decide if we are going to keep the remaining 9500 items or not.

In test theory we say our hypothesis (aka null hypothesis, H_{0}) for a statistical hypothesis test is that the quality of our charge is 99%. The alternative hypothesis (H_{1}) would be that our hypothesis is wrong, meaning the quality is below 99%. So the question is, based on the result of our spot sample, do we keep the hypothesis or not. More precisely, our null hypothesis is, that the probability for a damaged items follows a binomial distribution with a single probability of p=1%. The binomial distribution function B(n,k,p) gives the probability that exactly k out of n events happen when the single probability for an event is p. In our example, B(500,5,1%) ~ 17.6% gives the probability that we observe exactly 5 damaged items in our 500 item spot sample, assuming that the probability for a damaged item is 1%.

Without going too deep into statistics, it’s important to know that there are two kinds of error we have to face:

- type I error (aka alpha error or false positive)
- type II error (aka beta error or false negative)

A type I error means that we discard the null hypothesis (the whole charge would be thrown away) although it actually meets our quality criteria while a type II error means that we don’t discard our hypotheses (we deliver the whole charge although it doesn’t meet the criteria). It’s impossible to minimize both kind of errors (see http://en.wikipedia.org/wiki/Type_I_and_type_II_errors for details) at the same time.

For our test let’s take a look at the probabilities for the binomial distribution function (we’ll cover the calculation of these probabilities in the next post but in the meanwhile you can use the Excel function BINOMDIST to compute these values):

number of items failing the test |
single probability |
cumulated probability |

0 | 0.7% | 0.7% |

1 | 3.3% | 4.0% |

2 | 8.4% | 12.3% |

3 | 14.0% | 26.4% |

4 | 17.6% | 44.0% |

5 | 17.6% | 61.6% |

6 | 14.7% | 76.3% |

7 | 10.5% | 86.8% |

8 | 6.5% | 93.3% |

9 | 3.6% | 96.9% |

10 | 1.8% | 98.7% |

11 | 0.8% | 99.5% |

12 | 0.3% | 99.8% |

13 | 0.1% | 99.9% |

14 | 0.0% | 100.0% |

When plotting the probability we clearly see the peak at about 5 items which corresponds to our hypothesis that the probability of a damaged item is 1% as 1% of 500 items gives 5 items.

If we find no damaged items in our spot-test we can be pretty sure that our charge meets the quality criteria. And if we find 14 damaged items? Well, obviously the charge shouldn’t be delivered. But what do we do with maybe 8 damaged items? We have to define a threshold c. When X, the number of damaged items is less or equal than c we keep the charge, if X is greater than c we don’t keep the charge. This definition makes our test-procedure: "take 500 out of 10000 items for the test. If more than c fail the test, discard the whole charge". But how to find c?

Usually we would set a confidence level alpha for the type I error. Then we could compute the proper value of c that matches this confidence level (i.e. the smallest value for c for which the type I error is below our confidence level alpha). In mathematical terms this would read

If c is zero we only keep charges that have no damaged items in the spot test. Let’s compute the type I error in this case. We have to compute the probability that our hypothesis is abandoned although it is true. In mathematical terms this would read

The value can simply be taken from the table above as P(X>0|p=1%) = 1-P(X=0|p=1%) = 1-0.007=0.993

If we decide for an alpha of 5% we would choose c=9, meaning that we discard charges having more than 9 damaged items in the spot tests as this has a probability of about 3.1% (the sum of the single probability values from the table above for c=10, 11, 12, …) which is below 5%.

On the opposite, let’s compute the type II error for this case. Here we have to compute the probability that our hypothesis is kept although it isn’t true. Our hypothesis is wrong if the probability for a damage is not 1%. For the calculation we assume a higher probability, for example 2%. So in mathematical terms this would read

The complete table for the type I and type II error probabilities looks like this:

c | alpha |
beta (2%) |

0 | 99.3% | 0.0% |

1 | 96.0% | 0.0% |

2 | 87.7% | 0.3% |

3 | 73.6% | 1.0% |

4 | 56.0% | 2.8% |

5 | 38.4% | 6.5% |

6 | 23.7% | 12.8% |

7 | 13.2% | 21.7% |

8 | 6.7% | 33.1% |

9 | 3.1% | 45.7% |

10 | 1.3% | 58.3% |

11 | 0.5% | 69.8% |

12 | 0.2% | 79.3% |

13 | 0.1% | 86.7% |

14 | 0.0% | 91.9% |

So, if we set c=0 (very strict) it is very likely that we discard charges that are ok and very unlikely that we don’t discard a charge although it is damaged. On the other hand, if we set c=14 it’s very unlikely that we discard charges that are ok and very likely that we don’t discard charges that are damaged.

Of course, test theory is much more complex. Besides alpha and beta you can also consider the so called power (1-beta) in order to decide if your test setup is significant. The power gives the probability for a statistical test, to decide for the alternative hypothesis in case the alternative hypothesis is correct. In general terms keeping the hypothesis while the power is low (i.e. the type II error probability is high) doesn’t make much sense. Alpha may also be considered as the risk of the manufacturer while beta may be considered as the risk of the buyer (in case there is no warranty). The proper choose of alpha and beta determines much of the quality of the test. For example, Cohan suggests beta being less or equal to four times alpha (α≤4β, see Cohan, J. 1969, "Statistical Power Analysis for the Behavior Science") for medicial hypothesis tests. So if alpha is 5%, beta should be less or equal to 20% (power greater or equal to 80%). Of course you can increase the significance of your test (and reduce the error probability) by taking a bigger spot sample but on the other hand this is usually more expensive, so you have to find an optimal compromise between costs and significance of the test.

Now, finally back to our well known KPIs. Instead of using a test setup with only one threshold we could use a KPI with three different areas, for example like this:

green | probability of accepting a damaged charge is low (type II error) |

red | probability of not accepting a correct charge is low (type I error) |

yellow | hard to decide area in between |

This might lead to the following KPI color definition:

color |
condition |
values |

green | if beta is below 20% | 0 … 6 |

red | if alpha is below 5% | 9 … 500 |

yellow | everything else | 7 .. 8 |

Using this approach resembles the conflict of minimizing both errors in a good way.

Although the probabilities can be computed using the Excel function BINOMDIST, I didn’t manage to get this called from MDX (you can call many, but not all Excel functions from within MDX using the Excel!functionname-syntax). So the next post will be more about how to compute this distribution function in MDX using an SSAS stored procedure.