Imagine being in a data mining project where people are all enthusiastic about finding hidden patterns and doing some data magic. Now you set up and train your model and the model gives some results (for example scores for the test dataset). You might be feeling a little bit lost at that point. Are the results sensible and reliable? Did I really choose a good approach, the right model and algorithm for that problem? The fact is, that most mining tools are giving you “results” even if your modeling approach has severe flaws, but the results simply have no meaning. So, how can you be sure that your model works fine?
The situation reminds me a little bit of Douglas Adams’ Hitchhiker’s Guide to the Galaxy. Finally you got an answer, but you are not longer sure what this means. What options do you have to be assured (and to assure your stake holders) that business decisions can be based upon the results of that mining model?
Victoria Garment from Software Advice, a website that researches business intelligence systems, gathered methods used by data mining professionals Karl Rexer, Dean Abbott and John Elder to test and validate the accuracy of a mining model. You can find the full report here:
What makes this article a must-read for me, is that it doesn’t only cover methods for accuracy testing (lift charts, decile tables, target shuffling, bootstrap sampling, cross validation) but also contains many practical examples and touches various side aspects that are important for data mining. For example when John Elder talks about recognizing false patterns like the Redskins Rule. Or Dean Abbott, as he shows how easily models can be overfit and what methods you have to correct them. I especially like one quotation by John Elder: “Statistics is not persuasive to most people—it’s just too complex”. And this is true, as from my own experience it is very important not only to design a good prediction model but also to assure the business decision makers that they can trust the results and base their decisions on those results. Target shuffling as described in the article may be one promising approach to make business people trust the results of a predictive model without being a master of science in statistics.
Again, the full article can be found here: