Powered by InformationWeek Business Technology Network
Topics:
College Hoops Challenge : Fritz Nelson's Instigator
Predictive Analytics Applied To March Madness
Last night, watching all of the experts reach the same, boring conclusions (only Bobby Knight colored outside of the bracket lines picking Pittsburgh to win it all) it became clear that only by sheer luck or insanity could you pick a Marquette or a George Mason (or Pitt) to get to the Final Four. There are so many factors to consider, but it always comes back to the number one seeds. But we live in an Information Age, so why not use technology. And that's just what two professors have done to create Dance Card (to predict at-large tournament berths) and Score Card (to predict the tournament winners). Jay Coleman and Allen Lynch were professors at University of North Florida (Allen is now at Mercer University) and married leisure pursuits and professional expertise (they teach quantum analysis) to form Dance Card. Jay had been searching the Web in 1998 and found collegerpi.com (home to the infamous rating percentage index, or RPI) where much of the data work had already been done. Jay and Allen analyzed the data from 1994 to 1999 and found that only six pieces of information really appeared to have anything to do with getting a bid, so they took that information and converted it into a statistical formula that would predict which teams would get in (note: this is just for at-large teams, or those that don't automatically get a bid based on winning their conference or their end-of-year conference tournament). Coleman says that the NCAA selection committee gets a report called "the nitty-gritty report" with information on those teams conceivably in the running. That report includes how teams do against others ranked in the top 25, conference ranking, conference RPI and so on. This year (2008) their formula produced results that were 88% accurage (meaning they got four wrong), which is down from their 94% average accuracy. Coleman claims that the formula can only be as accurate as the selection committee is consistent year to year; and that the formula is now nine years old, so they are in the process of taking a second look at it now. Specifically, they missed Kansas State, and they knew that would likely happen. "Any model is an abstraction," Coleman said, and it is "rarely perfect." In the case of Kansas State, it has some star power in the name of Michael Beasely, a candidate for player of the year. Other teams it didn't predict: Oregon, St. Joseph's, and Villanova. Teams it predicted that didn't get in: University of Massachusetts, Illinois State, Dayton, and Ohio State. The team began working on ScoreCard (for predicting the winner) in 2004 and it launched in 2005. The pair consider the formula in the developmental stage (DanceCard was published in the academic journal INFORMS, the leading journal for operations research). The average accuracy so far is 72% and they have no real target in mind except to just get better. [Note: Please fill out the InformationWeek Hoops Challenge brackets by going to CBS Sportsline here, and then signing up for the InformationWeek Bracket here using the password biztech.] This formula has only four factors (out of more than 50 that Coleman and Lynch examined): RPI value (not the ranking, but all of the factors for RPI), strength and rank of conference (power conferences do tend to win more games in the tournament), whether it won its regular season championship (the conference tournament is just a flash in time and not, it turns out, a good predictor of the NCAA tournament performance), and the number of wins in the last 10 games. This last one is a big one, because so many of the analysts talk about the "hotness" of a team, even for getting into the tournament. Coleman says that it's not a factor in DanceCard because it's not a predictor in making the tournament at all. It is, however, a determinant in winning. But there are certain factors that cannot be put into the formula. UCLA is coming in as a No. 1 seed, but with lingering injury issues for key players (Luc Richard Mbah a Moute and Kevin Love). Jay Bilas talked about the guard play of Cornell and how it would make them a tough match for Stanford's guards, thus hinting at an upset. Coleman says this latter item is reflected in the overall team performance statistics to some degree and therefore is factored in at the macro level. I asked about things like the propensity of a team to drop out early: Arizona and Tennessee, and to some degree Kansas have each found themselves bounced out early in tournaments, even though each has also at times progressed far. While the formula doesn't account for this, it could, Coleman said. It also doesn't account for coaches and teams who've been deep in the tournament before. Whether or not this is a key factor would have to be explored. But that's what makes all of this so much fun. Lynch and Coleman are using some beefy, enterprise-level software to do this, namely SAS predictive analytics. Coleman says it's so powerful and simple, he just has to hypothesize the factors, tell the software what they are trying to predict and feed it the data, and it crunches the numbers and spits out the results. Coleman says he started with SAS software in grad school and has found it has always done what he needs. For example, he uses the probit function to do complex regression analysis in just a few quick statements. He also uses time series cross sectional regression (I just thought I'd stick that in there to impress you). In some ways, this is the Money Ball (Michael Lewis seminal book on how statistics can be a better predictor than years of scouting knowledge and raw expertise) for March Madness. "It's fun for us to see if conventional wisdom is true," Coleman said. And let's face it, if I may make the obvious stretch (though I don't think it's much of one): a way people need to think about predictive business behavior. For Coleman and Lynch, this is a way of taking things that people think matter, and proving if they actually do. To that end, they are providing some significant value. Coleman is a big Clemson fan and he says the fact that his model has them getting to the Sweet 16 isn't biased, a spurious statement if I ever heard one, but I don't have the chops to challenge it. You can use the picks in your own bracket, of course. It won't help you pick the next George Mason. And while it's a frequent occurrence for a 12 seed to beat a five seed, the challenge is picking which one. Scorecard's upsets: Baylor over Purdue (an 11 over a six), which means that Baylor's profile on average would have beaten Purdue's profile in past seasons; Kent State in a minor upset as a nine seed. But if the higher seeds win in the first round, ScoreCard has three of the number five seeds (Notre Dame, Clemson, and Drake) beating their counterpart. It also has Marquette beating Stanford in the second round. « 5 Ways To Cut Data Center Power Costs | Main | In Japan, There Are No More 2G Phones » |
| Sign up now for the weekly InformationWeek Blog Newsletter. |