The InformationWeek -- Blogs
InformationWeek's College Hoops Challenge

Topics:   College Hoops Challenge : Full Nelson

  • Email this page E-mail this page
  • Print this page Print this page
  • Bookmark and Share
  • icon

Predictive Analytics Applied To March Madness


Posted by Fritz Nelson, Mar 17, 2008 02:27 PM

Last night, watching all of the experts reach the same, boring conclusions (only Bobby Knight colored outside of the bracket lines picking Pittsburgh to win it all) it became clear that only by sheer luck or insanity could you pick a Marquette or a George Mason (or Pitt) to get to the Final Four. There are so many factors to consider, but it always comes back to the number one seeds. But we live in an Information Age, so why not use technology. And that's just what two professors have done to create Dance Card (to predict at-large tournament berths) and Score Card (to predict the tournament winners).


Jay Coleman and Allen Lynch were professors at University of North Florida (Allen is now at Mercer University) and married leisure pursuits and professional expertise (they teach quantum analysis) to form Dance Card. Jay had been searching the Web in 1998 and found collegerpi.com (home to the infamous rating percentage index, or RPI) where much of the data work had already been done. Jay and Allen analyzed the data from 1994 to 1999 and found that only six pieces of information really appeared to have anything to do with getting a bid, so they took that information and converted it into a statistical formula that would predict which teams would get in (note: this is just for at-large teams, or those that don't automatically get a bid based on winning their conference or their end-of-year conference tournament).

Coleman says that the NCAA selection committee gets a report called "the nitty-gritty report" with information on those teams conceivably in the running. That report includes how teams do against others ranked in the top 25, conference ranking, conference RPI and so on. This year (2008) their formula produced results that were 88% accurage (meaning they got four wrong), which is down from their 94% average accuracy. Coleman claims that the formula can only be as accurate as the selection committee is consistent year to year; and that the formula is now nine years old, so they are in the process of taking a second look at it now.

Specifically, they missed Kansas State, and they knew that would likely happen. "Any model is an abstraction," Coleman said, and it is "rarely perfect." In the case of Kansas State, it has some star power in the name of Michael Beasely, a candidate for player of the year. Other teams it didn't predict: Oregon, St. Joseph's, and Villanova. Teams it predicted that didn't get in: University of Massachusetts, Illinois State, Dayton, and Ohio State.

The team began working on ScoreCard (for predicting the winner) in 2004 and it launched in 2005. The pair consider the formula in the developmental stage (DanceCard was published in the academic journal INFORMS, the leading journal for operations research). The average accuracy so far is 72% and they have no real target in mind except to just get better.

[Note: Please fill out the InformationWeek Hoops Challenge brackets by going to CBS Sportsline here, and then signing up for the InformationWeek Bracket here using the password biztech.]

This formula has only four factors (out of more than 50 that Coleman and Lynch examined): RPI value (not the ranking, but all of the factors for RPI), strength and rank of conference (power conferences do tend to win more games in the tournament), whether it won its regular season championship (the conference tournament is just a flash in time and not, it turns out, a good predictor of the NCAA tournament performance), and the number of wins in the last 10 games. This last one is a big one, because so many of the analysts talk about the "hotness" of a team, even for getting into the tournament. Coleman says that it's not a factor in DanceCard because it's not a predictor in making the tournament at all. It is, however, a determinant in winning.

But there are certain factors that cannot be put into the formula. UCLA is coming in as a No. 1 seed, but with lingering injury issues for key players (Luc Richard Mbah a Moute and Kevin Love). Jay Bilas talked about the guard play of Cornell and how it would make them a tough match for Stanford's guards, thus hinting at an upset. Coleman says this latter item is reflected in the overall team performance statistics to some degree and therefore is factored in at the macro level.

I asked about things like the propensity of a team to drop out early: Arizona and Tennessee, and to some degree Kansas have each found themselves bounced out early in tournaments, even though each has also at times progressed far. While the formula doesn't account for this, it could, Coleman said. It also doesn't account for coaches and teams who've been deep in the tournament before. Whether or not this is a key factor would have to be explored. But that's what makes all of this so much fun.

Lynch and Coleman are using some beefy, enterprise-level software to do this, namely SAS predictive analytics. Coleman says it's so powerful and simple, he just has to hypothesize the factors, tell the software what they are trying to predict and feed it the data, and it crunches the numbers and spits out the results. Coleman says he started with SAS software in grad school and has found it has always done what he needs. For example, he uses the probit function to do complex regression analysis in just a few quick statements. He also uses time series cross sectional regression (I just thought I'd stick that in there to impress you).

In some ways, this is the Money Ball (Michael Lewis seminal book on how statistics can be a better predictor than years of scouting knowledge and raw expertise) for March Madness. "It's fun for us to see if conventional wisdom is true," Coleman said. And let's face it, if I may make the obvious stretch (though I don't think it's much of one): a way people need to think about predictive business behavior. For Coleman and Lynch, this is a way of taking things that people think matter, and proving if they actually do. To that end, they are providing some significant value.

Coleman is a big Clemson fan and he says the fact that his model has them getting to the Sweet 16 isn't biased, a spurious statement if I ever heard one, but I don't have the chops to challenge it. You can use the picks in your own bracket, of course. It won't help you pick the next George Mason. And while it's a frequent occurrence for a 12 seed to beat a five seed, the challenge is picking which one. Scorecard's upsets: Baylor over Purdue (an 11 over a six), which means that Baylor's profile on average would have beaten Purdue's profile in past seasons; Kent State in a minor upset as a nine seed. But if the higher seeds win in the first round, ScoreCard has three of the number five seeds (Notre Dame, Clemson, and Drake) beating their counterpart. It also has Marquette beating Stanford in the second round.

« 5 Ways To Cut Data Center Power Costs | Main | In Japan, There Are No More 2G Phones »



Sign Up Now
For InformationWeek News Alerts




This is a public forum. United Business Media and its affiliates are not responsible for and do not control what is posted herein. United Business Media makes no warranties or guarantees concerning any advice dispensed by its staff members or readers.

Community standards in this comment area do not permit hate language, excessive profanity, or other patently offensive language. Please be aware that all information posted to this comment area becomes the property of United Business Media LLC and may be edited and republished in print or electronic format as outlined in United Business Media's Terms of Service.

Important Note: This comment area is NOT intended for commercial messages or solicitations of business.




 
 

  1. Sequential Programming: Like Eating Peas with a Straw.
  2. Biomolecular device using self-assembled DNA nanostructures?
  3. Coreinfo v2.0: A Simple Utility to Understand the Manycore Complexity in Windows


Join The InformationWeek Group On LinkedIn


                           


  1. Too Much Netbook For Too Litl?
  2. Sprint And T-Mobile Headed The Wrong Direction
  3. More Reasons Why Linux Misses The Desktop
  4. Windows 7 Is Broken, So What?


  1. Florida Hospital Dials Up iPhones For Nurses
  2. Is Antivirus Software Dead?
  3. Securing The Cyber Supply Chain
  4. CIO Profiles: Christopher Rence, Chief Information And Business Transformation Officer Of FICO
  5. InformationWeek Analytics Research: Federated Search
  6. Practical Analysis: The Fastest-Growing Security Threat

 

  Ars Technica
Boing Boing
Channel 9 Forums
CRN Blogs
Dr.Dobb's Portal: Blogs
Engadget
Gizmodo
GrokLaw
  Lifehacker
Schneier on Security
Slashdot
TechCrunch
Techdirt
Techmeme
Valleywag

  DECEMBER 2008
NOVEMBER 2008
OCTOBER 2008
SEPTEMBER 2008
AUGUST 2008
JULY 2008
JUNE 2008
MAY 2008
  APRIL 2008
MARCH 2008
FEBRUARY 2008
JANUARY 2008
DECEMBER 2007
NOVEMBER 2007
OCTOBER 2007
SEPTEMBER 2007