"Some of Fan's algorithms were seemingly dumb, but they worked on a limited domain of the question," Baker explains. "You can populate the machine with lots of algorithms, each with its own specialty; if each one can bring back a small percentage of correct answers, then you can create a whole ecosystem in which a bunch of algorithms deliver a bunch of answers, and an analytical system can then determine which should be trusted."
In the resulting bake-off, held in March 2007, the Piquant-based system answered only 30 percent of the clues correctly. But Fan's system did nearly as well.
"The bake-off proved that Piquant was not up to snuff, and Ferrucci concluded they were going to have to build an entirely new and much more ambitious platform if they were going to succeed," says Baker.
Adopting elements of Fan's approach, the first breakthrough was combining many algorithms and then correlating and scoring the confidence in myriad answers. Indeed, this "ensemble" idea is not entirely new, and it has cropped up elsewhere in recent years. For example, ensemble analysis was used by many of the leading contestants in the 2009 Netflix Prize competition.
There's much more to the technology development story (as explored in this story and covered in great detail in Baker's book). But IBM Research basically spent much of 2008 and 2009 adding millions of lines of new code to Watson's analysis and scoring software. Another breakthrough was the addition of a feedback loop that enabled Watson to learn from correct and incorrect answers supplied by both humans and the computer itself.
The Show Agreement Challenge
The IBM Research team gained confidence as Watson's Jeopardy performance steadily improved, but it took two demonstrations and a bit of hard negotiation with Jeopardy's producers to hammer out the details of the competition. After an initial agreement was reached in early 2010, Jeopardy's producers introduced new requirements.
"Jeopardy wanted the computer to have a physical finger so it would have to press a button just like the humans," Baker explains. "IBM had done all its testing with Watson buzzing in purely electronically, so they were upset and felt that Jeopardy was trying to graft human limitations onto a computer."
From IBM's perspective, its developers were building a brain, but Jeopardy's producers were trying to turn it into a robot. In the interest of perceived fairness, according to Baker, IBM ultimately acquiesced and gave Watson an electromechanical actuator -- an equivalent to the buzzers that human contestants use.
In another disagreement, Ferrucci's team wanted assurances that Jeopardy's writers could not bias the clues in favor of the human contestants.
"Conceivably, writers could fill the game with all kinds of puns and riddles designed to foil a machine," Baker says. "IBM feared the writers would make it a Turing test so that instead of a game of Jeopardy it would become a test to see if the machine could pass for a human."
Jeopardy assured IBM that the clues for an entire season's worth of episodes had already been written, but the show's producers granted the extra precaution of setting aside 30 sets of clues and using a third-party company to randomly choose those to be used in the human-vs.-machine episodes.