For example, each question had to be decomposed into subject, verb, and object, with relationships determined among them. This front-end digestion was accomplished through IBM's Unstructured Information Management Architecture (UIMA), said Anjul Bhambhri, IBM's VP of product development for big data, in an interview.
This phase yielded text and keyword leads that were fed into Hadoop. As a distributed system, Hadoop could map the leads to processors, or more likely, dozens of virtual machines, each located close to the relevant data in the server cluster, and get results back quickly. IBM has built a system on top of Hadoop, InfoSphere BigInsights, for processing its many results down into a set of weighted answers.
The most relevant results were then processed to become a logical set of possible answers, using finer and finer grained algorithms. Three answers were reached, with Watson rating the top one. It's a little eerie that when it came up with the wrong answer, Toronto as a U.S. city, it also took the unusual step of placing five question marks after it, as if severely second guessing its own logic. It's also telling that, when asked what material the quills of a hedgehog are made of, he came up with keratin, porcupine_2, and fur as possible answers and chose keratin, the correct one.
The whole process was called DeepQA, and ended with one result finally being selected in what would become, in most cases, Watson's human-beating conclusions.
Indeed, this DeepQA process was based on software whose major parts are now publicly available. Hadoop is an Apache open source project. UIMA, an open source framework for analysis of unstructured content, is also open source code donated by IBM and available under an Apache license. Getting them to work together to yield a correct answer on Jeopardy, I suspect, is still a stretch for the uninitiated.
IBM has taken some of the things it's learned from data warehousing and analytics based on DB2 and applied them to BigInsights, Bhambhri said.
"I was not surprised at the outcome," said Bhambhri. "Our research had indicated that [the Watson team] had built something that could withstand the test of a Jeopardy contest," she said.
In this sense, Watson didn't think like a human. He did something that a human couldn't do -- process terabytes of information across thousands of virtual machines in milliseconds, sorting results down to one answer. Humans, after hearing a question, often need a second or two to let old associations reactivate old memories and information, in this fashion often working their way to marvelously remote, correct answers. The process probably would look untidy next to a diagram of how Watson narrowed down his candidate results.
What I think Watson represents better than anything else is not a machine surpassing human intelligence so much as how humans will use computers to attack all sorts of problems, backed by the power to process masses of recently acquired, unstructured information. Big data, parallel processing, and the human mind are together engaged in a new era of data exploration.
It's not the human quiz show contestant who's in jeopardy. Rather, the target most likely to yield to this new power is the timeless problem that resisted solution by hiding in an amount of data formerly too large to grasp.
Charles Babcock is an editor-at-large for InformationWeek.