Semantics is hot, but only in a geeky sort of way. Contrast with search, which long ago shed its geeky image to become the Web's #1 utility. Search and semantics have similar goals and rely on similar technologies. Both apply data-structuring techniques to make information more findable and usable. Join the two and you get semantic search, in essence, search made smarter, search that seeks to boost accuracy by taming ambiguity via an understanding of context.
Semantic search is still in a definitional phase, "on its way!" as claimant Hakia puts it. Yet Hakia's own site, still in beta, only confuses with its challenge to "Compare with Google." I compared, using a term Hakia suggested, carrots. Results look pretty similar, no? So what, exactly, are the ingredients of semantic search?
Semantics (in an IT setting) is meaningful computing: the application of natural language processing (NLP) to support information retrieval, analytics, and data-integration that compass both numerical and "unstructured" information. The ever-emerging Semantic Web is, for many, the poster child, although semantic computing is advancing rapidly even while a portion of the folks who push semantic technologies seem unable to explain clearly and convincingly what business value they deliver. Another case in point: Microsoft Bing, which is alleged to deliver semantic search because, in response to certain queries, it offers you "related searches" and Wikipedia reference look-ups. Those are semantic elements, reliant on/related to the meaning of the search terms and results, given that meaning is what semantics are all about. But there must be more to semantic search than that?! There is.
Two + Nine Views of Semantic Search
- Related searches/queries. The engine proposes searches that are in some way similar to the entered search. For example, in response to a search on "carrots," Hakia offers, "You can also search for: Zucchini, Eggplant." Yup, they're all vegetables.
I'd place query correction in this same category, the "Did you mean:" response that leading search engines provide if they detect a likely search-term misspelling.
- Reference results. Here the search engine is responding with materials that define the search terms, simply, via a dictionary look-up, or elaborately, pulling Wikipedia pages (as Bing does) or the like. Note that this isn't search in a traditional sense, is it? It's question-answering, the presumption that the user is probably searching for practical information (like maps, movie reviews and show times, or stock quotes) rather than document hit lists.
- Semantically annotated results. Here you're returned pages or documents with highlighting of text features, especially named or pattern-defined entities, that are semantically related to the search terms. This capability is only just emerging for non-textual media... on the Web. Digital cameras can do it: When in portrait mode, they will detect and outline faces.
- Full-text similarity search. A block of text ranging from a phrase to a full document, rather than a few keywords, is submitted. While the matching techniques rely on statistical or vector-space similarity measures rather than conventional meaning, the results do fit the semantic label.
- Search on semantic/syntactic annotations. The user would tag a search term to indicate the syntactic role the term plays -- for instance, the part-of-speech (noun, verb, etc.) -- or its semantic meaning -- whether it's a company name, location, or event. An IBM page shows how this works.
A key-word search on "center" would likely produce way too many documents because "center" is a common and ambiguous term. Our semantic search engine supports a query language called XML Fragments. This query language is designed to exploit UIMA’s CAS annotations entered in the search engine’s index. The XML Fragment query, for example,This capability extends the search on document-level metadata and tags you can do with mainstream systems such as Google, where you can currently enter filetype:pdf (for example) or would enter terms in a fielded search interface such as the one offered by Google patent advanced search.
<organization> center </organization>will produce first only documents that contain "center" where it appears as part of a phrase annotated as an organization by a named-entity recognizer.
- Concept search. I enter "Ford films" and among the hits I get are documents that contain the word "movies" even if not the word "films." Conceptual relationships could be specified by a taxonomy or they could be less deterministically inferred by statistical co-occurrence or other, similar techniques.
- Ontology-based search. Here the engine not only understands hierarchical relationships of entities and concepts as in a taxonomy, but also more complex inter-entity relationships. "What does a dog chase?" Ontology-based search would bring up results about cars, cats, and tails, as they relate to dogs of course.
- Semantic Web search. The Semantic Web seeks to capture data relationships and make the resulting "Web of data" queryable. This lofty and worthy goal is years from practical usability, but you can get a feel for it via Sindice Data Web services.
- Faceted search. Faceted search provides a means of exploring results according to a set of predefined, high-level categories called facets. While I will send you to Daniel Tunkelang's The Noisy Channel blog to learn more, I will observe that faceted search is often verticalized, that is, limited or targeted to a particular information domain. Epicurious, a site "for people who love to eat," provides a great example. Try a search on "brownies" and observe the facets listed under "refine this search by..." Semidico, a search tool for biomedical literature, uses facets in its query-completion suggestions and for results delivery, where it offers three tabs of facets on the left side the results screen. In these examples, inferred meaning is the basis for assigning search results to facets.
- Clustered search. Clustered search is like faceted search, but without the predefined categories. Check out Carrot
2, which organizes search results into topics, as does Clusty from Vivisimo. Here, meaning is inferred from topics extracted from the content of search results.
- Natural language search. I first tried out a natural-language query tool around 20 years ago. The goal was to translate a question such as "How much did our inventory of widgets cost us?" into an SQL query against a conventional relational database. The technology, available from companies such as EasyAsk, creates a semantic representation of the user's query, but it has yet to catch on. Noting that we're now habituated to two-to-three word searches, I wonder if it ever will?
A quick announcement of a conference I'm organizing: The 2010 Sentiment Analysis Symposium will take place April 13 in New York, looking at solutions that discover business value in opinions and attitudes in social media, news, and enterprise feedback. Follow me on Twitter, or follow the symposium, for program updates.