Commentary

Fritz Nelson
 

ReviewCam: OpenCalais -- "Semantic Plumbing"

When we look back, years from now, this decade's Internet Search may seem prehistoric. Perhaps that's why so many companies (including the usual suspects) have begun working toward an enlightened future. Thomson Reuters is one such company, and its OpenCalais project provides what the company likes to call semantic plumbing; you can see what that means in our ReviewCam.

When we look back, years from now, this decade's Internet Search may seem prehistoric. Perhaps that's why so many companies (including the usual suspects) have begun working toward an enlightened future. Thomson Reuters is one such company, and its OpenCalais project provides what the company likes to call semantic plumbing; you can see what that means in our ReviewCam.For a deeper look, view the video below. There are some result sets using a simple testing tool, and a few examples of some of Thomson Reuters customers.

;


More Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

While it seems like we've been successfully using search for a long time, there is much work to be done in getting to the right results more quickly. Nary a user would admit an unspoken and angry command to a badly returned results page: Don't find what I typed, find what I meant! Behind Microsoft's Bing's well conceived user interface is a better result; behind the high-horsepower drive train Google is purported to be putting into its Caffeine project is a better result. Companies like Blinkx and AOL (Truveo) want to help you find the right video. You can search images, tweets.

There's more, and then there's better. A few months ago, I profiled so-called semantic search company Truevert, which relied not on a better ontology, but upon the body of work that end-users actually do when searching for data. Here's the key paragraph from that blog:

A true semantic-based approach trusts a context, rather than a categorization. OrcaTec started Truevert with a more vertical approach, namely "green." So everything gets searched through that filter. It uses Yahoo BOSS to gather a Web search, but it then re-ranks the results based on its own language model derived from understanding the association and context of words from 6,000 green-tagged documents in Delicious (which it can do on a mere laptop in less than 15 minutes). Google's terms of service, Roitblat says, don't allow re-ranking of pages the way Truevert does it.

One mistake I made when I wrote that piece was calling Thomson Reuters OpenCalais a competitor. Instead, Calais is a web service. A publisher, a developer, a site can submit its information to the open source Calais service and it will provide all of its magic behind the scenes to provide a more contextually relevant result set to the search engines.

Thomson Reuters acquired this technology a couple of years ago from ClearForest, and its goal is natural language processing. When I talked with the company months ago, it explained that most natural language processing efforts are based on RDF (resource description framework), a much more structured form of XML, the mission of which is to publish data rather than web pages. There's a query language for RDF called sparql, and an ontology to describe the data.

Using the Calais web services API, you can take unstructured text, run it through the natural language engine and it spits out information in RDF. In theory, then, you get better information and you get it extremely fast. Lots of other technology is needed to take this further, like the linked data standard that Tim Berners-Lee is working on -- essentially the building block to create linkages between all of this richer meta data. An example of this kind of work is DBpedia, which is an effort to create links between information on the Web and Wikipedia content.

With OpenCalais and other similar technologies, you can envision publishers (or anyone) being able to create very specific information streams on unique web pages, or in widgets on existing pages (or tag clouds), based on a very well-crafted set of criteria; that is, not just finding information when you know you need it, but finding relationships between information in a predictable way to yield unpredictable results -- better, fresher results, surprising results. And doing it automatically, and, as Thomson Reuters likes to point out, in ways that humans think.

Clearly, then, it's in Thomson Reuters' interest to give this natural language engine away. The more people who use it, the richer it can make its own information and the more frequently its massive information database can get in the hands of more people (or on more sites). What it's losing is the control of how that information is presented, and how it's linked to other data -- on its own sites it can manage the data and the flow and the presentation, but it's also a very manual and insular process.

The company noted that the amount and type of content is exploding (user generated content, Twittter, etc.) and "we can act like AOL and pretend it's not happening, or acknowledge that it is and embrace it through interoperability." And being a trusted source of information against which to bounce all of that "wild content" is where Thomson Reuters wins. "Hedge traders want to pay attention to twitter and blogs, but they need to bounce that against content they can trust."

Fritz Nelson is an Executive Editor at InformationWeek and the Executive Producer of TechWebTV. Fritz writes about startups and established companies alike, but likes to exploit multiple forms of media into his writing.

Follow Fritz Nelson and InformationWeek on Twitter, Facebook, YouTube and LinkedIn:

Twitter @fnelson @InformationWeek @IWpremium

Facebook Fritz Nelson Facebook Page InformationWeek Facebook Page

YouTube TechWebTV

Fritz Nelson on LinkedIn InformationWeek


Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
T-Shirt Giveaway T-Shirt Giveaway: Each week we're selecting one great comment from our readers. The author of the comment will receive an InformaitonWeek Community t-shirt. So get posting!
Subscribe to RSS

Resource Links