Software // Social
News
4/14/2011
05:07 PM
Connect Directly
LinkedIn
Google+
Twitter
RSS
E-Mail
50%
50%

Lexalytics Analyzes Wikipedia To Understand How Humans Think

Concepts extracted from the community created encyclopedia can be used to improve analysis of documents and sentiment in social media.

Top 20 Apps For Managing Social Media
(click image for larger view)
Top 20 Apps For Managing Social Media
Academics may frown on citations from Wikipedia because of its social media origins, but for the text mining and sentiment analysis firm Lexalytics the sprawling community created encyclopedia was the perfect reference for teaching software how to understand the world.

At its user conference this week in New York, Lexalytics announced that the Salience 5.0 release of its software, due out this summer, will be better able to understand concepts and relationships between concepts, thanks to a close reading of the entire content of Wikipedia. Because of the open source nature of the Web encyclopedia, Lexalytics was able to index it freely. A footnote to the press release cautions that no endorsement by the Wikimedia Foundation is implied.

"Wikipedia represents a very, very large corpus information, and, importantly, it's human edited--which means it shows the way humans think about information," CEO Jeff Catlin said. "We used it as a source for how people think about the organization of information and for perspective on how bits of information are related to each other."

Lexalytics is best known for technology that produces automated summaries of documents, as well as sentiment analysis capabilities that can be used for social media monitoring. Catlin said his firm's technology is used "behind the scenes" by companies like Radian6 (recently acquired by Salesforce.com) and also licensed directly by some websites, such as TripAdvisor. But the core Lexalytics technology is general purpose--like a search engine that can be adapted to search specialized types of content.

The "concept matrix" Lexalytics created on the basis of its Wikipedia analysis may factor into improved sentiment analysis, but it's broader than that, Caitlin said. In some ways, this was more similar to the work that went into creating IBM's computerized Jeopardy champ, Watson, which also had to be fed large volumes of news articles and reference sources. One thing the Watson team had in its favor was that answering trivia questions is a very specific task, focused on the kind of "myopic detail" that computers are good at handling. "So if they can figure out the question, there is a good chance they are going to have the right answer," he said.

Just as the process of building Watson's knowledge base started long before Alex Trebek stepped on stage, the compilation of the Lexalytics concept matrix was a distributed computing analytics job run across many servers--many of them procured through Amazon's cloud services. "We basically did boil the ocean, so this required a lot of hardware behind the scenes and a lot of Amazon computing time," he said. But by the end of the process his team had boiled it down to a summary of concepts that fits on a laptop or a modest sized server.

The result is a piece of computer software that "understands that a rose and a daisy are both flowers, which up until now has been a really tough model," Catlin said. "If someone writes that a device runs for three days without a recharge, the system can figure out that 'runs for three days without a recharge' is a battery event," even though the word "battery" was never mentioned. Using this technology, a marketing application could read through hundreds of news articles about a company to see how many of the key messages from its latest press release made their way into that coverage--even though each of the news writers used different words and phrases to tell the story.

Sentiment analysis is a relatively mature branch of text analytics, but automated systems still get confused by things like sarcasm and double meanings. One improvement Lexalytics is making in this upcoming release of its software is a filter for subjective versus objective understanding, or direct versus second-hand knowledge. For example, "I heard that movie was great"--a comment from someone who hasn't actually seen the movie--could be scored differently from "That movie was great!" even though both are positive sentiments.

The Wikipedia concept matrix "is just one more piece we're using to try to crank up the accuracy of these things, and it's wonderful because it's so good for general knowledge and gives us a broad and relatively deep look at the world," Catlin said.

Comment  | 
Print  | 
More Insights
Social is a Business Imperative
Social is a Business Imperative
The use of social media for a host of business purposes is rising. Indeed, social is quickly moving from cutting edge to business basic. Organizations that have so far ignored social - either because they thought it was a passing fad or just didnít have the resources to properly evaluate potential use cases and products - must start giving it serious consideration.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Dec. 9, 2014
Apps will make or break the tablet as a work device, but don't shortchange critical factors related to hardware, security, peripherals, and integration.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of December 14, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.