Software // Information Management
Commentary
12/16/2008
09:07 AM
Seth Grimes
Seth Grimes
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Spock.com Taps Text Analytics

Spock is a people-search engine, currently in beta release. The company uses "a combination of search-engine technologies and user edits to aggregate the world's people information and make it searchable." Andrew Borthwick, Principal Scientist at Spock Networks, kindly fielded a number of questions about Spock's use of text analytics and Spock's data-quality efforts.

Spock is a people-search engine, currently in beta release. The company uses "a combination of search-engine technologies and user edits to aggregate the world's people information and make it searchable." Think Google meets LinkedIn: Web search with accuracy boosted by allowing individuals to claim, augment, and correct information about themselves. (See the screenshot below, right.)

Spock Screenshot
Spock interface allowing a registered individual to amend his or her profile
(click image for larger view)
Spock competes with sites such as ZoomInfo and Spoke.

Andrew Borthwick is Principal Scientist at Spock Networks. I "met" him on the GATE e-mail list. (GATE is open-source text-mining software.) Andrew kindly fielded a number of questions I had about Spock's use of text analytics to build its search database and about Spock's data-quality efforts.Seth: Start general: How is Spock using text analytics?

Andrew: Spock does extensive extraction of information from pages on the general web and various specific sites such as Wikipedia, IMDB, and many social networks. We particularly target the retrieval of key biographical facts from these sites such as a person's education, employment history, and place of residence. We also try to find paragraphs of text describing the person, people to whom they are associated, images, and other useful nuggets of information we can derive from the page.

Seth: Are there special tricks to identifying that material from multiple sources is or is not about a given individual? In particular, I did a Spock search on my buddy Neil Raden (http://www.spock.com/q/neil-raden) and I see four unconsolidated records for him. Can you explain why the records weren't consolidated?

Andrew: This is a very challenging task when working with data on the scale that we are (over 100 million profiles). For instance, for a relatively common name like "Jason Johnson," we currently have over 30,000 profiles. The general idea in this task is to try to infer that two records refer to the same person by looking at factors besides the name. For instance, I would agree that we should have caught the two instances of "Neil Raden" that refer to the company "Hired Brains." However, for another pair of these four records, we have very little tying them together besides their astrological sign (Sagittarius). "Hired Brains" is a rare term, so we should have caught these two records. (I'll check on this case.) On the other hand, Sagittarius is obviously very common, so there would be too much of a chance of a false positive if we consolidated based on this little information.

Seth: Spock provides an application programming interface, similar I suppose to the APIs provided by Yahoo! and other search providers. Who's using the Spock API and how, and who do you expect to use it as Spock evolves?

Andrew: Spock's API is being used quite broadly, as shown by the 2 million hits per month that we are receiving right now. Usage of our API is currently growing at 25% per month, so we imagine that there will be an evolution in how it is used, but it's difficult to predict how that will develop.

Seth: I see in your own Spock record that you have a background in record matching. Did Spock acquire ChoiceMaker Technologies, a company you co-founded that focused on data cleansing and deduplication?

Andrew: My Ph.D. was in information extraction, specifically in using a machine-learning technology to identify proper names in text. While still in graduate school I saw the opportunity to start a company based on the idea of applying this same technology to the problem of matching people in databases.

By 2007, after having done record matching with ChoiceMaker for 9 years, grown the company to 14 people, and received three patents, I was ready for a new challenge and leaped at the opportunity to become Principal Scientist at Spock where I could work on both the problem of extracting information about people from Web pages (similar to my thesis) and then bringing those records together into a single consolidated profile (similar to what I did at ChoiceMaker). Spock did not acquire ChoiceMaker, though, and we are working on a somewhat harder problem here. ChoiceMaker and its competitors primarily focus on matching database records. Matching records derived from a web page is much harder.

Seth: The quality of information Spock presents seems pretty good. Do you have anything to relate regarding maintaining data quality?

Andrew: Maintaining data quality is a constant challenge requiring frequent recrawls of sites and continual upgrading of the information extraction and person matching logic. Spock also benefits from the ability of people to edit and fix up their own profiles and from our staff of data quality specialists who monitor the accuracy of our data.

Seth: What can you tell me about Spock directions -- technology or market positioning? And when will the site emerge from beta?

Andrew: We are continuing to work with the GATE toolkit along with a range of other tools on the problem and we are planning to continue to contribute to the GATE project. For instance, I will be checking in a new enhancement to GATE's within-page person matching engine next week and I made a significant improvement to the speed of GATE's pronominal coreference engine (i.e. matching "he" or "she" with its antecedent) last summer.

We are focused on building the best technology possible. We will only take ourselves out of beta once we feel we have decent person coverage both in terms of the depth of information about each person and the scale of people indexed.

Thanks Andrew!Spock is a people-search engine, currently in beta release. The company uses "a combination of search-engine technologies and user edits to aggregate the world's people information and make it searchable." Andrew Borthwick, Principal Scientist at Spock Networks, kindly fielded a number of questions about Spock's use of text analytics and Spock's data-quality efforts.

Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A UBM Tech Radio episode on the changing economics of Flash storage used in data tiering -- sponsored by Dell.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.