Spock is a people-search engine, currently in beta release. The company uses "a combination of search-engine technologies and user edits to aggregate the world's people information and make it searchable." Andrew Borthwick, Principal Scientist at Spock Networks, kindly fielded a number of questions about Spock's use of text analytics and Spock's data-quality efforts.
Spock is a people-search engine, currently in beta release. The company uses "a combination of search-engine technologies and user edits to aggregate the world's people information and make it searchable." Think Google meets LinkedIn: Web search with accuracy boosted by allowing individuals to claim, augment, and correct information about themselves. (See the screenshot below, right.)
Spock interface allowing a registered individual to amend his or her profile (click image for larger view)
Andrew Borthwick is Principal Scientist at Spock Networks. I "met" him on the GATE e-mail list. (GATE is open-source text-mining software.) Andrew kindly fielded a number of questions I had about Spock's use of text analytics to build its search database and about Spock's data-quality efforts.Seth: Start general: How is Spock using text analytics?
Andrew: Spock does extensive extraction of information from pages on the general web and various specific sites such as Wikipedia, IMDB, and many social networks. We particularly target the retrieval of key biographical facts from these sites such as a person's education, employment history, and place of residence. We also try to find paragraphs of text describing the person, people to whom they are associated, images, and other useful nuggets of information we can derive from the page.
Seth: Are there special tricks to identifying that material from multiple sources is or is not about a given individual? In particular, I did a Spock search on my buddy Neil Raden (http://www.spock.com/q/neil-raden) and I see four unconsolidated records for him. Can you explain why the records weren't consolidated?
Andrew: This is a very challenging task when working with data on the scale that we are (over 100 million profiles). For instance, for a relatively common name like "Jason Johnson," we currently have over 30,000 profiles. The general idea in this task is to try to infer that two records refer to the same person by looking at factors besides the name. For instance, I would agree that we should have caught the two instances of "Neil Raden" that refer to the company "Hired Brains." However, for another pair of these four records, we have very little tying them together besides their astrological sign (Sagittarius). "Hired Brains" is a rare term, so we should have caught these two records. (I'll check on this case.) On the other hand, Sagittarius is obviously very common, so there would be too much of a chance of a false positive if we consolidated based on this little information.
Seth: Spock provides an application programming interface, similar I suppose to the APIs provided by Yahoo! and other search providers. Who's using the Spock API and how, and who do you expect to use it as Spock evolves?
Andrew: Spock's API is being used quite broadly, as shown by the 2 million hits per month that we are receiving right now. Usage of our API is currently growing at 25% per month, so we imagine that there will be an evolution in how it is used, but it's difficult to predict how that will develop.
Seth: I see in your own Spock record that you have a background in record matching. Did Spock acquire ChoiceMaker Technologies, a company you co-founded that focused on data cleansing and deduplication?
Andrew: My Ph.D. was in information extraction, specifically in using a machine-learning technology to identify proper names in text. While still in graduate school I saw the opportunity to start a company based on the idea of applying this same technology to the problem of matching people in databases.
By 2007, after having done record matching with ChoiceMaker for 9 years, grown the company to 14 people, and received three patents, I was ready for a new challenge and leaped at the opportunity to become Principal Scientist at Spock where I could work on both the problem of extracting information about people from Web pages (similar to my thesis) and then bringing those records together into a single consolidated profile (similar to what I did at ChoiceMaker). Spock did not acquire ChoiceMaker, though, and we are working on a somewhat harder problem here. ChoiceMaker and its competitors primarily focus on matching database records. Matching records derived from a web page is much harder.
Seth: The quality of information Spock presents seems pretty good. Do you have anything to relate regarding maintaining data quality?
Andrew: Maintaining data quality is a constant challenge requiring frequent recrawls of sites and continual upgrading of the information extraction and person matching logic. Spock also benefits from the ability of people to edit and fix up their own profiles and from our staff of data quality specialists who monitor the accuracy of our data.
Seth: What can you tell me about Spock directions -- technology or market positioning? And when will the site emerge from beta?
Andrew: We are continuing to work with the GATE toolkit along with a range of other tools on the problem and we are planning to continue to contribute to the GATE project. For instance, I will be checking in a new enhancement to GATE's within-page person matching engine next week and I made a significant improvement to the speed of GATE's pronominal coreference engine (i.e. matching "he" or "she" with its antecedent) last summer.
We are focused on building the best technology possible. We will only take ourselves out of beta once we feel we have decent person coverage both in terms of the depth of information about each person and the scale of people indexed.
Thanks Andrew!Spock is a people-search engine, currently in beta release. The company uses "a combination of search-engine technologies and user edits to aggregate the world's people information and make it searchable." Andrew Borthwick, Principal Scientist at Spock Networks, kindly fielded a number of questions about Spock's use of text analytics and Spock's data-quality efforts.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
Join us for a roundup of the top stories on InformationWeek.com for the week of December 7, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program!