DataSift Turns Back Clock On TwitterSocial data processing specialist launches searchable archive of tweets going back to January 2010.
The perishable nature of the status post may have started to change on Facebook, with its new Timeline profiles, but history is still hard to come by on Twitter, where search results typically don't reach back more than days or weeks. Individual users may value Twitter precisely for how well it lives in the moment, but researchers seeking to analyze changing attitudes and trends often need to look backward in order to tell what has changed.
- Core Systems Modernization: Harnessing the Power of Rules-Based Policy Administration
- The Oracle Insurance Survey: Overcoming IT Hurdles to Success
They will be able to get that history starting Tuesday through DataSift's social data processing service. DataSift is a specialist in the heavy lifting of sorting through billions of tweets and social posts, then indexing them for rapid search and retrieval. DataSift enriches that index through partnerships with Lexalytics for sentiment analysis and Klout for influence scores. It acts as a data processing back end to other social media analytics services, as well as some enterprise applications such as those of large news organizations.
Anyone can test a basic version of the service at the DataSift website, composing queries like "interaction.content contains 'obama' AND klout.score > 50" in the firm's curated stream definition language. More data-intensive queries and deeper integration with the service require a commercial license--and historical queries by definition will fall into that commercial category.
[ Even short-form Twitter can provide a window into consumer preferences. Read more in Do Tweets Predict The Future? ]
Making it work well is the challenge, CEO Rob Bailey said in an interview. "Companies are overwhelmed handling even the realtime nature of social--it has completely overwhelmed even the social media monitoring companies. Our team has been working on this feverishly over the last few months, and we've now got something like 100 billion tweets stored."
Founder Nick Halstead, who was previously the founder of TweetMeme, said his organization is up to the task, even though "filtering this data is a massive technical task" and means tackling all the complexities of Big Data management technologies like Hadoop and related tools like Pig and Hive. "A lot of these things require data scientists to get involved" in making sense of the data and processing it efficiently at scale, Halstead said.
To keep things simple for the user, or the front-end application developer, DataSift uses the same query language with historical data as with realtime queries, Halstead said. The major difference is you have to decide how much history you need to query for your analytic purpose, compared with the data processing cost associated with that query, he said.
Twitter generated about 85 billion posts in 2011 alone, and DataSift has been working with Twitter to extract data from its own archives, going back to the beginning of 2010. To respect the rights of the users, DataSift has also had to make sure that deleted tweets that were removed from the live stream have also been removed from the copy of the archives made available for analysis, Bailey said.
To make the data easier to interpret, DataSift has also worked in some correlation with events in the news that function as "mileposts" in your analytic slog through history. For example, the news of the resignations of the Research in Motion co-CEOs is displayed in the DataSift application as a reference point you can see relative to tweets about that company.
Attend this Enterprise 2.0 webcast, Rebalancing The IT-User Relationship: The Business Value In Consumerization, and learn how the consumerization of IT will ultimately help organizations drive innovation and productivity, retain customers, and create a business advantage. It happens March 7. (Free registration required.)