IoT
IoT
Data Management // Big Data Analytics
News
1/15/2016
12:05 PM
50%
50%

Yahoo Releases Massive Data Set To Academic Institutions

The clicks and searches of 20 million anonymous Yahoo users could help researchers in a number of different academic institutions expand the boundaries of machine learning and deep learning.

Big Data Predictions For 2016
Big Data Predictions For 2016
(Click image for larger view and slideshow.)

Yahoo is releasing a massive machine learning dataset to the academic research community, which contains the surfing and search habits of 20 million anonymous users.

The dataset, which will only be made available to academic institutions, can be used by researchers for context-aware learning, large-scale learning algorithms, user-behavior modeling, and content enrichment. It can also validate recommender systems.

The collection is based on a sample of user interactions on the news feeds of several Yahoo properties. As it stands, it's a massive 110 billion lines of data charting the interaction of users with news items.

The dataset includes information gathered from the Yahoo home page, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. It was collected by recording the user-news item interaction of about 20 million users from February 2015 to May 2015.

(Image: leezsnow/iStockphoto)

(Image: leezsnow/iStockphoto)

"Many academic researchers and data scientists don't have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," Suju Rajan, director of research at Yahoo Labs, wrote in a Jan. 14 statement announcing the release of the data set. "We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state of the art in machine learning and recommender systems."

The dataset is available as part of the Yahoo Labs Webscope data-sharing program, a reference library of datasets composed of anonymous user data for non-commercial use.

Yahoo is also releasing the title, summary, and key phrases of the pertinent news articles included in the data set, and providing demographic information such as age segment and gender.

Other information, including the location in which the user is based, will be provided. The interaction data is time-stamped with the user's local time and contains partial information of the device on which the user accessed the news feeds.

"The release of this large Yahoo News Feed dataset will be a tremendous asset for the academic research community, and for us at UMass particularly, given our major research activities in natural language processing, information retrieval, databases and computational social science," wrote Andrew McCallum, director of the UMass Amherst Center for Data Science and a professor in the College of Information and Computer Sciences, in Yahoo's statement.

[Read more about machine learning.]

Yahoo's announcement is indicative of a recent trend towards the advancement of machine learning and deep learning, in which computers use massive reams of data to make predictions or better understand population sets.

In December, Facebook announced that it would open source its latest artificial intelligence (AI) server designs. Codenamed Big Sur, the server is designed specifically to train the newest class of AI algorithms that mimic the neural pathways found in the human brain. These algorithms are collectively called deep learning.

"No matter how much talent you have, there is always more on a manager's bucket list," Andrew Moore, Dean of the School of Computer Science at Carnegie Mellon University, told The Wall Street Journal. "No one in these big technology companies feels like they have enough people to do the things they want to do."

Nathan Eddy is a freelance writer for InformationWeek. He has written for Popular Mechanics, Sales & Marketing Management Magazine, FierceMarkets, and CRN, among others. In 2012 he made his first documentary film, The Absent Column. He currently lives in Berlin. View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Technocrati
50%
50%
Technocrati,
User Rank: Ninja
1/18/2016 | 8:32:49 PM
Re: Yahoo!

@jastroff   I agree with your skepticism regarding large datasets.   I suspect you are correct, by the time anything useful can be gleaned it will be outdated and useless.

I am sure the data is encrypted but there is always the possibility it can be un-encrypted and then what do we have ?

 

Sounds to me like a data breech.

jastroff
50%
50%
jastroff,
User Rank: Ninja
1/16/2016 | 5:37:18 PM
Re: Yahoo!
Any academic would like to see Amazon's business plan and tax statements. We can do much more with that than large datasets. But I'm just jaded...sorry
jastroff
50%
50%
jastroff,
User Rank: Ninja
1/16/2016 | 5:35:44 PM
Re: Yahoo!
Well, ok. But as one of those reseachers who had access to large datasets from various places in a past life, it took a long time for any results and mostly, they were grant funded and not many people cared. They could be conference papers or dissertations, and they were out of date before they hit the page.

This may be different with Google, etc. but I'm not seeing anything that says "hey, this is why more men than women or more kids than adults do x and why" or "people in Georgia don't access online news..."

And if I did see it, the world is changing so fast, and the data changes with it, I'm not sure it would make a difference. Do you, Brian? 

Now I would be interested in hos governments and other organizations are using the data  -- they might get something out of it when added to everything else they know. But maybe not.

>> Google is great in this regard. Any individual can utilize Google's real-time and stored data to research the keywords that users are typing into their search engine. The research can be split into geographic and demographic data and, Google does not mind with the data is being utilized for commercial or academic concerns.
danielcawrey
50%
50%
danielcawrey,
User Rank: Ninja
1/16/2016 | 4:29:42 PM
Re: Yahoo!
I had never previously considered the challenges researchers have accessing large datasets. I'm sure this release of data from Yahoo is going to make a number of academics really happy. Most big data is understandably kept locked up by the owner, but hopefully we're going to see more massive datasets released for research. It might lead to a better understanding of what all of this data we generate really means. 
Brian.Dean
50%
50%
Brian.Dean,
User Rank: Ninja
1/16/2016 | 3:35:37 PM
Re: Yahoo!
Google is great in this regard. Any individual can utilize Google's real-time and stored data to research the keywords that users are typing into their search engine. The research can be split into geographic and demographic data and, Google does not mind with the data is being utilized for commercial or academic concerns.
jastroff
50%
50%
jastroff,
User Rank: Ninja
1/16/2016 | 3:08:08 PM
Re: Yahoo!
By the time the neew theories are created, Yahoo will be gone, and the users a memory. 

I suspect you can't get a dataset out of Amazon, or FaceBook.
Brian.Dean
50%
50%
Brian.Dean,
User Rank: Ninja
1/15/2016 | 1:21:57 PM
Yahoo!
This is a great move by Yahoo as the data could be utilized to explain and/or build new theories in social sciences, etc. I wonder if limiting the data set to a quarter of a year's data will prohibit academic institutions to research seasonal changes in user interaction or whether the data is already large enough that academic institutions will spend years processing it. 
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
Top IT Trends to Watch in Financial Services
IT pros at banks, investment houses, insurance companies, and other financial services organizations are focused on a range of issues, from peer-to-peer lending to cybersecurity to performance, agility, and compliance. It all matters.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of September 25, 2016. We'll be talking with the InformationWeek.com editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.