World's largest music service switches Hadoop distributions to take advantage of Hortonworks Hive improvements, support services.
Spotify, the 24-million-user-strong music service based in Stockholm and London, announced Monday that it's migrating its massive, 690-node Hadoop cluster from Cloudera's software distribution to the Hortonworks Data Platform (HDP) and Hortonworks enterprise support.
Among the largest Hadoop implementations in Europe, Spotify's cluster is used to develop analytics that drive the company's personalized services, such as Spotify Radio. It also drives data-driven analyses for advertisers and partners. For example, Spotify can do listener segmentation to help advertisers place ads. It can also do geospatial analyses of listening patterns to help record labels and artists determine optimal concert locations.
"[Hortonworks'] true open source approach and the work they have done to improve the Apache Hive data warehouse system aligns well with our needs," said Wouter de Bie, team lead for data infrastructure at Spotify, in a statement. "We use Hive extensively for ad-hoc queries and for the analysis of large data sets."
Most Hadoop software distributors have supported the so-called SQL-on-Hadoop movement this year -- Cloudera with Impala, IBM with Big SQL, MapR with Drill, and Pivotal with HAWQ -- but Hortonworks is alone in doing so by focusing on improving Hadoop's existing Hive interface through its Stinger initiative.
Hive relies on behind-the-scenes MapReduce processing, which has a reputation for being slow, but Hortonworks executives insist that the company's design improvements will drive a 100X performance improvement that will yield ad-hoc query results within "a handful of seconds."
"Spotify is undertaking some really innovative work in the data analytics field and realized the need for a deep level of open source Apache Hadoop domain experience and expertise," commented Herb Cunitz, president of Hortonworks, in a statement.
Spotify launched in 2008 and soon thereafter launched a 30-node cluster on Amazon Web Services. The company switched to an on-premises 60-node cluster less than two years ago and was scaled out quickly to today's 690 nodes. The company collects more than 200 gigabytes of compressed user activity data per day and has more than 4 petabytes of capacity in its cluster.
Spotify could not be reached in time to comment on whether it's simply using Cloudera's distribution of open source software or also employing its commercial management software and support services. Spotify is said to have a highly skilled, 12-plus-engineer internal Hadoop team that would seem quite capable of running Hadoop independently. That team developed Luigi, a Python framework for batch data processing, dependency resolution and monitoring of Hadoop that Spotify has since contributed to open source.
"The cultural fit was an important factor in our selection and we have appreciated Hortonworks' relaxed, helpful and open approach," said Wouter de Bie. "We were looking for a true partner relationship and the team at Hortonworks [is] committed to enabling the overall ecosystem."
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
InformationWeek Tech Digest, Nov. 10, 2014Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?