It's Apache Spark time in our Big Data Roundup for the week ending June 12. At the Spark Summit West 2016, vendors big and small made announcements supporting the real-time big data analytics platform. Microsoft is getting behind Spark with several of its products. Distribution company Databricks revealed general availability of its Community Edition. Intel declared Spark to be at the center of the big data revolution.
Let's start with Microsoft. This week the company announced general availability of Apache Spark 1.6.1 for Azure HDInsight, and Power BI support for Spark Streaming. Azure HDInsight is Microsoft's answer to Hadoop in the Azure cloud. Based on Hortonworks Data Platform Hadoop distribution, the service deploys and provisions managed Apache Hadoop clusters in the Azure cloud, providing a framework designed to process, analyze, and report on big data.
Now, the company is adding Spark for HDInsight. Microsoft says it's a popular service, being adopted in 50% of all new HDInsight clusters deployed.
"With GA, we are revealing improvements we've made to the service to make Spark hardened for the enterprise and easy for your users," wrote Oliver Chiu, a senior product marketing manager for big data and data warehousing at Microsoft, in a blog post. "This includes improvements to the availability, scalability, and productivity of our managed Spark service."
[Find out about what IBM announced at the Spark Summit. Read IBM Unveils Data Science Experience Dev Environment.]
Microsoft also said it worked with Hortonworks to add capabilities to the YARN resource manager. In addition, Redmond co-led Project Livy with Cloudera and other organizations to create an open source Apache licensed REST web service for managing long-running Spark contexts and submitting Spark jobs.
Microsoft said it will offer an integration between Spark and the Azure Data Lake Store to enable Spark to store and process data of any size. Microsoft plans to enable role-based data access at the storage level through integration of Spark and the Data Lake Store.
For data scientists specifically, Microsoft also introduced out-of-the-box integration with Jupyter data science notebooks.
Microsoft had something for business intelligence professionals and analysts as well. The company will offer integration with Power BI and other BI tools such as Tableau, SAP Lumira, and QlikView.
"This lets you build interactive visualizations over data of any size," Chui wrote. "In addition to the traditional dashboards, Power BI offers a streaming connector that has integration with Spark allowing you to publish real-time events from Spark Streaming directly to Power BI."
Databricks is the chief commercial distribution company behind Apache Spark. This week the company announced general availability of its Databricks Community Edition, a free version of its just-in-time data platform built on top of Apache Spark.
The company said in a statement that DCE is accessible to all users, making it easy and quick to learn Apache Spark without the need to deal with infrastructure concerns.
"This year we've seen explosive growth for the Apache Spark project and all signs indicate the pace will only accelerate as the community expands even more," said Matei Zaharia, cofounder and chief technology officer at Databricks, in the statement. "Databricks Community Edition has created an ideal environment for learning Apache Spark. Developers of all backgrounds can now use Databricks Community Edition to learn Spark and mitigate the acute Spark skills gap."
In her presentation at the Spark Summit, Ziya Ma, Intel's Director of big data technologies, shared a statement made by the company's No. 3 employee Andy Grove about how analytics would be the No. 1 workload in the data center by the year 2020. "Analytics enriches people's lives," she said.
"We believe Spark is at the center of the analytics revolution," Ma told attendees at the Summit.
In keeping with her company's commitment to big data and analytics, Ma provided an update on Intel's Trusted Analytics Platform. She also discussed a new chip announcement, the Xeon Processor E7 V4 family, which she said provided a sevenfold performance improvement for Spark workloads when moving from the previous generation of Intel hardware to this new one.