Amazon CTO Werner Vogels kicked off the annual Amazon Web Services Summit series in New York last week with a vintage cloud-will-be-king presentation that made a strong case for big data computing in the cloud. And Vogels offered a few predictions for what will drive cloud-based data analytics.
One, he predicted demand for big data analysis will spur interest in real-time analysis, and that companies will have to respond with unlimited capacity as needed.
Second, he said we can expect infrastructure like Hadoop (delivered by Amazon as the Elastic Map Reduce (EMR) service) in the future will "become invisible" behind analytic layers built on top of Hadoop. He described today's big data analysis tools as "rather crude."
Third, he said that this layer of big data analytics will include big-data-powered industry-specific applications.
[ What's Amazon's biggest rival up to? Read Microsoft Azure Public Cloud Matches Amazon Prices. ]
This slick new layer of next-era tools doesn't exist yet, but to prove the industry-focused point, Vogels introduced executives from Bristol-Meyers Squibb, GE and big data analytics startup Mortar Data (among other companies) to detail AWS-powered big data applications.
-- Bristol-Meyers Squibb IT executive Russell Towell described how the drug giant is using AWS to do computer simulations to optimize large-scale drug trials before actually conducting them with patients. The company uses AWS security provisions including private connections to Amazon data centers, Amazon Virtual Private Cloud services and encryption of all data, Towell said. Bristol-Meyers Squibb researchers can spin up scores and even hundreds of Linux server instances within five minutes and preconfigured Oracle Database instances within 12 minutes, he said.
Workloads that would have taken 60 hours to provision and complete on-premises using the company's old approach (and requiring huge investments in server capacity) now take 1.2 hours on AWS with service fees of $336, he said. As a result, the company can quickly do "thousands rather than hundreds" of simulations in the same amount of time, Towell said.
The big payoff for each simulation is money saved on live clinical trials. Having completed simulations, the company can reduce the number of patients required for a trial while being certain of valid results. Trial costs that averaged $750,000 have been cut to $250,000, according to Towell.
-- General Electric executive Joe Salvo, manager of the manufacturer's Business Integration Technologies Laboratory, touted a collaborative platform that GE built on AWS that's intended to help manufacturers and suppliers bring together expertise, materials data, and modeling and simulation capabilities to speed part and component development times by as much as five times. GE calls it a CEED -- crowd-driven ecosystem for evolutionary design.
"It's a flexible, elastic environment on [Amazon] EC2 that supports both rapid prototyping, simulation and, ultimately, building real parts that go into complex products and systems," Salvo said. "The teams come together quickly, they exchange their data and models [securely] ... and it holds the promise of transforming the whole manufacturing paradigm."
-- Mortar Data CEO K Young cited elastic capacity as the key to the 2011 startup's ability to grow quickly and provide Hadoop-as-a-service capacity without having to buy and set up servers. Mortar has raised $1.8 million in capital, and in 2012 the company spent some $500,000 on AWS services, using some 1,000 servers on demand. Provisioning that much capacity in a conventional on-premises data center would have cost $7 million and taken eight months to bring online, Young said.
"We're able to serve new customers without delay and without upfront costs, and we can start bringing in new revenue, and we're able to do it using about a quarter of what we would have had to raise otherwise," Young said.
Services To Come
Vogels' point about companies needing to tap into capacity on demand is an obvious selling point for AWS. His predictions about real-time analysis and the inevitability of analytic layer on top of invisible Hadoop infrastructure could well be a tease to coming AWS services announcements.
For example, Amazon has yet to join the SQL-on-Hadoop trend that is driving multiple projects and initiatives aimed at delivering faster and more extensive SQL querying capabilities on top of Hadoop than are currently supported by Hive. Lead Hadoop distributor Cloudera, for example, is promoting project Impala, while competitors EMC (Pivotal HD), Hortonworks (Stinger), IBM (Big SQL), MapR (Apache Drill) and Teradata (Teradata SQL-H) each have their own SQL-on-Hadoop initiatives in the works.
On coming up with an analytics layer, Amazon has heretofore partnered with BI and analytics vendors including Actuate, Birst, GoodData, Karmasphere, Pentaho and others. It would be interesting (and not terribly surprising) to see Amazon acquire or invest in BI and analytics technologies for Hadoop and other platforms. In the database arena, Amazon took a large equity stake in ParAccel, for example, to gain licensing rights to the high-scale, massively parallel-processing database now behind the Amazon RedShift data warehousing service. This could be the model for an analytics play.
Amazon did make two notable database-related announcements at the AWS Summit, one aimed at incumbent-database customers and one aimed at moving them to Amazon's big data services. In the first case, Vogels announced that encrypted data storage and network data flow is now available for Amazon Relational Database Services for Oracle Database and will soon be available for Amazon RDS for Microsoft SQL Server. Amazon still has to allay corporate concerns about putting data in the cloud, so this announcement is aimed at companies using incumbent platforms.
As for those looking toward new platforms, Vogels announced that Amazon's DynomoDB NoSQL database has gained an important new analytical capability through a feature called Local Secondary Indexes.
"This allows you to perform queries on any attribute in your data model, so now you have all the power of querying that you're used to with relational databases available to you on DynamoDB," Vogels said.
The announcements fit a pattern for Amazon in which it offers familiar tools (like Oracle Database and Microsoft SQL Server) while also pioneering and promoting new platforms (like DynamoDB, Hadoop and Redshift). As always, the cloud is the place to do it all.
Companies want more than they're getting today from big data analytics. But small and big vendors are working to solve the key problems. Also in the new, all-digital Analytics Wish List issue of InformationWeek: Jay Parikh, the Facebook's infrastructure VP, discusses the company's big data plans. (Free registration required.)