5 Top Wishes For Big Data Deployments
If you've even experimented with building big-data applications or analyses, you're probably acutely aware that the domain has its share of missing ingredients. We've boiled it down to five top wants on the big-data wish list, starting with SQL (or at least SQL-like) analysis options and shortcuts to deployment and advanced analytics and finishing with real-time and network analysis options.
The good news is that people and, in some cases, entire communities, are working on these problems. There are armies of data-management and data-analysis professionals who are familiar with SQL, for example, so organizations naturally want to take advantage of knowledge of that query language to make sense of data in Hadoop clusters and NoSQL databases -- the latter is no paradox, as the "No" in "NoSQL" stands for "not only" SQL. It's not a surprise that every distributor of Apache Hadoop software has proposed, is testing, and has or will soon release an option for SQL or SQL-like analysis of data residing on Hadoop clusters. That group includes Cloudera, EMC, Hortonworks, IBM, MapR and Teradata, among others. In the NoSQL camp, 10Gen has improved on the analytics capabilities within MongoDB, and commercial vendor Acunu does the same for Cassandra.
Deploying and managing Hadoop clusters and NoSQL databases is a new experience for most IT organizations, but it seems that each and every software update brings new deployment and management features expressly designed to make life easier. There are also a number of appliances -- available or planned by the likes of EMC, HP, IBM, Oracle and Teradata -- aimed at fast deployment of Hadoop. Other vendors are focusing on particularly tricky aspects of working with Hadoop framework components. WibiData, for example, provides open-source libraries, models and tools designed to make it easier to work with HBase, Hadoop's high-scale NoSQL database.
The whole point of gathering up and making use of big data is to come up with predictions and other advanced analytics that can trigger better-informed business decisions. But with the shortage of data-savvy talent in the world, companies are looking for an easier way to support sophisticated analyses. Machine learning is one technique that many vendors and companies are investigating because it relies on data and compute power, rather than human expertise, to spot customer behaviors and other patterns hidden in data.
One of the key "Vs" of big data (along with volume and variety) is velocity, but you'd be hard pressed to apply the phrase "real-time" to Hadoop, with its batchy MapReduce analysis approach. Alternative software distributor MapR and analytics vendor HStreaming are among a small group of firms bringing real-time analysis of data in Hadoop. It's an essential step that other vendors -- particularly event-stream processing vendors -- are likely to follow.
Last among the top five wishes for big data is easier network analysis. Here, corporate-friendly graph-analysis databases and tools are emerging that employ some of the same techniques Facebook uses at truly massive scale. Keep in mind that few of the tools and technologies described here have had 30 or more years to mature, like relational databases and SQL query tools have. But there are clear signs that the pain points of big-data management and big-data analysis are rapidly being addressed.
Wish 1: SQL Analysis At Big-Data Scale
You could compile a massive data set just by gathering all the stories and reports that have been written about the shortage of big-data talent. The most acute need is for data scientist types who know data and who also know how to write custom code, MapReduce jobs, and algorithms to gain insights from big data. But what if SQL-savvy professionals schooled in relational databases and business intelligence (BI) and analytics tools could do more of the heavy lifting? There are many more SQL professionals out there than there are data scientists, and most SQL pros would be eager to expand their career potential.
There's a big push to deliver SQL-analysis capabilities on top of Hadoop, and the talent shortage is just one reason. The second reason for the trend is that Apache Hive, Hadoop's incumbent data warehousing infrastructure, offers a limited subset of SQL-like query capabilities and suffers from slow performance tied to behind-the-scenes MapReduce processing.
Answering the call for broader, faster SQL querying on Hadoop are projects and initiatives including Cloudera Impala, EMC's HAWQ query feature on the Pivotal HD distribution, Hortonworks Stinger, IBM Big SQL, MapR-supported Apache Drill, and Teradata SQL-H.
Even the NoSQL camp wants better, SQL-like querying. Last year 10Gen added a real-time data aggregation framework to its popular MongoDB NoSQL database. The aggregation framework lets users directly query data within MongoDB without resorting to writing and running complicated, batch-oriented MapReduce jobs. More evidence is Acunu, which has developed a SQL-like AQL language to support querying on top of Cassandra.
The development of SQL querying capabilities is only the beginning. BI and analytics tools and systems native to big-data platforms are emerging. Examples include Datameer, Hadapt, Karmasphere and Platfora, and they're offering distinguishing query, analysis, data-visualization and monitoring capabilities on top of Hadoop.
Teradata Joins SQL-On-Hadoop Bandwagon
Wish 2: Simplified Deployment And Management
There's no shortage of efforts to simplify the deployment and management of big-data platforms including Hadoop and NoSQL databases. It seems each and every software update brings new management features and new built-in capabilities. 10Gen, for example, added built-in text search capabilities and on-premises monitoring capabilities with the latest release of MondoDB. And Hortonwork's distribution of Hadoop for Microsoft Windows ties into Active Directory, Microsoft's System Center, and Microsoft virtualization technologies to simplify deployment and management.
We haven't heard a lot of complaining about the hardware-related challenges of building out Hadoop clusters. Nonetheless, EMC, IBM, Oracle and Teradata insist their released and pending Hadoop appliances make deployment faster and easier than the build-it-yourself approach. The cost of commodity hardware might be alluring, but Oracle, for one, says its appliance costs less less than build-it-yourself deployments when taking into account the price of individual components, time saved on provisioning and tuning the system, and support and upgrade efforts. Oracle's appliance includes pre-configured, ready-to-run versions of Cloudera software and Oracle's NoSQL database.
The real messiness and complication of managing Hadoop usually involves the software, not hardware configuration. HBase, for example, is the Hadoop framework's increasingly important NoSQL database, but many practitioners have found it hard to model and analyze data on the database. Vendor WibiData provides open-source libraries, models and tools that make it easier to store, extract and analyze data on HBase. The idea is to make the hard, technical parts of running HBase repeatable so you need fewer engineers and data scientists when trying to solve business problems. That's a formula that should and will be applied across many big-data platforms.
Teradata Joins SQL-On-Hadoop Bandwagon
Wish 3: Easier Paths To Advanced Analytics
Developing algorithms and predictive models is work that has to be carried out by hard-to-find, expensive data scientists. Or is it? Scarcity of talent is one reason big-data, analytics and business intelligence vendors are developing machine-learning approaches. Proven in applications including optical character recognition, spam filtering and computer security threat detection, machine learning uses learning algorithms that are trained by the data itself. If you show the algorithm thousands or tens of thousands of examples of scanned text characters, unsolicited email messages, or virus bots and malware, it can reliably find more examples.
The same approach can be applied to spotting customers who are ready to churn or jet engines that are about to fail. With machine learning, trained models also can continue to learn from new data. Amazon.com and Netflix, for example, use algorithms to spot patterns in customer transactions so they can recommend other books or movies. When a new book or movie comes out, these companies can start recommending it as soon as their algorithms discerns the preference pattern in the data.
Apache Mahout is the leading route to deploying machine-learning-based clustering, classification and collaborative filtering algorithms on Hadoop, but these techniques are also supported by the R statistical programming language. Commercial vendors supporting or embedding machine-learning techniques include Alpine Data Labs, Birst, Causata, Lionsolver, Revolution Analytics and a growing list of others.
Teradata Joins SQL-On-Hadoop Bandwagon
Wish 4: Real-time Analysis Options
Another item on the big-data analytics wish list is real-time performance. Two startup vendors going after this opportunity are marketing analytics vendor Causata and real-time Hadoop-analysis vendor HStreaming.
For Causata, "real time" means making decisions in under 50 milliseconds. You need that kind of speed to change content, banner ads and marketing offers while your customers are still active on websites and mobile devices. Causata uses Hadoop's HBase NoSQL database for storage or marketing-related data that might include clickstreams, campaign-response data and CRM records. HBase isn't good at real-time querying, however, so Causata runs Java-based algorithms on a proprietary query engine to improve performance.
As its name hints, HStreaming relies on stream-processing technology that's similar to the event-processing engines used by financial trading operations and offered by IBM (InfoSphere Streams), Progress Software (Apama), SAP (Sybase Aleri), Tibco (Complex Event Processing) and others. HStreaming takes data directly from always-on sources such as video surveillance cameras, cell towers and sensors, and spots patterns in that data while it's still in flight. The technology also provides a form of extract, transform, load (ETL) for then storing the data onto Hadoop for later analysis. HStreaming cites video surveillance, network optimization and mobile advertising as its top applications. In all three cases, real-time insight and action are a must.
Taking a different tack, Hadoop software and support vendor MapR has announced a partnership with Informatica through which it claims it will become the first and only Hadoop software distributor capable of delivering near-real-time data streaming on the big-data platform. MapR's Hadoop distribution features a lockless storage services layer that works hand-in-hand with Informatica messaging software to continuously stream massive amounts of data into Hadoop. Couple this capability with a coming SQL-on-Hadoop option such as MapR-favored Drill, and you'll have yet another option for fast big-data analysis.
Teradata Joins SQL-On-Hadoop Bandwagon
Wish 5: Network Insight
Social networks are contributing to the scale and variability of big data. The social networks themselves use graph databases and analysis tools to uncover the web of user relationships by studying "nodes" -- representing people, companies, locations and so on -- and edges, the often-complex relationships among those nodes.
Mutual fund company American Century Investments uses graph analysis to predict the performance of the companies funds invest in. The company used the open source R statistical programming language and its iGraph package, with software and support from Revolution Analytics, to build a graph-analysis application that tracks revenue flows among manufacturers and their suppliers.
Apple, for example, has suppliers of chips and screens just as car manufacturers have suppliers of components and parts. American Century combines public and proprietary data on those buying relationships, and it applies graph analyses to get a clearer understanding of the likely performance of suppliers. These forecasts are more accurate than what could be developed with forecasts based on quarters-old public financial reports, according to American Century.
Other open-source technologies supporting graph analysis include Neo4j , a graph database developed and supported by Neo Technologies. Neo4j is used in IT and telecom network scenarios to resolve secure-access challenges, in master data management applications to see changing relationships among data, and in recommendation-engine apps to figure out what people want based on the behaviors of friends and connections. Other open source graph-analysis projects include Pregel (from Google) and Apache Giraph. It's not the stampede of solutions you see around Hadoop, but there's clearly growing interest in graph analysis.