With all the hubbub about SQL-on-Hadoop options over the last year, what seems to be lost on many is the fact that SQL access won't exactly open up big data to the masses. If the arcane languages of MapReduce, Pig, and so on are akin to ancient Latin, adding SQL to the language list is a bit like suggesting the use of modern Greek.
Enter Datameer, which offers a spreadsheet-on-Hadoop approach to exploring and analyzing data on Hadoop. Datameer has been among the pioneers in supporting ad-hoc analysis, reporting, and analytics directly on data in Hadoop, and with a 4.0 version released last week, it has enhanced the product's visual data preparation and data analysis capabilities.
Research by InformationWeek and plenty of other organizations has long confirmed that data quality is one of the biggest obstacles to developing valid and valuable BI and analytics. That challenge hasn't gone away in the big data era, and that's the key reason Datameer 4.0 includes a new Visual Data Profiling feature. Toggling from a spreadsheet view to the "flip side" Visual Data Profiling view, you see details on data type, count, max, min, uniqueness, mean, and average.
[Want more on creating big data applications? Read Hortonworks Adds Cascading For Big Data App Development.]
The data profile also reveals data quality problems, such as having more than two categories for customer gender or having a great big gap in the age range of customers. This is where Datameer's data correction, filtering, transformation, and enrichment capabilities come into play. Once you've addressed data quality problems, you toggle between the spreadsheet and data-profile views to visually inspect and validate that you're working with reliable data.
The second big advance in Datameer 4.0 is a new visual view of Datameer's clustering, column-dependency, decision tree, and recommendation algorithms. Once again, you toggle from the spreadsheet view to a flip-side visual analysis that helps business analysts understand how and why the algorithms produced their results.
Instead of just seeing that there are 12,500 customers in cluster A and 28,200 in cluster B, for example, the visual view might reveal that cluster A is made up of 20- to 30-year-old customers who spent an average of $48 last month, whereas cluster B is made up of 40- to 50-year-olds who spent an average of $96. You could get to these details in Datameer 3.0, but they weren't immediately exposed in a visual view.
Many traditional BI tools can now peer into Hadoop by way of SQL-on-Hadoop interfaces and connectors, but Datameer's spreadsheet-on-Hadoop approach is much more user friendly, even for those who know SQL, according to Karen Hsu, senior director or product marketing at Datameer.
"IT types might use SQL to segment data, but then they'll jump out and use Datameer for grouping, joining, partitioning, unioning data, and identifying outliers," Hsu said. "You can do all that with SQL, but it's just much easier to do in a spreadsheet."
SQL-on-Hadoop tools will certainly have their place. But given that SQL was a nonstarter in democratizing small data analysis, the emphasis on Hive, Impala, and other SQL-driven tools seems like a half step toward what organizations really want from big data. With its 4.0 release, Datameer is banking on spreadsheets and visualizations as the way to give business analysts a hands-on way to conduct data analysis on top of Hadoop.
You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation (free registration required).