When Big Data Questions Can't Wait For Data Scientists
Alteryx and other vendors are pushing tools that aim to make big data accessible to business-side teams and other non-specialists.
Big Data Talent War: 10 Analytics Job Trends
(click image for larger view and for slideshow)
The problem with big data--the reason it hasn't been adopted into production roles at more companies--isn't just the nonexistent budgets, lack of skills to manage big data properly, or the lack of a demonstrable business case, according to business-intelligence vendor Alteryx.
The big problem, the one that's holding up progress on all the other fronts but especially the development of business cases, is that the tools to analyze and manage big data projects are as rare, complicated, and specialized as the high-level statistics and data-integration requirements in a data-scientist's job description.
Companies such as Alteryx say their products humanize data and make it simple enough for non-data-specialists to use. Alteryx's Strategic Analytics product includes data-cleansing and management tools designed to give non-specialists the ability to extract data to work with themselves, rather than waiting for a data scientist to have time.
There are good reasons to make complex analytics more accessible to analysts working in business units rather than in IT, according to Shalini Das, research director for the Washington, D.C.-based CIO Executive Board. "According to the research we've done on big data during the past couple of years, about 82% of employees in an average company are knowledge workers who need some analytic skills and access to information to do their jobs," Das said. "Saying we need to restrict access to big data or new analytics to a set of specialists creates a bottleneck where most knowledge workers have to put in a request and then wait for specialists with specialized tools to do their jobs."
The whole point of big data is to make it possible for business people to find answers where they previously found only data. That can't happen if tools available to handle big-data analyses are so complex only data scientists can use them, she said.
Data scientists are an absolute necessity for big companies faced with mountains of raw data they don't know what to do with, according to Mike Boyarski, director of product marketing for business-intelligence/big-data software vendor Jaspersoft, which recently published a survey on the topic.
Parsing, cleaning, de-duplicating and preparing raw text or machine-to-machine data to be analyzed as if it consisted of numbers slotted into cells in a relational-database table is not a job for the faint of heart or shallow of understanding, Boyarski said. Whipping big data into decent shape requires someone very good with statistics, with a deep understanding of how business works and what business units actually need to know. Without that understanding, whoever chooses the sources of a developing big-data set and decides how to use it will tend to skew the project toward the needs of data scientists, not end users who ultimately need the answers, Boyarski said.
Most companies have access to huge amounts of data, but risk creating a system that takes garbage in and puts garbage out if they don't filter, manage, and process both structured and unstructured data to make it work effectively with existing analytics, according to a report from Ventana Research.
According to the survey Jaspersoft ran on 600 members of its Hadoop open-source big-data analytics community, analytics able to deliver solid information on the experience and attitudes of customers is the No. 1 user requirement for big data projects.
Customer-experience analytics are simply one more tool to give corporate planners some insight into the plans, requirements, and attitudes of their customers--exactly the kind of tool that could enhance and eventually replace spreadsheets as the go-to data tool for corporate planners, Das said. The other top five requirements fall into the same category: Customer segmentation and churn analyses; marketing campaign optimization; financial risk analysis; marketing competitive analysis.
Sixty percent of respondents are using relational databases as their primary big-data store, which makes complex analytics more difficult than with specialized tools, Boyarski said. Even among members of the forums at Jaspersoft, which uses Hadoop as its main big-data filing system, only 18% of respondents use either Hadoop or the big-data-mananging MongoDB as their big-data data stores, the survey showed.
Of those who responded to the survey, only 6% had business-unit titles. The others were application developers, report developers, or BI system administrators. That mix shows how little the business units are often involved in big-data projects, even though it is to them that the data-driven revelations of big data are supposedly made, Das said.
"About 85% of the data in corporate environments can't be analyzed with the usual tools available to people in those businesses," Das said. "So people are on board with the need to make decisions using more than 15% of the available information; the tools available and their knowledge of what to do with them are still somewhat lacking, however."
Of three hype-burdened technologies for which venture-capital firm Ascent Partners has already created a metric based on public discussions about new technologies, big data attracted by far the most mentions during April, May, June, and July. That could mean there is far more interest in big data in general than in BYOD or cloud security, the other two areas Ascent measured, said Ascent blogger Matt Fates.
Most of the discussions were about the scalability of big data, how to parse and analyze increasingly large data sets very quickly, Fates wrote. The acquisition of former social-networking market leader Digg by news aggregator Betaworks in July was at least partially due to Digg's failure to keep under control the MySQL database in which user-entered data was stored, he wrote.
That failure slowed down the whole service, which put off users wanting to recommend sites to their friends, or find recommended sites, leading to a drop in Digg's estimated value from $160 million to $500,000 by the time it was acquired by Betaworks, which said it would combine Digg with its own News.me to produce a news discovery and sharing site, according to the Washington Post.
The Alteryx tools are designed to allow a business analyst to ask a question, then guide him or her through the process of identifying potential sources for relevant data, assembling the data into a single data store, cleaning and enhancing the results with metadata to add context, and then passing the results to analytics and workflow modules.
That kind of functionality is rare and its availability will be a critical element in the success or failure of individual big-data projects, or at least of the projects' ability to do the job business analysts want them to do, Das said.
"The question right now is the level of maturity in the market for tools. We are still early in the early-adopter segment of the adoption bell curve, so it's not surprising that tools aren't widely available that make it easier and that encourage the later adopters to use something in greater numbers," she said. "There should be a range of tools with either basic, intermediate or advanced levels of functionality… We're looking for a variety of options, but for the most part, even the basics are not yet in place."