Hoarding useless data makes analytics harder. Companies like Paxata say their brand of analytics lets non-data experts turn data landfills into useful info.
Companies of all sorts are now in the garbage business. Without even thinking about it, companies collect so much data that they have data garbage dumps, filled up with bad data.
The big difference between data dumps and real landfills is the smell; bad data doesn't have the same odor. That's probably why companies keep collecting data they don't need. It's also cheap to keep data, and it's gotten cheaper in the last few years. That just makes comparing data harder to do.
"There's so much data from different places and in different formats. It's very difficult to treat that data," says Jon Oltsik, an analyst at Enterprise Strategy Group in Milford, Mass.
The rise of post-relational database tools such as Hadoop, Mongo DB, and Cassandra have lowered data storage costs, says Nenshad D. Bardoliwalla, cofounder and vice president of product at Paxata, a startup that uses machine learning and analytics to automate and accelerate the data preparation part of big data. No longer do companies need to think about what they're storing.
"Companies have flipped their mentality to just store it all, rather than just the data they really want," he says.
Bardoliwalla was at Hyperion in an earlier era of data warehousing, and others involved in founding Paxata were at SAP, Tibco, and Guidewire.
Paxata's founders think they've used analytics to help turn big data landfills into compost. They argue the problem companies face is in preparing data, which is time consuming and costly. Bardoliwalla says that data preparation either takes place through arduous hand coding, with specialists using tools like Informatica and Trillium, or trying to scrub data in Excel.
Paxata applies analytic techniques to data sources to see whether Michael Fitzgerald, Mike Fitzgerald, and M Fitzgerald in different databases might all be the same person, for instance. Its software figures the answer out on its own, meaning a user does not have to look at it. For very large data sets, that promises huge time savings.
"The value there is exactly as they say," Oltsik said. He has no ties to Paxata and has not looked at its product.
Paxata's target user is someone like the company's vice president of marketing, an experienced user of Excel, but not a "super jock." She needs information from disparate sources, and needs to know things such as whether a sales lead is a duplicate, and if information about it is correct. Providing that context to data sets is one of the things that costs analysts precious time.
The rule of thumb is that data preparation takes up 80% to 90% of the time people spend on data, leaving a small fraction of time for actual analysis. "People pour things into the data landfill. They don't even know it's there," he says. "There's a huge discoverability problem that needs intelligent algorithmic techniques and visualization techniques to allow computers to do the heavy lifting."
Bardoliwalla wants to flip the ratio of time that analysts spend on data, so they can spend 80% of their time analyzing data sets. There is value in data, but getting to the value might be more expensive than the data is worth, like ore buried too deeply in a mine.
Paxata says it has about a dozen customers including data storage firm Box, Dannon, the American unit of French yogurt maker Group Danone, and the big Swiss financial firm UBS. It also is not alone in the market: just today I received an email for a pre-briefing on a similar product from another data company.
Perhaps some day soon companies will spend their time making hay from their data.
You can use distributed databases without putting your company's crown jewels at risk. Here's how. Also in the Data Scatter issue of InformationWeek: A wild-card team member with a different skill set can help provide an outside perspective that might turn big data into business innovation. (Free registration required.)
Michael Fitzgerald writes about the power of ideas and the people who bring them to bear on business, technology and culture. View Full Bio
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
InformationWeek Must Reads Oct. 21, 2014InformationWeek's new Must Reads is a compendium of our best recent coverage of digital strategy. Learn why you should learn to embrace DevOps, how to avoid roadblocks for digital projects, what the five steps to API management are, and more.