Big Data: The Early Days Are Over

Netezza, Teradata, Aster Data, Datameer and IBM data warehousing customers share their stories.
At least five [Update: make that six as of late Friday] innovative data warehousing practitioners have stepped up to share their stories for our "mastering big data" feature article planned for August 9. Their accounts show that application-specific needs are diverse, making generic speed, feed and TPC-H benchmark claims all the more irrelevant.

I'll get to the list of my latest customer interviews in a moment, but first a refresher. As I detailed in this column, the big-data era isn't new. Despite claims that the market is suddenly red-hot (now that the big vendors have finally responded), data volumes have been steadily growing for years.

Image Gallery: 11 Leading Data Warehousing Appliances
(click for larger image and for full photo gallery)
Pioneering independent vendors have led the way toward highly scalable and performance-oriented approaches including massively parallel processing (MPP), column-store databases, in-database analysis and, more recently, NoSQL approaches. Going back to the well of stories published at in recent years, consider the examples of Sweden's TradeDoubler and India's Reliance Communications, shared in this story posted in June 2008.

TradeDoubler is a pan-European new-media marketing firm that needed faster load speed and analytic performance than it could achieve in an existing Oracle deployment. The company chose Infobright, which offers a column-store database that runs on commodity symmetric multiprocessor (SMP) hardware -- TradeDoubler chose a $12,500 Dell server that's probably much cheaper today.

In June '08, TradeDoubler had more than 125,000 Web sites in its network and was tracking 20 billion ad impressions, 265 million unique visitors and 12 million leads per month. The mart retains only three days' worth of clickstream data and 60 days' worth of aggregated online order data, so it was actually less than a terabyte in size. But with rapid data turnover, TradeDoubler was loading 2 billion rows of data per day, and it was hitting a wall.

"We had a one person working with the data full time, but depending on the complexity of the queries, it took anywhere from half a data to two days to get the data out," explained CTO Ola Uden.

TradeDoubler was able to load, rebuild and query the Infobright database all within the same day. The gains were due partly to the column-store compression (said to be 30 times that of a relational database) and partly due to the fact that Infobright auto indexes and doesn't need the partitioning and tuning required to make relational databases perform. (Infobright says its database requires up to 90% less admin work than Oracle, Microsoft SQL Server or IBM DB2 and is half the cost in terms of licensing and storage requirements.)

TradeDoubler's example is one of big-data loading and turnover rather than sheer scale, and it's common workload requirement in Web clickstream analysis. TradeDoubler could have easily built a larger-scale, higher-performance Oracle-based warehouse (even before the fall 2009 introduction of Oracle Exadata V2), but Uden said the costs would have been much higher than its Infobright investment.

Reliance Communications is one of India's largest and fastest-growing telcos. Back in 2008 it was adding some 1.5 million customers per month. The resulting flood of data was maxing out a 50-terabyte Oracle data warehouse, so in early 2007, the company decided to offload a call-data-record (CDR) data mart application. It chose a 60-terabyte Greenplum MPP database deployment. In 2008, Reliance added a 120-terabyte configuration for a total of 180 terabytes of capacity.

Indeed storage was more of a priority than speed for this particular application. Reliance needed to retain CDRs for compliance reasons. In a police investigation, for instance, law enforcement officials might ask Reliance for a complete record for all calls a particular subscriber made or received during a certain time period. The Indian government requires CDRs to be retained for 13 months, and with nearly one billion new calls made each day, the demands were massive.

"Access to CDRs is not very frequent, but we needed fast loading and fast retrieval for large amounts of data," said Raj Joshi, vice president of decision support systems.

Speed wasn't really the point of the deployment, but queries that previously took two to three hours were returned in 30 minutes on the Greenplum platform. Joshi said the cost savings over a conventional data warehouse were also "substantial."

Storage-hungry customers such as Reliance surely figured in EMC's recently announced plan to acquire Greenplum. Competitors have also taken note of the niche; in late 2008, Teradata added the Teradata 1550 Extreme Data Appliance, aimed at telco CDRs, Web clickstream and other extreme-scale applications involving up to 50 petabytes of information. This isn't an application you can affordably address with a one-size-fits-all box.

These are just two examples of diverse needs that had customers looking for a better way back when independents offered the only alternatives to conventional data warehouse deployments. In fact, if you're looking to upgrade, you should consider at least six dimensions of scalability: data size, number of users, data complexity, query volume, data latency and query complexity.

All the better if you can read about or talk to customers who tackled a deployment that's similar to the one you are contemplating. Which brings me to the list of real-world case examples I plan to share in my upcoming "managing big data" article:

  • Catalina Marketing. I summarized this massive Netezza/SAS deployment in this article, and at 2.5 petabytes, I expect it will hold up as the biggest big-data example in my story. Interestingly, Catalina built its own MPP warehouse before it switched to Netezza in 2003.
  • Cabela's. This hunting and fishing direct marketer and retailer was a pioneer of in-database analysis even before its data warehousing and analytics vendors, Teradata and SAS, respectively, got together to productize the combination.
  • Hutchison 3G. A major mobile phone provider in the UK, Hutchison deployed IBM's Smart Analytic System late last year. It's already using in-database analytics built into the system's data-mining capabilities. Pilot testing is underway on IBM-SPSS predictive analytics that will model customer churn.
  • Barnes & Noble. This multi-channel retailer needed to consolidate dozens of terabytes spread across nine separate Oracle data warehouses. It implemented Aster Data's platform this spring and it's using Map/Reduce techniques as well as in-database analytics.
  • McAfee. McAfee is using Datameer's Katta search solution and pilot testing its Datameer Analytics Solution for analyzing Internet traffic on Hadoop for global threat detection. Very cutting edge.
  • Adknowledge. This digital marketing firm was an early adopter of both Netezza and Hadoop. It still handles the truly big-data loads in Hadoop, including instances in Amazon's Cloud. Early this year it replaced Netezza with a bigger Greenplum because it was outgrowing the aging Netezza box and the company "wanted to switch to commodity hardware." (My interview subject said Netezza was switching to the commodity Twin Fin platform at the time, and "gave it a look," but went ahead with Greenplum.) Adknowledge is now moving many tasks from Hadoop into Greenplum because it can handle the scale (where the dated Netezza box couldn't) and analysts can use plain old SQL.

I have yet to hear from Oracle about a customer willing to talk about a successful data warehousing deployment of Exadata V2. Oracle had a lot to say to Bob Evans about customers, performance and Exadata V2's unique ability to address both transactional (OLTP) and analytic (data warehousing) needs. By one Gartner estimate, OLTP accounts for 60% to 70% of database license revenue, but more than 75% of the growth is attributable to data warehousing. That's because warehouses are where data is retained for analysis rather than regularly purged for keep-the-lights-on processing.

I don't doubt that there are successful Exadata customers are out there. I've seen headlines noting customer wins and I've talked to analysts who have interviewed customers. But as analyst Merv Adrian commented in response to my "Seven Questions for Oracle" column, "I'm not surprised production references are still scarce after just two or three quarters of selling a product like this."

That's really my point. The early days for Teradata were in the 1990s, and then it got a competitive wake-up call in the middle of this decade. The early days for Netezza and Greenplum were in 2004 and 2005. The early days for Aster Data were in 2007. Right now we're seeing the early days of things like Hadoop and NoSQL alternatives that may change the data-analysis market even more dramatically than what we've seen over the last decade.

There's an old saying from the times of the Wild West that the pioneers get the arrows and the settlers get the land. Maybe that analogy will ultimately apply to Oracle and Microsoft, both now entering the scale-out data warehousing space (as well as scale-out OLTP, in Oracle's case). For now, I'm looking for proven production deployments that will show you what's possible within your enterprise.At least five [update: make that six as of late Friday] innovative data warehousing practitioners have stepped up to share their stories for my upcoming "Mastering Big Data" feature article planned for August 9 in InformationWeek. I'll get to the list of my latest customer interviews in a moment, but first a refresher...

Editor's Choice
Brandon Taylor, Digital Editorial Program Manager
Jessica Davis, Senior Editor
Terry White, Associate Chief Analyst, Omdia
Richard Pallardy, Freelance Writer
Cynthia Harvey, Freelance Journalist, InformationWeek
Pam Baker, Contributing Writer