Zones Of Analysis
IW: So that's the velocity side of the big data story, but how are architectures changing to handle the onslaught of volume and variety?
Picciano: In the five use cases I described, we've seen the emergence of what we call analytic zones. The data warehouse is one of the zones inside a big data architecture. Then there's an exploration zone, which is typically Hadoop. Sometimes Hadoop is fronted with a data-discovery capability for indexing and categorization. In our case that's Vivisimo.
Real-time analytics is another zone. That's the stream processing we just talked about, and we see that as an important part of any organization's big data architecture. All of the companies that we're working with, whether it's General Motors or major telcos, have a need to look at information in real time for a variety of problems.
IW: IBM recently announced BLU Acceleration for DB2, which in some ways seemed like a throwback to 2009-2010, when IBM acquired Netezza. Is that still a problem that companies need to solve?
Picciano: It's still a red-hot problem. Most data warehouses are in the under-10-terabyte range and there are a lot of data marts out there... One thing that's been underemphasized about BLU is that it's an in-memory, columnar table type inside of the DB2 database management system. That means we can give anyone who's running transactional applications the best of both worlds by implementing BLU on a highly proven, highly scalable resilient row-store database [in DB2]. As workloads need the analytical and reporting aspect, you can utilize BLU tables for the ultimate in performance.
IW: So what's the use case for BLU versus PureData For Analytics, formerly known as Netezza?
Picciano: Netezza can handle extraordinarily large collections, and it has been tuned, over the years, for very specific workloads such as retail data and call data records in telco operations. We're talking about petabyte-size collections whereas BLU runs inside of a single system, so it's for collections under 50 terabytes.
IW: That's kind of confusing because DB2 is the enterprise data warehouse product aimed at the ultimate in scale. How does BLU work within DB2?
Picciano: BLU doesn't cluster, but the rest of DB2 does. So inside of a DB2 instance you would have a BLU table. BLU is especially helpful for reporting because one of the things that the in-memory, columnar technology does is perform extraordinarily well -- even if you write some very bad SQL.
For tools like Cognos, BusinessObjects or MicroStrategy, where there are line-of-business users who aren't up on their SQL skills, the database administrators can just create the table, organize it by column and load the information. The tool will generate the SQL and you'll see tremendous performance. You don't have to worry about whether you're going to do a star schema or a snowflake schema or whether you're going to implement multi-dimensional clusters and range partitioning. With BLU, all that goes away. It's like loading up a spreadsheet but it performs like a supercomputer.
IW: IBM has drawn competitive comparisons between BLU and SAP Hana, but if Hana is running entirely in RAM and BLU use a mix of memory and disk, how could it perform any better?
Picciano: That comes down to our actionable compression. With BLU, you don't have to decompress data [to determine whether it fits a query], so you're moving smaller amounts of information in and around memory and applying a [query] predicate inside of that compressed information.
IW: I take it that also assumes that the BLU workload is running entirely in memory?
Picciano: In the comparisons that we've run it has been an in-memory-to-in-memory comparison because that's their environment. But remember that when Hana runs out of memory, it's useless. That's a big bet for your company when you're, maybe, trying to go through a year-end close or the quarterly close and you find out that Hana was misconfigured. When you look at the price difference, SAP Hana is very buffered on the amount of memory required, which makes it very expensive. We compare very well, on a price-performance basis and on sheer performance, because of our ability to manage I/O effectively.
IW: Another hot database topic is NoSQL. We noted that IBM recently announced mobile and Web app collaboration with 10Gen, the company behind MongoDB. Why the partnership?
IW: Being a polyglot doesn't necessarily make DB2 the agile development choice people associate with MongoDB or the highly scalable choice people associate with Cassandra. Is IBM going to get into the NoSQL market with a dedicated NoSQL database?
Picciano: If you look at what those database are used for and where they're used in the application tiers, I think that will be a low-end, very low-margin market for entry-tier SQL light. Then there will be more capable systems [like DB2] that can speak that dialect but that have security protocols, maintainability, backups and recovery where Mongo doesn't have any of those capabilities today. We think we can perform very well with Mongo workloads, plus we provide all the things that somebody who is really writing an enterprise app would require.
[Author's note: The MongoDB site offers several pages of documentation on replication functionality aimed at ensuring redundancy and high availability. 10Gen also has a managed backup service in beta]
IW: What about Cassandra, with its ability to support globally distributed databases? Can IBM address that need?
Picciano: Cassandra is not highly deployed in enterprise configurations. Out of all the engagements that I've worked on, I've run into Cassandra once.
[Author's note: DataStax, which offers Cassandra support, says it has more than 300 customers, including 20 Fortune 100 companies.]