Big Data // Big Data Analytics
News
7/9/2013
03:48 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

IBM And Big Data Disruption: Insider's View

IBM's Bob Picciano, general manager of Information Management, talks up five big data use cases, Hadoop-driven change; slams SAP Hana, NoSQL databases.

Zones Of Analysis

IW: So that's the velocity side of the big data story, but how are architectures changing to handle the onslaught of volume and variety?

Picciano: In the five use cases I described, we've seen the emergence of what we call analytic zones. The data warehouse is one of the zones inside a big data architecture. Then there's an exploration zone, which is typically Hadoop. Sometimes Hadoop is fronted with a data-discovery capability for indexing and categorization. In our case that's Vivisimo.

Real-time analytics is another zone. That's the stream processing we just talked about, and we see that as an important part of any organization's big data architecture. All of the companies that we're working with, whether it's General Motors or major telcos, have a need to look at information in real time for a variety of problems.

IW: IBM recently announced BLU Acceleration for DB2, which in some ways seemed like a throwback to 2009-2010, when IBM acquired Netezza. Is that still a problem that companies need to solve?

Picciano: It's still a red-hot problem. Most data warehouses are in the under-10-terabyte range and there are a lot of data marts out there... One thing that's been underemphasized about BLU is that it's an in-memory, columnar table type inside of the DB2 database management system. That means we can give anyone who's running transactional applications the best of both worlds by implementing BLU on a highly proven, highly scalable resilient row-store database [in DB2]. As workloads need the analytical and reporting aspect, you can utilize BLU tables for the ultimate in performance.

IW: So what's the use case for BLU versus PureData For Analytics, formerly known as Netezza?

Picciano: Netezza can handle extraordinarily large collections, and it has been tuned, over the years, for very specific workloads such as retail data and call data records in telco operations. We're talking about petabyte-size collections whereas BLU runs inside of a single system, so it's for collections under 50 terabytes.

IW: That's kind of confusing because DB2 is the enterprise data warehouse product aimed at the ultimate in scale. How does BLU work within DB2?

Picciano: BLU doesn't cluster, but the rest of DB2 does. So inside of a DB2 instance you would have a BLU table. BLU is especially helpful for reporting because one of the things that the in-memory, columnar technology does is perform extraordinarily well -- even if you write some very bad SQL.

For tools like Cognos, BusinessObjects or MicroStrategy, where there are line-of-business users who aren't up on their SQL skills, the database administrators can just create the table, organize it by column and load the information. The tool will generate the SQL and you'll see tremendous performance. You don't have to worry about whether you're going to do a star schema or a snowflake schema or whether you're going to implement multi-dimensional clusters and range partitioning. With BLU, all that goes away. It's like loading up a spreadsheet but it performs like a supercomputer.

IW: IBM has drawn competitive comparisons between BLU and SAP Hana, but if Hana is running entirely in RAM and BLU use a mix of memory and disk, how could it perform any better?

Picciano: That comes down to our actionable compression. With BLU, you don't have to decompress data [to determine whether it fits a query], so you're moving smaller amounts of information in and around memory and applying a [query] predicate inside of that compressed information.

IW: I take it that also assumes that the BLU workload is running entirely in memory?

Picciano: In the comparisons that we've run it has been an in-memory-to-in-memory comparison because that's their environment. But remember that when Hana runs out of memory, it's useless. That's a big bet for your company when you're, maybe, trying to go through a year-end close or the quarterly close and you find out that Hana was misconfigured. When you look at the price difference, SAP Hana is very buffered on the amount of memory required, which makes it very expensive. We compare very well, on a price-performance basis and on sheer performance, because of our ability to manage I/O effectively.

IW: Another hot database topic is NoSQL. We noted that IBM recently announced mobile and Web app collaboration with 10Gen, the company behind MongoDB. Why the partnership?

Picciano: The 10Gen work is about JavaScript and reporting, but we now support JSON natively inside DB2. DB2 is really becoming a polyglot. DB2 has an Oracle compatibility feature, for example, so you can put an Oracle application directly on DB2 without having to migrate. DB2 also speaks RDF for graph data. You see this polyglot theme in our big data initiatives as well. We've put Big SQL [IMB's SQL-on-Hadoop answer to Cloudera Impala] inside of IBM BigInsights 2.0. That opens up the available skills to work with the data in Hadoop.

IW: Being a polyglot doesn't necessarily make DB2 the agile development choice people associate with MongoDB or the highly scalable choice people associate with Cassandra. Is IBM going to get into the NoSQL market with a dedicated NoSQL database?

Picciano: If you look at what those database are used for and where they're used in the application tiers, I think that will be a low-end, very low-margin market for entry-tier SQL light. Then there will be more capable systems [like DB2] that can speak that dialect but that have security protocols, maintainability, backups and recovery where Mongo doesn't have any of those capabilities today. We think we can perform very well with Mongo workloads, plus we provide all the things that somebody who is really writing an enterprise app would require.

[Author's note: The MongoDB site offers several pages of documentation on replication functionality aimed at ensuring redundancy and high availability. 10Gen also has a managed backup service in beta]

IW: What about Cassandra, with its ability to support globally distributed databases? Can IBM address that need?

Picciano: Cassandra is not highly deployed in enterprise configurations. Out of all the engagements that I've worked on, I've run into Cassandra once.

[Author's note: DataStax, which offers Cassandra support, says it has more than 300 customers, including 20 Fortune 100 companies.]

Previous
2 of 3
Next
Comment  | 
Print  | 
More Insights
Comments
Oldest First  |  Newest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
7/10/2013 | 10:02:01 PM
re: IBM And Big Data Disruption: Insider's View
I was surprised by Picciano's dismissive take on MongoDB and Cassandra.
Oracle seems to be taking NoSQL more seriously, but then, they had Berkeley DB
IP to draw from when they developed the Oracle NoSQL database. I'd note that
MySQL has offered NoSQL data-access options for some time, but that hasn't
curbed the rapid growth of NoSQL databases including Cassandra, Couchbase,
MongoDB,
Riak and others. DB2 may have NoSQL access, but cost, development speed
and, frankly, developer interest in using it for Web and mobile apps
just isn't the same as what we're seeing with new-ear options.

I was also surprised by the idea of running Hadoop on mainframe,
but then, Cray recently put Hadoop on one of its supercomputers. That's not
exactly cheap, commodity hardware.
DAVIDINIL
50%
50%
DAVIDINIL,
User Rank: Strategist
7/11/2013 | 5:49:13 PM
re: IBM And Big Data Disruption: Insider's View
Good piece Doug
LoriV01
50%
50%
LoriV01,
User Rank: Apprentice
7/11/2013 | 6:35:59 PM
re: IBM And Big Data Disruption: Insider's View
Thank you Doug for your post. For clarification, SAP HANA does not need to decompress data in order to determine whether or not it fits a query. SAP HANA can select and run operations on compressed data. When data needs to be decompressed, it is not done until it is already in the CPU cache. Also, if an SAP HANA system should run scarce on memory, columns (selected by LRU mechanisms) are unloaded from memory down to Data Volume (HANA organized disks), in a manner that leverages database know-how, thus preventing the usual brutal SWAP activities of the OS. Of course, SAP offers scale-out capabilities with the SAP HANA platform so that customers can grow their deployments to multiple nodes, supporting multi-terabyte data sets.
Rob Klopp
50%
50%
Rob Klopp,
User Rank: Apprentice
7/12/2013 | 7:36:37 PM
re: IBM And Big Data Disruption: Insider's View
Here is a description of how HANA utilizes memory (http://wp.me/p1a7GL-lo ) to better inform Mr. Picciano. This information is available to IBM via the HANA Blue Book and other resources as they are one of SAP's best partners and very active in the HANA community.

BTW: The surprise to me was that Netezza is the preferred solution for petabyte-sized solutions... but not below 50TB. I do not believe that they have a large footprint in the space above a petabyte... and Hadoop plays somewhere in that petabyte place?
paulzikopoulos
50%
50%
paulzikopoulos,
User Rank: Apprentice
7/22/2013 | 3:43:42 PM
re: IBM And Big Data Disruption: Insider's View
@rklopp, I think Mr. Picciano's understanding of memory usage is EXACTLY in line with the blog posting you point to. In fact, that blog posting clearly states, "in other words where there is not enough memory to fit all of the vectors in memory even after flushing everything else outGǪ the query fails." That's EXACTLY what Mr. Picciano points out when he talks about how a client might have issues at a Qtr-end close when they start to really stress the system. From what I can tell, and DO correct me (my wife always does, swiftly I may add) if I've read the paper you sent us to wrong, but SAP HANA resorts to an entire column partition as the smallest unit of memory replacement in its LRU algorithm. All other vendors that I know of (including columnar ones that I've looked at) work on a much better block/page level memory replacement algorithm. In today's Big Data world, I just find it unacceptable to require a client to have to fit all their active data into memory; I talk to enough of them that this just doesn't seem to be reality.
paulzikopoulos
50%
50%
paulzikopoulos,
User Rank: Apprentice
7/22/2013 | 3:46:37 PM
re: IBM And Big Data Disruption: Insider's View
@Lori Vanourek, please see my response to rklopp894 regarding the inefficient column partition replacement LRU algorithm that Mr. Picciano was referring to. With respect to decompression, you actually call out the difference Mr. Picciano is stating. You say that decompression "is not done until it is already in the CPU cache" And THAT IS the issue, you have to decompress the data when loading into registers from cache so that you can evaluate the query. DB2 with BLU Acceleration doesn't decompress the data. In fact, the data stays compressed and encoded in the registers for predicate evaluation (including range predicates, not just equality) as well as join and aggregate processing. That's the clear advantage that Mr. Picciano is pointing out for DB2.
paulzikopoulos
50%
50%
paulzikopoulos,
User Rank: Apprentice
7/22/2013 | 4:00:05 PM
re: IBM And Big Data Disruption: Insider's View
Sorry @rklopp894, I just realized that I didn't respond to your BTW comment. Mr. Picciano did not say that Netezza can't do under 50 TB at all, in fact, there are loads of Pure Data for Analytic systems (which many will know through the Netezza name) that are below 50TB. Hadoop indeed plays in that PetaByte space as well (and below for that matter) and there is a tight integration between Netezza & Hadoop (not to mention IBM has it's own non-forked distribution called BigInsights which you get a limited use license for free with Netezza). What's more, Netezza lets you execute in-database MapReduce programs which can really bridge the gap for the right applications and provide a unified programming method across the tiers (Netezza and Hadoop).
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Must Reads Oct. 21, 2014
InformationWeek's new Must Reads is a compendium of our best recent coverage of digital strategy. Learn why you should learn to embrace DevOps, how to avoid roadblocks for digital projects, what the five steps to API management are, and more.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A roundup of the top stories and community news at InformationWeek.com.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.