IBM And Big Data Disruption: Insider's View

IBM's Bob Picciano, general manager of Information Management, talks up five big data use cases, Hadoop-driven change; slams SAP Hana, NoSQL databases.

Doug Henschen, Executive Editor, Enterprise Apps

July 9, 2013

14 Min Read

Bob Picciano, general manager of IBM Information

Bob Picciano, general manager of IBM Information

Bob Picciano, General Manager, IBM Information Management

What's IBM's take on Hadoop as the new enterprise data warehouse and disruptor of data-integration and mainframe workloads? Bob Picciano, appointed in February as general manager of IBM's Information Management Software Division, says there's no doubt that Hadoop will displace certain workloads, but he's more dismissive about NoSQL and upstart databases including SAP Hana.

A veteran of multiple IBM software units, including a prior stint in Information Management, Picciano is tasked with revitalizing a business that has been "a little flat for the last year or two," according to Gartner Analyst Merv Adrian. "No doubt their feistiness is going to be more evident," Adrian says.

Feisty is a fitting description for Picciano, who in this in-depth interview with InformationWeek is at turns effusive and dismissive. He talks at length about IBM's vision for five big data use cases while ceding nothing to database competitors SAP, MongoDB and Cassandra. Read on for the big data perspective from inside IBM.

InformationWeek: There's a growing view that companies will use Hadoop as a reservoir for big data. Everyone agrees that conventional databases will still have a role, but some see big enterprise data warehouses being displaced. What's your view?

Bob Picciano: Sometimes people drastically overuse the term big data. We've done more than 3,000 engagements with customers around our big data portfolio, and almost all of them have fallen into one of five predominant use cases: developing a 360-degree view of the customer; understanding operational analytics; addressing threat, fraud and security; analyzing information that you didn't think was useable before; and offloading and augmenting data warehouses.

Some of these use cases are more Hadoop-oriented than others. If you think about exploring new data types including a high degree of unstructured data, for example, it doesn't make sense to transform that data into structured information and put it into a data warehouse. You'd use Hadoop for that. We have an offering called Data Explorer, which is based on our Vivisimo acquisition, that helps index and categorize unstructured information so you can navigate, visualize, understand and correlate it with other things.

Operational analytics is another use case involving Hadoop. There we just delivered a new offering with our Smart Cloud and Smarter Infrastructure that focuses on helping clients to pull in and analyze log information to spot events that could be used to help improve the resiliency of operational systems.

In the case of developing a 360-degree view of customers, maybe you have a system of master data [like CRM], so you have customer data files, but how do you also include information from public or social domains?... And how do you sew together interactions on Web pages? That's very much a Hadoop data workload.

IW: IBM has a Hadoop offering (with IBM BigInsights), but so, too, does Microsoft, Oracle, Pivotal, Teradata, Cloudera and others. How does IBM stand out in the big data world?

Picciano: One of the use cases that's unique to IBM is streaming analytics. In a big data world, sometimes the best thing to do is persist your question and have the data run through that question continuously rather than finding a better place to persist the data. Hadoop is, in many ways, just like a different kind of big database. That may be insufficient to differentiate company performance on a variety of different workloads.

Data is becoming a commodity, information is becoming a commodity and even insight is becoming a commodity. What's going to become a differentiator is how fast you can develop that insight. If you have to pour data into a big data lake on Hadoop and then interrogate that information, then you have to figure out, "is this the right day to ask that question?" With streaming analytics you can ask important questions continuously.

IW: Aspirations around the Internet of Things seem to be reinvigorating the complex event processing market. Is this the kind of analysis you're talking about?

Picciano: Yes. If you think about machine-to-machine data and areas like health care and life sciences, we've done some great work with amazing institutions like UCLA and the Hospital for Sick Children in Toronto by analyzing data in motion with IBM InfoSphere Streams. When you look at neonatal care, for example, a nurse typically comes by once an hour and writes down vital signs. That's one chart point, and they'll come back in another hour and so on. But there's so much volatility around blood oxygen levels, heart rates and respiratory rates. By streaming that information and analyzing on a constant basis, you can spot when infants are having what they call spells, which increase their susceptibility to life-threatening infections. In some instances they can also over-oxygenate babies, and when that happens they can go blind.

IW: You hear a lot of talk about real-time applications, but there seem to be far fewer real-world examples. Is real-time really in high demand?

Picciano: There are many other examples. In the telco space, providers are constantly trying to analyze call quality and spot potentially fraudulent activity. They typically do that based on call data records that they load into a warehouse on a daily basis. We're doing it in real time so there's a whole different degree of remediation for customer experience management. We can identify dropped calls and whether they were related to call quality. You can look at the profile of callers, particularly pre-paid callers, and see if they're trying to burn up their minutes. That means they're likely to churn to another carrier, but we've found that there are ways to intercede in those cases and prevent churn. Zones Of Analysis

IW: So that's the velocity side of the big data story, but how are architectures changing to handle the onslaught of volume and variety?

Picciano: In the five use cases I described, we've seen the emergence of what we call analytic zones. The data warehouse is one of the zones inside a big data architecture. Then there's an exploration zone, which is typically Hadoop. Sometimes Hadoop is fronted with a data-discovery capability for indexing and categorization. In our case that's Vivisimo.

Real-time analytics is another zone. That's the stream processing we just talked about, and we see that as an important part of any organization's big data architecture. All of the companies that we're working with, whether it's General Motors or major telcos, have a need to look at information in real time for a variety of problems.

IW: IBM recently announced BLU Acceleration for DB2, which in some ways seemed like a throwback to 2009-2010, when IBM acquired Netezza. Is that still a problem that companies need to solve?

Picciano: It's still a red-hot problem. Most data warehouses are in the under-10-terabyte range and there are a lot of data marts out there... One thing that's been underemphasized about BLU is that it's an in-memory, columnar table type inside of the DB2 database management system. That means we can give anyone who's running transactional applications the best of both worlds by implementing BLU on a highly proven, highly scalable resilient row-store database [in DB2]. As workloads need the analytical and reporting aspect, you can utilize BLU tables for the ultimate in performance.

IW: So what's the use case for BLU versus PureData For Analytics, formerly known as Netezza?

Picciano: Netezza can handle extraordinarily large collections, and it has been tuned, over the years, for very specific workloads such as retail data and call data records in telco operations. We're talking about petabyte-size collections whereas BLU runs inside of a single system, so it's for collections under 50 terabytes.

IW: That's kind of confusing because DB2 is the enterprise data warehouse product aimed at the ultimate in scale. How does BLU work within DB2?

Picciano: BLU doesn't cluster, but the rest of DB2 does. So inside of a DB2 instance you would have a BLU table. BLU is especially helpful for reporting because one of the things that the in-memory, columnar technology does is perform extraordinarily well -- even if you write some very bad SQL.

For tools like Cognos, BusinessObjects or MicroStrategy, where there are line-of-business users who aren't up on their SQL skills, the database administrators can just create the table, organize it by column and load the information. The tool will generate the SQL and you'll see tremendous performance. You don't have to worry about whether you're going to do a star schema or a snowflake schema or whether you're going to implement multi-dimensional clusters and range partitioning. With BLU, all that goes away. It's like loading up a spreadsheet but it performs like a supercomputer.

IW: IBM has drawn competitive comparisons between BLU and SAP Hana, but if Hana is running entirely in RAM and BLU use a mix of memory and disk, how could it perform any better?

Picciano: That comes down to our actionable compression. With BLU, you don't have to decompress data [to determine whether it fits a query], so you're moving smaller amounts of information in and around memory and applying a [query] predicate inside of that compressed information.

IW: I take it that also assumes that the BLU workload is running entirely in memory?

Picciano: In the comparisons that we've run it has been an in-memory-to-in-memory comparison because that's their environment. But remember that when Hana runs out of memory, it's useless. That's a big bet for your company when you're, maybe, trying to go through a year-end close or the quarterly close and you find out that Hana was misconfigured. When you look at the price difference, SAP Hana is very buffered on the amount of memory required, which makes it very expensive. We compare very well, on a price-performance basis and on sheer performance, because of our ability to manage I/O effectively.

IW: Another hot database topic is NoSQL. We noted that IBM recently announced mobile and Web app collaboration with 10Gen, the company behind MongoDB. Why the partnership?

Picciano: The 10Gen work is about JavaScript and reporting, but we now support JSON natively inside DB2. DB2 is really becoming a polyglot. DB2 has an Oracle compatibility feature, for example, so you can put an Oracle application directly on DB2 without having to migrate. DB2 also speaks RDF for graph data. You see this polyglot theme in our big data initiatives as well. We've put Big SQL [IMB's SQL-on-Hadoop answer to Cloudera Impala] inside of IBM BigInsights 2.0. That opens up the available skills to work with the data in Hadoop.

IW: Being a polyglot doesn't necessarily make DB2 the agile development choice people associate with MongoDB or the highly scalable choice people associate with Cassandra. Is IBM going to get into the NoSQL market with a dedicated NoSQL database?

Picciano: If you look at what those database are used for and where they're used in the application tiers, I think that will be a low-end, very low-margin market for entry-tier SQL light. Then there will be more capable systems [like DB2] that can speak that dialect but that have security protocols, maintainability, backups and recovery where Mongo doesn't have any of those capabilities today. We think we can perform very well with Mongo workloads, plus we provide all the things that somebody who is really writing an enterprise app would require.

[Author's note: The MongoDB site offers several pages of documentation on replication functionality aimed at ensuring redundancy and high availability. 10Gen also has a managed backup service in beta]

IW: What about Cassandra, with its ability to support globally distributed databases? Can IBM address that need?

Picciano: Cassandra is not highly deployed in enterprise configurations. Out of all the engagements that I've worked on, I've run into Cassandra once.

[Author's note: DataStax, which offers Cassandra support, says it has more than 300 customers, including 20 Fortune 100 companies.] Disrupting Legacy

IW: You do acknowledge that Hadoop is emerging, but is IBM committed to bringing that platform to enterprises even if it might displace legacy data warehouse workloads?

Picciano: Hadoop will displace not just some aspects of data warehouse work, it will create disruption in the field of ETL as well.

IW: And also mainframe processing. So is IBM really going to champion Hadoop if it might displace data warehousing, ETL and mainframe workloads?

Picciano: Yes, although I would be careful to define the legacy businesses. One of the biggest businesses around the Z mainframe is around Linux and workload consolidation. As we run Hadoop on Linux, there's an opportunity to have that workload in a Z environment. In fact, we've announced the ability to put our BigInsights engine on ZBX, which are Z blades inside of a Z enterprise cluster.

IW: What's the advantage of that approach? Isn't one of the most notable benefits of Hadoop taking advantage of low-cost commodity hardware?

Picciano: It's about handling a diversity of workloads in one environment. If you consider that Z is the system of record in most institutions, why wouldn't they also want to be able to get faster, real-time analytic views into that information? Right now companies have to move that data, on average, 16 times to get it inside a tier where they can do analysis work. We're giving them an option to shorten the synapse between transaction and insight with our IBM DB2 Analytics Accelerator (IDAA).

It makes perfect sense to do that with a data warehouse, and we're having great success where organizations are looking at their Teradata environments in comparison to the efficiency of putting an IDAA on Z. They're saying, why am I sending that data all the way over there to that expensive Teradata system? When you send queries to DB2 when the IDAA is attached, it figures out whether it's more effective to run the query with Z MIPS or whether to run it the IDAA box.

IW: So you're talking about running Hadoop on mainframe, but is that evidence that IBM is willing to disrupt existing business and be an agent of change?

Picciano: If you look at our company's history, especially in the information management space, we started with hierarchical databases but we were the agent of our own change by introducing relational systems. We introduced XML-based systems and object-relational systems. Some of them had more traction than others and some of them fizzled out and never really produced much.

We think there's real value for our clients around Hadoop and data in motion. In some ways that disrupts the data warehousing market in a new way in that you're analyzing in real-time, not in a warehouse. That's very threatening to storage players because you're intelligently determining what patterns are interesting in real time as opposed to just trying to build a bigger repository. We're doing this not because we think it's intellectually stimulating but because it's valuable to customers.

IW: Is there a poster-child customer where mainframe or ETL or DB2 workloads have dramatically changed because IBM is helping them reengineer?

Picciano: General Motors is an example where CIO Randy Mott is transforming and bringing IT back into the company. He's doing that utilizing a Teradata enterprise data warehouse and a new generation of extract-load-transform capabilities using Hadoop as the transformation engine. IBM BigInsights is the Hadoop engine and we're taking our DataStage [data transformation] patterns into Hadoop.

IW: Upstarts are making claims about how big data is changing enterprise architectures. It makes you wonder who's driving the trends.

Picciano: I think IBM is driving, and the reason is this architecture that I've talked about where you have different analytical zones that are really effective at certain aspects of the big data problem. You can't look at it through a purist lens and say, "Hadoop will be able to do all these things," because it just cannot do all those things.

IW: There's clearly a role for multiple technologies. The question is how technology investments will change and how quickly they'll change?

Picciano: Customer value has to be in the center of everyone's cross hairs. It's not a technology experiment or a science project. The use cases that I talked about are where customers are getting additive value because they're analyzing operational data that they couldn't analyze before. They're getting a different view of their clients that wouldn't have been economical to build and so on... When you look at what's required in each of those zones, IBM has a leadership stake in all of those areas and we're putting vigorous investment even into areas that may appear to be most disruptive, like Hadoop.

About the Author(s)

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights