With Big Blue behind Hadoop, companies with Big Data problems may find the open source technology is available in more manageable forms.
IBM, the originator of the SQL data access language, has recognized the NoSQL movement has a point. Some data management problems don't lend themselves to being solved by IBM's DB2 or other relational database systems.
That's why it's started offering consulting services on managing large volumes of data based on Apache's open source Hadoop. It has a package of services and Hadoop-based analytics that it calls BigInsights Core to enable companies to take the plunge in Internet-scale data volumes. It's also offering its own large volume, data management software, IBM BigSheets, using a large scale spreadsheet paradigm.
"Hadoop opens up a broader technology domain -- Big Data," said Bernie Spang, IBM director of information management product strategy, referring to the common appellation for masses of website, customer or RSS feed or Twitter message data, all of value to the business.
Hadoop makes no pretence of running transactions or functioning like a transaction-processing database system, with its stringent requirements for a two-phase commit. Rather, it specializes in filtering, sorting and managing either structured or unstructured data on a very large scale. After Hadoop has done its work, it's possible for data warehouses, business analytics systems and relational databases to work with a more manageable results set.
IBM made the announcement of Hadoop services at its Information on Demand conference in Rome May 19. These are not services from consultants at IBM Global Information Services but advisors from IBM labs and engineering. IBM is exploring how to help customers get a handle of information flows that can be measured in the petabytes, as opposed to mere megabytes, gigabyte and terabytes.
Hadoop is a combination of two distributed systems meant to filter and manage data on a large server cluster. One part is Map/Reduce, a system that knows where data is stored on disks throughout the cluster and where the nearest processor to it is. When it comes time to sort or filter the data, it can give the orders to call up the data from disk in large chunks of 64 or 128 megabytes and move it to nearby processors. The second part is the HDFS or Hadoop File System that knows how to distribute the data across a cluster in the first place.
Likewise, IBM said BigSheets, a browser-based data extraction, annotation and visualization system, was available May 19 in technology preview form. There is no date yet for when it will be a generally available, finished product, Spang said in an interview.
BigSheets was first announced Feb. 25. It is based on several open source components, including Hadoop; Nutch, a Web search and search indexing engine; and PIG, a high level language under development at Yahoo! for composing work that will be executed by Hadoop.
A key Hadoop brain trust, the start-up firm Cloudera, says IBM's entry into the Big Data field will have an impact on getting Hadoop adopted in the enterprise. "At Cloudera, we've seen incredible Hadoop uptake in mainstream enterprises… I see no end to the number of applications of this new technology. IBM's entry means more open source contributors will help expand the horizons for Hadoop," said Doug Cutting, Cloudera software architect and original author of Hadoop, while at Yahoo! He made the comments in an email message.
"We're confident the time is right for Hadoop to move into established IT infrastructure. IBM's contributions should accelerate this movement," added Mike Olson, CEO of Cloudera, in a message.