Hadoop Meets Near Real-Time Data - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
News
3/5/2012
08:46 AM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Hadoop Meets Near Real-Time Data

Sidestepping HDFS and using Informatica's messaging software, MapR says it will be the first to stream data into the big data platform.

12 Hadoop Vendors To Watch In 2012
12 Hadoop Vendors To Watch In 2012
(click image for larger view and for slideshow)
Hadoop software and support vendor MapR announced a partnership with Informatica Monday through which it said it will become the first and only Hadoop software distributor capable of delivering near-real-time data streaming on the big data platform.

Hadoop is best known for its massive scalability, low cost, and ability to handle mixed data types, including structured and unstructured sources. A speed demon it's typically not. Most deployments of open source Apache Hadoop software rely on comparatively slow, batch-oriented data loading into the Hadoop distributed file systems (HDFS).

MapR has differentiated its distribution of Hadoop as a high-performance alternative that eliminates HDFS and replaces it with a derivative of the Unix-based network file system (NFS). In addition, the MapR distribution features a lockless storage services layer that will work hand in hand with Informatica messaging software to continuously stream massive amounts of data into Hadoop. There are plenty of potential use cases, according to John Haddad, a director of product marketing at Informatica.

[ Want more on Hadoop? Read Why Hadoop Crowd Is Hearing Much About Hortonworks. ]

"We're seeing strong interest from financial services that have streaming sources, such as social media and transactional flows, and that want to do ongoing analysis for anything from sentiment analysis to fraud detection," Haddad told InformationWeek. Telcos and ad-targeting firms doing customer profiling are among the other types of firms interested in streaming data into Hadoop, Haddad said.

Informatica is a neutral party that is ready to adapt its software for use by any of the major Hadoop distributors, Haddad said. Indeed, Cloudera and Hortonworks already have partnerships with Informatica and integrations to its data-integration platform, but they're not making use of the vendor's Ultra Messaging technology, which was acquired in 2010 along with 29West. That's because HDFS can't handle streaming data flows, according to Jack Norris, MapR's VP of marketing.

"The underlying storage layer of HDFS is an append-only system, so you can't do continuous streaming," Norris told InformationWeek.

Informatica's integration work with MapR has yet to be completed (and Haddad could not be pinned down on a precise release date, month, or quarter), but once finished, it will apply to MapR's M3 (community) and M5 (enterprise) distributions as well as to EMC's Greenplum MR Hadoop distribution, which is based on M5.

The MapR integration will also support real-time and snapshot replication using Informatica Data Replication, designed for a variety of relational data sources, and Informatica FastClone, which is designed specifically for Oracle databases. This will enable firms to replicate multiple terabytes of transactional data into Hadoop per hour--effectively providing another method of fast data loading.

"Typical use cases for replication are around predictive analytics, pricing optimization, supply chain optimization, and areas where you want more data or a mix of data to build better models for more accurate insights," Haddad said.

In addition, MapR will be able to archive data from data warehouses and enterprise business applications from the likes of SAP, Oracle, Infor, and Epicor by way of Informatica's Information Lifecycle Management products. This will enable companies to take advantage of Hadoop as a low-cost-yet-searchable storage platform.

As part of the new partnership, MapR will immediately make Informatica's HParser software available for free with its software distributions and as a separate download. HParser brings Informatica parsing capabilities (a form of data filtering and transformation) directly into the Hadoop distributed processing environment where they can be applied to complex data sources such as server logs, call data records, and text-oriented social streams at high scale.

Other Hadoop providers have ways to bring streaming data into the Hadoop environment, according to Forrester analyst James Kobielus. IBM, for one, recently added a connector to integrate its InfoSphere Streams and InforSphere BigInsights Hadoop offerings. In addition, vendor HStreaming supports streaming data with other distributors.

Both IBM and HStreaming are using HBase, a NoSQL database that's part of the Hadoop framework but that some critics say has yet to be hardened for production environments. DataStax supports streaming through Cassandra and can quickly replicate that data onto separate Hadoop nodes.

For now, MapR has taken the lead in providing a low-latency option for streaming big data directly into Hadoop's core MapReduce processing environment, and that counts as an edge on rival distributors.

Predictive IT analytics can provide invaluable insight--vital if a private cloud is in your future. Find out how in the new, all-digital Predictive IT Analytics issue of InformationWeek. Also in this issue: Randy Mott named CIO of General Motors, how Dell is pushing into the enterprise data center, and eight key features in Windows 8. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Slideshows
IT Careers: Top 10 US Cities for Tech Jobs
Cynthia Harvey, Freelance Journalist, InformationWeek,  1/14/2020
Commentary
Predictions for Cloud Computing in 2020
James Kobielus, Research Director, Futurum,  1/9/2020
News
What's Next: AI and Data Trends for 2020 and Beyond
Jessica Davis, Senior Editor, Enterprise Apps,  12/30/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Slideshows
Flash Poll