Hadoop is best known for its massive scalability, low cost, and ability to handle mixed data types, including structured and unstructured sources. A speed demon it's typically not. Most deployments of open source Apache Hadoop software rely on comparatively slow, batch-oriented data loading into the Hadoop distributed file systems (HDFS).
MapR has differentiated its distribution of Hadoop as a high-performance alternative that eliminates HDFS and replaces it with a derivative of the Unix-based network file system (NFS). In addition, the MapR distribution features a lockless storage services layer that will work hand in hand with Informatica messaging software to continuously stream massive amounts of data into Hadoop. There are plenty of potential use cases, according to John Haddad, a director of product marketing at Informatica.
[ Want more on Hadoop? Read Why Hadoop Crowd Is Hearing Much About Hortonworks. ]
"We're seeing strong interest from financial services that have streaming sources, such as social media and transactional flows, and that want to do ongoing analysis for anything from sentiment analysis to fraud detection," Haddad told InformationWeek. Telcos and ad-targeting firms doing customer profiling are among the other types of firms interested in streaming data into Hadoop, Haddad said.
Informatica is a neutral party that is ready to adapt its software for use by any of the major Hadoop distributors, Haddad said. Indeed, Cloudera and Hortonworks already have partnerships with Informatica and integrations to its data-integration platform, but they're not making use of the vendor's Ultra Messaging technology, which was acquired in 2010 along with 29West. That's because HDFS can't handle streaming data flows, according to Jack Norris, MapR's VP of marketing.
"The underlying storage layer of HDFS is an append-only system, so you can't do continuous streaming," Norris told InformationWeek.
Informatica's integration work with MapR has yet to be completed (and Haddad could not be pinned down on a precise release date, month, or quarter), but once finished, it will apply to MapR's M3 (community) and M5 (enterprise) distributions as well as to EMC's Greenplum MR Hadoop distribution, which is based on M5.
The MapR integration will also support real-time and snapshot replication using Informatica Data Replication, designed for a variety of relational data sources, and Informatica FastClone, which is designed specifically for Oracle databases. This will enable firms to replicate multiple terabytes of transactional data into Hadoop per hour--effectively providing another method of fast data loading.
"Typical use cases for replication are around predictive analytics, pricing optimization, supply chain optimization, and areas where you want more data or a mix of data to build better models for more accurate insights," Haddad said.
In addition, MapR will be able to archive data from data warehouses and enterprise business applications from the likes of SAP, Oracle, Infor, and Epicor by way of Informatica's Information Lifecycle Management products. This will enable companies to take advantage of Hadoop as a low-cost-yet-searchable storage platform.
As part of the new partnership, MapR will immediately make Informatica's HParser software available for free with its software distributions and as a separate download. HParser brings Informatica parsing capabilities (a form of data filtering and transformation) directly into the Hadoop distributed processing environment where they can be applied to complex data sources such as server logs, call data records, and text-oriented social streams at high scale.
Other Hadoop providers have ways to bring streaming data into the Hadoop environment, according to Forrester analyst James Kobielus. IBM, for one, recently added a connector to integrate its InfoSphere Streams and InforSphere BigInsights Hadoop offerings. In addition, vendor HStreaming supports streaming data with other distributors.
Both IBM and HStreaming are using HBase, a NoSQL database that's part of the Hadoop framework but that some critics say has yet to be hardened for production environments. DataStax supports streaming through Cassandra and can quickly replicate that data onto separate Hadoop nodes.
For now, MapR has taken the lead in providing a low-latency option for streaming big data directly into Hadoop's core MapReduce processing environment, and that counts as an edge on rival distributors.
Predictive IT analytics can provide invaluable insight--vital if a private cloud is in your future. Find out how in the new, all-digital Predictive IT Analytics issue of InformationWeek. Also in this issue: Randy Mott named CIO of General Motors, how Dell is pushing into the enterprise data center, and eight key features in Windows 8. (Free registration required.)