eHarmony Matches Informatica HParser With Hadoop - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management
08:14 AM
Connect Directly

eHarmony Matches Informatica HParser With Hadoop

Informatica's distributed parsing software saves eHarmony time preparing data in Hadoop for loading into a data warehouse.

12 Hadoop Vendors To Watch In 2012
12 Hadoop Vendors To Watch In 2012
(click image for larger view and for slideshow)
This is a story about JSON and Ruby. They were spending too much time together in an unrewarding relationship, so sooner or later it had to end.

JSON (Java Script Object Notation) is what eHarmony uses to capture and move data from its various customer-facing Web sites to its back-end systems. When customers seeking love fill out questionnaires about the dating site's advertised "29 dimensions of compatibility," for example, JSON encapsulates that data and sends it off wherever it's needed. One destination is Voldemort, the highly scalable, distributed NoSQL data store. Another is Solr, the Apache open-source search platform.

A third destination is Hadoop. That's where eHarmony's matching algorithms do the work of bringing together compatible customer records. And that's where Ruby comes in. You see, eHarmony can't just load JSON-encapsulated data into its SQL-based IBM Netezza data warehouse. It has to transform the object-encapsulated data into nicely structured information that can be loaded into the appropriate columns and rows in Netezza. For more than two years, eHarmony has been using scripts written in Ruby, the popular object-oriented programming language, to process the JSON data and move it into the data warehouse.

[ Thinking about your career? Read IT's Next Hot Job: Hadoop Guru. ]

Never mind that writing scripts was time-consuming. In addition, each hourly job also took as long as 40 minutes because it had to run on a conventional server rather than in Hadoop's distributed processing environment. eHarmony had people who knew Ruby, so let's just say it was a "you'll do for now" relationship.

But then eHarmony started getting serious about its long-term data warehousing prospects. Operations were destined to get bigger, according to Grant Parsamyan, director of business intelligence and data warehousing. Enter Informatica and its PowerCenter data-integration platform, which eHarmony was already using to load as much as seven terabytes per day into Netezza from conventional SQL data sources. Ruby was processing roughly 300 gigabytes per day from Hadoop, but Parsamyan says he expects that volume to get four to five times larger. It was clear the Ruby approach could not scale, he says.

Fortunately, Informatica last fall introduced HParser, a product that moves PowerCenter data-parsing capabilities into the Hadoop distributed processing environment. There, the many processors that work together can handle transformation jobs quickly, just as they do with massive MapReduce computations.

Informatica's HParser community edition handles JSON, XML, Omniture (Web analytics data), and log files. Commercial editions are available for documents (Work, Excel, PDF, etc.) and industry-standard file formats (SWIFT, NACHA, HIPAA, HL7, ACORD, EDI X12, and so on). The package also includes a visual, point-and-click studio that eliminates coding. Once the processing is done, PowerCenter can be used to extract the data from Hadoop and move it into the target destination.

In tests completed in November, eHarmony proved the advantages of the HParser approach. "Using a small Hadoop cluster, jobs that took 40 minutes in Ruby can be completed in about 10 minutes," Parsamyan says. "More importantly, as data volumes grow, we can just throw more Hadoop nodes at the problem and scale it up as much as we need to."

Once the HParser approach is in full production, Parsamyan expects to start loading as much as 1 terabyte per day into the data warehouse in short order, and that will enable more analytic measurement of eHarmony's success. The marketing department uses the data warehouse to measure response to its email and banner advertising campaigns. Product development teams use it to study the success of new site features. And the operations team uses the warehouse to study the health of the business, including membership and revenue trends.

With data volumes, velocity, and complexity on the rise, practitioners are turning to highly scalable platforms such as Hadoop. HParser is an early example of the type of new tools they'll need to work with the latest Big Data platforms.

The pay-as-you go nature of the cloud makes ROI calculation seem easy. It’s not. Also in the new, all-digital Cloud Calculations InformationWeek supplement: Why infrastructure-as-a-service is a bad deal. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
9 Steps Toward Ethical AI
Cynthia Harvey, Freelance Journalist, InformationWeek,  5/15/2019
How to Assess Digital Transformation Efforts
Lisa Morgan, Freelance Writer,  5/14/2019
Is AutoML the Answer to the Data Science Skills Shortage?
Guest Commentary, Guest Commentary,  5/10/2019
White Papers
Register for InformationWeek Newsletters
Current Issue
A New World of IT Management in 2019
This IT Trend Report highlights how several years of developments in technology and business strategies have led to a subsequent wave of changes in the role of an IT organization, how CIOs and other IT leaders approach management, in addition to the jobs of many IT professionals up and down the org chart.
Flash Poll