Informatica's distributed parsing software saves eHarmony time preparing data in Hadoop for loading into a data warehouse.
12 Hadoop Vendors To Watch In 2012
(click image for larger view and for slideshow)
This is a story about JSON and Ruby. They were spending too much time together in an unrewarding relationship, so sooner or later it had to end.
JSON (Java Script Object Notation) is what eHarmony uses to capture and move data from its various customer-facing Web sites to its back-end systems. When customers seeking love fill out questionnaires about the dating site's advertised "29 dimensions of compatibility," for example, JSON encapsulates that data and sends it off wherever it's needed. One destination is Voldemort, the highly scalable, distributed NoSQL data store. Another is Solr, the Apache open-source search platform.
A third destination is Hadoop. That's where eHarmony's matching algorithms do the work of bringing together compatible customer records. And that's where Ruby comes in. You see, eHarmony can't just load JSON-encapsulated data into its SQL-based IBM Netezza data warehouse. It has to transform the object-encapsulated data into nicely structured information that can be loaded into the appropriate columns and rows in Netezza. For more than two years, eHarmony has been using scripts written in Ruby, the popular object-oriented programming language, to process the JSON data and move it into the data warehouse.
Never mind that writing scripts was time-consuming. In addition, each hourly job also took as long as 40 minutes because it had to run on a conventional server rather than in Hadoop's distributed processing environment. eHarmony had people who knew Ruby, so let's just say it was a "you'll do for now" relationship.
But then eHarmony started getting serious about its long-term data warehousing prospects. Operations were destined to get bigger, according to Grant Parsamyan, director of business intelligence and data warehousing. Enter Informatica and its PowerCenter data-integration platform, which eHarmony was already using to load as much as seven terabytes per day into Netezza from conventional SQL data sources. Ruby was processing roughly 300 gigabytes per day from Hadoop, but Parsamyan says he expects that volume to get four to five times larger. It was clear the Ruby approach could not scale, he says.
Fortunately, Informatica last fall introduced HParser, a product that moves PowerCenter data-parsing capabilities into the Hadoop distributed processing environment. There, the many processors that work together can handle transformation jobs quickly, just as they do with massive MapReduce computations.
Informatica's HParser community edition handles JSON, XML, Omniture (Web analytics data), and log files. Commercial editions are available for documents (Work, Excel, PDF, etc.) and industry-standard file formats (SWIFT, NACHA, HIPAA, HL7, ACORD, EDI X12, and so on). The package also includes a visual, point-and-click studio that eliminates coding. Once the processing is done, PowerCenter can be used to extract the data from Hadoop and move it into the target destination.
In tests completed in November, eHarmony proved the advantages of the HParser approach. "Using a small Hadoop cluster, jobs that took 40 minutes in Ruby can be completed in about 10 minutes," Parsamyan says. "More importantly, as data volumes grow, we can just throw more Hadoop nodes at the problem and scale it up as much as we need to."
Once the HParser approach is in full production, Parsamyan expects to start loading as much as 1 terabyte per day into the data warehouse in short order, and that will enable more analytic measurement of eHarmony's success. The marketing department uses the data warehouse to measure response to its email and banner advertising campaigns. Product development teams use it to study the success of new site features. And the operations team uses the warehouse to study the health of the business, including membership and revenue trends.
With data volumes, velocity, and complexity on the rise, practitioners are turning to highly scalable platforms such as Hadoop. HParser is an early example of the type of new tools they'll need to work with the latest Big Data platforms.
The pay-as-you go nature of the cloud makes ROI calculation seem easy. It’s not. Also in the new, all-digital Cloud Calculations InformationWeek supplement: Why infrastructure-as-a-service is a bad deal. (Free registration required.)
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.