How Manages Generations Of Big Data - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics

How Manages Generations Of Big Data

Over the past year, the genealogy site's repository of family historical data has more than doubled in size. Here's how Ancestry managed its growth.

Businesses often use -- or overuse -- the term "big data" to describe all sorts of data-related products and services, but the buzzword certainly applies in the case of, a popular genealogy service that helps people dig up their family roots.

A little over a year ago, Ancestry was managing about 4 petabytes of data, including more than 40,000 record collections with birth, census, death, immigration, and military documents, as well as photos, DNA test results, and other info. Today the collection has quintupled to more than 200,000 records, and Ancestry's data stockpile has soared from 4 petabytes to 10 petabytes.

According to Bill Yetman, senior director of engineering at, the big data explosion led to growing pains. "We measured every step in our process pipeline," said Yetman in a phone interview with InformationWeek. "We started with academic algorithms that people are using at universities, and they work great at smaller scales."

[How can K-12 education help train a new generation of data scientists? Read How Educators Can Narrow Big Data Skills Gap.]

But, he added, these algorithms were breaking down as the database got bigger and bigger and bigger. "There's a very specific algorithm we use in matching [DNA]. It's called Germline, and it was created by some very, very bright people at Columbia University," Yetman told us.

To analyze its growing stockpile of DNA data, Ancestry had to re-implement Germline using Hadoop and HBase. This process involved storing the data in HBase, and then using two map functions to run comparisons in parallel. "There are two MapReduce steps we use, and then we use HBase to hold the results, which makes it easy for us to do the [DNA] comparisons. If we couldn't run these things in parallel, we couldn't get it done nearly as fast."

Hadoop's vaunted expandability also helped Ancestry manage its growth. "If I need to improve my [performance] times, I can scale horizontally," said Yetman. "Just add more nodes to the cluster, and we can handle the growth."

Future growth, however, will require more innovation to keep things flowing smoothly. "You can't just say, 'OK, I've gotten over this 200,000 hump, and I can make it to 5 million.' I know there are going to be challenges all along the way, and I'm going to be looking for them."

Obviously, hardware performance must be monitored closely. "We've got to watch the memory in each node, how we're using memory, and how we're using the CPU." is also in the process of optimizing its Germline implementation to greatly reduce its memory usage. And it may team up with a cloud provider to boost its processing capacity.

The cloud option gained credence when recently updated its algorithm for its ethnicity test. "We had to go back to those 200,000 people to rerun their ethnicity," said Yetman. "We did that using machines in our datacenter." But local hardware won't be enough as the number of users climbs to 500,000 -- or 1 million. is currently evaluating several cloud providers, but Yetman acknowledges that privacy issues add a degree of complexity to the move. "It gets really tricky because DNA data is so sensitive. That's one of the things that we as a company are careful with."

One potential solution: "I'm looking at bursting to the cloud… to do these calculations," Yetman said. But rather than leaving the data in the cloud, he might "pull it all back" to local storage to alleviate customers' privacy concerns.

Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Author
12/9/2013 | 2:47:37 PM
Cloud bursting and privacy
Interesting discussion of cloud bursting. We tend to discuss bursting in terms of capacity problems -- more power on a busy shopping day, for example -- but the privacy angle begs examination. What do you think cloud community? Is this approach advantageous for privacy?
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Blockchain Gets Real Across Industries
Lisa Morgan, Freelance Writer,  7/22/2021
Seeking a Competitive Edge vs. Chasing Savings in the Cloud
Joao-Pierre S. Ruth, Senior Writer,  7/19/2021
How CIO Roles Will Change: The Future of Work
Jessica Davis, Senior Editor, Enterprise Apps,  7/1/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Monitoring Critical Cloud Workloads Report
In this report, our experts will discuss how to advance your ability to monitor critical workloads as they move about the various cloud platforms in your company.
Flash Poll