Big Data // Big Data Analytics
News
12/9/2013
11:55 AM
Connect Directly
Google+
RSS
E-Mail
50%
50%

How Ancestry.com Manages Generations Of Big Data

Over the past year, the genealogy site's repository of family historical data has more than doubled in size. Here's how Ancestry managed its growth.

Businesses often use -- or overuse -- the term "big data" to describe all sorts of data-related products and services, but the buzzword certainly applies in the case of Ancestry.com, a popular genealogy service that helps people dig up their family roots.

A little over a year ago, Ancestry was managing about 4 petabytes of data, including more than 40,000 record collections with birth, census, death, immigration, and military documents, as well as photos, DNA test results, and other info. Today the collection has quintupled to more than 200,000 records, and Ancestry's data stockpile has soared from 4 petabytes to 10 petabytes.

According to Bill Yetman, senior director of engineering at Ancestry.com, the big data explosion led to growing pains. "We measured every step in our process pipeline," said Yetman in a phone interview with InformationWeek. "We started with academic algorithms that people are using at universities, and they work great at smaller scales."

[How can K-12 education help train a new generation of data scientists? Read How Educators Can Narrow Big Data Skills Gap.]

But, he added, these algorithms were breaking down as the database got bigger and bigger and bigger. "There's a very specific algorithm we use in matching [DNA]. It's called Germline, and it was created by some very, very bright people at Columbia University," Yetman told us.

To analyze its growing stockpile of DNA data, Ancestry had to re-implement Germline using Hadoop and HBase. This process involved storing the data in HBase, and then using two map functions to run comparisons in parallel. "There are two MapReduce steps we use, and then we use HBase to hold the results, which makes it easy for us to do the [DNA] comparisons. If we couldn't run these things in parallel, we couldn't get it done nearly as fast."

Hadoop's vaunted expandability also helped Ancestry manage its growth. "If I need to improve my [performance] times, I can scale horizontally," said Yetman. "Just add more nodes to the cluster, and we can handle the growth."

Future growth, however, will require more innovation to keep things flowing smoothly. "You can't just say, 'OK, I've gotten over this 200,000 hump, and I can make it to 5 million.' I know there are going to be challenges all along the way, and I'm going to be looking for them."

Obviously, hardware performance must be monitored closely. "We've got to watch the memory in each node, how we're using memory, and how we're using the CPU."

Ancestry.com is also in the process of optimizing its Germline implementation to greatly reduce its memory usage. And it may team up with a cloud provider to boost its processing capacity.

The cloud option gained credence when Ancestry.com recently updated its algorithm for its ethnicity test. "We had to go back to those 200,000 people to rerun their ethnicity," said Yetman. "We did that using machines in our datacenter." But local hardware won't be enough as the number of users climbs to 500,000 -- or 1 million.

Ancestry.com is currently evaluating several cloud providers, but Yetman acknowledges that privacy issues add a degree of complexity to the move. "It gets really tricky because DNA data is so sensitive. That's one of the things that we as a company are careful with."

One potential solution: "I'm looking at bursting to the cloud… to do these calculations," Yetman said. But rather than leaving the data in the cloud, he might "pull it all back" to local storage to alleviate customers' privacy concerns.

Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)

Comment  | 
Print  | 
More Insights
Comments
Oldest First  |  Newest First  |  Threaded View
Laurianne
50%
50%
Laurianne,
User Rank: Author
12/9/2013 | 2:47:37 PM
Cloud bursting and privacy
Interesting discussion of cloud bursting. We tend to discuss bursting in terms of capacity problems -- more power on a busy shopping day, for example -- but the privacy angle begs examination. What do you think cloud community? Is this approach advantageous for privacy?
Ulf Mattsson
50%
50%
Ulf Mattsson,
User Rank: Strategist
12/9/2013 | 3:14:26 PM
Privacy, cloud and big data
I agree that "It gets really tricky because DNA data is so sensitive" and that the hard part is to "alleviate customers' privacy concerns".

Many organizations are looking to the cloud and outsourcing solutions for massive processing but international privacy laws are now escalating and organizations are desperately looking for effective ways to comply to these new stringent regulations. Europe and US are leading with very stringent privacy laws.

I studied one interesting project that addressed the challenge to protect sensitive information about individuals in a way that could satisfy European Cross Border Data Security requirements. This included incoming source data from various European banking entities, and existing data within those systems, which would be consolidated in one European country. The project achieved targeted compliance with EU Cross Border Data Security laws, Datenschutzgesetz 2000 - DSG 2000 in Austria, and Bundesdatenschutzgesetz in Germany by using a data tokenization approach.

I recently read an interesting report from the Aberdeen Group that revealed that "Over the last 12 months, tokenization users had 50% fewer security-related incidents(e.g., unauthorized access, data loss or data exposure than tokenization non-users". Nearly half of the respondents (47%) are currently using tokenization for something other than cardholder data The name of the study, released a few months ago, is "Tokenization Gets Traction".

Aberdeen has also seen "a steady increase in enterprise use of tokenization as an alternative to encryption for protecting sensitive data".

Ulf Mattsson, CTO Protegrity
Li Tan
50%
50%
Li Tan,
User Rank: Ninja
12/10/2013 | 3:00:00 AM
Re: Privacy, cloud and big data
Privacy is not only the concern from Ancestry.com but from all big enterprises. The companies are seeking for the possible way to improve their IT capability and efficiency. Obviously going for cloud is one necessary step. But the privacy and other security issues are really of concern. Starting to work with private cloud sounds promising but in fact you just start to create data silos, which is not good in the long run.
cbabcock
50%
50%
cbabcock,
User Rank: Strategist
12/10/2013 | 3:17:12 PM
Ancestry.com, cloud burster
Ancestry.com has got the ideal problem to using cloud bursting with -- if there is such a thing as an ideal problem. PCI-compliant transaction handlers send the data into the cloud but retain the identifier, the name. As results come back, they can match names to transactions on-premises. Couldn't Ancestry do something like that?
Brian.Dean
50%
50%
Brian.Dean,
User Rank: Ninja
12/14/2013 | 1:27:20 AM
Re: Cloud bursting and privacy
Excellent question, I think if privacy issues are going to cause harm to the customer then moving to the cloud in order to get access to more efficient hardware and framework will be difficult. Having said that, if DNA analysis can help pre flag for example, being lactose intolerant or having a greater chance of going into shock by something as miner as a bee sting, then customers will re-think their definition of privacy.

And I feel there is already a middle ground in place that cloud security and privacy can handle, even if privacy attitudes do not change.
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.