Businesses often use -- or overuse -- the term "big data" to describe all sorts of data-related products and services, but the buzzword certainly applies in the case of Ancestry.com, a popular genealogy service that helps people dig up their family roots.
A little over a year ago, Ancestry was managing about 4 petabytes of data, including more than 40,000 record collections with birth, census, death, immigration, and military documents, as well as photos, DNA test results, and other info. Today the collection has quintupled to more than 200,000 records, and Ancestry's data stockpile has soared from 4 petabytes to 10 petabytes.
According to Bill Yetman, senior director of engineering at Ancestry.com, the big data explosion led to growing pains. "We measured every step in our process pipeline," said Yetman in a phone interview with InformationWeek. "We started with academic algorithms that people are using at universities, and they work great at smaller scales."
[How can K-12 education help train a new generation of data scientists? Read How Educators Can Narrow Big Data Skills Gap.]
But, he added, these algorithms were breaking down as the database got bigger and bigger and bigger. "There's a very specific algorithm we use in matching [DNA]. It's called Germline, and it was created by some very, very bright people at Columbia University," Yetman told us.
To analyze its growing stockpile of DNA data, Ancestry had to re-implement Germline using Hadoop and HBase. This process involved storing the data in HBase, and then using two map functions to run comparisons in parallel. "There are two MapReduce steps we use, and then we use HBase to hold the results, which makes it easy for us to do the [DNA] comparisons. If we couldn't run these things in parallel, we couldn't get it done nearly as fast."
Hadoop's vaunted expandability also helped Ancestry manage its growth. "If I need to improve my [performance] times, I can scale horizontally," said Yetman. "Just add more nodes to the cluster, and we can handle the growth."
Future growth, however, will require more innovation to keep things flowing smoothly. "You can't just say, 'OK, I've gotten over this 200,000 hump, and I can make it to 5 million.' I know there are going to be challenges all along the way, and I'm going to be looking for them."
Obviously, hardware performance must be monitored closely. "We've got to watch the memory in each node, how we're using memory, and how we're using the CPU."
Ancestry.com is also in the process of optimizing its Germline implementation to greatly reduce its memory usage. And it may team up with a cloud provider to boost its processing capacity.
The cloud option gained credence when Ancestry.com recently updated its algorithm for its ethnicity test. "We had to go back to those 200,000 people to rerun their ethnicity," said Yetman. "We did that using machines in our datacenter." But local hardware won't be enough as the number of users climbs to 500,000 -- or 1 million.
Ancestry.com is currently evaluating several cloud providers, but Yetman acknowledges that privacy issues add a degree of complexity to the move. "It gets really tricky because DNA data is so sensitive. That's one of the things that we as a company are careful with."
One potential solution: "I'm looking at bursting to the cloud… to do these calculations," Yetman said. But rather than leaving the data in the cloud, he might "pull it all back" to local storage to alleviate customers' privacy concerns.
Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)