Outbrain Outgrows Initial Big Data Infrastructure, Migrates

Content discovery platform Outbrain launched its first Hadoop pilot about 5 years ago, on the free version of the technology, giving it the flexibility to experiment. But the company recently made some changes to update its big data program.

Jessica Davis, Senior Editor

March 1, 2018

4 Min Read
Image: scanrail - istockphoto

When you are planning to implement Hadoop, there's always a question of whether to stay with the open source implementation or go with one of the companies that provides a higher-service version of the software, in this case Cloudera, Hortonworks, or MapR.

On the one hand, the non-company distributed version costs nothing in terms of procurement. That no-cost model gives organizations quite a bit of freedom in terms of playing with a new and unfamiliar technology and figuring out just what it can do and what it can't. There's little commitment involved.

But once you play with that technology for a while and gain the confidence to put it into production, the game sometimes changes. Maybe you are running major projects on this technology, but you spend a lot of expensive engineering time patching them. It can be tricky to add new technologies to the stack, too, because it those technologies may not be tested on the version you have.

It was a dilemma that was familiar to Outbrain. Outbrain is a company that describes itself as a content "discovery platform." When you visit a website, say CNN, and scroll to the bottom, you'll see a section called "Paid Content." In that section you will see a number of links to other stories on other sites, and the placement of those links is paid for by the owners of the sites where those articles run. If you click through you go to a new site that has paid for placement on CNN. Publishers pay Outbrain to be placed in those locations and served to the viewing public. Outbrain's recommendation engine decides what content will most likely to appeal to you, and serves that content to you. Outbrain's algorithms are doing a good job if they show you content that is most interesting to you -- that makes you want to click through.

This business and business model is a big enterprise for the nearly 11-year-old company, with its business headquarters in New York and many of its executives and its R&D headquarters based in Netanya, Israel. The company has a monthly global audience of 557 million consumers and serves 275 billion recommendations per month. It's entire technology infrastructure operates in three data centers with more than 6,000 physical servers.

Experimenting with Hadoop

Outbrain first implemented Hadoop 5 or 6 years ago, according to Orit Yaron, the company's VP of Cloud Platform, who has overseen a number of complex migration projects for the company, including migrating its New York data center to an entirely new facility in a month's time. The Hadoop update is just the most recent project in her management of the infrastructure. Yaron told InformationWeek about the Hadoop project in an interview.

"It started with a small cluster several years ago," Yaron said. "We used to run an open source type of Hadoop, and we started small just to get started with the technology."

Yaron said that trying out new technology is intrinsic to Outbrain's culture, and that the initial Hadoop pilot was driven by the desire "to see if we could better analyze the data we had."

The upgrade

But 5 years down the road, Outbrain was bumping into some limitations that it associated with using the free version of Hadoop. For instance, the engineering team wanted to add new technologies to the stack, such as Spark and ORC, but that process had become complex and difficult in terms of version control and interoperability. That wasn't how the company preferred to work.

"If a researcher or algorithm developer wants to introduce something new, we want to say 'let's try it, let's go for it,' " Yaron said. But it had become too difficult to do that with the 5-year-old Hadoop cluster which was now 330 nodes.

"It resulted in bad performance for our algorithms when the Hadoop cluster wasn't stable enough," Yaron said. "At some point it became a source of frustration."

Yaron's team decided to rebuild its Hadoop infrastructure with new physical servers and standardizing on a MapR implementation of Hadoop.

"We are a company that runs a lot of open source, and we try to contribute to the open source community as well," Yaron told me. "However, there are cases where we feel there is value to enterprise technologies."

The new physical servers changed the ratio between disk space, RAM, and CPU, Yaron said, and hardware and software upgrade enabled Outbrain to reduce the footprint of Hadoop servers in the data center to one-third of what it had been before. The change also reduced energy and cooling costs. Today the Hadoop infrastructure runs on 2 clusters of 110 nodes per cluster.

While the workforce numbers change from time to time, currently Outbrain has 5 engineers supporting the cluster and a few dozen people leveraging the cluster for data analysis, according to Yaron.

"It improves the algorithms we are serving," she said. "We end up with happier users at the end of the day."

About the Author(s)

Jessica Davis

Senior Editor

Jessica Davis is a Senior Editor at InformationWeek. She covers enterprise IT leadership, careers, artificial intelligence, data and analytics, and enterprise software. She has spent a career covering the intersection of business and technology. Follow her on twitter: @jessicadavis.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights