2 Lessons Learned Managing Big Data In Cloud - InformationWeek
Software // Information Management
01:11 PM
Connect Directly

2 Lessons Learned Managing Big Data In Cloud

Wordnik, world's fastest-updating dictionary, explains its move from MySQL to NoSQL and a new architecture designed to run on Amazon EC2.

12 Top Big Data Analytics Players
12 Top Big Data Analytics Players
(click image for larger view and for slideshow)
Human language constantly changes. In fact, it changes so quickly that Wordnik has had to push the limits of database and cloud technology just to keep pace. Here are the two key lessons Wordnik learned managing big data with MongoDB and running on Amazon's EC2 cloud.

Wordnik is an online dictionary, thesaurus, and language resource launched in 2009. It started with existing dictionaries including the American Heritage Dictionary, WordNet, and Wiktionary. But the idea was to apply natural-language processing to newly published content to discover new words, concepts, and meanings. Sources include proprietary feeds from the likes of The Wall Street Journal, Forbes, and other publishers, as well as open-source images from Flickr, and audio from sources including the Macmillan online dictionary.

Wordnik's corpus is now reportedly six times bigger than the Oxford English Dictionary--and it adds new citations at a rate of 5,000 to 8,000 words per second. It wasn't easy reaching that speed and scale. Wordnik launched using the LAMP stack running on Amazon EC2, but from the very beginning, when there were fewer than 1 million records and the engine was adding 50 records per second, running on MySQL was an operational nightmare, said Tony Tam, Wordnik's CTO and technical co-founder. Adding too much data too quickly locked tables for seconds, preventing updates and degrading site performance.

[ What about Amazon's NoSQL option? Read Amazon DynamoDB: Big Data's Big Cloud Moment. ]

Wordnik's first lesson was that it needed the speed, flexibility, and scalability of a NoSQL database. Wordnik started experimenting with 10Gen's MongoDB NoSQL database in 2009. "What jumped out at me was that within the first day, I was able to write all the software and migrate 50 million records," said Tam.

Tam was seduced by MongoDB's raw speed, which immediately boosted Wordnik engine update speeds to 1,000 records per second, up from 50 on MySQL. He said he was also "blown away" by the simplicity of development. Where multiple schemas required all sorts of conditional logic coding in MySQL, multiple schemas could easily coexist on MongoDB, helping to reduce coding requirements by 75% compared to MySQL.

Unfortunately, with the performance increases of MongoDB, Wordnik was soon outstripping the performance constraints of Amazon EC2 virtual servers. So in May 2010, Wordnik made what turned out to be a temporary move to a conventional hosted data center. The upside was that Wordnik could specify high-powered servers with 72GB of RAM, more cores of processing power, and big, fast hard drives. Database read and write capacity grew by five times and query latency dropped from 400 milliseconds down to 100 milliseconds.

The downside of the move to a physical data center was that Wordnik could no longer scale at will. Lead times to add more capacity were six weeks, and the company had to pay for capacity whether it ended up needing it or not. Wordnik also does lots of offline data processing on Hadoop, and to add a data center to support those requirements would have more than doubled the company's costs.

By 2011, Wordnik determined it wanted to move back to the cloud, having learned a second key lesson: that "you can't run a big application in the cloud without making changes," Tam said.

The key architectural change was a move to a micro services architecture, whereby Wordnik split its application into smaller chunks, each with its own data tier. By distributing the workload across many more servers, Wordnik could take advantage of lower-cost virtual machines.

A key enabler of the new approach was MongoDB replica sets, a feature that wasn't yet in production when Wordnik was forced to move to a conventional data center. Replica sets make it easy to add physical or virtual servers, and when you add them, they can sync from the master or any one of the servers already in use. The new machine notifies clients that it's ready for use, and if you want to take a machine out of service, you simply remove it from the set and there's no outage.

"It all comes down to the durability of the replica-set infrastructure," said Tam. "We can add and take away servers easily, and I don't have to go through crazy MySQL replication snapshots."

With the cloud-friendly ways of replica sets, Wordnik was able to move off the last of its physical servers and back to the comparatively low-cost virtualized Amazon EC2 cloud. As for the embrace of NoSQL databases, Tam said the options are getting mature. Cassandra, he said, is very good choice if you need to replicate data centers spread across regions or continents. He called MongoDB "the race car of the NoSQL market," offering both speedy performance and speedy development.

With all the tuning Worknik has done in its software layer, it can now update as many as 35,000 records per second. "We just can't get any faster than the way MongoDB handles our data," Tam concluded.

For more detail on Wordnik's move from MySQL to MongoDB and the use of micro services architecture, view Tony Tam's presentation.

The pay-as-you go nature of the cloud makes ROI calculation seem easy. It’s not. Also in the new, all-digital Cloud Calculations InformationWeek supplement: Why infrastructure-as-a-service is a bad deal. (Free registration required.)

Comment  | 
Print  | 
More Insights
Oldest First  |  Newest First  |  Threaded View
User Rank: Apprentice
1/16/2014 | 11:12:31 AM
Another article about MongoDB from us with findings
We created a small article about our learnings and findings with MongoDB and it's setup in a large cluster.
That might be another interesting read for you: http://www.webpageanalyse.com/blog/findings-deploying-a-large-mongodb-sharded-cluster-with-replication-part-1
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
2017 State of the Cloud Report
As the use of public cloud becomes a given, IT leaders must navigate the transition and advocate for management tools or architectures that allow them to realize the benefits they seek. Download this report to explore the issues and how to best leverage the cloud moving forward.
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 6, 2016. We'll be talking with the InformationWeek.com editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll