Software // Information Management
News
4/17/2012
01:11 PM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

2 Lessons Learned Managing Big Data In Cloud

Wordnik, world's fastest-updating dictionary, explains its move from MySQL to NoSQL and a new architecture designed to run on Amazon EC2.

12 Top Big Data Analytics Players
12 Top Big Data Analytics Players
(click image for larger view and for slideshow)
Human language constantly changes. In fact, it changes so quickly that Wordnik has had to push the limits of database and cloud technology just to keep pace. Here are the two key lessons Wordnik learned managing big data with MongoDB and running on Amazon's EC2 cloud.

Wordnik is an online dictionary, thesaurus, and language resource launched in 2009. It started with existing dictionaries including the American Heritage Dictionary, WordNet, and Wiktionary. But the idea was to apply natural-language processing to newly published content to discover new words, concepts, and meanings. Sources include proprietary feeds from the likes of The Wall Street Journal, Forbes, and other publishers, as well as open-source images from Flickr, and audio from sources including the Macmillan online dictionary.

Wordnik's corpus is now reportedly six times bigger than the Oxford English Dictionary--and it adds new citations at a rate of 5,000 to 8,000 words per second. It wasn't easy reaching that speed and scale. Wordnik launched using the LAMP stack running on Amazon EC2, but from the very beginning, when there were fewer than 1 million records and the engine was adding 50 records per second, running on MySQL was an operational nightmare, said Tony Tam, Wordnik's CTO and technical co-founder. Adding too much data too quickly locked tables for seconds, preventing updates and degrading site performance.

[ What about Amazon's NoSQL option? Read Amazon DynamoDB: Big Data's Big Cloud Moment. ]

Wordnik's first lesson was that it needed the speed, flexibility, and scalability of a NoSQL database. Wordnik started experimenting with 10Gen's MongoDB NoSQL database in 2009. "What jumped out at me was that within the first day, I was able to write all the software and migrate 50 million records," said Tam.

Tam was seduced by MongoDB's raw speed, which immediately boosted Wordnik engine update speeds to 1,000 records per second, up from 50 on MySQL. He said he was also "blown away" by the simplicity of development. Where multiple schemas required all sorts of conditional logic coding in MySQL, multiple schemas could easily coexist on MongoDB, helping to reduce coding requirements by 75% compared to MySQL.

Unfortunately, with the performance increases of MongoDB, Wordnik was soon outstripping the performance constraints of Amazon EC2 virtual servers. So in May 2010, Wordnik made what turned out to be a temporary move to a conventional hosted data center. The upside was that Wordnik could specify high-powered servers with 72GB of RAM, more cores of processing power, and big, fast hard drives. Database read and write capacity grew by five times and query latency dropped from 400 milliseconds down to 100 milliseconds.

The downside of the move to a physical data center was that Wordnik could no longer scale at will. Lead times to add more capacity were six weeks, and the company had to pay for capacity whether it ended up needing it or not. Wordnik also does lots of offline data processing on Hadoop, and to add a data center to support those requirements would have more than doubled the company's costs.

By 2011, Wordnik determined it wanted to move back to the cloud, having learned a second key lesson: that "you can't run a big application in the cloud without making changes," Tam said.

The key architectural change was a move to a micro services architecture, whereby Wordnik split its application into smaller chunks, each with its own data tier. By distributing the workload across many more servers, Wordnik could take advantage of lower-cost virtual machines.

A key enabler of the new approach was MongoDB replica sets, a feature that wasn't yet in production when Wordnik was forced to move to a conventional data center. Replica sets make it easy to add physical or virtual servers, and when you add them, they can sync from the master or any one of the servers already in use. The new machine notifies clients that it's ready for use, and if you want to take a machine out of service, you simply remove it from the set and there's no outage.

"It all comes down to the durability of the replica-set infrastructure," said Tam. "We can add and take away servers easily, and I don't have to go through crazy MySQL replication snapshots."

With the cloud-friendly ways of replica sets, Wordnik was able to move off the last of its physical servers and back to the comparatively low-cost virtualized Amazon EC2 cloud. As for the embrace of NoSQL databases, Tam said the options are getting mature. Cassandra, he said, is very good choice if you need to replicate data centers spread across regions or continents. He called MongoDB "the race car of the NoSQL market," offering both speedy performance and speedy development.

With all the tuning Worknik has done in its software layer, it can now update as many as 35,000 records per second. "We just can't get any faster than the way MongoDB handles our data," Tam concluded.

For more detail on Wordnik's move from MySQL to MongoDB and the use of micro services architecture, view Tony Tam's presentation.

The pay-as-you go nature of the cloud makes ROI calculation seem easy. It’s not. Also in the new, all-digital Cloud Calculations InformationWeek supplement: Why infrastructure-as-a-service is a bad deal. (Free registration required.)

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
anon6307843880
50%
50%
anon6307843880,
User Rank: Apprentice
1/16/2014 | 11:12:31 AM
Another article about MongoDB from us with findings
We created a small article about our learnings and findings with MongoDB and it's setup in a large cluster.
That might be another interesting read for you: http://www.webpageanalyse.com/blog/findings-deploying-a-large-mongodb-sharded-cluster-with-replication-part-1
The Agile Archive
The Agile Archive
When it comes to managing data, don’t look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July10, 2014
When selecting servers to support analytics, consider data center capacity, storage, and computational intensity.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join InformationWeek’s Lorna Garey and Mike Healey, president of Yeoman Technology Group, an engineering and research firm focused on maximizing technology investments, to discuss the right way to go digital.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.