2 Lessons Learned Managing Big Data In Cloud
Wordnik, world's fastest-updating dictionary, explains its move from MySQL to NoSQL and a new architecture designed to run on Amazon EC2.
Wordnik is an online dictionary, thesaurus, and language resource launched in 2009. It started with existing dictionaries including the American Heritage Dictionary, WordNet, and Wiktionary. But the idea was to apply natural-language processing to newly published content to discover new words, concepts, and meanings. Sources include proprietary feeds from the likes of The Wall Street Journal, Forbes, and other publishers, as well as open-source images from Flickr, and audio from sources including the Macmillan online dictionary.
More Software Insights
- Using InfoSphere Information Server to Integrate and Manage Big Data
- Why is Information Governance So Important for Modern Analytics?
White PapersMore >>
Wordnik's corpus is now reportedly six times bigger than the Oxford English Dictionary--and it adds new citations at a rate of 5,000 to 8,000 words per second. It wasn't easy reaching that speed and scale. Wordnik launched using the LAMP stack running on Amazon EC2, but from the very beginning, when there were fewer than 1 million records and the engine was adding 50 records per second, running on MySQL was an operational nightmare, said Tony Tam, Wordnik's CTO and technical co-founder. Adding too much data too quickly locked tables for seconds, preventing updates and degrading site performance.
[ What about Amazon's NoSQL option? Read Amazon DynamoDB: Big Data's Big Cloud Moment. ]
Wordnik's first lesson was that it needed the speed, flexibility, and scalability of a NoSQL database. Wordnik started experimenting with 10Gen's MongoDB NoSQL database in 2009. "What jumped out at me was that within the first day, I was able to write all the software and migrate 50 million records," said Tam.
Tam was seduced by MongoDB's raw speed, which immediately boosted Wordnik engine update speeds to 1,000 records per second, up from 50 on MySQL. He said he was also "blown away" by the simplicity of development. Where multiple schemas required all sorts of conditional logic coding in MySQL, multiple schemas could easily coexist on MongoDB, helping to reduce coding requirements by 75% compared to MySQL.
Unfortunately, with the performance increases of MongoDB, Wordnik was soon outstripping the performance constraints of Amazon EC2 virtual servers. So in May 2010, Wordnik made what turned out to be a temporary move to a conventional hosted data center. The upside was that Wordnik could specify high-powered servers with 72GB of RAM, more cores of processing power, and big, fast hard drives. Database read and write capacity grew by five times and query latency dropped from 400 milliseconds down to 100 milliseconds.
The downside of the move to a physical data center was that Wordnik could no longer scale at will. Lead times to add more capacity were six weeks, and the company had to pay for capacity whether it ended up needing it or not. Wordnik also does lots of offline data processing on Hadoop, and to add a data center to support those requirements would have more than doubled the company's costs.
By 2011, Wordnik determined it wanted to move back to the cloud, having learned a second key lesson: that "you can't run a big application in the cloud without making changes," Tam said.
The key architectural change was a move to a micro services architecture, whereby Wordnik split its application into smaller chunks, each with its own data tier. By distributing the workload across many more servers, Wordnik could take advantage of lower-cost virtual machines.
A key enabler of the new approach was MongoDB replica sets, a feature that wasn't yet in production when Wordnik was forced to move to a conventional data center. Replica sets make it easy to add physical or virtual servers, and when you add them, they can sync from the master or any one of the servers already in use. The new machine notifies clients that it's ready for use, and if you want to take a machine out of service, you simply remove it from the set and there's no outage.
"It all comes down to the durability of the replica-set infrastructure," said Tam. "We can add and take away servers easily, and I don't have to go through crazy MySQL replication snapshots."
With the cloud-friendly ways of replica sets, Wordnik was able to move off the last of its physical servers and back to the comparatively low-cost virtualized Amazon EC2 cloud. As for the embrace of NoSQL databases, Tam said the options are getting mature. Cassandra, he said, is very good choice if you need to replicate data centers spread across regions or continents. He called MongoDB "the race car of the NoSQL market," offering both speedy performance and speedy development.
With all the tuning Worknik has done in its software layer, it can now update as many as 35,000 records per second. "We just can't get any faster than the way MongoDB handles our data," Tam concluded.
For more detail on Wordnik's move from MySQL to MongoDB and the use of micro services architecture, view Tony Tam's presentation.
The pay-as-you go nature of the cloud makes ROI calculation seem easy. It’s not. Also in the new, all-digital Cloud Calculations InformationWeek supplement: Why infrastructure-as-a-service is a bad deal. (Free registration required.)