Latest release of 10Gen's database sidesteps complicated MapReduce processing with a new data-aggregation framework. That distances MongoDB from NoSQL rivals including Cassandra, HBase, and Riak.
Big Data Talent War: 10 Analytics Job Trends
(click image for larger view and for slideshow)
10Gen, the company behind the fast-growing MongoDB database, on Wednesday announced the general availability of a highly anticipated upgrade that promises easier analytic querying of a NoSQL database best known for speedy transactional performance.
The new release, MongoDB 2.2, is the production-ready result of a 2.1 developers' preview that has been beta tested by the MongoDB community since January. Key upgrades include a new real-time aggregation framework, new sharding and replication features for multi-data-center deployments, and improved performance and database concurrency for high-scale deployments.
The biggest news in the upgrade is clearly the new real-time data aggregation framework, which lets users directly query data within MongoDB without resorting to writing and running complicated, batch-oriented MapReduce jobs within the database.
"MapReduce works well when it's a complex analysis that you need to handle with batch processing, but if you're trying to do something simple like compute the average of a list of numbers, it's overkill," explained Jared Rosoff, director of product marketing at 10gen in an interview with InformationWeek.
What was missing before 2.2, and indeed in most NoSQL databases, according to Rosoff, is routine query functionality that lets you handle the kind of data-filtering and data-analysis tasks you would otherwise handle with SQL--that is if you were using a relational database. That's exactly what the data aggregation framework provides: a collection of data operators that can handle 80% of the tasks that MongoDB developers used to handle with MapReduce, according to 10gen.
The MongoDB query language is not SQL, but 10gen describes it as a simple, expressive language with a straightforward syntax for efficient querying. Examples of simple query statements include "sum," "min," "max," and "average." These sorts of operators would be familiar to any database veteran or analyst, and they're applied in a real-time data-processing pipeline that delivers sub-second performance, according to 10gen.
Other available query statements include "project," which is used to select desired attributes and ignore everything else. "Group" lets you combine results with desired attributes. "Match" is a filter than can be used to eliminate documents from a query. "Limit," "skip" and "sort," are statements used in much the same way they're used in SQL: to limit a query to a desired number of results, to skip over a given number of results, and to sort results alphabetically, numerically or by some other value.
SQL veterans might ask, "why not just use a relational database?" Rosoff says MongoDB is displacing products like Oracle Database and Microsoft SQL Server because of its scalability and flexibility. MongoDB runs on low-cost, highly distributed nodes of commodity hardware much like Hadoop, but unlike that data-processing platform, it's a database that can run applications.
Like other NoSQL databases, MongoDB gives users the flexibility to store and recall any type of data without the rigid constraints of a fixed data model--something that relational databases demand. New data types including complex data and loosely structured textual information can be added without first conforming the data to a predefined schema.
"Customers frequently tell us they've spent as long as a year trying to model complicated schemas in relational databases but they just couldn't make it work or perform," Rosoff said. "People are adopting Mongo because every document stored in the database can have slightly different fields, and documents can have more structure than rows in a relational database."
A good use case for NoSQL is modeling a product catalog for an e-commerce site. If that site sells books, shoes, furniture, and MP3s, the catalog will require many different fields to cover diverse product attributes, but at the same time, all of those products have product IDs, prices, and descriptions. That's hard to structure in a relational database, but "you can model that type of data much more simply in Mongo," Rosoff said.
The new aggregation framework promises to fill the need for fast, simple querying in MongoDB, but more complex analyses can still be handled with MapReduce processing within the database. And for really complex data processing and analyses, there's a MongoDB-Hadoop connector that lets users handle those tasks on separate Hadoop clusters.
New multi-data-center support features included in the 2.2. release give administrators tighter control over data location to meet compliance demands. For example, certain privacy regulations in Europe demand that customer data is stored within the country or continent. Tag-aware database sharding and replication features in 2.2 support location-based storage and retention. In addition, different types of data can be assigned to content-appropriate hardware, as in fast storage for frequently accessed data and low-cost options for archival information.
MongoDB 2.2 performance and concurrency is said to be improved with a new locking architecture that 10gen says handles frequent database reads and writes. Locking ensures data integrity by ensuring that one transaction is completed before another can update the same information. By using a more fine-grained locking approach and detecting when data is on disk rather than in RAM, 10gen says Mongo 2.2 handles more disk input and output demands under load without degrading database performance.
The performance gains and multi-data-center support features are table stakes for big data deployments that 10gen had to deliver. The data aggregation framework distances MongoDB from NoSQL competitors including Cassandra, HBase, and Riak, according to Rosoff. Gartner analyst Merv Adrian told InformationWeek he's cautiously optimistic that 10gen will deliver what's promised.
"Time will tell if 10gen's '80% of the use cases' assertion proves out, but there is no doubt that grouping and aggregation functions do make up a lot of the intended [analytic] work in their customer and prospect base," Adrian said.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.