When data grows into the tens or even hundreds of terabytes, you need a special technology to quickly make sense of it all. From Hadoop to Teradata, check out the top platform options.
Organizations around the globe and across industries have learned that the smartest business decisions are based on fact, not gut feel. That means they're based on analysis of data, and it goes way beyond the historical information held in internal transaction systems. Internet clickstreams, sensor data, log files, mobile data rich with geospatial information, and social-network comments are among the many forms of information now pushing information stores into the big-data league above 10 terabytes.
Trouble is, conventional data warehousing deployments can't scale to crunch terabytes of data or support advanced in-database analytics. Over the last decade, massively parallel processing (MPP) platforms and column-store databases have started a revolution in data analysis. But technology keeps moving, and we're starting to see upgrades that are blurring the boundaries of known architectures. What's more, a whole movement has emerged around NoSQL (not only SQL) platforms that take on semi-structured and unstructured information.
This image gallery presents a 2011 update on what's available, with options including EMC's Greenplum appliance, Hadoop and MapReduce, HP's recently acquired Vertica platform, IBM's separate DB2-based Smart Analytic System and Netezza offerings, and Microsoft's Parallel Data Warehouse. Smaller, niche database players include Infobright, Kognitio and ParAccel. Teradata reigns at the top of the market, picking off high-end defectors from industry giant Oracle. SAP's Sybase unit continues to evolve Sybase IQ, the original column-store database. In short, there's a platform for every scale level and analytic focus, so click on to see and read more about your options.
Modular EMC Appliance Handles Multiple Data Types EMC acquired Greenplum in 2010, and it wasted no time in developing an EMC Greenplum Data Computing Appliance (DCA) combining EMC storage hardware and replication and recovery options with Greenplum's massively parallel processing (MPP) database. EMC's Data Computing Division is expanding on Greenplum's deep support for in-database analytics with partners including SAS and MapR.
EMC introduced its own distribution of Hadoop software in May, and a Modular DCA set for release this fall promises to support the Greenplum SQL/relational database as well as Hadoop deployments on the same appliance. With Hadoop, EMC addresses analysis of truly big data like clickstreams and unstructured data such as social-network comments. The Modular DCA will also support high-capacity storage modules on the same appliance for long-term retention of records to meet regulatory mandates.
Hadoop And MapReduce Boil Down Really Big Data Hadoop is a collection of open-source distributed data-processing components for storing and processing structured, semi-structured, or unstructured data at truly high scale (as in tens or hundreds of terabytes of even petabytes). Clickstream and social-media analysis applications are driving much of the demand, and of particular interest is MapReduce, a technique supported by Hadoop (and a few other environments) that is ideal for processing big data sets. MapReduce breaks a big data problem into sub-problems, distributes those onto dozens, hundreds, or even thousands of processing nodes and then combines the results into a smaller data set that's easier to analyze.
Hadoop runs on low-cost commodity hardware and it scales up at a fraction of the cost of commercial storage and data-processing alternatives. That has made it a staple at Internet giants including AOL, eHarmony, eBay, Facebook, Twitter, and Netflix. But even more traditional firms coping with big data, like JPMorgan Chase, are embracing the platform.
HP Vertica Scales Up for E-Commerce Analysis Acquired by HP in February, Vertica is a column-store database that delivers high data compression for efficient storage and fast querying in analytic applications. It's also low maintenance, able to get up and running quickly and maintain performance with less tuning than required by conventional relational databases. The database also supports massively parallel processing on commodity hardware, and HP introduced an HP Vertica appliance running on X86 hardware soon after the acquisition. MPP scalability has helped Vertical score high-end digital marketing and e-commerce customers such as AOL, Twitter, and Groupon, that are scaling into the petabytes.
Before its acquisition by HP, Vertica introduced innovation options including in-memory and flash-memory analysis for faster querying. It was among the first to add Hadoop connectivity to support customers managing non-relational data in that environment, and it was also among the first to venture into cloud-based deployment options. Vertica now supports HP's Cloud Service Automation solution.
IBM Addresses Operational And Analytic Data Warehousing IBM introduced the DB2-based Smart Analytic System (at left) last year, so why did it also acquire the separate Netezza appliance platform? The former is a platform for high-scale enterprise data warehouses capable of supporting many thousands of users and operational applications. Call centers, for example, often have multitudes of employees seeking fast recall of customer histories. The Smart Analytic System combines the DB2 database with information-integration and Cognos BI software modules preinstalled and tuned to work together and perform on the same IBM Power System (RISC or x86) platform.
Netezza is all about supporting high-scale analytic applications at digital marketing firms, telcos, and other firms mining tens or hundreds of terabytes or even petabytes of data. IBM Netezza TwinFin appliances support massively parallel processing and can be deployed in one day, according to IBM. Netezza supports deep "i-Class" in-database analytics in various languages and approaches, including Java, C, C++, Python, and MapReduce. i-Class also supports matrix-manipulation approaches, such as those used by SAS, IBM SPSS, and the R programming language. IBM Netezza recently added a high-capacity appliance for long-term archival storage needed to meet regulatory requirements.
Infobright Cuts DBA Labor And Query Times The Infobright column-store database is aimed at analytic analysis of moderate data volumes ranging from hundreds of gigbytes up to tens of terabytes. This is also the core market for Oracle and Microsoft SQL Server, but InfoBright says its alternative database, which is built on MySQL and designed for analytic applications, delivers higher performance at lower cost with much less database administrative work. The column-store database creates indexes automatically and there's no data partitioning and minimal ongoing DBA tuning required. The company claims customers are doing 90% less work than required for conventional databases while incurring half the cost in terms of database licensing and storage, thanks to high data compression.
Infobright's recent 4.0 release added a DomainExpert feature lets companies ignore repeating patterns of data that don't change, such as email addresses, URLs, and IP addresses. Companies add their own patterns as well, whether it's related to call data records, financial trading, or geospatial information. The Knowledge Grid query engine then has the brains to ignore this static data and explore only changing data. That saves query time because irrelevant data doesn't have to be decompressed and interrogated.
Kognitio Offers Three Appliance Speeds And Virtual Cubes Kognitio is a database vendor that doesn't have its own hardware, but bowing to customer interest in rapid deployment, it offers Lakes, Rivers, and Rapids appliances with its WX2 database preinstalled on HP or IBM hardware. The Lakes configuration delivers high-capacity storage at low cost, with 10 terabytes of storage and 48 compute cores per module. Telcos or financial services might use this configuration to scan vast stores of records retained for compliance reasons. The Rivers configuration balances speed and capacity, with 2.5 terabytes of storage and 48 cores per module. For the ultimate in query performance, the Rapids configuration offers 96 cores of processing power against just 1.5 terabytes per module. This appliance is aimed at financial firms doing algorithmic trading or other high-performance demands.
This year Kognitio added a virtual-OLAP-style "Pablo" analysis engine that offers flexible, what-if analysis by business users. This optional extension to WX2 builds virtualized cubes on the fly. Thus, any dimension of data in a WX2 database can be used for rapid-fire analysis from a cube held entirely in memory. The front-end interface for this analysis is Microsoft Excel by way of A La Carte, a Pablo feature that lets users of this familiar spreadsheet interfaces tap into the data in WX2.
Microsoft Scales Out SQL Server With PDW Two and a half years in development and more than six months in preview release, the Microsoft SQL Server R2 Parallel Data Warehouse (PDW) was released in early 2011 to enable customers to scale up into deployments analyzing hundreds of terabytes. The appliance is offered on hardware from partners including Hewlett-Packard. At launch, PDW pricing was just over $13,000 per terabyte of user-accessible data, including hardware, though Microsoft shops can expect discounting. It remains to be seen how deep street-price discounts will go.
PDW, like many products, uses massively parallel processing to support high scalability, but Microsoft was late to the market and lags behind market leaders on in-database analytics and in-memory analysis. Microsoft is counting on the appeal of its total database platform as a differentiator. That means everything from its data lineage and budding master data management capabilities to its widely used Information Integration, Analysis and Reporting services, all of which are built-in components of the SQL Server database.
Microsoft announced October 12 that it will get into the big data, non-relational data world with a Windows-focused release of Apache Hadoop and a related SQL Azure Hadoop service. The Azure service will debut by the end of 2011 while the on-premises software is expected in the first half of 2012. No word on whether Microsoft will work with hardware partners on a related big data appliance.
Oracle Adds To Its Engineered Systems Story Oracle says Exadata (shown at left) is its most successful product launch ever, landing more than 1,000 customers since it was introduced in 2008. This "engineered system" puts Oracle's 11g Database on supporting X86-based processing and disk-based storage tiers with flash cache available for ultra-fast querying. It can be used for either transactional environments or data warehousing (though not both simultaneously). Exadata's Hybrid Columnar Compression offers some of the storage efficiencies of column-store databases, delivering up to a 10-to-1 compression ratio, versus an average of 4-to-1 for most row-store databases.
Oracle expanded its engineered systems family in September by announcing the Oracle SuperCluster (shown at right), a new product due out this fall based on the new Sun Sparc T-4 chip. SuperCluster is available in full-rack or half-rack configurations, and you can add capacity in half-rack increments. The full-rack includes 1,200 CPU threads, 4 TB of DRAM, between 97 TB and 198 TB of hard disk, and 8.66 TB of flash memory.
Oracle claims transactional performance for SuperCluster will be 10 times faster and data warehousing performance 50 times faster than conventional server architectures. But as a proprietary Unix machine, SuperCluster will be swimming against the tide of data warehousing deployments moving toward scale-out architectures on commodity X86 hardware. Oracle Exadata and Exalogic are both X86-powered machines that run on Linux.
In the latest news, Oracle announced at Oracle OpenWorld in early October that it will add a distribution of Apache Hadoop software and a related big-data appliance. Oracle is also planning a separate NoSQL transactional DBMS based on the open-source BerkeleyDB product acquired with Sleepycat Software in 2006. And in another embrace of open-source software, Oracle says it will offer a distribution of the open-source R statistical environment for in in-database analysis within the Oracle 11g. Oracle has yet to announce ship dates for the Hadoop, NoSQL, and R products.
ParAccel Combines Column-Store, MPP And In-Database Analytics ParAccel is the developer of the ParAccel Analytic Database (PADB), a database that combines the fast, selective-querying and compression advantages of a column-store database with the scale-out capabilities of massively parallel processing. The vendor says its platform supports a range of analyses, from reporting to complex advanced-analytics workloads. Built-in analytics enable analysts to perform advanced mathematical, statistical, and data-mining functions, and an open API extends in-database processing capabilities to third-party analytic applications. Table functions are used to feed and receive results to and from third-party and custom algorithms written in languages such as C and C++. ParAccel has partnered with Fuzzy Logix, a vendor that offers an extensive library of descriptive statistics, Monte Carlo simulations and pattern-recognition functions. The table functions approach also supports MapReduce techniques and more than 700 analyses commonly used by financial services.
Sybase Evolves The IQ Column-Store Database Sybase IQ from SAP's Sybase business unit was the very first column-store database management system and it remains the top seller with more than 2,000 customers. Sybase says IQ version 15.3, released this summer, can handle more data, more data types, more queries, and more users thanks to a new PlexQ massively parallel processing option. PlexQ is a grid of CPU processing power--deployed on third-party, commodity hardware--that's said to deliver 12 times faster performance compared with existing IQ deployments.
To support diverse analyses, 15.3 adds distributed processing capabilities to execute queries across multiple CPUs in the PlexQ grid. To ensure fastest querying, PlexQ also includes a Logical Servers feature that lets administrators assign virtual clusters of server capacity to specific users, departments, or queries.
A key difference between Sybase IQ and most other MPP-capable products is that it remains a shared-everything, rather than shared-nothing environment. The downside of shared everything is that processors can compete to access the shared pool of storage (usually a storage-area network), and that contention can degrade query performance. But Sybase insists shared-everything is more flexible in terms of query optimization because all the CPUs have access to all the data, so you can direct as much or as little processing power as required toward specific queries.
Teradata Moves From EDWs To Analytic Family Once purist preachers of the enterprise data warehouse (EDW) approach, Teradata has loosened up in recent years and come out with an extended family of offerings built around the Teradata database. The company's high-performance and high-capacity products have been widely copied, as have many of the company's workload management features, including virtualized OLAP (cube-style) analysis.
Teradata has been pushing the envelope on in-database analytics, but it did not have a footing in blended analysis of structured data, semi-structured data, and largely unstructured data. That's why it bought Aster Data, which offers a SQL-MapReduce framework. MapReduce processing is in big demand because it's useful in crunching massive quantities of Internet clickstream data, sensor data, and social-media content.
Teradata recently announced plans for an Aster Data MapReduce appliance to be built on the same hardware as the Teradata appliance. It also added two-way integration between the Teradata and Aster Data databases. By buying AsterData, Teradata has broadened what is widely regarded as the broadest, deepest, and most scalable family of products available in the data warehousing industry.
1010data Offers Big Data As A Service 1010data provides a cloud-based big data analytics platform. Many database platform vendors offer cloud-based sandbox test-and-development environments, but 1010data's managed database service is aimed at moving your entire workload into the cloud. The service supports a "rich and sophisticated array of built-in analytical functions" including predictive analytics. A key selling point is that the service includes the data modeling and design, information integration, and data transformation. Customers including hedge funds, global banks, securities exchanges, retailers, and packaged goods companies, and 1010data claims "higher performance at a fraction of the cost of other data management approaches."