Our first InformationWeek Analytics State of Database Technology Survey reveals fault lines beneath the critically important enterprise database and data warehouse markets. For years, market leaders IBM, Microsoft, and Oracle have delivered stability and a steady flow of new capabilities in exchange for hefty license fees. But at least 10 other vendors, most of them startups and specialists, are vying to break the three-way hold on the enterprise database market by offering companies innovative technologies in areas including analytics, incorporation of new architectures, and manageability for ever-larger and more-complex systems. For example, Teradata, with annual revenue marching toward $2 billion, has a strong presence in data warehousing--the fastest-growing major market segment, and one that's already been shaken up by the appliance strategy of Netezza, now a public company growing at 40% per year. Open source products are changing market dynamics as well.
Meantime, our survey of more than 750 business technology professionals shows discontent with steep licensing and upgrade fees. Fifty-two percent of survey respondents characterize licensing costs for their primary databases as either somewhat overpriced or outright highway robbery, and they're not targeting only Oracle. That resentment is nothing new, but it's being exacerbated by workloads and data volumes that are multiplying at a staggering rate, sending costs even higher.
"Oracle pricing is off the chain," says a database administrator for Northrop Grumman. "I appreciate how excellent their product is, but enough is enough. They are so preoccupied with acquiring all of these companies that they have lost sight of their core technology. They are ripe for some disruptive technology to come along."
Like what? Try Google BigTable, Hadoop, and NoSQL.
In our full State of Enterprise Database Technology report, we analyze survey responses and the current market reality in three key areas: operational applications, including transaction processing and some forms of reporting and inquiry, typically against recent data; analytical applications, also called data warehouses or data marts, that involve query, reporting, analysis, and data mining, typically against both recent and historical data; and the new area of extreme analytics, which involves rapid analysis of extremely large volumes of data, sometimes using innovative approaches that differ from what's used in the typical data warehouse. In this story, we'll focus on the vendor landscape, the emergence of extreme analytics, database selection and management essentials, and the all-important security aspect.
A Dynamic Landscape
IBM, Microsoft, and Oracle play in just about every segment of the enterprise database market, and they're not sitting still: IBM's Smart Analytic Systems, Microsoft's Parallel Data Warehouse, and Oracle's Exadata are high-profile examples of their innovations. But our survey respondents are interested in branching out and keeping their options open. Fortunately, each of the three market segments we're focusing on have vendors worth watching. Aster Data, Greenplum (recently acquired by EMC), Infobright, Netezza, ParAccel, Sybase (recently acquired by SAP), Teradata, and Vertica specialize in data warehousing. InterSystems provides a product, Caché, for high-performance transaction processing. Sybase offers ASE for transaction processing and a separate product, Sybase IQ, for data warehousing. Hewlett-Packard is in the game with its NonStop software for transaction processing and HP Neoview for data warehousing.
Meantime, the NoSQL movement is disrupting some areas of operational database management as commercial software vendors, including MarkLogic, embrace the NoSQL concept (see related story). There are at least three open source relational database management products on the market backed by commercial vendors: Ingres, MySQL (acquired by Oracle), and PostGres. XtremeData has introduced a product for extreme analytics. New and established vendors, from Cloudera to IBM, are offering enhanced distributions of Hadoop.
Our point is that if you haven't investigated alternatives for a few years, you'll be pleasantly surprised by the variety. As you consider emerging systems, assess them in five key areas.
Usability: Data can be easily located, maintained, and retrieved. Data from different sources is readily integrated and shared among end users and applications. The timeliness and consistency of data can be managed to meet business needs. Various users can have data presented in the formats, structures, and representations that work best for them. A wide range of query types and access patterns are readily supported. Specifics of how to access and retrieve data--which may change as the database evolves--don't have to be programmed into applications.
Performance and scalability: Vendors have invested heavily in this area in the last several years, but data volumes, workloads, and complexity have grown, and continue to grow, beyond expectations. Performance and scalability still vary drastically by application, database, and platform. Investigate how much work you'll have to do to keep the system performing well, and how much pain and disruption you'll face each time the system or database size must increase by 50% or 100%.
Security: Data can be protected against loss, unauthorized access, theft, vandalism, and disasters. We've produced a range of reports on this topic, and its importance can't be overstated. Yet just 24% of the respondents to our survey say they're very satisfied with the security of their current database environments.
Back-end operations: Known data relationships are defined and enforced by the database system rather than in many separate applications. Semantic relationships among data values not recognized in advance are, once discovered, readily exploited. The meaning and acceptable values of data can be easily and consistently understood across the user community and maintained over time. Database systems--and the business activities that rely on them--generally perform to service-level objectives, are manageable, and are expandable without unreasonable disruption or expense.
Business consistency considerations: Application development can be completed at lower cost because necessary data can be readily identified, located, and used--and it has defined and predictable quality. Business decisions are more likely to be correct because they can be based on shared, timely facts that are consistent across the enterprise. New sources of business data are readily incorporated into databases and integrated with existing data. Applications run more reliably because they aren't disrupted with unavailable, incorrect, or unexpected data. Databases and the applications they depend on can evolve to satisfy new business requirements and are easily maintained over time.
Together, these criteria don't just comprise a purchase evaluation checklist. They're also a best-practices blueprint for a long-term database management program--and without that, data integration will suffer. Thought that problem was solved? Not yet. Of our survey respondents, just 25% maintain a single data warehouse repository, and in many enterprises, critical information remains fragmented over hundreds or even thousands of separate database platforms--the notorious data silos.
Vendors are working to help IT reduce compartmentalization, and they're often succeeding. One survey respondent has standardized on Oracle databases across various customer service systems because they allow for easy cross-relational linking of disparate yet common tables. Major advances in recent years include the appliance model; broader acceptance of parallel architectures; better data compression; exploitation of rising processor power to store, read, and write data more efficiently; use of solid state or flash disk to increase storage bandwidth and lower storage latency; introduction of sophisticated data partitioning and clustering methods; and more systems management automation, particularly in such areas as mixed workloads, performance management, and troubleshooting.
While some commercial products, such as those from Teradata and Netezza, are aimed directly at analytics, data warehouse vendors have generally responded to the growing demand for this capability by integrating analytic technologies into their data warehouse engines, commonly using a separate SAS server or infrastructure maintained for this purpose. The big problem is this: If information already resides in the data warehouse, and the volume of data to be analyzed is considerable, then just moving the data from the data warehouse to the analytical server can give rise to a performance nightmare. In addition, if the analytical tool isn't highly parallel, analysis can take a long time. One solution is to perform the analysis in place on the data warehouse, exploiting its highly parallel architecture. Teradata has delivered capabilities for performing SAS routines within the data warehouse, while Netezza has announced such capabilities as part of Release 6 of its software.
Dawn Of Extreme Analytics
We see a certain set of analytic problems falling into a separate category--call it "extreme analytics." What makes it extreme? To start, very large data volumes: hundreds of terabytes to petabytes. Analysis is intensive and often clumsy or impractical to perform entirely in SQL, requiring routines or functions written in a procedural language such as Java. In our survey, 48% of respondents say they view "analytic databases" as a separate category from data warehouses or data marts. Further, 67% say that they have analytic databases and applications that are independent of their data warehouse/data mart environment.
Traditionally, data warehousing is about leveraging data over multiple uses, often over a long period of time. IT tends to make an up-front investment in carefully defining, modeling, cleansing, and integrating information so that it can serve a variety of different purposes, typically over a period of years. Those interested in analytic databases, in contrast, prioritize speed: faster development, higher throughput, and rapid alignment with technology trends. Huge volumes of data must be analyzed quickly; results are used by a small number of people for a short period of time.
Open source software is also in the extreme analytics sector. Partly in response to rapidly growing analytical requirements, technologies have emerged from Google, Yahoo, and other very-large-scale Web businesses. These systems have been embodied in Apache Hadoop. Key elements of Hadoop are:
HBase: A scalable, distributed database that supports structured data storage for large tables;
HDFS: A distributed file system that provides high-throughput access to application data;
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying; and
MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Yahoo runs Hadoop on more than 36,000 computers, with the largest cluster containing 4,000 nodes; it's used for research on advertising systems and Web search.
Another key requirement in extreme analytics is economically analyzing these enormous volumes of data using enormous amounts of computer power. IBM has a major research initiative in this area. It's offering its own enhanced distribution of Hadoop and has launched a services initiative called BigInsights, all aimed at helping customers use Hadoop-related capabilities for large-scale analysis. Aster Data, Greenplum and others have introduced ways to support MapReduce processing inside their databases, and Greenplum, Teradata, and Vertica also offer integration with separate MapReduce environments.
Companies in a range of industries are finding that they must store and analyze enormous volumes of data that isn't in a structured, tabular form, says Hamid Pirahesh, an IBM fellow and a leader of IBM's research program in extreme analytics. Common examples include records structured as key-value pairs; documents, blogs, and e-mails; and data with a graphical structure. In response, IBM is working on higher-level languages that can be used by analysts to access and analyze data in Hadoop. Pirahesh says many companies are pursuing the use of large-scale Hadoop environments for analytics but also place a high priority on making such data and capabilities available to existing users of business intelligence, the data warehouse, and commercial analytic environments.
The Easy Button
IT leadership at all levels is under continued pressure to control costs, even as data and workload volumes, and user expectations around performance and data availability, increase. This combination of forces is causing CIOs to cast around for systems that are easier to implement, manage--and pay for.
Established vendors are reacting with more attractive systems and competitive pricing; we see this in the Teradata appliance line and the IBM Smart Analytic Systems. Oracle has expanded its architecture with Exadata, increasing I/O parallelism and building intelligence into the storage layer, and Microsoft's SQL Server 2008 R2 Parallel Data Warehouse has similar objectives. And we've never seen more database startups.
With such a dynamic picture, IT decision makers have to play it smart. We recommend that enterprises continue to bite the licensing bullet and use proven systems while piloting innovative technologies and approaches. Devise meaningful ways to test products before relying too much on them. With databases particularly, validation is always tricky. Because problems tend to surface later, quick benchmarks won't indicate whether a system will work for you in the long term. Above all, avoid the trap of the superficial demo. Yes, it's time consuming to think through your database requirements in depth, but doing so means you can test at realistic levels of complexity and scale before making decisions. When you check out reference sites, find some that are solving the problems you expect to face in the next few years. Keep your eye on security, and manage carefully as you scale up.
Follow these basic principles and you'll benefit from the both the established players and the upstarts in the database ecosystem. Our survey respondents are keeping open minds, and you should, too.
"We love Oracle, but with current CPUs having eight or 12 cores, we need four or six licenses to bring a single processor box on line--and typical servers do best with at least two," says one survey respondent. "Because of this, we recently acquired some SQL server licenses--the first time in almost 10 years. It's not so bad, a lot better than it used to be."
That statement just might sum up the overall mood among our respondents, many of whom say they're taking time to run some pilots. At worst, you spend a few bucks and find out you can't do without your current primary platform. At best, you find a new partner that proves to be as good a fit or better for your needs, at a lower cost.