If you ever want to see what's wrong with the growing world of corporate data, check out Winter Corp.'s "2005 Top Ten Program." Winter has been publishing a survey of the biggest and baddest databases for several years, and the results are as much a triumph of database technology as they are an indictment of the more-data-is-better mentality that seems to pervade IT departments and executive suites alike.
Just because we can build huge databases, and put every bit of data on line, doesn't mean we should. One simple reason we shouldn't is that big is not necessarily better — logic that should resonate with business and IT managers.
Big often means too much money spent on hardware, software and administration, and too much time sorting through what is often a landfill's worth of data. Big can mean lousy throughput and even worse analysis. And big also can indicate a lack of understanding of the real goal of hanging on to historical data, and how sampling and other Statistics 101 techniques make it possible to analyze only the data you need, instead of all the data you have.
So every time I see a multi-terabyte database, I begin to wonder if the company that has built such a monstrosity really understands the content or value of its data. Or is it building a "just-in-case" database — a kind of cover-your-analysis solution that ought to be an order of magnitude smaller to be as useful as possible. Unfortunately, "just in case" and "CYA" seem to be the order of the day.
The Big Ones
Against this backdrop of too big, too slow and too clueless, the Winter data is amazing. Once you're done being impressed by the sheer bulk of the Top Ten, you really should think about whether your company would want to find itself in the winner's circle. In other words: Do you really want to aspire to running a 20-plus terabyte transaction database and a 100-TB data warehouse?
To be fair, there's some justification for the size of some of these monsters: Yahoo's 100-TB data warehouse may have a meaningful raison d'etre. And maybe Amazon's two mega-data warehouses, coming in at 24 TB and 19 TB respectively, make some sense too. There may even be a smart business case for the U.S. Patent and Trademark Office's 16-TB transaction database. But for the rest of us, there has to be a better way.
One of the better ways is to archive the data, though most of these solutions slow access to a crawl. Running an historical report means locating a tape, mounting it, indexing it, loading an operational data store and running the report against what you hope is the right data set. That usually takes many minutes. Of course, most archiving solutions don't really change the amount of data you're trying to store, just whether it's on line or not. Archiving can improve the throughput of your on-line data significantly, but at the cost of gumming up the analysis of your off-line data.
One archiving vendor, SAND Technology, can create what it calls a "near-line" archive that can be queried without a complex restore process. SAND's compression technology also reduces the overall data footprint by an order of magnitude. This means that SAND can solve the cost, throughput and data storage problems, thus giving archiving a much-needed image upgrade.
It's about time because these mega-terabyte databases desperately need to be put on a diet. They're too big, too costly and too inefficient — despite the prevalence of cheaper hardware and faster software. And, fundamentally, these mega-terabyte databases are evidence of a lack of strategic thinking about the most strategic asset in the company: Data.
That's the biggest problem of all. Technology tends to reward sloppy thinking and sloppy actions with trouble, and trouble with the corporate database is trouble at the heart of a company. You many not qualify for the Winter Top Ten, but if you're getting close, you may want to rethink your database strategy — before it's too late.
Joshua Greenbaum is a principal at Enterprise Applications Consulting. Write to him at [email protected].