Software // Information Management
Commentary
6/22/2010
09:35 AM
Curt Monash
Curt Monash
Commentary
Connect Directly
RSS
E-Mail
50%
50%

Exploring Netezza's Coming 6.0 DBMS

Netezza is having its user conference, Enzee Universe, in Boston Monday-Wednesday, June 21-23, and naturally will be announcing new products there, and otherwise providing hooks and inducements to get itself written about.

Netezza is having its user conference, Enzee Universe, in Boston Monday-Wednesday, June 21-23, and naturally will be announcing new products there, and otherwise providing hooks and inducements to get itself written about. (The preliminary count is seven press releases in all.) To get a head start, I stopped by Netezza last Thursday for meetings that included a three-and-a-half hour session with 10 or so senior engineers, and have exchanged some clarifying emails since. It might be best to start with some Netezza product introduction and naming housekeeping:

  • Netezza isn't changing the hardware on any of its existing systems at this time. Rather, Netezza's product upgrades are contained in a software-only release...
  • ...except that it isn't a "release," but rather a "wave." There are three points to that terminological distinction:
    • The advanced analytics part doesn't depend on the new database platform software.
    • Individual functions in the advanced analytics part don't necessarily depend on advances in the analytics platform.
    • It plays on the surfboard-centric naming of Netezza's appliances.
  • Netezza has wisely scrapped its prior plan to make its advanced-analytics capabilities be a chargeable add-on to it core appliance products. Rather, Netezza is going to offer advanced analytics as part of its core product. Part of the reason is that the interest in these capabilities is broader than Netezza first anticipated. The name for this is is something like i-Class advanced analytics capabilities.
  • There is a "release" in all this too, namely NPS 6.0 (Netezza Performance Software). That's the core DBMS technology.
  • It's all to be shipped in Q3.
Highlights of our NPS 6.0 conversation include:
  • As promised, Netezza has improved its compression significantly. Because this was anticipated, this upgrade was planned for in the design of the systems Netezza started introducing last summer. Consequently, the reduction in I/O produced by compression translates almost directly into better performance -- the silicon is now more fully loaded than it was before, but few if any actual silicon bottlenecks have been introduced by the I/O improvement.
  • Netezza's other big performance enhancement is the introduction of clustered base tables, which it says can reduce I/O by an order of magnitude or better.
  • Netezza says that there are individual queries in which the enhancements take query performance up 30-40X. (Presumably, those would be ones for which clustered base tables are a big win.)
  • More interestingly, Netezza says that overall performance is improved by >2X. That's queries, load, backup, and everything else all blended together.
  • Underpinning all this, Netezza went from 125 MHz to a blend of 125 and 250 MHz in its FPGA clock speeds. Also, the width of the FPGA onboard data path went from 16 to 32 bits. Netezza suggests that the naive calculation which says this could increase FPGA throughput 4X isn't entirely misleading.
  • Netezza is pretty content with its workload management capabilities for queries, but nonetheless keeps adding features. Workload management has not yet been extended to cover all the non-query parts of the analytic functionality.
  • Netezza continues to enhance its cost-based optimizer and query planner.
  • Netezza has long used an internal networking approach that's rather different from TCP/IP. Netezza views TCP/IP's strength as recovering gracefully if there's congestion. However, Netezza would rather do whatever it takes to preclude congestion in the first place, except perhaps in rare edge cases. I'm not aware of what enhancements, if any, have been made to Netezza's internal networking specifically in NPS 6.0.
The basic idea of clustered base tables ("base tables" are ones that are not, for example, materialized views) is to range partition in multiple dimensions at once. Then you rule out (as in don't retrieve) all those blocks that fail a match in any one of the cluster dimensions. Netezza says its customers were doing a lot of work to simulate this benefit by multiple sorts; Netezza's implementation will now handle that much more automatically. Netezza says that talking to customers revealed that 4-5 cluster dimensions was almost always the most somebody would need; they will ship support for 4. That makes sense. In most cases, you'd want to cluster on the answers to "W" questions -- Who, What, Where, When (but probably not Why), in one dimension each. However, Netezza does call out as an ideal use case geospatial, precisely because 2 (or more rarely 3) dimensions each have "equal weight."

I don't know how other vendors implement clustered base tables, but in Netezza's case it's via a space-filling curve. (Actually, they called it a "Hilbert space-filling curve," but I oppose that phrasing, as it's apt to lead to extremely incorrect use of the term "Hilbert space.") I.e., data is mapped to 4-tuples (say) in line with the dimensions, which are then sorted in a linear order in line with a space-filling curve. Happily, Netezza hasn't experienced problems clustering columns that have particularly challenging cardinality or skew.

If I understood correctly, you can only zone map (and presumably cluster) on integers and dates right now, but that will change soon. (Edit: In blog comments and email, Tim Greenwood of Netezza explained to me that the NPS 6.0 workarounds to that were much more robust than I realized.)

Netezza put a lot of work for NPS 6 into something it calls "table grooming," which amounts to recopying tables in more beneficial form. Uses for table grooming -- which is a manually initiated process -- include but probably aren't limited to:

  • Clustering tables and, as needed, reclustering them.
  • Getting rid of data that was deleted. (Netezza has Postgres-style multiversion concurrency control -- MVCC -- but no time-travel, so keeping around deleted data is a waste of space.)
  • Recompressing data from Compress Engine 1 to Compress Engine 2.
  • Alter Table
The core ideas of table grooming include:
  • The Netezza NPS software copies rows from one place to another.
  • Netezza NPS then updates the appropriate metadata.
  • Metadata updates are transactional, even though the actual data movement is not.
This can be done part of a table at a time. Reads and loads are unaffected by the process, or at least not blocked. Delete commits are indeed blocked during a reorg, but Netezza guesses that the block hold for a few minutes during the grooming of a clustered base table, 10-15 seconds if space is being reclaimed, and something similar for an Alter Table.

And finally, here are some notes on Netezza's query optimization and planning.

  • Netezza has a traditional cost-based optimizer, in which all operations have estimated costs, measured in microseconds, irrespective of which parts of the system (CPU, I/O, network, whatever) they most stress. (I have trouble imagining how a cost-based optimizer could work differently from that without incurring huge computational costs.)
  • Netezza's bottleneck is almost always disk I/O.
  • Netezza's optimizer is not/no longer based on the PostgreSQL optimizer.
  • Netezza does a lot of query transformation. Key points include:
    • Netezza joins are usually very cheap.
    • Filtered scans are cheap too.
    • More expensive in Netezza are data redistribution (duh), sorts, and unfiltered scans.
    • Most expensive of all are intermediate result sets that don't fit into memory.
  • Specific examples of Netezza query transformation include:
    • Pushing predicates out to nodes.'
    • Flattening query trees and eliminating subqueries.
    • Rewriting windowed aggregates to be joins + grouped aggregates.
    • (New in 6.0) Transforming outer joins into other kinds.
  • Netezza does real-time sampling to help with query planning. (But this is only worth doing for queries that are estimated to be expensive.) Zone maps (and clustering too?) are invoked as part of deciding where to sample. Sampling was for scans only prior to NPS 6.0, and will now be done for joins as well.

Related links

The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Must Reads Oct. 21, 2014
InformationWeek's new Must Reads is a compendium of our best recent coverage of digital strategy. Learn why you should learn to embrace DevOps, how to avoid roadblocks for digital projects, what the five steps to API management are, and more.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A roundup of the top stories and trends on InformationWeek.com
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.