eBay's Enormous Data Warehouses Detailed - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
Commentary
5/1/2009
09:38 AM
Curt Monash
Curt Monash
Commentary
50%
50%

eBay's Enormous Data Warehouses Detailed

A few weeks ago, I had the chance to visit eBay and meet with executive Oliver Ratzesberger and his team... Now I'm finally writing about the core of what we discussed, which is two of the very largest data warehouses in the world.

A few weeks ago, I had the chance to visit eBay, meet briefly with Oliver Ratzesberger and his team, and then catch up later with Oliver for dinner. I've already alluded to those discussions in a couple of posts, specifically on MapReduce (which eBay doesn't like) and the astonishingly great difference between high- and low-end disk drives (to which eBay clued me in). Now I'm finally getting around to writing about the core of what we discussed, which is two of the very largest data warehouses in the world.Metrics on eBay's main Teradata data warehouse include:

  • >2 petabytes of user data
  • 10s of 1000s of users
  • Millions of queries per day
  • 72 nodes
  • >140 GB/sec of I/O, or 2 GB/node/sec, or maybe that's a peak when the workload is scan-heavy
  • 100s of production databases being fed in

Metrics on eBay's Greenplum data warehouse (or, if you like, data mart) include:

  • 6 1/2 petabytes of user data
  • 17 trillion records
  • 150 billion new records/day, which seems to suggest an ingest rate well over 50 terabytes/day
  • 96 nodes
  • 200 MB/node/sec of I/O (that's the order of magnitude difference that triggered my post on disk drives)
  • 4.5 petabytes of storage
  • 70% compression
  • A small number of concurrent users

eBay's Teradata installation is a full enterprise data warehouse. Besides size and scope, it is most notable for its implementation of Oliver's misleadingly named analytics-as-a-service vision. In essence, eBay spins out dozens of virtual data marts, which:

  • Combine views and aggregations on the central data warehouse with (optionally) additional "private" data the data mart user loads in.
  • Are usually <5 terabytes in size, and indeed often <500 gigabytes.
  • Can be created "instantaneously" by setting permissions, resource quotas, and the like.

The whole scheme relies heavily on Teradata's workload management software to deliver with assurance on many SLAs (Service-Level Agreements) at once. Resource partitions are a key concept in all this.

So far as I can tell, eBay uses Greenplum to manage one kind of data -- Web and network event logs. These seem to be managed primarily at two levels of detail -- Oliver said that the 17 trillion event detail records reduce to 1 trillion real event records. When I asked where the 17:1 ratio comes from, Oliver explained that a single web page click -- which is what is memorialized in an event record -- resulted in 50-150 details. That leaves a missing factor of 3-8X, but perhaps other less complex kinds of events are also mixed in. The Greenplum metrics I quoted above represent over 100 days of data. Ultimately, eBay expects to keep 90-180 days of ultimate detail, and >1 years of event data. The 6 1/2 petabyte figure comes from dividing 2 terabytes of compressed data by (100%-70%). Since that all fits on a 4 1/2 petabyte system, I presume there's only one level of mirroring (duh), not much temp space, and even less in the way of indexes.

Two uses of eBay's Greenplum database are disclosed -- whittling down from detailed to click-level event data, and sessionization. The latter seems to be done in batch runs and take 30 minutes per day. A couple of other uses are undisclosed. I assume eBay is doing something that requires UDFs (User-Defined Functions), because Oliver remarked that he likes the language choices offered by Greenplum's Postgres-based UDF capability. But basically eBay's Greenplum database is used for and evidently does very nicely at:

  • Data ingest -- it's the first place log data goes
  • Feeding the Teradata database
  • A small number of big queries

eBay's Teradata database handles the rest.

Related links:

Commentary
Will AI and Machine Learning Break Cloud Architectures?
Lisa Morgan, Freelance Writer,  6/10/2019
Slideshows
9 Steps Toward Ethical AI
Cynthia Harvey, Freelance Journalist, InformationWeek,  5/15/2019
Commentary
Humans' Fascination with Artificial General Intelligence
Guest Commentary, Guest Commentary,  6/6/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
A New World of IT Management in 2019
This IT Trend Report highlights how several years of developments in technology and business strategies have led to a subsequent wave of changes in the role of an IT organization, how CIOs and other IT leaders approach management, in addition to the jobs of many IT professionals up and down the org chart.
Slideshows
Flash Poll