A young startup in San Francisco is an industry-leading example of what happens when big data meets the GPU: data processing at both very high speed and high volume. MapD, maker of the Core database system and Immerse data visualization system, is a database company designed around the use of the special power of graphics processing units.
MapD is a sign that what used to be an add-on accelerator system is becoming more of a mainstream system that will be added to the data management sytems that IT managerrs will have in their arsenal.
GPUs got their start as the hardware foundation for the display of video games, with their insatiable appetite for graphical processing. Because GPUs departed from the design of "Intel inside" servers, they had special capabilities when it came to big data jobs that could be processed in parallel. GPUs are good at seizing very large chunks of data and processing them in parallel. Intel's x86 instruction set was originally geared to the single-thread, step-by-step processing of personal computing programs.
GPUs are getting their due as they are added to cloud servers or NoSQL data systems as accelerators. MapD skipped the accelerator phase and went straight to GPUs as the foundation for its data systems.
Want to learn more about Nvidia GPUs? See Nvidia Releaes Tesla P-100 Accelereator, Software Tools.
Todd Mostak, MapD's founder and CEO, developed a prototype for a GPU-based database system while waiting on the outcome of his queries on hundreds of millions of tweets as he pursued research on the Arab Spring while at MIT. Each query took hours or days to complete and he wished he had his own cluster on which to perform them. He got the idea of combining a cluster of off-the-shelf video game cards with a new design for a parallel-processing database.
Mostak pursued the possibilities of in-memory, GPU-based database processing under MIT computer science professor Sam Madden, head of the Intel Science and Technology Center for Big Data and [email protected], a cross-discipline group of big data users at the MIT Computer Science and Artificial Intelligence Lab. The result, Mostak said in an interview, was a system that was 75 times to 3,500 times faster than a traditional CPU-bound database. (Madden developed the column-based Vertica system with Michael Stonebraker, relational database guru and a principal developer of Ingres at the former Relational Technology Inc.)
He demonstrated such a system at an academic/industry event at MIT in 2013 and was asked by enterprise executives whether he could produce a model that they could use. MapD was born that year.
The following year, MapD was one of 12 companies entering the Early Stage Challenge at the Emerging Companies Summit in San Jose and on March 20, 2014, Mostak walked away with the $100,000 grand prize.
A GPU-powered server contains cards packed with 12-30 GPUs per card. Amazon Web Services recently announced the option of using virtualized GPU servers with up to 16 GPU processors. Mostak said the Core in-memory system could work at large scale with access to a minimum of two GPUs, thanks to the wide bandwidth built into GPU processors. A full table can be scanned by two GPUs thanks to the faster memory used with GPUs and the faster data interface, NVLink in the case of Nvidia GPUs. "It's two to four times faster than the PCI bus" used in Intel servers, he said.
Intel and AMD CPUs can move data at an optimum rate of 50-75 GB per second. The Nvidia and other brand GPUs can move it at 300-750 GB per second, he said.
"GPUs have a ton of computational bandwidth," processing discrete units of data in parallel, combined with faster memory, leading to the improved performance of a GPU-based system, Mostak said. "We can scan a multi-billion row data set in seconds," he said.
"Most major organizations are building up data sets that are too big to be handled efficiently" by CPU based systems, he said. The GPU hardware beneath the Core system can scan 6-8 TB per second.
One customer is Verizon, which uses a MapD system to do log analysis on the servers that constantly update the SIM cards of mobile phone users. The system could process enough data quickly enough to spot anomalies, such as some phones were being updated more frequently than they needed to be, leading to overprovisioning of the servers devoted to the task, Mostak said in the interview.
But the era of big data is just beginning. Mostak's firm believes GPU systems will have to be part of it. His 30-employee firm is located in San Francisco and produces only database and data visualization software, leaving Nvidia and other hardware suppliers to produce the hardware.
But the hardware is becoming more widely available. IBM was an early installer of GPUs servers and GPU-based instances in its SoftLayer Cloud in July 2015.
Microsoft added GPU servers in a preview offering on Azure, its N series, in August, based on Nvidia's Tesla K80s, each containing 4992 processing cores.
In September this year, AWS announced the P2 instance, powered by up to 16 Nvidia Tesla K80 GPUs, the largest virtual GPU server from a public cloud provider, it said.
On Nov. 15, Google followed suite with its selection of GPU servers.
So Mostak is banking on the prospect that many machine learning systems, deep learning systems and autonomous vehicles will be running MapD software on GPU hardware either on-premises or in the cloud. And what was previously a novelty, an accelerator accessory, will become a mainstay in the new digital economy.