San Francisco-based online-traffic analysis vendor Quantcast announced Thursday it shipped and released to the open-source community a big-data management engine designed to outperform the industry-dominating Hadoop Distributed File System.
Quantcast touts its Quantcast File System 1.0 as using half the disk space of the Hadoop Distributed File System (HDFS) while outperforming the more widely known big-data system in batch processing of data and in the speed of input/output of data between servers and back-end storage.
Quantcast File System (QFS) is designed as an alternative to Hadoop, which dominates the big-data market to such an extent that it will become the de-facto industry standard for big-data management and generate as much as $2.2 billion per year by 2018, according to a July report from MarketAnalysis.
QFS, on the other hand, is an internal app Quantcast uses to collect and analyze more than 500 billion data records per month, processing in excess of 20 petabytes per day, according to the company.
QFS is an enhanced version of the Kosmos Distributed File System (KFS), the open-source version of Google's internally designed Google File System--the data-management software that drives Google's search engine and other products.
KFS' main advantage, according to Google, is that it improves the performance of backend storage hardware for compute- and data-intensive applications such as search engines and data-mining projects.
KFS was designed to use two separate backend components: One to manage reads, writes, and searches of huge piles of data broken into chunks, and another to supply metadata defining the data's meaning and source.
Quantcast began using KFS internally for its own data management when the app was open sourced in 2007, as an alternative to Hadoop. At the time, KFS was "fundamentally experimental and insufficiently stable for production usage," however, according to Quantcast.
To fix that, Quantcast, which uses QFS as the primary data-management app for its production applications, chose it for load-balancing abilities that are more flexible and timely compared to Hadoop, according to Schubert Zhang, a VP at Hanborq, a Hadoop performance-optimization provider based in China.
According to Quantcast, QFS outperforms Hadoop because its client software is written in C++ rather than the slower Java, and its core services are compiled in C++ rather than C and Java, as is the case for Hadoop. QFS also encodes data using the same Reed-Solomon algorithm used to compact data onto DVDs, which lays data out in nine stripes, each of which is painted on a different physical disk and could be painted on entirely separate storage racks in the cluster.
Hadoop, by contrast, simply makes three copies of each data set and stashes them in different corners of the cluster so it can get to them using high-speed cluster interchanges rather than network connections.
Having to go through a server's PCI bus and a comparatively slow network could make QFS slower than Hadoop. But because every read and write is parallelized across six or nine different drives, the performance of QFS rises quickly if the cluster uses 10 Gigabit Ethernet or InfiniBand rather than the more-standard gigabit Ethernet, according to Quantcast.
To keep its performance high QFS also includes automatic file replication; fixed-footprint management of memory; data-storage location based on space and workload rather than static tables; and direct I/O from disk. A separate module is designed to integrate data and queries across both QFS and Hadoop to make the two compatible as well, the company said.
"In our Big Data future, file systems such as QFS will underpin cost-effective critical infrastructure for commerce and government," Quantcast CEO Konrad Feldman said in a statement. "Quantcast makes use of open source software and by making our own contribution with QFS we're hopeful that others will benefit as we have."
Quantcast, a startup that launched in 2006 using Hadoop to process its data, started using KFS in 2008 and began almost immediately to enhance and expand the software to suit its own needs.
Other big-data management vendors also have put out their own alternatives to Hadoop, most notably Datastax and MapR, both of which use proprietary enhancements in combination with open-source software including Hadoop.
Binaries and source code for QFS, as well as deployment and administrator's guides, are available at no cost here.
At this hands-on virtual event from Dr. Dobb's, GPU And CPU Programming, experts will offer insights that will enable developers who know little or nothing about GPU computing to add this co-processing dimension to existing and greenfield projects. When you register, you'll gain access to live and on-demand webcast presentations, as well as virtual booths packed with free resources. It happens Nov. 6.