When your relational database takes longer to process your data than to collect it, it's time to call in big data technology, said panelists at Interop.
(click image for larger view)
Slideshow: 8 Big Data Deployments In Detail
Not everyone is sure whether they have big data or not, or whether they need a NoSQL system to handle it. One way to find out, said one adopter of a NoSQL approach, is to ask yourself whether it is taking you longer to process your data than it did to collect it.
The Enterprise Cloud Summit Monday at Interop 2011 in Las Vegas, a UBM TechWeb event, called on Jeremy Edberg, senior product developer at Reddit.com, and Bradford Stephens, founder and CEO of Drawn to Scale, a big data consulting firm, to address the confusion.
Reddit.com is the social news site where anyone may submit a post of either self-created content or linked content and let other viewers vote on it. With enough positive votes versus negative, a blog, news story, or other item gets positioned on Reddit.com's front page.
Reddit.com collects so much information and records so many user interactions that Edberg realized at one point its relational database system was taking nearly as long to process the data as the site spent collecting it. Edberg started tracking the processing time and realized at a later date that it was taking 25 hours to process data collected over 24 hours.
He concluded that the situation was untenable. If the time the database system took to extract, transform, and load the data was growing longer than the collection phase, "pretty soon we were going to be in the infinite pit of despair."
Stephens said his experience as lead platform engineer at Visible Technologies, a firm producing business intelligence for social media, was similar to Edberg's. The main problem is that relational databases function most effectively when they sit on one large server. Relational systems do not easily distribute data across a cluster without introducing latencies into the database's operations.
Stephens said he tried to solve the problem through sharding, or distributing subsets of data around a cluster, each with its own database system to manage it as a discrete unit, "but we still couldn't get reads fast enough."
"You know you have a big data problem when your hardware budget is growing exponentially," he said.
Edberg agreed. It's a problem, he said, "when you have so much data in the database that you keep hiring consultants and operations guys to mitigate the effects of periodic slowdowns." They can only do so much. They will postpone the next recurrence of the problem, not eliminate it, he said.
The litany of signs continued. You know you have a big data problem when "developers want to produce new features but they're spending more time maintaining the systems than working on them ... the engineers can't seem to deliver the system's potential."
Stephens said Visible Technologies combined database triggers and Python commands in its database system, and the two conflicted with each other." The wait for the system to sort out the conflicts and respond to SQL queries imposed long waits, he said. Also, triggers embedded in relational database systems don't scale well to handle big data.
Stephens and Edberg are experienced at using HBase, a NoSQL system based on Hadoop open source code, and Cassandra, another open source NoSQL system, and said they can be made to scale easily. Edberg referred to both Oracle and SQL Server as having issues with "scaling out" over many servers.
"Keep a pile of commodity hardware in the corner of the data center and call it up when you need it," he said. In effect, NoSQL systems "scale out" by adding server nodes and load balancing across them. Cassandra is designed to use many nodes, and can continue operating if the server in a node fails.
HBase and Cassandra "are fantastic systems" but usually lack the ability to build indexes, the way relational databases do. On the other hand, by employing many nodes in a cluster, NoSQL systems allow applications to be built on top of a NoSQL database that can process immense amounts of data by subdividing the work among the nodes.
Edberg said HBase is fast on reads, slower on writes, which is good for social networking sites seeking to respond quickly to site visitors. They lag momentarily on updating the database with the information collected from the most recent visitors.
That lag doesn't matter much when site visitors want to see what trends are in aggregate among those participating. "Being fast on reads is great for our (Reddit.com) customers," said Edberg. They want to see what everybody else thinks; visitors already know what they think.
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.