When it comes to big data, how many V's are enough?
Analyst Doug Laney used three -- volume, velocity and variety -- in defining big data back in the '90s. In recent years, revisionists have blown out the count to a too-many seven or eight. "Embrace and extend" is alive and well, it seems, expanding the market space but also creating confusion.
When a concept resonates, as big data has, vendors, pundits and gurus -- the revisionists -- spin it for their own ends. Big data revisionists would elevate value, veracity, variability/variance, viability and even victory (the last being a notion so obscure that I won't mention it further) to canonical V status. Each of the various new V's has its champions. Joining them are the contrarians who have given us the "small data" countertrend.
In my opinion, the wanna-V backers and the contrarians mistake interpretive, derived qualities for essential attributes.
The original 3 V's do a fine job of capturing essential big data attributes, but they do have shortcomings, specifically related to usefulness. As Forrester analyst Mike Gualtieri puts it, the original 3 V's are not "actionable." Gualtieri poses three pragmatic questions. The first relates to Big Data capture. The others relate to data processing and use: "Can you cleanse, enrich and analyze the data?" and "Can you retrieve, search, integrate and visualize the data?"
[ Want more from NLP expert Seth Grimes? Read The Rise And Stall Of Social Media Listening. ]
As for "small data:" The concept is a misframing of the data challenge. Small data is nothing more or less than a filtered and reduced topical subset of the big data motherlode, again the product of analytics. Fortunately, attention to this bit of big data backlash seems to have ebbed, which lets us get back to the big picture.
3 V's and Beyond
The big picture is that original 3 V's work well. I won't explain them; instead, I will refer you to "Big Data 3 V's: Volume, Variety, Velocity," an infographic posted by Gil Press. You'll see that the infographic posits viability -- essentially, can the data be analyzed in a way that makes it decision-relevant? -- as "the missing V." The concluding line: "Many data scientists believe that perfecting as few as 5% of the relevant variables will get a business 95% of the same benefit. The trick is identifying that viable 5%, and extracting the most value from it." Hmm... It seems to me that the missing V could equally well have been Value.
Neil Biehn, writing in Wired, sees viability and value as distinct missing V's. Biehn's take on viability is similar to Press's. "We want to carefully select the attributes and factors that are most likely to predict outcomes that matter most to businesses," Biehn says. I agree, but note that the selection process is purpose-driven and external to the data.
"The secret is uncovering the latent, hidden relationships among these variables," Biehn continues. Again, I agree, but how do you determine predictive viability, generated by those latent relationships among variables? Professor Gary King of Harvard University read my mind when he stated, at a conference I attended in June, "Big data isn't about the data. It's about analytics." Viability isn't a big data property. It's a quality that you determine via big data analytics.
"We define prescriptive, needle-moving actions and behaviors and start to tap into the fifth V from big data: Value," Biehn asserts. Again, how do you determine prescriptive value, which Biehn notes is derived from, and hence is not an intrinsic quality of, big data? Analytics.
Analytics verifies not only the accuracy of predictions, but also the effectiveness of outcomes in achieving goals. Analytics ascertains the validity of the methods and the ROI impact of the overall data-centered initiative. ROI quantifies value, complementing the qualitative measure validity. Both V's are external to the data itself.
Compounding the Confusion
Variability and veracity are similarly analytics-derived qualities that relate more to data uses than to the data itself.
Variability is particularly confusing. "Many options or variable interpretations confound analysis," observed Forrester analysts Brian Hopkins and Boris Evelson back in 2011. Sure, and you can use a stapler to bang in a nail (I have), but that doesn't make it any less a stapler.
"For example, natural language search requires interpretation of complex and highly variable grammar," Hopkins and Evelson wrote. Put aside that grammar doesn't vary so much; rather, it's usage that is highly variable. Natural-language processing (NLP) techniques, as implemented in search and text-analytics systems, deal with variable usage by modeling language. NLP facilitates entity and information extraction, applied for particular business purposes.
(An entity is a uniquely identiﬁable thing or object; for instance, the name of a person, place, product or pattern, such as an e-mail address or Social Security number. Extractable information may include attributes of entities, relationships among entities, and constructs such as events -- "Michelle LaVaughn Robinson Obama, born January 17, 1964, an American lawyer and writer, is the wife of the 44th and current President of the United States" -- that we recognize as facts.)