As the big data trend continues to grow due to positive media buzz, the ongoing proliferation of data generation and collection, and financial success stories, a nasty pest nibbles like a sand gnat at the practitioners. This pest is big data terminology -- or rather, the lack of a coherent and unified vocabulary of terms used in the big data arena.
The essential problem: Many concepts are floating around, and different folks understand these concepts according to their own technical backgrounds, disciplines, and work environments. Although one can presumably operate within one's own isolated framework, the terminology issue inhibits sharing of methodologies or recognizing that solutions may exist elsewhere. The terminology mess affects both business and academia.
As one example, consider the two technical areas of neural networks and non-linear regression. Both statisticians and computer scientists/engineers are heavy users of these tools. The following pairs of terms are equivalent:
Statistical Term Neural Network Term
parameter estimation training
steepest descent back-propagation
intercept bias term
derived predictor hidden node
penalty function weight decay
The mathematical formulations in nonlinear regression and neural networks are essentially equivalent, but the terminology is entirely different (the above correspondence is found in Applied Linear Statistical Models (2005). The translation table shown above provides some hope for navigating the statistical and computer science literatures related to this methodology, but it can be uncomfortable or unproductive being away from the home discipline.
These issues do not reside only with these two academic departments. Specialty areas such as machine learning, artificial intelligence, Bayesian reasoning, graphical models, probabilistic networks, and pattern recognition all have their own flavors of terms, with some overlap and some contradictions.
Collecting the relevant terms, figuring out the underlying concepts, diagramming their relationships (subordinate and superordinate), and ultimately arriving at coherent and technically correct definitions is a monumental task.
This terminology situation in big data is precisely like the one that occurs in international standards, so there is the possibility of learning from the standards community. The International Standards Organization (ISO) produces standards that provide requirements, specifications, guidelines or characteristics that can be used consistently to ensure that materials, products, processes, and services are fit for their purpose. Each technical committee of ISO in turn has a subcommittee (SC1) that deals with terminology and definitions.
Thus, there are mechanisms in place to deal with the big data terminology mess. ISO TC69 (Applications of Statistical Methods), under the lead of its terminology subcommittee (I am Chair of TC69/SC1), has for the past 15 years produced core terminology documents on general statistical terms and terms used in probability (ISO 3534-1), applied statistics (ISO 3534-2), design of experiments (ISO 3534-3), and survey sampling (ISO 3534-4).
The process used to develop these documents could be used to develop a coherent vocabulary system for predictive analytics. (I'm using this more scientific phrase rather than big data terminology since predictive analytics has a superior je ne sais quoi -- we are dealing with international standards here!)
The gloomy state of affairs noted at the outset of this piece could be addressed under the aegis of TC69/SC1. The group is currently preparing for a new work item on predictive analytics. The subcommittee has experts from numerous countries but could benefit from additional expert participants from the US. I encourage those interested to contact the American Society of Quality for possible inclusion in the process. The technical work requires the consent and support of the expert's employer. As a volunteer effort (at least from the US participation side), the process is a bit glacial, but it could eventually reach an international consensus on a terminology document.
Ultimately, the efforts could also lead to technical standards on select methodologies in big data analysis -- but first and foremost, the pesky terminology problem needs to be tackled.
Which frequently misused big data terms bug you? Tell us in the comments section.
Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)