Big Data Terminology Mess Needs Cleanup

Big data needs a coherent and unified vocabulary of terms -- or we can't share solutions to problems across disciplines.

Mark E. Johnson, Contributor

November 18, 2013

4 Min Read
InformationWeek logo in a gray background | InformationWeek

As the big data trend continues to grow due to positive media buzz, the ongoing proliferation of data generation and collection, and financial success stories, a nasty pest nibbles like a sand gnat at the practitioners. This pest is big data terminology -- or rather, the lack of a coherent and unified vocabulary of terms used in the big data arena.

The essential problem: Many concepts are floating around, and different folks understand these concepts according to their own technical backgrounds, disciplines, and work environments. Although one can presumably operate within one's own isolated framework, the terminology issue inhibits sharing of methodologies or recognizing that solutions may exist elsewhere. The terminology mess affects both business and academia.

As one example, consider the two technical areas of neural networks and non-linear regression. Both statisticians and computer scientists/engineers are heavy users of these tools. The following pairs of terms are equivalent:

                        Statistical Term                   Neural Network Term

                        coefficient                             weight

                        observation                           exemplar

                        parameter estimation            training

                        steepest descent                  back-propagation

                        intercept                               bias term

                        derived predictor                  hidden node

                        penalty function                    weight decay

The mathematical formulations in nonlinear regression and neural networks are essentially equivalent, but the terminology is entirely different (the above correspondence is found in Applied Linear Statistical Models (2005). The translation table shown above provides some hope for navigating the statistical and computer science literatures related to this methodology, but it can be uncomfortable or unproductive being away from the home discipline.

These issues do not reside only with these two academic departments. Specialty areas such as machine learning, artificial intelligence, Bayesian reasoning, graphical models, probabilistic networks, and pattern recognition all have their own flavors of terms, with some overlap and some contradictions.

Collecting the relevant terms, figuring out the underlying concepts, diagramming their relationships (subordinate and superordinate), and ultimately arriving at coherent and technically correct definitions is a monumental task.

This terminology situation in big data is precisely like the one that occurs in international standards, so there is the possibility of learning from the standards community. The International Standards Organization (ISO) produces standards that provide requirements, specifications, guidelines or characteristics that can be used consistently to ensure that materials, products, processes, and services are fit for their purpose. Each technical committee of ISO in turn has a subcommittee (SC1) that deals with terminology and definitions.

Thus, there are mechanisms in place to deal with the big data terminology mess. ISO TC69 (Applications of Statistical Methods), under the lead of its terminology subcommittee (I am Chair of TC69/SC1), has for the past 15 years produced core terminology documents on general statistical terms and terms used in probability (ISO 3534-1), applied statistics (ISO 3534-2), design of experiments (ISO 3534-3), and survey sampling (ISO 3534-4).

The process used to develop these documents could be used to develop a coherent vocabulary system for predictive analytics. (I'm using this more scientific phrase rather than big data terminology since predictive analytics has a superior je ne sais quoi -- we are dealing with international standards here!)

The gloomy state of affairs noted at the outset of this piece could be addressed under the aegis of TC69/SC1. The group is currently preparing for a new work item on predictive analytics. The subcommittee has experts from numerous countries but could benefit from additional expert participants from the US. I encourage those interested to contact the American Society of Quality for possible inclusion in the process. The technical work requires the consent and support of the expert's employer. As a volunteer effort (at least from the US participation side), the process is a bit glacial, but it could eventually reach an international consensus on a terminology document.

Ultimately, the efforts could also lead to technical standards on select methodologies in big data analysis -- but first and foremost, the pesky terminology problem needs to be tackled.

Which frequently misused big data terms bug you? Tell us in the comments section.

Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)

About the Author

Mark E. Johnson

Contributor

Dr. Mark E. Johnson is Professor of Statistics at the University of Central Florida in Orlando. He is a Fellow of the American Statistical Association, an elected member of the International Statistical Institute, and a Chartered Statistician with the Royal Statistical Society. He is the author of Multivariate Statistical Simulation (Wiley Applied Probability and Statistics Series) and has published in such journals as Bulletin of the American Meteorological Society, Technometrics, Biometrics, JASA, The American Statistician, J. of Statistical Planning and Inference, and Risk Analysis. Mark does extensive consulting in the area of catastrophic risks (especially hurricanes) and regularly is retained as an expert witness in legal cases.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights