Big Data // Hardware/Architectures
Commentary
11/18/2013
09:06 AM
Mark E. Johnson
Mark E. Johnson
Commentary
Connect Directly
RSS
E-Mail
50%
50%

Big Data Terminology Mess Needs Cleanup

Big data needs a coherent and unified vocabulary of terms -- or we can't share solutions to problems across disciplines.

As the big data trend continues to grow due to positive media buzz, the ongoing proliferation of data generation and collection, and financial success stories, a nasty pest nibbles like a sand gnat at the practitioners. This pest is big data terminology -- or rather, the lack of a coherent and unified vocabulary of terms used in the big data arena.

The essential problem: Many concepts are floating around, and different folks understand these concepts according to their own technical backgrounds, disciplines, and work environments. Although one can presumably operate within one's own isolated framework, the terminology issue inhibits sharing of methodologies or recognizing that solutions may exist elsewhere. The terminology mess affects both business and academia.

As one example, consider the two technical areas of neural networks and non-linear regression. Both statisticians and computer scientists/engineers are heavy users of these tools. The following pairs of terms are equivalent:

                        Statistical Term                   Neural Network Term

                        coefficient                             weight

                        observation                           exemplar

                        parameter estimation            training

                        steepest descent                  back-propagation

                        intercept                               bias term

                        derived predictor                  hidden node

                        penalty function                    weight decay

The mathematical formulations in nonlinear regression and neural networks are essentially equivalent, but the terminology is entirely different (the above correspondence is found in Applied Linear Statistical Models (2005). The translation table shown above provides some hope for navigating the statistical and computer science literatures related to this methodology, but it can be uncomfortable or unproductive being away from the home discipline.

These issues do not reside only with these two academic departments. Specialty areas such as machine learning, artificial intelligence, Bayesian reasoning, graphical models, probabilistic networks, and pattern recognition all have their own flavors of terms, with some overlap and some contradictions.

Collecting the relevant terms, figuring out the underlying concepts, diagramming their relationships (subordinate and superordinate), and ultimately arriving at coherent and technically correct definitions is a monumental task.

This terminology situation in big data is precisely like the one that occurs in international standards, so there is the possibility of learning from the standards community. The International Standards Organization (ISO) produces standards that provide requirements, specifications, guidelines or characteristics that can be used consistently to ensure that materials, products, processes, and services are fit for their purpose. Each technical committee of ISO in turn has a subcommittee (SC1) that deals with terminology and definitions.

Thus, there are mechanisms in place to deal with the big data terminology mess. ISO TC69 (Applications of Statistical Methods), under the lead of its terminology subcommittee (I am Chair of TC69/SC1), has for the past 15 years produced core terminology documents on general statistical terms and terms used in probability (ISO 3534-1), applied statistics (ISO 3534-2), design of experiments (ISO 3534-3), and survey sampling (ISO 3534-4).

The process used to develop these documents could be used to develop a coherent vocabulary system for predictive analytics. (I'm using this more scientific phrase rather than big data terminology since predictive analytics has a superior je ne sais quoi -- we are dealing with international standards here!)

The gloomy state of affairs noted at the outset of this piece could be addressed under the aegis of TC69/SC1. The group is currently preparing for a new work item on predictive analytics. The subcommittee has experts from numerous countries but could benefit from additional expert participants from the US. I encourage those interested to contact the American Society of Quality for possible inclusion in the process. The technical work requires the consent and support of the expert's employer. As a volunteer effort (at least from the US participation side), the process is a bit glacial, but it could eventually reach an international consensus on a terminology document.

Ultimately, the efforts could also lead to technical standards on select methodologies in big data analysis -- but first and foremost, the pesky terminology problem needs to be tackled.

Which frequently misused big data terms bug you? Tell us in the comments section.

Emerging software tools now make analytics feasible -- and cost-effective -- for most companies. Also in the Brave The Big Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury's Call Me Maybe project. (Free registration required.)

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Alex Kane Rudansky
50%
50%
Alex Kane Rudansky,
User Rank: Author
11/18/2013 | 10:24:56 AM
Healthcare
I've seen the same issue arise in healthcare. Doctors have different terms for the same illnesses and medications, making big data analytics a headache. For example: Hypertension and high blood pressure. Same illness, different name. Until a standardized nomenclature is put in place, it will be hard to accurately and effectively mine electronic health record data.
Laurianne
50%
50%
Laurianne,
User Rank: Author
11/18/2013 | 9:46:28 AM
Big Data Terminology Mess
Vocabulary will be a big deal as data scientists try to communicate with people on the business side and even with people in other IT disciplines. As we just went through a large dev project here, I heard myself say several times, "I think we're just not speaking the same language." We each knew what we wanted, but the lexicon was completely different.You have an opportunity at this time, big data gurus, to set the tone in vocabulary. Which terms should we ban early?
In A Fever For Big Data
In A Fever For Big Data
Healthcare orgs are relentlessly accumulating data, and a growing array of tools are becoming available to manage it.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek - September 2, 2014
Avoiding audits and vendor fines isn't enough. Take control of licensing to exact deeper software discounts and match purchasing to actual employee needs.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
In in-depth look at InformationWeek's top stories for the preceding week.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.