A Good Workflow to Avoid Bad Data - InformationWeek
IoT
IoT
Data Management // Big Data Analytics
Commentary
4/8/2019
02:00 PM
Pierre DeBois
Pierre DeBois
Commentary
50%
50%

A Good Workflow to Avoid Bad Data

Data is important, yet processing data sometimes can make teams feel like they're chasing their own tails to unlock its value.

If you feel uncertainty in deciding what steps to take in creating data models and using resources effectively, you are not alone. Data is very esoteric in its own right, so much so that sometimes analysts feel a personal victory for just having the data in the first place.

But that victory is fleeting. Data is always linked to the assumptions for a model, but it can be unclear as to how to best proceed to set a good data workflow.

The best way to establish an effective workflow is to see how objectives can be described as a hypothesis, a supposition made on limited evidence.  A hypothesis implies a null hypothesis, the concept that there is no significant difference between your default condition and the condition with the supposition in place.

Image: Shutterstock
Image: Shutterstock

Creating a hypothesis sounds very scientific for a business deliverable or nonprofit objective. But the limited evidence of a hypothesis is data, the stuff that has become overwhelmingly available these days. Link that factoid to the increased capabilities of analysis software, and you have a better way to ask questions that incorporate real world conditions.  Spatial analysis, for example, grew from the availability of GPS data, allowing professionals to import the data into models created in R and Python languages and do sophisticated mapping of resources, such as noting supply problems of a perishable product across regions. Even civic organizations have begun to use spatial analysis to deploy public services more effectively. As managers across many industries adopt more data into key decisions, good data science precepts become great operational guidelines for data-driven organizations. 

To determine a good hypothesis, analysts should frame the context of an objective against data that can potentially explain the model output. This can make the data for a model easier to understand against performance statistics, like accuracy and precision, and consequently better relate a data model to a business need. The stats may not be the KPIs that you readily report to your management colleagues, but they do indicate if model performance is really addressing a KPI-related objective.

Let’s take an overview of a marketing mix model to see how hypothesis creation can work. A marketing mix model is a sales or market share prediction analysis based on the marketing channels used to advertise a given product or service. It compares the channels based on versus another is influencing sales or market share. The independent variables are the channels used to market a product or service while sales or market share is the dependent variable. 

Thus, a hypothesis can determine if, say, a new strategy on a social media channel really influenced sales significantly in a marketing mix model. The null hypothesis in this instance is that there is no significant difference in sales performance attributed to the new social media strategy in place. This approach positions the analysis on assessing the degree of improvement a channel may provide, implying an answer on the return on investment expected from dedicating further budget to the channel. 

Building a model based on your hypothesis helps you better plan how to treat the observations from the data. For example there are machine learning models that do not handle N/A values in the observations, so you have to look at a source and see if any columns contain them, and then decide how to best address those values: Are there a few missing values? Does it seem more systematic? Does your model still represent performance in a real world condition without the fields in question? 

These choices can filter down to guidance on your workflow resources. You can better approach data quality within the context of processes like CRISP-DM (Cross Industry Standard Process for Data Mining), a set of deliverables meant to allow teams to collaborate on data exploration. A hypothesis raises the question on what kind of data is needed to provide answers. Because of that, CRISP-DM deliverables, designed to explain why a given task is relevant to a business objective, can be framed to allow teams such as IT to understand what they can do to support model development. (Jessica Davis examines IT’s role in supporting a machine learning initiative in her recent post).

Finally, you can use the latest publishing features to keep files and data together to match that workflow. Professionals should get acquainted with Markdown, a lightweight markup language meant to conveniently publish supporting documents alongside a data model in a number of formats.  Thus collaborators can better understand the programming and assumptions being applied to the data. Many Integrated Development Environments (IDEs), like RStudio or Microsoft Visual Studio Code, have features to create markdown documents and distribute the documents.

Having a ton of data creates a lot of esoteric details that can confuse the analyst, thus creating extra work. But good data workflow can prevent a data spiraling into a prolonged long workday of one bad analysis after another.

 

Pierre DeBois is the founder of Zimana, a small business analytics consultancy that reviews data from Web analytics and social media dashboard solutions, then provides recommendations and Web development action that improves marketing strategy and business profitability. He ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
The Staying Power of Legacy Systems
Mary E. Shacklett, Mary E. Shacklett,  4/15/2019
Commentary
Q&A: Red Hat's Robert Kratky Discusses Essentials of Docs
Joao-Pierre S. Ruth, Senior Writer,  4/15/2019
Commentary
How Cloud Shifts Security Balance of Power to the Good Guys
Guest Commentary, Guest Commentary,  4/11/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
A New World of IT Management in 2019
This IT Trend Report highlights how several years of developments in technology and business strategies have led to a subsequent wave of changes in the role of an IT organization, how CIOs and other IT leaders approach management, in addition to the jobs of many IT professionals up and down the org chart.
Slideshows
Flash Poll