A Good Workflow to Avoid Bad DataA Good Workflow to Avoid Bad Data
Data is important, yet processing data sometimes can make teams feel like they're chasing their own tails to unlock its value.
April 8, 2019
If you feel uncertainty in deciding what steps to take in creating data models and using resources effectively, you are not alone. Data is very esoteric in its own right, so much so that sometimes analysts feel a personal victory for just having the data in the first place.
But that victory is fleeting. Data is always linked to the assumptions for a model, but it can be unclear as to how to best proceed to set a good data workflow.
The best way to establish an effective workflow is to see how objectives can be described as a hypothesis, a supposition made on limited evidence. A hypothesis implies a null hypothesis, the concept that there is no significant difference between your default condition and the condition with the supposition in place.
Creating a hypothesis sounds very scientific for a business deliverable or nonprofit objective. But the limited evidence of a hypothesis is data, the stuff that has become overwhelmingly available these days. Link that factoid to the increased capabilities of analysis software, and you have a better way to ask questions that incorporate real world conditions. Spatial analysis, for example, grew from the availability of GPS data, allowing professionals to import the data into models created in R and Python languages and do sophisticated mapping of resources, such as noting supply problems of a perishable product across regions. Even civic organizations have begun to use spatial analysis to deploy public services more effectively. As managers across many industries adopt more data into key decisions, good data science precepts become great operational guidelines for data-driven organizations.
To determine a good hypothesis, analysts should frame the context of an objective against data that can potentially explain the model output. This can make the data for a model easier to understand against performance statistics, like accuracy and precision, and consequently better relate a data model to a business need. The stats may not be the KPIs that you readily report to your management colleagues, but they do indicate if model performance is really addressing a KPI-related objective.
Let’s take an overview of a marketing mix model to see how hypothesis creation can work. A marketing mix model is a sales or market share prediction analysis based on the marketing channels used to advertise a given product or service. It compares the channels based on versus another is influencing sales or market share. The independent variables are the channels used to market a product or service while sales or market share is the dependent variable.
Thus, a hypothesis can determine if, say, a new strategy on a social media channel really influenced sales significantly in a marketing mix model. The null hypothesis in this instance is that there is no significant difference in sales performance attributed to the new social media strategy in place. This approach positions the analysis on assessing the degree of improvement a channel may provide, implying an answer on the return on investment expected from dedicating further budget to the channel.
Building a model based on your hypothesis helps you better plan how to treat the observations from the data. For example there are machine learning models that do not handle N/A values in the observations, so you have to look at a source and see if any columns contain them, and then decide how to best address those values: Are there a few missing values? Does it seem more systematic? Does your model still represent performance in a real world condition without the fields in question?
These choices can filter down to guidance on your workflow resources. You can better approach data quality within the context of processes like CRISP-DM (Cross Industry Standard Process for Data Mining), a set of deliverables meant to allow teams to collaborate on data exploration. A hypothesis raises the question on what kind of data is needed to provide answers. Because of that, CRISP-DM deliverables, designed to explain why a given task is relevant to a business objective, can be framed to allow teams such as IT to understand what they can do to support model development. (Jessica Davis examines IT’s role in supporting a machine learning initiative in her recent post).
Finally, you can use the latest publishing features to keep files and data together to match that workflow. Professionals should get acquainted with Markdown, a lightweight markup language meant to conveniently publish supporting documents alongside a data model in a number of formats. Thus collaborators can better understand the programming and assumptions being applied to the data. Many Integrated Development Environments (IDEs), like RStudio or Microsoft Visual Studio Code, have features to create markdown documents and distribute the documents.
Having a ton of data creates a lot of esoteric details that can confuse the analyst, thus creating extra work. But good data workflow can prevent a data spiraling into a prolonged long workday of one bad analysis after another.
About the Author(s)
You May Also Like