Planning the Right Sample Size for Data Analysis - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management
8/2/2019
08:00 AM
Commentary
100%
0%

# Planning the Right Sample Size for Data Analysis

Sample size affects the statistical results for correlations, regressions, and other models. See how to compare data to ensure an apples-to-apples comparison.

In my post A Good Workflow to Avoid Bad Data, I noted how you are not alone in your uncertainty in deciding the initial steps to create data models. I also suggested the idea of a hypothesis to work around the esoteric nature of data to set a good workflow.

Another hindering step on the road to a good workflow is deciding on the right sample size of observations for a model. You’d think that with how much ink and pixels has spilled on data that companies would have the right amount of data immediately at hand.

Well, think again. Different data types imply that while a lot of data is stored, structural issues can leave the number of suitable observations short for a given model. So, the best question to ask is about what options are available when the minimal amount of data needed to make a model operational is questionable?

Most likely, when deciding on the sample size you have come across the minimal sample size formula, a statistical calculation based on standard deviation, a 95% confidence interval, and the margin of error that you specify. The concern is that in some instances, managers may not have confidence in obtaining enough data to meet the number of observations mathematically required. The standard deviation in a dataset can add further complications. As the standard deviation of a set of observations increases, the sample size increases.

But you may have some limitations. For starters, you may ask yourself if gaining observations has a real-world cost implication. You may have a constricting budget, for example, limiting the number of physical sources for data. The need to obtain observations is particularly a problem for machine learning datasets, which need a lot of data to train a model accurately.

Another limit can come from within the model itself. That limit comes in the form of the curse of dimensionality, a concept in which the amount of data needed for accuracy increases exponentially as the number of dimensions grows. These dimensions represent potentially correlated variables for a regression, a clustering, or more advanced machine learning model. The data is meant to be the observations within the dimensions. Increasing dimensions creates a larger set of steps to obtain model accuracy. If you are examining how product features are correlated to a purchase decision, for example, you will need more observations as you add features to investigate in a regression that answers the purchase question.

When features are indiscriminately added to a model, you will reach a limit beyond which the amount of needed data to train will reach unfeasible levels. Data represents real-world products, services or operations, so a real-world cap exists on the amount of data possible, given the business resources available.

So, what should managers do to better manage data decisions?

One step is to assess the dimensions being used in a model that are most relevant to answering a business need. The input can help shape decisions on which dimensions and supporting data should be modeled. There are complementary techniques to the curse of dimensionality, called dimensionality reduction. Analysts reduce the dimensions to a minimum number of influencers, using their understanding of what the variables represent. A correlation matrix is applied to identify dimensions that can be potentially removed. This can eliminate some data needs, if the dimensions are representing the very data in which you are questioning the sample size.

Another technique, imputation, can help fill up data gaps where there is either a shortage or not available values due to data type concerns. An imputation is the substitution of a proxy value for missing data. Usually the selected imputation value is either a median or mean. The reason for applying imputation is to eliminate missing values through assigning a value that infers similar qualities to the dataset being examined.  Missing values decrease the accuracy of predictive models, particularly ones for machine learning. The result is misleading conclusions about the relationship in the data. Imputation avoids that result, giving a reasonable dataset that can then be deployed in a regression or machine learning model.

Having a ton of data may sound heaven-sent, but many managers can sometimes face a shortage in gaining enough sample size because of the kinds of data types used to gain those observations. However, it is possible to make intuitive assumptions on a given data set to make good use of the data available. The creativity involved does not require heavy math or programming. Instead you need a few ideas on how the data will be used to ensure that an analytic model is accurate and useful while working around real-world and mathematical constraints.

Pierre DeBois is the founder of Zimana, a small business analytics consultancy that reviews data from Web analytics and social media dashboard solutions, then provides recommendations and Web development action that improves marketing strategy and business profitability. He ... View Full Bio
More Insights
Reflections on Tech in 2019
James M. Connolly, Editorial Director, InformationWeek and Network Computing,  12/9/2019
What Digital Transformation Is (And Isn't)
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/4/2019
Watch Out for New Barriers to Faster Software Development
Lisa Morgan, Freelance Writer,  12/3/2019
White Papers
State of the Cloud
Cloud has drastically changed how IT organizations consume and deploy services in the digital age. This research report will delve into public, private and hybrid cloud adoption trends, with a special focus on infrastructure as a service and its role in the enterprise. Find out the challenges organizations are experiencing, and the technologies and strategies they are using to manage and mitigate those challenges today.
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Slideshows
Flash Poll