Planning the Right Sample Size for Data Analysis - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management
08:00 AM
Pierre DeBois
Pierre DeBois

Planning the Right Sample Size for Data Analysis

Sample size affects the statistical results for correlations, regressions, and other models. See how to compare data to ensure an apples-to-apples comparison.

In my post A Good Workflow to Avoid Bad Data, I noted how you are not alone in your uncertainty in deciding the initial steps to create data models. I also suggested the idea of a hypothesis to work around the esoteric nature of data to set a good workflow.

Another hindering step on the road to a good workflow is deciding on the right sample size of observations for a model. You’d think that with how much ink and pixels has spilled on data that companies would have the right amount of data immediately at hand.

Image: Kurt Kleemann -
Image: Kurt Kleemann -

Well, think again. Different data types imply that while a lot of data is stored, structural issues can leave the number of suitable observations short for a given model. So, the best question to ask is about what options are available when the minimal amount of data needed to make a model operational is questionable?

Most likely, when deciding on the sample size you have come across the minimal sample size formula, a statistical calculation based on standard deviation, a 95% confidence interval, and the margin of error that you specify. The concern is that in some instances, managers may not have confidence in obtaining enough data to meet the number of observations mathematically required. The standard deviation in a dataset can add further complications. As the standard deviation of a set of observations increases, the sample size increases.

But you may have some limitations. For starters, you may ask yourself if gaining observations has a real-world cost implication. You may have a constricting budget, for example, limiting the number of physical sources for data. The need to obtain observations is particularly a problem for machine learning datasets, which need a lot of data to train a model accurately.

Another limit can come from within the model itself. That limit comes in the form of the curse of dimensionality, a concept in which the amount of data needed for accuracy increases exponentially as the number of dimensions grows. These dimensions represent potentially correlated variables for a regression, a clustering, or more advanced machine learning model. The data is meant to be the observations within the dimensions. Increasing dimensions creates a larger set of steps to obtain model accuracy. If you are examining how product features are correlated to a purchase decision, for example, you will need more observations as you add features to investigate in a regression that answers the purchase question.

When features are indiscriminately added to a model, you will reach a limit beyond which the amount of needed data to train will reach unfeasible levels. Data represents real-world products, services or operations, so a real-world cap exists on the amount of data possible, given the business resources available.

So, what should managers do to better manage data decisions?

One step is to assess the dimensions being used in a model that are most relevant to answering a business need. The input can help shape decisions on which dimensions and supporting data should be modeled. There are complementary techniques to the curse of dimensionality, called dimensionality reduction. Analysts reduce the dimensions to a minimum number of influencers, using their understanding of what the variables represent. A correlation matrix is applied to identify dimensions that can be potentially removed. This can eliminate some data needs, if the dimensions are representing the very data in which you are questioning the sample size.

Another technique, imputation, can help fill up data gaps where there is either a shortage or not available values due to data type concerns. An imputation is the substitution of a proxy value for missing data. Usually the selected imputation value is either a median or mean. The reason for applying imputation is to eliminate missing values through assigning a value that infers similar qualities to the dataset being examined.  Missing values decrease the accuracy of predictive models, particularly ones for machine learning. The result is misleading conclusions about the relationship in the data. Imputation avoids that result, giving a reasonable dataset that can then be deployed in a regression or machine learning model.

Having a ton of data may sound heaven-sent, but many managers can sometimes face a shortage in gaining enough sample size because of the kinds of data types used to gain those observations. However, it is possible to make intuitive assumptions on a given data set to make good use of the data available. The creativity involved does not require heavy math or programming. Instead you need a few ideas on how the data will be used to ensure that an analytic model is accurate and useful while working around real-world and mathematical constraints.

Pierre DeBois is the founder of Zimana, a small business analytics consultancy that reviews data from Web analytics and social media dashboard solutions, then provides recommendations and Web development action that improves marketing strategy and business profitability. He ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Why 2021 May Turn Out to be a Great Year for Tech Startups
John Edwards, Technology Journalist & Author,  2/24/2021
How GIS Data Can Help Fix Vaccine Distribution
Jessica Davis, Senior Editor, Enterprise Apps,  2/17/2021
11 Ways DevOps Is Evolving
Lisa Morgan, Freelance Writer,  2/18/2021
White Papers
Register for InformationWeek Newsletters
The State of Cloud Computing - Fall 2020
The State of Cloud Computing - Fall 2020
Download this report to compare how cloud usage and spending patterns have changed in 2020, and how respondents think they'll evolve over the next two years.
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll