In my post A Good Workflow to Avoid Bad Data, I noted how you are not alone in your uncertainty in deciding the initial steps to create data models. I also suggested the idea of a hypothesis to work around the esoteric nature of data to set a good workflow.
Another hindering step on the road to a good workflow is deciding on the right sample size of observations for a model. You’d think that with how much ink and pixels has spilled on data that companies would have the right amount of data immediately at hand.
Well, think again. Different data types imply that while a lot of data is stored, structural issues can leave the number of suitable observations short for a given model. So, the best question to ask is about what options are available when the minimal amount of data needed to make a model operational is questionable?
Most likely, when deciding on the sample size you have come across the minimal sample size formula, a statistical calculation based on standard deviation, a 95% confidence interval, and the margin of error that you specify. The concern is that in some instances, managers may not have confidence in obtaining enough data to meet the number of observations mathematically required. The standard deviation in a dataset can add further complications. As the standard deviation of a set of observations increases, the sample size increases.
But you may have some limitations. For starters, you may ask yourself if gaining observations has a real-world cost implication. You may have a constricting budget, for example, limiting the number of physical sources for data. The need to obtain observations is particularly a problem for machine learning datasets, which need a lot of data to train a model accurately.
Another limit can come from within the model itself. That limit comes in the form of the curse of dimensionality, a concept in which the amount of data needed for accuracy increases exponentially as the number of dimensions grows. These dimensions represent potentially correlated variables for a regression, a clustering, or more advanced machine learning model. The data is meant to be the observations within the dimensions. Increasing dimensions creates a larger set of steps to obtain model accuracy. If you are examining how product features are correlated to a purchase decision, for example, you will need more observations as you add features to investigate in a regression that answers the purchase question.
When features are indiscriminately added to a model, you will reach a limit beyond which the amount of needed data to train will reach unfeasible levels. Data represents real-world products, services or operations, so a real-world cap exists on the amount of data possible, given the business resources available.
So, what should managers do to better manage data decisions?
One step is to assess the dimensions being used in a model that are most relevant to answering a business need. The input can help shape decisions on which dimensions and supporting data should be modeled. There are complementary techniques to the curse of dimensionality, called dimensionality reduction. Analysts reduce the dimensions to a minimum number of influencers, using their understanding of what the variables represent. A correlation matrix is applied to identify dimensions that can be potentially removed. This can eliminate some data needs, if the dimensions are representing the very data in which you are questioning the sample size.
Another technique, imputation, can help fill up data gaps where there is either a shortage or not available values due to data type concerns. An imputation is the substitution of a proxy value for missing data. Usually the selected imputation value is either a median or mean. The reason for applying imputation is to eliminate missing values through assigning a value that infers similar qualities to the dataset being examined. Missing values decrease the accuracy of predictive models, particularly ones for machine learning. The result is misleading conclusions about the relationship in the data. Imputation avoids that result, giving a reasonable dataset that can then be deployed in a regression or machine learning model.
Having a ton of data may sound heaven-sent, but many managers can sometimes face a shortage in gaining enough sample size because of the kinds of data types used to gain those observations. However, it is possible to make intuitive assumptions on a given data set to make good use of the data available. The creativity involved does not require heavy math or programming. Instead you need a few ideas on how the data will be used to ensure that an analytic model is accurate and useful while working around real-world and mathematical constraints.