Datasets are inherently messy, and with such disorder IT professionals must inspect datasets to maintain data quality. Increasingly, models power business operations, so IT teams are protecting machine learning models from running with imbalanced data.
Imbalanced datasets are a condition in which a predictive classification model misidentifies observation as a minority class. This occurs when observations are tested to a classification as designed by the model, but the test includes so few observations that the model operates with an askew prediction accuracy.
To illustrate, think of a company that examines data from 100 samples of a product. Let's say a model built on that data predicted that 90 would meet a desired quality threshold score, and 10 would not. That model would have a 90% accuracy for selecting products that meet that score. That accuracy, however, treats that ratio of conditions as a sure bet, firmly held for the next dataset on which the model is applied.
The consequence of that "sure bet" is a biased model with a false sense of data identification. The model misidentifies observations from a larger dataset, and, given the dataset size, scale the misidentification.
The condition gets worse with high-dimensional datasets. These datasets contain multiple variables, with the number of variables exceeding the number of observations in some instances. That layout of data -- a wide table of variables with few observations -- is shaped similarly to that in the 90/10 example, with the significant difference of more features (variables). High dimensionality can influence a model to bias toward the majority class.
Such bias can have societal consequences, such as facial recognitional systems that do not identify Black faces from images well. These systems have been criticized for perpetuating discrimination and racism because their biases could lead to illegal arrests and false criminal accusations by authorities.
Retail operations offers real-world examples of common business impacts from imbalanced data. A customer database in which a minority class of customers unsubscribe from a service can impact how a model detects customer churn for products and services. Fraud purchases or returns are additional examples where minority classes can be too small for detection.
The most straight-forward solution to imbalanced datasets is to collect more data, but additional data collection is not a choice in every instance. The observations that create the dataset may be limited due to an event or other practical consideration. An unexpected cut in product production -- like those experienced last year due to COVID-19 -- is a good example.
A different solution is to use imputation. Imputation is a process of assigning a value to missing data by inference. The imputation process has a few variations. One imputation option is data resampling. In resampling, analysts can do one of two tasks:
- Add copies of the underrepresented class, called oversampling.
- Delete observations of the overrepresented class, called undersampling.
Either choice is meant to correct the influence of dataset features, minimizing bias in the model.
An advanced imputation technique is synthetic minority over-sampling technique (SMOTE). SMOTE creates synthetic samples calculated from the minor class instead of the duplication or adjustment used in resampling. It provides more observations without adding features that can negatively inform the model. SMOTE applies a nearest neighbor vector calculation on a pair of minority class observations, then creates the additional observation from that calculation. The oversampling process repeats until all the observation pairs have been assessed with a nearest neighbor calculation.
There are libraries in R and packages for Python designed to apply SMOTE within a program. No matter which programming language you decide to use, there is general approach that can be taken to examine datasets for possible imbalances. First, select the observations that are in the training set for the model. Next, create a summary line in the program to confirm that the example classes were created. The final step is a quality assurance step, creating a scatterplot to see if the classes make intuitive sense.
There are other approaches for inspecting class imbalance in data through examining the results of machine learning models. Analysts can look at the performance of a model or compare the output of several models on the same data to note which model best classifies and treats the minority class in production. One technique, called penalized models, imposes a cost on the model for making mistakes on the classes. This helps to learn which models can make the most destructive impact from a decision.
The main point is to develop a comparison of the dataset before and after the imputation process. Data analysts and IT teams will have to rely on their familiarity with the data selected to know when the classification make sense.
Correcting imbalanced data is a gift for a team charged with keeping a machine learning model in production.
Follow up with these articles on machine learning: