When A Good Machine Learning Model Is So Bad
IT teams must work with managers who oversee data scientists, data engineers, and analysts to develop points of intervention that complement model ensemble techniques.
Most managers feel euphoria when implementing a technology meant to enhance the workflow of a team or an organization. But they often overlook the details that help implement the technology successfully. The same sentiment can occur for managers who oversee data scientists, data engineers, and analysts examining machine learning initiatives.
Every organization seems to be in love with machine learning. Because love is blind, so to speak, IT teams become the first line of defense in protecting that euphoric feeling. They can start that protection by helping managers appreciate how models fit observations from data sources. Appreciating the statistical balance in data models is essential for establishing management that minimizes errors that lead to very poor real-world decisions. Overfitting and underfitting is the key part of that discussion.
Overfitting and underfitting address how training data performance compares to production data performance of a model or machine learning algorithm. An analyst can see good performance on the training data but experience results that exhibit poor generalization with a new data sample or, even worse, in production.
So how does all of this work in practice? Overfit means the model treats noise in the training data as a reliable indicator, when in reality the noise distorts. The model creates a poor prediction from any new dataset that does not contain the same or any noise in it -- namely the production data. From a statistics standpoint, overfitting occurs if the model or algorithm shows low bias but high variance
Underfit introduces a different model performance issue. Intuitively, underfit implies that the model or the algorithm does not capture all the data well enough to understand the statistical relationships among the data. From a statistics perspective, underfitting occurs if the model or algorithm shows low variance but high bias.
Both model conditions reduce generalizations to poor decisions. Generalizations are the capacity for machine learning models to accurately access unseen data. Getting the right generalization is at the heart of establishing a good machine learning model.
One avenue for analysts is to examine the training data to determine if additional observations are possible to avoid adding unbalanced data sets to models. I explained unbalanced datasets previously in a previous post.
But there are limits to adding observations or adding features. There are phenomena in which adding more data yields no further performance improvements. One example is called the Hughes phenomenon, which shows that as the number of features increases, a classifying model’s performance increases up to a point of optimal number of features, then decreases performance as more features based on the same size as the training set are added. The Hughes phenomenon should certainly remind data professionals of the curse of dimensionality. The number of possible unique rows grow exponentially for many instances, such as high-dimensional models. The variance increases from the additional observations as well. The result is a model with more opportunities to overfit, making accurate generalization harder to establish and raising development inefficiency.
Thus, the most likely efforts will involve finding a balance between bias and variance. Having low bias and variance is a desired objective but usually is impractical or impossible to achieve. Analysts should focus on cross-validation techniques, like gradient boosting, to minimize the likelihood of implementing a poor model.
IT teams must work with managers who oversee data scientists, data engineers, and analysts to develop points of intervention that complement model ensemble techniques. The interaction can also lead to forming robust management processes like observability for incident detection and root-cause reporting. The result is a system that minimizes operational downtime related to data issues. It also produces a process point for managing a balance of bias and variance that protects model accuracy and yield fair outcomes.
Signal noise does not mean that ethics exists in an outcome. Good judgment will make sure ethics in the outcome occur. Such outcomes are certainly worth a euphoric feeling.
Related Content:
Machine Learning Basics Everyone Should Know
How to Explain AI, ML, and NLP to Business Leaders in Plain Language
About the Author
You May Also Like