Overfitting And Underfitting
Overfitting involves an overly complex model that includes noise. Underfitting is the opposite; the model is overly simple. Either skews results.
"[Overfitting] is one of the most common (and worrisome) biases. It comes about from checking lots of different hypotheses in data. If each hypothesis you check has, say, a 1 in 20 chance of being a false positive, then if you check 20 different hypotheses, you're very likely to have a false positive occur at least once," Greenberg said.
He tested the effect that various behavioral interventions had on site participants. At first, it appeared that a particular behavior outperformed the control. However, when a correction was applied that adjusted for the number of hypotheses tested, the statistical significance vanished.
"If you're using a high-dimensionality, non-linear predictive algorithm that has lots of degrees of freedom that allow you to sift data to the tee, you can essentially take any function and map it point-by-point so that the model does a tremendous job at looking at the data that you fitted on it. It's excellent on that, but if you go beyond that realm, it does awfully in terms of predicting data points that are outside the spectrum you looked up," said CenturyLink's Schleicher. "We inevitably split up our data sets into training data sets and testing data sets, and do cross validation across such multiple sets to make sure we don't overfit."
(Image: PeteLinforth via Pixabay)