Please ensure Javascript is enabled for purposes of website accessibility

Some Models Don’t Tell the Truth – Using Statistical Models Realistically

09.07.2018

“All models are wrong, but some are useful.”

This quote, well-known to statisticians, comes from one of the most influential figures in statistics, George Box. The point that Box was making is one that is misunderstood by many data analysts, from novices to experienced researchers: every statistical model is an approximation to reality. To understand how an approximation to reality can still be useful, consider the two models visualized in the following figure.

Knowing that both of these models come from the same data set, a data analyst would want to determine which is “better”. If I gave these two figures to the students in my Statistics 318 class at the University of Nebraska along with the coefficient of determination (R-squared value) for each model (0.65 for the first model and 0.70 for the second), most students would tell me that the second model is better. Given the available information, that is an understandable choice. Unfortunately, it’s also wrong.

How do I know that the second model is not “better” than the first model? These data points represent speed and braking distance measurements taken on fifty different cars in the 1920s. The x-axis represents the speed the car was going in miles per hour, and the y-axis represents the distance (in feet) it took the car to stop once the brakes were applied. How does knowing that help? Look at the second model again. Do you notice anything strange? Why would the braking distance be almost 40 feet at five miles per hour then drop sharply as the speed increases? Why would there be a slight downturn followed by a sharp increase around 20 miles per hour? Worst of all, if an analyst tried to use this model to predict the braking distance at 50 miles per hour, she would find that predicted value to be less than -10,000,000 feet. If this doesn’t make any sense to you, that’s good! This model (along with any other possible model) has no idea what the data is, it just sees numbers, does some math, and spits out the results. While the first model also has its shortcomings, it does a much better job at describing reality.

Regrettably, some analysts approach modeling without thinking about how their models compare to reality. They see the picture and some measure of model fit (like the coefficient of determination) and make modeling decisions without any information about what the data represent. The best analyses will often be collaborations between experts in a field and experienced data analysts who have familiarized themselves with the best methods suited for their goals.

NCS is an excellent example of this approach. Between us, there is experience in banking, accounting, finance, economics, math, statistics, physics, sociology, and ecology, just to name a few. With each of us bringing experience from this wide array of fields, we are able to work together in a synergistic way. The whole is greater than the sum of its parts as we bring our combined skills to analyze and find real insights in our data.

All models may be wrong, but some can be useful if we are smart about our approach to building and using them.

Author Jason Adams

Recent Insights

Receive Key Insights