The NCAA Bracket: Checking Our Work


Our NCAA tournament forecasts are probabilistic. You could say that FiveThirtyEight is “calling” for Michigan State to defeat Virginia on Friday, but it isn’t much of a call. Our model gives Michigan State a 50.7 percent chance of winning, and Virginia a 49.3 percent chance. For all intents and purposes, it’s a toss-up.

Other times, of course, one team has a much clearer edge. Duke was a 92.9 percent favorite against Mercer last week, but Mercer won.

Still, upsets like these are supposed to happen some of the time. The question in evaluating a probabilistic forecast is whether the underdogs are winning substantially more or less often than expected. The technical term for this is calibration. But you might think of it more as truth in advertising. Over the long run, out of all the times when we say a team is a 75 percent favorite, is it really winning about 75 percent of the time?

FiveThirtyEight’s NCAA tournament projections have been published each year since 2011. The formula has changed very little over that period. (The only substantive change has been adding a fifth computer power rating, ESPN’s Basketball Power Index, this season.) That gives us a reasonable baseline for evaluation: a total of 254 games, counting the 52 played so far this year. (These totals include “play-in” games.)

You can find a file containing our past predictions here. It’s important to emphasize that these predictions were published at The New York Times or ESPN versions of FiveThirtyEight in advance of each game — these are actual predictions and not hindcasts.

Overall, FiveThirtyEight’s favorite won 70.0 percent of the time. How often was the favorite supposed to win? According to our model, 72.1 percent of the time. So at a macro level, the forecasts have been pretty well calibrated (70.0 percent is well within the 95 percent confidence interval drawn by simulating the results 100,000 times).

But that doesn’t tell us everything. It’s also important that the heavier favorites, like Duke in its game against Mercer, win more often than the slim favorites, like Michigan State against Virginia.

To evaluate this, we can break the 254 games down into bins. The most intuitive way is to use five bins based on the first digit of the favorite’s win probability. For example, one bin contains the 40 games in which the favorite’s win probability was somewhere in the 90s (that is, anywhere from 90.0 percent to 99.99999 percent; in practice, the heaviest favorite in our database is Ohio State, which had a 99.7 percent chance of beating the University of Texas at San Antonio in 2011.)

We’d expect the favorite to win about 95 percent of these games. In fact, the favorite won 38 out of 40, or exactly 95 percent of the time. The exceptions were Duke this year and Missouri in 2012, which lost to Norfolk State despite being a 97.2 percent favorite.

nate-bracket-table-3.27

 

Not all the numbers work out so neatly, however. Another bin contains all those favorites with win probabilities in the 50s (anywhere from 50.0 percent to 59.9 percent). These teams were supposed to win about 55 percent of the time. In fact, they’ve won 38 of 63 games, or 60.3 percent of the time. So the teams performed a little better than expected in these games.

By contrast, teams with win probabilities in the 60s (from 60.0 percent to 69.9 percent) have won 35 of their 60 games, or 58.3 percent. That’s a bit worse than expected.

What’s going on here? How are the somewhat heavier favorites, with win probabilities in the 60s, performing worse than those teams whose win probabilities were in the 50s? Are the heavier favorites getting cocky? Is there something wrong with the model?

Probably not. Instead, these differences are well within the ranges that might result from random chance. This is easier to explain visually, as in the following graphic, which portrays the results of games from each bin along with their confidence intervals.

 datalab-silver-ncaa-calibration-fixed

Each result is within its respective confidence interval. The calibration is not perfect. But the deviations from perfect calibration are not statistically significant. We encourage you to check our work, and let us know if you think there’s a better way to judge our results. But overall, the tournament’s results have been reasonably true to the FiveThirtyEight probabilities. Just don’t tell that to your buddy who had Duke winning it all this year.

Correction: Speaking of checking our work: a reader, D.K., discovered that our database incorrectly listed Kentucky as the winner of its 2011 national semifinal against Connecticut. The chart, table and text have now been changed to reflect the correct result. We appreciate D.K.’s assistance and apologize for the error.