How Our New NFL Model Did This Season


Quarterbacks ruled the 2019 NFL season, with Patrick Mahomes bringing the Lombardi Trophy to Kansas City and Lamar Jackson emerging as the league MVP. Quarterbacks were in control of the FiveThirtyEight prediction model, too, as a key factor of the new version of our Elo rating system, which adjusted for the performance of every starting QB. Now that the season is over, in the spirit of checking our work, we wanted to look back at the 2019 season and see how well the new system did — and whether it improved on our old, simple Elo system from years past.

One simple way to judge prediction accuracy is to look at how close the predicted point spread came to the actual score differential of each game (squaring the errors to give a larger penalty to bad misses). And in that department, new Elo beat old Elo this season, albeit by a smaller margin than we might have expected based on the preceding five seasons.

But our preferred way to judge the accuracy of a forecast is using Brier scores, which are essentially the average squared error between a probabilistic forecast and what actually happened.1 (Lower Brier scores are better because they mean your prediction was closer to being correct.) And by that standard, our new Elo ratings basically performed as expected. It was a bit of an unpredictable NFL season according to either system, particularly during the playoffs, but the improvement in Brier score from the old version of Elo (0.224) to the new Elo (0.219) by the end of the 2019 season ended up being almost exactly what it had been when it was backtested over the previous five seasons, on average:

Using Brier scores, let’s look at how the model’s accuracy evolved over time. Very early in the season, new Elo had an edge, perhaps because it was accounting for the many quarterback injuries that beset teams during the first few weeks. Then things in the league got weird. And the old system — which didn’t adjust for QBs, travel distance or rest days — was actually handling the weirdness better for most of the first half of the year. The new model didn’t pull ahead for good in terms of seasonlong Brier score until Week 11, at which point it maintained a lead and even expanded it, with injuries and teams resting starters in the closing weeks of the schedule.

The playoffs were a bit rough for the new model, primarily because of two games: Seattle at Philadelphia in the wild-card round (where new Elo’s Brier was 0.480, compared with 0.380 for the old model) and Tennessee at Baltimore in the divisional round (new Elo’s Brier was 0.755 — really bad! — compared with 0.582 for the old system). Our backtesting suggested that there are real predictive effects to late-season QB hot and cold streaks, and that favorites tend to play better in the postseason, but both of those factors ended up haunting the new model in that pair of upsets. Overall in the playoffs, new Elo had a worse Brier score (0.272) than the old model did (0.261) — although, as we mentioned earlier, that didn’t really cause it to do worse than expected for the entire season overall. And, of course, it also helped that the new system did much better in the conference championships and the Super Bowl.

Finally, just for fun, let’s look at the games in which the new model had its best and worst picks of the season, relative to the old system:

QB-adjusted Elo’s greatest hits (and misses) of 2019

Highest and lowest Brier score differentials between FiveThirtyEight’s old and new QB-adjusted NFL Elo models by game, 2019 season

Hits: Winner Loser Winner’s W% by Elo
Date Team QB Team QB Old New Brier Diff.
10/27/19 GB Rodgers KC Moore 32% 58% -0.296
12/29/19 TEN Tannehill HOU McCarron 35 59 -0.249
10/6/19 OAK Carr CHI Daniel 27 47 -0.242
12/29/19 CHI Trubisky MIN Mannion 28 45 -0.207
9/22/19 SF Garoppolo PIT Rudolph 53 79 -0.177
9/5/19 GB Rodgers CHI Trubisky 24 36 -0.165
10/6/19 BAL Jackson PIT Rudolph 42 58 -0.165
10/13/19 NYJ Darnold DAL Prescott 30 42 -0.159
12/22/19 NYJ Darnold PIT Hodges 34 46 -0.142
11/17/19 DAL Prescott DET Driskel 55 75 -0.140
Misses: Winner Loser Winner’s W% by Elo
Date Team QB Team QB Old New Brier Diff.
10/13/19 PIT Hodges LAC Rivers 39% 20% 0.266
12/1/19 CIN Dalton NYJ Darnold 42 24 0.238
9/22/19 CAR K. Allen ARI Murray 54 35 0.219
9/29/19 CAR K. Allen HOU Watson 35 22 0.189
11/3/19 KC Moore MIN Cousins 56 39 0.178
1/11/20 TEN Tannehill BAL Jackson 24 13 0.174
11/3/19 DEN K. Allen CLE Mayfield 58 41 0.170
11/10/19 PIT Rudolph LAR Goff 56 40 0.158
9/22/19 NO Bridgewater SEA Wilson 42 30 0.153
9/29/19 NO Bridgewater DAL Prescott 63 47 0.147

Unsurprisingly, most of these examples revolved around backup quarterbacks, for good or bad — either because the regular starter was knocked out (which old Elo didn’t know about) or because he was returning after a long absence. Sometimes adjusting for this resulted in an overcorrection, such as when Pittsburgh was down to third-string QB Devlin Hodges in Week 6 yet somehow managed to still win. But more often it helped, such as when Mahomes went down and Kansas City lost with Matt Moore at the helm in Week 8.

So overall, we think new Elo had a solid rookie season, and the new changes helped the model’s predictions. Although there are a few areas of improvement to potentially investigate over the offseason, it was encouraging that the new system outperfomed the old system by almost precisely what we expected based on our backtesting. It was also a good sign that the model was able to consistently outpredict the average reader in our forecast game, “winning” all but two weeks of the season and continuing the old system’s pattern of dominance over the field from previous seasons:

Our new Elo had a pretty good season vs. the field

Weekly average differences between points won by our new QB-adjusted Elo and by readers in FiveThirtyEight’s NFL prediction game, 2019 regular season and playoffs

Week Games Avg. Net Pts Week Games Avg. Net Pts
1 16 +7.9 11 14 +27.8
2 16 +13.6 12 14 +54.0
3 16 -1.0 13 16 +35.6
4 15 -2.7 14 16 +57.7
5 15 +19.8 15 16 +3.4
6 14 +24.6 16 16 +14.1
7 14 +8.1 17 16 +72.2
8 15 +26.8 Playoffs 11 +69.4
9 14 +45.9 Season total 267 +548.6
10 13 +71.4

The scoring system is nonlinear, based on the accuracy of the FiveThirtyEight model relative to readers’ probabilistic forecasts, with particularly strong punishments for overconfident incorrect picks by either side.

Speaking of which, congrats to Jordan Sweeney, who led all readers in the postseason with 275 points, and to Griffin Colaizzi, who used the Super Bowl to pull ahead and win the full-season contest with 1,126.2 points. And a big thanks to everyone who played all season! We can’t wait to fire up the model again in about six months and try to get that Brier score even lower next year.