The Ravens Should Have Gone For It


Stop me if you’ve heard this one: An NFL head coach decides to go for it on fourth down in a big, closely fought game. The team fails the attempt and goes on to lose the game. Fans are outraged, and media members are contemptuous.

It’s a scene we’ve seen play out dozens of times over the past decade, and on Sunday it unfolded at M&T Bank Stadium with the Baltimore Ravens and coach John Harbaugh. A high-stakes bet was made, then lost, and an obscene amount of second-guessing ensued. 

Even after years of explaining the logic of fourth-down decision-making in the NFL, the reasoning behind it is routinely ignored by those who follow and analyze the game. Which leads us to the question: Does evidence matter in the NFL? If so, why do some people reject it in favor of flawed ways of thinking? And what can be done to mitigate it?

First, let’s recap the situation Baltimore found itself in on Sunday against the Buffalo Bills. On fourth-and-goal from Buffalo’s 2-yard line, with the game tied at 20 and 4:15 remaining on the clock, the Ravens made a decision that is, on its face, counter-intuitive: They attempted a play that the average NFL team had converted 46 percent of the time from 2016 to 2021,1 rather than one that had been converted around 99 percent of the time.

Put in those terms, it might seem ridiculous to choose the play with less than a coin-flip’s chance of succeeding. But of course the story can’t end there, because the analysis is incomplete. We need to factor in the possible points gained for each play and figure out how those points impact the chance that a team wins.

Enter win probability models and decision guides. There are three major public-facing win probability models – ESPN’s model, the NFL’s Next Gen Stats decision guide and the 4th Down Decision Bot by Ben Baldwin — and they crunch the numbers in a similar way. They take a given game state — score, down and distance, time remaining, number of timeouts left for each team, relative team strengths and other factors — and calculate how often a team in a similar situation goes on to win if it decides to run or pass, kick a field goal or punt. 

Since the Ravens were on the 2-yard line, punting was out of the question. That means the models had to calculate the percentages for just two outcomes: going for the touchdown or kicking a field goal. All three agreed that going for it maximized the chances of the Ravens winning the game.

Because some folks (perhaps wisely) mistrust an algorithm they don’t fully understand, authors of the models have taken pains to explain how the win probabilities are calculated. Of the three, Baldwin’s model does the best job of peeling back the layers of the analysis and helping us grasp where each component of win probability comes from.

The bot is telling us that Baltimore’s win probability skyrockets to 83 percent if the Ravens score a touchdown (which is estimated to have a 47 percent likelihood), and it falls to around a coin flip if they fail. Meanwhile, if kicker Justin Tucker kicks through an almost certainly successful field goal, the Ravens’ chances would become 63 percent, 20 percentage points lower than with a successful touchdown. (This discrepancy exists in large part because Bills quarterback Josh Allen would have plenty of time to drive downfield and erase Baltimore’s 3-point edge.) And finally, in the extremely unlikely event of a Tucker miss, the Ravens would have just a 43 percent chance of victory.

All of those differences end up being important. The 20-percentage point increase in win probability if the Ravens score a TD makes up for the play’s much lower chance of success, as touchdowns are worth twice as much as field goals (and extra points are fairly routine). The final win probability for each choice represents the average of all outcomes if the Ravens go for it (65 percent) versus if they kick the field goal (63), and going for it netted a positive expectation of 2.1 percentage points.

That positive expectation guided the Ravens’ decision-making. What Baltimore didn’t expect is that quarterback Lamar Jackson would throw his second interception of the day to Jordan Poyer. An interception from the 2 is rare, happening just 15 times in 1,097 pass attempts from the 2 from 2016 to 2021. Since teams threw a touchdown 281 times in those same 1,097 attempts, the Ravens were 18.7 times more likely to throw a touchdown than an interception. The pick was quite unlucky — about as likely as Tucker missing the field goal. It was a bad beat that was incorporated into the model but weighted lightly due to its rarity.

The cost of that misfortune was large, though. After Poyer’s pick, the Bills got the ball at the 20 instead of the 2, so instead of a 49 percent win probability2 the Ravens’ chance of winning dropped to 43 percent — the same probability as after a field goal miss.3

In other words, the math knew the Ravens were in trouble at that point. Allen immediately did what an elite quarterback does, promptly leading his team into field goal range. Buffalo drove all the way to the Baltimore 1-yard line, kneeled twice to run the clock down and kicked a field goal to win. 

To some pundits, the 3-point margin of defeat confirmed that going for it on fourth down was a mistake. Some even made ponderous, overconfident claims that the Ravens would have won the game if they’d just taken the points. But that is far from certain. We all saw how methodically the Bills moved downfield after picking off Jackson; their drive after a hypothetical Ravens field goal would have been even easier. Assuming Tucker kicks a touchback after the field goal, the Bills’ ensuing drive would have started at the 25 instead of the 20, and Buffalo would have been incentivized to punch the ball into the end zone at the 1 rather than take two knees. 

Plus, as Harbaugh said after the game: “If you kick a field goal there, it’s not a three-down game anymore, it’s a four-down game.” When a team suddenly has 33 percent more plays at its disposal, defending it becomes much harder — especially when that team also has Allen at QB. 

Yet even with all the careful explanations and mathematical logic, a significant portion of football people remain unconvinced. When the evidence is so strong, the question is: Why do they resist change? It turns out there are lots of reasons why people persistently fail to embrace change, but a lack of convincing evidence is not at the top of the list. And this is especially true when the recommended change is coming from something as alien as an algorithm. 

Academics have even named this behavior: “algorithm aversion.” A 2015 paper from Berkeley Dietvorst, Joseph Simmons and Cade Massey found that people will reject a statistical algorithm in favor of a human forecaster when they see the algorithm miss on a prediction — even if they know the algorithm has a better overall track record. Control — or the illusion of it — matters more to people than being correct (and enjoying the better outcomes that come with it). 

The same researchers later found in a 2016 paper that they could attenuate the effect of algorithm aversion if they let people have control over the output — even if control led to worse results. It’s a little like refusing to trust your car until you’ve opened the hood and messed around with the ordering of the wires on the distributor cap. You might get lucky and not foul things up too badly, but a misfire or worse is likely in the near future. To combat this potential hazard, the authors recommended limiting the amount that people can change an algorithm’s output by a slight fraction to sate their need to fiddle while minimizing the damage introduced by meddling.

If the academics are correct, tables and charts likely aren’t enough when it comes to NFL win probability models. Perhaps evidence doesn’t matter. Instead of pedantic explanations, maybe a deeper, more human desire needs to be addressed before people will trust a win probability algorithm: The need to exercise agency. Maybe people have to be empowered before they’ll be comfortable listening to a machine.

Or — and hear me out — maybe model designers need to take their cue from fidget spinners. Just slap some twisty knobs and clicky buttons onto the algorithmic engines, and we’ll painlessly usher in a new age of evidence-based consensus.

Check out our latest NFL predictions.