How to Predict MLB Records From Early Results


We often hear announcers and commentators say a baseball team is “on pace” to win and lose a certain number of games, by simply applying a team’s current winning percentage over 162 games. Those statements may technically be true, but in a randomness-filled reality, they’re meaningless.

Sabermetrics constantly struggles with randomness, an unavoidable fact of sporting life and the reason there’s almost always a difference between a team’s observed performance and its actual talent level. Moreover, the smaller the sample of games, the less confident we can be that what we’re seeing is skill and not luck. We’re fewer than 10 games into Major League Baseball’s season.

This is why it’s necessary to regress observed statistics to the mean. Things — including baseball stats — tend to average out. But how much do we need to regress? And which mean should we regress to?

The most naïve prior to use would be the league average (in the case of regressing team records, a .500 winning percentage). And the question of how much we need to regress depends on what the preferred ratio is of skill to luck. Sabermetricians typically set a regression to match the number of games it takes for half the variance in team records to be due to talent and half to chance.

In that case, we need to add about 67 games of .500 baseball (33½ wins, 33½ losses) to a team’s record, based on seasonal data since the MLB last expanded in 1998. (Here are a couple of mathematical proofs explaining this method as it relates to Bayes’ theorem.)

reg_wpct[1]

So while the Washington Nationals’ and Milwaukee Brewers’ current MLB co-leading 6-2 records come with a .750 winning percentage, we’d really only expect each of them to have .527 winning percentages from now on, based on the information we have relative to our prior (the population of MLB teams from which the Brewers and Nationals are selected).

We can do this for all MLB clubs:

paine.regerss.1

Regression to the mean lets us get a better sense of a team’s pace by giving us a realistic estimate of its future winning percentage. That’s why a 1-0 team isn’t on pace to win 162 games; in the absence of other information about the quality of the team, it’s really only on pace to win 82.7 games.

Another great thing about this procedure is that the “add 67 games of .500 ball” trick works no matter how far into the season a team is. It’s just as valid now as it will be in late July. The difference between now and then will be the amount of weight that a team’s observed results take in the formula. In April, the .500 prior dominates any team’s projection. By the time 67 games roll around, precisely half of a team’s regressed record will be made up of its observed results, and the other half will be the prior.

Of course, we don’t have to limit ourselves to a prior winning percentage of .500 for every team, either. We know that sources such as Las Vegas over/unders and computer projections do a better job of setting preseason expectations than simply expecting every team to finish 81-81. If we plug in an aggregation of Vegas and computer models from before the season as priors (using a standard deviation of nine wins for those predictions), we come to the following regressed-to-the-mean records:

neil paine corrected chart

Whichever method we use, it’s important to note how long it takes for observed records from the current season to start to have an impact on an assessment of a team’s ability level. At this stage of the season, the only conclusions we can draw should be extremely small relative to the expectations we had for each team a few weeks ago.

Correction (April 12, 11:25 p.m.): The second table in an earlier version of this article miscalculated the true win and true loss paces. The correct table appears above.