Where Sabermetrics And the ‘Eye Test’ Disagree


Two weeks ago, ESPN released its “Baseball Tonight” (BBTN) 100, a player ranking based on votes from a panel of 40 experts. The panelists graded each of a group of 277 players on a 0-100 scale and then ranked them accordingly. For instance, Mike Trout of the Los Angeles Angels topped the list with a score of 98, while his most valuable player foil from recent seasons, Miguel Cabrera, came in second at 96.

It’s just one group of ratings from one group of writers, former players and ex-pats of the game. But it’s a useful proxy to understand the difference between subjective valuations and empirical ones. I was interested in seeing what these experts’ preferences tell us about how they view the game. If the BBTN panelists were ranking players, surely some of their metrics would differ from more empirical measures. This isn’t to say that the experts didn’t look at statistics at all when voting, or that most of them aren’t knowledgeable about baseball’s growing culture of numeracy. Many of them are. But purely subjective votes like this bring out the emotional decisions inherent in evaluating players. I wanted to know where those emotions led the panel when — right or wrong — they strayed from the path of pure sabermetric orthodoxy.

To that end, I broke down what caused players’ BBTN scores to differ from what would have been predicted from their wins above replacement marks, an advanced metric designed to statistically measure a player’s on-field contributions in a logical, structured way. Think of this as an investigation into where sabermetrics and the “eye test” disagree.

For the 149 position players who logged at least 600 plate appearances from 2011 to 2013, I adapted per-plate appearance WAR rates to the same scale as the BBTN ratings. For instance, a rate of 6 WAR per 600 PAs, as Jose Bautista of the Toronto Blue Jays produced, would typically lead to a BBTN score of 78 — precisely the mark the panel gave Bautista. But not every player’s BBTN score lined up so perfectly with his WAR numbers.

paine-war-vs-bbtn

The Texas Rangers’ Prince Fielder generated just 3.3 WAR/600, which my calculations predict would lead to a BBTN rating of 62; instead, the voters deemed Fielder worthy of a 79, one of the most divergent ratings in the data set. Meanwhile, Craig Gentry of the Oakland Athletics created 6.7 WAR/600 — normally good for a BBTN score of 82 — but was rated a 45 by the panel.

These divergences are a proxy for over- and under-ratedness, where — for the purposes of this concept — a player’s accurate rating is just his WAR rate. If a player was overrated, his BBTN score would be higher than his WAR implied it should be, and the opposite for an underrated player.

But not every overrated player is overrated for the same reason. To understand if the experts are snookered by certain skills more than others, I gathered a bunch of numbers (including scouting-style defensive opinions) from Baseball-Reference and Fangraphs, looking to see which were significant predictors of how much a player’s BBTN rating diverged from his WAR.

Seven factors turned up as having a real effect on how a player was regarded by voters, relative to his sabermetric output.

  • Isolated Power: All else being equal, players with great power were significantly overvalued by voters. If a player somehow increased his isolated power by 11 points while keeping his overall value equal, he would have been rated one point higher by the BBTN panel. To wit: Chris Davis of the Baltimore Orioles is a fantastic power hitter (his .269 ISO since 2011 tied the Detroit Tigers’ two-time MVP Miguel Cabrera for third in our data set), but he rates as a below-average baserunner and a weak fielder at a non-premium defensive position. His three-year WAR should be equivalent to a 62 BBTN rating, but the panel gave him an 81.

  • Defensive Runs Saved: Despite “Baseball Tonight’s” penchant for highlighting Web Gems, great fielders appear to be given short shrift in the ratings. Regardless of position, they were systematically underrated by the BBTN 100 panel. After controlling for other characteristics and overall WAR, a player who invests in his defense to the point where he saves 2.5 runs per 600 PAs gets dinged by one point in BBTN’s ratings. A great example is the Atlanta Braves’ Andrelton Simmons, who saved an astounding 40 runs per 1,200 innings in the field. His 7 WAR/600 suggested a BBTN score of 84; instead, the panel rated him a mere 76.

  • Positional Scarcity: The voters tended to judge players’ offensive numbers without regard to the position where they were produced. It’s a lot easier to find a great hitter physically capable of playing first base than it is to find the same hitter who can also play competently at shortstop, but the BBTN rankings don’t reflect that. This sits in stark contrast to the sabermetric idea of positional scarcity, which Bill James gave voice to in the 1980s and, more rigorously, was popularized by Keith Woolner in the 1990s with the development of VORP. The effect is most evident with designated hitters, whose WAR totals are limited because they provide literally no defensive value. DHs like Billy Butler of the Kansas City Royals and David Ortiz of the Boston Red Sox rated 10 and 8 points higher, respectively, than WAR says they should have.

  • Arm Strength: Sabermetrics is indifferent to the flair with which a player plays — its only concern is production. Arm strength is one of those pieces of flair most associated with raw athleticism (or “tools,” in scouting parlance) that the eyes appreciate even if the numbers are indifferent. That’s speculation, of course, but the regression is detecting some kind of real effect. It tells us that between two equally valuable players rated six points apart in arm strength, as measured by Tom Tango’s Fans Scouting Report, the more rifle-armed of the pair would score one point higher in the BBTN 100.

  • Batting Average on Balls in Play: There are two possible explanations for the value placed by the BBTN 100 voters on BABIP. One is that the panel doesn’t seem to agree with the standard sabermetric view that much of a batter’s BABIP is driven by chance. (Remember, this isn’t to say success on balls in play is all luck, but it does take nearly two and a half years worth of plate appearances for an individual hitter’s BABIP to stabilize.) The other is that they strongly value those select batters whose playing style lends itself to a higher than normal BABIP — think speedy, ground-ball hitters like Ichiro Suzuki in his prime. With WAR held constant, a player would need a BABIP 14 points higher to see a one-point boost to his BBTN score.

  • Contact Rate: Perhaps unsurprisingly, BBTN 100 panelists have a bias toward guys who put the bat on the ball when they swing. All else being equal, a 1.9-point increase in contact percentage leads to a player being rated one point higher in the voting. The Toronto Blue Jays’ Jose Reyes connected on 88.5 percent of swings, which coupled with a .321 BABIP to help give him a .306 batting average — and a BBTN rating of 71, seven points higher than his WAR would typically warrant.

  • Clutch Hitting: Success in the Win Probability Added “clutch” metric — which measures whether a player hits better in high-leverage situations — may follow from playing style more than mental fortitude. But whatever the reason, the voters enjoy a player who raises his game in crucial situations. A player whose clutch play added 0.45 wins per 600 PA more than another (despite equal WAR) receives a one-point boost to his BBTN 100 rating. The New York Yankees’ Jacoby Ellsbury might be the poster child here after leading baseball in clutch wins above average over the 2011-13 span; Ellsbury was rated eight points higher by BBTN than WAR suggests he should have been.

All told, these results aren’t incredibly shocking. They exhibit a bias toward many of the traditional “tools” of scouting: hitting for average and power, speed, fielding and throwing. The voters’ bias isn’t conscious, but it is real. It’s indicative of all the factors that add up to our impression of a player, rather than his empirical value.

Sabermetrics isn’t automatically the “correct” answer in these comparisons, but it does offer a rigorous, systematic way of valuing players. Examining where human biases conflict with the statistics is a useful way to determine where our eyes’ prejudices lie.