With Selection Sunday only a few days away every college basketball fan is dying to know whether their team is going to make the NCAA tourney this March. If your team is undefeated like Kentucky then you can kick back with a mint julep and not have to sweat out the result, but if you are a fan of 1 of the 5 Big South teams with 10+ conference wins then you might need to drink something a bit stronger to deal with the stress of the impending conference tourney.
Fortunately, North Florida Professor of Business Analytics Jay Coleman has done all the hard work for you by employing analytics to predict which teams will receive at-large bids from the NCAA Selection Committee. Working with Mike DuMond of Economists Incorporated and Allen Lynch of Mercer University, Professor Coleman helped develop a “Dance Card” formula that is constantly updated.
“I originally stumbled across a data set online (at CollegeRPI.com) back in 1999 that contained a large amount of performance information about every team that had received a bid or was likely to get a bid in each year going back to 1994,” said Coleman. “A data set like that is a gold mine to those of us who do predictive analytics. As someone who is also a college basketball fan, I immediately thought that someone ought to use that data to build a formula that could predict future bids…and that someone might as well be us! We ultimately used SAS software to do exactly that and have been making predictions ever since 2000.”
The magic formula has been near-perfect over the past 3 years (correctly predicting 108 of the 110 at-large bids) due to several factors: the team performance data they use is highly related to whether a team gets a bid, the models they have constructed using SAS capture the patterns in those decisions very accurately, and the selection committees have followed very consistent patterns from year-to-year. That “team performance” data has identified a treasure trove of information that take the guessing out of the game: RPI ranking, wins against top-25 teams, overall record against teams ranked 51-100, etc.
According to Coleman, “The SAS analytics we use place differing weights on those factors based on how they have related to past selections,” which allows him to plug in the current values for 2 near-identical teams to discover which comes out ahead at the current point in time. If you do not have time to do a deep dive into the droves of data, Coleman admits that RPI is the single strongest predictor of which team will receive a bid. However, Coleman claims that the “Dance Card” is the only tournament prognosticator that recognizes the fact that the RPI was tweaked in 2005 and uses the “old” RPI in its projections.
The formula also complies a “chance of bid” value for each team representing the chance that a team with the same profile would have received a bid to past tourneys, so he can state whether teams with the same factor values have secured a bid 100% of the time in the past, or 0% of the time, or some number in between. As proud as he is of his terrific track record, Coleman cannot guarantee perfection because, “the model only identifies patterns in past committee decisions, and assumes those patterns will repeat themselves in the future. If they don’t, then the model can err.”
Additionally, while the formula is great for predicting bids, it is not designed to predict the specific seed that a team will receive because the factors related to getting a bid are not necessarily the same as those related to seeding. Coleman says this is possibly due to “the committee being more careful or egalitarian when selecting teams…or a desire to reward past success (actual wins and losses) over the course of the season when choosing teams combined with a desire to assign seeds based on which teams are more apt to win future (i.e., NCAA Tournament) games.”
The only miss on last year’s “Dance Card” was picking Cal instead of NC State, which was done due to the inclusion of a historical bias that was found to be in favor of Pac-12 teams. For those of you who think Stanford of UCLA might make the tourney this year as the 3rd team out of the Pac-12 after Arizona/Utah, Coleman confirms that a favorable Pac-12 bias may be the sole remaining bias in the process that has persisted into recent years.
However, he warns that, “the fact that Cal did not get picked last year despite having a profile that put it in the realm of consideration points to a possible disappearance of even that bias in the process,” which is why they decided not to include any bias factors in this year’s predictions. So, out with the bias, in with analytics, and good luck to all 68 teams who secure an invitation to the greatest tournament in sports!
Full Interview with Jay Coleman:
Jon Teitel: How did you come up with the “Dance Card” formula (which is designed to predict which teams will receive at-large tournament bids from the NCAA Tournament Selection Committee)?
Jay Coleman: Originally, I stumbled across a data set online (at CollegeRPI.com) back in 1999 that contained a large amount of team performance information about every team that had gotten a bid, and every team that was likely in the running to get a bid, to the tournament in each year going back to 1994. A data set like that is a gold mine to those of us who do predictive analytics. As someone who is also a sports nut and college basketball fan, I immediately thought that someone ought to use that data to build a formula that could predict future bids, and that someone might as well be us. We ultimately used SAS software to do exactly that, and have been making predictions ever since 2000. The Dance Card formula is now in its third incarnation: after the original model’s development in 1999, we updated it in 2008, and updated it again last year. In each version, we’ve added additional pieces of information and/or used more recent data to build it.
JT: You have correctly predicted 108 of the 110 at-large bids over the past 3 years (98%): what makes your formula so effective?
JC: The formula is effective because (1) the team performance data we use is indeed highly related to whether a team gets a bid from the selection committee, (2) the models we’ve built using SAS capture the patterns in those decisions very accurately, and (3) the selection committees follow very consistent patterns in their selections year-to-year. If any of those were not true, our predictions wouldn’t be very good.
JT: I do not need a formula to see that a team like Kentucky has a better chance of making the tourney than a team like Oregon, but how can you separate bubble teams to know whether Oregon has a better chance of making the tourney than Old Dominion (aka “The Bubble Bursts Here” divider)?
JC: The current version of the Dance Card has identified a relative handful of information that is highly related to getting a bid: things such as the RPI ranking, wins against the top 25, wins and overall record against teams ranked 26-50, overall record against teams ranked 51-100, etc. The SAS analytics we use places differing weights on those factors based on how they have related to past selections. When we plug in Oregon’s current values for each of those things into the formula, and then plug in the current values for Old Dominion, Oregon comes out ahead at the moment.
JT: Your “Dance Card” values are somewhat correlative to RPI but not exactly: how do the 2 rankings differ?
JC: The RPI is the strongest predictor of who gets a bid, but it’s not the only predictor. There are other things—i.e., such as the items listed in the response to the previous question—that also are related to the committee’s decisions. When we factor in the effects of those other factors, on top of the RPI ranking, they cause the Dance Card ranking to be different from (although still somewhat correlated with) the RPI ranking.
JT: What does it mean when you say that a team’s chance of getting a bid is 100%, and what does it mean when you say that a team’s chance of getting a bid is 0%?
JC: The “chance of bid” values represent the chance that a team with the same team performance profile would have gotten a bid into the tournament in past years. So, if a team’s chance is 100%, it means that a team with the same factor values would have gotten a bid every time in the past years we analyzed. Similarly, if the team’s chance is 0%, it means that a team with the same profile would have never gotten a bid in past years.
JT: Have you found that any 1 factor is more important than other factors (RPI rankings, wins, conference record, etc.)?
JC: Yes: the RPI ranking is easily the most important factor. However, it’s curious that it’s the “old” version of the RPI, which was in use prior to 2005, which is more highly related to getting a bid than the “new” version of the RPI that’s been in use since that time. Interestingly, the Dance Card appears to be the only tournament prognosticator (or at least the only one we know of) that recognizes that fact and uses the old RPI in projections.
JT: Is it possible to create a model that will always get every at-large prediction correct, and if not, why not?
JC: No, because there’s no guarantee that future selection committees will always follow the same patterns as those followed by past committees. The model only identifies patterns in past committee decisions, and assumes those patterns will repeat themselves in the future. If they don’t, then the model can err.
JT: What is a “conference-related” bias, and how has that affected your formula over the years?
JC: We’ve assessed three forms of conference-related bias: (1) when a team from a so-called major conference (e.g., the ACC, SEC, Big 10, Big 12, and Pac 12) is more apt to receive a bid than a team with the same statistical profile from a “mid-major” or “minor” conference; (2) when a team with representation on the committee—having its athletic director, its conference commissioner, or an athletic director from a fellow conference member on the committee—is more apt to receive a bid, and (3) when the position of the team in its conference standings is related to getting a bid. The original version of the Dance Card in 1999 did not include any assessment of these kinds of potential biases. When we updated the model in 2008, we looked into those types of factors, and we did indeed find that some of them were related to whether a team received a bid, even after accounting for each team’s performance profile. The good news, however, is that when we re-examined things recently, we found that such biases have largely if not completely gone away.
JT: Can the “Dance Card” values also be used to predict which seed a team will receive, or would that require a separate computation?
JC: While you could use it for that purpose, it’s not designed to predict the seeding. In fact, our research has suggested that the factors and/or factor weights that are related to getting a bid are not necessarily the same as those that are related to seeding. This is possibly due to the committee being more careful or perhaps even more egalitarian when selecting teams; after all, whether you get into the tournament in the first place is more important than where you get seeded, and it gets the most attention from fans, teams, the media, etc. It’s also possibly due to a desire to reward past success (actual wins and losses) over the course of the season when choosing teams, combined with a desire to assign seeds based on which teams are more apt to win future (i.e., NCAA Tournament) games. Those objectives are related but not necessarily the same. Moreover, whereas the committees seem to follow very consistent patterns in selecting teams, they seem to follow somewhat less consistent patterns when assigning seeds. This is surely in part due to various seeding rules that the committee is required to follow, but such rules do not necessarily account for all of the extra variation. For these reasons, using the Dance Card to predict seeds is likely not the best approach, and it’s why we don’t include seed projections on the web site.
JT: Why did you pick Cal instead of NC State last season (your only miss), and are you doing anything different this year in your quest for perfection?
JC: The Dance Card picked Cal last year in part because last year we included an historical bias that we’ve found in favor of Pac-12 teams. Without that adjustment for such bias, Cal would have been below our bubble line. A favorable Pac-12 bias may be the sole remaining bias in the process that has persisted into recent years. However, the fact that Cal didn’t get picked last year, despite having a profile that put it in the realm of consideration, points to a possible disappearance of even that bias in the process. Because of that, we’ve decided not to include any bias factors in this year’s predictions. NC State was below our bubble line last year, and would have been below the bubble line even without any bias considerations. (Southern Miss would have been projected by the Dance Card to get the Cal bid if the Pac-12 bias factor had been removed.) So why did the Dance Card miss on the Pack? The choice of N.C. State seemed simply to be a deviation from the committee’s past patterns – not a huge one, but enough of one to cause the model to miss.