/cdn.vox-cdn.com/uploads/chorus_image/image/29182045/gyi0063978249.0.jpg)
Following the recent mock committee exercises on Twitter got me thinking, once again, about the most reviled rating method this side of Richard Billingsley's monstrosity: the RPI. While individual Selection Committee members can - and, by all accounts, at least some do - look at other rating systems, the team information sheets generated by their software organize all of the data around the RPI. And I can't think of a worse method to use.
Before I continue, I do want to emphasize that, on the whole, the committee does fairly well with a difficult job. And unlike the BCS, the tournament is large enough that the marginal decisions here aren't leaving out teams like Oklahoma State football in 2011 who have a fair claim to being the best team in the country, and minor seeding errors aren't absolutely critical (they can make a difference, but there's no such thing as an easy path to the title). But that they do as well as they do is despite their use of the RPI, not because of it.
Quite a few of the anti-RPI rants you'll see around the internet suggest that margin-based metrics like the Sagarin or Pomeroy ratings should replace the RPI in the committee's deliberations. While I am certainly a fan of such systems (particularly Pomeroy's), I'm not sure I agree with that particular use of them. There are two different goals a rating method might have - to predict future achievement, or to reward past achievement - and the two do not always coincide. A team that loses a lot of close games and wins big when they do win might be expected to win a lot of their remaining games, but they haven't achieved much so far. For selection and seeding, what you have accomplished this season should matter more than what you would be expected to do later. The problem with the RPI is not that it ignores margin and is thus a lousy predictor of future results; that's not what it's designed to do. The problem is that the formula has so many flaws that it's not very good at what it is supposedly designed to do.
Simple Tests with Pathological Results
The basic formula for the RPI is simple: a weighted average of your win percentage (25%), the average of your opponents' win percentage excluding games against you (50%), and the average of your opponents' opponents' win percentage excluding games against that opponent (25%). Where did those numbers come from? Who knows. In hockey, they've played around with the balance between the factors fairly frequently (this year it's actually 25%-21%-54%, and no, I have no idea how they came up with those numbers either), but for basketball it's been relatively stable.
This formula can lead to some very odd results. Consider a case with 10 teams, split into two groups of five. Each team in Group A plays each team in Group B, but teams do not play others within their group. Let's say that the team from Group A wins every game except for A5 losing to B1. The order of those two teams could be argued, but any sane system would have to put them both ahead of the rest of Group B and behind the rest of Group A (same logic in reverse), right? Some quick calculation yields:
A1-A4: Win percentage 1.000, opponents' win percentage .050 (B1 is 1-3 other than the game against this team; the rest are 0-4), opponents' opponents' win percentage .960 (1.000 for B1, .950 for B2-B5) = RPI .515
A5: WP .800, OWP .000 (B1's win over them doesn't count toward this), OOWP .960 = RPI .440
B1: WP .200, OWP 1.000, OOWP .040 = RPI .560
B2-B5: WP .000, OWP .950, OOWP .040 = RPI .485
Yes, in this scenario the RPI thinks that a 1-4 team should be ahead of the four teams that beat it even though none of them has lost a game. Obviously, this is a contrived setup (this would be like the ACC and Big Ten replacing their conference seasons with an extended ACC-B1G Challenge), but it shouldn't be possible to break the system entirely this easily. If the RPI can't be relied upon to produce sane answers to simple scenarios, why should anyone trust that it makes sense in more complex ones?
Rewarding Road Wins, or Penalizing Home Losses?
Several years back, the NCAA added an adjustment to the formula to attempt to reward road wins. When calculating your win percentage, home wins and road losses only count as 0.6 wins or losses, and home losses and road wins count as 1.4. (Strength of schedule components don't use this weighting at all - every game counts as 1.) Although the numbers again seem to be pulled out of thin air, this sounds reasonable enough - except that it doesn't actually reward road wins at all.
Let's consider a team with a record of 20-12. Assume they played two neutral-court games and split them. If the rest of the wins are all at home and the rest of the losses are all away, their RPI-calculation record is (0.6*19+1)=12.4 wins and (0.6*11+1)=7.6 losses, a win percentage of .620. Now let's switch it around to give them three road wins and three home losses. That gives a record of (0.6*16+1+3*1.4)=14.8 wins and (0.6*8+1+3*1.4)=10.0 losses, for a win percentage of .597. Same record, more road wins, worse RPI (assuming equal schedules).
What happened? Switching things up to add road wins and home losses adds to both the win total and the loss total in the same amount, which pulls your win percentage toward .500. So a change intended to reward bubble teams for road wins actually punishes them more for home losses. (For teams with losing records, this tweak actually does what they meant it to do, but those aren't the ones that matter come Selection Sunday.)
How Not to Calculate Strength of Schedule
SOS (and non-conference SOS in particular) get a lot of attention around Selection Sunday. Every year there is at least one team whose exclusion gets the talking heads to go berserk, and the committee's reason is almost always an atrocious strength of schedule. As with so much else here, this sounds perfectly reasonable, but the devil is in the details.
The way strength of schedule is calculated for the RPI is very straightforward: calculate all of your opponents' win percentages (other than in games against you), and average them all together. It's simple, obvious ... and yet spectacularly wrong. To demonstrate this, I'm going to use the log5 method for estimating win probabilities; this works with any rating system for which a win percentage against an average team can be calculated for each team (Pomeroy's method and Bradley-Terry are two examples, but there are many others). If teams A and B have win percentages against average of WA and WB, then the probability A defeats B is (WA*(1-WB)) / (WA*(1-WB) + (1-WA)*WB).
Let's look at two possible schedules:
1) 10 games against a team with rating .5000 (after adjusting for home court)
2) five games each against teams with ratings .9600 (a national championship contender on a neutral court) and .0400 (one of the three or so worst teams in Division 1)
The naive measure of SOS used alongside the RPI suggests that both schedules are about equal. And for an average team, they are. Both work out to 5 wins on average, although the distribution is very different (against Schedule 1, our average team will go exactly 5-5 just under 25% of the time; against Schedule 2, the probability is almost 70%). But for a team with a rating of .8000 (around 45th by Pomeroy's numbers, which would put them on or near the bubble):
Schedule 1: 10x (.8*.5 / (.8*.5 + .2*.5)) = 8 wins
Schedule 2: 5x (.8*.04 / (.8*.04 + .2*.96)) + 5x (.8*.96 / (.8*.96 + .2*.04)) = 5.66 wins
Clearly, these two schedules are anything but equivalent for the teams that have the most to gain or lose if the committee gets it wrong. Once you get sufficiently far away from your own level, in terms of expected wins it really doesn't make much difference just how badly overmatched the other team is (or how badly overmatched you are). Dropping an opponent from average to abominable gives a bubble team an extra 0.19 wins on average, while raising that opponent to championship-caliber instead costs 0.66 wins - and yet the SOS measure provided to the committee considers these to be the same. This is the root of the "RPI anchor" phenomenon that has been the cause of the most dubious decisions the committee has made. Here's a plot of rating impact for the RPI versus a system which makes use of log5:
(The log5 impact of a win is, essentially, the probability that you would have lost the game - that's the difference between your expected number of wins and actual number of wins after the game. For RPI, I had to get creative; I estimated this by getting the difference between 0.58, which is around the 50th place in the RPI, and a rescaled game value running linearly from 0.4 (.25 for the win, half of .1 for opponent's record, 1/4 of .4 for opponent's SOS) to 0.85 (.25 + .5*.9 + .25*.6), then doubled it to get approximately the same gap between top and bottom. The X axis for RPI is not the team's actual RPI, as those tend not to cover the entire range, but a rough estimate of what their log5-style rating would look like for a given RPI. The endpoints probably correspond to an RPI of about 0.25 and 0.75.)
There are two key things to notice about the chart:
1) The RPI impact plot is linear and remains so no matter what your RPI is (it shifts up and down, that's all). Sliding one opponent up and one down by the same amount, regardless of where they are relative to you, does nothing. Log5, on the other hand, is most sensitive around your own rating (the steepness of the curve at the high end is an artifact of the compression that occurs at both ends when transforming the odds ratio to the 0-1 rating).
2) The RPI impact plot goes negative, meaning a win can actually hurt you. When it comes to the RPI and the SWAC (annually the worst conference in Division 1), the only winning move is not to play them. This is the root of the "RPI anchor" phenomenon that has been behind most of the dubious decisions the committee has made. A win over a bad team certainly shouldn't help much, but if you're not considering scores, a win should never hurt. (Systems that do include the score can reasonably drop you for a closer-than-expected win.)
On top of this, the standard SOS measure does not account for home court advantage. If two teams played the same opponents but one had several more road games, their SOS ranks would still be the same. (RPI would not be because the additional road games will either count for less if they are losses or count for more if they are wins.) This is clearly absurd. Even if the number of road games is the same, which games are away from home can have a significant impact. Take the sample bubble team (.8000 rating) above again, and let's look at the graph of log5 rating impact with home court advantage factored in:
You can see clearly that near your own rating (slightly below if going on the road, slightly above if playing at home) is where the gap between the curves is the largest. Home court makes relatively little difference in the odds when playing someone well below your level or well above. For this hypothetical bubble team, playing three .8000 opponents at home and three .9600 opponents away is easier than the reverse. Switching the .8000 opponents from home to road costs you about 0.31 wins per game on average; doing the opposite for the .9600 opponents gains you about 0.15 wins per game.
The result of all of this is that teams try to game the RPI in scheduling. Pile up mediocre opponents (ranked 125-200) and sprinkling in one or two top-quality opponents. Play the toughest games on the road (where the loss won't hurt as badly) and the easiest ones too (where you can pick up a cheap "road win" bonus) and get as many of your comparable opponents at home as you can. These tricks can make your schedule look harder without actually making it much harder, and they can be the difference between a comfortable 7 seed and sweating out Selection Sunday. (Minnesota in particular seems to have mastered this; last year they had only two non-conference games tougher than #70 and 13 losses, and still managed to rank #30 in the RPI.)
With the game-by-game data available for the committee to analyze, it's certainly possible for them to recognize and correct for these effects, but when they are operating under massive time pressure, it's very easy to go back to the summary data that's provided, and the summary data is flat wrong. There is no reason not to give the committee a tool that properly handles all of these effects instead of forcing them to try to eyeball the adjustments needed.
The "Quality Win" Cliff
One point defenders of the RPI tend to make is that the committee generally doesn't use a team's RPI rank itself in discussing the merits of their case, but instead primarily use the RPI as a tool to group opponents for looking at the schedule. This is arguably worse, as it creates an effect similar to the infamous "TUC cliff" in the Pairwise Rankings used for selecting and seeding the NCAA hockey tournament.
(For those unaware: until this year, one component of the Pairwise was your record against "teams under consideration", which meant either top 25 RPI or .500+ RPI (which version they used varied from year to year). If one of your conference-mates was hovering around the cut line, their movement could swing your TUC record by several games, potentially moving you up or down several places in the rankings without you playing another game. One particularly bizarre example: in 2007, going into the CCHA third-place game a win or loss would guarantee a tournament berth for MSU, but some scenarios involving a tie would have knocked us out. A loss would have allowed Lake Superior State to climb over the TUC cliff and give us 3.5 extra TUC wins, enough to swing a few comparisons that we could not have won with a tie.)
Because of how this is used in organizing the team sheets, now you no longer have to worry only about gaming the RPI yourself; you need your opponents to do so too. As an example, in 2011 the entire Big Ten bubble got in with surprisingly high seeds and most of the ACC bubble teams missed out, in large part because all of the Big Ten bubble teams were just inside the top 50 of the RPI and therefore got to count games against each other as quality wins rather than questionable losses. Had Illinois dropped below #50, for instance, MSU would have gotten credit for one less top-50 win and one more sub-50 loss. Same if Penn State had dropped out; the two combined might have been enough to keep MSU out of the tournament entirely.
Worse, the "quality win" criteria does not adjust for home or away; beating #51 on the road is significantly harder than beating #50 at home, but the latter is the one that will count toward your top-50 record. In effect, using the RPI to group teams combines the worst of both worlds: grouping teams this way is inherently sensitive to noise in the data, and the RPI is a very noisy rating system. As with SOS, the committee has the game-by-game data to recognize these effects and compensate, but when it's crunch time, the summary data sitting at the top of the sheet lists the record against the top 50 as though all of those games are equal.
How to Make Team Sheets That Don't Suck
For handling teams' resumes for the committee, what we need is a ranking method that:
1) has a sound mathematical basis instead of arbitrary weights
2) does not display pathological behavior in corner cases
3) gives due credit for road wins without overreacting to home losses
4) recognizes that, for a tournament-quality team, the difference between a truly atrocious opponent and a merely bad one is not nearly as important as the top end and mid-level games when it comes to strength of schedule
5) recognizes that it matters who you play at home or away, not just how many of your games are home or away
We also need a different way to visualize strength of schedule instead of just grouping games against various tiers together. The sparklines on Crashing the Dance are a good example, although sorting on game value as well as chronologically might be useful. The hockey committee dumped the TUC cliff this year, now instead awarding bonuses for quality wins on a sliding scale; perhaps a similar measure could be added here.
It may not even be necessary (or wise) that all the data comes from one rating method; with schedules as widely varied as they are, no single method is going to be perfect. Margin-based numbers may have a place too (particularly for evaluating strength of schedule, as that gives a better idea of the true quality of your opponents to measure your achievement against them), and there are some factors that can't easily be captured in a simple rating system: key players out who are or are not returning for the tournament, controversial calls on game-deciding plays, late bubble battles that might be given extra weight if it comes down to those teams for the last spot. But the RPI should not be a part of the Selection Committee's arsenal any more. It fails too many sanity checks. I would say it has outlived its usefulness, but that presumes that it ever had any usefulness in the first place.