How Good Are Preseason Rankings? (A Deep Dive)

Around this time of year, the various pre-season college football publications start appearing on the shelves. For some time now, I have often wondered if there was a good way to evaluate how good or bad the various preseason rankings really are. This year, I decided that I would try to figure it out. Now, it would straightforward to simply compare the various preseason rankings to the post-season CFB playoff ranking, AP ranking, or coaches poll. But, that only tells the story for about 1/3 of all Division 1, and I was looking for something a bit more comprehensive.

From time to time, I have discussed and posted data based on an algorithm that I have developed to generate my own power rankings. Since my method does assign a ranking to all 128 Div. 1 teams, is typically a reasonable predicter of Vegas spreads (more on that later), and since I also tabulate preseason predictions from various sources to support my annual preseason analysis (coming soon to a message board near you), it occurred to me that I had all the data that I needed to make this comparison. So, I went back over the data from the last 10 years or so and compared various full 128 team preseason rankings (from sources such as Phil Steele, Athlon’s, Lindys, ESPN (FPI), and SP+) and tabulated the average absolute difference between their rankings and my algorithm’s post-season rankings for all Division 1 teams. The results are shown below:


Now, as you can see, I do not have a perfect data set to work with. I only have multiple source rankings for the last 5 years, and you also must trust that my algorithm is reasonable approximation of the relative strength of teams. In any event, there are several interesting observations from this table:

First, for the limited data that I have, Phil Steele’s publication appears to give consistently smallest error between the preseason and my simulated post-season rankings. I have his data as the best in 4 of the 5 years where I have rankings from multiple sources. He always advertises that his rankings are the most accurate, and I cannot dispute that with this analysis. Second, that said, there is not a huge difference between the different publications. So, there is no strong reason to rush out and buy any one of these publications over the other based on the rankings alone (I will comment a little more on this later). Third, all the publications don’t seem to get that close to the final rankings. The average deviations are all in the range of 15-20 slots which is an average error of ~15%. That does not seem great to me.

I wanted to dive a little deeper into the third point. As the table indicates, I have the most historical data on Phil Steele’s rankings, so I decided to go back ten years and compare the all of his preseason rankings to all my post-season rankings. There are several ways to look at this data, but I find the most informative to be a histogram of the deviations, a scatter plot, and a plot of the average post season ranking as function of the initial Phil Steele ranking (basically the scatter plot data where the y-axis instead contains the average and standard deviation / error bars for each rank instead of each individual data point). Once again, there are several conclusions we can draw from this data.




First, the histogram gives us an idea of the distribution of the variance. It is fairly bell shaped, with 24% of the picks falling within +/- 5 slots of final rankings, and 41% falling between +/- 10 slots. But, the tails of the distribution are also fairly long. 23% of all of Steele’s picks are not within 30 slots of the final ranking. The scatter plot tells a very similar story and in this case, we can see that the correlation (R squared = 0.66) is OK, but not that great. The scatter plot also tends to highlight the real misses, like when Steele ranks a team in his top 20 (like Illinois in 2009) but then this team winds up 3-9 with a ranking in the 80s by my algorithm, or when teams like Utah St and San Jose St. in 2012 are ranked around 100 by Steele, but wind up ranked in the top 25 by my algorithm and the national polls. The plot of the average ranking vs. initial ranking data shows the Phil Steele data in perhaps the best light. This plot shows that for any given ranking, on average, Steele is pretty close, but the error is still quite large. Notably, the deviation is much smaller for teams in Phil Steele’s ~Top 5. Historically, those teams do usually wind up having great seasons, but there are exceptions (like the 2007 Louisville team, which started ranked #4, but who ended 6-6). That said, the same trend is also found at the bottom end of the chart, so it might have more to do with the fact that teams ranked high (or low) only really have one direction to go: down (or up). That fact is best illustrated by a plot of standard deviation of post season ranking as compared to preseason rankings (basically, the plot of the error bars as a function of preseason rankings) which is show here with a clear parabolic trend.


What is perhaps the most interesting aspect of all of this to me harkens back to my second observation shown in the 1st table above: the fact that the deviations from the different publications are all basically the same for a given year. To visualize this, I plotted the predictions from the two publications for which I have the most data tabulated (Phil Steele and Athlons) and plotted that in a scatter plot, which is shown here:


Not surprisingly, the correlation between the two predictions is rather high (R-squared = 0.91) and much higher than the correlation to reality, so to speak. So, as my first conclusion, I think that we can say that pre-season predictions are OK, but not great (they are certainly not destiny) and they agree with each other far more than they will agree with the actual results on the field.

This analysis led me to think about another interesting topic which is related to the first. Now that we have looked at the robustness of preseason rankings, what about in-season predictions? More specifically, what about metrics such as EPSN’s vaunted FPI? In the 2016 season, I decided to put the FPI to the test alongside my own algorithm to see how they performed. As it turns out, this is a tricky question because defining "performance" in this context is not as easy as you might think. A big part of the reason why is that there is generally a very poor correlation between any predicted margin of victory and the actual result. The best predictor, I suppose not surprisingly, is the Vegas Spread, and a plot of the scatter plot of the actual game margins vs. the opening Vegas spreads for the entire 2016 season is shown here. As you can see the R-squared is a pathetic 0.214. But, this is better than the FPI, which only mustered an R-squared of 0.196 and, sadly, my algorithm, which only mustered an R-squared of 0.167. I won’t bother to show you those plots, as they both look like shotgun blasts.


Last year as I poured through the FPI data, I noticed something odd: it was quite rare for the FPI to predict a Vegas upset. I only counted 37 predicted upsets total out of over 750 games (5%), which is interesting because historically about 25% of all college games wind up being upsets per Vegas. 2016 saw over 200 upsets total. My algorithm picked over 80 upsets for the season. Granted, it was only right concerning the upset 37% of the time (which is below my algorithm’s historical average of 40%), but the FPI only got 46% of its upset picks correct. When I plotted the full year projected margins from the FPI versus the Vegas Spread (See below), you see that the correlation is quite good (R-squared = 0.86). By comparison, my algorithm did not do quite as well, but it still fairly highly correlated (R-squared = 0.72).



From all of this, I come to my second main conclusion from all this analysis: In-season algorithms don’t do a good job of predicting the outcomes of actual games, but they can do a good job of predicting the Vegas spread. In this regard, the FPI (and to a lesser extent, my algorithm) does have value in doing things such as projecting point spreads out 2-3 weeks in advance. That type of analysis is appears to be fairly robust. I also must concede that the FPI does a better job of predicting these spreads than my algorithm does (which I would expect considering they most likely have more than one dude working on it in his spare time). But, you could argue that the FPI is so good at predicting the spread that it doesn’t add much to the discussion. It is on some level too conservative. At least my algorithm takes some chances and will make more than 1-2 upset picks a week. But, at the end of the day, the gold standard is the Vegas spread, which honestly makes sense. After all, if there was a computer program out there that could beat Vegas, somebody would be very rich and they would certainly not tell the rest of us about it.

So, with this knowledge, perhaps the most useful figure that I can leave you with is the following: the 5-point boxcar averaged plot of the probability of the favored team winning as a function of the opening Vegas spread for all college games back to 2009. As you can see, if the data is smoothed, it forms a nice quadratic curve from a 50-50 toss-up to a virtual sure thing once the spread reaches around 30. (In reality, there have been a total of 2 upsets in games where the spread exceeds 30 since 2009, but the frequency is less than 1%). The fit is not perfect, but the equation on the chart is very simple and easy to remember. I would imagine the line should asymptotically approach 100%, but never actually reach it, because in college football, I believe the underdog always has a chance.


This brings us to my final conclusion for this piece: college football is unpredictable, and that is why we love it.

This is a FanPost, written by a member of the TOC community. It does not represent the official positions of The Only Colors, Inc.--largely because we have no official positions.