Wednesday, October 01, 2008

Finding Correlations

I make a statistical based, non-subjective “Power Poll” each year. For this year I am trying to refine my methods to reflect what statistics relate to actual success.

I’m sorting through all the NCAA statistics for the past several years, looking for correlations. To do so I’m using linear regressions. An example of this is –

What we have here is the linear regression of all 119 NCAA teams last year with the y-axis being win-loss percentage of the teams for the year, and the x-axis the average score per game . In the upper right are Hawaii and Kansas at 12-1 each with 43.4 ppg and 42.9 ppg respectively, while in the lower left we find 1-11 FIU at 15 ppg.

I also identify Florida, LSU and Ohio State on the graph. The line running the center of the graph is the “slope”, which shows what one might expect when all the variable are fitted together. Florida, located beneath the line, did worse on a win-loss basis than their points-per-game of 42.5 would suggest.

The Correlation between points per game and record is 0.704616406330361. A correlation of 1 (or -1) is considered to show perfect correlation, while 0 shows no correlation. The 0.7 we found shows a high degree of correlation between points per game and winning.

When you look at the correlation between record and defensive points allowed for 2007, the result was 0.731562154315535, showing a slightly higher correlation between a good defense and a winning record.

I plan on doing this for a variety of statistics for the past several years, and hopefully incorporating it in my Power Poll.

Credit for the use of a powerful statistical software tool (free, at that) goes to -

Wessa, P. (2008), Free Statistics Software, Office for Research Development and Education,version 1.1.23-r2, URL


UgaMatt said...

Very cool Mergz. What's the r-squared of your model?

Anonymous said...

The problem with designing a correlation table in college football is the VAST variation in the ability level of the potential opposition pool among different schools and leagues.

A particular stat may show a high level of correlation to winning % when you are comparing teams who play 'similar' schedules. But the HUGE disparity in the level of play, from league-to-league or region-to-region, would seem likely to skew almost any statistically-based analysis. Ex: how can you account for situations where some teams use 2nd, 3rd, or 4th team players for substantial periods in some games while some coaches leave their 1st-teamers in longer (and potentially pad their stats v. lesser opposition).

I know that your focus is on college football, but I believe your analytical method would be more substantiatable when used to study the NFL, where there is not nearly as great a disparity in ability/play level.

Except maybe the Rams...


Anonymous said...

An OBVIOUS example of the case of misleading stats is UF's Columbus campus--AN ohio state university.

For the last several years they have CONSISTENTLY piled up huge numbers v. their 'normal' schedule; but when faced with stronger opposition from outside their regional comfort zone, they have failed EQUALLY consistently.

There is OBVIOUSLY a direct correlation between WHO Aosu is playing and their winning %:

Since 2005:
Record v. Big 10,Notre Dame, and unranked out-of-conference teams: 36-2 (94.7 %)

Record v. ranked teams OUTSIDE the Big(wecanonlycountto)10: 1-4 (20%)


PhilipVU94 said...

I just skimmed this post, but seeing that you're in that relatively small community of college football bloggers who care about "doing stats" well...

I need some help thinking about how to approach yardage stats. In particular, I'm wondering if anyone's researched how sustainable is a season like Vanderbilt's having so far, with poor yardage stats and lots of hidden yards.

Any ideas are appreciated.

Anonymous said...

Brilliant stuff, but I like to weigh on the side of chaos theory and or the butterfly effect when it comes to putting stats and predictions together in football. Think about it, Weather is about as predictable as NCAA Football. So many small factors can have such an impact on events. And those small factors somtimes undeterminable can seem to compound themselves over time. To many to name. I think the ability to determine and identify those small sometimes naked to the human eye factors is the key and what Vegas goes after. Only problem is they are most likely always changing.