Tuesday, November 15, 2011

k-NN Prediction

"Andywocky" commented not too long ago on my Prediction by Similarity posting asking whether I'd looked at k-nearest neighbors (k-NN) algorithms.  At the time I made the original posting I hadn't, but shortly thereafter I had a "D'oh" moment and realized that what I was doing was re-creating k-NN.  So I re-created some of the work I'd done using RapidMiner's k-NN operator.

The basic idea behind k-NN is that we predict the outcome of a new game by finding some number of similar past games, and then use those (say by averaging) to create a prediction for the new game.  The "k" in "k-NN" refers to the "some number" of similar past games -- k might be 5 or 50, indicating that we were using the five most similar, or 50 most similar past games.  "Nearest Neighbor" is just another way of saying similar.  If we think of the games living in a multi-dimensional space -- say a dimension for each statistical value for the game (e.g., rebounds per minute, free throw percentage, etc.) -- then the most similar games are the ones that are the nearest neighbors in this multidimensional space.

There are some subtleties in how this works.  For example, team free throw percentage might vary from (say) 50% to 100%, while rebounds per minute might vary from 0.00 to 0.056.  If we don't normalize those dimensions, one or the other is likely to be far more important in determining the nearest neighbor than the other. But a reasonable starting approach is to characterize each game with as many statistical properties as we have, normalize those to similar scales, and then predict MOV by averaging the MOVs of k nearest-neighbors.

Here's the result of doing that with k=10.  For comparison, I include the performance of the best linear regression predictor based upon the same statistical properties.

  Predictor    % Correct    MOV Error  
Best Statistical Predictor72.3%11.04
k-NN, k=1059.7%11.65

This isn't tremendous performance, but we have a few tweaks we can perform.  First, we can try varying k to see if some different number of neighbors provides better performance.  Some searching around produces the best performance in this case when k=41:

  Predictor    % Correct    MOV Error  
Best Statistical Predictor72.3%11.04
k-NN, k=1059.7%11.65
k-NN, k=4171.2%11.44

Interestingly, this shows a lot of improvement in games correct with only modest improvement in MOV error.

Another tweak we can look at is weighting our results.  Instead of doing a flat average of the 41 nearest neighbors, we can weight each neighbor's contribution to the answer by how close it is to the new game.  We can also try eliminating some of our dimensions to see if accuracy improves.  This provides some further improvement:

  Predictor    % Correct    MOV Error  
Best Statistical Predictor72.3%11.04
k-NN, k=1059.7%11.65
k-NN, k=4171.2%11.44
k-NN, k=44, weighted subset72.4%11.17

With this tweak k-NN is competitive with the best linear regression.  (Although they both trail the best predictors.)

I'm inclined to draw a couple of conclusions from these experiments.  First, 40+ neighbors is a large number, suggesting that while games between statistically similar may be broadly comparable, there's not a strong relationship.  Conversely, the improvement gained by using weighting suggests that closer is still better.  It would seem that good performance with this approach requires a moderate amount of generalization to help "wash out" the random component in game outcomes.

Football Predictions (11/15/11)

I hope to have some time in the next day or so for some postings, so here are the predictions for this week in NCAA football.  I've been tracking this performance for a contest, and for the past three weeks I'm 56% against the line (and 71% winners).  That might be anomalously good performance, but I've been positive against the line every week, so take that for what it is worth.

As always, heed the Net Prophet Disclaimer.

NCAA Football Predictions (11/15/11)
Home TeamAway Team  MOV
Air ForceNevada-Las Vegas16.1
Arizona StateArizona18
ArkansasMississippi State13.1
BaylorOklahoma-14.2
Bowling Green StateOhio-6.7
Brigham YoungNew Mexico State15.7
BuffaloAkron12.3
Central MichiganToledo-14.9
ConnecticutLouisville0.4
DukeGeorgia Tech-6.9
East CarolinaCentral Florida-8
Florida StateVirginia16.8
GeorgiaKentucky24.5
HawaiiFresno State9
HoustonSouthern Methodist18.3
IdahoUtah State-8.3
IllinoisWisconsin-9.9
Iowa StateOklahoma State-17
KentEastern Michigan6
Louisiana-MonroeFlorida International-3.3
MemphisMarshall-15.4
Miami (Ohio)Western Michigan-2.4
MichiganNebraska10.2
Michigan StateIndiana20.9
Middle Tennessee StateArkansas State-11.5
MississippiLouisiana State-26.3
MissouriTexas Tech15.2
NevadaLouisiana Tech1.8
North Carolina StateClemson-9.1
North TexasWestern Kentucky2
Northern IllinoisBall State11.5
NorthwesternMinnesota14.3
Notre DameBoston College22.3
Ohio StatePenn State1
OregonSouthern California14.9
Oregon StateWashington-0.4
PurdueIowa-1.6
RiceTulane15
RutgersCincinnati0.8
San Diego StateBoise State-13.6
San Jose StateNavy-1.8
South FloridaMiami (Florida)3.9
StanfordCalifornia20.3
TempleArmy15.8
TennesseeVanderbilt2.6
TexasKansas State3
Texas A&MKansas23.8
Texas ChristianColorado State24.4
Texas-El PasoTulsa-10.4
TroyFlorida Atlantic11.2
UabSouthern Mississippi-26.2
UclaColorado12.2
Virginia TechNorth Carolina6
Wake ForestMaryland11.1
Washington StateUtah-2.4
WyomingNew Mexico20.4

Tuesday, November 8, 2011

One Bad (Good) Game

As mentioned in my previous posting, I recently looked at the effect of dropping a football team's best game (highest Margin of Victory) and their worst game (lowest MOV).  The intuitive notion is that everybody has bad days, where everything goes wrong, and good days, where everything goes right, and maybe those days don't tell us anything useful about the real strength of a team.  If that's so, then dropping those games might give us ratings that are more accurate.

To test this hypothesis I implemented this "drop the worst score" grading system for a couple of the rating systems I use for football and measured performance in the usual way.  Here are the results for one of the rating systems:

  Predictor    % Correct    MOV Error  
BGD Baseline73.7%16.52
BGD w/o blowouts or lowouts 72.6%16.77
BGD w/o lowouts72.9%16.69
BGD w/o blowouts73.6%16.62

Here I'm using the whimsical "lowout" to indicate the worst loss for a team.

As this shows, eliminating the blowouts/lowouts hurts predictive performance.  For what it's worth, the losses seem to be more important than the wins.  (I saw the same effect in basketball when I looked at this last year.)

Friday, November 4, 2011

The Impact of MOV Cutoffs in Football Ratings

I was prompted to start my football predictions by a discussion on an email list of the value of MOV cutoffs in rating systems.  Roger Dendy believed that capping the MOV in blowout victories improved his rating system.  My testing of MOV cutoffs in basketball has shown just the opposite -- that no matter how big the blowout, there's always information in the margin of victory.  Capping MOV at any level (in both blowouts and nailbiters) always reduces the prediction value of a rating.

Of course, just because that's true in basketball doesn't mean it's true in football.  I was pretty sure it was true, but I believe in "trust but verify."  So I put together the football predictor and tested a couple of different rating systems both with and without MOV caps.

I have many rating systems that use MOV, so I picked one and measured it's performance with a 100-fold X-validation  across my archive of college football scores from 2005 to date.  It had a RMS of 16.78 and predicted 71% of the games correctly.

Then I experimented with adding a cutoff to the MOV.  I set the cutoff to 32 points, so that all the games where the MOV exceeded 32, it would be treated as 32.  I just picked 32 arbitrarily as a good figure for a blowout win.  The performance degraded to RMS=17.19 and 69%.  I then bumped up the cutoff to 48 points, and the performance was RMS=17.01 and 70%.

The other rating system showed a similar pattern of performance.

What this shows -- at least for the two rating systems I tested and these performance metrics -- is that even huge margins of victory have value in assessing future performance.  People argue intuitively that there's "no difference between winning by 48 and winning by 52" but that appears not to be true.

Recently I got to wondering if it might not make more sense to drop a blowout victory entirely.  This would be like "drop your lowest score" grading in high school.  The intuitive notion here is that sometimes teams just have a bad day -- a few unlucky bounces and worse goes to worse.  Or lucky bounces and better goes to better, from the other side of the coin.  More on that notion next time.

Wednesday, November 2, 2011

Football Predictions

Shown below are predictions for this week's upcoming football games.  I did a little tweaking and developed a new algorithm for this week's predictions so you'll see two predictions below.

The first is the new algorithm, the second is an ensemble of three algorithms. The new algorithm is similar to the "Homemade Sagarin Ratings" described here (although I do not use Excel Solver to calculate my ratings).  The Sagarin ratings do very well at the Prediction Tracker, so I wanted to implement something similar and see how it did in comparison to my other predictors.  Much to my surprise, it equals or surpasses my best football predictors.  In past tests on the basketball data, this type of predictor did not perform well, but in the course of implementing it for football I found several problems, so I intend to retest this on the basketball data and look at some possible improvements if warranted.

If anyone knows a better description of the Sagarin PREDICTOR algorithm, please let me know.

As always when viewing my predictions, heed the Disclaimer.

+-------------------+----------------+---------------+---------------+
|Hname              |Aname           |prediction(1)  |prediction(2)  |
+-------------------+---------
-------+---------------+---------------+
|wisconsin          |purdue          |23.4           |19.2           |
+-------------------+---------
-------+---------------+---------------+
|west virginia      |louisville      |11.6           |9.3            |
+-------------------+---------
-------+---------------+---------------+
|maryland           |virginia        |4.1            |1.5            |
+-------------------+---------
-------+---------------+---------------+
|rice               |texas-el paso   |-.3            |1.0            |
+-------------------+---------
-------+---------------+---------------+
|texas              |texas tech      |11.1           |11.1           |
+-------------------+---------
-------+---------------+---------------+
|wyoming            |texas christian |-16.7          |-15.5          |
+-------------------+---------
-------+---------------+---------------+
|tennessee          |middle          |22.3           |20.5           |
|                   |tennessee state |               |               |
+-------------------+---------
-------+---------------+---------------+
|oregon state       |stanford        |-18.7          |-20.2          |
+-------------------+---------
-------+---------------+---------------+
|east carolina      |southern        |-13.0          |-10.4          |
|                   |mississippi     |               |               |
+-------------------+---------
-------+---------------+---------------+
|southern           |tulane          |19.0           |18.1           |
|methodist          |                |               |               |
+-------------------+---------
-------+---------------+---------------+
|san jose state     |idaho           |10.7           |9.9            |
+-------------------+---------
-------+---------------+---------------+
|san diego state    |new mexico      |32.9           |29.0           |
+-------------------+---------
-------+---------------+---------------+
|rutgers            |south florida   |2.1            |1.4            |
+-------------------+---------
-------+---------------+---------------+
|washington         |oregon          |-13.7          |-12.8          |
+-------------------+---------
-------+---------------+---------------+
|oklahoma state     |kansas state    |15.3           |14.2           |
+-------------------+---------
-------+---------------+---------------+
|oklahoma           |texas a&m       |17.7           |17.9           |
+-------------------+---------
-------+---------------+---------------+
|ohio state         |indiana         |21.7           |21.2           |
+-------------------+---------
-------+---------------+---------------+
|wake forest        |notre dame      |-10.0          |-10.5          |
+-------------------+---------
-------+---------------+---------------+
|north carolina     |north carolina  |-4.9           |-6.4           |
|state              |                |               |               |
+-------------------+---------
-------+---------------+---------------+
|nebraska           |northwestern    |12.2           |13.1           |
+-------------------+---------
-------+---------------+---------------+
|navy               |troy            |7.4            |6.5            |
+-------------------+---------
-------+---------------+---------------+
|baylor             |missouri        |2.0            |4.4            |
+-------------------+---------
-------+---------------+---------------+
|michigan state     |minnesota       |22.9           |22.8           |
+-------------------+---------
-------+---------------+---------------+
|iowa               |michigan        |-11.3          |-14.1          |
+-------------------+---------
-------+---------------+---------------+
|miami (florida)    |duke            |8.7            |8.4            |
+-------------------+---------
-------+---------------+---------------+
|fresno state       |louisiana tech  |-3.2           |-4.9           |
+-------------------+---------
-------+---------------+---------------+
|louisiana-lafayette|
louisiana-monroe|11.3           |13.6           |
+-------------------+---------
-------+---------------+---------------+
|kentucky           |mississippi     |-2.1           |-.8            |
+-------------------+---------
-------+---------------+---------------+
|iowa state         |kansas          |19.3           |19.7           |
+-------------------+---------
-------+---------------+---------------+
|alabama-birmingham |houston         |-31.4          |-30.6          |
+-------------------+---------
-------+---------------+---------------+
|hawaii             |utah state      |1.2            |.7             |
+-------------------+---------
-------+---------------+---------------+
|georgia            |new mexico state|24.8           |25.2           |
+-------------------+---------
-------+---------------+---------------+
|western kentucky   |florida         |-3.3           |-3.9           |
|                   |international   |               |               |
+-------------------+---------
-------+---------------+---------------+
|florida            |vanderbilt      |8.6            |9.9            |
+-------------------+---------
-------+---------------+---------------+
|eastern michigan   |ball state      |4.2            |4.7            |
+-------------------+---------
-------+---------------+---------------+
|connecticut        |syracuse        |-1.6           |-3.3           |
+-------------------+---------
-------+---------------+---------------+
|pittsburgh         |cincinnati      |-1.3           |-2.0           |
+-------------------+---------
-------+---------------+---------------+
|california         |washington state|3.1            |3.5            |
+-------------------+---------
-------+---------------+---------------+
|nevada-las vegas   |boise state     |-30.6          |-28.2          |
+-------------------+---------
-------+---------------+---------------+
|florida atlantic   |arkansas state  |-12.7          |-15.1          |
+-------------------+---------
-------+---------------+---------------+
|arkansas           |south carolina  |.6             |1.0            |
+-------------------+---------
-------+---------------+---------------+
|ucla               |arizona state   |-7.5           |-7.2           |
+-------------------+---------
-------+---------------+---------------+
|arizona            |utah            |1.3            |-1.7           |
+-------------------+---------
-------+---------------+---------------+
|alabama            |louisiana state |3.9            |5.9            |
+-------------------+---------
-------+---------------+---------------+
|air force          |army            |11.1           |12.8           |
+-------------------+---------
-------+---------------+---------------+
|colorado           |southern        |-13.6          |-16.1          |
|                   |california      |               |               |
+-------------------+---------
-------+---------------+---------------+
|kent               |central michigan|5.5            |8.4            |
+-------------------+---------
-------+---------------+---------------+
|central florida    |tulsa           |-2.5           |1.0            |
+-------------------+---------
-------+---------------+---------------+
|miami (ohio)       |akron           |13.8           |13.3           |
+-------------------+---------
-------+---------------+---------------+
|boston college     |florida state   |-13.5          |-13.3          |
+-------------------+---------
-------+---------------+---------------+

NCAA Basketball Schedule Data

I have provide on this page links to a file containing the currently published schedule of games for the upcoming basketball season.  I scraped this today from Yahoo Sports so it may be missing some games that have not yet been scheduled, tournament games, etc.  The format is self-explanatory and designed for easy ingest by Lisp, but should be easily translated to CSV or other format.  All fields are enclosed with quotes for easy parsing.

At the same page I've also provided a listing of conferences and team names.  The team names correspond to the names used in the schedule and on Yahoo Sports.  This is the same conference file I used last year -- I don't believe there have been any conference changes, but if so let me know and I'll update the file.

Friday, October 28, 2011

Football Predictions

Here are college football predictions for this week.  I discovered a couple of different bugs in my input data since last weeks predictions; these should be somewhat better.  Apologies as always for the old-school formatting, and heed my Disclaimer as well.

+--------------------+--------------------+--------------------+
|Hname               |Aname               |prediction(mov)     |
+--------------------+--------------------+--------------------+
|ohio state          |wisconsin           |-11.7               |
+--------------------+--------------------+--------------------+
|western michigan    |ball state          |15.5                |
+--------------------+--------------------+--------------------+
|washington          |arizona             |7.4                 |
+--------------------+--------------------+--------------------+
|duke                |virginia tech       |-4.3                |
+--------------------+--------------------+--------------------+
|utah                |oregon state        |1.5                 |
+--------------------+--------------------+--------------------+
|ucla                |california          |1.7                 |
+--------------------+--------------------+--------------------+
|central florida     |memphis             |26.0                |
+--------------------+--------------------+--------------------+
|tulsa               |southern methodist  |5.4                 |
+--------------------+--------------------+--------------------+
|texas tech          |iowa state          |18.5                |
+--------------------+--------------------+--------------------+
|texas a&m           |missouri            |16.1                |
+--------------------+--------------------+--------------------+
|texas               |kansas              |22.7                |
+--------------------+--------------------+--------------------+
|southern california |stanford            |-10.0               |
+--------------------+--------------------+--------------------+
|texas-el paso       |southern mississippi|-8.1                |
+--------------------+--------------------+--------------------+
|tennessee           |south carolina      |-.1                 |
+--------------------+--------------------+--------------------+
|san diego state     |wyoming             |20.3                |
+--------------------+--------------------+--------------------+
|rutgers             |west virginia       |2.5                 |
+--------------------+--------------------+--------------------+
|penn state          |illinois            |5.0                 |
+--------------------+--------------------+--------------------+
|oregon              |washington state    |24.9                |
+--------------------+--------------------+--------------------+
|oklahoma state      |baylor              |7.3                 |
+--------------------+--------------------+--------------------+
|notre dame          |navy                |17.6                |
+--------------------+--------------------+--------------------+
|indiana             |northwestern        |-5.7                |
+--------------------+--------------------+--------------------+
|north carolina      |wake forest         |5.3                 |
+--------------------+--------------------+--------------------+
|new mexico state    |nevada              |-6.2                |
+--------------------+--------------------+--------------------+
|nebraska            |michigan state      |-6.6                |
+--------------------+--------------------+--------------------+
|kentucky            |mississippi state   |-11.1               |
+--------------------+--------------------+--------------------+
|michigan            |purdue              |22.1                |
+--------------------+--------------------+--------------------+
|miami (ohio)        |buffalo             |1.0                 |
+--------------------+--------------------+--------------------+
|maryland            |boston college      |6.8                 |
+--------------------+--------------------+--------------------+
|marshall            |alabama-birmingham  |14.2                |
+--------------------+--------------------+--------------------+
|louisville          |syracuse            |-5.3                |
+--------------------+--------------------+--------------------+
|louisiana tech      |san jose state      |11.6                |
+--------------------+--------------------+--------------------+
|louisiana-monroe    |western kentucky    |-6.2                |
+--------------------+--------------------+--------------------+
|middle tennessee    |louisiana-lafayette |4.7                 |
|state               |                    |                    |
+--------------------+--------------------+--------------------+
|kansas state        |oklahoma            |-4.0                |
+--------------------+--------------------+--------------------+
|minnesota           |iowa                |-13.5               |
+--------------------+--------------------+--------------------+
|idaho               |hawaii              |-8.0                |
+--------------------+--------------------+--------------------+
|florida             |georgia             |2.4                 |
+--------------------+--------------------+--------------------+
|florida state       |north carolina state|13.2                |
+--------------------+--------------------+--------------------+
|east carolina       |tulane              |9.9                 |
+--------------------+--------------------+--------------------+
|nevada-las vegas    |colorado state      |1.3                 |
+--------------------+--------------------+--------------------+
|georgia tech        |clemson             |-4.4                |
+--------------------+--------------------+--------------------+
|akron               |central michigan    |-1.8                |
+--------------------+--------------------+--------------------+
|kent                |bowling green state |-5.2                |
+--------------------+--------------------+--------------------+
|auburn              |mississippi         |11.5                |
+--------------------+--------------------+--------------------+
|arkansas state      |north texas         |15.2                |
+--------------------+--------------------+--------------------+
|vanderbilt          |arkansas            |-5.3                |
+--------------------+--------------------+--------------------+
|arizona state       |colorado            |26.3                |
+--------------------+--------------------+--------------------+
|new mexico          |air force           |-16.8               |
+--------------------+--------------------+--------------------+
|florida             |troy                |11.7                |
|international       |                    |                    |
+--------------------+--------------------+--------------------+
|pittsburgh          |connecticut         |11.4                |
+--------------------+--------------------+--------------------+
|brigham young       |texas christian     |-10.1               |
+--------------------+--------------------+--------------------+
|miami (florida)     |virginia            |15.8                |
+--------------------+--------------------+--------------------+
|houston             |rice                |21.5                |
+--------------------+--------------------+--------------------+

Thursday, October 20, 2011

Predicting the Oblong Ball

I was recently challenged by some friends to predict NCAA college football, so I gathered up some historical data from this archive and adapted some of the better rating systems I've investigated to create a predictor.  It's hard to judge the performance.  It does not perform as well as the systems reported here according to my standard cross-validation testing, but my implementation of Sagarin's ELO also underperforms the reported performance.  Since my implementation of ELO tracks the Sagarin performance very well in basketball, I suspect there's a systemic difference in how performance is measured.

At any rate, I don't intend to spend a lot of time on this, but just for amusement, here are the predictions for this weeks games:

alabama over tennessee by 10.6
arkansas over mississippi by 14.8
ball state over central michigan by 1.8
boise state over air force by 25.7
california over utah by -11.4
central florida over alabama-birmingham by 21.2
clemson over north carolina by 5.1
florida atlantic over middle tennessee state by -12.4
florida state over maryland by 5.9
hawaii over new mexico state by .5
houston over marshall by 12.9
illinois over purdue by 11.0
iowa over indiana by 6.5
kansas state over kansas by 18.7
louisiana state over auburn by 14.4
louisiana-lafayette over western kentucky by 6.5
miami (florida) over georgia tech by -15.4
navy over east carolina by 5.2
nebraska over minnesota by 14.4
nevada over fresno state by 7.8
north texas over louisiana-monroe by 2.7
northern illinois over buffalo by 3.1
notre dame over southern california by 2.5
ohio over akron by 21.1
oklahoma state over missouri by 9.9
oklahoma over texas tech by 9.4
oregon over colorado by 22.8
penn state over northwestern by 10.9
rutgers over louisville by 12.0
south florida over cincinnati by -6.9
southern mississippi over southern methodist by -4.4
stanford over washington by 17.7
temple over bowling green state by 17.2
texas a&m over iowa state by 14.5
texas christian over new mexico by 21.9
texas-el paso over colorado state by -1.8
toledo over miami (ohio) by 16.4
tulane over memphis by 10.2
tulsa over rice by 4.4
ucla over arizona by 1.6
utah state over louisiana tech by -2.9
vanderbilt over army by 5.1
virginia tech over boston college by 15.6
virginia over north carolina state by -4.4
wake forest over duke by -2.3
washington state over oregon state by 5.2
west virginia over syracuse by 3.2
western michigan over eastern michigan by 13.7
wisconsin over michigan state by 7.0

Apologies for the awful formatting -- I put this together in 3 days and didn't put much effort in to making pretty.

The Usual Disclaimers apply:  Use this information at your own risk; it is not intended for gambling purposes and the Net Prophet does not encourage or recommend gambling on sports events.

Wednesday, October 12, 2011

More on Statistical Prediction

I am continuing to explore statistical prediction.  In particular, after implementing the Four Factors as described here, I became interested in examining other statistics generated from the base set of statistics.  A subset of these generated statistics are ratios of the base statistics, like the "Offensive Balance" statistic I defined in my earlier post:
Offensive Balance = (# 3 Pt Attempts) / (# FG Attempts)
You can probably come up with a few sensible statistics like these off the top of your head.  But since I've seen time and again the value of exploring all options -- even the ones that make no "sense" -- I decided to calculate and test all of these sorts of ratios to see which of them (if any) have predictive value.

That's a more difficult job than you might imagine.  In my data sets there are 13 base statistics per team per game (FG Made, FG Attempted, 3PT Made, 3PT Attempted, FT Made, FT Attempted, Offensive Rebounds, Total Rebounds, Assists, Turnovers, Steals, Fouls, Score, and MOV).  For predictive purposes, we want to use the average of these over a team's previous games [1] and we can average by either game or possession - so that's 26 base statistics per team.  There are 26*25 = 650 possible ratios of those statistics.  But we also want to consider ratios not only of a team with itself but also of the team with its opponent, e.g., the ratio of the team's average number of 3 PT attempts in past games to it's opponents average number of 3 PT attempts in past games.  That adds another 676 possible ratios.  Finally, we also want to consider the statistics for a team's past opponents, e.g., the average number of 3 PT attempts in past games of a team's opponents in those games.  Adding those in creates a lot more ratios.  Multiply all that by the 12K games in my training data, and it's a lot of data.

My approach is to generate a subset of the possible ratios and test them for predictive value.  For various reasons I settled on generating all the ratios with a particular numerator, e.g.,
(FG Made) / (# Fouls)
(FG Made) / (Opponent's # Fouls)
(FG Made) / (# Fouls by Opponents in Past Games)
etc.
This ends up adding about 96 new statistics to every game in the database.  I can then take this expanded data and pump it through the usual linear regressions, etc., to find the statistics that have predictive value.  But this is a slow process -- for each numerator, it takes hours to generate all the statistics and run them through iterations of the predictive model.  (This has the disadvantage that I may miss some combination of generated statistics with different numerators that are only valuable in combination.)

So far, I haven't identified any ratios that result in significantly better predictions.  But I have been surprised that (at least so far) the models have selected a number of unexpected ratios as being of value.  For example:
(Away team's Average FG Made) / (Away team's Average 3PTs Attempted)
(Away team's Average FG Made) / (Away team's Average 3PTs Made)
These ratios seem to be capturing something about the Away team's offensive balance between inside and outside play.  Interestingly, both the ratio with 3 PTs Attempted and 3 PTs Made are significant -- it may be that the first captures the "offensive strategy" (whether a team plays outside first or inside first) and the second captures something about how effective they are at executing that strategy.  It's also interesting that these ratios are only significant for the Away team -- apparently the home team's performance doesn't depend strongly on what sort of offensive strategy it uses.

Another interesting statistic:
(Home team's Average FG Made) / (Home team's Past Opponents' Average Offensive Rebounds)
It takes a moment's thought to grasp this statistic.  It compares the average number of FGs made by a team to the offensive rebounding of the opponents the team faced.  If we take Offensive Rebounds as an indicator of how strongly teams are contesting inside play, then this ratio would seem to say something about how effective the home team's inside play has been relative to its opponents.

Hopefully working through all the ratio statistics will turn up a set of statistics that provide significantly better predictive value.

[1] Averaging isn't the only option here, and there are other possibilities for generated statistics that might be useful, but I feel that ratios are a reasonably fertile area for exploration.

Tuesday, September 27, 2011

Blogger seems to be rolling out some new template options.  You can view Net Prophet in the new templates here.

Wednesday, September 21, 2011

Statistical Prediction: Pace-Adjusted Statistics & The Four Factors

There is much talk in sports statistics circles about pace-adjusted statistics.  As Wikipedia puts it:
A key tenet for many modern basketball analysts is that basketball is best evaluated at the level of possessions.
The notion here is that because teams play at different paces, game-level statistics can be misleading.  A team that averages 95 points per game is not necessarily better than one that averages 78 points per game.  The higher-scoring team may simply be playing at a much faster pace.  We can account for this by measuring statistics per possession rather than per game.

While this makes a lot of intuitive sense, I always like to test my intuitions.  So I took the same set of statistics used in this posting and re-calculated them as per-possession statistics.  (See here for how to estimate the number of possessions in a game.)  Then I ran the prediction model using the per-possession statistics.  (Obviously some statistics, like "Field Goal Shooting Percentage" are not calculated on a per-game basis, so those don't get pace-adjusted.)  Here is the performance comparison:

  Predictor    % Correct    MOV Error  
Govan + Averaging73.5%10.80
Statistical prediction (per-game stats)72.2%11.09
Statistical prediction (per-possession stats) 72.2%11.10

As you can see, the two approaches were indistinguishable.  Not only was performance nearly identical, but they both selected the same statistics for the prediction model.  So at least for this case, it doesn't appear that adjusting for pace improves performance.

My guess is that the relative unimportance of pace is due to the shot clock and the copycat nature of coaching.  There probably isn't enough pace variation across teams to make it a significant factor.

If you search around for "pace-adjusted statistics" you'll eventually stumble across Ken Pomeroy's Four Factors page.  The four factors are derived statistics that are intended to give additional insight into how teams play.  The factors are:
  • Effective field goal percentage
  • Turnover percentage
  • Offensive rebounding percentage
  • Free throw rate   
(Definitions can be found on Ken Pomeroy's page.)

"Effective FG%" is not of interest to me because the linear regression can adjust the relative importance of field goals versus three-point attempts.  "Turnover %" is turnovers per possession; that's one of the statistics I calculated as part of the per-possession statistics experiment above.  (It had no value in the predictor, fwiw.)  "Offensive rebounding %" is a more interesting statistics, and since offensive rebounds are used by the statistical prediction model, this seems like a worthwhile statistics to investigate.  "Free throw rate" seems to capture some notion about how often a team draws a foul.  I think that's already captured, but it isn't difficult to generate this statistic.

If I generate these two new statistics and run the prediction model, I find that performance remains the same, but the "Offensive rebounding %" statistics replace the per-game or per-possession offensive rebounding statistics.  ("Free throw rate" has no predictive value and is eliminated in the linear regression.)

Since three point shooting percentages are used in the predictor, I decided to define a new statistic to capture how much a team relies on the three-point shot (and impacts its opponents use of the three-point shot).  I defined this as:
Offensive Balance = (# 3 Pt Attempts) / (# FG Attempts)
and re-ran the predictor.  The new statistic has no predictive value.  An alternative formulation is to look at the made 3 pointers versus the made field goals:

Offensive Balance = 3*(# 3 Pt Made) / 2*(# FG Attempts)
but again, this statistic has no predictive value.

I'm open to suggestions if anyone out there has any thoughts on similar "derived statistics" that might be of value in prediction.

Monday, September 19, 2011

Statistical Prediction: Normalizing Inputs

One thing we want to consider in doing statistical prediction (or any sort of prediction where we have a variety of dissimilar inputs) is to normalize our inputs.  The purpose of this is to be able to compare inputs that have different scales.  For example, in my data set, home team scoring average varies from 43 to 102, while "steals by the away team" varies from 0 to 13, so it's hard to compare those two numbers directly.  And we don't want our prediction model to favor one data over another just because it has a bigger absolute value.  To address this we can "normalize" our data to similar scales.

I mentioned here that Brady West normalizes all the input data to his model by subtracting the mean and dividing by the standard deviation -- this is called "standard score."  Instead of knowing that the home team scored 108 points, you'd know that they score 2.38 standard deviations above the mean.  That sounds like a fine approach to me, but as it turns out, RapidMiner (the tool I'm using to do the predictive models) doesn't offer that as an option.  It does, however, offer a z-transformation, which transforms the data so that it has a mean of zero and a standard deviation of 1.  If we apply that to all of our inputs, we'll have more of an apple-to-apples comparison.  For example, the home scoring average ends up ranging from -9.96 to 3.99, while the away team's FT percentage varies from -14.34 to 4.87 -- giving you some sense that there is more variance in FT shooting percentage.

If we apply the z-transformation to our inputs, there is no change in performance for the model that takes only scoring averages.  That's reasonable, since the scoring averages are all basically on the same scale anyway.  But when we throw in a second data point with a different scale, the difference becomes apparent:

  Predictor    % Correct    MOV Error  
Govan + Averaging73.5%10.80
Scoring averages72.1%11.18
Scoring + 3 pt % -- Without normalization 72.1%11.18
Scoring + 3 pt % -- With normalization 72.1%11.09

So as a matter of course I'll perform a normalization step as part of the prediction workflow.  (In this case, it doesn't improve our best performance by much.)

It's also interesting to compare the coefficients in our linear regression.  This is what we see if we look at the coefficients for the various scoring averages:

  Datum  Coefficient 
Home Team Scoring Average5.886
Away Team's Opponent Scoring Average-4.447
Away Team Scoring Average-5.686
Home Team's Opponent Scoring Average4.793

Naively, you might want to predict a team's score as exactly halfway between what the team usually scores (offense) and what the other team usually gives up (defense); but what this shows is that the best estimate actually weights offense slightly more -- 57% for the home team, 54% for the away team.


Friday, September 16, 2011

Statistical Prediction

With this post, I'm going to start taking a look at predicting game outcomes based upon team-level statistical measures other than won-loss or MOV, i.e., measures like "team scoring average," "average number of offensive rebounds per game," etc.

There are a number of ways to slice & dice these statistics, but the most straightforward approach is to use season-to-date averages.  So, when I'm trying to predict the Illinois-Purdue game on 2/15, I'll be looking at the statistics for those two teams averaged over all the games for that season before 2/15.  And I also want to include average statistics for a team's opponents.  So I want to know both Purdue's scoring average for all of its previous games, and also the scoring average of its opponents in those games.  For every game, I'll typically have four values for a statistic: the home team's average, the home team's opponents' average, the away team's average, and the away team's opponents' average.

To begin with, let's look at how well we can predict games using the most obvious statistic: the scoring average.  Using just the (four) scoring average statistics, and the usual methodology, here's our performance:

  Predictor    % Correct    MOV Error  
Govan + Averaging73.5%10.80
Scoring averages72.1%11.18

That's pretty encouraging.   Just using the scoring averages delivers performance comparable with some of our better W-L and MOV-based predictors.  The bad news is that this is still highly correlated with our best other predictors (around 96%), meaning that it probably can't be used in an ensemble to improve our overall predictive performance.

If we look at adding other statistics we find (as would be expected from the literature) that they offer little improvement.  The best combination I could find (in order of importance) was (1) scoring, (2) 3 pt percentage, and (3) opponent's average offensive rebounding:

  Predictor    % Correct    MOV Error  
Govan + Averaging73.5%10.80
Scoring averages72.1%11.18
Scoring + 3 pt % + Opponent's off rebounding 72.2%11.09

As you can see, the improvement was not huge.  The inclusion of "average number of offensive rebounds by opponents" is interesting because it is not scoring-related.  That statistic would seem to capture some aspect of a team's defensive performance -- a team that gives up a lot of offensive rebounds to its opponents is probably doing something wrong at the defensive end of the court.  That suggests that we might want to think about a better measure of defensive performance -- for example, we might want to look at offensive rebounding percentage rather than just the raw total.