Friday, April 8, 2016

2016 Machine Madness Winner

I've been a little slow in getting around to this, but I want to congratulate "SDSU Fan" on winning the 2016 Machine Madness contest!  In real life, SDSU Fan is Peter Calhoun, a graduate student in Statistics at (no surprise) San Diego State University.  We had a very large pool of entrants this year (40!) so Peter deserves some congratulations for beating the masses.  Peter was trailing by a significant amount after the Round of 32, but strong performances in the later rounds (and especially the FF) resulted in big lead by the end.

Peter's model modified the Logistic Regression/Markov Chain (LRMC) approach proposed by Kvam and Sokol to use random forests.  Peter also finished in fiftieth on Kaggle -- a very strong performance all around.

Despite the large number of entries, nobody had Villanova winning it all.  I think that makes the Villanova win a "true upset".  I know in my model, Villanova played considerably better than predicted.

Speaking of my model, it follows a strategy in pool-based contests of picking some "likely" upsets to try to maximize the chance of winning.  (This is probably more important in a larger pool.)  This year, it picked Purdue to make it to the Championship Game.  Not only didn't that happen, Purdue was upset in the first round by #12 Little Rock.  I'm adding a special "Purdue Rule" to the Net Prophet model so that mistake is never again repeated.  :-)

Congratulations again to Peter on great performance!

Paper Reviews

These papers have been added to the paper archive available through the Papers link on the sidebar.  Links are also provided for direct download of the papers.

Dubbs, Alexander, "Statistics-Free Sports Prediction", arXiv.org
The author builds logistic regression models for MLB, NBA, NFL, and NHL games that use only the teams and scores.  This works best for basketball, and the author concludes that "in basketball, most statistics are subsumed by the scores of the games, whereas in baseball, football, and hockey, further study of game and player statistics is necessary to predict games as well as can be done."

COMMENT: I'm not sure the results of this paper say anything deeper than "Compared to the other major sports, NBA has a long season and the teams don't change much from year to year." 
Clay, Daniel, "Geospatial Determinants of Game Outcomes in NCAA Men’s Basketball," International journal of sport and society 02/2015; 4(4):71-81.
The authors build a logistic regression model for 1,648 NCAA Tournament games that include features for distance travel, time zones crossed, direction of travel, altitude and temperature.  They conclude "We found that traveling east reduces the odds of winning more than does traveling west, and this finding holds when controlling for strength of team, home region advantage and other covariates. Traveling longer distances (>150 miles) also has a dramatic negative effect on game outcomes..."
COMMENT: This paper shows that travel distance and direction has a statistically significant impact upon game results in the NCAA Tournament, but I want to add a few caveats to this conclusion.  First, it isn't clear that the authors understand and control for the fact that there are many more basketball programs (and arguably stronger basketball programs) on the East Coast than elsewhere in the nation.  For this reason, it's likely that teams moving west to play in the Tournament are stronger than teams moving east.  Since the authors don't control for the strength of teams, it's impossible to say whether the claimed impact of direction of travel means anything.  Second, the magnitude of these effects may not be huge.  I don't understand how the authors calculate their "Odds Ratio" but factors like strength of team are several orders of magnitude more significant in determining outcome.  Third, the authors are measuring strength of team by seed, which has several problems.  It's a very coarse measure, it doesn't distinguish between teams with the same seed, and it's often poorly correlated with the actual team strength (i.e., teams are commonly mis-seeded).  In my experience, many factors with low significance vanish when team strength is more accurately estimated.  I think distance and direction of travel probably do have an impact on Tournament games, but I suspect the true effect is smaller than this paper would indicate.
Clay, Daniel, "Player Rotation, On-court Performance and Game Outcomes in NCAA Men's Basketball", International Journal of Performance Analysis in Sport · August 2014

The authors look at the relationship between the size of rotation (how many players play at least 10 minutes in a game) and statistics such as rebounding, shooting percentage, etc.  The authors conclude that teams with deep rotation tend to rebound better, particularly on the offensive end. They also have more steals. By contrast, smaller rotation teams tend to shoot the ball better, both field goals and free throws, and they are more effective at taking care of the ball, resulting in fewer turnovers.  In general, a larger rotation improves the chance of winning.
COMMENT: There's quite a bit of interesting material in this paper, and I recommend reading it and drawing your own conclusions.  I have reservations about some of the conclusions in this paper because the authors have not controlled for number of possessions in the game for many of the statistics.  Since I'd expect (for example) that both the number of offensive rebounds and the depth of rotation to increase with more possessions, I'm not sure I immediately accept that teams with deeper rotations rebound better.  The authors do control for possessions in two of the statistics (offensive and defensive rating) and those conclusions are more convincing.  However, as far as I can tell the authors did nothing to control for overtime games, and that may also be affecting the results. 
From the specific viewpoint of predicting game outcomes, the authors don't make use of any kind of strength rating, so it isn't clear whether depth of rotation has any predictive value that wouldn't already be covered by a good strength metric.