Thursday, October 24, 2013

Local Regression

The Prediction Machine uses a linear regression to form its predictions.  A linear regression works by calculating a straight line equation (hence "linear") that best fits the observed historical data.  That looks something like this picture:
Given a new X we use the blue line to predict a value for Y.  The Prediction Machine isn't two-dimensional like this illustration -- it has dozens of inputs rather than just one -- but this gives you the general idea of how it works.

It turns out that a linear regression works pretty well for the Prediction Machine.  But I've wondered whether there aren't "special cases" hidden in the data where the equation that best fits all the data doesn't work well for the special cases.  For example, you might think that teams that are very good at getting offensive rebounds could be predicted more accurately with a slightly different equation.  If you could pick out those cases and use a different linear regression, overall accuracy would improve.

There are a number of different approaches to doing this.  One is to use a more complex regression, so that the "blue line" can bend more flexibly in different regions of the prediction space.  For example, you can use a polynomial regression:

But a polynomial regression still bends "smoothly" and is limited in how many times it can bend.

Another approach is to predict a game's outcome based upon its nearest neighbors, as I talked about here.  The shortcoming with this is that the prediction is based upon the average of all the nearby neighbors -- which might not be the right estimation.  A more sophisticated model (such as a linear regression) might work better.

Local regression is a modeling technique that combines nearest neighbors with regression.  It works by finding the nearest neighbors to the example you want to predict, creates a linear regression using just those neighbors, and then predicts the example using the linear regression.  If your data really has "neighborhoods" that act differently, this should do a better job of prediction.

Local regression was recently added to RapidMiner so I took the opportunity to apply it to the Prediction Machine to see if it would improve performance.

The results were disappointing and/or enlightening, depending upon your perspective.  Performance of localized regression was much poorer than a linear regression for small numbers of neighbors.  It wasn't until the number of neighbors was greater than 2000 that its performance started to approach the performance of the linear regression.

This confirms earlier experiments suggesting that there aren't localized "neighborhoods" within the NCAA basketball data where we can improve performance by treating them differently.  The factors that predict performance seem to apply equally across the whole spectrum of college teams.

Wednesday, October 2, 2013

2014-2015 Schedule Available

The Prophet isn't completely awoken yet from his off-season hibernation, but is rousing long enough to mention that ESPN recently posted the first schedule of games for the season.  They don't seem to have locations yet for the neutral-site games and there are probably missing games.  I've posted the scraped file to the Data page.

Looking back, I see the Prediction Machine got the participants in the final games of both the NIT and the NCAA tournaments correct, and predicted all of the Final Four games correctly.