Tuesday, December 9, 2014

Other Meta-Classifier Approaches

In previous posts, I've talked about hyper-parameter tuning and ensemble learning.  The appeal of these techniques is to get something for nothing.  Without any real effort, they can (in theory) turn your inaccurate models into more accurate models.  Hyper-parameter tuning and ensemble learning aren't the only approaches to improve the accuracy of a base classifier.  Another commonly used approach is called "boosting."

Imagine that you have a model that predicts the final margin of victory (MOV) of a college basketball game.  If you ran this model on a bunch of training examples, you could calculate the error for each example (the actual MOV minus the predicted MOV).  You could then try to train a second model to predict that error.  And if you could do that, your combined prediction (the predicted MOV from the first model + the predicted error from the second model) would be better than your first model.  This idea is known as gradient boosting (or additive regression).

However, to be useful, there has to be something that the gradient boosting can find in the errors from the first model.  To understand why this is important, consider using gradient boosting with linear regression.  The linear regression algorithm produces the best linear solution to the data.  There's no other solution which will yield smaller errors.  So if you try to fit a second linear regression to the errors that are left behind, you get no improvement over the original model.  (Because no improvement is possible.)

There are two ways to overcome this limitation.  The first is to use a weaker base model.  That may seem non-intuitive, but it has the nice property of automatically turning a weak model into a stronger one.  (And in fact gradient boosting is usually applied to tree models for just this reason.)  The second approach is to use a different type of model for the second model.  For example, we could build a linear regression in the first step, and then a tree model in the second step.  The tree model could pick up some non-linear information in the errors that the linear regression would miss.

Gradient boosting is just one form of boosting.  In general, "boosting" methods try to improve accuracy by boosting the importance of training examples that have errors.  One approach is to weight the training examples, create a base classifier, find the training examples where the base classifier is wrong, increase the weight of those examples and create a new classifier.  The idea is that by forcing the base classifier to pay more attention to the incorrect examples, it will become more accurate on those examples.  A popular version of this approach is called adaptive boosting or AdaBoost and has proven to be very effective for a wide range of problems.

Machine learning toolkits like WEKA and RapidMiner have various forms of boosting built in, including both adaptive regression and AdaBoost, but the idea is straightforward enough that it's not difficult to implement yourself if you need more control over the approach.  In the case of the Pain Machine, I haven't found a boosting approach that improves the accuracy of the base linear regression, but I have found that some tree approaches can be boosted to an accuracy close to that of the linear regression.  I suspect that boosting is less useful for problem domains with large amounts of noisy data, or for models that are already close to the best possible performance, but that's just speculation on my part.

Friday, December 5, 2014

Ensemble Classifiers

(Review:  In machine learning, a classifier is a model for classifying/predicting.  The Pain Machine uses a linear regression as its classifier.  In this series of posts, I'm discussing other potential approaches.  Previously I discussed using AutoWEKA to automatically find a good classifier.)

An ensemble classifier is one that takes a number of other classifiers (called the "base classifiers") and aggregates their results to get a (hopefully) more accurate prediction.  Ensemble classifiers are popular because -- in certain cases -- the approach provides an easy way to improve accuracy.

To see why this might be so, imagine that we're trying to predict the winner of an NCAA basketball game.  We have three base classifiers, and they're each 70% accurate.  So if we use any one of these classifiers, we'll predict 30% of the games incorrectly.  Instead, let's make a prediction by taking a majority vote amongst the classifiers.  Now we'll only make the wrong prediction if two out of the three classifiers are wrong.  If I've done the math correctly, that means the combined (ensemble) classifier is 73% accurate.

The figure below (taken from this article at Scholarpedia) illustrates this idea.


Each of the three classifiers across the top has mis-classified some examples -- the ones with the dark borders.  But since each classifier has made different errors, they all "wash out" in the final ensemble prediction.

If you're mathematically inclined, you might have already realized the "catch".  For this to work, the errors made by the base classifiers need to be independent.  If all the classifiers make the same mistakes, nothing will be gained by collecting them into an ensemble.  Many approaches have been suggested for creating independent classifiers.  Common ones include subsetting the training data and giving a different subset to each classifier, or subsetting the attributes and giving different attributes to each classifier.  You can also try creating a large number of classifiers, and then search them to find independent ones.  (I will have more to say on that idea later.)  Unfortunately, many of these methods trade off accuracy of the base classifier for independence, and the end result is often that the ensemble does no better than the best base classifier.

At any rate, the experiment with Auto-WEKA provided me with a pool of good base classifiers, so I thought it would be worth a little effort to try combining them into an ensemble classifier.

There are a couple of options on how to combine the base classifiers.  The most straightforward is the voting scheme outline above (for numeric predictions, this is averaging).  This often works well, but in many cases weighting the base classifiers can produce better predictions.  For example, instead of averaging the predictions of the base classifiers, we might want to take 80% of Classifier A, 15% of Classifier B and 5% of Classifier C.  How do we figure that out?  Well, we just create a new classifier -- one that takes the outputs of the base classifiers as inputs -- and produces a prediction.  We can then pick an appropriate model (say, a linear regression) and train that for the best results.

This brings up an easily-overlooked complexity, namely, what data shall we use to train and test this new ensemble classifier?  For best results, we need training & test data for the ensemble classifier that is independent of the data we used to train and test the base classifiers.  If we re-use any of that data, we will probably overfit the ensemble classifier to that data.  This will often give us poor results on new data.  So to pursue an ensemble classifier, we need to partition our data into four sets:  training data for the base classifiers, test data for the base classifiers, training data for the ensemble classifier, and test data for the ensemble classifier.

The best classifiers found during the Auto-WEKA experiment were a linear regression, a Support Vector Machine (SVM), an M5P tree-based classifier and a PLS-based classifier.  So I combined these three into an ensemble and trained a linear regression ensemble classifier, with this result:
mov =
      0.1855 * weka.classifiers.functions.SMOreg-1 +
      0.2447 * weka.classifiers.functions.LinearRegression-2 +
      0.4786 * weka.classifiers.tree.M5P +
      0.1147 * weka.classifiers.function.PLSclassifier +
     -0.1154
So in this case, the final classifier uses 18% of the SVM prediction, 25% of the linear regression, and so on.

The ensemble classifier did perform better than the base classifiers -- but only by a tiny amount which was not statistically significant.  So at least in this case, there was no appreciable benefit.

WEKA also provides some built-in methods for creating ensembles.  It has meta-classifiers called "Random Subspace" and "Rotation Forest" that create ensembles of base classifiers by splitting up the attributes into independent subspaces (explained in more detail here and here).  These are somewhat more restrictive approaches because they build ensembles of only a single type of base classifier, but they're very easy to use.  In this case, using these meta-classifiers can improve the performance of the M5P and PLSclassifiers to be as good as the mixed ensemble.

Although these ensemble approaches didn't provide a significant benefit for the Pain Machine model, the ensemble learning concept is a worthwhile concept to keep in your toolkit.  If nothing else, you should keep in mind that prediction diversity can be converted into accuracy, and be on the lookout for potential sources of diversity.







Thursday, December 4, 2014

Least Frightening Div I Team Names & Other Diversions

In alphabetical order:
Centenary Gentlemen
UMKC Kangaroos
Pennsylvania Quakers
Presbyterian Blue Hose
South Dakota State Jackrabbits
St. Peter's Peacocks
UC Irvine Anteaters
Vanderbilt Commodores (feat. Lionel Richie)
Youngstown St. Penguins
It's always a blood bath when the Quakers take on the Gentlemen.

BONUS CONTENT:  First round matchups we'd like to see
The Campbell Fighting Camels vs. the Delaware Fighting Blue Hens
The Kent St. Golden Flashes vs. the St. Francis(PA) Red Flash
The Furman Paladins vs. the Northwestern St. Demons
The Loyola-Maryland Greyhounds vs. the Marist Red Foxes
The McNeese St. Cowboys vs. the Marshall Thundering Herd
The North Dakota Fighting Sioux vs. the North Dakota State Bison
The Northern Arizona Lumberjacks vs. the Indiana St. Sycamores
That is all.

Saturday, November 29, 2014

Auto-WEKA

As I've mentioned before, the Pain Machine uses a linear regression for prediction.  Is that the best model for this problem?  One of the challenges of machine learning is that there are a number of different machine learning algorithms, and each of these typically has several parameters that control its performance.  For example, if you're trying to predict Margin of Victory (MOV), you could use a linear regression, a polynomial regression, a linear local regression, a Support Vector Machine (SVM), and so on.  There's no straightforward way to know which of these will be best, and likewise no easy approach to tuning each algorithm to find its best performance.

Auto-WEKA is a software package from some smart folks at the University of British Columbia to try to address this problem.  It applies something called "Sequential Model-based Algorithm Configuration" to a machine learning problem to try to find the best Weka algorithm for solving that problem.   (Weka is a popular open source suite of machine learning software written in Java at the University of Waikato, New Zealand.)  Auto-WEKA tries to be smart about picking parameters for the Weka algorithms it tries so that it can hone in on the "best" parameters relatively quickly.

Regardless of how efficient Auto-WEKA is at finding the best parameters, it's very handy simply because it automates the work of trying a bunch of different algorithms on a problem.  At the simplest, you just point it at a data file and let it work, but I'll illustrate a more complex example.

First of all, you need to get Auto-WEKA from here and unpack it on your system.  It's self-contained and written in Java, so as long as you have Java installed you should be able to run it on just about any sort of machine.

Second, you need to get your data in a format that Auto-WEKA (and Weka) likes.  This is the Attribute-Relation File Format (ARFF). For me, the easiest way to do this was using RapidMiner.  RapidMiner has operators to read and write data in a various formats, so I put together a simple process to read my data (in CSV format) and write it out in ARFF.  One caveat is that Weka expects the label (that is, the data we're trying to predict) to be the last column in the ARFF data.  This is something of a standard, so RapidMiner takes care of this when writing out the ARFF format.  Here's what my RapidMiner process looks like:


The Pre-Process operator is responsible for reading in and labeling my data.  The Sample operator is in there because my data set is very large, so I down-sample it so that Auto-WEKA doesn't take forever to make progress.  (I used about 5000 games.)

With the data in hand, we start Auto-WEKA up and are presented with the startup screen:



The "Wizard" option does about what you'd expect -- you point it at your data file and it does the rest.  For more control, you can use the Experiment Builder.  This opens a window with three sequential tabs.  In the first tab, you open the data file.  In the second tab, you pick the classifiers (and meta-classifiers) you'd like to try in your experiment:


Here I've picked some tree-based classifiers as well as a few of the meta-classifiers.  By default, Auto-WEKA will use all the classifiers suitable for your data set.  The third tab allows you to control the experiment settings, such as how long to run:


The defaults here are reasonable, but depending upon your dataset you might want to allocate more memory and time.

Having done all that your experiment is constructed.  You now run the experiment.


(Note that you supply a random seed for the run.  You can do multiple runs in parallel, using different random seeds, to speed the overall search.)

Now you sit back and wait ...  for days.  On a large dataset like mine, training classifiers can take many hours.  And Auto-WEKA will try different classifiers and different combinations of parameters, so it will make many runs.  In truth, you can spend about as much CPU time as you can find.

The good news is that with about a week of CPU time, Auto-WEKA was able to identify several algorithms that performed as well as linear regression.  The combinations it found involved base classifiers combined with meta-classifiers, using parameter settings quite different than the defaults.  It's unlikely that I could have found any of these experimenting on my own.  The bad news is that none of these significantly outperformed linear regression, so my overall performance did not improve.

Overall, Auto-WEKA is a tool that could be very useful in the early stages of building a predictor by quickly identifying promising classifier configurations.  And certainly you have nothing to lose by simply letting it loose on the data to see what it can find.

Thursday, November 20, 2014

A New Season

Well, the new college basketball season is upon us.  My early predictions indicate that the Kentucky Blue Platoon is going to play the Kentucky White Platoon in the Championship game, but that prediction might evolve as we see more games.

In the meantime, I have been working on the Net Prophet model (for new readers of the blog - ha, as if! - sometimes known as the Pain Machine or PM).  Traditionally the PM has used a linear regression as its predictive model, but lately I have been looking at other possibilities.  In coming blog posts I'll detail some of these things.

I'm also hoping to get my act together enough to get the PM listed on the Prediction Tracker page for basketball predictions.  I meant to do this last year but never quite got around to doing it.  But this year for sure :-).

Friday, October 10, 2014

Day of the Week Effect && Polynominal Variables in Linear Regression

Motivated partly by recent discussion of Thursday Night Football, I began to wonder if the day of the week has any impact upon college basketball games.  This is a little bit of a tricky topic, because conferences play games on different nights (e.g., the Ivy League plays on Friday nights) so there's some conference bias mixed into any discussion of the impact of day of the week.  But I decided to ignore that for the moment and just look at the straightforward question.

This is a little trickier than you might expect, because my prediction model uses linear regression.  Linear regression works fine when we're looking for the relationship between two numerical variables (e.g., how does rebounds/game affect score) but it doesn't work so well with polynominal (not polynomial!) variables.  A polynominal variable is one that takes on a number of discrete, non-numeric values.  In this case, day of the week can be Monday, Tuesday, Wednesday and so on.

To use a polynominal variable in linear regression, we turn it into a number of binominal variables.  In this case, we create a new variable called "DOW = Monday" and give it a 1 or 0 value depending upon whether or not the day of the game is Monday.  We do this for each possible value of the polynominal variable, so in this case we end up with seven new variables.  We can then use these as input to our linear regression.

When I do so, I find that only one of the new variables has any importance in the regression:

      0.6636 * DOW = 4=false

Translating, this says the home team is at a small disadvantage in Friday games.  I leave it up to the reader to explain why that might be true.  (Ivy League effect?)


We can also look at whether predictions are more or less accurate on some days.  When I do that for my model, I find that the predictions are most accurate for Saturday games, and the least accurate for Sunday games.  The difference in RMSE is about 6/10 of a point, so it's not an entirely trivial difference.  In fact, Saturday games are more accurate than any other day of the week.

Monday, September 8, 2014

A Few More Papers

As usual, all these papers are available in the Papers archive.

[Trono 2007] Trono, John A., "An Effective Nonlinear Rewards-Based Ranking System," Journal of Quantitative Analysis in Sports, Volume 3, Issue 2, 2007.

Trono is very concerned about the NCAA football polls and with formulating a rating system that will closely match those polls.  I'm not exactly sure what utility that provides -- surely if I want to know what the polls say I can just look at them?  That issue aside, his description of his ranking system is vague and confusing -- I came away with no good understanding of how it worked or how to implement it. 

[Minton 1992] Minton, R. "A mathematical rating system." UMAP Journal 13.4 (1992): 313-334.
This is a teaching module for undergraduate mathematics that illustrates basic linear algebra through application to sports rating.  The ratings systems developed are simple systems of linear equations based upon wins, MOV, etc.  The systems are very simple, but this is a clear and detailed introduction to some basic concepts.

[Redmond 2003] Redmond, Charles. "A natural generalization of the win-loss rating system." Mathematics magazine (2003): 119-126.
Redmond presents a rating system based upon MOV that includes a first-generation strength of schedule factor. It isn't extremely sophisticated, but makes a nice follow-on to [Minton 1992].

[Gleich 2014] Gleich, David. "PageRank Beyond the Web," http://arxiv.org/abs/1407.5107.

This is a thorough and well-written survey of the use of the PageRank algorithm.  Gleich provides clear, non-formal descriptions of the subject but also delves into the mathematical details at a level that will require some knowledge to understand.  There is a section on PageRank applied to sports rankings, and Gleich also shows that the Colley rating is equivalent to a PageRank.  Required reading for anyone interested in applying PageRank-type algorithms.

[Massey 1997] Massey, Kenneth. "Statistical models applied to the rating of sports teams." Bluefield College (1997).
Kenneth Massey's undergraduate thesis is required reading for anyone interesting is sports rating systems.  He covers the least-squares and maximum-likelihood ratings that form the basis of the Massey rating system.