Tuesday, March 31, 2015

Final Four Predictions

I've been busy enjoying the Tournament and the discussions over at Kaggle, but I thought I'd take some time to run predictions for the Final Four.

#7 MSU vs. #1 Duke:  Duke by 3.5 points
MSU has had (another) amazing Tournament run, but my predictor still considers them the weakest team in the field by a substantial margin.
#1 Wisconsin vs. #1 Kentucky:  UK by 1.5 points
The predictor agrees with most pundits that Wisconsin is the second best team, and the most likely to knock off UK.  1.5 points is basically a toss-up.  Wisconsin's ratings rose slightly following a good win over Arizona, and UK's dropped after a relatively poor performance against Notre Dame.
#1 Duke vs. #1 Kentucky: UK by 4.0 points
 This is not a toss-up -- UK has a solid edge in this game.
The other possible final game matchups:
#7 MSU vs. #1 Kentucky:  UK by 8 points
#1 Duke vs. #1 Wisconsin: Wisconsin by 1 point
#7 MSU vs. #1 Wisconsin: Wisconsin by 5 points
I don't think MSU has much of a chance against Kentucky, but all the other matchups should be pretty even.

Monday, March 23, 2015

Machine Madness Update

I'm just back from watching UCLA win two games in Louisville and am not yet caught up, but here's a quick update from the Machine Madness side of the competition.

Perhaps not unsurprisingly, Monte McNair leads the competition with 56 points and I suspect will win if Kentucky wins out.  (Monte's in the Top Ten in the Kaggle competition right now.)   BlueFool is in second place just a point behind Monte and the only competitor with Duke as champion, so she'll likely win if that happens.  Jason Sumpter is in third place but has Kentucky as champion, so he'll need some breaks to beat out Monte -- specifically, I think he needs Xavier to beat Wisconsin.

Nothing But Neural Net (great name, btw) is the only competitor with Wisconsin as champion.  Likewise I'm the only competitor with Arizona, so obviously we'll be rooting for those teams to win out. 

Thursday, March 19, 2015

Good Luck!

I'm off in Louisville to watch the first round games, so updating the blog is difficult, but I wanted to wish good luck to everyone in both the Kaggle and the Machine Madness contests.  Enjoy the games!

Monday, March 9, 2015

The Silliness of Simulation

When the NCAA Tournament rolls around there's an inevitable flurry of blog posts and news articles about some fellow or another who has predicted the Tournament outcome by running a Tournament simulation a million times!  Now that's impressive!

Or maybe not.

These simulations are nothing more than taking someone's win probabilities (usually Pomeroy or Sagarin, since these are available with little effort) and then rolling a die against those probabilities for each of the 63 games.  On a modern computer you can do this a million times in a second with no real strain.

More importantly, though, does running this sort of simulation a million times actually reveal anything interesting?

Imagine that we decided to do this for just the title game.  In our little thought experiment, the title game this year has (most improbably) come down to Duke versus Furman, thanks in no small part to Furman's huge upset of the University of Kentucky in their opening round game.

(Furman -- one of the worst teams in the nation and who have only managed 5 wins in the lowly Southern Conference -- has somehow won through to the conference title game and actually does have a chance to get to the Tournament.  If this happens, they'll undoubtedly be the worst 16 seed and matched up against UK in Louisville.  So this is totally a plausible scenario.)

We look up the probability of Duke beating Furman in our table of Jeff Sagarin's strengths (or Ken Pomeroy, whomever it was)  and we see that Duke is favored to win that game 87% of the time.  So now we're ready to run our simulation.

We run our simulation a million times.  No, wait.  We want to be as accurate as possible for the Championship game, so we run it ten million times.

(We have plenty of time to do this while Jim Nantz narrates a twenty minute piece on the unlikely Furman Paladins and their quixotic quest to win the National Championship.  This includes a long interview with a frankly baffled Coach Calipari.)

We anxiously watch the results tally as our simulation progresses.  (Or rather we don't, because the whole thing finishes before we can blink, but I'm using some dramatic license here.)  Finally our simulation is complete, and we proudly announce that in ten million simulated games, Duke won 8,700,012 of the games!  Whoo hoo!

But wait.

The sharp-eyed amongst you might have noticed that Duke's 8,700,012 wins out of a 10,000,000 is almost the same percentage as our original winning probability that we borrowed from Ken Pomeroy.  (Or Jeff Sagarin, whomever it was.)  Well, no kidding.  It had better be, or our random number generator is seriously broken.

Welcome to the Law of Large Numbers.  To quote Wikipedia:  "[T]he average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed."  The more times we run this "simulation" the closer we'll get to exactly 87%.

This is why the whole notion of "simulating" the tournament this way is silly.  The point of doing a large number of trials (simulations) is to reveal the expected value.  But we already know the expected value:  it's the winning probability we stole from Jeff Sagarain.  (Or Ken Pomeroy, whomever it was.)  It's just a waste of perfectly good random numbers to get us back to the place we started.

To be fair, there's one reason that it makes some sense to do this for the entire Tournament.  If for some reason you want to know before the Tournament the chances of a particular team winning the whole thing, then this sort of simulation is a feasible way to calculate that result.  (Or if you're Ed Feng you create this thing.)  And if that's your goal, I give you a pass.

On the other hand, if you're doing all this simulation to fill out a bracket for (say) the Machine Madness competition, then it makes more sense to run your simulation for a small number of trials.  The number of trials is essentially a sliding control between Very Random (1 trial) and Very Boring (1 billion trials) at the other end.  Arguably it is good meta-strategy in pool competitions not to predict the favorite in every game, so by lowering the number of trials you can inject some randomness into your entry.  (I don't think this is necessarily a good approach, but at least it is rational.)

Now I'm off to root for Furman in the Southern Conference title game.

Wednesday, March 4, 2015

So What About Me?

I've recently put up a few posts about the Kaggle competition including one about reasonable limits to performance in the contest.  So it's natural to wonder how I'm doing / have done in the Kaggle competition.

Fair enough.

Last year, my entry ended up finishing at 60th on the Kaggle leaderboard, with a score of 0.57.  At one point that was exactly at the median benchmark, but apparently the post-contest cleanup of DQed entries changed that slightly.  2014 wasn't a particularly good year for my predictor.   Here are the scores for the other seasons since 2009:


2014 was my worst year since 2011.  (2011 was the Chinese Year of the Upset, with a Final Four of a #3, #4, #8 and #11 seed.)  Ironically, I won the Machine Madness contest in 2011 because my strategy in that contest includes predicting some upsets; this led to correctly predicting Connecticut as the champion.

My predictor isn't intended specifically for the Tournament.  It's optimized for predicting Margin of Victory (MOV) for all NCAA games.  This includes the Tournament, but those games are such a small fraction of the overall set of games that they don't particularly influence the model.  There are some things I could do to (hypothetically) improve the performance of my predictor for the Kaggle competition.  For one thing, I could build a model that tries to predict win percentages directly, rather than translating from predicted MOV to win percentage.  Secondly, since my underlying model is a linear regression, I implicitly optimize RMSE.  I think it's likely that a model that optimizes on mean absolute error would do better1 but I haven't yet found a machine learning approach that can create a model optimized on mean absolute error with performance equaling linear regression.

I haven't put much effort into building a "Tournament optimized" predictor because (as I have pointed out previously) there is a large random element to the Tournament performance.  Any small gains I might make by building a Tournament-specific model would be swamped by the random fluctuations in the actual outcomes.

1 I say this because RMSE weights outliers more heavily.  Although there are a few matchups in the Tournament between teams with very different strengths (i.e., the 1-16 and 2-15 matchups in particular) in general you might suppose that there are fewer matchups of this sort than in the regular season, and that being slightly more wrong on these matchups won't hurt you much if you're also slightly more correct on the rest of the Tournament games.  That's just speculation on my part, though.

Monday, March 2, 2015

Reasonable Kaggle Performance

The first stage of the Kaggle competition involves Kagglers testing out their models against data from the past few basketball seasons, and these scores appear on the first stage leaderboard.  Invariably new Kagglers make some fundamental mistakes and end up submitting entries with unreasonably good performance.  The administrators of the contest have taken to removing these entries to avoid discouraging other competitors.  The line for removing entries is somewhat fuzzy, and it begs the question1 "What is a reasonable long-term performance for a Tournament predictor?" There are probably many ways to answer this question,2 but here's one approach that I think is reasonable:  Calculate the performance of the best possible predictor over an infinite number of Tournaments.

I am reminded at this point of an old joke.
A man is sitting in a bar complaining to his friend -- who happens to be a physicist -- about his awful luck at the racing track, and wishing he had some better way to know what horse was going to win each race.  

"Well, that strikes me as a rather simple physics problem," his friend says.  "I'm sure I could build a model to predict the outcome."

"Really?" says the man, visibly excited.  "That's fantastic.  We'll both get rich!"

So the physicist goes off to build his model.  After a week, the man has still heard nothing, so he calls his friend.  "How are you doing on the model?" he asks.

"Well," says the physicist.  "I admit that it is turning out to be a bit more complicated than I imagined.  But I'm very close."

"Great," says the man.  "I can't wait!"

But another week goes by and the man hears nothing, so he calls again.

"Don't bother me," snaps the physicist.  "I've been working on this day and night.  I'm very close to a breakthrough!"

So the man leaves his friend alone.  Weeks pass, when suddenly the man is awakened in the middle of the night by a furious pounding on his front door.  He opens the door and sees his friend the physicist.  He looks terrible -- gaunt and strained, his hair a mess -- and he is clutching a sheaf of crumpled papers.  "I have it!" he shouts as the door opens.  "With this model we can predict the winner of any horse race!"

The man's face lights up.  "I can't believe you did it," he says.  "Tell me how it works."

"First of all," says the physicist, "we assume that the horses are perfect spheres racing in a vacuum..."
Like the physicist, we face a couple difficulties.  For one thing, we don't have the best possible predictor.  For another, we don't have an infinite set of Tournaments.  No matter, we shall push on.

We don't have the best possible predictor (or even know what its performance would be) but we do have some data from the best known predictors and we can use that as a substitute.  The Vegas opening line is generally acknowledged to be the best known predictor (although a few predictors do manage to consistently beat the closing line, albeit by small margins).  The Vegas opening line predicts around 74% of the games correctly "straight up" (which is what the Kaggle contest requires). I'm personally dubious that anyone can improve upon this figure significantly3 but for the sake of this analysis let's assume that the best possible predictor can predict an average game4 correctly 80% of the time.

We also don't have an infinite number of Tournaments to predict, but we can assume that the average score on an infinite number of Tournament games will tend towards the score on an average Tournament game.  For the log-loss scoring function, the best score in the long run comes from predicting our actual confidence (the 80% from above).  If we predict an infinite number of games at 80% and get 80% of those games correct, our score is:

`0.80*log(0.80) + (1-0.80)*log(1-0.80)`

which turns out (fortuitously) to be just about 0.50.  (If we use a performance of 74%, the score is about 0.57.)

This analysis suggests that the theoretical best score we can expect predicting a large number of Tournament games is around 0.50 (and probably closer to 0.57).  This agrees well with last year's results -- the winner had a score of about 0.52 and the median score was about 0.58.

As far as "administrative removal" goes, there are 252 scored games in the Kaggle stage one test set.  That's not an infinite set of games, but it is enough to exert a strong regression towards the mean.  The Kaggle administrators are probably justified in removing any entry with a score below 0.45.

On a practical level, if your predictor is performing significantly better than about 0.55 for the test data, it strongly suggests that you have a problem.  The most likely problems are that you are leaking information into your solution or that you are overfitting your model to the test data.

Or, you know, you could be a genius.  That's always a possibility.

1 Yes, I know I'm misusing  "beg the question". 
2 I suspect that a better approach treats the games within the Tournament as a normal distribution and sums over the distribution to find the average performance, but that's rather too much work for me to attempt.
3 If for no other reason than Vegas has a huge financial incentive to improve this number if they could.  
4 The performance of the Vegas line is an average over many games.  Some games (like huge mismatches) the Vegas line predicts better than 74%; some (like very close matchups) it predicts closer to 50%.  I'm making the simplifying assumption here that the average over all the games corresponds to the performance on an average game.  Later on I make the implicit assumption that the distribution of Tournament games is the same as the distribution of games for which we have a Vegas line.  You can quarrel with either of these assumptions if you'd like.  A quick analysis of the Tournament games since 2006 shows that the Vegas line is only right 68% of the time, suggesting that Tournament games may be harder to predict than the average game.

Friday, February 27, 2015

Five Mistakes Kaggle Competitors Should Avoid

#1 -- Don't think you're going to win the competition.

One of the results that came out of the analysis of last year's contest is that the winner was essentially random:  at least the top 200 entries could have credibly won the contest.  Why?  Evidence from the best predictors suggests that there is about 8 points or so of unpredictability in college basketball games.  That's a lot of randomness.  Last year, 32 of the 64 games in the Tournament were decided by 8 points or less.  So even if you have the most accurate predictor in the contest, you're almost certain to be beaten by someone who made a worse prediction and got lucky when it came true.  It's the same reason why the ESPN pool is usually won by an octopus or someone who picked based on mascot fashions. On the other hand, maybe this year you'll be the guy who gets lucky.  It could happen.

#2 -- Don't use the data from before the 2008-2009 season.

Isn't it nice of the Kaggle administrators to provide data back to 1985?

If you're not familiar with college basketball, you might not realize that the college game underwent a radical change at the beginning of the 2008-2009 season when the NCAA instituted the three-point shot at a consistent distance of 20' 9".  The three-point shot created whole new game strategies, and data from before that season is probably not easily applicable to today's game.

#3 -- The Tournament data is not enough for training or testing.

More like March Sadness

At 64 games a year, the Tournament just doesn't provide enough data for training or even testing a predictor with any reliability.  You may think you're being smart to build your model specifically for the Tournament -- imagine the advantage you'll have over all the other competitors that don't understand how different the Tournament is from the regular season.  Ha!

But actually you're just overfitting your model.   My own predictor needs about 15,000 training examples for best performance.  Your mileage may vary -- maybe you only need 14,000 training examples -- but there just isn't enough information in the Tournament games alone to do accurate prediction.  Particularly since you shouldn't use the games from before 2008 (see #2).  Of course, you can do all that and you might still win the contest (see #1).

#4 -- Beware of leakage!

Guess what?  It turns out that you can do a really good job of predicting the Tournament if you know the results ahead of time.  Who knew?

Now that's not a big problem in the real contest because (short of psychic powers) no one knows the results ahead of time.  But if the forums from last year and this year are any indication, it's a big problem for many Kagglers as they build and test their predictors.  Knowledge from the games they're testing creeps into the model and results in unrealistically good performance.

A First-Time Kaggle Competitor
There are three major ways this happens.

The first and most obvious way this happens is that a model is trained and tested on the same data.  In some cases you can get away with doing this -- particularly if you have a lot of data and a model without many degrees of freedom.  But that isn't the case for most of the Kaggle models.  If you train your model on the Tournament data and then test it on the same data (or a subset of the data), it's probably going to perform unreasonably well.  You address this by setting aside the test data so that it's not part of the training data.  For example, you could train on the Tournament data from 2008 to 2013 and then test on the 2014 Tournament.  (Although see #3 above about using just the Tournament data.)  Cross-validation is another, more robust approach to avoiding this problem.

The second way this often happens is that you unwittingly use input data that contains information about the test games.  A lot of Kagglers use data like Sagarin's ratings without understanding how these statistics are created.  (I'm looking at you, Team Harvard.)  Unless you are careful this can result in information about the test games leaking back into your model.  The most common error is using ratings or statistics from the end of the season to train a model for games earlier in the season.  For example, Sagarin's final ratings are based upon all the games played that season -- including the Tournament games -- so if you use those ratings, they already include information about the Tournament games.  But there are more subtle leaks as well, particularly if you're calculating your own statistics.

The third and least obvious way this happens is when you tune your model.  Imagine that you are building your model, taking care to separate out your test data and avoid using tainted ratings.  You test your model on the 2014 Tournament and get mediocre results.  So you tweak one of your model parameters and test your model again, and your results have improved.  That's great!  Or is it?  In fact, what you've done is leak information about the 2014 Tournament back into your model.  (This can also be seen as a type of overfitting to your test data.)  This problem is more difficult to avoid, because tuning is an important part of the model building process.  One hedge is to use robust cross-validation rather than a single test set.  This helps keep your tuning more general.

How can you tell when you're suffering from leakage?  Your performance can provide an indicator.  Last year's winner had a log-loss score of 0.52, and the median score was around 0.58.  If your predictor is getting performance significantly better than those numbers, then you're either (1) a genius, or (2) have a problem.  It's up to you to decide which.

#5 -- A Miscellany of Important Notes
  • College basketball has a significant home court advantage (HCA).  (And yes, there may be a home court advantage in Tournament games!) Your model needs to account for the HCA and how it differs for neutral court and Tournament games.  If your model doesn't distinguish home and away, you've got a problem.
  • College teams change significantly from season to season.  You can't use a team's performance in one season to predict its performance in another season.  (This seems obvious, but last year's Harvard team seems to have made this mistake.  On the other hand, they got a journal publication out of it, so if you're an academic this might work for you too.)
  • Entering your best predictions might not be the best way to win the contest.  Since the contest has a large random element (see #1 above) your best strategy might be to skew your predictions in some way to distinguish yourself from similar entries, i.e., you should think about meta-strategy.