Saturday, February 6, 2016

Kaggle Competition is Back for 2016

I've been remiss about posting to the blog, but I thought I'd share that a little birdie hinted to me that the Kaggle Competition will be back again this year, with perhaps some new twists.  So keep your predictors warmed up.

I'm undecided whether I'm going to provide "Steal My Entry" again this year, but I might be interested in a private collaborative effort. In particular my thought is to merge an entry from my predictor -- which mostly focuses on regular-season games -- with a predictor that has specifically been trained on tournament games.  I'll provide my model's game predictions for all the tournament games back to 2009, and then you train a tournament-specific model using my predictions along with any other information you think is valuable (e.g., team seedings, locations, etc.).  Contact me if that sounds interesting -- and this isn't an exclusive offer, I'm happy to collaborate with multiple folks either individually or as part of a larger group.

Wednesday, December 9, 2015

Sports Information API

Tonight I stumbled across Sportradar.us, which seems to be the former SportsData.  Interestingly, they have APIs to deliver all sorts of sports information, including comprehensive NCAA men's basketball coverage -- including play-by-play data and even location data, i.e., where on the court a shot was taken (!). 

The bad news is that the lowest pricing tier is $500/month.  So not something I'll be buying for Christmas.  But interesting.

Working Overtime

Overtime is one of the interesting quirks of basketball.  In some sports -- particularly low-scoring sports like soccer and hockey -- a game may end in a tie.  But in college basketball teams play additional periods -- as many as needed -- until a winner is determined.

Overtime games skew team statistics.  ESPN and other sites typically have pages of statistics such as "Points Per Game".  But if one game is 40 minutes long and another is 226 minutes long, it's not really an apples-to-apples comparison.  This is one reason analysts are fond of "per possession" statistics -- not only does it correct for pace of play, but it also corrects for overtime games.

Clearly the statistics you feed into a predictor need to be corrected somehow for overtime games.  But there's another interesting overtime issue to consider:  What's the final score of an overtime game?

One choice is to use the score at the end of the overtime(s).  The other is to treat the game as a tie.  There are intuitive arguments in favor of both choices.  The fact that Syracuse beat Connecticut suggests that Syracuse is a better team, regardless of how many minutes that took, so we should treat the game as a win for Syracuse.  On the other hand, the teams were deadlocked for six overtimes, which suggests that they're about as equal as it is possible to be, regardless of whether one team or the other managed to win the game in the wee hours of the morning.

Or maybe the game should be treated as a tie for some statistics and not for others.

As longtime readers of this blog are aware, I'm a believer in doing whatever works best.  So in this case, I made two runs of my predictor, once treating overtime games as ties and once using the actual  final scores.   In my case, the predictor performed better treating overtime games as ties.

Another possibility is to treat the final score of an overtime game as 1 or -1 (or 0.1 and -0.1 if your predictor can handle that), depending upon which team wins the overtime period(s).  This retains the won/loss information, but otherwise treats the game as (nearly) a tie.

For those of you who also have predictors, I encourage you to try the same experiment and report back which choice (if either) works better for you.

Sunday, December 6, 2015

A Few Funny Things

When I logged in to work on this post, I noticed that my blog had 100,000 page views.  Since I have an audience of like six people, you guys must be checking my pages a lot.  Good job!  Anyway, I've been spending some time lately getting my data scraping working, and that always involves a few trips through the bowels of data validation.

First stop is this game.  I was running the predictor when it warned me about an unusual event:  a conference game in early November.  Unusual, but it happens (often a Big5 game).  What was more surprising was that it was a team playing itself.  According to the predictor, UNC Greensboro had come up with the clever notion of scheduling a home game against itself.  Or maybe it was on the road. 

One of the challenges of predicting NCAA basketball is that every data source uses different names for teams.  To try to match them up I have lists of alternate team names:

St. Francis (NY)
1383
St. Francis BRK
St. Francis (N.Y.)
St. Francis-NY
 St Francis NY
St Francis(NY)
St Francis (NY)
St. Francis Brooklyn
St. Francis NY
St. Francis-NY Terriers
St Francis (BKN)
st.-francis-(NY)-terriers
St. Francis (BKN)
(That weird-looking "1383" is the name for St. Francis (NY) in the Kaggle contest.  Because it's run by data scientists, so why use a human-readable name when you can use an arbitrary and completely useless number?)

In this case the predictor too aggressively (although reasonably) determined that Div III Greensboro College was a nickname for UNC Greensboro.  (By the way, my list of nicknames and the Python code that goes with it is available for the asking.  But you're on your own dealing with Greensboro vs. Greensboro.)

Next up is this game.  Looks like a perfectly reasonable WAC Conference game.  Problem is, one of those teams was not in the WAC.  Actually, one of those teams didn't even exist.

You see, last year the University of Texas decided to merge two campuses -- the University of Texas Brownville and the University of Texas Pan American -- to form a brand new campus the University of Texas Rio Grande Valley.  Brownsville didn't have sports, but UT-PA was a Division I team in the WAC, so the new campus stayed in the WAC and became the "Vaqueros."

(Trivia Question:  Name the other four NCAA Division I basketball nicknames that are Spanish words.)

Well, ESPN decided the easiest way to deal with this whole business was to just go into their database and replace every instance of "University of Texas Pan American Broncs" with "UTRGV Vaqueros."  Hence the mysterious 2013 game involving a university that wouldn't exist for several more years.

Wednesday, November 18, 2015

Why I Hate Rainbows

The University of Hawaii Rainbow Warriors play their home games at a five hour offset from the East Coast.  I don't begrudge them living in Paradise, but the peculiarity of the time zones means that ESPN often reports their games as happening the day after they were actually scheduled.

This annoyance I don't need.

Tuesday, November 17, 2015

Kind of Amazing


That's an animation of NBA movement data, which apparently you can get via a free API.  Savvas Tjortjoglou goes into detail here.  Who knows what sort of predictive model you could build exploiting this data.  Thankfully the NCAA doesn't have anything of the sort, or I'd have to quit my day job.

Saturday, November 14, 2015

Really, ESPN?

With the first day's games done I fired up my ESPN score scraper to gather up the data and get the season started.

It crashed.

It seems like ESPN chose the first day of the season to roll out a new format (and URL scheme) for their basketball scoreboard page.

To be fair, I wrote my current (Python Scrapy-based) scraper with the help of Brandon Harris (*) and he warned me when I started down this road that ESPN was busy mucking up all their scoreboard pages.  "Oh no," said I, "it looks fine, I'm sure they won't change it at the last second."  So I have no one to blame but myself.

(*) And by help I mean he basically gave me working code.

ESPN went all Web 3.0 in their page redesign, which means that rather than send a web page, they send a bunch of Javascript and raw data and make your web browser build the page.  (Which probably saves them millions of dollars a year in server costs, so who can blame them?)  This breaks the whole scraper paradigm, which is to Xpath through the HTML to find the bits you need.  There's no HTML left to parse.  The good news in this case is that ESPN was nice enough to include the entire URL I need in the data portion of the new format, so it is very easy to do a regular expression search and pull out the good parts.  Otherwise you get into some kludgy solutions like using a headless web browser to execute the Javascript and build the actual HTML page.  Or trying to find the mobile version of the page and hope that's more parseable.

I don't do anything much with the model until after a few weeks of games, so I have some time to fix my code.  And I suppose that if  you want to scrape data from the Web, you'd better be prepared to deal with change.