20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.,7,10,2008
They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}
Seriously, I had plans for the next 4 days, but I just scrapped them. Funny how jazzed I get when it's data that I can really relate to...
I've already structured my data warehouse and started the loads. (I'll probably need a whole day just to parse the text in Field 10.) Then I'm going to build a Business Intelligence system on top of it. I will finally have the proof I need that I, not the offensive coordinator, should be texting each play to Coach Tomlin.
See you guys on Monday.
EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...
fleaflicker: Cool website & domain name. Thanks for the tips. I expect shortcomings in the data, but it looks like it's in a lot better shape than the usual free form enterprise quality/vendor/customer comments I usually have to parse. We'll see...
MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app that enables others to do their own analyses, answering questions that nobody else is asking. Like "Which Steeler makes the most tackles on opposing runs or more than 5 yards when it's 3rd down and longer than 8 yards to go, the temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"
jerf: Nice thought. I've spent years trying to earn enough money to buy the Pittsburgh Steelers just to fire the slackers and fumblers and win the Super Bowl every year. Maybe I should just take an easier route and solve that problem like any self-respecting hacker should: with data & logic. No Steeler game this weekend; I may have found my destiny </sarcasm>
You'll find that the text descriptions aren't consistently formatted. It's tough to extract structured data from all play descriptions.
For example, first initial plus last name does does not uniquely identify a player. You'll need accurate roster data first, and even then there are clashes.
We store play data by its structured components (players involved, play type, player roles, etc) and then derive the text description. This allows us to reassemble pbp data from different pro games to show a "feed" for your fantasy team.
Baseball has a smaller set of play outcomes/transitions so its easier to model this way. As your example from the Steelers Super Bowl shows, football plays can be very complex.
"It's tough to extract structured data from all play descriptions."
Which means you can treat it a bit like a text mining program. NASA had a text mining contest in 2007 as part of the SIAM conference on data mining which was really similar - instead of football plays it was textual descriptions of aeronautics incident reports and their classification. There were several papers that came out of that (I was with a group that did one of them, using an approximate nonnegative matrix classification approach - got beat out by some ensemble approaches).
Anyway - if you'd like to do something with unstructured football play descriptions, text mining might be able to empower you to some extent without going through a full manual analysis, and those papers could be a good starting point. I think some of them ended up in a volume titled _Survey of Text Mining II_.
Incredibly interested in your work here. For small-dimensional problems (or problems with features that can be engineered to be small-dimensional), ensemble methods through random forests and bagging and the like are incredibly useful.
But for high-dimensional text problems that're pure classification, I tend to rely simply on 1NN classifiers (against a single centroid of training data of a target category, of which there tend to be many). I've spent a lot of time with NMF, for its potential as an incredibly interesting data-exploration tool ("There's a pronoun cluster! There's a Spanish cluster! There's a 404 Error axis!") or low-dimension projection step. I've even spent a good amount of time on implementing the algorithm in a number of memory-efficient ways.
Could you expand a bit on how you used NMF for these problems in practice (similar to how a sparse autoencoder captures reduced-dimensional features en route to supervised learning), or how others used ensemble methods?
Afraid it's been a while, and I wasn't really at the core of the project design - if you're REALLY interested look up _Anomaly Detection Using Nonnegative Matrix Factorization_ and contact Michael W Berry (whom I assume still teaches at the University of Tennessee, Knoxville).
The main idea, though, is to generate a term-by-document matrix (count words, maybe throw out stopwords, normalize counts), then do Math to factor your matrix (approximately) into two: term-by-feature and feature-by-document. When you want to classify a new document, you can use its contents (more terms) to calculate a feature vector.
(The math seems to typically involve random initialization followed by iterative improvements. Other work in the field discusses the specifics.)
The matricies are "nonnegative" because, conceptually, features are a _positive_ thing, and you can't say that a certain term makes something less a member of a feature cluster (only more).
The tricky part is figuring out how to map features to things which are semantically interesting to your application, and I don't want to comment too much on the state of that because it's been five years and I honestly forgot what exactly we did there, and it was all done in Matlab (which I'd never used before), and there's probably more recent work in the field. But if you fiddle with it manually, you can come up with your matrices and essentially have a nice little classifier.
Also, I went to compare this PBP to both ESPN and CBS and found that both have the exact PBP data, which is interesting because it seems that they got this data directly from the NFL (or from the same source, at least). I guess this makes sense, but it's something I hadn't considered.
It looks like the same format they use for NFL Game Rewind too. I would guess that there is an official syntax and the data is provided by the NFL because if not you would have all types of formats and opinions about the game baked into each team's data. I would also guess the same office that keeps records (game, individual, all time, etc) are the ones that keep the play by play too.
Overall this is neat but it's hard to find real life context within this data. Was the QB pressured, was a coverage blown, was there a pre-snap audible or motion or change by the defense, what was the formation, how much sleep did the players get the night before, etc etc.
@edw518, Awesome! Thank you for sharing what you're doing here with the data. I was thinking of playing with this dataset too and since I'm new to this field (data), look forward to learning from you if you post more info in the future!
One of the "new" ways that Burke (data creator) et al are using with this type of data is finding the Expected Points Added for each plays. The EPA allows one to determine how valuable players are to a team's performance.
I've been trying to work at the college football level with this same strategy but I'm still trying to figure out how its calculated. It seems trivial but it takes a lot of data organizing.
Looking through the 2002 season, there's an oddity around touchdowns and extra points. It seems that the 6 points for the touchdown are bundled with the extra point, and the score is not updated until the extra point is complete.
It seems this might result in bugs, as in the Oct 20, 2002 game between Dallas and Arizona. In the third quarter, with a score of Arizona 6 - Dallas 0, Dallas scored a touchdown (row 13900) but "aborted" the extra point (row 13901). The 6 points for the Cowboys are not recorded in the data.
The game eventually went to overtime, with the Cardinals kicking a winning field goal in OT for a final score of Arizona 9 - Dallas 6, but the data here records it as Arizona 6 - Dallas 0.
Most of my team data comes from open online sources such as espn.com, nfl.com, myway.com, and yahoo.com. It's easy for anyone to grab whatever they're interested in from those sites.
My play-by-play data comes from a source that's not publicly available, and at this time I regret that I cannot share it. However, I am working hard to develop a way to spread the wealth. One of my biggest goals is to help create a larger, more open, and more collaborative community for football research.
----
There's no real terms of service so I'm curious as to the constraints in using this for commercial purposes. I most definitely want to use this for teaching purposes (how to text-mine, how to build a web app from data, etc) but want to know what terms the data can be redistributed.
IANAL, but it has been ruled that NFL player names and statistics are protected by the First Amendment, i.e. no one "owns" it and anyone is free to use it for any purpose.
However, you do have to get the data, and unauthorized access of computers (which constitutes trespassing) can be a legal gray area. I'd love to hear a lawyer weigh in on the legality of scraping the data directly from espn.com.
Last time I checked, the play by play data on espn.com was pretty error-ridden. This was three or so years ago, so it might have changed, and I was hypothetically interested in the score columns, so it may not matter depending on other hypothetical uses. But I'd hypothetically avoid scraping ESPN for that reason alone.
This seems like as good a time as any to share something I've been working on which uses the same source data, even though it's pretty rough at the moment (slow, bad data, only currently goes through week 8 of 2012, etc.):
I'm frankly surprised that this information is allowed to be distributed. I spent awhile in the financial services industry, and while it was really easy to obtain "public" information like stock quote data, I recall that we weren't allowed to simply scrape data from public sites... we had to pay a license fee to get a feed of the data if we were planning on repackaging & distributing it.
It seems to me that the NFL would want to have exclusive rights to distribute this data and charge people a fee for access to it. Clearly I'm no expert in these legal affairs though.
Generally speaking, stats are public domain as they are a public event that occurred. Because a sports league may disagree with this position doesn't mean that it isn't true. However, its entirely possible to violate a given site's TOU by scraping the data, it doesn't mean the data itself isn't allowed to be compiled or distributed.
IANAL, but I worked at ESPN and founded Fanvibe (YC S'10), and worked quite a bit with the leagues and lawyers on rights-related topics.
IANAL, but I asked one about this while ago; let's see if I can remember: It's complicated. The NFL broadcasts are copyrighted, and come with a statement that (among other things) distributing descriptions of the game is not allowed. That could be considered a derivative work.
On the other hand, a live performance is generally not protected from copyright, so if you attend a live game to collect the data, you may be in the clear.
The data isn't owned by the NFL, but all recordings of the games are, and so any data obtained by watching recordings of the games could potentially be controlled by the NFL.
It might not even violate the NFL's copyright if extracted from tapes. For one thing, something is only a "derived work" for copyright purposes if it's a "creative work" subject to copyright at all, and in the U.S., data sets comprising factual information aren't typically considered "creative". For another, it's not clear whether data about a recording is derived from the recording for copyright purposes. For example, a re-edit or mash-up of a film is clearly a derived work, but is a count of how many minutes each character speaks a derived work? Or is a Spotify-style algorithmic analysis of a song's musical style a derived work of the song?
I wouldn't want to put a large bet on where exactly those lines are drawn, though.
IANAL too, but the NFL lost a relatively recent court case regarding fantasy football that, as far as I am aware, made the leagues statistics public knowledge as long as you compile the information yourself. Therefore the main legal issue with this data would be the source.
IANAL, but there was a landmark case where the NBA sued Motorola and STATS Inc. for distributing live game statistics. The ruling ended up in favor of STATS, where the decision was pure facts could not be copyrighted.
The NFL is not near as strict as MLB (note how you never see an MLB highlight on youtube?) yet MLB allows http://retrosheet.org/ to exist. I don't believe it's technically "dissemination".
MLB are extremely strict, but it's an exaggeration to say "you never see highlights on youtube", I've watched them there many times. Likely either before they got taken down or because they were to small to care about, but there'll still be plenty on there. 5 second search brought up https://www.youtube.com/watch?v=OZW7448mh94 right away as a very quick example.
Fun fact: In the UK even the FA Premier League fixture list is copyrighted, and websites or publications wishing to publish all or part of the fixture list for a given season need to pay a licence fee.
I can't even give you a list of football fixtures coming up this weekend without breaking copyright law.
Here is an idea: build a predictive model of an offensive coach that predicts the play he will call, given a game situation (and based on that, build a predictiveness quotient for a coach).
It doesn't work like that in practice. Football is very dependent on matchups. Coaches will vary gameplans from week-to-week to exploit weaknesses they see on film.
Matchup would be a part of the model. My experience with predictive modeling in various domains has taught me that people tend to underestimate how predictive they are (NFL offensive/defensive coaches are no exception).
I'm interested in doing some predictive modeling for a couple of project ideas I've been kicking around. Are there any specific resources you would recommend as good starter material?
If only a single play call had a single potential outcome, and that outcome was always met. Using these stats for predictions would seem extremely difficult beyond answering, "will it be a run or a pass?"
If you were the defensive coordinator on the opposing team, knowing the answer to "run or pass" with a high degree of certainty would give you a pretty large advantage.
In his early Stanford days, Bill Walsh had already cracked the code on how un-random football coaches (and almost all people) are. From "Controlling the Ball with the Passing Game":
"We know that if they don't blitz one down, they're going to blitz the next down. Automatically. When you get down in there, every other play. They'll seldom blitz twice in a row, but they'll blitz every other down. If we go a series where there haven't been blitzes on the first two downs, here comes the safety blitz on third down."
Most NFL offenses tend to alternate rather than randomize. Walsh knew defenses were just as predictable decades ago.
There are too many missing variables. The most obvious being who made the actual play call, the head coach, the offensive coordinator, or the quarterback.
The CSV file format is nice, but if you're looking for a Python API to play with NFL stats without having to parse play-data fields, check out nflgame [1]. I've written up a quick primer. [2] It also includes the ability to get play-by-play statistics live.
That sounds like the dictionary definition of the gambler's fallacy. If anything, the odds of an interception likely increase after a previous interception as it would be a sign of a defensive advantage over the QB. If only there was somewhere we could get the data to figure out for sure...
Here's some soccer data, doesn't include play-by-play though (soccer generally isn't suited to that kind of breakdown, although Opta Sports do track it).
The comments on that are awesome too - great advice for parsing, categorizing, and such. I couldn't download 2010 though - "Sorry, we are unable to generate a view of the document at this time. Please try again later."
Amazing!!! Thanks to www.advancednflstats.com for doing all the leg-work. Highly recommend their site too. Their in-game win probability statistics are always a must-have for me on game-day ^_^
Could just be my imagination, but I felt like this season I saw many more teams go for it on 4th down (maybe some of this data can prove or disprove that?). Perhaps the impact of the above, or similar, analyses is finally being seen.
This looks like great fun...Judging by some of the sample entries, it will also be an instructive example of the limitations of CSV and why serious analysts who want to work with unstructured data need to know a scripting language, or at least regexes.
Sample description field:
> 20020905_SF@NYG,1,59,20,NYG,SF,3,11,81,(14:20) (Shotgun) K.Collins pass intended for T.Barber INTERCEPTED by T.Parrish (M.Rumph) at NYG 29. T.Parrish to NYG 23 for 6 yards (T.Barber).,0,0,2002
In the comments section of the OP, someone posted this sample Excel function:
The Excel function looks ridiculous, but it probably didn't take more than 10 minutes to make, tops. Nested conditionals are easy.
At any rate, what would you recommend most to accomplish the task? I'm learning Python and know R a bit, so I was just wondering how I was going to about combing through the data.
I think the point is less "can this be done in X minutes?" and more "is there an easier way to do this?" This, at least, is what frustrates me about people who are Excel fanatics but refuse to learn a couple days' worth of Python, Ruby, VBA, whatever. Nested conditionals might be easy, but a case statement is vastly easier to write and understand. There are lots of people who would say "oh I could never program" and then write Excel functions too complex for me to understand.
Python or Ruby is fine...the main trick is to be able to process those fields with regular expressions...which, IIRC, requires throwing in VBscript if you were to handle it solely in Excel.
Python and Ruby would also allow for more elegant-looking -- i.e. more maintainable -- functions to handle that field.
R is fine too: it has regular expressions and probably excels if you plan to do statistics using all of that data. Python seems to have reasonable statistics functionality as well (with pandas, etc) but I haven't used it personally.
Thanks for the note. I've not really looked into how one would do that with R (doing it in Python seems more clear), but am checking it out now. If anyone else is looking I'm finding this PDF helpful: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenR...
God this is such interesting stuff. How do we still not have a fully featured open source NFL stats-rosters-game charting API? Who wouldn't want to contribute to that project?
Other than cool data visualization stuff, the obvious implication is the potential to devise a profitable system to pick games against the spread. The guys at Football Outsiders have done a decent job at it and made a proprietary algorithm that picked games at 58% this year ( which is over the threshold you need to be profitable in Vegas ). But even those guys are still having some trouble getting access to and aggregating the data in a usable format.
I really want to sit down and start playing around with some of this data so I appreciate you putting this together for everyone. The NFL needs an open source API and this is definitely a step in the right direction.
I believe my library, nflgame [1], would fit the bill. Features all play-by-play data back to 2009, and includes the ability to track play-by-play data live.
A bit OT, but I thought this might be a good opportunity to mention the upcoming SportsHackDay in Seattle from Feb 1-3 which culminates in a group viewing of the SuperBowl.
http://sportshackday.com/
Need a bounds check or two in there. Tried setting the number of yards you need to 1 and the yards away from the endzone to 99 and it threw up a nice exception. Cool calculator though!
I've used data from Brian Burke's site before. I think it's the exact PBP data the NFL has, but you'll find that the structure and common phrasings change over the years. I had to write a lot of regular expressions and I was still catching edge cases for weeks.
btw, pro-football-reference has pbp data now too, and it probably goes back a lot further, but I think they discourage mass scraping of their site.
There is a lot to have fun with here. I would imagine though that in a lot of NFL coaching rooms there has to be a balance between coaching and analysis.
As a French not interested in sports at all this would have made no sense at all to me before I watched the TV series The League [1]. Now I kind of enjoy the fact that these stats exist and are available in an open format, even if I don't really care myself.
If you live in the vicinity of Seattle, there is a sports-themed hackathon going on Superbowl weekend. Google, ESPN and a bunch of tech companies are sponsoring. The grand prize will be passes to the Sloan Sports Conference. More details to come:
It would be interesting to take this data and build an app around it for fantasy football. If you have all the tendencies, and how players like your player have played against certain teams, you could make better guesses on who to play.
I did something like that for an AI class in college. We used FuzzyCLIPS to write an expert system for drafting fantasy football teams. A ruby script pulled the CSV data from some other site that had the previous three years of NFL data, and then converted the CSV to fact files which the system then read in.
When all was said and done it worked, but made some pretty crappy draft picks! I should find that code....
20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.,7,10,2008
They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}
Seriously, I had plans for the next 4 days, but I just scrapped them. Funny how jazzed I get when it's data that I can really relate to...
I've already structured my data warehouse and started the loads. (I'll probably need a whole day just to parse the text in Field 10.) Then I'm going to build a Business Intelligence system on top of it. I will finally have the proof I need that I, not the offensive coordinator, should be texting each play to Coach Tomlin.
See you guys on Monday.
EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...
fleaflicker: Cool website & domain name. Thanks for the tips. I expect shortcomings in the data, but it looks like it's in a lot better shape than the usual free form enterprise quality/vendor/customer comments I usually have to parse. We'll see...
MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app that enables others to do their own analyses, answering questions that nobody else is asking. Like "Which Steeler makes the most tackles on opposing runs or more than 5 yards when it's 3rd down and longer than 8 yards to go, the temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"
jerf: Nice thought. I've spent years trying to earn enough money to buy the Pittsburgh Steelers just to fire the slackers and fumblers and win the Super Bowl every year. Maybe I should just take an easier route and solve that problem like any self-respecting hacker should: with data & logic. No Steeler game this weekend; I may have found my destiny </sarcasm>