Every NFL play for the past 10 years in CSV format

edw519 · on Jan 3, 2013

From Line 42536 of the 2008 CSV file:

20090201_PIT@ARI,2,30,18,ARI,PIT,1,1,1,(:18) (Shotgun) K.Warner pass short middle intended for A.Boldin INTERCEPTED by J.Harrison at PIT 0. J.Harrison for 100 yards TOUCHDOWN. Super Bowl Record longest interception return yards. Penalty on ARZ-E.Brown Face Mask (15 Yards) declined. The Replay Assistant challenged the runner broke the plane ruling and the play was Upheld.,7,10,2008

They forgot: for(i=0;i<92;i++){yell('edw519','GO!')}

Seriously, I had plans for the next 4 days, but I just scrapped them. Funny how jazzed I get when it's data that I can really relate to...

I've already structured my data warehouse and started the loads. (I'll probably need a whole day just to parse the text in Field 10.) Then I'm going to build a Business Intelligence system on top of it. I will finally have the proof I need that I, not the offensive coordinator, should be texting each play to Coach Tomlin.

See you guys on Monday.

EDIT: OK, I'm back, but not for long. I'm having way too much fun with this...

fleaflicker: Cool website & domain name. Thanks for the tips. I expect shortcomings in the data, but it looks like it's in a lot better shape than the usual free form enterprise quality/vendor/customer comments I usually have to parse. We'll see...

MattSayer & sjs382: I don't plan to do any analysis. I prefer to build an app that enables others to do their own analyses, answering questions that nobody else is asking. Like "Which Steeler makes the most tackles on opposing runs or more than 5 yards when it's 3rd down and longer than 8 yards to go, the temperature is below 38, and edw519 is twirling his Terrible Towel clockwise?"

jerf: Nice thought. I've spent years trying to earn enough money to buy the Pittsburgh Steelers just to fire the slackers and fumblers and win the Super Bowl every year. Maybe I should just take an easier route and solve that problem like any self-respecting hacker should: with data & logic. No Steeler game this weekend; I may have found my destiny </sarcasm>

fleaflicker · on Jan 3, 2013

You'll find that the text descriptions aren't consistently formatted. It's tough to extract structured data from all play descriptions.

For example, first initial plus last name does does not uniquely identify a player. You'll need accurate roster data first, and even then there are clashes.

We store play data by its structured components (players involved, play type, player roles, etc) and then derive the text description. This allows us to reassemble pbp data from different pro games to show a "feed" for your fantasy team.

Baseball has a smaller set of play outcomes/transitions so its easier to model this way. As your example from the Steelers Super Bowl shows, football plays can be very complex.

fennecfoxen · on Jan 3, 2013

"It's tough to extract structured data from all play descriptions."

Which means you can treat it a bit like a text mining program. NASA had a text mining contest in 2007 as part of the SIAM conference on data mining which was really similar - instead of football plays it was textual descriptions of aeronautics incident reports and their classification. There were several papers that came out of that (I was with a group that did one of them, using an approximate nonnegative matrix classification approach - got beat out by some ensemble approaches).

Anyway - if you'd like to do something with unstructured football play descriptions, text mining might be able to empower you to some extent without going through a full manual analysis, and those papers could be a good starting point. I think some of them ended up in a volume titled _Survey of Text Mining II_.

textminer · on Jan 4, 2013

Incredibly interested in your work here. For small-dimensional problems (or problems with features that can be engineered to be small-dimensional), ensemble methods through random forests and bagging and the like are incredibly useful.

But for high-dimensional text problems that're pure classification, I tend to rely simply on 1NN classifiers (against a single centroid of training data of a target category, of which there tend to be many). I've spent a lot of time with NMF, for its potential as an incredibly interesting data-exploration tool ("There's a pronoun cluster! There's a Spanish cluster! There's a 404 Error axis!") or low-dimension projection step. I've even spent a good amount of time on implementing the algorithm in a number of memory-efficient ways.

Could you expand a bit on how you used NMF for these problems in practice (similar to how a sparse autoencoder captures reduced-dimensional features en route to supervised learning), or how others used ensemble methods?

fennecfoxen · on Jan 4, 2013

Afraid it's been a while, and I wasn't really at the core of the project design - if you're REALLY interested look up _Anomaly Detection Using Nonnegative Matrix Factorization_ and contact Michael W Berry (whom I assume still teaches at the University of Tennessee, Knoxville).

The main idea, though, is to generate a term-by-document matrix (count words, maybe throw out stopwords, normalize counts), then do Math to factor your matrix (approximately) into two: term-by-feature and feature-by-document. When you want to classify a new document, you can use its contents (more terms) to calculate a feature vector.

(The math seems to typically involve random initialization followed by iterative improvements. Other work in the field discusses the specifics.)

The matricies are "nonnegative" because, conceptually, features are a _positive_ thing, and you can't say that a certain term makes something less a member of a feature cluster (only more).

The tricky part is figuring out how to map features to things which are semantically interesting to your application, and I don't want to comment too much on the state of that because it's been five years and I honestly forgot what exactly we did there, and it was all done in Matlab (which I'd never used before), and there's probably more recent work in the field. But if you fiddle with it manually, you can come up with your matrices and essentially have a nice little classifier.

JL2010 · on Jan 3, 2013

I had asked a question on stack-overflow a while ago asking for some guidance on parsing this exact kind of stuff. http://stackoverflow.com/questions/8198923/natural-language-...

jerf · on Jan 3, 2013

I eagerly wait your next eBook, "How I Pivoted Into The NFL".

iansinke · on Jan 3, 2013

"How I Pivot-tabled Into The NFL"

jtchang · on Jan 3, 2013

And here is that play :)

http://www.youtube.com/watch?v=oM1iXHY8s9o

aantix · on Jan 3, 2013

Can we use the play descriptions to find corresponding clips on Youtube to reconstruct the entirety of the past 10 years of the league? :)

sjs382 · on Jan 3, 2013

Funny, I was looking for just that line to see how the more complicated plays were described.

First thing that I noticed was that the Game ID matched CBS's website's URLs: 20090201_PIT@ARI == http://www.cbssports.com/nfl/gametracker/playbyplay/NFL_2009...

Also, I went to compare this PBP to both ESPN and CBS and found that both have the exact PBP data, which is interesting because it seems that they got this data directly from the NFL (or from the same source, at least). I guess this makes sense, but it's something I hadn't considered.

For reference, ESPN's PBP for the same game: http://espn.go.com/nfl/playbyplay?gameId=290201022&perio...

anon987 · on Jan 3, 2013

It looks like the same format they use for NFL Game Rewind too. I would guess that there is an official syntax and the data is provided by the NFL because if not you would have all types of formats and opinions about the game baked into each team's data. I would also guess the same office that keeps records (game, individual, all time, etc) are the ones that keep the play by play too.

Overall this is neat but it's hard to find real life context within this data. Was the QB pressured, was a coverage blown, was there a pre-snap audible or motion or change by the defense, what was the formation, how much sleep did the players get the night before, etc etc.

sswezey · on Jan 3, 2013

I think they all get their data for Elias Sports Bureau

tesmar2 · on Jan 3, 2013

It looks like a lot of the work you are hoping to do has already been done on http://statsheet.com/nfl

Though perhaps not as open...

tvon · on Jan 3, 2013

See also, http://www.pro-football-reference.com/, but I'd love to see something on github.

burntsushi · on Jan 3, 2013

It exists. Check out nflgame. [1]

[1] - https://github.com/BurntSushi/nflgame

MattSayar · on Jan 3, 2013

Please post your results, I'd love to read your analysis.

sjs382 · on Jan 3, 2013

Agreed. I'd love to learn about the process, too. Not just the results/findings.

mav3r1ck · on Jan 3, 2013

@edw518, Awesome! Thank you for sharing what you're doing here with the data. I was thinking of playing with this dataset too and since I'm new to this field (data), look forward to learning from you if you post more info in the future!

Thanks!

larrydag · on Jan 3, 2013

One of the "new" ways that Burke (data creator) et al are using with this type of data is finding the Expected Points Added for each plays. The EPA allows one to determine how valuable players are to a team's performance.

http://www.advancednflstats.com/2010/01/expected-points-ep-a...

I've been trying to work at the college football level with this same strategy but I'm still trying to figure out how its calculated. It seems trivial but it takes a lot of data organizing.

tghw · on Jan 3, 2013

Looking through the 2002 season, there's an oddity around touchdowns and extra points. It seems that the 6 points for the touchdown are bundled with the extra point, and the score is not updated until the extra point is complete.

It seems this might result in bugs, as in the Oct 20, 2002 game between Dallas and Arizona. In the third quarter, with a score of Arizona 6 - Dallas 0, Dallas scored a touchdown (row 13900) but "aborted" the extra point (row 13901). The 6 points for the Cowboys are not recorded in the data.

The game eventually went to overtime, with the Cardinals kicking a winning field goal in OT for a final score of Arizona 9 - Dallas 6, but the data here records it as Arizona 6 - Dallas 0.

danso · on Jan 3, 2013

There's a FAQ for this data that is on the site's main nav:

http://www.advancednflstats.com/2007/02/contact.html

Of particular interest:

Where did you get your data?

Most of my team data comes from open online sources such as espn.com, nfl.com, myway.com, and yahoo.com. It's easy for anyone to grab whatever they're interested in from those sites.

My play-by-play data comes from a source that's not publicly available, and at this time I regret that I cannot share it. However, I am working hard to develop a way to spread the wealth. One of my biggest goals is to help create a larger, more open, and more collaborative community for football research.

----

There's no real terms of service so I'm curious as to the constraints in using this for commercial purposes. I most definitely want to use this for teaching purposes (how to text-mine, how to build a web app from data, etc) but want to know what terms the data can be redistributed.

bendmorris · on Jan 3, 2013

IANAL, but it has been ruled that NFL player names and statistics are protected by the First Amendment, i.e. no one "owns" it and anyone is free to use it for any purpose.

http://blogs.trb.com/sports/custom/business/blog/2009/04/cbs...

However, you do have to get the data, and unauthorized access of computers (which constitutes trespassing) can be a legal gray area. I'd love to hear a lawyer weigh in on the legality of scraping the data directly from espn.com.

pseut · on Jan 4, 2013

Last time I checked, the play by play data on espn.com was pretty error-ridden. This was three or so years ago, so it might have changed, and I was hypothetically interested in the score columns, so it may not matter depending on other hypothetical uses. But I'd hypothetically avoid scraping ESPN for that reason alone.

petersalas · on Jan 3, 2013

This seems like as good a time as any to share something I've been working on which uses the same source data, even though it's pretty rough at the moment (slow, bad data, only currently goes through week 8 of 2012, etc.):

http://nfl-query.herokuapp.com/

The basic syntax is [stats] [conditions] : [row] / [column].

There's some autocompletion to try to make it possible to discover what is accepted.

Examples:

passing yards : team / season

first downs / first down attempts : down / distance

rushing yards min 100 rushing yards : player, game, quarter

rushing yards / carries min 200 carries : player

One of the biggest problems is that it's currently way too easy to shoot yourself in the foot by making a really slow query.

arscan · on Jan 3, 2013

I'm frankly surprised that this information is allowed to be distributed. I spent awhile in the financial services industry, and while it was really easy to obtain "public" information like stock quote data, I recall that we weren't allowed to simply scrape data from public sites... we had to pay a license fee to get a feed of the data if we were planning on repackaging & distributing it.

It seems to me that the NFL would want to have exclusive rights to distribute this data and charge people a fee for access to it. Clearly I'm no expert in these legal affairs though.

vsprabhakara1 · on Jan 3, 2013

Generally speaking, stats are public domain as they are a public event that occurred. Because a sports league may disagree with this position doesn't mean that it isn't true. However, its entirely possible to violate a given site's TOU by scraping the data, it doesn't mean the data itself isn't allowed to be compiled or distributed.

IANAL, but I worked at ESPN and founded Fanvibe (YC S'10), and worked quite a bit with the leagues and lawyers on rights-related topics.

aidenn0 · on Jan 3, 2013

IANAL, but I asked one about this while ago; let's see if I can remember: It's complicated. The NFL broadcasts are copyrighted, and come with a statement that (among other things) distributing descriptions of the game is not allowed. That could be considered a derivative work.

On the other hand, a live performance is generally not protected from copyright, so if you attend a live game to collect the data, you may be in the clear.

The data isn't owned by the NFL, but all recordings of the games are, and so any data obtained by watching recordings of the games could potentially be controlled by the NFL.

_delirium · on Jan 3, 2013

It might not even violate the NFL's copyright if extracted from tapes. For one thing, something is only a "derived work" for copyright purposes if it's a "creative work" subject to copyright at all, and in the U.S., data sets comprising factual information aren't typically considered "creative". For another, it's not clear whether data about a recording is derived from the recording for copyright purposes. For example, a re-edit or mash-up of a film is clearly a derived work, but is a count of how many minutes each character speaks a derived work? Or is a Spotify-style algorithmic analysis of a song's musical style a derived work of the song?

I wouldn't want to put a large bet on where exactly those lines are drawn, though.

sethist · on Jan 3, 2013

IANAL too, but the NFL lost a relatively recent court case regarding fantasy football that, as far as I am aware, made the leagues statistics public knowledge as long as you compile the information yourself. Therefore the main legal issue with this data would be the source.

dkoch · on Jan 3, 2013

IANAL, but there was a landmark case where the NBA sued Motorola and STATS Inc. for distributing live game statistics. The ruling ended up in favor of STATS, where the decision was pure facts could not be copyrighted.

saturdayplace · on Jan 3, 2013

I'd love to believe that this data falls underneath the "Facts are not copyrightable" decision: http://chart.copyrightdata.com/c16B.html

xentronium · on Jan 3, 2013

Facts are not copyrightable, but databases (compilations of data) are.

IANALTINALA.

kodablah · on Jan 3, 2013

The NFL is not near as strict as MLB (note how you never see an MLB highlight on youtube?) yet MLB allows http://retrosheet.org/ to exist. I don't believe it's technically "dissemination".

Edit: I should note that even http://gdx.mlb.com/components/game/mlb/ contents are governed by http://gdx.mlb.com/components/copyright.txt which doesn't allow commercial use.

swores · on Jan 3, 2013

MLB are extremely strict, but it's an exaggeration to say "you never see highlights on youtube", I've watched them there many times. Likely either before they got taken down or because they were to small to care about, but there'll still be plenty on there. 5 second search brought up https://www.youtube.com/watch?v=OZW7448mh94 right away as a very quick example.

frozenport · on Jan 3, 2013

Lets pretend he got it from memory.

aes256 · on Jan 3, 2013

Fun fact: In the UK even the FA Premier League fixture list is copyrighted, and websites or publications wishing to publish all or part of the fixture list for a given season need to pay a licence fee.

I can't even give you a list of football fixtures coming up this weekend without breaking copyright law.

kenver · on Jan 4, 2013

That was true but there have been some recent developments so I'm not sure if it is still the case.

http://www.bbc.co.uk/news/business-17218968

dude_abides · on Jan 3, 2013

Here is an idea: build a predictive model of an offensive coach that predicts the play he will call, given a game situation (and based on that, build a predictiveness quotient for a coach).

fleaflicker · on Jan 3, 2013

It doesn't work like that in practice. Football is very dependent on matchups. Coaches will vary gameplans from week-to-week to exploit weaknesses they see on film.

dude_abides · on Jan 3, 2013

Matchup would be a part of the model. My experience with predictive modeling in various domains has taught me that people tend to underestimate how predictive they are (NFL offensive/defensive coaches are no exception).

lftl · on Jan 3, 2013

I'm interested in doing some predictive modeling for a couple of project ideas I've been kicking around. Are there any specific resources you would recommend as good starter material?

kaliblack · on Jan 3, 2013

Agreed. Also, assuming you could create a successful predictive model the advantage would be short term. All coaches would hop on the bandwagon.

pseut · on Jan 4, 2013

How optimistic of you.

jpeterson · on Jan 3, 2013

It could still be useful for finding tendencies, though.

zachgemignani · on Jan 3, 2013

Visualization of rushing tendencies using this very data http://labs.juiceanalytics.com/spider/index.html

nchuhoai · on Jan 3, 2013

Here is a start: https://www.dropbox.com/s/cy04oxaq83mxvoz/report.pdf

zgohr · on Jan 3, 2013

If only a single play call had a single potential outcome, and that outcome was always met. Using these stats for predictions would seem extremely difficult beyond answering, "will it be a run or a pass?"

gtCameron · on Jan 3, 2013

If you were the defensive coordinator on the opposing team, knowing the answer to "run or pass" with a high degree of certainty would give you a pretty large advantage.

pitt1980 · on Jan 4, 2013

http://www.advancednflstats.com/2010/06/bill-walsh-on-random...

In his early Stanford days, Bill Walsh had already cracked the code on how un-random football coaches (and almost all people) are. From "Controlling the Ball with the Passing Game":

"We know that if they don't blitz one down, they're going to blitz the next down. Automatically. When you get down in there, every other play. They'll seldom blitz twice in a row, but they'll blitz every other down. If we go a series where there haven't been blitzes on the first two downs, here comes the safety blitz on third down."

Most NFL offenses tend to alternate rather than randomize. Walsh knew defenses were just as predictable decades ago.

sethist · on Jan 3, 2013

There are too many missing variables. The most obvious being who made the actual play call, the head coach, the offensive coordinator, or the quarterback.

euroclydon · on Jan 3, 2013

How many of you are thinking right now: I'm going to generate an HTML page for every game and throw ads on it? Be Honest!

DanBC · on Jan 3, 2013

Is anyone going to try a 'moneyball' style Fix_Your_Fantasy_League_LineUp site with ads?

404error · on Jan 3, 2013

I might try to use the data to create mock drafts.

pinchyfingers · on Jan 3, 2013

hush :)

burntsushi · on Jan 3, 2013

The CSV file format is nice, but if you're looking for a Python API to play with NFL stats without having to parse play-data fields, check out nflgame [1]. I've written up a quick primer. [2] It also includes the ability to get play-by-play statistics live.

[1] - https://github.com/BurntSushi/nflgame

[2] - http://blog.burntsushi.net/nfl-live-statistics-with-python

ImJasonH · on Jan 3, 2013

I've started uploading these CSVs to a public Google BigQuery dataset called [nfl], so you can run queries over them like this:

    SELECT off, COUNT(off) AS count
    FROM [nfl.2012reg]
    WHERE description CONTAINS "INTERCEPTED"
    GROUP BY off
    ORDER BY count DESC

(This counts the number of plays that resulted in an interception by the team that threw the interception, sorted from most to fewest INTs)

danvoell · on Jan 3, 2013

I'm new to BigQuery, how do I access a public dataset? I ran the query and got the error Not Found: Dataset 578707073226:nfl

ImJasonH · on Jan 3, 2013

Sorry, I didn't actually make it public it seems. Should work now.

danvoell · on Jan 3, 2013

thanks!

snake_plissken · on Jan 3, 2013

mean reversion between ints for NFL live betting on next int? i smell greenbacks!

sethist · on Jan 3, 2013

That sounds like the dictionary definition of the gambler's fallacy. If anything, the odds of an interception likely increase after a previous interception as it would be a sign of a defensive advantage over the QB. If only there was somewhere we could get the data to figure out for sure...

patrickk · on Jan 3, 2013

Here's some soccer data, doesn't include play-by-play though (soccer generally isn't suited to that kind of breakdown, although Opta Sports do track it).

http://www.football-data.co.uk/downloadm.php

Tons of European leagues, going back to 1993 in some cases.

Here's some sites that give detailed stats and match reports:

http://www.eplindex.com/

http://www.whoscored.com/

http://www.soccerstats.com/

http://www.soccerway.com/

http://www.squawka.com/

Man City use their petro-dollars to open up Opta Sports (detailed match stats) to all: http://www.mcfc.co.uk/the-club/mcfc-analytics

Someone needs to compile stats equivalent to these NFL ones for european football! Hmmmm...

ScottWhigham · on Jan 3, 2013

The comments on that are awesome too - great advice for parsing, categorizing, and such. I couldn't download 2010 though - "Sorry, we are unable to generate a view of the document at this time. Please try again later."

ScottWhigham · on Jan 3, 2013

If you click the little down arrow (top left), it will download the file. Just a heads-up in case others see this message as well.

gavinlynch · on Jan 3, 2013

Amazing!!! Thanks to www.advancednflstats.com for doing all the leg-work. Highly recommend their site too. Their in-game win probability statistics are always a must-have for me on game-day ^_^

tghw · on Jan 3, 2013

I really like his 4th Down analysis:

http://www.advancednflstats.com/2009/09/4th-down-study-part-...

The tl;dr version can be found at:

http://www.advancednflstats.com/2010/05/4th-down-briefs.html

The conclusion is that teams should go for it on 4th down much more often than they currently do.

He also has a calculator where you can get the exact values:

http://wp.advancednflstats.com/4thdncalc1.php

yukoncornelius · on Jan 3, 2013

The Patriots have used this analysis:

http://www.math.toronto.edu/mpugh/Teaching/Sci199_03/Footbal...

I also believe Belichek/Adams have funded some football economics research.

binxbolling · on Jan 4, 2013

Could just be my imagination, but I felt like this season I saw many more teams go for it on 4th down (maybe some of this data can prove or disprove that?). Perhaps the impact of the above, or similar, analyses is finally being seen.

danso · on Jan 3, 2013

This looks like great fun...Judging by some of the sample entries, it will also be an instructive example of the limitations of CSV and why serious analysts who want to work with unstructured data need to know a scripting language, or at least regexes.

Sample description field: > 20020905_SF@NYG,1,59,20,NYG,SF,3,11,81,(14:20) (Shotgun) K.Collins pass intended for T.Barber INTERCEPTED by T.Parrish (M.Rumph) at NYG 29. T.Parrish to NYG 23 for 6 yards (T.Barber).,0,0,2002

In the comments section of the OP, someone posted this sample Excel function:

    	=IF(ISNUMBER(SEARCH("right   tackle",J2)),"rush",IF(ISNUMBER(SEARCH("right
	guard",J2)),"rush",IF(ISNUMBER(SEARCH("left
	guard",J2)),"rush",IF(ISNUMBER(SEARCH("up                              the
	middle",J2)),"rush",IF(ISNUMBER(SEARCH("left
	tackle",J2)),"rush",IF(ISNUMBER(SEARCH("left
	end",J2)),"rush",IF(ISNUMBER(SEARCH("right
	end",J2)),"rush",IF(ISNUMBER(SEARCH("pass",J2)),"pass",IF(ISNUMBER(SEARCH("kneel",J2)),"kneel",IF(ISNUMBER(SEARCH("punt",J2)),"punt",IF(ISNUMBER(SEARCH("kicks",J2)),"kickoff",IF(ISNUMBER(SEARCH("extra
	point",J2)),"extrapoint",IF(ISNUMBER(SEARCH("sacked",J2)),"sack",IF(ISNUMBER(SEARCH("PENALTY",J2)),"penalty",IF(ISNUMBER(SEARCH("field
	goal",J2)),"fieldgoal",IF(ISNUMBER(SEARCH("FUMBLES",J2)),"fumble",IF(ISNUMBER(SEARCH("spiked",J2)),"spike",IF(ISNUMBER(SEARCH("scrambles",J2)),"rush","rush"))))))))))))))))))

Dear god, at what point do people finally realize that it's worth learning some simple scripting to work with text files?

HelloMcFly · on Jan 3, 2013

The Excel function looks ridiculous, but it probably didn't take more than 10 minutes to make, tops. Nested conditionals are easy.

At any rate, what would you recommend most to accomplish the task? I'm learning Python and know R a bit, so I was just wondering how I was going to about combing through the data.

jdminhbg · on Jan 4, 2013

I think the point is less "can this be done in X minutes?" and more "is there an easier way to do this?" This, at least, is what frustrates me about people who are Excel fanatics but refuse to learn a couple days' worth of Python, Ruby, VBA, whatever. Nested conditionals might be easy, but a case statement is vastly easier to write and understand. There are lots of people who would say "oh I could never program" and then write Excel functions too complex for me to understand.

danso · on Jan 3, 2013

Python or Ruby is fine...the main trick is to be able to process those fields with regular expressions...which, IIRC, requires throwing in VBscript if you were to handle it solely in Excel.

Python and Ruby would also allow for more elegant-looking -- i.e. more maintainable -- functions to handle that field.

pseut · on Jan 4, 2013

R is fine too: it has regular expressions and probably excels if you plan to do statistics using all of that data. Python seems to have reasonable statistics functionality as well (with pandas, etc) but I haven't used it personally.

HelloMcFly · on Jan 4, 2013

Thanks for the note. I've not really looked into how one would do that with R (doing it in Python seems more clear), but am checking it out now. If anyone else is looking I'm finding this PDF helpful: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenR...

Groxx · on Jan 4, 2013

There are limitations to CSV? As long as it's properly escaped, it works as well as any other I'm aware of. What would you suggest, XML?

AlwaysBCoding · on Jan 3, 2013

God this is such interesting stuff. How do we still not have a fully featured open source NFL stats-rosters-game charting API? Who wouldn't want to contribute to that project?

Other than cool data visualization stuff, the obvious implication is the potential to devise a profitable system to pick games against the spread. The guys at Football Outsiders have done a decent job at it and made a proprietary algorithm that picked games at 58% this year ( which is over the threshold you need to be profitable in Vegas ). But even those guys are still having some trouble getting access to and aggregating the data in a usable format.

I really want to sit down and start playing around with some of this data so I appreciate you putting this together for everyone. The NFL needs an open source API and this is definitely a step in the right direction.

burntsushi · on Jan 3, 2013

I believe my library, nflgame [1], would fit the bill. Features all play-by-play data back to 2009, and includes the ability to track play-by-play data live.

[1] - https://github.com/BurntSushi/nflgame

nchuhoai · on Jan 3, 2013

For my ML class, I used this dataset to play around with predicting plays. Im obviously no expert, but wanted to share my report anyways:

https://www.dropbox.com/s/cy04oxaq83mxvoz/report.pdf

evanjacobs · on Jan 3, 2013

A bit OT, but I thought this might be a good opportunity to mention the upcoming SportsHackDay in Seattle from Feb 1-3 which culminates in a group viewing of the SuperBowl. http://sportshackday.com/

kevinburke · on Jan 3, 2013

I wrote a small wrapper around the 4th down calculator on that site, which should help you figure out if your team should go for it on 4th down:

http://downanddistance.herokuapp.com/

stuff4ben · on Jan 3, 2013

Need a bounds check or two in there. Tried setting the number of yards you need to 1 and the yards away from the endzone to 99 and it threw up a nice exception. Cool calculator though!

brianbreslin · on Jan 3, 2013

As a Madden (game) aficionado my first thought was "ALWAYS go for it on 4th!" but then again I play super recklessly...

jredwards · on Jan 3, 2013

I've used data from Brian Burke's site before. I think it's the exact PBP data the NFL has, but you'll find that the structure and common phrasings change over the years. I had to write a lot of regular expressions and I was still catching edge cases for weeks.

btw, pro-football-reference has pbp data now too, and it probably goes back a lot further, but I think they discourage mass scraping of their site.

pmarsh · on Jan 4, 2013

There is a lot to have fun with here. I would imagine though that in a lot of NFL coaching rooms there has to be a balance between coaching and analysis.

Like someone else said, it's about match-ups.

Semi-related : http://profootballtalk.nbcsports.com/2013/01/03/polian-think...

p4bl0 · on Jan 3, 2013

As a French not interested in sports at all this would have made no sense at all to me before I watched the TV series The League [1]. Now I kind of enjoy the fact that these stats exist and are available in an open format, even if I don't really care myself.

[1] http://www.imdb.com/title/tt1480684/

zempf · on Jan 4, 2013

There's an interface into this sort of play-by-play data (since 2000) at http://pro-football-reference.com/play-index/play_finder.cgi -- lets you do queries on down/distance/position on field/score differential, all that sort of stuff.

grogenaut · on Jan 4, 2013

This is slightly off topic but does anyone know of a resource to get the odds at gametime historically for NFL games?

glamp · on Jan 4, 2013

https://www.dropbox.com/s/ikczgv737lllh0a/nfl_spreads_1985-2...

grogenaut · on Jan 13, 2013

thanks!

crabasa · on Jan 4, 2013

If you live in the vicinity of Seattle, there is a sports-themed hackathon going on Superbowl weekend. Google, ESPN and a bunch of tech companies are sponsoring. The grand prize will be passes to the Sloan Sports Conference. More details to come:

http://sportshackday.com

bsims · on Jan 3, 2013

You should send this to the Buffalo Bill's new analytics department. Maybe a Hacker could get hired to an NFL team.

http://www.nfl.com/news/story/0ap1000000121582/article/bill-...

activus · on Jan 3, 2013

It would be interesting to take this data and build an app around it for fantasy football. If you have all the tendencies, and how players like your player have played against certain teams, you could make better guesses on who to play.

cacciatc · on Jan 3, 2013

I did something like that for an AI class in college. We used FuzzyCLIPS to write an expert system for drafting fantasy football teams. A ruby script pulled the CSV data from some other site that had the previous three years of NFL data, and then converted the CSV to fact files which the system then read in.

When all was said and done it worked, but made some pretty crappy draft picks! I should find that code....

eel · on Jan 3, 2013

FYI, for baseball fans, you can get similar data about each play in the MLB from http://www.retrosheet.org

wildster · on Jan 4, 2013

Shame it does not have player ids. I wonder how many players with the same surname and the same first initial play for the same team.

bdittmer · on Jan 3, 2013

Now all we need is some historical line & associated movement data...

JohnFromBuffalo · on Jan 4, 2013

What ... no NHL? Oh ya.

winstonian · on Jan 3, 2013

Would be great if there were columns:

Head Coach - Offensive Coordinator - Defensive Coordinator - Formation - Play

mcs · on Jan 3, 2013

asc76 · on Jan 3, 2013

Wow. Simply. Wow.