Interesting, thanks for the report there. Coincidentally, this misspelling is also present in all the subtitles that Simpsons World uses!
We chose the season 15 cutoff pretty much arbitrarily. We're not necessarily opposed to later seasons, but we'd like to have some better season/episode filtering in place before expanding more.
Make sense with the seasons. Certainly helps when you've got issues like season 11 being off like it is. There's other reports of the search functionality being literal with punctuation and other things to work out. It'd be really neat with more polish and possibly adding in more tv shows. I imagine the data needs are pretty substantial, but I wonder if there might be a good way to deal with that by having the server generate the jpgs on the fly from the better compressed video files. That might really be a big win if BPG or other formats actually take off and begin replacing jpg.
"We also parse subtitle files and correlate each subtitle line's timecode with the timecode of the screenshot. Finally, the frinkiac binary can upload the data set to frinkiac-server. "
Could you elaborate on this parsing of the subtitle files. I've seen the "open source" star wars gifs file with the dialog and time codes[1], but I'm not sure how they pulled the text from the close-captioning? (edit: someone else something similar...sorry).
Also aside from the two character index search index you describe how are you searching the quotes with postgres? Are you using postgres's full text search[2] or something else?
Thanks, I love the simpsons and this is really cromulent[3] and cool.
Was this done using the DVDs? I'm curious about any potential licensing issues with the screen caps and subtitles. Did you have to get permission/sign something - or does this fall under fair use?
No way this is licensed (no copyright notice, even; not even a mention of Fox), and no way it is fair use. It has frame-by-frame, full resolution images and full transcripts of every episode up for browsing. This is textbook mass copyright infringement. Short of offering unlicensed video downloads for a fee, it could hardly be more clear-cut.
Yeah, it's cool, I get it, but you can't just steal and redistribute content en masse for your cool project. Well, he did, but I expect he'll be hearing from Fox's lawyers soon.
It is arguably fair use in the U.S. I don't think there is enough case law to be sure. It's hard to predict how it would go in litigation. I think you're right that the defendants wouldn't have a particularly strong case, but they wouldn't have the weakest.
The courts have generally judged significant "transformation" of the source material to be powerful in determining fair use. I think that would be in their benefit. Also it could be argued that this has very little effect on the market for the original copyrighted material, which would be in their favor. Of course, the copyright holder would see and argue it differently if they choose to sue. And the "the amount and substantiality of the portion taken" would not look good for the defendants -- but even though some common belief focuses on this factor almost exclusively -- thinking as long as you copy only 10 pages or whatever you're good, and if you don't you're definitely not -- that's not how it works, it's just one factor, and one that the courts in the past couple decades have somewhat de-emphasized.
But I don't think we can say "no way it is fair use", or "it could hardly be more clear cut." It could go either way. Fair use in the U.S. for novel things, not already well established as fair use or not, almost always looks like this.
Counterpoint: copying every single page of every book and making it searchable can be fair use. It just takes only 10 years of litigation and appeals to determine that. See Authors Guild v. Google.
https://www.eff.org/document/ruling-appeals-court
Point is, the law is hardly clear cut and never is with new technologies. Without someone willing to take a risk and develop a potentially infringing technology we would never have had VCRs, MP3 players, YouTube.... I applaud the creators for making an incredibly useful resource and I hope if they do face legal threats they get a zealous pro-bono defense from someone like the EFF or Larry Lessig.
This is impressive. It found everything I tried. If the author is reading, showing GIFs or a small video clip instead of a static image would be preferable.
They may or may not (be allowed to) have a sense of humor about it. Our 24 Hours of LeMons car's publicity was sent to Matt Groening by a friend, and he passed it around the office. Apparently he asked their publicity folks if they could invite us up to show off the car about the same time that legal asked about sending us a cease and desist.
In the case of the car, it's probably fair use and the only issue was likely that we have non-Fox-approved sponsorship on it, but they probably decided their advertisers wouldn't complain about it because it's not exactly big bucks changing hands here.
So yeah, we got to meet Matt Groening and David X Cohen and Al Jean and a lot of the writers. It was definitely a cool experience.
They could definitely use some better text indexing/relevancy ranking implementations. I had mixed success. I'd recommend lucene or something based on lucene (Solr, ElasticSearch).
If I could use this to get subtitled gifs of the scene in question, not just screenshots, it would go from amazing to godlike. On the roadmap for v2, hopefully?
For those of use who grew up having conversations in simpsons dialog, this will help provide those in my wife who don't have such habits develop them :)
This is great! My only complaint is that it comes up with lots of near duplicates. The images look they are different frames, but the quotes they reference are the same
"Subtitles":[{"Id":138914,"Episode":"S13E08","StartTimestamp":794266,"EndTimestamp":796533,"Content":" ( gavel pounding ) So, Professor,"},{"Id":138915,"Episode":"S13E08","StartTimestamp":796533,"EndTimestamp":799834,"Content":"tell us about Operation Hoyvin-Mayvin."}]
on the legal/lawyer talk tip - there have been a few notable other simpsons screencap repositories (like Lardlad) that have remained online for years. Wondering if there's some leeway or can't chase after a single frame (rather than video with picture and sound, which they are notoriously strict on youtube about etc)
also, why didn't this get picked up in the duplicate post algo HN? For the blog writeup @reaperhulk you should have put 'Show HN' in your original post to get more traction or something
Getting the same results from my laptop at the moment also. Everything is returning 'Nothing Found' Error. It was working earlier today. (Can you tell it is Friday?)