Show HN: BestSFBooks, Mashup of the Best SF/Fantasy Books

bgraves · on Oct 5, 2011

"As always, data acquisition and cleanup was the hardest part."

I'm most interested in this piece of the project. What were your particular tools and methodologies? How long did it take you, once you identified your data sources? Any interesting stumbling blocks or problems that were solved along the way?

gurgeous · on Oct 5, 2011

I've written a lot of data tools as part of Urbanspoon and subsequent startups. I like to collect publicly available data, clean it up, normalize it, and then release it in a more useful way.

Hot tips for crawling data:

  - Cache pages locally while you work on the indexing
  - Nokogiri is awesome
  - Don't be afraid to use regular expressions
  - Initially, put data into a spreadsheet (not the db).
    That way it can be checked in and diffed.

I also have a lot of subtle tricks for cleaning up messy data. For example, to see if two similar authors refer to the same person, I have a method that converts an author name to an author key. The key is just like the name, only it's been uppercased, apostrophes removed, etc. Plus weird stuff like this:

  # replace all vowels with the letter E
  s = s.gsub(/[AEIOUY]+/, "E")

It's little things like this hack that make a big difference in data quality.

(edit: formatting)

bgraves · on Oct 5, 2011

Thanks! I've been doing some scraping projects lately and really like it a lot. There's a pretty steep learning curve, but it gets easier and easier as you go along, I think.

1. Caching pages is definitely a great idea while debugging. Especially if the data source has a request limit :)

2. I've never heard of Nokogiri, but it looks like BeautifulSoup for Ruby. I've found that Python has worked for everything I need so far, but thanks for the reference.

3. I suck so bad at regex, but using it more will help me climb that mountain.

4. One tip I've used is writing out the "INSERT INTO TABLE..." statements along with the scraped results. I definitely use CSV (and Google Refine) for general clean up and spot checking.

5. You should write a 'Data Scraping One-liners Explained' ebook :)

semanticist · on Oct 6, 2011

I spent the first half of this year writing scrapers for every newspaper in the UK. My Top Regex Tip is http://rubular.com/ - this thing saved me HOURS of my life.

berntb · on Oct 5, 2011

>>I suck so bad at regex

Then you are the opposite of where I used to be -- I thought I understood and could use regexps. Hell, I do Perl for fun. :-)

Something like when I first sat down with Photoshop -- "Hey, I know how to program Macs [before MacOs X]. This is just using a Macintosh program, so I should have no problems"... :-)

Read "Mastering Regular Expressions". It made me feel embarrassed about my previous stupidity [Edit: Embarrassment, your name is Dunning–Kruger :-) ]. Just the first few chapters are enough to change your world.

Edit: I might add, I still can't use Photoshop.

phzbOx · on Oct 5, 2011

May I add: Usually, scraping is a "one time only job", so feel free to use all the hack you want to get the job faster. For instance, wget/grep into a file, use vim to clean it up a bit, mix some awk, perl or bash. The goal is just to get the data, not to write production quality code that will be maintained for years.

For tricky pages, it can be a good idea to write tests though. Be sure to cache all downloaded pages so the tests can be run uber-fast.. It's a good way to perfect that regex. (Yes, there are programs for that, but sometime HTML have some weird newlines or characters that screw everything).

Also, I think it's important to emphasis not to "code everything". Say you've got 10 links to get on the first page.. and in all of these links you've got hundreds. Obviously, you won't go on each page and manually copy everything; however, for the first 10 links, it's useless to write a script that crawl that. Just copy it and clean it from the source. Jee, use a macro in emacs or your favorite editor if you really cant repeat a task 10 times.

phil · on Oct 5, 2011

Yes, repleceng ell vewels weth the letter E clerle empreves dete qelete ;)

natbro · on Oct 5, 2011

OK, I have to admit at first I thought there was something wrong with the site because I didn't recognize enough books... but after reading some excerpts of books off the lists, I'm psyched -- turns out I just haven't been finding the good books for a long time, so now I can.

Suggests:

* show me excerpts on-page (if possible from amzn?)

* allow community +1/-1 on books and generate lists based on top-rated by site users

* commenting, facebook or disquss, on each book

(little issue: Facebook "like" button on main page and book pages doesn't seem to be working -- dunno if that's facebook's problem not yours)

gurgeous · on Oct 5, 2011

Thanks Nat - I fixed the like button. Turns out that you have to specify the href param when using the iframe, unlike twitter.

phrotoma · on Oct 5, 2011

This fairly reeks of awesome. If the ones I don't recognize are as great as the ones I do, I'd say you have a damn fine site there!

sammyo · on Oct 5, 2011

A hook into Google book library reference 'available at your local library' would be awsome.

wtf242 · on Oct 5, 2011

looks awesome! I wrote a similar site that aggregates the general fiction and non-fiction based on how many awards and lists they are on. http://thegreatestbooks.org

I built it on Rails as well

gurgeous · on Oct 5, 2011

Update - GeekWire picked it up:

http://www.geekwire.com/2011/hunger-science-fiction-books-sp...

Also, thanks to webwright for the title suggestion.

100k · on Oct 5, 2011

Cool! Very nice.

I'm sort of slowly working my way through the double winners of the Hugo and Nebula (kind of a lifetime goal, I guess).

It took me a while to find the "Hall of Fame" for books, but that is what immediately wanted from a site like this (I don't care so much about the year-by-year rankings). Maybe make it more prominent?

EDIT: also, I think there's a big difference between nominations and winning. Would be cool to sort based on actually won awards, not just nominations.

javanix · on Oct 5, 2011

This is awesome - I've been searching for somewhere to find good new SF/Fantasy for a while that doesn't just entail blindly searching Amazon.

Urgo · on Oct 5, 2011

I like it. One suggestion or request really however. Can you add a link to Audible in addition to amazon & the kindle for us audiobook fans?

adamzochowski · on Oct 5, 2011

Awesome job. It is similar to existing https://www.worldswithoutend.com/

2mur · on Oct 5, 2011

This is really, really fantastic. Been a huge sf/f reader my whole life (collect 1sts of lots of favorite authors... hovering around 6000 now). I basically used to do this manually: end of the year check Locus, check SF Site lists, check Hugos... read them all.

md1515 · on Oct 5, 2011

This is fantastic. I think you should follow natbro's lead on a few things - allow disquss commenting and show excerpts of the book.

Also I have a series to add: Harry Turtledove's "Darkness" series. It is like 5 books, if I remember correctly. I'm at work...

pasbesoin · on Oct 5, 2011

Yay!

http://www.bestsfbooks.com/b/3119/The-Children-of-the-Sky

The Children of the Sky Series: Zones of Thought #3 by Vernor Vinge (Tor, Oct-2011)

dougws · on Oct 5, 2011

There seems to be a really heavy bias towards relatively new SF; is this because there are more awards now than there used to be, or because you only have data going so far back?

gigawatt · on Oct 5, 2011

This is fantastic! Perfectly simple and useful.

Only glitch I've run into so far is that the search function doesn't work. As far as I can tell, there's no type="submit" for the form.

jogrimst · on Oct 7, 2011

Hello. That is because it is a auto-complete search field. If you search for something that does not exist, nothing will happen. If you type for example "Krak", you will get auto-complete suggestion for "Kraken".

Maybe it would be an idea to show a message like "No results..." if what the user types does not return any matches in this search function?

wccrawford · on Oct 5, 2011

You're going to cause me to spend far too much money. Dang you.

peapicker · on Oct 5, 2011

Well done, you! I like the site a lot, bookmarked.