Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: BestSFBooks, Mashup of the Best SF/Fantasy Books
78 points by gurgeous on Oct 5, 2011 | hide | past | favorite | 25 comments
I'm releasing BestSFBooks today:

http://www.bestsfbooks.com

I was inspired by this HN post from a few weeks ago:

http://news.ycombinator.com/item?id=2978027

BestSFBooks ranks science fiction/fantasy books according to how many awards they've won or been nominated for. I included all the big ones (Hugo, Nebula, etc.) and also more obscure awards that I like such as SF Site Editor's choice. Once I had all the awards in there it was easy to start creating a New Book list based on award winning authors.

I could tell the app was working as soon as I saw the two book lists on the home page. They're excellent!

The stack is virtually identical to the stuff I used to build PickHealthInsurance. BestSFBooks is built with Rails 3.1.1 (HAML, Sass, CoffeeScript). It's hosted on Heroku and MongoHQ. I used the Twitter Bootstrap CSS toolkit, which I continue to find hugely innovative and useful. As always, data acquisition and cleanup was the hardest part.

I've wanted to build something like this for quite a while. As my time becomes more valuable, I'm becoming less tolerant of bad books. For a laugh, check out the "prototype" that I created ten years ago - http://www.gurge.com/amd/top100

Please send feedback!




"As always, data acquisition and cleanup was the hardest part."

I'm most interested in this piece of the project. What were your particular tools and methodologies? How long did it take you, once you identified your data sources? Any interesting stumbling blocks or problems that were solved along the way?


I've written a lot of data tools as part of Urbanspoon and subsequent startups. I like to collect publicly available data, clean it up, normalize it, and then release it in a more useful way.

Hot tips for crawling data:

  - Cache pages locally while you work on the indexing
  - Nokogiri is awesome
  - Don't be afraid to use regular expressions
  - Initially, put data into a spreadsheet (not the db).
    That way it can be checked in and diffed.
I also have a lot of subtle tricks for cleaning up messy data. For example, to see if two similar authors refer to the same person, I have a method that converts an author name to an author key. The key is just like the name, only it's been uppercased, apostrophes removed, etc. Plus weird stuff like this:

  # replace all vowels with the letter E
  s = s.gsub(/[AEIOUY]+/, "E")
It's little things like this hack that make a big difference in data quality.

(edit: formatting)


Thanks! I've been doing some scraping projects lately and really like it a lot. There's a pretty steep learning curve, but it gets easier and easier as you go along, I think.

1. Caching pages is definitely a great idea while debugging. Especially if the data source has a request limit :)

2. I've never heard of Nokogiri, but it looks like BeautifulSoup for Ruby. I've found that Python has worked for everything I need so far, but thanks for the reference.

3. I suck so bad at regex, but using it more will help me climb that mountain.

4. One tip I've used is writing out the "INSERT INTO TABLE..." statements along with the scraped results. I definitely use CSV (and Google Refine) for general clean up and spot checking.

5. You should write a 'Data Scraping One-liners Explained' ebook :)


I spent the first half of this year writing scrapers for every newspaper in the UK. My Top Regex Tip is http://rubular.com/ - this thing saved me HOURS of my life.


>>I suck so bad at regex

Then you are the opposite of where I used to be -- I thought I understood and could use regexps. Hell, I do Perl for fun. :-)

Something like when I first sat down with Photoshop -- "Hey, I know how to program Macs [before MacOs X]. This is just using a Macintosh program, so I should have no problems"... :-)

Read "Mastering Regular Expressions". It made me feel embarrassed about my previous stupidity [Edit: Embarrassment, your name is Dunning–Kruger :-) ]. Just the first few chapters are enough to change your world.

Edit: I might add, I still can't use Photoshop.


May I add: Usually, scraping is a "one time only job", so feel free to use all the hack you want to get the job faster. For instance, wget/grep into a file, use vim to clean it up a bit, mix some awk, perl or bash. The goal is just to get the data, not to write production quality code that will be maintained for years.

For tricky pages, it can be a good idea to write tests though. Be sure to cache all downloaded pages so the tests can be run uber-fast.. It's a good way to perfect that regex. (Yes, there are programs for that, but sometime HTML have some weird newlines or characters that screw everything).

Also, I think it's important to emphasis not to "code everything". Say you've got 10 links to get on the first page.. and in all of these links you've got hundreds. Obviously, you won't go on each page and manually copy everything; however, for the first 10 links, it's useless to write a script that crawl that. Just copy it and clean it from the source. Jee, use a macro in emacs or your favorite editor if you really cant repeat a task 10 times.


Yes, repleceng ell vewels weth the letter E clerle empreves dete qelete ;)


OK, I have to admit at first I thought there was something wrong with the site because I didn't recognize enough books... but after reading some excerpts of books off the lists, I'm psyched -- turns out I just haven't been finding the good books for a long time, so now I can.

Suggests:

* show me excerpts on-page (if possible from amzn?)

* allow community +1/-1 on books and generate lists based on top-rated by site users

* commenting, facebook or disquss, on each book

(little issue: Facebook "like" button on main page and book pages doesn't seem to be working -- dunno if that's facebook's problem not yours)


Thanks Nat - I fixed the like button. Turns out that you have to specify the href param when using the iframe, unlike twitter.


This fairly reeks of awesome. If the ones I don't recognize are as great as the ones I do, I'd say you have a damn fine site there!


A hook into Google book library reference 'available at your local library' would be awsome.


looks awesome! I wrote a similar site that aggregates the general fiction and non-fiction based on how many awards and lists they are on. http://thegreatestbooks.org

I built it on Rails as well


Update - GeekWire picked it up:

http://www.geekwire.com/2011/hunger-science-fiction-books-sp...

Also, thanks to webwright for the title suggestion.


Cool! Very nice.

I'm sort of slowly working my way through the double winners of the Hugo and Nebula (kind of a lifetime goal, I guess).

It took me a while to find the "Hall of Fame" for books, but that is what immediately wanted from a site like this (I don't care so much about the year-by-year rankings). Maybe make it more prominent?

EDIT: also, I think there's a big difference between nominations and winning. Would be cool to sort based on actually won awards, not just nominations.


This is awesome - I've been searching for somewhere to find good new SF/Fantasy for a while that doesn't just entail blindly searching Amazon.


I like it. One suggestion or request really however. Can you add a link to Audible in addition to amazon & the kindle for us audiobook fans?


Awesome job. It is similar to existing https://www.worldswithoutend.com/


This is really, really fantastic. Been a huge sf/f reader my whole life (collect 1sts of lots of favorite authors... hovering around 6000 now). I basically used to do this manually: end of the year check Locus, check SF Site lists, check Hugos... read them all.


This is fantastic. I think you should follow natbro's lead on a few things - allow disquss commenting and show excerpts of the book.

Also I have a series to add: Harry Turtledove's "Darkness" series. It is like 5 books, if I remember correctly. I'm at work...


Yay!

http://www.bestsfbooks.com/b/3119/The-Children-of-the-Sky

The Children of the Sky Series: Zones of Thought #3 by Vernor Vinge (Tor, Oct-2011)


There seems to be a really heavy bias towards relatively new SF; is this because there are more awards now than there used to be, or because you only have data going so far back?


This is fantastic! Perfectly simple and useful.

Only glitch I've run into so far is that the search function doesn't work. As far as I can tell, there's no type="submit" for the form.


Hello. That is because it is a auto-complete search field. If you search for something that does not exist, nothing will happen. If you type for example "Krak", you will get auto-complete suggestion for "Kraken".

Maybe it would be an idea to show a message like "No results..." if what the user types does not return any matches in this search function?


You're going to cause me to spend far too much money. Dang you.


Well done, you! I like the site a lot, bookmarked.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: