Hacker News new | past | comments | ask | show | jobs | submit login

Keep in mind that if you crawl this site your ip will be banned. At least mine was when I was playing around with a web crawler i built.



There is an unofficial API: http://www.hnsearch.com/api (Provided by the very search engine referred to in the OP, haha!)


Unfortunately there is no API for getting access to personal information on HN (i.e. comments I have made, or stories I've upvoted). You're relegated to scraping if you want that information.


now there is :

here is how you can pull a specific username's submissions and you can add filters: http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...

And then here is how you can pull the comments for a specific thread/discussion/id: http://api.thriftdb.com/api.hnsearch.com/items/_search?filte...

you can now grab a lot of data.. including they enlarged the site's rss feed in hopes of slowing a few of the scrapers..

there are a few items missing, but they added a lot: http://www.hnsearch.com/api

btw that includes a user bio now, as well as things you've upvoted... etc.. its all just done via filters..

the also boosted the rss feed to help slow down the strapers


You can always get around this by throttling your web crawler. It will take a much longer time, but at least you'll be able to read HN in the meantime.


The tricky thing when doing this is knowing what rate to stop at without getting permanently banned. I built an Android Market crawler two summers ago, and luckily Google only temp bans (from my experience), so that might be an easier project without any risk.


Respecting robots.txt is probably the best plan.


Use disposable IPs.


Why will you be banned? Do you know about the reason?


To prevent people from (unintentionally) DDoSing the site.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: