Hacker News new | past | comments | ask | show | jobs | submit login

The problem with many sites (and LinkedIn in particular) is that they whitelist a bunch of specific websites, presumably based on the business interests, but disallow everyone else in their robots.txt. You should either allow all scrapers that respect certain load requirements or allow none. Anything that Google is allowed to see and include in their search results should be fair game.

Here's the end of LinkedIn's robots.txt:

User-agent: * Disallow: /

# Notice: If you would like to crawl LinkedIn, # please email whitelist-crawl@linkedin.com to apply # for white listing.




And this is what the HiQ case hinged on. LinkedIn were essentially selectively applying the computer fraud and abuse act based on their business interests - that was never going to sit well with judges.


Btw, LinkedIn does have an API for things like Sales Navigator. It used some weird partnership program (SNAP) to get into it and it starts at (I think) 1500$/year per user. Still pretty cheap though, I think you’d get the value out of that quite quickly for a >300 people company.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: