Hacker News new | past | comments | ask | show | jobs | submit login

rate limiting can be a double edged sword, you can be better off giving a scraper highest bandwidth so they are gone sooner, otherwise somthing like making a zip or other sort of compilation of the site available may be an option.

just what kind of scraper you have is a concern.

does scraper just want a bunch of stock images;

or does scraper have FOMO on web trinkets;

or does scraper want to mirror/impersonate your site.

the last option is the most concerning because then;

scraper is mirroring bcz your site is cool and local UI/UX is wanted;

or is scraper phishing smishing or otherwise duping your users.




Yeah, good points to consider. I think the sites that would be scrapped the most would be where the data is regularly and reliably up-to-date, and a large volume of it at that - so not just one scraper but many different parties may on a daily or weekly basis try to scrap every page.

I feel that ruling should have the caveat that if a fair cost paid API version for getting publicly listed data then the scrapers must legally use that (say no more than 5% more than cost of CPU/bandwidth/etc of the scraping behaviour); ideally a rule too that at minimum there be a delay if they are republishing that data without your permission, so at least you as the platform/source/reason for the data being up-to-date aren't harmed too - which may then kill the source platform over time if regular visitors somehow start going to the competitor publishing the data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: