Hacker News new | past | comments | ask | show | jobs | submit login

Sure. At earlier jobs, I often got paged because of some bot running wild in crawling our web pages. Some pages are heavy and we don't expect them to be hit often, but once you crawl these pages indiscriminately (even if accidentally) that can bring a site down. There are also some pages whose underlying resources are billed in a pay-as-you-use model. Once again, heavy bot traffic ran up our bills.

Robots.txt allows the site owners to restrict such pages from being crawled by bots. Services that allow people to circumvent the restrictions are being rude to say the least. Many crawling services also use a farm of proxies that spoof their real identity with fake user agents to circumvent rate limiting etc. All of these "strategies" go far beyond basic automation and is quite shady in reality.




There's actually a difference between crawling and web scraping. Crawling discovers pages in a very loose manner by following all links, digesting them and producing more crawl tasks. Web Scraping, on the other hand, is a more controlled environment where the rules are pretty strict e.g. scrape `produc-<product id>.html` links for product data so web scrapers are very unlikely to stumble on some page randomly.

Also, unfortunately, robots.txt is rarely used to indicate non crawlable endpoints these days but instead is used as a way to withold public data. Just take any random big website and take a look at their robots.txt file:

User-agent: Googlebot Allow: / User-agent: * Disallow: /




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: