Hacker News new | past | comments | ask | show | jobs | submit login

I wrote a reply but you edited out the chunk of text that I quoted, so here's a new reply.

> After all, the users device will never even see the final DOM now. instead it's getting fetched, parsed and processed on a third device, which is objectively a robot.

Sure, but why does it matter if the machine that I ask to fetch, parse, and process the DOM lives on my computer or on someone else's? I, the human being, will never see the DOM either way.

This distinction between my computer and a third-party computer quickly falls apart when you push at it.

If I issue a curl request from a server that I'm renting, is that a robot request? What about if I'm using Firefox on a remote desktop? What about if I self-host a client like Perplexity on a local server?

We live in an era where many developers run their IDE backend in the cloud. The line between "my device" and "cloud device" has been nearly entirely blurred, so making that the line between "robot" and "not robot" is entirely irrational in 2024.

The only definition of "robot" or "crawler" that makes any kind of sense is the one provided by robotstxt.org [0], and it's one that unequivocally would incorporate Perplexity on the "not robot" side:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced. ... Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

Or the MDN definition [1]:

> A web crawler is a program, often called a bot or robot, which systematically browses the Web to collect data from webpages. Typically search engines (e.g. Google, Bing, etc.) use crawlers to build indexes.

Perplexity issues one web request per human interaction and does not fetch referenced pages. It cannot be considered a "crawler" by either of these definitions, and the definition you've come up with just doesn't work in the era of cloud software.

[0] https://www.robotstxt.org/faq/what.html

[1] https://developer.mozilla.org/en-US/docs/Glossary/Crawler




I'm honestly confused here, if anything, aren't your quotes literally confirming my point?

It's triggering an automation which fetches data. This is a crawl, even if the crawl has a very limited scope (it's also not limited to a single request, that's just the scope that's used by default. But even if it was programmatically limited to only ever request a single resource, that'd still be a crawl, while recursion is the norm too build indexes, it's not necessary for all usecases that utilize crawlers.

Did you ever actually make anything that's utilizing them to gather information that you want? You might be surprised to know that adhoc triggering a singular resource fetch is actually pretty common to keep data up-to-date.

> If I issue a curl request from a server that I'm renting, is that a robot request? What about if I'm using Firefox on a remote desktop? What about if I self-host a client like Perplexity on a local server?

Yes, anything on a third device is effectively a robot that's acting on the behalf of the acteur.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: