do we know they didn't download the DB? Maybe the new traffic is the LLM reading the site? (not the training)
I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.
Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.
I would guess site operators can tell the difference between an exhaustive crawl and the targeted specific traffic I'd expect to see from an LLM checking sources on-demand. For one thing, the latter would have time-based patterns attributable to waking hours in the relevant parts of the world, whereas the exhaustive crawl traffic would probably be pretty constant all day and night.
Also to be clear I doubt those big guys are doing these crawls. I assume it's small startups who think they're gonna build a big dataset to sell or to train their own model.
I don't know that LLMs read sites. I only know when I use one it tells me it's checking site X, Y, Z, thinking about the results, checking sites A, B, C etc.... I assumed it was actually reading the site on my behalf and not just referring to its internal training knowledge.
Like how people are training LLMs, and how often does each one scrap? From the outside, it feels like the big ones (ChatGPT, Gemini, Claude, etc..) scrape only a few times a year at most.