> Give each process a FIFO to read URLs from. Then you choose which FIFO to add a URL to based on the address so that all URLs with the same address are assigned to the same process.
I wrote this in a reply to myself a moment after you posted your comment so I'll just move it here:
Regarding the last two issues I mentioned, you could sort the list of URLs by domain and split the list when the new list's length is >= n URLs and domain on the current line is different from the domain on the previous line. As long as wget can at least honor robots.txt directives between consecutive requests to a domain, it should all work out fine.
It looks like an easily solvable problem however you go about it.
> It really depends what you're trying to do here.
I was thinking about HTTP requests that respond with 4xx and 5xx errors. It would need to be possible to either remove those from the frontier and store them in a separate list or mark them with the error code so that it can be checked at some point being passed onto wget.
Open file on disk. See that it's 404. Delete file. Re-run crawler.
You'd turn that into code by doing grep -R 404 . or whatever the actual unique error string is and deleting any file containing the error message. (You'd be careful not to run that recursive delete on any unexpected data.)
Really, these problems are pretty easy. It's easy to overthink it.
This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's default 404 page. You really can't count on there being any consistency between 404 pages on different sites.
If wget somehow stored the header response info to disk (e.g. "FILENAME.header-info") you could whip something up to do what you are suggesting though.
Yeah, wget stores response info to disk. Besides, even if it didn't, you could still visit a 404 page of the website and figure out a unique string of text to search for.
I wrote this in a reply to myself a moment after you posted your comment so I'll just move it here:
Regarding the last two issues I mentioned, you could sort the list of URLs by domain and split the list when the new list's length is >= n URLs and domain on the current line is different from the domain on the previous line. As long as wget can at least honor robots.txt directives between consecutive requests to a domain, it should all work out fine.
It looks like an easily solvable problem however you go about it.
> It really depends what you're trying to do here.
I was thinking about HTTP requests that respond with 4xx and 5xx errors. It would need to be possible to either remove those from the frontier and store them in a separate list or mark them with the error code so that it can be checked at some point being passed onto wget.