> Give each process a FIFO to read URLs from. Then you choose which FIFO to add ...

sillysaurus3 · on Jan 19, 2015

Open file on disk. See that it's 404. Delete file. Re-run crawler.

You'd turn that into code by doing grep -R 404 . or whatever the actual unique error string is and deleting any file containing the error message. (You'd be careful not to run that recursive delete on any unexpected data.)

Really, these problems are pretty easy. It's easy to overthink it.

pyre · on Jan 19, 2015

> grep -R 404

This isn't 1995 anymore. When you hit a 404 error, you no longer get Apache's default 404 page. You really can't count on there being any consistency between 404 pages on different sites.

If wget somehow stored the header response info to disk (e.g. "FILENAME.header-info") you could whip something up to do what you are suggesting though.

sillysaurus3 · on Jan 19, 2015

Yeah, wget stores response info to disk. Besides, even if it didn't, you could still visit a 404 page of the website and figure out a unique string of text to search for.