Why? Java is extremely well suited to crawling and parsing. Extremely well suited to backend tasks running on servers for months on end without crashing.
It's blisteringly fast, low CPU usage, and would suit the core task of crawling websites extremely well. The async NIO libs are fantastic for network io.
What would you use and why? (And why would you not use java)
We've found Java's regular expression capabilities to be a bit frustrating at times. Although they're easy to use, they can be very slow for certain types of regexes. Does anyone know of a "fast regex" class?
Perhaps you're talking about Java SE 6, not Java runtime of 10 years ago?
> Extremely well suited to backend tasks running on servers for months on end without crashing.
Running for months without crashing - maybe. But by the end of that month (well, week really) it will be so slow (i.e. due to memory leaks ironically) that your only option will be autorestarting it every now and then...
AFAIK Larry and Sergey chose Perl at the beginning. Now it should be mostly Python and C.
I've been using Java for backend/net crawl tasks since about 2001. It definitely improved drastically with the addition of nio, and there were some irritating segfault issues a few years ago, but nothing a rollback to earlier JVM didn't fix (Until sun fixed it).
You can certainly run for months without issue (memory/crash/speed) as long as you don't have any leaks in your own code.
I'm pretty sure Java is still widely used at Google.
If I was writing the google crawler from scratch today, I'd certainly start with Java, then probably use perl/python for less critical scripting glue, and maybe rewrite any CPU intensive stuff in C/asm.
the perl runtime is far more stable than java. I've never encountered perl to crash while it happens that java simply gives up the ghost and goes belly up from time to time (not very often though)