Why? Java is extremely well suited to crawling and parsing. Extremely well suited to backend tasks running on servers for months on end without crashing.
It's blisteringly fast, low CPU usage, and would suit the core task of crawling websites extremely well. The async NIO libs are fantastic for network io.
What would you use and why? (And why would you not use java)
We've found Java's regular expression capabilities to be a bit frustrating at times. Although they're easy to use, they can be very slow for certain types of regexes. Does anyone know of a "fast regex" class?
Perhaps you're talking about Java SE 6, not Java runtime of 10 years ago?
> Extremely well suited to backend tasks running on servers for months on end without crashing.
Running for months without crashing - maybe. But by the end of that month (well, week really) it will be so slow (i.e. due to memory leaks ironically) that your only option will be autorestarting it every now and then...
AFAIK Larry and Sergey chose Perl at the beginning. Now it should be mostly Python and C.
I've been using Java for backend/net crawl tasks since about 2001. It definitely improved drastically with the addition of nio, and there were some irritating segfault issues a few years ago, but nothing a rollback to earlier JVM didn't fix (Until sun fixed it).
You can certainly run for months without issue (memory/crash/speed) as long as you don't have any leaks in your own code.
I'm pretty sure Java is still widely used at Google.
If I was writing the google crawler from scratch today, I'd certainly start with Java, then probably use perl/python for less critical scripting glue, and maybe rewrite any CPU intensive stuff in C/asm.
the perl runtime is far more stable than java. I've never encountered perl to crash while it happens that java simply gives up the ghost and goes belly up from time to time (not very often though)
Interesting excerpt on their view of advertising back then:
Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is "The Effect of Cellular Phone Use Upon Driver Attention", a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.
Looks like they solved the problem by turning it on its head.
For me the first step in figuring out a solution is to clearly understand the problem. It always seems like the solution is far easier than that first part. Well... then there's implementation :)
The hilarious bit is that they dropped Java for Python, probably due to the massive levels of frustration encountered when trying to do simple things like this.
It is reasonable to recall that Java was being promoted in 1996 as the "net" language, yet lacked basic mature libraries for being so. Java only became a decent language with mature libs after many years of front running this position. Some of us that used Java early on, mainly because business drivers forced it on us (Sun/IBM/BEA wouldn't lie to corporate America?), don't have fond memories.
no idea. I'm not even sure that in 1996 I had heard of python. Not sure why folks on this thread want to keep bringing python into the mix, especially in a 1996 context. Others have stated that python isn't and never has been the core to Google's search engine.
Did URLConnection.SetRequestProperty exist back in JDK 1.0?
The closest I could find were the docs for JDK 1.1.8 in a downloadable zip file, and yes SetRequestProperty existed back in JDK 1.1.8 at least.
Looking at the actual response, and the JDK 1.1.8 docs, he would probably have been using HTTPURLConnection (could not find HttpClient anywhere in the jdk1.1.8 docs) and even HTTPURLConnection in JDK 1.1.8 I could not find the string 'agent' anywhere on the page.
So yea, if the settings were there they were buried and not readily accessible in the documentation of the time.