Hacker News new | past | comments | ask | show | jobs | submit login
Google's Indexing Javascript more than we thought (distilled.net)
112 points by dohertyjf on Dec 2, 2011 | hide | past | favorite | 30 comments



It occurs to me that if GoogleBot is executing client javascript you could take advantage of Google's resources for computational tasks.

For instance, let me introduce you to SETI@GoogleBot. SETI@GoogleBot is much like SETI@home except it takes advantage of GoogleBot's recently discovered capabilities. Including the SETI@GoogleBot script into your web pages will cause (after the page load event) the page to fetch a chunk of data from the SETI servers via ajax request and proceed to process that data in JavaScript. Eventually, once the data has been processed it will be posted back to the SETI servers (via another ajax request) and then repeat the cycle. Thus enabling you, for the small cost of a page load, have GoogleBot process your SETI data and enhance your SETI@home score.

Obviously, this isn't a new idea (using page loads to process data via JavaScript) but it is an interesting application to exploit GoogleBot's likely vast resources.


One would assume that they are clever enough to have built in safeguards to prevent anything going too long, or using too much processing power.


Not just that, I would also assume that PageRank will penalise your site if a JS takes so long to execute.


Looking at Google Webmaster tools, I see a significant decline in my reported site performance starting in September, even though my site's speed has improved significantly since then by my own measures. Assuming this is due to our 'next' feature that AJAXs in, I'm going to disallow the 'next' urls in robots.txt and cross my fingers.


If you see significant positive results, write them up and let us know.


The Google Webmaster Tools site performance is being measured through multiple data points which can include: - People on dial up (yes these still exist) - People in other countries, if you have not taken care of cashing or CDN, your website might load slower for far-far away visitors This represents an average across all the data points being measured.

The way the data is being captured, is through: - Google Toolbar - Google Chrome - Google DNS services

Your observations of the speed and the improvements are not always similar as the ones from the aggregated data Google has access to.

I'm not sure what you are trying to accomplish with the dis-allowment of the next urls in robots.txt. Can you explain more what your hypothesis is in this test, and how you would measure success?


I guess disallowing won't work if what you say is correct i.e. Google doesn't measure site performance with Googlebot. However, this wouldn't explain the slowdown since September unless changes were also made to include JS execution time in site performance within the services you mentioned.

Perhaps a solution then would be to trigger the AJAX on mouse over, but that seems kludgy. In my case, I need make the AJAXed content part of the initial page load anyway, for the sake of user experience. But I can see cases where Google should not be counting AJAX as part of the page load time. God forbid somebody uses long polling for example. Maybe Google is doing this in a smart way, looking at the changes after the AJAX and determining if they should count as part of the page load.


It really depends on what you are trying to accomplish if you need to worry what Google is reporting with respect to the page load time. In general, it's always good to pay attention to page load times, regardless what Google finds of it!

You can experiment with Asynchronous calls, or slow load jQuery scripts, which kick off after the headers and html framework already have been loaded.

Overall, I would not worry about the reports in Google WMC that much, just try to get faster in general.

If you are serious in delivering an ultra speedy service online, there are services which can test your site or application on multiple locations, different OS and connection speed or using a different browser. But these services are pricy, trust me on that one!


Better to say "that Google will penalize your site". PageRank is a calculation on the link graph of the web, and nothing else.


If enough people did that where it would matter, Google would probably notice and patch the code to stop executing it.


I imagine they run ephemeral, sandboxed chrome instances in a cgroup that limits resources like memory and cpu time. The chrome comic back when it was first launched announced it had headless capabilities, and they've gotten better at sandboxing since.


Is there any example of a site having their dynamically generated* disqus comments indexed by google? Disqus is probably one of the most common form of ajax-generated content on the web, so if this were the case that googlebot was actively indexing dynamic content like this, I would expect to see disqus supported.

* disqus has an API to allow you to display disqus comments serverside, so some disqus implementations - think mashable is one - will have comments indexed without the aid of Javascript.


I don't buy this argument. Wanting to have a more complete rendering engine for their crawler might have been a factor in designing Chrome, but I can't imagine it was in any way the driving force. The costs of developing a browser that runs well on millions of different computers and configurations are far beyond what it would take to make a really great headless version of WebKit for your crawler.


Yeah, I took that part of the original article (the one this one is citing) to be a dumbed-down explanation of googlebot's new behaviors. That article's audience were SEO people, not "engineers". It's unfortunately misleading enough that we get articles like these once you try to extrapolate from that.


I have been saying this for years, but most people have refused to believe me

Like most people a long time ago I also held the belief that robots were just dumb scripts, however I learnt that this is not the case when I had to trap said robots for a previous employer.

See at the time I was working for one of the many online travel sites; now most people probably are not aware that there is quite a bit of money that can be made in knowing airline costs. The thing is that to get this information is not actually cheap, see most of the GDS (Global Distribution System) providers are big mainframe shops that require all sorts of cunning to happen to emulate a green-screen session for the purposes of booking a flight.

The availability search (I forget the exact codenames for this) is done first, this search gives you the potential flights (after working through the byzantine rules of travel) and a costing or fare quote for your trip. This information is reliable about ~95% of the time. Each search costs a small amount against a pre-determined budget, and the slightly more over the limit (kinda like how commercial bandwidth is sold), if my memory serves it was 0.001 euro cents for each search.

During the booking phase (known as the GDS code FXP) the price is actually settled, the booking is a weird form of two-phase commit where first you get a concrete fair quote. This quote ‟ringfences” the fare - essentially ensuring that the seat cannot be booked for roughly 15 minutes. In practise there are a load more technicalities around this part of the system and as such it is possible for double bookings and over bookings to happen, but lets keep it simple for the sake of this story. These prebookings are roughly 99.5% accurate on price but cost something like 0.75 cents (there is a _lot_ that happens when you start booking a flight).

So with that in mind if you are in the business of trying to resell flights it can be to your advantage to avoid the GDS costs and scrape one of the online travel companies. You also want the prebook version of the fare as its more likely to be accurate, the travel sites mind less about people scrapping the lookup search.

Thus begins the saga of our bot elimination projects, first we banned all IP's that smash the site thousands of times, this is easy and kills 45% of the bots dead. Next up we start proper robots.txt and ways to discourage googlebot and the more "honest" robots, that gets us up to dealing with 80% of the bots. Next we take out china, russia etc as ip-addresses, we find that these often have the most fraudulent bookings anyhow so no big loss, that takes us up to 90% of the bots.

Killing the last 10% was never done, every time we tried something new (captua's, JS nonce values, weird redirect patterns, bot traps and pixels, user agent sniffs etc etc) the bots seemed to immediately work around it. I remember watching the access logs where we had one IP that never, ever bought products, just looked for really expensive flights. I distinctly remember seeing it hit a bottrap, notice the page was bad, and then out of nowhere the same user session appears on a brand new IP address with a new user agent, one that essentially said "netscape navigator 4.0 on X11" (this was firefox 1-2 days so seeing unix netscape navigator was a rare sight), it was clear the bot went and executed the javascript nonce with a full browser, and then went back to fast scraping.

A few years later, at the same company but for very different reasons I wrote a tool to replace a product known as gomez with an in house system. The idea of gomez and similar products like site-confidence is to run your website as the user sees it, from random ip's across the world and then report on it. I wrote this tool with XulRunner which is a stripped down version of firefox. Now admittedly I had the insider knowledge of where the bot traps were, but I was amazed at how easy it was to side-step all of our bot-detection in only a few days, I also had unit tests for the system that ran it on sites like Amazon and Google and even there is was shocking how easily I was able to side step bot traps (I am sure since they have got better, but it surprised me how easy it was).

I am not saying all the bots are smart, but my mantra since then has been that "if there is value for the bots to be smart, they can get very smart". I guess its all about the cost payoff for those writing the bots, is it a good idea to run JS all the time as a spider - probably not, does it make sense to save you from 0.75 cents of cost per search - very much so !


> I am not saying all the bots are smart, but my mantra since then has been that "if there is value for the bots to be smart, they can get very smart".

I was once actually on the other side of the fence as you were, around 5-6 years ago (in a different industry, though). You're right, if there's value to be gained by scrapping other people's pages there's almost always a way round the obstacles.

I remember the day when my boss presented me a link to a strange-named FF extension, called Chickenfoot (http://groups.csail.mit.edu/uid/chickenfoot/faq.html). It allowed one to very easily write FF extensions that would programatically click on whatever links you wanted to be scrapped, all this from inside the browser, like a normal user would have done. I used to run FF with this extension installed on a dedicated cheap PC, saving the data to our servers, and from time to time automatically restarting FF because the machine was running out of memory. Fun times :)


Sure, automated browser testing is a whole industry, and I think we all know those tools aren't always used for testing sites you control.

Take a look at Selenium and Watir.


And phantom.js


i wonder if it also means we don't need to implement _escaped_fragment_ anymore http://code.google.com/web/ajaxcrawling/docs/getting-started...


Just because GoogleBot can crawl and execute/index javascript, doesn't mean that it will on your site. The best bet would be to keep them. Or take them off and see what happens. If you don't see negative effects, then you will have discovered something interesting.


We sure saw a hell of a lot of improvement when we moved a lot of client-side rendering to the server. Before that, Google wasn't indexing any of our content that was rendered in JS.

We know googlebot executes JS. But this could be primarily for things like validating that a site isn't cheating by dynamically hiding search keywords and so on. It is also for generating the page preview.

It's good to see that they're starting to index JS-rendered content too, as seen with the Facebook comments widget, but it does not mean we're free to ignore these issues just yet. As it stands, client-side rendering in general means a huge hit to search ranking and experience (e.g. your search results won't have anything meaningful to say, and will probably use irrelevant text like your static footer/copyright notice as the description).


Yeah I'd definitely say we should continue to follow our established best practices until G gets better at this but Josh's evidence and our continued testing on this subject is very compelling.


I'd argue that best practice in web development is not requiring JavaScript to load a page, but I'm sure that issue has been done to death in the past.


I agree on this one... just because Google CAN do it doesn't mean they will.

On that note, I am personally of the belief that the fragments are part of Google's learning/training process for their spiders.

If they sniff the XHR traffic on every domain they encounter a HashBang they can learn lots about the use of AJAX and the types of content being exposed via AJAX.


Why the assumption that "GoogleBot" is a single thing? Of course we know that google has a headless browser running, we see it's output in the instant previews, but I'm sure they still do plenty of standard crawling (and probably some half way partial JS execution and/heuristics too).


> My personal favorite example of this is Google Translate, which is one of the most accurate machine translating tools on the planet. Google almost sacked it because it was not profitable, and had it not been for public outcry we may have lost access to this technology altogether.

I kind of missed this "public outcry", when did it happen? And if Google listens to public outcry, why did we lose Google Code Search?


When the shutdown of Google Translate API was announced, a few months ago. (Just the API, note. Not the tool itself).

It was saved because people care enough about the translation API that they're willing to pay for using it.


You might be able to check what the Googlebot executes by adding javascript to your site and checking the thumbnail.

EDIT: Removed comment about the bot's user-agent. The article links to a Google FAQ which answers the question.


They execute absolutely everything you put in Javascript; it looks exactly as it looks in Chrome. And it looks like it takes the snapshot after all the initial processing is done.

Javascript-heavy site with perfect snapshots: http://goo.gl/xNUIM

But they also index and take a snapshot of the no-javascript version: http://goo.gl/eP84M


I think they have it backwards. What if Chrome is GoogleBot? You get quality measurement on pages based on user behavior on the page. Crowdsourcing beats crawling!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: