Hacker News new | past | comments | ask | show | jobs | submit login

This was actually my primary role at Google from 2006 to 2010.

One of my first test cases was a certain date range of the Wall Street Journal's archives of their Chinese language pages, where all of the actual text was in a JavaScript string literal, and before my changes, Google thought all of these pages had identical content... just the navigation boilerplate. Since the WSJ didn't do this for its English language pages, my best guess is that they weren't trying to hide content from search engines, but rather trying to work around some old browser bug that incorrectly rendered (or made ugly) Chinese text, but somehow rendering text via JavaScript avoided the bug.

The really interesting parts were (1) trying to make sure that rendering was deterministic (so that identical pages always looked identical to Google for duplicate elimination purposes) (2) detecting when we deviated significantly from real browser behavior (so we didn't generate too many nonsense URLs for the crawler or too many bogus redirects), and (3) making the emulated browser look a bit like IE and Firefox (and later Chrome) at the some time, so we didn't get tons of pages that said "come back using IE" er "please download Firefox".

I ended up modifying SpiderMonkey's bytecode dispatch to help detect when the simulated browser had gone off into the weeds and was likely generating nonsense.

I went through a lot of trouble figuring out the order that different JavaScript events were fired off in IE, FireFox, and Chrome. It turns out that some pages actually fire off events in different orders between a freshly loaded page and a page if you hit the refresh button. (This is when I learned about holding down shift while hitting the browser's reload button to make it act like it was a fresh page fetch.)

At some point, some SEO figured out that random() was always returning 0.5. I'm not sure if anyone figured out that JavaScript always saw the date as sometime in the Summer of 2006, but I presume that has changed. I hope they now set the random seed and the date using a keyed cryptographic hash of all of the loaded javascript and page text, so it's deterministic but very difficult to game. (You can make the date determistic for a month and dates of different pages jump forward at different times by adding an HMAC of page content (mod number of seconds in a month) to the current time, rounding down that time to a month boundary, and then subtracting back the value you added earlier. This prevents excessive index churn from switching all dates at once, and yet gives each page a unique date.)




> (This is when I learned about holding down shift while hitting the browser's reload button to make it act like it was a fresh page fetch.)

Most useful aside of all time.


I used to use this a lot. My experience is that for some reason, a couple years ago it stopped working reliably as a fresh page fetch. Some items were still coming up cached. Now I use incognito or private browsing windows instead.


If you're running Chrom(e|ium) with developer tools open then you can right click the refresh button and it gives you a few refresh options (eg clear cache and reload).

That tends to be my fall back whenever I'm specifically fussed about the "freshness" of a page. That or curl


Thanks, never tried right-clicking that before. There's also a checkbox in dev tools settings to "Disable cache while dev tools is open."


i used to go through a lot of head scratching when doing manual testing, before discovering the joys of cmd-shift-r


> At some point, some SEO figured out that random() was always returning 0.5. I'm not sure if anyone figured out that JavaScript always saw the date as sometime in the Summer of 2006, but I presume that has changed. I hope they now set the random seed and the date using a keyed cryptographic hash of all of the loaded javascript and page text, so it's deterministic but very difficult to game.

I don't get why the rendering had to be deterministic. Server-side rendered HTML documents can also contain random data and it doesn't seem to prevent Google from doing "duplicate elimination".


Byte-for byte de-duping of search results is perfect and fairly cheap. Fuzzy de-duping is more expensive and imperfect. Users get really annoyed when a single query gives them several results that seem like near copies of the same page.

Tons of pages have minor modifications by JavaScript, and only a very small percentage have modifications done by JavaScript that result in JavaScript analysis resulting in improved search results.

So, if JavaScript analysis isn't deterministic, it has a small negative effect on the search results of many pages that offsets the positive effect it has on a small number of pages.


Great thread. I'm one of those people that was poking around and trying to figure this out a few years back.

Obviously, there's a lot I know you can't say, but I'd love to know your general thoughts on how far off we were: http://ipullrank.com/googlebot-is-chrome https://moz.com/blog/just-how-smart-are-search-robots


" Since the WSJ didn't do this for its English language pages, my best guess is that they weren't trying to hide content from search engines, but rather trying to work around some old browser bug that incorrectly rendered (or made ugly) Chinese text, but somehow rendering text via JavaScript avoided the bug."

Or maybe they were trying to get past the great firewall of China?


> Or maybe they were trying to get past the great firewall of China?

Possible, but at that time the only affected pages were for a certain date range in their archives, not the most recent pages. I alse think the Great Firewall of China did simple context-free regex searches that would have caught the text in the JavaScript literals.


Did you load in Ajax? I've got a client that runs a site that loads HTML in separately. They've been paying for a third party service to run PhantomJS and save HTML snapshots to serve to Googlebot - is that no longer needed?

(I'm not thrilled about rendering this way, but it makes development a lot easier.)


In practice, and from experience... content changes driven by JS tend to lag a few days, if the content was changed via direct output... If you're doing client-side rendering, couldn't you refactor to use node, or similar for your output rendering?

If you aren't heavy reliant on conversions from search traffic, you can probably get away with being JS driven, I'd suggest sticking with Anchor tags for direct navigation with JS overrides. Assuming you are supporting full url changes.. otherwise you need to support he shebang alternate paths... which is/was a pain when I did it 3-4 years ago.


As an aside, did you work on the indexing team at Google? I was on the indexing team from 2005-2007, and I remember that Javascript execution was being worked on then, but I don't remember who was doing it (was a long time ago ;) ). My name is my username.


I was always in the New York office (before and after the move from Times Square to Chelsea), on the Rich Content Team sub-team of Indexing. My username is the same as my old Google username.

I was working on the lightweight high-performance JavaScript interpretation system that sandboxed pretty much just a JS engine and a DOM implementation that we could run on every web page on the index. Most of my work was trying to improve the fidelity of the system. My code analyzed every web page in the index.

Towards the end of my time there, there was someone in Mountain View working on a heavier, higher-fidelity system that sandboxed much more of a browser, and they were trying to improve performance so they could use it on a higher percentage of the index.


Ah, okay, cool. Never visited the NY office. That's probably why I just remember the general idea that "JS execution was being worked on."




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: