Hacker News new | past | comments | ask | show | jobs | submit login
Google search indexes itself (google.com)
182 points by franze on Sept 10, 2014 | hide | past | favorite | 87 comments



Google's robots.txt http://www.google.com/robots.txt disallows /search but not //search.

However, if you search site:http://www.google.com/search and show omitted search results, you get a bunch of results (all 404s).

If you do this there are some strange results on the last couple pages.

For example: Obama won't salute the flag | Phallectomy | horse+mating+video | feral+horses+induced+abortion | Lactating+dog+images | animal+mating+video | mating+mpg+-beastiality+-...

So, Half Life 3 confirmed.


I thought you were joking about those search keywords, but indeed: http://i.marceldegraaf.net/sitehttpwww.google.comsearch_-_Go... (screenshot)


ODF files! Those sick sick people.


why is it "GooooooooooG" and not "Goooooooooogle" at the bottom?


A better example url is https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...

Note that switching //search to /search eliminates the phenomenon.

Note too that all the results on page 1 and page 10 are related to hostgator and coupon codes. I expect that there is some site which contains some text or links that cause these results.

Note also that the `site:` search operator isn't supposed to include anything but a domain or subdomain: no http:// nor /search should be included.

Finally, note that the results are actually google search pages, though! So I do think this is some kind of bug.

But NOT an instance of Google indexing its result pages. Please change the title to 'This one weird google bug will make you scratch your head!' :)

Edit: andybalholm suggests (on this page) that the double slash is in fact causing the googlebot to visit those search results page and indeed index them. Hm, sounds true.

Has anybody visited the spamfodder pages and found instances of malformed yet operative links to google search? (I don't feel like visiting those sites on this machine on this network.)



>Note also that the `site:` search operator isn't supposed to include anything but a domain or subdomain: no http:// nor /search should be included.

google recommends the site:example.com/path shortcut itself https://support.google.com/webmasters/answer/35256?hl=en

and it's ok to use, as site:example.com inurl:path could mean example.com/hudriwudri/path, too


I said that `site:` doesn't take full urls, just domains and subdomains.

Correction: this works as you (or a muggle) might expect: https://www.google.com/search?q=site:https:%2F%2Fgithub.com%...

...Though logically the operator should be named `page:` now. :)


>But NOT an instance of Google indexing its result pages.

That's what it looks like to me. Could you explain the difference?


I changed my tune at some point via seeing comments here. I posted a comment to that effect.

In hindsight, your comment alone would have changed my tune: nope, I can't explain the difference between a page appearing in search results and a page being indexed. Thanks for the illumination. :)


>This one weird google bug will make you scratch your head!

But that's clickbait! :)


This demonstrates the dangers of loose path resolution rules.

Traditionally, consecutive slashes in a path name are treated as equivalent to a single slash, presumably to simplify apps that need to join two path fragments -- they can safely just concatenate rather than call a library function like path.join().

Unfortunately, this makes it much harder to write code that blacklists certain paths, as robots.txt is designed to do. Clearly, Google's implementation of robots.txt filtering does not canonicalize double-slashes, and so it thinks //search is different from /search and only /search is blacklisted.

My wacky opinion: Path strings are an abomination. We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz". You can use something like slashes as an easy way for users to input a path, but the parsing should happen at point of input, and then all code beyond that should be dealing with lists of strings. Then a lot of these bugs kind of go away, and a lot of path manipulation code becomes much easier to write.


We should be passing around string lists, e.g. ["foo", "bar", "baz"] instead of "foo/bar/baz".

But that doesn't in and of itself solve the problem, because "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] without any additional convention.

This is actually not that unusual. this site does not treat two consecutive slashes as a single slash. There are likely others implementation differences.

Certainly in posix consecutive slashes count as one for file paths, but URL query strings are not file paths.


... "foo/bar//baz" would map to ["foo" "bar" "" "baz"/] ...

No, I think it'd be more like proto://host/thing?foo&bar&baz (put an =1 on each of those if you like).

Yeah, I'm employing a convention, but so to is the concept of list of strings that the commenter invoked.


Does the HTTP standard or robots.txt specification mandate the collapse of consecutive slashes, though? I agree that it is common, but if it is server-side implementation detail, then a correct implementation of robots.txt should not collapse them, as they might mean different things to a particular server.


I agree. If there's a bug here, it's in the server which collapses slashes seen in request paths, not in the indexer's interpretation of robots.txt.


Funny thing, Google indexes itself, indexing itself, indexing others .... All results lead to google search, which lead to google search results ...

https://www.google.ca/search?q=site%3Ahttp%3A%2F%2Fwww.googl...


We must go deeper



hi OP here, i did not consider this to go front-page, just thought it was a funny meta bug.

and no, it's not clickbait and i'm not affiliated with hostgator or any of that other crap.

a few strange points i would like to point out:

the indexed result pages are http:// not https:// - but to my knowledge google forces https:// everywhere.

the double slash issue is probably the reason why googlebot does indeed index this. robots.txt is a shitty protocol, i once tried to understand it in detail and coded https://www.npmjs.org/package/robotstxt and yes, there are a shitload of cases you just can't cover with a sane robots.txt file.

as there are no https://www.google.com/search (with "s" like secure) URLs indexed google(bot) probably has some failsafes to not index itself, but the old http:// URLs somehow slipped through.

but now lets go meta: consider the implications! the day google indexes itself is the day google becomes self aware. google is a big machine trying to understand the internet. now it's indexing itself, trying to understand itself - and it will succeed.the "build more data center algorithms" will kick in as google - which basically indexed the whole internet - is now indexing itself recursively! the "hire more engineers to figure how to deal with all this data" algorithm will kick in (yeah, recursively every developer will become a google dev, free vegan food!), too.

i think it's awesome.

by the way, a few years ago somebody wrote a similar story http://www.wattpad.com/3697657-google-ai-what-if-google-beca... fun enough the date for self awareness is "December 7, 2014, at 05:47 a.m" [update: ups, sorry seems to be the wrong story, but i'm sure the "google indexes itself becomes self aware" short story is out there, but i just can't find it right now ... strange coincident?]


> the indexed result pages are http:// not https:// - but to my knowledge google forces https:// everywhere.

Google only forces HTTPS for certain User-Agent strings. I just tried fetching http://www.google.com with the Googlebot User-Agent string and Google did not redirect to HTTPS.


It's a bug in the indexing system, exploited by hostgator for (I'm guessing) SEO purposes. There are other people doing the same thing, and they're all spammy (viagra sales etc.)

I reckon this will be fixed in a matter of days, judging by how quickly the latin lorem ipsum google translate thing was sorted out.


And fixed.


(I work with the search team at Google) This was a bug on our side, and should be resolved now.


(I'm one of your users) This was a lot of fun, and it's ruined now.

Seriously, why don't you let people do this?


Handling URLs with multiple slashes in them is tricky, lots of websites silently fold them into one and return the same content, so this seems like something we should handle in the same way in search.


Does this explain why Google search results have degraded the last 6 months? I am not trolling -seriously- for me googling first is hardly worthwhile nowadays. A user from the Netherlands. If there was a way to still use the 2009 search index, i would!!


If you want to send me specific queries (the more general, the better) and what went wrong in the search results for them, I'm more than happy to forward them to the team that works on that. I'm [this-user-name] AT google.com


There actually is a way to use the pre-2012 search index!

Just use http://www.google.com/custom I use either DuckDuckGo or this site all the time, I'd probably switch to DuckDuckGo completely if this search would go down.


Lovely but doesn't use an old index. Just searched for the name of an album released in 2013. Usual results.


Nooooooo! I'm going to be curious forever now.


And web archive indexes it's internal IP addresses and a... live printer: http://web.archive.org/web/*/http://printer


Which from the snapshot, shows an IP that's... still online: http://208.70.27.164/hp/device/this.LCDispatcher


Yep. A whois confirms it's their IP address.

Which is nothing wrong on it's own, as long it's protected by good password and doesn't fail to likes of thc-hydra.

They also had some ancient snapshots from 192.xxx range


And it appears to have been jammed since 2009.


Ouch. Stuff like this is just a confused deputy security vulnerability waiting to happen. Whenever I write code to fetch a resource based on user input (and a crawler following a link is a form of user input) I check to make sure I'm not going to fetch something on an internal network.



They fixed the OP issue by now, but this still works..


Looks like it's got fixed because i cannot see any results.


All of the results are HostGator coupons, anyone else seeing the same?


Yes. Look at the query:

    search?q=site%3Ahttp%3A%2F%2Fwww.google.com 
    %2F%2Fsearch%3Fq%3Dproranktracker.com%2B%2B 
    %2BHostgator%2BCoupon%2BCode%3ACOUPON333&pws=0& 
    hl=en#pws=0&hl=en&q=site:http:%2F%2Fwww.google.com 
    %2F%2Fsearch


Even if you take the search string site:http://www.google.com//search and put it into a fresh Google search, it only returns HostGator coupons. Maybe someone from Google can explain it.


add -hostgator to the search query and you'll find best-seller-watches.com dominating the list. Add that one to your query and things get really strange.

https://www.google.com/webhp?gws_rd=ssl#safe=off&q=site:goog...


Ah, I didn't notice! Interesting.


It is obviously the most relevant content on this site. PageRank is always right.



It's for good measure, in case it's down.


"Your search - site:http://www.google.com//search - did not match any documents."


The goal is to have an explicit Google search result which expresses the equivalent of "this Google search cannot be found via Google".

This will help construct a proof of Göogdel's Incompleteness Theorem.

Without being able to find anything in Google, including Google searches, and including that search for Google searches itself, Google is not a completely powerful search engine; however, it cannot be complete and consistent at the same time. There are searches which cannot be shown to be conclusively either in the index, or not in the index.


Made my day!


I wonder if it's somehow possible to exploit this to pass pagerank from google.com to your own website. Or if there's even people already doing it.


Well, let's look at the results - coupons, watches, ... - yup some blackhat SEO is probably cursing whoever publicised this issue.


I think it might not be that they "index themselves" but they index links to google that others post on forums, it's common for people to link to "lmgtfy" so they probably index those links too. I don't see google "googling" on itself while indexing it's own searches. Unless Skynet.


Results for http://www.google.com///search as well.

But not http://www.google.com////search because that's just crazy, come on.


Very strange:

https://www.google.com/search?q=site:http://www.google.com/s...

I got some searches like:

www.google.com/search@q=tetris+sorry+henk

https://www.google.com/search=pupuk+cair+alami

www.google.com/search&q=strobe+trigger+schematic

www.google.com/search@q=transvestites+used+in+rituals (!!!!)

Edit: roland-s found it first :) , and yes, the last pages of results are pretty weird.

https://news.ycombinator.com/item?id=8298239


Funny thing... It works only with[0]

    site:http://www.google.com//search
but not with[1]

    site:http://www.google.com/search
[0] https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...

[1] https://www.google.com/search?q=site%3Ahttp%3A%2F%2Fwww.goog...


Try it like this. site:www.google.com (About 34,000,000 results) or site:http://www.google.com inurl:search (About 185,000 results)


10,800 results for "site:http://www.google.com///search"


Seems like they could easily fix this with robots.txt or something similar, I really doubt it's oversight on their part either.

Any ideas why they're doing this?


I assume that some site has hostgator-related links with two slashes instead of one. Due to the two slashes, the GoogleBot doesn't realize that it's indexing their own results pages.


They disallow /search

    User-agent: *
    Disallow: /search
but maybe //search slipped thorough?


This is just a bug, sorry.


They probably want to index some pages on google.com, but not search results. To exclude search results, someone wrote something to exclude URLs that start with /search, and forgot that //search works the same way.


It works with

  site:http://www.google.com/search
, but all the results are considered duplicates and omitted. Hit the button.


Nice! These results must represent all the hrefs people have posted that point to google search...


Just checked this link again... It appears that Google has fixed the //search issue as it returns no results now.


Thank you. I was wondering what all the fuss was.



Fun. I expect a cheeky onebox to come out of this at some point along the lines of the recursion search.


why the hack google ever made it possible to hit the search url with more than one slash there...


It's interesting if add a slash to this page the result will be different.

https://news.ycombinator.com//item?id=8297241

Where in all other cases tested it won't

Is this a server specific stuff? Or it's configurable

http://url.spec.whatwg.org//#concept-url-path http://www.nytimes.com///pages//politics//index.html http://www.bing.com////search?q=site%3Abing.com%2Fsearch%3Fq... https://www.cloudflare.com///index


Many frameworks allow you to route URLs to actions instead of mapping to a file. I just tested it in one of my Symfony projects, and I was able to route /login and //login to two separate controllers.

Furthermore, it's pretty common to rewrite URLs, doing things like adding/removing trailing slashes, whatever. So it wouldn't be too difficult to have it condense multiple slashes into just one.

For example, this link worksfine: google.com//////////////////////////////////search?q=foobar

Google search tries to cover a lot of typos or be pretty user-friendly for people who don't understand tech. I wouldn't be surprised if there's a grandma out there who thinks http://google.com//search is the correct method.


no one will ever know.


Tested with a jetty + spring 3 with close to ootb settings, more than one slash will resolve not found error.


Isn't the head of web spam at Google a HNer (Matt I think?)?


I believe Matt Cutts went on sabbatical: https://www.mattcutts.com/blog/on-leave/


But does it index the results of the search of the index?


Now google will index searches of its own searches.


Ouroboros


So meta.


Can someone nuke the link on this post. It's clearly click bate and we're just driving traffic into it. :(


You should mention the arbitrary data in the query section, its not visible at the first look.


That's an artifact of google's weird link stuffing, if you search 'site:http://www.google.com//search' by hand it still works


Perhaps this "works" because all the pagerank stuff has been altered by all the sudden traffic related to hostgator coupons.


Googleception! (sorry for the useless comment, but I had to)


Eigengoogle.


Does nobody here understand robots.txt? It's pretty easy to figure out what's going on if you do. I assumed most users here work with web technologies, but maybe the readership doesn't skew that way as much as I thought.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: