Hacker News new | past | comments | ask | show | jobs | submit login

https://www.google.com/search?q=jack+ma&tbs=cdr%3A1%2Ccd_min...

First result for me is https://www.scb.co.th/en/personal-banking/stories/business-m... which Google thinks is from 2003-03-15, except it mentions COVID-19 so it obviously isn't.

Second result is https://www.instagram.com/jack_overpower/feed/ which Google thinks is from 2001-01-02, except Instagram didn't exist at that time. It might have pictures from 2001 though.

Third result is http://pacificpower.foreignpolicy.com/15-jack-ma/ which Google thinks is from 1999-02-15, except it mentions Alibaba's 2014 IPO so it obviously isn't.

Fourth result is https://www.facebook.com/story.php/?story_fbid=5041357966634... which Google doesn't show a date for, but it's a Facebook post from 2018.

...

I don't doubt that some of those results are from 1998 to 2005, but the millions of results number specifically is meaningless.




The "custom range" feature simultaneously feels broken, gamed by spammers, and intentionally being scrubbed. I'm surprised they haven't completely removed it yet.


In general I suppose, but per my comment above in this particular case of the scb.co.th article (which mentions SARS crisis), the article was actually published 2021 not 2003, there was no gaming going on, simply Google's data-inference code got it totally wrong on the Siam Commercial Bank article.

I don't want to see Google remove the search-by-date-range feature, it has tons of 100% serious legitimate uses (quote attribution, journalistic, historical, also debunking internet rumors and fresh reposts of old fake virals), but Google could estimate the errorbars on date-ranges and hitcounts, provide a disclaimer and feedback box to encourage users to flag/retag gross errors like this.

If anyone dug deeper into why the date inferencing is now getting broken, I'd speculate they find Google is nowadays getting reciprocally confused by site publishers and advertising networks changing or removing items which contain date; but that's in turn presumably for Google changes which downrank or uprank legit older content.

(I can't find the recent HN article here by an SEO expert with a bullet-point list advising website maintainers to remove all date information, among other things)


Google has perfect vision of the past (didnt latest leak confirm they keep everything crawled indefinitely and have extensive historical records for all domains?) but zero incentive for redirecting you to old websites with no advertisements.


Many old forums are only sporadically indexed by Google even if you do verbatim text searches using the site:... modifier.


>except it mentions COVID-19 so it obviously isn't.

Perhaps it was just updated?

I generally ignore/ get annoyed by articles that don't have a date/ updated on, on the byline.


Sometimes you can find the date embedded inside the source asset files.


> First result [scb.co.th] ... Google thinks is from 2003-03-15, except it mentions COVID-19 so it obviously isn't.

Interesting catch, seems Google grossly mistagged its date. IA confirms it was actually published 2021-09-06 [0], but that isn't tagged or referenced anywhere in the article text or HTML. I'm assuming Google misinferred the date as "2003-03-15" because the first two paragraphs talk about the SARS crisis, which was declared by the WHO around 2003-03.

> I don't doubt that some of those results are from 1998 to 2005, but the millions of results number specifically is meaningless.

Yes, seems there's not much QC on Google's date-inferencing of "old" articles. Hence the date-range is hit-and-miss, and search hit counts (which Google is eliminating anyway). I mean if anyone wanted to QC it, just search "old" internet for telltale terms like "COVID", "Nicki Minaj", "President Zelenskyy" etc. that should hardly generate any hits.

[0]: https://web.archive.org/web/20210601000000*/https://www.scb....


Yep; there may be a lack of incentive to preserve old sites, but what's worse are the ranking algorithms that prevent their discoverability in the first place.


Both the Internet Archive and Common Crawl have tools that reveal actual crawl dates. Search engines are not really intended to be archives, so it's no surprise that they aren't very good archives.


Is it, though? I think you have to define what your search engine is searching to make a claim like that. Internet Archive and Common Crawl (which I will say has its own incentives discouraging the discoverability of old sites through its methodology and limitations of its web crawling) are search engines in their own right.

What are you doing when you use their services? Searching.


Not really prevented, the huge one is http sites being down ranked heavily by google.

But they are still there. Do a specific enough search and they'll be at the top of the search results.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: