Hacker News new | past | comments | ask | show | jobs | submit | jfoster's comments login

Google used to have a policy of sites being required to show Google the same page that they serve users. It seems that has been eroded.

I'm not sure how that serves Google's interests, except perhaps that it keeps them out of legal hot water vs the news industry?


It's called cloaking and it's still looked down upon from an SEO perspective. That said, there's a huge gray area. Purposefully cloaking a result to trick a search engine would get penalized. "Updating" a page with newer content periodically is harder to assess.

There's also "dynamic rendering," in which you serve Google/crawlers "similar" content to, in theory, avoid JS-related SEO issues. However, it can just be a way to do what the parent commenter dislikes: render a mini-blurb unfound on the actual page.

Shoot, even a meta description qualifies for that - thankfully Google uses them less and less.


Google will reliably index dynamic sites rendered using JS. And other search engines do the same. There's really no good reason to do this if you want to be indexed on search engines.

Agreed. Yet whether it should be done is different than whether it is done. Google was recommending it in 2018, and degraded it to a "workaround" just two years ago. Sites still do it, SaaS products still tout its benefits, and Google does not penalize sites for it. GP's gripe about SERP blurbs being missing is still very much extant and blessed by Google.

People on HN have repeatedly stated that Google is "stealing their content" from their websites so it seems like this is a natural extension to that widespread opinion.

Isn't this the web we want? One where big corporations don't steal our websites? Right?


> People on HN have repeatedly stated

It is definitely an area where there is no single "people of HN" - opinion varies widely and is often more nuanced than a binary for/against⁰ matter. From an end user PoV it is (was) a very useful feature, one that kept me using Google by default¹, and I think that many like me used it as a backup when content was down at source.

The key problem with giving access to cached copies like this is when it effectively becomes the default view, holding users in the search providers garden instead of the content provider being particularly acknowledged never mind visited and the search service making money from that through adverts and related stalking.

I have sympathy for honest sites when their content is used this way, though those that give search engines full text but paywall most of it when I look who complain about the search engine showing fuller text, can do one. Also those who do the "turn off your stalker blocker or well not show you anything" thing.

----

[0] Or ternary for/indifferent/against one.

[1] I'm now finally moving away, currently experimenting with Kagi, as a number of little things that kept me there are no longer true and more and more irritations² keep appearing.

[2] Like most of the first screen full of a result being adverts and an AI summary that I don't want, just give me the relevant links please…


Cached was only a fallback option when the original site was broken. When the original site works, nearly everyone clicks on it.

People on HN will always find a way to see a good side from Google's terrible product and engineering

Which is great that there's one less of their products, it's terrible anyway.

ownership over websites does not work the way people expect nor want

I'm tired of saying this, yelling at clouds


I most commonly run into this issue when the search keyword was found in some dynamically retrieved non-body text - maybe it was on a site's "most recent comments" panel or something.

That policy was never actually enforced that way, however. They'd go after you if you had entirely different content for google vs for users, but large scientific publishers already had "full pdf to google, HTML abstract + pay wall to users" 20 years ago and it was never an issue.

It makes some sense, too because the edges are blurry. If a user from France receives a french version on the same URL where a US-user would receive an english version, is that already different content? What if (as it usually happens), one language gets prioritized and the other only receives updates once in a while?

And while Google recommends to treat them like you'd treat any other user when it comes to e.g. geo-targeting, in reality that's not possible if you do anything that requires compliance and isn't available in California. They do Smartphone and Desktop-crawling, but they don't do any state- or even country-level crawling. Which is understandable as well, few sites really need to or want to do that, and it would require _a lot_ more crawling (e.g. in the US you'd need to hit each URL once per state), and there's no protocol to indicate it (and there probably won't be one because it's too rare).


> It makes some sense, too because the edges are blurry. If a user from France receives a french version on the same URL where a US-user would receive an english version, is that already different content?

The recommended (or rather correct) way to do this is to have multiple language-scoped URLs, be it a path fragment or entirely different (sub)domains. Then you cross-link each other with <link> tags with rel="alternate" and hreflang (for SEO purposes) and give the user some affordance to switch between them (only if they want to do so).

https://developers.google.com/search/docs/specialty/internat...

Public URLs should never show different content depending on anything else than the URL and current server state. If you really need to do this, 302 Redirect into a different URL.

But really, don't do that.

If the URL is language-qualified but it doesn't match whatever language/region you guessed for the user (which might very well be wrong and/or conflicting, e.g. my language and IP's country don't match, people travel, etc.) just let the user know they can switch URLs manually if they want to do so.

You're just going to annoy me if you redirect me away to a language I don't want just because you tried being too smart.


> Public URLs should never show different content depending on anything else than the URL and current server state.

As a real-world example: you're providing some service that is regulated differently in multiple US-states. Set up /ca/, /ny/ etc and let them be indexed and you'll have plenty of duplicate content and all sorts of trouble that comes with it. Instead you'll geofence like everyone else (including Google's SERPs) and a single URL now has content that depends on the perceived IP location because both SEO and legal will be happy with that solution, and neither will be entirely happy with the state-based urls.


> You're just going to annoy me if you redirect me away to a language I don't want just because you tried being too smart.

So what do you propose that such a site shows on their root URL? It's possible to pick a default language (eg. English), but that's not a very good experience when the browser has already told you that they prefer a different language, right? It's possible to show a language picker, but that's not a very good experience for all users, then, as their browser has already told you which language they prefer.


See sibling comments.

What about Accept headers?

Quoting MDN https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...

> This header serves as a hint when the server cannot determine the target content language otherwise (for example, use a specific URL that depends on an explicit user decision). The server should never override an explicit user language choice. The content of Accept-Language is often out of a user's control (when traveling, for instance). A user may also want to visit a page in a language different from the user interface language.

So basically: don't try to be too smart. I'm more often than not bitten by this as someone whose browser is configured in English but often would like to visit their native language. My government's websites do this and it's infuriating, often showing me broken English webpages.

The only acceptable use would be if you have a canonical language-less URL that you might want to redirect to the language-scoped URL (e.g. visiting www.example.com and redirecting to example.com/en or example.com/fr) while still allowing the user to manually choose what language to land in.

If I arrive through Google with English search terms, believe it or not, I don't want to visit your French page unless I explicitly choose to do so. Same when I send some English webpage to my French colleague. This often happens with documentation sites and it's terrible UX.


"Accept" specifies a MIME type preference.

You said accept headerS and since the thread was about localization I assumed you meant Accept-Language.

To answer your comment: yes you should return the same content (resource) from that URL (note the R in URL). If you want/can, you can attend to the Accept header to return it in other representation, but the content should be the same.

So /posts should return the same list of posts whether in HTML, JSON or XML representation.

But in practice content negotiation isn't used that often and people just scope APIs in their own subpath (e.g. /posts and /api/posts) since it doesn't matter that much for SEO (since Google mostly cares about crawling HTML, JSON is not going to be counted as duplicate content).


Why are XML and JSON alternates of the same resource but French and German are two different resources?

Because the world is imperfect and having URLs instead of using content negotiation makes a far better user and SEO experience so that's what we do in practice.

IOW pragmatism.


Giving the content in the user agent's configured language preference also seems pragmatic.

In what way is ignoring this pragmatic?

https://developers.google.com/search/docs/specialty/internat...

> If your site has locale-adaptive pages (that is, your site returns different content based on the perceived country or preferred language of the visitor), Google might not crawl, index, or rank all your content for different locales. This is because the default IP addresses of the Googlebot crawler appear to be based in the USA. In addition, the crawler sends HTTP requests without setting Accept-Language in the request header.

> Important: We recommend using separate locale URL configurations and annotating them with rel="alternate" hreflang annotations.


Content delivery is becoming so dynamic and targeted, there is no way that can work effectively now -- even for first impression as one or more MVTs may be in place

Roads are 24x7 and outside every building.

Two things:

1. If you judge new technology by the absolute worst way that it can be used, you are going to spend your life worrying about a million things that will never happen. Focus on stuff that's happening instead of hypotheticals.

2. In a really dystopian setup, you can be restricted from leaving "your zone" regardless of autonomous vehicles. There's plenty of existing technology that can help with that. (number plates, cellular networks, facial recognition, etc.)


“a million things that will never happen”

Like cameras and AI used to enforce the “not allowed to sing while driving” policy for Amazon drivers?

The only way to make sure things never happen is by worrying about the worst possible misuse of the technology. Because someone will misuse it that way if it gets him money or power.


People worried about surveillance but it didn't stop surveillance from happening.

I'm not saying to be naive to possibilities, especially as they begin to move from the "possibilities" bucket to the "realities" bucket, but isn't it better to focus on making progress against stuff that's already happening? If Amazon are preventing drivers from singing, I think those drivers might be able to change that.


Not only is #2 possible today, it is in fact a lived reality for people in some parts of the world.

> Focus on stuff that's happening instead of hypotheticals.

How do you prevent "hypotheticals" from becoming "stuff that's happening" if you never pay any attention to them in the first place and let them happen?


Because hypotheticals have a process of moving into "stuff that's happening", on the timescale of years, decades.

Once it's starts happening, speak; if speaking doesn't work, fight; if fighting doesn't work, move. This works.


The problem is that for many things once something starts happening it is very hard or even impossible to stop it. It is way better to prevent something from happening than try to stop it after the fact - especially if someone with power can benefit from that something happening.

On 1, technology sometimes brings a worse than humanity could not even anticipate.

The compatibility gap on WebP is already quite small. Every significant web browser now supports it. Almost all image tools & viewers do as well.

Lossy WebP comes out a lot smaller than JPEG. It's definitely worth taking the saving.


I work on Batch Compress (https://batchcompress.com/en) and recently added WebP support, then made it the default soon after.

As far as I know, it was already making the smallest JPEGs out of any of the web compression tools, but WebP was coming out only ~50% of the size of the JPEGs. It was an easy decision to make WebP the default not too long after adding support for it.

Quite a lot of people use the site, so I was anticipating some complaints after making WebP the default, but it's been about a month and so far there has been only one complaint/enquiry about WebP. It seems that almost all tools & browsers now support WebP. I've only encountered one website recently where uploading a WebP image wasn't handled correctly and blocked the next step. Almost everything supports it well these days.


Whenever WebP gives you file size savings bigger than 15%-20% compared to a JPEG, the savings are coming from quality degradation, not from improved compression. If you compress and optimize JPEG well, it shouldn't be far behind WebP.

You can always reduce file size of a JPEG by making a WebP that looks almost the same, but you can also do that by recompressing a JPEG to a JPEG that looks almost the same. That's just a property of all lossy codecs, and the fact that file size grows exponentially with quality, so people are always surprised how even tiny almost invisible quality degradation can change the file sizes substantially.


> but WebP was coming out only ~50% of the size of the JPEGs

Based on which quality comparison metric? WebP has a history of atrocious defaults that murder detail in dark areas.


Nothing technically objective, just the size that a typical photo can be reduced to without looking bad.

It really depends on what you're after, right? If preserving every detail matters to you, lossless is what you want. That's not going to create a good web experience for most users, though.


"Odeo up for sale"

Became (or spun out) Twitter, by the way.

https://dailyfly.com/on-this-day-in-2006-twitter-launches/



I wonder if there's some price at which Boeing could buy a license to fork the current SpaceX technology. From there they could take it in different directions if they wanted, or pay some ongoing license fee for updates & training from SpaceX. Seems like it would be very beneficial for them to start with a platform that is known to work fairly well.


50 MB might be fine for desktops on effectively unlimited & high speed connections, but consider the case of a mobile user with a few GB of data per month. Might be unacceptable for them. Not sure how common that case is in the US, but certainly possible outside the US.


Anyone can easily do a online/offline binary check for web apps like these:

1. Load the page

2. Disconnect from the internet

3. Try to use the app without reconnecting


Well, my question is about where it lies within the gray area between fully online and fully offline, so that wouldn't work.

Edit: Good call! It's fully offline - I disabled the network in Chrome and it worked. Says it's 176MB. I think it must be downloading part of the model, all at once, but that's just a guess.

The 176MB is in storage which makes me think that my browser will hold onto it for a while. That's quite a lot. My browser really should provide a disk clearing tool that's more like OmniDiskSweeper than Clear History. If for instance it showed just the ones over 20MB, and my profile was using 1GB, at most it would be 50, a manageable amount to go through and clear the ones I don't need.


Yeah, this is why I think browsers need to start bundling some foundational models for websites to use. It's too unscalable if many websites start trying to store a significantly sized model each.

Google has started addressing this. I hope it becomes part of web standards soon.

https://developer.chrome.com/docs/ai/built-in

"Since these models aren't shared across websites, each site has to download them on page load. This is an impractical solution for developers and users"

The browser bundles might become quite large, but at least websites won't be.


As long as there’s a way to disable it. I don’t want my disk space wasted by a browser with AI stuff I won’t use.


A lot of these AI licenses are a lot more restrictive than old school open source licenses were.

My company runs a bunch of similar web-based services and plan to do a background remover at some stage, but as far as I know there's no current models with a sufficiently permissive license that can also feasibly download & run in browsers.


Meta's second Segment Anything Model (SAM2) has an Apache license. It only does segmenting, and needs additional elbow grease to distill it for browsers, so it's not turnkey, but it's freely licensed.


Yeah, that one seems to be the closest so far. Not sure if it would be easier to create a background removal model from scratch (since that's a more simple operation than segmentation) or distill it.


I got pretty far down that path during Covid for a feature of my saas, but limited to specific product categories on solid-ish backgrounds. Like with a lot of things, it’s easy to get good, and takes forever to get great.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: