Cloudflare doesn't obey the Accept header on anything other than image content types.
This means if you have an endpoint that returns HTML or JSON depending on the requested content type and you try to serve it from behind Cloudflare you risk serving cached JSON to HTML user agents or vice-versa.
I dropped the idea of supporting content negotiation from my Datasette project because of this.
(And because I personally don't like that kind of endpoint - I want to be able to know if a specific URL is going to return JSON or HTML).
A wisened old web developer named Rik told me that in a wiki he wrote long ago he performed his own cache-busting form of content negotiation by having urls ends with a file type —
If /fred returns HTML, then the URL is perfectly fine, since the only real consumer of fred.md or fred.json are automated systems/API clients, and they couldn't care less what the URL looks like, only that it's predictable.
Depends. Including it in the location part of the URL could make it harder to understand if you have multiple parameters, instead of just one. Using query parameters makes it very explicit, which can be good or bad depending on your use case, while sticking it into the URL makes it implicit.
Together with that, you can also go for putting it in the `hash` part of the URL, if you want to keep it a secret from the server, as the server doesn't receive those parts, `/fred/#format=md`. In this case, it doesn't make much sense, but useful to know for other things, like keys and whatnot.
Is don’t think I that hash method would work since that part of the URL isn’t sent to the server. It’s strictly used by the client to decide which part of the response we to show.
I think if this is the URL behind any sort of "api" prefix in the URL (subdomain, or subdirectory) I don't see why not? It's always worked out pretty darned fantastic.
It also works quite well with the browser save feature. Curl and wget will also save to a nice name by default. It may be nice if content-type tracking was universal but unfortunately file-extensions are probably the most robust way of tagging some data with a type (other than magic bytes). You can save that JSON, email it, archive it, upload it to a fileserver, someone downloads it again, decompresses it and uploads it in an HTTP form and the result will probably still be identified as JSON and highlighted by default in your editor.
Are all the same page as far as the browser concerns, and if you move from `/#!/` to `/#!/users/`, and reload the page, you still load the same page (i.e. `/`).
But that was too ugly for URLs, so modern websites now use browser history APIs just so they can remove 3 characters from the URL and do stuff like:
Sure, if you load `/` and move to `/about/` you don't load a new page because of browser history APIs; but if you then refresh, now you load a different, uncached page (`/about/`, instead of `/`) even if the HTML in the response is exactly the same as in `/`.
Sure, the difference is not much, but to me it still seems like a waste when the response could have been cached already like in the first example.
The other poster is wrong, the entire reason the history API exists is specifically because when these SPA frameworks first came onto the scene one of the things that was broken was history.
So they invented the history API's so SPA's could stop breaking some of the user expectations they were breaking.
The real reason for moving (back) from fragment URLs to real URLs is because the latter can serve different content on the initial load - specifically, they can act as actual webpages that use javascipt for progressive enhancement but work immediately whereas your / needs to display a spinner while you send of additional network requests to retrieve the content and build the page.
This way user-agents know that the Accept header is involved in the final form of the response. (As another example, Firefox also doesn't take Accept into consideration when caching locally by default, and it does with Varry: Accept)
After looking some more into it, it seems that parent was right, and the examples I had in mind are not cached at all, that being the reason why it looked like it was working.
Their documentation specifically mentions Vary as non-functional[1] outside a paid plan, and even then only for images[2].
"HTTP/1.1 is a delightfully simple protocol, if you ignore most of it".
A Vary field containing a list of field names has two purposes:
1. To inform cache recipients that they MUST NOT use this response to satisfy a later request unless the later request has the same values for the listed header fields as the original request (Section 4.1 of [CACHING]) or reuse of the response has been validated by the origin server. In other words, Vary expands the cache key required to match a new request to the stored cache entry.
[...]
When a cache receives a request that can be satisfied by a stored response and that stored response contains a Vary header field (Section 12.5.5 of [HTTP]), the cache MUST NOT use that stored response without revalidation unless all the presented request header fields nominated by that Vary field value match those fields in the original request (i.e., the request that caused the cached response to be stored).
So, Cloudfare doesn't actually implement HTTP/1.1: you can't just decide to ignore parts of the standard and still claim you implement it, that's only allowed for "SHOULD [NOT]" and "MAY" parts, not for the "MUST [NOT]" ones.
Notice that this is the HTTP "Semantics" RFC, not the "core" one (Message and Syntax) which is the venerated RFC-7230 [1] which is indeed quite simple for such a widely used protocol.
This RFC only defines a handful of "header fields", almost all of which necessary for actually being able to "frame" (find the beginning and end, applying decoding if specified) the HTTP message : https://www.rfc-editor.org/rfc/rfc7230#section-8.1
See, the caching mechanism in HTTP/1.1 is actually quite nicely designed in the sense that you can completely ignore it: don't look at any caching-related fields, always pass the requests up and the responses down — and it will be correct behaviour. But when you start to implement caching, well, you have to implement it, even the inconvenient but mandatory parts.
Yes, I know (and personally experienced) that most clients forget to include "Vary" field, or don't know that it exists, or the problem that this header solves exists) but when they do include it, they actually mean it and rely on it being honoured, one way or another.
PS. By the way, 7230 is obsoleted, you're supposed to use 9112 now.
Surely you mean RFC-2616? That was around for 15 years, including all of the 2000s which were the most formative Web 2.0 years; the RFC-7230 came by around 2014 and lasted only 8 years OTOH.
> vary — Cloudflare does not consider vary values in caching decisions. Nevertheless, vary values are respected when Vary for images is configured and when the vary header is vary: accept-encoding.
That's even with the proper Vary header, right? Seems like a cloudflare bug if that's true. Maybe even a security bug if you find a service that supports text/plain.
I haven't dealt with Cloudflare specifically, but I did deal with a number of big CDNs for large amounts of traffic. They were pretty adamant about NOT supporting arbitrary Vary header values. It broke some logic on a few of our systems and we eventually just decided to work around it instead of pushing our case.
Interestingly, one of the big CDN providers did have controls in their UI for explicitly allowing/disallowing Vary header entries but they disabled it for us at some point (e.g. it was still in the UI but greyed out). I assumed once we hit a certain level of traffic it was too computationally expensive? Ever since, I've avoided any kind of fancy header/response variance in APIs just in case I end up in the same situation. It is rarely a necessity. IIRC, the only thing they continued to support variance wise was gzip (e.g. content-encoding).
It's also worth noting they were extremely conservative with query parameters too. Also to reiterate, this was very high traffic and high volume with expectations of low latency, so probably not applicable to most people using CDNs for static website assets.
> I assumed once we hit a certain level of traffic it was too computationally expensive?
Seems strange; AFAIK in e.g. Varnish, Vary just means you get more "stuff" tacked onto the buffer that gets built from the request and then hashed to create the cache key.
And actually, come to think of it, if memory for N concurrent in-flight requests is the concern, then you don't even need an actual (dynamically allocated) buffer, either; presuming you're using a streaming hash, you can feed each constituent field directly into the hasher, with only the hasher's (probably stack-allocated) internal static buffer for blockwise hashing required. (Which you're gonna need regardless of whether you're doing any Vary-ing.)
So it's really just a question of how many CPU cycles are being spent hashing. And it's likely just going to be a difference between hashing 300 bytes (base request — hostname, path, headers that are always implicitly Varied upon) and 350 bytes (those things, plus whatever you explicitly Varied) per request. Doesn't seem like too much of a win... (especially when hardware-accelerated hashing ops operate on blocks anyway, such that you only get stepwise cost increases for every e.g. 128 bytes.) I wonder why they bothered?
Respecting vary headers is not this simple. Given a request, how do you calculate a cache key that includes only the Vary headers? You only get that list in actual responses from the server, so you need to actually look at some information derived from previous responses to determine what to hash on each request. This is called "partial match retrieval", and is much more complicated (and computationally intensive) than cases where you can calculate a hash key as a pure function of the request.
This isn't something I considered but it totally makes sense. Given that the Vary header is a per-resource value you would have to propagate that through the network. For millions of resources that might become an issue. And since in a worse-case scenario the server could be changing the Vary header for a single resource across multiple requests you have the additional problem of trying to keep it consistent across datacenters.
I think that is probably why some CDNs have a single configuration for any HTTP headers you want to vary on (e.g. Cloudfront allows you to specify a global configuration for a distribution that takes into account specific headers). This avoids the problem of both per-resource and inter-datacenter consistency that relying on the Vary header might cause.
It now occurs to me that even what you're describing wouldn't be enough, because, as MDN says [emphasis mine]:
> The Vary HTTP response header describes the parts of the request message aside from the method and URL that influenced the content of the response it occurs in.
In other words, if the server backend has a resource with representations that Vary on header values {A,B,C,D}; and one client sends req headers {A,B} — then by the standard they should only be told `Vary: A, B`; while if another client sends req headers {C,D}, then they should only be told `Vary: C, D`. The client should not be told in the Vary response header, about request headers they didn't send.
So it's not just that you can wait for the backend to send a `Vary` response header, and then medium-term cache the value of that header in the cache-policy metadata for the cache key. Instead, on each response, you need to
1. collect any additional Vary fields from the response and add them to your cache-policy Vary set; and
2. have some idea of what the "default header value" would be, to use as a fallback value when computing the cache key, for each header that isn't sent, when it's part of the active Vary set, so that you can dedup requests that explicitly send the header with value X, with request that don't send the header at all but where the default value would be X.
3. Also, ideally, you have a library of normalization transforms for the value of each header used in Vary, to decrease cardinality (the approach of this taking up the majority of the page space on the Varnish docs for Vary: https://varnish-cache.org/docs/3.0/tutorial/vary.html)
And the knowledge required to do all this correctly is really... not knowledge that a middlebox has any good way of acquiring.
This is starting to feel like a design smell in HTTP. Maybe zero-RTT content negotiation is misguided?
What if we instead did content negotiation like this (which — correct me if I'm wrong — would be a mostly ecosystem-backward-compatible change):
- if a resource negotiates, then by default, the server will send a 406 error response for all attempts at retrieving the resource. It sends this because the client itself needs to prove it knows what fields the resource varies on — and, of course, it doesn't know (yet), because nobody's told it yet. This 406 response contains a novel "Should-Vary" response header, informing the client of what it should be sending.
- to actually fetch a resource representation, the client is then expected to make the same request again, but this time, sending an Expect-Vary request header, the value of which matches the Should-Vary header value it saw from the server. Note that unlike with the Vary response header, this Expect-Vary request header should include header names that aren't part of the set of headers it's sending. (And/or, this list should force the client to emit explicit headers with its choice of implicit-default values for any headers listed in its own Expect-Vary header.)
- Upon receiving a request for a resource that negotiates, where the request has the Expect-Vary header set, the server will first verify that the Expect-Vary header value matches the Should-Vary value it would return for the resource, and either matches or is a superset of the Vary value it would compute as the response header given 1. the resource and 2. the rest of the received request. If this verification fails, that's a 406 again, sending Should-Vary again. If the verification passes, and the rest of the HTTP state workflow goes through, then you get a 2XX response. This 2xx response has the old Vary header as part of the response — but it now only exists for ecosystem back-compat.
- If a client thinks it knows the right Expect-Vary header to send, it can try sending it as a request header in the initial request. After all, the worst that can happen is the same 406 error it'd get otherwise. As well, the observed Should-Vary response header value of a resource can be cached basically indefinitely by the browser in its Expect-Vary cache, since the next time it changes for a resource, the browser will try its cached value for Expect-Vary in the request, and get a 406 response that tells it the new Expect-Vary value it should be using instead.
- Optionally, for efficiency, there could be introduced an Others-Should-Vary response header with the value being a path pattern (similar to a Set-Cookie Path field), which specifies other path prefixes for the host that should all be assumed by default to have the same Should-Vary header value as the response does. Potentially, a Should-Vary response header could also be sent in OPTIONS responses, to set a fallback assumed Vary value for the HTTP origin as a whole. (Clients are already requesting OPTIONS for CORS anyway; may as well give them some more useful information while we've got them on the line.)
With this design, middleboxes could safely trust the client's Expect-Vary header and use it to build the cache key — as long as 406 responses aren't cached.
To be clear, I'm not trying to make their argument for them since we spent probably 1 day working around it. I'm just passing along an anecdote. One day, Vary header stopped working on one CDN and we had to fix it. When I spoke to our account rep (I literally had a weekly call with them due to our usage) he said they were phasing it out for performance reasons. Not long after we got notice from another CDN asking for similar consideration. I have no inside knowledge as to their infrastructure or systems that made this a requirement. I very much doubt it was the cost to hash, maybe more likely something to do with their network topology and how requests were routed from origin to regional tiers to PoPs? I'm totally speculating here.
If this had been a necessity then I would have probably dug into the request more deeply. It was a "pick your battles" kind of thing. Extremely low cost on our side to change, no reason to bother if they claimed it would decrease problems on their side.
The cost of vary headers is usually not in hashing the keys but storing multiple entries per url in an arbitrarily large combination of headers. I can imagine cdns not wanting the hassle, though I don't live the outcome.
I'm not sure that tracks. If those variants are used, then eliminating support for Vary means they'll just it with new endpoints that return the same thing, so total number of cache entries remain unchanged.
It's worth noting that the cost of storage wasn't the issue in this case. They already had a system that allowed you to determine which headers in the Vary list would be respected and so you could calculate a worst-case storage load. I mean, it definitely was an issue in general and we were careful about avoiding the same content being stored multiple times but it wasn't the reasoning they communicated behind the change in the anecdote I related.
I think the best suggestion was in another thread by @johncolanduoni where he pointed out the difficulty of storing, distributing and retrieving the metadata per-resource that would be necessary for each PoP to correctly determine the Vary requirements at request time.
The problem with Vary is that it massively expands footprints and reduces cache efficiency when overused. In a CDN this can create noisy neighbor like issues.
Yes - this can be a big problem if it means that users get a cached response which isn’t compatible with their browser. It can also mean that you lose cache efficiency if you don’t get a ton of traffic - one site I worked on would’ve gotten much slower if they used WebP because it would have increased your odds of not getting a CDN-cached response, and the ~10-15% byte size savings just wasn’t worth that.
I'm undecided about whether the Semantic Web is generally useful for not, but in certain domains it does seem to have some worth.
When building a public scientific database I really want the URL identifying the item in the database to return the page for that item when I enter it in a browser but to return the appropriately structured data for that item when requested with "Accept: application/ld+json" or "application/rdf+xml" by a linked data library.
So it's unfortunate that there's no good way to support this with common CDNs.
Of course I always make it so that appending "?type=json" or "?type=xml" gets you the appropriate document.
Are you certain that it is actually not supported? I suspect the CDN cache was not configured to vary cached responses via the Vary header. That header will usually at minimum look like:
This sounds like a misconfiguration. I haven't used Cloudflare, but I've used other CDNs, and they need to generate a cache key with as few header values as possible to maximize the cache hit rate. The headers used in generating the cache key is configurable, so if you want to use the accept header in your application, then you're free to do so, but you need to tell Cloudflare that's an important part of your application.
Meh. Too many ways to solve the same problem results in bugs and security issues. There is no case where you need to use the Accepts header, you can just put that info in the URL instead.
Some effort to clean up the useless and duplicate features would be good.
Well as you're quite interested in APIs: the accept header is a fantastic way to version a REST API, and even has a built in way to handle clients that can talk to multiple versions of the same API.
Your app could request just 'application/vnd.foo.v1+json' from an API and if it's a centralised service that may be fine.
If the API your app talks to is something that's deployed to customers, or rolls out to regions incrementally or whatever, and thus can be at different versions, you need a way to handle that: the Accept header has you covered.
Everyone does this by putting /v1/ in the api url. It's massively more visible, gets logged properly, and isn't annoying to request from tools like curl.
> Everyone does this by putting /v1/ in the api url
URLs are opaque identifiers that only have meaning to the server that generated them. This makes any meaning implicit, where the accept header is an explicit and documented part of the interface. Maybe not much difference in practice for code you've interacted with, but that doesn't mean no difference at all.
If curl has trouble setting headers, that's a curl problem.
You saying "no one needs that" doesn't mean it doesn't do more than some arbitrary url parameter or path segment. It means you're choosing not to use it.
An API can return JSON, XML, HTML microformats, maybe even a binary encoding of some kind, possibly more down the road. All of these are serialized formats for objects.
Of course you can have a unique URL for each format and avoid the accept header, just like you don't technically need more than GET and POST methods.
The only time you'd ever want to do this is for image and video formats. And HTML has built in support for this in the picture and video tags. Otherwise this sounds nonsensical.
What kind of situation would you ever have a request like "Uh I want json the most but XML could work". Either the backend serves it in a format or not, just directly request what you want.
XML, JSON, Protobuf can all be used as object serialization formats. It's not nonsensical at all for an endpoint to offer choice as to what serialization format a client may want to use. It's not common, but that doesn't mean nonsensical.
URLs are much longer, harder to remember and harder to communicate than accept header values. It also unnecessarily complicates the API. It also relies on out-of-band/non-standardized information to know which URL returns which format, where the accept headers is in-band/standardize information for specifying the return format. Seems like there are more multiple advantages.
This means if you have an endpoint that returns HTML or JSON depending on the requested content type and you try to serve it from behind Cloudflare you risk serving cached JSON to HTML user agents or vice-versa.
I dropped the idea of supporting content negotiation from my Datasette project because of this.
(And because I personally don't like that kind of endpoint - I want to be able to know if a specific URL is going to return JSON or HTML).