> at least not until the IANA get around to officially assigning them a media type.
This is the wrong characterisation. IANA does not take such initiative; their role is administrative rather than regulatory or active. It’s up to an interested party to register media types.
For Parquet, that’s easy: the developers can fill out https://www.iana.org/form/media-types in probably less than ten minutes, probably choosing the media type application/vnd.apache.parquet. It’ll be processed quickly.
For JSON Lines/NDJSON, it’s messier, calling for standards tree registration, which generally means taking a proper specification through some relevant IETF working group. (There are a few media types in customary use presently, all bad: application/x-ndjson, application/x-jsonlines, application/jsonlines; all are in the standards tree despite nonregistration, and two include the long-obsolete x- prefix.) Such an adventurer will doubtless encounter at least some resistance due to the existing JSON Text Sequences (application/json-seq, defined in RFC 7464, https://www.rfc-editor.org/rfc/rfc7464), which is functionally equivalent, mildly harder to work with, and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record, but given the definite popularity of JSON Lines/NDJSON, an Internet Draft will easily be enough for provisional registration.
> and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record
Where would you see its superiority? I've mostly worked with jsonlines so far, but I found it very convenient to use, as it's almost the natural input/output format for Jq, grep and all kinds of other line-based tools.
I get that jsonseq would be easier to parse in theory, but this goes away when you ensure that no individual json segment contains a newline. And ensuring this is basically a jq -c call.
Because json is whitespace agnostic, there is also no situation where you need a newline to represent the data.
The only advantage of jsonseq I see is that in files which contain exactly one item you unambiguously know it's not jdon. Tte advantage goes away for files with zero items though - and in most situations ehere you'd have to make that distinction, I'd assume you'd use the content type anyway.
I find a surprising amount of value in in-band media type signalling. Not everything sees the declared media type, and media types are regularly calculated from file contents (magic numbers) rather than anything else, also. So here, the very first byte lets you know you’re dealing with a JSON text sequence rather than JSON or concatenated JSON or newline-delimited JSON or whatever else. It really comes down to just that. That line feeds are permitted (though largely not recommended) elsewhere is nice for human writing, but not particularly relevant, as these formats are not generally intended for human writing (you’d just use normal JSON if you wanted that).
Good article, but this not a “trick”, it’s a core part of the HTTP protocol. Worthy of an article non the less, judging by how misunderstood the topic is among the commenters here.
Content negotiation itself is not a trick, but this usage of it is fair to characterize as a sort of trick. Most uses of content negotiation are about serving the same content in different formats, like different image formats; in this case though, it's actually negotiating a different thing entirely, where one is a machine-readable CSV and the other a human-readable hypertext document. The fact that it works is unexpected to the uninitialized, because even just reading the RFC will not necessarily make it apparent why cURL and the browser would return such different results. There's a lot of ways a server could detect cURL, like the user agent, and it might come as a surprise to some that cURL and other user agents that are not web browsers often send */* for the accept header. It is not a terribly new way to use content negotiation, but arguably still a clever one. It's a deliberate behavior.
> serving the same content in different formats, like different image formats; in this case though, it's actually negotiating a different thing entirely, where one is a machine-readable CSV and the other a human-readable hypertext document
Huh? No, it's also serving the same content, just in different formats: one is a machine-readable CSV and the other a human-readable hypertext document.
Disagree: the table inside the HTML is the "content" that comprises the entirety of the CSV. Related but absolutely not "same". (It'd be more arguable if all the HTML had was the table, but it's actually just a normal web page with a table.)
Arguably the resource is the dataset of stock exchanges, and the CSV representation is forced to omit all the metadata but the HTML representation isn't.
I understand what people are getting at, as it's not really that big of a logical leap. I think the fact that it is somewhat of a stretch, but still "in the lines," is exactly why it is a "trick": it isn't doing anything particularly invalid or hacky, it's just not necessarily what you'd imagine when reading the RFC. Content negotiation to me is more about serving the optimal content to a given agent, not really about selecting modalities for different use cases based on different types of user agents.
I think both cases are "valid" although I think it is inherently less tricky if the document talking about and previewing the dataset is referenced via a separate URL from the dataset itself. (Which, of course, entirely mitigates problems like Apache Spark having HTML in the Accept header.)
Actually, that's a pretty good point. The thing is that the URL itself refers to some conceptual resource, and the response is ideally a representation of that resource, potentially one of multiple. If you take the same source image and encode it multiple ways, although the two resulting images are different from eachother, they are representations of the same underlying image/resource. But if you were to provide a different image, or alter the image in other ways, I think this would be pretty tricky actually, even if the modification was something trivial. You can imagine a simple use case like a WebP image with the text "Your browser supports WebP" and a PNG image with the text "Your browser does not support WebP." The point is that content negotiation being used to present different logical resources that are not necessarily interchangable representations of the same data feels tricky. I think that is still compatible with the fact that it's all within specifications.
This. lol. I opened the article and was ready for a cool trick and then they showed it returning csv over Curl and html over a browser and I was like "So the accept header? Maybe there is another trick...nope, just the accept header." Our API already does this at my company, where it returns an HTML formatted, human readable response over a browser nav but returns application/json over anything requesting only that.
My impression is that many power users (not web developers) want to download datasets. Sometimes they’ll use a browser to grab a csv or json file, then stuff it into excel/R/python/… to do some sort of analysis work. This kind of knowledge is helpful for that crowd. They often don’t know about the non-url parts of a request.
1) The discovery of the different response formats. How do I know that I can get csv files from that URL, other than by hoping that the website documents this somewhere?
There's nothing in the underlying HTTP response from https://csvbase.com/meripaterson/stock-exchanges that tells me I can get an HTML or CSV version. Is there a JSON version available? What other variants exist? How do I know that this URL will deliver different responses?
2) Will the website always default to csv files or will my app break when they decide that XML is superior? (Well, obviously not, especially for a site called csvbase!)
But if your program expects CSV data, it is probably best to always request that, and a URL that ends in .csv gives you far more certainty that the data is going to be in that format.
> There's nothing in the underlying HTTP response from https://csvbase.com/meripaterson/stock-exchanges that tells me I can get an HTML or CSV version. Is there a JSON version available? What other variants exist? How do I know that this URL will deliver different responses?
Well, you can send a HEAD request with a given accept: header to find out what you’ll get without actually fetching the data. But it’s true that it would be nice to have the full set of possible responses advertised somehow.
I should have mentioned the HTTP 'Vary' response header, that servers can use to inform the client that its response was based upon some of the headers that it sent.
A 'Vary: accept' response header gives a hint that it could have supplied a different response had you given a different Accept: header. But I don't think that there's a way to actually list the variants available in the HTTP spec?
My reading of the HTTP spec suggests that csvbase.com is behaving incorrectly by not setting the Vary header properly (it sends: 'Vary: Accept-Encoding', but it should also list 'Accept' in there too). Potentially, a proxy server could decide to cache the CSV or HTML response, and then serve that version back to another client instead of the 'right' one, because the server didn't correctly report that the response varies based upon 'Accept'. In practice, this isn't likely to happen unless you've got a caching proxy that is also unwrapping the encryption between itself and your HTTP client.
> My reading of the HTTP spec suggests that csvbase.com is behaving incorrectly by not setting the Vary header properly
This is a good idea, I will certainly look at this. There are some planned features WRT caching coming up.
To address the comments you made in GP:
> How do I know that I can get csv files from that URL
That is a good question. The web UI could be better, of course. But programmatically, how do you advertise alternate representations? I'm not sure. Suggestions appreciated.
> Will the website always default to csv files or will my app break when they decide that XML is superior? (Well, obviously not, especially for a site called csvbase!)
As you say: csvbase won't change :)
But the other thing is that the HTTP client you use could decide to change it's default Accept header. If curl changed to "application/json,q=0.9;/" then suddenly you'd get json (I didn't mention in the blog post but that is also implemented)!
Oh dear. Perhaps a good idea to include the file extension or explicit Accept header when you're coding something that needs to last. But I do think it's nice to be able to copy and paste into pandas. That's my main usability case and I wanted that to be as smooth as possible.
This is a good idea, I will certainly look at this. There are some planned features WRT caching coming up.
While it is probably a bug, it's probably not a serious one that many people would run into nowadays. Now that https is ubiquitous, there aren't many caching proxies around to cause grief. Probably the only proxies people will experience are where they are behind a paranoid company's firewall, one that is configured to decrypt (and then re-encrypt) all their web traffic. And in those situations, they don't tend to do caching much now. (Because even though you can cache HTTP, you'll hit problems with misconfigured sites and users will blame your proxy for it.)
But programmatically, how do you advertise alternate representations? I'm not sure. Suggestions appreciated.
Sorry, I don't have a good answer for this. I only nit-pick problems in web comments :)
You could set a HTTP header to list the available variants, but there isn't a standard AFAIK so it would only help developers who spotted the header.
But the other thing is that the HTTP client you use could decide to change it's default Accept header. If curl changed to "application/json,q=0.9;/" then suddenly you'd get json (I didn't mention in the blog post but that is also implemented)!
That's cool! Aeons ago, I was involved in developing a web server, where we added support for properly handling all kinds of content negotiation (Accept-Encoding, Accept-Language, etc), where you could configure it to deliver the right file based on the user's language, file type preference, etc. It was a large chunk of code, but in the end, nobody really used it. In theory, web browsers and sites could co-operate to deliver the right page in the right language for all their users automatically. In practice though, it never works. No-one sets up their web browser to pick the language properly (who even knows how to change it?) As a result, multi-lingual sites offer to switch languages by clicking on a link, and if they choose a default language, they mostly do it based on IP address (and assumed location)
That's my main usability case and I wanted that to be as smooth as possible.
I think it's the right choice for csvbase, my original comment reads far too critical in retrospect, it's neat that if you curl a URL, you get the csv. But if I was writing code to scrape some csv data, I would still always prefer to download URLs with a .csv extension, because you know what you are getting 100% of the time, and you avoid any unpleasant surprises if some 3rd-party library or tool changes its behaviour.
> Now that https is ubiquitous, there aren't many caching proxies around to cause grief.
Well, there are still CDNs. csvbase is designed for a public cache for some pages. I haven't done much on this except for the blog pages, which use the CDN a lot.
I also have vague plans for client libraries that include a caching forward proxy as my experience is that most people export the same tables repeatedly. Likely that will be based on etags though so that the cache is always validated.
The designers of HTTP 1.1 clearly thought a lot about a lot of things, including caches.
Thanks for your thoughts. :) Keep in touch via email if you like (same goes for anyone else reading this): cal@calpaterson.com
I vaguely remember a web server set up for language content negotiation failing to determine which version to send and giving me a list of links to the individual language versions instead.
I think it was Apache and the negotiable resource was called X.html while the individual linked versions had names like X.en.html etc.
Without sending "Vary: Accept" the server might have its response mis-cached by a proxy. A request from a browser could populate the cache with HTML, which could then serve HTML in response to a request that wants CSV. Any time you vary your response based on a request header, the spec says you should list it in your Vary response header.
In practice, with the move to HTTPS this rarely comes up anymore outside the sending company's internal infrastructure. Basically no one is running client-side caches that are shared between multiple consumers.
This is a supported use of "Accept" headers, but I kind of miss the pre-SEO web where it was okay for URLs to just carry a file extension - having "example.com/dataset/foo.html" and "example.com/dataset/foo.csv" is pretty simple and less ambiguous too.
I don't think it's SEO so much as the evolving complexity of websites. When there's a .html extension, you're usually pointing at an actual static file sitting on the server. On the page where I'm typing this (/reply) there isn't a simple "reply.html" that's just hiding it's file extension, but a completely dynamic page being rendered.
You can configure web servers to behave like that if you wanted. The only reason we don’t is because file extensions are ugly and noisy for none technical people.
It’s more about having “friendly” URLs than it is a limiting of any technology.
In case the author sees this:
Thank you for enabling CORS so that it's possible to plot examples from other sites. It would be awesome if the Content-Range header was allowed as well
Thank you for csvbase, today is the first time I've seen it.
I believe PapaParse, a JS library for parsing CSV files, uses Content-Range to stream large CSV files in chunks.
https://csvplot.com uses PapaParse under the hood, I saw a warning in the dev console and posted here. I'm not sure why it seemingly works fine anyway.
For a random python/pandas trick, I have come across web-api's that cannot be directly read into pandas using the URL (I imagine folks on here can comment better the difference in web serving tech), but you can read in the IO object and pass that to pandas. Blog post, https://andrewpwheeler.com/2022/11/02/using-io-objects-in-py..., but can just put simple example in comment:
####
import pandas as pd
from io import StringIO
import requests
url = ('https://data.townofcary.org/explore/dataset/cpd-incidents/download/'
'?format=csv&timezone=America/New_York&lang=en&use_labels_for_header=true'
'&csv_separator=%2C')
res = requests.get(url)
df = pd.read_csv(StringIO(res.text))
####
For what it's worth, if the requests module works fine you could probably set "stream=True" in the request and read the `res.raw.data` file object directly. That way you avoid loading the data into memory first [1]. You'll probably want to set `res.raw.decode_content = True` to ensure you get the raw bytes, and not some zipped stream.
I'm kind of shocked by how poorly understood basic HTTP stuff like this is for HN audience based on the comments and article itself. My filter bubble must be tuned to "web."
HTTP is great. Another common "How does it know" is resuming downloads: that's done by the Range header. Curl supports it by using `--continue-at- -` (the dash means "figure out where it stopped", you can also use a byte range).
Except resumed downloads are broken when used in conjunction with compression, because its not clear whether the byte range refers to the compressed or uncompressed resource.
The HTTP spec solved this problem elegantly, it had the concept of the identity of a resource, and gave two headers to declare compression: Content-Encoding (=the resource is always compressed, like an tar.gz, byte range refers to compressed) and Transfer-Encoding (=the resource is compressed only for transfer, the uncompressed is the real thing).
As of 2023, this has not been implemented and the Content-Encoding header is used for both semantics. So resuming downloads over a proxy has a good chance of corrupting your file, i also had source tarballs being decompressed on the fly and failing their checksums.
If you came here to point out how content negotiation isn't a "trick" but rather a simple basic part of the core protocol: think first on the fact that there was once a time when you didn't know that.
I mean, I also once didn’t know how to program, doesn’t make programming a trick. As another comments pointed out, it’s worthy of an article. But a trick is a weird description for it. It’s like saying it’s a trick that you can do `console.log` and it will output it in the browser console.
If we can all agree it's an interesting topic of discussion, we can set aside debating the semantics of whether it is a "trick" and let the author express themselves using the words of their choosing. Policing the definition of trick is surely not curious conversation.
I disagree that this is a clickbait title. I also disagree that those complaints are worth discussing. The guidelines specifically discourage us from complaining about things "too common to be interesting." And the articles that I'd describe as clickbait are simply too low quality to be posted anyhow. ("Clickbait", in my mind, is when the article uses provocative language to build expectations that it doesn't or can't deliver upon. "Trick" in this sense pretty clearly means, "here's a tool you can use to serve different content types to different user agents," and it delivers on that.)
These are really stylistic complaints; the author expressed themselves in a way that's not to your taste. Your tastes are valid, but there should be no expectation that every or any article will cater to them, and their not doing so isn't a criticism of the article and isn't something we can really have a productive discussion about in this medium. I find people use the term "clickbait" to try and reframe their tastes as something more objective.
Being old enough that I probably studied the http protocol before actually using it, by the time I encountered this functionality in the wild I already knew how it worked. So, no, there was never such a time.
I've long thought that this concept could/should be applied to user-targeted content. Blogs could be capable of delivering content as HTML but possibly Markdown, plaintext, Gemini, PDF, etc. as well. SQLite tables, CSV, JSON for sharing data. Downloading of a directory with archives. What is missing for adoption is proper readers for alternative formats in the big web browsers though; I wonder if that would be accepted upstream.
Others note that git supports http clone, but if you do want to serve SSH and HTTP on the same port, that's something haproxy makes pretty easy to do. It's not something terribly fancy, these days.
I wrote one for personal use. If you (by the book) reject invalid feeds you lose more than half. Attempting to properly negotiate the content would just result in failure for each feed (could be many) but you could try it if all else fails I suppose. (without much result but if you try 100 such things you will get to brag about rare successes)
I think (only) if a popular service required it it could be a thing.
There's no reason to have a separate /archives resource and a /feed.xml (with or without content negotiation). You can just specify some external XSLT with an xml-stylesheet processing instruction in your feed XML that will cause the feed to be rendered nicely when it's opened in the browser...
Probably only a few, but there are web servers out there where it would be a minimal effort to implement it, which are sadly only a few too.
It's always confusing me how a Feed Reader looks like a normal desktop browser when it loads a feed, e.g. Accept: text/html instead of application/rss+xml first.
Most feed readers don't even tell the server which format they handle at all, most just send a Accept: text/html, /
This is one of the most popular beginner datasets around. It's often used in tutorials and when experimenting. Does everyone who wants to use that have to write some custom parser? Why? https://csvbase.com/calpaterson/boston.csv is just so much easier
As stupid as it may sound, "csv" has become a generic term for any single-character delimited line-based file. We deal regularly with a "csv" that uses the pipe character instead of commas, for example.
yeah that's pretty strange, does the file format documentation call it a CSV? In my experience those files are always referred to in their documentation as "pipe-delimited file"
Except when you want to be sure you get what they intended. If it’s just some little throwaway project and you don’t care if the results don’t match up, use csv.
That's actually when I'd rather use something like pickle or feather, maybe even json depending on the usecase, which is easier to parse back.
I've encountered CSV mostly for data transfer among different programs. There sometimes are better options. Sometimes there isn't, maybe even just because CSV is easier and cheaper to implement on both (or multiple) ends.
Yeah as long as you don’t mind Excel changing a few values here and there and pretend translation and internationalization doesn’t exist. But hey, that’s ‘good enough’ so the developers don’t need to bother using a format developed in this century.
That’s completely backwards. Content negotiation was part of HTTP early on (check RFC 1945, HTTP/1.0). Fielding’s REST thesis came after HTTP 1.1, and used HTTP as an example of REST’s concept of representations.
Fielding's REST thesis also goes into great detail and emphasis on using HTTP's content negotiation definitions. This is probably the least used part of Fielding's definition of REST, but he greatly encourages defining custom media types for application objects, and he believes those are a much more fundamental part of REST design than the HTTP verbs, for example.
Generally speaking, implicit magic is cool but can also be frustrating if its not working as expected.
Very OT: Apple has this attitude of 'it just works'. Really great. Unless it is not working while there are no settings and no info whatsoever on what the requirements are to make it work.
It is not depending on settings of clients nor is it magic. The HTTP headers are as much part of the input of a request as the url itself. The only difference is that most people aren't aware of the http headers and they aren't shown by default in the browser. But not displaying information that isn't relevant to this context is not the same as magic.
This isn't a problem either - if you are a regular internet user, then stuff just works and you don't need to know about http headers at all. If you are a (web) developer, you really should have a general idea about http headers and what kinds of things they are useful for - and then its not magic anymore.
It’s not implicit because it’s something that the latest batch of “I’m not sure what my browser does” developers hasn’t organically come across. See also: the utter miscategorisation of CORS as an annoyance instead of an utter blessing.
This is the wrong characterisation. IANA does not take such initiative; their role is administrative rather than regulatory or active. It’s up to an interested party to register media types.
For Parquet, that’s easy: the developers can fill out https://www.iana.org/form/media-types in probably less than ten minutes, probably choosing the media type application/vnd.apache.parquet. It’ll be processed quickly.
For JSON Lines/NDJSON, it’s messier, calling for standards tree registration, which generally means taking a proper specification through some relevant IETF working group. (There are a few media types in customary use presently, all bad: application/x-ndjson, application/x-jsonlines, application/jsonlines; all are in the standards tree despite nonregistration, and two include the long-obsolete x- prefix.) Such an adventurer will doubtless encounter at least some resistance due to the existing JSON Text Sequences (application/json-seq, defined in RFC 7464, https://www.rfc-editor.org/rfc/rfc7464), which is functionally equivalent, mildly harder to work with, and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record, but given the definite popularity of JSON Lines/NDJSON, an Internet Draft will easily be enough for provisional registration.