How does it know I want CSV? – An HTTP trick

chrismorgan · on Jan 17, 2023

> at least not until the IANA get around to officially assigning them a media type.

This is the wrong characterisation. IANA does not take such initiative; their role is administrative rather than regulatory or active. It’s up to an interested party to register media types.

For Parquet, that’s easy: the developers can fill out https://www.iana.org/form/media-types in probably less than ten minutes, probably choosing the media type application/vnd.apache.parquet. It’ll be processed quickly.

For JSON Lines/NDJSON, it’s messier, calling for standards tree registration, which generally means taking a proper specification through some relevant IETF working group. (There are a few media types in customary use presently, all bad: application/x-ndjson, application/x-jsonlines, application/jsonlines; all are in the standards tree despite nonregistration, and two include the long-obsolete x- prefix.) Such an adventurer will doubtless encounter at least some resistance due to the existing JSON Text Sequences (application/json-seq, defined in RFC 7464, https://www.rfc-editor.org/rfc/rfc7464), which is functionally equivalent, mildly harder to work with, and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record, but given the definite popularity of JSON Lines/NDJSON, an Internet Draft will easily be enough for provisional registration.

xg15 · on Jan 17, 2023

> and technically superior, due to being unambiguously not-just-JSON, using a ␞ (U+001E RECORD SEPARATOR) prefix on every record

Where would you see its superiority? I've mostly worked with jsonlines so far, but I found it very convenient to use, as it's almost the natural input/output format for Jq, grep and all kinds of other line-based tools.

I get that jsonseq would be easier to parse in theory, but this goes away when you ensure that no individual json segment contains a newline. And ensuring this is basically a jq -c call.

Because json is whitespace agnostic, there is also no situation where you need a newline to represent the data.

The only advantage of jsonseq I see is that in files which contain exactly one item you unambiguously know it's not jdon. Tte advantage goes away for files with zero items though - and in most situations ehere you'd have to make that distinction, I'd assume you'd use the content type anyway.

chrismorgan · on Jan 18, 2023

I find a surprising amount of value in in-band media type signalling. Not everything sees the declared media type, and media types are regularly calculated from file contents (magic numbers) rather than anything else, also. So here, the very first byte lets you know you’re dealing with a JSON text sequence rather than JSON or concatenated JSON or newline-delimited JSON or whatever else. It really comes down to just that. That line feeds are permitted (though largely not recommended) elsewhere is nice for human writing, but not particularly relevant, as these formats are not generally intended for human writing (you’d just use normal JSON if you wanted that).

calpaterson · on Jan 17, 2023

Thanks for that background. I'll try to update the post when I get home this evening

larsnystrom · on Jan 17, 2023

Good article, but this not a “trick”, it’s a core part of the HTTP protocol. Worthy of an article non the less, judging by how misunderstood the topic is among the commenters here.

jchw · on Jan 17, 2023

Content negotiation itself is not a trick, but this usage of it is fair to characterize as a sort of trick. Most uses of content negotiation are about serving the same content in different formats, like different image formats; in this case though, it's actually negotiating a different thing entirely, where one is a machine-readable CSV and the other a human-readable hypertext document. The fact that it works is unexpected to the uninitialized, because even just reading the RFC will not necessarily make it apparent why cURL and the browser would return such different results. There's a lot of ways a server could detect cURL, like the user agent, and it might come as a surprise to some that cURL and other user agents that are not web browsers often send */* for the accept header. It is not a terribly new way to use content negotiation, but arguably still a clever one. It's a deliberate behavior.

Joker_vD · on Jan 17, 2023

> serving the same content in different formats, like different image formats; in this case though, it's actually negotiating a different thing entirely, where one is a machine-readable CSV and the other a human-readable hypertext document

Huh? No, it's also serving the same content, just in different formats: one is a machine-readable CSV and the other a human-readable hypertext document.

jchw · on Jan 17, 2023

Disagree: the table inside the HTML is the "content" that comprises the entirety of the CSV. Related but absolutely not "same". (It'd be more arguable if all the HTML had was the table, but it's actually just a normal web page with a table.)

lkitching · on Jan 17, 2023

Arguably the resource is the dataset of stock exchanges, and the CSV representation is forced to omit all the metadata but the HTML representation isn't.

jchw · on Jan 17, 2023

I understand what people are getting at, as it's not really that big of a logical leap. I think the fact that it is somewhat of a stretch, but still "in the lines," is exactly why it is a "trick": it isn't doing anything particularly invalid or hacky, it's just not necessarily what you'd imagine when reading the RFC. Content negotiation to me is more about serving the optimal content to a given agent, not really about selecting modalities for different use cases based on different types of user agents.

I think both cases are "valid" although I think it is inherently less tricky if the document talking about and previewing the dataset is referenced via a separate URL from the dataset itself. (Which, of course, entirely mitigates problems like Apache Spark having HTML in the Accept header.)

Joker_vD · on Jan 17, 2023

Well, neither a picture is really "the same" if it's encoded in different formats, say, JPEG or PNG.

jchw · on Jan 17, 2023

Actually, that's a pretty good point. The thing is that the URL itself refers to some conceptual resource, and the response is ideally a representation of that resource, potentially one of multiple. If you take the same source image and encode it multiple ways, although the two resulting images are different from eachother, they are representations of the same underlying image/resource. But if you were to provide a different image, or alter the image in other ways, I think this would be pretty tricky actually, even if the modification was something trivial. You can imagine a simple use case like a WebP image with the text "Your browser supports WebP" and a PNG image with the text "Your browser does not support WebP." The point is that content negotiation being used to present different logical resources that are not necessarily interchangable representations of the same data feels tricky. I think that is still compatible with the fact that it's all within specifications.

agloe_dreams · on Jan 17, 2023

This. lol. I opened the article and was ready for a cool trick and then they showed it returning csv over Curl and html over a browser and I was like "So the accept header? Maybe there is another trick...nope, just the accept header." Our API already does this at my company, where it returns an HTML formatted, human readable response over a browser nav but returns application/json over anything requesting only that.

0xbadcafebee · on Jan 17, 2023

I didn't realize you could assign a preference hierarchy... to me that's the trick (well, that browsers do it, anyway)

pletnes · on Jan 17, 2023

My impression is that many power users (not web developers) want to download datasets. Sometimes they’ll use a browser to grab a csv or json file, then stuff it into excel/R/python/… to do some sort of analysis work. This kind of knowledge is helpful for that crowd. They often don’t know about the non-url parts of a request.

Semaphor · on Jan 17, 2023

Yeah, asp.net for example, supports both JSON and XML and relies on the accept header to decide what to send.

antifa · on Jan 18, 2023

I was expecting something less obvious other than "101 class on the HTTP Accept Header" lol.

opportune · on Jan 17, 2023

You would be surprised how little people understand HTTP despite working with it for years.

bazoom42 · on Jan 17, 2023

I’m just happy they didn’t call it “a hack”.

joosters · on Jan 17, 2023

There are two downsides to this approach:

1) The discovery of the different response formats. How do I know that I can get csv files from that URL, other than by hoping that the website documents this somewhere?

There's nothing in the underlying HTTP response from https://csvbase.com/meripaterson/stock-exchanges that tells me I can get an HTML or CSV version. Is there a JSON version available? What other variants exist? How do I know that this URL will deliver different responses?

2) Will the website always default to csv files or will my app break when they decide that XML is superior? (Well, obviously not, especially for a site called csvbase!)

But if your program expects CSV data, it is probably best to always request that, and a URL that ends in .csv gives you far more certainty that the data is going to be in that format.

kshay · on Jan 17, 2023

> There's nothing in the underlying HTTP response from https://csvbase.com/meripaterson/stock-exchanges that tells me I can get an HTML or CSV version. Is there a JSON version available? What other variants exist? How do I know that this URL will deliver different responses?

Well, you can send a HEAD request with a given accept: header to find out what you’ll get without actually fetching the data. But it’s true that it would be nice to have the full set of possible responses advertised somehow.

joosters · on Jan 17, 2023

I should have mentioned the HTTP 'Vary' response header, that servers can use to inform the client that its response was based upon some of the headers that it sent.

A 'Vary: accept' response header gives a hint that it could have supplied a different response had you given a different Accept: header. But I don't think that there's a way to actually list the variants available in the HTTP spec?

My reading of the HTTP spec suggests that csvbase.com is behaving incorrectly by not setting the Vary header properly (it sends: 'Vary: Accept-Encoding', but it should also list 'Accept' in there too). Potentially, a proxy server could decide to cache the CSV or HTML response, and then serve that version back to another client instead of the 'right' one, because the server didn't correctly report that the response varies based upon 'Accept'. In practice, this isn't likely to happen unless you've got a caching proxy that is also unwrapping the encryption between itself and your HTTP client.

calpaterson · on Jan 17, 2023

> My reading of the HTTP spec suggests that csvbase.com is behaving incorrectly by not setting the Vary header properly

This is a good idea, I will certainly look at this. There are some planned features WRT caching coming up.

To address the comments you made in GP:

> How do I know that I can get csv files from that URL

That is a good question. The web UI could be better, of course. But programmatically, how do you advertise alternate representations? I'm not sure. Suggestions appreciated.

> Will the website always default to csv files or will my app break when they decide that XML is superior? (Well, obviously not, especially for a site called csvbase!)

As you say: csvbase won't change :)

But the other thing is that the HTTP client you use could decide to change it's default Accept header. If curl changed to "application/json,q=0.9;/" then suddenly you'd get json (I didn't mention in the blog post but that is also implemented)!

Oh dear. Perhaps a good idea to include the file extension or explicit Accept header when you're coding something that needs to last. But I do think it's nice to be able to copy and paste into pandas. That's my main usability case and I wanted that to be as smooth as possible.

joosters · on Jan 17, 2023

This is a good idea, I will certainly look at this. There are some planned features WRT caching coming up.

While it is probably a bug, it's probably not a serious one that many people would run into nowadays. Now that https is ubiquitous, there aren't many caching proxies around to cause grief. Probably the only proxies people will experience are where they are behind a paranoid company's firewall, one that is configured to decrypt (and then re-encrypt) all their web traffic. And in those situations, they don't tend to do caching much now. (Because even though you can cache HTTP, you'll hit problems with misconfigured sites and users will blame your proxy for it.)

But programmatically, how do you advertise alternate representations? I'm not sure. Suggestions appreciated.

Sorry, I don't have a good answer for this. I only nit-pick problems in web comments :)

You could set a HTTP header to list the available variants, but there isn't a standard AFAIK so it would only help developers who spotted the header.

But the other thing is that the HTTP client you use could decide to change it's default Accept header. If curl changed to "application/json,q=0.9;/" then suddenly you'd get json (I didn't mention in the blog post but that is also implemented)!

That's cool! Aeons ago, I was involved in developing a web server, where we added support for properly handling all kinds of content negotiation (Accept-Encoding, Accept-Language, etc), where you could configure it to deliver the right file based on the user's language, file type preference, etc. It was a large chunk of code, but in the end, nobody really used it. In theory, web browsers and sites could co-operate to deliver the right page in the right language for all their users automatically. In practice though, it never works. No-one sets up their web browser to pick the language properly (who even knows how to change it?) As a result, multi-lingual sites offer to switch languages by clicking on a link, and if they choose a default language, they mostly do it based on IP address (and assumed location)

That's my main usability case and I wanted that to be as smooth as possible.

I think it's the right choice for csvbase, my original comment reads far too critical in retrospect, it's neat that if you curl a URL, you get the csv. But if I was writing code to scrape some csv data, I would still always prefer to download URLs with a .csv extension, because you know what you are getting 100% of the time, and you avoid any unpleasant surprises if some 3rd-party library or tool changes its behaviour.

calpaterson · on Jan 17, 2023

> Now that https is ubiquitous, there aren't many caching proxies around to cause grief.

Well, there are still CDNs. csvbase is designed for a public cache for some pages. I haven't done much on this except for the blog pages, which use the CDN a lot.

I also have vague plans for client libraries that include a caching forward proxy as my experience is that most people export the same tables repeatedly. Likely that will be based on etags though so that the cache is always validated.

The designers of HTTP 1.1 clearly thought a lot about a lot of things, including caches.

Thanks for your thoughts. :) Keep in touch via email if you like (same goes for anyone else reading this): cal@calpaterson.com

kshay · on Jan 17, 2023

Yeah, I guess maybe this is what a 300 Multiple Choices[1] response was intended for but that seems to be underspecified and I’ve never seen it used.

[1] https://www.rfc-editor.org/rfc/rfc7231#section-6.4.1

brewmarche · on Jan 17, 2023

I vaguely remember a web server set up for language content negotiation failing to determine which version to send and giving me a list of links to the individual language versions instead.

I think it was Apache and the negotiable resource was called X.html while the individual linked versions had names like X.en.html etc.

Might that have been a 300 response?

jefftk · on Jan 17, 2023

Without sending "Vary: Accept" the server might have its response mis-cached by a proxy. A request from a browser could populate the cache with HTML, which could then serve HTML in response to a request that wants CSV. Any time you vary your response based on a request header, the spec says you should list it in your Vary response header.

In practice, with the move to HTTPS this rarely comes up anymore outside the sending company's internal infrastructure. Basically no one is running client-side caches that are shared between multiple consumers.

ilyt · on Jan 17, 2023

READ THE SPEC GUYS, you might find other "hidden" "tricks" there lmao

6510 · on Jan 17, 2023

found more tricks here!

https://developer.mozilla.org/en-US/docs/Web/HTML/Element/so...

https://datatracker.ietf.org/doc/html/rfc2295

cerved · on Jan 17, 2023

I agree, the web is famous for its strict adherence to specifications. This behavior is super obvious.

manoDev · on Jan 17, 2023

This is a supported use of "Accept" headers, but I kind of miss the pre-SEO web where it was okay for URLs to just carry a file extension - having "example.com/dataset/foo.html" and "example.com/dataset/foo.csv" is pretty simple and less ambiguous too.

ncallaway · on Jan 17, 2023

They support extensions also. As in the article https://csvbase.com/meripaterson/stock-exchanges.csv will always render a CSV (regardless of accept headers), and https://csvbase.com/meripaterson/stock-exchanges.html appears to do the same for HTML

kzrdude · on Jan 17, 2023

I didn't know it wasn't ok now?

jaywalk · on Jan 17, 2023

I don't think it's SEO so much as the evolving complexity of websites. When there's a .html extension, you're usually pointing at an actual static file sitting on the server. On the page where I'm typing this (/reply) there isn't a simple "reply.html" that's just hiding it's file extension, but a completely dynamic page being rendered.

hnlmorg · on Jan 17, 2023

You can configure web servers to behave like that if you wanted. The only reason we don’t is because file extensions are ugly and noisy for none technical people.

It’s more about having “friendly” URLs than it is a limiting of any technology.

tboerstad · on Jan 17, 2023

In case the author sees this: Thank you for enabling CORS so that it's possible to plot examples from other sites. It would be awesome if the Content-Range header was allowed as well

Here is an example of plotting the first dataset that popped up for me: https://csvplot.com/remote_file.html?url=https://csvbase.com...

calpaterson · on Jan 17, 2023

Wow that is cool, really cool.

I will look into implementing Content-Range - what is it that you want that for? What's the usecase?

tboerstad · on Jan 17, 2023

Thank you for csvbase, today is the first time I've seen it.

I believe PapaParse, a JS library for parsing CSV files, uses Content-Range to stream large CSV files in chunks.

https://csvplot.com uses PapaParse under the hood, I saw a warning in the dev console and posted here. I'm not sure why it seemingly works fine anyway.

calpaterson · on Jan 17, 2023

You are most welcome. So glad it is useful.

This subject tracked here: https://github.com/calpaterson/csvbase/issues/29

apwheele · on Jan 17, 2023

For a random python/pandas trick, I have come across web-api's that cannot be directly read into pandas using the URL (I imagine folks on here can comment better the difference in web serving tech), but you can read in the IO object and pass that to pandas. Blog post, https://andrewpwheeler.com/2022/11/02/using-io-objects-in-py..., but can just put simple example in comment:

    ####
    import pandas as pd
    from io import StringIO
    import requests
    url = ('https://data.townofcary.org/explore/dataset/cpd-incidents/download/'
           '?format=csv&timezone=America/New_York&lang=en&use_labels_for_header=true'
           '&csv_separator=%2C')
    res = requests.get(url)
    df = pd.read_csv(StringIO(res.text))
    ####

contravariant · on Jan 17, 2023

For what it's worth, if the requests module works fine you could probably set "stream=True" in the request and read the `res.raw.data` file object directly. That way you avoid loading the data into memory first [1]. You'll probably want to set `res.raw.decode_content = True` to ensure you get the raw bytes, and not some zipped stream.

[1]: https://stackoverflow.com/questions/16923898/how-to-get-the-...

huntedsnark · on Jan 17, 2023

I'm kind of shocked by how poorly understood basic HTTP stuff like this is for HN audience based on the comments and article itself. My filter bubble must be tuned to "web."

hk1337 · on Jan 17, 2023

Similar to how ifconfig.me returns just your IP address in curl but the full page in a browser.

pelasaco · on Jan 17, 2023

Was rails one of the first frameworks to make it to work automatically? https://github.com/rails/rails/blob/bbf0d35bf6148752911c1da4...

contravariant · on Jan 17, 2023

Some links in the article are missing the TLD curiously enough, it does work if you go to: https://csvbase.com/meripaterson/stock-exchanges.

calpaterson · on Jan 17, 2023

Whoops, should be fixed

contravariant · on Jan 17, 2023

No problems. Looks like a nice tool by the way. I can only hope that it catches on, would make my life a lot easier.

alganet · on Jan 17, 2023

HTTP is great. Another common "How does it know" is resuming downloads: that's done by the Range header. Curl supports it by using `--continue-at- -` (the dash means "figure out where it stopped", you can also use a byte range).

blueflow · on Jan 17, 2023

Except resumed downloads are broken when used in conjunction with compression, because its not clear whether the byte range refers to the compressed or uncompressed resource.

The HTTP spec solved this problem elegantly, it had the concept of the identity of a resource, and gave two headers to declare compression: Content-Encoding (=the resource is always compressed, like an tar.gz, byte range refers to compressed) and Transfer-Encoding (=the resource is compressed only for transfer, the uncompressed is the real thing).

As of 2023, this has not been implemented and the Content-Encoding header is used for both semantics. So resuming downloads over a proxy has a good chance of corrupting your file, i also had source tarballs being decompressed on the fly and failing their checksums.

jayknight · on Jan 17, 2023

And byte ranges is how those download managers from the 90s downloaded your file in several parallel chunks to make it go faster.

treeman79 · on Jan 17, 2023

Also loved the software that cut clock speed in half, so your download speed was 2x.

kaladin-jasnah · on Jan 17, 2023

Is this how JDownloader2 works?

leni536 · on Jan 17, 2023

aria2 is a CLI tool do the same.

ElfinTrousers · on Jan 17, 2023

If you came here to point out how content negotiation isn't a "trick" but rather a simple basic part of the core protocol: think first on the fact that there was once a time when you didn't know that.

Semaphor · on Jan 17, 2023

I mean, I also once didn’t know how to program, doesn’t make programming a trick. As another comments pointed out, it’s worthy of an article. But a trick is a weird description for it. It’s like saying it’s a trick that you can do `console.log` and it will output it in the browser console.

maxbond · on Jan 17, 2023

If we can all agree it's an interesting topic of discussion, we can set aside debating the semantics of whether it is a "trick" and let the author express themselves using the words of their choosing. Policing the definition of trick is surely not curious conversation.

Semaphor · on Jan 17, 2023

Clickbait titles (which "trick" certainly is) have always led to complaints about them being clickbait titles.

maxbond · on Jan 17, 2023

I disagree that this is a clickbait title. I also disagree that those complaints are worth discussing. The guidelines specifically discourage us from complaining about things "too common to be interesting." And the articles that I'd describe as clickbait are simply too low quality to be posted anyhow. ("Clickbait", in my mind, is when the article uses provocative language to build expectations that it doesn't or can't deliver upon. "Trick" in this sense pretty clearly means, "here's a tool you can use to serve different content types to different user agents," and it delivers on that.)

These are really stylistic complaints; the author expressed themselves in a way that's not to your taste. Your tastes are valid, but there should be no expectation that every or any article will cater to them, and their not doing so isn't a criticism of the article and isn't something we can really have a productive discussion about in this medium. I find people use the term "clickbait" to try and reframe their tastes as something more objective.

loloquwowndueo · on Jan 17, 2023

Being old enough that I probably studied the http protocol before actually using it, by the time I encountered this functionality in the wild I already knew how it worked. So, no, there was never such a time.

tomgp · on Jan 17, 2023

There still must have been a time when you didn't know it, assuming you didn't spring from the womb with the knowledge hardwired.

tleb_ · on Jan 17, 2023

I've long thought that this concept could/should be applied to user-targeted content. Blogs could be capable of delivering content as HTML but possibly Markdown, plaintext, Gemini, PDF, etc. as well. SQLite tables, CSV, JSON for sharing data. Downloading of a directory with archives. What is missing for adoption is proper readers for alternative formats in the big web browsers though; I wonder if that would be accepted upstream.

rjh29 · on Jan 17, 2023

Speaking of http tricks I'm more impressed about how you can clone github URLs (i.e. they serve git and regular http on the same port).

oefrha · on Jan 17, 2023

When you git clone https://github.com/git/git, the git client simply GET https://github.com/git/git/info/refs?service=git-upload-pack etc. Nothing magical, there’s not even content negotiation involved. git:// protocol uses port 9481 by default, so not the same port.

rjh29 · on Jan 17, 2023

Thanks.

yamtaddle · on Jan 17, 2023

Others note that git supports http clone, but if you do want to serve SSH and HTTP on the same port, that's something haproxy makes pretty easy to do. It's not something terribly fancy, these days.

andix · on Jan 17, 2023

Isn’t there a git over HTTP protocol? Even if your git server is different to your http server, you can just reverse-proxy it behind your HTTP server.

ElfinTrousers · on Jan 17, 2023

Yes I haven't seen anyone use it for some time but the `git+https` scheme used to be common.

daniel-s · on Jan 17, 2023

The author uses Pandas as an example, but Pandas also has a great, built-in API to download and read HTML formatted tables in one line.

https://pandas.pydata.org/docs/reference/api/pandas.read_htm...

mxuribe · on Jan 17, 2023

Pandas (and, yes, Python) continues to blow my mind! Thanks for sharing!

Asmod4n · on Jan 17, 2023

I’d pay money for a RSS reader which uses this, last time I checked none of the popular ones use content negotiation.

6510 · on Jan 17, 2023

I wrote one for personal use. If you (by the book) reject invalid feeds you lose more than half. Attempting to properly negotiate the content would just result in failure for each feed (could be many) but you could try it if all else fails I suppose. (without much result but if you try 100 such things you will get to brag about rare successes)

I think (only) if a popular service required it it could be a thing.

pwdisswordfish9 · on Jan 17, 2023

Speaking of Atom/RSS:

There's no reason to have a separate /archives resource and a /feed.xml (with or without content negotiation). You can just specify some external XSLT with an xml-stylesheet processing instruction in your feed XML that will cause the feed to be rendered nicely when it's opened in the browser...

timendum · on Jan 17, 2023

Is there any site that will provide an Atom feed (and not an HTML page) when asked with 'application/atom+xml' in the accept header?

I've tried Github release page, but it doesn't work.

For RSS Atom "application/rss+xml" is non-standard

Asmod4n · on Jan 17, 2023

Probably only a few, but there are web servers out there where it would be a minimal effort to implement it, which are sadly only a few too.

It's always confusing me how a Feed Reader looks like a normal desktop browser when it loads a feed, e.g. Accept: text/html instead of application/rss+xml first.

Most feed readers don't even tell the server which format they handle at all, most just send a Accept: text/html, /

calpaterson · on Jan 17, 2023

This is planned for csvbase at least.

cybrjoe · on Jan 17, 2023

You mention jsonlines in the escape hatch section, is there an escape hatch for jsonlines? I tried .jsonl but I get a 500 error.

calpaterson · on Jan 17, 2023

I'm very sorry, you have found a bug. Dates and json lines currently not working happily together.

Try a table without a date column:

    curl https://csvbase.com/calpaterson/iris.jsonl

Is ".jsonl" the right file extension, do you think?

cybrjoe · on Jan 18, 2023

That does work, thank you!

And I also think .jsonl is the correct format. It's what I've seen others use.

jcuenod · on Jan 17, 2023

I loved this about building on rails

bayesian_horse · on Jan 17, 2023

You always want CSV, don't you?

calpaterson · on Jan 17, 2023

Well, one of my bugbears is that most open data is released in weird formats.

Take the boston housing dataset as an example:

http://lib.stat.cmu.edu/datasets/boston

This is one of the most popular beginner datasets around. It's often used in tutorials and when experimenting. Does everyone who wants to use that have to write some custom parser? Why? https://csvbase.com/calpaterson/boston.csv is just so much easier

layer8 · on Jan 17, 2023

Which CSV though? The format isn't well-defined in practice: https://donatstudios.com/Falsehoods-Programmers-Believe-Abou...

pwdisswordfish9 · on Jan 17, 2023

https://www.w3.org/TR/tabular-data-primer/#dialects

swyx · on Jan 17, 2023

> TSV isn't (not) CSV

why? needs elaboration

Izkata · on Jan 17, 2023

As stupid as it may sound, "csv" has become a generic term for any single-character delimited line-based file. We deal regularly with a "csv" that uses the pipe character instead of commas, for example.

chasd00 · on Jan 17, 2023

yeah that's pretty strange, does the file format documentation call it a CSV? In my experience those files are always referred to in their documentation as "pipe-delimited file"

6510 · on Jan 17, 2023

I just export an excel sheet as scv in libreoffice. Opening the file in libreoffice with default settings gave me a pile of crap.

tinus_hn · on Jan 17, 2023

Except when you want to be sure you get what they intended. If it’s just some little throwaway project and you don’t care if the results don’t match up, use csv.

bayesian_horse · on Jan 17, 2023

That's actually when I'd rather use something like pickle or feather, maybe even json depending on the usecase, which is easier to parse back.

I've encountered CSV mostly for data transfer among different programs. There sometimes are better options. Sometimes there isn't, maybe even just because CSV is easier and cheaper to implement on both (or multiple) ends.

tinus_hn · on Jan 17, 2023

Yeah as long as you don’t mind Excel changing a few values here and there and pretend translation and internationalization doesn’t exist. But hey, that’s ‘good enough’ so the developers don’t need to bother using a format developed in this century.

bayesian_horse · on Jan 18, 2023

Excel is much like an Incel: it has a tendency to mistake something for a date that isn't.

RobinL · on Jan 17, 2023

I would generally recommend parquet for the reasons set out here: https://www.robinlinacre.com/parquet_api/

chasd00 · on Jan 17, 2023

it's mainly because excel can read csv ootb (mostly). Also, many old UNIX programs read csv or fixed-width for batch record processing.

bayesian_horse · on Jan 17, 2023

I don't know if I meant that in jest or for real. CSV is a great format, better than some, until it is not...

superlupo · on Jan 17, 2023

You've basically reinvented REST

oefrha · on Jan 17, 2023

That’s completely backwards. Content negotiation was part of HTTP early on (check RFC 1945, HTTP/1.0). Fielding’s REST thesis came after HTTP 1.1, and used HTTP as an example of REST’s concept of representations.

simiones · on Jan 17, 2023

Fielding's REST thesis also goes into great detail and emphasis on using HTTP's content negotiation definitions. This is probably the least used part of Fielding's definition of REST, but he greatly encourages defining custom media types for application objects, and he believes those are a much more fundamental part of REST design than the HTTP verbs, for example.

m1sta_ · on Jan 17, 2023

Gys · on Jan 17, 2023

Instead of relying on settings in clients I would have required something more explicit in the url. For example https://csvbase.com/meripaterson/stock-exchanges for the html version and https://csvbase.com/meripaterson/stock-exchanges/csv for the csv version.

Generally speaking, implicit magic is cool but can also be frustrating if its not working as expected.

Very OT: Apple has this attitude of 'it just works'. Really great. Unless it is not working while there are no settings and no info whatsoever on what the requirements are to make it work.

Lutger · on Jan 17, 2023

It is not depending on settings of clients nor is it magic. The HTTP headers are as much part of the input of a request as the url itself. The only difference is that most people aren't aware of the http headers and they aren't shown by default in the browser. But not displaying information that isn't relevant to this context is not the same as magic.

This isn't a problem either - if you are a regular internet user, then stuff just works and you don't need to know about http headers at all. If you are a (web) developer, you really should have a general idea about http headers and what kinds of things they are useful for - and then its not magic anymore.

KyeRussell · on Jan 17, 2023

It’s not implicit because it’s something that the latest batch of “I’m not sure what my browser does” developers hasn’t organically come across. See also: the utter miscategorisation of CORS as an annoyance instead of an utter blessing.

CrimsonRain · on Jan 17, 2023

there's nothing "implicit" about http headers. That's basic; just like get post.