pandas.read_csv() says hi - https://pandas.pydata.org/docs/reference/api/pandas....

nonethewiser · on Feb 11, 2023

Yeah pandas seems to flaunt a lot of convention about not grouping lots of different control flow into a single function.

But at the same time I wonder how it would look refacotred. How many read from csv functions would we be left with?

masklinn · on Feb 11, 2023

> How many read from csv functions would we be left with?

It probably couldn't be that, because many build on one another. Some are deprecated and others are clearly incompatible, but out of 50 parameters you likely could imagine calling this with 20 parameters if the environment and the CSV you're ingesting are wonky enough.

I think feasible refactorings would be:

- rationalise currently separate parameters into meatier objects e.g. there's at least half a dozen parameters which deal with dates parsing, a dozen which configure the low-level CSV parsing, etc... that could probably be coalesced into configuration objects

- a builder-type API, but you'd end up at the same result using intermediate steps instead of a function, not really useful unless you leverage (1) and each builder step configures a non-trivial amount of the system, so rather than 50 parameters you'd have maybe 10 builder, each with 0~10 knobs

- or you'd build the thing as a bunch of composable transformers on top of a base parser

Of note: the latter at least might be undesirable from the Pandas POV, as it would imply layers of recursive Python calls, which might be much slower than whatever Pandas currently does (I've no idea).

disgruntledphd2 · on Feb 11, 2023

I think that this style (such as it is) comes from R, and scientific computing more generally. I grew up with R and never realised how terrible long argument functions are until relatively recently.

hfbff · on Feb 11, 2023

`pyarrow`'s `read_csv` function[0] has just four default arguments (defaulted to None): 3 option objets and one Memory Pool option.

``` pyarrow.csv.read_csv(input_file, read_options=None, parse_options=None, convert_options=None, MemoryPool memory_pool=None) ```

You can then pass a `ReadOptions`[1] object if needed.

For example:

``` read_options = csv.ReadOptions( column_names=["animals", "n_legs", "entry"], skip_rows=1) csv.read_csv(io.BytesIO(s.encode()), read_options=read_options) ```

You can see how ReadOptions is written on this link [2]. It's interesting they use a `cdef class` from `Cython` for this.

This doesn't solve all issues (the ReadOptions object and the others will inevitably have a bunch of default arguments) but I do think it's safer and it's easier to have a mental map of the things you need to decide and what's decided for you.

[0] https://arrow.apache.org/docs/python/generated/pyarrow.csv.r... [1] https://arrow.apache.org/docs/python/generated/pyarrow.csv.R... [2] https://github.com/apache/arrow/blob/master/python/pyarrow/_...

pindab0ter · on Feb 11, 2023

This is where you could use a builder pattern where you specify everything that diverges from the default using chained method calls.

masklinn · on Feb 11, 2023

So you end up at the same point, but now you need additional intermediate structures and infrastructure which do nothing to help. And for Python specifically it's also a pain in the ass to format due to the whitespace sensitivity.

rmorey · on Feb 11, 2023

proving the comment's point - this is a library function! exactly the right case for default args

roflyear · on Feb 11, 2023

Yet, I've never had an issue using that function!

japanman425 · on Feb 11, 2023

matplotlib has entered the chat.