Python Pandas: Tricks and Features

closed · on Aug 29, 2018

Woah, I had no idea the testing module existed. One thing I've found useful in pandas are the DataFrame .query and .eval methods [1]. They're nice for cutting out tons of lambda functions in pipes.

E.g.

  df.somemethod() \
    .loc[lambda df: df.x < 2]

becomes

  df.somemethod() \
    .query("x < 2")

One issue I've noticed is that there's a frustrating bug [2] that causes many queries to raise an error before evaluating, but this can be fixed by changing the engine argument:

  df.query("x.str.contains('a')", engine = "python")

1: https://pandas.pydata.org/pandas-docs/stable/generated/panda...

2: https://github.com/pandas-dev/pandas/issues/22435

mlthoughts2018 · on Aug 29, 2018

I wish more teams considered it important to expose the tests as a module or subpackage that is included in distribution, such as what numpy does with numpy.test(‘full’) [0].

When you are knee deep in some long-running docker container with some data analysis going on in an interactive console and get hit by a weird bug, it can be so, so helpful to easily run unit tests post-installation to verify everything is setup correctly.

It can also be a good step in CI if you build minimal docker containers that should house an installation of the package at the given commit, and have e.g. Jenkins build the container with the package installed from that commit and then launch the container with a simple command like

python -c “import mymodule; mymodule.test()”

[0] https://stackoverflow.com/questions/9200727/is-there-a-test-...

closed · on Aug 29, 2018

That's a good point--I used to keep tests outside the package, but it seems like some projects make good use of having people who open issues run the unit tests beforehand.

abakker · on Aug 29, 2018

I recently spent a bunch of time trying to restructure an SPSS dataset that had a sub-optimal structure. After failures with excel macros and SPSS syntax, I ended up with about 100 lines of python using pandas columnar multindex and stack(). The stack/Unstack is so fantastic for preparing data for tableau I recommend everyone learn to use it.

gcmac · on Aug 29, 2018

Thanks for posting - totally worth the read to learn there's a pd.read_clipboard() function.

world2vec · on Aug 29, 2018

Came here to say that. How come no other tutorial or MOOC on pandas mentions that? It's so useful.

joelschw · on Aug 30, 2018

Whilst it has its uses, I think we should encourage people to do things in a reproducible way

danmg · on Aug 29, 2018

Sadly, it doesn't look like there's a way of setting the xwin clipboard selection