Hacker News new | past | comments | ask | show | jobs | submit login
An Introduction to Scientific Python – Pandas (datadependence.com)
136 points by Jmoir on July 3, 2016 | hide | past | favorite | 42 comments



Pandas is certainly excellent -- be aware of it's NA type promotion behavior before you start designing data analysis programs, however. I learned this the hard way:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#nan...


Another gotcha is variable type inference. Reading csv files can often produce varying column types. This can be a pain for any consistent data pipeline.


That's a good point, I've ran into problems with that before too. Thanks.


As an R user I noticed a couple of oddities. First,

  len(df)
returns the number of rows rather than the number of columns. This strikes me as a bad idea, because data-frames are better thought of as a collection of columns. Typically you want to loop over the columns of a data-frame and not so much over its rows, which is performance-wise much more costly.

Second, the apply method seems totally redundant. Why call a method that calls a function when you can simply call the function directly

  df['year'] = base_year(df.water_year)
Probably I'm missing something here.


> This strikes me as a bad idea, because data-frames are better thought of as a collection of columns

The dataframe is a collection of records then len operator tells you how big the dataset you're dealing with. You also have len(df.columns) and df.shape

> Second, the apply method seems totally redundant

df.water_year refers to a column. You can certainly use the syntax you wrote, provided you crafted a function that manipulate a column in some way. E.g. if you had a function that returns the first 2 elements of what was given, passing a column to that function would return a view into that column with only the first 2 rows. Passing the same function into apply would process every element in the (string) column and return the first 2 letters, finally returning a brand new column where each row is the first 2 letters of the corresponding row of the input.

Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on.


> Both of these behaviours make perfect sense if you think about them in terms of expected Python and Numpy which Pandas is built on.

My PyData London presentation "Pandas from the Inside" [1, 2] explains in detail how pandas gets its speed from numpy, with benchmarks comparing slow vs fast ways to do common operations. Column-wise operations can be three orders of magnitude faster than iterating by row.

[1] https://www.youtube.com/watch?v=Dr3Hv7aUkmU

[2] https://github.com/SteveSimmons/PyData-PandasFromTheInside


Thanks for clearing that up, now it does make sense. In R most functions handle vectors as well as scalars without distinction, so normally one would use the function directly. Whereas if you wanted to process each element of a vector individually then you'd use apply(). It works the other way around.


Well that's because R doesn't have scalars, just vectors containing a single value.


As someone who is, uh, fluent in R (begrudgingly), allow me to retort:

While you're right that in R a data frame is essentially a list of columns, this strikes me as a flaw in R. Others coming to R expect to be able to loop over the observations in a data frame, or get number of observations by taking the length of the data structure. Indeed for most of my real world work that's what I actually want to do: iterate over customers or units that have multiple observations, stored as rows in the df with variables describing characteristics regarding that observation. I assure you, for everyone else coming to R, that is a genuine "WTF" moment when they loop across a data frame and find themselves iterating across variables rather than observations, or that they accidentally took the length of the data frame to be the number of observations rather than the number of variables: and I've got a glorious real world story of a bug caused by that on a 1 x 0 dimension data frame being returned by consultants code...

I have no idea if that's how it's actually implemented in pandas though...

As for the apply thing: I'm guessing that has to do with python syntax and the nature of functions/methods/data frames, but I agree with you it's a bit kludgy to me too. But I guess that's because what you're actually doing is applying a scalar function across a sequence of values, not actually calling a function that takes a sequence as an argument. In your example there, which is very R'y because the function application would be automatically vectorised, in python there's no such (necessary) thing. The reason this "kind of" works "naturally" in R is actually because R is weird and takes an efficiency hit by not having unboxed scalar values at all: even single numbers are actually vectors, as is the result of the returned operations/functions on them, so you actually have no scalar operations at all (but for many applications you don't actually notice:[1] + [1] = [2] is effectively the same as 1 + 1 = 2 in an unvectorised language, barring the R resource hit which is insignificant in smaller examples/problems.


Iterating over variables may seem counter-intuitive but it actually is the right thing to do when you have a data-frame.

The reason is that data-frames are intended for dealing with heterogeneous data. The proper way to loop over observations is to convert the variables to a common data type, e.g. logical or numeric, then you have a matrix and then you can loop over rows.

If recall correctly pandas uses a dictionary to implement data-frames, therefore iterating over rows in pandas has the same performance hit as in R.


> The reason is that data-frames are intended for dealing > with heterogeneous data. The proper way to loop over > observations is to convert the variables to a common data > type, e.g. logical or numeric, then you have a matrix and > then you can loop over rows.

Pandas saves its users the 'proper' step of 'converting the variables to a common data type', and lets me iterate over rows to get the observations. That seems like a win to me, no?


Ah, its starting to come back to me...

Does this mean that pandas effectively implements the dataframe as a simple hash on columns vs R which does it as a list? Because if so, yes, that means that they'll probably be relatively comparable in practice.

But I don't think its right to say there's a "right way" to do things with "datasets" though (i'm calling them that as a general concept for these rectangular data structures across languages and platforms, though I appreciate there are differences between their implementations). I do think there's an aesthetic and real effect drawn from the choices of each though, and I can speak loosely about preferences, style, pluses and minuses.

If pandas does have its implementation underlying as a column based philosophy, then yes, I agree its an interesting weird/choice to go with the row-based notions mentioned earlier in spite of this.

That being said, I think there's reasonable grounds to critique your notion that if you want to iterate over observations that you should have to split things out into matrices of different types. Its true, of course, that it might be more efficient to do so given how R chose to implement dataframes, but I would argue that the point of bringing disparate types of data together (in R or elsewhere) into a rectangular data structure that mixes types across the members of an observation is because you likely want to do operations on observations that involve mixed data.

Its seems curious to me, therefore, that this is relatively inefficient and the preference is given to columns in R. And I've met enough people who were also caught out by this to think its not just me.

SAS, for instance, for all its failures and quirks, effectively does this: pulls together basic mixed data types into a rectangular data structure for a relatively efficient, compiled, row-based iterative operations across mixed data types. Its in this one area of analysis and arbitrary row based data munging where SAS, I think, wipes the floor with R and the R data frame.

Now, I speak SAS and R quite fluently, as well as Lisp, from which the R implementation evolved, and when I look at the R data frame, I don't see beautiful design for observation based mixed data-type munging or analysis, I see a linked list of vectors. The R data structure philosophy of course plays to its strengths when you're doing modelling and things on finite columns of fixed variable types in data sets, but its weakness is in row based mixed-type data munging and analysis on messy data of mixed types (which is, also, I think R's and the data frame's dirty little insecurity).

Its an insecurity specifically because a lot of the real world data experience of what many people face and how many people think about data, and the reason they bring data into a rectangular mixed-type data asset...is because that's what they want to do...which could explain why pandas went that particular way: observations are often the general subject of analysis.

(or they might have done it with no particular thought, I don't know.)


Yes, internally Pandas stores the data as a series of homogeneous arrays, which correspond to one more columns in the data-frame. Details here: http://www.jeffreytratner.com/slides/pandas-under-the-hood-p...

I agree with what you say except that I consider data-frames one of R's strengths. What makes R data-frames great is that the language is designed around these data structures, thus allowing most of their inherent limitations to be overcome by following "good practices". The problem of porting data-frames to other environments as in the case of pandas in my opinion is precisely a lack of language support, which makes the whole thing feel a little stitched together.


If you are fluent in R, why are you looping over a data frame?


I'm not saying I'm doing it (although sometimes I will for readability, small problems that can't be naively vectorised, and where I have to make code readable for non-R people).

But not everything is naively vectorisable or best expressed as a vector operation, which is an idea that offends some R programmers.

The truth is a lot of real world analysis is done where the observation is the unit of natural analysis, and not the variable, and lots of people from other languages think in rows vs columns.

Common lisp realised this, and you've got there a language that allows for efficient expression of scalar, compiled loops, vectors and vectorisation/functional application, so I think this shows it's not entirely an either/or dichotomy in practice and is more about design/implementation choices and trade offs.

My point is not that R gets it wrong, it's that you can't say the R way is the "right way".


Iterating over the rows much more intuitive to me, just like rows in a database. In their example dataframe each row is a year, and columns represent different information about that year. So, if I wanted to compare rain from oct-sep on a yearly basis, I would iterate over the years (rows) and then grab that column by name.


It's inconsistent though, as iterating over a dataframe like

    for c in df:
will return the column labels. I expect `len(obj)` to return the same as `len([i for i in obj])`.


Between dplyr, ifelse, and apply family functions, I don't think I've ever had to iterate over a data frame in R.


As a large pandas user, I don't agree with the len() comment. Can you give an example?


In R a data-frame is a list of vectors (in Python parlance, a dictionary of arrays). Therefore the length of a data-frame is the number of columns and an iteration over a data-frame iterates over its columns. Iterating over the rows can be done but it's generally better avoided because it's highly inefficient. The reason is that since the columns have different types each row has to be represented as a list. This is also true in Python, as far as I know.


Sure if your column data is completely independent and you don't need more than one column at a time in a given algorithm, it is natural to iterate over columns instead of rows. However if you need multiple columns (or data properties) at each iteration, which is more likely the case in my experience, then you end up iterating over the rows.


That's what Pandas encourages you to do! In my experience iterating is rarely needed at all if you have functions that operate on arrays.


> returns the number of rows rather than the number of columns. This strikes me as a bad idea

I don't know. In my eyes, "rows" is a name that refers to the first dimension of a possibly high-dimensional array. "Colums" would refer to the next dimension (and then I don't have any more names).

0. rows

1. columns

2. …

3. …


I usually do this kind of processing by linux pipes, head, tail, cut, sort, uniq, and inline Perl. It is kind of similar to using monads, but you have to handle the formatting to and from text. A few ones of my own creation are a tool for counting and a tool for generating histograms in text. I often chain 5 or 10 of these commands together. My basic data type is similar to CSV, but using "|" instead of comma as separator because it tends not to appear in text as much. On the other hand, not being put in a binary format, my data is very accessible.


It's really too bad that the ASCII codes 29, 30, and 31 (Group, Record, and Unit separators) never took off, as this is exactly what they were designed for.

When implemented, they'd let you include commas, line feeds/carriage returns, etc within your data records.


they'd let you include commas, line feeds/carriage returns, etc within your data records

And there would also be less ambiguity as to what seperator to use. I understand the popularity of CSV, but it's really not so nice to share data with. German customers want semicolons as a seperator, the US ones claims they are right 'because after all it is called comma-seperated and else I cannot import it in Excel' (sic). Etc.


>but using "|" instead of comma as separator because it tends not to appear in text as much.

I do this as well. Using a comma to separate values seems silly to me, commas appear so frequently in text.


My blog post about the most popular pandas methods: https://kozikow.wordpress.com/2016/07/01/top-pandas-function....

Pandas is a big library and it's hard to distinguish between necessary and nice to have methods. I have written 1000s of lines in pandas and I have been doing some things "around" rather than using the proper API call.


I don't trust your data. scipy.org is not a function


I will explain the methodology better. My goal was to avoid false negatives.

See the methodology description: https://kozikow.wordpress.com/2016/07/01/top-pandas-function... .


pandas is very good for scientific computing and data analysis, but beware, the documentation quite frankly sucks. Stack overflow seems to be the best way to learn things


The scientific Python environment has very erratic documentation. Matplotlib for example has pages and pages of completely disorganized and often hard to decipher documentation. Examples are very sparse.


Been using Pandas for a few weeks and I...kind of agree. The 10 minute tutorial etc is fine but as soon as you start doing more complicated stuff, you need the API docs. And they leave much to be desired.


I also use Pandas for some of my data analysis and I found that it took me a long time to learn how to use it. Unlike numpy, I just couldn't remember how to do things and had to keep looking things up. Maybe this is just because Pandas has a lot of functionality. But I might waste half an hour trying to write one line of code although that line would do most of my analysis.


Pandas are a reinvention (be it a conscious one or not) of PAW "ntuples" which have been around for at least a quarter of a century. ROOT has further evolved them into "trees" which allow structure beyond simple tables. Both provide a selection language akin to Panda's filtering.

I have nothing against Pandas, but the ebullience that always comes with blog posts about them seems to be ignorant of existing systems used every day in scientific data analysis for the past few decades.


Whether they are a reinvention or not, can't I be ebullient about them? I love new technology for example, and I get pretty ebullient about whatever new things there are. It doesn't mean that I'm ignorant of the past that has led up to them and it certainly shouldn't effect my thoughts on them either.


Curious what this will give me over a relational database.


This is typical of what passes for 'exploratory data analysis' these days:

> You can also see that UK’s rainfall is significantly less than Japan’s, and people say UK rains a lot!

Well duh, Japan is about 50% larger than the UK.


The rainfall is given in 'mm', right - this means area has been controlled for?

I.e. its the amount of rain, in milimeters, that would fall in a given area. This is how rainfall is normally measured.

Unless you are trying to say something more subtle, or making a joke?


Yeah it's 1.5 times larger. It was a joke though because Japanese people always say it never stops raining in the UK, it's like the first thing they always say.


That's what everyone seems to say about the UK.


That's true haha. It rains a fair amount but it's definitely over exaggerated.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: