As someone who is, uh, fluent in R (begrudgingly), allow me to retort: While you...

lottin · on July 4, 2016

Iterating over variables may seem counter-intuitive but it actually is the right thing to do when you have a data-frame.

The reason is that data-frames are intended for dealing with heterogeneous data. The proper way to loop over observations is to convert the variables to a common data type, e.g. logical or numeric, then you have a matrix and then you can loop over rows.

If recall correctly pandas uses a dictionary to implement data-frames, therefore iterating over rows in pandas has the same performance hit as in R.

nzjrs · on July 4, 2016

> The reason is that data-frames are intended for dealing > with heterogeneous data. The proper way to loop over > observations is to convert the variables to a common data > type, e.g. logical or numeric, then you have a matrix and > then you can loop over rows.

Pandas saves its users the 'proper' step of 'converting the variables to a common data type', and lets me iterate over rows to get the observations. That seems like a win to me, no?

ACow_Adonis · on July 4, 2016

Ah, its starting to come back to me...

Does this mean that pandas effectively implements the dataframe as a simple hash on columns vs R which does it as a list? Because if so, yes, that means that they'll probably be relatively comparable in practice.

But I don't think its right to say there's a "right way" to do things with "datasets" though (i'm calling them that as a general concept for these rectangular data structures across languages and platforms, though I appreciate there are differences between their implementations). I do think there's an aesthetic and real effect drawn from the choices of each though, and I can speak loosely about preferences, style, pluses and minuses.

If pandas does have its implementation underlying as a column based philosophy, then yes, I agree its an interesting weird/choice to go with the row-based notions mentioned earlier in spite of this.

That being said, I think there's reasonable grounds to critique your notion that if you want to iterate over observations that you should have to split things out into matrices of different types. Its true, of course, that it might be more efficient to do so given how R chose to implement dataframes, but I would argue that the point of bringing disparate types of data together (in R or elsewhere) into a rectangular data structure that mixes types across the members of an observation is because you likely want to do operations on observations that involve mixed data.

Its seems curious to me, therefore, that this is relatively inefficient and the preference is given to columns in R. And I've met enough people who were also caught out by this to think its not just me.

SAS, for instance, for all its failures and quirks, effectively does this: pulls together basic mixed data types into a rectangular data structure for a relatively efficient, compiled, row-based iterative operations across mixed data types. Its in this one area of analysis and arbitrary row based data munging where SAS, I think, wipes the floor with R and the R data frame.

Now, I speak SAS and R quite fluently, as well as Lisp, from which the R implementation evolved, and when I look at the R data frame, I don't see beautiful design for observation based mixed data-type munging or analysis, I see a linked list of vectors. The R data structure philosophy of course plays to its strengths when you're doing modelling and things on finite columns of fixed variable types in data sets, but its weakness is in row based mixed-type data munging and analysis on messy data of mixed types (which is, also, I think R's and the data frame's dirty little insecurity).

Its an insecurity specifically because a lot of the real world data experience of what many people face and how many people think about data, and the reason they bring data into a rectangular mixed-type data asset...is because that's what they want to do...which could explain why pandas went that particular way: observations are often the general subject of analysis.

(or they might have done it with no particular thought, I don't know.)

lottin · on July 4, 2016

Yes, internally Pandas stores the data as a series of homogeneous arrays, which correspond to one more columns in the data-frame. Details here: http://www.jeffreytratner.com/slides/pandas-under-the-hood-p...

I agree with what you say except that I consider data-frames one of R's strengths. What makes R data-frames great is that the language is designed around these data structures, thus allowing most of their inherent limitations to be overcome by following "good practices". The problem of porting data-frames to other environments as in the case of pandas in my opinion is precisely a lack of language support, which makes the whole thing feel a little stitched together.

sin7 · on July 4, 2016

If you are fluent in R, why are you looping over a data frame?

ACow_Adonis · on July 4, 2016

I'm not saying I'm doing it (although sometimes I will for readability, small problems that can't be naively vectorised, and where I have to make code readable for non-R people).

But not everything is naively vectorisable or best expressed as a vector operation, which is an idea that offends some R programmers.

The truth is a lot of real world analysis is done where the observation is the unit of natural analysis, and not the variable, and lots of people from other languages think in rows vs columns.

Common lisp realised this, and you've got there a language that allows for efficient expression of scalar, compiled loops, vectors and vectorisation/functional application, so I think this shows it's not entirely an either/or dichotomy in practice and is more about design/implementation choices and trade offs.

My point is not that R gets it wrong, it's that you can't say the R way is the "right way".