Hacker News new | past | comments | ask | show | jobs | submit login

Ah, its starting to come back to me...

Does this mean that pandas effectively implements the dataframe as a simple hash on columns vs R which does it as a list? Because if so, yes, that means that they'll probably be relatively comparable in practice.

But I don't think its right to say there's a "right way" to do things with "datasets" though (i'm calling them that as a general concept for these rectangular data structures across languages and platforms, though I appreciate there are differences between their implementations). I do think there's an aesthetic and real effect drawn from the choices of each though, and I can speak loosely about preferences, style, pluses and minuses.

If pandas does have its implementation underlying as a column based philosophy, then yes, I agree its an interesting weird/choice to go with the row-based notions mentioned earlier in spite of this.

That being said, I think there's reasonable grounds to critique your notion that if you want to iterate over observations that you should have to split things out into matrices of different types. Its true, of course, that it might be more efficient to do so given how R chose to implement dataframes, but I would argue that the point of bringing disparate types of data together (in R or elsewhere) into a rectangular data structure that mixes types across the members of an observation is because you likely want to do operations on observations that involve mixed data.

Its seems curious to me, therefore, that this is relatively inefficient and the preference is given to columns in R. And I've met enough people who were also caught out by this to think its not just me.

SAS, for instance, for all its failures and quirks, effectively does this: pulls together basic mixed data types into a rectangular data structure for a relatively efficient, compiled, row-based iterative operations across mixed data types. Its in this one area of analysis and arbitrary row based data munging where SAS, I think, wipes the floor with R and the R data frame.

Now, I speak SAS and R quite fluently, as well as Lisp, from which the R implementation evolved, and when I look at the R data frame, I don't see beautiful design for observation based mixed data-type munging or analysis, I see a linked list of vectors. The R data structure philosophy of course plays to its strengths when you're doing modelling and things on finite columns of fixed variable types in data sets, but its weakness is in row based mixed-type data munging and analysis on messy data of mixed types (which is, also, I think R's and the data frame's dirty little insecurity).

Its an insecurity specifically because a lot of the real world data experience of what many people face and how many people think about data, and the reason they bring data into a rectangular mixed-type data asset...is because that's what they want to do...which could explain why pandas went that particular way: observations are often the general subject of analysis.

(or they might have done it with no particular thought, I don't know.)




Yes, internally Pandas stores the data as a series of homogeneous arrays, which correspond to one more columns in the data-frame. Details here: http://www.jeffreytratner.com/slides/pandas-under-the-hood-p...

I agree with what you say except that I consider data-frames one of R's strengths. What makes R data-frames great is that the language is designed around these data structures, thus allowing most of their inherent limitations to be overcome by following "good practices". The problem of porting data-frames to other environments as in the case of pandas in my opinion is precisely a lack of language support, which makes the whole thing feel a little stitched together.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: