Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Again, I completely see where you are coming from. However, the difference between a Masters and a PhD are huge, far bigger than the gaps between any other form of education.

In a PhD, essentially everything you learn is to master a particular topic, or solve some kind of problem. This can often involve programming (it did for me) and almost certainly involves statistics (again, it did for me). The most important characteristic of a PhD is that you learn all this yourself (I certainly did). For instance, I was the only person in my department to learn R (although there were some oldtime Fortran and C programmers in my department), and then I ended up learning some python and java along with bash to deal with data manipulation problems and administering psychological measures of the internet. These are the kinds of skills that lead into me possessing some of the skills needed to be a data scientist, and with some experience in the private sector, I'll get there.

Bear in mind that I (almost) have a Psychology PhD, and this would all have been far easier for me if I had worked in physics, chemistry or any of the harder sciences. So from my perspective, I can see that this is where the data scientists of the future are going to come from.

Note that I looked up the job market, and made a conscious decision to train myself in these kinds of skills throughout my PhD, but if you are not capable of performing this kind of analysis that you probably shouldn't be doing a PhD anyway.

I really don't see how programming upends all that grad students learn (though I would be delighted to hear your thoughts), as to me it just seemed like the application of logic with the aid of computers. I'm not that good a programmer though, certainly not outside the application of stats, but within the next few years I will be.



> In a PhD, essentially everything you learn is to master a particular topic, or solve some kind of problem. This can often involve programming (it did for me) and almost certainly involves statistics (again, it did for me).

Yes, but the original point was that more or less any quantitative PhD would be expected to have these skills.

> For instance, I was the only person in my department to learn R

Case in point - and I can tell you from knowing the PhD students that I if I found someone who knew how to program in any language not used primarily for statistical computation (R, Stata, Matlab, SAS etc.), I would consider them the exception, not the norm.

The exact opposite is true about a data scientist.

> but if you are not capable of performing this kind of analysis that you probably shouldn't be doing a PhD anyway.

Or you just don't care about those types of jobs - and apparently there are plenty of those, because many, if not the majority, of PhD students I can think of aren't looking for data science jobs.

> I really don't see how programming upends all that grad students learn (though I would be delighted to hear your thoughts), as to me it just seemed like the application of logic with the aid of computers.

It's not programming per se, but the computational power that it brings makes certain techniques feasible, and other concepts and methods aren't obsolete - just no longer optimal. This is really a comment about statistics specifically. Most job postings for data science positions mention some form of the phrase 'machine learning' - and if they don't, they often have that in mind. Unfortunately, while demand for machine learning dominates the job market, in the grand scheme of things, it's just one branch in the field of statistics, and its 'parent' branch was relatively obscure until very recently. To this day, if a PhD student finished their program having next to none of the required academic background for machine learning, I doubt most academics would bat an eyelid. It's just not considered important from an academic standpoint. It's unfortunate that we have such a disconnect between academic interest and industry demand, but it's very much the case.

A basic example that I often cite about how computational power has fundamentally changed statistics from how it was for the previous few decades is in our selection of estimators. (I often cite this because anybody who's ever taken a statistics class probably had this experience). In every introductory statistics class (and for many non-intro classes as well), when studying inference, you spend 90% of your time talking about estimators for which the first moment has an expectation of zero, and the 10% is a 'last resort' when no 'better' estimator exists. Who decided that the first moment was the most important? What about the second? Third?

Well, it turns out that the first moment is easier to calculate, and, by coincidence, it happens to be the most relevant when your dataset is small (say, between 30 and 100). But once you're talking about datasets with observations which number in the thousands (which is still 'small' by some standards today!), you'd be insane to throw out any estimator that converges at a linear rate (rather than at the rate of \sqrt{n}) just because it introduces a small bias.

But we do - and that's reflected in the the sheer amount of academic research and literature that discusses the former, and the sheer lack of that reflects the latter. In many cases, the theory exists, but it was developed in an era in which it could never feasibly be applied.

Vestiges of this era are visible even in many statistical software packages - for another basic example, regressions by default assume homoskedasticity in errors, even though this is almost never valid in real life. Why? Because in a previous era, everyone imposed this assumption, because while the theory behind the alternative had been developed, it was expensive to carry out in practice (it involves several extra matrix multiplications).

I'm painting with a broad brush, but the general picture still very much holds.


I completely see your point on estimators and the nature of many if not most statistics courses. I suppose that I was lucky enough to study non-parametric statistics in my first year of undergrad, as there's a subset within psychology that's very suspicious of all the assumptions required for the traditional estimators.

That being said, I think you're missing my major point which is that a PhD should be a journey of independent intellectual activity, so the courses one takes should be of little relevance, and so can therefore be downweighted in considering what PhD students actually learn. I accept that this is an idealistic viewpoint (FWIW, the best thing that ever happened to my PhD in this context was my supervisor taking maternity leave twice during my studies, which forced me to go to the wider world for more information about statistics).

I accept your point about machine learning not being a major focus of academia (well except for machine learning researchers), and I think its awful. Its very sad that the better methods for dealing with large scale data and complex relationships are only used by private companies. Its not that surprising, but it is sad.

That being said, I firmly believe that any halfway skilled quantitative PhD can understand machine learning, most of which is based on older statistical methods. It may not be taught (yet), but its not that much of a mindblowing experience. I do remember that when I first heard about cross-validation I got shivers down my spine, but that may just be a sad reflection on my interests rather than a more general point.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: