I completely take most of your points, but I think that pretty much all quantitative PhD's are going to be close to "data scientists". Given that stats and explaining your research are requirements, all that's left is to train them to program, which a lot of people are already doing. As a matter of fact, since I heard about this big data stuff I've been honing my skills in this area, in case the hype actually manifests.
> I think that pretty much all quantitative PhD's are going to be close to "data scientists"
Having taken all but one of the core requirements for a masters' degree in statistics at a university with a well-respected statistics department, I can tell you that's very much not true.
The true challenges in data science have almost nothing to do with what you spend 90% of your time as a graduate student studying (whether you're getting an MA or a PhD, this applies the same). You may happen to end up a qualified data scientist, but that's not by design of the program.
The big problems in data science are almost a disjoint set from the big problems in statistics (at least the solved ones), and that's because the things that are tractable from a theoretical/mathematical perspective are very different from the ones that we hope to solve in the workforce. We're just starting to bridge this gap in recent years (particularly with the advent of computers), but that's a very, very nascent trend.
This isn't unique to my university, either - most schools just simply aren't teaching the type of skills that a data scientist - not a statistician, but a data scientist - would need to be competitive in the work force. Those that do know these skills mostly do by chance - either because they branched into statistics from another discipline, because they were forced to learn it on the job, or because they took the time to learn it themselves.
All three of those are pretty rare - I recently took a class in applied data mining and Bayesian statistics. Except for a few undergraduates majoring in comp sci, the class was mostly graduate students in statistics, and those who knew how to program were in the stark minority (and were very popular when we were picking project groups!)
> all that's left is to train them to program
And to turn everything that they've learned and studied for the past two, four, or more years on its head so that they can actually put it to use. Okay, not everything, but at least 80% of it. Seriously, studying statistics at a high level is incredibly valuable, but it's not sufficient - it's not even going to get you half of the way there.
I'm in the same boat and one of the funnier professors in math stat loves to talk about the students he hasn't "ruined" because they manage to learn programming and practical finite sample wisdom and go on to be successful in the industry.
And then he talks about his other students, with great love, who just like proving theorems.
Again, I completely see where you are coming from. However, the difference between a Masters and a PhD are huge, far bigger than the gaps between any other form of education.
In a PhD, essentially everything you learn is to master a particular topic, or solve some kind of problem. This can often involve programming (it did for me) and almost certainly involves statistics (again, it did for me). The most important characteristic of a PhD is that you learn all this yourself (I certainly did). For instance, I was the only person in my department to learn R (although there were some oldtime Fortran and C programmers in my department), and then I ended up learning some python and java along with bash to deal with data manipulation problems and administering psychological measures of the internet. These are the kinds of skills that lead into me possessing some of the skills needed to be a data scientist, and with some experience in the private sector, I'll get there.
Bear in mind that I (almost) have a Psychology PhD, and this would all have been far easier for me if I had worked in physics, chemistry or any of the harder sciences. So from my perspective, I can see that this is where the data scientists of the future are going to come from.
Note that I looked up the job market, and made a conscious decision to train myself in these kinds of skills throughout my PhD, but if you are not capable of performing this kind of analysis that you probably shouldn't be doing a PhD anyway.
I really don't see how programming upends all that grad students learn (though I would be delighted to hear your thoughts), as to me it just seemed like the application of logic with the aid of computers. I'm not that good a programmer though, certainly not outside the application of stats, but within the next few years I will be.
> In a PhD, essentially everything you learn is to master a particular topic, or solve some kind of problem. This can often involve programming (it did for me) and almost certainly involves statistics (again, it did for me).
Yes, but the original point was that more or less any quantitative PhD would be expected to have these skills.
> For instance, I was the only person in my department to learn R
Case in point - and I can tell you from knowing the PhD students that I if I found someone who knew how to program in any language not used primarily for statistical computation (R, Stata, Matlab, SAS etc.), I would consider them the exception, not the norm.
The exact opposite is true about a data scientist.
> but if you are not capable of performing this kind of analysis that you probably shouldn't be doing a PhD anyway.
Or you just don't care about those types of jobs - and apparently there are plenty of those, because many, if not the majority, of PhD students I can think of aren't looking for data science jobs.
> I really don't see how programming upends all that grad students learn (though I would be delighted to hear your thoughts), as to me it just seemed like the application of logic with the aid of computers.
It's not programming per se, but the computational power that it brings makes certain techniques feasible, and other concepts and methods aren't obsolete - just no longer optimal. This is really a comment about statistics specifically. Most job postings for data science positions mention some form of the phrase 'machine learning' - and if they don't, they often have that in mind. Unfortunately, while demand for machine learning dominates the job market, in the grand scheme of things, it's just one branch in the field of statistics, and its 'parent' branch was relatively obscure until very recently. To this day, if a PhD student finished their program having next to none of the required academic background for machine learning, I doubt most academics would bat an eyelid. It's just not considered important from an academic standpoint. It's unfortunate that we have such a disconnect between academic interest and industry demand, but it's very much the case.
A basic example that I often cite about how computational power has fundamentally changed statistics from how it was for the previous few decades is in our selection of estimators. (I often cite this because anybody who's ever taken a statistics class probably had this experience). In every introductory statistics class (and for many non-intro classes as well), when studying inference, you spend 90% of your time talking about estimators for which the first moment has an expectation of zero, and the 10% is a 'last resort' when no 'better' estimator exists. Who decided that the first moment was the most important? What about the second? Third?
Well, it turns out that the first moment is easier to calculate, and, by coincidence, it happens to be the most relevant when your dataset is small (say, between 30 and 100). But once you're talking about datasets with observations which number in the thousands (which is still 'small' by some standards today!), you'd be insane to throw out any estimator that converges at a linear rate (rather than at the rate of \sqrt{n}) just because it introduces a small bias.
But we do - and that's reflected in the the sheer amount of academic research and literature that discusses the former, and the sheer lack of that reflects the latter. In many cases, the theory exists, but it was developed in an era in which it could never feasibly be applied.
Vestiges of this era are visible even in many statistical software packages - for another basic example, regressions by default assume homoskedasticity in errors, even though this is almost never valid in real life. Why? Because in a previous era, everyone imposed this assumption, because while the theory behind the alternative had been developed, it was expensive to carry out in practice (it involves several extra matrix multiplications).
I'm painting with a broad brush, but the general picture still very much holds.
I completely see your point on estimators and the nature of many if not most statistics courses. I suppose that I was lucky enough to study non-parametric statistics in my first year of undergrad, as there's a subset within psychology that's very suspicious of all the assumptions required for the traditional estimators.
That being said, I think you're missing my major point which is that a PhD should be a journey of independent intellectual activity, so the courses one takes should be of little relevance, and so can therefore be downweighted in considering what PhD students actually learn. I accept that this is an idealistic viewpoint (FWIW, the best thing that ever happened to my PhD in this context was my supervisor taking maternity leave twice during my studies, which forced me to go to the wider world for more information about statistics).
I accept your point about machine learning not being a major focus of academia (well except for machine learning researchers), and I think its awful. Its very sad that the better methods for dealing with large scale data and complex relationships are only used by private companies. Its not that surprising, but it is sad.
That being said, I firmly believe that any halfway skilled quantitative PhD can understand machine learning, most of which is based on older statistical methods. It may not be taught (yet), but its not that much of a mindblowing experience. I do remember that when I first heard about cross-validation I got shivers down my spine, but that may just be a sad reflection on my interests rather than a more general point.
_delirium's post acknowledges your point but is looking for an even rarer person:
"There could be more of them in the future, but someone who is top-notch at all of statistics, programming, and data-presentation has long been less common than someone who's good at one or two of those".
Someone that can program, understands statistics and can present the data in an appealing manner without losing significant fidelity. Many people underestimate the difficulty and skill required in presenting data in a way that makes sense and also actually says something.
There is a significant gap between presenting data that is satisfactory to a research advisor and something that a business person with barely enough time to think can grasp without misconception.
Again, I completely see the difference (and am actually in the process of moving full time to the private sector from academia, so will probably understand a lot more in six months) but visualising data well is not that hard.
Step 1: learn R
Step 2: Learn PCA
Step 3: Learn ggplot2
Step 4: play with the different geoms until you understand them (seriously though, everyone's eyes are optimised to find patterns, and if you can apply significance testing to these then you should be good)
Step 5: profit!? Note that I am being somewhat facetious here, but I suspect that the mathematical knowledge and ability to apply this to business problems will be the real limiting factors, as good practices in data analysis, programming and visualisation can be learned. Granted that will take a long time to learn, and there will be individual differences, but its doable.
Whether or not it will be done at all though is another matter.
Again, delirium's point is trivially true if one requires these people to know all of statistics, programming and data presentation as I don't think there's anyone who knows all of any one of these subjects.
I suppose it somewhat depends on what the skill levels for each of these areas need to be, and that varies from person to person as well as from application to application.
Allow a short vignette from a former academic and now management consultant.
We spent six months at a major pharmaceuticals client examining their reimbursement data. Poring over many millions of rows of transaction data and thousands of payment codes (which, of course, were unique across sales geographies), we determined the ten regions at highest risk of reimbursement collapse. R was used, maps were created, beers all around.
But almost none of it was used for the executive presentation. In fact, the only part that was included was that we had ten regions that needed fixing, and our suggestions on how to fix it. You see, the CEO was dyslexic, the chairman of the board was colorblind, and the COO was a white-boarding kind of gal, so given this audience the nuts and bolts of our advanced statistical analysis were simply irrelevant.
This is hardly surprising. If we are having so much trouble hiring people who are fluent in Big Data, how can we expect business leaders to be even conversant? With only slight exaggeration, the way you do your analysis and the visualizations that you create are not important.
Companies are demanding Big Data scientists because they suddenly have lots of data and see the term Data Scientist in the news. But what they really want is not Data Scientists, it's business insights and implications from Big Data. The customer needs 1/4" holes, but we're all arguing over which brand of laser powered diamond drill they should buy.
I still remember my first business presentation. I had a slide talking about how I did a statistics study. I was told to take the word "study" out because it had bad connotations for the target audience (middle managers at Bristol-Myers Squibb if you're curious).
The comment was probably right. But I was horrified.
I agree, and it's all the more true if you consider that "presenting" data may actually be more like creating an interactive environment to explore data.
I believe that data analysis yields the best results when perusing the data and tuning the models are closely connected tasks.
"I think that pretty much all quantitative PhD's are going to be close to "data scientists"."
Exactly. Fundamentally, "data science" is known far more widely by its other name: science. Yet we've reached the bizarro-world place where there are huge numbers of un- or under-employed scientists looking for work, while companies are freaking out about hiring the "rare" computer programmer who happens to know some statistics and self-identifies as a "data scientist". It's rather absurd.
Anyone who can earn a PhD can learn to program but being good at engineering the very complex processes that are needed for effective machine learning applications is a skill that is not so easily acquired.
I've worked with quite a number of quantitative PhD level people in my career and most often the quality of their code leaves much to be desired.