Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Open Source Data Science Masters (github.com/datasciencemasters)
125 points by olalonde on Sept 10, 2016 | hide | past | favorite | 21 comments


I don't like this curriculum very much - I think it is way too heavy on the data engineering side and way, way too little about the actual mechanics of the data science bit.

For example, the words "validation" (as in cross validation) and "overfitting" aren't mentioned anywhere on that page, and yet things like data scraping are mentioned multiple times.

With all due respect, I can find lots of people to do scraping, but it is much harder to find someone to explain a good strategy for cross validation on time series data (for example).

And yes, "data scientist" is a vague term that can mean pretty much anything.

Having said that, if you did everything on this list you'd be a pretty good data scientist.

(I run a data science team, and I'm involved in building a data science competency framework so it's something I think about a fair bit)


Good point. That said, easy enough to augment an expert data scientist with 4:1 data engineering support, whereas a data scientist working solo will spend 80% engineering data. With all the hype and inflated expectations, IMO much easier to hire aspiring data scientists than talented engineers who are satisfied with the data prep and admin aspects. MSPA programs are realistic that the bulk of their graduates will be spending much of their time as data janitors.


IMO much easier to hire aspiring data scientists than talented engineers who are satisfied with the data prep and admin aspects

As someone who hires both, I can guarantee this is incorrect. Well, maybe hiring "aspiring" data scientists is ok, but an aspirations will get me models that do exactly the wrong thing. So that isn't useful.


The machine learning coursera course listed on the page covers bias/variance and validation.


Yeah, I did see that, and that is a great course. I get the impression (based on the lack of a description) that they see it as equal importance to all the other many, many courses they tell you to do.

Put that first, and it would be a big improvement. Would be better if it wasn't in Octave though!


"how can you afford not to take advantage of an open source education?"

But there is a time cost to learning. For example, suppose a Masters degree takes two years, does the author have an estimate of how long it would take to complete her list?

And since the list is a little old (in terms of the rate at which this field is progressing), I would add Apache Spark to the list. I watched a video recently about how Scala took over the big data world (probably not true) [1], but the presenter made an interesting point about how Spark subsumes a lot of different things (streaming, machine learning, built in support for SQL) and it is good enough at those things even if not the best tool. Not surprisingly, that actually makes it a good candidate for adoption in the enterprise.

And then add the recent availability of things like the Databricks Community Edition (and similar offerings from competitors), and I could also make the case that you can start from a completely different entry point - learn Spark first, and then go deeper based on what you are interested in. But most of all, using a platform like Databricks takes away a major pain point - the often painful process of setting things up to start your work. [2]

My last point is not aimed at this particular resource, but just a general feeling I have when I see lists which incur significant time costs for the readers to fully pursue. I would like to see in these lists a brief statement about some popular things which were still omitted and why - simply because that gives some added confidence about the effort that went into the curation process.

[1]https://www.youtube.com/watch?v=AHB6aJyhDSQ&t=10m30s

[2]I am not associated in any way with Databricks. Also, obviously Databricks is a commercial entity, so in some sense you are not just within the "all free and open" domain.


I watched a video recently about how Scala took over the big data world

I think you mean Spark? I use Spark heavily at work and know little Scala (although I do have an engineering team who do work in it sometimes).

I would add Apache Spark to the list. I watched a video recently about how Scala took over the big data world (probably not true) [1], but the presenter made an interesting point about how Spark subsumes a lot of different things (streaming, machine learning, built in support for SQL) and it is good enough at those things even if not the best tool. Not surprisingly, that actually makes it a good candidate for adoption in the enterprise.

So Spark is great, but it isn't the only thing. For example, it does let people use Python and R on the same platform pretty easily, and with the potential for good performance.

However, you really need to know what you are doing to get the best of it (what a surprise, hey!). For example, Databricks likes to show how Dataframes/Datasets give huge performance advantages over the old RDD programming model.

This is true, but you need to understand why to make sure you see the same benefits. Basically, there are numerous primitive functions than have been implemented as native operations of the Dataframe classes, and if you use them they perform well. If however you want to use Python UDFs, then you won't.


I don't think Spark would be core data science. The way I see it is that data scientists usually fall somewhere in the range of [software engineers who are good at math, statisticians who are good at coding], and the projects done at either end are pretty different.


I use Spark at work, it's really good when you need to do some large scale analysis with your own custom code. Everything else it does is just ok. It certainly doesn't subsume ML tools.


Great to hear from someone who uses Spark at work. I see what you mean - subsumed is a bad word choice and I should not have tried to paraphrase. What do you feel about the phrase that the presenter uses - "One tool that fits everything"? (I am new to data science)


I think it's marketing nonsense. Databricks in particular has good "proximity to data" IMO, in that once you've figured out how to use it with your data sources, you can just fire up a web browser and connect to them, and connecting those data sources to your code is easy.

The problem with seeing Spark as your one tool for everything is that's only true if it's trivial to integrate your code with Spark. Viz tools like Plotly/Bokeh don't integrate well with Databricks' notebook, Deep Learning tools are not really supported yet unless you're running your own clusters and running special libraries to wire thing together.

I think Spark is a good workhorse for big data; it can do repetitive things well at large scale, it's less good when you want to use more niche tools since most of the Data Science community is not focussed on Spark. PySpark exists and will probably be good enough, but only if your data fits into memory in a single machine anyway.

And if you're not dealing with big data, Spark is overkill. It's usually simpler to just get a box with more RAM.


Spark's MLLib is pretty weak compared to other options.


Spark is a topic covered in the intro to data science UW videos, the first mentioned resource on the curriculum.


All these curricula seem a bit too complex. IMHO, there's one thing that should be prioritized on top of everything else. The concept of probability, computable probability distributions, and Bayesian inference.

It's the one thing that brings a unifying umbrella to all modes of reasoning under uncertainty. https://probmods.org/ and http://forestdb.org/ seem to be the best resources for this at the moment.

Besides, I dislike Data Science which seems to be a new buzzword. Data Engineering would be more acceptable, as I think people working in companies are building stuff rather than developing new theories. But I don't like it that much.


I agree Bayesian techniques are important, and satisfying intellectually.

The problem is that it is entirely possibly to build perfectly good models without ever touching anything Bayesian (excluding naive-Bayes classifiers perhaps!), and then adding Bayesian techniques will rarely improve the accuracy in anyway.

But I'm happy to admit my understanding of Bayesian techniques is incomplete. It's something I'm working on (https://probmods.org/ is great), but I just haven't found anywhere to use it in anger yet.


So something that no one usually admits is that there are three types of reasoning about stuff (frequentist, bayesian, and nonparametric), and each of them has their pros and cons and circumstances to use them.

So with frequentist statistics, it is really easy to reason about what should be the correct estimator (it is almost always the obvious one). For example, with functional time series (where each data point is a function and not a real value), then it is straight forward to find an MLE - it is just the average function. But defining a prior on the space of twice differentiable functions isn't as easy.


So far, the downside of this is that many companies seem to require a masters.


What about showing a portfolio of data analyses? Anybody has experience on this?


Someone got hired on a data science team I worked on with no master's degree, partially due to scraping some of our company's data from our website and showing some cool data science-like stuff using that scraped data to the people interviewing/hiring. That said, this may have been a successful tactic there too because the team would sometimes hire non-master's/PhD holders if a person showed aptitude and 'passion' for this type of work. We had PhDs who were really not equipped to do well at the business level, and those with Bachelor's that made major contributions (and vice-versa). Unfortunately there are still companies and teams out there that think a higher degree is a good filter...


Yes, I got a data science job advertised as masters only by showing past work and painting a picture of how I could do similar for their industry specific problems. I think credentials are often more a factor in getting the interview which can be hacked through connections and good networking but once you're in the interview, you can often communicate compentency through past work rather than a stamped piece of paper. That said, I've since done a MSc in Statistics and it's definitely easier. Think it helps but not strictly required in all cases (some companies will insist but think especially now with such a shortage of skills, I think you can get away with just a solid portfolio and good networking skills)


It might not be sufficient, but I would no longer pay for education and do it "on the record" without first taking the equivalent free classes.

From what we know about human memory the practice of cramming a topic for 3 months is not helpful and just the result of resource costs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: