A well honed skill set in 'ML' isn't even about the algorithms or data structure...

richtapestry · on Oct 22, 2019

I'm a complete beginner in this area, this sounds like a great aspect to bear in mind when learning, thanks!

AlexCoventry · on Oct 22, 2019

How do you go about developing skill in experimental design?

RosanaAnaDana · on Oct 22, 2019

I think the first step is to develop a skeptics mindset. As a scientist, your job isn't to believe, its to address evidence and evaluate whether or to it supports or conflicts with a given hypothesis. A profound influence on my development in the sciences was a older (he was 55+ at the time), physical chemist I shared an office with,'John'. Arguably, he was one of the greatest critical thinkers I had ever had the privilege of working with. Many other people at that office would avoid him entirely when discussing their work because if John heard about what they were doing, he would always 'challenge' (politely and professionally) them on what they were doing: how they had decided to set up their experiment; how they had decided to sample; how they intended to analyse their data; how they (physically and literally) planned on getting their data.

John did this because on his own work, he was constantly challenging his own assumptions. The result of this was that John 'appeared' to be far less productive; however, the results of his work were orders of magnitude more robust than our other colleagues at that time.

By incorporating a skeptics mindset, less becomes much more. A skeptic doesn't believe the internal error rates produced as a result of a model run, they take a long time to become convinced that a 'thing' is the truth. Don't believe, measure.

AlexCoventry · on Oct 22, 2019

How does this apply to ML, beyond the usual train/validation/test split?

RosanaAnaDana · on Oct 25, 2019

Because, the usual train/ validation/ test design often fails to generate a useful ML model/ pipeline. With out some serious consideration into the nitty gritty of the 'experiment design' (see above for what I mean by that), we get 'all-hat no cattle' results.

Lets take a small toy example from one that came up a few days ago, the model that could predict 'heart-disease' from one heartbeat. Their data set came from two different sources: their 'disease state' and 'null state' patients had their cardiograms recorded via different instruments. They did some statistical re-sampling to get the data to 'match', so, a skeptics flags should already be raised. Second major issue: without resampling, they had an effective N of 30; they made the decision to 'slice' every ones cardio gram into thousands of examples. The RNN (i think it was an RNN), obviously, will need thousands of examples to train. They then randomly sampled (according to their train/validation/test split) from this distribution of 'beats' to train their model.

So they didn't do anything 'wrong' according to what myself, and what I would assume yourself were told when we did this or that course in ML. But actually, from an experimental design POV, this is clearly going to overfit. Even disregarding the resampling of the original data (its own, very suspect issue), the slicing alone is enough to realize 'Ah. This is horse crap'. Think about it like this. Say its a 80:10:10 split. Take a random sample of n heartbeats, 3 heartbeats long, from n=100k heartbeats. What is the probability of a heartbeat in the training data set not being very close (in time proximity) to a heartbeat in the validation/ test set? I don't have the time to work it out on paper, but its most likely that anything in the test/ validation set will be temporally adjacent to something in the training set. The probability of anything in the test/ validation being very different from something in the training dataset (say, 3-4 beat sets away in either direction) is very very unlikely. The vast majority of data in the test/validation dataset will have two direct neighbors, both used in training, the next most populated class will have 1 neighbor in one direction, and the almost none will be separated at great distance (even 2-3 beats isolated from something in the training dataset).

This issue central here is the lack of independence in the experimental design. They've created quasi-independence in their sampling methodology, but at the end of the day, they've still only got an n of 30 (not 300,000-3,000,0000). I get it. For most cases, one almost always has to create a condition of quasi-independence in ones data. To get these algorithms to work, you need lots of data.

Knowledge of what ML is, how it works under the hood (a bit), and how to implement it (more imp. imo), all matter in this space. What matters more (imo), is the mindset that can be developed by doing science as an intellectual exercise. Its good science to remain skeptical and be adherent to the evidence rather than our assumptions.

CrazyStat · on Oct 22, 2019

Your local university stats department may have a course on it, or check out a textbook. There may well be some online courses too.

Unfortunately I wasn't very fond of the textbooks I used in school and I haven't looked at any since so I can't give you any recommendations.

atoko · on Oct 22, 2019

Probably Kaggle competitions

rodprice · on Oct 22, 2019

Read "Data Analysis: A Bayesian Tutorial," by D.S. Sivia for an excellent introduction to experimental design for scientists. This gives you a quantitative approach to the design of experiments, with tools to properly evaluate your results.

RosanaAnaDana · on Oct 24, 2019

I would also recommend the bayesian statistics book with the dogs on the cover. Can't remember its name off hand, but got handed a copy. Its a great foundation.