The OP wrote a great post some time ago (it was posted on HN and was probably how I discovered him): "The Forgotten Job of a Data Science: Editing" http://www.john-foreman.com/blog/the-forgotten-job-of-a-data... theme of that post is similar to the one here:
> In a business, predictive modeling and accuracy is a means, not an end. What’s better: A simple model that’s used, updated, and kept running? Or a complex model that works when you babysit it but the moment you move on to another problem no one knows what the hell it’s doing? Robert Holte argued simplicity versus accuracy in his rather interesting and thoroughly-titled paper “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets.” In that paper, the "air quotes" AI model he used was a single decision stump – one rule. In code, it’d basically be an IF statement. He simply looked through a training dataset and found the single best feature for splitting the data in reference to the response variable and used that to form his rule. The scandalous part of Holte’s approach: his results were pretty good!
The OP has a book (the link is on his site) that I've bought...it purports to teach non-technical people how to do machine learning via Excel...I haven't gone through all of the exercises, mostly because I shudder at having to open Excel to do any kind of data work, but the mental framework that he builds, demonstrating via examples in Excel, is pretty remarkable...I think it's admirable that a data scientist is doing such significant work in demystifying (and maybe even devaluing) his own profession.
I'll just step in and say that, as a non-coder who nevertheless has to understand how code works, Foreman's book "Data Smart" is an invaluable primer on the theory and practice of machine learning.
Certainly I'm no machine learning pro, but as a discipline it's a whole lot less mysterious to me now than it was before I read that book.
I feel like these are all reasonably obvious gigantic landmarks taught to anyone with a stats or pattern recognition background. It's sort of just the lay of the land with statistics, modeling, optimization.
That doesn't mean at all that more people shouldn't be aware of them and think critically about them. It also doesn't mean at all that the implications of these features of ML are understood by the entirety of the organization which is affected by them.
But I sort of feel like it's the (perhaps unstated) primary job of a data scientist/modeler/mathematician/statistician/what-have-you employed at any company investing in this kind of technology. It feels about akin to a company building their software without using version control. Sure some people do that, but... c'mon!
If you're failing to do it then you're probably not listening to your experts enough. If they're not screaming it's because they're wrapping blankets around their head trying to block out the noise.
After having skimmed the paper, I think it has a subtly different focus which is vastly more important to consider. In particular:
> At a system-level, a machine learning model may subtly erode abstraction boundaries. It may be tempting to re-use input signals in ways that create unintended tight coupling of otherwise disjoint systems. Machine learning packages may often be treated as black boxes, resulting in large masses of “glue code” or calibration layers that can lock in assumptions. Changes in the external world may make models or input signals change behavior in unintended ways, ratcheting up maintenance cost and the burden of any debt. Even monitoring that the system as a whole is operating as intended may be difficult without careful design.
In other words, due to either unavoidable or marketing-based complexities, ML projects have a tendency to blow up abstraction boundaries and induce loops leading to unexpected tight coupling. In other words, you can interpret this less as a feature of ML as a task and more as a feature of how ML algorithms tend to operate, especially when operated in "black box" mode.
ML may, when practiced in some ways, be antithetical to good software design.
Again, quoting the paper
> Indeed, arguably the most important reason for using a machine learning system is precisely that the desired behavior cannot be effectively implemented in software logic without dependency on external data. (Emphasis mine)
This rings very true in my experience building ML systems. It's an unexpected challenge in software terms. You're often used to many kinds of behavior as being undefined until runtime (e.g. needing user input before they can be understood), but typically the function of a program is something you fully internalize statically—this is the cornerstone of understandable code.
OTOH, ML says that even the very behavior of the system must be deferred until "runtime" information can propagate back. This is especially pronounced in online decision-making tools, but plays out over much longer time periods even with batch algorithm design. This "phase distinction" problem is something you will feel immediately in building ML-based code.
It's probably the case that black box ML algorithms exacerbate this. I don't have a lot of experience there, though.
Definitely agree with this. I do see more folks out there (like myself) with one foot in both the data science and production engineering worlds, and my personal approach to the problem of black boxes would be to have more of us! Rather than some 'sucker dev' implementing a statistician's hack, a full-stack mindset means the data science is done with engineering considerations all along.
But even then, half the joy of data science is not really knowing what you're doing at first as you research and experiment, just as your first attempt at a production system for something complex will also not necessarily be the final thing. If you have both of these at once, you can end up with a 'production' system that looks pretty well-designed and, indeed, works well enough at first -- until the product outgrows it, leading to a somewhat uncomfortable transition later on.
My personal anecdote starts with my PhD research which initially used a combination of Perl scripts, WEKA and god-knows-what-else, on fixed pre-formatted corpora, and became a Python engine powering a Django web-app using the Twitter API - not exactly rocket surgery at Google scale, but I found the whole shift in mindset from research to execution extremely useful and totally recommend to any data scientists who have focused solely on research that you try the other side of the fence, at least a little.
Stripping things down and simplifying, getting that 80/20 balance right, that's also some of the investigative fun of data science and we shouldn't get carried away with our own cleverness to ignore the poor sods who have to use the models we build years from now. I was consulting on a project which had very complex visions of a layered AI system using machine learning to make key decisions, but given the stage of the project and the depth of ML knowledge of the team, it was way easier for the initial stages to use what was effectively a switch statement on pre-defined values instead. You can always add complexity!
I see lots of common points with the "lean" method here (at the expense of technical debt), but can't forget Knuth's advice that premature optimization is the root of all evil.
To me, the utmost important aspect to take into account is the "key" of the business. I had a similar story with my PhD research (AI+NLP): I wanted to focus it on adding value to commercial products. I worked on the core of the implementation, I built it with Java, with a configurable pipeline, plus an online app with servlets. I did attain getting noticed by some companies, but at the moment of closing future lines with them, my adviser told me this was not what he expected from me, and I was forced to get back to the non-useful-goal-centred research of academia. I am still glad I did what I did, regardless of how useless it was eventually, but my knowledge grew, and I learned how important it is to know your business model before you do anything at all. I learned how unbalanced is the academic market, always relying on public funds to survive instead of worrying about building useful appealing stuff. I realized where I wanted to be, so I dropped out and joined the private industry where I now feel very fulfilled.
My present approach goes from small to big, little by little, getting as much feedback as I can so I can fix mistakes asap and prevent them from getting bigger and more difficult to manage. My agreement with your words.
Do not use more data sources than necessary. You can incur technical debt. [1]
Metaphor about dead bodies at the top of hill.
Be cognizant of the fact that in some use cases by making use of the information output by the predictive model you can reduce the effectiveness of the original data model. This is called a feedback loop.
The goal of ML is business. Not some new ML algorithm/technique or programming language paradigm.
Hey, author here. It does sound kinda bigoted, doesn't it? Lemme clarify. I would really love for more people to learn this stuff. Truly. That's why I wrote a book to teach it to folks in Excel. What freaks me out a bit is that in order to enlarge the ML pie, a lot of vendors are trying to democratize machine learning not by teaching it to more people but by putting the gun in hands of folks who don't actually understand how supervised machine learning works and where it can go wrong.
I don't think people need PhDs to learn or do this stuff. But I think they need more than a "machine learning made easy!" app and a gung-ho attitude.
I feel that I'm halfway between main street and an ivory tower...not sure where that puts me. The upstairs bar at a pizza parlor?
Speaking as one of the people that phrase is probably pointed at, I don't begrudge the choice of words at all.
Lately there's been a big thrust in my industry to do just what you're worried about, and now that I've started to get an inkling of what's going on under the hood with predictive modeling tools over the past couple years it's starting to worry me.
And I'm starting to find their sales pitches to be even more condescending. "Don't worry about how this works or what it's really doing. Just pretend it's magic." It doesn't speak well for the vendors' opinion of their customers, and it leaves a lot of room open for legitimate technology to morph into silicon snake oil.
I just recently posted in a HN machine learning thread asking for beginners resources. This sounds right up my alley.
I'm teaching myself Ruby (and other stuff), but consider myself pretty advanced with Excel and web analytics in general. This seems like a great way for me to get my feet wet in the deeper science of things with tools I'm already very familiar with (moreso than Ruby at least).
John, can you clarify a bit on how much background is needed in various areas of math to get the most out of this book? Or do you feel you do a solid job of teaching that as one progresses through the chapters?
A semester of linear algebra (or just a willingness to Wikipedia a few things) plus Excel experience is all you need.
That said, the book does require a lot of effort, because the techniques are worked through step by step.
But once you learn all the guts of the algorithms, you never have to implement them again! The last chapter moves the reader into R package land with the confidence that you now know what those packages are basically doing and what to watch out for.
If you've got a college semester of linear algebra under your belt (or equivalent) and are pretty good with Excel, then the book is a good fit. Even the algebra can be optional if you're willing to use wikipedia liberally. I don't take for granted that the reader has a lot of background.
That said, there are parts in the book that are really quite hard. Hard in that they just take time to work through. Because the book is about learning all the steps that go into training models and doing analyses from scratch. But once you do it all from scratch once, you don't necessarily have to ever ever do it again.
It's taught in Excel for learning purposes, and then the last chapter moves you into R. Literally, the Holt Winters forecasting chapter of the book is 50 pages while in R it's the forecast package plus 3 lines of code.
I'm curious about the real world implementation risk and if anyone has a methodology to proactively deal with external factors affecting the model performance. Such as if a feature is highly predictive of a certain outcome is there a framework to measure the volatility based on information outside of the dataset ie. product changes, marketing campaigns etc.
> In a business, predictive modeling and accuracy is a means, not an end. What’s better: A simple model that’s used, updated, and kept running? Or a complex model that works when you babysit it but the moment you move on to another problem no one knows what the hell it’s doing? Robert Holte argued simplicity versus accuracy in his rather interesting and thoroughly-titled paper “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets.” In that paper, the "air quotes" AI model he used was a single decision stump – one rule. In code, it’d basically be an IF statement. He simply looked through a training dataset and found the single best feature for splitting the data in reference to the response variable and used that to form his rule. The scandalous part of Holte’s approach: his results were pretty good!
The OP has a book (the link is on his site) that I've bought...it purports to teach non-technical people how to do machine learning via Excel...I haven't gone through all of the exercises, mostly because I shudder at having to open Excel to do any kind of data work, but the mental framework that he builds, demonstrating via examples in Excel, is pretty remarkable...I think it's admirable that a data scientist is doing such significant work in demystifying (and maybe even devaluing) his own profession.