> Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could absolutely suck at your job or be incredible at it and you’d get nearly the same regards in either case.
In my experience it's even a little bit worse than that. Approaches that are wrong from a statistics point of view are more likely to generate impressive seeming results. But the flaws are often subtle.
A common one I've seen quite many times is people using a flawed validation strategy (e.g. one which rewards the model for using data "leaked" from the future), or to rely on in-sample results too much in other ways.
Because these issues are subtle, management will often not pick up on them or not be aware that this kind of thing can go wrong. With a short-term focus they also won't really care, because they can still put these results in marketing materials and impress most outsiders as well.
>Meaning you could absolutely suck at your job or be incredible at it and you’d get nearly the same regards in either case.
One of the things I don't like about statements like this said in a Data Science context, is that they are true outside of Data Science as well. Executives make big decisions, managers make smaller decisions, nobody can evaluate how good/bad they really were for months or years. Engineers build something amazing, or build a house of cards, nobody cares as long as the money people are happy, even if the business use case turns out to be wrong in the long run.
>With a short-term focus they also won't really care, because they can still put these results in marketing materials and impress most outsiders as well.
Forget Data Science, you see this in KPIs as well. Say a crappy metric has to be moved by Q2 next year and people will destroy the company to move it.
I feel like Data Science is just one of those areas where you are exposed to a wider range of people and get to feel the full crapola of the insanity of working in a corporation. For lots of roles (e.g. Engineering) you get to hide in a hole behind layers of people and not see some of this insanity.
Not to get too off topic, but as a 35 year old engineer it seems the world in general has far fewer consequences than I was raised to expect. Everything from businesses with bullshit ideas flourishing at a loss, to January 6 even being possible (politics aside I expected the Capitol Police to crack a lot more skulls than they did once people started smashing windows), to the whole FTX situation and the tepid response in the media/government, to petty crime being outright tolerated, to in my own career I've at times burned through enough money badly enough (albeit with good intentions) that I thought I was going to be fired, only to be told in a performance review I was doing a good job (grateful to stay employed but WTF, I would have fired or at least demoted me). Importantly, the motivation for this lack of consequence doesn't seem to stem from a desire for forgiveness or positive reinforcement or any mechanism that might make things better.
It seems like there's a general apathy/nihilism that's growing in society, whereas by contrast my entire education from childhood up I was held to strict standards and reliably punished when I failed to meet them, and this was in US public schools (albeit a highly ranked school district) and a public university. That or I was just raised in a bubble, and the historical examples I referenced growing up and reference to this day are just a case of survivorship bias, and all the bullshit that was alongside them back in the day has simply been forgotten. I'm not sure, but it is disappointing how little people at large seem to give a shit. Maybe it's a side-effect of the obesity epidemic and people just have less energy or something
Parenting & the public education system is a very artificially constructed bubble designed to reinforce and reward "good" behavior, where "good" is usually defined as "that which makes life easier for my caregivers". That gives kids a falsely inflated sense of how much everything matters: your caregivers want you to mind your behavior, because then they don't have to, even if you would've been perfectly fine playing with mud or swearing in school or watching TV all day.
In real life there's basically one absolute goal, and that's survival. And that's largely assured in developed western countries these days, unless you do something really stupid. Everything else is socially constructed, and pretty arbitrary. There are some decisions that are fairly consequential for what your life will look like (where & whether to go to college, what field to go in, what metro area to move to, which employers to work for, who to marry, whether & when & with whom to have kids), but you will still have a life regardless, it just might be a slightly smaller house or a spouse that you click with worse or less disposable income for travel.
That's also instructive for what decisions actually do matter. Don't do drugs. Wear your seatbelt. Don't get pregnant unless you mean to. Don't play with loaded guns. If you're staying away from major causes of death you're generally doing pretty well.
This is the kind of mindset that felt obvious to me when I was young and resented anyone else trying to influence what I chose to do for myself.
But after growing up and having kids of my own as well as watching others' kids grow up with varying degrees of parental involvement, I have a whole new appreciation for adult caregivers who get involved and help shape healthy behaviors and habits in kids.
> your caregivers want you to mind your behavior, because then they don't have to, even if you would've been perfectly fine playing with mud or swearing in school or watching TV all day.
You've got it backwards. The easy way of caregiving is to just not care. Let kids watch TV all day, swear in inappropriate social situations, and whatever else they feel like doing. You don't have to get involved if you just don't care what they're doing.
But anyone who has worked with kids in an education setting can tell you that this doesn't actually produce good outcomes for the kids. There are occasional exception stories where students with minimal parental involvement lean heavily into becoming successful in life, but the more common outcome is that hands-off or absentee parenting styles lead to poor outcomes for the children, including social and personal issues. It's not just about getting good grades just because. It's about learning how to operate and function within a civilized society, as well as how to balance your own emotions, impulses, desires, and other behaviors they need to learn as they grow up.
> It so happens that "good" behaviour that we seek to embed in our children is generally the same as behaviour that is good for society.
The behaviour that most of the school system seeks to embed in children is primarily "obey and do as you're told, don't question", which is far from good.
> In real life there's basically one absolute goal, and that's survival. And that's largely assured in developed western countries these days, unless you do something really stupid.
Or just get unlucky: no need to do anything stupid. One can easily die of cancer at 30 and leave a toddler behind.
Yes, easily. My partner died of cancer at 30 despite exercising, avoiding alcohol, going for the screening, and generally trying her best.
Chances are it won't happen to you and your close ones. Perhaps try being grateful rather than dismissive?
[Edit: perhaps we have a misunderstanding as to the word "easily". I'm not saying it's likely, I'm saying it can and does happen without any warning signs and no amount of planning/preparation can save you.]
Nah, not even. I have a mutation called CDH1 that happens to be pathogenic and predisposes me to a greater than 40% chance of stomach cancer. It's a dominant gene which means it has a 50% chance I've passed it onto my daughter as well.
That cancer is what's known as a Hereditary Diffuse Gastric Cancer gene (HDGC). It just so happens that the E-cadherin control that suppresses those cancer cells is not processed properly. The diffuse part is what makes it particularly tricky. It's on the surface of the stomach epithelial cells and progresses from there. The only solution is a total gastrectomy (prophylactic if you do it early). No carcinogen necessary. It's found in populations all over the world and pathogenic lines don't even have to be related. The mutation can occur independently in the germline and is passed on. As long as you reproduce before it kills you nature really doesn't care.
Fun side fact. It also predisposes carriers to 70% chance of breast cancer. As a result many of those diagnosed are women who then find out they need to also have their stomachs removed.
Ouch, that’s a raw deal. Very very very sorry about you. If you don’t mind me asking, what are the consequences and mitigations necessary to live with a total stomach removal?
I would say that there is indeed a problem with modern tech companies where things matter even less than they should. It's not a problem of being raised in a context where everything matters, but that many companies, especially tech are totally care free with their vc money, and we can see that changing in the last months of downturn.
Raising children to care is good and takes lots of effort.
Raising children to leave parents alone usually means the children end up not caring or worse.
Consequences often catch up slowly. It took years for Elizabeth Holmes to be sentenced because it takes time to collect evidence, build an airtight case, and give people their due process.
As I get older, I'm actually noticing more and more consequences catching up with people, albeit slowly. The people I knew who drank heavily through their 20s and 30s are in much worse shape than basically anyone who made an effort to stay healthy. People with poor diets and low physical activity are visibly worse off than others who paid attention to their inputs. I knew several people who got into recreational drugs in their 20s thinking they were safe because they educated themselves before hand, yet who ended up losing jobs, relationships, wealth, and a few who even lost their lives.
I've also noticed more peoples' career reputations catching up with them. It's not uncommon to interview someone only to later discover that they left a very negative reputation at a previous company where I happen to know someone.
I was very jealous of one of my peers who job-hopped his way up the salary ladder, joining companies and then immediately focusing on nothing other than interviewing at his next salary increase. He rotated through several of the big companies here until his reputation for demanding high salaries and then delivering nothing at all finally locked him out of any company with well-networked people who knew about him. He literally had to leave the state and go somewhere new to escape his past network and get new jobs after 10 years of this.
Consequences do catch up to people most times, but it's not immediately obvious. If you expect immediate justice or for people like SBF to go straight to jail the moment the headlines break, you're only seeing the beginning of the story.
I can't match up your anecdata with mine. I can think of numerous people who have done all the things you have mentioned and have no suffered no ill-effects. In fact, many have prospered from lying or cheating the system. From substance abuse to habitual lying, there were no consequences and actually in some cases great wealth was accrued. A great deal of awful people have a very fine life out of it, and there is no greater cosmic justice to address this.
Also, one could argue another interpretation of what you are advising is never take a risk, because it will have consequences. Well, in real life, it doesn't always. You can get away with a lot, and people do.
The problem is that time value is extremely relevant. If it takes 10-20 years for consequences to catch up, the person is likely to have already built up an unassailable lead that the consequence barely dents.
> He literally had to leave the state and go somewhere new to escape his past network and get new jobs after 10 years of this.
That's not even that bad of a consequence. It sounds like his strategy was worth it tbh.
Personally I hate this kind of behaviour, but from a maximization POV (Especially in regards to career) it seems like the best move. There is likely some risk of ruin, but the upside appears to be much greater.
> The problem is that time value is extremely relevant. If it takes 10-20 years for consequences to catch up, the person is likely to have already built up an unassailable lead that the consequence barely dents.
I largely agree with you: the big names attached to the resume, the pay, and the effort spent on interviewing skills likely offset the negatives of the reputation (though I also intuitively don't like it because the strategy is rather self-centred).
However, the consequence is rather significant if he has roots. It's harder to pack up and move if one has a romantic partner who is settled into a job at a particular place, and you could also possibly be leaving family and friends. Sometimes one has to move, but typically one has the option to come back, which wouldn't be practical for the person in question. It's still plausibly worth it for the person if he didn't have roots and collected a lot of compensation, but especially when one is older (the commenter mentioned 10 years of workin experience), moves can be tougher.
To add another piece of evidence, while previous poster noted that the initial response by police officers on January 6 seemed less violent than they could have been (though even then, one person was shot and killed), the US Department of Justice is continuing to publish press releases about charges of people involved in the January 6 Capitol attack (at https://www.justice.gov/news , with full records with dates at https://www.justice.gov/usao-dc/capitol-breach-cases). The charges for many of the people involved caught up eventually, though it took time.
Separately, to put a positive spin on this, it often takes time for positive habits to pay off. When picking up a positive habit (e.g. exercise and especially learning a new technical skill such as a language), oftentimes much of the reward doesn't come until far later. This is important to keep in mind, especially if one has self-doubts or even a lack of encouragement for trying to adopt a new positive habit in one's life.
I think both you and the GP are correct. I do think that consequences are being detached from actions, at least in the last decade of free money and rapid growth. But I also agree with you on the slow burning nature of small bad decisions made many times over years on end.
With the economy contracting and inflation skyrocketing, consequences should be back in fashion relatively soon. We're already seeing it in mass layoffs and other areas of business.
I think this hits the nail on the head. People are learning the meritocracy they were taught growing up isn't real so why would they work 80 hour weeks for a 25% bonus instead of a 10-15% bonus. The calculus gets even worse when the bonus people get is insignificant to that of switching careers often.
For what you specifically experienced, my opinion, the bigger the organization the more inevitable this seems to become. To make things worse the size of the organization isn't limited to just a company or non-profit but to the size of all groups involved, i.e. a small charity or non-profit that's part of a huge government program is similar to a small engineering team in a huge tech company. They could do huge things or be completely worthless and so long as they pass along positive messages up the chain and the org or company as a whole is doing well then yay no consequences.
We're (hopefully) at the beginning of a cycle where companies realize they are causing apathy amongst the majority of the employed and hopefully experiment (and succeed) in providing meaningful pay raises to the lower echelons which will come at the short term costs of profits but are justified for long term productivity. Or we'll just keep divolving into a dystopia
I think you hit the nail on the head there with the survivorship bias and the raised in a bubble comments. Most people are raised in a bubble because children generally can't cope with how messy and complicated the world is. And systems and companies that last a long time can point to how successful they were because of their good decisions while ignoring their equally bad decisions that really should have undone them had they not been lucky.
The older I get, the more I realize how fragile a lot of human systems really are, but I suspect it has always been this way and it won't change significantly any time in my lifetime.
Your comment itself sound somewhat nihilistic, so I hope you're doing well mentally!
>The older I get, the more I realize how fragile a lot of human systems really are, but I suspect it has always been this way and it won't change significantly any time in my lifetime
I agree that human systems have always been fragile, but have long been papered-over by things like "decency", "tradition" and "doing the right thing" and in extreme cases, mobs with pitch-forks.
I disagree that it won't change in our lifetime(s) - the extreme polarization and tribal politics will get worse and people will let systems break - or intentionally break systems just so that their team will gain a short-term win. I have no idea what new horror it will take to remind people to be decent to each other again, but looking back at how divisive COVID-19 was, I'm not hopeful.
I took the prior post as in, "the fact that they are fragile won't change", not that the systems themselves won't change. And I would agree with that---I see it as yet another expression of the human condition. We may try to build order over chaos to make society, but we also keep loopholes and wiggle room for our psyches. I think the fragility of human systems emerges from that contradiction.
Students of history and the arts can get an earlier exposure to this worldview. I think we engineering types can get too focused on technology and imagine everything is innovation and progress. You have to work uphill against your default interests to expose yourself to a longer view and consider that fundamentally modern people with modern minds lived for (many) thousands of years doing almost all the same cognitive things as us, just with different physical props.
Our lungs are constantly in flux as we breathe. But at the same time, we're just breathing and that doesn't really change until our end. I'd say human social systems are much like that.
Thanks for the concern, but I'm all right, I have the privilege of living near the top of Maslow's hierarchy and actually pondering these questions. :) If I'm a nihilist I'm at the "creating your own value system" part. The world is generally a giant blob of apathetic flavorless jello, I can at least inject some sugar and food coloring wherever I'm at. There's also some freedom in that, when people don't care they also tend to give way pretty easily. It's just disappointing, except for when you encounter that rare person that also gives a shit. Part of the reason I spend a lot more time on HN than reddit. :)
My own thought (I know there is a great deal of room for disagreement) is that the J6 crowd saw no consequences for the attempts to obstruct the Brett Kavanaugh confirmation and thought the rules had changed. One of them was shot in the neck, many others are still incarcerated two years later despite a clear constitutional right to a speedy trial. I'd rather the Capitol Police had just cracked heads at this point. You might think one protest was more justified than another, but the differential in response works to dissolve confidence in the fair application of the law. At any rate, the participants in J6 have been broken, so you're not likely to seen that again...yet I feel we could get another riot season provoked by police brutality at any time.
There is garbage and tent encampments thoughout much of my city, and I am told that nothing can be done about it. I've been invited to engrave something on my catalytic converter. I wonder what good that would do.
> You might think one protest was more justified than another, but the differential in response works to dissolve confidence in the fair application of the law.
in 2018, the capitol was open to the public, no one broke in. 78 were arrested in the Capitol on Oct 5, 2018 and charged with Crowding, Obstructing, or Incommoding [1].
in 2022, the Capitol was closed to the public and people broke in. 12 people were arrested on Jan 6, 2021 and charged with Unlawful Entry or Assaulting a Police Officer [2].
Assaulting a Police Office is a felony; Crowding, Obstructing, or Incommoding is a misdemeanor. Seems there was a differential in severity of breaking the law as well.
I must have missed the part where the Kavanaugh protesters showed up in body armor with zip ties and plans to hold members of the Senate Judiciary Committee hostage.
The J6 crowd broke through windows and members of congress were literally barricading themselves into rooms for protection. Some brought zip ties for the purposes of apprehending individuals. They beat a capital police officer to death with a fire extinguisher.
I’m not sure how once can fail to see a difference between that group and the one protesting Brett Kavenaugh’s confirmation.
This is a more recent phenomenon - I would say the last 10 years or so - money became cheap, tech saw an infusion of billions of dollars.
I would also say "tech folk" are the biggest beneficiaries of this largesse.
My friends who are lawyers and doctors, not so much, they bust their ass for a lot longer and for a lot less.
I agree with you though - there is a malaise in society, folks in power don't even get a slap on the wrist, working hard and being sincere does not get you anywhere, deceit and fraud are the currency of our times.
If you are only 35 you aren't old enough to remember when interest rates were correctly pricing risk. You should see a lot more consequences show up (probably painfully for all involved) as the low interest BS makes way to people actually having to have a high likelihood of generating high positive returns to get funding and those companies that are otherwise profitable retrenching to pay off the overspend from the zero interest years.
I think you are also seeing the effect of the oligopolization of the world stemming from the bad rework of the antitrust laws relaxing antitrust enforcement significantly from the 1970's through now.Any sort of market power is really bad for this kind of behavior because almost noone wants to rock the boat if they don't have to and when you have an oligopoly/monopoly you can abuse you often can hide this stuff in slightly lower but still excessive profits.
The owner at a boutique engineering firm I worked for told me that in a large corporation the best thing you can do is massively fuck up at the beginning. Everybody would learn about you and then eventually forget what you did wrong. The extra bit of notoriety would help with name recognition and people would think you're a "good guy".
Enforcing consequences is difficult as laws and bureaucracies become ever more complex.
This gives plenty of space for opportunists and tricksters to hide.
You don’t ever have to fear being beheaded by the people whose life savings you stole and you don’t have to face consequences if you have a good lawyer.
To do well in todays world learn all the rules and where the loop holes lie. Violating the spirit of the law is fine as long as you can lawyer around the letter of it.
I feel your pain. It often seems to me like Quality is on the decline, on many different fronts. Hard to say if it's just my perception. It does make me more fully appreciate it when I do encounter true craftsmanship or excellence -- which though it might be increasingly rare, is still relatively easily found.
> as a 35 year old engineer it seems the world in general has far fewer consequences than I was raised to expect.
I wouldn't say fewer uniformly, but certainly very noisy. Some have their lives destroyed for minor or non-existent misdeeds, others get away with egregious crimes.
Your post mentions how you are surprised you can skate by at work without facing huge consequences for your actions. Most everyone is like you. We are self-centered and worried about our own security, over-analyzing our own problems and barely being aware of others. I don't know if this is a new problem or one as old as humanity.
The Jan 6 riots are possible because again, the Capitol Police weren't ready to lay their careers and lives on the line "cracking skulls" to defend an old building. Most of them probably were taking in the spectacle and thinking about how exciting it will be to recount with their friends/family later.
no, warehouse workers, nurses, other people with extreme work hours requirements and tons of metrics, are constantly being fired for failing to meet quota or whatever.
its the "the higher the pay the easier the job" paradox.
While what you're saying appeals to my biases I think it's a somewhat ahistorical. Not long ago we had Nixon. We had JFK's, MLK's, and RFK's assinations. Plus Reagan's attempted assination. We had the Vietnam war. And so forth. If I were an adult during that era, I imagine it would have felt like consequences were slow to come.
After reading your comment, I think you have captured some of my own thoughts about consequences and deserts (i.e., worthiness or entitlement to reward or punishment). I agree with the other comment that replied to you that says that thinking like this is a product of being raised in a bubble.
I am not sure if apathy/nihilism is growing in the larger society. I think that things have always been like this because people have always struggled to find meaning in life. After taking an intro psychology class, I was exposed to the idea that society wants an individual to police him/herself. The "super-ego" that makes one feel guilty for breaking rules and want to aim for perfection.
It might seem that way but it is hard to say with certainty but there is also probably sampling bias or declinism in that you are more likely to hear about negative events while normal or positive events are filtered out. And like PragmaticPulp said it can take time for things to catch up but they often do and people often eventually get what was coming to them.
I would not agree with general apathy. I get it more as reality. Strict standards are BS. If you get up sober in the morning and go to work it is like 80% of what is expected from an adult.
> One of the things I don't like about statements like this said in a Data Science context, is that they are true outside of Data Science as well. Executives make big decisions, managers make smaller decisions, nobody can evaluate how good/bad they really were for months or years. Engineers build something amazing, or build a house of cards, nobody cares as long as the money people are happy, even if the business use case turns out to be wrong in the long run.
This is purely anecdata, but I have found that this is more pronounced in a data science context. Managers and executives are (in my experience) more willing to admit they don't understand engineering work product and seek input from technical advisors, and executives and managers deal with decision making on a daily basis and understand that it can be nuanced. But since almost everyone reads financial reports or has to make a chart in Excel every now and then, they know enough to read someone else's analysis but not enough to recognize their knowledge gaps (particularly wrt advanced statistics).
IMO the reason behind this is that a lot of "data science" driven decisions are short term decisions. So you can look at something on a PowerPoint, not really care if it's wrong unless you personally will get fired if it turns out to be wrong, and back out of it a quarter later when it turns out to be wrong. IME there's no shortage of justifications or pivoting when it comes to a decision you made a quarter ago. The consequences are relatively small, so the caring is only bravado, not really caring.
When it comes to disastrous long term decisions, there's plenty of time to get input from multiple stakeholders. I always remember the armies of companies who went chasing after Hadoop because Big Data was going to transform something or the other. All the stakeholders were on board, from the CEO and CTO to IT and Engineering management. How much money and time got flushed down the toilet trying to implement and extract value from data with Hadoop. They only people who paid the consequences were the employees at Hadoop companies who thought their stock options would be worth something.
About 10 years ago, I worked at a company that really wanted to use Hadoop for some reason, so I was forced to use it for a project. The amount of data we were processing was minuscule (a few hundred megabytes per run) It could've been done with a simple script on a single EC2 instance for the entire duration of the project without any scalability issues. Instead, I had to provision Hadoop clusters (dev, staging, production), fit the script into the map-reduce paradigm, write another script to kick off the job and process the results, etc. At least we were using Hadoop.
Relying on your data science or marketing department to tell you how good your data science or marketing department is doing, with their own metrics and their own evaluation methods that you don't understand, can only really lead to one outcome.
I've seen this a LOT in my professional group. Many people (who often have PhDs!!) I interview for data science positions seem to know absolutely nothing about the algorithms they use professionally, or how to optimize them, or why they are a good fit for their use case, etc etc etc. I usually see through LinkedIn that these same people are now in impressive-sounding positions at other companies.
I had one candidate who was in charge of a multi-armed-bandit project at their current company. I asked them how it worked, and how they settled on that. Their response was "you know, I'm not really sure, the code was set up when I got there". He had been there for over a year, and could tell me nothing!
> A common one I've seen quite many times is people using a flawed validation strategy (e.g. one which rewards the model for using data "leaked" from the future), or to rely on in-sample results too much in other ways.
It's funny you mention this, we have a direct competitor who does this and advertises flawed metrics to clients. Often times our clients will come back to us saying "XYZ says they can get better performance", the performance in this case being something which is simply impossible without data leakage or some flawed validation strategy.
Where are these jobs where you can interview this badly and still get hired because in my experience DS interviews are extremely hard and often expect people to have very high Stats skills as well as Data Structures/Algo skills at FAANG level.
I think the issue here is that "data science" encompasses two very distinct branches of work. One answers to business needs and the other produces data based solutions for the product itself i.e you might have a data scientist who A/B tests your website design so you minimize your churn rate and the other is the team at uber eats who maintains the recommendation engine. While the distinction might not always be as sharp, the former makes up the bulk of data scientists in the market (and I suspect the OP is in that boat) with comparably simple interviews while the rest is the 5 step interview process with hackerrank test you are more familiar with.
I think the distinction is not so much on the domain/application. Rather it’s just that many Organisations decided to jump on the data-science wagon and don’t quite know yet for what qualities to look out for during hiring. And in second order as long as the predictive model is not included in a business process the over fitting is not as easily visible to the layperson stakeholders (and junior data scientists).
These days if you have a company selling cat food or rivets for aerospace or providing taxi swrvice to a random city, or whatever, they might have a few data scientists helping them make "optimized" business choices. Obviously they won't have a very adcanced recruiting process for that.
The ML interviews at FAANG are absurdly simple. Design YouTube recommendations for which canned answers are readily available.
A simple stats question. If I double the number of samples, how much will the confidence interval change? Most FAANG ML engineers can't answer this question.
The Dunning-Kruger effect is strong here. "What I know is what makes me the expert. What I don't know is irrelevant".
The definition of Standard deviation is in chapter 1 of Stats 101.
https://www.google.com/search?q=standard+deviation&tbm=isch
Apparently, asking Stats 101 chapter 1 question of a so called "Data Scientist" is too much of an irrelevant question!
> expect people to have very high Stats skills
Or as you have made apparent, expect people to have ZERO stats skills!
Some of the innumerate activities I have observed in "expert" data scientists and ML engineers who have years of experience without once thinking about sample sizes
1. Using A/B tests to accept the Null hypothesis instead of rejecting it
2. Squandering away 30M $ in annual revenue because they wanted to avoid a situation/meeting in which they might look like they don't understand statistics. This is hilarious because they simply nodded their head as if they understand all the calculations and then simply dropped any other meetings or followups and left 30M $ on the table
3. Not refreshing a key revenue generating model for 18 months because the were "trying to figure out" why the AUC was improving when the performance on "golden set data" was dropping
4. Using thresholding and aggregation to produce poor quality distorted training data of rich perfectly sampled data
5. Trying to use A/B tests to estimate impact even when the control and variant are not independent
All of the above at FAANGS! My coworkers in a non FAANG company were much more sophisticated. These are the kind of candidates a "build recommendations for youtube" interview selects. Template appliers.
The list of stupidities goes on and on! But yeah, none of them think that a basic understanding of statistics is necessary for work. The good thing about Javascript engineers is that they don't have an understanding of Statistics and are aware of it. However the DS/MLEs are unskilled and unaware of it.
> clients will come back to us saying "XYZ says they can get better performance"
Oh yes, good old marketing.
Along with buying off "Industry Awards" – hey, we're objectively the "Best cybersecurity company of 2022!" With a matching "platinum/gold badge" to go on our website! Or buying a place in the "10 Best Products for X" and "Independent X-vs-Y Comparison", another classic.
Because it works. Are your customers not sophisticated? Are they unable (or unwilling) to follow up on defects and outright lies? Or reality simply doesn't matter all that much to them? Humans LOVE a good story more than reality, after all.
Then your contribution as an engineer to your company's success, and hence its longevity and your job security, is strictly inferior to that of marketing. Not everything is the work of evil marketers – a lot of the supplied BS is in response to an existing demand for BS.
I manage at a client an application which is the actual leader (most top right and by far) in Gartner magic quadrant for its category, and for years, I have never seen a product this bad, where the implementors and supports are clueless of their own product. And obviously it's buggy as hell.
The people who make the decisions don't use the product. That's almost always the root cause of this stuff. I worked on a system for my state - another vendor came in and 'took over' all the functionality my system handled. Supposedly. 7 years later, my system powers the exception to the mandate to 'use system X', because... they refuse to provide the functionality that they sold the state. Contractually, "we provide feature ABC", but the reality is.. they don't. I even provided them our code to use - it was paid for with public money, they should just integrate it and then sell it to other people to make their product better. They can't even be bothered to take the code and integrate it... they prefer to continually lie and say "we provide feature ABC" when... they don't. It's beyond insane. A large majority of the people on the ground know it's bad/lacking/broken, but ... they have 0 voice in the matter.
Can you do your analysis both ways? Give your customers both, then tell them you method is more modern, but if they want outdated methods you have those too.
Is this the US? I'm concern about the extremely low level or the bar to get a PhD in Europe... and I'm wondering if that is a global problem, or only Europe.
The problem is that nobody actually wants data science. They want data pseudoscience.
And for the same reason that people tend to want pseudoscience instead of science in any other domain, too. Science is slow, tentative, and messy, and usually responds to questions with even more questions rather than with answers.
Pseudoscience tends to be much more concerned with exuding confidence and providing clean-cut answers. It's what happens when a desire for science meets a need for instant gratification. Along the way, things like blinding and controls and watching for bias and validating assumptions tend to get dropped when they're inconvenient or difficult to explain. And they're always inconvenient and difficult to explain.
> The problem is that nobody actually wants data science. They want data pseudoscience.
Technically, I think investors & owners would want the company to use real data science to improve products & maximize profits.
Everybody in the middle just wants to use data to lie to get promoted faster - because you don't get promoted for actually doing a good job - you get promoted for convincing people you did a good job, and lying is a VERY useful / effective tool.
> I think investors & owners would want the company to use real data science to improve products & maximize profits.
This is based on the assumption that companies are focused on long term profits and stability, and I’m not sure why anyone believes that to be the case anymore. The vast majority of companies are run based on next quarter’s stock price or growth metrics.
I worked on a newly formed data science team coming out of grad school that was tasked with taking some predictive initiatives that the company had relied on external consultants to produce, and implementing them in-house. The external team’s results always looked exactly like what the business wanted to hear, but they rarely played out in practice. This was in part because the underlying data quality was terrible, and the company wasn’t executing in a way that allowed anyone to actually answer the questions being asked. The consultants would just torture the data until they could come up with a report that would ensure the company would come back the following year. So we spent a lot of time trying pouring cold water into the business groups who saw data science as a magic wand that would conjure up more money at no cost. But we never were able to convince them to invest in anything that would take longer than a year. Anything that would require a change in their marketing or strategy executions that wouldn’t immediately deliver increased results was just a non-starter. But actual data science requires that kind of investment for long-term layoffs. So the data science team became figure-heads, never given the buy-in to actually make impact on business, but kept around so teams and leaders could tout being “data-driven” and throw “AI” and “machine-learning” into PR and marketing materials.
You aren’t wrong about middle management is looking to get promoted faster. But every single individual from the employee looking for a promotion to the executive suite to the investors are addicted to incentive windows no longer than 6-12 months.
LARGE INTERNET DATA COMPANIES. They want the real data science.
For them, data science actually allows them to perform a core business function (target their customers) in a profitable way (one way, asynchronous relationship. Note the complete lack of any "talking to a human being" in your relationship with big tech).
For everyone who isn't a large internet data company with an asynchronous relationship with their customers... what's the point?
Usually, they have only a handful of technical projects that benefit from data science.
In my experience, my multi-billion dollar organization got by with a shockingly small number of "real" data scientists.
I become wary any time someone utters the phrase, "show me the data" or any variation there of. There is a specific type of leader who thinks that within the data lurks a magical solution just waiting to be discovered. There is also the leader who uses data as a trump card to win arguments and these folks are perhaps even worse. This is not new. The origination of the phrase, "lies, damned lies, and statistics," can be traced to the 1800s. I propose the following update:
There are three kinds of lies: Lies, damned lies, and data
I am being glib, I of course do not think all data is inconsequential, rather it is more often used from a place of ignorance or a place of ill intent it is rendered, on the whole, useless.
It's BS because the people asking for the data do not have the sophistication to actually do a reasonable _analysis_ of the data. Or criticize an existing analysis.
Unfortunately, as many posters here are pointing out, there's plenty of ways to do a correct-looking analysis of the data to get evidence to support your agenda.
Maybe your agenda is right and maybe it's not, but I'd love to hear a story of someone standing up and saying "your consultant submitted a report with glaring flaws, they should not be paid and you should reconsider X." It's more likely the little company just goes out of business or the big company buries the failure.
The person above wasn’t complaining about data science as a whole, they were complaining about data science theater. The scenario where as long as you put some numbers in your bosses face, they could care less what the real implications are. In cases where you’re doing thorough analysis, you should look for the data yourself, rather than ask someone to market it to you
> I have only heard “show me the data” when someone wants someone else to support a claim.
I've heard it a lot in situations where somebody is demanding a level of rigor that they themselves do not live up to. This is usually soon after they have framed the conversation around a solution that they want to pursue that also lacks any supporting data. That is to say, being data driven is on net good but it can also just be a thinly veiled appeal to status quo bias (which is itself not a terrible heuristic) or "highest-paid-person-in-the-room" bias.
I am not talking about in the instance of claim verification. I have seen a number of instances where a leader just wants to see data. Not any specific data, just all of the data. There is a belief that data can solve problems if only they had enough of it.
The issue is more that the kind of executive who says "show me the data" is often not numerate enough to understand the limitations of the data set in front of them. (Maybe the solution they see in the data has a tiny effect size or is too likely to be statistical noise; maybe the intuitive conclusion is demonstrably wrong once you apply a Bayesian approach.)
A typical VP will have an MBA and maybe took statistics in high school.
Yep, these are the same people who backtest their portfolio and go "see, if you'd held this exact portfolio I put together through trial and error, you'd have turned one dollar into a million without any additional contributions!"
Not a data scientist, but it seems like a lot of people in business refuse to accept the fact that reality is generally boring, best practices are often "best" for a reason, and meaningful progress is hard. Of course it is possible to be too conservative, but 95% of ideas to improve a business or product are ego-stroking bullshit. Everyone wants the V10 engine to go down the highway at 65 mph, while towing a trailer, and there's only budget for an oil change every 15000 miles; don't look at the transmission fluid, just don't look.
This is the defining pain point for data science, in my experience. There’s no simple ground truth to test competence against.
If someone tells you that the data says their work is good, the only real way to know if they’re right or wrong is to look at what the data says yourself. If 99% of the work is building and 1% is checking something like latency, then you’re likely to have more than one set of eyeballs on that 1%. But if 99% of the work is putting the data together and doing the analysis, then you’re unlikely to have more than one person ever look at that part.
So incompetence goes unchecked (or worse, it is rewarded).
That's the same for many tech jobs. Competence is often only a local thing, subject to politics, reputation, and appearances. There's also no ground truth because the ground changes so fast. No one knows if the technologies mentioned in the OP will be popular 5-10 years from now.
> For establishing competence, you still have to dig in to see what caused the slowness.
Not as management. You just have to see that other people's similar sites are not slow with the same resources, therefore it is possible for your site not to be slow. You don't have to know why you're failing to know that the totality of the people you hired were not as good as the people those others hired.
This is of course barring management failure; but if you're failing at management, that's about the same as saying that your engineers were under-resourced.
Engineering competence is largely composed of the skills to figure out what is causing problems e.g. slowness. If you can't figure out what is causing the slowness, your engineers aren't good enough to figure out what is causing the slowness, qed.
Software Engineering is one of the few knowledge working areas where you can actually test the result in various ways as a layman. You can flush the toilet before paying the plumber to a large extent and hire another counter-team called QA. QA themselves are tested by future production bugs.
In other disciplines it is way more fuzzy. If you are in the conclusion business and there isn’t a clear path to test your conclusion in the short term you can bullshit away!
Unfortunately, I haven't worked at a company with dedicated QA in the past 5+ years, maybe longer. QA is often seen as a side job for engineers and product teams.
Oh! Dedicated QA makes a big difference, especially when they take leadership and are willing to get involved in lets call it qa-ops:
improving automated testing and such like.
"Managers will say they want to make data-driven decisions, but they really want decision-driven data. If you strayed from this role– e.g. by warning people not to pursue stupid ideas– your reward was their disdain, then they’d do it anyway, then it wouldn’t work (what a shocker). The only way to win is to become a stooge."
In science, a good scientific result can be bad for business. There is often little appreciation for the "science" in data science.
>There is often little appreciation for the "science" in data science.
It feels like even Google falls prey to this at times: they keep redoing the same A/B test until it comes up in favor of the change (or the designer whose pet project it is runs out of political capital, presumably).
I've been pitched by many "data-driven" vendors offering predictions. They often have very impressive accuracy metrics (RMSE, R2, etc). When I dive into the details these metrics are often reported using in-sample predictions.
I see this pointing to any of the following:
a) DS teams overpromising the accuracy of their approaches
b) marketing driving the narrative and DS getting pulled along
c) incompetence from the DS team
That’s the problem: these metrics often come from overfitted or in-sample data, and are completely unrealistic when it comes to expected generalization performance.
I’m at the point where I never trust performance metrics anymore. Or rather, the worse they are, the more I trust them!
I feel like you might be conflating a couple of things, though I'm not a DS so could be off base here.
My reading of the OP's description is that the vendors were offering interpolative predictions, but did not use a test/train split of data. This is in contrast to extrapolative predictions which I would call out-of-sample.
Thus due to not using a test/train split, they achieved extremely good accuracy because they were testing on the same data they trained on. Even though this is "in-sample", you can't use the same data for testing and training.
This gave me a chuckle. If you read the feature article you understand that this is also because management wants “decision driven data.” They have an idea and use ds to provide charts and tables to support their idea. The harder the idea is to support, the greater value data science is able to provide.
I guess data science is inferior to research in this way. People care about research methods, rigor, etc… Maybe data scientists should adopt stricter standards, like actual scientists.
I did read the article - some of the problems with judgements of work quality also come up with (hypothetical) well-intentioned truth-seeking non-political long-term-optimizing managers who just don't happen to be stats experts.
Sorry, wasn’t trying to imply you didn’t and I fully agree. Even managers that know stats can be busy or but into hype about ml or other shiny new things that they don’t have time or resources to deconstruct. This is another big problem with data science, “black box” systems and cargo cults. It’s easy to think “LLMs will change the world! We should use them, the competition will.”
> Approaches that are wrong from a statistics point of view
When OP talked about "the main bottleneck to my work" in terms of areas he would need to learn more about -- I was expecting him to talk about facility with statistical methods and using them appropriately!
I'm not sure what to take from the fact that he never did! I would like to ask him what he thinks about that!
I’ve always disliked how data science was positioned within companies as well, it’s outside the critical path of product and engineering, which means it becomes a mere abstraction to management (e.g. “throw that problem to the data science team and see what they come up with”), resulting in very vague and abstract requirements and, hence, deliverables. I think there is huge value in the discipline and technologies, but it unfairly gets relegated when not integrated to the whole product / engineering process. Hence, the title / concept of Data Engineer seems like a much better fit for this role within many companies.
Yeah, as a Data Science manager I've experienced this pain a lot (not part of the critical path). I am now an Engineering Manager that works with a cross-functional team including FE/BE/DS/Devops and it's the most power I ever had to put Data Science in front of our clients in a meaningful way.
People don't pay attention to production metrics, and a noisy problem (like marketing or whatnot) can often be pretty bad for a looooonnnnnggg time before anyone notices.
Not being mean to you, just showing how typically the goal posts are moved.
To give you an example from physics, if you find just one experiment that goes against your model, you immediately invalidate the model. You don’t just make grand claims that the model in general works.
> Holist underdetermination ensures, Duhem argues, that there cannot be any such thing as a “crucial experiment”: a single experiment whose outcome is predicted differently by two competing theories and which therefore serves to definitively confirm one and refute the other.
Could be wrong here, but in physics and most natural sciences, you don’t throw away your model if you have one experiment against it.
Usually isn’t it looking for an experiment that proves it and is repeatable?
If I discover a new element in one experiment, the results are published.
After publication, many labs will try to repeat and its not taken away if one can’t do it. Only if all can’t and it casts doubt on whether I did it in the first place.
Scientific method? Models are disproved, not proved. Do data scientists not know about science? People usually understand how science works here on HN, but not in this thread.
Example of a test that invalidated our old theory of gravity and validated Einsteins claims:
Models can also be knowingly/intentionally incomplete, which necessarily introduces noise that you are not controlling for. Meaning you have to use statistics, and there isn't really a concept of prove or disprove in the true sense of those words.
Maybe very far in the future there will be models of human biology that are as robust as classical physics, but right now there is such a large amount that is not understood, it's simply not feasible. A drug could work for one person and not another for reasons beyond the realistic scope of the original development hypothesis. It requires a probabilistic view to make any sort of statement about the efficacy then.
I suppose you could argue these models are just wrong and thus trivially disproven, but I don't think that's a productive framing. I doubt any biologist or doctor would claim they have anywhere near a complete model of how their specialty works. That doesn't mean a particular model isn't useful or isn't the best we currently have to work with.
Plus maybe the third best model will actually turn out to explain a separate puzzle piece in an eventual better model. Mechanistic models in biology aren't always well done in practice, but it's certainly not binary either.
The goal posts are only moved if they were in the incorrect position in the first place. Statistics isn’t the science of prediction, it’s the science of uncertainty management. There is always uncertainty, and how well you’ve measured uncertainty only can be accurately assessed over a large enough time frame over a large enough number of events.
It’s like when people got upset about Trump winning when 538 only gave him a 30% chance. That one event tells us nothing. But if all predictions 538 says have a 30% chance of occurring happen 30% of the time, then they are spot on. That’s not apparent with a single event though.
The problem is that most managers, companies, and people (including yourself, apparently) are statistically illiterate enough to not understand this, and jump head first into data science initiatives expecting immediate results, which is usually doomed to fail, at which point they blame others and not their poorly formed expectations.
There’s plenty of bad data science out there, but most failed data science initiatives are doomed before anyone every builds a model or analyzes any data.
In a recent past life, I was a HPC (high performance computing) administrator for a mid size company (just barely S&P400) who was in the transportation industry, so I had a lot of interactions with the "data science" team and it was just a fascinating delusion to watch.
Our CTO did the "Quick, this is the future! I'll be fired if I don't hop on this trend" panic thing and picked up a handful of recent grads and gave them an obscene budget by our company's standard.
The main problem they were expected to solve - forecasting future sales - was functionally equivalent to "predict the next 20 years of ~25% of the world economy". Somehow these 4 guys with a handful of GPUs were expected to out-predict the entirety of the financial sector.
The amazing part was they knew it was crap. All of their stakeholders knew it was crap. Everyone else who heard about it knew it was crap. But our CTO kept paying them a fortune and giving them more hardware every year with almost no expectation of results or performance. It was a common joke (behind the scenes) that if they actually got it right, we'd shut down our original business and becomes the world's largest bank overnight.
At least it finally gave the physics modelers access to some decent GPUs which led to some breakthrough products, as they finally were able to sneak onto some modern hardware.
My ex worked at a startup where she was hired as a the second or third data scientist. Their entire Posgress database dump was 20 MB. And they had three people working full time on analyzing ... that 20 MB.
It’s a great skill to walk in to a job and say “hey I’m the expert, that’s not a reasonable proposal, here’s the problem we can solve and here’s what we’ll do”. Much more value to the company, but hard to do.
It’s also not rewarded in modern companies. People are rewarded more for worthless garbage produced than worthless garbage avoided. You don’t have much to show for yourself when you talk a company down from making a plunge into a foolhardy, doomed initiative. Pretty soon the bean-counters might wonder why you are being paid when you don’t have as much to show for your work as others. See Elon’s “lines of code” decision making.
Yeah I feel a lot of companies could do with running their problems past a consultant first.
Also, w.r.t hiring in cases like these, I think often the experienced candidates can smell that this won't be a good gig so don't apply, while the less experienced (or desperate) ones apply. This means the workers get stuck with an intractable problem, and the company gets stuck with workers who are too inexperienced to know better.
> even the need, for ____ ________ yet try and hire them anyway.
I think that vast majority of human organizational structures, individuals to large corporations and countries have no clue what they are doing. The most successful apply science and just barely keep their head below the surface by avoiding utter failure. Most people would describe the Olympics as a competition to find out who is the best amateur in a given sport. No, the Olympics is a competition to see who can make the least number of mistakes.
If you are going to make bold dumb moves, you need a whole lot of margin.
Not the first HN comment I have seen where $real_useful_department borrows
resources off overfunded $bullshit_department to get the job done inspite of management.
In retrospect, maybe they made the right call for themselves when the money was pouring. Probably everyone involved was paid very well for the charade. The ethics can be questionable but maybe its some kind of wealth redistribution, after all the people with money are trusted to make the calls and them falling for this maybe simply means the money is beter off somewhere else.
Unfortunately it seemed pretty clear from the start that this is what data science would turn into. Data science effectively rebranded statistics but removed the requirement of deep statistical knowledge to allow people to get by with a cursory understanding of how to get some python library to spit out a result. For research and analysis data scientists must have a strong understanding of underlying statistical theory and at least a decent ability write passable code. With regard to engineering ability, certainly people exists with both skill sets, but its an awfully high bar. It is similar in my field (quant finance), the number of people that understand financial theory, valuation, etc and have the ability to design and implement robust production systems are few and you need to pay them. I don't see data science openings paying anywhere near what you would need to pay a "unicorn", you can't really expect the folks that fill those roles to perform at that level.
I worked adjacent to the data science field when it was in its infancy. As in I remember people who are now household names in the field debating what it should be called.
At the time I considered going down that path, but decided I did not have anywhere near the statistics & math knowledge to get very far. So I stuck with the path I had been on. Over time I saw a lot of acquaintances jumping into the data science game. I couldn't figure out how they were learning this stuff so fast. At some point I realized that most of them knew less than I did when I decided I didn't know enough to even begin that journey.
Of course, I was comparing myself against the giants of the field and not the long tail of foot soldiers. But it made for a great example to me of how with just about everything there's a small handful of people who are the primary movers, and then everybody else.
Data science effectively rebranded statistics but removed the requirement of deep statistical knowledge to allow people to get by with a cursory understanding of how to get some python library to spit out a result.
I dont know anything about Data Science but as a bystander with a mathematical background thats what I assumed was going on so its kindof interesting to see it spelt out like that. Like you've put words to a preconception that I didnt even know I had.
>Data science effectively rebranded statistics but removed the requirement of deep statistical knowledge
An important thing people miss is that shallow statistical knowledge can cause subtle failures, but shallow software engineering knowledge can cause subtle failures too.
A junior frontend developer will write buggy code, notice that the UI is glitched, and fix the bug. A junior data analyst will write buggy code, fix any bugs which cause the results to be obviously way off, but bugs which cause subtler problems will go unfixed.
Writing correct code without the benefit of knowing when there is a bug is challenging enough for senior developers. I don't trust newbie devs to do it at all.
Context here is I used to work in email marketing and at one point I was reading some SQL that one of the data scientists wrote and observed that it was triple-counting our conversions from marketing email. Triple-counting conversions means the numbers were way off, but not so far off as to be utterly absurd. If I hadn't happened to do a careful read of that code, we would've just kept believing that our email marketing was 3x as effective as it actually was.
So, it's impossible to know how much of a problem this is. But there is every reason to believe it is a significant problem, and lots of code written by data scientists is plagued by bugs which undermine the analysis. (When's the last time you wrote a program which ran correctly on the first try?) Any serious data science effort would enforce stern practices around code review, assertions, TDD, etc. to make the analysis as correct as possible -- but my impression is it is much more common for data analysis to be low-quality throwaway code.
This is an important point. I used to work in adtech. It's amazing how terrible the modeling is in that space. You can generate a model that identifies a given target audience and simply assert that it works without any real validation.
On the flip side you used to have statisticians writing code that is frankly unusable in a Production environment. You would weep at the R code I've seen and had to turn into something to actually produce business value.
There is a bit of a joke that a data scientist is someone who can do better stats then the average SWE and can write better code than the average statistician.
Both of those are relatively low bars to clear though
The way I heard the joke was "a data scientist is someone who's not good enough at math to be a statistician, and not good enough at programming to be a software engineer."
This is exactly my point. Let subject matter experts in their respective disciplines handle what they know and communicate through the lingua franca of R. Most data scientists/statisticians probably shouldn't be writing production code, I think that's ok. It's a failing of management to think that coding is coding and not understand the value of true engineering ability.
My first job basically consisted of taking code in FORTRAN and translating it into C++ with robust testing and engineering, and then frontending that code into a ton of spreadsheet packages. So you had quanta doing quant work, software engineers doing software engineering, and analysts and traders being analysts and traders, instead of having quants fail at all three, which is more or less what data science is.
When the R/stats guy quits and you have to figure out which of his 7 notebooks to run in which order and which local files need to be in which local directories to run correctly and which versions of each package are now broken and which code you need to rewrite to fix it you start to realize the value he produced was clicking a lot of buttons in the right order and that overall this doesn't scale at all.
Yeah, but I meant that because the business value is in the stats, and there is such low quality of stats in the field to begin with, it’s borked no matter what.
There’s no point in fixing it. You can just pretend like you did. But if the stat work is quality, then it’s worth the effort to optimize.
> Data science effectively rebranded statistics but removed the requirement of deep statistical knowledge to allow people to get by with a cursory understanding of how to get some python library to spit out a result.
That's a good way of putting it. I remember in my first calculus-based probability+statistics class in college, I felt incredibly challenged by the theory. I wondered why there are so many probability distributions out there, why the standard stats formulas look like they do, what "kernel density estimation" even is, etc.
On the other hand, my data science course did include some theory, but a big part of it was also learning how to type the right commands in R to perform the "featured analysis of the week" on a sample data set. Something about these lab exercises felt off because it felt more like training rather than education. The professor expressed something along the lines that if we wanted to go far with this in the future, he would expect us to design the algorithms behind the function calls. I think the analogy he used was "baking a cake from scratch rather than buying a ready made one at the store."
> But there’s also a part of me that’s just like, how can you not be curious? How can you write Python for 5 years of your life and never look at a bit of source code and try to understand how it works, why it was designed a certain way, and why a particular file in the repo is there? How can you fit a dozen regressions and not try to understand where those coefficients come from and the linear algebra behind it? I dunno, man.
This is true everywhere. As a professor, every semester I’m baffled by students who aren’t curious. But I’ve come to terms that there is a difference between those who will graduate and go on to be readers of hacker news and write this kind of article, and those who won’t.
To counter your professor opinion. The amount of extra time available as a student that I had to pursue things of interest was in the negative. All academic time was spent getting course content accomplished.
I am a naturally curious individual but time limitations prevent further exploration in most circumstances. Additionally there is a relevancy factor weighed on top of it. If something looks curious I have to pre-determine if I think the time spent pursuing that rabbit hole has any value to it. Granted you never know the outcome - it is alway a gamble.
Well good luck then, in my experience the most free time I've ever had in my life was during college. I squandered massive amounts of that time doing things completely unrelated to education, and I definitely don't regret doing that. College isn't just about book learning after all. But still, BY FAR, college is the time of my life when I had the most free time to do whatever I wanted.
Yeah I hardcore disagree with this. Partly my fault for saying yes too much, partly my work schedule, partly being in a weed out program that really worked you to the bone.
Some semesters I was doing like 70-80 hours a week on average, split between managing clubs, homework, attending class, working part time jobs, studying. One week I remember being busy from 7am to 2am for 6 days straight. a few semesters I had a lot of free time, like second semester of senior year, and first semester of freshman year, but mainly it was the gaps - after midterms, during breaks, where I had obscene amounts of free time.
Interesting. I had a very different experience. Double major, working in two labs simultaneously, active member of local ACM, interned with local startup during the school year, volunteered at a local soup kitchen. All that together was about 50 hours/week. Academics (including homework, studying, etc) was only 25 hrs/week on average. But I was very fortunate to have the advantage of not needing to work, which gave me the freedom to scale back my hours on a particularly busy week.
I learned a lot from my CS classes, but I actually felt like most of the value from the degree came from overhearing random chitchat between professors or other students and the reading more about those ideas and experimenting with them in my free time.
I had quite a lot of free time when in college, but I still feel like I had less time to pursue my interests. Reason being that the course itself was intellectually demanding while also being quite prescribed about what you had to learn. Meaning I ended up using all my mental capacity grinding through a bunch of stuff that my professors wanted me to learn, leaving me with much less time to go off and learn what interested me.
Both before and since I've had more free capacity to pursue learning for it's own sake.
Middle and high school is where a lot of students learn to stop being curious due to a lack of time. College demands far fewer hours per day, but it can be hard to forget what was taught previously.
I have to agree with you. So many of my professors have been vocally disappointed with their students for their lack of intellectual curiosity after it had been beaten out of them through the overstuffed schedules and pointless busy work of K–12.
If you are someone who is on the cusp of a better grade at university then
any
curiosity time is better invested in restudying the past exam papers. I think PhD has more of a curiosity culture
at least in the first year but I
never did one.
Also hard subjects at uni - there is only so much deep thinking you can do per day
It depends on your courseload that semester. When I was taking organic chemistry I would spend a good 8 hours in the library a day monday-thursday on top of class, which would open the weekend up for partying. Wake up at 10 for class at 11, then straight to the library with the occasional break for meals or other classes until 11pm or so, whenever I got too tired to continue. By my senior year when I was just taking interesting electives, I was totally coasting, probably throwing in 2 hours a week in the library in total.
My sense is that your program at school had a light work load - so a difference in experience. My peak workload so far in my life was at college - I had over 40 hours of class time a week which you then have to add on homework, projects and exams. It was a grind.
Since then workload has been intense of course but never comparable. I've had much more time to be able to explore personal interests since college.
Yeah, I wish I'd had "free time" in college. 60-70 hour work weeks were normal - 20 hours a week in class and then a full-time load of courseworks / readings / labs etc. I couldn't afford to take time off on weekends for the first 3.5 years. It was horrendous.
Once I started full-time work it was like a revelation - finally I don't have to work on evenings and weekends! I actually get free time to myself! I can have hobbies!
Where did you study? I was working full time while doing the university... It was HARD... but I was nowhere nearly 40Hs of class a week... unless you do the whole university in 2 years?!
That’s impressive. Between work (30 hours a week) and classes (full time credit load), I’ve never had less free time than when I was in college. And I’m speaking now as someone with 2 young kids and a full time job. Something tells me your experience is not commensurate with the standard college experience. Perhaps you didn’t have a full time job or only took part time credits?
> Something tells me your experience is not commensurate with the standard college experience.
I know very few university students with significant work commitments.
In the US, the stereotypical college student is not also holding down any kind of job. Maybe 5-7 hours of "work study" (light work running the reference desk at the library or working in the dining hall).
Frankly, I doubt the majority could do learn a lot and also work a significant number of job hours.
At a community college, it would be very different - most students also holding down jobs, I would guess. At a flagship state university, I would be very suprised.
Evidence in [1]... about 30% of full time students are working 20+ hours/week. Also apparently I was wrong about the low hours being typical; less than 10% are working but < 10 hours/week.
Yeah, this. I don't have any data to support this, but when I was in school, MOST people didn't have close to full-time jobs. I had a job where I probably worked 10 hours during the week at night and some full 8 hour shifts on the weekends. Most of the people I went to school with (and I would assume, maybe wrongly, that most people in better schools than I went to) didn't work AT ALL while they were in school, it was just those of us less than wealthy folk who actually had to work to have spending money and money to pay for books etc. I don't think my work load was overly demanding, but I was a Comp Sci major, fwiw.
Sounds like you weren’t in a competitive program that constantly tried to get people to drop from college altogether.
I was. Didn’t have anywhere near the free time and the lack of stress I do post-college. Helps that I also make a good chunk of change rather than living off a relatively small stipend in one of the most expensive cities in the world.
This was true for my undergrad, but my graduate program demands almost all of my free time, including weekends. Although, this may mostly be due to a drastic change in field of study from the two (social science to computer science) where I probably have to dedicate more time than those that already have knowledge/experience in this field.
I worked 10-15 hours a week (and somewhat more in the Summer) for about three years of college and can confirm, still the most free time and lowest stress ever. Worst, by a mile, was high school, and I even had a pretty good experience there. Far worse than working a full time job while having multiple young kids, even. Worse than before we had kids but when we made very little money and struggled to pay the bills every month. High school is terrible.
Did you happen to attend a prestigious school? I find that the level of rigor (and corresponding freedom) varies tremendously from program to program.
I did my undergrad at a state school with a middling engineering program, where I had ample free time to explore topics in depth, pursue extracurriculars that taught me far more than my classes, and have a thriving social life.
Contrast that experience to what I saw as a teaching assistant at Georgia Tech: undergrads who are so full of classwork that they're punting on the least-valuable graded assignments, never mind extracurriculars. The level of rigor in courses is much higher, but it presses out freedom to explore independently.
Another datapoint: I competed against GT extracurricular teams during my undergrad years, and we beat them handily almost every time because their students couldn't justify high effort for work that wasn't graded. I once saw a GT team arrive a day late to a competition, work on a robot for three hours at the adjacent table, realize their robot did not work, and drive home without competing.
Nope, nope and nope again. I refute this utterly, as a teaching academic.
Contact hours at most universities are around 2-4 hours per week per 15-credit module. To gain a degree, you have to take 120 credits a year, typically two terms of 4 x 15 credit modules, or 8-16 hours of contact per week maximum with the entire summer off.
You therefore have at least 24 hours a week to study on your own to bring your working week up to 40 hours. Maybe you're working, fair enough. But if you don't have time to study subjects in depth then you need to reduce your working hours. If you can't, then by definition you are not a full-time student.
This is not a personal attack on you. Perhaps you were genuinely studious and spent all your time poring over the coursework. It is a commentary on the whole academic sector where we repeatedly see students do nothing for most of the time and spend the last 2 weeks cramming and putting in substandard assessments, then blame the course material/their lecturers/their anxiety etc. for their poor results. And of course the leadership teams lap it up and tell us to make our courses easier.
No personal attack taken but your experience and points fail to win me over.
The difference probably belies in the rigor of the program. It sounds like you are working in a non-engineering based program. In our engineering programs we had 40 hours of class time + lab time per week.
I had a concurrent arts degree at the same time which is was, in comparison, incredibly light workload - though concurrently it took time away.
The only time that I will say was much lighter was in the final year of undergrad - the course load finally lightened up.
N.B. this whole conversation clearly excludes summer.
> The difference probably belies in the rigor of the program.
This is anecdata of course, but my experience with a top-3 US undergrad aerospace engineering program in the late-90s, early 2000s was around 15-16 hours of class time per week, sometimes increasing to 18-19 or so with labs. Work outside of class was 3x this or maybe 4x around midterms or finals.
You posted this elsewhere in the thread, where I replied that this is not normal in the U.S. Can I ask what university and degree program it is where students have 40 hours of class and lab time per week?
They mentioned engineering programs in their comment.
40 hours isn't normal even for engineering programs. Every engineering program I've looked at has higher course hour and credits required. Obviously I haven't looked at every single engineering program at every engineering school, so there probably exists some counter example showing it's no different than arts or science..
Where I studied, we had one semester with 40.5 hours of lecture, lab, and tutorials. One other semester was around 38 or 39 hours. The rest were in the mid-twenties for lecture, lab, tutorial. My program wasn't the typical engineering program, but all of the other engineering schools where I went (Western Canada) did require more credits and more class hours than science and arts and business programs. There may have been some exceptions with honors programs (meaning they have to take 132 credits vs 120 credits and write a thesis) in arts and science that put them closer to engineering programs, but these have limited enrollment.
Where do you teach? Where I have gone, 1 credit meant 1 hr of lecture and an expected 2 hr of study outside of lecture. Therefore 15 credits means 45 hours a week of study before you get curious about your field.
For example here is Purdue's handbook on credit guidelines:
I see lots of concurring and dissenting opinions here, and will add one more:
For context, I double majored in two adjacent subjects, physics and math. I went to a state school that has a very strong physics program. I also worked in physics lab for the last ~2 years, and graduated a semester early. While I did OK academically, I had no desire to run the gauntlet again in grad school, and left to work in tech.
I have never, ever, been as a busy as I was in college, nor do I ever want to be. I think that's a good thing! I have much more time to explore things that don't pan out, to do things I know are not "productive" (i.e, play video games), and am generally happier.
Apart from quality of life improvements, I think there are additional financial and intellectual benefits to not being overly burdened -- the time to explore topics that were not immediately adjacent to my field of study results in extremely useful skill development and better cross-pollination of ideas.
> The amount of extra time available as a student that I had to pursue things of interest was in the negative. All academic time was spent getting course content accomplished.
Did you go to a "good" school?
I went to a mediocre one for undergrad and a top school for grad. The one glaring difference I saw between the two: The top school's undergrad program gave students way, way too much busy work. All that work didn't give any insights, and was merely used to artificially distinguish students for grades. Their grad program was nothing like this.
Really glad I went to a mediocre school. Still learned everything, but had plenty of time to explore.
You must have been a very sincere and disciplined student. :)
In my case, when I was studying, I had all the time in the world. During that time, I did try to learn many things but I didn't go deep or weren't consistent. I mostly wasted times in goofing around. Looking back, the amount of time I wasted during college years has become a biggest pain of my current life when I don't have any skill or time to learn those skills.
I think this inclination to be curious can still be apparent even when someone doesn't have the time to pursue that inclination. It will be more subtle, but I think it's something rather fundamental that applies in broad ways across our lives.
> But there’s also a part of me that’s just like, how can you not be curious? How can you write Python for 5 years of your life and never look at a bit of source code and try to understand how it works, why it was designed a certain way, and why a particular file in the repo is there? How can you fit a dozen regressions and not try to understand where those coefficients come from and the linear algebra behind it? I dunno, man.
Because there's a lot of things out there which are also interesting, and you don't have time to do all of them, so you choose. And different people choose differently.
I’m surprised I had to scroll so far to find this response.
It was a revelation to me when I realized that, no, it’s not that “most people” lack intellectual curiosity. Their interests are just different than mine.
I'm pretty curious, but I wonder whether I would have come across that way that my college professors. I felt like college stifled my curiosity. Undergraduate courses rarely care about original or creative work, or about students pursuing their individual interests. They more or less want students to learn what the authorities in the field think.
I did student representation while I was at college, so I had quite a bit of contact with teaching staff around discussing the learning process. There were a lot of complaints from their side that students weren't engaging with the course and were rote learning answers for exams.
My perspective was that most of the courses were badly taught (students were given little guidance and struggled to learn the basics) AND badly examined (you had to guess at what the professor wanted in order to score well - it wasn't actually assessing learning accurately). The courses where you found truly curious students were the ones that taught the basics in a way that other professors would consider hand holding (which meant they could get passed that onto more advanced material), and gave clear advice on what was expected in and how to approach the exam (so that students didn't have to worry about that and could focus on learning and their interests).
You'll always get some students who just aren't interested (perhaps they picked the wrong course, or simply aren't that academic), but you'll also find that the same students respond dramatically differently to different environments.
curiosity is good but there is so much stuff to learn out there that for many fields learning things deeply is much less important than learning a lot at 25-35% depth.
Not everyone is wired that way. Personally, I have taken apart and reassembled most of the tech stuff I have at home simply because it interests me how things work (and broke and repaired a non-negligible amount of them in the process, to add), I've dabbled in repairing cars, gas boilers, do my own electricity work... but in my social circle, I'm pretty much the only one. And as I grew older, managed to land myself an s/o, I kind of get why - two other parts come into play:
The first issue is a lot of acquiring broad-spectrum knowledge involves risking quite an amount of money. A good DSLR cam can easily rack up a few thousand euros, a fully spec'd Mac Pro or larger drones cross the five digits without blinking. Messing around with gas and electricity can kill you, messing with water pipes can cause immense water damage. It takes a lot of ... let's say recklessness to even think about dealing with this if you're not a professional, and you have to have the resources in the first place.
But the real issue is time. Students, at least here in Europe, don't have the luxury of taking six or seven years for their basic diploma - "thanks" to the Bologna reforms, you're fucked if you can't make it in the designed timeframe as you won't be eligible for most kinds of financial aid. That means you simply cannot afford "wasting" a week to get that deep level of knowledge, you simply are happy enough if it runs well enough to get a passing grade. And once you've entered the workforce, it becomes even harder to have actual hobbies. It's one thing if you live alone, no one will bat an eye if you pull in an all-nighter on a weekend with just yourself, a crate of beer and a laptop and that's assuming you're not completely drained from your average 40 hours work week, 10 hours of getting to the workplace, and another 10 hours on domestic chores. When you live together with another person, the game completely changes: they also want time and attention from you - bonus points if your s/o has roughly the same interests that you have (which is why I suspect so many people meet their s/o at work). And with children... forget about hobbies of any kind if you don't have enough resources for either yourself or your s/o to be a stay-at-home parent.
This is why I so strongly advocate for a four-day and six-hour work week, a proper minimum wage and government-subsidized affordable housing for everyone. Just imagine what useful things people could run as side projects if they actually had the time to pull them off, not to mention the obvious physical and mental health benefits of not having to struggle with survival every single day. Add to that the elimination of "bullshit jobs" and an end of wasting the best minds of the world on financial bullshit (i.e. HFT, "quant investment funds") or advertising... or getting rid of racism and other discrimination. We as humanity could make so much more progress if we were not so hell-bent on exploiting each other.
Sadly, for those that depend on BAföG (the government assistance), this is the reality. If you can't manage to finish your study in the expected time ("Regelstudienzeit"), you're out from assistance [1]. If you take double the expected time plus another year, you're forcibly evicted [2], which was quite an issue for some people who were admitted under the pre-Bologna rules [3]. You're also kicked out as well if you fail one course three times [4], which is what hit me as I'm a better hacker than a mathematician.
In the end, it's clear that the Bologna reforms were not just about creating a common European academic standard - which is a good thing - but also about turning universities into factories with no room left anymore for the eccentrics who wanted to spend more time on their education than the system wants, those who need more time for specific subjects or those who need a job to survive and can't keep up the same workload as rich students can.
Having relatives and friends across Europe (UK, DE, IT, ES, FR, PT) I cannot understand the parent comment (not having enough time for curiosity). I see only "PARTYYY!" in the universities... plenty of free time -- and typically more than enough money. Specially with the "Orgasmus" ehhh sorry! "Erasmus" program. Do I have all friends with 160 IQ? I really doubt it.
Rather, you have friends with money. My university circle was not "rich kids", but those who either had to fend for themselves with government assistance or with joint work-study programs ("Duales Studium").
Seems pretentious to me. I’ve never bothered to look through many things I use. I look extensively at how to use them and what the API offers. I have a good intuition for how most models work. I don’t really care about the specifics of the implementations.
I have more important things to do. The hacker mentality, imo, is about identifying what’s useful for you to explore to accomplish whatever you need. Often that’s a lot of glue between things that other people built. Other times it’s tweaking the internals to do something a bit different.
If you think it’s about implementation details, you’re misunderstanding. It’s about understanding the principles behind it.
As an example, it’s more about understanding the statistics and linear algebra around estimating uncertainty in GLM regression estimates, than about reading the code for how the statsmodels library implements that.
This is not about the hacker mentality. This is a researcher mentality from a daily life perspective. Some people just aren’t curious. I like to understand the math and the computing models behind many things I use. That doesn’t mean I want to know what’s happening in Windows internals or something just because I use Windows everyday. But if I’m creating an app connecting to Office DLLs, I want to know what it does beyond “here’s a bunch of methods and constants you can use”.
I’d further argue that the nature of a hacker / power user is to break things apart once you want to get deep enough. If I need to know where in the cluster my instance of some software got lost into, I should be able to investigate all the tools I have available to somehow find it. Not just give up and say some garbage collector will get it for me.
I see what you're saying, but the post above seems to indicate that understanding those models SHOULD be important to you if your job is to run those models and explain results and make corporate decisions based off your forecasts. You shouldn't be just passing data through and thinking that it's not your job to actually understand things. The subtlety matters a lot as the software hitting some edge case could completely skew the results. There is a vague line drawn somewhere that tells you what is necessary to learn and what is superfluous. Finding the line isn't easy, but those that label too much as superfluous will likely get more erroneous results and that is a problem.
With regards to your API statement, I'm just as guilty regarding reading the code, but I do run some manual tests to ensure that my script calling the database actually does what I think it should. Is that good enough? Who knows :)
That's all fine, as long as you still understand the underlying assumptions and pitfalls. Many people who skim documentation and throw things together haphazardly do not.
I agree with you. Super powers are seeing value in doing something, and then finding the easiest and most efficient path to get there.
That said, sometimes I do like to read the code in libraries I use but often this is more for enjoyment with occasionally learning something interesting.
It depends. Sometimes for work or personal reasons I just need to get something done and API docs, etc. is all I need for that. And sometimes I dig deep. I have over 50 US patents and enjoy technical work, so sometimes I do dive deep.
> Meaning you could absolutely suck at your job or be incredible at it and you’d get nearly the same regards in either case.
Story time.
There was once a junior data scientist at Shopify that had learned Python and SQL and was tasked to figuring out how to fix their "broken app store recommendation engine" but since they didn't know Ruby, they asked for my help in figuring out what was going on.
Well somewhere in the soup of math was a fuzz factor at the very top. Think of it like
factor = 0.something # Not 100% sure what the decimal portion was.
some_complicated_math_that_maxed_out_at_one_pt_zero() + rand(factor)
Now the thing about ruby is that rand is basically broken for floats.
Negative or floating point values for max are allowed, but may give surprising results.
So basically what they thought they were doing was introducing a bit of randomness that would hinder others from reverse engineering their algorithm. What they actually did was make the recommendation algorithm fifty percent total noise. Yes, it's true. On every load half the recommended app scores were noise.
They fixed the bug and I'm sure a ton of balance sheets for businesses around the world are markedly different know because of it, but I never heard of it again.
This is one of the core problems with data science.
Having a subset of totally random recommendations wouldn't be a totally terrible idea---especially if you know which they were! It could help push the system out of local minima and it's the obvious benchmark to beat.
I have to agree with a lot of this - I started my career as a data scientist right out of a STEM PhD back when the term just started coming into existence. At the time, anyone who wanted to get hired as a Data Scientist needed to be trained as a professional scientist, i.e. have a PhD - at first my expectation that the purpose of my job was to apply the scientific method to solve business problems by leveraging the companies own data as the empirical evidence - whether I did this using machine learning, excel tables or a chalkboard didn't matter. ML was a barely used term at the time, the first version of Tensorflow wasn't released until later that year.
But over time, the higher up I climbed the more I realized the job had marginal business impact. Usually a big company would hire a bunch of PhDs with fancy degrees and stick them in some "Advance Analysis" department and leverage them as internal consultants, which just meant creating some models, writing a powerpoint deck, get a pat on the back from the execs - not a single model would ever see the day of light. I got all the way up to Director this way, before calling it quits this January, at the end I had basically nothing to do except work on "corporate AI strategy", which meant writing presentations and white papers for upper management.
It was comparatively easy job, one could coast their entire life in some of these corporations - especially in government sanctioned oligopolies like banking.
> the higher up I climbed the more I realized the job had marginal business impact
Do you have any observations why? I'm a pretty lowly business analyst, but my observation is if you don't own the decision making (usually by having profit and loss responsibility), you can't have much impact. Possibly it's the companies and industries I've worked at, but at the end of the day if the results don't meet expectations, it's the business owner that gets fired and not the people providing the recommendations.
For the same reason science takes 100s (,1000s) of years to develop.
All the "intelligence" takes place in the humans that design experiments to collect unambiguous data. "data" absent a profoundly intelligent (, expensive, fraught, ...) experimental design is basically useless.
Here is an example: target metrics are heavily manipulated and people don't really want to know what's going on. At my first job the Director of Product would change the way a target KPI was measured every few months but would not back-propagate the changes, the end result was that to upper management the product always looked good, because the product owner would just redefine the metric in a way that made the numbers go up. This was at a multi-billion marketcap company in the SP500 and this particular person was promoted two levels to managing vice president in 1.5 years.
Basically, like some other people have already said, companies are inherently political - they do not want data-driven decisions they want their decisions to be data-validated. If their view of reality aligns with the data that is all the better, but if it doesn't, their alignment takes priority. Moving up as a DS then involves delivering "evidence" that fits whatever narrative your boss and senior management want. Sometimes that evidence will be rock solid, other times there is no evidence. That's why I suspect in the beginning they loved hiring STEM PhDs from "elite" universities. If your degree is from Harvard Astronomy Dept, people will borrow your credentials to further their agenda - because you got a golden halo.
TLDR: science is not gospel, it's just a method of thinking to deduce natural laws, if you keep digging you can find your initial assumptions proven wrong, sometimes completely wrong, - in business and politics if you dig too hard, you start finding things that nobody wants to hear.
Regarding your point about owning profit and loss that is very true as well. In my second job I was in a center of excellence team and it was extremely hard to get any traction because we didn't own any sources of revenue so we were a cost center like HR or Accounting. Teams that owned LOBs want to hire their own analytics rather then "outsource" to a COE team as a way to retain control and expand their own power base.
Would I ever do it again? Who knows, maybe, I still believe it's possible to do good scientific work outside of academia (not to say good science always gets done in academia either). I am living off investments and savings right now and working on hobby projects that may or may not pan out. People always take less than ideal jobs for want of reality.
I think there is real value in scientific analysis in business but it's closer to operations research where you solve complex optimization problems that are directly pertinent to the core business (like traffic routing or container packing) than in busting out the latest DNN techniques.
So true. This article accurately describes DS at many companies.
The preceding sentence is a hilariously cynical zinger:
“Those who have seen my Twitter posts know that I believe the role of the data scientist in a scenario of insane management is not to provide real, honest consultation, but to launder these insane ideas as having some sort of basis in objective reality even if they don’t.”
>> "Managers will say they want to make data-driven decisions, but they really want decision-driven data"
Ooofff. This is too true. How often is the case that data is collected to test hypotheses vs confirming priors?
Find me some evidence of WMDs in Iraq! Yessss Sir!
I've found this to be the rule, not the exception. Pointing out extremely-obvious (to me? Maybe I'm just unusually good at it? I don't even have much formal science training, and hell, barely any math training by the standards of HN folks, though) damning errors in experimental construction that should invalidate the whole thing won't earn you any friends, even if you do it before the work is undertaken, and even if you're telling the person who's claiming to want good data and useful results. Everyone seems to just want a veneer of science to what they're doing, not actually good efforts at it. As long as you have a paper-thin layer of justification that falls apart if anyone looks at it long enough, that's considered good enough and people will sit around in meetings nodding along.
Of course, in many situations the business totally lacks what it needs to correctly do the "data-driven" stuff they want to, and it'd take a good deal of up-front effort by competent people to get it, amounting to entire new projects or deep modification of existing projects.
So, given the choice between: going without that stuff and acknowledging that a lot of what they're doing is guesswork and gut decision making, or simply arbitrary; putting a smaller but still-large amount of work into finding out what they can glean from what's available; spending the time and money to collect what they need, the right way, to do the data-driven decision making they claim to want to do; and insisting they're doing things "data driven" but having all their data hopelessly ruined by e.g. selection bias and comically-bad experimental construction that can't possibly be yielding reliable results, so they can cheap out and get no actual "data-driven" benefits aside from falsely claiming that's what they're doing—they tend to go with that last option, nearly every time!
That one stood out to me as well, but, to be fair, this predated current 'fashionable trend' for data driven decisions. It is, sadly, not a new development, but something to still be overcome.
This especially sucks if you are the middle manager. You know that what you are asked to do is a complete BS but you have to somehow communicate it to you underlings (who see through the BS) without using sarcasm or snarky remarks.
Rather than wanting to confirm priors, I believe this usually is a problem with neither the PM nor the data scientist ensuring that the problem formulation is good enough before diving in. I.e., what data would be needed to actually test the hypothesis? Do we have that data or not? Is the hypothesis even formulated in a way to be falsified in theory?
I've seen so many analysis tasks where data scientists without questioning went away for a few weeks to crunch data and come back with some random graphs and statistics that are completely useless as decision support.
You're overthinking it. Executives and managers quite literally want to see data that confirms their existing convictions and beliefs so they can act on those beliefs under the guise of it being "data-driven".
I made this same transition from data science to data engineering about 18 months ago and I've never looked back.
I hated working with bad code and dealing with arrogant phds who don't value good code. I've seen so many terrible Jupyter Notebooks just copied and pasted into VS Code and the data scientist just washed their hands of it calling it "production ready." Here's a conversation I've had multiple times:
Me: have you ever considered not making every variable global scope
Them: that's just software engineering. We do machine learning
Me: if it's just software engineering, then why can't you do it?
Meanwhile, automated data science tools are getting halfway decent. If you know what algorithm to pick and you don't need to run millions of records through the model every minute, your standard business analyst could probably get a solid model going--at least as well as most data scientists for all the reasons the article mentions.
And I like that I know I can do data engineering. With data science you can never really know if you can hit your target metrics given the data you have. So data scientists end up encouraged to fudge their results or make sloppy decisions. With data engineering I can say "yes this is doable or no that's not" and people believe me.
My prediction: there's value in the massive volume of data but most of it can be had through standard dashboards, some summary statistics, a graph network, or maybe a linear/logistic regression. Most data science is BS and companies aren't getting the return they need to pay for these guys. (And good God, you almost certainly don't need a neural network.) Meanwhile, data engineering will get integrated into software development, and machine learning—by virtue of its proliferation through academia—will just become another tool for software developers. Data scientists won't get laid off enmass but they will go the way of the webmaster: either pick up new skills and evolve or move on til they end up with new titles
This resonates with me so much, I stumbled into data science out of University a decade ago. Left it to do SWE and came back to it in the last 3 years.
So many data scientists are full of themselves thinking they are magicians and software developers are blacksmiths who are beneath them.
Incrementally at my company the SWE's have automated so much of the data scientists workflow that they end up just as you describe, using the tooling and being relegated to becoming analysts.
After 3 years coming back to this field, I see the writing on the wall: In the 90's most models were created by software developers, in the 2030's most models will be created by software developers.
Correct me if I'm wrong because I'm on the receiving end of such models, but I feel that many times a couple of linear regressions, surveys and qualitative work with customers could land much better results.
I say so because I've had time to read some of the reports that DS teams produce to drive decisions in my BIGCORP and it makes very little sense most of the times.
And we suffer from it because we have direct contact with clients, but nobody cares about my department opinion, they will rather believe in some model where I can see insane dispersion in datapoints when they plot em in reports, conclussions by people who clearly has zero understanding of our business.
I'm forced to make decisions on how to treat certain customers, by entering data into some software and being given an output I can't challenge, which produces lots of insane and unfair situations.
Also, IDK how they clean and treat their data, but if they're relying on our ERP's data, good luck. Our CRM if full of BS because most employees rush to put whatever it allows to continue as they need to keep up with KPIs, so they aren't trying to make nice comments and check everything is ok.
Actual, human based decisions will almost always win out.
Data is only helpful when it is directly and clearly tied to the problem.
* Good: "Our customers are complaining of random drop-outs. We've noticed X% of requests to Y service take longer than Z time. We believe that's the problem".
* Bad: "Companies who are most successful on our platform upload X things in their first Z days. We must find a way for everyone to upload X things in Z days".
I have no axe to grind, irt data science/engineering, as I have no experience in either.
However, it seems this person's biggest gripe is with good old crap management; the bane of business for hundreds of years.
This line stood out:
> Companies all over were consistently pursuing things that could be reasoned about a priori as being insane ideas– ideas any decently smart person should know wouldn’t work before they’re tried.
That pretty much summarizes why I have been told that today's companies want only young people. Us "olds," are "negative naysayers," who say things like "You know that the laws of physics forbid this, right?" or "I tried that, a couple of years ago. It didn't work out, and here's why...".
Apparently, young people are able to do the impossible, because they haven't been told it's impossible, and mixing "olds" with them, spoils the soup, by telling them it's impossible (or maybe a lot more difficult that they imagine).
As someone who also wants to move away from data science, data engineering is the last thing I would want to do. I think DE comes with many of the same problems and it's also a very ill-defined career track; I wouldn't recommend it to anyone. ML engineer or backend developer seem like much more appealing job profiles.
My title is still software engineer, but I effectively do data engineering, and I work closely with data scientists.
I love a lot of it, but there's still plenty of bullshit to deal with. Just in the technical side, dealing with Python is a perpetual gong show, and most of my team's work seems to revolve around configuration of secrets and K8s.
I'm fortunate to be the guy that nerds out about performant code, so when something inevitably turns out to be a perf bottleneck, I can turn back into a regular old software engineer who trades in big data. Which I think is a better title/charge than data engineer, anyway.
I've talked with plenty of ML engineers, and they seem to immensely enjoy what they do. It seems that the periphery of data engineering is great; the core of it, not so much.
I think "The Gong Show" was an old tv show about amateur talents. Sometimes good, most of the time terrible and hilariously unaware. Not sure if that was what was intended here.
So is the GP criticizing Python? If yes, I am curious to know why. No, I am not here to defend Python. The constant runtime exceptions due to typing mistakes is so tiring.
I went data science to data engineering and been very happy. You may have trouble moving into backend development direct from data science because data scientists don't have a reputation for writing solid maintable code so data engineering could be a nice intermediate step cause it's heavy python
would you highlight some of the biggest differences between ML engineering and data engineering? I believe they're sometimes used interchangebly especially if "data" is "datasets" for ML.
Data Engineers are the people who take raw data (e.g. what lands in S3) and put that into data systems that can be used by other systems (e.g. Dashboards) and people (e.g. Analyst, Data Scientists, BI people). Data Engineers clean data, but they are really looking at cleaning out systemic issues (e.g. some data that is missing in one field is in another field, and that needs to be consolidated) and not the scrutinized row-by-row cleaning that Data Scientists end up doing. Data Engineers also do the data steps (e.g. creating a performant stored query) required to support things like business KPIs and reporting.
ML Engineering has a lot more variety based on the company and org, but generally it's about building an automated pipeline that includes ML. In smaller orgs you do everything - build a data pipeline, train a model, deploy that model, score new data, etc. In larger orgs, ML Engineers take a model built by somebody else and make it run at scale while meeting certain SLAs (e.g. making recommendations on a social media website).
Data engineers don't work with machine learning at all. In fact one of the reasons why it developed as a job title over time waas specifically to differentiate the people who work with data but don't do any statistics or ML. If a DE who is doing "datasets for ML" decides to call themselves an ML engineer, they're just getting a bit too creative with the job titles (maybe they want a career change, more money, they think it sounds better, all sorts of reasons).
As an ML engineer you might need to do some data engineering work as part of your job but not the other way around.
I haven't come across that. Unless the job title is being polluted like DS was (to include all aspects of data management), DE is specifically about data pipelines and not the models generating those clusters, predictions, or classifications.
Data scientist is much more catch-all from what I've seen. But a lot of that varies a lot by geography too (for example in the US people very often use DS very differently from how the title is used in the UK).
In the US it's more common for data scientist to be similar to a product analyst or a data analyst perhaps with better technical skills. In the UK data scientist is more likely to be someone who is doing applied ML work (other titles for this are ML engineer or applied data scientist).
Obviously it's not a perfectly clean separation but it's a trend, and people sometimes end up really talking past each other. You can see on r/datascience which is very US-heavy how people often recommend to beginners not to bother with advanced ML, stick to SQL, basic Python and analytics, and in the UK data science job market that's outright bad advice (it's fine advice for the UK analytics market which is a separate thing).
This. "Data Engineering" is pretty far from having a standard definition in the wild. If someone is describing a role as "data engineering" about the only thing you can count on being true is that it involves data.
Somewhat surprised that there's a separate job category for what sounds like large-scale data cleaning and aggregation work (which IMO is 90%+ of the effort involved with data science).
Anyway, I'm going to go back to my 5K+ lines of code for an upcoming conference submission - almost all of which involve data cleaning and aggregation - and think about how I could be making a 2x more than I am now.
Depends. If their favourite data engineer says "Oh hey, I can write tensorflow too", then guess who get the job of to "productionizing" their crappy data science notebooks?
1. The person who developed the notebook is responsible for productionizing it. (No, it's not all crappy notebooks and some data scientists can indeed write high quality code).
2. You have someone like an ML engineer whose job it is to do this.
What you're describing seems like the least likely option; at least on the teams I've worked on "I can write tensorflow" would get you nowhere if that's not already a part of your job description.
Not sure why this is confusing, sounds like people think they are the same because they have never worked in the area. It's always been very well defined at Corps I've been at.
This blog post is directed at me, personally. Thanks W.D. This isn't just Data Science, I'd say that the gripes of the author are valid for about 60% of activity in tech companies. Not saying we can just eliminate 60% of it, but a lot of it supplies non-quantitative value that is driven by fashion (subset of politics) and direct politics.
There are a lot of naked emperors walking around with lots of folks standing as close as they can to shield them from the cool winter wind.
Not just tech companies either. Go bigger. It's possibly 60% of activity in US white collar economy. (I can't speak for those who work with their bodies, could be or not).
So that is maybe different. I think the "60%" we're talking about above, to me, is people doing useless work, or work whose only use is internal "political".
My $DAYJOB has a very high match to this. In fact, I used to do data and analytics full-time for a long while (a lot of Spark, basically, mostly cutting through the usual lambda and K-architecture hype, bulding data lakes and <omg> "lakehouses") in several F500 companies.
I could not agree more with the overall sentiment that data science is overblown in terms of reproducible results because the people doing it just don't have an actual process or good leadership focus (which is not just a startup problem...).
So much so that I stayed stauncihily on the "data engineering" track because it was much more concrete in terms of technology, performance drivers, and business outcomes than the folk who sold pipedreams of magical AI models that would provide amazing analytics overnight.
Turns out that if you can't get at, scrub and actually _use_ the data, figuring out trends or training models doesn't happen, so I focused on making at least that 50% of the project happen and leave nice, tidy infrastructure, workflows and schemas for the data science folk to go through.
I also had the good fortune to work with some very organized, knowledgeable ML folk who actually understood how things worked, but some partners and customers had... incredibly disorganized "data scientists" that would leave stuff scattered all over the place (including private copies of datasets on their laptops when we had nice, secure remote sandboxes for them that even did data masking to avoid leaking sensitive data).
Personally, I blame a lot of this on lack of certifications or professional training that emphasises _process_. Otherwise it's exactly the same problem we've had for the past 20 years in BI departments: People doing their own Excel sheets because "SQL is hard" and nobody can do ETL properly.
(Full disclosure: I am an MS FTE, spent something like 10 years doing analytics almost full time, and have presented on how to do Data Science at scale a few times: https://carmo.io/talks)
// it was often personally unfulfilling (e.g. tuning a parameter to make the business extra money).
He lost me here. Something I've always loved about being an engineer (and now in product) is that something small we do/tweak can have big impact.
If you tuned a parameter and that actually had tangible impact on the business, that's like the best case scenario and should be celebrated (vs doing some cool rocket science stuff that ends up unused and doesn't matter)
And all that extra profit is hovered up by the people above you that had nothing to do with it. Validated engineering cost savings should be treated like sales, the engineer gets a percentage.
> Validated engineering cost savings should be treated like sales, the engineer gets a percentage.
If you want to really follow the same compensation structure, we would then give engineers a really low base salary and make 80% of their compensation performance dependent.
Be careful what you wish for :)
Besides - this would drive some strange incentive structures. If you incentivise people based on cloud savings for instance, it will really only be the teams with unnecessarily large cloud spend in the first place that ‘get’ that bonus. If you incentivise on sales, engineers doing great work on back office tools don’t get any cake. Etc.
Yep :) Big performance related bonuses sound great until you realise it also means you are heavily performance managed and most of your livelihood depends on your pipeline and actually closing sales. You are constantly looking off a cliff.
Maybe you are on $150k per year today, but in three months time you are back to $45k per year because you didn’t make some minimum sales threshold. Might be fine for some people, but depending on your mortgage…
Un-ironically: the pride of a job well done? Most people in software on this site a very well paid and well treated, the least we can do is do our job right.
Please speak for yourself. A comfortable cage does not inspire me to go above and beyond.
I very much doubt my "going the extra mile" will really affect anyone at all in any major way. It may make some made up numbers go up -- or down -- but realistically it will have no major effect on anyone at all, except myself (and negatively).
Whatever effect it elicits in another will be short-lived, and forgotten next quarter -- least of all recompensed sufficiently for the sacrifices made.
1. Fact: The reality of most ML modeling on data is a lot of models don't need to be giant ass interaction machines with 10^80 features. Many non linear models with a Naive Bayes or other simple non parametric models and adding a few Boolean Random Indicator Variables to your regression will go a long way to make your models non linear and get you to 80% F1 score. These are all practical problems and none of them require just a pure DS but also ML skills
2. Fact: You need to deal with bigger problems in ML like data set class imbalance, calibrating responses to the right scalar range(figuring out what that range even is in terms of domain). This is not taught in schools, just like writing Software is not taught in schools. One needs to be in the field to learn these skills and an ML Engineer can pick these up as much as a DS.
"Data Science" was always a vague term, purposefully so. Useful mostly as a vendor / consultant battle-cry and hype term to "encourage" a number of new business domains to adopt digitization and automated information processing / decision support.
Various older information intensive fields (medicine, insurance, finance etc) knew the benefits and pitfals long ago. These examples show also the survival strategy for the generic "data scientist": specialization. The role of the human in the loop is to blow some context and relevance into an otherwise dead body of data. You can only do that if you really know your domain.
> The median data scientist is horrible at coding and engineering in general. The few who are remotely decent at coding are often not good at engineering in the sense that they tend to over-engineer solutions, have a sense of self-grandeur, and want to waste time building their own platform stuff (folks, do not do this).
I can relate to this so much. I've worked in multiple projects with great people who shifted away from solving the problem at hand, to instead construct some sort of generic problem-solving platform. In one project this actually happened twice: after refactoring the beef out of our SpecificProblemSolvingService into GenericProblemSolvingService, the generic problem-solving platform was then rewritten with one extra level of abstraction, so it could run any models designed to solve any task. As far as I know, neither service was ever used to solve any other problem except the SpecificProblem that we were solving the first place.
Tangentially, I also think the term data scientist has been so abused as to almost be meaningless at this point.
When I was applying for jobs it could range from anything from "knows how to use MS Excel" to "Can train large language models at scale".
Personally I went for ML Engineering. My company at some point hired people as data scientists (some of my more senior colleagues still have the title, despite doing the same work I do), but started hiring people as ML engineers, i.e. people who can do half-decent SW engineering and also do ML. Just a filtering thing I guess.
I have a suspicion the term will start to fall out of fashion as things become more specialised.
I've run a "data science consultancy" in some form or fashion for three years now.
When people say "data science" they mean one of three things:
(1) MLE
(2) Data Management
(3) Data Analysis or Business Intelligence (applications of the same skillsets).
(1) has a lot of ongoing innovation, be it in MLOps, autoML, mapping frontier ML to business cases, etc. Innovation is expensive if the investment strategy is unprincipled. (2) is a critical and essential part of making data a usable asset. Management is expensive if it exists solely as a control process and gatekeeps access and use. (3) is core and will never get away from the adhocs and the standard flows, but the inferences are often dubious or not logically justifiable and requires depth of statistical knowledge (rare) to do well -- and courage to call out BS.
Very few people have the depth to do all three. What I have found is that many businesses hope for capacity in all three, plus some basic SWE, in the hope that they can decrease labor expenses. Not an irrational hope, to be frank, but ultimate the iron law of business holds: you can have it good, fast, or cheap -- pick two and be happy with one.
My core observation (and one I see validated based on client interest and experience) is that this is not new and has happened before -- it is the hype cycle in action. The digitization process (including moving to digital and then moving to Web) had a similar cycle. When you treat "data science" like its a silver bullet it will generally fail to do anything but suck budget. When you embed it with your technology teams and treat it as an iterative add, as useful as devops, etc., you have a better chance for value add.
I've found three core customer sets that helped us define a sustainable business:
(1) government agencies (which tend to put most expenditure under labor categories, so they hire a lot of long-term consultants and contractors)
(2) mid-to-small sized non-technology firms that want better data science strategy or want to build data-driven features into applications/products (especially in novel ways)
(3) smaller technology companies that don't have the MLE and data management system capabilities.
My career has been in heavily regulated industries, so our customers often have an appreciation for the management and governance portion after experiencing negative data science outcomes from maverick types.
wow. How you phrase the work and client expectations make seem like working for you is a breath of fresh air in the DS world. Let me know if you ever need a contractor, or let me know how to reach you with a resume :D
There are good research jobs in industry which are serious and mathematical. However they also require you to be serious and mathematical. I’d venture to say at this stage that most “data scientists” are either self taught segues from adjacent fields or have a shallow relevant background.
The serious places don’t want you… so you end up at the place that can’t tell the difference, and the self fulfilling prophecy begins.
Look up the technical presenters at your favorite math or stats conference and look at the orgs they work for. That's how you can tell which orgs are serious about solving research problems and spreading solutions.
You could also look into the successful quant trading shops, if you're alright with not sharing results outside the firm. The nice thing about them is success or failure of your work is apparent much more quickly than the timescale of science, and it's harder to spin than other businesses. So the data science that gets done is generally pretty technically legit.
I've been a 'data scientist' for years, and I probably will be again at some point as it is the biggest item in my CV. It was in a company, where data science was not the bread and butter, but just something extra to show to the clients.
For me therefore, data science is the epitome of Graber's 'bullshit job' -- if the position didn't exist, the company would go on just the same.
See a lot of pessimism here, and heard a lot of the same stuff by grumpy engineers over my career "nobody knows how to do statistics properly", "scientists are bad at coding" and "managers don't care about quality and rigour". I'm old enough to say it all pre-dates data science as a term. It all comes across as a conspiracy like everyone is bad on purpose.
There is a null hypothesis here, that the average person in role x is just average at that role. It is an extraordinary person who has high level skills across multiple domains like maths/science and coding, maybe so extraordinary that they wouldn't be working with you...
I'll admit that the article rings true, but I think there is an implied intentionality that I don't agree with. We are all just plodding along, doing our best with limited information and skills.
> there is an implied intentionality that I don't agree with. We are all just plodding along, doing our best with limited information and skills.
Never got the feeling the author blamed the data scientists (he was one himself), but rather management, and not their bad intentions but their incompetence.
As someone who recently switched from DS to platform engineering, this post really captures why I also jumped ship.
> Shitty code & shitty data science
In my opinion the bar should be higher for code quality but also general engineering know-how in data science. You'd be surprised how many are uncomfortable with git, using the command line, interacting with APIs, managing environments, etc. Being able to only work within a jupyter notebook is not good enough, at all. Otherwise, you end up with people who's entire job it is to productionize and deploy the code which is a waste of time and effort.
> Poor mentorship
There is either a lack of quality leadership and mentorship or an inability for upper management to see the value in hiring for it. What ends up happening is you have a team who doesn't know how to grow, scale, or work together. They instead focus on building models and learning statistics when they should be focusing on building systems and process for helping the business scale analytics and building models when appropriate.
I enjoyed data science but found it to also not matter in the implementation that everyone thinks it should be. Data science isn't building nothing but ML models. In most companies, in my opinion, it is actually about scaling analytics. Being able to reach further up into data engineering, get raw data, explore it, shape it, give it back to DE to automate, and then automate the delivery of data to upper management and guide them through using it. If the team thinks their job is to just build models, everyone is going to have a miserable, miserable time.
I'm reading a lot of sour grapes here, but I'd like to offer a more optimistic future snapshot (if you're a data scientist and happen to be reading).
I agree that many companies hire data scientists with only a vague idea about how to utilize them, but the same is true of software people in general. "Software is eating the world" and so is the practice of extracting value from data.
The margin on software is high - often more than 95% - so there's a lot of room to screw up and "figure it out" as a business. I think that's why there's a low bar for software and data management compared to, say, an automotive manufacturing line manager.
But that's where the opportunity is, if you're a budding data scientist:
- The business might not know how to effectively use/manage/train/mentor you.
- Upper management might have 20+ years of line of business experience, but will need your help to understand how your team can impact the business.
- You're going to need to seek out ways to impact the bottom line of the business.
All of the above is a recipe for leaping forward in your career. Since data science is a relatively new field, the demand for senior leadership FAR outstrips the available supply.
If you can learn how to effectively manage yourself, your team mates, and your function within the business - you have a ton of negotiating leverage and can name your price.
Source: I'm a data person who "retired" in their early 30s. Now I do all the research and hard science I want. ;)
> the demand for senior leadership FAR outstrips the available supply.
This is so very, very true. Most of the "bad" data science orgs I've spent time with, are bad because leadership is either bluffers or a data engineer/BI type person. It's generally hard for these types to run effective DS orgs as the skills needed are very, very different.
I feel like in the near future there will be a more formal hybrid role between data engineering and data science - like devops or full stack developers. The best data scientists I have worked with (ML mostly) have been incredible data engineers as well - some of them former sysadmins, backend developers or DBAs themselves. They know where to get the data, how to set up pipelines and jobs, how to make sure they run properly, best practices for reading and writing to databases so they don't fall over, error reporting and logging, hosting their inference models on APIs, security by design... The amount of back-and-forth that gets cut out to go from raw data to product is huge when you compare it to a traditional siloed setup of business analysis/stakeholder, data science, infrastructure/security and systems engineering.
I know some people cringe (mostly infra) when they think of data scientists having direct access to databases and infrastructure but honestly you should have a level of understanding and responsibility to get there.
The data scientists that do data engineering are usually much more valuable to the company and definitely earn more.
I agree, up to a point. I feel that companies up to a few billion dollars in stable revenues in a non data intensive business don't need teams of Data Scientists, but would benefit from some people having Data Science skills (both Data Engineers and Business Analysts).
One of the things that always sort of annoys me about complaints that "management doesn't listen to data (science)" is the lack of awareness they consist of.
It turns out that data work is limited by all the same things every other part of the business is limited by: the need to make quick decisions, institutional imperative, the beliefs of decision makers, the ability to communicate well/influence, and so on.
Having better access or skill with data doesn't give you a pass on these things, despite the suggestions otherwise from laments such as this.
Read this yesterday and absolutely loved it. Especially feel the pain regarding working with management. I think there's an accountability that comes with evaluating management decisions with data that nobody really wants. I still have a lot of half-formed thoughts/opinions about this but it really feels like data-driven requires strict discipline but the data people who would be accountable for that discipline both 1. aren't empowered to wield it and 2. probably don't want to wield it anyways.
Also agree about the simple tools but it's really hard from a career perspective. If I deploy XGBoost in production and put it on my resume, I'm making double my salary next year. If I can find a simple ruleset or linear regression that performs 90%+ as well as the XGBoost and put it in production then nobody cares even though it feels like distilling the complex down to the simple is really where the value is.
The thing is that anymore anything short of a deep net is equally difficult to implement as a linear regression, if not easier due to nulls and categoricals.
Also, modern tooling makes a lot of these models more than explainable enough for a lot of cases… 10% is a lot
Just as an FWIW, I've been interviewing data people for about a decade now, and I would definitely be far more impressed with the simpler approach, but I realise that people like me are a minority in the field.
I have never understood the what a good ML engineer couldn't do and a Data scientist could in _majority_ situations. When you need a decision to be made based on data its just common sense risk analysis added together with basic statistics.
I feel some good field training in statistics(Look up Andrew Gelman) a couple of good courses on Linear, Bayesian Regression is all you need, rest is just engineering skill.
The dichotomy between ML Engg and Datascience is as stupid as was between Systems Engg and Application Engg before Devops came along.
I think the qualifying term here is "good". I've worked with a surprising number of MLEs that don't really understand gradient descent or how most models really work under the hood. They certainly couldn't implement most things from scratch if they needed to (neither could most data scientists).
I used to think an MLE was a solid engineer who also had a strong quantitative and numerical computing background. The kind of engineer that always has a copy of Numerical Recipes handy, and if needed, could reimplement core components of statsmodels and sklearn in javascript.
I think after this current contraction in tech is over we'll see that most of the remaining "data scientists/MLEs" will be the type of engineer I imagine an MLE to be.
> I used to think an MLE was a solid engineer who also had a strong quantitative and numerical computing background.
Application of Computational Stats/ML Models are not all that hard to aquire but essential. I think we need a fundamental rethink of how applied stats/ML is taught to engineers to make them effective. Here are a few things I can think of:
1. Getting a solid understanding of actually coming up with a simple enough model to do the job
2. Do Power Analysis to figure out how many samples we need. Creating datasets with Hard Negatives and overcoming sampling bias.
3. Using things like Multiple Regression to do EDA. i.e. using models as a tool vs the end goal to understand a problem space.
I'm clearly talking about quantitative modeling tools.
That said, while Linux and Chromium are massive projects each with years of development with thousands of engineers behind them, so of course it would be ridiculous to expect a single engineer to build such a thing. I also wouldn't expect an MLE to build SKLearn entirely as is from scratch on their own.
However, I do certainly hope most CS folks could implement an OS or Web browser from scratch.
IMO a data scientist should also be a domain expert, in the same way analysts are.
But of course, too many view DS as some abstract skill where domain knowledge is not needed, and where the methodology will solve all problems / provide insight.
I agree completely, but if a data scientist should be a domain expert, surely we should just focus more on programming and quantitative skills in these fields?
People who make important decisions get paid a lot and have a lot of power. Why on earth would they then delegate that decision-making to data-peeps and make themselves entirely redundant? Data scientists are always going to be funnelled into make-work projects just so companies can claim they're doing well on the data-side of things.
The same is true of automation incidentally, there's lots of big companies doing a lot of easily automatable work, but the guy who manages all those people doing easily automatable-work is hardly going to be scrap his own area by calling in some SWEs.
On a completely unrelated note, the author's data eng sounds like nothing I've seen. Hell veto power over code? I'm not even sure all of them can code. They're just glorified sys-admins who now can provision some cloud infra. Somehow data engs are probably even more incompetent on average than data scientists, and the reason you move upstream is because it's an easier job and you don't need to spend your weekends grinding through maths-problems or learning new languages while probably still having higher value-add.
This hit all the same high notes I was feeling when I quit Data Science to become a software engineer. It's an infinitely better gig and I encourage all my colleagues with enough chops to make the same switch.
yep, exact same feeling here. I had several years as a "data scientist" and it was a an almost totally bullshit job. the org bought into the hype and hired a cohort of us straight out of university, but then couldn't find anything data-science-y for us to actually do. what I actually ended up doing 95% of the time was taping together dodgy excel-based workflows using python scripts. it gave me a visceral appreciation for Conway's Law. the other 5% was when I got to do some genuinely interesting mathematical work, but that wasn't "data science" either, it was more like operations research. I lived for that stuff, but there wasn't enough of it.
so I jumped ship and became a software engineer. better pay and more interesting problems.
Could you elaborate on what kind of "software engineering" you now do? For someone who also would like to get out of data science, mentions of "I became a software engineer" don't really help to clarify what kind of SWE is feasible for a data scientist with decent programming chops to get into.
Another comment here mentioned the on the job steps you can take, and that mirrors my experience. I also enrolled in Georgia Tech's OMSCS after a year or so of self study. About 1 year in I took a role using Python for network topology analysis software. I went from there to using go and C to develop a distributed database product. It's been incremental steps lower on the stack and towards more "pure" dev work. I'm now where I wanted to be and will keep doing this kind of work for as long as I can get away with it.
I don't know how replicable my success is. but, depending on where you work, it may be that as a data scientist you can provide more value to your business purely using your software skills than with any kind of stats knowledge, by figuring out how to unfuck existing crufty bureaucratic workflows. this can be more directly useful than any amount of hyperparameter twiddling on some ridiculous neural network chimera. at my last job I could see so much of people's time wasted on fucking idiocy and my mind rebelled against it, I had this drive to rip it all out and Do It The Right Way(tm). and in doing that, I learned a lot about software development, tooling, version control, documentation, and so on. one thing led to another, and I had turned myself into a software engineer.
nowadays I do -- well, fudging slightly but you could describe it as "industrial automation control". writing libraries to provide convenient abstractions for controlling industrial equipment, writing robust scripts to drive that equipment, run physical tests on the $widgets we make, aggregate the experimental data, store it, etc. in the interviews they liked how (in my DS job) I had taken existing inefficient excel based workflows that had human-in-the-loop, and automated them, made unit tests, wrote docs, considered failure modes that nobody had considered before, things like that. and I just read about a fuckton of different stuff. for example in the interviews they wanted to know if I had worked with concurrency, I said I hadn't because it just didn't come up in the work I did. but I knew a little about it because I read voraciously, then I was able to answer all the theoretical questions they posed about locks and threads and async and so on. obviously that didn't mean I really knew about concurrency (that's a kind of deep metis that can only be acquired by practical experience and I'm still only scratching the surface of it), but it demonstrated that I had curiosity to learn about the field outside of the immediate things I worked on day to day.
during that job hunt I also had a strong offer from a company that wrote software for the visual effects industry and they wanted someone to improve their automated testing and continuous deployment frameworks. I didn't know much about CI but I knew about testing (pytest and hypothesis and things like that). they liked me talking about that kind of thing.
I guess the lesson is, if you are right now a data scientist and you want to be a software engineer, you can just decide to be that right now. be proactive and find a software problem to solve, and solve it. you don't have to ask permission to do this .. what are they going to do, tell you to stop being useful? note what you did, then figure out how to do the next thing better based on what you learned. your pay stub will say you're a data scientist, but you should just think of it as clandestine self-directed on-the-job training for your next job, so you can talk about it in the interviews. does that make sense?
Be a strong software engineer? It's not hard to extend existing data science duties to include more engineering work. There's always a demand in any company for more strong engineers so it's not hard to find ways you can contribute more seriously to the data engineering/MLE part of your team's work.
I've been a data scientist for quite awhile now at many different places, but every time I start interviewing again I always make sure to include a few pure software engineer roles in the positions I'm interviewing for. Even for some pretty elite teams, I'm still able to get to the final rounds but so far have always realized I still personally prefer the DS roles I'm looking at.
Any data scientist who wants to keep working on quantitative problems in the future should aim to be a solid software engineer.
Another comment here mentioned the on the job steps you can take, and that mirrors my experience. I also enrolled in Georgia Tech's OMSCS after a year or so of self study. About 1 year in I took a role using Python for network topology analysis software. I went from there to using go and C to develop a distributed database product. It's been incremental steps lower on the stack and towards more "pure" dev work. I'm now where I wanted to be and will keep doing this kind of work for as long as I can get away with it.
> Nobody knew or even cared what the difference was between good and bad data science work. Meaning you could absolutely suck at your job or be incredible at it and you’d get nearly the same regards in either case.
This seems to be a problem with the industry as a whole. I'm speaking as a SWE, but I've observed similar things with PMs. I don't think it's impossible or even very hard to appreciate the right things, it just requires a bit of thought and the correct value system. Both of those seem to be a bit too far of a reach though.
I think a big part of the problem is that most PMs are non-technical, and at some point up the chain so are most managers. Data science, when done thoughtfully, requires you to appreciate minute issues from data collection and management all the way through feature engineering, model selection/design, and validation.
The biggest, hardest bridge to cross was an appreciation of the importance of metrics. For some reason, getting a business person to grok something as simple as precision/recall/F-beta is a near-impossible task. You can do multiple presentations on it (after having honed those presentations over years with multiple manager audiences), and it never sticks. It's always "what's the accuracy?" It's impossible to do good work when your bosses insist on measuring and therefore optimizing for the wrong thing (which in my experience consulting for multiple Fortune 500 businesses, they always do).
Even worse, many organizations have such broken politics/cultures that the managers can't even tell you the big picture of what the project is trying to accomplish. Once you finally piece it together from the people who know their roles in-depth, it becomes clear that what they're trying to do is totally infeasible. At least that was my experience more than half the time.
While the tone is a bit too negative (expected of someone leaving one place for “greener grass”), there are terrific points here that completely resonate with my own experience in data science - albeit not at a “median” place, but at a large non-Tech corp.
The biggest point I’d emphasize:
“there is a general industry-wide need for people who are good at both data science and coding to oversee firms’ data science practices in a technical capacity.”
Even more, most of the data tech leaders I hear strongly suggest this is not possible without C-suite representation of a data/engineer expert (I’m talking about non-tech companies)
Finally, it’s actually not uncommon to see people shift from data science to data engineering due to similar motivations. It’s actually the technical leadership that is sometimes surprised at the shift. You hear about this in podcasts from DS/DE people.
Coles notes:
Data Engineer - more money, more clout, less analysis / interesting projects, more job security, more infra style work
Data Science - less money, a lot of random projects (Sometimes totally overqualified for), more analytical, don't have as much clout / confusion & lots of people don't actually understand capabilities.
50k+ lines of R, 10k+ lines of Julia, 5k+ in Python, C, and who knows what else. Most of it for what is, essentially, data engineering work.
Where do researchers with social science degrees fall on this scale? Less money, less clout. The projects are certainly interesting though (which is why I do what I do).
I think is true for >general< data science houses and firms that offer Data Science across any domain - I'm generally wary of anyone who writes about Data Science as a concept, rather than the use of Data Science to solve a particular problem in a particular domain, the rise of the Machine Learning bros is very real.
Having worked as a data engineer before and now working as a data scientist I cannot agree with that article. Data Engineering was really boring. The most complex math I ever used was computing an average value of something. Most of the time your work is only ETL, just data in/data out. The most complex technology I ever used was a search engine.
Everyone in IT who likes to do some math and statistics at his workplace, even if it simply a linear regression or some histograms should go for data science. Also instead of nonsense discussions about what is agile and what not, I enjoy talking with my colleagues about the newest papers in ML, even if nobody understands the details.
In my experience Data Science is based either on optimizing short term easily measurable KPIs or producing impressive looking BS. So if you're joining a new team and they can't explain in one sentence what they're optimizing for you're probably going to be tasked with producing impressive looking BS.
It's so buzz word heavy. I had a manager that wanted me to solve a problem using the monte-carlo method when it fact the problem had a closed form solution...
Why bother to understand how either software development has solved a problem, or how maths+stats has solved a problem when you could just ignore operational practices and “train a neural network to do it”?
I've written about this in other comments relevant to ad tech topic, but what the article says about twisting the data to support the pre-made decision (versus making decisions based on data-borne insights) is so true.
Regarding the business impact of data science related work, as a working ds and team lead a have a few thoughts.
There are some sectors and activities that are very data science like, to using a blunt anachronism, have been like that for a long time, and have a lot of teachings on how to develop and measure impact on business by the automated decision systems and/or analytical work that they build. Some examples are:
- demand forecasting for supply chain processes
- credit scoring for loan approval
- portfolio risk analysis in financial settings
- some misc optimizations work on operations
- maybe even six sigma can be listed here
Those have always had a data-science-like feel to it. The problem I see is when companies try to implement a data science team is:
1. push out subject expert knowledge and requirements just for the "freedom";
2. have stake holders to be out of touch with the solutions;
3. too litle focus on putting stuff "in production", tracking, beeing able to experiment, in whatever sense those have for the company;
4. too much focus on numbers that can come out of the ds's computer, and treating operation related numbers as an after thought;
5. no basic knowledge of simple/common/classic solutions for their problem at hand.
So yeah, making business impact is way harder than is sounds, and too out of the skill set for the 23yo STEM graduate to actually make impact. And too buzzwordy and impressive for the typical decision maker. I mean, I've heard countless times things like: "if an AI knows how to tell a cat from a dog by using one of those neural nets logarithms, surely it can know how to partition my marketing budget, optimize my coupon giving logic and determine the strategy we should have in order to achieve a very obscurely constructed OKR".
(yes, I have worked for people that used the word logarithm to mean algorithm).
I relate to this so much. I have experienced the same things, and I also find myself moving across to data engineering.
In my case, I was in organisations that wanted data science, but had no capability or interest in supporting the role, so a lot of my time and effort has been having to put down the data science tools, and learn devops, software development and data engineering so that I can get back to the point where I do my data science work.
I’ve also become frustrated with my data science peers lack of knowledge about the surrounding fields- I get that being a top tier software dev isn’t the primary responsibility of a DS, but it would certainly make their life, and the life of everyone around them a lot easier if they did make an effort. There’s a sense of “learned helplessness” in parts of data science (and parts of data engineering too) in which “if some third-party tool can’t do it for us, we just can’t do it” and imagination is limited to the features the latest framework de jour offers.
I have to agree with the author on all of the points he makes here. My projects with the most impact typically had a stronger data engineering and software engineering contribution from my team than data science. Data scientists today are what surgeons were in the 1700s. Hard for people to tell the difference between the good ones, the bad ones and the charlatans.
physics-based modeling with some supplemental data science/ML will always be superior. Every data scientist I've worked with has so little domain knowledge anything they do is already useless. They also have little to no presentation skills especially in a business setting.
obviously there are some examples like computer vision that require ML.
> Like bro, you want to do stuff with “diffusion models”? You don’t even know how to add two normal distributions together! You ain’t diffusing shit!
Made me LOL. First time I did that while reading a tech post/blog in years. Also neatly describes half the HN audience fawning over the latest AI thing.
Lots of people are citing bad management as a most annoying thihg in tech. I wonder two things:
1) To recognize bad management, there needs to be awareness of what good management is. What is the common source of that awareness? How do one know good manager from bad one?
2) If good management is so crucial for tech company success, why there is no worldwide trend to help engineering managers become better? There's an awful lot of courses, bootcamps, learning videos, tutorials, git repos and so on for those who wants to be software engineers, but all I see for managers is self-help style books and articles, centered around typical situation "oh, shit, they appointed you to a managerial position, how do you cope with that?"
Many things in this article, especially about the problems with the Data Science role, resonate with me (low value work with low expectations for quality). Funny thing is I have never worked in Data Science. Rather I've worked in Software Development. The summary at the bottom about Data engineering seems like the dream job to me. But I don't think it's because I'm interested in doing Data engineering specifically. I think it's because doing things that actually have an impact day to day is fullfilling. The last job I had totally lost me after ignoring security problems in favor of surface level things like updating CTA labels or similar. Have other Software Devs had this experience?
> 23 year-old data scientists should probably not work in start-ups, frankly; they should be working at companies that have actual capacity to on-board and delegate work to data folks fresh out of college.
If a 23 year old manages to get a data science job at a startup, and then actually delivers the results that the start-up expected of them, the learning experience there is infinitely more valuable than going to Google and using a bunch of tools that don't exist in the real world, on unrealistic timelines because you're not on the ads team and don't need to make money.
You can go learn "best practices" later, but working in a startup is an exercise in pragmatism. You deliver results, or you die.
For years I was low key obsessed with the idea that I should ditch regular programming, and become a data science. My thought process was very crude. Data science was mathsy and paid more, therefore it was a great career move!
After a few abortive attempts to learn statistics and linear algebra in isolation, I decided to sign up to that famous Andrew Ng online course.
I was bored out of my mind. I just did not find it at all interesting.
Not sure what my point is - maybe that a really easy way to learn if a seemingly lucrative and interesting thing is for you or not is to go ahead and learn the very basics of it directly, and see if you are at all motivated. I was not. And after that it was out of my mind.
Calling it Data Science was a tell. Have you noticed how non-scientific things add "science" to the name to make it sound like it has scientific rigor?
Data Science, Political Science, Social Science, Scientology
Should we call deep learning “statistics”? It’s mostly empirical. Should recommendation systems be called “statistics”? What about multi objective non convex optimization?
Not everyone is doing regression and classification all day.
The above were studied in the Computer Science curriculum at my university. I work with a statistician, like someone with a degree in mathematical statistics. They know nothing about any of those.
> Not everyone is doing regression and classification all day.
Yeah, some of them are doing unsupervised learning (recommender systems) too!
I dunno, I personally think we'd all have been better off if we'd called it statistics as at least then people would realise that the field wasn't created yesterday.
I realise the article was written for a specific audience for which this may be obvious, but what is the difference between data scientist and data engineer (in terms of what their job is)?
"Data engineering" means building systems that can manipulate data (e.g. storing, retrieving, and delivering it). There are usually fairly well-defined functional requirements about what the system is supposed to do, plus goals about performance and reliability that might be slightly more nebulous.
"Data science" means building systems that can draw conclusions from data. The functional requirement is usually some form of "accuracy", as measured somehow against some kind of human evaluation of the same conclusion.
Concretely: a data engineer might be asked to build a system that can ingest every tweet posted to Twitter, and return the 10 most widely-used hashtags in the last hour. A data scientist might be asked to build a system that looks at a tweet and figures out what language it's written in, or whether it's spam, or whether an attached image is pornographic.
Data scientist actually cook up and run the statistical/ML models on data and write reports about their "findings".
However, the data that data scientists want to use is often messy and comes from varied sources. Hence, data engineers do supporting infra work like cleaning/loading data from different databases, etc.
> The median data scientist is horrible at coding and engineering in general. The few who are remotely decent at coding are often not good at engineering in the sense that they tend to over-engineer solutions, have a sense of self-grandeur, and want to waste time building their own platform stuff (folks, do not do this).
> It was obvious that there is a general industry-wide need for people who are good at both data science and coding to oversee firms’ data science practices in a technical capacity.
The job of overseeing a crowd of stubborn self-important over-engineerers sounds pretty thankless.
> 23 year-old data scientists should probably not work in start-ups, frankly; they should be working at companies that have actual capacity to on-board and delegate work to data folks fresh out of college. So many careers are being ruined before they’ve even started because data science kids went straight from undergrad to being the third data science hire at a series C company where the first two hires either provide no mentorship, or provide shitty mentorship because they too started their careers in the same way.
Startups are a low-paid job with a lottery ticket for a little dash of excitement. You get what you pay for.
> ...I live in constant anxiety that someone will pop quiz me with questions like “what is the formula for an F-statistic,” and that by failing to get it right I will vanish in a puff of smoke. So my brain tells me that I must always refresh myself on the basics.
Focusing on the basics is better than pretending to understand fancy things, but even this level of "continuous professional training" or whatever you want to call it is, to me, a bit off the mark. We can look up formulas whenever we want these days. We need more meaningful ways to test our understanding of things.
> 23 year-old data scientists should probably not work in start-ups, frankly; they should be working at companies that have actual capacity to on-board and delegate work to data folks fresh out of college.
Ageism is disgusting and I cannot believe such blatant discriminatory language is seen as OK for a link posted to hackernews. How would you all say if he wrote that 40+ year old programmers should xx?
Eh, there's some stereotyping going on here that I don't 100% agree with, but I'm not offended by the notion that people fresh out of college are generally inexperienced in the working world and lack skills that are needed in industry more than in academic work.
I don't see how this personal take generalizes to a trend specific to data science (as much as the term looks weird to me). In other words, I don't see how creating useless Python notebooks is different from over-engineering or creating useless features.
To me the proper context of the story is "cheap money have been flooding the market for _decades_".
> Personally, I’ve benefited a ton from reading the first couple chapters out of advanced textbooks (while ignoring the last 75% of the textbook)
Anecdotally, this is something I've thought about many times, that an awful lot of books cover like 80% of the important stuff in the first say 4-5 chapters, and then are just full of filler.
I find this extremely frustrating as I'm at the same time fearing to miss out on some important insight in the latter chapters, while reading the full length of most books is simply super hard to do in any sensible quantity.
Otherwise, great post. Going through a similar transition, for some of the same reasons :)
> So many careers are being ruined before they’ve even started because data science kids went straight from undergrad to being the third data science hire at a series C company
This is something that bothers me about the industry in general, but thankfully seems rare inside tech giants.
My perception is that VC companies in pursuit of the next unicorn idea believe it can only come from a young mind, and so pile money and opportunities onto young people who aren't ready for that responsibility. Either they crash and burn, or they are carried over the finish line by multiple investment rounds then swan off, thinking they did it all themselves, to go through the experience again as an angel investor.
there are different flavors of DS, there are people who are doing diffusion models, doing top stuff that may or may not yield anything, but they are doing it because they know math well and enough code to put new maths stuff into new products. deep knowledge. so called ML engineers. maybe they are even good at coding at the lowest level, but in their point of view, why? these people are at huge companies that have vast resources downstream, they make people like OP work their work actually..
there is T shaped folks (unicorns, everybody wants one even if politically not ready [most arent]), where they know some concepts of many topics, perhaps so called full stack DS, which i consider myself to be... and i wouldn't be able to read thru most scientific papers, but I can put stuff together from start to finish including deploying it as an API that's scalable to top performance because of cloud. i do go back to basics often and I think its only natural! i think its like being a pilot, why not check the basics that actually, if forgotten, will take everything down lol... and you will use that the most as well!
i think also many people who are too much into one thing, math, code, whatever it is, start to call non basic things that are basic to them, well --- basic... BUT THERE IS NOTHING BASIC about multi linear reg and how to set it up all proper and how humanity spent thousands of years getting to this point..
there is also DS thats like data analyst on steroids, knowing middle basic and middle tier algos and stats well and can deliver mad value with a bit of business knowledge. hell, they could even use excel for their stuff, but proper understanding of the question at hand will most likely allow you to downgrade to lower, simpler tools. and simple is awesome! people often misinterpret complicated for advanced, not the case whatsoever.
once you know the land you accept your weak points and strong points and points you need to know enough to put stuff together. at the end if you know how to make sure stuff works and it works, hey, it works. and the only thing at that point between messing around and science, is "writing it down"... ;) push that code up , make it reproducible end to end.
I applied as SWE in 2018 an was asked whether I ever thought doing Data Science instead. I said I thought about it but it didn't appeal to me, for reasons similar to the article.
Still, I got hired as Data Scientist. The money was good, but I didn't believe in the field and was kinda ashamed of the title (though it did seem to impress people - I think rural Europe is a few year behinds).
After 3 years I left for SWE SRE and my only regret is not doing that earlier.
Decent grasp of SQL and a bit of handwavy knowledge about stats comes in handier than expected.
I found the section on "Poor self-directed education" to be interesting. I've had similar feelings as a full stack developer, where in many ways I'm just learning APIs and other narrowly useful trivia. The only things I've had meaningful improvement in are stuff like soft skills, giving a good code review, or managing a complex project. Wondering what kind of skills someone like me should focus on to avoid the "embarrassing" skills/resume gap.
I've never met a data scientist who could do anything more than basic statistics combined with the Python skills of a fifth grader (that's probably insulting to today's fifth graders tho). I honestly have no idea what they're supposed to be doing or why they're paid so much money. Pay a high school junior for the same and get better work. And where's the scientific method? Where's the experiments and rigor?
> Managers will say they want to make data-driven decisions, but they really want decision-driven data.
Has been my experience as ML engineer too. Decision making being intuition- and not data-driven was one of the largest shocks to me when I went from academia into industry.
How upper management and the board determine the course of the company was based more on emotion than anything else.
I've had similar experiences to the author and fellow HNers in this thread.
I just wanted to add that my personal transition from DS to DE has allowed me to work with a wider variety of data. DS is mostly tabular; data out in the wild comes in all shapes and sizes, and learning different techniques has been interesting.
Reading this and other similar posts, I feel so lucky to have found a 1) great boss that really knows his stuff thanks to the fact that he started working on this before it was cool, 2) a great company that believes that data will be the future and 3) a good team that follows me without too many frictions.
Can state from personal experience that there exists at least one company where each of the article's leading bullet-points are simultaneously false.
If your DS employer isn't making real use of the capabilities of a skilled data-scientist and that makes you sad, consider looking for a company that will.
> Rather the main bottlenecks I’ve faced were always crappy infrastructure and lacking (quality) data, so it has always felt natural to focus my efforts toward learning that stuff to unblock myself.
Right. What can you learn to over come crappy infrastructure?
Feels like data science outside of deep learning is slowly edging towards the trough of disillusionment. Time to rebrand your CVs folks in line with the new hype, while waiting it out to get to the plateau of productivity ...
I moved from Systems software to DE/ML Engineer and then back to Systems software. There is no more fun involved in setting up pipelines, writing spark jobs and prductionizing models.
As someone who has just moved from Data Science to Software Engineering I feel very much the same way, liberated.
I worked in DS for 5 years and had varying degrees of success in working at companies that understood the proper use and application of Data Science. What killed my passion for it was a few things:
1. Data Science is a dubious field - Data Science can certainly be applied correctly but I and others have used the underlying statistical methods gung-ho at times. Part of this comes down to something that W.D said. That Data Scientists are generally early on in their career. We have been captivated by the shiny new field and want to use it as quickly as possible without fully understanding it. Throughout my career I've been met with varying degrees of scepticism about my profession by people because Data Science offers more than it can give.
2. Data Science professional development is poorly understood/completely neglected - If you look for resources to grow in your Data Science skills online you are invariably drowned out by the sheer volume of crappy "Intro to Data Science" courses online. As far as I can find there is very little advanced Data Science professional development resources out there. Compounding the problem is that Data Science teams are invariably managed by people who aren't native to the field. This has the effect of the manager letting Data Scientists self direct their learning which will hit the problem mentioned previously.
3. Support for MLOps is non-existent - I think this problem will change over the next couple of years but Data Science has had to go through cycles of being integrated into a business. The first "wave" of Data Science was met with the realisation by companies that they couldn't get Data Scientists to magic money out of the poorly maintained data they kept. This has caused a huge increase in Data Engineers (not just Data Science has spawned this), now we have Data Scientists who have access to nice data (thanks Data Engineers!), they can build some interesting models but how do they get it deployed? This second "wave" is seeing the rise of MLOps tools, engineers, etc but Data Scientists currently don't have the know-how to get their own models in to production. This inability is incredibly demoralizing from my experience.
4. Educating fellow Data Scientists is too difficult - Unfortunately the perception that is given to people coming in to Data Science is that you can just do model engineering and call it a day. Bootcamps, courses, tutorials are all geared towards getting people good at building models, not about considering how those models fit into the bigger picture. There is little to no knowledge about good programming practices, source control (a lot of Data Scientists I worked with only knew git as a swear word) or deployment strategies. You could argue that a Data Scientist's should only be concerned with building models, I would agree but the reality is that companies will hire a team of Data Scientists but will likely not provide complementing teams to get models in to production. When trying to upskill others on my team it's been an uphill battle. Either people don't care as they just want to build models or they have come from an adjacent field with no software engineering experience.
Apologies for the stream of consciousness but it feels good to get it off my chest. My move to Software Engineering started in my last role where I was a Lead for a Data Science team. Thankfully my boss (head of Data) understood the need for developing a whole data system from good Data Engineering all the way through to MLOps for deployments. I was very fortunate to be able to move to being the Lead MLOps Engineer and develop our capability to deploy models with CI/CD mechanisms using AWS. That really gave me the taste for building systems rather than models. I really do think Data Science has a place and can provide great value but it's still a long way off. If we can make it so that Data Science teams can deploy to production quickly and safely we can really start to reap the rewards.
For Software Engineers looking at getting in to Data Science I would suggest looking at MLOps first. You get to combine existing experience with tackling new problems (how do we keep models live and continuously learning? how do we ensure the tracking of experiments?) and will have a tremendous impact.
This is how I feel about software development. The more people that get into it, the more it seems to be more that you're just a code monkey pushing out shitty code.
Working as a data engineer for over 7 years, I can add my two cents into this:
Mainstream data scientists don't do data science, they are either ML engineers or data analysts who use ML Python libraries and promoted to data scientists with bigger paycheck. With machine learning and AI being the trend, the job is sexy and it is a chance for the company to market itself as it uses AI and cutting-edge tech. I worked for a data solutions company and while I work together with data scientists to propose a design to our clients, it was the AI part narrated by the data scientists that makes clients ready to throw money. Even if the requirements seem impossible or at least very difficult to achieve. These projects often fail because the data scientists couldn't reach the accuracy goal written in the contract and the project ends up in the trash. These data scientists eventually leave the company or get fired, only to find another job within a short time with even bigger salaries.
Now for the data engineering part; I wish the OP all the best with his career, but he is still in the honeymoon period and didn't witness the misery of being a data engineer.
- The job can get very repetitive very quickly, unless he'll work on infrastructure, his tasks will be mainly focused on maintaining existing ETL pipelines or building ones, and both are labor tasks. You'll end up a data plumber who makes sure data goes from A to B and then C and that's it. You'll seldom find something new or revolutionary to work on and you'll keep using the same tools as long as you're in the same company. Even when moving to a new job, you'll pretty much be hired because of your knowledge of the same data warehouse or ETL tool you used in the previous job.
- Data engineering is an underappreciated job. If things go right, nobody pats you in the shoulder. When shit gets loose, you'll be the one to clean up the mess. What makes it worse is, you can't leave this mess long because data is a snowball effect; if you leave it unprocessed, you'll end up with more data clogging your pipelines and what can be fixed within an hour can quickly take long nights and even days to resolve, and do you know what this means? Managers won't have their fancy dashboards updated and they'll start panicking.
- Data engineering is not rewarding. Again, you're doing your job. No one cares.
- Data engineering has nothing to show for. Yes, data is being crunched and processed and baked. It's the data analyst who builds the fancy dashboards for managers, and the data scientists who create fancy graphs for managers, and the ML engineers who create fancy products for managers. You're just a plumber who, instead of fixing toilets and sinks, fixes data pipelines.
- Just recently, data engineering salaries are rising thanks to low supply and higher demand thanks to better awareness from CTOs and heads of data about the importance of the role, but until a few years ago, they were paid less than a software engineer.
In my experience it's even a little bit worse than that. Approaches that are wrong from a statistics point of view are more likely to generate impressive seeming results. But the flaws are often subtle.
A common one I've seen quite many times is people using a flawed validation strategy (e.g. one which rewards the model for using data "leaked" from the future), or to rely on in-sample results too much in other ways.
Because these issues are subtle, management will often not pick up on them or not be aware that this kind of thing can go wrong. With a short-term focus they also won't really care, because they can still put these results in marketing materials and impress most outsiders as well.