Furthermore, as computers get faster and cheaper in every dimension what makes economic sense to use “Big Data” tooling and efforts gets substantially larger with it. The limits of single nodes 15 years ago were pretty serious but most problems businesses have even in the so-called enterprise can currently easily fit on a workstation costing maybe $5k and be crunched through in a couple hours or maybe minutes - a lot easier to deal with than multiple Spark or Hana nodes. Operationalizing the analysis to more than a single group of users or problem is where things get more interesting but I’ve seen very, very few companies that have the business needs to necessitate all this stuff at scale - most business leaders still seem to treat analytics results in discrete blocks via monthly / weekly reports and seem quite content with reports and findings that take hours to run. Usually when some crunching takes days to run it’s not because the processing itself takes a lot of CPU but because some ancient systems never intended to be used at that scale are the bottleneck or manual processes are still required so the critical path isn’t being touched at all by investing more in modern tools.
I can support “misguided” Big Data projects from a political perspective if they help fund fixing the fundamental problems (similar to Agile consultants) that plague an organization, but most consultants are not going to do very well by suggesting going back and fixing something unrelated to their core value proposition itself. For example, if you hire a bunch of machine learning engineers and they all say “we need to spend months or even years cleaning up and tagging your completely unstructured data slop because nothing we have can work without clean data” that’ll probably frustrate the people paying them $1MM+ / year each to get some results ASAP. The basics are missing by default and it’s why the non-tech companies are falling further and further behind despite massive investments in technology - technology is not a silver bullet to crippling organizational and business problems (this is pretty much the TL;DR of 15+ years of “devops” for me at least).
That is precisely what the projects I'm usually involved in do. A client might want "buzzword technology", but at the heart of it, what they really need are stable, scalable, and consolidated data pipelines to e.g. Hadoop or AWS that gives "Data Scientists" a baseline to work with (and anyone needing information, really - it was just called "Business Intelligence" a couple of years ago).
In the end it doesn't matter if you wind up with a multi-TB copy of some large database or a handful of small XML files - it's all in one place, it gets updated, there are usable ACL in place, and it can be accessed and worked with. That's the point where you think about running a Spark job or the above AWK magic.
> most business leaders still seem to treat analytics results in discrete blocks via monthly / weekly reports and seem quite content with reports and findings that take hours to run.
I would go further and even call long or at least not instant report generation a perceived feature. Similar to flight and hotel booking sites that show some kind of loading screen even if they could give instant search results, the duration of the generation itself seems to add trust to the reports.
Absolutely. I really want to see advanced AI/ML tools developed to address THIS problem. Don’t make me solve the data before I use ML, give me ML to fix my data!
That’s hard though, because data chaos is unbounded and computers are still dumb. I think there’s still tons of room for improvement though.
I watched a talk by someone in the intelligence community space nearly 8 years ago talking about the data dirt that most companies and spy agencies are combing through and the kind of abstract research that will be necessary to turn that into something consumable by all the stuff that private sector seems to be selling and hyping. So I think the old guard big data folks collecting yottabytes of crap across the world and trying to make sense of it are well aware and may actually get to it sometime soon. My unsubstantiated fear is that we can’t attack the data quality problem with any form of scale because we need a massive revolution that won’t be funded by any VC or that nobody will try to tackle because it’s too hard / not sexy - government funding is super bad and brain drain is a serious problem. In academia, who the heck gets a doctorate for advancements in cleaning up arbitrary data to feed into ML models when pumping out some more model and hyperparameter incremental improvements will get you a better chance of getting your papers through or employment? I’m sure plenty of companies would love to pay decent money to clean up data with lower cost labor than to have their highly paid ML scientists clean it up, so I’m completely mystified what’s going on that we’re not seeing massive investments here across disciplines and sectors. Is it like the climate change political problem of computing?
I was asking somewhat rhetorically but am glad to see that there’s some serious efforts going into weak supervision. At the risk of goalpost moving, I am curious who besides those in the Bay Area at the cutting edge are working on this pervasive problem? My more substantive point is that given the massive data quality problem among the ML community I would expect these researchers to be superhero class but why aren’t they?
There are a lot of people tackling bits and pieces of the problem. Tom Mitchell's NELL project was an early one, using the web in all its messy glory...http://rtw.ml.cmu.edu/rtw/
Lots of other folks here (CMU). Particularly if you add an active learning. Hard messy problem that crosses databases and ML.
I can support “misguided” Big Data projects from a political perspective if they help fund fixing the fundamental problems (similar to Agile consultants) that plague an organization, but most consultants are not going to do very well by suggesting going back and fixing something unrelated to their core value proposition itself. For example, if you hire a bunch of machine learning engineers and they all say “we need to spend months or even years cleaning up and tagging your completely unstructured data slop because nothing we have can work without clean data” that’ll probably frustrate the people paying them $1MM+ / year each to get some results ASAP. The basics are missing by default and it’s why the non-tech companies are falling further and further behind despite massive investments in technology - technology is not a silver bullet to crippling organizational and business problems (this is pretty much the TL;DR of 15+ years of “devops” for me at least).