Furthermore, as computers get faster and cheaper in every dimension what makes e...

0x5002 · on June 27, 2019

That is precisely what the projects I'm usually involved in do. A client might want "buzzword technology", but at the heart of it, what they really need are stable, scalable, and consolidated data pipelines to e.g. Hadoop or AWS that gives "Data Scientists" a baseline to work with (and anyone needing information, really - it was just called "Business Intelligence" a couple of years ago).

In the end it doesn't matter if you wind up with a multi-TB copy of some large database or a handful of small XML files - it's all in one place, it gets updated, there are usable ACL in place, and it can be accessed and worked with. That's the point where you think about running a Spark job or the above AWK magic.

awiesenhofer · on June 27, 2019

> most business leaders still seem to treat analytics results in discrete blocks via monthly / weekly reports and seem quite content with reports and findings that take hours to run.

I would go further and even call long or at least not instant report generation a perceived feature. Similar to flight and hotel booking sites that show some kind of loading screen even if they could give instant search results, the duration of the generation itself seems to add trust to the reports.

edraferi · on June 27, 2019

> The basics are missing by default

Absolutely. I really want to see advanced AI/ML tools developed to address THIS problem. Don’t make me solve the data before I use ML, give me ML to fix my data!

That’s hard though, because data chaos is unbounded and computers are still dumb. I think there’s still tons of room for improvement though.

devonkim · on June 27, 2019

I watched a talk by someone in the intelligence community space nearly 8 years ago talking about the data dirt that most companies and spy agencies are combing through and the kind of abstract research that will be necessary to turn that into something consumable by all the stuff that private sector seems to be selling and hyping. So I think the old guard big data folks collecting yottabytes of crap across the world and trying to make sense of it are well aware and may actually get to it sometime soon. My unsubstantiated fear is that we can’t attack the data quality problem with any form of scale because we need a massive revolution that won’t be funded by any VC or that nobody will try to tackle because it’s too hard / not sexy - government funding is super bad and brain drain is a serious problem. In academia, who the heck gets a doctorate for advancements in cleaning up arbitrary data to feed into ML models when pumping out some more model and hyperparameter incremental improvements will get you a better chance of getting your papers through or employment? I’m sure plenty of companies would love to pay decent money to clean up data with lower cost labor than to have their highly paid ML scientists clean it up, so I’m completely mystified what’s going on that we’re not seeing massive investments here across disciplines and sectors. Is it like the climate change political problem of computing?

dgacmu · on June 27, 2019

> In academia, who the heck gets a doctorate for advancements in cleaning up arbitrary data to feed into ML models

Well - Alex Ratner [stanford], for one: https://ajratner.github.io/

And several of Chris Re's other students have as well: https://cs.stanford.edu/~chrismre/

Trifacta is Joseph Hellerstein's [berkeley] startup for data wrangling: https://www.trifacta.com/

Sanjay Krishnan [berkeley]: http://sanjayk.io/

devonkim · on June 27, 2019

I was asking somewhat rhetorically but am glad to see that there’s some serious efforts going into weak supervision. At the risk of goalpost moving, I am curious who besides those in the Bay Area at the cutting edge are working on this pervasive problem? My more substantive point is that given the massive data quality problem among the ML community I would expect these researchers to be superhero class but why aren’t they?

dgacmu · on June 27, 2019

... they are?

There are a lot of people tackling bits and pieces of the problem. Tom Mitchell's NELL project was an early one, using the web in all its messy glory...http://rtw.ml.cmu.edu/rtw/

Lots of other folks here (CMU). Particularly if you add an active learning. Hard messy problem that crosses databases and ML.