It's incredible. The last company I worked for before going it alone (I was the front-end engineer and moved away from ML based business models) was trying to automate statistical analysis. I came from an academic background in economics and I tried to propose modelling to them in simple terms and they jumped to, "it sounds like you're talking about forecasting, let's go with that." And they started trying to implement python models from academic papers with highly limited training data and my reaction was generally, WTF?! You can't even start to forecast without a reliable base model. But they went ahead trying to sell stuff like churn prediction to companies with 0 understanding of how these models work at the basest levels.
And yeah, Google started to throw their hat in the game with Analytics 360 and an enormously larger training base. Amazon's another major player.
Weirdly enough though, people do still blindly pay my previous employers to figure stuff out because easy answers are always actionable, even if they're wrong. It's just crazy because the CEO explained to me that lying about the service to potential customers and investors was necessary because "Faking it til you make it" was a sound business principle in his mind like 1980's Michael J Fox was his primary sources of business info.
Long story short, don't waste your time with these little companies purporting ML holy grails. They're probably just lying to you, whether intentionally or not. ML is a game for the big boys with access to market level aggregates. The models that last company came up with were wildly inaccurate.
I only partially agree. Building good ML models and even outperforming the ML services of the big players is absolutely feasible. Have e.g. a look at this talk from PyCon DE (in English: https://www.youtube.com/watch?v=XniwzOCWi2c), which shows how a small team built a machine vision system to read car registration numbers from official documents. The system was built and trained with an extremely small dataset (I think around 60 scanned documents with some data augmentation) and was able to easily beat the Google Cloud ML algorithm by an impressive margin (Google ML had an intolerably high error rate for this seemingly simple problem).
So I'd say if you have a very specific area that you're investigating you have a very good chance of beating larger players that don't specialize as much as you can. Of course competing against Google in self-driving cars or machine translation might be a bad idea, but even in those areas there are small startups that produce impressive results (e.g. DeepL: https://www.deepl.com/en/translator). Also, big companies regularly exaggerate their capabilities as well (sometimes more than startups), just have a look at how IBM markets their Watson AI/ML solutions, and what they deliver in reality.
So personally I'd say it has never been that easy to build relevant and interesting ML/AI based solutions as a small team, and it is possible to beat large players if you have the right approach and the right (very narrow) problem.
DeepL is a very promising thing. I was very sceptic on the future of automatic translation seeing as Google Translate seems to have stagnated for the last two years or so, but I’ve just recently tried DeepL on a German newspaper article a couple of days ago and it did a very good job. Granted, I don’t know German (hence why I used DeepL) but nevertheless the English translation provided by DeepL seemed more polished than what Google Translate usually does.
I've used it a fair amount, and continue to be amazed with the quality it puts out. There are still some issues with formal pronouns, subject-matter-specific contractions etc, but otherwise it does a great job with both EN->DE and DE->EN
Oh yes. I've seen this plenty of times not with just ML, but even basic statistics (and by seeing I mean working next to people doing it). You don't need to understand statistics at all, as long as your customers don't understand it either, and you sound confident enough. If it's hard to verify whether a model works, you can keep the customer happy and yourself paid while not providing much, if any, value to them.
I currently believe this is how most ad tech runs internally. Scammers scamming scammers.
That's obvious to anyone with any sort of basic idea as to how machine learning works. Feed bots data and test the bots predictions - the more data you have then obviously it'll be better. If you have 1 picture of a bee to test you'd only ever get one very specific shape for a bee from any AI, if you have thousands, you get a much better representation. It's pretty simple.
Data quantity and quality are key. Both, though. This is why it's foolish to go up against an ML product from Google, Facebook, or (maybe) Microsoft. You just can't compete with the volume and quality of data they can access.
In sectors like automotive, where every brand is competing to try and build the best predictors using only their own data, there is a huge opportunity available for the first two companies to share data with each other. Doubling your data quantity brings a significant improvement to any model, and would put them ahead of any competition. That advantage only grows the more players you add to the sharing pool.
I believe that if humanity is really going to harness machine learning, the concept of a bulk data commons is an inevitable requirement.
This makes decentralized personal data control, homomorphic encryption, and similar technologies incredibly important.
Yes. Found that sentence very odd if one accept the idea of the more data needed meant sacrify all those values ... otherwise just ask around ... hence may be china and its firms will win as they do not need to.
> This is why it's foolish to go up against an ML product from Google, Facebook, or (maybe) Microsoft.
I was sure someone will say this. I think it only applies for advertising, commerce and insurance. When it comes to learning ML models on images, text and games, big corporations don't have a unique trove of data. They only have the data advantage when it comes to personal data.
The more important advantage big corporations have is hiring the best and hiring more ML scientists and engineers. The job demand far outstrips the offer.
Large tech firms absolutely have access to unqiue image, text and game(play) data. I'm surprised you think they don't. How many tech startups do you suppose have access to every single multiplayer game of Destiny (Microsoft), every user's profile photo and interest metadata (Facebook), or every user's search history and email (Google)?
But wait, there's more! How many startups have ready access to usage statistics for every mobile app (Google, Apple)? How many startups have virtually the entire corpus of available ebooks on hand for mining (Google, Amazon)? What about traffic patterns (Google), or virtual reality game(play) data (Facebook)? Telemetry data from the most widely used operating systems and web browsers (Google, Apple, Microsoft)? Video content with official subtitles (Netflix, Amazon, Google, Apple, ...). And don't forget login and activity data through social login (Facebook, Google) and image training via reCAPTCHA (Google).
I can go on and on here. You should reconsider how large the set of training material is that be acquired by mining massive amounts of "personal" data. Of course, this is completely orthogonal to the other major advantage large companies have; namely, they have vast sums of money to throw at compute and storage resources.
ImageNet has 14 million images. How many images does Google have available to them? They've got about 1000+ of my personal photos backed up on Google Photos, so if each of their 200M monthly active users is like me, that's 200 billion images from Google Photos alone. Then add every image crawled from the WWW for image search (that was several tens of billion when I was there, each tagged with text & structured data from its page of origin), every Street View photo (that was thousands of hard drives worth, measuring in the petabytes total), and high-res satellite images of the entire earth through their SkyBox/TerraBella acquisition.
The Big Tech companies like to understate the size of their data advantage because it tempts competitors into doing stupid things that won't work in the market. Don't be fooled though - the majority of useful data is locked up inside proprietary silos.
You can have access to larger open datasets of images if you want, but it's expensive to train on such a large dataset. So the real advantage is having money for training models.
Lot's of companies have _vast_ amounts of unique data, often reaching back many, many decades.
For example, one company I worked with manufactured, installed, and did maintenance certain electrical / power and machine/mechanical products. They had archives full, dating back many decades, of error reports and such.
Luckily for them, the equipment problems 50 years ago were no different from problems today, so the data was still highly relevant. The company went on to digitize all this data, structure it in a usable way, and then build (i.e hire in ML consultants) models on that data - which they used for predictive work - in this case, anomaly detection, early warning systems, etc.
The net results were happier customers, less faults, less warranty expenses, etc. which in turn made them more competitive.
This is just one example, but there are many others. Lots of data that the big tech giants do not have access to, but which is (or can be) of great value.
But to do this successfully you have to hit the bullseye, grow extremely quickly, and have a strong moat. Otherwise, one of the big companies will either clone you (or buy out a startup who has cloned you) and leverage their dominating market position into serving your niche somewhat better than completely sucking at it, as they had before.
The article is decent, but I see two mistakes: the author assumes that AI/ML can’t produce unique insights from public data; and she conflates AI/ML with automation. While it’s true that you won’t find anything that your competitors haven’t if you use the same data and the same AI/ML techniques, there’s nothing stopping companies from differentiating on the techniques in addition to (or even instead of) the data. If you just use plug-and-play AI, then sure, you’ll need a unique data set if you want unique results.
The section about finding faster, less error-prone ways to apply existing insights sounds more like automation than AI. There’s certainly overlap, but they’re two different things.
It's a data race because we've run up against another wall on the algorithms side. Find a technique that works better than GBDT for the same type of problem. Other than some minor tweaks described in the academic literature, it's been a while since something really advanced the state of the art.
Small datasets still have massive predictive potential; we just need better algorithms. (As an extreme example, suppose I give you the first 30 digits of pi or e and ask you to predict what comes next. Despite being a small amount of data of low algorithmic complexity, machine learning cannot currently handle this type of problem.)
The pi and e example seems more complicated than it looks. If you ask a human who doesn’t know about pi or e, how much effort would it take for them to figure out the next digits? Seems like they’d have to rediscover the math first (or I suppose, perform a google search)
Yes, it would be a hugely complicated undertaking and probably impossible for most humans with little academic mathematical knowledge. But the point is that it would be possible, which indicates that the problem does not necessarily lie in the amount of data but in the algorithmic approach itself.
ML is a great tool that is creating very real and tangible value, but it still has ways to go. Just adding more computational capabilities and more data will only bring marginal improvements.
I was just saying to our partner (as well as to my wife!) how lucky we are to work on Healthcare solutions. We have access to data about medications, opioids, patients and physicians behaviors that so very little of others (who can have any clue about data analytics) has.
The realization of sitting on a goldmine of impossible-to-access-data + capability of developing cutting edge analytics solutions that could change the world is the best place to be.
The sample efficiency of ML systems are increasing rapidly in the theoretical realm, though the available data are growing magnitudes faster than the progress in ML sample efficiency. The article has a valid point: It's far cheaper to hoard data than to invent a more efficient ML system, thus we will see people race towards data rather than technical complexity.
Sure. I don't have any holistic survey to prove my point, but an example of recent progress in terms of sample efficiency is this paper[0]. Derivatives of this paper have been used to solve Sudoku[1], Starcraft II[2] and more [3]. This paper enabled more efficient use of data by creating a probabilistic graphical model between logical sets.
I like the 3D datasets in these papers since, like a scientist in a lab, you can setup the experiment and explore the domain. Adding time in would be cool too (e.g. Are the blue and red ball going to collide in 10 seconds?)
It also helps to be able to show you can answer some of these questions in principle with your model. It gives you hope that it might be able to cover real world images.
It's not just quality but it's also the available features in the data. Quality is just a signal to noise problem. If you have enough data, you generally can reasonably segment it to get higher quality. Obtaining features not present in other data sets is probably the most significant factor.
For example, let's say you want to build a speech rec engine. You need 15K hours of data to build/validate a model. How would you get that? You could farm it out to some people on mechanical turk and get 15K hours of audio transcribed. With enough money, you could duplicate the transcriptions enough times to actually be pretty sure about the quality of the data set. If you're clever and have a large enough dataset, segmentation generally gets you decent quality. The big gains come in when you have features not present. For example, google realized when you build a speech rec engine, you can include video data and image processing to actually use the way people move their mouths to significantly increase the quality of an automated transcription.
I think creative new ways of harvesting data will continue to be profitable.
That's fine with today's mindset, but over time it seems like we'll want to go beyond big data and think about how it is that a child can become quite capable without seeing millions of instances of people crossing a road, etc... We have to make our machines make models and 'want' to gather data to test them. By having the machines do the work of today's data scientist, they may well put themselves out of business, and quickly everyone else.
I think you are saying in essence, data quality trumps data quantity, which in turn trumps modeling and algorithmics.
This seems to hold fairly well, with a few caveats. For example, without enough data you ar wasting your t8me with some techniques. Also “data quality” most often needs to include quality labeling in practice (but you may have meant that inclusively).
as a data scientist, it has always been about the data. it has always been a data race. google has known that, which is why they spent billions trying to protect their search moat.
This. All things being equal - and assuming similar data quality among competitors - the one with the most data has a stronger strategic position. This is because, all things being equal, more data means a larger number of available choices and insights (I'm making some simplifying assumptions here).
Business strategy is fundamentally about trade-offs, though, so we do need a caveat. Naturally, more choices and more insights can, at times, be a weakness. You always have to prioritize, and as data volume increases it doesn't seem the ability for quality prioritization grows in parallel (or at least, necessarily does).
Mark my words - organizations like Amazon, Google, etc will soon start offering a Data Marketplace (intended for both buyers and sellers of all sorts of "alternative" data - everything from small business metrics to enterprise/b2b and anything in-between. The next logical step would be to offer insight/models as a service as a layer built on top of this..
I've been studying their moves carefully and have no doubt this is where they're headed. While I don't work for Amazon I think AWS in particular is uniquely positioned for these next couple of moves to a level that may make it challenging for others to compete. Interesting times...
Indeed, the clever thing would be to have an API flexible enough to allow a user to utilize their own ML architecture and training process but it still keep that user from accessing the raw data.
The ideal approach would somehow allow this anonymized access to multiple large databases simultaneously. I don't know how you'd do that but if you claim Etherium would help, you'd arrive at buzzword nirvana even if you were wrong.
This year brought to us huge improvements in handling text (BERT), speech (Google TTS, especially the Allo demo), images (ProGAN, StyleGAN, BigGAN) and activities such as games (Alpha Zero), and robotics. Even music composition is improving a lot. I don't feel it is slowing down yet. And when it will eventually slow down, it will be all for the better - we will have time to reopen all those different approaches that have been more or less ignored because of the DL hype.
I think the key fields in AI will become simulator based learning (RL) and graph processing neural nets, because graphs can express any kind of highly dimensional data and are useful for reasoning tasks. They marry the symbolic and connectionist approaches. These two subdomains have had rapid evolution over the last couple of years. They also solve the data problem - in simulation you can produce as much data as you want, and graphs have combinatorial generalisation, thus they work on new configurations without retraining.
I could show someone a single photograph of an animal they have never seen before and they will recognize it forever, from different angles, in different lighting, in black and white, probably even from a silhouette.
The more I dig in to ML the more it seems like it's cheating, like a mathematical trick. At least the way we are using it.
I can't help but feel like it's a game of Pachinko with pixels instead of balls, and neural weights instead of pins and holes.
I'm no expert though so take my opinion with a grain of salt.
What are models? They're a way to describe the data. So this is where a load of philosophical stuff like Occam's Razor comes in and favours things like having fewer degrees of freedom, lower errors, etc.
What is data? It's what makes one model a more likely explanation than another.
You can make up any number of ways to describe some phenomenon, but without data there's no way to tell which of them is better. Or rather, you will fall back on some model with fewer specifics, because of those considerations we mentioned earlier.
So getting smarter (ie more complex) with models can't help on its own.
I'm not sure i'm understanding you, or i'm not sure I agree. Humans, for instance, can learn from substantially less data than (most) machine learning models, if the data is presented in a way that we're optimized for (e.g. human facial recognition).
If we can devise ML algorithms capable of generalizing from substantially smaller datasets, while data will still be relevant, it will be a less important bottleneck in the process than it is now.
Human beings can be smart in using a fairly small set of data. Humans have an enormous store of data to work with but when confronted with a bit of new data on a new subject, they can sometimes do very well.
This is a result of the ML paradigm essentially starting new with whatever data set it is trying to "solve" but overall, the paradigm doesn't have to be that.
IMHO pretty much every example I've seen for how humans can get good results with using a fairly small set of data actually involves successful generalization / transfer of "generic life experience" or "generic audiovisual processing" to that new problem instead of actual learning from limited data. We know how to do that in ML, in general - we can do transfer learning from few examples quite well if the underlying generic data is good enough. However, we currently don't have good enough underlying generic data to match the years of generic life data that any human kid has accumulated.
A human learning to play an Atari game that involves an agent jumping over a pit has to learn only the mechanics of that game, and that can be done in minutes. Learning that game from scratch, on the other hand, requires also learning interpreting vision and the whole concept that the world has objects that may move around - which takes months of learning for human brain. So comparing sample efficiency is an apples to elephants comparison if we disregard the ability to reuse/transfer knowledge from related tasks that all humans learn during childhood.
Plus investing is data is much more predictable: the outcome is always going to get better, though the margin will be diminishing, but better is better.
While modeling is not, hiring 100 machine learning 'experts' will not solve the problem 100 times better than 10 of them. On the other side, 100 labelers are surely going to provide 10 times of throughput provided the same upscale.
True, of course. We can only go as high as 100% for accuracy, right?
However the labellers are scalable in terms of their throughput and coverage of the data, you can always find bad examples or holes in your current data plane, and that is the time when labelers, not the scientists, are going to rescue you
Yes. On average the human brain just needs dozens or hundreds of examples (turns, exercises, you name it) to "learn" something. A decent machine learning model needs hundreds of thousands to millions of samples to gain good confidence and can still be fooled easily afterwards with subtile changes.
It might be plausible to consider the extremely large amount of sensory information humans are hooked up with. Computers generally only have very limited amount of sensory information, not to mention the inferiority of electronic sensors compared to the highly sophisticated biological ones.
I'd like to see this put to the test in a situation where sensory information is of no benefit: strategy games.
Alpha Zero got to be so strong at chess through millions of games of self-play. I would like to see how it would fare with only one hundred games, against a human chess beginner with one hundred games under his/her belt.
Actually, I don't think this is strictly speaking accurate. In order to train from those dozens/hundreds of samples, human brain needs to first be developed enough by accumulating experience from billions of samples in related domains. This "pre-training" process literally takes years, and it still does not sufficiently prepare some people for some tasks.
I just think machines are good at some things, and people are good at others. If it really took billions of samples in related domains we wouldn't develop nearly as fast as we do after being born.
I'm not sure I'd call it "fast" either. For the first three months babies can barely see anything, and for at least nine they can't form anything even remotely resembling speech and can't walk. What's amazing is that all this learning is very sparsely supervised and all "subsystems" train at the same time.
This is a human peculiarity. Many animal babies are born with fully developed abilities. I am sure we have all seen the NatGeo videos of antelope babies stumbling 2-3 times and then immediately start walking and even running.
Human babies have bigger head-to-body ratio compared to all other species due to our brain being bigger. Our babies have to be born earlier or otherwise they cannot make it out of the mother alive.
Outside of that, we develop pretty quickly. As you pointed out, everything in us is developing in parallel which is quite impressive.
i think it's important to note when people talk about machines vs people they always think about ideal machine vs ideal person - a rigorously educated athletic genius that can infer links between things with a couple of hours of study at most & infer future results of actions with a couple of seconds of thought.
most people aren't like this but on average engineers who are truly innovatively thinking about these problems and creating solutions are. it is just the way it is
Your wording is interesting here. Do you mean humans need less data to feel confident or less data to make accurate predictions compared to a ML model? Sorry if I'm parsing your words too much here, just genuinely curious
The general answer is yes - it's the "one shot learning" problem. But humans also have access to a large (if not very well defined) amount of data via background knowledge, which plausibly enables them to apply stronger, more effective priors to any one task.
So try it: Teach a computer about animals in general.
Then show it a single example of a new animal it has never seen, and see how well it would do at telling you if challenge images are or are not that animal.
It's not even going to be close to what a human would manage.
It's not because of the data, it's because a human understands what he's seeing.
Then you're comparing a machine that's never seen anything but animals against a human that has no separation between different kinds of data and can't start on a blank slate.
I may have all sorts of images in my mind, allowing me to make the distinction?
This doesn't sound too different from face recognition systems where you train a network to be able to tell if two faces are the same or different. I don't know what the state of the art in animal recognition is, though.
I don't understand how this is a race at all. There is no way to finish it, and "going fast" is likely to cause major issues in developing sound technologies. Maybe changing the attitude of this article to a more scientifically sound position would increase it's relevance.
I have just started working with Ocean Protocol (https://oceanprotocol.com/) and I have set up a local meetup in January for local entrepreneurs looking for access to machine learning data.
There are reliable ways to generate unbiased, synthetic and even big data for almost any industrial domain out there plus a lot of applied research fields in the STEM curricula. The most important issue nowadays is the accountability of results aka ablation studies following too many false positives and a number of recent, blatant scams.
And yeah, Google started to throw their hat in the game with Analytics 360 and an enormously larger training base. Amazon's another major player.
Weirdly enough though, people do still blindly pay my previous employers to figure stuff out because easy answers are always actionable, even if they're wrong. It's just crazy because the CEO explained to me that lying about the service to potential customers and investors was necessary because "Faking it til you make it" was a sound business principle in his mind like 1980's Michael J Fox was his primary sources of business info.
Long story short, don't waste your time with these little companies purporting ML holy grails. They're probably just lying to you, whether intentionally or not. ML is a game for the big boys with access to market level aggregates. The models that last company came up with were wildly inaccurate.