The Machine Learning Race Is Really a Data Race

makewavesnotwar · on Dec 24, 2018

It's incredible. The last company I worked for before going it alone (I was the front-end engineer and moved away from ML based business models) was trying to automate statistical analysis. I came from an academic background in economics and I tried to propose modelling to them in simple terms and they jumped to, "it sounds like you're talking about forecasting, let's go with that." And they started trying to implement python models from academic papers with highly limited training data and my reaction was generally, WTF?! You can't even start to forecast without a reliable base model. But they went ahead trying to sell stuff like churn prediction to companies with 0 understanding of how these models work at the basest levels.

And yeah, Google started to throw their hat in the game with Analytics 360 and an enormously larger training base. Amazon's another major player.

Weirdly enough though, people do still blindly pay my previous employers to figure stuff out because easy answers are always actionable, even if they're wrong. It's just crazy because the CEO explained to me that lying about the service to potential customers and investors was necessary because "Faking it til you make it" was a sound business principle in his mind like 1980's Michael J Fox was his primary sources of business info.

Long story short, don't waste your time with these little companies purporting ML holy grails. They're probably just lying to you, whether intentionally or not. ML is a game for the big boys with access to market level aggregates. The models that last company came up with were wildly inaccurate.

ThePhysicist · on Dec 24, 2018

I only partially agree. Building good ML models and even outperforming the ML services of the big players is absolutely feasible. Have e.g. a look at this talk from PyCon DE (in English: https://www.youtube.com/watch?v=XniwzOCWi2c), which shows how a small team built a machine vision system to read car registration numbers from official documents. The system was built and trained with an extremely small dataset (I think around 60 scanned documents with some data augmentation) and was able to easily beat the Google Cloud ML algorithm by an impressive margin (Google ML had an intolerably high error rate for this seemingly simple problem).

So I'd say if you have a very specific area that you're investigating you have a very good chance of beating larger players that don't specialize as much as you can. Of course competing against Google in self-driving cars or machine translation might be a bad idea, but even in those areas there are small startups that produce impressive results (e.g. DeepL: https://www.deepl.com/en/translator). Also, big companies regularly exaggerate their capabilities as well (sometimes more than startups), just have a look at how IBM markets their Watson AI/ML solutions, and what they deliver in reality.

So personally I'd say it has never been that easy to build relevant and interesting ML/AI based solutions as a small team, and it is possible to beat large players if you have the right approach and the right (very narrow) problem.

paganel · on Dec 24, 2018

DeepL is a very promising thing. I was very sceptic on the future of automatic translation seeing as Google Translate seems to have stagnated for the last two years or so, but I’ve just recently tried DeepL on a German newspaper article a couple of days ago and it did a very good job. Granted, I don’t know German (hence why I used DeepL) but nevertheless the English translation provided by DeepL seemed more polished than what Google Translate usually does.

snovv_crash · on Dec 24, 2018

I've used it a fair amount, and continue to be amazed with the quality it puts out. There are still some issues with formal pronouns, subject-matter-specific contractions etc, but otherwise it does a great job with both EN->DE and DE->EN

TeMPOraL · on Dec 24, 2018

Oh yes. I've seen this plenty of times not with just ML, but even basic statistics (and by seeing I mean working next to people doing it). You don't need to understand statistics at all, as long as your customers don't understand it either, and you sound confident enough. If it's hard to verify whether a model works, you can keep the customer happy and yourself paid while not providing much, if any, value to them.

I currently believe this is how most ad tech runs internally. Scammers scamming scammers.

ryebreadistasty · on Dec 24, 2018

That's obvious to anyone with any sort of basic idea as to how machine learning works. Feed bots data and test the bots predictions - the more data you have then obviously it'll be better. If you have 1 picture of a bee to test you'd only ever get one very specific shape for a bee from any AI, if you have thousands, you get a much better representation. It's pretty simple.

sgt101 · on Dec 24, 2018

This is mostly true at the minute, but not completely and probably not for the long term (10 years +) See http://science.sciencemag.org/content/350/6266/1332.full

Also you can have problems where more and more data doesn't make an impact - the stock market being a clear example.

julienreszka · on Dec 24, 2018

It might have been the case before. Turns out you don't need that much data today with intrinsic reward functions.

ohthehugemanate · on Dec 23, 2018

Data quantity and quality are key. Both, though. This is why it's foolish to go up against an ML product from Google, Facebook, or (maybe) Microsoft. You just can't compete with the volume and quality of data they can access.

In sectors like automotive, where every brand is competing to try and build the best predictors using only their own data, there is a huge opportunity available for the first two companies to share data with each other. Doubling your data quantity brings a significant improvement to any model, and would put them ahead of any competition. That advantage only grows the more players you add to the sharing pool.

I believe that if humanity is really going to harness machine learning, the concept of a bulk data commons is an inevitable requirement.

This makes decentralized personal data control, homomorphic encryption, and similar technologies incredibly important.

orlp · on Dec 23, 2018

> This makes decentralized personal data control, homomorphic encryption, and similar technologies incredibly important.

Call me a cynic, but we can't even get companies to do the bare minimum to protect data that is centralized, and you expect them to do all that?

I'm not kidding when I say that your last sentence is a bigger, harder and more important task than the machine learning it is supposed to facilitate.

ngcc_hk · on Dec 24, 2018

Yes. Found that sentence very odd if one accept the idea of the more data needed meant sacrify all those values ... otherwise just ask around ... hence may be china and its firms will win as they do not need to.

visarga · on Dec 24, 2018

> This is why it's foolish to go up against an ML product from Google, Facebook, or (maybe) Microsoft.

I was sure someone will say this. I think it only applies for advertising, commerce and insurance. When it comes to learning ML models on images, text and games, big corporations don't have a unique trove of data. They only have the data advantage when it comes to personal data.

The more important advantage big corporations have is hiring the best and hiring more ML scientists and engineers. The job demand far outstrips the offer.

throwawaymath · on Dec 24, 2018

Large tech firms absolutely have access to unqiue image, text and game(play) data. I'm surprised you think they don't. How many tech startups do you suppose have access to every single multiplayer game of Destiny (Microsoft), every user's profile photo and interest metadata (Facebook), or every user's search history and email (Google)?

But wait, there's more! How many startups have ready access to usage statistics for every mobile app (Google, Apple)? How many startups have virtually the entire corpus of available ebooks on hand for mining (Google, Amazon)? What about traffic patterns (Google), or virtual reality game(play) data (Facebook)? Telemetry data from the most widely used operating systems and web browsers (Google, Apple, Microsoft)? Video content with official subtitles (Netflix, Amazon, Google, Apple, ...). And don't forget login and activity data through social login (Facebook, Google) and image training via reCAPTCHA (Google).

I can go on and on here. You should reconsider how large the set of training material is that be acquired by mining massive amounts of "personal" data. Of course, this is completely orthogonal to the other major advantage large companies have; namely, they have vast sums of money to throw at compute and storage resources.

nostrademons · on Dec 24, 2018

ImageNet has 14 million images. How many images does Google have available to them? They've got about 1000+ of my personal photos backed up on Google Photos, so if each of their 200M monthly active users is like me, that's 200 billion images from Google Photos alone. Then add every image crawled from the WWW for image search (that was several tens of billion when I was there, each tagged with text & structured data from its page of origin), every Street View photo (that was thousands of hard drives worth, measuring in the petabytes total), and high-res satellite images of the entire earth through their SkyBox/TerraBella acquisition.

The Big Tech companies like to understate the size of their data advantage because it tempts competitors into doing stupid things that won't work in the market. Don't be fooled though - the majority of useful data is locked up inside proprietary silos.

laichzeit0 · on Dec 24, 2018

Sure they have those images, but so what, it’s all unlabeled data.

I think the labeled datasets we create for them with e.g. reCAPTCHA is infinitely more useful for training.

shostack · on Dec 26, 2018

Are they unlabeled though? If location sharing metadata exists, that's a decent chunk of location-tagged photos of places to have.

visarga · on Dec 24, 2018

You can have access to larger open datasets of images if you want, but it's expensive to train on such a large dataset. So the real advantage is having money for training models.

TrackerFF · on Dec 26, 2018

Lot's of companies have _vast_ amounts of unique data, often reaching back many, many decades.

For example, one company I worked with manufactured, installed, and did maintenance certain electrical / power and machine/mechanical products. They had archives full, dating back many decades, of error reports and such.

Luckily for them, the equipment problems 50 years ago were no different from problems today, so the data was still highly relevant. The company went on to digitize all this data, structure it in a usable way, and then build (i.e hire in ML consultants) models on that data - which they used for predictive work - in this case, anomaly detection, early warning systems, etc.

The net results were happier customers, less faults, less warranty expenses, etc. which in turn made them more competitive.

This is just one example, but there are many others. Lots of data that the big tech giants do not have access to, but which is (or can be) of great value.

bobthepanda · on Dec 24, 2018

Eh, the big companies have more resources to classify larger training data sets: e.g. Google reCAPTCHA

jonathankoren · on Dec 23, 2018

They way you compete with Google et al, is to find a niche they think they’re serving, but in reality completely suck at.

chongli · on Dec 23, 2018

But to do this successfully you have to hit the bullseye, grow extremely quickly, and have a strong moat. Otherwise, one of the big companies will either clone you (or buy out a startup who has cloned you) and leverage their dominating market position into serving your niche somewhat better than completely sucking at it, as they had before.

jonathankoren · on Dec 24, 2018

Being bought out is an exit. Hardly a failure.

dredmorbius · on Dec 24, 2018

That really depends.

If you're being bought out at X to prevent you from reaching 10X or 1,000X, it's a loss.

Is it an offer you can or cannot refuse?

jonathankoren · on Dec 24, 2018

Future value isn’t set in stone, and the marginal value of a dollar decreases.

hayksaakian · on Dec 23, 2018

If someone would like an example here, check out this animated video filters (see snapchat, tiktok etc).

Just because Google et al have the underlying technology, doesn't mean they're the best at creating a product that uses that tech.

Certainly gives them a leg-up to get started, but it's not the end-all be-all.

DATACOMMANDER · on Dec 23, 2018

The article is decent, but I see two mistakes: the author assumes that AI/ML can’t produce unique insights from public data; and she conflates AI/ML with automation. While it’s true that you won’t find anything that your competitors haven’t if you use the same data and the same AI/ML techniques, there’s nothing stopping companies from differentiating on the techniques in addition to (or even instead of) the data. If you just use plug-and-play AI, then sure, you’ll need a unique data set if you want unique results.

The section about finding faster, less error-prone ways to apply existing insights sounds more like automation than AI. There’s certainly overlap, but they’re two different things.

Xcelerate · on Dec 23, 2018

It's a data race because we've run up against another wall on the algorithms side. Find a technique that works better than GBDT for the same type of problem. Other than some minor tweaks described in the academic literature, it's been a while since something really advanced the state of the art.

Small datasets still have massive predictive potential; we just need better algorithms. (As an extreme example, suppose I give you the first 30 digits of pi or e and ask you to predict what comes next. Despite being a small amount of data of low algorithmic complexity, machine learning cannot currently handle this type of problem.)

2sk21 · on Dec 24, 2018

I completely agree with this. I subscribe to the Gary Marcus' view that what we need is the marriage of symbolic AI and ML.

ydj · on Dec 23, 2018

The pi and e example seems more complicated than it looks. If you ask a human who doesn’t know about pi or e, how much effort would it take for them to figure out the next digits? Seems like they’d have to rediscover the math first (or I suppose, perform a google search)

montenegrohugo · on Dec 24, 2018

Yes, it would be a hugely complicated undertaking and probably impossible for most humans with little academic mathematical knowledge. But the point is that it would be possible, which indicates that the problem does not necessarily lie in the amount of data but in the algorithmic approach itself.

ML is a great tool that is creating very real and tangible value, but it still has ways to go. Just adding more computational capabilities and more data will only bring marginal improvements.

gesman · on Dec 24, 2018

THIS.

I was just saying to our partner (as well as to my wife!) how lucky we are to work on Healthcare solutions. We have access to data about medications, opioids, patients and physicians behaviors that so very little of others (who can have any clue about data analytics) has.

The realization of sitting on a goldmine of impossible-to-access-data + capability of developing cutting edge analytics solutions that could change the world is the best place to be.

tonyhb · on Dec 24, 2018

Google partners with the NHS to get medical data.

yonkshi · on Dec 24, 2018

The sample efficiency of ML systems are increasing rapidly in the theoretical realm, though the available data are growing magnitudes faster than the progress in ML sample efficiency. The article has a valid point: It's far cheaper to hoard data than to invent a more efficient ML system, thus we will see people race towards data rather than technical complexity.

fooker · on Dec 24, 2018

>The sample efficiency of NN are increasing rapidly in the theoretical realm.

Can you provide some references for this?

yonkshi · on Dec 24, 2018

Sure. I don't have any holistic survey to prove my point, but an example of recent progress in terms of sample efficiency is this paper[0]. Derivatives of this paper have been used to solve Sudoku[1], Starcraft II[2] and more [3]. This paper enabled more efficient use of data by creating a probabilistic graphical model between logical sets.

[0]https://arxiv.org/abs/1706.01427 [1]https://arxiv.org/abs/1711.08028 [2]https://arxiv.org/abs/1806.01830 [3]https://arxiv.org/abs/1806.01822

state_less · on Dec 24, 2018

I like the 3D datasets in these papers since, like a scientist in a lab, you can setup the experiment and explore the domain. Adding time in would be cool too (e.g. Are the blue and red ball going to collide in 10 seconds?)

It also helps to be able to show you can answer some of these questions in principle with your model. It gives you hope that it might be able to cover real world images.

minimaxir · on Dec 23, 2018

The neat thing about big data is that there are massive diminishing returns on the amount of data vs. model quality.

What's more important is data quality (e.g. structured and unbiased), and an incumbent with the right approach can have a strong impact.

dumbfoundded · on Dec 24, 2018

It's not just quality but it's also the available features in the data. Quality is just a signal to noise problem. If you have enough data, you generally can reasonably segment it to get higher quality. Obtaining features not present in other data sets is probably the most significant factor.

For example, let's say you want to build a speech rec engine. You need 15K hours of data to build/validate a model. How would you get that? You could farm it out to some people on mechanical turk and get 15K hours of audio transcribed. With enough money, you could duplicate the transcriptions enough times to actually be pretty sure about the quality of the data set. If you're clever and have a large enough dataset, segmentation generally gets you decent quality. The big gains come in when you have features not present. For example, google realized when you build a speech rec engine, you can include video data and image processing to actually use the way people move their mouths to significantly increase the quality of an automated transcription.

state_less · on Dec 24, 2018

I think creative new ways of harvesting data will continue to be profitable.

That's fine with today's mindset, but over time it seems like we'll want to go beyond big data and think about how it is that a child can become quite capable without seeing millions of instances of people crossing a road, etc... We have to make our machines make models and 'want' to gather data to test them. By having the machines do the work of today's data scientist, they may well put themselves out of business, and quickly everyone else.

ska · on Dec 23, 2018

I think you are saying in essence, data quality trumps data quantity, which in turn trumps modeling and algorithmics.

This seems to hold fairly well, with a few caveats. For example, without enough data you ar wasting your t8me with some techniques. Also “data quality” most often needs to include quality labeling in practice (but you may have meant that inclusively).

autokad · on Dec 23, 2018

as a data scientist, it has always been about the data. it has always been a data race. google has known that, which is why they spent billions trying to protect their search moat.

anongraddebt · on Dec 23, 2018

This. All things being equal - and assuming similar data quality among competitors - the one with the most data has a stronger strategic position. This is because, all things being equal, more data means a larger number of available choices and insights (I'm making some simplifying assumptions here).

Business strategy is fundamentally about trade-offs, though, so we do need a caveat. Naturally, more choices and more insights can, at times, be a weakness. You always have to prioritize, and as data volume increases it doesn't seem the ability for quality prioritization grows in parallel (or at least, necessarily does).

shreezus · on Dec 24, 2018

Mark my words - organizations like Amazon, Google, etc will soon start offering a Data Marketplace (intended for both buyers and sellers of all sorts of "alternative" data - everything from small business metrics to enterprise/b2b and anything in-between. The next logical step would be to offer insight/models as a service as a layer built on top of this..

I've been studying their moves carefully and have no doubt this is where they're headed. While I don't work for Amazon I think AWS in particular is uniquely positioned for these next couple of moves to a level that may make it challenging for others to compete. Interesting times...

joe_the_user · on Dec 24, 2018

Indeed, the clever thing would be to have an API flexible enough to allow a user to utilize their own ML architecture and training process but it still keep that user from accessing the raw data.

The ideal approach would somehow allow this anonymized access to multiple large databases simultaneously. I don't know how you'd do that but if you claim Etherium would help, you'd arrive at buzzword nirvana even if you were wrong.

AznHisoka · on Dec 24, 2018

“I've been studying their moves carefully”

What moves specifically?

sirwitti · on Dec 23, 2018

Me personally I'm not 100% convinced that the current ML/AI approaches (that I'm aware of) will yield new big steps forward.

Neuromorphic computing could have the potential I think, if we can build hardware that's good enough.

visarga · on Dec 24, 2018

This year brought to us huge improvements in handling text (BERT), speech (Google TTS, especially the Allo demo), images (ProGAN, StyleGAN, BigGAN) and activities such as games (Alpha Zero), and robotics. Even music composition is improving a lot. I don't feel it is slowing down yet. And when it will eventually slow down, it will be all for the better - we will have time to reopen all those different approaches that have been more or less ignored because of the DL hype.

I think the key fields in AI will become simulator based learning (RL) and graph processing neural nets, because graphs can express any kind of highly dimensional data and are useful for reasoning tasks. They marry the symbolic and connectionist approaches. These two subdomains have had rapid evolution over the last couple of years. They also solve the data problem - in simulation you can produce as much data as you want, and graphs have combinatorial generalisation, thus they work on new configurations without retraining.

PinkMilkshake · on Dec 24, 2018

I often feel the same.

I could show someone a single photograph of an animal they have never seen before and they will recognize it forever, from different angles, in different lighting, in black and white, probably even from a silhouette.

The more I dig in to ML the more it seems like it's cheating, like a mathematical trick. At least the way we are using it.

I can't help but feel like it's a game of Pachinko with pixels instead of balls, and neural weights instead of pins and holes.

I'm no expert though so take my opinion with a grain of salt.

lordnacho · on Dec 23, 2018

It kinda can't be any other way though.

What are models? They're a way to describe the data. So this is where a load of philosophical stuff like Occam's Razor comes in and favours things like having fewer degrees of freedom, lower errors, etc.

What is data? It's what makes one model a more likely explanation than another.

You can make up any number of ways to describe some phenomenon, but without data there's no way to tell which of them is better. Or rather, you will fall back on some model with fewer specifics, because of those considerations we mentioned earlier.

So getting smarter (ie more complex) with models can't help on its own.

darawk · on Dec 23, 2018

I'm not sure i'm understanding you, or i'm not sure I agree. Humans, for instance, can learn from substantially less data than (most) machine learning models, if the data is presented in a way that we're optimized for (e.g. human facial recognition).

If we can devise ML algorithms capable of generalizing from substantially smaller datasets, while data will still be relevant, it will be a less important bottleneck in the process than it is now.

joe_the_user · on Dec 23, 2018

Human beings can be smart in using a fairly small set of data. Humans have an enormous store of data to work with but when confronted with a bit of new data on a new subject, they can sometimes do very well.

This is a result of the ML paradigm essentially starting new with whatever data set it is trying to "solve" but overall, the paradigm doesn't have to be that.

PeterisP · on Dec 26, 2018

IMHO pretty much every example I've seen for how humans can get good results with using a fairly small set of data actually involves successful generalization / transfer of "generic life experience" or "generic audiovisual processing" to that new problem instead of actual learning from limited data. We know how to do that in ML, in general - we can do transfer learning from few examples quite well if the underlying generic data is good enough. However, we currently don't have good enough underlying generic data to match the years of generic life data that any human kid has accumulated.

A human learning to play an Atari game that involves an agent jumping over a pit has to learn only the mechanics of that game, and that can be done in minutes. Learning that game from scratch, on the other hand, requires also learning interpreting vision and the whole concept that the world has objects that may move around - which takes months of learning for human brain. So comparing sample efficiency is an apples to elephants comparison if we disregard the ability to reuse/transfer knowledge from related tasks that all humans learn during childhood.

karmasimida · on Dec 24, 2018

Model is cheap, data is HARD.

Plus investing is data is much more predictable: the outcome is always going to get better, though the margin will be diminishing, but better is better.

While modeling is not, hiring 100 machine learning 'experts' will not solve the problem 100 times better than 10 of them. On the other side, 100 labelers are surely going to provide 10 times of throughput provided the same upscale.

aqme28 · on Dec 24, 2018

> though the margin will be diminishing

So by the metric of model performance (the only thing that matters here), what you're saying is that hiring more labelers is actually not linear.

karmasimida · on Dec 24, 2018

True, of course. We can only go as high as 100% for accuracy, right?

However the labellers are scalable in terms of their throughput and coverage of the data, you can always find bad examples or holes in your current data plane, and that is the time when labelers, not the scientists, are going to rescue you

mslate · on Dec 24, 2018

No, it's a process race. The data is just the most visible "asset" in play.

The entities that own the best data mining process will win. This includes data collection + ETL/storage + model training/deployment.

Obtaining a vertical monopoly on this process is the goal.

nuclx · on Dec 23, 2018

"Machines need a lot more data than humans do in order to get smart" - is that true?

Doubleslash · on Dec 23, 2018

Yes. On average the human brain just needs dozens or hundreds of examples (turns, exercises, you name it) to "learn" something. A decent machine learning model needs hundreds of thousands to millions of samples to gain good confidence and can still be fooled easily afterwards with subtile changes.

devxpy · on Dec 24, 2018

It might be plausible to consider the extremely large amount of sensory information humans are hooked up with. Computers generally only have very limited amount of sensory information, not to mention the inferiority of electronic sensors compared to the highly sophisticated biological ones.

chongli · on Dec 25, 2018

I'd like to see this put to the test in a situation where sensory information is of no benefit: strategy games.

Alpha Zero got to be so strong at chess through millions of games of self-play. I would like to see how it would fare with only one hundred games, against a human chess beginner with one hundred games under his/her belt.

m0zg · on Dec 24, 2018

Actually, I don't think this is strictly speaking accurate. In order to train from those dozens/hundreds of samples, human brain needs to first be developed enough by accumulating experience from billions of samples in related domains. This "pre-training" process literally takes years, and it still does not sufficiently prepare some people for some tasks.

sbov · on Dec 24, 2018

I just think machines are good at some things, and people are good at others. If it really took billions of samples in related domains we wouldn't develop nearly as fast as we do after being born.

m0zg · on Dec 24, 2018

I'm not sure I'd call it "fast" either. For the first three months babies can barely see anything, and for at least nine they can't form anything even remotely resembling speech and can't walk. What's amazing is that all this learning is very sparsely supervised and all "subsystems" train at the same time.

pdimitar · on Dec 24, 2018

This is a human peculiarity. Many animal babies are born with fully developed abilities. I am sure we have all seen the NatGeo videos of antelope babies stumbling 2-3 times and then immediately start walking and even running.

Human babies have bigger head-to-body ratio compared to all other species due to our brain being bigger. Our babies have to be born earlier or otherwise they cannot make it out of the mother alive.

Outside of that, we develop pretty quickly. As you pointed out, everything in us is developing in parallel which is quite impressive.

kkarakk · on Dec 24, 2018

i think it's important to note when people talk about machines vs people they always think about ideal machine vs ideal person - a rigorously educated athletic genius that can infer links between things with a couple of hours of study at most & infer future results of actions with a couple of seconds of thought.

most people aren't like this but on average engineers who are truly innovatively thinking about these problems and creating solutions are. it is just the way it is

bumby · on Dec 24, 2018

Your wording is interesting here. Do you mean humans need less data to feel confident or less data to make accurate predictions compared to a ML model? Sorry if I'm parsing your words too much here, just genuinely curious

zozbot123 · on Dec 23, 2018

The general answer is yes - it's the "one shot learning" problem. But humans also have access to a large (if not very well defined) amount of data via background knowledge, which plausibly enables them to apply stronger, more effective priors to any one task.

lordnacho · on Dec 23, 2018

This is the thing. Machines start on a blank slate each time, humans never.

So how do you compare the amounts of data needed?

ars · on Dec 24, 2018

So try it: Teach a computer about animals in general.

Then show it a single example of a new animal it has never seen, and see how well it would do at telling you if challenge images are or are not that animal.

It's not even going to be close to what a human would manage.

It's not because of the data, it's because a human understands what he's seeing.

lordnacho · on Dec 24, 2018

Then you're comparing a machine that's never seen anything but animals against a human that has no separation between different kinds of data and can't start on a blank slate.

I may have all sorts of images in my mind, allowing me to make the distinction?

dmurray · on Dec 24, 2018

This doesn't sound too different from face recognition systems where you train a network to be able to tell if two faces are the same or different. I don't know what the state of the art in animal recognition is, though.

miguelrochefort · on Dec 23, 2018

What will happen is that we will move toward an agent-centric system like Holochain, and put all data about our environment there.

Only when people will realize that transparency is better than privacy, will they start putting all their quantified self data on there.

This will become a Decentralized Artificial Intelligence network, from which consciousness and AGI will emerge.

craftinator · on Dec 24, 2018

I don't understand how this is a race at all. There is no way to finish it, and "going fast" is likely to cause major issues in developing sound technologies. Maybe changing the attitude of this article to a more scientifically sound position would increase it's relevance.

mark_l_watson · on Dec 24, 2018

I have just started working with Ocean Protocol (https://oceanprotocol.com/) and I have set up a local meetup in January for local entrepreneurs looking for access to machine learning data.

miguelrochefort · on Dec 23, 2018

Don't they have more than enough data by now? What are they waiting for?

buboard · on Dec 23, 2018

Data is cheap though and a very effective way to get them is crowdsourcing. The differentiation may not last long.

DrNuke · on Dec 23, 2018

There are reliable ways to generate unbiased, synthetic and even big data for almost any industrial domain out there plus a lot of applied research fields in the STEM curricula. The most important issue nowadays is the accountability of results aka ablation studies following too many false positives and a number of recent, blatant scams.

zozbot123 · on Dec 23, 2018

Clearly, the solution is to develop a machine learning framework in Rust - Rust is 100% immune to data races!