How to create an AI startup – convince some humans to be your training set

AznHisoka · on March 30, 2016

"It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them."

Well, I would certainly hope any employee help create value for the companies they work for... even if they get laid off eventually.

exolymph · on March 30, 2016

The key word here is "legal". Without having a contract to that effect, no employee or contractor can just appropriate equity, regardless of how much sweat they put into building the company. I'm not sure why OP thinks there might be a legal claim.

danblick · on March 31, 2016

I think he's really missing the point about the importance of self-play in Alpha Go. Human play provided a seed for training the system, but the thing that made it work was the fact that the computer could play an unlimited number of games against itself; the fact that Go is a game with clear rules made it possible to label a huge number of board positions without any human-derived training data at all. The human-derived training set isn't nearly enough for this.

nazka · on March 31, 2016

Do you have first hand sources about this? I have now been hearing all and everything about what makes Alpha Go so great... First it was the hardware, then it was the use of the Monte-Carlo tree search with NN... And even more just 1 day ago https://news.ycombinator.com/item?id=11382954

danblick · on March 31, 2016

"the AlphaGo algorithm, this is something we’re going to try in the next few months — we think we could get rid of the supervised learning starting point and just do it completely from self-play, literally starting from nothing."

http://www.theverge.com/2016/3/10/11192774/demis-hassabis-in...

(This is why I take exception to the claim in this blog post that the supervised training data was critical to success...)

nazka · on March 31, 2016

Ok thank you for your answer. So many things are claimed that it is hard to track what is real and what is just hype.

I agree that it's something big. Training Alpha Go on itself means something bigger than "just" optimizing a statistical model on human data. I think recognition of logic elements to strategy planning are parts of what will make ML really close to IA. (With memory, and cleverness to learn) And are the next big steps.

morganK · on March 30, 2016

Would have like to hear at least one concrete exemple of startup actually doing that. Seems a bit theoretical at the moment, as big companies doesn't need to do that thanks to existing datasets, and I've never heard any startups using dozens (hundreds?) of contractors for this kind of job.

tariqali34 · on March 30, 2016

Netflix used humans to tag movies for their recommendation system.

Source: http://www.theatlantic.com/technology/archive/2014/01/how-ne...

LunaSea · on March 30, 2016

Netflix is not a startup.

true_religion · on March 30, 2016

At one point, Netflix was a startup.

LunaSea · on March 30, 2016

Yes but it wasn't in 2014 or 2012.

HillRat · on March 30, 2016

CrowdFlower does AI and ML-focused microtasking, though I have no experience with them. Even large companies need plenty of preprocessing done on their datasets, so it's common to use offshored services companies or divisions to do annotation and cleanup work on corpora before using them as training sets.

johndavi · on March 30, 2016

In very broad strokes this is how we power many of our API features at Diffbot. We have hundreds of thousands of human-trained web pages amounting to millions of individual elements that have helped to train our system.

RobertoG · on March 30, 2016

Not a start-up and not deep learning (until now I suppose), but this have been done for years in the translation industry.

They feed their automatic systems with the output of the human translator. Every input means less and less manual work that need to be done in the future.

globba22 · on March 30, 2016

the post office used humans for many years to train OCR models, e.g. zip code readers.

I visited a postal routing facility once in the 90s and saw a long row of metal stationed by about 20 people, 10 to each side. Envelopes passed through on a sort of pneumatic tube-like conveyor, paused in front of a human operator who read a single digit of a zip code, keyed it in and sent the envelope to be read by the next person.

nl · on March 30, 2016

Many, many startups use Amazon Mechanical Turk and/or CrowdFlower for this exact thing.

See http://blog.echen.me/2012/04/25/making-the-most-of-mechanica... for some examples.

klochner · on March 30, 2016

hunch

lifeisstillgood · on March 30, 2016

This does hit at one of the most basic debates of the next decade - how much of my actions and behaviours do I own? Creating a link from one page to another, thus providing PageRank with value - do I get a cut of that value? Purchasing a book or a film, thus making profit for the reseller's recommendation engine? Driving around populating maps with my GPS co-ordinates. Just generally leaving digital footprints makes someone a training set somewhere - and yet instead of this being a public good it's private profit - the term bandied Around after 2008 was "socialising risk, privatising profits". The same debate should be happening here - but I only occasionally hear about something like it.

Or am I listening in wrong places?

thinkingkong · on March 30, 2016

It wont work this way in the short term.

Any company doing "AI" will get there over a long period of time by employing people to do actual work and then slowly automating that work away. If you wait for a huge dataset or some new technique there will be tons of competition.

zodPod · on March 30, 2016

>It will also be interesting if there is a legal claim for the gig workers at these companies to make that their labor helped “create the value” at the companies that replace them.

I'd assume that you'd be waiving any legal claim they might have when they sign the ToS or w/e. I mean, in all fairness, they are getting paid to perform these actions and be recorded. What more would they have any claim for anyway? A percentage based on the times their anonymized playthroughs were used?

"Well, we've got 1,000 people and each played 100 games of Go. We took that 100,000 games and trained a single dataset to play against itself." User is 1 player of 1,000. Company makes 20,000,000 and sets aside 25% (magically) for paying back the original people. Those people now get $5000. That $5000 is cool but it's not life changing.

EDIT: It occurs to me that my numbers could be skewed. This could be significant if they only used 100 people or so, I guess. My point wasn't necessarily to shoot down the notion just to discuss it. What would the person have a claim to be it legal or otherwise?

pbkhrv · on March 30, 2016

Microsoft, perhaps inadvertently, did that. Tay's stream of consciousness can now be used as a training set for an abusive content monitoring AI.

bliti · on March 30, 2016

You could crawl 4chan and get a bigger dataset of abusive content. But that cold lead to terminators showing up on my lawn.

tariqali34 · on March 30, 2016

The interesting question is what would you call these humans who are serving as your training set. Do you call them "Machine Therapists" (trying to coax the AI to proper behavior)? "AI Educators" (providing the material that is used to teach the AI)? "Data Scientists" (they are curating data and handing it off to the machine)?

pdkl95 · on March 30, 2016

Hopefully they call them "people who gave their informed consent to use their data in this specific AI project".

stcredzero · on March 30, 2016

Searle would have us interpret this as the company taking the intelligence of the humans, refining and repackaging it.

https://www.youtube.com/watch?v=rHKwIYsPXLg

nxzero · on March 30, 2016

Unclear how this is new, even Google, Amazon, etc. have either been doing this internally, offering it as a service, been susceptible to man-in-the-middle exploits to mining real world data for training sets, released data, etc.

awinter-py · on March 30, 2016

Spot on. One recipe to become a tech acquisition target is to collection a 'new kind' of user data -- all big companies are hungry for this.

This phenomenon is not at all new; data has been informing investment models forever and access to that data comes from having the right customers, and is closely hoarded once gotten.

Some of the largest companies in the late middle ages were wool buyers -- they weren't permitted to trade internationally, but they used locally owned franchises and market knowledge to corner the market anyway. And many of the largest ag commodities futures traders in this century also own substantial farm acreage. Those capital one guys who were SEC'd for trading options on credit card receipts were leveraging customer activity.

Point being -- you've always needed data to train a good model.

graycat · on March 30, 2016

With the many parameters, the normal equations will become large. In that case, can consider solving the equations with the old iterative method Gauss-Seidel.

verbify · on March 30, 2016

I've only a little experience in NN, but getting trainers is rarely the bottleneck - it's usually in programming the NN.

tmaly · on March 30, 2016

I plan to do just that, but my end goal is to provide a free service that has tons of value for my users.

graycat · on March 30, 2016

There's a chance that some Web site ad targeting is being done this way.

madelinecameron · on March 30, 2016

This is kind of "no duh".

Not really an article that adds much value or understanding, especially for a blog seemingly being targeted to a technical audience.