> Academic research accelerates innovation, but it requires costly data that is out of reach for most academic teams.
This is true of pretty much any AI research. Look at Puffer[0], which was just on HN a couple of days ago. They're running a free streaming service just to get enough data to train their algorithms, and in fact mention in their FAQ that they would love to use commercial data if they could get it.
Unfortunately, academic and commercial incentives don't really align here. Most commercial entities don't want to share their data because it's valuable to them, and if they let researchers in, they want the output of the research to remain proprietary to their commercial enterprise.
I wonder if there isn't some sort of governance solution to this. Like give companies big tax breaks for sharing their data with researchers, or something like that. Essentially subsidize academia indirectly.
I've seen semiconductor industry companies collaborate on grant-funding fundamental condensed-matter physics research. If it is a question of interest to all parties, and the work is too blue-sky to be immediately profitable, sometimes they'll fund the work.
> Unfortunately, academic and commercial incentives don't really align here. Most commercial entities don't want to share their data because it's valuable to them, and if they let researchers in, they want the output of the research to remain proprietary to their commercial enterprise... I wonder if there isn't some sort of governance solution to this.
You're commenting on an article in which a commercial entity is sharing their data despite it being valuable to them. Maybe they are the outlier but I've seen plenty of companies share data, especially in the ML space. Here are some datasets[0]. Maybe you would prefer more, but compared to other fields there is a lot of sharing. A "governance solution" could make things worse. If there was some mandate that companies that collect this data have to share it in a costly way, then it would discourage collection.
Likely they see self-driving as a complement to their core business of vehicle-passenger matching rather than something they hope to profit from directly in a meaningful manner, in which case "commoditize your complement" applies.
"A classic pattern in technology economics, identified by Joel Spolsky, is layers of the stack attempting to become monopolies while turning other layers into perfectly-competitive markets which are commoditized, in order to harvest most of the consumer surplus."
Your ideas pair well with what Austhrow743 said above. Self driving as a complementary endeavor so Lyft's core isn't jeopardized by releasing the data.
> I wonder if there isn't some sort of governance solution to this. Like give companies big tax breaks for sharing their data with researchers, or something like that. Essentially subsidize academia indirectly.
I think that's a really great idea. Not sure how many would take advantage of it but if it could be made to work then it would be really awesome.
It would also be extremely prone to abuse, though. Patenting is already an art of pretending to explain in clear terms what you are doing, while actually describing something as broadly and vaguely as possible. It would be pretty easy for a TON of things to leave out some key things that make it impossible or unhelpful to have the information.
You could form industry-specific regulations or even an active agency to prosecute abuses like that, but it would be immediately overwhelmed. The patent office is already heavily gamed by patent trolls, who bank on long odds for small judgements. Now imagine if millions or billions of dollars of taxes were on the line, and major companies were investing significant resources to open source while protecting their IP.
Even if that were all figured out, how would you value open sourcing stuff, even something as simple as data? Do you give breaks by size, importance, proportion of profit or future profit? Cost of the research? How do you guard against overvaluations and abuse of accounting? Even if you had perfectly accurate, annually-updated solutions for all that, companies can still game the system. Lyft has decided this dataset is what they need; if they could get a bigger break by collecting more data, they'd do that. Plus- facebook and google release tons of open source stuff. Do they deserve more than say, pharmaceutical research?[1]
Similar (IIRC Nixon) tax breaks already exist for R&D, and they are a notoriously abused loophole. Simplified but illustrative example: you build your R&D lab in the shape of a factory, do your research for a while and then suddenly scale back and replace it with machinery- well, the original building was still deducted from taxes.
Pharma is actually a perfect example. It's a well known fact that R&D only accounts for 22% of pharma industry revenue (almost equal to advertising at 19%), but only ~30% of that actually goes to new drugs. The rest takes advantage of marketing and the patent system to re-release drugs that are essentially the same. Two thirds of their research is obvious changes that are only protected because they owned the original patent- those shouldn't be getting the benefit of incentives.
Slightly tangential but might this be another argument for people "owning" their data while companies "own" the processing procedures of it. If people "owned" their data it would presumably be much easier for them to give it out for research purposes
* The Lyft sensor suite has bumper-mounted lidar, which is absent from other existing datasets. Point cloud data in these areas is critical for pedestrians, bikes, and various road hazards. So this dataset alone is useful for validating work trained through other means.
* The current Lyft Level 5 release has no explicit test / validation set, which is crucial for properly measuring performance of any experiment one might do with the data. In nuscenes and Argoverse, there's a small snippet dataset that helps you prepare your pipeline. Feels like Lyft might have rushed things a little here-- they could have posted a "teaser" and then the full train and test/validation set a couple weeks later.
Great to see more public data (especially from a more modern sensor suite), plus investment into a contest with prizes.
Hmm this blog post and the website doesn't mention that this dataset was mostly annotated by Scale (scale.ai), as part of a partnership with Lyft ... We're going to publish a blog post about this soon, but if anyone at Lyft is reading this, please figure out how to reasonably credit Scale since I doubt leaving out Scale completely from the announcement is in the spirit of the agreement. Scale should probably also be added to the bibliography and website in some form
Contrast this with the nuScenes website, which was also annotated by Scale, and whose data format set the standard for this dataset: they credit Scale pretty reasonably
This comment does not represent the company's viewpoint, and cardigan is not speaking on behalf of Scale.
We are very excited to have been able to work with Lyft in open-sourcing this dataset and advancing the research community. We are also very grateful to Lyft for choosing to leverage our point cloud viewer and have credited the annotations to us on their launch page.
Hopefully not. Cardigan obviously is has the company's interest in mind, even if perhaps the execution is a little flawed. Cardigan has just learnt a lot about PR and also gave a lot of free airtime to Scale.
Also hopefully Scale will use this opportunity to educate team members about situations like this.
That’s a strange take on this. Not all staff are authorized to speak on the company’s behalf. That’s true almost anywhere I’ve worked. Your efforts cannot always be recognized externally. NDAs and various other types of contracts commonly outline that.
I would be surprised if many people here really just assumed that a pseudonymous user chatting with us in the HN comments was speaking on behalf of the company in an official capacity. I mean, obviously there are legal niceties to be observed and he should have appended the usual disclaimers, blah blah blah, but we do have common sense here right?
No, people don't have common sense. People should not post publicly on behalf of their employer without running it by a manager. This is lesson one at every major corporate introduction and I now understand why, because people don't have common sense.
I didn't say anything about whether he should or should not have spoken out about the deal. And I specifically said that common sense doesn't cut the mustard legally. But I am asserting that the damage from people supposedly assuming that he was speaking officially is speculative and likely zero.
This isn’t really about employee recognition. The whole comment was about attribution for the company and marketing the partnership with Scale. Which is pretty standard in some business arrangements but which wasn’t the case here, which the employee wasn’t aware of and turned out to not that big of a deal for scale. Plenty of companies work in the background supporting other. Businesses and don’t always need attribution.
OP is trying to pigeonhole this into some sort of anti capitalist diatribe by trying to make it about individual employees wanting recognition and some big evil company is treating them like invisible cogs in the machine... which doesn’t make much sense since he asked for the company itself to be attributed, not individuals. It’s up to the company to reward and recognize employee contributions, not in some 3rd party partners announcements.
Plus he was always free to comment how he helped work on it or letting people know Scale had a role in helping make it (which are both common on HN). Only if the parent company tried to suppress that would this argument make any sense. But I don’t know why I’m bothering to counter such a position.
that's a strange response. i'm already clearly critiquing the dominant paradigm, capitalism. why would you just itemise a bunch of conventions from this paradigm, which i likely disagree with?
do you need hn to be an agreeable echo chamber for you?
capitalism depends on people not thinking thoroughly about the "deal" they are being drawn into. i'm here to harm this situation.
Quick tip: voice concerns about partners in private. Lyft will probably be happy to credit Scale more - it was likely an honest mistake. But now you dragged them through the mud publicly, which is going to make big companies less likely to work with Scale in the future.
Yeah, this comment to me really gives off a bad impression of Scale as a whole. My immediate reaction, assuming this is how Scale normally deals with PR, is that this company is still far too immature to be properly handling any sort of legitimate partnership.
FYI, you’re unlikely to solve any of your problems airing grievances on a public forum instead of just directly emailing the people involved.
If you are an officer of scale you should take this offline. If not then you’re probably not authorized to speak on behalf of scale. Check your NDAs and service agreements. This is in such poor taste, anyone who would consider scale’s service now has to consider this sort of public commentary.
Out of curiosity, isn't Lyft just a customer that pays Scale for annotation services? Or is there a reason for this to be more of a partnership and less of a customer-client relationship?
yeah, i'd say so. what if lyft's lidar manufacturers said that they should be included in this release because it's their lidars? I agree, it doesnt really need mentioning
OP here, just waking up (I'm remote) - I can't edit my original comments so let me modify them here:
I wasn't involved in our communications with Lyft, so I was talking about something I didn't know much about. My audience was just the anonymous commenteriat: turns out a lot of people whose opinion makes a material difference to Lyft/Scale read these comments too. Sorry for not realizing that; I probably wouldn't have posted an uninformed personal opinion had I realized that.
I was being way too aggressive - genuinely sorry to anyone at Lyft who felt maligned by these comments. I woke up to 20 messages from coworkers who told me I was being an ass - genuinely sorry :(
Also I really should've clarified I was not speaking on behalf of the company: this was just a personal, uninformed opinion.
I cannot go into the details I learned about Scale's agreement with Lyft since it's confidential
I seriously suggest you to add "Disclaimer" to every comment you leave on threads related to your company (see boulos' comments for example here on HN).
Also, and I really mean this in a friendliest way possible, take a pause commenting here. You're not doing yourself a favor. I suggest you talk to your PR/marketing department before exposing more internal details, or run comments by them before posting.
Word of advice. If someone is paying you for a service (Lyft in this case), you really should think twice, three times, four times, before you disparage them in public. Live and learn. Good luck.
Also, the viewer packaged with nuScenes was built by Steven Hao from Scale, and while it was packaged as part of nuScenes it should probably be called Scale's viewer instead of nuScenes' viewer. The original viewer in the nuscenes SDK has the Scale logo, but it looks like Lyft removed that in the fork. Maybe a bit of public shaming will fix that...
Dear Lyft marketing person who wrote this: we are a data labeling company, and you may think that means we have a bunch of useless bozos working here like most other data labeling companies, but that's not true - e.g, Steven is one of the smartest people in the world - https://stats.ioinformatics.org/people/3113 - he learns ridiculously quickly - e.g, gets to number one on random video games in a few weeks and learned to boulder L10 in a few months from scratch (normally takes years/decades and most climbers never get there)
At first I was on kind of on your side against the other comments telling you to delete your other comment. I think its important to set the record straight if you can as early as possible. A small retraction/correction isn't guaranteed to make the frontpage of HN again.
But then this comment took it into a weird turn with how fast steven can learn rock climbing (seriously, i am still kind of unsure if we're talking about rock climbing because its so random and unrelated).
There is no such thing. They have to have meant v10, and climbing v10 "in a few months" is an incredibly skeptical claim.
Without some kind of gymnast background I honestly don't think it's possible. Even 0 to v10 in a year is hard to believe. I've heard of phenoms doing it in 2 or 3 years, and don't doubt a year is possible... but a few months? Kind of like going from couch to sub 5-minute mile in a few months.
I've had it with you Scale.AI people always trying to take credit for Lyft's work. We've been working weekends and nights for years, even the hourly workers have had to take unpaid overtime. All that time, I've never seen Scale.AI do extra work to help us before a big deadline.
Such competitions do not usually result in a comprehensive "solution" by themselves - pushing the state-of-the-art is more common. Also the value is not going to be derived solely from the algorithm but more from its deployment to real world applications and the surrounding infrastructure to make it possible.
But do not forget there will be 10s (if not 100s) of people working on this for 30 days. The man-hour this competition will use is highly disproportionate to the amount offered overall.
“There will be $25,000 in prizes, and we’ll be flying the top researchers to the NeurIPS Conference in December, as well as allowing the winners to interview with our team.”
I guess it’s a decent opportunity if you’re trying to break into DL?
I unofficially got first place after finding a bug in their test set (allowing me to blow the competition away). I reported the problem directly - they decided not to fix it and asked me to take my submission down. They said they'd still offer an interview.
However - this interview wasn't even for their DL team. They offered an interview with the web tools support team because they felt I didn't have enough experience...
> they decided not to fix it and asked me to take my submission down.
Imagine if it was a bug in their app leaking millions of users data and they go "We do not fix it. You can only use Uber now" "Also, come for an interview if you want to fix our web tools instead"
>> allowing the winners to interview with our team
This was pretentious AF. People who win such competitions _allow companies_ to interview them sometimes, not the other way around. It's not like working at Lyft is some amazing privilege.
I agree that the wording of "allow you to interview" is pretentious, but the end result is commonplace. Successful Kaggle competitors score job interviews all the time because of their performance. Getting an interview is a two-way street. The company needs to give you an invite and you need to accept. Doing well in the competition will get you an invite.
You might not think that working at Lyft is an amazing privilege, but it could be a good gig for someone that needs a paycheck and is looking for a job in the self-driving field.
The post indicates there is a competition and prizes but I'm not seeing any discussion of what sort of license the data is being made available under (or the competition for that matter). Hopefully it's there and I'm just not seeing it.
Hmmm. I wonder how much fun lawyers will have arguing about whether that "NC" clause means a model trained on this data cannot be used commercially by the researcher who built it?
I would guess (I'm not a lawyer) that a commercial model would be in the clear as long as the company doesn't release anything including the data itself. Model weights that are derived from the data are not the data.
I would make an analogy where the training data is like a textbook. If I read in a textbook about how to design/build a bridge, I don't have to give royalties to the textbook author when my civil engineering and construction firm gets paid to build a bridge. The copyright/license of the textbook can't prevent me from using the knowledge gained from the book to do a commercial job. In a similar vein, the knowledge gained from a public data set is probably fair game for whatever you want, just as long as you aren't repackaging the data itself. There's probably a good boundary requiring someone to stop at a point where the model weights can't be used to reconstruct the original data.
Of course, other people can disagree. I would look forward to an actual legal opinion to clear this up.
I disagree. The license specifies that "using the material for commercial purposes" is prohibited. The act of training a commercial model is obviously a commercial purpose. Whether or not the data is somehow incorporated into the resulting model is irrelevant.
You're confusing this CC license with open source licenses that do not restrict use but require derived works to be created/distributed under certain conditions. This CC license restricts use, in that you are not allowed to use the data for any commercial purpose - this is totally different from open source, and more like the "academic use only" licenses which used to be more common.
I think this is a pretty good argument, and more correct that my previous comment. It's too late for me to edit that one, but it's a pretty clear argument.
I doubt that any company would be willing to take the risk of creating a commercial self-driving car using this data to get a head start. I don't think I'll be seeing this particular license fought over in court.
CC is rarely a good license to choose other than to make yourself feel good. It does a terrible job of dealing with the actual matters of importance to a license outside of photo sharing (and even there it's not a good choice, as many photographers have found because it explicitly grants rights to others that the photographer may not be able to to grant to them, leaving the photographer open to lawsuits as has happened).
In this case, imagine a student working on a class assignment. They use this data for purely academic purposes with no commercial intent in mind. After they train their system, they realize yow I could use this trained system and get rich. There was arguably no commercial use during the training. The use of the data was purely academic, like a person learning math or French. What you do after running the learning is a separate matter, just as using a CC licensed textbook to learn math doesn't prevent you from getting a job as a statistician.
Again, the tl;dr is instead of trying to divine how a court will deal with a poorly specified problem, it's much better to just not license your stuff using a CC license. There are almost always much better licenses to choose from.
I'm curious about what you (or anyone else) would recommend as better license choices for datasets that might be used in machine learning model training?
(With, I suppose, hints about what restrictions you might be wanting to grant or prohibit by particular license options?)
IANAL but my sense is that it will be a number of years before we really know what works with regards to this kind of licensing of DNN/ML training data sets. It's almost certainly going to be decided on the basis of what's called "case law", which really just means "a bunch of random judges who don't know anything about ML made decisions on a bunch of random lawsuits that probably were not good samples to pick and which were probably taken to court by pairs of parties with wildly different abilities to pay lawyers and now we are stuck with those decisions as precedent for future cases."
If that sounds like a crappy legal footing for the next 20 years of software development, yeah, it is. It's also why Mitch Kapor and a few others founded and funded the EFF to try to encourage better case law decisions around the early days of electronic privacy law. We definitely need an EFF like effort around ML/DNN/etc., but I'm not holding my breath.
I wish I had a better answer, I'm mostly hoping someone else here does.
This will go against the grain here on HN. I don't know how anyone can imagine self driving even succeeding in real world, leave alone in the so called third world, unless all vehicles are self driven and operate in a controlled environment. There are some really important things pending, like accurate NLP and computer vision, but no, we need something shiny and useless. I think some smart computer scientists are getting rich by carrot sticking some gullible billionaire investors. Good for them. I hope some of the really useful stuff piggy back on this rather lofty endeavor.
I 'm with you i think self-driving is an aspiration rather than a concrete goal. It literally means solving the quintessential problem of robotics which is a very very hard problem. We 'll probably have human level NLP before that.
A more concrete goal for transportation would be to reduce driving times even more by adopting remote work. That is reachable within the decade. In the meanwhile, car safety features should be ramped up, but autonomous driving so far doesn't seem very safe.
> Self-driving is too big — and too important — an endeavor for any one team to solve alone. Transportation serves all of us, and we should all be invested in the next step of its evolution.
Imagine how much better things could be if everyone working on maps felt the same.
It's looking more and more like everyone is just going to have to licence Tesla's FSD when its finished.
They are the only ones with a broad real world data source and seem to have wisely taken the right path by not adopting LIDAR, focusing purely on passive vision.
tesla decided to not include lidar because they couldn't find a manufacturer that would make one cheap enough for them/fell out over terms. Its not a statement of vision. Its exactly the same decision that Apple dropped Flash support for the iPhone, The processor and ram were too limited to support it, adobe refused to make compromises, and it was too late to change before launch.
Firstly, Tesla is not focussing purely on passive vision, they are using radar as well. But because radar is nowhere near high resolution enough, they need vision to provide categorization.
Now Musk makes a lot of noise about avoiding lidar, thats mostly because he knows its a massive gamble. Yes, he bleats on about its power budget and cost, but using pure AI cost a whole more in RnD, plus a boat load of latency. Not to mention the massive power budget needed to run the custom silicon.
_eventually_ vision + radar will be more than enough to provide life critical level 5 autonomy. However Tesla barely provide more than level 2.
They have a number of problems to overcome, rain/bug occlusion of vision sensor, low light performance, sunrise/sunset, fog, reliable realtime depth estimation, etc.
I suspect that CCD based time of flight depth sensors will become cheaper, low power, and small (They are almost certainly going to end up in mobile phones soon) before pure vision realtime life critical depth estimation is a thing.
Anytime you do something new its a gamble and things will go wrong.
Tesla has never been shy of using expensive components for their products. Their entire strategy relies on bringing down battery costs.
Musk makes a very good argument for his choice to not use LIDAR - nothing alive on this planet uses it. Every creature navigates our world on vision, even dolphins and bats use sonar as a secondary system. The proof of concept is everywhere, Tesla is just adopting it.
And yeah, they're purely level 2 at the moment. I don't agree with their marketing as it does lead some to use it as something more before its ready.
> Musk makes a very good argument for his choice to not use LIDAR - nothing alive on this planet uses it. Every creature navigates our world on vision, even dolphins and bats use sonar as a secondary system. The proof of concept is everywhere, Tesla is just adopting it.
Nothing alive uses wheels, apart from us. Its a null argument.
I don't mind a gamble, what I mind is patent bullshit. If he'd been honest and said "Lidar is great, but its too expensive, and the manufactures are not willing to compromise" and left it at that, it'd be ok. But he didn't He dressed it up in semi prophetic _wank_. Now we have legions of "experts" blindly parroting that ToF sensors are dead and will be never used into the future.
I'm willing to be that a Lidar-like Time if Flight sensor will be shoved into a smartphone in the next 4 years. why? because it makes SLAM so much more easy/immersive/better. Which makes AR better, and more useful.
This is true of pretty much any AI research. Look at Puffer[0], which was just on HN a couple of days ago. They're running a free streaming service just to get enough data to train their algorithms, and in fact mention in their FAQ that they would love to use commercial data if they could get it.
Unfortunately, academic and commercial incentives don't really align here. Most commercial entities don't want to share their data because it's valuable to them, and if they let researchers in, they want the output of the research to remain proprietary to their commercial enterprise.
I wonder if there isn't some sort of governance solution to this. Like give companies big tax breaks for sharing their data with researchers, or something like that. Essentially subsidize academia indirectly.
[0] https://puffer.stanford.edu/player/