Hacker News new | past | comments | ask | show | jobs | submit login

> But what's proprietary here? That's what I'm not getting from the other person. You have the algorithm. Hell, they even provided the model in pytorch/python. They just didn't provide training parameters and data. But that's not necessary to use or modify the software just like it isn't necessary for nearly any other open sourced project.

It's necessary if you want to rebuild the weights/factors/whatever the current terminology is, which are a major part of what they're shipping. If they found a major bug in this release, the fix might involve re-running the training process, and currently that's something that they can do and we users can't.

> I mean we aren't calling PyTorch "not open source" because they didn't provide source code for vim and VS code.

You can build the exact same PyTorch by using emacs, or notepad, or what have you, and those are standard tools that you can find all over the place and use for all sorts of things. If you want to fix a bug in PyTorch, you can edit it with any editor you like, re-run the build process, and be confident that the only thing that changed is the thing you changed.

You can't rebuild this model without their training parameters and data. Like maybe you could run the same process with an off-the-shelf training dataset, but you'd get a very different result from the thing that they've released - the whole point of the thing they've released is that it has the weights that they've "compiled" through this training process. If you've built a system on top of this model, and you want to fix a bug in it, that's not going to be good enough - without having access to the same training dataset, there's no way for you to produce "this model, but with this particular problem fixed".

(And sure, maybe you could try to work around with finetuning, or manually patch the binary weights, but that's similar to how people will patch binaries to fix bugs in proprietary software - yes it's possible, but the point of open source is to make it easier)




Is this code "open source?"[0] It is under a MIT license, has the training scripts, all the data is highly available, etc. But to the best of my knowledge no one has reproduced their results. These people sure couldn't[1] and I'm not aware of any existing work which did. This honestly is quite common in ML and is quite frustrating as a researcher, especially when you get a round of reviewers who think benchmarks are the only thing that matters (I literally got a work rejected twice with a major complaint being about how my implementation didn't beat [0] despite it beating [1]. My paper wasn't even on architecture... so we weren't even trying to improve the SOTA...).

As a researcher I want to know the HPs and datasets used, but they honestly aren't that important for usage. You're right that to "debug" them one method would be to retrain from scratch. But more likely is doing tuning, reinforcement learning, or using a LoRA. Even the company engineers would look at those routes before they looked at retraining from scratch. Most of the NLP research world is using pretrained models these days (I don't like this tbh, but that's a different discussion all together). Only a handful of companies are actually training models. And I mean companies, I don't mean academics. Academics don't have the resources (unless partnering), and without digressing too much, the benchmarkism is severely limiting the ability for academics to be academics. Models are insanely hard to evaluate, especially after RLHF'd to all hell.

> (And sure, maybe you could try to work around with finetuning, or manually patch the binary weights, but that's similar to how people will patch binaries to fix bugs in proprietary software - yes it's possible, but the point of open source is to make it easier)

The truth is that this is how most ML refinement is happening these days. If you want better refinement we have to have that other discussion.

[0] https://github.com/openai/glow

[1] https://arxiv.org/abs/1901.11137


> Is this code "open source?"[0] It is under a MIT license, has the training scripts, all the data is highly available, etc. But to the best of my knowledge no one has reproduced their results. These people sure couldn't[1] and I'm not aware of any existing work which did.

I don't know about ML specifically, but I've seen a number of projects where people publish supposedly "the source" for something and it doesn't actually build. IMO if they're doing it wilfully that makes not open source, whereas if it's just good-faith legitimate incompetence then it can be.

(My litmus test would be: are they giving you all the stuff they'd give to a new hire/assistant working with them? If they've got a "how to build" on their internal wiki with a bunch of steps they're keeping secret, then it's not open-source. But if the process for a new hire is to hand over a code dump and say "huh, it works on my machine, I don't remember what I did to set it up", then at that point I'd consider it open source. I think this aligns with the "preferred form for making modifications" idea in the licenses).

> But more likely is doing tuning, reinforcement learning, or using a LoRA. Even the company engineers would look at those routes before they looked at retraining from scratch.

Sure. But they'd have that capability in their back pocket for if they needed it. It's a similar story for e.g. parts of the Linux kernel code that are generated via a perl script based on the architecture documentation - you only actually re-run that perl script once in a blue moon, but it's important that they publish the perl script and not just the C that was generated by it.


Build scripts are not required for open source. Usually they are provided, because nobody actually wants to maintain them separately, but they are not actually part of the project itself. Often times it's just a few gnu scripts, sometimes there's parts of it missing (because they're reused from another project, or they have secrets in them that the maintainer can't be bothered to remove, or other reasons), and rarely the build script is an entire project itself, and even more rarely there's nothing there at all except a single file of source code that can't be built alone (I've seen this in particular on several old golang projects, and it's incredibly annoying).


"Source in the preferred form for making modifications" is required. That includes build scripts if the maintainer is using them, IMO.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: