> But what's proprietary here? That's what I'm not getting from the other person...

godelski · on Sept 28, 2023

Is this code "open source?"[0] It is under a MIT license, has the training scripts, all the data is highly available, etc. But to the best of my knowledge no one has reproduced their results. These people sure couldn't[1] and I'm not aware of any existing work which did. This honestly is quite common in ML and is quite frustrating as a researcher, especially when you get a round of reviewers who think benchmarks are the only thing that matters (I literally got a work rejected twice with a major complaint being about how my implementation didn't beat [0] despite it beating [1]. My paper wasn't even on architecture... so we weren't even trying to improve the SOTA...).

As a researcher I want to know the HPs and datasets used, but they honestly aren't that important for usage. You're right that to "debug" them one method would be to retrain from scratch. But more likely is doing tuning, reinforcement learning, or using a LoRA. Even the company engineers would look at those routes before they looked at retraining from scratch. Most of the NLP research world is using pretrained models these days (I don't like this tbh, but that's a different discussion all together). Only a handful of companies are actually training models. And I mean companies, I don't mean academics. Academics don't have the resources (unless partnering), and without digressing too much, the benchmarkism is severely limiting the ability for academics to be academics. Models are insanely hard to evaluate, especially after RLHF'd to all hell.

> (And sure, maybe you could try to work around with finetuning, or manually patch the binary weights, but that's similar to how people will patch binaries to fix bugs in proprietary software - yes it's possible, but the point of open source is to make it easier)

The truth is that this is how most ML refinement is happening these days. If you want better refinement we have to have that other discussion.

[0] https://github.com/openai/glow

[1] https://arxiv.org/abs/1901.11137

lmm · on Sept 28, 2023

> Is this code "open source?"[0] It is under a MIT license, has the training scripts, all the data is highly available, etc. But to the best of my knowledge no one has reproduced their results. These people sure couldn't[1] and I'm not aware of any existing work which did.

I don't know about ML specifically, but I've seen a number of projects where people publish supposedly "the source" for something and it doesn't actually build. IMO if they're doing it wilfully that makes not open source, whereas if it's just good-faith legitimate incompetence then it can be.

(My litmus test would be: are they giving you all the stuff they'd give to a new hire/assistant working with them? If they've got a "how to build" on their internal wiki with a bunch of steps they're keeping secret, then it's not open-source. But if the process for a new hire is to hand over a code dump and say "huh, it works on my machine, I don't remember what I did to set it up", then at that point I'd consider it open source. I think this aligns with the "preferred form for making modifications" idea in the licenses).

> But more likely is doing tuning, reinforcement learning, or using a LoRA. Even the company engineers would look at those routes before they looked at retraining from scratch.

Sure. But they'd have that capability in their back pocket for if they needed it. It's a similar story for e.g. parts of the Linux kernel code that are generated via a perl script based on the architecture documentation - you only actually re-run that perl script once in a blue moon, but it's important that they publish the perl script and not just the C that was generated by it.

efreak · on Sept 28, 2023

Build scripts are not required for open source. Usually they are provided, because nobody actually wants to maintain them separately, but they are not actually part of the project itself. Often times it's just a few gnu scripts, sometimes there's parts of it missing (because they're reused from another project, or they have secrets in them that the maintainer can't be bothered to remove, or other reasons), and rarely the build script is an entire project itself, and even more rarely there's nothing there at all except a single file of source code that can't be built alone (I've seen this in particular on several old golang projects, and it's incredibly annoying).

lmm · on Sept 28, 2023

"Source in the preferred form for making modifications" is required. That includes build scripts if the maintainer is using them, IMO.