It seems like a lot of innovation is around training, no? GGML (the library that reads GGUF format) supports these values for the
required 'general.architecture':
I've also been trying to figure out GGUF and the other model formats going around. I'm horrified to see there is no model architecture details in the file! As you say, it seems they are hard-coding the above architectures as constants. If a new hot model comes out, one would need to update the reader code (which has the new model arch implemented). Am I understanding this right?
I'm also a bit confused by the quantization aspect. This is a pretty complex topic. GGML seems to use 16bit as per the article. If was pushing it to 8bit, I reckin I'd see no size improvement the GGML file? The article says they encode quantization versions in that file. Where are they defined?
In designing software, there's often a trade off between (i) generality / configurability, and (ii) performance.
llama.cpp is built for inference, not for training or model architecture research. It seems reasonable to optimize for performance, which is what ~100% of llama.cpp users care about.
GGUF files seems to be proliferating. I think some folks (like myself) make an incorrect assumption that the format has more portability/generalizability than it appears to have. Hence, the horror!