There is no doubt in my mind that Galactica fine-tuned on these specific datasets will outperform all these previous models. But yeah, someone should definitely do that and perform the benchmarks.
I’ve been vaguely following all the AI news on text to image and text that comes out from promos. But I have no idea how a benchmark for text would work. Is benchmarking subjective? Is it based on accuracy of information? How do you actually measure a benchmark for something like this?
Different benchmarks are performed for different tasks. As there are a lot of things you can use language models for, there are a lot of benchmarks.
With respect to subjectivity it really depends on the task - some tasks are quite amenable to objective classification. One common task for science language models is citation prediction: do these two papers share a citation link? Obviously that's a really simple accuracy metric to report.
Often things are not so simple. An example might be keyphrase extraction - standard practice there is to have grad students sit down with a highlighter and use the terms multiple students agree on (simplification, but not by much). From there it just gets messier. Are you reporting accuracy of all keywords identified or all sentences correctly processed? What about sentences with multiple keywords? What about sentences with no keywords? Very messy, appropriate metrics can be a real topic of debate.