Med-PaLM has been tested on a number of public benchmarks, and the results published. A useful comparison would have been to test GPT on the same set of benchmarks. Yet the paper we're discussing offers just one datapoint by way of a numerical comparison, literally in a footnote.
Basically, the comparison needs to be a lot more comprehensive to be useful.
Late to the thread here, but the paper announcing Med-PaLM (https://arxiv.org/abs/2212.13138) does not report many benchmark results on Med-PaLM and is instead mostly about Flan-PaLM 540B (which is compared against in this paper). I am curious if any other Med-PaLM benchmarks have been published but I don't believe it is currently possible to do any further comparison against Med-PaLM given that the model is not public and no other open benchmark results are reported in the original Med-PaLM paper.
Basically, the comparison needs to be a lot more comprehensive to be useful.