A fair comparison (on a technical level) to GPT-4 RAG would be a doctor in a rel...

A fair comparison (on a technical level) to GPT-4 RAG would be a doctor in a relevant field who also has internet access. I think this would be indeed interesting to compare to assess the resulting quality of care, so to speak!

(The other models being only partially able to source good references is unsurprising/"unfair" on a technical level, but that's not relevant for assessing their safety.)