While I think that's a valid point, for the purpose of the paper having a competent prompt writer would be needed to make a fair comparison.
I think the researchers would know how to get the best from their own AI bot, so that's a level of competency that should be extended to comparisons, otherwise user competency becomes a source of bias. I do feel you're correct in your concerns though, the systems shouldn't need experts to use them, nor should they need the user to already know the right answer, which leads me to my next point:
When it comes to real world expectations, perhaps instead we need a large group of random people (with no prior experience) working with each bot to complete a set of tasks in order to determine how it truly performs - something that could be enhanced if the answers weren't easy to check.
I think the researchers would know how to get the best from their own AI bot, so that's a level of competency that should be extended to comparisons, otherwise user competency becomes a source of bias. I do feel you're correct in your concerns though, the systems shouldn't need experts to use them, nor should they need the user to already know the right answer, which leads me to my next point:
When it comes to real world expectations, perhaps instead we need a large group of random people (with no prior experience) working with each bot to complete a set of tasks in order to determine how it truly performs - something that could be enhanced if the answers weren't easy to check.