My colleagues and I discussed this paper when it was published. Like lots of AI research in this area, it picks a human strawman and is sort of misleading in this regard. The human ratings are poorly designed and substandard in many ways.
It's still interesting, but there are lots of issues being swept under the rug in this area.
It's still interesting, but there are lots of issues being swept under the rug in this area.