I see the trick - the way they query for citations is just to append your text with a [put a reference here] tag, and then see what the model predicts. So it figures that immediately following "MS-COCO dataset" should, of course, be the citation for MS-COCO. With that in mind, you can structure your prompt to get the thing you want:
"Real time object-detection on the MS-COCO dataset was demonstrated by" gives a correct result (YOLO).
"Real time object-detection on the MS-COCO dataset was demonstrated by" gives a correct result (YOLO).