I think the clearest evidence is Microsofts paper where they show abilities at various stages during training[1]... But in a talk [2], they give more details... The unicorn gets worse during the finetuning process.
Noobie follow up question: Should we put any trust into “Sparks of intelligence” I thought it was regarded as a Microsoft marketing piece, not a serious paper.
[2]: https://www.youtube.com/watch?v=qbIk7-JPB2c&t=1392s
[1]: https://arxiv.org/abs/2303.12712