I recorded myself saying a few sentences from that transcript, then fed it throu...

I recorded myself saying a few sentences from that transcript, then fed it through different Whisper models. "small.en" and "large-v1" both generated "chat GPT", "large-v2" generated "chat-gpt", but somehow "medium.en" correctly generated "ChatGPT".

This was the same audio sample fed through each of those four models, with no "prompting" as you're discussing.

If I add "--initial_prompt ChatGPT", then all four models are able to get the spelling correct.

Regardless, I don't think "chat GPT" versus "ChatGPT" is a huge deal. There will always be some level of uncertainty and ambiguity in the transcript, and even books written by humans always have a few typos get past multiple stages of copy editing. Perfection is virtually unachievable, but you can always scroll through the transcript and make some edits after the fact, if desired. Maybe some future model will magically eliminate all typos.