i’m really curious how well they perform with a long chat history. i find that gemini often gets confused when the context is long enough and starts responding to prior prompts, using the cli or it’s gem chat window.
From my experience. Gemini is REALLY bad about context blending. It can't keep track of what I said and what it said in a conversation under 200K tokens. It blends concepts and statements up, then refers to some fabricated hybrid fact or comment.
Gemini has done this in ways that I haven't seen in the recent or current generation models from OpenAI or Anthropic.
It really surprised me that Gemini performs so well in multi-turn benchmarks, given that tendency.
I’ve not experimented with the recent models for this but older Gemini models were awful for this - they’d lie about what I’d said or what was in their system prompt even with short conversations.