This doesn't replicate using gpt-4o-mini, which always picks Flight B even when ...

yorwba · on Dec 28, 2024

The newline thing is the motivating example in the introduction, using Llama 3 8B Instruct with up to 200 newlines before the question. If you want to reproduce this example with another model, you might have to increase the number of newlines all the way to the context limit. (If you ask the API to give you logprobs, at least you won't have to run mutiple trials to get the exact probability.)

But the meat of the paper is the Shapley value estimation algorithm in appendix A4. And in A5 you can see that different models giving different results is to be expected.