There's nothing about Markov chains that says the model has to be based on brute...

krackers · on Dec 12, 2022

But it seems like the attention mechanism fundamentally isn't markov-like in that at a given position it can pool information from all other positions. So as in the simplest case when trained on masked-language modeling, the prediction of the mask in "Capital of [MASK] is Paris" can depend bidirectionally on all surrounding context. While I guess it's true that in the case where the mask is at the end (for next-token completion), you could consider this as a markov model with each state being the max attention window (2048 tokens I think?), but that's like saying all real-world computers are FSMs: it's technically true, but this isn't the best model to use for actually understanding its behavior.

Since for most inputs that are smaller than the max token length you never actually end up using the markov-ness, calling it a markov model seems like it's just in a way saying it's a function that provides a probability distribution for the next token given the previous tokens. Which just pushes the question back onto how that function is defined.

larsejonasson · on Dec 21, 2022

Could you not use two Markov chains for masked language modeling? One working from the beginning until [MASK] and one working backwards from the end until [MASK]. And then set [MASK] to the average of both chains. If a direct average cannot be found, it is assumed to be a multi-word-expression and words are generated from the two chains until they match.

krackers · on Dec 24, 2022

That seems closer to a BiLSTM?