Given the config parametes posted, its 2 experts per token, so the conputation cost per token should be the cost of the conponent that selects experts + 2× cost of a 7B model.
Ah good catch. Upon even closer examination, the attention layer (~2B params) is shared across experts. So in theory you would need 2B for the attention head + 5B for each of two experts in RAM.
That's a total of 12B, meaning this should be able to be run on the same hardware as 13B models with some loading time between generations.