Maybe my assumption of how MoE would/could work is wrong, but I had assumed that it means getting different models to generate different bits of text, and then stitching them together - for example, you ask it to write a short bit of code where every comment is poetry, the instruction would be split (by a top level "manager" model?) such that one model is given the task "write this code" and another given the task "write a poem that explains what the code does". There therefore wouldn't be maths done that's combining numbers from the different experts, just their outputs (text) being merged.
Have I completely misunderstood, does Mixture of Experts somehow involve the different experts actually collaborating on the raw computation together?
Could anyone share a recommendation for what to read to learn more about MoE generally? (Ideally that's understandable by someone like me that isn't an expert in LLMs/ML/etc.)
Have I completely misunderstood, does Mixture of Experts somehow involve the different experts actually collaborating on the raw computation together?
Could anyone share a recommendation for what to read to learn more about MoE generally? (Ideally that's understandable by someone like me that isn't an expert in LLMs/ML/etc.)