EMO: Mixture-of-Experts That Actually Behaves Like One

1 / 2

EMO: Mixture-of-Experts That Actually Behaves Like One

DEV Community·Aamer Mihaysi·19 days ago

#rEHmReJi

#ai #machinelearning #nlp #software #experts #model

Reading 0:00

15s threshold

Most MoE models are just big transformers with a traffic cop attached. The router directs tokens to different experts, sure, but ask for just the code experts and the whole thing falls apart. That's not modularity. That's sharding with extra steps. The problem isn't that MoE doesn't work. It's that the experts don't specialize where it matters. Open up a standard MoE and you'll find one expert handling prepositions, another managing punctuation, a third dealing with numbers. The specialization is lexical, not semantic. When you try to extract just the "math" capability, every token still needs access to most of the experts anyway. The promise of selective deployment remains theoretical. EMO changes this by making modularity a first-class training objective rather than a hoped-for emergent property. The insight is simple: tokens from the same document usually belong to the same domain. So EMO constrains all tokens in a document to route through a shared pool of experts.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

EMO: Mixture-of-Experts That Actually Behaves Like One