Chain-of-Experts: Unlocking the Communication Power of MoEs
Introduction
We propose Chain-of-Experts (CoE), which fundamentally changes
sparse Large Language Model (LLM) processing by implementing
sequential communication between intra-layer experts within Mixture-of-Experts (MoE) models.
Mixture-of-Experts (MoE) models process information
independently in parallel between experts and have
high memory requirements. CoE introduces an
iterative mechanism enabling experts to "communicate" by processing tokens on top of outputs from other experts.
Experiments show that CoE significantly outperforms previous MoE models in multiple aspects:
- Performance: CoE with 2x iterations reduces Math validation loss from 1.20 to 1.12
- Scaling: 2x iterations matches performance of 3x expert selections, outperforming layer scaling
- Efficiency: 17.6-42% lower memory usage with equivalent performance
- Flexibility:823x increase in expert combinations, improving utilization, communication, and specialization
These advantages constitute a "free lunch" effect, enabling efficient scaling of LLMs.
English Blog:
Chinese Blog: