Artificial Intelligence thread

tokenanalyst

Brigadier
Registered Member

Chain-of-Experts: Unlocking the Communication Power of MoEs​


Introduction

We propose Chain-of-Experts (CoE), which fundamentally changes sparse Large Language Model (LLM) processing by implementing sequential communication between intra-layer experts within Mixture-of-Experts (MoE) models.

Mixture-of-Experts (MoE) models process information independently in parallel between experts and have high memory requirements. CoE introduces an iterative mechanism enabling experts to "communicate" by processing tokens on top of outputs from other experts.

Experiments show that CoE significantly outperforms previous MoE models in multiple aspects:
  • Performance: CoE with 2x iterations reduces Math validation loss from 1.20 to 1.12
  • Scaling: 2x iterations matches performance of 3x expert selections, outperforming layer scaling
  • Efficiency: 17.6-42% lower memory usage with equivalent performance
  • Flexibility:823x increase in expert combinations, improving utilization, communication, and specialization
These advantages constitute a "free lunch" effect, enabling efficient scaling of LLMs.

English Blog:
Please, Log in or Register to view URLs content!

Chinese Blog:
Please, Log in or Register to view URLs content!


1741113470781.png

 

Legume7

New Member
Registered Member

Chain-of-Experts: Unlocking the Communication Power of MoEs​


Introduction

We propose Chain-of-Experts (CoE), which fundamentally changes sparse Large Language Model (LLM) processing by implementing sequential communication between intra-layer experts within Mixture-of-Experts (MoE) models.

Mixture-of-Experts (MoE) models process information independently in parallel between experts and have high memory requirements. CoE introduces an iterative mechanism enabling experts to "communicate" by processing tokens on top of outputs from other experts.

Experiments show that CoE significantly outperforms previous MoE models in multiple aspects:
  • Performance: CoE with 2x iterations reduces Math validation loss from 1.20 to 1.12
  • Scaling: 2x iterations matches performance of 3x expert selections, outperforming layer scaling
  • Efficiency: 17.6-42% lower memory usage with equivalent performance
  • Flexibility:823x increase in expert combinations, improving utilization, communication, and specialization
These advantages constitute a "free lunch" effect, enabling efficient scaling of LLMs.

English Blog:
Please, Log in or Register to view URLs content!

Chinese Blog:
Please, Log in or Register to view URLs content!


View attachment 146944

This was made by former DeepSeek intern (now PhD student at Northwestern) Wang Zihan.
 
Top