Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Check out the original paper!

In modern deep neural networks, increasing both training data scale and model size has led to significantly better prediction accuracy. However, this approach results in a quadratic increase in training costs. Current advances in computing speed and distributed computation cannot keep up with these demanding requirements.

Various forms of Conditional Computation have been proposed to increase model capacity without proportionally increasing computational costs. Research has explored different reinforcement learning, backpropagation, and training methods to determine whether large parts of a network should be active or inactive for each example. In this scheme, the challenges can be:

Branching Calculation: Modern computing devices perform better with arithmetic operations rather than branching calculations—a common requirement in conditional computation.
Conditional Batch Sizing: Large batch sizes create significant costs and performance bottlenecks. Conditional computation mitigates this by reducing batch sizes through selective network activation.
Network Bandwidth Bottleneck: A GPU cluster's computational power can be thousands of times greater than its aggregate inter-device network bandwidth.
New Loss Term: The training process may require additional loss terms to control sparsity levels across batches and individual examples.

Sparsely-Gated Mixture of Experts is one such approach that has achieved more than 1000× improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.

Introduction

As an approach to conditional computation, this paper introduces a new type of general-purpose neural network component called the "Sparsely-Gated Mixture-of-Experts Layer" (or MoE). The MoE consists of multiple experts—each being a simple feed-forward neural network—and a trainable gating network that selects a sparse combination of experts to process each input. All parts of the network are trained during backpropagation.

Formulation

The MoE layer consists of a set of expert networks and a gating network. This save computation based on the sparsity of the output of $G(x)$ . Wherever $G(x)_i = 0$ , we need not compute $E_i(x)$ .

y = \sum_{i = 1}^{n}{G(x)_i \cdot E_i(x)}

$E_i$ represents a "expert network" indexed as $i$ : $E = \{ E_0, E_1, …, E_n \}$
$G$ is a “gating network” whose output is a sparse $n$ -dimensional vector.
When $k > 1$ , the gate values for the top k experts have non-zero derivatives with respect to the gating network weights, while all others have zero derivatives.

G_{\sigma} = \text{softmax}(x \cdot W_g)

A Gating Network varies based on its calculation logic. The simplest form is Soft-max Gating, which multiplies the input by a trainable weight matrix $W_g$ and applies the soft-max function. However, it confronts the below challenges:

Unbalanced Loading: The gradient tends to push the gating network to distribute tokens unevenly.
Expert Collapse: The model ends up using only a few experts actively.

A Noisy Top-K Gating adds two components to the soft-max gating network: sparsity and noise, which are crucial for efficiency of large models. Before applying soft-max, it adds tunable Gaussian noise, then keeps only the top $k$ values and sets the rest to $-\inf$ . Without Gaussian noise, the network would converge to using the same few experts, making all others useless.

G(x) = \text{softmax}(\text{KeepTopK}(H(x), k))

$H(x)_i = (x \cdot w_g)i + \text{StandardNormal}()\cdot \text{softplus}((x \cdot W{\text{noise}})_i)$
$\text{KeepTopK}(v, k)_i = \begin{cases} v_i \quad\text{if }v_i \text{ is in the top k elments of }v\\-\inf \quad\text{otherwise}\end{cases}$

For very large numbers of experts, the branching factor can be reduced by using a two-level hierarchical MoE. In this setup, a primary gating network chooses a sparse weighted combination of experts, where each expert is itself a secondary MoE with its own gating network.

Balancing Expert Utilization: Additional Loss to Balance the Importance

To balance expert utilization across datasets, an importance score tracks how frequently each expert is used. This additional loss term $\mathcal{L}_{\text{Importance}}$ is incorporated with major loss to ensure balanced training.

\text{Importance}(X) = \sum_{x \in X}G(x) \newline \mathcal{L}_{I(x)}(X) = w_I \cdot \text{CV}(I(x))^2 \quad\text{where CV } = \frac{\sigma}{\mu}

$\mathcal{L}_{I(x)}(X)$ denotes the importance loss function $I(x)$ .
$W_I$ is a manually adjusted scaling coefficient.
$\text{CV}^2$ is the square of the coefficient of variation calculated from $\text{Importance}(X)$ across dataset $X$ .
- $\sigma$ represents the standard deviation of the distribution, while $\mu$ represents the mean of $\text{Importance}(X)$ .
$\mathcal{L}'_{I(x)}(X)$ for backpropagation is $\frac{2I_j}{N\cdot \mu^2_I} - \frac{2\sigma^2_I}{N \cdot \mu^3_I}$ .

PreviousAttention Is All You Need NextMamba: Linear-Time Sequence Modeling with Selective State Spaces

Last updated 3 months ago