Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

In modern deep neural networks, increasing both training data scale and model size has led to significantly better prediction accuracy. However, this approach results in a quadratic increase in training costs. Current advances in computing speed and distributed computation cannot keep up with these demanding requirements.

Various forms of Conditional Computation have been proposed to increase model capacity without proportionally increasing computational costs. Research has explored different reinforcement learning, backpropagation, and training methods to determine whether large parts of a network should be active or inactive for each example. In this scheme, the challenges can be:

  • Branching Calculation: Modern computing devices perform better with arithmetic operations rather than branching calculations—a common requirement in conditional computation.

  • Conditional Batch Sizing: Large batch sizes create significant costs and performance bottlenecks. Conditional computation mitigates this by reducing batch sizes through selective network activation.

  • Network Bandwidth Bottleneck: A GPU cluster's computational power can be thousands of times greater than its aggregate inter-device network bandwidth.

  • New Loss Term: The training process may require additional loss terms to control sparsity levels across batches and individual examples.

Sparsely-Gated Mixture of Experts is one such approach that has achieved more than 1000Ɨ improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters.

Introduction

As an approach to conditional computation, this paper introduces a new type of general-purpose neural network component called the "Sparsely-Gated Mixture-of-Experts Layer" (or MoE). The MoE consists of multiple experts—each being a simple feed-forward neural network—and a trainable gating network that selects a sparse combination of experts to process each input. All parts of the network are trained during backpropagation.

Formulation

The MoE layer consists of a set of expert networks and a gating network. This save computation based on the sparsity of the output of G(x)G(x). Wherever G(x)i=0G(x)_i = 0, we need not compute Ei(x)E_i(x).

y=āˆ‘i=1nG(x)iā‹…Ei(x)y = \sum_{i = 1}^{n}{G(x)_i \cdot E_i(x)}
  • EiE_i represents a "expert network" indexed as ii: E={E0,E1,…,En}E = \{ E_0, E_1, …, E_n \}

  • GG is a ā€œgating networkā€ whose output is a sparse nn-dimensional vector.

  • When k>1k > 1, the gate values for the top k experts have non-zero derivatives with respect to the gating network weights, while all others have zero derivatives.

Gσ=softmax(xā‹…Wg)G_{\sigma} = \text{softmax}(x \cdot W_g)

A Gating Network varies based on its calculation logic. The simplest form is Soft-max Gating, which multiplies the input by a trainable weight matrix WgW_g and applies the soft-max function. However, it confronts the below challenges:

  • Unbalanced Loading: The gradient tends to push the gating network to distribute tokens unevenly.

  • Expert Collapse: The model ends up using only a few experts actively.

A Noisy Top-K Gating adds two components to the soft-max gating network: sparsity and noise, which are crucial for efficiency of large models. Before applying soft-max, it adds tunable Gaussian noise, then keeps only the top kk values and sets the rest to āˆ’inf⁔-\inf. Without Gaussian noise, the network would converge to using the same few experts, making all others useless.

G(x)=softmax(KeepTopK(H(x),k))G(x) = \text{softmax}(\text{KeepTopK}(H(x), k))
  • H(x)i=(xā‹…wg)i+StandardNormal()ā‹…softplus((xā‹…Wnoise)i)H(x)_i = (x \cdot w_g)i + \text{StandardNormal}()\cdot \text{softplus}((x \cdot W{\text{noise}})_i)

  • KeepTopK(v,k)i={viifĀ viĀ isĀ inĀ theĀ topĀ kĀ elmentsĀ ofĀ vāˆ’inf⁔otherwise\text{KeepTopK}(v, k)_i = \begin{cases} v_i \quad\text{if }v_i \text{ is in the top k elments of }v\\-\inf \quad\text{otherwise}\end{cases}

For very large numbers of experts, the branching factor can be reduced by using a two-level hierarchical MoE. In this setup, a primary gating network chooses a sparse weighted combination of experts, where each expert is itself a secondary MoE with its own gating network.

Balancing Expert Utilization: Additional Loss to Balance the Importance

To balance expert utilization across datasets, an importance score tracks how frequently each expert is used. This additional loss term LImportance\mathcal{L}_{\text{Importance}} is incorporated with major loss to ensure balanced training.

Importance(X)=āˆ‘x∈XG(x)LI(x)(X)=wIā‹…CV(I(x))2whereĀ CVĀ =σμ \text{Importance}(X) = \sum_{x \in X}G(x) \newline \mathcal{L}_{I(x)}(X) = w_I \cdot \text{CV}(I(x))^2 \quad\text{where CV } = \frac{\sigma}{\mu}
  • LI(x)(X)\mathcal{L}_{I(x)}(X) denotes the importance loss function I(x)I(x).

  • WIW_I is a manually adjusted scaling coefficient.

  • CV2\text{CV}^2 is the square of the coefficient of variation calculated from Importance(X)\text{Importance}(X) across dataset XX.

    • σ\sigma represents the standard deviation of the distribution, while μ\mu represents the mean of Importance(X)\text{Importance}(X).

  • LI(x)′(X)\mathcal{L}'_{I(x)}(X) for backpropagation is 2IjN⋅μI2āˆ’2σI2N⋅μI3\frac{2I_j}{N\cdot \mu^2_I} - \frac{2\sigma^2_I}{N \cdot \mu^3_I}.

Last updated