Challenges in Training Deep Neural Network and the Latest Solutions

There are several challenges in training deep neural networks due to their multiple hidden layers. These issues can often cause the model to become inactive or perform poorly.

  • Vanishing Gradient is a problem where gradient values remain extremely small, preventing the early layers from being properly trained.

  • Exploding Gradient is a problem where gradient values remain extremely large, making unable or unstable to train.

  • The Choice of Activation Function is also a crucial matter in deep learning. Finding optimal activation functions through proper testing and research remains an ongoing challenge.

  • Weight Initialization is an optimization technique that helps prevent exploding/vanishing gradients and enhances training speed and performance.

  • Overfitting and Regularization occurs when a model becomes too optimized to its training dataset, making it unable to solve new tasks effectively.

  • Momentum-based methods (or adaptive learning) can be applied to optimize a model's cost.

Modern deep learning has evolved by solving these problems, though some challenges still persist today.

Vanishing/Exploding Gradient Problem

The vanishing gradient problem occurs when gradients become extremely small or zero during backpropagation, while the exploding gradient problem happens when gradients become too large. Both prevent proper weight updates from occurring. To understand these problems, let's examine how backpropagation works:

Ī”LĪ”Wl=Ī”LĪ”aLā‹…āˆk=1Lāˆ’1Ī”ak+1Ī”ak\frac{\Delta{\mathcal{L}}}{\Delta{W^{l}}} = \frac{\Delta{\mathcal{L}}}{\Delta_{a}^{L}} \cdot \prod^{L - 1}_{k = 1}{\frac{\Delta_a^{k+1}}{\Delta_a^k}}

In the above expression, āˆk=1Lāˆ’1Ī”k+1Ī”ak\prod^{L -1}_{k = 1}{\frac{\Delta^{k+1}}{\Delta_a^k}}, which involves multiple gradient multiplications, can cause the gradient to approach zero, making subsequent summations increasingly smaller—this phenomenon is called Vanishing Gradient.

  • The output range of an activation function's derivative also affects this problem. For example, the sigmoid function's derivative maps gradients to (0,0.25)(0, 0.25), while the tangent function's derivative maps to (0,1)(0 , 1).

  • Initializing weights with values that are too small can also cause the vanishing gradient problem.

Conversely, when gradients become larger than one, they can grow explosively, making weight updates impossible—this is called the Exploding Gradient. This problem is particularly common in RNNs and other models that process sequential data.

Several techniques can be applied to models to address these problems. Most of them remain ongoing issues in the field of deep learning:

  • use ReLU or Swish as the activation function, which helps prevent gradient problems.

  • Implementing Batch Normalization helps stabilize the learning process by normalizing layer inputs.

  • Weight Initialization techniques ensure proper initial conditions for training.

  • Residual Connections or Skip Connections allow gradients to flow more easily through deep networks.

  • Gradient Clipping prevents exploding gradients by limiting their values, while Weight Regularization helps control the model's complexity: grad=gradā‹…min⁔(1,threshold∣∣grad∣∣)\text{grad} = \text{grad} \cdot \min{(1, \frac{\text{threshold}}{|| \text{grad} ||})}.

  • For sequential data, specialized architectures like LSTMs and GRUs are particularly effective at handling these challenges.

Due to these problems, modern deep learning models use optimized initialization methods and activation functions specifically designed to prevent them.

How to Spot Exploding/Vanishing Gradient

Vanishing and exploding gradients are the most common issues in deep learning. To monitor training, there are several flags to help spot them:

  • The model converges very slowly, and weight changes in the early layers are minimal.

  • The loss function value remains stuck at the same level.

The Choice of Activation Function: Swish and ReLU

Choosing the right activation function is crucial for model performance. Since modern DNNs use ReLU ( or Rectified Linear Unit) or Swish most of the time, these two functions will be examined in this section:

ReLU(x)=max⁔(0,x)={xifĀ x>00ifĀ x≤0\text{ReLU}(x) = \max{(0,x)} = \begin{cases} x \quad\text{if } x > 0 \\ 0 \quad \text{if } x \leq 0\end{cases}
  • It introduces non-linearity to the model, enabling it to learn complex patterns while maintaining sparsity.

  • Unlike sigmoid and tanh functions, it does not involve exponential operations, making it more computationally efficient.

  • It prevents vanishing gradient because its derivatives only map to (0,1)(0, 1).

Swish(x)=xā‹…Ļƒ(β⋅x)whereĀ is:σ(z)=11+eāˆ’zĀ isĀ SigmoidĀ function.β isĀ aĀ learnableĀ orĀ fixedĀ learningĀ rate.\text{Swish}(x) = x \cdot \sigma{(\beta \cdot x)} \quad\text{where is:}\newline\sigma(z) = \frac{1}{1+e^{-z}}\text{ is Sigmoid function.} \newline\beta \text{ is a learnable or fixed learning rate.}
  • It improves upon ReLU by introducing smooth, non-monotonic behavior through a small negative slope when x<0x < 0.

  • The sigmoid acts as an automatic ā€œsoft gateā€ for the input.

  • It consistently outperforms ReLU in most applications, particularly in natural language processing and computer vision tasks.

  • The derivative is: ddxSwish(x)=Swish(x)+σ(β⋅x)ā‹…(1āˆ’Swish(x))\frac{d}{d{x}}\text{Swish}(x) = \text{Swish}(x) + \sigma(\beta \cdot x) \cdot (1 - \text{Swish}(x)).

Normalization and Internal Covariate Shift

The distribution of layer inputs change during training weights get updated, a phenomenon called internal Covariate Shift. This makes training slower because each layer must continuously adapt to the new distribution.

  • It slows down training and forces models to use lower learning rates.

  • It makes maintaining a smooth gradient landscape difficult, which can lead to vanishing or exploding gradients.

  • It makes models less sensitive to initialization.

  • It acts as regularization and may reduce overfitting.

BatchNorm=xˆi=xiāˆ’Ī¼BσB2+ϵ\text{BatchNorm} = \^x_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}

Batch Normalization standardizes the inputs of a layer for each mini-batch, reducing internal covariate shift and stabilizing training. It, typically applied before the activation function:

  1. For a given mini-batch B={x1,x2,…,xm}B = \{x_1, x_2, …, x_m\}, the normalization performs the following steps.

  2. Compute the mean: μB=1māˆ‘i=1mxi(BatchĀ Mean)\mu_B = \frac{1}{m}\sum^{m}_{i=1}{x_i}\quad\text{(Batch Mean)}.

  3. Compute the variance: σB2=1māˆ‘mi=1(xiāˆ’Ī¼B)2(BatchĀ Variance)\sigma^{2}_{B} = \frac{1}{m}\sum^{m}{i=1}{(x_i - \mu_B)^2} \quad\text{(Batch Variance)}.

  4. Normalize the batch: xˆi=xiāˆ’Ī¼BσB2+ϵ\^x_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}} where ϵ\epsilon is a small constant for numerical stability.

  5. Scaling and shifting using affine transformation: yi=γ⋅xˆi+β(AffineĀ Transformation)y_i = \gamma \cdot \^x_i + \beta \quad\text{(Affine Transformation)}, where γ\gamma is called scale, β\beta is shift.

    1. γ\gamma and β\beta are also learned via backpropagation.

LayerNorm=xˆ=x0āˆ’Ī¼LσL2+ϵ\text{LayerNorm} = \^{x} = \frac{x 0- \mu_{L}}{\sqrt{\sigma^2_L + \epsilon}}

Unlike batch normalization which operates per-channel, Layer Normalization works per sample across features. It is well-suited for models like RNNs and Transformers that process sequential data, though it is less effective than batch normalization in CNNs.

  • μL,σL\mu_L, \sigma_L are the mean and variance calculated across all features within a single sample.

InstanceNorm=xˆ=xāˆ’Ī¼IσI2+ϵ\text{InstanceNorm} = \^{x} = \frac{x - \mu_I}{\sqrt{\sigma^2_I + \epsilon}}

Instance normalization operates on each sample and channel independently. It is commonly used in GANs, style transfer, and image generation models.

GroupNorm=xˆ=xāˆ’Ī¼GσG2+ϵ\text{GroupNorm} = \^{x} = \frac{x - \mu_G}{\sqrt{\sigma^2_G + \epsilon}}

Group Normalization divides channels into groups and normalizes per sample per group. It can be thought of as a midpoint between batch and instance normalization. It is most commonly used in CNNs with small batches, such as in object detection.

W←Wσ(W)W \leftarrow \frac{W}{\sigma(W)}

Spectral Normalization constrains the spectral norm of weight matrices. While primarily used in GANs, it is crucial for preventing model collapse.

  • σ(W)\sigma(W) represents large singular value of WW.

Scale and Shift (or Affine Transformation) are crucial mechanisms in batch normalization. While normalizing all activations to have mean 00 and variance 11 might be too restrictive—since different layers may need different distributions for optimal performance—the transformation allows the model to learn how to scale and shift the normalized values, or even reverse the normalization if needed.

Some models may use Adaptive Normalization, which dynamically adjusts normalization parameters based on input. While more flexible than fixed normalization, it requires significantly more computational resources.

It was first introduced in the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" by Sergey Ioffe and Christian Szegedy.

Weight Initialization

Weight Initialization is crucial for training modern DNNs. Poor initialization can lead to vanishing/exploding gradients and unfavorable local optima, also known as "saddle points".

  • Xavier/Glorot Initialization is optimized for sigmoid and tanh activation functions.

  • He/Kaiming Initialization is optimized for ReLU, its variants, and Swish activation functions.

Last updated