Challenges in Training Deep Neural Network and the Latest Solutions
There are several challenges in training deep neural networks due to their multiple hidden layers. These issues can often cause the model to become inactive or perform poorly.
Vanishing Gradient is a problem where gradient values remain extremely small, preventing the early layers from being properly trained.
Exploding Gradient is a problem where gradient values remain extremely large, making unable or unstable to train.
The Choice of Activation Function is also a crucial matter in deep learning. Finding optimal activation functions through proper testing and research remains an ongoing challenge.
Weight Initialization is an optimization technique that helps prevent exploding/vanishing gradients and enhances training speed and performance.
Overfitting and Regularization occurs when a model becomes too optimized to its training dataset, making it unable to solve new tasks effectively.
Momentum-based methods (or adaptive learning) can be applied to optimize a model's cost.
Vanishing/Exploding Gradient Problem
The vanishing gradient problem occurs when gradients become extremely small or zero during backpropagation, while the exploding gradient problem happens when gradients become too large. Both prevent proper weight updates from occurring. To understand these problems, let's examine how backpropagation works:
In the above expression, , which involves multiple gradient multiplications, can cause the gradient to approach zero, making subsequent summations increasingly smallerāthis phenomenon is called Vanishing Gradient.
The output range of an activation function's derivative also affects this problem. For example, the sigmoid function's derivative maps gradients to , while the tangent function's derivative maps to .
Initializing weights with values that are too small can also cause the vanishing gradient problem.
Conversely, when gradients become larger than one, they can grow explosively, making weight updates impossibleāthis is called the Exploding Gradient. This problem is particularly common in RNNs and other models that process sequential data.
Several techniques can be applied to models to address these problems. Most of them remain ongoing issues in the field of deep learning:
use ReLU or Swish as the activation function, which helps prevent gradient problems.
Implementing Batch Normalization helps stabilize the learning process by normalizing layer inputs.
Weight Initialization techniques ensure proper initial conditions for training.
Residual Connections or Skip Connections allow gradients to flow more easily through deep networks.
Gradient Clipping prevents exploding gradients by limiting their values, while Weight Regularization helps control the model's complexity: .
For sequential data, specialized architectures like LSTMs and GRUs are particularly effective at handling these challenges.
The Choice of Activation Function: Swish and ReLU
Choosing the right activation function is crucial for model performance. Since modern DNNs use ReLU ( or Rectified Linear Unit) or Swish most of the time, these two functions will be examined in this section:
It introduces non-linearity to the model, enabling it to learn complex patterns while maintaining sparsity.
Unlike sigmoid and tanh functions, it does not involve exponential operations, making it more computationally efficient.
It prevents vanishing gradient because its derivatives only map to .
It improves upon ReLU by introducing smooth, non-monotonic behavior through a small negative slope when .
The sigmoid acts as an automatic āsoft gateā for the input.
It consistently outperforms ReLU in most applications, particularly in natural language processing and computer vision tasks.
The derivative is: .
Normalization and Internal Covariate Shift
The distribution of layer inputs change during training weights get updated, a phenomenon called internal Covariate Shift. This makes training slower because each layer must continuously adapt to the new distribution.
It slows down training and forces models to use lower learning rates.
It makes maintaining a smooth gradient landscape difficult, which can lead to vanishing or exploding gradients.
It makes models less sensitive to initialization.
It acts as regularization and may reduce overfitting.
Batch Normalization standardizes the inputs of a layer for each mini-batch, reducing internal covariate shift and stabilizing training. It, typically applied before the activation function:
For a given mini-batch , the normalization performs the following steps.
Compute the mean: .
Compute the variance: .
Normalize the batch: where is a small constant for numerical stability.
Scaling and shifting using affine transformation: , where is called scale, is shift.
and are also learned via backpropagation.
Unlike batch normalization which operates per-channel, Layer Normalization works per sample across features. It is well-suited for models like RNNs and Transformers that process sequential data, though it is less effective than batch normalization in CNNs.
are the mean and variance calculated across all features within a single sample.
Instance normalization operates on each sample and channel independently. It is commonly used in GANs, style transfer, and image generation models.
Group Normalization divides channels into groups and normalizes per sample per group. It can be thought of as a midpoint between batch and instance normalization. It is most commonly used in CNNs with small batches, such as in object detection.
Spectral Normalization constrains the spectral norm of weight matrices. While primarily used in GANs, it is crucial for preventing model collapse.
represents large singular value of .
Weight Initialization
Weight Initialization is crucial for training modern DNNs. Poor initialization can lead to vanishing/exploding gradients and unfavorable local optima, also known as "saddle points".
Xavier/Glorot Initialization is optimized for sigmoid and tanh activation functions.
He/Kaiming Initialization is optimized for ReLU, its variants, and Swish activation functions.
Last updated