Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Check out the original paper!

This paper introduces MAML (Model-Agnostic Meta-learning Algorithm), an algorithm that works seamlessly with traditional reinforcement learning, regression, classification, and gradient descent. It aims to solve few-shot learning problems and enable quick adaptation to new tasks through "fast adaptation".

This structure reflects the distribution of overall tasks through the base-level learner. Inspired by this idea, there is a study called "Meta-Learning in Neural Networks: A Survey" that applies this concept to optimization.

Introduction

Human intelligence performs various tasks by learning from just a few examples. This algorithm for deep learning was inspired by that concept. However, MAML takes a different approach from traditional meta-learning, which focuses on function optimization, learning rates, and hyperparameters.

Meta Learning Problem Set-Up

Few-shot learning (FSL) trains models using minimal examples. MAML implements this approach through meta-learning processes that occur before the main training—known as "fast adaptation." This section covers the formulation and setup of this process.

\mathcal{T} = \{ \mathcal{L}(x_1, a_1, …, x_H, a_H), q(x_1), q(x_{t+1} | x_t, a_t), H \}

$T$ is a task that the model should be able to solve, and it contains:
For a model, this is expressed as a function $f$ that maps observations $x$ to outputs $a$ .
$\mathcal{L}$ is the loss function—reflecting the distribution of $q(x_1)$ .
- $q(x_1)$ is the initial observation.
- The loss value $\mathcal{L}(x_1, a_1, …, x_H, a_H) \rightarrow \mathbb{R}$ is the feedback for a specific task.
$q(x_{t+1} | x_t, a_t)$ is the transition distribution associated with $x_t$ , $a_t$ .
$H$ is the episode length, representing the total number of time steps during which an model must take actions or proceed.

In this meta-learning scenario, a model must adapt to the distribution of $p(T)$ . Similar to the k-shot framework, the model performs $T_i$ drawn from $p(T)$ , where the response $L_{T_i}$ is derived from $K$ training samples and $T_i$ .

Algorithm

MAML's mechanism performs two types of learning in one episode. This is commonly explained through the concepts of inner loop with local parameters and outer loop with meta parameters.

In the inner loop, the model updates local parameters $\theta_i$ for a specific task using the local learning rate $\alpha$ .
In the outer loop that follows, the model updates the global/meta parameters $\theta$ that apply across all tasks using the global learning rate $\beta$ .
- The local and global parameters maintain identical shapes.
$p(T)$ represents the distribution over tasks $T$ .

Initialize $\theta$ randomly.
Repeat until done:
1. Sample batch of tasks $T_i, ..., p(T)$
2. For each $T_i$ , do the following:
  1. Evaluate gradient $\nabla_{\theta}\mathcal{L}{T_i}(f{\theta})$ with respect to $K$ samples and adapt parameters: $\theta'i = \theta - \alpha\nabla{\theta}\mathcal{L}{T_i}(f{\theta})$
3. Update $\theta \rightarrow \theta - \beta\nabla_{\theta}\sum_{T_i, …, p(T)}\mathcal{L}{T_i}(f{\theta})$

The experimental section of this paper demonstrates that simultaneous execution of meta-level learning and base-level learning has a positive effect on optimization between the two learners. The analysis of these results shows that MAML can converge in fewer steps as it avoids overfitting and considers the distribution and representation between tasks.

Check out the code i made!

PreviousPrototypical Networks for Few-shot Learning NextConvolutional Neural Networks

Last updated 2 months ago