Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

This paper introduces MAML (Model-Agnostic Meta-learning Algorithm), an algorithm that works seamlessly with traditional reinforcement learning, regression, classification, and gradient descent. It aims to solve few-shot learning problems and enable quick adaptation to new tasks through "fast adaptation".

This structure reflects the distribution of overall tasks through the base-level learner. Inspired by this idea, there is a study called "Meta-Learning in Neural Networks: A Survey" that applies this concept to optimization.

Introduction

Human intelligence performs various tasks by learning from just a few examples. This algorithm for deep learning was inspired by that concept. However, MAML takes a different approach from traditional meta-learning, which focuses on function optimization, learning rates, and hyperparameters.

Meta Learning Problem Set-Up

Few-shot learning (FSL) trains models using minimal examples. MAML implements this approach through meta-learning processes that occur before the main training—known as "fast adaptation." This section covers the formulation and setup of this process.

T={L(x1,a1,…,xH,aH),q(x1),q(xt+1∣xt,at),H}\mathcal{T} = \{ \mathcal{L}(x_1, a_1, …, x_H, a_H), q(x_1), q(x_{t+1} | x_t, a_t), H \}

  • TT is a task that the model should be able to solve, and it contains:

  • For a model, this is expressed as a function ff that maps observations xx to outputs aa.

  • L\mathcal{L} is the loss function—reflecting the distribution of q(x1)q(x_1).

    • q(x1)q(x_1) is the initial observation.

    • The loss value L(x1,a1,…,xH,aH)→R\mathcal{L}(x_1, a_1, …, x_H, a_H) \rightarrow \mathbb{R} is the feedback for a specific task.

  • q(xt+1∣xt,at)q(x_{t+1} | x_t, a_t) is the transition distribution associated with xtx_t, ata_t.

  • HH is the episode length, representing the total number of time steps during which an model must take actions or proceed.

In this meta-learning scenario, a model must adapt to the distribution of p(T)p(T). Similar to the k-shot framework, the model performs TiT_i drawn from p(T)p(T), where the response LTiL_{T_i} is derived from KK training samples and TiT_i.

Algorithm

MAML's mechanism performs two types of learning in one episode. This is commonly explained through the concepts of inner loop with local parameters and outer loop with meta parameters.

  • In the inner loop, the model updates local parameters θi\theta_i for a specific task using the local learning rate α\alpha.

  • In the outer loop that follows, the model updates the global/meta parameters θ\theta that apply across all tasks using the global learning rate β\beta.

    • The local and global parameters maintain identical shapes.

  • p(T)p(T) represents the distribution over tasks TT.

  1. Initialize θ\theta randomly.

  2. Repeat until done:

    1. Sample batch of tasks Ti,...,p(T)T_i, ..., p(T)

    2. For each TiT_i, do the following:

      1. Evaluate gradient ∇θLTi(fθ)\nabla_{\theta}\mathcal{L}{T_i}(f{\theta}) with respect to KK samples and adapt parameters: θ′i=θ−α∇θLTi(fθ)\theta'i = \theta - \alpha\nabla{\theta}\mathcal{L}{T_i}(f{\theta})

    3. Update θ→θ−β∇θ∑Ti,…,p(T)LTi(fθ)\theta \rightarrow \theta - \beta\nabla_{\theta}\sum_{T_i, …, p(T)}\mathcal{L}{T_i}(f{\theta})

The experimental section of this paper demonstrates that simultaneous execution of meta-level learning and base-level learning has a positive effect on optimization between the two learners. The analysis of these results shows that MAML can converge in fewer steps as it avoids overfitting and considers the distribution and representation between tasks.

Check out the code i made!

Last updated