Prototypical Networks for Few-shot Learning

Check out the original paper!

Prototypical Network is an algorithm for few-shot learning, which enables effective learning with limited data. It specializes in image classification and has proven performance in few-shot data scenarios.

Introduction

It solves the overfitting problem in few-shot learning scenarios. When working with extremely limited data, a classifier needs to be inductive. The network achieves this through clustering for each class, using the concept of a "prototype" to represent each class in the embedding space.

Formulation and Problem Set-up

Support Sets $S$ for each class $k$ can be represented as: $S_k = \{(x_i,y_i),...,(x_N,y_N)\} \quad\text{where is: }$
- $x_i \in \mathbb{R}^D$
- $y_i \in \{ 1, …, K \}$
- $k \text{ is index of a class}$

The Prototypical Network uses prototypes $c_k (\in \mathbb{R}^M)$ for each class in an $M$ -dimensional space. For this, the function $f_{\theta}: \mathbb{R}^D \rightarrow \mathbb{R}^M$ is used to calculate their distribution: $\text{Prototype} = c_k = \frac{1}{|N_c|} \cdot \sum^{}{(x_i, y_i) \in S_k}{f{\theta}(x_i)}$ .
Calculate distance from prototypes of each class using distance function $d = \mathbb{R}^M \cdot \mathbb{R}^M \leftarrow [0, +\inf) = d(x,y) = \sqrt{\sum_{i=1}^{n}{(x_i - y_i)^2}}$ : $p_{\theta}(y = k | x)=\frac{\exp{-d(f_{\theta}(x), c_k)}}{\sum_{k’}{\exp(-d(f_{\theta}(x),c_{k’}))}}$ .
Learning is computed through negative log probability $J(\theta) = -\log{p_{\theta}(y = k | x) }$ and SGD. The dataset for one episode consists of a randomly selected subset of classes and a few random samples for each. The remainder is used as query points.

Algorithm

The training algorithm for the model $f_{\theta}(x)$ can be described in two learning phases. Phase 1 involves calculating the prototype $c_k$ using the support set. In Phase 2, the loss $J$ is computed using the query set, and the weights $\theta$ of the embedding network are updated.

Soft-max: $p_{\theta}(y = k | x)=\frac{\exp{-d(f_{\theta}(x), c_k)}}{\sum_{k’}{\exp(-d(f_{\theta}(x),c_{k’}))}}$
Negative Loss Probability: $J(\theta) = -\log{p_{\theta}(y = k | x) }$

Create support set and calculate prototype $c_k$ .
1. $S = \text{RandomSample}(D, N_k)$
2. $c_k = \frac{1}{S_k} \cdot \sum_{(x_i, y_i) \in S_k}{f_{\theta}(x_i)}$
Create query set and update weights based on distance function $d(,)$ .
1. $Q = \text{RandomSample}(... S + D, N_q)$
2. $\text{for } k \text{ in } Q:$
  1. $p_{\theta}(y = k | x)=\frac{\exp{-d(f_{\theta}(x), c_k)}}{\sum_{k’}{\exp(-d(f_{\theta}(x),c_{k’}))}}$
    $J(\theta) = -\log{p_{\theta}(y = k | x) }$
    $\theta \leftarrow \theta + \Delta{J(\theta)}$

The experimental section of this paper demonstrates that simultaneous execution of meta-level learning and base-level learning has a positive effect on optimization between the two learners. The analysis of these results shows that MAML can converge in fewer steps as it avoids overfitting and considers the distribution and representation between tasks.

Check out the the code i made!

PreviousOverviewing on Meta Learning: What is Few/Zero Shot Learning?NextModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Last updated 1 month ago