Prototypical Networks for Few-shot Learning

Prototypical Network is an algorithm for few-shot learning, which enables effective learning with limited data. It specializes in image classification and has proven performance in few-shot data scenarios.

Introduction

It solves the overfitting problem in few-shot learning scenarios. When working with extremely limited data, a classifier needs to be inductive. The network achieves this through clustering for each class, using the concept of a "prototype" to represent each class in the embedding space.

Formulation and Problem Set-up

  • Support Sets SS for each class kk can be represented as: Sk={(xi,yi),...,(xN,yN)}where is: S_k = \{(x_i,y_i),...,(x_N,y_N)\} \quad\text{where is: }

    • xiRDx_i \in \mathbb{R}^D

    • yi{1,,K}y_i \in \{ 1, …, K \}

    • k is index of a classk \text{ is index of a class}

  1. The Prototypical Network uses prototypes ck(RM)c_k (\in \mathbb{R}^M) for each class in an MM-dimensional space. For this, the function fθ:RDRMf_{\theta}: \mathbb{R}^D \rightarrow \mathbb{R}^M is used to calculate their distribution: Prototype=ck=1Nc(xi,yi)Skfθ(xi)\text{Prototype} = c_k = \frac{1}{|N_c|} \cdot \sum^{}{(x_i, y_i) \in S_k}{f{\theta}(x_i)}.

  2. Calculate distance from prototypes of each class using distance function d=RMRM[0,+inf)=d(x,y)=i=1n(xiyi)2d = \mathbb{R}^M \cdot \mathbb{R}^M \leftarrow [0, +\inf) = d(x,y) = \sqrt{\sum_{i=1}^{n}{(x_i - y_i)^2}}: pθ(y=kx)=expd(fθ(x),ck)kexp(d(fθ(x),ck))p_{\theta}(y = k | x)=\frac{\exp{-d(f_{\theta}(x), c_k)}}{\sum_{k’}{\exp(-d(f_{\theta}(x),c_{k’}))}}.

  3. Learning is computed through negative log probability J(θ)=logpθ(y=kx)J(\theta) = -\log{p_{\theta}(y = k | x) } and SGD. The dataset for one episode consists of a randomly selected subset of classes and a few random samples for each. The remainder is used as query points.

Algorithm

The training algorithm for the model fθ(x) f_{\theta}(x) can be described in two learning phases. Phase 1 involves calculating the prototype ck c_k using the support set. In Phase 2, the loss J J is computed using the query set, and the weights θ \theta of the embedding network are updated.

  • Soft-max: pθ(y=kx)=expd(fθ(x),ck)kexp(d(fθ(x),ck))p_{\theta}(y = k | x)=\frac{\exp{-d(f_{\theta}(x), c_k)}}{\sum_{k’}{\exp(-d(f_{\theta}(x),c_{k’}))}}

  • Negative Loss Probability: J(θ)=logpθ(y=kx)J(\theta) = -\log{p_{\theta}(y = k | x) }

  1. Create support set and calculate prototype ckc_k.

    1. S=RandomSample(D,Nk)S = \text{RandomSample}(D, N_k)

    2. ck=1Sk(xi,yi)Skfθ(xi)c_k = \frac{1}{S_k} \cdot \sum_{(x_i, y_i) \in S_k}{f_{\theta}(x_i)}

  2. Create query set and update weights based on distance function d(,)d(,).

    1. Q=RandomSample(...S+D,Nq)Q = \text{RandomSample}(... S + D, N_q)

    2. for k in Q:\text{for } k \text{ in } Q:

      1. pθ(y=kx)=expd(fθ(x),ck)kexp(d(fθ(x),ck))p_{\theta}(y = k | x)=\frac{\exp{-d(f_{\theta}(x), c_k)}}{\sum_{k’}{\exp(-d(f_{\theta}(x),c_{k’}))}}

        • J(θ)=logpθ(y=kx)J(\theta) = -\log{p_{\theta}(y = k | x) }

        • θθ+ΔJ(θ)\theta \leftarrow \theta + \Delta{J(\theta)}

The experimental section of this paper demonstrates that simultaneous execution of meta-level learning and base-level learning has a positive effect on optimization between the two learners. The analysis of these results shows that MAML can converge in fewer steps as it avoids overfitting and considers the distribution and representation between tasks.

Check out the the code i made!

Last updated