Understanding Neural Network Foundations: Perceptron, ADALINE and MLP

Perceptron and ADALINE are often considered as prototypes of deep learning; which is important to understand architecture of neural network.

Perceptron

Perceptron (or Binary Classifier) is an algorithm that classifies data into two categories by determining which class an input belongs to. Unlike regression models which focus on prediction and analysis, the perceptron specializes in classifying inputs into binary classes represented as 0,10, 1.

History of Perceptron

Perceptron was invented by Warren McCulloch and Walter Pitts in 1943. Rather than software, it was implemented as a machine—the IBM 704—designed to classify images. Later, the Mark 1 Perceptron emerged as the software version of this system.

Formulation and Learning Rule

It does not use partial derivatives, but instead relies on the simple linear separability of the data.

f(z)=f[w(t)xj]f(z) = f[w(t) \cdot x_j]
  1. Initialize weights to 00 or a random number.

  2. Calculate outputs: yj(t)=f[w(t)xj]y_j(t) = f[w(t) \cdot x_j]

  3. Update weights: wi(t+1)=wi(t)+r(djyj(t))xj,iw_i(t + 1) = w_i(t) + r \cdot (d_j - y_j(t)) \cdot x_{j, i}

Comparing Perceptron and Logistic Regression

The perceptron is often confused with logistic regression. Though they share similarities, they are entirely different concepts.

  • Perceptron outputs a hard class label based on a threshold, while logistic regression outputs a probability using the sigmoid function.

  • Logistic regression is a probabilistic model, while the perceptron is deterministic.

  • The perceptron uses hinge loss instead of gradient descent rules and cross-entropy.

  • The perceptron converges only when data is linearly separable, while logistic regression always converges.

ADALINE

ADALINE (Adaptive Linear Neuron, Adaptive Linear Element) is an enhanced version of the perceptron. It classifies data using a layer of parallel perceptrons. This structure serves as a prototype for artificial neural networks.

Formulation and Learning Rule

It takes multiple inputs and produce a single output as a multi-layer neural network composed of various nodes.

y=j=0nxjwj+θy = \sum^{n}_{j=0}x_jw_j + \theta
Term Definitions
  • xx represents the input vector while x0=1x_0 = 1 is bias.

  • yy represents the model's output.

  • ww represents the weights, where w0=0w_0 = 0 is used for local bias.

  • nn is the number of inputs in the dataset.

  • θ\theta represents the global bias constant.

  • The least mean square error is calculated as E=(oy)2E = (o - y)^2.

  • Update the weights: ww+η(oy)xw \leftarrow w + \eta(o-y) \cdot x.

    • η\eta represents the learning rate.

    • oo represents the target output value.

MADALINE (Many ADALINE), a variant of ADALINE, uses a structure that connects three ADALINE units linearly. It is similar to modern neural networks but differs in that it uses different functions per layer, making backpropagation impossible.

MLP, Multi-Layer Perceptron

Multi-Layer Perceptron ( or MLP, fully connected artificial neural network) consists of three layers of perceptrons with non-linear activation functions. Its primary purpose is to classify inputs in a linear manner.

  • Every perceptron in an MLP is fully connected to the next layer of perceptrons, giving each perceptron multidimensional weights $w_{i,j}$.

  • These perceptrons function as signal processing units, similar to neurons in the human brain, which is why they are called neurons.

  • Every neuron has an activation function that maps scalar responses to a non-linear number range. This is a crucial concept in MLPs that enables their functionality and improves their performance.

    • In most implementations, MLPs use the hyperbolic tangent or sigmoid function as their activation function.

It is considered a very prototype of modern neural networks and is sometimes called a “vanilla neural network”, though it differs in its use of forward propagation.

Formulation and Learning Rule

The learning rule is based on the concept of a neuron; it updates the neuron's weights by calculating partial derivatives of their cost/loss function. This represents a fundamental mechanism of modern deep learning.

wj(l)(n)=iwji(l)(n)yi(l1)(n)+bj(l)(n)yj(l)(n)=ϕ(wj(l)(n))yj(l)(n)=ϕ(wj(l)(n))w_j(l)(n) = \sum_{i} w_{ji}(l)(n)⋅y_i(l−1)(n)+b_j(l)(n)y_j(l)(n)=\phi\large(w_j(l)(n)\large)y_j(l)(n)=\phi \large(w_j(l)(n)\large)
  • Calculate the error ϵ\epsilon at the output layer using ϵj(n)=dj(n)yj(n)\epsilon_j(n) = d_j(n) - y_j(n), where dj(n)d_j(n) represents the desired output values of the model.

    • e(n)=12ej2(n)\therefore e(n) = \frac{1}{2} \cdot \sum {e_j^2(n)}

  • Update using the gradient descent rule: Δwji(n)=ηΔϵ(n)Δwj(n)yi(n)\Delta w_{ji}(n) = \eta \cdot \frac{\Delta \epsilon(n)}{\Delta w_{j}(n)} \cdot y_i(n).

    • eje_j represents the output of layer jj.

    • yi(n)y_i(n) represents the output of neuron i in the previous layer.

    • wiw_i represents the weights of neuroni_i, while η\eta is the learning rate.

    • Δϵ(n)Δwi(n)\frac{\Delta \epsilon(n)}{\Delta w_i(n)} is the partial derivative of ϵ\epsilon with respect to wiw_i, which can be expressed as: dϵ(n)dvj(n)=ej(n)ϕ(wj(n))=ϕ(wj(n))kdϵ(n)dvk(n)wkj(n)\frac{d\epsilon(n)}{dv_j(n)}=e_j(n)\phi'(w_j(n)) = \phi'(w_j(n)) \cdot \sum_{k}-\frac{d\epsilon(n)}{dv_k(n)}w_{kj}(n) where ϕ\phi represents the activation function.

Most commonly, mean square error serves as ϵ\epsilon, the cost/loss function.

Last updated