Understanding Neural Network Foundations: Perceptron, ADALINE and MLP

Perceptron and ADALINE are often considered as prototypes of deep learning; which is important to understand architecture of neural network.

Perceptron

Perceptron (or Binary Classifier) is an algorithm that classifies data into two categories by determining which class an input belongs to. Unlike regression models which focus on prediction and analysis, the perceptron specializes in classifying inputs into binary classes represented as $0, 1$ .

History of Perceptron

Perceptron was invented by Warren McCulloch and Walter Pitts in 1943. Rather than software, it was implemented as a machine—the IBM 704—designed to classify images. Later, the Mark 1 Perceptron emerged as the software version of this system.

Formulation and Learning Rule

It does not use partial derivatives, but instead relies on the simple linear separability of the data.

f(z) = f[w(t) \cdot x_j]

Initialize weights to $0$ or a random number.
Calculate outputs: $y_j(t) = f[w(t) \cdot x_j]$
Update weights: $w_i(t + 1) = w_i(t) + r \cdot (d_j - y_j(t)) \cdot x_{j, i}$

Comparing Perceptron and Logistic Regression

The perceptron is often confused with logistic regression. Though they share similarities, they are entirely different concepts.

Perceptron outputs a hard class label based on a threshold, while logistic regression outputs a probability using the sigmoid function.
Logistic regression is a probabilistic model, while the perceptron is deterministic.
The perceptron uses hinge loss instead of gradient descent rules and cross-entropy.
The perceptron converges only when data is linearly separable, while logistic regression always converges.

ADALINE

ADALINE (Adaptive Linear Neuron, Adaptive Linear Element) is an enhanced version of the perceptron. It classifies data using a layer of parallel perceptrons. This structure serves as a prototype for artificial neural networks.

Formulation and Learning Rule

It takes multiple inputs and produce a single output as a multi-layer neural network composed of various nodes.

y = \sum^{n}_{j=0}x_jw_j + \theta

Term Definitions

$x$ represents the input vector while $x_0 = 1$ is bias.
$y$ represents the model's output.
$w$ represents the weights, where $w_0 = 0$ is used for local bias.
$n$ is the number of inputs in the dataset.
$\theta$ represents the global bias constant.

The least mean square error is calculated as $E = (o - y)^2$ .
Update the weights: $w \leftarrow w + \eta(o-y) \cdot x$ .
- $\eta$ represents the learning rate.
- $o$ represents the target output value.

MADALINE (Many ADALINE), a variant of ADALINE, uses a structure that connects three ADALINE units linearly. It is similar to modern neural networks but differs in that it uses different functions per layer, making backpropagation impossible.

MLP, Multi-Layer Perceptron

Multi-Layer Perceptron ( or MLP, fully connected artificial neural network) consists of three layers of perceptrons with non-linear activation functions. Its primary purpose is to classify inputs in a linear manner.

Every perceptron in an MLP is fully connected to the next layer of perceptrons, giving each perceptron multidimensional weights $w_{i,j}$.
These perceptrons function as signal processing units, similar to neurons in the human brain, which is why they are called neurons.
Every neuron has an activation function that maps scalar responses to a non-linear number range. This is a crucial concept in MLPs that enables their functionality and improves their performance.
- In most implementations, MLPs use the hyperbolic tangent or sigmoid function as their activation function.

It is considered a very prototype of modern neural networks and is sometimes called a “vanilla neural network”, though it differs in its use of forward propagation.

Formulation and Learning Rule

The learning rule is based on the concept of a neuron; it updates the neuron's weights by calculating partial derivatives of their cost/loss function. This represents a fundamental mechanism of modern deep learning.

w_j(l)(n) = \sum_{i} w_{ji}(l)(n)⋅y_i(l−1)(n)+b_j(l)(n)y_j(l)(n)=\phi\large(w_j(l)(n)\large)y_j(l)(n)=\phi \large(w_j(l)(n)\large)

Calculate the error $\epsilon$ at the output layer using $\epsilon_j(n) = d_j(n) - y_j(n)$ , where $d_j(n)$ represents the desired output values of the model.
- $\therefore e(n) = \frac{1}{2} \cdot \sum {e_j^2(n)}$
Update using the gradient descent rule: $\Delta w_{ji}(n) = \eta \cdot \frac{\Delta \epsilon(n)}{\Delta w_{j}(n)} \cdot y_i(n)$ .
- $e_j$ represents the output of layer $j$ .
- $y_i(n)$ represents the output of neuron i in the previous layer.
- $w_i$ represents the weights of neuron $_i$ , while $\eta$ is the learning rate.
- $\frac{\Delta \epsilon(n)}{\Delta w_i(n)}$ is the partial derivative of $\epsilon$ with respect to $w_i$ , which can be expressed as: $\frac{d\epsilon(n)}{dv_j(n)}=e_j(n)\phi'(w_j(n)) = \phi'(w_j(n)) \cdot \sum_{k}-\frac{d\epsilon(n)}{dv_k(n)}w_{kj}(n)$ where $\phi$ represents the activation function.

Most commonly, mean square error serves as $\epsilon$ , the cost/loss function.

PreviousUnderstanding Linear and Logistic Regression: Core Machine Learning Concepts NextArtificial Neural Network into Deep Neural Network, Birth of AI Generation

Last updated 3 months ago