Understanding Linear and Logistic Regression: Core Machine Learning Concepts

Linear Regression and Logistic Regression are fundamental concepts in machine learning that serve as basic solutions for classification and regression problems. These concepts, along with techniques like GDR and non-linear weight optimization, now form the foundation of deep learning.

These simple machine learning algorithms remain standard components of complex deep learning architectures and continue to be relevant today.

Regression is Mathematical Modeling

A mathematical model is an approach to explain real-world phenomena mathematically. It helps humans understand patterns, predict future outcomes, and make informed decisions. You can think of it like a machine: input $x$ represents the domain of the function, while $y$ represents the output.

Linear Regression

Linear regression is a type of modeling that shows the relationship between explanatory variables and scalar responses. It uses a linear approach called a "linear model". The algorithms that predict parameters must follow a key restriction: their conditional average must be expressed as an affine function. The most common algorithms for linear regression are least squares and Newton's method.

In situations where the algorithm does not properly fit the model, we call it "LOF, Lack of Fitting”, which has led to many optimization techniques and research.

The Variants of Linear Regression

Linear regression models fall into two distinct categories based on their purpose:

  • If the model is used for understanding and analyzing the relationship between explanatory variables and dependent variables, it is called regression analysis.

  • If the model is used for prediction and forecasting, it is called a predictive model.

These models can also be classified by their mathematical attributes:

  • Simple Linear Regression: A model with a single explanatory variable.

  • Multiple Linear Regression: A model with two or more explanatory variables.

  • Multivariate Linear Regression: A model with multiple dependent variables.

Formulation

yi=β0+β1x(i,1)+...+βpx(i,p)+ϵi=βpx(i,p)+ϵiy_i = \beta_{0} + \beta_{1} \cdot x_{(i,1)} + ... + \beta_{p} \cdot x_{(i, p)} + \epsilon_i = \beta_p \cdot x_{(i,p)} + \epsilon_i
  • x,yx, y represents a vector of observations, which can be a multi-dimensional matrix.

  • β\beta represents the model parameters, which have a dimension of p+1p + 1.

  • ϵ\epsilon represents possible error.

Key Concepts and Limitations

While deep learning and other advanced machine learning methods have largely superseded linear regression, it remains more cost-effective in certain cases.

  • Exogeneity is a measurement or property that is not related to the model's error.

    • Strict Exogeneity: The model maintains exogeneity over an extended period.

    • Weak Exogeneity: The model only maintains exogeneity over the current period.

    • Deterministic: The model maintains exogeneity for past periods but not for current and future periods

  • Linearity means the relationship between parameters and explanatory variables can be measured through linear combinations.

  • Constant Variance means the model's error range remains independent of the predicted value. For example, if the model predicts an individual's income as 1000, their actual income might range from 800~1200.

    • Independence of Errors means that errors are not correlated with each other. This is one of the major limitations of linear regression, though it can be addressed through data regularization or Bayesian linear regression.

Understanding Learning Rule to Fit the Model using GDR, Gradient Descent Rule

GDR (Gradient Descent Rule) is a learning rule and optimization technique for linear regression that helps fit the model to the problem. It minimizes the Cost Function by updating weights. This approach has become the fundamental workflow for optimization in modern machine learning and deep learning.

  • Initialize weight θ\theta as 00 or random number.

  • Calculate the relationship between the model and real-world observations using cost function J(θ)J(\theta).

  • Until J(θ)J(\theta) is fully minimized, the algorithm continues calculating w=wαJ(w)w' = w - \alpha \cdot \nabla{J(w)}, where ww' is the newly updated weight and $w$ is the previous weight.

Newton’s Method, the Legacy Optimization Technique

Newton's Method is an optimization technique using the idea of the tangent line. Although modern ML rarely uses it anymore, some statistical systems still use it to fit cost function f(x)f(x):

  • Slope of the tangent is f(xn)=f(xn)0xnxn+1f’(x_n) = \frac{f(x_n) - 0}{x_n - x_{n+1}} where the tangent line for f(x=xn)f(x = x_n) intersects with xx.

    • xn+1=xnf(xn)f(xn)\therefore x_{n+1} = x_n - \frac{f(x_n)}{f’(x_n)}

  • keep calculating f(xn)f’(x_n) and move xn+1x_{n+1} until cost function is fully minimized.

Logistic Regression

Logistic Model (or Logit Model) is a statistical method that predicts the log-odds of an event using a linear combination of variables. The most common measurement is Cross-Entropy Loss (or Log Loss), which differs from linear least squares but can still be explained as ordinary least squares.

Formulation

h(xi)=σ(z=θ0+θ1x(i,1)+...+θnx(n,i)+ϵi)where is σ(x)=11+ezh(x_i) = \sigma(z =\theta_0 + \theta_1 \cdot x_{(i,1)} + ... + \theta_n \cdot x_{(n, i)} + \epsilon_i) \newline\text{where is } \sigma(x) = \frac{1}{1 + e^{-z}}
  • Input xx is called feature vector while output h(x)h(x) is called label.

  • zz represents the linear combination of inputs and weights

  • while zz can be any real number, σ\sigma (called Sigmoid Function) maps it to a probability space between (0,1)(0, 1).

The Sigmoid/Logistic Function as an Activation Function

Activation Function is a mathematical function applied to the output. Its main purposes are adding non-linearity to the model and leveraging the output range to help make better decisions—most image recognition and NLP models cannot work without it.

sigmoid function=σ(x)=11+ex \text{sigmoid function} = \sigma(x) = \frac{1}{1 + e^{-x}}

Although it introduces non-linearity to models, the activation function must be differentiable to calculate gradients.

Decision Boundary is where the model changes its prediction. There are several types:

  • A point for a single feature $x$

  • A line for two features $x$

  • Hyperplane for higher dimensions $x$ </aside>

Cross-entropy/Log Loss

Cross-Entropy Loss is an algorithm that fits or evaluates the parameters θ\theta as log-likelihood, which differs slightly from least squares. It ensures convexity during gradient descent and penalizes wrong predictions more heavily when the model is "confident but wrong".

J(θ)==1mi=1my(i)logh(xi)+(1yi)log(1h(xi))J(\theta) = =\frac{1}{m}\sum^{m}_{i=1}{y^{(i)}\log{h(x^{i})} + (1 -y^i)\log{(1 - h(x^i))}}
  • To minimize J(θ)J(\theta), update weights using the gradient: θj:=θjαΔJ(θ)Δθj\theta_j := \theta_j - \alpha\frac{\Delta{J(\theta)}}{\Delta{\theta_j}}

    • Where the gradient is: ΔJ(θ)Δθj=1mmi=1h(xi)yixki\frac{\Delta{J(\theta)}}{\Delta{\theta_j}} = \frac{1}{m} \cdot \sum^{m}{i=1}{h(x^i) - y^i} \cdot x_k^i

  • Vectorized update rule from the above: θ:=θαmx(h(x)y)\theta := \theta - \frac{\alpha}{m} \cdot x \cdot (h(x) - y)

Last updated