Skip to main content

Classification with Perceptron

Perceptron as a Linear Classifier

A perceptron, initially introduced for linear regression, can be adapted for binary classification problems. This adaptation involves modifying the activation function to transform the perceptron's output into a form suitable for classification.

Mathematical Formulation

The weighted sum zz in a perceptron, where the inputs range from x1x_1 to xnx_n with corresponding weights w1w_1 to wnw_n, and including a bias term bb, the formula is:

z=i=1nwixi+bz = \sum_{i=1}^{n} w_i x_i + b

This equation represents the linear combination of inputs and their respective weights, with the bias term added to account for offsets. For classification, the sigmoid function, σ(z)\sigma(z), is used as the activation function, transforming zz into a probability between 0 and 1:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

This function outputs a value in the range (0,1)(0, 1), making it suitable for binary classification tasks.

Sigmoid Function

sigmoid function

The sigmoid function, denoted as σ(z)\sigma(z), plays a crucial role in machine learning, especially in logistic regression and neural networks, due to its ability to map any real-valued number into the (0,1)(0, 1) interval. This property is particularly useful for modeling probabilities.

Definition

The sigmoid function is defined as:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

where zRz \in \mathbb{R} is the input to the function.

Properties

  • Domain and Range: The function maps the domain of all real numbers R\mathbb{R} to the range (0,1)(0, 1).
  • Asymptotes: It has horizontal asymptotes at y=0y = 0 and y=1y = 1, implying that σ(z)\sigma(z) approaches 00 as zz \to -\infty and 11 as zz \to \infty.
  • Symmetry: At z=0z = 0, σ(0)=12\sigma(0) = \frac{1}{2}, making the sigmoid function symmetric about the origin in the transformed space.
  • Sigmoid of Large Positive and Negative Values: For large positive values of zz, σ(z)\sigma(z) approaches 11, and for large negative values, σ(z)\sigma(z) approaches 00.

Derivative of the Sigmoid Function

The derivative of the sigmoid function is significant in machine learning algorithms, particularly in the optimization process. It can be computed using the chain rule of calculus and exhibits a simple form that is computationally efficient.

Calculation

Let's denote the derivative of σ(z)\sigma(z) with respect to zz as σ(z)\sigma'(z). The calculation proceeds as follows:

  1. Start with the definition σ(z)=(1+ez)1\sigma(z) = (1 + e^{-z})^{-1}.
  2. Applying the chain rule, we get:
σ(z)=ddz(1+ez)1=(1+ez)2(ez)\sigma'(z) = \frac{d}{dz}\left(1 + e^{-z}\right)^{-1} = -(1 + e^{-z})^{-2} \cdot \left(-e^{-z}\right)
  1. Simplifying, we obtain:
σ(z)=ez(1+ez)2\sigma'(z) = \frac{e^{-z}}{(1 + e^{-z})^2}
  1. By adding and subtracting 11 in the numerator and rearranging, we find:
σ(z)=11+ez(111+ez)\sigma'(z) = \frac{1}{1 + e^{-z}} \left(1 - \frac{1}{1 + e^{-z}}\right)
  1. Finally, recognizing that the terms in the parenthesis represent σ(z)\sigma(z) and 1σ(z)1 - \sigma(z) respectively, we arrive at the elegant result:
σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))

Gradient Descent for Perceptron Training

Gradient descent is employed to minimize the error between the predicted and actual classifications. It adjusts the weights and bias to reduce the loss function, calculated using the log loss for classification:

L(y,y^)=[ylog(y^)+(1y)log(1y^)]L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]

where yy is the actual label, and y^\hat{y} is the predicted label (output of the sigmoid function). The loss function L(y,y^)L(y, \hat{y}) quantifies the difference between the actual outputs yy and the predicted outputs y^\hat{y}. The optimization's goal is to minimize LL by adjusting the model parameters, specifically the weights (ww) and bias (bb).

To understand how changes in ww and bb affect LL, we calculate the partial derivatives of LL with respect to these parameters. This involves understanding how LL is influenced by y^\hat{y} and, in turn, how y^\hat{y} depends on each parameter.

Chain Rule Application

The calculation of Lwi\frac{\partial L}{\partial w_i} and Lb\frac{\partial L}{\partial b} involves the application of the chain rule of calculus, expressed as:

  • Lwi=Ly^y^wi\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w_i}
  • Lb=Ly^y^b\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial b}

The term Ly^\frac{\partial L}{\partial \hat{y}} is common across these expressions and is crucial for understanding the gradient's direction and magnitude.

Derivative Calculations

Derivative of LL with respect to y^\hat{y}

Given the log loss function, the derivative of LL with respect to y^\hat{y} is calculated as:

Ly^=yy^+1y1y^\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}

This expression represents how the loss function gradient depends on the difference between actual and predicted values.

Derivative of y^\hat{y} with respect to ww and bb

The predicted output y^\hat{y} in a classification perceptron is typically computed using a sigmoid activation function of a linear combination of inputs and weights plus the bias. The derivatives of y^\hat{y} with respect to wiw_i, and bb are informed by the derivative of the sigmoid function:

  • y^wi=y^(1y^)xi\frac{\partial \hat{y}}{\partial w_i} = \hat{y}(1 - \hat{y})x_i
  • y^b=y^(1y^)\frac{\partial \hat{y}}{\partial b} = \hat{y}(1 - \hat{y})

Final Gradient Expressions

The final expressions for the partial derivatives of the loss function with respect to the parameters are:

  • Lwi=(yy^)xi\frac{\partial L}{\partial w_i} = -(y - \hat{y})x_i
  • Lb=(yy^)\frac{\partial L}{\partial b} = -(y - \hat{y})

These gradients guide the update steps in the gradient descent algorithm, indicating the direction and magnitude by which the parameters should be adjusted to reduce the loss.

Gradient Descent Update Rule

The gradient descent update rules for the weights and bias are as follows, where α\alpha is the learning rate:

  • wi:=wiαLwiw_i := w_i - \alpha \frac{\partial L}{\partial w_i}
  • b:=bαLbb := b - \alpha \frac{\partial L}{\partial b}

Iteratively applying these updates moves the parameters toward the values that minimize the loss function, aiming to find the best-fitting model for the given dataset.

Conclusion

The perceptron model can be effectively adapted for binary classification by using the sigmoid function as an activation function. Gradient descent, facilitated by the sigmoid function's derivative properties, enables efficient training of the perceptron, optimizing its parameters to classify input data accurately.