Classification with Perceptron

Perceptron as a Linear Classifier

A perceptron, initially introduced for linear regression, can be adapted for binary classification problems. This adaptation involves modifying the activation function to transform the perceptron's output into a form suitable for classification.

Mathematical Formulation

The weighted sum $z$ in a perceptron, where the inputs range from $x_1$ to $x_n$ with corresponding weights $w_1$ to $w_n$ , and including a bias term $b$ , the formula is:

z = \sum_{i=1}^{n} w_i x_i + b

This equation represents the linear combination of inputs and their respective weights, with the bias term added to account for offsets. For classification, the sigmoid function, $\sigma(z)$ , is used as the activation function, transforming $z$ into a probability between 0 and 1:

\sigma(z) = \frac{1}{1 + e^{-z}}

This function outputs a value in the range $(0, 1)$ , making it suitable for binary classification tasks.

Sigmoid Function

The sigmoid function, denoted as $\sigma(z)$ , plays a crucial role in machine learning, especially in logistic regression and neural networks, due to its ability to map any real-valued number into the $(0, 1)$ interval. This property is particularly useful for modeling probabilities.

Definition

The sigmoid function is defined as:

\sigma(z) = \frac{1}{1 + e^{-z}}

where $z \in \mathbb{R}$ is the input to the function.

Properties

Domain and Range: The function maps the domain of all real numbers $\mathbb{R}$ to the range $(0, 1)$ .
Asymptotes: It has horizontal asymptotes at $y = 0$ and $y = 1$ , implying that $\sigma(z)$ approaches $0$ as $z \to -\infty$ and $1$ as $z \to \infty$ .
Symmetry: At $z = 0$ , $\sigma(0) = \frac{1}{2}$ , making the sigmoid function symmetric about the origin in the transformed space.
Sigmoid of Large Positive and Negative Values: For large positive values of $z$ , $\sigma(z)$ approaches $1$ , and for large negative values, $\sigma(z)$ approaches $0$ .

Derivative of the Sigmoid Function

The derivative of the sigmoid function is significant in machine learning algorithms, particularly in the optimization process. It can be computed using the chain rule of calculus and exhibits a simple form that is computationally efficient.

Calculation

Let's denote the derivative of $\sigma(z)$ with respect to $z$ as $\sigma'(z)$ . The calculation proceeds as follows:

Start with the definition $\sigma(z) = (1 + e^{-z})^{-1}$ .
Applying the chain rule, we get:

\sigma'(z) = \frac{d}{dz}\left(1 + e^{-z}\right)^{-1} = -(1 + e^{-z})^{-2} \cdot \left(-e^{-z}\right)

Simplifying, we obtain:

\sigma'(z) = \frac{e^{-z}}{(1 + e^{-z})^2}

By adding and subtracting $1$ in the numerator and rearranging, we find:

\sigma'(z) = \frac{1}{1 + e^{-z}} \left(1 - \frac{1}{1 + e^{-z}}\right)

Finally, recognizing that the terms in the parenthesis represent $\sigma(z)$ and $1 - \sigma(z)$ respectively, we arrive at the elegant result:

\sigma'(z) = \sigma(z)(1 - \sigma(z))

Gradient Descent for Perceptron Training

Gradient descent is employed to minimize the error between the predicted and actual classifications. It adjusts the weights and bias to reduce the loss function, calculated using the log loss for classification:

L(y, \hat{y}) = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]

where $y$ is the actual label, and $\hat{y}$ is the predicted label (output of the sigmoid function). The loss function $L(y, \hat{y})$ quantifies the difference between the actual outputs $y$ and the predicted outputs $\hat{y}$ . The optimization's goal is to minimize $L$ by adjusting the model parameters, specifically the weights ( $w$ ) and bias ( $b$ ).

To understand how changes in $w$ and $b$ affect $L$ , we calculate the partial derivatives of $L$ with respect to these parameters. This involves understanding how $L$ is influenced by $\hat{y}$ and, in turn, how $\hat{y}$ depends on each parameter.

Chain Rule Application

The calculation of $\frac{\partial L}{\partial w_i}$ and $\frac{\partial L}{\partial b}$ involves the application of the chain rule of calculus, expressed as:

$\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w_i}$
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial b}$

The term $\frac{\partial L}{\partial \hat{y}}$ is common across these expressions and is crucial for understanding the gradient's direction and magnitude.

Derivative Calculations

Derivative of $L$ with respect to $\hat{y}$

Given the log loss function, the derivative of $L$ with respect to $\hat{y}$ is calculated as:

\frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}

This expression represents how the loss function gradient depends on the difference between actual and predicted values.

Derivative of $\hat{y}$ with respect to $w$ and $b$

The predicted output $\hat{y}$ in a classification perceptron is typically computed using a sigmoid activation function of a linear combination of inputs and weights plus the bias. The derivatives of $\hat{y}$ with respect to $w_i$ , and $b$ are informed by the derivative of the sigmoid function:

$\frac{\partial \hat{y}}{\partial w_i} = \hat{y}(1 - \hat{y})x_i$
$\frac{\partial \hat{y}}{\partial b} = \hat{y}(1 - \hat{y})$

Final Gradient Expressions

The final expressions for the partial derivatives of the loss function with respect to the parameters are:

$\frac{\partial L}{\partial w_i} = -(y - \hat{y})x_i$
$\frac{\partial L}{\partial b} = -(y - \hat{y})$

These gradients guide the update steps in the gradient descent algorithm, indicating the direction and magnitude by which the parameters should be adjusted to reduce the loss.

Gradient Descent Update Rule

The gradient descent update rules for the weights and bias are as follows, where $\alpha$ is the learning rate:

$w_i := w_i - \alpha \frac{\partial L}{\partial w_i}$
$b := b - \alpha \frac{\partial L}{\partial b}$

Iteratively applying these updates moves the parameters toward the values that minimize the loss function, aiming to find the best-fitting model for the given dataset.

Conclusion

The perceptron model can be effectively adapted for binary classification by using the sigmoid function as an activation function. Gradient descent, facilitated by the sigmoid function's derivative properties, enables efficient training of the perceptron, optimizing its parameters to classify input data accurately.

Perceptron as a Linear Classifier​

Mathematical Formulation​

Sigmoid Function​

Definition​

Properties​

Derivative of the Sigmoid Function​

Calculation​

Gradient Descent for Perceptron Training​

Chain Rule Application​

Derivative Calculations​

Derivative of LLL with respect to y^\hat{y}y^​​

Derivative of y^\hat{y}y^​ with respect to www and bbb​

Final Gradient Expressions​

Gradient Descent Update Rule​

Conclusion​