Classification with Perceptron
Perceptron as a Linear Classifier
A perceptron, initially introduced for linear regression, can be adapted for binary classification problems. This adaptation involves modifying the activation function to transform the perceptron's output into a form suitable for classification.
Mathematical Formulation
The weighted sum in a perceptron, where the inputs range from to with corresponding weights to , and including a bias term , the formula is:
This equation represents the linear combination of inputs and their respective weights, with the bias term added to account for offsets. For classification, the sigmoid function, , is used as the activation function, transforming into a probability between 0 and 1:
This function outputs a value in the range , making it suitable for binary classification tasks.
Sigmoid Function
The sigmoid function, denoted as , plays a crucial role in machine learning, especially in logistic regression and neural networks, due to its ability to map any real-valued number into the interval. This property is particularly useful for modeling probabilities.
Definition
The sigmoid function is defined as:
where is the input to the function.
Properties
- Domain and Range: The function maps the domain of all real numbers to the range .
- Asymptotes: It has horizontal asymptotes at and , implying that approaches as and as .
- Symmetry: At , , making the sigmoid function symmetric about the origin in the transformed space.
- Sigmoid of Large Positive and Negative Values: For large positive values of , approaches , and for large negative values, approaches .
Derivative of the Sigmoid Function
The derivative of the sigmoid function is significant in machine learning algorithms, particularly in the optimization process. It can be computed using the chain rule of calculus and exhibits a simple form that is computationally efficient.
Calculation
Let's denote the derivative of with respect to as . The calculation proceeds as follows:
- Start with the definition .
- Applying the chain rule, we get:
- Simplifying, we obtain:
- By adding and subtracting in the numerator and rearranging, we find:
- Finally, recognizing that the terms in the parenthesis represent and respectively, we arrive at the elegant result:
Gradient Descent for Perceptron Training
Gradient descent is employed to minimize the error between the predicted and actual classifications. It adjusts the weights and bias to reduce the loss function, calculated using the log loss for classification:
where is the actual label, and is the predicted label (output of the sigmoid function). The loss function quantifies the difference between the actual outputs and the predicted outputs . The optimization's goal is to minimize by adjusting the model parameters, specifically the weights () and bias ().
To understand how changes in and affect , we calculate the partial derivatives of with respect to these parameters. This involves understanding how is influenced by and, in turn, how depends on each parameter.
Chain Rule Application
The calculation of and involves the application of the chain rule of calculus, expressed as:
The term is common across these expressions and is crucial for understanding the gradient's direction and magnitude.
Derivative Calculations
Derivative of with respect to
Given the log loss function, the derivative of with respect to is calculated as:
This expression represents how the loss function gradient depends on the difference between actual and predicted values.
Derivative of with respect to and
The predicted output in a classification perceptron is typically computed using a sigmoid activation function of a linear combination of inputs and weights plus the bias. The derivatives of with respect to , and are informed by the derivative of the sigmoid function:
Final Gradient Expressions
The final expressions for the partial derivatives of the loss function with respect to the parameters are:
These gradients guide the update steps in the gradient descent algorithm, indicating the direction and magnitude by which the parameters should be adjusted to reduce the loss.
Gradient Descent Update Rule
The gradient descent update rules for the weights and bias are as follows, where is the learning rate:
Iteratively applying these updates moves the parameters toward the values that minimize the loss function, aiming to find the best-fitting model for the given dataset.
Conclusion
The perceptron model can be effectively adapted for binary classification by using the sigmoid function as an activation function. Gradient descent, facilitated by the sigmoid function's derivative properties, enables efficient training of the perceptron, optimizing its parameters to classify input data accurately.