Classification with Perceptron
Perceptron as a Linear Classifier
A perceptron, initially introduced for linear regression, can be adapted for binary classification problems. This adaptation involves modifying the activation function to transform the perceptron's output into a form suitable for classification.
Mathematical Formulation
The weighted sum in a perceptron, where the inputs range from to with corresponding weights to , and including a bias term , the formula is:
This equation represents the linear combination of inputs and their respective weights, with the bias term added to account for offsets. For classification, the sigmoid function, , is used as the activation function, transforming into a probability between 0 and 1:
This function outputs a value in the range , making it suitable for binary classification tasks.
Sigmoid Function
The sigmoid function, denoted as , plays a crucial role in machine learning, especially in logistic regression and neural networks, due to its ability to map any real-valued number into the interval. This property is particularly useful for modeling probabilities.
Definition
The sigmoid function is defined as:
where is the input to the function.
Properties
- Domain and Range: The function maps the domain of all real numbers to the range .
- Asymptotes: It has horizontal asymptotes at and , implying that approaches as and as .
- Symmetry: At , , making the sigmoid function symmetric about the origin in the transformed space.
- Sigmoid of Large Positive and Negative Values: For large positive values of , approaches , and for large negative values, approaches .
Derivative of the Sigmoid Function
The derivative of the sigmoid function is significant in machine learning algorithms, particularly in the optimization process. It can be computed using the chain rule of calculus and exhibits a simple form that is computationally efficient.
Calculation
Let's denote the derivative of with respect to as . The calculation proceeds as follows:
- Start with the definition .
- Applying the chain rule, we get:
- Simplifying, we obtain:
- By adding and subtracting in the numerator and rearranging, we find:
- Finally, recognizing that the terms in the parenthesis represent and respectively, we arrive at the elegant result:
Gradient Descent for Perceptron Training
Gradient descent is employed to minimize the error between the predicted and actual classifications. It adjusts the weights and bias to reduce the loss function, calculated using the log loss for classification:
where is the actual label, and is the predicted label (output of the sigmoid function). The loss function quantifies the difference between the actual outputs and the predicted outputs . The optimization's goal is to minimize by adjusting the model parameters, specifically the weights () and bias ().
To understand how changes in and affect , we calculate the partial derivatives of with respect to these parameters. This involves understanding how is influenced by and, in turn, how depends on each parameter.
Chain Rule Application
The calculation of and involves the application of the chain rule of calculus, expressed as: