Regression with a perceptron

Neural networks are computational models that mimic the human brain's structure to process information. They consist of units called neurons or perceptrons, which are the fundamental building blocks of neural networks. The training of these networks involves adjusting weights and biases to minimize the error in predictions, a process achieved through algorithms like gradient descent and Newton's method.

Perceptron: The Basic Unit of Neural Networks

A perceptron models a neuron in a neural network, capable of performing binary classifications. It computes a weighted sum of its inputs and passes this sum through an activation function to produce an output. The concept of a perceptron can be extended to linear regression, where it represents a simple linear model.

Mathematical Representation

Given inputs $x_1, x_2, ..., x_n$ with corresponding weights $w_1, w_2, ..., w_n$ and a bias term $b$ , the output $\hat{y}$ of a perceptron is given by:

\hat{y} = \sum_{i=1}^n w_i x_i + b

This output can be used for predictions in linear regression problems, where $\hat{y}$ might represent the predicted value of a dependent variable, such as the price of a house.

Loss Function

A common choice for the loss function in regression problems is the Mean Squared Error (MSE), defined as:

L(y, \hat{y}) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

where $y_i$ is the actual value, $\hat{y}_i$ is the predicted value, and $N$ is the number of samples.

Gradient Descent

Gradient Descent Algorithm

To minimize the loss function, gradient descent updates the parameters as follows:

$w_i^{(new)} = w_i^{(old)} - \alpha \frac{\partial L}{\partial w_i}$
$b^{(new)} = b^{(old)} - \alpha \frac{\partial L}{\partial b}$

Where $\alpha$ represents the learning rate, a hyperparameter that controls the step size during the optimization process.

Derivatives Calculation

The update rules rely on the calculation of derivatives of the loss function with respect to each parameter. These derivatives are obtained using the chain rule for differentiation. For a model with a simple quadratic loss function ( $L = \frac{1}{2}(y - \hat{y})^2$ ), the derivatives are as follows:

Loss Function Derivative with Respect to Predictions

\frac{dL}{d\hat{y}} = - (y - \hat{y})

Partial Derivatives of Predictions

With respect to bias ( $b$ ):

\frac{d\hat{y}}{db} = 1

With respect to weight ( $w_1$ ):

\frac{d\hat{y}}{dw_1} = x_1

With respect to weight ( $w_2$ ):

\frac{d\hat{y}}{dw_2} = x_2

Chain Rule Application

The chain rule is applied to compute the gradient of the loss function with respect to each parameter:

For bias ( $b$ ):

\frac{dL}{db} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{db} = - (y - \hat{y})

For weight ( $w_1$ ):

\frac{dL}{dw_1} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dw_1} = - (y - \hat{y}) \cdot x_1

For weight ( $w_2$ ):

\frac{dL}{dw_2} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dw_2} = - (y - \hat{y}) \cdot x_2

Update Rules

Integrating the derivatives back into the gradient descent formula, the parameters are updated iteratively:

$w_1^{(new)} = w_1^{(old)} - \alpha \cdot [ - (y - \hat{y}) \cdot x_1 ]$
$w_2^{(new)} = w_2^{(old)} - \alpha \cdot [ - (y - \hat{y}) \cdot x_2 ]$
$b^{(new)} = b^{(old)} - \alpha \cdot [ - (y - \hat{y}) ]$

Through repeated application of these updates, gradient descent aims to converge to the optimal values of $w_1, w_2,$ and $b$ that minimize the loss function, leading to a model with minimized prediction error.

Conclusion

Neural networks leverage the perceptron model and optimization algorithms like gradient descent to learn from data and make predictions. By minimizing the loss function, neural networks can be trained to model complex relationships between inputs and outputs, making them powerful tools for tasks ranging from regression to classification in various domains.

Perceptron: The Basic Unit of Neural Networks​

Mathematical Representation​

Loss Function​

Gradient Descent​

Gradient Descent Algorithm​

Derivatives Calculation​

Loss Function Derivative with Respect to Predictions​

Partial Derivatives of Predictions​

Chain Rule Application​

Update Rules​

Conclusion​