Skip to main content

Regression with a perceptron

Neural networks are computational models that mimic the human brain's structure to process information. They consist of units called neurons or perceptrons, which are the fundamental building blocks of neural networks. The training of these networks involves adjusting weights and biases to minimize the error in predictions, a process achieved through algorithms like gradient descent and Newton's method.

Perceptron: The Basic Unit of Neural Networks

A perceptron models a neuron in a neural network, capable of performing binary classifications. It computes a weighted sum of its inputs and passes this sum through an activation function to produce an output. The concept of a perceptron can be extended to linear regression, where it represents a simple linear model.

Mathematical Representation

Given inputs x1,x2,...,xnx_1, x_2, ..., x_n with corresponding weights w1,w2,...,wnw_1, w_2, ..., w_n and a bias term bb, the output y^\hat{y} of a perceptron is given by:

y^=i=1nwixi+b\hat{y} = \sum_{i=1}^n w_i x_i + b

This output can be used for predictions in linear regression problems, where y^\hat{y} might represent the predicted value of a dependent variable, such as the price of a house.

Loss Function

A common choice for the loss function in regression problems is the Mean Squared Error (MSE), defined as:

L(y,y^)=12Ni=1N(yiy^i)2L(y, \hat{y}) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

where yiy_i is the actual value, y^i\hat{y}_i is the predicted value, and NN is the number of samples.

Gradient Descent

Gradient Descent Algorithm

To minimize the loss function, gradient descent updates the parameters as follows:

  • wi(new)=wi(old)αLwiw_i^{(new)} = w_i^{(old)} - \alpha \frac{\partial L}{\partial w_i}
  • b(new)=b(old)αLbb^{(new)} = b^{(old)} - \alpha \frac{\partial L}{\partial b}

Where α\alpha represents the learning rate, a hyperparameter that controls the step size during the optimization process.

Derivatives Calculation

The update rules rely on the calculation of derivatives of the loss function with respect to each parameter. These derivatives are obtained using the chain rule for differentiation. For a model with a simple quadratic loss function (L=12(yy^)2L = \frac{1}{2}(y - \hat{y})^2), the derivatives are as follows:

Loss Function Derivative with Respect to Predictions

dLdy^=(yy^)\frac{dL}{d\hat{y}} = - (y - \hat{y})

Partial Derivatives of Predictions

  • With respect to bias (bb):
dy^db=1\frac{d\hat{y}}{db} = 1
  • With respect to weight (w1w_1):
dy^dw1=x1\frac{d\hat{y}}{dw_1} = x_1
  • With respect to weight (w2w_2):
dy^dw2=x2\frac{d\hat{y}}{dw_2} = x_2

Chain Rule Application

The chain rule is applied to compute the gradient of the loss function with respect to each parameter:

  • For bias (bb):
dLdb=dLdy^dy^db=(yy^)\frac{dL}{db} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{db} = - (y - \hat{y})
  • For weight (w1w_1):
dLdw1=dLdy^dy^dw1=(yy^)x1\frac{dL}{dw_1} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dw_1} = - (y - \hat{y}) \cdot x_1
  • For weight (w2w_2):
dLdw2=dLdy^dy^dw2=(yy^)x2\frac{dL}{dw_2} = \frac{dL}{d\hat{y}} \cdot \frac{d\hat{y}}{dw_2} = - (y - \hat{y}) \cdot x_2

Update Rules

Integrating the derivatives back into the gradient descent formula, the parameters are updated iteratively:

  • w1(new)=w1(old)α[(yy^)x1]w_1^{(new)} = w_1^{(old)} - \alpha \cdot [ - (y - \hat{y}) \cdot x_1 ]
  • w2(new)=w2(old)α[(yy^)x2]w_2^{(new)} = w_2^{(old)} - \alpha \cdot [ - (y - \hat{y}) \cdot x_2 ]
  • b(new)=b(old)α[(yy^)]b^{(new)} = b^{(old)} - \alpha \cdot [ - (y - \hat{y}) ]

Through repeated application of these updates, gradient descent aims to converge to the optimal values of w1,w2,w_1, w_2, and bb that minimize the loss function, leading to a model with minimized prediction error.

Conclusion

Neural networks leverage the perceptron model and optimization algorithms like gradient descent to learn from data and make predictions. By minimizing the loss function, neural networks can be trained to model complex relationships between inputs and outputs, making them powerful tools for tasks ranging from regression to classification in various domains.