General Concepts

In machine learning, the evaluation of models is a crucial step to understand their performance. Two key concepts in this process are the hypothesis and the loss functions.

Hypothesis

The hypothesis, typically denoted as $h_\theta$ , represents the model chosen to predict outputs given certain input data. For input $x^{(i)}$ , the model prediction is $h_\theta (x^{(i)})$ .

Loss Functions

Loss functions measure the difference between the actual values and predictions. They are essential for training a model, providing feedback on its performance. Common loss functions include:

Least Squared Error (for Linear Regression):
$\frac{1}{2}(y - z)^2$
Logistic Loss (for Logistic Regression):
$\log(1 + \exp(-yz))$
Hinge Loss (for Support Vector Machine - SVM):
$\max(0, 1 - yz)$
Cross-Entropy Loss (for Neural Networks):
$-[y\log(z) + (1 - y)\log(1 - z)]$

The graphs associated with each loss function show how the error changes with respect to the predicted value $z$ for different actual values $y$ .

Cost Function

The cost function, denoted as $J$ , aggregates the losses across all training examples and is used to assess the performance of the model. It is defined as the sum of individual loss function values for all $m$ training examples:

J(\theta) = \sum_{i=1}^{m} L(h_\theta(x^{(i)}), y^{(i)})

where $L$ is the chosen loss function, $h_\theta(x^{(i)})$ is the hypothesis for the $i^{th}$ example, and $y^{(i)}$ is the actual value.

This framework allows for the optimization of the model parameters $\theta$ through training, often using algorithms like gradient descent, with the goal of minimizing the cost function $J(\theta)$ .

Optimization Algorithms in Machine Learning

Optimization algorithms are essential for finding the best parameters for machine learning models. These algorithms aim to minimize the cost function, which measures the prediction error of a model.

Gradient Descent

Gradient Descent is a foundational optimization method used to minimize the cost function $J(\theta)$ by updating the parameters $\theta$ in the opposite direction of the gradient of the cost function $\nabla J(\theta)$ .

Update Rule:
$\theta \leftarrow \theta - \alpha \nabla J(\theta)$
$\alpha$ : Learning rate, a positive scalar determining the step size.
$\nabla J(\theta)$ : Gradient of the cost function with respect to the parameters.

The graphical representation shows concentric contours of the cost function with the gradient pointing towards the direction of steepest ascent. Gradient descent moves in the opposite direction to reach the minimum.

Stochastic Gradient Descent (SGD): Updates parameters for each training example.
Batch Gradient Descent: Updates parameters for a batch of training examples.

Likelihood

The likelihood function $L(\theta)$ measures how probable the observed data is, given a set of parameters $\theta$ .

Optimization Goal:
$\theta_{opt} = \arg \max_{\theta} L(\theta)$
In practice, the log-likelihood $\ell(\theta) = \log(L(\theta))$ is optimized since it is easier to work with, especially when dealing with products of probabilities.

Newton's Algorithm

Newton's algorithm, also known as the Newton-Raphson method, is an optimization technique that finds the parameters $\theta$ by solving $\ell'(\theta) = 0$ , where $\ell(\theta)$ is typically a loss or likelihood function.

Update Rule (Scalar Case):
$\theta \leftarrow \theta - \frac{\ell'(\theta)}{\ell''(\theta)}$
Update Rule (Multidimensional Generalization):
$\theta \leftarrow \theta - (\nabla^2\ell(\theta))^{-1} \nabla\ell(\theta)$

Here, $\nabla^2\ell(\theta)$ is the Hessian matrix of second-order partial derivatives. This method takes into account the curvature of $\ell(\theta)$ , which can lead to faster convergence compared to gradient descent, especially in well-behaved quadratic problems.

Hypothesis​

Loss Functions​

Cost Function​

Optimization Algorithms in Machine Learning​

Gradient Descent​

Likelihood​

Newton's Algorithm​