Skip to main content

General Concepts

In machine learning, the evaluation of models is a crucial step to understand their performance. Two key concepts in this process are the hypothesis and the loss functions.

Hypothesis

The hypothesis, typically denoted as hθh_\theta, represents the model chosen to predict outputs given certain input data. For input x(i)x^{(i)}, the model prediction is hθ(x(i))h_\theta (x^{(i)}).

Loss Functions

Loss functions measure the difference between the actual values and predictions. They are essential for training a model, providing feedback on its performance. Common loss functions include:

  • Least Squared Error (for Linear Regression):

    12(yz)2\frac{1}{2}(y - z)^2
  • Logistic Loss (for Logistic Regression):

    log(1+exp(yz))\log(1 + \exp(-yz))
  • Hinge Loss (for Support Vector Machine - SVM):

    max(0,1yz)\max(0, 1 - yz)
  • Cross-Entropy Loss (for Neural Networks):

    [ylog(z)+(1y)log(1z)]-[y\log(z) + (1 - y)\log(1 - z)]

The graphs associated with each loss function show how the error changes with respect to the predicted value zz for different actual values yy.

Cost Function

The cost function, denoted as JJ, aggregates the losses across all training examples and is used to assess the performance of the model. It is defined as the sum of individual loss function values for all mm training examples:

J(θ)=i=1mL(hθ(x(i)),y(i)) J(\theta) = \sum_{i=1}^{m} L(h_\theta(x^{(i)}), y^{(i)})

where LL is the chosen loss function, hθ(x(i))h_\theta(x^{(i)}) is the hypothesis for the ithi^{th} example, and y(i)y^{(i)} is the actual value.

This framework allows for the optimization of the model parameters θ\theta through training, often using algorithms like gradient descent, with the goal of minimizing the cost function J(θ)J(\theta).

Optimization Algorithms in Machine Learning

Optimization algorithms are essential for finding the best parameters for machine learning models. These algorithms aim to minimize the cost function, which measures the prediction error of a model.

Gradient Descent

Gradient Descent is a foundational optimization method used to minimize the cost function J(θ)J(\theta) by updating the parameters θ\theta in the opposite direction of the gradient of the cost function J(θ)\nabla J(\theta).

  • Update Rule:

    θθαJ(θ)\theta \leftarrow \theta - \alpha \nabla J(\theta)
  • α\alpha: Learning rate, a positive scalar determining the step size.

  • J(θ)\nabla J(\theta): Gradient of the cost function with respect to the parameters.

The graphical representation shows concentric contours of the cost function with the gradient pointing towards the direction of steepest ascent. Gradient descent moves in the opposite direction to reach the minimum.

  • Stochastic Gradient Descent (SGD): Updates parameters for each training example.
  • Batch Gradient Descent: Updates parameters for a batch of training examples.

Likelihood

The likelihood function L(θ)L(\theta) measures how probable the observed data is, given a set of parameters θ\theta.

  • Optimization Goal:

    θopt=argmaxθL(θ)\theta_{opt} = \arg \max_{\theta} L(\theta)
  • In practice, the log-likelihood (θ)=log(L(θ))\ell(\theta) = \log(L(\theta)) is optimized since it is easier to work with, especially when dealing with products of probabilities.

Newton's Algorithm

Newton's algorithm, also known as the Newton-Raphson method, is an optimization technique that finds the parameters θ\theta by solving (θ)=0\ell'(\theta) = 0, where (θ)\ell(\theta) is typically a loss or likelihood function.

  • Update Rule (Scalar Case):

    θθ(θ)(θ)\theta \leftarrow \theta - \frac{\ell'(\theta)}{\ell''(\theta)}
  • Update Rule (Multidimensional Generalization):

    θθ(2(θ))1(θ)\theta \leftarrow \theta - (\nabla^2\ell(\theta))^{-1} \nabla\ell(\theta)

Here, 2(θ)\nabla^2\ell(\theta) is the Hessian matrix of second-order partial derivatives. This method takes into account the curvature of (θ)\ell(\theta), which can lead to faster convergence compared to gradient descent, especially in well-behaved quadratic problems.