Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the true distribution of data that the algorithm will work on, but we can instead measure its performance on a known set of training data (the "empirical" risk).
Background
Consider the following situation, which is a general setting of many supervised learning problems. We have two spaces of objects
and
and would like to learn a function
(often called hypothesis) which outputs an object
, given
. To do so, we have at our disposal a training set of
examples
where
is an input and
is the corresponding response that we wish to get from
.
To put it more formally, we assume that there is a joint probability distribution
over
and
, and that the training set consists of
instances
drawn i.i.d. from
. Note that the assumption of a joint probability distribution allows us to model uncertainty in predictions (e.g. from noise in data) because
is not a deterministic function of
, but rather a random variable with conditional distribution
for a fixed
.
We also assume that we are given a non-negative real-valued loss function
which measures how different the prediction
of a hypothesis is from the true outcome
The risk associated with hypothesis
is then defined as the expectation of the loss function:
![R(h)={\mathbf {E}}[L(h(x),y)]=\int L(h(x),y)\,dP(x,y).](https://wikimedia.org/api/rest_v1/media/math/render/svg/7ad0926c0c7dbf4fa719987b028ff25c5fd678cb)
A loss function commonly used in theory is the 0-1 loss function:
.
The ultimate goal of a learning algorithm is to find a hypothesis
among a fixed class of functions
for which the risk
is minimal:

Empirical risk minimization
In general, the risk
cannot be computed because the distribution
is unknown to the learning algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation, called empirical risk, by averaging the loss function on the training set:

The empirical risk minimization principle[1] states that the learning algorithm should choose a hypothesis
which minimizes the empirical risk:

Thus the learning algorithm defined by the ERM principle consists in solving the above optimization problem.
Properties
![[icon]](//upload.wikimedia.org/wikipedia/commons/thumb/1/1c/Wiki_letter_w_cropped.svg/20px-Wiki_letter_w_cropped.svg.png) | This section needs expansion. You can help by adding to it. (February 2010) |
Computational complexity
Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for such a relatively simple class of functions as linear classifiers.[2] Though, it can be solved efficiently when the minimal empirical risk is zero, i.e. data is linearly separable.
In practice, machine learning algorithms cope with that either by employing a convex approximation to the 0-1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution
(and thus stop being agnostic learning algorithms to which the above result applies).
See also
References