Sicheng Liu bio photo

Email

LinkedIn

Github

Week 3 - Hypothesis Complexity and Generalization

Predefined Hypothesis Class

Assume the target function is $c$, which fits the data best. \(c=\underset{h}{\operatorname{\argmin}}R(h)\) where $c$ can be inside or outside the predefined hypothesis.

Risk

Empirical risk. is defined as: \(R_S(h)=\frac{1}{n}\sum^n_{i=1}l(X_i,Y_i,h)\) where $S$ is the training sample.

Expected risk. is defined as: \(R(h)=\mathbb{E}[R_S(h)]=\mathbb{E}(l(X,Y,h))\)

The optimal hypothesis in the predefined hypothesis class: \(h^*=\underset{h \in H}{\operatorname{\argmin}}R(h)\)

The hypothesis we can learn from data: \(h_S=\underset{h \in H}{\operatorname{\argmin}}R_S(h)=\underset{h \in H}{\operatorname{\argmin}}\frac{1}{n}\sum^n_{i=1}l(X_i,Y_i,h)\)

Approximation error. is the difference between $h^*$ and $c$.

Estimation error. is the difference between $h_S$ and $h^*$.

By enlarging the predefined hypothesis class may be able to reduce the Approximation error, but large hypothesis class is hard to learn leading to a large Estimation error.

Probably Approximately Correct Learning (PAC Learning)

PAC learning explains how many training examples are needed to learn the best hypothesis in the predefined class.

It provides a theoretical foundation for understanding the generalization performance (difference between test performance and training performance) of learning algorithms.

It focuses on:

  • Size of the training dataset
  • Complexity of the hypothesis class
  • Generalization error of the learned model

Definition

\(p\{R(h_S)-\underset{h \in H}{\operatorname{\min}}R(h) \le \epsilon\} \ge 1-\delta\) where $\epsilon$ is acceptable error and $1-\delta$ is desired confidence level.

VC Dimension: Binary Case

We can group the classifiers. \(H=\{(h^1_1,...,h^1_{n_1}),...,(h^G_1,...,h^G_{n_G})\}\) Let $h^1,…,h^G$ be the representatives of each group \(H'=\{h^1,...,h^G\}\)

If VC dimension exist, $H$ is PAC learnabel.

Growth Function

Counts the number of distinct labeling of a hypothesis class produced on any subset of data points. \(\forall n\in \mathbb{N}, \prod_H(n)=\underset{X_1,...,X_n}{\operatorname{\max}}|\{h(X_1),...,h(X_n):h \in H\}|\)

Shattering

A hypothesis class shatters a set of points if it can assign all possible binary labels (0 or 1) to those points.

If data points can be shattered by a hypothesis class $H$, then \(\prod_H(n)=2^n\)

VC Dimension

The VC dimension of a hypothesis class $H$ is the size of the largest set that can be fully shattered by $H$: \(VC dimension(H)=\underset{n}{\operatorname{\max}}\{n:\prod_H (n)=2^n\}\)