Week 3 - Hypothesis Complexity and Generalization
Predefined Hypothesis Class
Assume the target function is $c$, which fits the data best. \(c=\underset{h}{\operatorname{\argmin}}R(h)\) where $c$ can be inside or outside the predefined hypothesis.
Risk
Empirical risk. is defined as: \(R_S(h)=\frac{1}{n}\sum^n_{i=1}l(X_i,Y_i,h)\) where $S$ is the training sample.
Expected risk. is defined as: \(R(h)=\mathbb{E}[R_S(h)]=\mathbb{E}(l(X,Y,h))\)
The optimal hypothesis in the predefined hypothesis class: \(h^*=\underset{h \in H}{\operatorname{\argmin}}R(h)\)
The hypothesis we can learn from data: \(h_S=\underset{h \in H}{\operatorname{\argmin}}R_S(h)=\underset{h \in H}{\operatorname{\argmin}}\frac{1}{n}\sum^n_{i=1}l(X_i,Y_i,h)\)
Approximation error. is the difference between $h^*$ and $c$.
Estimation error. is the difference between $h_S$ and $h^*$.
By enlarging the predefined hypothesis class may be able to reduce the Approximation error, but large hypothesis class is hard to learn leading to a large Estimation error.
Probably Approximately Correct Learning (PAC Learning)
PAC learning explains how many training examples are needed to learn the best hypothesis in the predefined class.
It provides a theoretical foundation for understanding the generalization performance (difference between test performance and training performance) of learning algorithms.
It focuses on:
- Size of the training dataset
- Complexity of the hypothesis class
- Generalization error of the learned model
Definition
\(p\{R(h_S)-\underset{h \in H}{\operatorname{\min}}R(h) \le \epsilon\} \ge 1-\delta\) where $\epsilon$ is acceptable error and $1-\delta$ is desired confidence level.
VC Dimension: Binary Case
We can group the classifiers. \(H=\{(h^1_1,...,h^1_{n_1}),...,(h^G_1,...,h^G_{n_G})\}\) Let $h^1,…,h^G$ be the representatives of each group \(H'=\{h^1,...,h^G\}\)
If VC dimension exist, $H$ is PAC learnabel.
Growth Function
Counts the number of distinct labeling of a hypothesis class produced on any subset of data points. \(\forall n\in \mathbb{N}, \prod_H(n)=\underset{X_1,...,X_n}{\operatorname{\max}}|\{h(X_1),...,h(X_n):h \in H\}|\)
Shattering
A hypothesis class shatters a set of points if it can assign all possible binary labels (0 or 1) to those points.
If data points can be shattered by a hypothesis class $H$, then \(\prod_H(n)=2^n\)
VC Dimension
The VC dimension of a hypothesis class $H$ is the size of the largest set that can be fully shattered by $H$: \(VC dimension(H)=\underset{n}{\operatorname{\max}}\{n:\prod_H (n)=2^n\}\)