Week 7 - Learning with Noisy Data: On the Robustness of Surrogate Loss Functions
Probabilistic Perspective of Noise
Target: $y=h(x)+\epsilon$
Noise: $\epsilon \sim \mathcal{N}(0, \beta^{-1})$
What we learn: $y x \sim \mathcal{N}(h(x), \beta^{-1})$ (We want it have high probability.)
Bayes’ Rule
\(p(\theta|S)=\frac{p(S|\theta)p(\theta)}{p(S)}\) where $S$ is training sample, $\theta$ is the model parameter. $p(S)$ can be the distribution of training sample.
Maximum Likelihood Estimation (MLE)
We want to find the value of $\theta$ to maximise the likelihood $p(S|\theta)$. \(p(S|\theta)=\prod_{i=1}^n p(x_i,y_i|\theta)\)
Modelling Noisy Observation
For all training samples: \(p\{S|X,h,\beta^{-1}\}=\prod_{i=1}^n \mathcal{N}(y_i|h(x_i),\beta^{-1})\\ =\prod_{i=1}^n \sqrt{\frac{\beta}{2\pi}}exp\Bigl(-\frac{\beta(y_i-h(x_i))^2}{2}\Bigl)\) if we add $-\ln()$ on both side, \(\ln p (S|X,h,\beta^{-1}) = -\frac{n}{2}\ln \beta+\frac{n}{2}\ln(2\pi)+\frac{\beta}{2} \sum_{i=1}^n (y_i-h(x_i))^2\\ =-\frac{n}{2}\ln \beta+\frac{n}{2}\ln(2\pi)+\frac{\beta}{2} R_S(h)\) where $R_S(s)$ is the empirical risk.
Therefore, if we want to maxmize the conditional likelihood term $p{S | X,h,\beta^{-1}}$, we need to minimize the empirical risk term $R_S(s)$. |
Bias and Variance
Underfitting - High bias
Overfitting - High variance
A large $\lambda$ leads to high bias, a small $\lambda$ leads to high variance (more complex model).
Robustness of Surrogate Loss Functions
Let \(f(h) = \frac{1}{n}\sum_{i=1}^n l(X_i,Y_i,h)\) the optimality will be achieved at $h$ such that \(\nabla f(h)=0\)
Let \(g(t)=f(th),t \in \mathbb{R}, h \in \mathbb{R^d}\) then \(g'(t)=\nabla f(th)^Th\) if \(\nabla f(h)=0\) then \(g'(1)=0\) which means we need to find an $h$ such that $g’(1)=0$.