Week 9 - Learning with Noise Data: Label Noise
Problem setup:
Observation: $X \in \mathcal{X} \subset \mathbb{R}^d$.
Clean but unobservable label: $Y \in \mathcal{Y}={-1,+1}$.
Observable but noisy label: $\widetilde{Y} \in \mathcal{Y}$
Clean distribution: $D(X,Y)$
Noisy distribution: $D_{\rho}(X,\widetilde{Y})$
Our target is to learn a discriminant function $f_n:\mathcal{X} \rightarrow \mathbb{R}$ such that the classifier predicts the correct label $y$ given an observation $x$ (using noisy learn clean label).
A probabilistic model: \(\rho_Y(X)=P(\widetilde{Y}|Y,X)\) where $\rho_Y(X)$ is called flipping probability (flip rate).
Random Classification Noise (RCN)
The flipping probability is independent of $X$ and $Y$. \(\rho_{+1}(X)=\rho_{-1}(X)=\rho\)
Symmetric loss function is robust to RCN when the function class $\mathcal{F}_{lin}$ is extended to the universal function space (any kind of hyphothesis), which means the function in it can be of any form.
Symmetric loss function follow \(L(f(X),+1)+L(f(X),-1)=C\) where $C$ is a constant. That is \(\underset{f}{\operatorname{\argmin}}R_{D,L}(f)=\underset{f}{\operatorname{\argmin}}R_{D_\rho,L}(f)\)
Class-dependent Label Noise: Binary
The flipping probability is depend on the class. \(\rho_{+1}(X)=\rho_{+1}\\\rho_{-1}(X)=\rho_{-1}\)
Importance Reweighting
Viewing the noisy data and clean data are sampled from two domains.
\(R_{D,L}(f)=\mathbb{E}_{(X,Y) \sim D}[L(f(X),Y)]=\int P_D(X,Y)L(f(X),Y)dXdY\\ =\mathbb{E}_{(X,Y) \sim D_\rho}[\frac{P_D(X,Y)}{P_{D_\rho}(X,Y)}L(f(X),Y)]\\ =\mathbb{E}_{(X,Y) \sim D_\rho}[\beta(X,Y)L(f(X),Y)]\) where $\beta(x,y)=\frac{P_D(X=x,Y=y)}{P_{D_\rho}(X=x,\widetilde{Y}=y)}$.
In RCN, we have: \(P_{D_\rho}(\widetilde{Y}=y|X=x)=(1-\rho_{+1}-\rho_{-1})P_D(Y=y|X=x)+\rho_{-y}\)
Then we can calculate the value of $\beta$ if we know the flip rate $\rho$. \(\beta(x,y)=\frac{P_D(X=x,Y=y)}{P_{D_\rho}(X=x,\widetilde{Y}=y)} \\ =\frac{P_D(Y=y|X=x)}{P_{D_\rho}(\widetilde{Y}=y|X=x)} \\ =\frac{P_{D_\rho}(\widetilde{Y}=y|X=x)-\rho_{-y}}{(1-\rho_{+1}-\rho_{-1}P_{D_\rho})(\widetilde{Y}=y|X=x)}\)
Class-dependent Label Noise: Multi-class
In multi-class case, we have: \(\begin{bmatrix} P(\widetilde{Y}=1|x) \\ ... \\ P(\widetilde{Y}=C|x) \end{bmatrix}=\begin{bmatrix} P(\widetilde{Y}=1|Y=1,x) & ... &P(\widetilde{Y}=1|Y=C,x) \\ ... &...&...\\ P(\widetilde{Y}=C|x) &...& P(\widetilde{Y}=C|Y=C,x)\end{bmatrix}\begin{bmatrix} P(Y=1|x) \\ ... \\ P(Y=C|x) \end{bmatrix}\) where $C$ is the num of class. We call this transition matrix noted as $T$.
Forward.
Backward.