Week 8 - Domain Adaptation and Transfer Learning
Transfer Learning
The data in source domain: ${(x_1^S,y_1^S),…,(x_{n_S}^S,y_{n_S}^S)}$. The data in target domain: ${(x_1^T,y_1^T),…,(x_{n_S}^T,y_{n_S}^T)}$.
Importance Reweighting
The expected risk in source domain is: \(R^S(h)=\mathbb{E}_{(X,Y)\sim p_s(X,Y)}[l(X,Y,h)]\)
Similary, the expected risk in target domain is: \(R^T(h)=\mathbb{E}_{(X,Y)\sim p_t(X,Y)}[l(X,Y,h)]\\\\ =\int_{(X,Y)}l(X,Y,h)p_t(X,Y)dXdY\\\\ =\int_{(X,Y)}l(X,Y,h)\frac{p_t(X,Y)}{p_s(X,Y)}p_s(X,Y)dXdY\\\\ =\mathbb{E}_{(X,Y)\sim p_s(X,Y)}[\frac{p_t(X,Y)}{p_s(X,Y)}l(X,Y,h)]\\\\ =\mathbb{E}_{(X,Y)\sim p_s(X,Y)}[\beta (X,Y)l(X,Y,h)]\)
where $\beta(X,Y)=\frac{p_t(X,Y)}{p_s(X,Y)}$ represent the changes across domains. If we know the value of $\beta(X,Y)$, we can use source domain data to approximate the expected risk for target domain.
Domain Adaptation
In machine learning, a specific domain can be regarded as a specific joint distribution $p(X,Y)$.
Kernel Mean Matching
Kernel function can capture the similarity of two vectors.
A freature mapping function $\phi$: \(\phi:X\rightarrow \mathcal{H}\) where $\mathcal{H}$ is a Reproducing Kernel Hilbert Space (RKHS).
The kernel function: \(K(x_1,x_2)=\langle \phi(x_1),\phi(x_2) \rang\) where $x_1$ and $x_2$ are two vectors. If they represent two distributions, we can use the kernel function to measure the similarity of two distributions.
Let \(\mu(p(X))=\mathbb{E}_{X \sim p(X)}[\phi(X)]\) where $p(X)$ is a marginal distribution (only focus on the distribution of $X$) on the ferature space $X$.
The expectation $\mu$ is a bijective function (if $\mu(x_1)=\mu(x_2)$, $x_1=x_2$) if $K$ is a universal kernel.
Then we have: \(\mu(p_t(X))=\mathbb{E}_{X\sim p_s}[\beta(X)\phi(X)]=\mu(\beta(X)p_s(X))\\\\ s.t. \beta(X)\ge 0,\mathbb{E}_{X\sim p_s}[\beta(X)]=1\) where $\mathbb{E}_{X\sim p_s}[\beta(X)]=1$ means we have enough features.
We want to minimize the difference between expection on target domain and source domain. \(\underset{\beta}{\operatorname{\min}}||\mu(p_t(X))-\mathbb{E}_{X \sim p_s(X)}[\beta(X)\phi(X)]||^2 \\\\ s.t. \beta(X)\ge 0,\mathbb{E}_{X\sim p_s}[\beta(X)]=1\) Note, we can get $\beta(X)$ here!
As we can’t calculate the expection risk, we can calculate empirical risk instead.
Transfer Learning Models
Covariate Shift Model
We assume that $p_t(Y|X)=p_s(Y|X)$ and that $p_t(X) \neq p_s(X)$
Then \(\beta(X,Y)=\frac{p_t(X,Y)}{p_s(X,Y)}=\frac{p_t(Y|X)p_t(X)}{p_s(Y|X)p_s(X)}=\frac{p_t(X)}{p_s(X)}=\beta(X)\)
$\beta(X)$ can be learned by kernel mean matching.
Target Shift Model
We assume that $p_t(Y|X)=p_s(Y|X)$ and that $p_t(Y) \neq p_s(Y)$, which means the labeling distribution is different.
Then \(\beta(X,Y)=\frac{p_t(X,Y)}{p_s(X,Y)}=\frac{p_t(X|Y)p_t(Y)}{p_s(X|Y)p_s(Y)}=\frac{p_t(Y)}{p_s(Y)}=\beta(Y)\)
However, most of time we don’t know the label distribution on the target domain.