Transfer Learning


I’ve run out of my ideas. Time to learn new things. I start with transfer learning. After that, I’d probably continue with online learning, life-long learning and meta learning.

For transfer learning, there are two important concepts:

  • Domain: A domain is consists of data and data distribution. To be specific, there are two domains that we care in transfer learning: Source Domain ($D _ s$) and Target Domain ($D _ t$).
  • Task: Task is the goal of learning. It consists of two parts: the label spaces ($Y$) and the corresponding learning functions ($f(\cdot)$).

Now, we give a formal definition of transfer learning:

Given the source domain $D _ s = [\mathbf{x} _ {i}, y _ {i}] _ {i=1} ^ {n}$ with labels and target domain $D _ t = [\mathbf{x} _ {j}] _ {j=1} ^ {m}$ without labels. The data distributions are different, i.e. $P(\mathbf{x} _ {s}) \neq P(\mathbf{x} _ {t})$. The goal of transfer learning is using knowledge learned form source domain to predict the labels in target domain.

The core of transfer learning is to find shared knowledge between two domains and apply it properly. Knowledge is learned in source domain and then applied in target domain. In a word, it is about searching for the invariable (or similarity) in changing domains and then apply it.


The next question is how to measure the similarity? Well, we need a metric. But what is a good metric for it? The bad news is that there is no certain answer for all transfer learning problems. Different metrics are useful in dfferent ways and in different problems. The good news is that we have many metrics in the arsenal:

  • Distance:

    • Euclidean distance: $d _ {Euclidean} = \sqrt {(\mathrm{x} - \mathrm{y}) ^ {\top} (\mathrm{x} - \mathrm{y})}$
    • Minkowski distance: $d _ {Minkowski} = (| \mathbf { x } - \mathbf { y } | ^ { p }) ^ { 1 / p }$. When $p=1$, it’s Manhattan distance; when $p=2$, it’s Euclidean distance.
    • Mahalanobis distance: $d _ {Mahalanobis} = \sqrt { ( \mathrm { x } - \mathrm { y } ) ^ { \top } \Sigma ^ { - 1 } ( \mathrm { x } - \mathrm { y } ) }$. $\Sigma$ is the covariance of distribution. When $\Sigma = \mathbf{I}$, it’s Euclidean distance.
  • Similarity:

    • Cosine similarity: $\cos ( \mathbf { x } , \mathbf { y } ) = \frac { \mathbf { x } \cdot \mathbf { y } } { | \mathbf { x } | \cdot | \mathbf { y } | } \in [0, 1]$.
    • Mutual information: the mutual information of two discrete random variables $X$ and $Y$ can be defined as: $I ( X ; Y ) = \sum _ { x \in X } \sum _ { y \in Y } p ( x , y ) \log \frac { p ( x , y ) } { p ( x ) p ( y ) }$. For continous random variables, we have $\mathrm { I } ( X ; Y ) = \int _ {Y} \int _ {X} p ( x , y ) \log ( \frac { p ( x , y ) } { p ( x ) p ( y ) }) dx dy$.
    • Pearson coefficient: For two random variables $X$ and $Y$, $\rho _ {X, Y} = \frac{\operatorname{Cov}(X, Y)} {\sigma _ {X} \sigma _ {Y}} \in [−1, 1]$.
    • Jaccard coefficient: For two sets $X$ and $Y$, the Jaccard coefficient is defined as: $J = \frac { X \cap Y } { X \cup Y }$. Furthermore, Jaccard distance = $1 − J$.
  • Divergence:

    • Kullback–Leibler (KL) divergence: For two distributions $P(x)$ and $Q(x)$, $D _ {KL}(P | Q)=\sum _ {x \in X} P(x) \log \frac{P(x)}{Q(x)}$. A continous version: $D _ {KL}(P | Q) = \int _ {- \infty} ^ {\infty} p ( x ) \log (\frac {p(x)}{q(x)}) dx$. Notice that $D _ {KL}(P | Q) \neq D _ {KL}(Q | P)$.
    • Jensen–Shannon divergence: Denote $M = \frac { 1 } { 2 } ( P + Q )$, $JSD(P|Q)=\frac{1}{2} D _ {KL}(P|M)+\frac{1}{2}D _ {KL}(Q|M)$.
  • Maximum mean discrepancy (MMD):
    $$MMD(X , Y) = \sqrt{ || \sum _ { i = 1 } ^ { n _ { 1 } } \phi ( \mathbf { x } _ { i } ) - \sum _ { j = 1 } ^ { n _ { 2 } } \phi ( \mathbf { y } _ { j } ) || _ { \mathcal { H } } ^ { 2 } }$$ where $\phi(\cdot)$ is a mapping from orignal vector space to Reproducing Kernel Hilbert Space (RKHS).

  • A-distance: We first train a classifier $h$ to distinguish whether instances are from source domain or target domain. We then define A-distance to be: $$\mathcal{A}(\mathcal{D} _ {s},\mathcal{D} _ {t})=2(1 - 2 err(h))$$ where $err(h)$ is the hinge loss of this classifier $h$.

  • Hilbert-Schmidt Independence Criterion: It can be used to check the dependence of two sets of data: $$HSIC (X, Y) = \operatorname{trace}(HXHY)$$ where $X$ and $Y$ are kernel form of two datasets.

  • Wasserstein Distance: Let ($M, d$) be a metric space for which every probability measure on $M$ is a Radon measure (a so-called Radon space). For $p\geq 1$, let $P _ {p}(M)$ denote the collection of all probability measures $\mu$ on $M$ with finite $p ^ {\text{th}}$ moment for some $x _ {0}$ in $M$, $$\int _ { M } d(x, x _ {0}) ^ { p } \mathrm { d } \mu ( x ) < + \infty$$ Then the $p ^ {\text{th}}$ Wasserstein distance between two probability measures $\mu$ and $\nu$ in $P _ {p}(M)$ is defined as: $$W _ { p } ( \mu , \nu ) : = ( \inf _ { \gamma \in \Gamma ( \mu , \nu ) } \int _ { M \times M } d ( x , y ) ^ { p } \mathrm { d } \gamma ( x , y ) ) ^ { 1 / p }$$ where $\Gamma ( \mu , \nu )$ denote the collection of all measures $M \times M$ with marginals $\mu$ and $\nu$ on the first and second factors repectively. The Wasserstein metric may be equivalently defined by: $$W _ { p } ( \mu , \nu ) ^ { p } = \inf \mathbb{E} [ d ( X , Y ) ^ { p }]$$ where $\mathbb{E}[Z]$ denotes the expected value of a random variable $Z$ and the infimum is taken over all joint distributions of the random variables $X$ and $Y$ with marginals $\mu$ and $\nu$ respectively.
    It seems that Wasserstein distance is quite popular these days, especially in GAN and domain adaptation.


  • Instance based Transfer Learning:

    • By reusing samples in source domain and weighting them properly, we can transfer the learned knowledge from source domain and target domain. A naive way of weights setting is setting them to be $\frac{P(\mathbf{x} _ {t})}{P(\mathbf{x} _ {s})}$. This is similar to what we do to importance sampling ratio in RL. TrAdaboost introduces the idea of Adaboost to transfer learning: increasing the weights of samples that improve the performance of transfer learning and decreasing the weights of samples that harm the performance. Can we apply similar ideas for importance sampling ratio? Need future study.
    • Although instance based transfer learning has a good theoretical guarantee, it only applies to problems when the difference of $P(\mathbf{x} _ {s})$ and $P(\mathbf{x} _ {t})$ is small. The knowledge transfered in this method is not abstract enought.
  • Feature based Transfer Learning: This method assumes that some features are shared by source domain and target domain. By feature transformation, it minimizes the distance between two sets of features or maps all features into a same feature space. The core question is how to do feature transformation and how to learn the mapping?

  • Parameter/Model based Transfer Learning: This method assumes that some model parameters can be shared by source domain and target domain. Through parameters sharing, knowledge learned from one domain can be transfered to another domain. Most algorithms developed in this approach are connected with neural networks strongly.

  • Relation Based Transfer Learning: In this method, logic is applied to learn the relations between objects in source domain. Then these relations are reused in target domain. This may be the most abstract method for transfer learning. Also, it is hard. So there are not too many papers.

Deep transfer learning

Deep neural networks can learn features from the raw data end-to-end, including general features and specific features. Then the next question is how to decide which features of layers to transfer? There is no theoretical answer for this question. However, the experiments shows that:

  • Features represented by weights in the first few layers are more general.
  • By fine-tuning the neural networks, we can improve the peformance significantly.
  • Transfer weights are better than random weights.
  • By transfer weights in layers can accelerate learning.

Finetune can accelerate learning and save training time. However, it can not overcome the difference between training data and test training. By adding some adaptation layers, it can be overcomed to some extent. Furthermore, an additional loss is added to account for domain adaptation loss.


  • Artifical intelligence and human knowledge: Through the long history, we human being accumulate a large amount of knowledge. How to transfer these knowledge to agents? How to encode our knowledge into agent? Yes, we can always find a way to encode some particular knowledge into agent. However, the final goal is to find a general way to encode all human knowledge. And this is really hard. The most difficult part is to find a suitable knowledge representation that is understandale to human being as well as intelligent agents.
  • Transitive transfer learning: Although there may only be minor similarity between two domains, everything in this world is connected in some way. And If we can find a similarity chain that connects two different domains, we may find a way to transfer knowledge from one end to another end along this chain. This is the basic idea of transitive transfer learning. Surprisingly, it works!
  • Learning to Transfer: The goal of learning to transfer is to learn when to transfer, what to transfer and how to transfer. The general method includes two parts: learn experiences from previous cases and then apply them on new problems. Its main goal is to learn transfer learning experience. Formally,we define transfer learning experience: $$E = (S,T,a,l)$$ where $S$ and $T$ are source and target domain, respectively. $a$ is a transfer learning algorithm. $l$ shows the performance improvement compared to learning performance without transfer learning. What is a useful transfer learning experience then? Everything that helps improve performance!
  • Online transfer learning: There are not many works.
  • Transfer reinforcement learning: It is a combination of transfer learning and reinforcement learning.