Some existing distances, defined as compact matrix sets (random variables), are (? ), which is the space of all distributions defined on, has the following definitions for the distance between two distributions:
These four distances:
The author illustrates the superiority of EM distance in continuity with examples:
Let it be the distribution of two-dimensional random variables, but it is a random distribution family of two-dimensional random variables, in which it is hyperparameter.
It can be found that if and only if, and are the same distribution, and when, and are two distributions with no intersection at all, we can calculate these four distances in different situations:
Comparing these four distances, it is found that only the EM distance pair is continuous, and only the EM distance can make the distribution family converge to. When the two distributions are completely disjoint, the derivatives of other distance pairs are all 0, which makes the gradient descent method unable to learn.
It is very difficult to calculate the electromagnetic distance. The author uses Kantorovich-Rubinstein duality to change the distance into another formula:
The above formula represents the supremum of all functions satisfying 1-Lipschitz.
Replace the 1-Lipschitz condition with the K-Lipschitz condition (this is an arbitrary constant). If we have a family of functions () that satisfy the K-Lipschitz condition, then the solution becomes a problem of finding the optimal value:
Here, we can introduce the general approximator NN of functions and substitute their sum, and finally get the optimization goal of WGAN as follows:
In which it represents a family of functions satisfying Lipschitz- 1 conditions.
The training process of WGAN is as follows:
It is not difficult to see that the better D is trained, the more it can reflect the real Wasserstein distance. Therefore, the author also puts forward that the loss function value can be used as an approximation of Wasserstein distance to measure the learning quality of WGAN.
The main points of the above figure are summarized as follows:
A little experience:
WGAN is easier to train GAN. As for the mode collapse, the author only mentioned that this phenomenon was not found in the experiment.
Definition of Lipschitz condition:
Intuitively speaking, the slope of the connecting line between any two points of the function is less than.
Functions that satisfy the above conditions are also called Lipschitz continuity. Compared with the continuous function, the function satisfying Lipschitz continuity is smoother, which requires the change of the function: the change of the function in any interval cannot exceed the linear change, and the size of the linear change cannot exceed the Lipschitz constant.
In nonconvex optimization, Lipschitz condition defines a kind of boundary of function.
The article is written for your own understanding, and it is inevitable that there will be ambiguity or mistakes, or self-created and easy-to-understand terms. Please correct me if there are any mistakes.
First, what is the content of the paper?
Generally speaking, a paper mainly includes: cover, title page, original description, abstract, En