The main structure of GAN includes generator g (generator) and discriminator d (discriminator)
Raise your hands and write an example to further spy on the structure of GAN.
We now have a large number of handwritten digital data sets, and we hope to generate some handwritten pictures through GAN. It mainly consists of the following two parts:
Understanding of objective function;
The task of discriminator D is to maximize the function on the right, and the task of generator G is to minimize the function on the right.
Firstly, the following formulas are decomposed, mainly including: D(x), (1-D(G(z)).
D(x) is the probability that the discriminator D thinks that the sample comes from the original distribution, and D(G(z)) is the probability that the discriminator D wrongly discriminates the false sample from the generator G as true. Then the task of D is to maximize D(x) and minimize D(G(z)) (that is, to maximize 1-D(G(z)), so it is a comprehensive maximization of D(x)( 1-D(G(z)), and the increase or decrease is unchanged, which is convenient for logging.
And g should be similar enough, that is, D(G(z)) is large enough; LogD(x) has no influence on itself, so his measure function can be just min{log( 1-D(z))}, or it can be changed by adding a constant to him.
Derivation and origin of objective function
The discriminator is a classifier here, which is used to distinguish the authenticity of samples, so we often use cross entropy to judge the similarity of distribution. The formula of cross entropy is as follows:
The sum in the formula is the true sample distribution and the distribution generated by the generator. On the content of cross entropy
In the case of the current model, the discriminator is a binary classification problem, so the basic cross entropy can be expanded more specifically as follows:
For the correct sample distribution, then the corresponding () is the distribution of generated samples. D stands for discriminator, which means the probability of judging that the sample is correct, and the corresponding probability is the probability of judging that the sample is wrong.
After the above formula is extended to n samples, the corresponding formula is obtained by adding the n samples as follows:
So far, it is still a basic binary classification. Let's add some special features of Gan.
For the sample points in GAN, they correspond to two sources, either the real samples or the samples generated by the generator ~ (here, the noise distribution thrown into the generator shall prevail).
Among them, the samples from the real world should be judged to be correctly distributed. Judging from the generated sample, we should judge that it is an error distribution (). Using the expected form of probability distribution, the above formula is further written (in order to represent the case of infinite samples, it is equivalent to the case of infinite samples and cases). Let it be 1/2 and generate samples by representation, and the following results are obtained:
In fact, it is the same as the original formula.
Given the distribution of a sample data and the generated data distribution, GAN hopes to find a set of parameters to minimize the distance between the distribution sum, that is, to find a set of generator parameters to enable the generator to generate very realistic pictures.
Now we can extract a set of real pictures from the training set to train the parameters in the distribution to make it close to the real distribution. Therefore, we now extract a real sample {} from the sample, and for each real sample, we can calculate the probability that the sample appears in the generated distribution determined by. Therefore, we can construct the likelihood function:
From the likelihood function, we can know that the probability values of all real samples extracted from the distribution can be expressed as L. And because if the distribution and distribution are similar, the real data is likely to appear in the distribution, so the probability of all samples appearing in the distribution will be high.
Next, we can maximize the likelihood function L and find the generation distribution closest to the real distribution (that is, the optimal parameter θ):
In the above derivation, we hope to maximize the likelihood function L. If the likelihood function is logarithmic, the cumulative product ∏ can be transformed into cumulative σ, and this process will not change the optimization result. Therefore, we can transform the maximum likelihood estimation into θ which maximizes the expectation, and the expectation can be expanded into an integral form on X. Because the optimization process is aimed at θ, an integral without θ can be added without affecting the optimization effect. After adding this integral, we can combine these two integrals to construct a form similar to KL divergence. The process is as follows:
This integral is the integral form of KL divergence. Therefore, if we need to make the generated distribution as close as possible to the parameter θ of the real distribution, then we only need the parameter θ that minimizes the KL divergence. If the optimal parameter θ is obtained, the image generated by the generator will appear very real.
Next, it is necessary to prove that the optimization problem has a unique solution G* and the unique solution is satisfied. However, before we start to derive the optimal discriminator and optimal generator, we need to know Scott Rome's views on the derivation of the original paper. He thinks that the original paper ignores the reversible condition, so the derivation of the optimal solution is not perfect.
In Gan's original paper, one idea is different from many other methods, that is, the generator G does not need to meet the invertibility condition. Scott Rome thinks this is very important, because in practice G is irreversible. However, many proof notes ignore this point and wrongly use the integral substitution formula when proving, and the integral substitution is precisely based on the reversible condition of G. Scott believes that the proof can only be based on the following equation:
This equation comes from Radon-Nikodym theorem in measure theory, which is shown in the proposition 1 in the original paper and expressed as the following equation:
We see that the integral substitution formula is used in the handout, but the integral substitution must be calculated, and the inverse of G does not assume existence. In the practice of neural network, it does not exist. Perhaps this method is so common in machine learning and statistical literature that we ignore it.
In the first step of minimax game, given the generator G, the optimal discriminator D is obtained by maximizing V(D, g). Maximize V(D, g) to calculate the difference or distance between P_G and P_data. Because in the original paper, the value function can be written as an integral on x, that is, the mathematical expectation is expanded into an integral form:
In fact, finding the maximum value of integral can be transformed into finding the maximum value of integrand function. The purpose of finding the maximum value of the integrand function is to find the optimal discriminator D, so any term that does not involve the discriminator can be regarded as a constant term. As shown in the figure below, both P_data(x) and P_G(x) are scalars, so the integrand function can be expressed as a D(x)+b log( 1-D(x)).
If the discriminator D(x) is equal to y, then the integrand function can be written as:
In order to find the optimal extreme point, if a+b≠0, we can use the next derivative to solve:
If we continue to look for the second derivative of the expression f(y) at the stagnation point:
Where a, b∈(0, 1). Because the first derivative is equal to zero and the second derivative is less than zero, we know that a/(a+b) is the maximum. If a=P_data(x) and b=P_G(x) are substituted into this extreme value, then the optimal discriminator d (x) = p _ data (x)/(p _ data (x)+p _ g (x)).
Finally, we can write the value function expression as:
If we let D(x)=P_data/(P_data+p_G), then we can maximize the value function V(G, d). Because f(y) has a unique maximum in the definition domain, the optimal D is also unique, and no other D can reach the maximum.
In fact, the optimal D is not calculable in practice, but it is very important in mathematics. We don't know the prior P_data(x), so we will never use it in training. On the other hand, its existence enables us to prove the existence of optimal G, and we only need to approach D in training.
Of course, the goal of GAN process is to make P_G=P_data. What does this mean for optimal d? We can substitute this equation into the expression of D_G*:
This means that the discriminator is completely confused, and the difference between P_data and P_G can't be distinguished at all, that is, the probability of judging samples from P_data and P_G is 1/2. Based on this view, author Gan proved that G is the solution of minimax game. The theorem is as follows:
"If and only if P_G=P_data, the global minimum of the training standard C(G)=maxV(G, d) can be achieved. 」
The above theorem is the second step of minimax game, and the generator G (where G stands for optimal discriminator) that minimizes V(G, d) is found. The reason why the value function can be minimized when P_G(x)=P_data(x) is that the JSD of two distributions (p _ data (x) || p _ g (x)) is equal to zero. The detailed explanation of this process is as follows.
This theorem in the original paper is a statement of "if and only if", so it needs to be proved from two directions. First, the value of C(G) is approached and proved from the opposite direction, and then it is proved from the positive direction with the new knowledge obtained from the opposite direction. Let P_G=P_data (reverse means knowing the optimal conditions in advance and deriving them), and we can deduce them in reverse:
This value is a candidate for the global minimum because it only appears when P_G=P_data. We now need to prove from the front that this value is often the minimum value, that is, it meets the conditions of "if" and "only if" at the same time. Now give up the assumption that P_G=P_data, and for any g, we can substitute the optimal discriminator D* obtained in the previous step into C(G)=maxV(G, d):
Because -log4 is a known global minimum candidate, we want to construct a value to make log2 appear in the equation. So we can add or subtract log2 from each integral and multiply it by the probability density. This is a very common mathematical proof technique, which will not change the equation, because in essence we just add 0 to the equation.
The main purpose of adopting this technology is to construct a form with log2 and JS divergence. After simplifying the above formula, the following expression can be obtained:
Because of the definition of probability density, the integral of P_G and P_data in their integral domain is equal to 1, that is:
In addition, according to the definition of logarithm, we have:
So substituting this equation, we can write it as:
Now, if readers read KL divergence, then we will find that every integral happens to be it. Specifically:
KL divergence is non-negative, so we can immediately see that -log4 is the global minimum of C(G).
If we further prove that only one G can reach this value, because P_G=P_data will become C(G)=? The only point of log4, so that the whole proof can be completed.
It can be seen from the last article that the KL divergence is asymmetric, so the two terms about KL(P_data || (P_data+P_G)/2) in C(G) are not interchangeable, but if another term KL(P_G || (P_data+P_G)/2 is added at the same time, their sum can be. The sum of these two KL divergences can be expressed as JS divergence (Jenson-Shannon divergence):
Suppose there are two distributions P and Q, and the average distribution of these two distributions M=(P+Q)/2, then the JS divergence between these two distributions is the KL divergence between P and M plus the KL divergence between Q and M and then divided by 2.
The value of JS divergence is 0 to log2. If the two distributions are completely disjoint, then the maximum value of JS divergence is log2;; If the two distributions are exactly the same, then the minimum value of JS divergence is 0.
Therefore, according to the definition of JS divergence, C(G) can be rewritten as:
This divergence is actually the square of Jenson-Shannon distance measure. According to its nature: when P_G=P_data, JSD(P_data||P_G) is 0. To sum up, the optimal generator can be obtained if and only if the generation distribution is equal to the real data distribution.
We have proved that P_G=P_data is the best advantage of the rice girl (G, D). In addition, the original paper also proves that given enough training data and correct environment, the training process will converge to the optimal G.
It is proved that if V(G, D)=U(pg, d) is regarded as a function of pg, then U is a convex function of PG, and the subdifferential of its supremum must contain the derivative at the maximum value of this function. Therefore, given d, when updating pg to optimize G with gradient descent algorithm, PG will definitely converge to the optimal value. It was previously proved that the objective function has only a unique global optimal solution, so pg will converge to pdata.
In fact, when optimizing G, θg is updated instead of pg.
Reference link:
Generating countermeasure network
Popular understanding generates an article against the internet.
The heart of GitHub project: complete theoretical derivation and realization, perfect!
The Generation of Paper Reading Against the Network