Exercise: BEGAN did not directly estimate the distance between the generated distribution pg and the real distribution px, but estimated the distance between the distribution errors.
If the error distribution between distributions is similar, it can also be considered that pg and px are similar.
BEGAN designed the discriminator G as a self-encoder to reconstruct the distribution error and optimize the score.
The distance between cloth errors is as follows:
BEGAN put forward a concept of balance, which is used to balance the training of G and D, so that GAN can get good training results even if he uses a very simple network without training skills such as BN and minibath.
At the same time, a hyperparameter that can balance the diversity and quality of samples and a method to measure the convergence of the model are also proposed.
It is found in the experiment that BEGAN converges quickly and the training of G and D is balanced, but the selection of hyperparameter is a test of experience.
The training process of 1. standard makes GAN have fast and stable convergence.
2. The concept of equalization is introduced to balance the power of frequency discriminator and generator.
3. Provide a new method to control the tradeoff between image diversity and visual quality.
4. Approximate measure of convergence
Taking the automatic encoder as the discriminator, the loss distribution of the automatic encoder is matched by the loss derived from Wasserstein distance (similar to KL divergence, positive definiteness, symmetry and trigonometric inequality).
l:r^nx->; 1.R+ training loss function of pixel automatic encoder;
BEGAN put forward an idea, using an automatic encoder as discriminator D, which is to match the distribution of errors as much as possible, rather than directly matching the distribution of samples. If the distribution of errors is close enough, the distribution of real samples will be close enough.
D: automatic encoder function
N: this is the target standard
V: It is an example of Nx dimension.
U 1 2 is two distributions of loss function of automatic encoder.
γ (U 1, u2) is the set of all combinations of 1 and 2.
M 1, 2∈R are their respective averages.
Wasserstein distance is:
Where x 1 and x2 are derived from
The lower bound of W 1(u 1, u2) can be derived by using Zhan Sen inequality:
The goal is to optimize the lower bound of Wasserstein distance between loss distributions of automatic encoders, not the lower bound between sample distributions.
The discriminator is designed to maximize the loss of the automatic encoder between the equations 1.
Let u 1 be the distribution of loss L(x), where x is the real sample.
Let u2 be the distribution of loss L(G(z))
Where G: RNz →RNx is the generating function.
Z∈[- 1, 1]Nz is a uniform random sample with dimension Nz.
Because m 1, m2 ∈R+ reaches the maximum | m 1-m2 |, there are only two possible solutions:
We choose scheme (b) as our goal, because minimizing m 1 will naturally lead to automatic coding of real images.
Given the discriminator and generator parameters θD and θG,
Each parameter is updated by minimizing losses LD and LG.
Represent the problem as a GAN target, where zD and zG are samples of z:
The loss function corresponds to g and d at the beginning;
2. Introduce balance:
When the loss of generator and discriminator is balanced, the following conditions are met:
The discriminator cannot distinguish the generated samples from the real samples, so their error distributions (including expected errors) should be the same, so that they are balanced.
Introducing γ into balanced D can not only encode the real image automatically, but also distinguish the input samples correctly. When γ is very small, it means that the denominator is very large, so at this time, the model focuses on the recognition accuracy, and G only generates images that can fool D.
The discriminator has two competing goals:
1. Automatically encode the real image and identify the real image from the generated image.
2. The γ term allows us to balance these two goals.
Lower γ value leads to lower image diversity, because the discriminator focuses more on automatically coding real images.
γ is called diversity ratio. With natural boundaries, the image is clear and detailed.
For θD and θG and for each training step t, the objectives are
γ∈[0, 1] is a hyperparameter, and the smaller the value, the lower the diversity of generated samples.
λk is the update step of kt+ 1, and kt indicates the importance attached to the discrimination ability of d.
Use proportional control theory to keep balance.
This is achieved by using the variable kt ∈ [0, 1] to control the importance of L(G(zD)) in the process of gradient descent.
We initialize k0 = 0.λk is the proportional gain of k;
In machine learning terminology, it is the learning rate of K.
We used 0.00 1 in the experiment.
Essentially, this can be considered as a form of closed-loop feedback control, in which kt is adjusted at each step to maintain Equation 4.
In the early training stage, G tends to generate easily reconstructed data for the automatic encoder, because the generated data is close to zero and the real data distribution has not been accurately understood.
At this time L(X)>L(G(z))
Compared with the traditional training that needs alternating training D and G or pre-training D, the method proposed by BEGAN does not need stable training at first.
Adam uses the default superparameter in training.
θD and θG are updated independently by independent Adam optimizer according to their respective losses.
We usually use a lot size of n = 16.
3. Global convergence indicators:
Determining the convergence of GANs is usually a difficult task, because the original formula is defined as a zero-sum game.
As a result, one loss increased and one loss decreased.
Using the concept of balance, we derive the global measure of convergence: we can frame the convergence process as finding the nearest reconstruction L(x) (the mathematical expression of the control action in which the output signal of the control device has a linear relationship with the input signal) |γL(x)-L(G(Zg))|. This measure includes two projects and:
This metric can be used to determine when the network reaches the final state, or whether the model has crashed, that is, whether the model has converged.
4. Model architecture
Discriminator: rnx-> R Nx is a convolutional depth neural network, and its architecture is an automatic encoder.
Nx = H × W × C is short for x size.
Where h, w and c are height, width and color.
We use automatic encoders and depth encoders and decoders. The goal is to avoid typical GAN techniques as simply as possible.
The structure is shown in figure 1. We use 3 × 3 convolution and apply exponential linear unit at its output.
Each layer is repeated many times (usually 2 times). We observed that the more repetitions, the better the visual effect.
The convolution filter increases linearly with each downsampling.
Downsampling is realized as a sub-sampling with a step size of 2, and up-sampling is completed by the nearest neighbor.
At the boundary between the encoder and the decoder, the tensor of the processed data is mapped to the embedded state h ∈RNh and from the embedded state h∈RNh through the fully connected layer, where Nh is the dimension of the hidden state of the automatic encoder without any nonlinearity.
The generator G: RNz → RNx uses the same architecture as the discriminator decoder (although the weights are different).
But for simplicity. The input state is Z ∈ [- 1, 1] NZ, and uniform sampling is performed.
This simple architecture achieves high quality results and proves the robustness of the technology.
In addition, optional thinning helps gradient propagation and produces clearer images. Inspired by the depth residual network [8], the vanishing residual is used to initialize the network: for continuous layers with the same size, the input of each layer is combined with its output: inx+1= carry× inx+(1carry )× outx.
In the experiment, we started with carry = 1 and went through 16000 steps.
Gradually reduced to 0.
We also introduce jump connection [8,17,9] to help gradient propagation. The first decoder tensor h0 is obtained by projecting H onto the 8 × 8 × n tensor. After each upsampling step, the output is connected with h0 upsampled to the same dimension.
This creates a jump connection between the hidden state and each successive upsampling layer of the decoder.
We have not explored other techniques commonly used in GANs, such as batch normalization, missing, transposed convolution or exponential growth of convolution filters, although they may further improve these results.
5. Understand through experiments
When the value of γ changes, the diversity and quality comparison effects of model generation results are as follows, from which it can be seen that
The smaller the γ value, the clearer and closer the generated image is.
The greater the γ value, the higher the diversity, but the image quality will also decline.
BEGAN's spatial continuity is superior to other GANs:
With the convergence of the model, the image quality is also improving.
In a word, BEGAN has made great improvements on the problems such as difficulty in training GAN, difficulty in controlling the diversity of generated samples, and difficulty in convergence of balance discriminator and generator.
Reference papers: Berthelot D, Schum T, Metz L. Began: Boundary Balanced Generative Active Network [J]. ARXIV:1703.5438+0075438+07,
20 17