I have stumbled all the way since I invested in the deep learning research of self-coding.
In the last article, the self-coding convolutional neural network was denoised with seismic signals (see "New Exploration of Earthquake Denoising-Unsupervised Convolutional Neural Network in Actual Combat"), and the result was tragic. As shown in the figure below, the noise map is above, and the denoising map is below:
Judging from the denoising effect, only some fragmented effective signals can be obtained, which is totally unacceptable.
Isn't convolutional neural network better able to learn feature details and have better performance? Why is the effect I made so miserable?
The previous parameter settings include: using 10000 28*28 training block, training epoch: 5, learning rate: 0.00 1, optimizer: TF. Train. Adam optimizer (learning). Minimize (cost), loss function: tf.nn.sigmoid _ cross _ entropy _ with _ logs (labels = targets _, logs = logs _), cost = tf.reduce_mean (loss).
The network structure diagram is as follows:
Training loss curve:
1. Standardization optimization
The terrible loss of training results caught my attention. Take the problem of failed convergence to the internet to find the answer, and some great gods say that this is normalization not done well.
Then do two optimizations first:
One is to control the value range of the training sample to (-1, 1), and the use method is the method of dividing the original value by the maximum value, like this:
noise _ imgs = noise _ imgs/ABS(noise _ imgs)。 Max ()
The second is to add BN after each convolution of the training network, like this:
conv 1 = TF . layers . conv 2d(inputs _,64,(3,3),padding='same ',activation=tf.nn.relu)
conv 1 = TF . layers . batch _ normalization(conv 1,training=True)
After further training, the effect is not obvious, and there is still no convergence.
In addition, many normalization methods focus on (0, 1) and use this algorithm:
Imgs = (imgs-imgs.min ())/(imgs.max ()-imgs.min ()) # normalized to [0, 1]
The original seismic data has no training at all, and the curve is like this:
2. Adjustment of learning function
If a plan fails, a plan will be regenerated.
I want to modify the optimizer and loss function.
In neural network learning, the function of loss function is to measure the predicted value of neural network output, and the difference between calculated value and actual value can be said to be the key function to realize learning. Common loss functions include least square loss function, cross entropy loss function, smooth L 1 loss function used in regression, etc.
The principle of optimization function is to transfer the outermost loss value of neural network to the front to realize back propagation learning, which is the key to realize continuous learning and convergence of neural network. For example, the most basic gradient descent algorithms are: random gradient descent algorithm, batch gradient descent algorithm, driving gradient descent algorithm, Adagrad, Adadelta, Adam and so on.
Then I'll start with the optimizer function.
Since the learning rate is 0.00 1 and cannot converge, try 0.000 1. The result really converges, as shown below:
What is the prediction effect? The result is a mess, and I can't even learn the basic features, as shown below:
What's going on here? My understanding is that if the learning rate is too high, the neural network will learn more fine-grained features and lose the features we want. It is equivalent to studying a person's characteristics. We usually look at it from the aspects of facial features and body shape, but if we study it from the perspective of cells, we can't restore people's appearance characteristics.
Besides, setting it to 0.0005 is not much better.
Can we change the loss function?
For example, change it to softmax _ cross _ entropy _ with _ logits, like this:
loss = TF . nn . soft max _ cross _ entropy _ with _ logits(labels = targets _,logits=logits_)
The result is that you can't learn, as shown below:
3. Other attempts
Two axes passed and there was no sign of improvement. I didn't give up. I began to think about why the original program was so effective in training Mnist, and why not switch to seismic data training?
I wonder if there is something wrong with the training sample data. I tried the following methods again:
1. Adjust the size of training sample data: 128 * 128, 40 * 40, 32 * 32, 28 * 28, etc.
The second is to truncate the sample data: aren't there many abnormal values and large deviations in the seismic data? I will filter out 90% of the intervals in the data set, truncate the parts outside the intervals, and then normalize them. So the data distribution is much more uniform.
The third is to expand the source of sampling data and sample from different data sources. Will the training effect be improved if the data is richer?
……
You can imagine how trivial and complicated it is to do these experiments, but the reality is so ruthless. The final outcome is failure, with no decent effect at all, or even a clear result.
"Without water, there is no road to doubt, and there is another village."
After being rubbed on the ground by reality for n days, I learned from a painful experience: where is the direction of solution?
In the existing hopeless neural network, improving the learning rate can converge, but effective features cannot be learned. Reducing the learning rate can learn effective features but can't converge, that is to say, it can't continue to optimize learning. The whole thing became contradictory.
Faced with this ugly prediction result chart, I realized that there might be a problem with the network structure itself. It is likely that the network is effective for learning picture data, but not for learning earthquake data.
After reading other researchers' papers, I gradually focused on a structure-decoding. My program uses the structure of convolution kernel upsampling in this part. Like this:
con v4 = TF . image . resize _ nearest _ neighbor(con v3,(8,8))
conv4 = tf.layers.conv2d(conv4,32,(3,3),padding='same ',activation=tf.nn.relu)
But the structure of other seismic papers contains a structure that I don't have-deconvolution.
What if I also use deconvolution, even the simplest self-coding structure of convolution and deconvolution? A structure like this:
x = Conv2D(32,(3,3),activation='relu ',padding='same')(input_img)
x = Conv2D(32,(3,3),activation='relu ',padding='same')(x)
X = conv2d transposition (32, (3,3), padding =' same', activation =' relu', kernel _ initializer =' glorot _ normal') (x) # deconvolution.
x = Conv2DTranspose(32,(3,3),padding='same ',activation='relu ',kernel _ initializer = ' glorot _ normal ')(x)
decoded = conv 2d transpose( 1,( 1, 1),padding='same ',activation='tanh ',kernel _ initializer = ' glorot _ normal ')(x)
The result is amazing. The following figure shows the effect of convergence, which will soon converge:
The effect of training is better. The following are the original image, noise image and denoising effect image respectively:
As you can see, the noise above almost drowned out the effective signal. Then through training, only five iterations are needed, and the effective signal can be well separated.
"Since you have chosen a distant place, you will only care about the hardships."
It seems that deconvolution is a key to solve earthquake learning. Next, I will study the reason why deconvolution can adapt to seismic processing, and then continue to optimize and innovate, and do comparative experiments with other algorithms in order to make better results.
If you like it, please click "Like". If you are interested in the plan, you can contact me to get it.