Current location - Education and Training Encyclopedia - Graduation thesis - The clearest explanation principle of LSTM network
The clearest explanation principle of LSTM network
Humans don't always start thinking. When you read this article, you will understand every word according to your understanding of the previous words Don't throw everything away and start thinking from scratch. Your idea is very persistent.

Traditional neural networks can't do this, which seems to be a major drawback. For example, suppose you want to classify events that happen at every point in a movie. At present, it is not clear how the traditional neural network uses its reasoning about previous events in the movie to inform the latter.

Cyclic neural network solves this problem. They are periodic networks that allow information to persist.

One attraction of RNN is that they may be able to relate previous information to the current task. For example, using previous video frames may help to understand the current frame. If RNN can do this, they will be very useful. But can they? It depends.

Sometimes, we just need to check the latest information to perform the current task. For example, consider a language model that tries to predict the next word from the previous word. If we try to predict the last word "clouds in the sky", we don't need any further background-obviously the next word will be the sky. In this case, if the gap between the relevant information and the required information is small, RNN can learn to use the past information.

But there are also some situations that need more background. Consider trying to predict the last word in the text, "I grew up in France ... I speak French fluently." Recent information shows that the next word may be the name of a language, but if we want to narrow it down to which language, we need to look at it from a farther background, the French background. The gap between the relevant information and the point that needs to become very large is entirely possible.

Unfortunately, as the gap widens, RNN cannot learn connection information.

Long-short sequence memory network-usually called LSTMs network-is a special RNN network, which can learn the dependence between long sequences. The LSTM network was first proposed by Hochreiter and Schmidt Huber (1997), and has been refined and popularized by many people. Their improvements have performed well in various problems and are now widely used.

LSTM is obviously to avoid the problem of long sequence dependence. Memorizing information for a long time is actually their default behavior, not what they are trying to learn.

All circulatory neural networks are composed of a series of neuron repeating modules. In the standard RNN, the structure of these repetitive modules is very simple, such as a single tanh layer.

The key of lstm is the cell state, and the horizontal line runs through the top of the figure.

The state of the unit is a bit like a conveyor belt, which runs directly along the whole chain with only a few small linear interactions. Information flows easily along it.

LSTM has the ability to delete or add information about cell state, which is fine-tuned by a structure called a gate.

Doors are another way of access. It consists of a sigmoid neural network layer and point-by-point multiplication.

The first step of LSTM is to determine what information we will discard from the cell state. This strategy is determined by a sigmoid layer called forgetting gate. Enter ht? 1 and xt forgetting gates output a number between 0 and 1, corresponding to each number in the cell state Ct- 1. 1 means "completely reserved" and 0 means "completely forgotten".

Let's go back to our language model example and try to predict what the next word will be based on all previous words. In this Yang problem, the unit state may include the gender of the current topic, so the correct pronoun can be predicted. When we see the gender of a new theme, we want to forget the gender of the old theme.

The next step will be to decide what new information we will save in the cell state. There are two parts. First, the sigmoid layer called the "input gate layer" determines which values we will update. Next, the tanh layer creates a new vector of candidate values Ct~, which can be added to the state. In the next step, we will combine the two to create a status update.

In the example of our language model, we want to add the gender of the new theme to the cell state to replace the old theme we forgot.

Finally, we need to decide what we want to export. This output will be based on our cell state, but it will be a filtered version. First, we run a sigmoid layer, which determines which parts of the cell state we want to output. Then, we set the cell state to tanh (push the value between-1 and 1) and multiply it by the output of sigmoid gate, so that we only output what we decide.

For the language model example, because it only sees one topic, it may want to output information related to verbs in case something happens next. For example, it may output whether the subject is singular or plural, so that we can know what form of verb should be collocated with if it follows.

So far, what I have described is a very normal LSTM. But not all LSTM are the same as above. In fact, it seems that almost all papers involving LSTM use slightly different versions. The differences are small, but some of them are worth mentioning.

A popular variant of LSTM introduced by Gers & Schmidhuber (2000) is the addition of "peephole connection". This means that we let the door layer see the state of the cell.

Another variation is to use coupled forget and input gates. We don't decide what to forget or add new information, but we make these decisions together. We just forget when we need to enter content in its place. When we forget the old things, we just input new values into the state.

Which variant is the best? Does the difference matter? Greff et al. (20 15) compared the popular varieties well and found that they were all similar. Jozefowicz et al. (20 15) tested more than 10000 RNN architectures, and found some better architectures than LSTM in some tasks.

Earlier, I mentioned the remarkable achievements people have made by using RNN. Basically, all of this is implemented using LSTM. For most tasks, they do better!

Written as a set of equations, LSTM looks daunting. I hope to introduce them step by step in this article and make them more approachable.

LSTM is an important step for us to realize through RNN. I naturally want to know: is there another important step? The consensus of researchers is: "Yes! Next is its concern! " Our idea is to make every step of RNN select information from some larger information sets. For example, if you use RNN to create a title that describes an image, you can select a part of the image to see every word it outputs. In fact, Xu et al. (20 15) did this-if you want to explore attention, this may be an interesting starting point! Using attention has achieved many very exciting results, and it seems that there are still many things to happen. ......

Attention is not the only exciting clue in RNN research. For example, the Gridlist MS (20 15) of Kalchbrenner et al. seems very promising. It also seems interesting to use RNN's work in the generation model, such as Gregor et al. (20 15), Chung et al. (20 15), or Bayer & Osendorfer (2015). The past few years have been an exciting period for recurrent neural networks, and the coming promise of neural networks will only be more exciting!