Current location - Education and Training Encyclopedia - Graduation thesis - How to evaluate the newly proposed RNN variant SRU
How to evaluate the newly proposed RNN variant SRU
First, motivation:

Believe in deep learning &; NLP's friends have encountered the following problems.

The training of RNN is really too slow! This runs through RNN &;; I don't need to say more about CNN's natural understanding.

Applicability of the model and repeatability of the experiment. The influence of a model on task A or B is very explosive. But what about other tasks? Is the model open source? Is there an experimental code attached to prove repeatability?

The interpretability of network structure. When looking at the new network/model, have you ever encountered "why does this thing work?" For example, chestnut, the following picture shows the ring network unit NASCell found by Google with reinforcement learning, is it a question mark face ():

On the contrary, many simple things are not only effective, but also more explanatory, such as residuals and attention mechanisms.

Based on the above three points, SRU and its related work aim to propose and explore a "simple, fast and more explanatory circulatory neural network". We tested SRU extensively, and at the same time, we opened all the code. I hope to get more tests and even help find a more effective model.

A supplement to interpretability: My understanding of RNN, which is widely used at present, is that they put the better similarity of coding sequence in a hidden state (I'm not sure how to express it in Chinese), so it can be better generalized. My speech and recent work involve some. Therefore, I personally disagree with the statement that SRU is a quasi-RNN. The reasons are as follows,

( 1) ? Network structure: The core of quasi-RNN is to use adaptive gating on the basis of k-gram CNN (text convolution). When discussing k-gram convolution, k= 1, that is, window size 1, is usually not used as the running parameter. This is reflected in many papers, including Q-RNN itself. Although the matrix transformation of SRU can be regarded as the case of k= 1, it is not essentially different from the statement that "all fast-forward networks are k= 1 convolution" or "VGG network and Google network are AlexNet, so it is changed to 3*3 convolution plus depth".

In addition, many variants of cyclic/convolutional networks have the same goal. Take the idea of "CNN+ adaptive weighting" as an example. As far as I know, it includes genCNN of Huawei Lab 15, Gated ConvNet [4] proposed by FAIR and so on. The author of Q-RNN also admits that it is difficult to claim who is the "first", and it is important to realize and verify this idea:

These architectures are so simple in mathematics that they have been re-invented several times, and it is difficult to determine who tried them first, but as authors, our job is to try to do so (we certainly didn't include enough discussion about PixelCNN in our arxiv version, but we added it in camera readiness)? -bradbury

(2) Acceleration skills: RNN's acceleration skills, including batch gemm (), elemental operation fusion, etc. , were first proposed and opened by Nvidia researchers [10, 1 1]. This point is clearly stated in the paper, unlike what some students think is put forward by Q-RNN. Because Q-RNN adopts convolution kernel conv2d, it has better acceleration effect than traditional RNN. In SRU, because the dependence on h[t- 1] is removed, we can boast of the parallel acceleration of the time dimension (or input position), which is the same as conv2d/Q-RNN.

(3) Remaining connection: In our work ICML- 17 [6], an expressway connection is added to the simplified RNN structure. The structural variation of SRU has appeared in the first part of PTB experiment ([6] Figure 3). These works, including experiments, theoretical explanations and relevant proofs, were actually completed before 17, and were included in the graduation thesis after 17+0 [12] graduation defense. On this basis, SRU and its acceleration are to improve the application and test the effectiveness of the model on more task data sets.