Although recursive neural network shows the flexibility of PyTorch well, it also widely supports other deep learning frameworks. In particular, it can provide strong support for computer vision computing. PyTorch is the product of the developers of Facebook Artificial Intelligence Institute and several other laboratories. This framework combines Torch7' s efficient and flexible GPU accelerated back-end library and intuitive Python front-end. It is characterized by rapid prototyping, readable code and support for the most extensive deep learning model.
Start spinning
The article in the link (/jekbradbury/examples/tree/spinn/snli) introduces in detail the PyTorch implementation of recursive neural network, which has recursive tracker and TreeLSTM node, also known as SPINN——SPINN is an example of deep learning model used in natural language processing, and it is difficult to build it through many popular frameworks. The model here uses batch processing, so it can be accelerated by GPU, which makes the running speed obviously faster than the version without batch processing.
SPINN (Stack-Augmented Parser-Interpreter Neural Network) is a method proposed by Bowman et al. in 20 16 to solve the task of natural language reasoning. This paper uses the SNLI data set of Stanford University.
The task is to classify sentence pairs into three categories: Assuming that sentence 1 is the exact title of an invisible image, is sentence 2(a) affirmative (b) possible or (c) definitely not an accurate title? (These categories are called implication, neutrality and contradiction respectively). For example, a sentence is "two dogs are running across a field", the implication may turn this sentence into "outdoor animals", neutrality may turn this sentence into "some puppies are running and trying to catch a stick", and contradiction may turn this sentence into "pets sitting on the sofa".
In particular, the initial goal of studying SPINN is to encode each sentence into a fixed-length vector representation before determining the relationship between sentences (there are other methods, such as comparing each part of each sentence with soft focus method in attention model).
The data set is generated by the method of parsing tree, which divides the words in each sentence into phrases and clauses with independent meanings, and each phrase consists of two words or sub-phrases. Many linguists believe that human beings combine words and meanings and understand language through the hierarchical way of trees mentioned above, so it is worth trying to construct neural networks in the same way. The following example is a sentence in a dataset, and its parsing tree is represented by nested brackets:
((Church)) There is a crack in the ceiling. ) )
One way to encode this sentence is to use a neural network with parse tree to construct a neural network layer Reduce, which can combine word pairs (represented by word embedding, such as GloVe) and/or phrases, and then recursively apply this layer (function) to encode the sentence:
X = decrease ("the", "ceiling")
Y = decrease ("in", x)
... and so on.
But what if I want the network to work in a more human-like way, reading from left to right and preserving the context of sentences, while still using parse trees to combine phrases? Or, what if I want to train a network to build my own parsing tree and let it read sentences according to the words it sees? This is the same but slightly different way to write a parse tree:
There is a crack in the ceiling of the church. ) )
Or in the third way, as follows:
Words: There is a crack in the ceiling of the church.
Grammar analysis: S S R S S S S S R R R R S R R
All I did was delete the left bracket, then mark "shift" with "s" and replace the right bracket with "r" to mean "reduce". But now you can read information from left to right as a set of instructions to operate a stack and a buffer similar to a stack, and you can get exactly the same result as the recursive method above:
1. Put the word into the buffer.
2. Pop "The" from the front of the buffer and push it to the upper layer of the stack, followed by "church".
3. Pop up the first two stack values, apply them to Reduce, and then push the result back to the stack.
4. Pop "has" from the buffer, then push it onto the stack, then "cracks", then "in", then "the" and then "ceiling".
5. Repeat four times: pop up two stack values, apply the reduction, and then push the result.
6. Bang and push it to the top of the stack.
7. Repeat twice: pop up two stack values, apply reduction, and then push the result.
8. Pop up the remaining stack values and return them as sentence codes.
I also want to keep the context of the sentence, so that when applying the Reduce layer to the second half of the sentence, I can consider the information of the sentence part that the system has read. So I will replace the two-parameter Reduce function with a three-parameter function, and its input values are a left clause, a right clause and the context state of the current sentence. This state is created by the second layer of the neural network (a unit called a loop tracker). Given the context state of the current sentence, the top entry b in the buffer and the first two entries s 1\s2 in the stack, the tracker generates a new state after each step of the stack operation (i.e. reading each word or closing bracket):
Context [t+ 1] = Tracker (context [t], b, s 1, s2)
It's easy to imagine writing code in your favorite programming language to do these things. For each sentence to be processed, it will load the next word from the buffer, run the tracker and check whether to push the word onto the stack or execute the Reduce function to perform the operation; Then repeat until the whole sentence is processed. Through the application of a single sentence, this process constitutes a huge and complex deep neural network, and its two trainable layers are repeatedly applied through stacking operations. However, if you are familiar with traditional deep learning frameworks such as TensorFlow or Theano, you know that it is difficult for them to achieve such a dynamic process. It is worth taking some time to review and explore the differences of PyTorch.
graph theory
Figure 1: Graphical structure representation of function
Deep neural network is essentially a complex function with a large number of parameters. The purpose of deep learning is to optimize these parameters by calculating the partial derivative (gradient) of the loss function measurement. If the function is expressed as a graph structure (figure 1), the calculation of these gradients can be realized by traversing the graph backwards without unnecessary work. Every modern deep learning framework is based on this concept of back propagation, so every framework needs a method to represent the calculation diagram.
In many popular frameworks, including TensorFlow, Theano and Keras, and the nngraph library of Torch7, computational graphs are pre-built static objects. A graph is defined by a code similar to a mathematical expression, but its variables are actually placeholders that don't hold any numerical values. Compile the placeholder variables in the graph into a function, and then you can run the function repeatedly for the batch of training sets to generate output values and gradient values.
This method of static graph calculation is very effective for convolutional neural networks with fixed structures. But in many other applications, it is useful to make the graphic structure of neural network different according to the data. In natural language processing, researchers usually hope to expand (determine) the circular neural network by inputting words at each time step. The stack operation in the above SPINN model depends largely on the control flow (such as for and if statements) to define the computational graph structure of a specific sentence. In more complex cases, you may need to build a model whose structure depends on the output of the model's own subnet.
Some (though not all) of these ideas can be mechanically applied to static graph systems, but almost always at the expense of reducing transparency and increasing code confusion. The framework must add special nodes to its calculation diagram, which represent programming primitives such as loops and conditions, and users must learn and use these nodes, not just for and if statements in programming code languages. This is because any control flow statement used by programmers will only run once, and programmers need to hard-code a calculation path when building diagrams.
For example, TensorFlow needs a special control flow node tf.while_loop, which runs the recurrent neural network unit (rnn_unit) through the word vector (starting from the initial state h0). You need an extra special node to get the word length at run time, because it is just a placeholder when you run the code.
# Tensor flow
# (This code runs once during model initialization)
# The word is not a real list (it is a placeholder variable), so
# I can't use "len"
cond =λI,h:I & lt; tf.shape(words)[0]
cell = lambda i,h: rnn_unit(words[i],h)
i = 0
_,h = tf.while_loop(cond,cell,(I,h0))
The method based on dynamic computing graph is fundamentally different from the previous methods. It has decades of academic research history, including Kayak of Harvard, autographed and research-centered frameworks Chainer and DyNet. In this framework (also known as run definition), the calculation graph is established and reconstructed at run time, and the same code is used to perform the calculation of forward transmission, and the required data structure is also established for reverse propagation. This method can generate more direct code because the control flow can be written using standard for and if. It also makes debugging easier, because runtime breakpoints or stack traces will track the code actually written, rather than the functions compiled in the execution engine. A simple Python for loop can be used in a dynamic framework to realize a cyclic neural network with the same variable length.
# PyTorch (also working in Chainer)
# (This code runs every time the model is passed forward)
# "Word" is a Python list containing actual values.
h = h0
Word for word:
H = rnn_unit (word, h)
PyTorch is the first deep learning framework defined by running, which matches the function and performance of static graph framework (such as TensorFlow), making it very suitable for the idea from standard convolutional neural network to the craziest reinforcement learning. So let's look at the implementation of SPINN.
password
Before I start building the network, I need to set up a data loader. Through deep learning, data samples can be processed in batches to operate the model, and the training can be accelerated in parallel, and the gradient change of each step is smoother. I think we can do this here (I will explain how the above stack operation process is batch processed later). The following Python code uses the built-in system in PyTorch's text library to load data, which can automatically generate batches by connecting data samples of similar length. After running this code, train_iter, dev_iter and test_itercontain loop through the batch processing of training set, verification set and test set.
Import data from torchtext, dataset
TEXT = datasets . snli . parsedtextfield(lower = True)
TRANSITIONS = datasets . snli . shiftreducefield()
Label = data. On-site (sequence = false) training, development and testing = data set. SNLI.splits(
TEXT,TRANSITIONS,LABELS,wv _ type = ' glove . 42b ')TEXT . build _ vocab(train,dev,test)
train_iter,dev_iter,test_iter = data。 BucketIterator.splits(
(training, development, testing), batch_size=64)
You can find the rest of the code for setting the training period and precision measurement in train.py Let's move on. As mentioned above, the SPINN encoder includes a parameterized Reduce layer and an optional loop tracker to track the sentence context, so as to update the hidden state every time the network reads a word or applies Reduce; The following code shows that creating SPINN simply means creating these two sub-modules (we will see their codes soon) and putting them in a container for later use.
Import torch from torch nn import torch
# subclass the module classes in the neural network package of PyTorch.
SPINN class (NN. Module):
def __init__(self,config):
Super (SPINN, self). __init__()
self . config = config self . Reduce = Reduce(config . d _ hidden,config.d_tracker)
If config.d_tracker is not None:
self . Tracker = Tracker(config . d _ hidden,config.d_tracker)
Spinning. _ _ init _ _ was called once when the model was created. It allocated and initialized parameters, but did not perform any neural network operation or build any kind of calculation diagram. The code running on each new batch data is defined by the SPINN.forward method, which is a standard PyTorch name used to define the forwarding process of the model in the user-implemented method. What is described above is the effective implementation of the stack operation algorithm, that is, in general Python, it runs on multiple buffers and stacks, and each example corresponds to one buffer and stack. I iterate with a set of shift and reduce operations contained in the transformation matrix, run the tracker (if it exists), and traverse each sample in the batch to apply the shift operation (if necessary) or add it to the list of samples that need the reduce operation. Then run the Reduce layer on all the samples in the list and push the results back to their respective stacks.
Define forwarding (itself, buffering, transformation):
# Input enters as a single tensor embedded in the word;
# I need it to be a stack list, one for each example.
# We can independently pop up batches. Words in
# Every example has been turned upside down so that they can
# Pop up at the end of each page and read from left to right.
# list; Their prefixes are also null.
buffers =[list(torch . split(b . squeeze( 1), 1,0))
For b. split (buffer, 1,1) in the torch]
# We also need two null values at the bottom of each stack.
# So we can copy from the null value in the input; These null values
# are required, so that even if
# Buffer or stack is empty
stacks = [[buf[0],buf[0]] for buf in buffers]
if hasattr(self,' tracker '):
self.tracker.reset_state()
For trans_batch in transition:
if hasattr(self,' tracker '):
# I described earlier that the tracker takes 4.
# arguments (context_t, b, s 1, s2), but here I am
# Provide the stack contents as a single parameter
# When storing the context in the tracker
# The object itself.
Tracker_states, _ = self.tracker (buffer, stack)
Otherwise:
Tracker _ states = itertools.repeat (none)
Left, right, tracking = [], [], []
batch = zip(trans_batch,buffers,stacks,tracker_states)
For transitions, buffers, stacks, batch tracking:
If transition == SHIFT:
stack.append(buf.pop()
Elif conversion = = decrease:
rights.append(stack.pop())
lefts.append(stack.pop())
Trackings.append
If the right:
reduced = ITER(self . reduce(left,rights,trackings))
For transformation, stack (trans_batch, stacks) in zip:
If conversion = = decrease:
stack.append(next(reduced))
Return to [stack.pop() for stack in stacks]
When you call yourself. Tracker or self. Restore and run the forward method or atomic reduction module of the tracker respectively. This method needs to apply a forward operation to the sample list. In the forward method of main function, it makes sense to operate independently on different samples, that is, to provide a separate buffer and stack for each sample in batch processing, because all operations that use a lot of mathematics and need to benefit from GPU acceleration in batch processing are carried out in Tracker and Reduce. To write these functions more cleanly, I will use some assistants (defined later) to convert these sample lists into batch tensors, and vice versa.
I want the Reduce module to automatically batch its parameters to speed up the calculation, and then cancel the batch processing, so that they can be pushed and popped separately. The actual combination function used to combine the expressions of each pair of left and right sub-phrases into a mother phrase is TreeLSTM, which is a variant of the general circulating neural network unit LSTM. The combination function requires that the state of each sub-phrase actually consists of two tensors, a hidden state H and a memory cell state C, and the function is to use two linear layers and an nn operating in the hidden state of the sub-phrase. The linear combination function tree_lstm combines the result of the linear layer with the storage unit state of the sub-phrase. In SPINN, this method is extended by adding a third linear layer running in the hidden state of Tracker.
Figure 2: The treelstm composite function adds a third input (x, in this case, the tracker state). In the PyTorch implementation shown below, five groups of three linear transformations (represented by the triplets of blue, black and red arrows) are combined into three NNs. Linear module, tree_lstm function performs all calculations located in the box. Image from Chen et al. (20 16).