Paper address:}, look up all possible words ws containing the character c in the sentence in dictionary D, as shown in Figure -3:
Final generation sequence:
s_cw={(c 1,ws 1),(c2,ws2),...(cn,wsn)}
Incorporate words and word information into the dictionary adaptation layer of BERT network, as shown in Figure -4:
The dictionary adaptation layer has two inputs: character and word pairs, that is, H and X in the above figure, where H is the character vector output by the previous transformer layer, X is the word embedding composed of M words that may contain the character, and J is the j-th word in M:
Where e is a pre-trained word vector mapping table.
In order to arrange sequences of different lengths, the nonlinear transformation of word vectors is as follows:
Where W 1 is a matrix of dc-dw size, W2 is a matrix of dc-dc size, b 1 and b2 are offsets, dw is the dimension of the word vector, and c is the dimension of the hidden layer.
As can be seen from Figure -3, one word may correspond to multiple words, and the best matching word may be different for different tasks.
The specific algorithm is to use vi to represent all the words corresponding to the ith character, where m is the number of words that the character may correspond to, and the attention is calculated as follows:
Where w is the attention weight matrix.
Then each word is multiplied by its weight and added to get the word representation corresponding to position I:
Finally, the dictionary information is added to the vector of the character, and a new vector at this position is obtained:
The processed data is sent to the discarding layer and the normalization layer for further processing.
Input characters into the word embedding layer, add symbols, segments and location information, and then embed the words output from this layer into the Transformer layer:
The output is the output of the first hidden layer, LN is the normalization layer, HMAttn is the multi-head attention mechanism, FFN is two feedforward network layers, and ReLU is the activation function.
Add dictionary information between the kth and (k+ 1) transformers.
Considering the context of the tag, CRF layer is used to predict the final tag, and the output of the last hidden layer H is used as input to calculate the output layer O:
Then substitute the output layer into CRF model and calculate the probability p of label Y.
When training, given the sentence S and the label Y, the negative logarithm likelihood of the whole sentence is calculated as the error.
When decoding, the Viterbi algorithm is used to calculate the sequence with the highest score.
In this paper, named entity recognition NER, word segmentation CWS and position part-of-speech tagging are experimented, and the experimental data are shown in table-1 (experimental data commonly used in Chinese natural language processing).
Fig. 5 shows the reduction of model error compared with BERT and the latest BERT-based model.
Besides comparing with other models, this paper also compares the difference between LEBERT method and Bert+Word method in assembly model.