How to calculate the similarity between two documents by word vector technology

Recently, a document similarity sharing happened in the group. Decided to answer one.

First of all, if it is not limited to NN method, it can be solved by the system of BOW+tf-idf+LSI/LDA, which is commonly known as 0 1 or a thermal representation.

Secondly, if the landlord specifies that the popular NN method, commonly known as word-embedding, must be used, of course, word2vec is the first one (though not DNN). Then, after the word vector of word2vec is obtained, the document vector can be obtained by simple weighting/tag weighting /tf-idf weighting. This is a road. Of course, before weighting, stop words is usually killed and the words are clustered.

And the paragraph vector in doc2vec also belongs to the method of obtaining doc vector directly. The feature is to modify cbow and skip-gram models in word2vec. According to the paper "Distributed Representation of Sentences and Documents" (ICML 20 14).

There is also a way of weighting according to syntax tree, which is proposed by ICML20 1 1. See the paper "Analyzing Natural Scenes and Natural Languages with Recurrent Neural Networks", and there are several subsequent versions.

Of course, the way to obtain word vectors is not limited to word2vec, but RNNLM and glove can also obtain legendary high-quality word vectors.

Xia Hui's thesis

What is Sun's real age?

What certificates can on-the-job graduate students of Chang 'an University obtain?

What are the famous talented women in history?

Teaching achievements of Lingao middle school

How to embody educational tact in teaching

Shunfeng competitive paper

Graduation thesis defense is the last one, is it good or not?

What are the methods of data processing and statistical analysis in scientific research papers?

Illustration of paper title-how to draw beautiful inserts in the paper