Current location - Education and Training Encyclopedia - Graduation thesis - How does word2vec get the word vector?
How does word2vec get the word vector?
Word2vec is a tool to convert words into vector form. The processing of text content can be simplified as vector operation in vector space, and the semantic similarity of text can be expressed by calculating the similarity in vector space.

First of all, theoretical overview:

1. What is the word vector? If we want to turn the problem of natural language understanding into a problem of machine learning, the first step must be to find a way to mathematize these symbols. The most intuitive and commonly used word representation method in NLP is one-key representation, which represents each word as a long vector. The dimension of this vector is vocabulary, in which most elements are 0, and only one dimension is 1, which represents the current word.

Give a chestnut:

"Microphone" means [00010000000000 ...]

"Mike" means [0000000 1000000...]

Every word is a 1 in the sea of Wang Yang. If this unique representation is stored in a sparse way, it will be very concise: that is, each word is assigned a digital ID. For example, in the example just now, the microphone is recorded as 3, and the microphone is recorded as 8 (assuming starting from 0). If it is to be realized by programming, it is enough to assign a number to each word with a hash table. This concise representation method, combined with maximum entropy, SVM, CRF and other algorithms, successfully completed various mainstream tasks in NLP field. Of course, there is another important problem with this representation, that is, the phenomenon of "lexical gap": any two words are isolated. From these two vectors, we can't see whether these two words are related, even synonyms like microphone and Mike are not immune. Isn't the word vector commonly used in deep learning the usage just mentioned? One-key means a long word vector, but use? Distributed representation (I don't know how to translate this, because there is also a low-dimensional real vector represented by "distributed representation" (similar to the representation of word vectors with topic in LDA). This vector is generally like this: [0.792,? 0. 177, ? 0. 107, 0. 109, ? 0.542, ...]。 What is the dimension? 50? Dimension 100 is common.

2. What is the origin of word vectors? Distributed representation was first proposed by Hinton in the paper "Distributed Representation of Learning Concepts" in 1986. Although this article did not say that characters should be represented in a distributed way, at least this advanced idea planted a fire in people's minds at that time, and it began to be paid more and more attention after 2000.

3. The training of word vectors: To introduce how word vectors are trained, we have to mention the language model. All the training methods I have learned so far are training language models and getting word vectors by the way. This is also easy to understand. Learning from an unmarked natural text is nothing more than counting the frequency of words, the number of occurrences of words and the collocation of words. It is undoubtedly the most accurate task to count and build a language model from natural texts (it is not excluded that someone will create better and more useful methods in the future). Since the task of building a language model is so arduous, it is necessary to make more detailed statistics and analysis of the language, as well as better models and more data to support it. At present, the best word vectors come from this, so it is not difficult to understand. There are three classic works in the training of word vectors, C & ampW 2008, M & ampH 2008 and Mi Kolov 20 10. Of course, before talking about these works, I have to introduce Bengio's classic works in this series.

4. Evaluation of word vectors: Generally speaking, the evaluation of word vectors can be divided into two ways. The first way is to integrate word vectors into the existing system to see the improvement of system performance; The second is to analyze word vectors directly from the perspective of linguistics, such as similarity and semantic deviation.