Glove's full name is global vectors for word representation, which means global word vector in Chinese. Glove is a global word frequency statistic (count-based &;; Overall statistics) word representation tool.
Like word2vec, it can represent a word as a vector composed of real numbers, and the vector can capture some semantic features between words, such as similarity and analogy. And through vector operations, such as Euclidean distance or cosine similarity, the semantic similarity between two words can be calculated.
3. Build the loss function:
This loss function is the simplest mean square loss, but a weight function is added on this basis. Its function is: for words that often appear together in a corpus (frequent occurrence),
In this paper, the author adopts a piecewise function that satisfies the above conditions:
In all the experiments in this paper, the value of is.
Although many people claim that Glove is an unsupervised learning method, that is, there is no need to manually label data, in fact, it is still labeled, and vector sum is to constantly update learning parameters. So in essence, its training method is no different from supervised learning, and it is based on gradient descent.
The specific method of training is: using AdaGrad's gradient descent algorithm, randomly sampling all non-zero elements in the matrix, setting the learning rate to 0.05, iterating 50 times when the vector size is less than 300, and iterating 100 times for other sizes until convergence.
Because it is symmetrical, the vector sum of the last two words should also be symmetrical and equivalent, but the final value is different because of the different initial values. In order to improve the robustness, the sum of the two is finally selected as the final word vector (different initialization is equivalent to adding different random noise, so the robustness can be improved).
Figure1* * uses three indicators: semantic accuracy, grammatical accuracy and overall accuracy. Then we can easily find that the vector dimension can reach the best when it is 300, and the context window size is roughly between 6 and 10.
If the corpus itself is relatively small, fine-tuning has no effect, or direct self-training does not have strong computing power, the glove word vector pre-trained directly with big data will also have better effect.