Metapath2vec's past lives - Education and Training Encyclopedia

Metapath2vec's past lives

Some articles I read recently will mention Metapath2vec. So I spent some time sorting out the path from word2vec method to metapath2vec. This is a simple summary of knowledge. This paper tries to explain the train of thought without overusing the formula.

The more important knowledge that may be involved in this paper, and its sources:

Word2vec: Efficient Estimation of Word Representation in Vector Space

Address: Work and supervise task study and training, find the most suitable representation and explore its internal relationship.

One of the earliest applications of neural network embedding is word2vec.

Word2Vec is a model for learning semantic knowledge from a large number of text corpus in an unsupervised way, which is widely used in natural language processing (NLP). So how does it help us with natural language processing? Word2Vec actually expresses the semantic information of words by learning the text, that is, to make words with similar semantics very close in an embedded space. Embedding is actually a kind of mapping, which maps words from the original space to a new multidimensional space, that is, embedding the original space into a new space.

For example:

There are four words: man, woman, king and queen. We usually hope that the embedding results of these four words have similar (distance) relationships. The distance between a man and a woman is smaller than that between a man and a queen. Similarly, the distance between a man and a king is almost the same as that between a woman and a queen, but smaller than that between a man and a queen and between a woman and a king.

In Word2Vec model, there are mainly two models: Skip-Gram and CBOW. Intuitively speaking, skipping is to predict the context by giving the input words. CBOW is a given context for predicting input words. As shown in the figure below.

The following figure shows the network structure of CBOW. The input is multiple words, usually sum and then average, and finally the loss function remains unchanged.

The following figure shows the network structure of skip-gram. Enter a word and predict the surrounding words.

We can see that the calculation cost is very high for us if we want to calculate the related model. So there are two methods: layered softmax and negative sampling. Let's talk a little bit about the hierarchical softmax method. Its essence is to turn an N-classification problem into a log(N) secondary classification problem by establishing a tree model. The focus today is negative sampling.

The idea of negative sampling is simpler and more direct: in order to deal with too many differences in output vectors that need to be updated in each iteration, we only update one sample.

This is the core part of this idea, and its realization process is as follows:

When training samples (input word: "fox", output word: "fast") to train our neural network, "fox" and "fast" are both coded with one-hot. If the utterance size is 10000, in the output layer, we expect the neuron node corresponding to the word "fast" to output 1, and other 9999 should output 0. These words corresponding to 9999 neuron nodes whose output we expect to be 0 are called "negative" words. When using negative sampling, we will randomly select a small number of negative words (for example, select 5 negative words) to update the corresponding weights. We will also update the weight of our "positive" word (in the above example, the word means "fast").

Let's welcome today's number one protagonist, negative sampling objective function expression:

Therefore, this objective function can be understood as two limitations:

Word2vec turns words into vectors. As the name implies, node2vec actually turns nodes in a complex network into vectors. Its core idea is: generating random walks, sampling random walks to get combinations of (nodes, contexts), and then modeling such combinations by processing word vectors to get the representation of network nodes.

Deepwalk and node2vec are highly consistent in their thinking. Compared with deepwalk, node2vec has made some innovations in the process of generating random walks. Here, instead of comparing the two in depth, I draw a conclusion, and please invite the second protagonist today, the core structure of this coding method: I personally regard it as an "up and down" structure.

Image above: Try to wander among your contacts and collect them in order. The specific wandering strategy depends on what information you want to collect.

Bottom: treat the collected sequence as text, and the subsequent method is similar to word vector processing.

Let's take node2vec as an example to briefly introduce this process.

The purpose of this sequence is simple:

Of course, your idea of order reflects what kind of information you want. As mentioned in the original paper, the order selection method of breadth first or depth first essentially expresses the importance of different accumulated information. BFS tends to wander around the initial node, which can reflect the microscopic characteristics of the node neighbors. DFS generally runs farther and farther away from the initial node, which can reflect the macro characteristics of a node's neighbors. So! So! So! Sequence strategy is a direct response to which part of information the operator is more concerned about! (The second protagonist reappears)

The random walk strategy proposed in the original text of node2vec is actually a strategy that combines BFS and DFS. Let's take a closer look.

The picture above shows that we just collected T to V, so who should be the next lucky object? The original author gives the following transition probabilities:

Explain this distribution:

The meanings of parameters p and q are as follows:

Return probability p:

Access parameter q:

When p= 1 and q= 1, the walking mode is equivalent to random walking in DeepWalk.

Again, sequential strategy directly reflects which part of information the operator pays more attention to! (the second protagonist reappears)!

In this part, we only emphasize that the original author defines the objective function by extending skipgram:

So far, we have basically understood the structural characteristics of this embedding method (so-called "up and down"). However, the previous contact methods are all methods to deal with homogeneous networks, and metapath2vec is only one of the methods to deal with heterogeneous networks.

Metapath2vec is a vertex embedding method of heterogeneous information network proposed by Dong in 20 17. Metapath2vec uses a random walk based on metapath to construct a heterogeneous neighborhood of each vertex, and then uses Skip-Gram model to embed vertices. On the basis of metapath2vec, the author also puts forward metapath2vec++ to realize the modeling of structural and semantic association in heterogeneous networks at the same time.

The following are the contributions of metapath:

Let's take a look at how metapath2vec realizes the coding of heterogeneous networks.

For heterogeneous networks, the goal is to learn the dimension expression, the length of which is much smaller than the side length of the adjacency matrix, while maintaining the structural information and semantic relationship of the graph.

The key point of this part is that although the types of vertices are different, the representation vectors of different types of vertices are mapped to the same dimension space. Due to the existence of network heterogeneity, the traditional vertex embedding representation method based on homogeneous network is difficult to be directly applied to heterogeneous networks effectively.

Metapath2vec method, focusing on the improvement of mining order. The improvement of its training process is not obvious.

Random Walk Based on Metapath

This random walk method can capture the semantic relationship and structural relationship between different types of vertices at the same time, which promotes the transformation of heterogeneous network structure to Skip-Gram model of metapath2vec.

Let me give you a hint. Generally speaking, the definition of metapath is symmetrical. For example, the scheme of the metapath can be "o-a-p-v-p-a-o".

At this time, we can compare the contents of this section with those in the above 4. 1 sequence to find the first core contribution of metapath: sequence strategy.

As the name implies, the vertex characteristics of heterogeneous networks are learned by maximizing conditional probability.

At this time, please invite our No.1 protagonist again, the objective function under the original skip-gram negative sampling:

Did you find the difference? The essential difference is very subtle. It doesn't even make a difference. So the main contribution of this part is the upgrade of "sequence".

Above, I saw the upgrade of metapath2vec to the "upper" part. Let's take a look at how metapath2vec++ upgraded "Xia".

Two main points:

First, the softmax function is normalized according to the context of different types of vertices, namely:

There is no essential difference between the objective function here and our number one protagonist. But! But! But! The heterogeneous information of heterogeneous networks is not only reflected in the order, but also in the objective function. I put the objective function of metapath2vec and the objective function of metapath2vec++ together and compare them:

This also led to the upgrade of metapath2vec++ in the "training" goal.

The experimental results show that the following figure is a screenshot of the original paper. Category clustering accuracy:

Guangdong technician qualification certificate

Can I check the certificate of Guangdong technician? How to inquire?

//gdhrss . gov . cn/sof pro/gecs/zs/cha Xun . JSP

Try to copy a we

An essay on folk culture

Winners list of the 6th Star River Bay

The Application of Color in Interior Design

About Akira Kurosawa

Summary of project construction management

Why did the graduation thesis pass the final exam?

How high is the reliability and validity of graduation thesis analysis?

Zhu's main argument

Zhangwen