Detailed introduction of GCN convolutional network

In this paper, we will carefully study a famous graphic neural network called GCN. First, let's intuitively understand its working principle, and then deeply understand the mathematical principle behind it.

Bilingual Original of Subtitle Group: Introduction to GCN Volume Network (GCN)

Original English: Graphical Involuntary Network (GCN)

Listen to the wind 1996, big cousin

Many problems are essentially graphs. In our world, we see that a lot of data are graphs, such as molecules, social networks and paper citation networks.

Chart example. (Image from [1])

In the graph, we have the characteristics of nodes (representing the data of nodes) and the structure of the graph (representing how nodes are connected).

For nodes, we can easily get the data of each node. But when it comes to the structure of graphs, it is not easy to extract useful information from them. For example, if two nodes are very close, should they be treated differently from other nodes? How to deal with high and low nodes? In fact, for every specific work, it is only feature engineering, that is, transforming the structure of the diagram into our features will consume a lot of time and energy.

Graphic feature engineering. (Image from [1])

It would be better if the node characteristics and structural information of the graph can be obtained as input in some way, so that the machine can judge which information is useful.

This is why we need graphic representation learning.

We hope Tu can teach himself "Characteristic Engineering". (Image from [1])

Paper: Semi-supervised classification based on graph neural network (20 17)[3]

GCN is a convolutional neural network, which can work directly on graphs and make use of the structural information of graphs.

It solves the classification problem of nodes (such as documents) in graphs (such as citation networks), and only marks a few nodes (semi-supervised learning).

An example of semi-supervised learning on graph. Some nodes have no labels (unknown nodes).

As the name "convolution" means, this idea originated from images and was later introduced into graphics. However, when the image has a fixed structure, the graph is much more complicated.

Convolution idea from image to graph. (Image from [1])

The basic idea of GCN: For each node, we get its feature information from all its neighbors, including its own features. Suppose we use the average () function. We will do the same for all nodes. Finally, we input the average of these calculations into the neural network.

In the figure below, we have a simple example of referring to the network. Each node represents a research paper and each edge represents a citation. We have a preprocessing step here. Here we don't use the original paper as a feature, but convert the paper into a vector (by using NLP embedding, such as tf-idf). NLP embedding, such as TF-IDF).

Let's consider the green node. First, we get the eigenvalues of all its neighbors, including our own nodes, and then take the average. Finally, a result vector is returned as the final result through the neural network.

The main idea of GCN. Let's take the green node as an example. First, we take the average of all its neighbors, including our own nodes. Then, the average value is passed through the neural network. Please note that in GCN, we only use one fully connected layer. In this example, we get a two-dimensional vector as the output (two nodes of the fully connected layer).

In practice, we can use an aggregate function that is more complex than the average function. We can also stack more layers together to get a deeper GCN. The output of each layer will be regarded as the input of the next layer.

Example of two-layer GCN: the output of the first layer is the input of the second layer. Similarly, note that the neural network of GCN is just a completely connected layer (picture from [2]).

Let's take a closer look at how it works from a mathematical point of view.

First, we need some notes.

Let's consider Figure G, as shown below.

From Figure G, we have an adjacency matrix A and a degree matrix D, and at the same time, we also have a characteristic matrix X..

So how can we get the eigenvalues of each node from the neighboring nodes? The solution lies in the multiplication of a and X.

Looking at the first row of the adjacency matrix, we can see that there is a connection between node A and node E. The first row of the matrix is the eigenvector of the connection between node E and node A (as shown below). Similarly, the second row of the obtained matrix is the sum of the eigenvectors of D and E. By this method, we can get the sum of the vectors of all neighboring nodes.

Calculate the first line AX of the sum vector matrix.

In the problem (1), we can solve it by adding a identity matrix I to A, and get a new adjacency matrix? .

Let lambda= 1 (let the characteristics of the node itself be as important as the neighbors), what do we have? =A+I, note that we can regard lambda as a trainable parameter, but now we just need to assign lambda as 1. Even in the paper, lambda is simply assigned as 1.

By adding a self-cycle to each node, we get a new adjacency matrix.

For question (2): For matrix scaling, we usually multiply the matrix by the diagonal matrix. In the current situation, should we take the average value of aggregation features, or should we compare the aggregation vector matrix mathematically according to the degree of nodes? X scaling. Intuition tells us that the diagonal matrix used for scaling here is the sum matrix d? Related things (why d? , not d? Because we are considering a new adjacency matrix? Degree matrix d? , and is no longer a).

The question now becomes how do we scale/normalize the sum vector? In other words:

How do we pass the neighbor's information to a specific node? Let's start with the average old friend. In this case, d? (that is, the inverse matrix of d? {- 1}) will do. Basically, d? Every element in the inverse matrix of is the reciprocal of the corresponding item in diagonal matrix d.

For example, if the degree of node A is 2, then we multiply the aggregation vector of node A by 1/2, and the degree of node E is 5, then we must multiply the aggregation vector of e by 1/5, and so on.

Therefore, through d? Multiplying the reciprocal with x, we can take the average of the eigenvectors of all neighboring nodes (including our own nodes).

So far, everything is fine. But you may ask what about the weighted average ()? Intuitively speaking, it should be better to treat nodes with high and low degrees differently.

But we only scale by rows, ignoring the corresponding columns (dashed boxes).

Add a new scaler to the column.

The new scaling method provides us with a "weighted" average. What we do here is to give more weight to low-level nodes to reduce the influence of high-level nodes. The idea of this weighted average is that we assume that low-level nodes will have greater influence on neighboring nodes, while high-level nodes will have less influence because their influence is scattered on too many neighboring nodes.

When the features of neighboring nodes are gathered in Node B, we assign the maximum weight (degree 3) to Node B itself and the minimum weight (degree 5) to Node E. ..

Because we standardized it twice, we changed "-1" to "-1/2".

For example, we have a multi-classification problem of class 10, and f is set to 10. After there are 10 dimensional vectors in layer 2, we predict these vectors by softmax function.

The calculation method of loss function is very simple, that is, it is calculated by the cross entropy error of all labeled samples, where Y_{l} is the set of labeled nodes.

The number of layers refers to the farthest distance that a node feature can transmit. For example, in 1 layer GCN, each node can only get information from its neighbors. Each node collects information independently, and all nodes complete it at the same time.

When a layer is superimposed on the first layer, we repeat the process of collecting information, but this time, the neighbor nodes already have their own neighbor information (from the previous step). This makes the number of layers the maximum hop that each node can make. Therefore, according to the degree that we think a node should get information from the network, we can set an appropriate number for #layers. But again, in pictures, we usually don't want to go too far. Set it to 6-7 hops, we can get almost the whole graph, but this makes aggregation meaningless.

Example: the process of collecting two layers of information of target node I

This paper also carried out experiments on shallow and deep GCN respectively. In the figure below, we can see that the best results can be obtained by using a 2-tier or 3-tier model. In addition, for deep GCN (more than 7 layers), it usually gets poor performance (blue dotted line). One solution is to use the remaining connections (purple lines) between hidden layers.

# Performance of different layers #. The picture comes from the newspaper [3]

Explanation of the author of the paper

This framework is currently limited to undirected graphs (weighted or unweighted). However, we can deal with directed edges and edge features by representing the original directed graph as undirected bipartite graph and adding nodes representing edges to the original graph.

For GCN, it seems that we can use node features and graph structure at the same time. But what if there are different types of edges in the graph? Should we treat each relationship differently? How to aggregate neighbor nodes in this case? Are there any advanced methods recently?

In the next article on graphic theme, we will study some more complicated methods.

How to deal with different relationships (brothers, friends, ...)?

[1]Jure Leskovec's wonderful slide on graphic representation learning:

[5] Demonstration of using StellarGraph library:-node-classification.html.

Lei Feng Subtitle Group is a translation team composed of AI enthusiasts, which brings together the strength of more than five volunteers to share the latest overseas AI information and exchange industrial changes and technological innovations in the field of artificial intelligence technology.

Team members include big data experts, algorithm engineer, image processing engineers, product managers, product operations, IT consultants, teachers and students at school; Volunteers come from well-known enterprises such as IBM, AVL, Adobe, Ali and Baidu, as well as research institutes of universities at home and abroad such as Peking University, Tsinghua, Hong Kong University, Chinese Academy of Sciences, University of South Carolina and Waseda University.

If, you are also an AI lover who loves to share. Welcome to learn new knowledge and share growth with Lei Feng's subtitle group.

Write a composition about daffodils.

Brief introduction of Wu Xiang's characters

Papers on human resource management psychology

What are the applications of dielectric ceramics?

Can the submission of graduate papers in American universities be postponed?

There are too many spaces in the topics cited in the paper.

What do Han Yu think are the words "Disciples don't have to be inferior to teachers, and teachers don't have to be superior to disciples" in Shi Shuo?

What is the governance method of environmental greening in any country?

Ming dynasty paper money

Papers published by Jiang Haiyang.