This article is one of two Distill publications about graph neural networks.
Take a look at
A Gentle Introduction to Graph Neural Networks
for a companion view on many things graph and neural network related.
Many systems and interactions – social networks, molecules, organizations, citations, physical models, transactions – can be represented quite naturally as graphs.
How can we reason about and make predictions within these systems?
One idea is to look at tools that have worked well in other domains: neural networks have shown immense predictive power in a variety of learning tasks.
However, neural networks have been traditionally used to operate on fixedsize and/or regularstructured inputs (such as sentences, images and video).
This makes them unable to elegantly process graphstructured data.
Graph neural networks (GNNs) are a family of neural networks that can operate naturally on graphstructured data.
By extracting and utilizing features from the underlying graph,
GNNs can make more informed predictions about entities in these interactions,
as compared to models that consider individual entities in isolation.
GNNs are not the only tools available to model graphstructured data:
graph kernels
and randomwalk methods
were some of the most popular ones.
Today, however, GNNs have largely replaced these techniques
because of their inherent flexibility to model the underlying systems
better.
In this article, we will illustrate
the challenges of computing over graphs,
describe the origin and design of graph neural networks,
and explore the most popular GNN variants in recent times.
Particularly, we will see that many of these variants
are composed of similar building blocks.
First, let’s discuss some of the complications that graphs come with.
The Challenges of Computation on Graphs
Lack of Consistent Structure
Graphs are extremely flexible mathematical models; but this means they lack consistent structure across instances.
Consider the task of predicting whether a given chemical molecule is toxic
Looking at a few examples, the following issues quickly become apparent:
 Molecules may have different numbers of atoms.
 The atoms in a molecule may be of different types.
 Each of these atoms may have different number of connections.
 These connections can have different strengths.
Representing graphs in a format that can be computed over is nontrivial,
and the final representation chosen often depends significantly on the actual problem.
NodeOrder Equivariance
Extending the point above: graphs often have no inherent ordering present amongst the nodes.
Compare this to images, where every pixel is uniquely determined by its absolute position within the image!
But what do we do when the nodes have no inherent order?
Above:
The same graph labelled in two different ways. The alphabets indicate the ordering of the nodes.
As a result, we would like our algorithms to be nodeorder equivariant:
they should not depend on the ordering of the nodes of the graph.
If we permute the nodes in some way, the resulting representations of
the nodes as computed by our algorithms should also be permuted in the same way.
Scalability
Graphs can be really large! Think about social networks like Facebook and Twitter, which have over a billion users.
Operating on data this large is not easy.
Luckily, most naturally occuring graphs are ‘sparse’:
they tend to have their number of edges linear in their number of vertices.
We will see that this allows the use of clever methods
to efficiently compute representations of nodes within the graph.
Further, the methods that we look at here will have significantly fewer parameters
in comparison to the size of the graphs they operate on.
Problem Setting and Notation
There are many useful problems that can be formulated over graphs:
 Node Classification: Classifying individual nodes.
 Graph Classification: Classifying entire graphs.
 Node Clustering: Grouping together similar nodes based on connectivity.
 Link Prediction: Predicting missing links.
 Influence Maximization: Identifying influential nodes.
This list is not exhaustive!
A common precursor in solving many of these problems is node representation learning:
learning to map individual nodes to fixedsize realvalued vectors (called ‘representations’ or ‘embeddings’).
In Learning GNN Parameters, we will see how the learnt embeddings can be used for these tasks.
Different GNN variants are distinguished by the way these representations are computed.
Generally, however, GNNs compute node representations in an iterative process.
We will use the notation
$v$
after the $k^{text{th}}$
We will define a graph
$G$as a set of nodes, $V$
, with a set of edges $E$
Nodes can have individual features as part of the input: we will denote by
$v in V$
For example, the ‘node features’ for a pixel in a color image
would be the red, green and blue channel (RGB) values at that pixel.
For ease of exposition, we will assume
$G$
Many of the same ideas we will see here
apply to other kinds of graphs:
we will discuss this later in Different Kinds of Graphs.
Sometimes we will need to denote a graph property by a matrix
$M$
where each row
$v$
.
Extending Convolutions to Graphs
Convolutional Neural Networks have been seen to be quite powerful in extracting features from images.
However, images themselves can be seen as graphs with a very regular gridlike structure,
where the individual pixels are nodes, and the RGB channel values at each pixel as the node features.
A natural idea, then, is to consider generalizing convolutions to arbitrary graphs. Recall, however, the challenges
listed out in the previous section: in particular, ordinary convolutions are not nodeorder invariant, because
they depend on the absolute positions of pixels.
It is initially unclear as how to generalize convolutions over grids to convolutions over general graphs,
where the neighbourhood structure differs from node to node.
The curious reader may wonder if performing some sort of padding and ordering
could be done to ensure the consistency of neighbourhood structure across nodes.
This has been attempted with some success
but the techniques we will look at here are more general and powerful.
Neighbours participating in the convolution at the center pixel are highlighted in gray.
Hover over a node to see its immediate neighbourhood highlighted on the left.
The structure of this neighbourhood changes from node to node.
We begin by introducing the idea of constructing polynomial filters over node neighbourhoods,
much like how CNNs compute localized filters over neighbouring pixels.
Then, we will see how more recent approaches extend on this idea with more powerful mechanisms.
Finally, we will discuss alternative methods
that can use ‘global’ graphlevel information for computing node representations.
Polynomial Filters on Graphs
The Graph Laplacian
Given a graph
$G$, let us fix an arbitrary ordering of the $n$
nodes of $G$
We denote the
adjacency matrix of $G$
by $A$
, we can construct the diagonal degree matrix $D$
of $G$
as:
where
$A_{vu}$$v$
and the column corresponding to $u$
in the matrix
. We will use this notation throughout this section.
Then, the graph Laplacian
$L$is the square $n times n$
L = D – A.
The graph Laplacian gets its name from being the discrete analog of the
Laplacian operator
from calculus.
Although it encodes precisely the same information as the adjacency matrix
$A$
In the sense that given either of the matrices
or $L$
,
the graph Laplacian has many interesting properties of its own.
The graph Laplacian shows up in many mathematical problems involving graphs:
random walks,
spectral clustering
and
diffusion, to name a few.
We will see some of these properties
in a later section,
but will instead point readers to
this tutorial
for greater insight into the graph Laplacian.
Polynomials of the Laplacian
Now that we have understood what the graph Laplacian is,
we can build polynomials
p_w(L) = w_0 I_n + w_1 L + w_2 L^2 + ldots + w_d L^d = sum_{i = 0}^d w_i L^i.
$w = [w_0, ldots, w_d]$
$w$
, $p_w(L)$
$n times n$
matrix, just like $L$
.
These polynomials can be thought of as the equivalent of ‘filters’ in CNNs,
and the coefficients
as the weights of the ‘filters’.
For ease of exposition, we will focus on the case where nodes have onedimensional features:
each of the
$v in V$
The same ideas hold when each of the
Using the previously chosen ordering of the nodes,
we can stack all of the node features
$x in mathbb{R}^n$
Once we have constructed the feature vector
$x$
we can define its convolution with a polynomial filter
${x}^{\mathrm{\prime}}={p}_{w}\left(L\right)\text{}x$
x’ = p_w(L) x
$w$
let us begin by considering the ‘simplest’ polynomial:
when
$0$
In this case,
$x$
x’ = p_w(L) x = sum_{i = 0}^d w_i L^ix = w_0 I_n x = x.
$w_1 = 1$
$0$
Then,
$Lx$
begin{aligned}
x’_v = (Lx)_v &= L_v x \
&= sum_{u in G} L_{vu} x_u \
&= sum_{u in G} (D_{vu} – A_{vu}) x_u \
&= D_v x_v – sum_{u in mathcal{N}(v)} x_u
end{aligned}
$v$
with the features of its immediate neighbours
For readers familiar with
Laplacian filtering of images,
this is the exact same idea. When
$x$
At this point, a natural question to ask is:
How does the degree
Indeed, it is not too hard to show that:
text{dist}_G(v, u) > i quad Longrightarrow quad L_{vu}^i = 0.
This implies, when we convolve
$x$with $p_w(L)$
$d$
to get $x’$
$\begin{array}{cc}{\displaystyle {x}_{v}^{\mathrm{\prime}}=\left({p}_{w}\right(L)x{)}_{v}}& {\displaystyle =\left({p}_{w}\right(L){)}_{v}x}\\ {\displaystyle}& {\displaystyle =\sum _{i=0}^{d}{w}_{i}{L}_{v}^{i}x}\\ {\displaystyle}& {\displaystyle =\sum _{i=0}^{d}{w}_{i}{\sum}_{u\in G}{L}_{vu}^{i}{x}_{u}}\\ {\displaystyle}& {\displaystyle =\sum _{i=0}^{d}{w}_{i}{\sum}_{\genfrac{}{}{0px}{}{u\in G}{{\text{dist}}_{G}(v,u)\le i}}{L}_{vu}^{i}{x}_{u}\mathrm{.}}\end{array}$
begin{aligned}
x’_v = (p_w(L)x)_v &= (p_w(L))_v x \
&= sum_{i = 0}^d w_i L_v^i x \
&= sum_{i = 0}^d w_i sum_{u in G} L_{vu}^i x_u \
&= sum_{i = 0}^d w_i sum_{u in G atop text{dist}_G(v, u) leq i} L_{vu}^i x_u.
end{aligned}
Effectively, the convolution at node
$v$occurs only with nodes $u$
which are not more than $d$
Thus, these polynomial filters are localized. The degree of the localization is governed completely by
.
To help you understand these ‘polynomialbased’ convolutions better, we have created the visualization below.
Vary the polynomial coefficients and the input grid
to see how the result $x’$
$x$
the resulting pixel in
$p_w(L)$
Hover over a pixel in the input grid (left, representing
$x$
to highlight it and see the equivalent convolutional kernel
for that pixel under the arrow.
The result
Click on the input grid to toggle pixel values between
$0$(white) and $1$
To randomize the input grid, press ‘Randomize Grid’. To reset all pixels to
Use the sliders at the bottom to change the coefficients
To reset all coefficients
to $0$
, press ‘Reset Coefficients.’
ChebNet
ChebNet
p_w(L) = sum_{i = 1}^d w_i T_i(tilde{L})
$T_i$
$i$
Chebyshev polynomial of the first kind and
$L$
We discuss the eigenvalues of the Laplacian
tilde{L} = frac{2L}{lambda_{max}(L)} – I_n.
What is the motivation behind these choices?
 $L$

The Chebyshev polynomials have certain interesting properties that make interpolation more numerically stable.
We won’t talk about this in more depth here,
but will advise interested readers to take a look atas a definitive resource.
Polynomial Filters are NodeOrder Equivariant
The polynomial filters we considered here are actually independent of the ordering of the nodes.
This is particularly easy to see when the degree of the polynomial
$1$
where each node’s feature is aggregated with the sum of its neighbour’s features.
Clearly, this sum does not depend on the order of the neighbours.
A similar proof follows for higher degree polynomials:
the entries in the powers of
are equivariant to the ordering of the nodes.
As above, let’s assume an arbitrary nodeorder over the
$n$
Any other nodeorder can be thought of as a permutation of this original nodeorder.
We can represent any permutation by a
permutation matrix
will always be an orthogonal $01$
PP^T = P^TP = I_n.
$f$
nodeorder equivariant iff for all permutations $P$
f(Px) = P f(x).
When switching to the new nodeorder using the permutation
$P$
the quantities below transform in the following way:
begin{aligned}
x &to Px \
L &to PLP^T \
L^i &to PL^iP^T
end{aligned}
$\begin{array}{cc}{\displaystyle f\left(Px\right)}& {\displaystyle =\sum _{i=0}^{d}{w}_{i}\left(P{L}^{i}{P}^{T}\right)\left(Px\right)}\\ {\displaystyle}& {\displaystyle =P\sum _{i=0}^{d}{w}_{i}{L}^{i}{x}^{\mathrm{\prime}}}\\ {\displaystyle}& {\displaystyle =Pf\left(x\right)\mathrm{.}}\end{array}$
begin{aligned}
f(Px) & = sum_{i = 0}^d w_i (PL^iP^T) (Px) \
& = P sum_{i = 0}^d w_i L^i x’ \
& = P f(x).
end{aligned}
Embedding Computation
We now describe how we can build a graph neural network
by stacking ChebNet (or any polynomial filter) layers
one after the other with nonlinearities,
much like a standard CNN.
In particular, if we have
the
$w^{(k)}$
Note that these networks
reuse the same filter weights across different nodes,
exactly mimicking weightsharing in Convolutional Neural Networks (CNNs)
which reuse weights for convolutional filters across a grid.
Modern Graph Neural Networks
ChebNet was a breakthrough in learning localized filters over graphs,
and it motivated many to think of graph convolutions from a different perspective.
We return back to the result of convolving
by the polynomial kernel $p_w(L) = L$
$v$
begin{aligned}
(Lx)_v &= L_v x \
&= sum_{u in G} L_{vu} x_u \
&= sum_{u in G} (D_{vu} – A_{vu}) x_u \
&= D_v x_v – sum_{u in mathcal{N}(v)} x_u
end{aligned}
As we noted before, this is a
$1$
But more importantly, we can think of this convolution as arising of two steps:
 Aggregating over immediate neighbour features $x_u$
 Combining with the node’s own feature $x_v$
Key Idea:
What if we consider different kinds of ‘aggregation’ and ‘combination’ steps,
beyond what are possible using polynomial filters?
By ensuring that the aggregation is nodeorder equivariant,
the overall convolution becomes nodeorder equivariant.
These convolutions can be thought of as ‘messagepassing’ between adjacent nodes:
after each step, every node receives some ‘information’ from its neighbours.
By iteratively repeating the
$1$hop localized convolutions $K$
the receptive field of the convolution effectively includes all nodes upto
hops away.
Embedding Computation
Messagepassing forms the backbone of many GNN architectures today.
We describe the most popular ones in depth below:
 Graph Convolutional Networks (GCN)
 Graph Attention Networks (GAT)
 Graph Sample and Aggregate (GraphSAGE)
 Graph Isomorphism Network (GIN)
Thoughts
An interesting point is to assess different aggregation functions: are some better and others worse?
they can uniquely preserve node neighbourhood features;
we recommend the interested reader take a look at the detailed theoretical analysis there.
Here, we’ve talk about GNNs where the computation only occurs at the nodes.
More recent GNN models
such as MessagePassing Neural Networks
and Graph Networks
perform computation over the edges as well;
they compute edge embeddings together with node embeddings.
This is an even more general framework –
but the same ‘message passing’ ideas from this section apply.
Interactive Graph Neural Networks
Below is an interactive visualization of these GNN models on small graphs.
For clarity, the node features are just real numbers here, shown inside the squares next to each node,
but the same equations hold when the node features are vectors.
Use the sliders on the left to change the weights for the current iteration, and watch how the update equation changes.
In practice, each iteration above is generally thought of as a single ‘neural network layer’.
This ideology is followed by many popular Graph Neural Network libraries,
For example: PyTorch Geometric
and StellarGraph.
allowing one to compose different types of graph convolutions in the same model.
From Local to Global Convolutions
The methods we’ve seen so far perform ‘local’ convolutions:
every node’s feature is updated using a function of its local neighbours’ features.
While performing enough steps of messagepassing will eventually ensure that
information from all nodes in the graph is passed,
one may wonder if there are more direct ways to perform ‘global’ convolutions.
The answer is yes; we will now describe an approach that was actually first put forward
in the context of neural networks by
much before any of the GNN models we looked at above.
Spectral Convolutions
As before, we will focus on the case where nodes have onedimensional features.
After choosing an arbitrary nodeorder, we can stack all of the node features to get a
‘feature vector’
Key Idea:
Given a feature vector
the Laplacian
allows us to quantify how smooth $x$
is, with respect to $G$
.
How?
After normalizing
$x$such that $sum_{i = 1}^n x_i^2 = 1$
$L$
${R}_{L}\left(x\right)=\frac{{x}^{T}Lx}{{x}^{T}x}=\frac{{\sum}_{(i,j)\in E}({x}_{i}{x}_{j}{)}^{2}}{{\sum}_{i}{x}_{i}^{2}}={\sum}_{(i,j)\in E}({x}_{i}{x}_{j}{)}^{2}\mathrm{.}$
R_L(x) = frac{x^T L x}{x^T x} = frac{sum_{(i, j) in E} (x_i – x_j)^2}{sum_i x_i^2} = sum_{(i, j) in E} (x_i – x_j)^2.
$x$
adjacent nodes in
(hence, are smooth) would have smaller values of $R_L(x)$
$L$
is a real, symmetric matrix, which means it has all real eigenvalues $lambda_1 leq ldots leq lambda_{n}$
$lambda$
of a matrix $A$
satisfying the equation
for a certain vector $u$
For a friendly introduction to eigenvectors,
please see this tutorial.
Further, the corresponding eigenvectors
${u}_{{k}_{1}}^{T}{u}_{{k}_{2}}=\{\begin{array}{c}{\textstyle 1\phantom{\rule{1em}{0ex}}\text{if}{k}_{1}={k}_{2}\mathrm{.}}\\ {\textstyle 0\phantom{\rule{1em}{0ex}}\text{if}{k}_{1}\ne {k}_{2}\mathrm{.}}\end{array}$
u_{k_1}^T u_{k_2} =
begin{cases}
1 quad text{ if } {k_1} = {k_2}. \
0 quad text{ if } {k_1} neq {k_2}.
end{cases}
$L$
are successively less smooth, as $R_L$
${\text{argmin}}_{x,\text{}x\perp \{{u}_{1},\dots ,{u}_{i1}\}}{R}_{L}\left(x\right)={u}_{i}\mathrm{.}\phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}\phantom{\rule{2em}{0ex}}{\mathrm{min}}_{x,\text{}x\perp \{{u}_{1},\dots ,{u}_{i1}\}}{R}_{L}\left(x\right)={\lambda}_{i}\mathrm{.}$
underset{x, x perp {u_1, ldots, u_{i – 1}}}{text{argmin}} R_L(x) = u_i.
qquad
qquad
qquad
min_{x, x perp {u_1, ldots, u_{i – 1}}} R_L(x) = lambda_i.
$L$
We denote the ‘spectral’ decomposition of
L = U Lambda U^T.
where
and
Lambda = begin{bmatrix}
lambda_{1} & & \
& ddots & \
& & lambda_{n}
end{bmatrix}
qquad
qquad
qquad
qquad
U = begin{bmatrix} \ u_1 cdots u_n \ end{bmatrix}.
$U^T U = I$
$n$
eigenvectors form a basis for $mathbb{R}^n$
$x$
x = sum_{i = 1}^n hat{x_i} u_i = U hat{x}.
$hat{x}$
$[x_0, ldots x_n]$
$hat{x}$
$x$
The orthonormality condition allows us to state:
x = U hat{x} quad Longleftrightarrow quad U^T x = hat{x}.
$x$
and the ‘spectral’ representation $hat{x}$
$x in mathbb{R}^n$
Spectral Representations of Natural Images
As discussed before, we can consider any image as a grid graph, where each pixel is a node,
connected by edges to adjacent pixels.
Thus, a pixel can have either
or $8$
Each pixel gets a value as part of the image. If the image is grayscale, each value will be a single
real number indicating how dark the pixel is. If the image is colored, each value will be a
vector, indicating the values for the red, green and blue (RGB) channels.
This construction allows us to compute the graph Laplacian and the eigenvector matrix
$U$
Given an image, we can then investigate what its spectral representation looks like.
To shed some light on what the spectral representation actually encodes,
we perform the following experiment over each channel of the image independently:
 We first collect all pixel values across a channel into a feature vector $x$
 Then, we obtain its spectral representation $hat{x}$
 We truncate this to the first $m$
 Then, we convert this truncated representation $hat{x}_m$
Finally, we stack the resulting channels back together to get back an image.
We can now see how the resulting image changes with choices of
Note that when
as we can reconstruct each channel exactly.
Each of these images has been taken from the ImageNet
dataset and downsampled to $50$
As
$m$decreases, we see that the output image $x_m$
$m$
to $1$
, the output image $x_m$
$n$
we can retain a lot of the information in the image with significantly fewer components.
We can relate this to the Fourier decomposition of images:
the more eigenvectors we use, the higher frequencies we can represent on the grid.
To complement the visualization above,
we additionally visualize the first few eigenvectors on a smaller
We change the coefficients of the first
out of $64$
in the spectral representation
and see how the resulting image changes:
These visualizations should convince you that the first eigenvectors are indeed smooth,
and the smoothness correspondingly decreases as we consider later eigenvectors.
For any image
$x$
the initial entries of the spectral representation
Embedding Computation
We now have the background to understand spectral convolutions
and how they can be used to compute embeddings/feature representations of nodes.
As before, the model we describe below has
$K$
each layer
has learnable parameters $hat{w}^{(k)}$
$m$
eigenvectors used to compute the spectral representations.
We had shown in the previous section that we can take
and still not lose out on significant amounts of information.
Thus, convolution in the spectral domain enables the use of significantly fewer parameters
than just direct convolution in the natural domain.
Further, by virtue of the smoothness of the Laplacian eigenvectors across the graph,
using spectral representations automatically enforces an inductive bias for
neighbouring nodes to get similar representations.
Assuming onedimensional node features for now,
the output of each layer is a vector of node representations
We fix an ordering of the nodes in
$G$. This gives us the adjacency matrix $A$
and the graph Laplacian $L$
allowing us to compute
The method above generalizes easily to the case where each
$h^{(k)} in mathbb{R}^{d_k}$
With the insights from the previous section, we see that convolution in the spectraldomain of graphs
can be thought of as the generalization of convolution in the frequencydomain of images.
Spectral Convolutions are NodeOrder Equivariant
We can show spectral convolutions are nodeorder equivariant using a similar approach
as for Laplacian polynomial filters.
Details for the Interested Reader
As in our proof before,
let’s fix an arbitrary nodeorder.
Then, any other nodeorder can be represented by a
permutation of this original nodeorder.
We can associate this permutation with its permutation matrix
.
Under this new nodeorder,
the quantities below transform in the following way:
begin{aligned}
x &to Px \
A &to PAP^T \
L &to PLP^T \
U_m &to PU_m
end{aligned}
$\begin{array}{cc}{\displaystyle \widehat{x}}& {\displaystyle \to {\left(P{U}_{m}\right)}^{T}\left(Px\right)={U}_{m}^{T}x=\widehat{x}}\\ {\displaystyle \widehat{w}}& {\displaystyle \to {\left(P{U}_{m}\right)}^{T}\left(Pw\right)={U}_{m}^{T}w=\widehat{w}}\\ {\displaystyle \widehat{g}}& {\displaystyle \to \widehat{g}}\\ {\displaystyle g}& {\displaystyle \to \left(P{U}_{m}\right)\widehat{g}=P\left({U}_{m}\widehat{g}\right)=Pg}\end{array}$
begin{aligned}
hat{x} &to left(PU_mright)^T (Px) = U_m^T x = hat{x} \
hat{w} &to left(PU_mright)^T (Pw) = U_m^T w = hat{w} \
hat{g} &to hat{g} \
g &to (PU_m)hat{g} = P(U_mhat{g}) = Pg
end{aligned}
$sigma$
f(Px) = sigma(Pg) = P sigma(g) = P f(x)
as required.
Further, we see that the spectral quantities
$hat{g}$
The theory of spectral convolutions is mathematically wellgrounded;
however, there are some key disadvantages that we must talk about:
 We need to compute the eigenvector matrix $U_m$
 Even if we can compute $U_m$

The learned filters are specific to the input graphs,
as they are represented in terms
of the spectral decomposition of input graph Laplacian $L$
While spectral convolutions have largely been superseded by
‘local’ convolutions for the reasons discussed above,
there is still much merit to understanding the ideas behind them.
Indeed, a recently proposed GNN model called Directional Graph Networks
actually uses the Laplacian eigenvectors
and their mathematical properties
extensively.
Global Propagation via Graph Embeddings
A simpler way to incorporate graphlevel information
is to compute embeddings of the entire graph by pooling node
(and possibly edge) embeddings,
and then using the graph embedding to update node embeddings,
following an iterative scheme similar to what we have looked at here.
This is an approach used by Graph Networks
We will briefly discuss how graphlevel embeddings
can be constructed in Pooling.
However, such approaches tend to ignore the underlying
topology of the graph that spectral convolutions can capture.
Learning GNN Parameters
All of the embedding computations we’ve described here, whether spectral or spatial, are completely differentiable.
This allows GNNs to be trained in an endtoend fashion, just like a standard neural network,
once a suitable loss function
is defined:

Node Classification: By minimizing any of the standard losses for classification tasks,
such as categorical crossentropy when multiple classes are present:
$\mathcal{L}({y}_{v},\widehat{{y}_{v}})={\sum}_{c}{y}_{vc}\mathrm{log}\widehat{{y}_{vc}}\mathrm{.}$ 
Graph Classification: By aggregating node representations,
one can construct a vector representation of the entire graph.
This graph representation can be used for any graphlevel task, even beyond classification.
See Pooling for how representations of graphs can be constructed. 
Link Prediction: By sampling pairs of adjacent and nonadjacent nodes,
and use these vector pairs as inputs to predict the presence/absence of an edge.
For a concrete example, by minimizing the following ‘logistic regression’like loss:
$\begin{array}{cc}{\displaystyle \mathcal{L}({y}_{v},{y}_{u},{e}_{vu})}& {\displaystyle ={e}_{vu}\mathrm{log}\left({p}_{vu}\right)(1{e}_{vu})\mathrm{log}(1{p}_{vu})}\\ {\displaystyle {p}_{vu}}& {\displaystyle =\sigma \left({y}_{v}^{T}{y}_{u}\right)}\end{array}$  Node Clustering: By simply clustering the learned node representations.
The broad success of pretraining for natural language processing models
such as ELMo
has sparked interest in similar techniques for GNNs
The key idea in each of these papers is to train GNNs to predict
local (eg. node degrees, clustering coefficient, masked node attributes)
and/or global graph properties (eg. pairwise distances, masked global attributes).
Another selfsupervised technique is to enforce that neighbouring nodes get similar embeddings,
mimicking randomwalk approaches such as node2vec
${L}_{G}={\sum}_{v}{\sum}_{u\in {N}_{R}\left(v\right)}\mathrm{log}\frac{\mathrm{exp}{z}_{v}^{T}{z}_{u}}{{\sum}_{{u}^{\mathrm{\prime}}}\mathrm{exp}{z}_{{u}^{\mathrm{\prime}}}^{T}{z}_{u}}\mathrm{.}$
L_{G} = sum_{v} sum_{u in N_R(v)} logfrac{exp{z_v^T z_u}}{sumlimits_{u’} exp{z_{u’}^T z_u}}.
where
$N_R(v)$$v$
For large graphs, where computing the sum over all nodes may be computationally expensive,
techniques such as Noise Contrastive Estimation
Conclusion and Further Reading
While we have looked at many techniques and ideas in this article,
the field of Graph Neural Networks is extremely vast.
We have been forced to restrict our discussion to a small subset of the entire literature,
while still communicating the key ideas and design principles behind GNNs.
We recommend the interested reader take a look at
We end with pointers and references for additional concepts readers might be interested in:
GNNs in Practice
It turns out that accomodating the different structures of graphs is often hard to do efficiently,
but we can still represent many GNN update equations using
as sparse matrixvector products (since generally, the adjacency matrix is sparse for most realworld graph datasets.)
For example, the GCN variant discussed here can be represented as:
h^{(k)} = D^{1} A cdot h^{(k – 1)} {W^{(k)}}^T + h^{(k – 1)} {B^{(k)}}^T.
Regularization techniques for standard neural networks,
such as Dropout
can be applied in a straightforward manner to the parameters
(for example, zero out entire rows of
Different Kinds of Graphs
Here, we have focused on undirected graphs, to avoid going into too many unnecessary details.
However, there are some simple variants of spatial convolutions for:
 Directed graphs: Aggregate across inneighbourhood and/or outneighbourhood features.
 Temporal graphs: Aggregate across previous and/or future node features.
 Heterogeneous graphs: Learn different aggregation functions for each node/edge type.
There do exist more sophisticated techniques that can take advantage of the different structures of these graphs:
see
Pooling
This article discusses how GNNs compute useful representations of nodes.
But what if we wanted to compute representations of graphs for graphlevel tasks (for example, predicting the toxicity of a molecule)?
A simple solution is to just aggregate the final node embeddings and pass them through another neural network
$text{PREDICT}_G$${h}_{G}={\text{PREDICT}}_{G}\left({\text{AGG}}_{v\in G}\left(\right\{{h}_{v}\left\}\right)\right)$
h_G = text{PREDICT}_G Big( text{AGG}_{v in G}left({ h_v } right) Big)
 SortPool
: Sort vertices of the graph to get a fixedsize nodeorder invariant representation of the graph, and then apply any standard neural network architecture.  DiffPool
: Learn to cluster vertices, build a coarser graph over clusters instead of nodes, then apply a GNN over the coarser graph. Repeat until only one cluster is left.  SAGPool
: Apply a GNN to learn node scores, then keep only the nodes with the top scores, throwing away the rest. Repeat until only one node is left.
Supplementary Material
Reproducing Experiments
The experiments from
Spectral Representations of Natural Images
can be reproduced using the following
Colab notebook:
Spectral Representations of Natural Images.
Recreating Visualizations
To aid in the creation of future interactive articles,
we have created ObservableHQ
notebooks for each of the interactive visualizations here: