Neural Network vs. Kernels#

Several recent papers have begun to dive into the implicit connections between neural networks and kernel methods. They claim that understanding kernel methods will be critical in order to understand and overcome some of the theoretical roadblocks that have cropped up at the forefront of neural network research.

Note

The connection between these two competing techniques has been studied extensively, so here we describe only the basic case of a single-layer neural network.

Feature Maps#

You may have heard the term “feature” in learning literature, as in, machine learning attempts to learn data features (which can potentially be described as learning the underlying patterns in the data).

In practice, this often amounts to learning whether the data fits a set of pre-defined functions \(\phi_{i}\), \(i = 1, 2, \ldots\), and we combine these features using some sort of linear combination,

\[f(x) = \sum_{i=1}^{M} \alpha_{i} \phi_{i}(x)\]

In kernel methods, we have a very simple way of defining features via the kernel. In effect, the features are simply the kernel function fixed at a number of pre-specified points, i.e. \(\phi_{i}(\cdot) = k(x_{i}, \cdot)\).

\[f(x) = \sum_{i=1}^{M} \alpha_{i} k(x_{i}, x)\]

There are a number of well-known kernels that have been found, and are typically used in kernel methods, such as the constant kernel \(k(x, x') = a\), linear kernel \(k(x, x') = \langle x, x' \rangle\), polynomial kernel \(k(x, x') = (\langle x, x' \rangle + c)^{d}\), and the Gaussian (RBF) kernel \(k(x, x') = \exp(-\lVert x - x' \rVert^{2}/2 \sigma^{2})\).

We can also combine these kernels as needed, for example by adding them together, scaling them, or multiplying two kernels, in order to create new valid kernels. This allows us to define the “features” easily via the kernel.

Hint

What this also means is, a kernel function can itself be a linear combination of kernels.

Approximation via Features#

We can describe the learning problem in terms of features, meaning for a set of pre-defined features \(\lbrace \phi_{i} \rbrace_{i=1}^{M}\), we seek to learn the coefficients \(\alpha_{i}\) of a linear combination of feature maps,

\[f(x) = \alpha_{1} \phi_{1}(x) + \cdots + \alpha_{M} \phi_{M}(x)\]

This is exactly the same as the kernel learning problem that we have described before. In short, whether we call the kernels “features” or vice versa, we are essentially using the same techniques that we have already described in the first part of the tutorial.

However, it is important to note that not all feature maps are valid kernels. Some features are complex, especially in the case of deep neural networks or convolutional networks, which means not all neural networks are directly relatable to kernel methods. Nevertheless, in certain cases, these two techniques are virtually identical.

Single-Layer Neural Networks#

In a single-layer neural network, we take an input vector \(x\), and compute a linear transformation of the data, \(W x + b\), where \(W\) is usually described as a matrix of “weights”, and \(b\) is called a “bias” vector.

We then pass the result through a nonlinear transformation \(g\), which is called the “activation” function. At the final step, we may then compute some linear combination of the “activations” in order to compute the predicted value of the function approximation.

\[f(x) = \sum_{i=1}^{M} \beta_{i} g_{i}(W x + b)\]

We specify the activation functions explicitly, and the neural network learning problem is defined as learning the weights \(W\), the bias \(b\), and the linear combination coefficients \(\beta_{i}\). Thanks to a technique called backpropagation, we can directly compute the gradient of these values with respect to the error. This means we can use an iterative training scheme called gradient descent, where the gradient is used to update the weights, biases, and coefficients of the neural network in order to iteratively improve the neural network approximation.

However, it is important to note that this procedure does not guarantee that we will converge to a perfect solution. In practice, if we do not take sufficiently small steps when we update the neural network parameters, we may fail to minimize the error of the neural network approximator.

In order to describe a single-layer neural network in terms of features, we define the feature map to be the activations of the neural network.

\[\varphi_{i}(x) = g_{i}(W x + b)\]

Then, it is easy to see that the neural network follows the same scheme as the kernel learning problem when viewed from the last layer.

\[f(x) = \beta_{1} \varphi_{1}(x) + \cdots + \beta_{M} \varphi_{M}(x)\]

This is an important insight, since it gives us a way to relate the kernel features \(\phi_{i}\) and the neural network features \(\varphi_{i}\). In certain cases, these feature maps may be directly related, meaning what we are doing in neural network learning problems is effectively “learning the kernel”.

Note

Note that deep neural networks and convolutional networks are different, and the feature maps for these networks are significantly more complex. These networks typically do not have a simple analog to kernel methods.

Practical Differences#

There are also some practical differences between neural networks and kernel methods that we outline below.

Input Dimensionality#

Note that the dimensionality of the input vector plays a role in the complexity of a neural network. If the dimensionality of the input vector \(x\) is large, the weight matrix \(W\) will also have very high dimensionality. This can make maintaining and updating the weights challenging, since we need to explicitly store and compute the gradient updates for a high-dimensional matrix.

Kernel methods do not suffer from this problem, since the data is projected into a high-dimensional function space, and we can use the kernel trick. This means we only need to evaluate the kernel over high-dimensional data, which is typically much easier since it boils down to an inner product of vectors.

Sequential Data#

Neural networks are particularly adept at handling data which is provided sequentially, meaning we receive observations given one at a time from the true function. Since we can update the parameters of the network as data becomes available, we do not need to store all of the data that has been passed previously.

The solution for kernel methods, however, requires that we store a matrix that scales with the amount of data. This is a major hurdle for kernel methods, since practically it requires us to first collect a large sample, which we can then use to easily compute the “best fit” solution.

In short, what this means is that we are presented with a tradeoff. Neural networks take a long time to train, but are relatively cheap to store in memory once the parameters have been learned. Kernel methods, on the other hand, face large memory storage requirements, but are trained in a single calculation. Despite this, neural networks are typically better calibrated for sequential data scenarios when a large amount of data is involved.

Theoretical Guarantees#

One major disadvantage of neural networks is the lack of theoretical guarantees related to convergence of the solution and what is called “explainability”, i.e. why the network has a particular output in response to a given input. This is a deep topic, so we will only cover it here briefly.

Basically, the training procedure for neural networks is not conducive to “introspection”. Tracking an output backwards through the network leads to a complex web of interconnected weights and activations, so determining the reasons why a network may output a poor prediction is an area of open research.

In addition, the training for neural networks is typically iterative, and the solution can jump around or get caught in a local solution without ever converging to the true, global solution. Mathematically, this is because the neural network learning problem is non-convex.

Kernel learning, on the other hand, is a convex problem, meaning kernel-based solutions do not suffer from local minima, and are guaranteed to find the “best fit” for a given data set. In addition, kernel methods are typically “shallow”, meaning they offer better explainability in their outputs. In other words, we can always find which features and training data contributed to a certain prediction.