We have seen that a covariance function is the crucial ingredient in a Gaussian process predictor, as it encodes our assumptions about the function which we wish tolearn. From a slightly diﬀerent viewpoint it is clear that in supervised learning the notion of similarity between data points is crucial; it is a basic assumption that points with inputs x which are close are likely to have similar target values y, and thus training points that are near to a test point should be informative about the prediction at that point. Under the Gaussian process view it isthe covariance function that deﬁnes nearness or similarity. An arbitrary function of input pairs x and x will not, in general, be a valid covariance function.1 The purpose of this chapter is to give examples of some commonly-used covariance functions and to examine their properties. Section 4.1 deﬁnes a number of basic terms relating to covariance functions. Section 4.2 gives examples ofstationary, dot-product, and other non-stationary covariance functions, and also gives some ways to make new ones from old. Section 4.3 introduces the important topic of eigenfunction analysis of covariance functions, and states Mercer’s theorem which allows us to express the covariance function (under certain conditions) in terms of its eigenfunctions and eigenvalues. The covariance functions given insection 4.2 are valid when the input domain X is a subset of RD . In section 4.4 we describe ways to deﬁne covariance functions when the input domain is over structured objects such as strings and trees.
valid covariance functions
A stationary covariance function is a function of x − x . Thus it is invariant to translations in the inputspace.2 For example the squared exponential cobe a valid covariance function it must be positive semideﬁnite, see eq. (4.2). stochastic process theory a process which has constant mean and whose covariance function is invariant to translations is called weakly stationary. A process is strictly stationary if all of its ﬁnite dimensional distributions are invariant to translations [Papoulis, 1991, sec.10.1].
2 In 1 To
C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml 80 Covariance Functions variance function given in equation 2.16 is stationary. If further the covariance function is a function only of |x − x | then it is called isotropic; it is thusinvariant to all rigid motions. For example the squared exponential covariance function given in equation 2.16 is isotropic. As k is now only a function of r = |x − x | these are also known as radial basis functions (RBFs). If a covariance function depends only on x and x through x · x we call it a dot product covariance function. A simple example is the covariance function 2 k(x, x ) = σ0 + x ·x which can be obtained from linear regression by putting 2 N (0, 1) priors on the coeﬃcients of xd (d = 1, . . . , D) and a prior of N (0, σ0 ) on the bias (or constant function) 1, see eq. (2.15). Another important example 2 is the inhomogeneous polynomial kernel k(x, x ) = (σ0 + x · x )p where p is a positive integer. Dot product covariance functions are invariant to a rotation of thecoordinates about the origin, but not translations. A general name for a function k of two arguments mapping a pair of inputs x ∈ X , x ∈ X into R is a kernel. This term arises in the theory of integral operators, where the operator Tk is deﬁned as (Tk f )(x) =
dot product covariance
k(x, x )f (x ) dµ(x ),
where µ denotes a measure; see section A.7 for further...