# Inteligencia artificial

Páginas: 46 (11425 palabras) Publicado: 17 de octubre de 2010
7 The Backpropagation Algorithm

We saw in the last chapter that multilayered networks are capable of computing a wider range of Boolean functions than networks with a single layer of computing units. However the computational eﬀort needed for ﬁnding the correct combination of weights increases substantially when more parameters and more complicated topologiesare considered. In this chapter we discuss a popular learning method capable of handling such large learning problems — the backpropagation algorithm. This numerical method was used by diﬀerent research communities in diﬀerent contexts, was discovered and rediscovered, until in 1985 it found its way into connectionist AI mainly through the work of the PDP group [382]. It has been one of the moststudied and used algorithms for neural networks learning ever since. In this chapter we present a proof of the backpropagation algorithm based on a graphical approach in which the algorithm reduces to a graph labeling problem. This method is not only more general than the usual analytical derivations, which handle only the case of special network topologies, but also much easier to follow. It alsoshows how the algorithm can be eﬃciently implemented in computing systems in which only local information can be transported through the network. 7.1.1 Diﬀerentiable activation functions The backpropagation algorithm looks for the minimum of the error function in weight space using the method of gradient descent. The combination of weights which minimizes the error function is considered to be asolution of the learning problem. Since this method requires computation of the gradient of the error function at each iteration step, we must guarantee the continuity and diﬀerentiability of the error function. Obviously we have to use a kind of activation function other than the step function used in perceptrons,

R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996

152

7 TheBackpropagation Algorithm

because the composite function produced by interconnected perceptrons is discontinuous, and therefore the error function too. One of the more popular activation functions for backpropagation networks is the sigmoid, a real function sc : IR → (0, 1) deﬁned by the expression sc (x) = 1 . 1 + e−cx

The constant c can be selected arbitrarily and its reciprocal 1/c is called thetemperature parameter in stochastic neural networks. The shape of the sigmoid changes according to the value of c, as can be seen in Figure 7.1. The graph shows the shape of the sigmoid for c = 1, c = 2 and c = 3. Higher values of c bring the shape of the sigmoid closer to that of the step function and in the limit c → ∞ the sigmoid converges to a step function at the origin. In order to simplifyall expressions derived in this chapter we set c = 1, but after going through this material the reader should be able to generalize all the expressions for a variable c. In the following we call the sigmoid s1 (x) just s(x).
1

-4

-2

0

2

4

x

Fig. 7.1. Three sigmoids (for c = 1, c = 2 and c = 3)

The derivative of the sigmoid with respect to x, needed later on in thischapter, is d e−x s(x) = = s(x)(1 − s(x)). dx (1 + e−x )2 We have already shown that, in the case of perceptrons, a symmetrical activation function has some advantages for learning. An alternative to the sigmoid is the symmetrical sigmoid S(x) deﬁned as S(x) = 2s(x) − 1 = 1 − e−x . 1 + e−x

This is nothing but the hyperbolic tangent for the argument x/2 whose shape is shown in Figure 7.2 (upperright). The ﬁgure shows four types of continuous “squashing” functions. The ramp function (lower right) can also be used in

R. Rojas: Neural Networks, Springer-Verlag, Berlin, 1996

153

learning algorithms taking care to avoid the two points where the derivative is undeﬁned.
1 1

-3 x

-2

-1

1

2

3

x

-4

-2

0 1

2

4

-1 1...

Regístrate para leer el documento completo.