Tài liệu Mạng thần kinh thường xuyên cho dự đoán P5 doc

Thông tin tài liệu

Recurrent Neural Networks for Prediction Authored by Danilo P. Mandic, Jonathon A. Chambers Copyright c 2001 John Wiley & Sons Ltd ISBNs: 0-471-49517-4 (Hardback); 0-470-84535-X (Electronic) 5 Recurrent Neural Networks Architectures 5.1 Perspective In this chapter, the use of neural networks, in particular recurrent neural networks, in system identification, signal processing and forecasting is considered. The ability of neural networks to model nonlinear dynamical systems is demonstrated, and the correspondence between neural networks and block-stochastic models is established. Finally, further discussion of recurrent neural network architectures is provided. 5.2 Introduction There are numerous situations in which the use of linear filters and models is limited. For instance, when trying to identify a saturation type nonlinearity, linear models will inevitably fail. This is also the case when separating signals with overlapping spectral components. Most real-world signals are generated, to a certain extent, by a nonlinear mech- anism and therefore in many applications the choice of a nonlinear model may be necessary to achieve an acceptable performance from an adaptive predictor. Commu- nications channels, for instance, often need nonlinear equalisers to achieve acceptable performance. The choice of model has crucial importance 1 and practical applications have shown that nonlinear models can offer a better prediction performance than their linear counterparts. They also reveal rich dynamical behaviour, such as limit cycles, bifurcations and fixed points, that cannot be captured by linear models (Gershenfeld and Weigend 1993). By system we consider the actual underlying physics 2 that generate the data, whereas by model we consider a mathematical description of the system. Many vari- ations of mathematical models can be postulated on the basis of datasets collected from observations of a system, and their suitability assessed by various performance 1 System identification, for instance, consists of choice of the model, model parameter estimation and model validation. 2 Technically, the notions of system and process are equivalent (Pearson 1995; Sjöberg et al. 1995). 70 INTRODUCTION −5 0 5 −1 −0.5 0 0.5 1 y=tanh(v) −5 0 5 −5 0 5 Input signal −5 0 5 −1 −0.5 0 0.5 1 Neuron output Figure 5.1 Effects of y = tanh(v) nonlinearity in a neuron model upon two example inputs metrics. Since it is not possible to characterise nonlinear systems by their impulse response, one has to resort to less general models, such as homomorphic filters, mor- phological filters and polynomial filters. Some of the most frequently used polynomial filters are based upon Volterra series (Mathews 1991), a nonlinear analogue of the linear impulse response, threshold autoregressive models (TAR) (Priestley 1991) and Hammerstein and Wiener models. The latter two represent structures that consist of a linear dynamical model and a static zero-memory nonlinearity. An overview of these models can be found in Haber and Unbehauen (1990). Notice that for nonlinear systems, the ordering of the modules within a modular structure 3 plays an important role. To illustrate some important features associated with nonlinear neurons, let us consider a squashing nonlinear activation function of a neuron, shown in Figure 5.1. For two identical mixed sinusoidal inputs with different offsets, passed through this nonlinearity, the output behaviour varies from amplifying and slightly distorting the input signal (solid line in Figure 5.1) to attenuating and considerably nonlinearly distorting the input signal (broken line in Figure 5.1). From the viewpoint of system theory, neural networks represent nonlinear maps, mapping one metric space to another. 3 To depict this, for two modules performing nonlinear functions H 1 = sin(x) and H 2 =e x ,we have H 1 (H 2 (x)) = H 2 (H 1 (x)) since sin(e x ) =e sin(x) . This is the reason to use the term nesting rather than cascading in modular neural networks. RECURRENT NEURAL NETWORKS ARCHITECTURES 71 Nonlinear system modelling has traditionally focused on Volterra–Wiener analysis. These models are nonparametric and computationally extremely demanding. The Volterra series expansion is given by y(k)=h 0 + N  i=0 h 1 (i)x(k − i)+ N  i=0 N  j=0 h 2 (i, j)x(k − i)x(k − j)+··· (5.1) for the representation of a causal system. A nonlinear system represented by a Volterra series is completely characterised by its Volterra kernels h i , i =0, 1, 2, The Volterra modelling of a nonlinear system requires a great deal of computation, and mostly second- or third-order Volterra systems are used in practice. Since the Volterra series expansion is a Taylor series expansion with memory, they both fail when describing a system with discontinuities, such as y(k)=A sgn(x(k)), (5.2) where sgn( · ) is the signum function. To overcome this difficulty, nonlinear parametric models of nonlinear systems, termed NARMAX, that are described by nonlinear difference equations, have been introduced (Billings 1980; Chon and Cohen 1997; Chon et al. 1999; Connor 1994). Unlike the Volterra–Wiener representation, the NARMAX representation of nonlinear systems offers compact representation. The NARMAX model describes a system by using a nonlinear functional dependence between lagged inputs, outputs and/or prediction errors. A polynomial expansion of the transfer function of a NARMAX neural network does not comprise of delayed versions of input and output of order higher than those presented to the network. Therefore, the input of an insufficient order will result in undermodelling, which complies with Takens’ embedding theorem (Takens 1981). Applications of neural networks in forecasting, signal processing and control require treatment of dynamics associated with the input signal. Feedforward networks for processing of dynamical systems tend to capture the dynamics by including past inputs in the input vector. However, for dynamical modelling of complex systems, there is a need to involve feedback, i.e. to use recurrent neural networks. There are various configurations of recurrent neural networks, which are used by Jordan (1986) for control of robots, by Elman (1990) for problems in linguistics and by Williams and Zipser (1989a) for nonlinear adaptive filtering and pattern recognition. In Jordan’s network, past values of network outputs are fed back into hidden units, in Elman’s network, past values of the outputs of hidden units are fed back into themselves, whereas in the Williams–Zipser architecture, the network is fully connected, having one hidden layer. There are numerous modular and hybrid architectures, combining linear adaptive filters and neural networks. These include the pipelined recurrent neural network and networks combining recurrent networks and FIR adaptive filters. The main idea here is that the linear filter captures the linear ‘portion’ of the input process, whereas a neural network captures the nonlinear dynamics associated with the process. 72 OVERVIEW 5.3 Overview The basic modes of modelling, such as parametric, nonparametric, white box, black box and grey box modelling are introduced. Afterwards, the dynamical richness of neural models is addressed and feedforward and recurrent modelling for noisy time series are compared. Block-stochastic models are introduced and neural networks are shown to be able to represent these models. The chapter concludes with an overview of recurrent neural network architectures and recurrent neural networks for NARMAX modelling. 5.4 Basic Modes of Modelling The notions of parametric, nonparametric, black box, grey box and white box modelling are explained. These can be used to categorise neural network algorithms, such as the direct gradient computation, a posteriori and normalised algorithms. The basic idea behind these approaches to modelling is not to estimate what is already known. One should, therefore, utilise prior knowledge and knowledge about the physics of the system, when selecting the neural network model prior to parameter estimation. 5.4.1 Parametric versus Nonparametric Modelling A review of nonlinear input–output modelling techniques is given in Pearson (1995). Three classes of input–output models are parametric, nonparametric and semiparametric models. We next briefly address them. • Parametric modelling assumes a fixed structure for the model. The model identification problem then simplifies to estimating a finite set of parameters of this fixed model. This estimation is based upon the prediction of real input data, so as to best match the input data dynamics. An example of this technique is the broad class of ARIMA/NARMA models. For a given structure of the model (NARMA for instance) we recursively estimate the parameters of the chosen model. • Nonparametric modelling seeks a particular model structure from the input data. The actual model is not known beforehand. An example taken from nonparametric regression is that we look for a model in the form of y(k)=f(x(k)) without knowing the function f( · ) (Pearson 1995). • Semiparametric modelling is the combination of the above. Part of the model structure is completely specified and known beforehand, whereas the other part of the model is either not known or loosely specified. Neural networks, especially recurrent neural networks, can be employed within esti- mators of all of the above classes of models. Closely related to the above concepts are white, grey and black box modelling techniques. RECURRENT NEURAL NETWORKS ARCHITECTURES 73 5.4.2 White, Grey and Black Box Modelling To understand and analyse real-world physical phenomena, various mathematical models have been developed. Depending on some a priori knowledge about the process, data and model, we differentiate between three fairly general modes of modelling. The idea is to distinguish between three levels of prior knowledge, which have been ‘colour-coded’. An overview of the white, grey and black box modelling techniques can be found in Aguirre (2000) and Sjöberg et al. (1995). Given data gathered from planet movements, then Kepler’s gravitational laws might well provide the initial framework in building a mathematical model of the process. This mode of modelling is referred to as white box modelling (Aguirre 2000), underlying its fairly deterministic nature. Static data are used to calculate the parameters, and to do that the underlying physical process has to be understood. It is therefore possible to build a white box model entirely from physical insight and prior knowledge. However, the underlying physics are generally not completely known, or are too complicated and often one has to resort to other types of modelling. The exact form of the input–output relationship that describes a real-world system is most commonly unknown, and therefore modelling is based upon a chosen set of known functions. In addition, if the model is to approximate the system with an arbitrary accuracy, the set of chosen nonlinear continuous functions must be dense. This is the case with polynomials. In this light, neural networks can be viewed as another mode of functional representations. Black box modelling therefore assumes no previous knowledge about the system that produces the data. However, the chosen network structure belongs to architectures that are known to be flexible and have performed satisfactorily on similar problems. The aim hereby is to find a function F that approximates the process y based on the previous observations of process y PAST and input u,as y = F (y PAST ,u). (5.3) This ‘black box’ establishes a functional dependence between the input and output, which can be either linear or nonlinear. The downside is that it is generally not possible to learn about the true physical process that generates the data, especially if a linear model is used. Once the training process is complete, a neural network represents a black box, nonparametric process model. Knowledge about the process is embedded in the values of the network parameters (i.e. synaptic weights). A natural compromise between the two previous models is so-called grey box modelling. It is obtained from black box modelling if some information about the system is known a priori. This can be a probability density function, general statistics of the process data, impulse response or attractor geometry. In Sjöberg et al. (1995), two subclasses of grey box models are considered: physical modelling, where a model structure is built upon understanding of the underlying physics, as for instance the state-space model structure; and semiphysical modelling, where, based upon physical insight, certain nonlinear combinations of data structures are suggested, and then estimated by black box methodology. 74 NARMAX MODELS AND EMBEDDING DIMENSION z -1 z -M z -N z -1 u(k) y(k) e(k) y(k) I II ν(k) + + + _ ^ Neural Network Model Σ Σ Figure 5.2 Nonlinear prediction configuration using a neural network model 5.5 NARMAX Models and Embedding Dimension For neural networks, the number of input nodes specifies the dimension of the network input. In practice, the true state of the system is not observable and the mathematical model of the system that generates the dynamics is not known. The question arises: is the sequence of measurements {y(k)} sufficient to reconstruct the nonlinear system dynamics? Under some regularity conditions, Takens’ (1981) and Mane’s (1981) embedding theorems establish this connection. To ensure that the dynamics of a nonlinear process estimated by a neural network are fully recovered, it is convenient to use Takens’ embedding theorem (Takens 1981), which states that to obtain a faithful reconstruction of the system dynamics, the embedding dimension d must satisfy d  2D +1, (5.4) where D is the dimension of the system attractor. Takens’ embedding theorem (Takens 1981; Wan 1993) establishes a diffeomorphism between a finite window of the time series [y(k − 1),y(k − 2), ,y(k − N )] (5.5) and the underlying state of the dynamic system which generates the time series. This implies that a nonlinear regression y(k)=g[y(k − 1),y(k − 2), ,y(k − N)] (5.6) can model the nonlinear time series. An important feature of the delay-embedding theorem due to Takens (1981) is that it is physically implemented by delay lines. RECURRENT NEURAL NETWORKS ARCHITECTURES 75 1 w 0 x(k-1) w 1 y(k-1) 2 y(k) w Figure 5.3 A NARMAX recurrent perceptron with p = 1 and q =1 There is a deep connection between time-lagged vectors and underlying dynamics. Delay vectors are not just a representation of a state of the system, their length is the key to recovering the full dynamical structure of a nonlinear system. A general starting point would be to use a network for which the input vector comprises delayed inputs and outputs, as shown in Figure 5.2. For the network in Figure 5.2, both the input and the output are passed through delay lines, hence indicating the NARMAX character of this network. The switch in this figure indicates two possible modes of learning which will be explained in Chapter 6. 5.6 How Dynamically Rich are Nonlinear Neural Models? To make an initial step toward comparing neural and other nonlinear models, we perform a Taylor series expansion of the sigmoidal nonlinear activation function of a single neuron model as (Billings et al. 1992) Φ(v(k)) = 1 1+e −βv(k) = 1 2 + β 4 v(k)− β 3 48 v 3 (k)+ β 5 480 v 5 (k)− 17β 7 80 640 v 7 (k)+··· . (5.7) Depending on the steepness β and the activation potential v(k), the polynomial representation (5.7) of the transfer function of a neuron exhibits a complex nonlinear behaviour. Let us now consider a NARMAX recurrent perceptron with p = 1 and q =1, as shown in Figure 5.3, which is a simple example of recurrent neural networks. Its mathematical description is given by y(k)=Φ(w 1 x(k − 1) + w 2 y(k − 1) + w 0 ). (5.8) Expanding (5.8) using (5.7) yields y(k)= 1 2 + 1 4 [w 1 x(k−1)+w 2 y(k−1)+w 0 ]− 1 48 [w 1 x(k−1)+w 2 y(k−1)+w 0 ] 3 +··· , (5.9) where β = 1. Expression (5.9) illustrates the dynamical richness of squashing activation functions. The associated dynamics, when represented in terms of polynomials are quite complex. Networks with more neurons and hidden layers will produce more complicated dynamics than those in (5.9). Following the same approach, for a general 76 HOW DYNAMICALLY RICH ARE NONLINEAR NEURAL MODELS? recurrent neural network, we obtain (Billings et al. 1992) y(k)=c 0 + c 1 x(k − 1) + c 2 y(k − 1) + c 3 x 2 (k − 1) + c 4 y 2 (k − 1) + c 5 x(k − 1)y(k − 1) + c 6 x 3 (k − 1) + c 7 y 3 (k − 1) + c 8 x 2 (k − 1)y(k − 1) + ··· . (5.10) Equation (5.10) does not comprise delayed versions of input and output samples of order higher than those presented to the network. If the input vector were of an insufficient order, undermodelling would result, which complies with Takens’ embedding theorem. Therefore, when modelling an unknown dynamical system or tracking unknown dynamics, it is important to concentrate on the embedding dimension of the network. Representation (5.10) also models an offset (mean value) c 0 of the input signal. 5.6.1 Feedforward versus Recurrent Networks for Nonlinear Modelling The choice of which neural network to employ to represent a nonlinear physical process depends on the dynamics and complexity of the network that is best for representing the problem in hand. For instance, due to feedback, recurrent networks may suffer from instability and sensitivity to noise. Feedforward networks, on the other hand, might not be powerful enough to capture the dynamics of the underlying nonlinear dynamical system. To illustrate this problem, we resort to a simple IIR (ARMA) linear system described by the following first-order difference equation z(k)=0.5z(k − 1)+0.1x(k − 1). (5.11) The system (5.11) is stable, since the pole of its transfer function is at 0.5, i.e. within the unit circle in the z-plane. However, in a noisy environment, the output z(k)is corrupted by noise e(k), so that the noisy output y(k) of system (5.11) becomes y(k)=z(k)+e(k), (5.12) which will affect the quality of estimation based on this model. This happens because the noise terms accumulate during recursions 4 (5.11) as y(k)=0.5y(k − 1)+0.1x(k − 1) + e(k) − 0.5e(k − 1). (5.13) An equivalent FIR (MA) representation of the same filter (5.11), using the method of long division, gives z(k)=0.1x(k − 1)+0.05x(k − 2)+0.025x(k − 3)+0.0125x(k − 4) + ··· (5.14) and the representation of a noisy system now becomes y(k)=0.1x(k − 1)+0.05x(k − 2)+0.025x(k −3)+0.0125x(k −4) + ···+e(k). (5.15) 4 Notice that if the noise e(k) is zero mean and white it appears coloured in (5.13), i.e. correlated with previous outputs, which leads to biased estimates. RECURRENT NEURAL NETWORKS ARCHITECTURES 77 Clearly, the noise in (5.15) is not correlated with the previous outputs and the estimates are unbiased. 5 The price to pay, however, is the infinite length of the exact representation of (5.11). A similar principle applies to neural networks. In Chapter 6 we address the modes of learning in neural networks and discuss the bias/variance dilemma for recurrent neural networks. 5.7 Wiener and Hammerstein Models and Dynamical Neural Networks Under relatively mild conditions, 6 the output signal of a nonlinear model can be considered as a combination of outputs from some suitable submodels. The structure identification, model validation and parameter estimation based upon these submodels are more convenient than those of the whole model. Block oriented stochastic models consist of static nonlinear and dynamical linear modules. Such models often occur in practice, examples of which are • the Hammerstein model, where a zero-memory nonlinearity is followed by a linear dynamical system characterised by its transfer function H(z)=N(z)/D(z); • the Wiener model, where a linear dynamical system is followed by a zero-memory nonlinearity. 5.7.1 Overview of Block-Stochastic Models The definitions of certain stochastic models are given by the 1. Wiener system y(k)=g(H(z −1 )u(k)), (5.16) where u(k) is the input to the system, y(k) is the output, H(z −1 )= C(z −1 ) D(z −1 ) is the z-domain transfer function of the linear component of the system and g( · ) is a nonlinear function; 2. Hammerstein system y(k)=H(z −1 )g(u(k)); (5.17) 3. Uryson system, defined by y(k)= M  i=1 H i (z −1 )g i (u(k)). (5.18) 5 Under the usual assumption that the external additive noise e(k) is not correlated with the input signal x(k). 6 A finite degree polynomial steady-state characteristic. 78 WIENER AND HAMMERSTEIN MODELS AND DYNAMICAL NNs u(k) v(k) y(k) N(z) D(z) nonlinear function (a) The Hammerstein stochastic model u(k) v(k) y(k) N(z) nonlinear function D(z) (b) The Wiener stochastic model Figure 5.4 Nonlinear stochastic models used in control and signal processing Theoretically, there are finite size neural systems with dynamic synapses which can represent all of the above. Moreover, some modular neural architectures, such as the PRNN (Haykin and Li 1995), are able to represent block-cascaded Wiener– Hammerstein systems described by (Mandic and Chambers 1999c) y(k)=Φ N (H N (z −1 )Φ N−1 (H N−1 (z −1 ) ···Φ 1 (H 1 (z −1 )u(k)))) (5.19) and y(k)=H N (z −1 )Φ N (H N−1 (z −1 )Φ N−1 ···Φ 1 (H 1 (z −1 u(k)))) (5.20) under certain constraints relating the size of networks and order of block-stochastic models. Due to its parallel nature, however, a general Uryson model is not guaranteed to be representable this way. 5.7.2 Connection Between Block-Stochastic Models and Neural Networks Block diagrams of Wiener and Hammerstein systems are shown in Figure 5.4. The nonlinear function from Figure 5.4(a) can be generally assumed to be a polynomial, 7 i.e. v(k)= M  i=0 λ i u i (k). (5.21) The Hammerstein model is a conventional parametric model, usually used to represent processes with nonlinearities involved with the process inputs, as shown in Figure 5.4(a). The equation describing the output of a SISO Hammerstein system corrupted with additive output noise η(k)is y(k)=Φ[u(k − 1)] + ∞  i=2 h i Φ[u(k − i)] + ν(k), (5.22) where Φ is a nonlinear function which is continuous. Other requirements are that the linear dynamical subsystem is stable. This network is shown in Figure 5.5. . nonlinear difference equations, have been introduced (Billings 1980; Chon and Cohen 1997; Chon et al. 1999; Connor 1994). Unlike the Volterra–Wiener representation,. based upon a chosen set of known functions. In addition, if the model is to approximate the system with an arbitrary accuracy, the set of chosen nonlinear

Ngày đăng: 26/01/2014, 13:20

Xem thêm: Tài liệu Mạng thần kinh thường xuyên cho dự đoán P5 doc, Tài liệu Mạng thần kinh thường xuyên cho dự đoán P5 doc

Tài liệu Mạng thần kinh thường xuyên cho dự đoán P5 doc

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan