Tài liệu Độ tin cậy của hệ thống máy tính và mạng P4 pdf

57 633 0
Tài liệu Độ tin cậy của hệ thống máy tính và mạng P4 pdf

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design Martin L Shooman Copyright  2002 John Wiley & Sons, Inc ISBNs: 0-471-29342-3 (Hardback); 0-471-22460-X (Electronic) N-MODULAR REDUNDANCY INTRODUCTION In the previous chapter, parallel and standby systems were discussed as means of introducing redundancy and ways to improve system reliability After the concepts were introduced, we saw that one of the complicating design features was that of the coupler in a parallel system and that of the decision unit and switch in a standby system These complications are present in the design of analog systems as well as digital systems However, a technique known as voting redundancy eliminates some of these problems by taking advantage of the digital nature of the output of digital elements The concept is simple to explain if we view the output of a digital circuit as a string of bits Without loss of generality, we can view the output as a parallel byte (8 bits long) (The concept generalizes to serial or parallel outputs n bits long.) Assume that we apply the same input to two identical digital elements and compare the outputs If each bit agrees, then either they are both working properly (likely) or they have both failed in an identical manner (unlikely) Using the concepts of coding theory, we can describe this as an error-detection, not an error-correction, method If we detect a difference between the two outputs, then there is an error, although we cannot tell which element is in error Suppose we add a third element and compare all three If all three outputs agree bitwise, then either all three are working properly (most likely) or all three have failed in the same manner (most unlikely) If two of the element outputs (say, one and three) agree, then most likely element two has failed and we can rely on the output of elements one and three Thus with three elements, we are able to correct one error If two errors have occurred, it is very possible that they will fail in the 145 146 N-MODULAR REDUNDANCY same manner, and the comparison will agree (vote along) with the majority The bitwise comparison of the outputs (which are 1s or 0s) can be done easily with simple digital logic The next section references some early works that led to the development of this concept, now called N-modular redundancy This chapter and Chapter are linked in many ways For example, the technique of voting reliability joins the parallel and standby system reliability of the previous chapter as the three most common techniques for fault tolerance (Also, the analytical techniques involving binomial probabilities and Markov models are used in both chapters.) Thus many of the analyses in this chapter that are aimed at comparing the three techniques constitute a continuation of the analyses that were begun in the previous chapter The reader not familiar with the binomial distribution discussed in Sections A5.3 and B2.4 or the concepts of Markov modeling in Sections A8 and B7 should read the material in these appendix sections first Also, the introductory material on digital logic in Appendix C is used in this chapter for discussing voter circuitry THE HISTORY OF N-MODULAR REDUNDANCY The history of majority voting begins with the work of some of the most illustrious mathematicians of the 20th century, as outlined by Pierce [1965, pp 2–7] There were underlying currents of thought (linked together by theoreticians) that focused on the following: How to use automata theory (logic gates and state machines) to model digital circuit and digital computer operation A model of the human nervous system based on an interconnection of logic elements A means of making reliable computing machines from unreliable components The third topic was driven by the maintenance problems of the early computers related to relay and vacuum tube failures A study of the Univac computer that was undertaken by Bell and Newell [1971, pp 157–169] yields insight into these problems The first Univac system passed its acceptance tests and was put into operation by the Bureau of the Census in March 1951 This machine was designed to operate 24 hours per day, days per week (168 hours), except for approximately 32 hours of regularly scheduled preventative maintenance per week Thus the availability would be 136/ 168 (81%) if there were no failures In the 7-month period from June to December 1951, the computer experienced about 22 hours of nonscheduled engineering time (repair time due to failures), which reduced availability to 114/ 168 (68%) Some of the stated causes of troubles were uniservo failures, noise, long time constants, TRIPLE MODULAR REDUNDANCY 147 and tube failures occurring at a rate of about per week It is therefore clear that reliability was a compelling issue Moore and Shannon of Bell Labs in a classic article [1956] developed methods for making reliable relay circuits by various series and parallel connections of relay contacts (The relay was the active element of its time in the switching networks of the telephone company as well as many elevator control systems and many early computers built at Bell Labs starting in 1937 See Randell [1975, Chapter VI] and Shooman [1990, pp 310–320] for more information.) The classic paper on majority logic was written by John von Neuman (published in the work of Moore and Shannon [1956]), who developed the basic idea of majority voting into a sophisticated scheme with many NAND elements in parallel Each input to the NAND element is supplied by a bundle of N identical inputs, and the 2N inputs are cross-coupled so that each NAND element has one input from each bundle One of von Neuman’s elements was called a restoring organ, since erroneous data that entered at the input was compared with the correct input data, producing the correct output and restoring the data 4.3.1 TRIPLE MODULAR REDUNDANCY Introduction The basic modular redundancy circuit is triple modular redundancy (often called TMR) The system shown in Fig 4.1 consists of three parallel digital circuits—A, B, and C—all with the same input The outputs of the three circuits are compared by the voter, which sides with the majority and gives the majority opinion as the system output If all three circuits are operating properly, all outputs agree; thus the system output is correct However, if one element has failed so that it has produced an incorrect output, the voter chooses the output of the two good elements as the system output because they both agree; thus the system output is correct If two elements have failed, the voter agrees with the majority (the two that have failed); thus the system output is incorrect The system output is also incorrect if all three circuits have failed All the foregoing conclusions assume that a circuit fault is such that it always yields the complement of the correct input A slightly different failure model is often used that assumes the digital circuit to have a fault that makes it stuckat-one (s-a-1) or stuck-at-zero (s-a-0) Assuming that rapidly changing signals are exciting the circuit, a failure occurs within fractions of a microsecond of the fault occurrence regardless of the failure model assumed Therefore, for reliability purposes, the two models are essentially equivalent; however, the error-rate computation differs from that discussed in Section 4.3.3 For further discussion of fault models, see Siewiorek [1982, pp 17; 105–107] and [1992, pp 22; 32; 35; 37; 357; 804] 148 N-MODULAR REDUNDANCY Digital circuit A System inputs (0,1) Digital circuit B Voter System output (0,1) Digital circuit C Figure 4.1 4.3.2 Triple modular redundancy System Reliability To apply TMR, all circuits—A, B, and C—must have equivalent logic and must have the same truth tables In most cases, they are three replications of the same design and are identical Using this assumption, and assuming that the voter does not fail, the system reliability is given by R c P(A B + A C + B C ) (4.1) If all the digital circuits are independent and identical with probability of success p, then this equation can be rewritten as follows in terms of the binomial theorem R c B(3 : 3) + B(2 : 3) 3 c p3 (1 − p)0 + 冢 冣 冢 冣 p (1 − p) c 3p2 − 2p3 c p2 (3 − 2p) (4.2) This is, of course, the reliability expression for a two-out-of-three system The assumption that the digital elements fail so that they produce the complement of the correct input may not be valid (It is, however, a worst-case type of result and should yield a lower bound, i.e., a pessimistic answer.) 4.3.3 System Error Rate The probability model derived in the previous secton enabled us to compute the system reliability, that is, the probability of no failures In many problems, this is the primary measure of interest; however, there are also a number of applications in which another approach is important In a digital communications system, for example, we are interested not only in the probability that the system makes no errors but also in the error rate In other words, we TRIPLE MODULAR REDUNDANCY 149 assume that errors from temporary equipment malfunction or noise are not catastrophic if they occur only rarely, and we wish to compute the probability of such occurrence Similarly, in digital computer processing of non-safetycritical data, we could occasionally tolerate an error without shutting down the operation for repair A third, less clear-cut example is that of an inertial guidance computer for a rocket At every computation cycle, the computer generates a course change and directs the missile control system accordingly An error in one computation will direct the missile off course If the error is large, the time between computations moderately long, the missile control system and dynamics quick to respond, and the flight near its end, the target may be missed, from which a catastrophic failure occurs If these factors are reversed, however, a small error will temporarily steer the missile off course, much as a wind gust does As long as the error has cleared in one or two computation cycles, the missile will rapidly return to its proper course A model for computing transmission-error probabilities is discussed below To construct the type of failure model discussed previously, we assume that one good state and two failed states exist: A1 c element A gives a one output regardless of input (stuck-at-one, or s-a-1) A0 c element A gives a zero output regardless of input (stuck-at-zero, or s-a-0) To work with this three-state model, we shall change our definition of reliability to “the probability that the digital circuit gives the correct output to any given input.” Thus, for the circuits of Fig 4.1, if the correct output is to be a one, the probability expression is P1 c − P(A0 B0 + A0 C0 + B0 C0 ) (4.3a) Equation (4.3a) states that the probability of correctly indicating a one output is given by unity minus the probability of two or more “zero failures.” Similarly, the probability of correctly indicating zero output is given by Eq (4.3b): P0 c − P(A1 B1 + A1 C1 + B1 C1 ) (4.3b) If we assume that a one output and a zero output have equal probability of occurrence, 1/ 2, on any particular transmisson, then the system reliability is the average of Eqs (4.3a) and (4.3b) If we let P(A) c P(B) c P(C ) c p P(A1 ) c P(B1 ) c P(C1 ) c q1 P(A0 ) c P(B0 ) c P(C0 ) c q0 (4.4a) (4.4b) (4.4c) 150 N-MODULAR REDUNDANCY and assume that all states and all elements fail independently, keeping in mind that the expansion of the second term in Eq (4.3a) has seven terms, then substitution of Eqs (4.4a–c) in Eq (4.3a) yields the following equations: P1 c − P(A0 B0 ) − P(A0 C0 ) − P(B0 C0 ) + 2P(A0 B0 C0 ) c − 3q20 + 2q30 (4.5a) (4.5b) Similarly, Eq (4.3b) becomes P0 c − P(A1 B1 ) − P(A1 C1 ) − P(B1 C1 ) + 2P(A1 B1 C1 ) c − 3q21 + 2q31 (4.6a) (4.6b) Averaging Eq (4.5a) and Eq (4.6a) gives Pc P0 + P1 c− (3q20 + 3q21 − 2q30 − 2q31 ) (4.7a) (4.7b) To compare Eq (4.7b) with Eq (4.2), we choose the same probability for both failure modes q0 c q1 c q; therefore, p + q0 + q1 c p + q + q c 1, and q c (1 − p)/ Substitution in Eq (4.7b) yields Pc 3 + p− p 4 (4.8) The two probabilities, Eq (4.2) and Eq (4.8), are compared in Fig 4.2 To interpret the results, it is assumed that the digital circuit in Fig 4.1 is turned on at t c and that initially the probability of each digital circuit being successful is p c 1.00 Thus both the reliability and probability of successful transmission are unity If after year of continuous operation p drops to 0.750, the system reliability becomes 0.844; however, the probability that any one message is successfully transmitted is 0.957 To put the result another way, if 1,000 such digital circuits were operated for year, on average 156 would not be operating properly at that time However, the mistakes made by these machines would amount to 43 mistakes per 1,000 on the average Thus, for the entire group, the error rate would be 4.3% after year 4.3.4 TMR Options Systems with N-modular redundancy can be designed to behave in different ways in practice [Toy, 1987; Arsenault, 1980, p 137] Let us examine in more detail the way a TMR system works As previously described, the TMR sys- TRIPLE MODULAR REDUNDANCY 1.0 Any o All ne tr ansm tra nsm issio n co iss rrec ion t sc or re ct 0.8 Probability of success 151 0.6 Re lia 0.4 0.2 bil it yo fa sin gle ele m ent 0.75 0.50 0.25 Element reliability, p Figure 4.2 Comparison of probability of successful transmission with the reliability tem functions properly if there are no system failures or one system failure The reliability expression was previously derived in terms of the probability of element success, p, as R c 3p2 − 2p3 (4.9) If we assume a constant-failure rate l, then each component has a reliability p c e − l t , and substitution into Eq (4.9) yields R(t) c 3e − 2l t − 2e − 3l t (4.10) We can compute the MTTF for this system by integrating the reliability function, which yields MTTF c − c 2l 3l 6l (4.11) Toy calls this a TMR 3–2 system because the system succeeds if or units are good Thus when a second failure occurs, the voter does not know which of the systems has failed and cannot determine which is the good system In some cases, additional information is available by such means as observation (from a human operator or an automated system) of the two remaining units after the first failure occurs For agreement in the event of failure, if one 152 N-MODULAR REDUNDANCY of the two remaining units has behaved strangely or erratically, the “strange” system would be locked out (i.e., disconnected) and the other unit would be assumed to operate properly In such a case, the TMR system really becomes a : system with a voter, which Toy calls a TMR 3–2–1 system Equation (4.9) will change, and we must add the binomial probability of : to the equation, that is, B(1 : 3) c 3p(1 − p)2 , yielding R c 3p2 − 2p3 + 3p(1 − p)2 c p3 − 3p2 + 3p (4.12a) Substitution of p c e − l t gives R(t) c e − 3l t − 3e − 2l t + 3e − l t (4.12b) and an MTTF calculation yields MTTF c 3 11 − + c 3l 2l 6l l (4.13) If we compare these results with those given in Table 3.4, we see that on the basis of MTTF, the TMR 3–2 system is slightly worse than a system with two standby elements However, if we make a series expansion of the two functions and compare them in the high-reliability region, the TMR 3–2 system is superior In the case of the TMR 3–2–1 system, it has an MTTF that is nearly the same as two standby elements Again, a series expansion of the two functions and comparison in the high-reliability region is instructive For a single element, the truncated expansion of the reliability function e − l t is Rs ⬵ − l t (4.14) For a TMR 3–2 system, the truncated expansion of the reliability function, Eq (4.9), is RTMR (3–2) c e − 2l t (3 − 2e − l t ) ⬵ [1 − 2l t + (2l t)2 / 2] [3 − 2(1 − l t + (l t)2 / 2)] ⬵ − 3(l t)2 (4.15) For a TMR 3–2–1 system, the truncated expansion of the reliability function, Eq (4.12b), is RTMR (3–2–1) c e − 3l t − 3e − 2l t + 3e − l t ⬵ [1 − 3l t + (3l t)2 / − (3l t)3 / 6] − 3[1 − 2l t + (2l t)2 / − (2l t)3 / 6] + 3[1 − l t + (l t)2 / − (l t)3 / 6] c − l t (4.16) Equations (4.14), (4.15), and (4.16) are plotted in Fig 4.3 showing the superiority of the TMR systems in the high-reliability region Note that the TMR(3–2) system reliability decreases to about the same value as a single N-MODULAR REDUNDANCY 153 1.0 0.9 0.8 Reliability 0.7 0.6 Single System TMR(3-2) TMR(3-2-1) 0.5 0.4 0.3 0.2 0.1 0 0.05 0.1 0.15 0.2 0.25 Normalized time, l t 0.3 0.35 Figure 4.3 Comparison of the reliability functions of a single system, a TMR 3–2 system, and a TMR 3–2–1 system in the high-reliability region element when l t increases from about 0.3 to 0.35 Thus, the TMR is of most use for l t < 0.2, whereas TMR (3–2–1) is of greater benefit and provides a considerably higher reliability for l t < 0.5 For further comparisons of MTTF and reliability for N-modular systems, refer to the problems at the end of the chapter 4 4.4.1 N-MODULAR REDUNDANCY Introduction The preceding section introduced TMR as a majority voting scheme for improving the reliability of digital systems and components Of course, this is the most common implementation of majority logic because of the increased cost of replicating systems However, with the reduction in cost of digital systems from integrated circuit advances, it is practical to discuss N-version voting or, as it is now more popularly called, N-modular redundancy In general, N is an odd integer; however, if we have additional information on which systems are malfunctioning and also the ability to lock out malfunctioning systems, it is feasible to let N be an even integer (Compare advanced voting techniques in Section 4.11 and the Space Shuttle control system example in Section 5.9.3.) The reader should note there is a pitfall to be skirted if we contemplate the design of, say, a 5-level majority logic circuit on a chip If the five digital circuits plus the voter are all on the same chip, and if only input and output signals are accessible, there would be no way to test the chip, for which reason 154 N-MODULAR REDUNDANCY additional best outputs would be needed This subject is discussed further in Sections 4.6.2 and 4.7.4 In addition, if we contemplate using N-modular redundancy for a digital system composed of the three subsystems A, B, and C, the question arises: Do we use N-modular redundancy on three systems (A1 B1 C1 , A2 B2 C2 , and A3 B3 C3 ) with one voter, or we apply voting on a lower level, with one voter comparing A1 A2 A3 , a second comparing B1 B2 B3 , and a third comparing C1 C2 C3 ? If we apply the principles of Section 3.3, we will expect that voting on a component level is superior and that the reliability of the voter must be considered This section explores such models 4.4.2 System Voting A general treatment of N-modular redundancy was developed in the 1960s [Knox-Seith, 1953; Pierce, 1961] If one considers a system of 2n + voters (note that this is an odd number), parallel digital elements, and a single perfect voter, the reliability expression is given by 2n + Rc 冱 icn+1 2n + B(i : 2n + 1) c 冱 icn+1 冢 2n i+ 冣 p (1 − p) i 2n + − i (4.17) The preceding expression is plotted in Fig 4.4 for the case of one, three, five, and nine elements, assuming p c e − l t Note that as n b ∞, the MTTF of the system b 0.69/ l The limiting behavior of Eq (4.17) as n b ∞ is discussed in Shooman [1990, p 302]; the reliability function approaches the three straight lines shown in Fig 4.4 Further study of this figure reveals another important principle—N-modular redundancy is only superior to a single system in the high-reliability region To be more specific, N-modular redundancy is superior to a single element for l t < 0.69; thus, in system design, one must carefully evaluate the values of reliability obtained over the range < t < maximum mission time for various values of n and l Note that in the foregoing analysis, we assumed a perfect voter, that is, one with a reliability equal to unity Shortly, we will discard this assumption and assign a more realistic reliability to voting elements However, before we investigate the effect of the voter, it is germane to study the benefits of partitioning the original system into subsystems and using voting techniques on the subsystem level 4.4.3 Subsystem Level Voting Assume that a digital system is composed of m series subsystems, each having a constant-failure rate l, and that voting is to be applied at the subsystem level The majority voting circuit is shown in Fig 4.5 Since this configuration is composed of just the m-independent series groups of the same configuration ... Level Voting Assume that a digital system is composed of m series subsystems, each having a constant-failure rate l, and that voting is to be applied at the subsystem level The majority voting circuit... expression pc (3 − 2pc ) c 3pc − 2p2c Differentiating with respect to pc and equating to zero yields pc c 3/ 4, which agrees with Fig 4.7 Substituting this value of pc into [pv pc (3 − 2pc ) c... voting on a lower level, with one voter comparing A1 A2 A3 , a second comparing B1 B2 B3 , and a third comparing C1 C2 C3 ? If we apply the principles of Section 3.3, we will expect that voting

Ngày đăng: 15/12/2013, 08:15

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan