DESIGN AND ANALYSIS OF DISTRIBUTED ALGORITHMS phần 8 doc

CHAPTER 7 Computing in Presence of Faults 7.1 INTRODUCTION In all previous chapters, with few exceptions, we have assumed total reliability, that is, the system is failure free. Unfortunately, total reliability is practically nonexistent in real systems. In this chapter we will examine how to compute, if possible, when failures can and do occur. 7.1.1 Faults and Failures We speak of a failure (or fault) whenever something happens in the systems that deviates from the expected correct behavior. In distributed environments, failures and their causes can be very different in nature. In fact, a malfunction could be caused by a design error, a manufacturing error, a programming error, physical damage, deterioration inthecourseoftime, harshenvironmentalconditions, unexpectedinputs, operator error, cosmic radiations, and so forth. Not all faults lead (immediately) to computational errors (i.e., to incorrect results of the protocol), but some do. So the goal is to achievefault-tolerant computations,thatis,our aim is to designprotocolsthatwill proceed correctly in spite of the failures. The unpredictability of the occurrence and nature of a fault and the possibility of multiple faults render the design of fault-tolerant distributed algorithms very difficult and complex, if at all possible. In particular, the more components (i.e., entities, links) are present in the system, the greater is the chance of one or more of them being/becoming faulty. Depending on their cause, faults can be grouped into three general classes:  execution failures, that is, faults occurring during the execution of the protocol by an entity; examples of protocol failures are computational errors occurring when performing an action, as well as execution of the incorrect rule.  transmission failures, due to the incorrect functioning of the transmission subsystem; examples of transmission faults are the loss or corruption of a transmitted message as well as the delivery of a message to the wrong neighbor. Design and Analysis of Distributed Algorithms, by Nicola Santoro Copyright © 2007 John Wiley & Sons, Inc. 408 INTRODUCTION 409  component failures, such as the deactivation of a communication link between two neighbors, the shutdown of a processor (and thus of the corresponding entity), and so forth. Note that the same fault can occur because of different causes, and hence classified differently. Consider, for example, a message that an entity x is supposed to send (according to the protocol) to a neighbor y but never arrives. This fault could have been caused by x failing to execute the “send” operation in the protocol: an execution error; by the loss of the message by the transmission subsystem: a transmission error; or by the link (x,y) going down: a component failure. Depending on their duration, faults are classified as transient or permanent.  A transient fault occurs and then disappears of its own accord, usually within a short period of time. A bird flying through the beam of a microwave transmitter may cause lost bits on some network. A transient fault happens once in a while; it may or may not reoccur. If it continues to reoccur (not necessarily at regular intervals), the fault is said to be intermittent. A loose contact on a connector will often cause an intermittent fault. Intermittent faults are difficult to diagnose.  A permanent failure is one that continues to exist until the fault is repaired. Burnout chips, software bugs, and disk head crashes often cause permanent faults. Depending on their geographical “spread”, faults are classified as localized or ubiquitous.  Localized faults occur always in the same region of the system, that is, only a fixed (although a priori unknown) set of entities/links will exhibit a faulty behavior.  Ubiquitous faults will occur anywhere in the system, that is, all entities/links will exhibit at some point or another a faulty behavior. Note that usually transient failures are ubiquitous, while intermittent and permanent failures tend to be localized. Clearly no protocol can be resilient to an arbitrary number of faults. In particular, if the entire system collapses, no protocol can be correct. Hence, the goal is to design protocols that are able to withstand up to a certain amount of faults of a given type. Another fact to consider is that not all faults are equally dangerous. The danger of a fault lies not necessarily in the severity of the fault itself but rather in the consequences that its occurrence might have on the correct functioning of the system. In particular, danger for the system is intrinsically related to the notion of detectability. In general, if a fault is easily detected, a remedial action can be taken to limit or circumvent the damage; if a fault is hard or impossible to detect, the effects of the initial fault may spread throughout the network creating possibly irreversible damage. For example, the permanent fault of a link going down forever is obviously more severe than if that link failure is just transient. In contrast, the permanent failure of the link might be more easily detectable, and thus can be taken care of, than the occasional mulfanctioning 410 COMPUTING IN PRESENCE OF FAULTS of the link. In this example, the less severe fault (the transient one) is potentially more dangerous for the system. With this in mind, when we talk about fault-tolerant protocols and fault-resilient computations, we must always qualify the statements and clearly specify the type and number of faults that can be tolerated. To do so, we must first understand what are the limits to the fault tolerance of a distributed computing environment,expressed in terms of the nature and number of faults that make a nontrivial computation (im)possible. 7.1.2 Modeling Faults Given the properties of the system and the types of faults assumed to occur, one would like to know the maximum number of faults that can be tolerated. This number is called the resiliency. To establish the resiliency, we need to be more precise on the types of faults that can occur. In particular, we need to develop a model to describe the failures in the system. Faults, as mentioned before, can be due to execution errors, transmission errors, or component failures; the same fault could be caused by any of those three causes and hence could be in any of these three categories. There are several failure models, each differing on what is the factor “blamed” for a failure. IMPORTANT. Each failure model offers a way of describing (some of the) faults that can occur in the system. A model is not reality, only an attempt to describe it. Component Failure Models The more common and most well known models employed to discuss and study fault tolerance are the component failures models. In all the component failure models, the blame for any fault occurring in the system must be put on a component, that is, only components can fail, and if something goes wrong, it is because one of the involved components is faulty. Depending on which components are blamed, there are three types of component failure models: entity, link, and hybrid failure models.  In the entity failure (EF) model, only nodes can fail. For example, if a node crashes, for whatever reason, that node will be declared faulty. In this model, a link going down will be modeled by declaring one of the two incident nodes to be faulty and to lose all the message to and from its neighbor. Similarly, the corruption of a message during transmission must be blamed on one of the two incident nodes that will be declared to be faulty.  In the link failure (LF) model, only links can fail. For example, the loss of a message over a link will lead to that link being declared faulty. In this model, the crash of a node is modeled by the crash of all its incident links. The event of an entity computing some incorrect information (because of a execution error) and sending it to a neighbor, will be modeled by blaming the link connecting the entity to the neighbor; in particular, the link will be declared to be responsible for corrupting the content of the message. INTRODUCTION 411 Crash Send Omission Receive Omission Byzantine Send/Receive Omission FIGURE 7.1: Hierarchy of faults in the EF model.  In the hybrid failure (HF) model, both links and nodes can be faulty. Although more realistic, this model is little known and seldom used. NOTE. In all three component failure models, the status faulty is permanent and is not changed, even though the faulty behavior attributed to that component may be never repeated. In other words, once a component is marked with being faulty, that mark is never removed; so, for example, in the link failure model, if a message is lost on a link, that link will be considered faulty forever, even if no other message will ever be lost there. Let us concentrate first on the entities failure model. That is, we focus on systems where (only) entities can fail. Within this environment, the nature of the failures of the entities can vary. With respect to the danger they may pose to the system, a hierarchy of failures can be identified. 1. With crash faults, a faulty entity works correctly according to the protocol, then suddenly just stops any activity (processing, sending, and receiving messages). These are also called fail-stop faults. Such a hard fault is actually the most benign from the overall system point of view. 2. With send/receive omission faults, a faulty entity occasionally loses some received messages or does not send some of the prepared messages. This type of faults may be caused by buffer overflows. Notice that crash faults are just a particular case of this type of failure: A crash is a send/receive omission in which all messages sent to and and from that entity are lost. From the point of view of detectability, these faults are much more difficult than the previous one. 3. With Byzantine faults, a faulty entity is not bound by the protocol and can perform any action: It can omit to send or receive any message, send incorrect 412 COMPUTING IN PRESENCE OF FAULTS information to its neighbors, behave maliciously so as to make the protocol fail. Undetected software bugs often exhibit Byzantine faults. Clearly, dealing with Byzantine faults is going to be much more difficult than dealing with the previous ones. A similiar hierarchy between faults exists in the link as well as in hybrid failures models. Communication Failures Model A totally different model is the communication failure or dynamic fault (DF) model; in this model, the blame for any fault is put on the communication subsystem. More precisely, the communication system can lose, corrupt, and deliver to the incorrect neighbor. As in this model, only the communication system can be faulty, a component fault such as the crash failure of a node, is modeled by the communication system losing all the messages sent to and from that node. Notice that in this model, no mark (permanent or otherwise) is assigned to any component. In the communication failure model, the communication subsystem can cause only three types of faults: 1. An omission: A message sent by an entity is never delivered. 2. An addition: A message is delivered to an entity, although none was sent. 3. A corruption: A message is sent but one with different content is received. While the nature of omissions and corruptions is quite obvious, that of additions is less so. Indeed, it describes a variety of situations. The most obvious one is when sud- den noise in the transmission channel is mistaken for transmission of information by the neighbor at the other end of the link. The more important occurrence of additionsin sytems is rather subtle, as an addition models the reception of a “nonauthorized message” (i.e., a message not transmitted by any authorized user). In this sense, additions model messages surreptitiously inserted in the system by some outside, and possibly malicious, entity. Spam being sent from an unsuspecting site clearly fits the descrip- tion of an addition. Summarizing, additions do occur and can be very dangerous. These three types of faults are quite incomparable with each other in terms of danger. The hierarchy comes into place when two or all of these basic fault types can simultaneously occur in the system. The presence of all three types of faults creates what is called a Byzantine faulty behavior. The situation is depicted in Figure 7.2. Clearly, no protocol can tolerate any number of faults of any type. If the entire system collapses, no computation is possible. Thus, when we talk about fault-tolerant protocols and fault-resilient computations, we must always qualify the statements and clearly specify the type and number of faults that can be tolerated. 1 The term “Byzantine” refers to the Byzantine Empire (330–1453 AD), the long-lived eastern component of the Roman Empire whose capital city was Byzantium (now Istanbul), in which endless conspiracies, intrigue, and untruthfulness were alleged to be common among the ruling class. INTRODUCTION 413 Byzantine Omission + Addition Omission + Corruption Addition + Corruption Omission Addition Corruption FIGURE 7.2: Hierarchy of combinations of fault types in the DF model. 7.1.3 Topological Factors Our goal is to design protocols that can withstand as many and as dangerous faults as possible and still exhibit a reasonable cost. What we will be able to do depends not only on our ability as designers but also on the inherent limits that the environment imposes. In particular, the impact of a fault, and thus our capacity to deal with it and design fault-tolerant protocols, depends not only on the type and number of faults but also on the communication topology of the system, that is, on the graph G. This is because all nontrivial computations are global, that is, they require the participation of possibly all entities. For this reason, Connectivity is a restriction required for all nontrivial computations. Even when initially existent, in the lifetime of the system, owing to faults, connectivity may cease to hold, rendering correctness impossible. Hence, the capacity of the topological structure of the network to remain connected in spite of faults is crucial. There are two parameters that directly link topology to reliability and fault tolerance:  edge connectivity c edge (G) is the minimum number of edges whose removal destroys the (strong) connectivity of G;  node connectivity c node (G) is the minimum number of nodes whose removal destroys the (strong) connectivity of G. NOTE. In the case of a complete graph, the node connectivity is always defined as n −1. Clearly, the higher the connectivity, the higher the resilience of the system to component failures. In particular, Property 7.1.1 If c edge (G) = k, then for any pair x and y of nodes there are k edge-disjoint paths connecting x to y. 414 COMPUTING IN PRESENCE OF FAULTS Network Node Connectivity Edge Connectivity Max Degree G c node (G) c edge (G) deg(G) Tree T 1 1 ≤ n −1 Ring R 2 2 2 Torus Tr 4 4 4 Hypercube H log n log n log n Complete K n −1 n −1 n −1 FIGURE 7.3: Connectivity of some networks. Property 7.1.2 If c node (G) = k, then for any pair x and y of nodes there are k node-disjoint paths connecting x to y. Let us considersomeexamples ofconnectivity. A tree T has the lowest connectivity of all undirected graphs: c edge (T ) = c node (T ) = 1, so any failure of a link or a node disconnects the network. A ring R faces little better as c edge (R) = c node (R) = 2. Higher connectivity can be found in denser graphs. For example, in a hypercube H , both connectivity parameters are log n. Clearly the highest connectivity is to be found in the complete network K. For a summary, see Figure 7.3. Note that in all connected networks G the node connectivity is not greater than the edge connectivity (Exercise 7.10.1) and neither can be better than the maximum degree: Property 7.1.3 ∀G, c node (G) ≤ c edge (G) ≤ deg(G) As an example of the impact of edge connectivity on the existence of fault-tolerant solutions, consider the broadcast problem Bcast. Lemma 7.1.1 If k arbitrary links can crash, it is impossible to broadcast unless the network is (k +1)-edge-connected. Proof. If G is only k-edge-connected, then there are k edges whose removal disconnects G. The failure of those links will make some nodes unreachable from the initiator of the broadcast and, thus, they will never receive the information. By contrast, if G is (k +1)-edge-connected, then even after k links go down, by Property 7.1.1, there is still a path from the initiator to all other nodes. Hence flooding will correctly work. ᭿ As an example of the impact of node-connectivity on the existence of fault-tolerant solutions, consider the problem of an initiator that wants to broadcast some information, but some of the entities may be down. In this case, we just want the nonfaulty entities to receive the information. Then (Exercise 7.10.2), Lemma 7.1.2 If k arbitrary nodes can crash, it is impossible to broadcast to the nonfaulty nodes unless the network is (k +1)-node-connected. INTRODUCTION 415 7.1.4 Fault Tolerance, Agreement, and Common Knowledge In most distributed computations there is a need to have the entities to make a local but coordinated decision. This coordinated decision is called an agreement. For example, in the election problem, every entity must decide whether it is the leader or not. The decision is local but must satisfy some global constraint (only one entity must become leader); in other words, the entities must agree on which one is the leader. For any problem requiring an agreement, the sets of constraints defining the agreement are different. For example, in minimum finding, the constraint is that all and only the entities with the smallest input value must become minimum. For example, in ranking when every entity has an initial data item, the constraint is that the value decided by each entity is precisely the rank of its data item in the overall distributed set. When there are no faults, reaching these agreements is possible (as we have seen in the other chapters) and often straightforward. Unfortunately, the picture changes dra- matically in presence of faults. Interestingly, the impact that faults have on problems requiring agreement for their solution has common traits, in spite of the differences of the agreement constraints. That is, some of the impact is the same for all these problems. For these reasons, we consider an abstract agreement problem where this common impact of faults on agreements is more evident. In the p-Agreement Problem (Agree(p)), each entity x has an input value v(x)from some known set (usually {0, 1}) and must terminally decide upon a value d(x) from that set within a finite amount of time. Here, “terminally” means that once made, the decision cannot be modified. The problem is to ensure that at least p entities decide on the same value. Additional constraints, called nontriviality (or sometimes validity constraints), usually exist on the value to be chosen; in particular, if all values are initially the same, the decision must be on that value. This nontriviality constraint rules out default-type solutions (e.g., “always choose 0”). Depending on the value of p, we have different types of agreement problems. Of particular interest is the case of p = n 2 +1 that is called strong majority. When p = n, we have the well known Unanimity or Consensus Problem (Con- sensus) in which all entities must decide on the same value, that is, ∀x,y ∈ E,d(x) = d(y). (7.1) The consensus problem occurs in many different applications. For example, consider an aircraft where several sensors are used to decide if the moment has come to drop a cargo; it is possible that some sensors detect “yes” while others “not yet.” On the basis of these values, a decision must be made on whether or not the cargo is to be dropped now. A solution strategy for our example is to drop the cargo only if all sensors agree; another is to decide for a drop as soon as at least one of the sensors indicates so. Observe that the first solution corresponds to computing the AND of the sensors’ values; in the consensus problem this solution corresponds to each entity x setting d(x) = AND({v(y):y ∈ E}). The second solution consists of determining the 416 COMPUTING IN PRESENCE OF FAULTS OR of those values, that is, d(x) = OR({v(y):y ∈ E}). Notice that in both strategies, if the initial values are identical, each entity chooses that value. Another example is in distributed database systems, where each site (the entity) of the distributed database must decide whether to accept or drop a transaction; in this case, all sites will agree to accept the transaction only if no site rejects the transaction. The same solutions strategy apply also in this case. Summarizing, if there are no faults, consensus can be easily achieved (e.g., by computing the AND or the OR of the values). Lower forms of agreement, that is, when p<n, are even easier to resolve. In presence of faults, the situation changes drastically and even the problem must be restated. In fact, if an entity is faulty, it might be unable to participate in the computation; even worse, its faulty behavior might be an active impediment for the computation. In other words, as faulty entities cannot be required to behave correctly, the agreement constraint can hold only for the nonfaulty entities. So, for example, a consensus problem we are interested in is Entity-Fault-Tolerant Consensus (EFT- Consensus). Each nonfaulty entity x has an input value v(x) and must terminally decide upon a value d(x) within a finite amount of time. The constraints are 1. agreement: all nonfaulty entities decide on the same value; 2. nontriviality: if all values of the nonfaulty elements are initially the same, the decision must be on that value. Similarly, we can define lower forms (i.e., when p<n) of agreement in presence of entity failures (EFT-Agree(p)). For simplicity (and without any loss of generality), we can consider the Boolean case, that is when the values are all in {0, 1}. Possible solutions to this problem are, for example, computing AND or the OR of the input values of the nonfaulty entities, or the value of an elected leader. In other words, consensus (fault tolerant or not) can be solved by solving any of a variety of other problems (e.g., function evaluation, leader election, etc.). For this reason, the consensus problem is elementary: If it cannot be solved, then none of those other problems can be solved either. Reaching agreement, and consensus in particular, is strictly connected with the problem of reaching common knowledge. Recall (from Section 1.8.1) that common knowledge is the highest form of knowledge achievable in a distributed computing environment. Its connection to consensus is immediate. In fact, any solution protocol P to the (fault-tolerant) consensus problem has the following property: As it leads all (nonfaulty) entities to decide on the same value, say d, then within finite time thevalue d becomes common knowledge among all the nonfaulty entities. By contrast, any (fault-tolerant) protocol Q that creates common knowledge among all the nonfaulty entities can be used to make them decide on a same value and thus achieve consensus. IMPORTANT. This implies that common knowledge is as elementary as consensus: If one cannot be achieved, neither can be other. THE CRUSHING IMPACT OF FAILURES 417 7.2 THE CRUSHING IMPACT OF FAILURES In this section we will examine the impact that faults have in distributed computing environments. As we will see, the consequences are devastating even when faults are limited in quantity and danger. We will establish these results assuming that the entities have distinct values (i.e., under restriction ID); this makes the bad news even worse. 7.2.1 Node Failures: Single-Fault Disaster In this section we examine node failures. We consider the possibility that entities may fail during the computation and we ask under what conditions the nonfaulty entities may still carry out the task. Clearly, if all entities fail, no computation is possible; also, we have seen that some faults are more dangerous than others. We are interested in computations that can be performed, provided that at most a certain number f of entities fail, and those failures are of a certain type τ (i.e., danger). We will focus on achieving fault-tolerant consensus (problem EFT-Consensus described in Section 7.1.4), that is, we want all nonfailed entities to agree on the same value. As we have seen, this is an elementary problem. A first and immediate limitation to the possibility of achieving consensus in presence of node failures is given by the topology of the network itself. In fact, by Lemma 7.1.2, we know that if the graph is not (k +1)-node-connected, a broadcast to nonfaulty entities is impossible if k entities can crash. This means that Lemma 7.2.1 If k ≥ 1 arbitraryentitiescan possibly crash, fault-tolerantconsensus can not be achieved if the network is not (k +1)-node-connected. This means, for example, that in a tree, if a node goes down, consensus among the others cannot be achieved. Summarizing, we are interested in achieving consensus, provided that at most a given number f of entities fail, those failures are of at most a certain type τ of danger, and the node-connectivity of the network c node is high enough. In other words, the problem is characterized by those three paramenters, and we will denote it by EFT- Consensus(f, τ, c node ). We will start with the simplest case:  f = 1, that is, at most one entity fails;  τ = crash, that is, if an entity fails, it will be in the most benign way;  c node = n −1, that is, the topology is not a problem as we are in the complete graph. In other words, we are in a complete network (every entity is connected to every other entity); at most one entity will crash, leaving all the other entities connected to each other. What we want is that these other entities agree on the same value, that is, we want to solve problem EFT-Consensus(1, crash,n−1). Unfortunately, [...]... events in ψ2 , then ψ2 can be applied to C1 and ψ1 to C2 , and both lead to the same configuration C3 (see Figure 7.4) More precisely, Lemma 7.2.2 Let ψ1 and ψ2 be sequences of events applicable to C such that 1 the sets of entities affected by the events in ψ1 and ψ2 , respectively, are disjoint; and 2 at most one of ψ1 and ψ2 includes a crash event Then, both ψ1 ψ2 and ψ2 ψ1 are applicable to C Furthermore,... configuration, and let = (x, m) be a noncrash event that is applicable to C Let A be the set of nonfaulty configurations reachable from C without applying , and let B = (A) = { (A) | A ∈ A and is applicable to A} (See Figure 7.5) Then, B contains a nonfaulty bivalent configuration Proof First of all, observe that as is applicable to C, by definition of A and because of the unpredictability of communication... all events (arrivals and timeouts) involving x will be removed from all Future(t ) with t ≥ t Recall from Section 1.6 that the internal state of an entity is the value of all its registers and internal storage Also recall that the configuration C(t) of the system at time t is a snapshot of the system at time t; it contains the internal state of each entity and the set Future(t) of the future events... more restricted environments To understand which properties (and thus restrictions) would suffice we need to examine the proof of Theorem 7.2.1 and to understand what are the particular conditions inside a general distributed computing environment that make it work Then, if we disable one of these conditions (by adding the appropriate restriction), we might be able to design a fault-tolerant solution The... protocol, called TellAll-Crash and on the basis of this mechanism, that tolerates up to f ≤ n − 1 crashes The algorithm is just mechanism Tell All where T = f and the “report” consists of the AND function of all the values seen so far More precisely, rep(x, t) = v(x) if t = 0 AND( rep(x, t − 1), M(x1 , t), , M(xn−1 , t)) otherwise , (7.2) where x1 , , xn−1 are the neighbors of x and M(xi , t) denotes the... reachable from C by a sequence of events, say ψ, in which is the last event applied The sequence for stage i is precisely this sequence of events ψ (d) We execute the constructed sequence of events, ending in a bivalent configuration (e) We move x and all preceeding entities to the back of the queue and start the next stage 424 COMPUTING IN PRESENCE OF FAULTS In any infinite sequence of such stages every entity... correct solution to our consensus problem 7.2.2 Consequences of the Single-Fault Disaster The Single-Failure Disaster result of Theorem 7.2.1 dashes any hope for the design of fault-tolerant distributed solution protocols for nontrivial problems and tasks Because the consensus problem is an elementary one, the solution of almost every nontrivial distributed problem can be used to solve it, but as consensus... as the same type of events and call them timeouts We represent by (x, M) the event of x receiving message M, and by (x, ∅) the event of a timeout occurring at x THE CRUSHING IMPACT OF FAILURES 419 As we want to describe what happens to the computation if an entity fails by crashing, we add special system events called crashes, one per entity, to the initial set of events Future(0), and denote by (x,... know a “report” on what y knew at time t (note that in case of Byzantine faults, this “report” could be false) For the appropriate choice of T and with the appropriate information sent in the “report,” this mechanism enables the nonfaulty entities to reach consensus The actual value of T and the nature of the report depend on the types and number of faults the protocol is supposed to tolerate Let us now... 7.4: Commutativity of disjoint sequences of events and 1-decision configurations in C(C), then we say that C is bivalent; in other words, in a bivalent configurations, which value is going to be chosen depends on the future events An important property of sequences of events is the following Suppose that from some configuration C, the sequences of events ψ1 and ψ2 lead to configurations C1 and C2 , respectively . The unpredictability of the occurrence and nature of a fault and the possibility of multiple faults render the design of fault-tolerant distributed algorithms very difficult and complex, if at all. functioning of the transmission subsystem; examples of transmission faults are the loss or corruption of a transmitted message as well as the delivery of a message to the wrong neighbor. Design and Analysis. entity), and we will treat both spontaneous events and the ringing of the alarm clocks as the same type of events and call them timeouts. We represent by (x,M) the event of x receiving message M, and