Tài liệu Grid Computing P13 ppt

Thông tin tài liệu

13 Autonomic computing and Grid Pratap Pattnaik, Kattamuri Ekanadham, and Joefon Jann Thomas J. Watson Research Center, Yorktown Heights, New York, United States 13.1 INTRODUCTION The goal of autonomic computing is the reduction of complexity in the management of large computing systems. The evolution of computing systems faces a continuous growth in the number of degrees of freedom the system must manage in order to be efficient. Two major factors contribute to the increase in the number of degrees of freedom: Historically, computing elements, such as CPU, memory, disks, network and so on, have nonuniform advancement. The disparity between the capabilities/speeds of various elements opens up a number of different strategies for a task depending upon the environment. In turn, this calls for a dynamic strategy to make judicious choices for achieving targeted efficiency. Secondly, the systems tend to have a global scope in terms of the demand for their services and the resources they employ for rendering the services. Changes in the demands/resources in one part of the system can have a significant effect on other parts of the system. Recent experiences with Web servers (related to popular events such as the Olympics) emphasize the variability and unpredictability of demands and the need to rapidly react to the changes. A system must perceive the changes in the environment and must be ready with a variety of choices, so that suitable strategies can be quickly selected for the new environment. The autonomic computing approach is to orchestrate Grid Computing – Making the Global Infrastructure a Reality. Edited by F. Berman, A. Hey and G. Fox  2003 John Wiley & Sons, Ltd ISBN: 0-470-85319-0 352 PRATAP PATTNAIK, KATTAMURI EKANADHAM, AND JOEFON JANN the management of the functionalities, efficiencies and the qualities of services of large computing systems through logically distributed, autonomous controlling elements, and to achieve a harmonious functioning of the global system within the confines of its stipulated behavior, while individual elements make locally autonomous decisions. In this approach, one moves from a resource/entitlement model to a goal-oriented model. In order to signif- icantly reduce system management complexity, one must clearly delineate the boundaries of these controlling elements. The reduction in complexity is achieved mainly by making a significant amount of decisions locally in these elements. If the local decision process is associated with a smaller time constant, it is easy to revise it, before large damage is done globally. Since Grid Computing, by its very nature, involves the controlled sharing of computing resources across distributed, autonomous systems, we believe that there are a number of synergistic elements between Grid computing and autonomic computing and that the advances in the architecture in either one of these areas will help the other. In Grid computing also, local servers are responsible for enforcing local security objectives and for managing various queuing and scheduling disciplines. Thus, the concept of cooperation in a federation of several autonomic components to accomplish a global objective is a common theme for both autonomic computing and Grid computing. As the architecture of Grid computing continues to improve and rapidly evolve, as expounded in a number of excellent papers in this issue, we have taken the approach of describing the autonomic server architecture in this paper. We make some observations on the ways we perceive it to be a useful part of the Grid architecture evolution. The choice of the term autonomic in autonomic computing is influenced by an analogy with biological systems [1, 2]. In this analogy, a component of a system is like an organism that survives in an environment. A vital aspect of such an organism is a symbiotic relationship with others in the environment – that is, it renders certain services to others in the environment and it receives certain services rendered by others in the environment. A more interesting aspect for our analogy is its adaptivity – that is, it makes constant efforts to change its behavior in order to fit into its environment. In the short term, the organism perseveres to perform its functions despite adverse circumstances, by readjusting itself within the degrees of freedom it has. In the long term, evolution of a new species takes place, where environmental changes force permanent changes to the functionality and behavior. While there may be many ways to perform a function, an organism uses its local knowledge to adopt a method that economizes its resources. Rapid response to external stimuli in order to adapt to the changing environment is the key aspect we are attempting to mimic in autonomic systems. The autonomic computing paradigm imparts this same viewpoint to the components of a computing system. The environment is the collection of components in a large system. The services performed by a component are reflected in the advertised methods of the component that can be invoked by others. Likewise, a component receives the services of others by invoking their methods. The semantics of these methods constitute the behavior that the component attempts to preserve in the short term. In the long term, as technology progresses new resources and new methods may be introduced. Like organisms, the components are not perfect. They do not always exhibit the advertised behavior exactly. There can be errors, impreciseness or even cold failures. An autonomic component watches for AUTONOMIC COMPUTING AND GRID 353 these variations in the behavior of other components that it interacts with and adjusts to the variations. Reduction of complexity is not a new goal. During the evolution of computing systems, several concepts emerged that help manage the complexity. Two notable concepts are particularly relevant here: object-oriented programming and fault-tolerant computing. Object-oriented designs introduced the concept of abstraction, in which the interface specification of an object is separated from its implementation. Thus, implementation of an object can proceed independent of the implementation of dependent objects, since it uses only their interface specifications. The rest of the system is spared from knowing or dealing with the complexity of the internal details of the implementation of the object. Notions of hierarchical construction, inheritance and overloading render easy develop- ment of different functional behaviors, while at the same time enabling them to reuse the common parts. An autonomic system takes a similar approach, except that the alternative implementations are designed for improving the performance, rather than providing different behaviors. The environment is constantly monitored and suitable implementations are dynamically chosen for best performance. Fault-tolerant systems are designed with additional support that can detect and correct any fault out of a predetermined set of faults. Usually, redundancy is employed to overcome faults. Autonomic systems generalize the notion of fault to encompass any behavior that deviates from the expected or the negotiated norm, including performance degradation or change-of-service costs based on resource changes. Autonomic systems do not expect that other components operate correctly according to stipulated behavior. The input–output responses of a component are constantly monitored and when a component’s behavior deviates from the expectation, the autonomic system readjusts itself either by switching to an alternative component or by altering its own input–output response suitably. Section 13.2 describes the basic structure of a typical autonomic component, delineat- ing its behavior, observation of environment, choices of implementation and an adaptive strategy. While many system implementations may have these aspects buried in some detail, it is necessary to identify them and delineate them, so that the autonomic nature of the design can be improved in a systematic manner. Section 13.3 illustrates two spec- ulative methodologies to collect environmental information. Some examples from server design are given to illustrate them. Section 13.4 elaborates on the role of these aspects in a Grid computing environment. 13.2 AUTONOMIC SERVER COMPONENTS The basic structure of any Autonomic Server Component, C, is depicted in Figure 13.1, in which all agents that interact with C are lumped into one entity, called the environment. This includes clients that submit input requests to C, other components whose services can be invoked by C and resource managers that control the resources for C. An autonomic component has four basic specifications: AutonomicComp ::= BehaviorSpec, StateSpec, MethodSpec, StrategySpec BehaviorSpec ::= InputSet , OutputSet , ValidityRelation β ⊆  ×  354 PRATAP PATTNAIK, KATTAMURI EKANADHAM, AND JOEFON JANN StateSpec ::= InternalState , EstimatedExternalState ξ MethodSpec ::= MethodSet , each π∈ :  ×  × ξ →  ×  × ξ StrategySpec ::= Efficiency η, Strategy α :  ×  × ξ →  The functional behavior of C is captured by a relation, β ⊆  ×  ,where  is the input alphabet,  is the output alphabet and β is a relation specifying valid input–output pair. Thus, if C receives an input u ∈  , it delivers an output v ∈  , satisfying the relation β(u,v) . The output variability permitted by the relation β (as opposed to a function) is very common to most systems. As illustrated in Figure 13.1, a client is satisfied to get any one of the many possible outputs (v, v  , .) for a given input u , as long as they satisfy some property specified by β . All implementations of the component preserve this functional behavior. The state information maintained by a component comprises two parts: internal state  and external state ξ . Internal state,  , contains the data structures used by an implementation and any other variables used to keep track of input–output history and resource utilization. The external state ξ is an abstraction of the environment of C and includes information on the input arrival process, the current level of resources available for C and the performance levels of other components of the system whose services are invoked by C. The component C has no control over the variability in the ingredients of ξ ,as they are governed by agents outside C. The input arrival process is clearly outside C. We assume an external global resource manager that may supply or withdraw resources from C dynamically. Finally, the component C has no control over how other components are performing and must expect arbitrary variations (including failure) in their health. Thus the state information, ξ , is dynamically changing and is distributed throughout the system. Internal state y Estimated state of environment x p 1 p 2 p 3 Π b( u , v ) b( u , v ′) b( u ′′, v ) v ∈ F v ′ ∈ F v ′′∈ F u ∈ S Clients Resource mgrs Other services Autonomic Server Component C Environment Figure 13.1 Schematic view of an autonomic component and its environment. AUTONOMIC COMPUTING AND GRID 355 C cannot have complete and accurate knowledge of ξ at any time. Hence, the best C can do is to keep an estimate, ξ ,of ξ at any time and periodically update it as and when it receives correct information from the appropriate sources. An implementation, π , is the usual input–output transformation based on state π :  ×  × ξ →  ×  × ξ , where an input–output pair u ∈  and v ∈  produced will satisfy the relation β(u,v) . There must be many implementations, π ∈  , available for the autonomic component in order to adapt to the situation. A single implementation provides no degree of freedom. Each implementation may require different resources and data structures. For any given input, different implementations may produce different outputs (of different quality), although all of them must satisfy the relation β . Finally, the intelligence of the autonomic component is in the algorithm α that chooses the best implementation for any given input and state. Clearly switching from one implementation to another might be expensive as it involves restructuring of resources and data. The component must establish a cost model that defines the efficiency, η ,atwhich the component is operating at any time. The objective is to maximize η . In principle, the strategy, α , evaluates whether it is worthwhile to switch the current implementation for a given input and state, based on the costs involved and the benefit expected. Thus, the strategy is a function of the form α :  ×  × ξ →  . As long as the current implementation is in place, the component continues to make local decisions based on its estimate of the external state. When actual observation of the external state indi- cates significant deviations (from the estimate), an evaluation is made to choose the right implementation, to optimize η . This leads to the following two aspects that can be studied separately. Firstly, given that the component has up-to-date and accurate knowledge of the state of the environment, it must have an algorithm to determine the best implementation to adapt. This is highly dependent upon the system characteristics, the costs associated and the estimated benefits from different implementations. An interesting design criterion is to choose the time constants for change of implementation, so that the system enters a stable state quickly. Criteria and models for such designs are under investigation and here we give a few examples. Secondly, a component may keep an estimate of the external state (which is distributed and dynamically changing) and must devise a means to correct its estimate periodically, so that the deviation from the actual state is kept within bounds. We examine this question in the next section. 13.3 APPROXIMATION WITH IMPERFECT KNOWLEDGE A general problem faced by all autonomic components is the maintenance of an estimate, ξ , of a distributed and dynamically changing external state, ξ , as accurately as possible. We examine two possible ways of doing this: by self-observation and by collective observation. 356 PRATAP PATTNAIK, KATTAMURI EKANADHAM, AND JOEFON JANN 13.3.1 Self-observation Here a component operates completely autonomously and does not receive any explicit external state information from its environment. Instead, the component deduces information on its environment solely from its own interactions with the environment. This is indeed the way organisms operate in a biological environment. (No one explicitly tells an animal that there is a fire on the east side. It senses the temperatures as it tries to move around and organizes in its memory the gradients and if lucky, moves west and escapes the fire.) Following the analogy, an autonomic component keeps a log of the input–output history with its clients, to track both the quality that it is rendering to its clients as well as the pattern of input arrivals. Similarly, it keeps the history of its interactions with each external service that it uses and tracks its quality. On the basis of these observations, it formulates the estimate, ξ , of the state of its environment, which is used in its local decisions to adapt suitable implementations. The estimate is constantly revised as new inputs arrive. This strategy results in a very independent component that can survive in any environment. However, the component cannot quickly react to the rapidly changing environment. It takes a few interactions before it can assess the change in its environment. Thus, it will have poor impulse response, but adapts very nicely to gradually changing circumstances. We illustrate this with the example of a memory allocator. 13.3.1.1 Example 1. Memory allocator This simple example illustrates how an autonomic server steers input requests with frequently observed characteristics to implementations that specialize in efficient handling of those requests. The allocator does not require any resources or external services. Hence, the only external state it needs to speculate upon, ξ , is the pattern of inputs – specifically how frequently a particular size is being requested in the recent past. The behavior, (,,β) , of a memory allocator can be summarized as follows: The input set  has two kinds of inputs: alloc(n) and free(a); the output set  has three possible responses: null, error and an address. Alloc(n) is a request for a block of n bytes. The corresponding output is an address of a block or an error indicating inability to allocate. The relation β validates any block, as long as it has the requested number of free bytes in it. Free(a) returns a previously allocated block. The system checks that the block is indeed previously allocated and returns null or error accordingly. The quality of service, η , must balance several considerations: A client expects quick response time and also that its request is never denied. A second criterion is locality of allocated blocks. If the addresses are spread out widely in the address space, the client is likely to incur more translation overheads and prefers all the blocks to be within a compact region of addresses. Finally, the system would like to minimize fragmentation and avoid keeping a large set of noncontiguous blocks that prevent it from satisfying requests for large blocks. We illustrate a Pi that has two implementations: The first is a linked-list allocator, which keeps the list of the addresses and sizes of the free blocks that it has. To serve a new allocation request, it searches the list to find a block that is larger than (or equal to) the requested size. It divides the block if necessary and deletes the allocated block from AUTONOMIC COMPUTING AND GRID 357 the list and returns its address as the output. When the block is returned, it searches the list again and tries to merge the block with any free adjacent portions in the free list. The second strategy is called slab allocation. It reserves a contiguous chunk of memory, called slab, for each size known to be frequently used. When a slab exists for the requested size, it peals off a block from that slab and returns it. When a block (allocated from a slab) is returned to it, it links it back to the slab. When no slab exists for a request, it fails to allocate. The internal state,  , contains the data structures that handle the linked-list and the list of available slabs. The estimated environmental state, ξ , contains data structures to track the frequency at which blocks of each size are requested or released. The strategy, α , is to choose the slab allocator when a slab exists for the requested size. Otherwise the linked-list allocator is used. When the frequency for a size (for which no slab exists) exceeds a threshold, a new slab is created for it, so that subsequent requests for that size are served faster. When a slab is unused for a long time, it is returned to the linked-list. The cost of allocating from a slab is usually smaller than the cost of allocating from a linked-list, which in turn, is smaller than the cost of creating a new slab. The allocator sets the thresholds based on these relative costs. Thus, the allocator autonomically reorganizes its data structures based on the pattern of sizes in the inputs. 13.3.2 Collective observation In general, a system consists of a collection of components that are interconnected by the services they offer to each other. As noted before, part of the environmental state, ξ ,that is relevant to a component, C, is affected by the states of other components. For instance, if D is a component that provides services for C, then C can make more intelligent decisions if it has up-to-date knowledge of the state of D. If C is periodically updated about the state of D, the performance can be better than what can be accomplished by self-observation. To elaborate on this, consider a system of n interacting components, C i ,i = 1,n .Let S ii (t) denote the portion of the state of C i at time t , that is relevant to other components in the system. For each i = j , C i , keeps an estimate, S ij (t) ,ofthe corresponding state, S jj (t) ,of C j . Thus, each component has an accurate value of its own state and an estimated value of the states of other components. Our objective is to come up with a communication strategy that minimizes the norm  i,j |S ij (t) − S ii (t)| ,for any time t . This problem is similar to the time synchronization problem and the best solution is for all components to broadcast their states to everyone after every time step. But since the broadcasts are expensive, it is desirable to come up with a solution that minimizes the communication unless the error exceeds certain chosen limits. For instance, let us assume that each component can estimate how its state is going to change in the near future. Let  t i be the estimated derivative of S ii (t) , at time t – that is, the estimated value of S ii (t + dt) is given by S ii (t) +  t i (dt) . There can be two approaches to using this information. 13.3.2.1 Subscriber approach (push paradigm) Suppose a component C j is interested in the state of C i .Then C j will subscribe to C i and obtains a tuple of the form, t,S ii (t),  t i  , which is stored as part of its estimate of 358 PRATAP PATTNAIK, KATTAMURI EKANADHAM, AND JOEFON JANN the external state, ξ . This means that at time t ,thestateof C i was S ii (t) and it grows at therateof  t i ,sothat C j can estimate the state of C i at future time, t + δt ,as S ii (t) +  t i ∗ δt . Component, C i , constantly monitors its own state, S ii (t) , and whenever the value |S ii (t) +  t i − S ii (t + δt)| exceeds a tolerance limit, it computes a new gradient,  t+δt i and sends to all its subscribers the new tuple t + δt,S ii (t + δt),  t+δt i  . The subscribers, replace the tuple in their ξ with the new information. Thus, the bandwidth of updates is proportional to the rate at which states change. Also, depending upon the tolerance level, the system can have a rapid impulse response. 13.3.2.2 Enquirer approach (pull paradigm) This is a simple variation of the above approach, where an update is sent only upon explicit request from a subscriber. Each subscriber may set its own tolerance limit and monitor the variation. If the current tuple is t,S ii (t),  t i  , the subscriber requests for a new update when the increment  t i ∗ δt exceeds its tolerance limit. This relieves the source component the burden of keeping track of subscribers and periodically updating them. Since all information flow is by demand from a requester, impulse response can be poor if the requester chooses poor tolerance limit. 13.3.2.3 Example 2. Routing by pressure propagation This example abstracts a common situation that occurs in Web services. It illustrates how components communicate their state to each other, so that each component can make decisions to improve the overall quality of service. The behavior, β , can be summarized as follows: The system is a collection of components, each of which receives transactions from outside. Each component is capable of processing any transaction, regardless of where it enters the system. Each component maintains an input queue of transactions and processes them sequentially. When a new transaction arrives at a component, it is entered into the input queue of a selected component. This selection is the autonomic aspect here and the objective is to minimize the response time for each transaction. Each component is initialized with some constant structural information about the system, µ i ,τ ij  ,where µ i is the constant time taken by component C i to process any transaction and τ ij is the time taken for C i to send a transaction to C j . Thus, if a transaction that entered C i was transferred and served at C j , then its total response time is given by τ ij + (1 + Q j ) ∗ µ j ,where Q j is the length of the input queue at C j , when the transaction entered the queue there. In order to give best response to the transaction, C i chooses to forward it to C j , which minimizes [τ ij + (1 + Q j ) ∗ µ j ] , over all possible j .Since C i has no precise knowledge of Q j , it must resort to speculation, using the collective observation scheme. As described in the collective observation scheme, each component, C i , maintains the tuple t, t j ,Q t j  , from which the queue size of C j at time t + δt can be estimated as Q t j +  t j ∗ δt . When a request arrives at C I at time t + δt , it computes the target j ,which minimizes [τ ij + (1 + Q t j +  t j ∗ δt) ∗ µ j ] , over all possible j . The request is sent to be queued at C j . Each component, C j , broadcasts a new tuple, t + δt, t+δt j ,Q t+δt j  ,to AUTONOMIC COMPUTING AND GRID 359 all other components whenever the quantity |Q t j +  t j ∗ δt − Q t+δt j | exceeds a tolerance limit. 13.4 GRID COMPUTING The primary objective of Grid computing [3] is to facilitate controlled sharing of resources and services that are made available in a heterogeneous and distributed system. Both heterogeneity and distributedness force the interactions between entities to be based on protocols that specify the exchanges of information in a manner that is independent of how a specific resource/service is implemented. Thus, a protocol is independent of details such as the libraries, language, operating system or hardware employed in the implementation. In particular, implementation of a protocol communication between two heterogeneous entities will involve some changes in the types and formats depending upon the two systems. Similarly, implementation of a protocol communication between two distributed entities will involve some marshaling and demarshaling of information and instantiation of local stubs to mimic the remote calls. The fabric layerofGrid architecture defines some commonly used protocols for accessing resources/services in such a system. Since the interacting entities span multiple administrative domains, one needs to put in place protocols for authentication and security. These are provided by the connectivity layer of the Grid architecture. A Service is an abstraction that guarantees a specified behavior, if interactions adhere to the protocols defined for the service. Effort is under way for standardization of the means in which a behavior can be specified, so that clients of the services can plan their interactions accordingly, and the implementers of the services enforce the behavior. The resource layer of Grid architecture defines certain basic protocols that are needed for acquiring and using the resources available. Since there can be a variety of ways in which resource sharing can be done, the next layer, called the collective layer, describes protocols for discovering available services, negotiating for desired services, and initiating, monitoring and accounting of services chosen by clients. 13.4.1 Synergy between the two approaches The service abstraction of the Grid architecture maps to the notion of a component of autonomic computing described in Section 13.2. As we noted with components, the implementation of a high-level service for a virtual organization often involves several other resources/services, which are heterogeneous and distributed. The behavior of a service is the BehaviorSpec of a component in Section 13.2 and an implementation must ensure that they provide the advertised behavior, under all conditions. Since a service depends upon other services and on the resources that are allocated for its implementation, prudence dictates that its design be autonomic. Hence, it must monitor the behavior of its dependent services, its own level of resources that may be controlled by other agents and the quality of service it is providing to its clients. In turn, this implies that a service implementation must have a strategy such as α of Section 13.2, which must adapt to the changing environment and optimize the performance by choosing appropriate resources. Thus, all the considerations we discussed under autonomic computing apply to this situation. In 360 PRATAP PATTNAIK, KATTAMURI EKANADHAM, AND JOEFON JANN particular, there must be general provisions for the maintenance of accurate estimates of global states as discussed in Section 13.3, using either the self-observation or collective observation method. A specialized protocol in the collective layer of the Grid architecture could possibly help this function. Consider an example of a data-mining service offered on a Grid. There may be one or more implementations of the data-mining service and each of them requires database services on the appropriate data repositories. All the implementations of a service form a collective and they can coordinate to balance their loads, redirecting requests for services arriving at one component to components that have lesser loads. An autonomic data- mining service implementation may change its resources and its database services based on its performance and the perceived levels of service that it is receiving. Recursively the database services will have to be autonomic to optimize the utilization of their services. Thus, the entire paradigm boils down to designing each service from an autonomic perspective, incorporating logic to monitor performance, discover resources and apply them as dictated by its objective function. 13.5 CONCLUDING REMARKS As systems get increasingly complex, natural forces will automatically eliminate interactions with components whose complexity has to be understood by an interactor. The only components that survive are those that hide the complexity, provide a simple and stable interface and possess the intelligence to perceive the environmental changes, and struggle to fit into the environment. While facets of this principle are present in various degrees in extant designs, explicit recognition of the need for being autonomic can make a big difference, and thrusts us toward designs that are robust, resilient and innovative. In the present era, where technological changes are so rapid, this principle assumes even greater importance, as adaptation to changes becomes paramount. The first aspect of autonomic designs that we observe is the clear delineation of the interface of how a client perceives a server. Changes to the implementation of the service should not compromise this interface in any manner. The second aspect of an autonomic server is the need for monitoring the varying input characteristics of the clientele as well as the varying response characteristics of the servers on which this server is dependent. In the present day environment, demands shift rapidly and cannot be anticipated most of the time. Similarly, components degrade and fail, and one must move away from deterministic behavior to fuzzy behaviors, where perturbations do occur and must be observed and acted upon. Finally, an autonomic server must be prepared to quickly adapt to the observed changes in inputs as well as dependent services. The perturbations are not only due to failures of components but also due to performance degradations due to changing demands. Autonomic computing provides a unified approach to deal with both. A collective of services can collaborate to provide each other accurate information so that local decisions by each service contribute to global efficiency. We observe commonalities between the objectives of Grid and autonomic approaches. We believe that they must blend together and Grid architecture must provide the necessary framework to facilitate the design of each service with an autonomic perspective. While [...]... essence here This is the subject of our future study REFERENCES 1 Horn, P Autonomic Computing, http://www.research.ibm.com/autonomic 2 Wladawsky-Berger, I Project Eliza, http://www-1.ibm.com/servers/eserver/introducing/eliza 3 Foster, I., Kesselman, C and Tuecke, S (2001) The anatomy of the grid International Journal of Supercomputing Applications, 15(3), 200–222 ...AUTONOMIC COMPUTING AND GRID 361 we outlined the kinds of protocols and mechanisms that may be supported for this purpose, there is more work to be done in the area of formulating models that capture the stability characteristics . between Grid computing and autonomic computing and that the advances in the architecture in either one of these areas will help the other. In Grid computing. objective is a common theme for both autonomic computing and Grid computing. As the architecture of Grid computing continues to improve and rapidly evolve,

Ngày đăng: 15/12/2013, 05:15

Xem thêm: Tài liệu Grid Computing P13 ppt, Tài liệu Grid Computing P13 ppt

Tài liệu Grid Computing P13 ppt

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan