Principles of Network and System Administration 2nd phần 9 potx

506 CHAPTER 13. ANALYTICAL SYSTEM ADMINISTRATION more efficient in man hours than one which places humans in the driving seat. This presupposes, of course, that the setup and maintenance of the automatic system is not so time-consuming in itself as to outweigh the advantages provided by such an approach. 13.5.4 Evaluation of system administration as a collective effort Few system administrators work alone. In most cases they are part of a team who all need to keep abreast of the behavior of the system and the changes made in administration policy. Automation of system administration issues does not alter this. One issue for human administrators is how well a model for administration allows them to achieve this cooperation in practice. Does the automatic system make it easier for them to follow the development of the system in i) theory and ii) practice? Here theory refers to the conceptual design of the system as a whole, and practice refers to the extent to which the the- oretical design has been implemented in practice. How is the task distributed between people, systems, procedures and tools? How is responsibility delegated and how does this affect individuals? Is time saved, are accuracy and consistency improved? These issues can be evaluated in a heuristic way from the experiences of administrators. Longer-term, more objective studies could also be performed by analyzing the behavior of system administrators in action. Such studies will not be performed here. 13.5.5 Cooperative software: dependency The fragile tower of components in any functional system is the fundament of its operation. If one component fails, how resilient is the remainder of the system to this failure? This is a relevant question to pose in the evaluation of a system administration model. How do software systems depend on one another for their operation? If one system fails, will this have a knock-on effect for other systems? What are the core systems which form the basis of system operation? In the present work it is relevant to ask how the model continues to work in the event of the failure of DNS, NFS and other network services which provide infrastructure. Is it possible to immobilize an automatic system administration model? 13.5.6 Evaluation of individual mechanisms For individual pieces of software, it is sometimes possible to evaluate the efficiency and correctness of the components. Efficiency is a relative concept and, if used, it must be placed in a context. For example, efficiency of low-level algorithms is conceptually irrelevant to the higher levels of a program, but it might be practically relevant, i.e. one must say what is meant by efficiency before quoting results. The correctness of the results yielded by a mechanism/algorithm can be measured in relation to its design specifications. Without a clear mapping of input/output 13.5. EVALUATING A HIERARCHICAL SYSTEM 507 the correctness of any result produced by a mechanism is a heuristic quality. Heuristics can only be evaluated by experienced users expressing their informed opinions. 13.5.7 Evidence of bugs in the software Occasionally bugs significantly affect the performance of software. Strictly speak- ing an evaluation of bugs is not part of the software evaluation itself, but of the process of software development, so while bugs should probably be mentioned they may or may not be relevant to the issues surrounding the software itself. In this work software bugs have not played any appreciable role in either the development or the effectiveness of the results so they will not be discussed in any detail. 13.5.8 Evidence of design faults In the course of developing a program one occasionally discovers faults which are of a fundamental nature, faults which cause one to rethink the whole operation of the program. Sometimes these are fatal flaws, but that need not be the case. Cataloguing design faults is important for future reference to avoid making similar mistakes again. Design faults may be caused by faults in the model itself or merely in its implementation. Legacy issues might also be relevant here: how do outdated features or methods affect software by placing demands on onward compatibility, or by restricting optimal design or performance? 13.5.9 Evaluation of system policies System administration does not exist without human attitudes, behaviors and policies. These three fit together inseparably. Policies are adjusted to fit behavioral patterns; behavioral patterns are local phenomena. The evaluation of a system policy has only limited relevance for the wider community then: normally only relative changes are of interest, i.e. how changes in policy can move one closer to a desirable solution. Evaluating the effectiveness of a policy in relation to the applicable social boundary conditions presents practical problems which sociologists have wrestled with for decades. The problems lie in obtaining statistically significant samples of data to support or refute the policy. Controlled experiments are not usually feasible since they would tie up resources over long periods. No one can afford this in practice. In order to test a policy in a real situation the best one can do is to rely on heuristic information from an experienced observer (in this case the system administrator). Only an experienced observer would be able to judge the value of a policy on the basis of incomplete data. Such information is difficult to trust however unless it comes from several independent sources. A better approach might be to test the policy with simulated data spanning the range from best to worst case. The advantage with simulated data is that the results are reproducible from those data and thus one has something concrete to show for the effort. 508 CHAPTER 13. ANALYTICAL SYSTEM ADMINISTRATION 13.5.10 Reliability Reliability cannot be measured until we define what we mean by it. One common definition uses the average (mean) time before failure as a measure of system reliability. This is quite simply the average amount of time we expect to elapse between serious failures of the system. Another way of expressing this is to use the average uptime, or the amount of time for which the system is responsive (waiting no more than a fixed length of time for a response). Another complementary figure is then, the average downtime, which is the average amount of time the system is unavailable for work (a kind of informational entropy). We can define the reliability as the probability that the system is available: ρ = Mean uptime Total elapsed time Some like to define this in terms of the Mean Time Before Failure (MTBF) and the Mean Time To Repair (MTTR), i.e. ρ = MTBF MTBF + MTTR . This is clearly a number between 0 and 1. Many network device vendors quote these values with the number of 9’s it yields, e.g. 0.99999. The effect of parallelism or redundancy on reliability can be treated as a facsimile of the Ohm’s law problem, by noting that service provision is just like a flow of work (see also section 6.3 for examples of this). Rate of service (delivery) = rate of change in information / failure fraction This is directly analogous to Ohm’s law for the flow of current through a resistance: I = V/R The analogy is captured in this table: Potential difference V Change in information Current I Rate of service (flow of information) Resistance R Rate of failure This relation is simplistic. For one thing it does not take into account variable latencies (although these could be defined as failure to respond). It should be clear that this simplistic equation is full of unwarranted assumptions, and yet its simplicity justifies its use for simple hand-waving. If we consider figure 6.10, it is clear that a flow of service can continue, when servers work in parallel, even if one or more of them fails. In figure 6.11 it is clear that systems which are dependent on other systems are coupled in series and a failure prevents the flow of service. Because of the linear relationship, we can use the usual Ohm’s law expressions for combining failure rates: R series = R 1 + R 2 + R 3 + 13.5. EVALUATING A HIERARCHICAL SYSTEM 509 and 1 R parallel = 1 R 1 + 1 R 2 + 1 R 3 These simple expressions can be used to hand-wave about the reliability of combinations of hosts. For instance, let us define the rate of failure to be a probability of failure, with a value between 0 and 1. Suppose we find that the rate of failure of a particular kind of server is 0.1. If we couple two in parallel (a double redundancy)thenweobtainaneffectivefailurerateof 1 R = 1 0.1 + 1 0.1 i.e. R = 0.05, the failure rate is halved. This estimate is clearly naive. It assumes, for instance, that both servers work all the time in parallel. This is seldom the case. If we run parallel servers, normally a default server will be tried first, and, if there is no response, only then will the second backup server be contacted. Thus, in a fail-over model, this is not really applicable. Still, we use this picture for what it is worth, as a crude hand-waving tool. The Mean Time Before Failure (MTBF) is used by electrical engineers, who find that its values for the failures of many similar components (say light bulbs) has an exponential distribution. In other words, over large numbers of similar component failures, it is found that the probability of failure has the form P(t)= exp(−t/τ) or that the probability of a component lasting time t is the exponential, where τ is the mean time before failure and t is the failure time of a given component. There are many reasons why a computer system would not be expected to have this sim- pleform.Oneisdependency. Computer systems are formed from many interacting components. The interactions with third party components mean that the environ- mental factors are always different. Again, the issue of fail-over and service latencies arises, spoiling the simple independent component picture. Mean time before failure doesn’t mean anything unless we define the conditions under which the quantity was measured. In one test at Oslo College, the following values were measured for various operating systems, averaged over several hosts of the same type. Solaris 2.5 86 days GNU/Linux 36 days Windows 95 0.5 days While we might feel that these numbers agree with our general intuition of how these operating systems perform in practice, this is not a fair comparison since the patterns of usage are different in each case. An insider could tell us that the users treat the PCs with a casual disregard, switching them on and off at will: and in spite of efforts to prevent it, the same users tend to pull the plug on GNU/Linux hosts also. The Solaris hosts, on the other hand, live in glass cages where prying fingers cannot reach. Of course, we then need to ask: what is the reason why users reboot and pull the plug on the PCs? The numbers above cannot have any meaning until this has been determined; i.e. the software components 510 CHAPTER 13. ANALYTICAL SYSTEM ADMINISTRATION of a computer system are not atomic; they are composed of many parts whose behavior is difficult to catalogue. Thus the problem with these measures of system reliability is that they are almost impossible to quantify and assigning any real meaning to them is fraught with subtlety. Unless the system fails regularly, the number of points over which it is possible to average is rather small. Moreover, the number of external factors which can lead to failure makes the comparison of any two values at different sites meaningless. In short, this quantity cannot be used for anything other than illustrative purposes. Changes in the reliability, for constant external conditions, can be used as a measure to show the effect of a single parameter from the environment. This is perhaps the only instance in which this can be made meaningful, i.e. as a means of quantitative comparison within a single experiment. 13.5.11 Metrics generally The quantifiers which can be usefully measured or recorded on operating systems are the variables which can be used to provide quantitative support for or against a hypothesis about system behavior. System auditing functionality can be used to record just about every operation which passes through the kernel of an operating system, but most hosts do not perform system auditing because of the huge negative effect it has on performance. Here we consider only metrics which do not require extensive auditing beyond what is normally available. Operating system metrics are normally used for operating system performance tuning. System performance tuning requires data about the efficiency of an operating system. This is not necessarily compatible with the kinds of measurement required for evaluating the effectiveness of a system administration model. System administration is concerned with maintaining resource availability over time in a secure and fair manner. It is not about optimizing specific performance criteria. Operating system metrics fall into two main classes: current values and average values for stable and drifting variables respectively. Current (immediate) values are not usually directly useful, unless the values are basically constant, since they seldom accurately reflect any changing property of an operating system adequately. They can be used for fluctuation analysis, however, over some coarse- graining period. An averaging procedure over some time interval is the main approach of interest. The Nyquist law for sampling of a continuous signal is that the sampling rate needs to be twice the rate of the fastest peak cycle in the data if one is to resolve the data accurately. This includes data which are intended for averaging since this rule is not about accuracy of resolution but about the possible complete loss of data. The granularity required for measurement in current operating systems is summarized in the following table. 0 − 5 secs Fine grain work 10 − 30 secs For peak measurement 10 − 30 mins For coarse-grain work Hourly average Software activity Daily average User activity Weekly average User activity 13.5. EVALUATING A HIERARCHICAL SYSTEM 511 Although kernel switching times are of the order of microseconds, this time scale is not relevant to users’ perceptions of the system. Inter-system cooperating requires many context switch cycles and I/O waits. These compound themselves into intervals of the order of seconds in practice. Users themselves spend long periods of time idle, i.e. not interacting with the system on an immediate basis. An interval of seconds is therefore sufficient. Peaks of activity can happen quickly by user perceptions but they often last for protracted periods, thus ten to thirty seconds is appropriate here. Coarse-grained behavior requires lower resolution, but as long as one is looking for peaks a faster rate of sampling will always include the lower rate. There is also the issue of how quickly the data can be collected. Since the measurement process itself affects the performance of the system and uses its resources, measurement needs to be kept to a level where it does not play a significant role in loading the system or consuming disk and memory resources. The variables which characterize resource usage fall into various categories. Some variables are devoid of any apparent periodicity, while others are strongly periodic in the daily and weekly rhythms of the system. The amount of periodicity in a variable depends on how strongly it is coupled to a periodic driving force, such as the user community’s daily and weekly rhythms, and also how strong that driving force is (users’ behavior also has seasonal variations, vacations and deadlines etc). Since our aim is to find a sufficiently complete set of variables which characterize a macrostate of the system, we must be aware of which variables are ignorable, which variables are periodic (and can therefore be averaged over a periodic interval) and which variables are not periodic (and therefore have no unique average). Studies of total network traffic have shown an allegedly self-similar (fractal) structure to network traffic when viewed in its entirety [192, 324]. This is in contrast to telephonic voice traffic on traditional phone networks which is bursty, the bursts following a random (Poisson) distribution in arrival time. This almost certainly precludes total network traffic from a characterization of host state, but it does not preclude the use of numbers of connections/conversations between different protocols, which one would still expect to have a Poissonian profile. A value of none means that any apparent peak is much smaller than the error bars (standard deviation of the mean) of the measurements when averaged over the presumed trial period. The periodic quantities are plotted on a periodic time scale, with each covering adding to the averages and variances. Non-periodic data are plotted on a straightforward, unbounded real line as an absolute value. A running average can also be computed, and an entropy, if a suitable division of the vertical axis into cells is defined [42]. We shall return to the definition of entropy later. The average type referred to below divides into two categories: pseudo- continuous and discrete. In point of fact, virtually all of the measurements made have discrete results (excepting only those which are already system averages). This categorization refers to the extent to which it is sensible to treat the average value of the variable as a continuous quantity. In some cases, it is utterly meaningless. For the reasons already indicated, there are advantages to treating measured values as continuous, so it is with this motivation that we claim a pseudo-continuity to the averaged data. In this initial instance, the data are all collected from Oslo College’s own computer network which is an academic environment with moderate resources. One 512 CHAPTER 13. ANALYTICAL SYSTEM ADMINISTRATION might expect our data to lie somewhere in the middle of the extreme cases which might be found amongst the sites of the world, but one should be cognizant of the limited validity of a single set of such data. We re-emphasize that the purpose of the present work is to gauge possibilities rather than to extract actualities. Net • Total number of packets: Characterizes the totality of traffic, incoming and outgoing on the subnet. This could have a bearing on latencies and thus influence all hosts on a local subnet. • Amount of IP fragmentation: This is a function of the protocols in use in the local environment. It should be fairly constant, unless packets are being fragmented for scurrilous reasons. • Density of broadcast messages: This is a function of local network services. This would not be expected to have a direct bearing on the state of a host (other than the host transmitting the broadcast), unless it became so high as to cause a traffic problem. • Number of collisions: This is a function of the network community traffic. Collision numbers can significantly affect the performance of hosts wishing to communicate, thus adding to latencies. It can be brought on by sheer amount of traffic, i.e. a threshold transition and by errors in the physical network, or in software. In a well-configured site, the number of collisions should be random. A strong periodic signal would tend to indicate a burdened network with too low a capacity for its users. • Number of sockets (TCP) in and out: This gives an indication of service usage. Measurements should be separated so as to distinguish incoming and outgoing connections. We would expect outgoing connections to follow the periodicities of the local site, where as incoming connections would be a superposition of weak periodicities from many sites, with no net result. See figure 13.1. • Number of malformed packets: This should be zero, i.e. a non-zero value here specifies a problem in some networked host, or an attack on the system. Storage • Disk usage in bytes: This indicates the actual amount of data generated and downloaded by users, or the system. Periodicities here will be affected by whatever policy one has for garbage collection. Assuming that users do not produce only garbage, there should be a periodicity superposed on top of a steady rise. • Disk operations per second: This is an indication of the physical activity of the disk on the local host. It is a measure of load and a significant contribution to latency both locally and for remote hosts. The level of periodicity in this signal must depend on the relative magnitude of forces driving the host. If a 13.5. EVALUATING A HIERARCHICAL SYSTEM 513 0 6 12 18 24 0 1 2 3 4 Figure 13.1: The daily rhythm of the external logins shows a strong unambiguous peak during work hours. host runs no network services, then it is driven mainly by users, yielding a strong periodicity. If system services dominate, these could be either random or periodic. The values are thus likely to be periodic, but not necessarily strong. • Paging (out) rate (free memory and thrashing): These variables measure the activity of the virtual memory subsystem. In principle they can reveal problems with load. In our tests, they have proved singularly irrelevant, though we realize that we might be spoiled with the quality of our resources here. See figures 13.2 and 13.3. Processes • Number of privileged processes: The number of processes running the system provides an indication of the number of forked processes or active threads which are carrying out the work of the system. This should be relatively constant, with a weak periodicity indicating responses to local users’ requests. This is separated from the processes of ordinary users, since one expects the behavior of privileged (root/Administrator) processes to follow a different pattern. See figure 13.4. • Number of non-privileged processes: This measure counts not only the number of processes but provides an indication of the range of tasks being performed by users, and the number of users by implication. This measure has a strong periodic quality, relatively quiescent during weekends, rising sharply 514 CHAPTER 13. ANALYTICAL SYSTEM ADMINISTRATION 0 6 12 18 24 0 39 78 117 156 Figure 13.2: The daily rhythm of the paging data illustrates the problems one faces in attaching meaning directly to measurements. Here we see that the error bars (signifying the standard deviation) are much larger than the variation of the graph itself. Nonetheless, there is a marginal rise in the paging activity during daytime hours, and a corresponding increase in the error bars, indicating that there is a real effect, albeit of little analytical value. on Monday to a peak on Tuesday, followed by a gradual decline towards the weekend again. See figures 13.5 and 13.6. • Maximum percentage CPU used in processes: This is an experimental measure which characterizes the most CPU expensive process running on the host at a given moment. The significance of this result is not clear. It seems to have a marginally periodic behavior, but is basically inconclusive. The error bars are much larger than the variation of the average, but the magnitude of the errors increases also with the increasing average, thus, while for all intents and purposes this measure’s average must be considered irrelevant, a weak signal can be surmised. The peak value of the data might be important however, since a high max-cpu task will significantly load the system. See figure 13.7. Users • Number logged on: This follows the classic pattern of low activity during the weekends, followed by a sharp rise on Monday, peaking on Tuesday and declining steadily towards the weekend again. • Total number: This value should clearly be constant except when new user accounts are added. The average value has no meaning, but any change in this value can be significant from a security perspective. 13.5. EVALUATING A HIERARCHICAL SYSTEM 515 0 24 48 72 96 120 144 168 0 39 78 117 156 Figure 13.3: The weekly rhythm of the paging data show that there is a definite daily rhythm, but again, it is drowned in the huge variances due to random influences on the system, and is therefore of no use in an analytical context. • Average time spent logged on per user: Can signify patterns of behavior, but has a questionable relevance to the behavior of the system. • Load average: This is the system’s own back-of-the-envelope calculation of resource usage. It provides a continuous indication of load, but on an exaggerated scale. It remains to be seen whether any useful information can be obtained from this value; its value can be quite disordered (high entropy). • Disk usage rise per session per user per hour: The average amount of increase of disk space per user per session, indicates the way in which the system is becoming loaded. This can be used to diagnose problems caused by a single user downloading a huge amount of data from the network. During normal behavior, if users have an even productivity, this might be periodic. • Latency of services: Thelatencyistheamountoftimewewaitforananswer to a specific request. This value only becomes significant when the system passes a certain threshold (a kind of phase transition). Once latency begins to restrict the practices of users, we can expect it to feed back and exacerbate latencies. Thus the periodicity of latencies would only be expected in a phase of the system in which user activity was in competition with the cause of the latency itself. Part of what one wishes to identify in looking at such variables is patterns of change. These are classifiable but not usually quantifiable. They can be relevant to policy decisions as well as in fine tuning of the parameters of an automatic response. Patterns of behavior include [...]... which are menial and repetitive The core principles of system administration will remain the same, but the job description of the system manager will be rather different In many ways, the day to day business of system administration consists of just a few recipes which slowly evolve over time However, underneath the veneer of cookery, there is a depth of understanding about computer systems which has... administrative strategies At some level, the development of a computer system is a problem in economics: it is a mixed game of opposition and cooperation between users and the system The aims of the game are several: to win resources, to produce work, to gain control of the system, and so on A proper understanding of the issues should lead to better software and better strategies from human administrators For... value of the measurement The sources of systematic error are often difficult to find, since they are often a result of misunderstandings, or of the specific behavior of the measuring apparatus In a system with finite resources, the act of measurement itself leads to a change in the value of the quantity one is measuring In order to measure the CPU usage of a computer system, for instance, we have to start... ease of use, utility of tools, security, adaptability Other heuristic impressions include the amount of dependency of a software component on other software systems, hosts or processes; also the dependency of a software system on the presence of a human being In ref [186] Kubicki discusses metrics for measuring customer satisfaction These involve validated questionnaires, system availability, system. .. Explain why problems with quite different causes often lead to the same symptoms Chapter 14 Summary and outlook The aim of this book has been to present an overview of the field of system administration for active system administrators, university courses and computer scientists everywhere For a long time, system administration has been passed on by word of mouth and has resisted formalization Only in recent... CHAPTER 13 ANALYTICAL SYSTEM ADMINISTRATION systems fall into two categories, depending on how we choose our problem to analyze These are called open systems and closed systems • Open system: This is a subsystem of some greater whole An open system can be thought of as a black box which takes in input and generates output, i.e it communicates with its environment The names source and sink are traditionally... are traditionally used for the input and output routes What happens in the black box depends on the state of the environment around it The system is open because input changes the state of the system s internal variables and output changes the state of the environment Every piece of computer software is an open system Even an isolated total computer system is an open system as long as any user is using... system administration We are approaching a new generation of operating systems, with the capacity for self-analysis and self-correction It is no longer a question of whether they will arrive, but of when they will arrive When it happens, the nature of system administration will change The day to day tasks of system administration change constantly and we pay these changes little attention However, improvements... through the use of specific software, or through the interpretation of the measurements The final and most insidious type of error is the systematic error This is an error which runs throughout all of the data It is a systematic shift in the true value of the data, in one direction, and thus it cannot be eliminated by averaging A systematic error leads also to an error in the mean value of the measurement... over a length of time Analysis of system behavior can sometimes benefit from knowing these periods, e.g if one is trying to determine a causal relationship between one part of a system and another, it is sometimes possible to observe the signature of a process which is periodic and thus obtain direct evidence for its effect on another part of the system Periods in data are in the realm of Fourier analysis . Evaluation of system administration as a collective effort Few system administrators work alone. In most cases they are part of a team who all need to keep abreast of the behavior of the system and. vendors quote these values with the number of 9 s it yields, e.g. 0 .99 999 . The effect of parallelism or redundancy on reliability can be treated as a facsimile of the Ohm’s law problem, by noting that. These are called open systems and closed systems. • Open system: This is a subsystem of some greater whole. An open system can be thought of as a black box which takes in input and generates output, i.e.

Principles of Network and System Administration 2nd phần 9 potx

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan