Hệ thống phân loại trong Chỉnh hình pps

Journal of the American Academy of Orthopaedic Surgeons 290 Classifications of musculoskeletal conditions have at least two central functions. First, accurate classification characterizes the nature of a problem and then guides treatment decision making, ultimately improving outcomes. Second, accurate classification establishes an expected outcome for the natural history of a condition or injury, thus forming a basis for uniform reporting of results for various surgical and nonsurgical treatments. This allows the comparison of results from different centers purportedly treating the same entity. A successful classification system must be both reliable and valid. Reliability reflects the precision of a classification system; in general, it refers to interobserver reliability, the agreement between different observers. Intraobserver reliability is the agreement of one observer’s repeated classifications of an entity. The validity of a classification system reflects the accuracy with which the classification system describes the true pathologic process. A valid classification system correctly categorizes the attribute of interest and accurately describes the actual process that is occurring. 1 To measure or quantify validity, the classification of interest must be compared to some “gold standard.” If the surgeon is classifying bone stock loss prior to revision hip arthroplasty, the gold standard could potentially be intraoperative assessment of bone loss. Validation of the classification system would require a high correlation between the preoperative radiographs and the intraoperative findings. In this example, the radiographic findings would be considered “hard” data because different observers can confirm the radiographic findings. Intraopera- tive findings, on the other hand, would be considered “soft” data because independent confirmation of this intraoperative assessment is often impossible. This problem with the validation phase affects many commonly used classification systems that are based on radiographic criteria, and it introduces the ele- ment of observer bias to the validation process. Because of the difficulty of measuring validity, it is critical that classification systems have at least a high degree of reliability. Assessment of Reliability Classifications and measurements in general must be reliable to be assessed as valid. However, because confirming validity is difficult, many commonly used classifi- Dr. Garbuz is Assistant Professor, Department of Orthopaedics, University of British Columbia, Vancouver, BC, Canada. Dr. Masri is Associate Professor and Head, Division of Reconstructive Orthopaedics, University of British Columbia. Dr. Esdaile is Professor and Head, Division of Rheumatology, University of British Columbia. Dr. Duncan is Professor and Chairman, Department of Orthopaedics, University of British Columbia. Reprint requests: Dr. Garbuz, Laurel Pavilion, Third Floor, 910 West Tenth Avenue, Vancouver, BC, Canada V5Z 4E3. Copyright 2002 by the American Academy of Orthopaedic Surgeons. Abstract Classification systems help orthopaedic surgeons characterize a problem, suggest a potential prognosis, and offer guidance in determining the optimal treatment method for a particular condition. Classification systems also play a key role in the reporting of clinical and epidemiologic data, allowing uniform comparison and documentation of like conditions. A useful classification system is reliable and valid. Although the measurement of validity is often difficult and some- times impractical, reliability—as summarized by intraobserver and interobserver reliability—is easy to measure and should serve as a minimum standard for validation. Reliability is measured by the kappa value, which distinguishes true agreement of various observations from agreement due to chance alone. Some commonly used classifications of musculoskeletal conditions have not proved to be reliable when critically evaluated. J Am Acad Orthop Surg 2002;10:290-297 Classification Systems in Orthopaedics Donald S. Garbuz, MD, MHSc, FRCSC, Bassam A. Masri, MD, FRCSC, John Esdaile, MD, MPH, FRCPC, and Clive P. Duncan, MD, FRCSC Donald S. Garbuz, MD, et al Vol 10, No 4, July/August 2002 291 cation systems can be shown to be reliable yet not valid. On preoperative radiographs of a patient with a hip fracture, for example, two observers may categorize the fracture as Garden type 3. This measurement is reliable because of interobserver agreement. However, if the intraoperative findings are of a Garden type 4 fracture, then the classification on radiographs, although reliable, is not valid (ie, is inaccurate). A minimum criterion for the acceptance of any classification or measurement, therefore, is a high degree of both interobserver and intraobserver reliability. Once a classification system has been shown to have acceptable reliability, then testing for validity is appropriate. If the degree of reliability is low, however, then the classification system will have limited utility. Initial efforts to measure reliability looked only at observed agreement—the percentage of times that different observers categorized their observations the same. This concept is illustrated in Figure 1, a situation in which the two surgeons agree 70% of the time. In 1960, Cohen 2 introduced the kappa value (or kappa statistic) as a measure to assess agreement that oc- curred above and beyond that related to chance alone. Today the kappa value and its variants are the most accepted methods of measuring observer agreement for categorical data. Figure 1 demonstrates how the kappa value is used and how it dif- fers from the simple measurement of observed agreement. In this hypothetical example, observed agreement is calculated as the percentage of times both surgeons agree whether fractures were displaced or nondisplaced; it does not take into account the fact that they may have agreed by chance alone. To calculate the percentage of chance agreement, it is assumed that each surgeon will choose a category independently of the other. The marginal totals are then used to calculate the agreement expected by chance alone; in Figure 1, this is 0.545. To calculate the kappa value, the observed agreement (Po) minus the chance agreement (Pc) is divided by the maximum possible agreement that is not related to chance (1 − Pc): κ = (Po − Pc) / (1 − Pc) This example is the simplest case of two observers and two categories. The kappa value can be used for multiple categories and multiple observers in a similar manner. In analyzing categorical data, which the kappa value is designed to measure, there will be cases in which disagreement between various categories may not have as profound an impact as disagreement between other categories. For this reason, categorical data are divided into two types: nominal (unranked), in which all categorical differences are equally important, and ordinal (ranked), in which disagreement between some categories has a more profound impact than disagreement between other categories. An example of nominal data is eye color; an example of ordinal data is the AO classification, in which each subse- quent class denotes an increase in severity of the fracture. The kappa value can be unweighted or weighted depending on whether the data are nominal or ordinal. Unweighted kappa values should always be used with unranked data. When ordinal data are being analyzed, however, a decision must be made whether or not to weight the kappa value. Weighting has the advantage of giving some credit to partial agreement, whereas the unweighted kappa value treats all disagree- ments as equal. A good example of appropriate use of the weighted kappa value is in a study by Kristiansen et al 3 of interobserver agreement in the Neer classification of proximal humeral fractures. This well-known classification has four categories of fractures, from nondisplaced or minimally displaced to four-part fractures. Weighting was appropriate in this case because disagreement be- Surgeon No. 2 Surgeon No. 1 Displaced Nondisplaced Total Displaced 50 15 65 Nondisplaced 15 20 35 Total 65 35 100 Observed agreement = 0.70 Chance agreement = 65 × 65 + 35 × 35 100 = 0.545 100 100 Agreement beyond chance (κ) = 0.70 – 0.545 = 0.34 1 – 0.545 Figure 1 Hypothetical example of agreement between two orthopaedic surgeons classifying radiographs of subcapital hip fractures. Classification Systems in Orthopaedics Journal of the American Academy of Orthopaedic Surgeons 292 tween a two-part and three-part fracture is not as serious as disagreement between a nondisplaced fracture and a four-part fracture. By weighting kappa values, one can account for the different levels of importance between levels of disagreement. If a weighted kappa value is determined to be appropriate, the weighting scheme must be specified in advance because the weights chosen will dramatically affect the kappa value. In addition, when reporting studies that have used a weighted kappa value, the weighting scheme must be docu- mented clearly. One problem with weighting is that without uniform weighting schemes, it is difficult to generalize across studies. Sample size will allow the confidence inter- val to be narrower but it does not automatically affect the number of categories. Although the kappa value has become the most widely accepted method to measure observer agreement, interpretation is difficult. Values obtained range from −1.0 (complete disagreement) through 0.0 (chance agreement) to 1.0 (complete agreement). Hypothesis testing has limited usefulness when the kappa value is used because it allows the researcher to see only if obtained agreement is significantly different from zero or chance agreement, revealing nothing about the extent of agreement. Consequently, when kappa values are obtained for assessing classifications of musculoskeletal conditions, hypothesis testing has almost no role. As Kraemer stated, “It is insufficient to demonstrate merely the nonran- domness of diagnostic procedures; one requires assurance of substantial agreement between observations.” 4 This statement is equally applicable to classifications used in orthopaedics. To assess the strength of agreement obtained with a given kappa value, two different benchmarks have gained widespread use in orthopaedics and other branches of medicine. The most widely adopted criteria for assessing the extent of agreement are those of Landis and Koch: 5 κ > 0.80, almost perfect; κ = 0.61 to 0.80, substantial; κ = 0.41 to 0.60, moderate; κ = 0.21 to 0.40, fair; κ = 0.00 to 0.20, slight; and κ < 0.00, poor. Although these criteria have gained widespread acceptance, the values were chosen arbitrarily and were never intended to serve as general benchmarks. The criteria of Svanholm et al, 6 while less widely used, are more stringent than those of Landis and Koch and are perhaps more practical for use in medicine. Like Landis and Koch, Svanholm et al chose arbitrary values: κ ≥ 0.75, excellent; κ = 0.51 to 0.74, good; and κ ≤ 0.50, poor. When reviewing reports of studies on agreement of classification systems, readers should look at the actual kappa value and not just at the arbitrary categories described here. Although the interpretation of a given kappa value is difficult, it is clear that the higher the value, the more reliable the classification system. When interpreting a given kappa value, the impact of prevalence and bias must be considered. Feinstein and Cicchetti 7,8 refer to them as the two paradoxes of high observed agreement and low kappa values. Most important is the effect that the prevalence (base rate) can have on the kappa value. Preva- lence refers to the number of times a given category is selected. In general, as the proportion of cases in one category approaches 0, or 100%, the kappa value will decrease for any given observed agreement. In Figure 2, the same two hypothetical orthopaedic surgeons as in Figure 1 review and categorize 100 different radiographs. The observed agreement is the same as in Figure 1, 0.70. However, the agreement beyond chance (kappa value) is 0.06. The main difference between Figures 1 and 2 is the marginal totals or the Surgeon No. 2 Surgeon No. 1 Displaced Nondisplaced Total Displaced 65 15 80 Nondisplaced 15 5 20 Total 80 20 100 Observed agreement = 0.70 Chance agreement = 80 × 80 + 20 × 20 100 = 0.68 100 100 Agreement beyond chance (κ) = 0.70 – 0.68 = 0.06 1 – 0.68 Figure 2 Hypothetical example of agreement between two orthopaedic surgeons classifying radiographs, with a higher prevalence of displaced fractures than in Figure 1. Donald S. Garbuz, MD, et al Vol 10, No 4, July/August 2002 293 underlying prevalence of displaced and nondisplaced fractures, de- fined as the proportion of displaced and nondisplaced fractures. If one category has a very high prevalence, there can be paradoxi- cal high observed agreement yet low kappa values (although to some extent this can be the result of the way chance agreement is calculated). The effect of prevalence on kappa values must be kept in mind when interpreting studies of observer variability. The prevalence, observed agreement, and kappa values should be clearly stated in any report on classification reliability. Certainly a study with a low kappa value and extreme prevalence rate will not represent the same level of disagreement as will a low kappa value in a sample with a balanced prevalence rate. Bias (systematic difference) is the second factor that can affect the kappa value. Bias has a lesser effect than does prevalence, however. As bias increases, kappa values para- doxically will increase, although this is usually seen only when kappa values are low. To assess the extent of bias in observer agreement studies, Byrt et al 9 have sug- gested measuring a bias index, but this has not been widely adopted. Although the kappa value, influ- enced by prevalence and bias, mea- sures agreement, it is not the only measure of the precision of a classification system. Many other factors can affect both observer agreement and disagreement. Sources of Disagreement As mentioned, any given classification system must have a high degree of reliability or precision. The degree of observer agreement obtained is affected by many factors, including the precision of the classification system. To improve reliability, these other sources of disagreement must be understood and minimized. Once this is done, the reliability of the classification system itself can be accurately estimated. Three sources of disagreement or variability have been described: 1,10 the clinician (observer), the patient (examined), and the procedure (examination). Each of these can affect the reliability of classifications in clinical practice and studies that examine classifications and their reliability. Clinician variability arises from the process by which information is observed and interpreted. The information can be obtained from different sources, such as history, physical examination, or radiographic examination. These raw data are often then converted into categories. Wright and Feinstein 1 called the criteria used to put the raw data into categories conversion criteria. Disagreement can occur when the findings are observed or when they are organized into the arbitrary categories commonly used in classification systems. An example of variability in the observational process is the measurement of the center edge angle of Wiberg. Inconsistent choice of the edge of the acetabulum will lead to variations in the measurements obtained (Fig. 3). As a result of the emphasis on arbitrary criteria for the various categories in a classification system, an observer may make measurements that do not meet all of the criteria of a category. The observer will then choose the closest matching category. Another observer may disagree about the choice of closest category and choose another. Such variability in the use of conversion criteria is common and is the result of trying to convert the continuous spectrum of clinical data into arbitrary and finite categories. The particular state being measured will vary depending on when and how it is measured. This results in patient variability. A good example is the variation obtained in measuring the degree of spondylolisthesis when the patient is in a standing compared with a supine position. 11 To minimize patient variability, examinations should be performed in a consistent, stan- dardized fashion. The final source of variability is the procedure itself. This often refers to technical aspects, such as the taking of a radiograph. If the exposures of two radiographs of the same patient’s hip are different, for example, then classification of the degree of osteopenia, which de- pends on the degree of exposure, will differ as a result of the variability. Standardization of technique will help reduce this source of variability. Figure 3 Anteroposterior radiograph of a dysplastic hip, showing the difficulty in defining the true margin of the acetabulum when measuring the center edge angle of Wiberg (solid lines). The apparent lateral edge of the acetabulum (arrow) is really a superimposition of the true anterior and posterior portions of the superior rim of the acetabulum. Inconsistent choice among observers may lead to errors in measurement. Classification Systems in Orthopaedics Journal of the American Academy of Orthopaedic Surgeons 294 These three sources of variation apply to all measurement processes. The variability of classification systems is not just a problem of improving a classification system itself; it is only one aspect by which the reliability and utility of classification systems can be improved. Under- standing these sources of measurement variability and how to minimize them is critically important. 1,10 Assessment of Commonly Used Orthopaedic Classification Systems Although many classification systems have been widely adopted and frequently used in orthopaedic surgery to guide treatment decisions, few have been scientifically tested for their reliability. A high degree of reliability or precision should be a minimum requirement before any classification system is adopted. The results of several recent studies that have tested various orthopaedic classifications for their intraobserver and interobserver reliability are summarized in Table 1. 12-21 In general, the reliability of the listed classification systems would be considered low and probably unacceptable. Despite this lack of reliability, these systems are commonly used. Although Table 1 lists only a limited number of systems, they were chosen because they have been subjected to reliability testing. Many other classification systems commonly cited in the literature have not been tested; consequently, there is no evidence that they are or are not reliable. In fact, most classifications systems for medical conditions and injuries that have been tested have levels of agreement that are considered unacceptably low. 22,23 There is no reason to believe that the classification systems that have not been tested would fare any bet- ter. Four of the studies listed in Table 1 are discussed in detail to highlight the methodology that should be used to assess the reliability of any classification system: the AO classification of distal radius fractures, 15 the classification of acetabular bone defect in revision hip arthroplasty, 13 the Severin classification of congenital dislocation of the hip, 14 and the Vancouver classification of periprosthetic fractures of the femur. 12 Kreder et al 15 assessed the reliability of the AO classification of distal radius fractures. This classification system divides fractures into three types based on whether the fracture is extra-articular (type A), partial articular (type B), or complete articular (type C). These fracture types can then be divided into groups, which are further divided into subgroups with 27 possible combinations. Thirty radiographs of distal radial fractures were presented to observers on two occa- sions. Before classifying the radiographs, a 30-minute review of the AO classification was conducted. Assessors also had a handout, which they were encouraged to use when classifying the fractures. There were 36 observers in all, including attending surgeons, clinical fellows, residents, and nonclinicians. These groups were chosen to ascertain whether the type of observer had an influence on the reliability of the classification. In this study, an unweighted kappa value was used. The authors evaluated intraobserver and interobserver reliability for AO type, AO group, and AO subgroup. The criteria of Landis and Koch 5 were used to grade the levels of agreement. Interobserver agreement was highest for the initial AO type, and it decreased for groups and subgroups as the number of categories increased. This should be expected because, as the number of categories increases, there is more opportunity for disagreement. Intraobserver agreement showed similar results. Kappa values for AO type ranged from 0.67 for residents to 0.86 for attending surgeons. Again, with more detailed AO subgroups, kappa values decreased progressively. When all 27 categories were included, kappa values ranged from 0.25 to 0.42. The conclusions of this study were that the use of AO types A, B, and C pro- duced levels of reliability that were high and acceptable. However, sub- classification into groups and subgroups was unreliable. The clinical utility of using only the three types was not addressed and awaits further study. Several important aspects of this study, aside from the results, merit mention. This study showed that not only the classification system is tested but also the observer. For any classification system tested, it is important to document the observers’ experience because this can substantially affect reliability. One omission in this study 15 was the lack of discussion of observed agreement and the prevalence of fracture categories; these factors have a distinct effect on observer variability. Campbell et al 13 looked at the reliability of acetabular bone defect classifications in revision hip arthroplasty. One group of observers included the originators of the classification system. This is the ulti- mate way to remove observer bias; however, it lacks generalizability because the originators would be expected to have unusually high levels of reliability. In this study, preoperative radiographs of 33 hips were shown to three different groups of observers on two occa- sions at least 2 weeks apart. The groups of observers were the three originators, three reconstructive orthopaedic surgeons, and three senior residents. The three classifications assessed were those attrib- uted to Gross, 24 Paprosky, 25 and the American Academy of Orthopaedic Surgeons. 26 The unweighted kappa Donald S. Garbuz, MD, et al Vol 10, No 4, July/August 2002 295 Table 1 Intraobserver and Interobserver Agreement in Orthopaedic Classification Systems Intraobserver Interobserver Observed Observed Study Classification Assessors Agreement (%) κ Value Agreement (%) κ Value Brady Periprosthetic Reconstructive — 0.73 – 0.83* — 0.60 – 0.65* et al 12 femur fractures orthopaedic surgeons, (Vancouver) including originator; residents Campbell Acetabular bone Reconstructive — 0.05 – 0.75* — 0.11 – 0.28* et al 13 defect in revision orthopaedic surgeons, total hip (AAOS 26 ) including originators Campbell Acetabular bone Reconstructive — 0.33 – 0.55* — 0.19 – 0.62* et al 13 defect in revision orthopaedic surgeons, total hip (Gross 24 ) including originators Campbell Acetabular bone Reconstructive — 0.27 – 0.60* — 0.17 – 0.41* et al 13 defect in revision orthopaedic surgeons, total hip including originators (Paprosky 25 ) Ward et al 14 Congenital hip dis- Pediatric orthopaedic 45 – 61 0.20 – 0.44* 14 – 61 −0.01 – 0.42* location (Severin) surgeons 0.32 – 0.59 † 0.05 – 0.55 † Kreder et al 15 Distal radius (AO) Attending surgeons, — 0.25 – 0.42* — 0.33* fellows, residents, nonclinicians Sidor et al 16 Proximal humerus Shoulder surgeon, 62 – 86 0.50 – 0.83* — 0.43 – 0.58* (Neer) radiologist, residents Siebenrock Proximal humerus Shoulder surgeons — 0.46 – 0.71 † — 0.25 – 0.51 † et al 17 (Neer) Siebenrock Proximal humerus Shoulder surgeons — 0.43 – 0.54 † — 0.36 – 0.49 † et al 17 (AO/ASIF) McCaskie Quality of cement Experts in THA, — 0.07 – 0.63* — −0.04* et al 18 grade in THA consultants, residents Lenke et al 19 Scoliosis (King) Spine surgeons 56 – 85 0.34 – 0.95* 55 0.21 – 0.63* Cummings Scoliosis (King) Pediatric orthopaedic — 0.44 – 0.72* — 0.44* et al 20 surgeons, spine surgeons, residents Haddad Femoral bone Reconstructive — 0.43 – 0.62* — 0.12 – 0.29* et al 21 defect in revision orthopaedic surgeons total hip (AAOS, 30 Mallory, 28 Paprosky et al 29 ) * Unweighted † Weighted Classification Systems in Orthopaedics Journal of the American Academy of Orthopaedic Surgeons 296 value was used to assess the level of agreement. As expected, the originators had higher levels of intraobserver agreement than did the other two observer groups (AAOS, 0.57; Gross, 0.59; Paprosky, 0.75). However, levels of agreement fell markedly when tested by surgeons other than the originators. This study underscores the importance of the qualifications of the observers in studies that measure reliability. To test the classification system itself, experts would be the initial optimal choice, as was the case in this study. 13 However, even if the originators have acceptable agreement, this result should not be generalized. Because most classification systems are developed for widespread use, reliability must be high among all observers for a system to have clinical utility. Hence, although the originators of the classifications of femoral bone loss were not included in a similar study 21 at the same center, the conclusions of the study remain valuable with re- spect to the reliability of femoral bone loss classifications in the hands of orthopaedic surgeons other than the originators. Ward et al 14 evaluated the Severin classification, which is used to assess the radiographic appearance of the hip after treatment for congenital dislocation. This system has six main categories ranging from normal to recurrent dislocation and is reported to be a prognostic indi- cator. Despite its widespread acceptance, it was not tested for reliability until 1997. The authors made every effort to test only the classification system by minimizing other potential sources of disagreement. All identifying markers were removed from 56 radiographs of hips treated by open reduction. Four fellow- ship-trained pediatric orthopaedic surgeons who routinely treated congenital dislocation of the hip independently rated the radiographs. Before classifying the hips, the observers were given a detailed description of the Severin classification. Eight weeks later, three observers repeated the classifying exercise. The radiographs were presented in a different order in an attempt to minimize recall bias. Both weighted and unweighted kappa values were calculated. Observed agreement also was calculated and reported so that the possibility of a high observed agreement with a low kappa value would be apparent. The kappa values, whether weighted or unweighted, were low, usually less than 0.50. The authors of this study used the arbitrary criteria of Svanholm et al 6 to grade their agreement and concluded that this classification scheme is unreliable and should not be widely used. This study demonstrated the methodology that should be used when testing classification systems. It elimi- nated other sources of disagreement and focused on the precision of the classification system itself. The Vancouver classification of periprosthetic femur fractures is an example of a system that was tested for reliability prior to its widespread adoption and use. 12 The first description was published in 1995. 27 Shortly afterward, testing began on the reliability and the validity of this system. The methodology was similar to that described in the three previous studies. Reliability was acceptable for the three experienced reconstructive orthopaedic surgeons tested, including the originator. To assess generalizability, three senior residents also were assessed for their intraobserver and interobserver reliability. The kappa values for this group were nearly identical to those of the three expert surgeons. This study confirmed that the Vancouver classification is both reliable and valid. With these two criteria met, this system can be rec- ommended for widespread use and can subsequently be assessed for its value in guiding treatment and out- lining prognosis. Summary Classification systems are tools for identifying injury patterns, assessing prognoses, and guiding treatment decisions. Many classification systems have been published and widely adopted in orthopaedics without information available on their reliability. Classification systems should consistently produce the same results. A system should, at a minimum, have a high degree of intraobserver and interobserver reliability. Few systems have been tested for this reliability, but those that have been tested generally fall short of acceptable levels of reliability. Because most classification systems have poor reliability, their use to differentiate treatments and suggest outcomes is not warranted. A system that has not been tested cannot be assumed to be reliable. The systems used by orthopaedic surgeons must be tested for reliability, and if a system is not found to be reliable, it should be modified or its use seriously ques- tioned. Improving reliability involves looking at many components of the classification process. 1 Methodologies exist to assess classifications, with the kappa value the standard for measuring observer reliability. Once a system is found to be reliable, the next step is to prove its utility. Only when a system is shown to be reliable should it be widely adopted by the medical community. This should not be construed to mean that untested classification systems, or those with disappointing reliability, are without value. Systems are needed to categorize or define surgical problems before surgery in order to plan appropriate approaches and techniques. Classification systems provide a discipline to help define pathology as well as a lan- Donald S. Garbuz, MD, et al Vol 10, No 4, July/August 2002 297 guage to describe that pathology. However, it is necessary to recog- nize the limitations of existing classification systems and the need to confirm or refine proposed preoperative categories by careful intraoperative observation of the actual findings. Furthermore, submission of classification systems to statisti- cal analysis highlights their inher- ent flaws and lays the groundwork for their improvement. References 1. Wright JG, Feinstein AR: Improving the reliability of orthopaedic measurements. J Bone Joint Surg Br 1992;74:287-291. 2. Cohen J: A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960;20:37-46. 3. Kristiansen B, Andersen UL, Olsen CA, Varmarken JE: The Neer classification of fractures of the proximal humerus: An assessment of interobserver variation. Skeletal Radiol 1988;17:420-422. 4. Kraemer HC: Extension of the kappa coefficient. Biometrics 1980;36:207-216. 5. Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics 1977;33:159-174. 6. Svanholm H, Starklint H, Gundersen HJ, Fabricius J, Barlebo H, Olsen S: Reproducibility of histomorphologic diagnoses with special reference to the kappa statistic. APMIS 97 1989;689-698. 7. Feinstein AR, Cicchetti DV: High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol 1990;43:543-549. 8. Cicchetti DV, Feinstein AR: High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol 1990;43: 551-558. 9. Byrt T, Bishop J, Carlin JB: Bias, prevalence and kappa. J Clin Epidemiol 1993;46:423-429. 10. Clinical disagreement: I. How often it occurs and why. Can Med Assoc J 1980: 123;499-504. 11. Lowe RW, Hayes TD, Kaye J, Bagg RJ, Luekens CA: Standing roentgeno- grams in spondylolisthesis. Clin Orthop 1976;117:80-84. 12. Brady OH, Garbuz DS, Masri BA, Duncan CP: The reliability and validity of the Vancouver classification of femoral fractures after hip replacement. J Arthroplasty 2000;15:59-62. 13. Campbell DG, Garbuz DS, Masri BA, Duncan CP: Reliability of acetabular bone defect classification systems in revision total hip arthroplasty. J Arthro- plasty 2001;16:83-86. 14. Ward WT, Vogt M, Grudziak JS, Tumer Y, Cook PC, Fitch RD: Severin classification system for evaluation of the results of operative treatment of congenital dislocation of the hip: A study of intraobserver and interobserver reliability. J Bone Joint Surg Am 1997;79:656-663. 15. Kreder HJ, Hanel DP, McKee M, Jupiter J, McGillivary G, Swiont- kowski MF: Consistency of AO fracture classification for the distal radius. J Bone Joint Surg Br 1996;78:726-731. 16. Sidor ML, Zuckerman JD, Lyon T, Koval K, Cuomo F, Schoenberg N: The Neer classification system for proximal humeral fractures: An assessment of interobserver reliability and intraobserver reproducibility. J Bone Joint Surg Am 1993;75:1745-1750. 17. Siebenrock KA, Gerber C: The reproducibility of classification of fractures of the proximal end of the humerus. J Bone Joint Surg Am 1993;75:1751-1755. 18. McCaskie AW, Brown AR, Thompson JR, Gregg PJ: Radiological evaluation of the interfaces after cemented total hip replacement: Interobserver and intraobserver agreement. J Bone Joint Surg Br 1996;78:191-194. 19. Lenke LG, Betz RR, Bridwell KH, et al: Intraobserver and interobserver reliability of the classification of thoracic adolescent idiopathic scoliosis. J Bone Joint Surg Am 1998;80:1097-1106. 20. Cummings RJ, Loveless EA, Campbell J, Samelson S, Mazur JM: Interobserver reliability and intraobserver reproducibility of the system of King et al. for the classification of adolescent idiopathic scoliosis. J Bone Joint Surg Am 1998; 80:1107-1111. 21. Haddad FS, Masri BA, Garbuz DS, Duncan CP: Femoral bone loss in total hip arthroplasty: Classification and preoperative planning. J Bone Joint Surg Am 1999;81:1483-1498. 22. Koran LM: The reliability of clinical methods, data and judgments (first of two parts). N Engl J Med 1975;293: 642-646. 23. Koran LM: The reliability of clinical methods, data and judgments (second of two parts). N Engl J Med 1975;293: 695-701. 24. Garbuz D, Morsi E, Mohamed N, Gross AE: Classification and reconstruction in revision acetabular arthroplasty with bone stock deficiency. Clin Orthop 1996;324:98-107. 25. Paprosky WG, Perona PG, Lawrence JM: Acetabular defect classification and surgical reconstruction in revision arthroplasty: A 6-year follow-up evaluation. J Arthroplasty 1994;9:33-44. 26. D’Antonio JA, Capello WN, Borden LS: Classification and management of acetabular abnormalities in total hip arthroplasty. Clin Orthop 1989;243:126-137. 27. Duncan CP, Masri BA: Fractures of the femur after hip replacement. Instr Course Lect 1995;44:293-304. 28. Mallory TH: Preparation of the proximal femur in cementless total hip revision. Clin Orthop 1988;235:47-60. 29. Paprosky WG, Lawrence J, Cameron H: Femoral defect classification: Clinical application. Orthop Rev 1990; 19(suppl 9):9-15. 30. D’Antonio J, McCarthy JC, Bargar WL, et al: Classification of femoral abnormalities in total hip arthroplasty. Clin Orthop 1993;296:133-139