Statistical Methods in Medical Research - part 3 ppsx

83 308 0
Statistical Methods in Medical Research - part 3 ppsx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

156 Analysing variances Comparison of two counts Suppose that x1 is a count which can be assumed to follow a Poisson distribution with mean m1 Similarly let x2 be a count independently following a Poisson distribution with mean m2 How might we test the null hypothesis that m1 ˆ m2 ? One approach would be to use the fact that the variance of x1 À x2 is m1 ‡ m2 (by virtue of (3.19) and (4.9)) The best estimate of m1 ‡ m2 on the basis of the available information is x1 ‡ x2 On the null hypothesis E…x1 À x2 † ˆ m1 À m2 ˆ 0, and x1 À x2 can be taken to be approximately normally distributed unless m1 and m2 are very small Hence, x1 À x2 zˆp …x1 ‡ x2 † …5:7† can be taken as approximately a standardized normal deviate A second approach has already been indicated in the test for the comparison of proportions in paired samples (§4.5) Of the total frequency x1 ‡ x2 , a portion x1 is observed in the first sample Writing r ˆ x1 and n ˆ x1 ‡ x2 in (4.17) we have zˆ x1 À …x1 ‡ x2 † x À x2 p2 ˆp …x1 ‡ x2 † …x1 ‡ x2 † as in (5.7) The two approaches thus lead to exactly the same test procedure A third approach uses a rather different application of the x2 test from that described for the  table in §4.5, the total frequency of x1 ‡ x2 now being divided into two components rather than four Corresponding to each observed frequency we can consider the expected frequency, on the null hypothesis, to be …x1 ‡ x2 †: Observed Expected x1 …x1 ‡ x2 † x2 …x1 ‡ x2 † Applying the usual formula (4.30) for a x2 statistic, we have X2 ˆ ‰x1 À …x1 ‡ x2 †Š2 ‰x2 À …x1 ‡ x2 †Š2 2 ‡ 1 …x1 ‡ x2 † …x1 ‡ x2 † 2 …x1 À x2 †2 ˆ : x1 ‡ x2 …5:8† As for (4.30) X follows the x2 distribution, which we already know to be …1† the distribution of the square of a standardized normal deviate It is therefore not surprising that X given by (5.8) is precisely the square of z given by (5.7) The third approach is thus equivalent to the other two, and forms a particularly useful method of computation since no square root is involved in (5.8) 5.2 Inferences from counts 157 Consider now an estimation problem What can be said about the ratio m1 =m2 ? The second approach described above can be generalized, when the null hypothesis is not necessarily true, by saying that x1 follows a binomial distribution with parameters x1 ‡ x2 (the n of §3.7) and m1 =…m1 ‡ m2 † (the p of §3.6) The methods of §4.4 thus provide confidence limits for p ˆ m1 =…m1 ‡ m2 †, and hence for m1 =m2 which is merely p=…1 À p† The method is illustrated in Example 5.4 The difference m1 À m2 is estimated by x1 À x2 , and the usual normal theory can be applied as an approximation, with the standard error of x1 À x2 estip mated as in (5.7) by …x1 ‡ x2 † Example 5.4 Equal volumes of two bacterial cultures are spread on nutrient media and after incubation the numbers of colonies growing on the two plates are 13 and 31 We require confidence limits for the ratio of concentrations of the two cultures The estimated ratio is 13=31 ˆ 0Á4194 From the Geigy tables a binomial sample with 13 successes out of 44 provides the following 95% confidence limits for p: 0Á1676 and 0Á4520 Calculating p=…1 À p† for each of these limits gives the following 95% confidence limits for m1 =m2 : 0Á1676=0Á8324 ˆ 0Á2013 and 0Á4520=0Á5480 ˆ 0Á8248: The mid-P limits for p, calculated exactly as described in §4.4, are 0Á1752 and 0Á4418, leading to mid- P limits for m1 =m2 of 0Á2124 and 0Á7915 The normal approximations described in §4.4 can, of course, be used when the frequencies are not too small Example 5.5 Just as the distribution of a proportion, when n is large and p is small, is well approximated by assuming that the number of successes, r, follows a Poisson distribution, so a comparison of two proportions under these conditions can be effected by the methods of this section Suppose, for example, that, in a group of 1000 men observed during a particular year, 20 incurred a certain disease, whereas, in a second group of 500 men, four cases occurred Is there a significant difference between these proportions? This question could be answered by the methods of §4.5 As an approximation we could compare the observed proportion of deaths falling into group 2, p ˆ 4=24, with the theoretical proportion p ˆ 500=1500 ˆ 0Á3333 The equivalent x2 test would run as follows: Observed cases Expected cases Group 20 1000  24 ˆ 16 1500 Group 500  24 ˆ8 1500 Total 24 24 158 Analysing variances With continuity correction Xc ˆ …3 1†2 =16 ‡ …3 1†2 =8 2 ˆ 0Á766 ‡ 1Á531 ˆ 2Á30 …P ˆ 0Á13†: The difference is not significant Without the continuity correction, X ˆ 3Á00 …P ˆ 0Á083† If the full analysis for the  table is written out it will become clear that this abbreviated analysis differs from the full version in omitting the contributions to X from the non-affected individuals Since these are much more numerous than the cases, their contributions to X have large denominators and are therefore negligible in comparison with the terms used above This makes it clear that the short method described here must be used only when the proportions concerned are very small Example 5.6 Consider a slightly different version of Example 5.5 Suppose that the first set of 20 cases occurred during the follow-up of a large group of men for a total of 1000 man-years, whilst the second set of four cases occurred amongst another large group followed for 500 man-years Different men may have different risks of disease, but, under the assumptions that each man has a constant risk during his period of observation and that the lengths of follow-up are unrelated to the individual risks, the number of cases in each group will approximately follow a Poisson distribution As a test of the null hypothesis that the mean risks per unit time in the two groups are equal, the x2 test shown in Example 5.5 may be applied Note, though, that a significant difference may be due to failure of the assumptions One possibility is that the risk varies with time, and that the observations for one group are concentrated more heavily at the times of high risk than is the case for the other group; an example would be the comparison of infant deaths, where one group might be observed for a shorter period after birth, when the risk is high Another possibility is that lengths of follow-up are related to individual risk Suppose, for example, that individuals with high risk were observed for longer periods than those with low risk; the effect would be to increase the expected number of cases in that group Further methods for analysing follow-up data are described in Chapter 17 5.3 Ratios and other functions We saw, in §4.2, that inferences about the population mean are conveniently made by using the standard error of the sample mean In §§4.4 and 5.2, approximate methods for proportions and counts made use of the appropriate standard errors, invoking the normal approximations to the sampling distributions Similar normal approximations are widely used in other situations, and it is therefore useful to obtain formulae for standard errors (or, equivalently, their squares, the sampling variances) for various other statistics 5.3 Ratios and other functions 159 Many situations involve functions of one or more simple statistics, such as means or proportions We have already, in (4.9), given a general formula for the variance of a difference between two independent random variables, and applied it, in §§4.3, 4.5 and 5.2, to comparisons of means, proportions and counts In the present section we give some other useful formulae for the variances of functions of independent random variables Two random variables are said to be independent if the distribution of one is unaffected by the value taken by the other One important consequence of independence is that mean values can be multiplied That is, if x1 and x2 are independent and y ˆ x1 x2 , then E…y† ˆ E…x1 †E…x2 †: …5:9† Linear function Suppose x1 , x2 , , xk are independent random variables, and y ˆ a1 x1 ‡ a2 x2 ‡ ‡ ak xk , the as being constants Then, var…y† ˆ a2 var…x1 † ‡ a2 var…x2 † ‡ ‡ a2 var…xk †: k …5:10† The result (4.9) is a particular case of (5.10) when k ˆ 2, a1 ˆ and a2 ˆ À1 The independence condition is important If the xs are not independent, there must be added to the right-hand side of (5.10) a series of terms like 2ai aj cov…xi , xj †, …5:11† where `cov' stands for the covariance of xi and xj , which is defined by cov…xi , xj † ˆ Ef‰xi À E…xi †Š ‰xj À E…xj †Šg: The covariance is the expectation of the product of deviations of two random variables from their means When the variables are independent, the covariance is zero When all k variables are independent, all the covariance terms vanish and we are left with (5.10) Ratio In §5.1, we discussed the ratio of two variance estimates and (at least for normally distributed data) were able to use specific methods based on the F distribution In §5.2, we noted that the ratio of two counts could be treated by using results established for the binomial distribution In general, though, exact methods for ratios are not available, and recourse has to be made to normal approximations 160 Analysing variances Let y ˆ x1 =x2 , where again x1 and x2 are independent No general formula can be given for the variance of y Indeed, it may be infinite However, if x2 has a small coefficient of variation, the distribution of y will be rather similar to a distribution with a variance given by the following formula: var…y† ˆ var…x1 † ‰E…x2 †Š ‡ ‰E…x1 †Š2 ‰E…x2 †Š4 var…x2 †: …5:12† Note that if x2 has no variability at all, (5.12) reduces to var…y† ˆ var…x1 † , x2 which is an exact result when x2 is a constant Approximate confidence limits for a ratio may be obtained from (5.12), with p the usual multiplying factors for SE…y† ‰ˆ var…y†Š based on the normal distribution However, if x1 and x2 are normally distributed, an exact expression for confidence limits is given by Fieller's theorem (Fieller, 1940) This covers a rather more general situation, in which x1 and x2 are dependent, with a non-zero covariance We suppose that x1 and x2 are normally distributed with variances and a covariance which are known multiples of some unknown parameter s2 , and that s2 is estimated by a statistic s2 on f DF Define E…x1 †ˆ m1 , E…x2 †ˆ m2 , var…x1 † ˆ v11 s2 , var…x2 † ˆ v22 s2 and cov…x1 , x2 † ˆ v12 s2 Denote the unknown ratio m1 =m2 by r, so that m1 ˆ rm2 It then follows that the quantity z ˆ x1 À rx2 is distributed as N‰0, …v11 À 2rv12 ‡ r2 v22 †s2 Š, and so the ratio x1 À rx2 Tˆ p s …v11 À 2rv12 ‡ r2 v22 † …5:13† follows a t distribution on f DF Hence, the probability is À a that Àtf , a < T < tf , a , or, equivalently, T < t2 , a : f …5:14† Substitution of (5.13) in (5.14) gives a quadratic inequality for r, leading to 100…1 À a†% confidence limits for r given by  !1 gv12 v2 12 v Ỉt S v11 À 2yv12 ‡ y v22 À g v11 À v f,a 22 22 x2 , rL , r U ˆ 1Àg where …5:15† 5.3 Ratios and other functions gˆ t2, a s2 v22 f x2 161 …5:16† and ‰ Š2 indicates a square root If g is greater than 1, x2 is not significantly different from zero at the a level, and the data are consistent with a zero value for m2 and hence an infinite value for r The confidence set will then either be the two intervals (ÀI, rL ) and (rU , I), excluding the observed value y, or the whole set of values (ÀI, I) Otherwise, the interval (rL , rU ) will include y, and when g is very small the limits will be close to those given by the normal approximation using (5.12) This may be seen by setting g ˆ in (5.15), when the limits become y Ỉ tf , a !1 var…x1 † 2x1 x2 À cov…x1 , x2 † ‡ var…x2 † : x2 x2 x2 …5:17† Equation (5.17) agrees with (5.12), with the replacement of expectations of x1 and x2 by their observed values, and the inclusion of the covariance term The validity of (5.15) depends on the assumption of normality for x1 and x2 Important use is made of Fieller's theorem in biological assay (§20.2) where the normality assumption is known to be a good approximation A situation commonly encountered is the comparison of two independent samples when the quantity of interest is the ratio of the location parameters rather than their difference The formulae above may be useful, taking x1 and x2 to be the sample means, and using standard formulae for their variances The use of Fieller's theorem will be problematic if (as is usually the case) the variances are not estimated as multiples of the same s2 , although approximations may be used An alternative approach is to work with the logarithms of the individual readings, and make inferences about the difference in the means of the logarithms (which is the logarithm of their ratio), using the standard procedures of §4.3 Product Let y ˆ x1 x2 , where x1 and x2 are independent Denote the means of x1 and x2 by m1 and m2 , and their variances by s2 and s2 Then var…y† ˆ m2 s2 ‡ m2 s2 ‡ s2 s2 : 2 1 …5:18† The assumption of independence is crucial General function Suppose we know the mean and variance of the random variable x Can we calculate the mean and variance of any general function of x such as 3x3 or 162 Analysing variances p … log x†? There is no simple formula, but again a useful approximation is available when the coefficient of variation of x is small We have to assume some knowledge of calculus at this point Denote the function of x by y Then  2 dy var…y† g var…x†, …5:19† dx xˆE…x† the symbol g standing for `approximately equal to' In (5.19), dy=dx is the differential coefficient (or derivative) of y with respect to x, evaluated at the mean value of x If y is a function of two variables, x1 and x2 ,        @y @y @y @y var…y† g var…x1 † ‡ var…x2 †, …5:20† cov…x1 , x2 † ‡ @x1 @x1 @x2 @x2 where @y=@x1 and @y=@x2 are the partial derivatives of y with respect to x1 and x2 , and these are again evaluated at the mean values The reader with some knowledge of calculus will be able to derive (4.9) as a particular case of (5.20) when cov…x1 , x2 † ˆ An obvious extension of (5.20) to k variables gives (5.10) as a special case Equations (5.12) and (5.18) are special cases of (5.20), when cov…x1 , x2 † ˆ In (5.18), the last term becomes negligible if the coefficients of variation of x1 and x2 are very small; the first two terms agree with (5.20) The method of approximation by (5.19) and (5.20) is known as the delta method 5.4 Maximum likelihood estimation In §4.1 we noted several desirable properties of point estimators, and remarked that many of these were achieved by the method of maximum likelihood In Chapter and the earlier sections of the present chapter, we considered the sampling distributions of various statistics chosen on rather intuitive grounds, such as the mean of a sample from a normal distribution Most of these turn out to be maximum likelihood estimators, and it is useful to reconsider their properties in the light of this very general approach In §3.6 we derived the binomial distribution and in §4.4 we used this result to obtain inferences from a sample proportion The probability distribution here is a two-point distribution with probabilities p and À p for the two types of individual There is thus one parameter, p, and a maximum likelihood (ML) estimator is obtained by finding the value that maximizes the probability shown in (3.12) The answer is p, the sample proportion, which was, of course, the statistic chosen intuitively We shall express this result by writing 5.4 Maximum likelihood estimation 163 p ˆ p, ^ the `hat' symbol indicating the ML estimator Two of the properties already noted in §3.6 follow from general properties of ML estimators: first, in large samples (i.e for large values of n), the distribution of p tends to become closer and closer to a normal distribution; and, secondly, p is a consistent estimator of p because its variance decreases as n increases, and so p fluctuates more and more closely around its mean, p A third property of ML estimators is their efficiency: no other estimator would have a smaller variance than p in large samples One other property of p is its unbiasedness, in that its mean value is p This can be regarded as a bonus, as not all ML estimators are unbiased, although in large samples any bias must become proportionately small in comparison with the standard error, because of the consistency property Since the Poisson distribution is closely linked with the binomial, as explained in §3.7, it is not surprising that similar properties hold There is again one parameter, m, and the ML estimator from a sample of n counts is the observed mean count: m ˆ x: ^  An equivalent statement is that the ML estimator of nm is n, which is the total x € count x The large-sample normality of ML estimators implies a tendency towards normality of the Poisson distribution with a large mean (nm here), confirming the decreased skewness noted in connection with Fig 3.9 The con sistency of x is illustrated by the fact that var…† ˆ var…x†=n ˆ m=n, x  so, as n increases, the distribution of x becomes more tightly concentrated around its mean m Again, the unbiasedness is a bonus In Fig 4.1, the concept of maximum likelihood estimation was illustrated by reference to a single observation from a normal distribution N(m, 1) The ML estimator of m is clearly x In a sample of size n from the same distribution, the  situation would be essentially the same, except that the distributions of x for different values of m would now have a variance of 1=n rather than The ML  estimator would clearly be x, which has the usual properties of consistency and efficiency and, as a bonus, unbiasedness In practice, if we are fitting a normal distribution to a set of n observations, we shall not usually know the population variance, and the distribution we fit, N(m, s2 ), will have two unknown parameters The likelihood now has to be maximized simultaneously over all possible values of m and s2 The resulting ML estimators are: m ˆ x, ^  164 Analysing variances as expected, and s ˆ ^ €  …xi À x†2 : n This is the biased estimator of the variance, (2.1), with divisor n, rather than the unbiased estimator s2 given by (2.2) As we noted in §2.6, the bias of (2.1) becomes proportionately unimportant as n gets large, and the estimator is consistent, as we should expect Proofs that the ML estimators noted here maximize the likelihood are easily obtained by use of the differential calculus That is, in fact, the general approach for maximum likelihood solutions to more complex problems, many of which we shall encounter later in the book In some of these more complex models, such as logistic regression (§14.2), the solution is obtained by a computer program, acting iteratively, so that each round of the calculation gets closer and closer to the final value Two points may be noted finally: The ML solution depends on the model put forward for the random variation Choice of an inappropriate model may lead to inefficient or misleading estimates For certain non-normal distributions, for instance, the ML estimator of the location parameter may not be (as with the normal distribution)  the sample mean x This corresponds to the point made in §§2.4 and 2.5 that for skew distributions the median or geometric mean may be a more satisfactory measure than the arithmetic mean There are some alternative approaches to estimation, other than maximum likelihood, that also provide large-sample normality, consistency and efficiency Some of these, such as generalized estimating equations (§12.6), will be met later in the book Bayesian methods 6.1 Subjective and objective probability Our approach to the interpretation of probability, and its application in statistical inference, has hitherto been frequentist That is, we have regarded the probability of a random event as being the long-run proportion of occasions on which it occurs, conditional on some specified hypothesis Similarly, in methods of inference, a P value is defined as the proportion of trials in which some observed result would have been observed on the null hypothesis; and a confidence interval is characterized by the probability of inclusion of the true value of a parameter in repeated samples Bayes' theorem (§3.3) allowed us to specify prior probabilities for hypotheses, and hence to calculate posterior probabilities after data had been observed, but the prior probabilities were, at that stage, justified as representing the long-run frequencies with which these hypotheses were true In medical diagnosis, for example, we could speak of the probabilities of data (symptoms, etc.) on certain hypotheses (diagnoses), and attribute (at least approximately) probabilities to the diagnoses according to the relative frequencies seen in past records of similar patients It would be attractive if one could allot probabilities to hypotheses like the following: `The use of tetanus antitoxin in cases of clinical tetanus reduces the fatality of the disease by more than 20%,' for which no frequency interpretation is possible Such an approach becomes possible only if we interpret the probability of a hypothesis as a measure of our degree of belief in its truth A probability of zero would correspond to complete disbelief, a value of one representing complete certainty These numerical values could be manipulated by Bayes' theorem, measures of prior belief being modified in the light of observations on random variables by multiplication by likelihoods, resulting in measures of posterior belief It is often argued that this is a more `natural' interpretation of probability than the frequency approach, and that non-specialist users of statistical methods often erroneously interpret the results of significance tests or confidence intervals in this subjective way That is, a non-significant result may be wrongly interpreted as showing that the null hypothesis has low probability, and a parameter may be claimed to have a 95% probability of lying inside a confidence interval 165 224 Comparison of several groups  c À y q € iˆ1 0 i y q, which, when multiplied by q becomes L1 ˆ qc À y q € iˆ1 i , y a particular case of (8.23) with lc ˆ q, li ˆ À1 for all i in the set of q groups € and li ˆ otherwise Note that li ˆ A linear regression coefficient Suppose that a set of q groups is associated with a variable xi (for example, the dose of some substance) It might be of interest to ask whether the regression of y on x is significant Using the result quoted in the derivation of (7.13), L2 ˆ €  y …xi À x†i , €  which again is a particular case of (8.23) with li ˆ xi À x, and again li ˆ A difference between two means The difference between g and h is another y y case of (8.23) with lg ˆ 1, lh ˆ À1 and all other li ˆ Corresponding to any linear contrast, L, the t statistic, on k…n À 1† DF, is from (8.24), tˆ L L ˆ pÀ€ Á : SE…L† sW li =n As we have seen, the square of t follows the F distribution on and k…n À 1† DF Thus, F ˆ t2 ˆ L2 s2 € ˆ 21 , s2 li =n sW W € where s2 ˆ L2 … l2 =n† In fact, s2 can be regarded as an MSq on DF, derived i 1 from an SSq also equal to s2 , which can be shown to be part of the SSq between groups of the analysis of variance The analysis thus takes the following form: SSq Between groups Due to L Other contrasts Within groups € L2 … l2 =n† € € 2 i 2 Ti n À T n À L2 … l2 =n† i € S À Ti =n Total S À T =nk DF kÀ2 k…n À 1† nk À MSq VR s2 s2 R s2 W F1 ˆ s2 =s2 W F2 ˆ s2 =s2 R W 8.4 Multiple comparisons 225 Separate significance tests are now provided: (i) by F1 on and k…n À 1† DF for the contrast L; as we have seen, this is equivalent to the t test for L; and (ii) by F2 on k À and k…n À 1† DF for differences between group means other than those measured by L Suppose there are two or more linear contrasts of interest: L1 ˆ € l1i i , L2 ˆ y € l2i yi , etc: Can the single degrees of freedom for these contrasts all be incorporated in the same analysis of variance? They can, provided the Ls are uncorrelated when the null hypothesis is true, and the condition for this is that for any two contrasts (Lp and Lq , say) the sum of products of the coefficients is zero: k € iˆ1 lpi lqi ˆ 0: In this case Lp and Lq are said to be orthogonal If there are kH such orthogonal contrasts (kH < k), the analysis of variance is extended to include a separate row for each Li , each with DF, and the SSq for other contrasts, with k À kH À DF, is again obtained by subtraction The straightforward use of the t or F tests is appropriate for any differences between means or for more general linear contrasts which arise naturally out of the structure of the investigation However, a difficulty must be recognized If there are k groups, there are k…k À 1† pairs of means which might conceivably be compared and there is no limit to the number of linear contrasts which might be formed These comparisons are not all independent, but it is fairly clear that, even when the null hypothesis is true, in any set of data some of these contrasts are likely to be significant A sufficiently assiduous search will often reveal some remarkable contrasts which have arisen purely by chance This may not matter if scrutiny is restricted to those comparisons on which the study was designed to throw light If, on the other hand, the data are subjected to what is sometimes called a dredging procedureÐa search for significant contrasts which would not have been thought of initiallyÐthere is a real danger that a number of comparisons will be reported as significant, but that they will almost all have arisen by chance A number of procedures have been devised to reduce the chance of this happening They are referred to as methods of making multiple comparisons or simultaneous inference, and are described in detail by Miller (1981) and Hsu (1996) We mention briefly two methods, one for differences between means and the other for more general linear contrasts The first method, based on the distribution of the studentized range, is due to Newman (1939) and Keuls (1952) Given a set of p means, each based on n observations, the studentized range, Q, is the range of the i divided by the y estimated standard error In an obvious notation, 226 Comparison of several groups Qˆ max À min y y p : s= n …8:25† The distribution of Q, on the null hypothesis that all the mi are equal, has been studied and some upper 5% and 1% points are given in Appendix Table A5 They depend on the number of groups, p, and the within-groups degrees of freedom, f2 , and are written Qp, 0Á05 and Qp, 0Á01 The procedure is to rank the i in order of y magnitude and to test the Studentized range for all pairs of adjacent means (when it actually reduces to the usual t test), for all adjacent triads, all groups of four adjacent means, and so on Two means are regarded as differing significantly only if all tests for sets of means, including these two, give a significant result The procedure is most readily performed in the opposite order to that described, starting with all k, following by the two sets of k À adjacent means, and so on The reason for this is that, if at any stage a non-significant Q is found, that set of means need not be used for any further tests The procedure will, for example, stop after the first stage if Q for all k means is non-significant The following example is taken from Miller (1981, §6.1) Five means, arranged in order, are 16Á1, 17Á0, 20Á7, 21Á1 and 26Á5; n ˆ 5, f2 ˆ 20, and the p standard error s= ˆ 1Á2 The values of Qp, 0Á05 for p ˆ 2, 3, and are, respectively, 2Á95, 3Á58, 3Á96 and 4Á23 Tests are done successively for p ˆ 5, 4, and 2, and the results are as follows, where non-significant groupings are indicated by underlining A 16Á1 B 17Á0 C 20Á7 D 21Á1 E 26Á5 The interpretation is that E differs from fA, B, C, Dg, and that within the latter group A differs from C and D, with B occupying an ambiguous position In Example 8.1, where we noted that 3 and 4 differed by more than twice y y the standard error of the difference, Q calculated for all four groups is p …371Á2 À 274Á8†= …3997=5† ˆ 96Á4=28Á3 ˆ 3Á41, and from Table A5 Q4, 0Á05 is 4Á05, so the test shows no significant difference, as might have been expected in view of the non-significant F test The Newman±Keuls procedure has the property that, for a set of groups with equal mi , the probability of asserting a significant difference between any of them is at most equal to the specified significance level (0Á05 in the above examples) If the null hypothesis is untrue, however, the method has the property that the probability of making at least one incorrect assertion on the pattern of differences between the means may exceed the significance level (Hsu, 1996) A modification to ensure that, whatever the pattern of differences between means, the probability of making at least one incorrect assertion will not exceed 8.5 Comparison of several proportions: the x k contingency table 227 the significance level was suggested by Einot and Gabriel (1975) The first step of this modified method is exactly as for the Newman±Keuls method but in subsequent steps the critical values exceed those given in Table A5 If linear contrasts other than differences are being `dredged', the infinite number of possible choices suggests that a very conservative procedure should be used, that is, one which indicates significance much less readily than the t  test A method proposed by Scheffe (1959) is as follows A linear contrast, L, is declared significant at, say, the 5% level if the absolute value of L=SE…L† exceeds p ‰…k À 1†F0Á05 Š, …8:26† where F0Á05 is the tabulated 5% point of the F distribution with k À and k…n À 1† DF When k ˆ 2, this rule is equivalent to the use of the t test For k > it is noticeably conservative in comparison with a t test, in that the numerical value of (8.26) may considerably exceed the 5% level of t on k…n À 1† degrees of freedom  Scheffe's method has the property that, if the null hypothesis that all mi are equal is true, only in 5% of cases will it be possible to find any linear contrast which is significant by this test Any contrast significant by this test, even if discovered by an exhaustive process of data dredging, may therefore be regarded with a reasonable degree of confidence In Example 8.1 the contrast between the means for experiments and gives L=SE…L† ˆ …371Á2 À 274Á8†=40Á0 ˆ 2Á41; p  by Scheffe's test the 5% value would be ‰3…3Á24†Š ˆ 3Á12, and the observed contrast should not be regarded as significant  It should again be emphasized that the Newman±Keuls and Scheffe procedures, and other multiple-comparison methods, are deliberately conservative in order to reduce the probability of too many significant differences arising by chance in any one study They are appropriate only when means are being compared in an exploratory way to see what might `turn up' When comparisons are made which flow naturally from the plan of the experiment or survey, the usual t test is appropriate 8.5 Comparison of several proportions: the  k contingency table In §4.5 the comparison of two proportions was considered from two points of viewÐthe sampling error of the difference between the proportions and the x2 significance test applied to the  table We saw that these two approaches led to equivalent significance tests of the null hypothesis Where more than two proportions are to be compared the calculation of standard errors between pairs of proportions raises points similar to those discussed in §8.4: many comparisons are possible and an undue number of significance differences may arise by chance However, an overall significance 228 Comparison of several groups test, analogous to the F test in the analysis of variance, is provided by a straightforward extension of the x2 test Suppose there are k groups of observations and that in the ith group ni individuals have been observed, of whom ri show a certain characteristic (say, being `positive') The proportion of positives, ri =ni , is denoted by pi The data may be displayed as follows: All groups Group Positive Negative Total Proportion positive r1 n1 À r1 r2 n2 À r2 i ri ni À ri k rk nk À rk combined R N ÀR n1 p1 n2 p2 ni pi nk pk N P ˆ R=N The frequencies form a  k contingency table (there being two rows and k columns, excluding the marginal totals) The x2 test follows the same lines as for the  table (§4.5) For each of the observed frequencies, O, an expected frequency is calculated by the formula Eˆ Row total  Column total : N …8:27† The quantity (O À E†2 =E is calculated, and, finally X2 ˆ ˆ …O À E†2 E …8:28† the summation being over the 2k cells in the table On the null hypothesis that all k samples are drawn randomly from populations with the same proportion of positives, X is distributed approximately as x2 …kÀ1† , the approximation improving as the expected frequencies increase in size An indication of the extent to which the x2 …kÀ1† distribution is valid for small frequencies is given in §8.6 No continuity correction is required, because unless the observed frequencies are very small the number of tables which may be formed with the same marginal totals as those observed is very large, and the distribution of X is consequently more nearly continuous than is the case for  tables An alternative formula for X is of some value The value of O À E for an entry in the first row of the table (the positives for group i, for instance) is ri À Pni , and this is easily seen to differ only in sign from the entry for the negatives for group i: …ni À ri † À …1 À P†ni ˆ À…ri À Pni †: 8.5 Comparison of several proportions: the x k contingency table 229 The contribution to X from these two cells is, therefore,   1 ‡ …ri À Pni † , Pni Qni where Q ˆ À P The expression in the second set of parentheses simplifies to give the following expression for X : € X ˆ ni …pi À P†2 , PQ …8:29† the summation now being over the k groups A little manipulation with the summation in (8.29) gives two equivalent expressions, € X2 ˆ ni p2 À NP2 i PQ and € X2 ˆ …r2 =ni † À R2 =N i : PQ …8:30† The last two expressions are more convenient as computing formulae than (8.29) The expression (8.29) and the results of §8.2 provide an indication of the reason why X follows the x2 …kÀ1† distribution In the general formulation of §8.2,  we can replace Yi by pi , Vi by PQ=ni and wi by ni =PQ The weighted mean Y then becomes € € …ni pi =PQ† ni pi R € ˆ € ˆ ˆ P, ni …ni =PQ† N and, from (8.14), the test statistic G, distributed as x2 …kÀ1† , becomes X2 ˆ ˆ ni …pi À P†2 i PQ in agreement with (8.29) We have departed from the assumptions underlying (8.14) in two respects: the variation in pi is binomial, not normal, and the true variance s2 has been replaced by the estimated variance PQ=ni Both these i approximations decrease in importance as the expected frequencies increase in size Example 8.3 Table 8.2 shows the numbers of individuals in various age groups who were found in a survey to be positive and negative for Schistosoma mansoni eggs in the stool 230 Comparison of several groups Table 8.2 Presence or absence of S mansoni eggs in the stool Age (years) 0± 10± 20± 30± 40± Total 14 87 16 33 14 66 34 11 57 231 101 49 80 41 17 288 Positive Negative Total The expected number of positives for the age group 0± is …57†…101†=288 ˆ 19Á99: The set of expected numbers for the 10 cells in the table is 19Á99 81Á01 Total 101 9Á70 39Á30 15Á83 64Á17 8Á11 32Á89 3Á36 13Á64 49 80 41 17 Total 57 231 288 The fact that the expected numbers add to the same marginal totals as those observed is a useful check The contribution to X from the first cell is …14 À 19Á99†2 =19Á99 ˆ 1Á79, and the set of contributions for the 10 cells is 1Á79 0Á44 4Á09 1Á01 0Á21 0Á05 0Á15 0Á04 2Á07 0Á51 giving a total of X ˆ 10Á36: The degrees of freedom are k À ˆ 4, for which the 5% point is 9Á49 The departures from the null hypothesis are thus significant at the 5% level …P ˆ 0Á035† In this example the column classification is based on a continuous variable, age, and it would be natural to ask whether the proportions of positives exhibit any smooth trend with p age The estimated proportions, with their standard errors calculated as …pi qi =ni †, are 0Á14 Æ 0Á03 0Á33 Æ 0Á07 0Á18 Æ 0Á04 0Á17 Æ 0Á06 0Á35, Ỉ 0Á12 the last being based on particularly small numbers No clear trend emerges (a method for testing for a trend is given in §15.2) About half the contribution to X comes from the second age group (10±19 years) and there is some suggestion that the proportion of positives in this group is higher than in the neighbouring age groups To illustrate the use of (8.30), call the numbers of positives ri Then X ˆ …142 =101 ‡ ‡ 62 =17 À 572 =288†=…0Á1979†…0Á8021†, 8.6 General contingency tables 231 where P ˆ 57=288 ˆ 0Á1979 This gives X ˆ 10Á37 as before, the discrepancy being due to rounding errors Note that, if the negatives rather than the positives had been denoted by ri , each of the terms in the numerator of (8.30) would have been different, but the result would have been the same 8.6 General contingency tables The form of table considered in the last section can be generalized by allowing more than two rows Suppose that a total frequency, N, is subdivided by r row categories and c column categories The null hypothesis, corresponding to that tested in the simpler situations, is that the probabilities of falling into the various columns are independent of the rows; or, equivalently, that the probabilities for the various rows are the same for each column The x2 test follows closely that applied in the simpler cases For each cell in the body of the table an expected frequency, E, is calculated by (8.27) and the X index obtained from (8.28) by summation over the rc cells Various alternative formulae are available, but none is as simple as (8.29) or (8.30) and it is probably most convenient to use the basic formula (8.28) On the null hypothesis, X follows the x2 † distribution with f ˆ …r À 1†…c À 1† This number of degrees of …f freedom may be thought of as the number of arbitrary choices of the frequencies in the body of the table, with the constraint that they should add to the same margins as those observed and thus give the same values of E (If the entries in r À rows and c À columns are arbitrarily specified, those in the other row and column are determined by the marginal totals.) Again, the x2 distribution is an approximation, increasingly valid for large expected frequencies A rough rule (Cochran, 1954) is that the approximation is safe provided that relatively few expected frequencies are less than (say in cell out of or more, or cells out of 10 or more), and that no expected frequency is less than In tables with smaller expected frequencies the result of the significance test should be regarded with caution If the result is not obviously either significant or non-significant, it may be wise to pool some of the rows and/or columns in which the small expected frequencies occur and recalculate X (with, of course, a reduced number of degrees of freedom) See also the suggestions made by Cochran (1954, p 420) Example 8.4 Table 8.3 shows results obtained in a trial to compare the effects of para-amino-salicylic acid (PAS) and streptomycin in the treatment of pulmonary tuberculosis In each cell of the table are shown the observed frequencies, O, the expected frequencies, E, and the 232 Comparison of several groups Table 8.3 Degrees of positivity of sputa from patients with pulmonary tuberculosis treated with PAS, streptomycin or a combination of both drugs (Medical Research Council, 1950) Sputum Treatment PAS Positive smear O E Negative smear, positive culture Negative smear, negative culture Streptomycin and PAS Total 13 24Á66 5Á59 6Á07 À11Á66 46 42Á77 18 20Á31 20 20Á92 À2Á31 À0Á92 37 45Á82 18 21Á76 35 22Á42 À8Á82 Streptomycin 30 23Á93 3Á23 OÀE 56 50Á41 À3Á76 12Á58 66 68 Total 139 99 84 90 273 discrepancies, O À E For example, for the first cell, E ˆ …99†…139†=273 ˆ 50Á41 Note that the values of O À E add to zero along each row and down each column, a useful check on the arithmetic X ˆ …5Á59†2 =50Á41 ‡ ‡ …12Á58†2 =22Á42 ˆ 17Á64: The degrees of freedom for the x2 distribution are …3 À 1†…3 À 1† ˆ 4, and from Table A2 the 1% point is 13Á28 The relationship between treatment and type of sputum is thus significant at the 1% level …P ˆ 0Á0014† The magnitudes and signs of the discrepancies, O À E, show clearly that the main difference is between PAS (tending to give more positive results) and the combined treatment (more negative results) Fisher's exact test (§4.5) may be extended to a general r  c table (Mehta & Patel, 1983) The exact probability level is equal to the sum of all probabilities less than or equal to the observed table, where the probabilities are calculated under the null hypothesis that there is no association and all the marginal totals are fixed This corresponds to a two-tailed test, using the alternative method of calculating the other tail in a  table (p 136), but as the test is of general association there are no defined tails The calculation is available as `EXACT' in the SAS program PROC FREQ, and is feasible when n < 5…r À 1†…c À 1†, and in StatXact 8.7 Comparison of several variances 233 8.7 Comparison of several variances The one-way analysis of variance (§8.1) is a generalization of the two-sample t test (§4.3) Occasionally one requires a generalization of the F test (used, as in §5.1, for the comparison of two variances) to the situation where more than two estimates of variance are to be compared In a one-way analysis of variance, for example, the primary purpose is to compare means, but one might wish to test the significance of differences between variances, both for the intrinsic interest of this comparison and also because the analysis of variance involves an assumption that the group variances are equal Suppose there are k estimates of variance, s2 , having possibly different i degrees of freedom, ni (If the ith group contains ni observations, ni ˆ ni À 1.) On the assumption that the observations are randomly selected from normal distributions, an approximate significance test due to Bartlett (1937) consists in calculating € € 2 ˆ ni s2 = ni s i € € s M ˆ … ni †ln 2 À ni ln s2 i and ! ˆ  1 C ˆ1‡ À€ 3…k À 1† ni ni and referring M=C to the x2 …kÀ1† distribution Here `ln' refers to the natural logarithm (see p 126) The quantity C is likely to be near and need be calculated only in marginal cases Worked examples are given by Snedecor and Cochran (1989, §13.10) Bartlett's test is perhaps less useful than might be thought, for two reasons First, like the F test, it is rather sensitive to non-normality Secondly, with samples of moderate size, the true variances s2 have to differ very considerably i before there is a reasonable chance of obtaining a significant test result To put this point another way, even if M=C is non-significant, the estimated s2 may i differ substantially, and so may the true s2 If possible inequality in the s2 is i i important, it may therefore be wise to assume it even if the test result is nonsignificant In some situations moderate inequality in the s2 will not matter very i much, so again the significance test is not relevant An alternative test which is less influenced by non-normality is due to Levene (1960) In this test the deviations of each value from its group mean, or median, are calculated and negative deviations changed to positive, that is, the absolute deviations are used Then a test of the equality of the mean values of the absolute deviations over the groups using a one-way analysis of variance (§8.1) is carried out Since the mean value of the absolute deviation is proportional to 234 Comparison of several groups the standard deviation, then if the variances differ between groups so also will the mean absolute deviations Thus the variance ratio test for the equality of group means is a test of the homogeneity of variances Carroll & Schneider (1985) showed that it is preferable to measure the deviations from the group medians to cope with asymmetric distributions 8.8 Comparison of several counts: the Poisson heterogeneity test Suppose that k counts, denoted by x1 , x2 , , xi , , xk are available It may be interesting to test whether they could reasonably have been drawn at random from Poisson distributions with the same (unknown) mean m In many microbiological experiments, as we saw in §3.7, successive counts may be expected to follow a Poisson distribution if the experimental technique is perfect With imperfect technical methods the counts will follow Poisson distributions with different means In bacteriological counting, for example, the suspension may be inadequately mixed, so that clustering of the organisms occurs; the volumes of the suspension inoculated for the different counts may not be equal; the culture media may not invariably be able to sustain growth In each of these circumstances heterogeneity of the expected counts is present and is likely to manifest itself by excessive variability of the observed counts It seems reasonable, therefore, to base a test on the sum of squares about the mean of the xi An appropriate test statistic is given by €  …x À x†2 X ˆ , …8:31†  x which, on the null hypothesis of constant m, is approximately distributed as x2 …kÀ1† The method is variously called the Poisson heterogeneity or dispersion test The formula (8.31) may be justified from two different points of view First, it is closely related to the test statistic (5.4) used for testing the variance of a normal distribution On the present null hypothesis the distribution is Poisson, which we know is similar to a normal distribution if m is not too small; furthermore,  s2 ˆ m, which can best be estimated from the data by the sample mean x  Replacing s0 by x in (5.4) gives (8.31) Secondly, we could argue that, given € the total count x, the frequency `expected' at the ith count on the null €  hypothesis is x=k ˆ x Applying the usual formula for a x2 index, € ‰…O À E†2 =EŠ, immediately gives (8.31) In fact, just as the Poisson distribution can be regarded as a limiting form of the binomial for large n and small p, so the present test can be regarded as a limiting form of the x2 test for the  k table (§8.5) when R=N is very small and all the ni are equal; under these circumstances it is not difficult to see that (8.29) becomes equivalent to (8.31) 8.8 Comparison of several counts: the Poisson heterogeneity test 235 Example 8.5 The following data were given by `Student' (1907), who first emphasized the role of the Poisson distribution in microbiology Twenty counts of yeast cells in squares of a haemocytometer were as follows: 4 Here € € 5 7 k ˆ 20 x ˆ 96  x ˆ 4Á8 x2 ˆ 542 € … x†2 =k ˆ 460Á8 €  …x À x†2 ˆ 81Á2 X ˆ 81Á2=4Á8 ˆ 16Á92 on 19 DF …P ˆ 0Á60†: There is no suggestion of variability in excess of that expected from the Poisson distribution In referring X to the x2 …kÀ1† distribution we should normally a one-sided test since heterogeneity tends to give high values of X Occasionally, though, departures from the Poisson distribution will lead to reduced variability In microbiological counting this might be caused by omission of counts differing widely from the average; Lancaster (1950) has shown that unskilled technicians counting blood cells (which under ideal circumstances provide another example of the Poisson theory) tend to omit extreme values or take repeat observations, presumably because they underestimate the extent of random variation Other causes of reduced variability are an inability to record accurately high counts (for instance, because of overlapping of bacterial colonies), or physical interference between particles which prevents large numbers from settling close together The latter phenomenon has been noted by Lancaster (1950) in the counting of red blood cells The use of the x2 …kÀ1† distribution in the heterogeneity test is an approxima tion, but is quite safe provided x is greater than about 5, and is safe even for  much smaller values of x (as low as 2, say) provided k is not too small (> 15, say)  For very small values of x, Fisher (1950, 1964) has shown how to obtain an exact test; the method is illustrated by Oldham (1968, §5.15) Finally, note that, for k ˆ 2, (8.31) is equivalent to …x1 À x2 †2 =…x1 ‡ x2 †, which was used as a x2 variate in §5.2 …1† Experimental design 9.1 General remarks Notwithstanding the importance of observational studies, such as those to be discussed in Chapter 19, experiments are as fundamental to the advancement of medicine as they are in other branches of science Experiments are performed to compare the effects of various treatments on some type of experimental unit; the investigator must decide which treatment to allocate to which unit The following are examples A comparison of the effects of inoculating animals with different doses of a chemical substance The units here will be the animals A prophylactic trial to compare the effectiveness for children of different vaccines against measles Each child will receive one of the vaccines and may be regarded as the experimental unit A comparison in one patient suffering recurrent attacks of a chronic disease of different methods of alleviating discomfort The successive occasions on which attacks occur are now the units for which the choice of treatment is to be made A study of the relative merits of different programmes of community health education Each programme would be applied in a different area, and these areas would form the experimental units In the last three examples the experiments involve people and this poses special problems A fuller discussion of this type of experiment and its associated problems is given in Chapter 18 In the present chapter some of the devices of classical experimental design are described Many of these designs have their basis in agricultural applications and not adequately address the special problems that arise when the experimental units are people Nevertheless, aspects of classical design can be useful in this context and these are discussed as they arise in Chapter 18 It should also be noted that the analyses which accompany the designs discussed in this chapter can be useful even when the data have not arisen from a designed experiment In 1±4 above a crucial question is how the treatments are to be allotted to the available units One would clearly wish to avoid any serious disparity between the characteristics of units receiving different treatments In 2, for instance, it would be dangerous to give one vaccine to all the children in one school and 236 9.1 General remarks 237 another vaccine to all the children in a second school, for the exposure of the two groups of children to measles contacts might be quite different It would then be difficult to decide whether a difference in the incidence of measles was due to different protective powers of the vaccines or to the different degrees of exposure to infection It would be possible to arrange that the groups of experimental units to which different treatments were to be applied were made alike in various relevant respects For example, in 1, groups of animals with approximately the same mean weight could be formed; in 2, children from different schools and of different age groups could be represented equally in each treatment group But, however careful the investigator is to balance factors which seem important, one can never be sure that the treatment groups not differ markedly in some factor which is also important but which has been ignored in the allocation The accepted solution to this dilemma is that advocated by Fisher in the 1920s and 1930s: the allocation should incorporate an element of randomization In its simplest form this means that the choice of treatment for each unit should be made by an independent act of randomization such as the toss of a coin or the use of random-number tables This would lead to some uncertainty in the numbers of units finally allotted to each treatment, and if these are fixed in advance the groups may be formed by choosing random samples of the appropriate sizes from the total pool of experimental units Sometimes a form of systematic allocation, analogous to systematic sampling (p 650), is used as an alternative to random allocation The units are arranged in a certain order and are then allotted systematically to the treatment groups This method has much the same advantages and disadvantages as systematic sampling It is likely to be seriously misleading only if the initial ordering of the units presents some systematic variation of a cyclic type which happens to run in phase with the allocation cycle However, prior knowledge of which treatment a unit is going to receive can lead to bias (see §18.4), so alternation and other forms of systematic allocation are best avoided in favour of strictly random methods A second important principle of experimental design is that of replication, the use of more than one experimental unit for each treatment Various purposes are served by replication First, an appropriate amount of replication ensures that the comparisons between treatments are sufficiently precise; the sampling error of the difference between two means, for instance, decreases as the amount of replication in each group increases Secondly, the effect of sampling variation can be estimated only if there is an adequate degree of replication In the comparison of the means of two groups, for instance, if both sample sizes were as low as two, the degrees of freedom in the t test would only be two (§4.3); the percentage points of t on two degrees of freedom are very high and the test therefore loses a great deal in effectiveness merely because of the inadequacy of 238 Experimental design the estimate of within-groups variance Thirdly, replication may be useful in enabling observations to be spread over a wide variety of experimental conditions In the comparison of two surgical procedures, for instance, it might be useful to organize a cooperative trial in which the methods were compared in each of a number of hospitals, so that the effects of variations in medical and surgical practice and perhaps in the precise type of disease could be studied A third basic principle concerns the reduction in random variability between p experimental units The formula for the standard error of a mean, s= n, shows that the effect of random error can be reduced, either by increasing n (more replication) or by decreasing s This suggests that experimental units should be as homogeneous as possible in their response to treatment However, too strenuous an effort to remove heterogeneity will tend to counteract the third reason given above for replicationÐthe desire to cover a wide range of extraneous conditions In a clinical trial, for example, it may be that a precise comparison could be effected by restricting the age, sex, clinical condition and other features of the patients, but these restrictions may make it too difficult to generalize from the results A useful solution to this dilemma is to subdivide the units into relatively homogeneous subgroups, called blocks Treatments can then be allocated randomly within blocks so that each block provides a small experiment The precision of the overall comparisons between treatments is then determined by the random variability within blocks rather than that between different blocks This is called a randomized block design and is discussed in §9.2 More complex designs, allowing simultaneously for more than one source of extraneous variation, are discussed in §§9.4 and 9.5 Other extensions dealt with in this chapter are designs for the simultaneous comparison of more than one set of treatments; those appropriate for situations similar to that of multistage sampling, in which some units are subdivisions of others; and designs which allow in various ways for the natural restrictions imposed by the experimental material 9.2 Two-way analysis of variance: randomized blocks In contrast to the data discussed in §§8.1±8.4, outcomes from a randomized block design are classified in two ways, by the block and the treatment If there are r blocks and c treatments, each block containing c experimental units to which treatments are randomly allocated, there will be a total of N ˆ rc observations on any variable, simultaneously divided into r blocks with c observations in each and c treatment groups with r observations in each The analysis of randomized blocks can readily be viewed as a method for the analysis of data that are classified more generally in two ways, say, by the rows and columns of a table In some experimental situations both the rows and columns of the two-way table may represent forms of treatment In a bloodclotting experiment, for instance, clotting times may be measured for each ... < 38 ? ?3) 38 ? ?3 30 40 50 40 10 5 45Á54 39 Á66 43? ?94 48Á 23 40? ?35 3? ?97 3? ?27 3? ?27 3? ?27 0Á97 0Á 034 0? ?34 0Á0 43 0Á001 0Á017 172 Bayesian methods In some situations it may be appropriate to assign a non-zero... 1 23 116 125 126 122 126 127 86 142 132 87 1 23 133 106 1 03 118 114 94 68 63 66 72 52 75 76 118 120 114 29 42 48 50 69 59 27 60 71 88 63 88 53 50 111 59 76 72 90 68 93 91 To predict the weight increase... comments on Bayesian methods Shrinkage The phenomenon of shrinkage was introduced in §6.2 and illustrated in several of the situations described in that section and in §6 .3 It is a common feature

Ngày đăng: 10/08/2014, 15:20

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan