Ebook An introduction to statistical methods and data analysis (6th edition) Part 2

Thông tin tài liệu

(BQ) Part 2 book An introduction to statistical methods and data analysis has contents: Linear regression and correlation, multiple regression and the general linear model; further regression topics, analysis of variance for blocked designs, the analysis of covariance; analysis of variance for some unbalanced designs

17582_11_ch11_p572-622.qxd 11/25/08 5:38 PM CHAPTER 11 Page 572 11.1 Introduction and Abstract of Research Study Linear Regression and Correlation 11.2 Estimating Model Parameters 11.3 Inferences about Regression Parameters 11.4 Predicting New y Values Using Regression 11.5 Examining Lack of Fit in Linear Regression 11.6 The Inverse Regression Problem (Calibration) 11.7 Correlation 11.8 Research Study: Two Methods for Detecting E coli 11.9 Summary and Key Formulas 11.10 Exercises 11.1 Introduction and Abstract of Research Study The modeling of the relationship between a response variable and a set of explanatory variables is one of the most widely used of all statistical techniques We refer to this type of modeling as regression analysis A regression model provides the user with a functional relationship between the response variable and explanatory variables that allows the user to determine which of the explanatory variables have an effect on the response The regression model allows the user to explore what happens to the response variable for specified changes in the explanatory variables For example, financial officers must predict future cash flows based on specified values of interest rates, raw material costs, salary increases, and so on When designing new training programs for employees, a company would want to study the relationship between employee efficiency and explanatory variables such as the results from employment tests, experience on similar jobs, educational background, and previous training Medical researchers attempt to determine the factors which have an effect on cardiorespiratory fitness Forest scientists study the relationship between the volume of wood in a tree to the diameter of the tree at a specified heights and the taper of the tree The basic idea of regression analysis is to obtain a model for the functional relationship between a response variable (often referred to as the dependent 572 17582_11_ch11_p572-622.qxd 11/25/08 5:38 PM Page 573 11.1 Introduction and Abstract of Research Study 573 variable) and one or more explanatory variables (often referred to as the independent variables) Regression models have a number of uses The model provides a description of the major features of the data set In some cases, a subset of the explanatory variables will not affect the response variable and hence the researcher will not have to measure or control any of these variables in future studies This may result in significant savings in future studies or experiments The equation relating the response variable to the explanatory variables produced from the regression analysis provides estimates of the response variable for values of the explanatory not observed in the study For example, a clinical trial is designed to study the response of a subject to various dose levels of a new drug Because of time and budgetary constraints, only a limited number of dose levels are used in the study The regression equation will provide estimates of the subjects’ response for dose levels not included in the study The accuracy of these estimates will depend heavily on how well the final model fits the observed data In business applications, the prediction of future sales of a product is crucial to production planning If the data provide a model that has a good fit in relating current sales to sales in previous months, prediction of sales in future months is possible However, a crucial element in the accuracy of these predictions is that the business conditions during which model building data were collected remains fairly stable over the months for which the predictions are desired In some applications of regression analysis, the researcher is seeking a model which can accurately estimate the values of a variable that is difficult or expensive to measure using explanatory variables that are inexpensive to measure and obtain If such a model is obtained, then in future applications it is possible to avoid having to obtain the values of the expensive variable by measuring the values of the inexpensive variables and using the regression equation to estimate the value of the expensive variable For example, a physical fitness center wants to determine the physical well-being of its new clients Maximal oxygen uptake is recognized as the single best measure of cardiorespiratory fitness but its measurement is expensive Therefore, the director of the fitness center would want a model that provides accurate estimates of maximal oxygen uptake using easily measured variables such as weight, age, heart rate after 1-mile walk, time needed to walk mile, and so on prediction versus explanation We can distinguish between prediction (reference to future values) and explanation (reference to current or past values) Because of the virtues of hindsight, explanation is easier than prediction However, it is often clearer to use the term prediction to include both cases Therefore, in this book, we sometimes blur the distinction between prediction and explanation For prediction (or explanation) to make much sense, there must be some connection between the variable we’re predicting (the dependent variable) and the variable we’re using to make the prediction (the independent variable) No doubt, if you tried long enough, you could find 30 common stocks whose price changes over a year have been accurately predicted by the won–lost percentage of the 30 major league baseball teams on the fourth of July However, such a prediction is absurd because there is no connection between the two variables Prediction 17582_11_ch11_p572-622.qxd 574 11/25/08 5:38 PM Page 574 Chapter 11 Linear Regression and Correlation unit of association simple regression intercept slope requires a unit of association; there should be an entity that relates the two variables With time-series data, the unit of association may simply be time The variables may be measured at the same time period or, for genuine prediction, the independent variable may be measured at a time period before the dependent variable For cross-sectional data, an economic or physical entity should connect the variables If we are trying to predict the change in market share of various soft drinks, we should consider the promotional activity for those drinks, not the advertising for various brands of spaghetti sauce The need for a unit of association seems obvious, but many predictions are made for situations in which no such unit is evident In this chapter, we consider simple linear regression analysis, in which there is a single independent variable and the equation for predicting a dependent variable y is a linear function of a given independent variable x Suppose, for example, that the director of a county highway department wants to predict the cost of a resurfacing contract that is up for bids We could reasonably predict the costs to be a function of the road miles to be resurfaced A reasonable first attempt is to use a linear production function Let y ϭ total cost of a project in thousands of dollars, x ϭ number of miles to be resurfaced, and yˆ ϭ the predicted cost, also in thousands of dollars A prediction equation yˆ ϭ 2.0 ϩ 3.0x (for example) is a linear equation The constant term, such as the 2.0, is the intercept term and is interpreted as the predicted value of y when x ϭ In the road resurfacing example, we may interpret the intercept as the fixed cost of beginning the project The coefficient of x, such as the 3.0, is the slope of the line, the predicted change in y when there is a one-unit change in x In the road resurfacing example, if two projects differed by mile in length, we would predict that the longer project cost (thousand dollars) more than the shorter one In general, we write the prediction equation as yˆ ϭ bˆ ϩ bˆ 1x assumption of linearity where bˆ is the intercept and bˆ is the slope See Figure 11.1 The basic idea of simple linear regression is to use data to fit a prediction line that relates a dependent variable y and a single independent variable x The first assumption in simple regression is that the relation is, in fact, linear According to the assumption of linearity, the slope of the equation does not change as x changes In the road resurfacing example, we would assume that there were no (substantial) economies or diseconomies from projects of longer mileage There is little point in using simple linear regression unless the linearity assumption makes sense (at least roughly) Linearity is not always a reasonable assumption, on its face For example, if we tried to predict y ϭ number of drivers that are aware of a car dealer’s midsummer FIGURE 11.1 y Linear prediction function y= 1 x 0+ 1x 17582_11_ch11_p572-622.qxd 11/25/08 5:38 PM Page 575 11.1 Introduction and Abstract of Research Study random error term 575 sale using x ϭ number of repetitions of the dealer’s radio commercial, the assumption of linearity means that the first broadcast of the commercial leads to no greater an increase in aware drivers than the thousand-and-first (You’ve heard commercials like that.) We strongly doubt that such an assumption is valid over a wide range of x values It makes far more sense to us that the effect of repetition would diminish as the number of repetitions got larger, so a straight-line prediction wouldn’t work well Assuming linearity, we would like to write y as a linear function of x: y ϭ b0 ϩ b1x However, according to such an equation, y is an exact linear function of x; no room is left for the inevitable errors (deviation of actual y values from their predicted values) Therefore, corresponding to each y we introduce a random error term ei and assume the model y ϭ b0 ϩ b1x ϩ e We assume the random variable y to be made up of a predictable part (a linear function of x) and an unpredictable part (the random error ei) The coefficients b0 and b1 are interpreted as the true, underlying intercept and slope The error term e includes the effects of all other factors, known or unknown In the road resurfacing project, unpredictable factors such as strikes, weather conditions, and equipment breakdowns would contribute to e, as would factors such as hilliness or prerepair condition of the road—factors that might have been used in prediction but were not The combined effects of unpredictable and ignored factors yield the random error terms e For example, one way to predict the gas mileage of various new cars (the dependent variable) based on their curb weight (the independent variable) would be to assign each car to a different driver, say, for a 1-month period What unpredictable and ignored factors might contribute to prediction error? Unpredictable (random) factors in this study would include the driving habits and skills of the drivers, the type of driving done (city versus highway), and the number of stoplights encountered Factors that would be ignored in a regression analysis of mileage and weight would include engine size and type of transmission (manual versus automatic) In regression studies, the values of the independent variable (the xi values) are usually taken as predetermined constants, so the only source of randomness is the ei terms Although most economic and business applications have fixed xi values, this is not always the case For example, suppose that xi is the score of an applicant on an aptitude test and yi is the productivity of the applicant If the data are based on a random sample of applicants, xi (as well as yi) is a random variable The question of fixed versus random in regard to x is not crucial for regression studies If the xis are random, we can simply regard all probability statements as conditional on the observed xis When we assume that the xis are constants, the only random portion of the model for yi is the random error term ei We make the following formal assumptions DEFINITION 11.1 Formal assumptions of regression analysis: The relation is, in fact, linear, so that the errors all have expected value zero: E(ei) ϭ for all i The errors all have the same variance: Var(ei) ϭ s2 for all i The errors are independent of each other The errors are all normally distributed; ei is normally distributed for all i 17582_11_ch11_p572-622.qxd 576 11/25/08 5:38 PM Page 576 Chapter 11 Linear Regression and Correlation FIGURE 11.2 y E (y) = 1.5 + 2.5x Theoretical distribution of y in regression 12 x scatterplot smoothers These assumptions are illustrated in Figure 11.2 The actual values of the dependent variable are distributed normally, with mean values falling on the regression line and the same standard deviation at all values of the independent variable The only assumption not shown in the figure is independence from one measurement to another These are the formal assumptions, made in order to derive the significance tests and prediction methods that follow We can begin to check these assumptions by looking at a scatterplot of the data This is simply a plot of each (x, y) point, with the independent variable value on the horizontal axis, and the dependent variable value measured on the vertical axis Look to see whether the points basically fall around a straight line or whether there is a definite curve in the pattern Also look to see whether there are any evident outliers falling far from the general pattern of the data A scatterplot is shown in part (a) of Figure 11.3 Recently, smoothers have been developed to sketch a curve through data without necessarily assuming any particular model If such a smoother yields something close to a straight line, then linear regression is reasonable One such method is called LOWESS (locally weighted scatterplot smoother) Roughly, a smoother takes a relatively narrow “slice” of data along the x axis, calculates FIGURE 11.3 (a) Scatterplot and (b) LOWESS curve 100 100 y y 50 50 0 100 x (a) 200 100 x (b) 200 17582_11_ch11_p572-622.qxd 11/25/08 5:38 PM Page 577 577 11.1 Introduction and Abstract of Research Study a line that fits the data in that slice, moves the slice slightly along the x axis, recalculates the line, and so on Then all the little lines are connected in a smooth curve The width of the slice is called the bandwidth; this may often be controlled in the computer program that does the smoothing The plain scatterplot (Figure 11.3a) is shown again (Figure 11.3b) with a LOWESS curve through it The scatterplot shows a curved relation; the LOWESS curve confirms that impression Another type of scatterplot smoother is the spline fit It can be understood as taking a narrow slice of data, fitting a curve (often a cubic equation) to the slice, moving to the next slice, fitting another curve, and so on The curves are calculated in such a way as to form a connected, continuous curve Many economic relations are not linear For example, any diminishing returns pattern will tend to yield a relation that increases, but at a decreasing rate If the scatterplot does not appear linear, by itself or when fitted with a LOWESS curve, it can often be “straightened out” by a transformation of either the independent variable or the dependent variable A good statistical computer package or a spreadsheet program will compute such functions as the square root of each value of a variable The transformed variable should be thought of as simply another variable For example, a large city dispatches crews each spring to patch potholes in its streets Records are kept of the number of crews dispatched each day and the number of potholes filled that day A scatterplot of the number of potholes patched and the number of crews and the same scatterplot with a LOWESS curve through it are shown in Figure 11.4 The relation is not linear Even without the LOWESS curve, the decreasing slope is obvious That’s not surprising; as the city sends out more crews, they will be using less effective workers, the crews will have to travel farther to find holes, and so on All these reasons suggest that diminishing returns will occur We can try several transformations of the independent variable to find a scatterplot in which the points more nearly fall along a straight line Three common transformations are square root, natural logarithm, and inverse (one divided by the variable) We applied each of these transformations to the pothole repair data The results are shown in Figure 11.5a– c, with LOWESS curves The square root (a) and inverse transformations (c) didn’t really give us a straight line The spline fit transformation 150 150 100 100 Patched Patched FIGURE 11.4 Scatterplots for pothole data 50 50 0 10 Crews (a) 15 10 Crews (b) 15 17582_11_ch11_p572-622.qxd 5:38 PM Page 578 Chapter 11 Linear Regression and Correlation FIGURE 11.5 150 Patched Scatterplots with transformed predictor 100 50 0 SqrtCrew (a) Patched 150 100 50 0 LnCrew (b) 150 Patched 578 11/25/08 100 50 0 1.0 InvCrew (c) natural logarithm (b) worked very well, however Therefore, we would use LnCrew as our independent variable Finding a good transformation often requires trial and error Following are some suggestions to try for transformations Note that there are two key features to look for in a scatterplot First, is the relation nonlinear? Second, is there a pattern of increasing variability along the y (vertical) axis? If there is, the assumption of constant variance is questionable These suggestions don’t cover all the possibilities, but include the most common problems 17582_11_ch11_p572-622.qxd 11/25/08 5:38 PM Page 579 11.1 Introduction and Abstract of Research Study 579 Steps for choosing a transformation: DEFINITION 11.2 If the plot indicates a relation that is increasing but at a decreasing rate, and if variability around the curve is roughly constant, transform x using square root, logarithm, or inverse transformations If the plot indicates a relation that is increasing at an increasing rate, and if variability is roughly constant, try using both x and x2 as predictors Because this method uses two variables, the multiple regression methods of the next two chapters are needed If the plot indicates a relation that increases to a maximum and then decreases, and if variability around the curve is roughly constant, again try using both x and x2 as predictors If the plot indicates a relation that is increasing at a decreasing rate, and if variability around the curve increases as the predicted y value increases, try using y2 as the dependent variable If the plot indicates a relation that is increasing at an increasing rate, and if variability around the curve increases as the predicted y value increases, try using ln(y) as the dependent variable It sometimes may also be helpful to use ln(x) as the independent variable Note that a change in a natural logarithm corresponds quite closely to a percentage change in the original variable Thus, the slope of a transformed variable can be interpreted quite well as a percentage change The plots in Figure 11.6 correspond to the descriptions given in Definition 11.2 There are symmetric recommendations for the situations where the relation is decreasing at a decreasing rate, use Step or Step transformations or if the relation is decreasing at an increasing rate use Step or Step transformations FIGURE 11.6 Step Step Step Plots corresponding to steps in Definition 11.2 y y y x x Step Step y y x x x 17582_11_ch11_p572-622.qxd 5:38 PM Page 580 Chapter 11 Linear Regression and Correlation EXAMPLE 11.1 An airline has seen a very large increase in the number of free flights used by participants in its frequent flyer program To try to predict the trend in these flights in the near future, the director of the program assembled data for the last 72 months The dependent variable y is the number of thousands of free flights; the independent variable x is month number A scatterplot with a LOWESS smoother, done using Minitab, is shown in Figure 11.7 What transformation is suggested? FIGURE 11.7 300 Flights Frequent flyer free flights by month 200 100 0 10 20 30 40 Month 50 60 70 The pattern shows flights increasing at an increasing rate The LOWESS curve is definitely turning upward In addition, variation (up and down) around the curve is increasing The points around the high end of the curve (on the right, in this case) scatter much more than the ones around the low end of the curve The increasing variability suggests transforming the y variable A natural logarithm (ln) transformation often works well Minitab computed the logarithms and replotted the data, as shown in Figure 11.8 The pattern is much closer to a straight line, and the scatter around the line is much closer to constant Solution FIGURE 11.8 Result of logarithm transformation LnFlight 580 11/25/08 10 20 30 40 50 60 70 Month We will have more to say about checking assumptions in Chapter 12 For a simple regression with a single predictor, careful checking of a scatterplot, ideally with a smooth curve fit through it, will help avoid serious blunders Once we have decided on any mathematical transformations, we must estimate the actual equation of the regression line In practice, only sample data are available The population intercept, slope, and error variance all have to be estimated from limited sample data The assumptions we made in this section allow us to make inferences about the true parameter values from the sample data 17582_11_ch11_p572-622.qxd 11/25/08 5:38 PM Page 581 11.2 Estimating Model Parameters 581 Abstract of Research Study: Two Methods for Detecting E coli The case study in Chapter described a new microbial method for the detection of E coli, Petrifilm HEC test The researcher wanted to evaluate the agreement of the results obtained using the HEC test with results obtained from an elaborate laboratory-based procedure, hydrophobic grid membrane filtration (HGMF) The HEC test is easier to inoculate, more compact to incubate, and safer to handle than conventional procedures However, prior to using the HEC procedure it was necessary to compare the readings from the HEC test to readings from the HGMF procedure obtained on the same meat sample to determine whether the two procedures were yielding the same readings If the readings differed but an equation could be obtained that could closely relate the HEC reading to the HGMF reading, then the researchers could calibrate the HEC readings to predict what readings would have been obtained using the HGMF test procedure If the HEC test results were unrelated to the HGMF test procedure results, then the HEC test could not be used in the field in detecting E coli The necessary regression analysis to answer these questions will be given at the end of this chapter 11.2 Estimating Model Parameters The intercept b0 and slope b1 in the regression model y ϭ b0 ϩ b1x ϩ e are population quantities We must estimate these values from sample data The error variance s2e is another population parameter that must be estimated The first regression problem is to obtain estimates of the slope, intercept, and variance: we discuss how to so in this section The road resurfacing example of Section 11.1 is a convenient illustration Suppose the following data for similar resurfacing projects in the recent past are available Note that we have a unit of association: The connection between a particular cost and mileage is that they’re based on the same project Cost yi (in thousands of dollars): Mileage xi (in miles): 6.0 1.0 14.0 3.0 10.0 4.0 14.0 5.0 26.0 7.0 A first step in examining the relation between y and x is to plot the data as a scatterplot Remember that each point in such a plot represents the (x, y) coordinates of one data entry, as in Figure 11.9 The plot makes it clear that there is FIGURE 11.9 26 Scatterplot of cost versus mileage Cost 21 16 11 Miles 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1269 Index extensions of, 1048 –1056 hypothesis testing, 1045 –1046 vs fixed-effects model, 1044 –1045 random error accounting for in linear regression, 590 randomized block designs, 37–38, 1048 –1056 AOV table for, 1049, 1050 –1051 assumptions, 1049 estimation of variance components, 1051–1053 examples, 1053 –1056 factorial treatment structure, 1050 model for, 1050 nested sampling experiment, 1056 randomized block design, one or more missing observations, 1137–1143 comparing treatment means, 1140 estimating value of missing observation, 1137 estimation bias and, 1137–1138 examples of, 1138 –1139 exercises for, 1160 –1162 fitting complete and reduced models for, 1140 –1141 randomized design factorial treatment structure in, 38 – 41 random sampling, 178 –181 defined, 179 exercises for, 213 –214 random number tables, 179 –181 random variables, 156 –157 range, variability measure, 86 – 87 ratio estimation, 25 regression coefficient, 771 regression equations examples of, 771–781 SAS output, and, 771–772 regression model homogeneous variances vs heterogeneous variances, 800 regression parameters, 590 –594 accounting for random error, 590 confidence interval for slope, 592 –593 examples of, 591–594, 593 –594 exercises for, 628 – 633 using F test for null hypothesis, 593 using t test for slope, 590 –591 rejection regions, 233 1269 relative frequency, 67 concept of probability, 142 relative frequency histogram, 65 – 66, 68, 69, 70 with different variabilities but same mean, 86 repeated-measures designs case study, 1093 –1095, 1120 –1121 crossover designs, 1112 –1119 AOV for, 1115 AOV table for, 1119 carryover effect and washout period, 1115 examples, 1112 –1114, 1116 –1118 exercises for, 1126 –1128 introduction, 1092 model for, 1114, 1118 vs treatment, 1112 defined, 1092 exercises for, 1123 –1126 introduction, 1091–1092 single-factor experiments, 1101–1105 AOV for, 1102 –1103 assumptions, 1102 compound symmetry of observations, 1101–1102 examples, 1103 –1105 two-factor experiments, 1105 –1112 AOV table for, 1107 examples, 1108 –1110, 1112 F test for, 1111 Huynh- Feldt condition, 1106 –1107 model for, 1106 –1107 tests for, 1107 replications determining number, 921–926 estimator specification accuracy, 922 F-test power specification, 923 –926 experimental study, 33 research hypothesis, 232 –233 research study exit polls vs election results, 46 – 47 residual plots checking model assumptions, 797–798 fitting linear regression model, 785 –787 fitting quadratic regression model, 788 –789 residuals analysis, 417– 420 residual standard deviation defined, 588 17582_23_index_p1254-1280.qxd 1270 11/25/08 6:38 PM Page 1270 Index response variables, 18 retrospective study, 20 –21 S sample, definition, 5, 6, 24 sampled population, 24 sample histogram, 191 sample size population central values, single population, 230 –232, 245 –246 confidence interval for, 231 exercises for, 279 –280 formula for, 231 tolerable error, 230 –231 for two-sided test, 246 Type I / Type II errors, 245 population central values, two populations, 323 –325 confidence interval, independent samples, 323 confidence interval, paired samples, 325 examples, 324 –325 exercises for, 345 –346 testing independent samples, 324 testing paired samples, 325 sample survey, 20 sampling designs for surveys, 24 –30 exercises for, 49 –50 sampling distribution difference between two sample means, properties, 291–293 theorem, 291 mean (␮) estimation, 225 sampling distributions, 181–191 Central Limit Theorem, 185 –191 exercises for, 214 –215 interpretations of, 191 Minitab instructions for calculating, 202 standard error, 185 sampling unit, 24 –25 SAS output regression equations and, 771–772, 788 scatterplot, 105 –106, 576 scatterplot matrix detecting outliers, 766, 806 testing variables with, 765 –766 and variable selection, 765 –766 Scheffé’s S method confidence interval for, 478 steps in, 476 – 477 scientific data in courts, use and interpretation problem definition, 10 –11 scientific method, schematic, self-administered questionnaire, 30 sensitivity/specificity of diagnostic tests, Bayes’ formula, 153 sequential sums of squares (SS), 684 serial correlation Durbin-Watson test statistic, 813 first differences approach, and, 817 positive/negative correlation, 814 time series data, and, 813 side-by-side boxplots, 108 –112 significant interactions, in factorial treatment, 899 –900 sign test for estimating median (M), 268 for median (M), 268 t test vs., 270 simple logistic regression model, 702 –703 See also logistic regression simple random sampling, 25 single-factor experiments, repeated measures, 1101–1105 AOV for, 1102 –1103 assumptions, 1102 compound symmetry of observations, 1101–1102 examples, 1103 –1105 single variable, describing data on graphical methods, 62 –77 measures of central tendency, 78 – 85 measures of variability, 85 –97 skeletal boxplot, 97–99 skewed to the left, histogram, 71, 72 skewed to the right, histogram, 71, 72 skewness, mean, 84 – 85 slope comparing, 697–701 examples of, 697–700 exercises for, 746 –749 intersecting lines vs parallel lines, 698 linear regression line, 698 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1271 Index confidence interval for, 592 –593 in linear regression, 592 –593 smoothers, 576 software systems, 61– 62 split-plot designs, 1095 –1101 AOV for, 1097, 1098 examples, 1098 –1101 exercises for, 1122 –1123 introduction, 1091–1092 subplot analysis, 1097, 1098 vs two-factor experiments, 1101 wholeplot analysis, 1097, 1098 squared prediction error, 582 SS See sequential sums of squares (SS) SST See sum of squares, between-treatment (SST) stacked bar graph, 103 –104 standard deviation (␴), 92 –93 approximating, 95 –96 of binomial probability distribution, 164 mean (␮) estimation, 228 –230 standard error, 588 sampling distributions, 185 standardized residuals, 798 standard method treatment, 33 states of nature, 154 statistical inferences bootstrap methods for, 259 –265 estimation hypothesis testing, 223 and estimation of hypothesis testing, 223 linear regression, 590 –594 accounting for random error, 590 confidence interval for slope, 592 –593 examples of, 591–594 using F test for null hypothesis, 593 using t test for slope, 590 –591 statistically significant findings, statistical test See also hypothesis testing about more than two population means, 405 – 414 for ␮, 232 –244 exercises for, 278 –279 one-tailed test, 236, 241 probability of Type II error, 239 –241 rejection region, 233 research hypothesis, 232 –233 summary of, 238 –241 test statistic for, 233 two-tailed test, 236, 241 Type I / Type II errors, 234 –235 for median (M), 268 –269 population parameter estimation, 223 for population variance comparing two, 366 single, 366 statistical test for ␮, 232 –244 null hypothesis, 233, 234 one-tailed test, 236, 241 probability of Type II error, 239 –241 rejection region, 233 research hypothesis, 232 –233 summary of, 238 –241 test statistic for, 233 two-tailed test, 236, 241 Type I / Type II errors, 234 –235 Statistical Theory and Methodology in Science and Engineering, 809 statistics, 78 current applications of, –12 misunderstandings of results, 7– reason for studying, – stem-and-leaf plot, 72 –74 stepwise regression, 776 stratified random sample, 25 studentized residuals, 798 Student–Newman–Keuls procedure, 471– 474 comparing with Tukey’s procedure, 471– 472 steps in, 472 Student’s t distribution mean (␮) estimation, unknown variance, 250 –252 subjective probability, 143 sum of squares main effects, for, 903 three-way interactions, for, 903 two-way interactions, for, 903 sum of squares, between-treatment (SST) completely randomized designs, 881 sum of squares, for error (SSE) completely randomized designs, 881 factorial treatment structure, 896 1271 17582_23_index_p1254-1280.qxd 1272 11/25/08 6:38 PM Page 1272 Index sum of squares, total (TSS) completely randomized designs, 881 factorial treatment structure, 895 partition, 881 sum of squares between samples (SSB), 410 surveys, collecting data case study, 46 – 47 problems associated with, 27–28 measurement problems, 27–28 survey nonresponse, 27 sampling designs for, 24 –30 exercises for, 49 –50 stages of, 29 symmetric histogram, 71, 72 systematic sample, 26 T target population, 24 telephone interviews, 29 testing coefficients, multiple regression, 691– 694 complete and reduced models for, 692 examples of, 692 – 694 exercises for, 739 –743 F test of predictors, 691– 692 tests of significance mixed-effects model, 1057 test statistic, 408 – 409 theorems sampling distribution, 291 three-factor treatment structure, 904 time series, 74 –77 serial correlation, 813 tolerable error sample size, 230 –231 sample size for estimating ␮, 230 –231 total sum of squares (TSS), 410 See also sum of squares, total (TSS) Transformation and Weighting in Regression, 800 transformations of data, for more than two population variances, 421– 428 exercises for, 440 – 441 treatment design, 32 vs crossover designs, 1112 treatment means balanced incomplete block designs, 1153 –1154 comparisons, 917–921 confidence interval, 918 difference estimation, 917–921 examples of, 918 –921 Latin square design, 1143 randomized block design, 1140 trial-and-error, in multiple regression models, 784 trimmed mean, 83 – 84 TSS See sum of squares, total (TSS) t tests for estimating mean (␮1) Ϫ mean (␮2), independent samples, 301–302 for slope, 590 –591 vs sign test, 270 vs Wilcoxon rank sum test, 313 –314 Tukey’s W procedure confidence interval for, 470 experimentwise error rate, 468 student–Newman–Keuls procedure, comparing with, 471– 472 two-factor experiments, repeated measures, 1105 –1112 AOV table for, 1107 examples, 1108 –1110, 1112 F test for, 1111 Huynh–Feldt condition, 1106 –1107 model for, 1106 –1107 tests for, 1107 two-tailed test, 236, 241 ϫ contingency tables, 535 –538 Type I error (␣) population variances, 369 statistical test for ␮, 234 –235 Type II error (␤) statistical test for ␮, 234, 239 –241 U unbalanced design, 1137 unbiased estimates for completely randomized designs, 882 unbiased estimator single population variance, 362 unconditional probability, 150 –151 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1273 Index unequal number of replications, with factorial treatment structure, 910 –917 uniform histogram, 70, 71 unimodal histogram, 70, 71 union of events, probability law and, 148, 149 unique predictive value, 684 V variability, measures of, 85 –97 coefficient of variation, 96 –97 deviation, 91 empirical rule, 93 –95 exercises, 124 –126 interquartile range, 90 –91 percentiles, 87–90 range, 86 – 87 standard deviation, 92 –93 variance, 91–92 using covariates to reduce, 45 – 46 variability errors, 895 variables discrete and continuous, 155 –157 independent, 765 selecting for multiple regression, 764 –781 variance, 91–92 variance inflation factor (VIF) diagnosing collinearity, 766 diagnosing collinearity with, 689 multiple regression, and, 766 VIF See variance inflation factor (VIF) W washout period, 1115 weighted least squares, 802 Wilcoxon rank sum test, 305 –314 calculating rank sum statistics, 306 –307 examples, 307–313 exercises for, 338 –341 normal approximation and, 311–314 vs t test, 313 –314 Wilcoxon signed-rank test, 319 –323 examples, 320 –322 exercises for, 343 –345 g groups and, 319 within-sample sum of squares, 410 within-sample variation, 403 Z zero expectations checking in multiple regression assumptions, 797–798 multiple regression, 669 – 670 z-score, 173 –176 1273 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1274 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1275 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1276 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1277 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1278 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1279 17582_23_index_p1254-1280.qxd 11/25/08 6:38 PM Page 1280 17582_IEC.qxd 11/25/08 6:49 PM Page Shaded area = ␣ TABLE t ␣, Percentage points of Student’s t distribution Right-Tail Probability (A) df 40 25 10 05 025 01 005 001 0005 325 289 277 271 267 1.000 816 765 741 727 3.078 1.886 1.638 1.533 1.476 6.314 2.920 2.353 2.132 2.015 12.706 4.303 3.182 2.776 2.571 31.821 6.965 4.541 3.747 3.365 63.657 9.925 5.841 4.604 4.032 318.309 22.327 10.215 7.173 5.893 636.619 31.599 12.924 8.610 6.869 10 265 263 262 261 260 718 711 706 703 700 1.440 1.415 1.397 1.383 1.372 1.943 1.895 1.860 1.833 1.812 2.447 2.365 2.306 2.262 2.228 3.143 2.998 2.896 2.821 2.764 3.707 3.499 3.355 3.250 3.169 5.208 4.785 4.501 4.297 4.144 5.959 5.408 5.041 4.781 4.587 11 12 13 14 15 260 259 259 258 258 697 695 694 692 691 1.363 1.356 1.350 1.345 1.341 1.796 1.782 1.771 1.761 1.753 2.201 2.179 2.160 2.145 2.131 2.718 2.681 2.650 2.624 2.602 3.106 3.055 3.012 2.977 2.947 4.025 3.930 3.852 3.787 3.733 4.437 4.318 4.221 4.140 4.073 16 17 18 19 20 258 257 257 257 257 690 689 688 688 687 1.337 1.333 1.330 1.328 1.325 1.746 1.740 1.734 1.729 1.725 2.120 2.110 2.101 2.093 2.086 2.583 2.567 2.552 2.539 2.528 2.921 2.898 2.878 2.861 2.845 3.686 3.646 3.610 3.579 3.552 4.015 3.965 3.922 3.883 3.850 21 22 23 24 25 257 256 256 256 256 686 686 685 685 684 1.323 1.321 1.319 1.318 1.316 1.721 1.717 1.714 1.711 1.708 2.080 2.074 2.069 2.064 2.060 2.518 2.508 2.500 2.492 2.485 2.831 2.819 2.807 2.797 2.787 3.527 3.505 3.485 3.467 3.450 3.819 3.792 3.768 3.745 3.725 26 27 28 29 256 256 256 256 684 684 683 683 1.315 1.314 1.313 1.311 1.706 1.703 1.701 1.699 2.056 2.052 2.048 2.045 2.479 2.473 2.467 2.462 2.779 2.771 2.763 2.756 3.435 3.421 3.408 3.396 3.707 3.690 3.674 3.659 30 35 40 50 60 120 inf .256 255 255 255 254 254 253 683 682 681 679 679 677 674 1.310 1.306 1.303 1.299 1.296 1.289 1.282 1.697 1.690 1.684 1.676 1.671 1.658 1.645 2.042 2.030 2.021 2.009 2.000 1.980 1.960 2.457 2.438 2.423 2.403 2.390 2.358 2.326 2.750 2.724 2.704 2.678 2.660 2.617 2.576 3.385 3.340 3.307 3.261 3.232 3.160 3.090 3.646 3.591 3.551 3.496 3.460 3.373 3.291 Source: Computed by M Longnecker using the R function qt (1 Ϫ a, df) For 2-tailed tests and C.I.s use value in column headed by a/2 17582_IEC.qxd 11/25/08 6:49 PM Page TABLE 10 Percentage points of the Studentized range Error df 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ϱ t ‫ ؍‬Number of Treatment Means A 10 11 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 3.64 5.70 3.46 5.24 3.34 4.95 3.26 4.75 3.20 4.60 3.15 4.48 3.11 4.39 3.08 4.32 3.06 4.26 3.03 4.21 3.01 4.17 3.00 4.13 2.98 4.10 2.97 4.07 2.96 4.05 2.95 4.02 2.92 3.96 2.89 3.89 2.86 3.82 2.83 3.76 2.80 3.70 2.77 3.64 4.60 6.98 4.34 6.33 4.16 5.92 4.04 5.64 3.95 5.43 3.88 5.27 3.82 5.15 3.77 5.05 3.73 4.96 3.70 4.89 3.67 4.84 3.65 4.79 3.63 4.74 3.61 4.70 3.59 4.67 3.58 4.64 3.53 4.55 3.49 4.45 3.44 4.37 3.40 4.28 3.36 4.20 3.31 4.12 5.22 7.80 4.90 7.03 4.68 6.54 4.53 6.20 4.41 5.96 4.33 5.77 4.26 5.62 4.20 5.50 4.15 5.40 4.11 5.32 4.08 5.25 4.05 5.19 4.02 5.14 4.00 5.09 3.98 5.05 3.96 5.02 3.90 4.91 3.85 4.80 3.79 4.70 3.74 4.59 3.68 4.50 3.63 4.40 5.67 8.42 5.30 7.56 5.06 7.00 4.89 6.62 4.76 6.35 4.65 6.14 4.57 5.97 4.51 5.84 4.45 5.73 4.41 5.63 4.37 5.56 4.33 5.49 4.30 5.43 4.28 5.38 4.25 5.33 4.23 5.29 4.17 5.17 4.10 5.05 4.04 4.93 3.98 4.82 3.92 4.71 3.86 4.60 6.03 8.91 5.63 7.97 5.36 7.37 5.17 6.96 5.02 6.66 4.91 6.43 4.82 6.25 4.75 6.10 4.69 5.98 4.64 5.88 4.59 5.80 4.56 5.72 4.52 5.66 4.49 5.60 4.47 5.55 4.45 5.51 4.37 5.37 4.30 5.24 4.23 5.11 4.16 4.99 4.10 4.87 4.03 4.76 6.33 9.32 5.90 8.32 5.61 7.68 5.40 7.24 5.24 6.91 5.12 6.67 5.03 6.48 4.95 6.32 4.88 6.19 4.83 6.08 4.78 5.99 4.74 5.92 4.70 5.85 4.67 5.79 4.65 5.73 4.62 5.69 4.54 5.54 4.46 5.40 4.39 5.26 4.31 5.13 4.24 5.01 4.17 4.88 6.58 9.67 6.12 8.61 5.82 7.94 5.60 7.47 5.43 7.13 5.30 6.87 5.20 6.67 5.12 6.51 5.05 6.37 4.99 6.26 4.94 6.16 4.90 6.08 4.86 6.01 4.82 5.94 4.79 5.89 4.77 5.84 4.68 5.69 4.60 5.54 4.52 5.39 4.44 5.25 4.36 5.12 4.29 4.99 6.80 9.97 6.32 8.87 6.00 8.17 5.77 7.68 5.59 7.33 5.46 7.05 5.35 6.84 5.27 6.67 5.19 6.53 5.13 6.41 5.08 6.31 5.03 6.22 4.99 6.15 4.96 6.08 4.92 6.02 4.90 5.97 4.81 5.81 4.72 5.65 4.63 5.50 4.55 5.36 4.47 5.21 4.39 5.08 6.99 10.24 6.49 9.10 6.16 8.37 5.92 7.86 5.74 7.49 5.60 7.21 5.49 6.99 5.39 6.81 5.32 6.67 5.25 6.54 5.20 6.44 5.15 6.35 5.11 6.27 5.07 6.20 5.04 6.14 5.01 6.09 4.92 5.92 4.82 5.76 4.73 5.60 4.65 5.45 4.56 5.30 4.47 5.16 7.17 10.48 6.65 9.30 6.30 8.55 6.05 8.03 5.87 7.65 5.72 7.36 5.61 7.13 5.51 6.94 5.43 6.79 5.36 6.66 5.31 6.55 5.26 6.46 5.21 6.38 5.17 6.31 5.14 6.25 5.11 6.19 5.01 6.02 4.92 5.85 4.82 5.69 4.73 5.53 4.64 5.37 4.55 5.23 Source: Computed by M Longnecker using the R function qtukey (1 Ϫ a, t, df) 17582_IEC.qxd 11/25/08 6:49 PM Page TABLE 10 (continued) Error df 10 11 12 13 14 15 16 17 18 19 20 24 30 40 60 120 ϱ t ‫ ؍‬Number of Treatment Means 12 13 14 15 16 17 18 19 20 A 7.32 10.70 6.79 9.48 6.43 8.71 6.18 8.18 5.98 7.78 5.83 7.49 5.71 7.25 5.61 7.06 5.53 6.90 5.46 6.77 5.40 6.66 5.35 6.56 5.31 6.48 5.27 6.41 5.23 6.34 5.20 6.28 5.10 6.11 5.00 5.93 4.90 5.76 4.81 5.60 4.71 5.44 4.62 5.29 7.47 10.89 6.92 9.65 6.55 8.86 6.29 8.31 6.09 7.91 5.93 7.60 5.81 7.36 5.71 7.17 5.63 7.01 5.55 6.87 5.49 6.76 5.44 6.66 5.39 6.57 5.35 6.50 5.31 6.43 5.28 6.37 5.18 6.19 5.08 6.01 4.98 5.83 4.88 5.67 4.78 5.50 4.68 5.35 7.60 11.08 7.03 9.81 6.66 9.00 6.39 8.44 6.19 8.03 6.03 7.71 5.90 7.46 5.80 7.26 5.71 7.10 5.64 6.96 5.57 6.84 5.52 6.74 5.47 6.66 5.43 6.58 5.39 6.51 5.36 6.45 5.25 6.26 5.15 6.08 5.04 5.90 4.94 5.73 4.84 5.56 4.74 5.40 7.72 11.24 7.14 9.95 6.76 9.12 6.48 8.55 6.28 8.13 6.11 7.81 5.98 7.56 5.88 7.36 5.79 7.19 5.71 7.05 5.65 6.93 5.59 6.82 5.54 6.73 5.50 6.65 5.46 6.58 5.43 6.52 5.32 6.33 5.21 6.14 5.11 5.96 5.00 5.78 4.90 5.61 4.80 5.45 7.83 11.40 7.24 10.08 6.85 9.24 6.57 8.66 6.36 8.23 6.19 7.91 6.06 7.65 5.95 7.44 5.86 7.27 5.79 7.13 5.72 7.00 5.66 6.90 5.61 6.81 5.57 6.73 5.53 6.65 5.49 6.59 5.38 6.39 5.27 6.20 5.16 6.02 5.06 5.84 4.95 5.66 4.85 5.49 7.93 11.55 7.34 10.21 6.94 9.35 6.65 8.76 6.44 8.33 6.27 7.99 6.13 7.73 6.02 7.52 5.93 7.35 5.85 7.20 5.78 7.07 5.73 6.97 5.67 6.87 5.63 6.79 5.59 6.72 5.55 6.65 5.44 6.45 5.33 6.26 5.22 6.07 5.11 5.89 5.00 5.71 4.89 5.54 8.03 11.68 7.43 10.32 7.02 9.46 6.73 8.85 6.51 8.41 6.34 8.08 6.20 7.81 6.09 7.59 5.99 7.42 5.91 7.27 5.85 7.14 5.79 7.03 5.73 6.94 5.69 6.85 5.65 6.78 5.61 6.71 5.49 6.51 5.38 6.31 5.27 6.12 5.15 5.93 5.04 5.75 4.93 5.57 8.12 11.81 7.51 10.43 7.10 9.55 6.80 8.94 6.58 8.49 6.40 8.15 6.27 7.88 6.15 7.66 6.05 7.48 5.97 7.33 5.90 7.20 5.84 7.09 5.79 7.00 5.74 6.91 5.70 6.84 5.66 6.77 5.55 6.56 5.43 6.36 5.31 6.16 5.20 5.97 5.09 5.79 4.97 5.61 8.21 11.93 7.59 10.54 7.17 9.65 6.87 9.03 6.64 8.57 6.47 8.23 6.33 7.95 6.21 7.73 6.11 7.55 6.03 7.39 5.96 7.26 5.90 7.15 5.84 7.05 5.79 6.97 5.75 6.89 5.71 6.82 5.59 6.61 5.47 6.41 5.36 6.21 5.24 6.01 5.13 5.83 5.01 5.65 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 05 01 ... Ϫ61.3 28 .7 23 .8 Ϫ15.8 Ϫ8.8 6 .2 16 .2 29 .2 8 .2 Ϫ3.8 28 .8 21 .2 1,101.94 25 7.54 187.44 22 .94 626 .94 1,947.64 153.34 42. 94 1,765.44 608.44 566.44 24 9.64 77.44 38.44 26 2.44 8 52. 64 67 .24 14.44 829 .44... 9.969397 62 0 .23 0746 72 10.933 –4.659 Prob > T 0.0001 0.0 023 601 175 82_ 11_ch11_p5 72- 622 .qxd 6 02 11 /25 /08 5:38 PM Page 6 02 Chapter 11 Linear Regression and Correlation OBS X Y PRED RESID 20 20 20 40... b1 ϭ Show that t2 ϭ F for this data set 175 82_ 11_ch11_p5 72- 622 .qxd 594 11 /25 /08 5:38 PM Page 594 Chapter 11 Linear Regression and Correlation Predictor Constant SoilpH S = 2. 721 62 Coef 47.475

Ngày đăng: 18/05/2017, 10:17

Xem thêm: Ebook An introduction to statistical methods and data analysis (6th edition) Part 2, Ebook An introduction to statistical methods and data analysis (6th edition) Part 2

Ebook An introduction to statistical methods and data analysis (6th edition) Part 2

Thông tin tài liệu

Từ khóa liên quan

Mục lục

cover

0495017582StatData

Tài liệu cùng người dùng

Tài liệu liên quan