Analysis of Survey Data phần 9 docx

38 295 0
Analysis of Survey Data phần 9 docx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

and prior are reasonable. Another useful feature of MI is that the complete- data analysis does not necessarily need to be based on the model used to impute the missing values. In particular, the complete-data analysis might consist in computing estimates and standard errors using more conventional design- based methods, in which case the potential effects of model misspecification are confined to the imputation of the missing values. For more discussion of the properties of MI under model misspecification, see for example Rubin (1996), Fay (1996), Rao (1996), and the associated discussions. An important feature of MI is that draws of the missing values are imputed rather than means. Means would be preferable if the objective was to obtain the best estimates of the missing values, but have drawbacks when the objective is to make inferences about parameters. Imputation of draws entails some loss of efficiency for point estimation, but the averaging over the K multiply-imputed datasets in (18.25) considerably reduces this loss. The gain from imputing draws is that it yields valid inferences for a wide range of estimands, including nonlinear functions such as percentiles and variances (Little, 1988). The difficulty in implementing MI is in obtaining draws from the posterior distribution of y mis given z U , y obs and r s , which typically has an intractable form. Since draws from the posterior distribution of y mis given z U , y obs , r s , and y are often easy to implement, a simpler scheme is to draw from the posterior distribution of y mis given z U , y obs , r s and ~ y, where ~ y is an easily computed estimate of y such as that obtained from the complete cases. This approach ignores uncertainty in estimating y, and is termed improper in Rubin (1987). It yields acceptable approximations when the fraction of missing data is modest, but leads to overstatement of precision with large amounts of missing data. In the latter situations one option is to draw y (q) from its asymptotic distribution and then impute y mis from its posterior distribution given z U , y obs , r s , and y (q) . A better but more computationally intensive approach is to cycle between draws y (i) mis $ P( y mis jz U , y obs , r s , y (iÀ1) ) and y (i) $ p(yjz U , y obs , y (i) mis , r s ), an application of the Gibbs sampler (Tanner and Wong, 1987; Tanner, 1996). The formulation provided above requires specification of the joint distribu- tion of y s and r s given z U . As discussed in Section 18.1, if the missing data are missing at random (MAR) in that the distribution of r s given y s and z U does not depend on y mis , then inference can be based on a model for y s alone rather than on a model for the joint distribution of y s and r s (Rubin 1976: Little and Rubin 2002). Specifically, (18.23)±(18.26) can replaced by the following: p(yjz U , y obs )  1 Q  Q q1 p(yjz U , y (q) s ), y (q) s  ( y obs , y (q) mis ), y (q) mis $ p( y mis jz U , y obs ), E(yjz U , y obs )  " y  1 Q  Q q1 ^ y (q) , where ^ y (q)  E(yjz U , y (q) s ), ITEM NONRESPONSE 299 Var(yjz U , y obs )  1 Q  Q q1 Var(yjz U , y (q) s )  Q  1 Q 1 Q À 1  Q q1 ( ^ y (q) À " y)( ^ y (q) À " y) 0 4 5 , where the conditioning on r s has been dropped. This approach is called infer- ence ignoring the missing-data mechanism. Since modeling the missing-data mechanism is difficult in many applications, and results are vulnerable to model misspecification, ignorable models are attractive. Whenever possible, surveys should be designed to make this assumption plausible by measuring information on at least a subsample of nonrespondents. Example 3. MI for the Third National Health and Nutrition Examination Survey The Third National Health and Nutrition Examination Survey (NHANES- 3) was the third in a series of periodic surveys conducted by the National Center for Health Statistics to assess the health and nutritional status of the US population. The NHANES-3 Survey began in 1988 and was conducted in two phases, the first in 1988±91 and the second in 1991±4. It involved data collection on a national probability sample of 39 695 individuals in the US population. The medical examination component of the survey dictated that it was carried out in a relatively small number (89) of localities of the country known as stands; stands thus form the primary sampling units. The survey was also stratified with oversampling of particular population subgroups. This survey was subject to nonnegligible levels of unit and item nonresponse, in both its interview and its examination components. In previous surveys, nonresponse was handled primarily using weighting adjustments. Increasing levels of nonresponse in NHANES, and inconsistencies in analyses of NHANES data attributable to differing treatments of the missing values, led to the desire to develop imputation methods for NHANES-3 and subsequent NHANES surveys that yield valid inferences. Variables in NHANES-3 can be usefully classified into three groups: 1. Sample frame/household screening variables 2. Interview variables (family and health history variables) 3. Mobile Examination Center (MEC) variables. The sample frame/household screening variables can be treated essentially as fully observed. Of all sampled individuals, 14.6 % were unit nonrespondents who had only the sampling frame and household screening variables measured. The interview data consist of family questionnaire variables and health vari- ables obtained for sampled individuals. These variables were subject to unit nonresponse and modest rates of item nonresponse. For example, self-rating of 300 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE health status (for individuals aged 17 or over) was subject to an overall non- response rate (including unit nonresponse) of 18.8 %, and family income had an overall nonresponse rate of 21.1 %. Missing data in the MEC variables are referred to here as examination nonresponse. Since about 8 % of the sample individuals answered the interview questions but failed to attend the examination, rates of examination nonre- sponse were generally higher than rates of interview nonresponse. For example, body weight at examination had an overall nonresponse rate of 21.6 %, systolic blood pressure an overall nonresponse rate of 28.1 %, and serum cholesterol an overall nonresponse rate of 29.4 %. The three blocks of variables ± screening, interview, examination ± had an approximately monotone structure, with screening variables basically fully observed, questionnaire variables missing when the interview is not conducted, and examination variables missing when either (i) the interview is not con- ducted or (ii) the interview is conducted but the MEC examination does not take place. However, item nonresponse for interview data, and component and item-within component nonresponse for MEC data, spoil this monotone structure. A combined weighting and multiple imputation strategy was adopted to create a public-use dataset consisting of over 70 of the main NHANES-3 variables (Ezzati and Khare, 1992; Ezzati-Rice et al., 1993, 1995; Khare et al., 1993). The dataset included the following information: Basic demographics and geography: age, race/ethnicity, sex, household size, design stratum, stand, interview weight. Other interview variables: alcohol consumption, education, poverty index, self- reported health, activity level, arthritis, cataracts, chest pain, heart attack, back pain, height, weight, optical health measures, dental health measures, first-hand and second-hand smoking variables. Medical examination variables: blood pressure measures, serum cholesterol measures, serum triglycerides, hemoglobin, hematocrit, bone density mea- sures, size measures, skinfold measures, weight, iron, drusen score, macul- pathy, diabetic retinopathy, ferritin, mc measures, blood lead, red cell measures. Many of the NHANES variables not included in the above set are recodes of included variables and hence easily derived. As in previous NHANES surveys, unit nonrespondents were dropped from the sample and a nonresponse weight created for respondents to adjust for the fact they are no longer a random sample of the population. The nonresponse weights were created as inverses of estimated propensity-to-respond scores (Rubin, 1985), as described in Ezzati and Khare (1992). All other missing values were handled by multiple imputation, specifically creating five random draws from the predictive distribution of the missing values, based on a multivariate linear mixed model (Schafer, 1996). The database consists of the following six components: ITEM NONRESPONSE 301 1. A core dataset containing variables that are not subject to imputation (id, demographics, sampling weights, imputation flags) in fixed-width, space- delimited ASCII. 2. Five versions of a data file containing the observed data and the imputed values created as one draw from the joint predictive distribution of the missing values. 3. SAS code that will merge the core data with each of the imputed datasets, assign variables, names, etc., to produce five SAS datasets of identical size, with identical variable names. 4. Sample analyses using SUDAAN and Wesvar-PC to estimate means, pro- portions, quantiles, linear and logistic regression coefficients. Each analysis will have to be run five times. 5. SAS code for combining five sets of estimates and standard errors using Rubin's (1987) methods for multiple imputation inference, as outlined above. 6. Documentation written for a general audience that details (a) the history of the imputation project, (b) an overview of multiple imputation, (c) NHANES imputation models and procedures, (d) a summary of the 1994±5 evaluation study, (e) instructions on how to use the multiply- imputed database, and (f) caveats and limitations. A separate model was fitted to sample individuals in nine age classes, with sample sizes ranging from 1410 to 8375 individuals. One reason for stratifying in this way is that the set of variables defined for individuals varies somewhat by age, with a restricted set applying to children under 17 years, and some variables restricted to adults aged over 60 years. Also, stratification on age is a simple modeling strategy for reflecting the fact that relationships between NHANES variables are known to vary with age. I now describe the basic form of the model for a particular age stratum. For individual t in stand c, let y tc be the (1 ÂJ) vector of the set of items subject to missing data, and let x tc be a fixed (1 Âp) vector of design variables and items fully observed except for unit nonresponse. It is assumed that y tc jb c $ N J (x tc b  b c , Æ) b c $ N J (0, c), c  1, F F F , 89; t  1, F F F , n c , (18X27) where b is a ( p  J) vector of fixed effects, b c is a (1  J) vector of random stand effects with mean zero and covariance matrix c  diag(c 1 , F F F , c J ), and Æ is an unstructured covariance matrix; conditioning on (b, S, c) in (18.27) is implicit. It is further assumed that the missing components of y tc are missing at random and the parameters (b, Æ, c) are distinct from parameters defining the mechanism, so that the missing-data mechanism does not have to be modeled for likelihood inference (Rubin, 1976; Little and Rubin, 2002). In view of the normality assumption in (18.27), most variables were trans- formed to approximate normality using standard power transformations. A few variables not amenable to this approach were forced into approximate 302 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE normality by calculating the empirical cdf and corresponding quantiles of the standard normal distribution. The model (18.27) is a refinement over earlier imputation models in that stand is treated as a random effect rather than a fixed effect. This reduces the dimensionality of the model and allows for greater pooling of information across stands. The Gibbs sampler was used to generate draws from the posterior distribu- tion of the parameters and the missing values for the model in Section 18.2.1, with diffuse conjugate prior distributions on (b, Æ, c). S-Plus and Fortran code is available at Joseph Schafer's web site at http://www.psu.edu/~jls. In sum- mary, given values from the tth iteration, the (t  1)th iteration of Gibbs' sampling involves the following five steps: Draw (b (i1) c jy obs, tc , y (i) mis ,tc , b (i) , c (i) , Æ (i) ) $ Normal, c  1, F F F , 89 Draw (c (i1) jy obs, tc , y (i) mis ,tc , b (i) , Æ (i) , {b (i1) c }) $ Inverse Wishart Draw (Æ (i1) jy obs, tc , y (i) mis ,tc , b (i) , {b (i1) c }, c (i1) ) $ Inverse Wishart Draw (b (i1) jy obs, tc , y (i) mis ,tc , {b (i1) c }, c (i1) , Æ (i1) ) $ Normal Draw ( y (i1) mis ,tc jy obs, tc , {b (i1) c }, c (i1) , S (i1) , b (i1) $ Normal, c  1, F F F , 89X Here y obs, tc consists of the set of observed items in the vector y tc and y mis , tc , the set of missing items. More details of the forms of these distributions are given in Schafer (1996). The Gibbs sampler for each age stratum was run as a single chain and converged rapidly, reflecting the fact that the model parameters and random effects were well estimated. After an initial run-in period, draws of the missing values were taken at fixed intervals in the chain, and these were transformed back to their original scales and rounded to produce the five sets of imputations. 18.4. NONIGNORABLE MISSING DATA nonignorablemissing data The models discussed in the previous two sections assume the missing data are MAR. Nonignorable, non-MAR models are needed when missingness depends on the missing values. For example, suppose a participant in an income survey refused to report an income amount because the amount itself is high (or low). If missingness of the income amount is associated with the amount, after controlling for observed covariates (such as age, education, or occupation) then the mechanism is not MAR, and methods for imputing income based on MAR models are subject to bias. A correct analysis must be based on the full likelihood from a model for the joint distribution of y s and r s . The standard likelihood asymptotics apply to nonignorable models providing the parameters are identified, and computational tools such as the Gibbs sampler also apply to this more general class of models. NONIGNORABLE MISSING DATA 303 Suppose the missing-data mechanism is nonignorable, but the selection mechanism is ignorable, so that a model is not required for the inclusion indicators i U . There are two broad classes of models for the joint distribution of y s and r s (Little and Rubin, 2002, Ch. 11; Little, 1993b). Selection models model the joint distribution as p( y s , r s jz U , y, c)  p( y s jz U , y)p(r s jz U , y inc , c), (18X28) as in Section 18.1. Pattern-mixture models specify p( y s , r s jz U , g, p)  p( y s jz U , r s , g)p(r s jz U , p), (18X29) where g and p are unknown parameters, and the distribution of y s is condi- tioned on the missing data pattern r s . Equations (18.28) and (18.29) are simply two different ways of factoring the joint distribution of y s and r s . When r s is independent of y s the two specifications are equivalent with y  g and c  p. Otherwise (18.28) and (18.29) generally yield different models. Pattern-mixture models (18.29) seem more natural when missingness defines a distinct stratum of the population of intrinsic interest, such as individuals reporting `don't know' in an opinion survey. However, pattern-mixture models can also provide inferences for parameters y of the complete-data distribution, by expressing the parameters of interest as functions of the pattern-mixture model parameters g and p (Little, 1993b). An advantage of the pattern-mixture modeling approach over selection models is that assumptions about the form of the missing-data mechanism are sometimes less specific in their parametric form, since they are incorporated in the model via parameter restrictions. This point is explained for specific normal pattern-mixture models in Little (1994) and Little and Wang (1996). Most of the literature on nonignorable missing data has concerned selection models of the form (18.28), for univariate nonresponse. An early example is the probit selection model. Example 4. Probit selection model Suppose Y is scalar and incompletely observed, X 1 , F F F , X p represent design variables and fully observed survey variables, and interest concerns the param- eters b of the regression of Y on X 1 , F F F , X p . A normal linear model is assumed for this regression, that is ( y t jx t1 , F F F , x tp ) $ N b 0   p j1 b j x tj , s 2 2 3 X (18X30) The probability that Y is observed given Y and X 1 , F F F , X p is modeled as a probit regression function: Pr(R t  1jy t , x t1 , F F F , x tp )  F c 0   p j1 c j x tj  c p1 y t 4 5 , (18X31) 304 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE where F denotes the probit function. When c p1 T 0, this probability is a monotonic function of the values of Y, and the missing-data mechanism is nonignorable. If, on the other hand, c p1  0 and (c 0 , F F F , c p ) and (b, s 2 ) are distinct, then the missing-data mechanism is ignorable, and maximum likeli- hood estimates of (b, s 2 ) are obtained by least squares linear regression based on the complete cases. Amemiya (1984) calls (18.31) a Type II Tobit model, and it was first intro- duced to describe selection of women into the labor force (Heckman, 1976). It is closely related to the logit selection model of Greenlees, Reece and Zieschang (1982), which is extended to repeated-measures data in Diggle and Kenward (1994). This model is substantively appealing, but problematic in practice, since information to simultaneously estimate the parameters of the missing-data mechanism and the parameters of the complete-data model is usually very limited, and estimates are very sensitive to misspecification of the model (Little, 1985; Stolzenberg and Relles, 1990). The following example illustrates the problem. Example 5. Income nonresponse in the current population survey Lillard, Smith and Welch (1982, 1986) applied the probit selection model of Example 4 to income nonresponse in four rounds of the Current Population Survey Income Supplement, conducted in 1970, 1975, 1976, and 1980. In 1980 their sample consisted of 32 879 employed white civilian males aged 16±65 who reported receipt (but not necessarily amount) of W  wages and salary earnings and who were not self-employed. Of these individuals, 27 909 reported the value of W and 4970 did not. In the notation of Example 4, Y is defined to equal (W dÀ1 )ad, where d is a power transformation of the kind proposed in Box and Cox (1964). The predictors X were chosen as education (five dummy variables), years of market experience (four linear splines), probability of being in first year of market experience, region (south or other), child of household head (yes, no), other relative of household head or member of secondary family (yes, no), personal interview (yes, no), and year in survey (1 or 2). The last four variables were omitted from the earnings equation (18.30); that is, their coeffi- cients in the vector b were set equal to zero. The variables education, years of market experience, and region were omitted from the response equation (18.31); that is, their coefficients in the vector c were set to zero. Lillard, Smith and Welch (1982) fit the probit selection model (18.30) and (18.31) for a variety of other choices of d. Their best-fitting model, ^ d  0X45, predicted large income amounts for nonrespondents, in fact 73 % larger on average than imputations supplied by the Census Bureau, which used a hot deck method that assumes ignorable nonresponse. However, this large adjust- ment is founded on the normal assumption for the population residuals from the g  0X45 model, and on the specific choice of covariates in (18.30) and (18.31). It is quite plausible that nonresponse is ignorable and the unrestricted residuals follow the same (skewed) distribution as that in the respondent sample. Indeed, comparisons of Census Bureau imputations with IRS income NONIGNORABLE MISSING DATA 305 amounts from matched CPS/IRS files do not indicate substantial underestima- tion (David et al., 1986). Rather than attempting to simultaneously estimate the parameters of the model for Y and the model for the missing-data mechanism, it seems preferable to conduct a sensitivity analysis to see how much the answers change for various assumptions about the missing-data mechanism. Examples of this approach for pattern-mixture models are given in Rubin (1977), Little (1994), Little and Wang (1996), and Scharfstein, Robins and Rotnitsky (1999). An alternative to simply accepting high rates of potentially nonignorable missing data for financial variables such as income is to use special questionnaire formats that are designed to collect a bracketed observation whenever a re- spondent is unable or unwilling to provide an exact response to a financial amount question. Heeringa, Little and Raghunathan (2002) describe a Baye- sian MI method for multivariate bracketed data on household assets in the Health and Retirement Survey. The theoretical underpinning of these methods involves the extension of the formulation of missing-data problems via the joint distribution-of y U , i U and r s in Section 18.1 to more general incomplete data problems involving coarsened data (Heitjan and Rubin, 1991; Heitjan, 1994). Full and ignorable likelihoods can be defined for this more general setting. 18.5. CONCLUSION conclusion This chapter is intended to provide some indication of the generality and flexibility of the Bayesian approach to surveys subject to unit and item non- response. The unified conceptual basis of the Bayesian paradigm is very appealing, and computational tools for implementing the approach are becom- ing increasingly available in the literature. What is needed to convince practi- tioners are more applications such as that described in Example 3, more understanding of useful baseline ``reference'' models for complex multistage survey designs, and more accessible, polished, and well-documented software: for example, SAS (2001) now has procedures (PROC MI and PROC MIANA- LYZE) for creating and analysing multiply-imputed data. I look forward to further developments in these directions in the future. ACKNOWLEDGEMENTS This research is supported by grant DMS-9803720 from the National Science Foundation. 306 BAYESIAN METHODS FOR UNIT AND ITEM NONRESPONSE CHAPTER 19 Estimation for Multiple Phase Samples Wayne A. Fuller 19.1. INTRODUCTION introduction Two-phase sampling, also called double sampling, is used in surveys of many types, including forest surveys, environmental studies, and official statistics. The procedure is applicable when it is relatively inexpensive to collect infor- mation on a vector denoted by x, relatively expensive to collect information on the vector y of primary interest, and x and y are correlated. In the two-phase sample, the vector x is observed on a large sample and (x, y) is observed on a smaller sample. The procedure dates from Neyman (1938). Rao (1973) extended the theory and Cochran (1977, Ch. 12) contains a discussion of the technique. Hidiroglou (2001) presents results for different configurations for the two phases. We are interested in constructing an estimator for a large vector of charac- teristics using data from several sources and/or several phases of sampling. We will concentrate on the use of the information at the estimation stage, omitting discussion of use at the design stage. Two types of two-phase samples can be identified on the basis of sample selection. In one type, a first-phase sample is selected, some characteristics of the sample elements are identified, and a second-phase sample is selected using the characteristics of the first-phase units as controls in the selection process. A second type, and the type of considerable interest to us, is one in which a first-phase sample is selected and a rule for selection of second-phase units is specified as part of the field procedure. Very often the selection of second-phase units is not a function of first-phase characteristics. One example of the second type is a survey of soil properties conducted by selecting a large sample of points. At the same time the large sample is selected, a subset of the points, the second-phase sample, is specified. In the field operation, a small set of data is collected from the first-phase sample and a larger set is collected from the Analysis of Survey Data. Edited by R. L. Chambers and C. J. Skinner Copyright ¶ 2003 John Wiley & Sons, Ltd. ISBN: 0-471-89987-9 second-phase sample. A second example of the second type is that of a popula- tion census in which most individuals receive a short form, but a subsample receives a long from with more data elements. The sample probability-of-selection structure of the two-phase sample is sometimes used for a survey containing item nonresponse. Item nonresponse is the situation in which respondents provide information for some, but not all, items on the questionnaire. The use of the two-phase model for this situation has been discussed by Sa È rndal and Swensson (1987) and Rao and Sitter (1995). A very similar situation is a longitudinal survey in which respondents do not respond at every point of the survey. Procedures closely related to multiple phase estimation for these situations have been discussed by Fuller (1990, 1999). See also Little and Rubin (1987). Our objective is to produce an easy-to-use dataset that meets several criteria. Generally speaking, an easy-to-use dataset is a file of complete records with associated weights such that linear estimators are simple weighted sums. The estimators should incorporate all available information, and should be design consistent for a wide range of population parameters at aggregate levels, such as states. The dataset will be suitable for analytic uses, such as comparison of domains, the computation of regression equations, or the computation of the solutions to estimating equations. We also desire a dataset that produces reasonable estimates for small areas, such as a county. A model for some of the small-area parameters may be required to meet reliability objectives for the small areas. 19.2. REGRESSION ESTIMATION regression estimation 19.2.1. Introduction Our discussion proceeds under the model in which the finite population is a sample from an infinite population of (x t , y t ) vectors. It is assumed that the vectors have finite superpopulation fourth moments. The members of the finite population are indexed with integers U  {1, 2, F F F , N}. We let (m x , m y ) denote the mean of the superpopulation vector and let ( x U , y U ) denote the mean of the finite population. The set of integers that identify the sample is the set s. In a two-phase sample, we let s 1 be the set of elements in the first phase and s 2 be the set of elements in the second phase. Let there be n 1 units in the first sample and n 2 units in the second sample. When we discuss consistency, we assume that n 1 and n 2 increase at the same rate. Assume a first-phase sample is selected with selection probabilities p 1t . A second-phase sample is selected by a procedure such that the total probabil- ity of selection is p 2t . Thus, the total probability of being selected for the second-phase sample can be written p 2t  p 1t p 2tj1 , (19X1) 308 ESTIMATION FOR MULTIPLE PHASE SAMPLES [...]... discussed by È Sarndal, Swensson and Wretman ( 199 2, ch 9) Expression ( 19. 17) is not always easy to implement See Kott ( 199 0, 199 5), Rao and Shao ( 199 2), Breidt and Fuller ( 199 3), Rao and Sitter ( 199 5), Rao ( 199 6), and Binder ( 199 6) for discussions of variance estimation We now consider another approach to variance estimation We can expand the estimator ( 19. 3) in a first-order Taylor expansion to obtain... contextual analysis in which the characteristics of the group in which an 324 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA individual is located are included in the analysis of individual-level survey data (Riley, 196 4; Boyd and Iversen, 197 9) A more recent response to the concern of analysing individuals out of context has been the development of multi-level modelling (Goldstein, 198 7, 199 5) Here... observation is then one-half of the original observation Kim and Fuller ( 199 9) demonstrate how to construct replicates using the fractionally imputed data to compute an estimated variance Analysis of Survey Data Edited by R L Chambers and C J Skinner Copyright 2003 John Wiley & Sons, Ltd ISBN: 0-471- 899 87 -9 CHAPTER 20 Analysis Combining Survey and Geographically Aggregated Data D G Steel, M Tranmer and... Aggregate data are used because individual-level data cannot be made available due to confidentiality requirements, as in the analysis of electoral data (King, 199 7) or because aggregate data are readily available at little or no cost Achen and Shively ( 199 5) noted that the use of aggregate data was a popular approach in social research prior to the Second World War but the rapid development of sample surveys... individual-level data on auxiliary variables in the analysis of aggregate data 20.2 AGGREGATE AND SURVEY DATA AVAILABILITY aggregate and survey data availability Consider two data sources, the first based on a sample of units, s0 , from which unit-level data are available and the second on the sample, s1 , from which aggregate data are available To simplify the discussion and focus on the role of survey and... discussed in the imputation literature and as a special topic by others See Rubin ( 198 7), Little and Rubin ( 198 7), Kalton and Kasprzyk ( 198 6), Chambers and Dunstan ( 198 6), Nascimento Silva and Skinner ( 199 5), Brick and Kalton ( 199 6), and Rao ( 199 6) The majority of the imputation procedures require a model specification 19. 3 REGRESSION ESTIMATION WITH IMPUTATION regression estimation with imputation Our... group Cohen ( 199 5) suggested that when costs depend on the number of groups and the number of individuals in the sample then in many cases an optimal number of units to select per group is 326 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA approximately proportional to the square root of the ratio of unit-level to group-level variance components This is also the optimal choice of sample units... (19X42) t where ^y is the estimated total of y, ^x1c is the phase 1 estimator of the total of x t for cell c, x is the indicator for urban in 199 2, and x2c is the estimated mean of cell c for the second phase The second term on the right side of ( 19. 42) is the estimator of the conditional variance of the regression estimator given the first-phase sample In this example, the regression estimator of. .. Bureau of Statistics (ABS) releases such data for many household surveys in this way An important example is the release of data for samples of households and individuals from the population census, which is done in several countries, including Australia, the UK and the USA 328 ANALYSIS OF SURVEY AND GEOGRAPHICALLY AGGREGATED DATA Case (3) corresponds to purely aggregate data As noted before, analysis. .. unit-level survey data containing group indicators and aggregate data Case (5) corresponds to having unit-level survey data without group indicators and aggregate data For these cases the impact of n n different relative values of m0 , n0 , "0 and m1 , n1 , "1 will be investigated 20.3 BIAS AND VARIANCE OF VARIANCE COMPONENT ESTIMATORS BASED ON AGGREGATE AND SURVEY DATA bias and variance of variance . always easy to implement. See Kott ( 199 0, 199 5), Rao and Shao ( 199 2), Breidt and Fuller ( 199 3), Rao and Sitter ( 199 5), Rao ( 199 6), and Binder ( 199 6) for discus- sions of variance estimation. We now. adopted to create a public-use dataset consisting of over 70 of the main NHANES-3 variables (Ezzati and Khare, 199 2; Ezzati-Rice et al., 199 3, 199 5; Khare et al., 199 3). The dataset included the following. Rubin ( 198 7), Little and Rubin ( 198 7), Kalton and Kasprzyk ( 198 6), Chambers and Dunstan ( 198 6), Nascimento Silva and Skinner ( 199 5), Brick and Kalton ( 199 6), and Rao ( 199 6). The majority of the

Ngày đăng: 14/08/2014, 09:20

Từ khóa liên quan

Tài liệu cùng người dùng

  • Đang cập nhật ...

Tài liệu liên quan