Quick study academic statistics 600dpi

6 410 2
Quick study academic statistics 600dpi

Đang tải... (xem toàn văn)

Thông tin tài liệu

STATISTICS: The study of methods for collecting, organizing, and analyzing data oDescriptive Statistics: Procedures used to organize and present data in a convenient and communicable form oInferential Statistics: Procedures employed to arrive at broader conclusions or inferences about populations on the basis of samples POPULATION: The complete set of actual or potential elements about which inferences are made SAMPLE: A subset of the population selected using some sampling method oSampling methods -Cluster sample: A population is divided into groups called clusters; some clusters are randomly selected, and every member in them is observed -Stratified sample: The population is divided into strata, and a fixed number of elements of each stratum are selected for the sample -Simple random sample: A sample selected so that each possible sample of the same size has an equal probability of being selected; used for most elementary inference VARIABLE: An attribute of elements ofa population or sample that can be measured; ex: height, weight, IQ, hair colo~ and pulse rate are some of the many variables that can be measured for people DATA: Values of variables that have been observed oTypes of data -Qualitative (or "categorical") data are descriptive but not numeric; ex: your gender, your birthplace, the color of an automobile -Quantitative data take numeric values -Discrete data take counting numbers (0, 1,2, ) as values, usually representing things that can be counted; ex: the number of fleas on a dog, the number of times a professor is late in a semester -Continuous data can take a range of numeric values, not just counting numbers; ex: the height of a child, the weight of a bag of beans, the amount of time a professor is late oLevels of measurement -Qualitative data can be measured at the: oNominal level: Values are just names, without any order; ex: color of a car, major in college oOrdinal level: Values have some natural order; ex: high school class (freshman!sophomore! junior/senior), military rank -Quantitative data can be measured at the : oInterval level: Numeric data with no natural zero point; intervals (differences) are meaningful , but ratios are not; ex: temperature in Fahrenheit degrees; 80°F is 20°F hotter than 60°F, but it is not 150% as hot oRatio level: Numeric data for which there is a true zero; both intervals and ratios are meaningful; ex: weight, length, duration, most physical properties STATISTIC: A numeric measure computed from sample data, used to describe the sample and to estimate the corresponding popUlation parameter PARAMETER: A numeric measure that describes a population; parameters are usually not computed, but are inferred from sample statistics FREQUENCY DISTRIBUTION MEASURES OF DISPERSION Provides the frequency (number of times observed) of each value of a variable SUM OF SQUARES (SS): The sum of squared deviations from the mean (" )2 2 L.,.X· oPopulation SS: L(X;- f.1x) orLx; - ,;-­ Table #1: Students in a driving class are polled regarding number of accidents they've had: (# of accidents) (frequency) (relative frequency) x f RF 0.0526 0.1579 15 0.2632 16 0.2807 12 0.2105 0.0351 Table #2: The scores on a midterm exam are grouped into classes: class f cumulative freq 90-99 80 80-89 18 76 70-79 31 58 19 27 50-59 40-49 1 MEAN: Most commonly used measure of central tendency, usually meant by "average"; sensitive to extreme values SAMPLE MEAN n LX; = I oTrimmed mean: Computed discarding some number of the highest and lowest values; less sensitive than ordinary mean i G oWeighted mean: Computed with a L W;X; weight multiplied to each value, making ; = some values influence the mean more L W; heavily than others ;=I d MEDIAN: Value that divides the set so the same number of observations lie on each side of it; less sensitive to extreme values; for an odd number of values, it is the middle value; for an even number, it is the average of the middle two; ex: in Table 1, the median is the average of the 28th and 29th observa­ tions, or 1.5 MODE: Observation that occurs with the greatest frequency; ex: in Table 1, the mode is 1 I (J N = N L(X;- f.1) =I STANDARD SCORES: Also known as Z-scores; the standard score of a value is the directed number of standard deviations from the mean at which the MEASURES OF CENTRAL TENDENCY = Ii I n oSample variance: s2= =-1 L(x;- x) n ; =I oVariances for grouped data: I G -Population: (J = N Lf;(m;- f.1) i = I G -Sample: s = =-1 Lf;(m;- x) n ; =I STANDARD DEVIATION: The square root of the variance; unlike variance, it has the same units as the original data and is more commonly used: i CUMULATIVE FREQUENCY DISTRIBUTION: Frequencies count all observations at a particular value or class and all those less; ex: third column of Table X oSample ss: L(X;- x) orLx; - nVARIANCE: The average of square differences between observations and their mean N oPopulation variance: (J2= N L(x;- f.1) ex: Pop S,D.: RELATIVE FREQUENCY DISTRIBUTION: Each frequency is divided by the total number of observa­ tions to produce the proportion or percentage of the data set having that value; ex: third column ofTable POPULATION MEAN (LX;) ; = I GROUPED FREQUENCY DISTRIBUTION: Values of the variable are grouped into classes 60-69 value is found; that is, z = x ~ J.l oA positive z-score indicates a value greater than the mean; a negative z-score indicates a value less than the mean; a z-score of zero indicates the mean value ·Converting every value in a data set or distribution to a z-score is called standardization; once a data set or distribution has been standardized, it has a new mean ~= 0, and a new standard deviation a = I GRAPHING TECHNIQUES BAR GRAPH: A graph that uses bars to indicate the frequency of occurrence of observations oHistogram: A bar graph used with quantitative, continuous variables FREQUENCY CURVE: A graph representing a frequency distribution in the form of a continuous line that traces a histogram oCumulative frequency curve: A continuous line that traces a histogram where bars in all the lower classes are stacked up in the adjacent higher class; cannot have a negative slope oSymmetric curve: The frequency curve is unchanged if rotated around its center; median = mean oNormal curve: Bell-shaped curve; symmetric -Skewed curve: Deviates from symmetry; frequency curve is shifted with a longer "tail" to the left (mean < median) or to the right (mean > median) ~ tA?ArSkd -10 ~~ tt -10 -5 +5 +10 SKEWED CURVE -5 o +5 +\0 PROBABILITY STATISTICAL INFERENCE A measure of the likelihood of a random event; the long-term relative frequency with which an outcome or event occurs Probability of occurrence of Event A (A) = Number of outcomes favoring Event A p Total number of outcomes ·Sample space: All possible simple outcomes of an experiment • Relationships between events -Exhaustive: or more events are said to be exhaustive if they represent all possible outcomes ·Symbolically, peA or B or ) = I -Non-exhaustive: Two or more events are said to be non-exhaustive if they not exhaust all possible outcomes -Mutually exclusive: Events that cannot occur simultaneously: peA and B) = 0, and peA or B) = peA) + PCB); ex: males, females -Non-mutually exclusive: Events that can occur simultaneously: peA or B) = peA) + PCB) - peA and B); ex: males, brown eyes -Independent: Events whose probability is unaffected by occurrence or nonoccurrence of each other: P(AIB) = peA); P(BIA) = PCB); and peA and B) = P(A)P(B); ex: gender and eye color -Dependent: Events whose probability changes depending upon the occurrence or non-occurrence of each other: P(AIB) differs from peA); P(BIA) differs from PCB); and peA and B) = peA) P(BIA) = PCB) P(AIB); ex: race and eye color ·Joint probabilities: Probability that or more events occur simultaneously • Marginal probabilities or unconditional probabilities = summation of probabilities ·Conditional p robabilities: Probability of A given the existence of S, written, P(AIS) •Ex: Given the numbers I to as observations in a sample space: -Events mutually exclusive and complementary; ex: P(all odd numbers); P(all even numbers) -Events mutually exclusive but not complementary; ex: P(an even number); P(the numbers and 5) - Events neither mutually exclusive or exhaustive; ex: P(an even number or a 2) -A random variable takes numeric values randomly, with probabilities specified by a probability distri­ bution (or density) function • Discrete random variables: Take only distinct values (as with quantitative data) • Binomial distribution: A model for the number (x) of successes in a series of n independent trials where each trial results in success with probability p, or failure with probability - p; ex: The number (x) of heads ("successes") obtained in 12 (n) tosses of a fair (probability of heads = p = 0.5) coin P(x)=nCx P x(l_p)n-x where P(x) is the probability of exactly x successes out of n trials with a constant probability p of success on each trial; nCx = n!/(n-x)!x! -Binomial mean: 11 = np -Binomial variance: (J2 = np(l- p) -As n increases, the binomial approaches the Normal distribution •Hypergeometric distribution: - Represents the number of successes from a series of n trials where each trial results in success or failure - Like the binomial, except that each trial is drawn from a small population with N elements split between NI successes and Nz failures - Then the probability of splitting the n trials between x I successes and Xz failures is: P( XI an NI! N2! d ) - xl!(N I - X I)!X2!(Nz- X 2)! X2 N! n! (N - n)! -Hypergeometric mean: nNI 111 = E(xl) = Nand variance' O'z= N - n[nNI][Nz] N-I N N ·Poisson distribution: A model for the number of occurrences of an event x=0,l,2, , counted over some fixed interval of space and time rather than some fixed number of trials; the parameter is the average number of occurrences, A, for x=0,1,2,3, , and >0, otherwise P(x)=O -AA x p (x) = _e_,_ Poisson mean and variance: x FREQUENCY TABLE Event C Event D I Totals Event E 52 36 87 Event F 62 71 133 Totals 114 106 220 EX: Joint Probability Between C and E p(C & E) = 52/220 = 0.24 JOINT, MARGINAL & CONDITIONAL PROBABILITY TABLE EventC Event D Event E 0.24 0.16 0.40 (CfE)=O.60 (DfE)=O.40 Event F 0.28 0.32 0.60 (CIF)=O.47 (D/F)=O.53 Marginal Probability 0.52 0.48 Marginal Conditional Probability ProbablHty 1.00 Conditional (E/C)=O.46 (EfD)=O.33 Probability (F/C)=O.54 (FfD)=O.67 Sampling distribution: A theoretical probability distribution of a statistic that would result from drawing all possible samples of a given size from some population - A continuous random variable may take on any value along an uninterrupted interval of a number line - Probabilities are measured only over intervals, never for single values; the probability that a continuous random variable falls between two values is exactly equal to the area under the density curve between those two values •Normal distribution: Bell curve; a distribution whose values cluster symmetrically around the mean (also median and mode); common in nature and important in making inferences - The density curve is the graph of: I(x) = 1_ e - 0'/2ii (x - Ji )' /2(5' where f(x) = frequency at a given value (J = standard deviation of the normal distribution J1 = the mean of the normal distribution x = value of normally distributed variable ·Standard normal distribution: A normal distri­ bution with a mean of 0, and standard deviation of I; values following a normal distribution can be transformed to the standard normal distribution by using z-scores [see Measures of Dispersion, page I] • In order to make inferences about a population, which is unobserved, a random sample is drawn -The sample is used to compute statistics, which are then used to draw probability conclusions about the parameters of the population Population (unobserved) • 'andom mmpUng measured by Parameters (unknown) statistical inferellce Sample (observed) measured by Statistics (known) BIASED & UNBIASED ESTIMATORS • Unbiased estimator of a parameter: An estimator (sample statistic) with an average value equal to the value of the parameter; ex: the sample mean is an unbiased estimator of the population mean; the average value of all possible sample means is the population mean; all other factors being equal, an unbiased estimator is preferable to a biased one • Biased estimator of a parameter: An estimator (sample statistic) that does not equal on the average the value of the parameter; ex: the median is a biased estimator, since the average of sample medians is not always equal to the population median; variance calculated from a sample, dividing by n, is a biased estimator of the population variance; however, when calculated with n-I it is unbiased - Note: Estimators themselves present only one source of bias; even when an unbiased estimator is used, bias in the sample (elements not all equally likely to be chosen) may still be present - Elementary methods of in ference assume unbiased sampling -Sampling distribution: The probability distribution of a sample statistic that would resul t from drawing all possible samples of a given size from some population; because samples are drawn at random, every sample statistic is a random variable, and has a probability di stribution that can be described using mean and standard deviation ·Standard error: The standard deviation of the estimator; not confuse this with the standard deviation of the sample itself; measures the variability in the estimates around their expected value, while the standard deviation of the sample refl ects the variability within the sample around the sample mean -The standard deviation of all possible sampl e means of a given sample size, drawn from the same population, is called the standard error of the sample mean -If the population standard deviation (J is known , the standard error is: O' x= ~ ';11 -Usually, the popUlation standard devi ation s is unknown, and is estimated by s; in thi s case, the d stan dard estImate error'IS: 0' ,,'" s" = ~ in -Note: in either case, the standard error of the sample mean decreases as sample si ze is increased - a larger sample provides more reliable information about the population HYPOTHESIS TESTING •In a hypothesis test, sample data is used to accept or reject a null hypothesis (Ho) in favor of an alternative hypothesis (H ); the significance level at which the null hypothesis can be rejected I indicates how much evidence the sample provides against the null hypothesis •Null hypothesis (Ho): Always specifies a value (the null hypothesis value) for a population parameter; the null hypothesis is assumed to be true-this assumption underlies the computations for the hypothesis test; ex: Ho: "a coin is unbiased," that is, the proportion of heads is 0.5: Ho: P = 0.5 •Alternative hypothesis (H.): Never specifies a value for a parameter; the alternative hypothesis states that a population parameter has some value different from the one specified under the null hypothesis; ex: H.: A coin is biased; that is, the proportion of heads is not 0.5: HI: p 0.5 I Two-tailed (or nondirectional): An alternative hypothesis (H 1) that states only that the population parameter is simply different from the one specified under Ho; two-tailed probability is employed; ex: To use sample data to test whether the population mean pulse rate is different from 65, we would use the two-tailed hypothesis test Ho: ~ = 65 vs HI: ~ 65 One-tailed (or directional): An alternative hypothesis (H t) that states that the population parameter is greater than (right-tailed) or less than (left-tailed) the value specified under Ho; one­ tailed probability is employed; ex: to use sample data to test whether the population mean pulse rate 1_ is greater than 65, we would use the right-tailed I/" hypothesis test Ho: ~ = 65 vs H.: ~ > 65 • The alternative hypothesis HI is also sometimes known as the "research hypothesis," as only claims expressed as alternative hypotheses can be positively asserted • Level of significance: The probability of observing sample results as extreme or more extreme than those actually observed, under the assumption the null hypothesis is true; if this probability is small enough, we conclude there is sufficient evidence to reject the null hypothesis; two basic approaches: I Fixed significance level (traditional method): A level of significance a is predetermined; commonly used significance levels are 0.0 I, 0.05, and 0.10 • Thesmaller the significance level a, the higher the standard for rejecting Ho; critical value(s) for the test statistic are determined such that the probability of the test statistic being farther from zero than the critical value (in one or two tails, depending on HI) is a; if the test statistic falls beyond the critical value-in the rejection region- then Ho can be· rejected at that fixed significance level a Observed significance level (p-value method): The test statistic is computed using the sample data, then the appropriate probability distribution is used to find the probability of observing a sample statistic that differs at least that much from the null hypothesis value for the population parameter (the probability value, or ~ p-value); the smaller the p-value, the better the evidence against Ho I ~ ·This method is more commonly used by computer applications ·The p-value also represents the smallest signifi­ cance level a at which Ho can be rejected; thus, p-value results can be used with a fixed signifi­ cance level by rejecting Ho ifp-va[ue ~ a * * • Generally, the larger (farther from zero, positive or negative) the value of the test statistic, the smaller the p-value will be, providing better evidence against the null hypothesis in favor of the alternative • Notion of indirect proof: Through traditional hypothesis testing, the null hypothesis can never be proven true; ex: if we toss a coin 200 times and tails comes up exactly 100 times, we have no evidence the coin is biased, but cannot prove the coin is fair because of the random nature of sampling-it is possible to flip an unfair coin 200 times and get exactly 100 heads, just as it is possible to draw a sample from a population with mean 104.5 and find a sample mean of 101; failing to reject the null hypothesis does not prove it true and rejecting it does not prove it false ·Two types of errors - Type I error: Rejecting Ho when it is actually true; the probability of a type I error is given by the significance level a; type I is generally more prominent, as it can be controlled - Type II error: Failing to reject Ho when it is actually false; the probability of a type II error is denoted Ii; type II error is often (foolishly) disregarded: it is difficult to measure or control, as Ii depends on the unknown true value of the parameter in question, which is not known True Status of Ho Statistical Hypotheses Ho True Ho False , - - - - - - r - - - - - 1- - - - -t . -. -. . -.­ "0 Reject "0 Accept Decision: Correct (l-a) i Type (~) II error Type I error Correct (1- ~) (al CENTRAL '-IMIT THEOREM (for sample mean x) If XI, X2, X3, x n , is a simple random sample of n elements from a large (infinite) population, with mean Il( m) and standard deviation (J, then the distri­ bution of x takes on the bell shaped distribution of a normal random variable as n increases and the distri­ bution of the ratio: x ~ f1 I"rn approaches the standard normal distribution as n goes to infinity; in practice, a normal approximation is acceptable for samples of size 30 or larger INFERENCE FOR POPULATION MEAN USING THE Z·STATISTIC (0 KNOWN) Requires that the sample must be drawn from a normal distribution or have a sample size (n) of at least 30 • Used when the population standard deviation (J is known: If (J is known (treated as a constant, not random) and the above conditions are met, then the distribution of the sample mean follows a normal distribution, and the test statistic z follows a standard normal distribution: Note that this is rarely the case in reality and the t-distribution is more widely used ·The test statistic is z= x;;: where ~ = population mean (either known or hypothesized under Ho) and (J" x = (J" /Iii ·Critical region: The portion of the area under the curve which includes those values of the test statistic that provide sufficient evidence for the rejection of the null hypothesis - The most often used significance levels are 0.01, 0.05, and 0.1; for a one-tailed test using z-statistic, these correspond to z-values of 2.33, 1.65, and 1.28 respectively-positive values for a right-tailed test, negative for a left-tailed test •For a two-tailed test, the critical region for a = 0.01 is split into two equal outer areas marked by z-values of 12.581; for a = 0.05, the critical values of z are 11.961, and for a = 0.10, the critical values are 11.651 -Ex 1: Given a population with (J = 50, a simple random sample of n = 100 values is chosen with a sample mean X of 255; test using the p-value method Ho: ~ = 250 vs HI: ~ > 250; is there sufficient evidence to reject the null hypothesis? • In this case, the test statistic z (255-250)/(501'" I 00) = 1.00 •Looking at Table A, the area given for z = 1.00 is 0.3413; the area to its right (since H is ">", this is a right-tailed test) is 0.5 - 0.3413 = 0.1587, or 15.87% ·This is the p-value: the probability, if Ho is true (that is, if ~ = 250), of obtaining a sample mean of 255 or greater; it also represents the smallest significance level a at which Ho can be rejected •Since, even if Ho is true, the probability of obtaining a sample mean ~ 255 from this popUlation with a sample of size n = 100 is about 16%, it is quite plausible that Ho is true- there is not very good evidence to support the alternative hypothesis that the population mean is greater than 250 so we fail to reject Ho ·It can't even be rejected at the weakest common significance level of a = 0.10, since 0.1587 > 0.10; remember, this doesn't prove the population mean to be equal to 250; we just haven't accumu­ lated sufficient evidence against the claim -Ex 2: A simple random sample of size n = 25 is taken from a population following a normal distri­ bution with (J = 15; the sample mean x is 95; use the p-value method to test Ho: ~ = 100 vs H.: 1.1 100; is there sufficient evidence to reject the claim that the population mean is 100 at a significance level a of 0.1 O? At a = 0.05? •In this case, the test statistic z = (95 ­ 100)/(15/"'25) = - 5/3 = - 1.67 ·Since the normal curve is symmetric, we can look up a z-score of 1.67 - the value in Table A is 0.4525, that is, P(O < z < 1.67) = P( -1.67 < z < 0) = 0.4525 -Thus, P(z < -1.67) = P(z > 1.67) = 0.5 - 0.4525 = 0.0475 • Since this is a two-tailed test (HI: ~* 100), the p­ value is twice this area, or 0.095 • Since the p-value = 0.095 < 0.10 = a, there is sufficient evidence to reject the null hypothesis at a significance level a of 0.10, but in the second case, the p-value = 0.095 > 0.05 = a, so the sample data are not strong enough to reject at the higher (0.05) level of significance * ~ Table A Normal Curve Areas Area from mean to -Ex: A simple random sample of size 25 is taken from a population following a normal distribution, with a sample mean 42, and the sample standard deviation 7.5; test at a fixed significance level a = 0.05: "0: 11 = 45 vs HI: 11 > 45 z ·This is a left-tailed test (H ( 11 < 45), so the critical value and rejection region ~ Z 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.1 1.3 1.4 1.6 1.7 1.8 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 00 01 02 03 04 z 05 06 07 08 09 0000 0040 0080 0120 0160 0199 0239 0279 0319 0359 0398 0438 0478 0517 0557 0596 0636 0675 0714 0753 0793 0832 0871 0910.0948.0987 1026 1064 1103 1141 will be negative ·Consulting Table B to find the appropriate critical value, with df = n - I = 24, produces a critical value of - 1.711; the null hypothesis can be rejected at a = 0.05 if the value of the test statistic t < -1.711 ·The test statistic t = (42 - 45)/(7 5/-.,125) = -311.5 = - 2; since this is less than the critical value of - 1.711 , HO is rejected at a = 0.05 Table B Critical Values of t .1179 1217 1255 1293 1331 1368 1406 1443 1480 1517 1554 1591 1628 1664 1700.1736 1772 1808 1844 1879 Values indicate area to right of ta 1915 1950 1985 2019 2054 2088 2123 2157 2190 2224 .2257 2291 2324 2357 2389 2422 2454 2486 2517 2549 2580 2611 2642 2673 2704 2734 2764 2794 2823 2852 ~ 2881 2910 2939 2967 2995 3023 3051 3078 3106 3133 3159 3186 3212 3238 3264 3289 3315 3340 3365 3389 3413 3438 3461 3485 3508 3531 3554 3577 3599 3621 A* B' 3643 3665 3686 3708 3729 3749 3770 3790 3810 3830 3849 3869 3888 3907 3925 3944 3962 3980 3997 4015 4032 4049 4066 4082 4099 4115 4131 4147 4162 4177 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 info 4192 4207 4222 4236 4251 4265 4279 4292 4306 4319 4332 4345 4357 4370 4382 4394 4406 4418 4429 4441 4452 4463 4474 4484 4495 4505 4515 4525 4535 4545 4554 4564 4573 4582 4591 4599 4608 4641 4649 4656 4664 4671 4678 4686 4713 4719 4726 4732 4738 4744 4750 4772 4778 4783 4788 4793 4798 4803 4616 4693 4756 4808 4625 4699 4761 4812 4633 4706 4767 4817 4821 4826 4830 4834 4838 4842 4846 4850 4854 4857 4861 4864 4868 4871 4875 4878 4881 4884 4887 4890 4893 4896 4898 4901 4904 4906 4909 4911 4913 4916 4918 4920 4922 4925 4927 4929 4931 4932 4934 4936 4938 4940 4941 4943 4945 4946 4948 4949 4951 4952 4953 4955 4956 4957 4959 4960 4961 4962 4963 4964 4965 4966 4967 4968 4969 4970 4971 4972 4973 4974 4974 ~4975 4976 4977 4977 4978 4979 4979 4980 4981 4981 4982 4982 4983 4984 4984 4985 4985 4986 4986 4987 4987 4987 4988 4988 4989 4989 4989 4990 4990 INFERENCE FOR POPULATION MEAN USING THE t·STATISTIC (a UNKNOWN) Requires that the sample must be drawn from a normal distribllfion or have a sample size (n) ofat least 30 ·When (J is not known- as is usually the case-it is estimated from s, the sample standard deviation •Because of the variability of both estimates- the sample mean as well as the sample standard deviation- the test statistic follows not a z-dis­ tribution, but at-distribution ·Comparison between t- and z-distributions -Although both distributions are symmetric about a mean of zero, the t­ distribution is more spread out than the normal distribution, producing a larger critical value of t as the boundary for the rejection region -The t-distribution is characterized by its degrees of freedom (dt), referring to the number of values that are free to vary after placing cer­ tain restrictions on the data •For example, if we know that a sample of size produces a mean of 87, we know that the sum of the numbers is * 87 = 348; this tells us nothing about the individual values in the sample- there are an infi­ nite number of ways to get four numbers to add up to 348; but as soon as we 've chosen three of them, the fourth is determined •For instance, the first number might be 84, the second 98, and the third 81; but if the first three numbers are 84, 98, and 81 , then the fourth must be 85, the only number producing the known sample mean-that is, there are n- I or degrees of freedom in this example - For a test about a population mean, the t-statistic follows at-distribution with n -1 df •As df increases, the t-distribution approaches the standard normal z­ distribution -The test statistic t used for testing hypotheses about a population mean · IS: t x- j.1 were h S =~ 11 = i popu ' atlon mean under H an d Sx= in Note: This is not so dif.forentfrom the test statistic z used when (J is known! uares ns­ 0.1 0.2 3.078 1.886 638 1.533 1.476 1.440 1.415 1.397 383 1.372 1.363 356 1.350 1.345 1.341 1.337 1.333 330 328 325 323 1.321 319 318 316 1.315 1.314 1.313 1.311 1.310 282 0.05 6.314 2.920 2.353 132 2.015 1.943 895 1.860 833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 725 1.721 1.717 714 1.711 1.708 1.706 1.703 1.701 1.699 1.697 1.645 0.025 05 12.706 303 3.182 2.776 2.571 447 2.365 2.306 2.262 2.228 2.201 179 160 2.145 2.131 2.120 2.110 2.101 2.093 086 080 2.074 2.069 064 2.060 2.056 2.052 048 2.045 042 960 0.01 0.02 31.821 6.965 541 3.747 3.365 3.143 998 2.896 2.821 764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 508 500 2.492 485 2.479 2.473 2.467 2.462 2.457 2.326 0.005­ 01 63.657 925 5.841 4.604 4.032 3.707 499 355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750 2.576 A * = Level of significance for one-tailed test B* = Level of significance for two-tailed test Note: The t-distribution is a robust alternative to the z-distribution when testing for the population mean: inferences are likely to be valid even if the population distribution is far from normal; however, the larger the departure from normality in the population, the larger the sample size needed for a valid hypothesis test using either distribution sider ceds can CONFIDENCE INTERVALS Confidence interval: Interval within which a population parameter is likely to be found; determined by sample data and a chosen level of confidence (I - a - [a refers to the level of significance]) ·Common confidence levels are 90%, 95%, and 99%, just as common levels of significance are 0.10, 0.05, and 0.01 • (I - a) confidence interval for 11: Xi - Zal2(Yrn) :S j.1 :S X + Zal2(Yrn) where Za/2 is the value of the standard normal variable z that puts an area a/2 in each tail of the distribution ·A t-statistic should be used in place of the z-statistic when (J is unknown and s must be used as an estimate ·Ex: Given X = 108, s= 15, and n =26, estimate a 95% confidence interval for the population mean - Since the population variance is unknown, the t-distribution is used - The resulting interval, using at-value of2.060 from Table B (row 25 of the mid­ dle column), is approximately 102 to 114 - Consequently, any null hypothesis that 11 is between 102 to 114 is tenable based on this sample -Any hypothesized 11 below 102 or above 114 would be rejected at 0.05 significance - -­ _ ~ l"'"1l'!'1I~~CJ~"' COMPARING POPULATION MEANS ~ -Sampling distribution of the difference between means: If a number of pairs of sam­ III pIes were taken from the same population or II from two different populations, then: -The distribution of differences between " pairs of sample means tends to be normal (z-distribution) - The mean of these differences between means ~ f.i x I - X is equal to the difference between the population means, that IS 111 - 112 -Independent samples - We are testing whether or not two samples are drawn from populations with the same mean, that is, HO: 111 = 112, versus a one- or two-tailed alternative - When 01 and 02 are known, the test statistic z follows a standard normal distribution under the null hypothesis - The standard error of the difference between Z = means CY x I -Homogeneity of variances (a criterion for the pooled 2-sample t-test): The condition that the variances of two populations are equal; to establish homo~eneity of variances, test HO: 0)2 = vs H f 0) oF- 022 (note that this is e~uivalent to testing HO: o)2 / 02 = ) vs Hfo) /°2 oF- 1) - Under the null hypothesis, the test statistic s 12/S22 follows an F-distribution with degrees of freedom: (n I - I , n2 - 1); if the test statistic exceeds the critical value in Table C, then the null hypothesis can be rejected at the indicated level of significance oi difference in means, the following statistic can be used for hypothesis tests: (x,- x z)(f.1, - f.1z) Top row = 05; bottom row = I; Points for distribution of F Degrees of freedom for numerator - x - When 01 and 02 ~re unknown, which is usu­ ally the case, substitute s and s2 for ° I and 02, respectively, in the above formulas, and use the t-distribution with df= n I + n2 - "IlIIIII - Pooled t-test Both populations have normal distributions -n < 30 -Requires homogeneity of variance: 01 and 02 are not known but assumed equal- a risky assumption! ~ -Many statisticians not recommend the t­ distribution with pooled standard error, the above approach is more conservative -The hypothesis test may be tailed (= vs oF-) or tailed: III :5112 and the alternative is 111 >112 (or 11) ~112 and the alternative is 111 < 112) -Degrees of freedom (df): (nl-)+(nT1)= n 1+nT2 -Use the given formula below for estimating ,vith ::.ero correlation 10 produce by chan ce a sample with O! r'" CHI-SQUARE ( X2) TESTS - Most widely used non-parametric test -The X2 mean = its degrees of freedom -The X2 variance = twice its degrees of freedom -Can be used to test independence, homogeneity, and goodness-o f- fit -The square of a standard normal variable is a chi-square variable with df = I - Like the t-distribution, the shape of the di stribu­ tion depends on the value of df CHI-SQUARE (X2) TESTS (continued) DEGREES OF FREEDOM (dt) COMPUTATION •If chi-square tests for the goodness-of-fit to a hypothesized distribution (uses frequency distri­ bution), df = g ~ I, where g = number of groups, or classes, in the frequency distribution •If chi-square tests for homogeneity or independence (uses two-way contingency table), df = (#of rows ~ I)(# of columns ~ I) Regression is a method for predicting values of one variable (the outcome or dependent variable) on the basis of the values of one or more independent or predictor variables; fitting a regression model is the process of using sample data to determine an equation to represent the relationship GOODNESS-OF-FIT TEST: To apply the chi-square distribution in this manner, the critical chi-square value is expressed as: L if ~ 1)2 Ie ' frequency of the variable, fe = expected frequency (based on hypothesized population distribution) TESTS OF CONTINGENCY: Application of chi­ square tests to two separate populations to test statis­ tical independence of attributes TESTS OF HOMOGENEITY: Application of chi­ square tests to two samples to test if they came from populations with like distributions RUNS TEST: Tests whether a sequence (to comprise a sample) is random; the following equations are applied: 2nl1l2 d ( 'R- ) = ~+ an SR 1l1+1l2 21111l,(2111112 ~ 1I1~1I,) h - were ) (1I1+1l2)-(nl+n2~1) R = mean number of runs nl = number of outcomes of one type n2 = number of outcomes of the other type SR = standard deviation of the distribution of the number of runs HYPOTHESIS TEST FOR LINEAR CORRELATION With a simple random sample of size n producing a sample correlation coefficient r, it is possible to test for the linear correlation in the population, P; that is, we conduct the hypothesis test Ho: P = Po, versus a right-, left-, or two-tailed alternative; usually we are interested in determining whether there is any linear correlation at all; that is, po = Th e test statIstIc IS: t =/ (I' - (I Po) ~r2)/(n~2) which follows a t-distribution with n ~ degrees of freedom under Ho; this hypothesis test assumes that the sample is drawn from a population with a bivariate normal distribution 'Ex: A simple random sample of size 27 produces a correlation coefficient r = -0.41; is there sufficient evidence at a = 0.05 of a negative linear relationship? -Since we're testing for a negative linear relationship, we need a left-tailed test: Ho: P = vs HI: P < 0; the critical value can be found trom the t-distri­ bution with n ~ = 25 df, and one-tailed a = 0.05; since this is a left-tailed test, we take the negative: -1.708; that is, if the test statistic is less than ~I.708 , we conclude that there is sufficient evidence of a negative linear relationship The test statistic t = - 0.41 )(1 ~(~ 0.412)/(27 -2)) SIMPLE LINEAR REGRESSION where fo = observed ~ 2.248, allowing us to reject the null hypothesis of no linear correlation and support the alternative hypothesis of a negative linear correlation at a = 0.05 ISBN~13: 978-157222944-0 ISBN-10 157222944-6 911~ 111,1ll ~~llllll~I!I!~llllrlllllr Ilil lI In a simple linear regression model, we use only one predictor variable and assume that the relationship to the outcome variable is linear; that is, the graph of the regression equation is that of a straight line; (we often refer to the "regression line"); for the entire population, the model can be expressed as: y = ~ + ~lX + e y is called the dependent variable (or outcome variable), as it is assumed (0 depend on a linear relationship to x x is the independent variable, also called the predictor variable ~ is the intercept of the regression line; that is, the predicted value for y when x = ~I is the slope of the regression line-the marginal change in y per unit change in x e refers to random error; the error term is assumed (0 follow a normal distribution with a mean olzero and constant variation-that is, there should be no increase or decrease in dispersion for different regions along the regression line; in addition , it is assumed that error terms are independent for different (x, y) observations On the basis of sample data, we find estimates bo and bl of the intercept ~o and slope ~l; this gives us the estimated (or sample) regression equation y= bo+ blX The parameter estimates bo and bl can be derived in a variety of ways; one of the most common is known as the method of least squares; least squares estimates minimize the sum of squared differences between predicted and actual values of the dependent variable y For a simple linear regression model , the least squares estimates of the intercept and slope are: estimated slope = b = SSxy SSx estimated intercept = bo = Y~ bIx These estimates-and other calculations regression-involve sums of squares: in SSxy = I(x ~ x)(y ~ y) = Ixy ~ (Ix)(Iy)/n SSx = I(x ~ x)2 = I(x 2) ~ (.h)2/n SSy = I(y ~ Y)2 = I(y2) ~ (Iy)2/n Ex: A simple random sample of cars provides the following data on engine displacement (x) and highway mileage (y); fit a simple linear regression model U.S $5.95 I CAN.$8.95 Customer Hotline # 1.800.230.9522 free downloads & J'lUtldredS of tltres at qUlcKsluay.com (displacement) (mileage) x y x2 5.7 18 32.49 2.5 19 6.25 3.8 20 14.44 19 7.84 2.8 4.6 17 21.16 1.6 32 2.56 1.6 29 2.56 1.4 30 1.96 SUMS: 24 184 89.26 y2 xy 324 102.6 361 47.5 400 76 361 53.2 78.2 , , 289 ,., 1024 51.2 841 46.4 900 42 4500 497.1 Fitting a model entails computing the least-squares estimates bo and bl; note that there are observationsthat is, n = First, SSxy = Ixy ~ (Ix)(Iy)/n = ~54.9, SSx = I(x ) CIx)2/n = 17.26, and SSy = I(y2) ~ (Iy)2/n = 268 ~ Then the estimated slope is bl = SSxy/SSx = ~3.18 , and (he estimated intercept is bo = Y~ b l X = 32.54 The estimated regression model, then , is mileage = 32.54 ~ 3.18 displacement SIGNIFICANCE OF A REGRESSION MODEL We can assess the significance of the model by testing to see if the sample provides sufficient evidence of a linear relationship in the population; that is, we conduct the hypothesis test: Ho: ~1 = versus HI: ~I;C 0; this is exactly equivalent to testing for linear correlation in the population: Ho: P = versus HI: p;c 0; the test for correlation is somewhat simpler: The correlatIOn coeffiCient r = / SSxy ~SSx.SSy The test statistic t = (r~O) I f(l ~r 2)/(n ~ 2) = = -0.8072 ~3.350 Consulting Table S, with degrees of freedom =n~2=6 , we obtain a critical value of 3.143 at a=0.02, and a critical value of 3.707 at a=O.OI; since we have a two-tailed test, we should consider the absolute value of the test statistic, which exceeds 3.143, but does not exceed 3.707; that is we can reject Ho at a=0.02 but not at a=O.OI , so the p­ value is between 0.02 and 0.01; (the actual p-value, which can be found using computer applications is 0.0154); this is a reasonably significant model LINEAR DETERMINATION Regression models are also assessed by the coefficient of linear determination, r2; this represents the proportion of total variation in y that is explained by the regression model; the coefficient of linear determination can be calculated in a variety of ways; the easiest is to compute r2 = (r)2; that is, the coefficient of determi­ nation is the square of the coefficient of correlation RESIDUALS The difference between an observed and a fitted value of y(y ~ y) is called a residual; examining the residuals is useful (0 identify outliers (observations far from the regression line, representing unusual values for x and y) and to check the assumptions of the model l'iOTICE TO STUU[NT: This Qu!ckQuickstud y® guide uJVers the b;;tsies of [ntroductory Statistics Due to its condensed fonnm, however, use it as a Statistics g il ide and not :.IS a replacement for asslg!H"!d coune work All rIghts reserved, No pan of this publication may be reproduced or transmitlcli in fin}" form, or by any means, electronic or mechanical, including photocopy, record ing or any information storage and retrieval system, without written permission fro lll the publi sher .c 2002, 2UU5 BarCharts, Jnc, Boca Raton, FL 0308 6 ... the model l'iOTICE TO STUU[NT: This Qu!ckQuickstud y® guide uJVers the b;;tsies of [ntroductory Statistics Due to its condensed fonnm, however, use it as a Statistics g il ide and not :.IS a replacement... 'andom mmpUng measured by Parameters (unknown) statistical inferellce Sample (observed) measured by Statistics (known) BIASED & UNBIASED ESTIMATORS • Unbiased estimator of a parameter: An estimator... about a population, which is unobserved, a random sample is drawn -The sample is used to compute statistics, which are then used to draw probability conclusions about the parameters of the population

Ngày đăng: 30/01/2017, 09:11

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan