Class Notes in Statistics and Econometrics Part 18 pps

CHAPTER 35 Least Squares as the Normal Maximum Likelihood Estimate Now assume ε ε ε is multivariate normal. We will show that in this case the OLS estimator ˆ β is at the same time the Maximum Likelihood Estimator. For this we need to write down the density function of y . First look at one y t which is y t ∼ N(x  t β, σ 2 ), where X =    x  1 . . . x  n    , i.e., x t is the tth row of X. It is written as a column vector, since we follow the “column vector convention.” The (marginal) 855 856 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE density function for this one obse rvation is (35.0.3) f y t (y t ) = 1 √ 2πσ 2 e −(y t −x  t β) 2 /2σ 2 . Since the y i are stochastically independent, their joint density function is the product, which can be written as (35.0.4) f y (y) = (2πσ 2 ) −n/2 exp  − 1 2σ 2 (y −Xβ)  (y −Xβ)  . To compute the maximum likelihood estimator, it is advantageous to start with the log likelihood function: (35.0.5) log f y (y; β, σ 2 ) = − n 2 log 2π − n 2 log σ 2 − 1 2σ 2 (y −Xβ)  (y −Xβ). Assume for a moment that σ 2 is known. Then the MLE of β is clearly equal to the OLS ˆ β. Since ˆ β does not depend on σ 2 , it is also the maximum likelihood estimate when σ 2 is unknown. ˆ β is a linear function of y. Linear transformations of normal variables are normal. Normal distributions are characterized by their mean vector and covariance matrix. The distribution of the MLE of β is therefore ˆ β ∼ N(β, σ 2 (X  X) −1 ). 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 857 If we replace β in the log likelihood function (35.0.5) by ˆ β, we get what is called the log likelihood function with β “concentrated out.” (35.0.6) log f y (y; β = ˆ β, σ 2 ) = − n 2 log 2π − n 2 log σ 2 − 1 2σ 2 (y −X ˆ β)  (y −X ˆ β). One gets the maximum likelihood estimate of σ 2 by maximizing this “concentrated” log likelihoodfunction. Taking the derivative with respect to σ 2 (consider σ 2 the name of a variable, not the square of another variable), one gets (35.0.7) ∂ ∂σ 2 log f y (y; ˆ β) = − n 2 1 σ 2 + 1 2σ 4 (y −X ˆ β)  (y −X ˆ β) Setting this zero gives (35.0.8) ˜σ 2 = (y −X ˆ β)  (y −X ˆ β) n = ˆε  ˆε n . This is a scalar multiple of the unbiased estimate s 2 = ˆε  ˆε/(n − k) which we had earlier. Let’s look at the distribution of s 2 (from which that of its scalar multiples follows easily). It is a quadratic form in a normal variable. Such quadratic forms very often have χ 2 distributions. Now recall equation 10.4.9 characterizing all the quadratic forms of multivariate normal variables that are χ 2 ’s. Here it is again: Assume y is a multivariate normal 858 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE vector random variable with m ean vector µ and covariance matrix σ 2 Ψ, and Ω Ω Ω is a symmetric nonnegative definite matrix. Then (y −µ)  Ω Ω Ω(y −µ) ∼ σ 2 χ 2 k iff (35.0.9) ΨΩ Ω ΩΨΩ Ω ΩΨ = ΨΩ Ω ΩΨ, and k is the rank of ΨΩ Ω Ω. This condition is satisfied in particular if Ψ = I (the identity matrix) and Ω Ω Ω 2 = Ω Ω Ω, and this is exactly our situation. (35.0.10) ˆσ 2 = (y −X ˆ β)  (y −X ˆ β) n − k = ε ε ε  (I − X(X  X) −1 X  )ε ε ε n − k = ε ε ε  Mε ε ε n − k where M 2 = M and rank M = n − k. (This last identity because for idempotent matrices, rank = tr, and we computed its tr ab ove.) Therefore s 2 ∼ σ 2 χ 2 n−k /(n −k), from which one obtains again unbiasedness, but also that var[s 2 ] = 2σ 4 /(n − k), a result that one cannot get from mean and variance alone. Problem 395. 4 points Show that, if y is normally distributed, s 2 and ˆ β are independent. Answer. We showed in question 300 that ˆ β and ˆε are uncorrelated, therefore in the normal case independent, therefore ˆ β is also independent of any function of ˆε, such as ˆσ 2 .  35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 859 Problem 396. Computer assignment: You run a regression with 3 explanatory variables, no constant term, the sample size is 20, the errors are normally distributed and you know that σ 2 = 2. Plot the density function of s 2 . Hint: The command dchisq(x,df=25) returns the density of a Chi-square distribution with 25 degrees of freedom evaluated at x. But the number 25 was only taken as an example, this is not the number of degrees of freedom you need here. • a. In the same plot, plot the density function of the Theil-Schweitzer estimate. Can one see from the comparison of these density functions why the Theil-Schweitzer estimator has a better MSE? Answer. Start with the Theil-Schweitzer plot, because it is higher. > x <- seq(from = 0, to = 6, by = 0.01) > Density <- (19/2)*dchisq((19/2)*x, df=17) > plot(x, Density, type="l", lty=2) > lines(x,(17/ 2)*d chis q((1 7/2) *x, df=17)) > title(main = "Unbiased versus Theil-Schweitzer Variance Estimate, 17 d.f.")  Now let us derive the maximum likelihood estimator in the case of nonspher- ical but positive definite covariance matrix. I.e., the model is y = Xβ + ε ε ε, ε ε ε ∼ N(o, σ 2 Ψ). The density function is (35.0.11) f y (y) = (2πσ 2 ) −n/2 |det Ψ| −1/2 exp  − 1 2σ 2 (y −Xβ)  Ψ −1 (y −Xβ)  . 860 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE Problem 397. Derive (35.0.11) as follows: Take a matrix P with the property that P ε ε ε has covariance matrix σ 2 I. Write down the joint density function of P ε ε ε. Since y is a linear transformation of ε ε ε, one can apply the rule for the density function of a transformed random variable. Answer. Write Ψ = QQ  with Q nonsingular and define P = Q −1 and v = P ε ε ε. Then V [v] = σ 2 P QQ  P  = σ 2 I, therefore (35.0.12) f v (v) = (2πσ 2 ) −n/2 exp  − 1 2σ 2 v  v  . For the transformation rule, write v, whose density function you know, as a function of y, whose density function you want to know. v = P (y − Xβ); therefore the Jacobian matrix is ∂v/∂y  = ∂(P y − P Xβ)/∂y  = P , or one can see it also element by element (35.0.13)    ∂v 1 ∂y 1 ··· ∂v 1 ∂y n . . . . . . . . . ∂v n ∂y 1 ··· ∂v n ∂y n    = P , therefore one has to do two things: first, substitute P (y − Xβ) for v in formula (35.0.12), and secondly multiply by t he absolute value of the determinant of the Jacobian. Here is how to ex- press the determinant of the Jacobian in terms of Ψ: From Ψ −1 = (QQ  ) −1 = (Q  ) −1 Q −1 = (Q −1 )  Q −1 = P  P follows (det P ) 2 = (det Ψ) −1 , hence |det P | = √ det Ψ.  35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 861 From (35.0.11) one obtains the following log likelihood function: (35.0.14) log f y (y) = − n 2 ln 2π − n 2 ln σ 2 − 1 2 ln det[Ψ] − 1 2σ 2 (y −Xβ)  Ψ −1 (y −Xβ). Here, usually not only the elements of β are unknown, but also Ψ depends on unknown parameters. Instead of concentrating out β, we will first concentrate out σ 2 , i.e., we will compute the maximum of this likelihood function over σ 2 for any given set of values for the data and the other parameters: ∂ ∂σ 2 log f y (y) = − n 2 1 σ 2 + (y −Xβ)  Ψ −1 (y −Xβ) 2σ 4 (35.0.15) ˜σ 2 = (y −Xβ)  Ψ −1 (y −Xβ) n .(35.0.16) Whatever the value of β or the values of the unknown parameters in Ψ, ˜σ 2 is the value of σ 2 which, together with the given β and Ψ, gives the highest value of the likelihood function. If one plugs this ˜σ 2 into the likelihood function, one obtains the so-called “concentrated likelihood function” which then only has to be maximized over β and Ψ: (35.0.17) log f y (y; ˜σ 2 ) = − n 2 (1 + ln 2π −ln n) − n 2 ln(y −Xβ)  Ψ −1 (y −Xβ) − 1 2 ln det[Ψ] 862 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE This objective function has to be maximized with respect to β and the parameters entering Ψ. If Ψ is known, then this is clearly maximized by the ˆ β minimizing (26.0.9), therefore the GLS estimator is also the maximum likelihood e stimator. If Ψ depends on unknown parameters, it is interesting to compare the maximum likelihood estimator with the nonlinear least squares estimator. The objective function minimized by nonlinear least squares is (y − Xβ)  Ψ −1 (y − Xβ), which is the sum of squares of the innovation parts of the residuals. These two objective functions therefore differ by the factor (det[Ψ]) 1 n , which only matters if there are unknown parameters in Ψ. Asymptotically, the objective functions are identical. Using the factorization theorem for sufficient statistics, one also sees easily that ˆσ 2 and ˆ β together form sufficient statistics for σ 2 and β. For this use the identity (y −Xβ)  (y −Xβ) = (y −X ˆ β)  (y −X ˆ β) + (β − ˆ β)  X  X(β − ˆ β) = (n − k)s 2 + (β − ˆ β)  X  X(β − ˆ β).(35.0.18) Therefore the observation y enters the likelihoo d function only through the two statistics ˆ β and s 2 . The factorization of the likelihood function is therefore the trivial factorization in which that part which does not depend on the unknown parameters but only on the data is unity. 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE 863 Problem 398. 12 points The log likelihood function in the linear model is given by (35.0.5). Show that the inverse of the information matrix is (35.0.19)  σ 2 (X  X) −1 o o  2σ 4 /n  The information matrix can be obtained in two different ways. Its typical element has the following two forms: (35.0.20) E[ ∂ ln  ∂θ i ∂ ln  ∂θ k ] = −E[ ∂ 2 ln  ∂θ i ∂θ k ], or written as matrix derivatives (35.0.21) E [ ∂ ln  ∂θ ∂ ln  ∂θ  ] = − E [ ∂ 2 ln  ∂θ∂θ  ]. In our case θ =  β σ 2  . The expectation is taken under the assumption that the parameter values are the true values. Compute it both ways. Answer. The log likelihood function can be written as (35.0.22) ln  = − n 2 ln 2π − n 2 ln σ 2 − 1 2σ 2 (y  y − 2y  Xβ + β  X  Xβ). 864 35. LEAST SQUARES AS THE NORMAL MAXIMUM LIKELIHOOD ESTIMATE The first derivatives were already computed for the maximum likelihood estimators: ∂ ∂β  ln  = − 1 2σ 2 (2y  X + 2β  X  X) = 1 σ 2 (y − Xβ)  X = 1 σ 2 ε ε ε  X(35.0.23) ∂ ∂σ 2 ln  = − n 2σ 2 + 1 2σ 4 (y − Xβ)  (y − Xβ) = − n 2σ 2 + 1 2σ 4 ε ε ε  ε ε ε(35.0.24) By the way, one sees that each of these has expected value zero, which is a fact that is needed to prove consistency of the maximum likelihood estimator. The formula with only one partial derivative will be given first, although it is mo re tedious: By doing ∂ ∂β   ∂ ∂β    we get a symmetric 2 ×2 partitioned matrix with th e diagonal elements (35.0.25) E [ 1 σ 4 X  ε ε εε ε ε  X] = 1 σ 2 X  X and (35.0.26) E[  − n 2σ 2 + 1 2σ 4 ε ε ε  ε ε ε  2 ] = var[− n 2σ 2 + 1 2σ 4 ε ε ε  ε ε ε] = var[ 1 2σ 4 ε ε ε  ε ε ε] = 1 4σ 8 2nσ 4 = n 2σ 4 One of the off-diagonal elements is ( n 2σ 4 + 1 2σ 6 ε ε ε  ε ε ε)ε ε ε  X. Its expected value is zero: E [ε ε ε] = o, and also E [ε ε εε ε ε  ε ε ε] = o since its ith component is E[ε i  j ε 2 j ] =  j E[ε i ε 2 j ]. If i = j , then ε i is independ ent of ε 2 j , therefore E[ε i ε 2 j ] = 0·σ 2 = 0. If i = j, we get E[ε 3 i ] = 0 since ε i has a symmetric distribution. [...]... IN THE LINEAR MODEL 873 Problem 401 As in Problem 274, we will work with the Cobb-Douglas production function, which relates output Q to the inputs of labor L and capital K as follows: (36.0.49) β Qt = µKt Lγ exp(εt ) t Setting y t = log Qt , xt = log Kt , zt = log Lt , and α = log µ one obtains the linear regression (36.0.50) y t = α + βxt + γzt + εt Assume that the prior information about β, γ, and. .. Bayesian First interpretation: It is a matrix weighted average of the OLS estimate and ν, with the weights being the respective precision matrices If ν = o, ˆ ˆ then the matrix weighted average reduces to β = (X X + κ2 A)−1 X y, which has been called a “shrinkage estimator” (Ridge regression), since the “denominator” is bigger: instead of “dividing by” X X (strictly speaking, multiplying by (X X)−1... the researcher from stepping on his own toes too blatantly In the present textbook situation, this advantage does not hold On the contrary, the only situation where the researcher may be tempted to do something which he does not quite understand is in the above eliciation of prior information It often happens that prior information gained in this way is selfcontradictory, and the researcher is probably... variances of three linear combinations of two parameters imply for the correlation between them! I can think of two justifications of Bayesian approaches In certain situations the data are very insensitive, without this being a priori apparent Widely different estimates give an almost as good fit to the data as the best one In this case the researcher’s prior information may make a big difference and it should... About α assume that the prior information is such that (36.0.54) E[α] = 5.0, Pr[−10 < α < 20] = 0.9 874 36 BAYESIAN ESTIMATION IN THE LINEAR MODEL and that our prior knowledge about α is not affected by (is independent of ) our prior knowledge concerning β and γ Assume that σ 2 is known and that it has the value σ 2 = 0.09 Furthermore, assume that our prior views about α, β, and γ can be adequately represented... opinion what to think of this I always get uneasy when I see graphs like [JHG+ 88, Figure 7.2 on p 283] The prior information was specified on pp 277/8: the marginal propensity to consume is with high probability between 0.75 and 0.95, and there is a 50-50 chance that it lies above or below 0.85 The least squares estimate of the MPC is 0.9, with a reasonable confidence interval There is no multicollinearity... not attain the bound However one can show with other means that it is nevertheless efficient CHAPTER 36 Bayesian Estimation in the Linear Model The model is y = Xβ + ε with ε ∼ N (o, σ 2 I) Both y and β are random The distribution of β, called the “prior information,” is β ∼ N (ν, τ 2 A−1 ) (Bayesians work with the precision matrix, which is the inverse of the covariance matrix) Furthermore β and ε are... is “shrunk” not in direction of the origin but in direction of ν Second interpretation: It is as if, in addition to the data y = Xβ + ε , also an independent observation ν = β + δ with δ ∼ N (o, τ 2 A−1 ) was available, i.e., as if the model was σ2 I O y X ε ε o (36.0.45) = β+ with ∼ , τ2 ν I δ δ o O A−1 ˆ ˆ The Least Squares objective function minimized by the GLS estimator β = β in (36.0.45) is:... The posterior distribution combines all the information, prior information and sample information, about β According to (??), this posterior mean can be written as (36.0.33) ˆ ˆ β = ν + B ∗ (y − Xν) where B ∗ is the solution of the “normal equation” (??) which reads here (36.0.34) B ∗ (XA−1 X + κ2 I) = A−1 X The obvious solution of (36.0.34) is B ∗ = A−1 X (XA−1 X according to (??), the MSE-matrix of... Bayesians are interested in β because this is the posterior mean The MSE-matrix, which is the posterior covariance matrix, can also be written as (36.0.38) ˆ ˆ MSE[β; β] = σ 2 (X X + κ2 A)−1 Problem 399 Show that B ∗ as defined in (36.0.35) satisfies (36.0.34), that (36.0.33) with this B ∗ becomes (36.0.36), and that (36) becomes (36.0.38) 870 36 BAYESIAN ESTIMATION IN THE LINEAR MODEL Answer (36.0.35) in the . and variance alone. Problem 395. 4 points Show that, if y is normally distributed, s 2 and ˆ β are independent. Answer. We showed in question 300 that ˆ β and ˆε are uncorrelated, therefore in. function in the linear model is given by (35.0.5). Show that the inverse of the information matrix is (35.0.19)  σ 2 (X  X) −1 o o  2σ 4 /n  The information matrix can be obtained in two different. the Bayesians are after. The posterior distribution combines all the information, prior information and sample information, about β. According to (??), this posterior mean can be written as (36.0.33) ˆ ˆ β