Contents 1 Shrinkage estimators

50 downloads 193 Views 245KB Size Report
Oct 31, 2013 - Contents. 1 Shrinkage estimators. 1. 2 Admissible linear shrinkage estimators. 3 .... Under this posterio
Peter Hoff

Shrinkage estimators

October 31, 2013

Contents 1 Shrinkage estimators

1

2 Admissible linear shrinkage estimators

3

3 Admissibility of unbiased normal mean estimators

6

4 Motivating the James-Stein estimator

11

4.1

What is wrong with X? . . . . . . . . . . . . . . . . . . . . . . . . .

11

4.2

An oracle estimator: . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

4.3

Adaptive shrinkage estimation . . . . . . . . . . . . . . . . . . . . . .

13

5 Risk of δJS

16

5.1

Risk bound for δJS . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

5.2

Stein’s identity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

6 Some oracle inequalities 6.1

A simple oracle inequality . . . . . . . . . . . . . . . . . . . . . . . .

7 Unknown variance or covariance

21 21 23

Much of this content comes from Lehmann and Casella [1998], sections 5.2, 5.4, 5.5, 4.6 and 4.7.

1

Shrinkage estimators

Consider a model {p(x|θ) : θ ∈ Θ} for a random variable X such that E[X|θ] = µ(θ), 0 < Var[X|θ] = σ 2 (θ) < ∞ ∀θ ∈ Θ.

1

Peter Hoff

Shrinkage estimators

October 31, 2013

A linear estimator δ(x) for µ(θ) is an estimator of the form δab (X) = aX + b. Is δab admissible? Theorem 1 (LC thm 5.2.6). δab (X) = aX + b is inadmissible for E[X|θ] under squared error loss whenever 1. a > 1, 2. a = 1 and b 6= 0, or 3. a < 0. Proof. The risk of δab is R(θ, δab ) = E[(aX + b − µ)2 |θ] = E[(aX − aµ − µ(1 − a) + b)2 |θ] = E[a2 (X − µ)2 + (b − µ(1 − a))2 + 2a(X − µ)(b − µ(1 − a))|θ] = a2 σ 2 + (b − µ(1 − a))2 1. If a > 1, then R(θ, δab ) > a2 σ 2 > σ 2 = R(θ, X), so δab is dominated by X. 2. If a < 0, then R(θ, δab ) > (b − µ(1 − a))2 = (1 − a)2 (b/(1 − a) − µ)2 = (b/(1 − a) − µ)2 = R(θ, b/(1 − a)) and so δab is dominated by the constant estimator b/(1 − a). 3. If a = 1 and b 6= 0, then R(θ, δab ) = σ 2 +b2 > σ 2 = R(θ, X), so δab is dominated by X.

2

Peter Hoff

Shrinkage estimators

October 31, 2013

Letting w = 1 − a and µ0 = b/(1 − a), the result suggests that if we want to use an admissible linear estimator, it should be of the form δ(X) = wµ0 + (1 − w)X , w ∈ [0, 1] We call such estimators linear shrinkage estimators as they “shrink” the estimate from X towards µ0 . Intuitively, you can think of µ0 as your “guess” as to the value of µ, and w as the confidence you have in your guess. Of course, the closer your guess is to the truth, the better your estimator. If µ0 represents your guess as to µ(θ), it seems natural to require that µ0 ∈ µ(Θ) = {µ : µ = µ(θ), θ ∈ Θ}, i.e. µ0 is a possible value of µ. Lemma 1. If µ(Θ) is convex and µ0 6∈ µ ¯(Θ), then δ(X) = wµ0 + (1 − w)X is not admissible. Proof. For the one-dimensional case, suppose µ0 > µ(θ) ∀θ ∈ Θ. ˜ Let µ ˜0 = supΘ µ(θ), and δ(X) = wµ ˜0 + (1 − w)X. ˜ Then δ(X) dominates δ(X) (the variances are the same, and the latter has higher bias for all θ). The proof is similar for the case µ0 < µ(θ) ∀θ ∈ Θ.

Exercise: Generalize this result to higher dimensions.

2

Admissible linear shrinkage estimators

We have shown that δ(X) = wµ0 + (1 − w)X is inadmissible for µ(θ) = E[X|θ] if • w 6∈ [0, 1] or 3

Peter Hoff

Shrinkage estimators

October 31, 2013

• µ0 6∈ µ(Θ). Restricting attention to w ∈ [0, 1] and µ0 ∈ µ(Θ), it may seem that such estimators should always be admissible, but “always” is almost always too inclusive. Exercise: Given an example where wµ0 + (1 − w)X is not admissible, even with w ∈ (0, 1) and µ0 ∈ µ(Θ).

Linear shrinkage via conjugate priors What about using a Bayesian argument? Recall, Theorem. Any unique Bayes estimator is admissible. If we can show that wµ0 + (1 − w)X is unique Bayes under some prior, then we will have shown admissibility. Let X1 , . . . , Xn ∼ i.i.d. p(x|θ), where p(x|θ) ∈ P = {p(x|θ) = h(x) exp(x · θ − A(θ)) : θ ∈ H} Consider estimation of µ = E[X|θ] under squared error loss. Let π(θ) ∝ exp(n0 µ0 · θ − n0 A(θ)) where n0 > 0 and µ0 ∈ Conv{E[X|θ] : θ ∈ H} Recall that under this prior, E[µ] ≡ E[E[X|θ]] = µ0 . Then π(θ|x) ∝ exp(n1 µ1 · θ − n1 A(θ)), where n1 = n0 + n and n1 µ1 = n0 µ0 + n¯ x µ1 = n0 µ0 /n1 + n¯ x/n1 n n0 = µ0 + x¯. n0 + n1 n0 + n Under this posterior distribution, E[µ|x] ≡ E[E[X|θ]|x] = µ1 .

4

Peter Hoff

Shrinkage estimators

October 31, 2013

Therefore, the unique Bayes estimator of µ = E[X|θ] under squared error loss is µ1 = wµ0 + (1 − w)¯ x, and so this linear shrinkage estimator is admissible. Example (multiple normal means): Let X ∼ Np (θ, σ 2 I). First consider the case that σ 2 is known, so that p(x|θ) = (2πσ 2 )−p/2 exp(−(x − θ) · (x − θ)/[2σ 2 ]) ∝θ exp(x · θ/σ 2 − θ · θ/[2σ 2 ]). Consider the normal prior π(θ) = (2πτ 2 )−p/2 exp(−(θ − θ 0 ) · (θ − θ 0 )/[2τ02 ]) ∝θ exp(θ 0 · θ/τ02 − θ · θ/[2τ02 ]). where τ02 is analogous to 1/n0 in the general formulation for exponential families. The posterior density is π(θ|x) ∝θ exp{[θ 0 /τ02 + x/σ 2 ] · θ − θ · θ[1/σ 2 + 1/τ02 ]/2} = exp{θ 1 · θ/τ12 − θ · θ/[2τ12 ]} where • 1/τ12 = 1/τ02 + 1/σ 2 • θ1 =

1/τ02 θ 1/τ02 +1/σ 2 0

+

1/σ 2 x 1/τ02 +1/σ 2

≡ wθ 0 + (1 − w)x.

So {θ|x} ∼ Np (θ 1 , τ12 I), which means that E[θ|x] = θ 1 = wθ 0 + (1 − w)x uniquely minimizes the posterior risk under squared error loss. The posterior mean is therefore a unique Bayes estimator and also an admissible estimator of θ. Since this result holds for all τ02 > 0, we have the following: 5

Peter Hoff

Shrinkage estimators

October 31, 2013

Lemma 2. For each w ∈ (0, 1) and θ 0 ∈ Rp , the estimator δwθ0 (x) = wθ 0 +(1−w)x is admissible for estimating θ in the model X ∼ Np (θ, σ 2 I), θ ∈ Rp , where σ 2 is known. Of course, what we would like is the following lemma: Lemma 3. For each w ∈ (0, 1) and θ 0 ∈ Rp , the estimator δwθ0 (X) = wθ 0 +(1−w)X is admissible for estimating θ in the model X ∼ Np (θ, σ 2 I), θ ∈ Rp , σ 2 ∈ R+ . How can this result be obtained? Theorem 2. Let P = {p(x|θ, ψ) : (θ, ψ) ∈ Θ × Ψ}, and for ψ0 ∈ Ψ, let Pψ0 = {p(x|θ, ψ0 ) : θ ∈ Θ} be a submodel. If δ is admissible for estimating θ under Pψ0 for each ψ0 ∈ Ψ, then δ is admissible for estimating θ under P. Proof. Suppose δ satisfies the conditions of the theorem but is not admissible. Then there exists a δ 0 ∈ D such that ∀(θ, ψ) , R((θ, ψ), δ 0 ) ≤ R((θ, ψ), δ) ∃(θ0 , ψ0 ) , R((θ0 , ψ0 ), δ 0 ) < R((θ0 , ψ0 ), δ). But this contradicts the assumption that δ is admissible for estimating θ under Pψ0 . Therefore, no such δ 0 can exist and so δ is admissible for P. A corollary to this theorem is the admissibility of wθ 0 + (1 − w)X in the normal model with unknown variance.

3

Admissibility of unbiased normal mean estimators

Let X ∼ Np (θ, σ 2 I), θ ∈ Rp , σ 2 > 0. For estimation of θ under squared error loss, we have shown that the linear shrinkage estimator δ(x) = wθ 0 + (1 − w)x is 6

Peter Hoff

Shrinkage estimators

October 31, 2013

• inadmissible if w 6∈ [0, 1], • admissible if w ∈ (0, 1). What remains to evaluate is the admissibility for w ∈ {0, 1}. Admissibility for w = 1 is easy to show - the estimator δ1θ0 (x) = θ 0 beats everything at θ 0 and so can’t be dominated. The last and most interesting case is that of w = 0, i.e. δ0 (X) = X, the unbiased MLE and UMVUE. Blyth’s method Recall Blyth’s method for showing admissibility using a limiting Bayes argument: Theorem 3 (LC 5.7.13). Suppose Θ ⊂ Rp is open, and that R(θ, δ) is continuous in θ for all δ ∈ D. Let δ be an estimator and {πn } be a sequence of measures such that for any open ball B ⊂ Θ, R(πn , δ) − R(πn , δπn ) → 0 as n → ∞. πn (B) Then δ is admissible. Let’s try to use this to show admissibility of δ0 (X) = X in the normal means problem. We begin with the case that σ 2 = 1 is known. X ∼ Np (θ, I) θ ∼ Np (0, τ 2 I) 2

2

τ τ {θ|X} ∼ Np ( 1+τ 2 X, 1+τ 2 I)

The unique Bayes estimator is δτ 2 = E[θ|X] =

τ2 X 1+τ 2

≡ (1 − w)X.

To apply the theorem, we need to compute the Bayes risk of δτ 2 and X under the Np (0, τ 2 I) prior πτ 2 . The loss we will use is “scaled” squared error loss, L(θ, d) = P (θj − dj )2 /p. Because the risk is the average of the individual MSEs, the Bayes

7

Peter Hoff

Shrinkage estimators

October 31, 2013

risk is just the average of the Bayes risks from the p components, p X R(θ, δ) = E[ (θj − δj )2 ]/p j=1 p

=

X

E[(θj − δj )2 ]/p,

j=1

and so calculating the Bayes risk is similar to calculating the risk in the p = 1 problem. For δτ 2 (x) = ax, where a = 1 − w = τ 2 /(1 + τ 2 ), we have E[(aX − θ)2 ] = E[(aX − aθ + (1 − a)θ)2 ] = a2 E[(X − θ)2 ] + 2a(1 − a)E[(X − θ)θ] + (1 − a)2 E[θ]2 = a2 + (1 − a)2 τ 2  2 2 τ τ2 τ2 = + = 1 + τ2 (1 + τ 2 )2 1 + τ2 A more intuitive way to calculate this makes use of the fact that δτ 2 (X) = E[θ|X], so E[(θ − δτ 2 )2 ] = E[(θ − E[θ|X])2 ] = Ex [Eθ|x [(θ − E[θ|X])2 ]] = Ex [Var[θ|X]] = Ex [τ 2 /(1 + τ 2 )] =

τ2 . 1 + τ2

Similarly, E[(X − θ)2 ] = Eθ [Ex|θ [(X − θ)2 ]] = Eθ [1] = 1. So R(πτ 2 , δτ 2 ) =

τ2 1+τ 2

and R(πτ 2 , X) = 1. Returning to the p-variate case, since the

Bayes risk is the arithmetic average of the risks for each of the p components of θ, we have τ2 1 + τ2 R(πτ 2 , X) = 1.

R(πτ 2 , δτ 2 ) =

8

Peter Hoff

Shrinkage estimators

October 31, 2013

Note that δτ 2 (X) =

τ2 X 1+τ 2

↑ X as τ 2 ↑ ∞,

R(πτ 2 , δτ 2 ) ↑ R(πτ 2 , X) as τ 2 ↑ ∞, so X is a “limiting Bayes” estimator, for which the risk difference from the Bayes estimator converges to zero. This is promising - let’s now apply the theorem. Letting B be any open finite ball in Rp , we need to see if the following limit is zero: τ2

(1 − 1+τ 2 ) R(πτ 2 , X) − R(πτ 2 , δτ 2 ) = lim lim τ 2 →∞ πτ 2 (B) τ 2 →∞ πτ 2 (B) = 2lim [(1 + τ 2 )πτ 2 (B)]−1 τ →∞

Now πτ 2 (B) → 0 as τ 2 → ∞ for any bounded set B Therefore, the limit is zero only if lim τ 2 πτ2 (B) = ∞.

τ 2 →∞

We have 2

Z

(2πτ 2 )−p/2 exp(−||θ||2 /[2τ 2 ]) dθ B Z −p/2 2 1−p/2 exp(−||θ||2 /[2τ 2 ]) dθ = (2π) × (τ ) ×

τ πτ2 (B) = τ

2

B

(2π)

−p/2

× (a) × (b).

Now take the limit as τ 2 → ∞:

(b) → Vol(B) as τ 2 → ∞    ∞ if p = 1 (a) → 1 if p = 2 .   0 if p > 2 Therefore, the desired limit is achieved for p = 1 but not p > 1. • By the theorem, X is admissible for Θ. 9

Peter Hoff

Shrinkage estimators

October 31, 2013

• For p > 1, this method of showing admissibility does not work. – For p = 2, X can be shown to be admissible using Blyth’s method with non-normal priors (see LC exercise 5.4.5). – For p > 2, X can’t be shown to be admissible because it isn’t.

Interpreting the failure of Blyth’s method: The admissibility conditions for Blyth’s method derived from consideration of the existence of an estimator δ that dominates X. If such an estimator exits, then by continuity of risks ∃  > 0 and an open ball B ⊂ Θ : R(θ, X) − R(θ, δ) >  ∀θ ∈ B, which implies for each prior πk that Z Z R(πk , X) − R(πk , δ) = [R(θ, X) − R(θ, δ)]πk (dθ) ≥ [R(θ, X) − R(θ, δ)]πk (dθ) B

≥ πk (B). Integrating with respect to a prior πk , and comparing to the Bayes risk of the Bayes estimator δk under πk gives R(πk , X) − R(πk , δk ) ≥ R(πk , X) − R(πk , δ) > πk (B) ∀k, as δk has Bayes risk less than or equal to that of δ. Could such a δ exist? Could exist: Suppose B is a ball such that πk (B) goes to zero very fast. Then an estimator (like X) can have a good limiting Bayes risk and still do poorly on B. This allows for the possibility of domination by another estimator that does better on B.

10

Peter Hoff

Shrinkage estimators

October 31, 2013

Couldn’t exist: On the other hand, if R(πk , X) − R(πk , δk ) goes to zero very fast (e.g. faster than the probability of any ball B), then in a sense X would have to be doing well everywhere, and would not be able to be dominated - this is Blyth’s method for showing admissibility. What fails in the admissibility proof for the normal means problem is that for p > 2, the probability πk (B) of an open ball B is going to zero much faster than the Bayes risk difference, leaving a large enough “gap” for some other estimator to do better.

4

Motivating the James-Stein estimator

Stein [1956] showed that X is inadmissible for θ in the normal means problem when p > 2. This was surprising, as X is the MLE and UMVUE for θ. In this section,

4.1

What is wrong with X?

For large p, • X may be close to θ, but • X · X = ||X||2 may be far from θ · θ = ||θ||2 . If X ∼ Np (θ, I), 2

E[||X|| ] = E[

p X

Xj2 ]

1 p

=

X

(θj2 + 1) = ||θ||2 + p,

1

so for large p, the magnitude of the estimator vector X is expected to be much larger than the magnitude of the estimand vector θ. More insight can be gained as follows: Note that every vector x can be expressed as x = sθ + r, for some s ∈ R and r : θ · r = 0. 11

Peter Hoff

Shrinkage estimators

October 31, 2013

Here, the random variable s is the magnitude of the projection of x in the direction of θ, and r is the residual vector. Using this decomposition, we can write the squarederror loss of ax for estimating θ as ||ax − θ||2 = (ax − θ) · (ax − θ) = ((as − 1)θ + ar) · ((as − 1)θ + ar) = (as − 1)2 ||θ||2 + a2 ||r||2 Now consider replacing x with X ∼ Np (θ, I). The random-variable version of the above equation is then ||aX − θ||2 = (aS − 1)2 ||θ||2 + a2 ||R||2 Exercise: Show that • S ∼ N (1, ||θ||−2 ), • ||R||2 ∼ χ2p−1 , • S and R are independent. Now imagine a situation where p is growing but ||θ||2 remains fixed. The distribution of (aS − 1)2 ||θ||2 remains fixed whereas the distribution of a2 ||R||2 blows up. This suggests that if we think ||θ||2 /p is small we should use an estimator like aX with a < 1 to control the error that comes from R2 . But what should the value of a be?

4.2

An oracle estimator:

Question: Among estimators aX : a ∈ [0, 1], which has the smallest risk? Solution: E[||aX − θ||2 ] = E[||(aX − aθ) − (1 − a)θ||2 ] = a2 p + (1 − a)2 ||θ||2 .

12

Peter Hoff

Shrinkage estimators

October 31, 2013

Taking derivatives, the minimizing value a ˜ of a satisfies 2˜ ap − 2(1 − a ˜)||θ||2 = 0 a ˜ ||θ||2 = 1−a ˜ p ||θ||2 a ˜= . ||θ||2 + p Thus the optimal shrinkage “estimator” is given by δa˜ (X) = a ˜X. This is not really an estimator in the usual sense, because the ideal degree of shrinkage a ˜ depends on θ. For this reason, a ˜X is sometimes called an “oracle estimator:” You would need an oracle to tell you the value of ||θ||2 before you could use it. Note that the risk of this estimator is ||θ||4 p + p2 ||θ||2 (||θ||2 + p)2 p||θ||2 (||θ||2 + p) = (||θ||2 + p)2 p||θ||2 = ||θ||2 + p

E[||aX − θ||2 ] =

and so E[||aX − θ||2 ] = p

||θ||2 < p = E[||X − θ||2 ]. ||θ||2 + p

The risk differential is large if ||θ||2 is small compared to p.

4.3

Adaptive shrinkage estimation

As shown above, the optimal amount of shrinkage a ˜ is a ˜=

||θ||2 ||θ||2 /p = . ||θ||2 + p ||θ||2 /p + 1

13

Peter Hoff

Shrinkage estimators

October 31, 2013

Note that ||θ||2 /p is the variability of of the θj values around zero. Can this variability be estimated? Consider the following hierarchical model: Xj = θj + j

iid

1 , . . . , p ∼ N (0, 1) iid

θ1 , . . . , θp ∼ N (0, τ 2 ) If you’d like to connect this with some actual inference problem, imagine that each Xj is the sample mean or t-statistic calculated from observations from experiment j, with population mean θj . Suppose you believed this model and knew the value of τ 2 . If you were interested finding an estimator δ(X) that minimized the the expected squared error ||θ−δ(X)||2 under repeated sampling of • θ1 , . . . , θp , followed by sampling of • X1 , . . . , X p , you would want to come up with an estimator δ(X) that minimized Z Z 2 E[||θ − δ(X)|| ] = ||θ − δ(x)||2 p(dx|θ)p(dθ). Exercise: Show that δτ 2 (X) =

τ2 X 1+τ 2

minimizes the expected loss.

If we knew τ 2 , then the estimator to use would be

τ2 X. 1+τ 2

We generally don’t know

τ 2 , but maybe it can be estimated from the data. Under the above model, X =θ+ θ ∼ Np (0, τ 2 I)  ∼ Np (0, I) Cov(θ, ) = 0. This means that the distribution of X marginalized over θ is X ∼ Np (0, (τ 2 + 1)I). 14

Peter Hoff

Shrinkage estimators

October 31, 2013

An unbiased estimator of τ 2 + 1 is clearly ||X||2 /p, so an unbiased estimator of τ 2 is τˆ2 =

||X||2 − p . p

However, we were interested in estimating τ 2 /(τ 2 + 1), not τ 2 . If p > 2, you can use the fact that ||X||2 ∼ gamma(p/2, 1/[2(τ 2 + 1)]) to show that E[||X||−2 ||] = [(p − 2)(τ 2 + 1)]−1 , and so p−2 E[ ||X|| 2] =

E[1 − Again,

τ2 X τ 2 +1

p−2 ] ||X||2

=

τ2

1 +1

τ2 . τ2 + 1

would be the optimal estimator in this hierarchical model if we knew

2

τ . If we don’t know τ 2 , we might instead consider using \ τ2 )X τ2 + 1 p−2 = (1 − ||X|| 2 )X.

δJS (X) = (

This estimator is called the James-Stein estimator. As we will see, it has many interesting properties: • For large p in the hierarchical normal model, it is almost as good as the oracle estimator

τ2 X: τ 2 +1

EXθ [||θ − δJS ||2 ] ≈ EXθ [||θ −

τ2 X||2 ] τ 2 +1

• Even if the hierarchical normal model isn’t correct, it still is almost as good as the oracle estimator a ˜X in the normal means model: EX|θ [||θ − δJS ||2 ] ≈ EX|θ ||θ − a ˜X||2 ] • In the normal means problem, this estimator dominates the unbiased estimator X if p > 2: EX|θ [||θ − δJS ||2 ] < EX|θ [||θ − X||2 ] ∀θ We will show this last inequality first. 15

Peter Hoff

5

Shrinkage estimators

October 31, 2013

Risk of δJS

We will show that δJS dominates X by showing that the risk R(θ, δJS ) = E[||δJS − θ||2 ]/p is uniformly less than 1. This will not be done by computing its risk function directly, but instead by showing that 1 is an upper bound on the risk. This bound will be obtained via an identity that has applications beyond the calculation of R(θ, δJS ).

5.1

Risk bound for δJS

We can write the James-Stein estimator as  \ τ2 x = (1 − δJS = 1 + τ2 =x−

p−2 )x x·x p−2 x x·x

≡ x − g(x). Under X ∼ Np (θ, I), E[||δJS − θ||2 ] = E[(X − g(X) − θ)2 ] = E[(X − θ) − g(X))2 ] = E[||(X − θ)||2 ] + E[||g(X)||2 ] − 2E[(X − θ) · g(X)]. where all expectations are with respect to the distribution of X given θ. The first expectation is p and the second is 2 X·X 1 E[(p − 2)2 (X·X) 2 ] = (p − 2) E[ X·X ].

The third expectation is more complicated, but in the next subsection we’ll derive an identity (Stein’s identity) for computing E[(X − θ) · g(X)] that is applicable for arbitrary functions g. Stein’s identity as applied to g(x) =

p−2 x x·x

gives

2

E[(X − θ)g(X)] = E[ (p−2) ]. X·X 16

Peter Hoff

Shrinkage estimators

October 31, 2013

Using this for the above risk calculation gives 1 1 E[||δJS − θ||2 ] = p + (p − 2)2 E[ X·X ] − 2(p − 2)2 E[ X·X ] 2

]. = p − E[ (p−2) X·X Note that we haven’t actually calculated the risk of δJS is closed form - our formula depends on the expectation of 1/X·X, which is an inverse-moment of a noncentral χ2 distribution where the noncentrality parameter depends on θ. However, computing this moment is not necessary to show that δJS dominates X: Since

1 x·x

> 0 ∀x ∈ Rp ,

we have 2

] < p = E[||X − θ||2 ]. E[||δJS − θ||2 ] = p − E[ (p−2) X·X Since the expectation of (X · X)−1 is complicated, further study of the risk of δJS is often achieved via a study of its unbiased risk estimate. From the above calculation, we see that E[||δJS − θ||2 ] = E[p − and so p −

5.2

(p−2)2 X·X

(p−2)2 ], X·X

can be said to be an unbiased estimate of the risk of δJS .

Stein’s identity

We start with a univariate version of the identity: Lemma 4 (Stein’s identity). Let X ∼ N (µ, σ 2 ) and let g(x) be such that E[|g 0 |] < ∞. Then E[g(X)(X − µ)] = σ 2 E[g 0 (X)]. Proof. The proof follows from Fubini’s theorem and a bit of calculus. Letting p(x) = φ([x − µ]/σ)/σ, note that p0 (x) = −( x−µ )p(x). By the fundamental theorem of σ2 calculus, Z

x

p(x) = −∞

−( y−µ )p(y) σ2

Z dy = x



( y−µ )p(y) dy. σ2

17

Peter Hoff

Shrinkage estimators

The expectation we wish to calculate is Z ∞ 0 g 0 (x)p(x) dx E[g (x)] = −∞ Z Z ∞ 0 g (x)p(x) dx + =

October 31, 2013

0

g 0 (x)p(x) dx.

−∞

0

Doing the first part, we have Z ∞ Z ∞ Z ∞ 0 0 ( y−µ g (x) g (x)p(x) dx = )p(y) dy dx σ2 x 0 Z0 = g 0 (x)( y−µ )p(y) dy dx σ2 0