3. Conjugate families of distributions

3. Conjugate families of distributions Objective One problem in the implementation of Bayesian approaches is analytical tractability. For a likelihood function l(θ|x) and prior distribution p(θ), in order to calculate the posterior distribution, it is necessary to evaluate Z f (x) = l(θ|x)p(θ) dθ and to make predictions, we must evaluate Z f (y|x) = f (y|θ)p(θ|x) dθ. General approaches for evaluating such integrals numerically are studied in chapter 6, but in this chapter, we will study the conditions under which such integrals may be evaluated analytically.

Recommended reading • Wikipedia entry on conjugate priors. http://en.wikipedia.org/wiki/Conjugate_prior

• Bernardo, J.M. and Smith, A.F.M. (1994). Bayesian Theory, Section 5.2.

Coin tossing problems and beta priors It is possible to generalize what we have seen in Chapter 2 to other coin tossing situations. Suppose that we have defined a beta prior distribution θ ∼ B(α, β) for θ = P (head). Then if we observe a sample of coin toss data, whether the sampling mechanism is binomial, negative-binomial or geometric, the likelihood function always takes the form l(θ|x) = cθh(1 − θ)t where c is some constant that depends on the sampling distribution and h and t are the observed numbers of heads and tails respectively.

Applying Bayes theorem, the posterior distribution is p(θ|x) ∝ cθh(1 − θ)t

1 θα−1(1 − θ)β−1 B(α, β)

∝ θα+h−1(1 − θ)β+t−1 which implies that θ|x ∼ B(α + h, β + t). Thus, using a beta prior, guarantees that the posterior distribution is also beta. In this case, we say that the class of beta prior distributions is conjugate to the class of binomial (or geometric or negative binomial) likelihood functions.

Interpretation of the beta parameters The posterior distribution is B(α + h, β + t) so that α (β) in the prior plays the role of the number of heads (tails) observed in the experiment. The information represented by the prior distribution can be viewed as equivalent to the information contained in an experiment where we observe α heads and β tails. Furthermore, the posterior mean is given by (α + h)/(α + β + h + t) which can be expressed in mixture form as α h E[θ|x] = w + (1 − w) α+β h+t = wE[θ] + (1 − w)θˆ α+β where w = α+β+h+t is the relative weight of the number of equivalent coin tosses in the prior.

Limiting results Consider also what happens as we let α, β → 0. We can interpret this as representing a prior which contains information equivalent to zero coin tosses. In this case, the posterior distribution tends to B(h, t) and, for example, the ˆ the classical MLE. posterior mean E[θ|x] → θ, The limiting form of the prior distribution in this case is Haldane’s (1931) prior, 1 , p(θ) ∝ θ(1 − θ) which is an improper density. We analyze the use of improper densities further in chapter 5.

Prediction and posterior distribution The form of the predictive and posterior distributions depends on the sampling distribution. We shall consider 4 cases: • Bernoulli trials. • Binomial data. • Geometric data. • Negative binomial data.

Bernoulli trials Given that the beta distribution is conjugate in coin tossing experiments, given a (Bernoulli or binomial, etc.) sampling distribution, f (x|θ), we only need to R calculate the prior predictive density f (x) = f (x|θ)p(θ) dθ for a beta prior. Assume first that we have a Bernoulli trial with P (X = 1|θ) = θ and prior distribution θ ∼ B(α, β). Then: Theorem 3 If X is a Bernoulli trial with parameter θ and θ ∼ B(α, β), then: ( f (x) =

α α+β β α+β

E[X] =

α , α+β

if x = 1 if x = 0 αβ V [X] = . (α + β)2

Given a sample P x = (x1, . . . , xn) P of Bernoulli data, the posterior distribution n n is θ|x ∼ B(α + i=1 xi, β + n − i=1 xi).

Proof Z

1

P (X = 1|θ)p(θ) dθ

P (X = 1) = 0

= E[θ] =

α α+β

β P (X = 0) = 1 − P (X = 1) = α+β E[X] = V [X] = = =

α 0 × P (X = 0) + 1 × P (X = 1) = α+β 2 E X − E[X]2 2 α α − α+β α+β αβ . 2 (α + β)

The posterior distribution formula was demonstrated previously.

Binomial data Theorem 4 Suppose that X|θ ∼ BI(m, θ) with θ ∼ B(α, β). Then, the predictive density of X is the beta binomial density

m B(α + x, β + m − x) f (x) = for x = 0, 1, . . . , m x B(α, β) α E[X] = m α+β αβ V [X] = m(m + α + β) (α + β)2(α + β + 1) Given a sample,P x = (x1, . . . , xn), ofPbinomial data, the posterior distribution n n is θ|x ∼ B(α + i=1 xi, β + mn − i=1 xi).

Proof Z

1

f (x|θ)p(θ) dθ

f (x) = 0

Z

1

= 0

= =

m x m x

m x

x

θ (1 − θ)

1 B(α, β)

m−x

Z

1 θα−1(1 − θ)β−1 dθ B(α, β)

for x = 0, . . . , m

1

θα+x−1(1 − θ)β+m−x−1 dθ

0

B(α + x, β + m − x) B(α, β).

E[X] = E[E[X|θ]] = E[mθ] remembering the mean of a binomial distribution α = m . α+β

V [X] = E[V [X|θ]] + V [E[X|θ]] = E[mθ(1 − θ)] + V [mθ] B(α + 1, β + 1) αβ 2 = m +m B(α, β) (α + β)2(α + β + 1) αβ αβ = m + m2 (α + β)(α + β + 1) (α + β)2(α + β + 1) αβ = m(m + α + β) . (α + β)2(α + β + 1)

Geometric data Theorem 5 Suppose that θ ∼ B(α, β) and X|θ ∼ GE(θ), that is P (X = x|θ) = (1 − θ)xθ

for x = 0, 1, 2, . . .

Then, the predictive density of X is f (x) = E[X] = V [X] =

B(α + 1, β + x) for x = 0, 1, 2, . . . B(α, β) β for α > 1 α−1 αβ(α + β − 1) for α > 2. 2 (α − 1) (α − 2)

Given a sample, x = (x1, . . . , xn), then θ|x ∼ B(α + n, β + Proof Exercise.

Pn

i=1 xi ).

Negative binomial data Theorem 6 Suppose that θ ∼ B(α, β) and X|θ ∼ N B(r, θ), that is x+r−1 P (X = x|θ) = θr (1 − θ)x for x = 0, 1, 2, . . . Then: x f (x) =

x+r−1 x

B(α + r, β + x) B(α, β)

β for α > 1 α−1 rβ(r + α − 1)(α + β − 1) (α − 1)2(α − 2)

for x = 0, 1, . . .

E[X] = r V [X] =

for α > 2.

Given a sample x = (x1, . . . , xn) P of negative binomial data, the posterior n distribution is θ|x ∼ B(α + rn, β + i=1 xi). Proof Exercise.

Conjugate priors

Raiffa

The concept and formal definition of conjugate prior distributions comes from Raiffa and Schlaifer (1961). Definition 2 If F is a class of sampling distributions f (x|θ) and P is a class of prior distributions for θ, p(θ) we say that P is conjugate to F if p(θ|x) ∈ P ∀ f (·|θ) ∈ F y p(·) ∈ P.

The exponential-gamma system Suppose that X|θ ∼ E(θ), is exponentially distributed and that we use a gamma prior distribution, θ ∼ G(α, β), that is β α α−1 −βθ p(θ) = θ e Γ(α)

for θ > 0.

Given sample data x, the posterior distribution is p(θ|x) ∝ p(θ)l(θ|x) ∝

n β α α−1 −βθ Y −θxi θ e θe Γ(α) i=1

∝ θα+n−1e−(β+n¯x)θ which is the nucleus of a gamma density θ|x ∼ G(α + n, β + n¯ x) and thus the gamma prior is conjugate to the exponential sampling distribution.

Interpretation and limiting results The information represented by the prior distribution can be interpreted as being equivalent to the information contained in a sample of size α and sample mean β/α. Letting α, β → 0, then the posterior distribution approaches G(n, n¯ x) and thus, for example, the posterior mean tends to 1/¯ x which is equal to the MLE in this experiment. However, the limiting prior distribution in this case is 1 f (θ) ∝ θ which is improper.

Prediction and posterior distribution Theorem 7 Let X|θ ∼ E(θ) and θ ∼ G(α, β). Then the predictive density of X is f (x) = E[X] = V [X] =

αβ α for x > 0 α+1 (β + x) β for α > 1 α−1 αβ 2 for α > 2. 2 (α − 1) (α − 2)

Given a sample x = (x1, . . . , xn) of exponential data, the posterior distribution is θ|x ∼ G(α + n, β + n¯ x). Proof Exercise.

Exponential families The exponential distribution is the simplest example of an exponential family distribution. Exponential family sampling distributions are highly related to the existence of conjugate prior distributions. Definition 3 A probability density f (x|θ) where θ ∈ R is said to belong to the one-parameter exponential family if it has form f (x|θ) = C(θ)h(x) exp(φ(θ)s(x)) for given functions C(·), h(·), φ(·), s(·). If the support of X is independent of θ then the family is said to be regular and otherwise it is irregular.

Examples of exponential family distributions Example 12 The binomial distribution, f (x|θ) =

m x

θx(1 − θ)m−x

m

= (1 − θ)

m x

for x = 0, 1, . . . , m

θ exp x log 1−θ

is a regular exponential family distribution. Example 13 The Poisson distribution, θxe−θ 1 f (x|θ) = = e−θ exp(x log θ) x! x! is a regular exponential family distribution.

for x = 0, 1, 2, . . .

Irregular and non exponential family distributions Example 14 The uniform distribution, 1 f (x|θ) = θ

for 0 < x < θ

is an irregular exponential family distribution. Example 15 The Cauchy density, 1 f (x|θ) = π (1 + (x − θ)2)

for − ∞ < x < ∞

is not an exponential family distribution. The Student’s t and Fisher’s F distributions and the logistic distribution are other examples of non exponential family distributions.

Sufficient statistics and exponential family distributions Theorem 8 If X|θ is a one-parameter, regular exponential family distribution, Pn then given a sample x = (x1, . . . , xn), a sufficient statistic for θ is t(x) = i=1 s(xi). Proof The likelihood function is l(θ|x) =

n Y

C(θ)h(xi) exp (φ(θ)s(xi))

i=1

=

n Y

! h(xi) C(θ)n exp (φ(θ)t(x))

i=1

= h(x)g(t, θ) Qn where h(x) = ( i=1 h(xi)) and g(t, θ) = C(θ)n exp (φ(θ)t(x)) and therefore, t is a sufficient statistic as in Theorem 1.

A conjugate prior to an exponential family distribution If f (x|θ) is an exponential family, with density as in Definition 3, then a conjugate prior distribution for θ exists. Theorem 9 The prior distribution p(θ) ∝ C(θ)a exp (φ(θ)b) is conjugate to the exponential family distribution likelihood. Proof From Theorem 8, the likelihood function for a sample of size n is l(θ|x) ∝ C(θ)n exp (φ(θ)t(x)) and given p(θ) as above, we have a posterior of the same form ?

p(θ|x) ∝ C(θ)a exp (φ(θ)b?) where a? = a + n and b? = b + t(x) for j = 1, . . . , n.

The Poisson-gamma system Suppose that X|θ ∼ P(θ). Then, from Example 13, we have the exponential family form −θ 1 f (x|θ) = e exp(x log θ) x! and from Theorem 9, a conjugate prior density for θ has form −θ a

p(θ) ∝ e

exp(b log θ)

for some a, b. Thus, p(θ) ∝ θbe−aθ which is the nucleus of a gamma density, θ ∼ G(α, β), where α = b + 1 and β = a. Thus, the gamma prior distribution is conjugate to the Poisson sampling distribution.

Derivation of the posterior distribution Now assume the prior distribution θ ∼ G(α, β) and suppose that we observe a sample of n Poisson data. Then, we can derive the posterior distribution via Bayes theorem as earlier. p(θ|x) ∝

n β α α−1 −βθ Y θxi e−θ θ e Γ(α) xi! i=1

∝ θα+n¯x−1e−(β+n)θ θ|x ∼ G(α + n¯ x, β + n)

Alternatively, we can use Theorem 9. The sufficient statistic is t(x) = and thus, from Theorem 9, we know immediately that θ|x ∼ G(|{z} α +t(x), |{z} β +n) ∼ G(α + n¯ x, β + n). b+1

a

Pn

i=1 xi

Interpretation and limiting results We might interpret the information in the prior as being equivalent to a the information contained in a sample of size β with sample mean α/β. Letting α, β → 0, the posterior mean converges to the MLE although the corresponding limiting prior, p(θ) ∝ θ1 , is improper.

Prediction and posterior distribution Theorem 10 Let X|θ ∼ P(θ) with θ ∼ G(α, β). Then:

X

β that is, β+1 α x Γ(x + α) β 1 x!Γ(α) β + 1 β+1 α β α(β + 1) . β2

∼ N B α,

f (x) = E[X] = V [X] =

for x = 0, 1, 2, . . .

Given a sample x = (x1, . . . , xn) of Poisson data, the posterior distribution is θ|x ∼ G(α + n¯ x, β + n). Proof Exercise.

The uniform-Pareto system As noted in Example 14, the uniform distribution is not a regular exponential family distribution. However, we can still define a conjugate prior distribution. Let X ∼ U(0, θ). Then, given a sample of size n, the likelihood function is l(θ|x) = θ1n , for θ > xmax, where xmax is the sample maximum. Consider a Pareto prior distribution, θ ∼ PA(α, β), with density αβ α p(θ) = α+1 θ

for θ > β > 0

and mean E[θ] = αβ/(α − 1). It is clear that this prior distribution is conjugate and that θ|x ∼ PA (α?, β ?) where α? = α + n and β ? = max{β, xmax}.

Properties and limiting results The posterior mean in this case is E[θ|x] =

(α + n) max{β, xmax} α+n−1

which cannot be represented in a general way as an average of the prior mean and the MLE (xmax). We can interpret the information in the prior as being equivalent to a sample of size α where the maximum value is equal to β. When we let α, β → 0, then the posterior distribution approaches PA(n, xmax) and in this case, the posterior mean is nxmax ˆ E[θ|x] = > xmax = θ. n−1 This also corresponds to an improper limiting prior p(θ) ∝ 1/θ. In this experiment, no prior distribution (except for a point mass at xmax) will lead to a posterior distribution whose mean coincides with the MLE.

Prediction and posterior distribution Theorem 11 If X|θ ∼ U(0, θ) and θ ∼ PA(α, β), then: ( f (x) =

α (α+1)β αβ α (α+1)xα+1

if 0 < x < β if x ≥ β

E[X] =

αβ 2(α − 1)

V [X] =

α(α + 2)β 2 12(α − 1)(α − 2)

for α > 1 for α > 2.

Given a sample of size n, then the posterior distribution of θ is θ|x ∼ PA (α + n, max{β, xmax}). Proof Exercise.

Higher dimensional exponential family distributions Definition 4 A probability density f (x|θ) where θ ∈ Rk is said to belong to the k-parameter exponential family if it has form   k X f (x|θ) = C(θ)h(x) exp  φj (θ)sj (x) j=1

for given functions C(·), h(·), φ(·), s(·). If the support of X is independent of θ then the family is said to be regular and otherwise it is irregular. Example 16 Qk xj m! The multinomial density is given by f (x|θ) = Qk j=1 θj where j=1 xj ! Pk Pk xj ∈ {0, 1, . . . , m} for j = 1, . . . , k, j=1 xj = m and j=1 θj = 1.

We can write

f (x|θ) =

m! Qk

j=1 xj !

  k X exp  xj log(θj ) j=1

 =

=

m! Qk

j=1 xj !

m! Qk

j=1 xj !

exp (m −

k−1 X

xj ) log θk +

j=1

xj log(θj )

j=1





θkm exp 

xj log(θj /θk )

k−1 X

k−1 X



j=1

and thus, the multinomial distribution is a regular, k−1 dimensional exponential family distribution.

It is clear that we can generalize Theorem 8 to the k dimensional case. Theorem 12 If X|θ is a k-parameter, regular exponential family distribution, then given Pn a sample xP=n (x1, . . . , xn), a sufficient statistic for θ is t(x) = ( i=1 s1(xi), . . . , i=1 sk (xi)). Proof Using the same arguments as in Theorem 8, the result is easy to derive.

A conjugate prior to an exponential family sampling distribution We can also generalize Theorem 9. If f (x|θ) is an exponential family, with density as in Definition 4, then a conjugate prior distribution for θ exists. Theorem 13 P k The prior distribution p(θ) ∝ C(θ)a exp j=1 φj (θ)bj is conjugate to the k dimensional exponential family distribution likelihood. Proof Exercise.

The multinomial-Dirichlet system Suppose that X|θ ∼ MN (m, θ) is a k- dimensional multinomial sampling distribution. Then from Example 16, we can express the multinomial density as   k−1 X m! m f (x|θ) = Qk θk exp  xj log(θj /θk ) j=1 xj ! j=1 and therefore, a conjugate prior distribution is 



f (θ) ∝ (θk m) exp 

bj log(θj /θk )

a

k−1 X

for arbitrary a, b1, . . . , bk−1

j=1

∝

k−1 Y j=1

Pk−1 a− j=1 bj bj θj θk

∝

k Y

α −1

θj j

j=1

where αj = bj + 1 for j = 1, . . . , k − 1 and αk = a −

Pk

j=1 bj

+ 1.

The Dirichlet distribution Definition 5 A random variable, θ = (θ1, . . . , θk ) is said to have a Dirichlet distribution, θ ∼ D(α1, . . . , αk ) if P

Γ p(θ) = Qk

k j=1 αj

k Y

j=1 Γ(αj ) j=1

for 0 < θj < 1,

Pk

j=1 θj

α −1

θj j

= 1.

The Dirichlet distribution may be thought of as a generalization of the beta distribution. In particular, the marginal distribution of any θj is beta, that is Pk θj ∼ B(αj , α0 − αj ), where α0 = j=1 αj .

Also, the moments of the Dirichlet distribution are easily evaluated: αj α0 αj (α0 − αj ) V [θj ] = α02(α0 + 1) αiαj Cov[θiθj ] = − 2 α0(α0 + 1) E[θj ] =

Prediction and posterior distribution Theorem 14 Let X|θ ∼ MN (m, θ) and θ ∼ D(α1, . . . , αk ). Then: k m!Γ (α0) Y Γ(xj + αj ) P (X = x) = Γ (m + α0) j=1 xj !Γ(αj )

where α0 =

Pk

j=1 αj .

for xj ≥ 0,

Pk

Also E[Xj ] = m

αj α0

αj (α0 − αj ) V [Xj ] = m(α0 + m) 2 α0(α0 + 1) αiαj Cov[Xi, Xj ] = −m(α0 + m) 2 α0(α0 + 1)

j=1 xj

=m

Given a sample, x = (x1, . . . , xn) of multinomial data, then  θ|x ∼ D α1 +

n X j=1

Proof Exercise

x1j , . . . , αk +

n X j=1

 xkj  .

Canonical form exponential family distributions Suppose that we have a standard exponential family distribution   k X f (x|θ) = C(θ)h(x) exp  φj (θ)sj (x) . j=1

Define the transformed variables Yj = sj (X) and transformed parameters φj = φj (θ) for j = 1, . . . k. Then the density of y|φ is of canonical form. Definition 6 The density   k X f (y|φ) = D(y) exp  yj φj − e(φ) j=1

is called the canonical form representation of the exponential family distribution.

Conjugate analysis Clearly, a conjugate prior for φ takes the form     k k X X p(φ) ∝ exp  bj φj − ae(φ) ∝ exp  aBj φj − ae(φ) j=1

j=1

b

where Bj = aj for j = 1, . . . , k. We shall call this form the canonical prior for φ. If a sample of size n is observed, then   k X aBj + n¯ y·j p(φ|y) ∝ exp  φj − (a + n)e(φ) a+n j=1 ! T aB + n¯ y ∝ exp φ − (a + n)e(φ) a+n The updating process for conjugate models involves a simple weighted average representation.

The posterior mean as a weighted average Suppose that we observe a sample of n data from a canonical exponential family distribution as in Definition 6 and that we use a canonical prior distribution   k X T   p(φ) ∝ exp aBj φj − ae(φ) ∝ exp aB φ − ae(φ) . j=1

Then we can demonstrate the following theorem. Theorem 15 +n¯ y E [Oe(φ)] = aB a+n where [Oe(φ)]j =

∂ ∂φj e(φ).

Proof As we have shown that the prior is conjugate, it is enough to prove that, a priori, E[Oe(φ)] = B. However, Z Z a (B − E[Oe(φ)]) = a (B − Oe(φ)) p(φ) dφ = Op(φ) dφ

Example 17 Let X|θ ∼ P(θ). Then, from Example 13, we can write −θ

f (x|θ) = e

1 1 1 φ exp(x log θ) = exp(x log θ − θ) = exp xφ − e , x! x! x!

where φ = eθ , in canonical exponential family form. We already know that the gamma prior density θ ∼ G(α, β) is conjugate here. Thus, p(θ) ∝ θα−1e−βθ ⇒ p(φ) ∝ (log φ)αφ−β α ∝ exp β φ − βeφ β d φ e = eφ = θ. Thus, from Theorem 15, we in canonical form. Now Oφ = dφ β α n have E[θ|x] = β+n + ¯, a weighted average of the prior mean and the β α+n x MLE.

Example 18 Consider the Bernoulli trial f (x|θ) = θx(1 − θ)1−x for x = 0, 1. We have x θ f (x|θ) = (1 − θ) 1−θ θ 1 = exp x log − log 1−θ 1−θ φ = exp xφ − log(1 + e ) θ in canonical form, where φ = log 1−θ .

Thus, the canonical prior for φ must take the form , φ

p(φ) ∝ exp aBφ − a log(1 + e ) ∝ (1 + eφ)−a exp(aBφ).

This implies, using the standard change of variables formula, that a

p(θ) ∝ (1 − θ)

θ 1−θ

aB

1 θ(1 − θ)

∝ θaB−1(1 − θ)a(1−B)−1 which we can recognize as a beta distribution, B(α, β), with parameters α = aB and β = a(1 − B). φ

e Now O log(1 + eφ) = 1+e φ = θ and so, from Theorem 15, given a sample of n Bernoulli trials, we have

E[θ|x] = where w =

α aB + n¯ x =w + (1 − w)¯ x a+n α+β

α+β α+β+n .

We previously derived this formula by standard Bayesian analysis on page 95.

Mixtures of conjugate priors Suppose that a simple conjugate prior distribution p(·) ∈ P does not well represent our prior beliefs. An alternative is to consider a mixture of conjugate prior distributions k X wipi(θ) i=1

where 0 < wi < 1 and

Pk

i=1 wi

= 1 and pi(·) ∈ P.

Note that Dalal and Hall (1983) demonstrate that any prior density for an exponential family can be approximated arbitrarily closely by a mixture of conjugate distributions. It is easy to demonstrate that given a conjugate prior mixture distribution the posterior distribution is also a mixture of conjugate densities.

Proof Using Bayes theorem, we have p(θ|x) ∝ f (x|θ)

k X

wipi(θ)

i=1

∝

k X

wipi(θ)f (x|θ)

i=1

∝

k X

Z wi

pi(θ)f (x|θ)dθ R

i=1

∝

k X

Z wi

pi(θ)f (x|θ) pi(θ)f (x|θ)dθ

pi(θ)f (x|θ)dθ pi(θ|x)

i=1

where pi(·|x) ∈ P because it is assumed thatR pi(·) is conjugate and therefore Pk w ( p (θ )f (x|θ )dθ ) p(θ|x) = i=1 wi?pi(θ|x) where wi? = Pk i Ri . w p ( θ )f ( x | θ )d θ ( ) j j=1 j

Example 19 In Example 11, suppose that we use a mixture prior distribution: θ ∼ 0.25B(1, 1) + 0.75B(5, 5). Then, the posterior distribution is given by 1 p(θ|x) ∝ 0.25 × 1 + 0.75 × θ5−1(1 − θ)5−1 θ9(1 − θ)3 B(5, 5) 1 θ14−1(1 − θ)8−1 ∝ 0.25θ10−1(1 − θ)4−1 + 0.75 B(5, 5) 1 B(14, 8) 1 θ10−1(1 − θ)4−1 + 3 θ14−1(1 − θ)8−1 B(10, 4) B(5, 5) B(14, 8) 1 1 ? 10−1 4−1 ? = w θ (1 − θ) + (1 − w ) θ14−1(1 − θ)8−1 B(10, 4) B(14, 8)

∝ B(10, 4)

where w? =

B(10,4) B(10,4)+3B(14,8)/B(5,5)

= 0.2315.

Thus, the posterior distribution is a mixture of B(10, 4) and B(14, 8) densities with weights 0.2315 and 0.7685 respectively.

Software for conjugate models Some general software for undertaking conjugate Bayesian analysis has been developed. • First Bayes is a slightly dated but fairly complete package • The book on Bayesian computation by Albert (2009) gives a number of R routines for fitting conjugate and non conjugate models all contained in the R LearnBayes package.

References Albert, J. (2009) Bayesian Computation with R. Berlin: Springer. Dalal, S.R. and Hall, W.J. (1983). Approximating priors by mixtures of natural conjugate priors. Journal of the Royal Statistical Society Series B, 45, 278–286. Haldane, J.B.S. (1931). A note on inverse probability. Proceedings of the Cambridge Philosophical Society, 28, 55–61. Raiffa, H. and Schlaifer, R. (1961). Applied Statistical Decision Theory. Cambridge MA: Harvard University Press.

Application I: Bayesian inference for Markovian queuing systems

A queue of children

We will study inference for the M/M/1 queueing system, as developed by Armero and Bayarri (1994a).

The M/M/1 queueing system The M (λ)/M (µ)/1 system is a queue with a single server and exponential inter-arrival and service times with rates λ and µ respectively. Under this model, clients arrive at the checkout according to a Markov process with rate λ and if the server is empty, they start an exponential service time process with rate µ. If the server is busy, the client joins the queue of customers waiting to be served.

Stability and equilibrium distributions In practical analysis, it is of particular importance to study the conditions under which such a queueing system is stable, i.e. when the queue length does not go off to infinity as more customers arrive. It can be shown that the system is stable if the traffic intensity, defined by ρ = λ/µ is strictly less than 1. In the case that the system is stable, it is interesting to consider the equilibrium or stationary distribution of the queue size, client waiting times etc. It can be shown that the equilibrium distribution of the number of clients in the system, N is geometric; P (N = n|ρ) = (1 − ρ)ρ

n

for N = 0, 1, 2, . . .

ρ with meanE[N |ρ] = . 1−ρ

Also, the limiting distribution of the time, W , spent in the system by a 1 customer is exponential; W |λ, µ ∼ E(µ − λ), with mean E[W |λ, µ] = µ−λ . The distributions of other variables of interest such as the duration of a busy period are also well known. See e.g. Gross and Harris (1985).

Experiment and inference In real systems, the arrival and service rates will be unknown and we must use inferential techniques to estimate these parameters and the system characteristics. A first step is to find a reasonable way of collecting data. The simplest experiment is that of observing nl inter-arrival times and ns service times. In this case, the likelihood is l(λ, µ|x, y) ∝ λnl e−λtl µns e−µts where tl and ts are the sums of inter-arrival and service times respectively. If we use conjugate, independent gamma prior distributions, λ ∼ G(αl, βl) µ ∼ G(αs, βs), then the posterior distributions are also gamma distributions: λ|x ∼ G(αl + nl, βl + tl) µ|y ∼ G(αs + ns, βs + ts).

Estimation of the traffic intensity It is straightforward to estimate the mean of the traffic intensity ρ. We have λ x, y E[ρ|x, y] = E µ 1 = E[λ|x]E y µ αl + nl βs + ts = βl + tl αs + ns − 1 We can also evaluate the distribution of ρ by recalling that if φ ∼ G then bφ ∼ χ2a is chi-square distributed.

a b 2, 2

,

The probability that the system is stable Thus, we have 2(βl + tl)λ|x ∼ χ22(αl+nl)

2(βs + ts)µ|y ∼ χ22(αs+ns)

and therefore, recalling that the ratio of two independent χ2 variables divided by their degrees of freedom is an F distributed variable, we have (βl + tl)(αs + ns) 2(α +n ) ρ|x, y ∼ F2(αsl+nls). (αl + nl)(βs + ts) The posterior probability that the system is stable is p = P (ρ < 1|x, y) (βl + tl)(αs + ns) = P F < (αl + nl)(βs + ts)

2(α +n )

where F ∼ F2(αsl+nls)

which can be easily evaluated in e.g. Matlab or R.

Estimation of the queue size in equilibrium If p is large, it is natural to suppose that the system is stable. In this case, it is interesting to carry out prediction for the equilibrium distribution of the number of clients in the system. We have

P (N = n|x, y, equilibrium) = P (N = n|x, y, ρ < 1) Z 1 = (1 − ρ)ρnp(ρ|x, y, ρ < 1) dρ 0

=

1 p

Z

1

(1 − ρ)ρnp(ρ|x, y) dρ

0

We will typically need numerical methods to evaluate this integral. One possibility is to use simple numerical integration. Another technique is to use Monte Carlo sampling.

Aside: Monte Carlo sampling Suppose that we wish to estimate an integral Z E[g(X)] = g(x)f (x) dx where X is a random variable with density f (·). Then, if we draw a sample, say x1, . . . , xM of size from f (·) then, under certain regularity conditions, as M → ∞, we have M 1 X g¯ = g(xi) → E[g(X)]. M i=1 In order to assess the precision of a Monte Carlo based estimator, a simple √ technique is to calculate the confidence band g¯ ± 2sg / M where s2g = PM 2 1 (g(x ) − g ¯ ) is the sample variance of g(·). The Monte Carlo sample i i=1 M size can be increased until the size of the confidence band goes below a certain preset precision . We will discuss Monte Carlo methods in more detail in chapter 6.

Using Monte Carlo to estimate the system size distribution We can set up a generic algorithm to generate a Monte Carlo sample. 1. Fix a large value M . 2. For i = 1, . . . , M : (a) Generate λi ∼ G(αl + nl, βl + nl) and µi ∼ G(αs + ns, βs + ns). (b) Set ρi = λi/µi. (c) If ρi > 1, go to a). Given the Monte Carlo sampled data, we can then estimate the queue P M 1 n (1 − ρ )ρ size probabilities, P (N = n|x, y, equilibrium) ≈ M i i for n = i=1 0, 1, 2, . . . and the waiting time distribution, P (W ≤ w|x, y, equilibrium) ≈ PM −(µ −λ ) 1 1 − M i=1 e i i .

Estimating the mean of the equilibrium system size distribution It is easier to evaluate the predictive moments of N . We have E[N |x, y, equilibrium] = E[E[N |ρ]|x, y, ρ < 1] ρ = E x, y, ρ < 1 1 − ρ Z 1 1 ρ = p(ρ|x, y) dρ = ∞ p 0 1−ρ and thus, these moments do not exist. This is a general characteristic of inference in queueing systems with Markovian inter-arrival or service processes. The same thing also happens with the equilibrium waiting time or busy period distributions. See Armero and Bayarri (1994a) or Wiper (1998). The problem stems from the use of (independent) prior distributions for λ and µ with p(λ = µ) > 0 and can be partially resolved by assuming a priori that P (λ ≥ µ) = 0. See Armero and Bayarri (1994b) or Ruggeri et al (1996).

Example Hall (1991) gives collected inter-arrival and service time data for 98 users of an automatic teller machine in Berkeley, USA. We shall assume here that inter-arrival and service times can be modeled as exponential variables. 35 30 25

freq

20 15 10 5 0

0

1

2

3 x

4

5

6

The sufficient statistics are na = ns = 98, ta = 119.71 and ts = 81.35 minutes respectively. We will assume the improper prior distributions p(λ) ∝ λ1 and p(µ) ∝ µ1 which we have seen earlier as limiting cases of the conjugate gamma priors. Then λ|x ∼ G(98, 119.71) and µ|y ∼ G(98, 119.71).

The posterior distribution of the traffic intensity The posterior expected value of ρ is E[ρ|x, y] ≈ 0.69 and the distribution function F (ρ|x, y) is illustrated below. Cdf of ρ 1 0.9 0.8 0.7

F(ρ|data)

0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.2

0.4

ρ

0.6

0.8

1

The posterior probability that the system is stable is 0.997.

The posterior distribution of N We used a large Monte Carlo sample to estimate the posterior density of N . There is only a very small probability that there are more than 15 clients in the system. 0.35

0.3

P(N=n|data)

0.25

0.2

0.15

0.1

0.05

0

0

2

4

6

8 n

10

12

14

16

The posterior distribution of W There is only a very small probability that a client spends over 5 minutes in the system.

P(W < w) 1 0.9 0.8 0.7

F(w)

0.6 0.5 0.4 0.3 0.2 0.1 0

0

1

2

3 w

4

5

6

Extensions • Calculation of busy period and transient time distributions. • Extension to other queueing systems: Markovian systems with various or unlimited numbers of servers, e.g. Armero and Bayarri (1997). Networks of queues. Armero and Bayarri (1999). Non Markovian systems. See e.g. Wiper (1998), Aus´ın et al (2004).

References Armero, C. and Bayarri, M.J. (1994a). Bayesian prediction in M/M/1 queues. Queueing Systems, 15, 419-426. Armero, C. and Bayarri, M.J. (1994b). Prior assessments for prediction in queues. The Statistician, 43, 139-153 Armero, C. and Bayarri, M.J. (1997). Bayesian analysis of a queueing system with unlimited service. Journal of Statistical Planning and Inference, 58, 241–261. Armero, C. and Bayarri, M.J. (1999). Dealing with Uncertainties in Queues and Networks of Queues: A Bayesian Approach. In: Multivariate Analysis, Design of Experiments and Survey Sampling, Ed. S. Ghosh. New York: Marcel Dekker, pp 579–608. Aus´ın, M.C., Wiper, M.P. and Lillo, R.E. (2004). Bayesian estimation for the M/G/1 queue using a phase type approximation. Journal of Statistical Planning and Inference, 118, 83–101. Gross, D. and Harris, C.M. (1985). Fundamentals of Queueing Theory (2nd ed.). New York: Wiley. Hall, R.W. (1991). Queueing Methods. New Jersey: Prentice Hall. Ruggeri, F., Wiper, M.P. and Rios Insua, D. (1996). Bayesian models for correlation in M/M/1 queues. Quaderno IAMI 96.8, CNR-IAMI, Milano. Wiper, M.P. (1998). Bayesian analysis of Er/M/1 and Er/M/c queues. Journal of Statistical Planning and Inference, 69, 65–79.