CSU Hayward

Statistics Department

Session 3: A Simple Example of the Gibbs Sampler


1. Introduction — Using Conditional Information to Estimate Prevalence

This session relies heavily on information and notation from Sessions 1 and 2. It covers much of the material in Sections 4 and 5 of Suess, Fraser, and Trumbo: "Elementary Uses of the Gibbs Sampler: Applications to Medical Screening Tests," STATS #27, Winter 2000, pages 3-10. The main differences here are that additional detail is provided and that exercises suitable for classroom use are included. Some of the exercises require the use of computers for simulation.


In this session we return to the problem of using screening test data to estimate the prevalence of infection by a virus among units of blood at a particular site. Suppose we know two types of conditional information:

With this information it is possible to determine the prevalence of infection p = P(D=1) at the site. We will see that this computation can be done by several methods, including simple algebra and the Gibbs Sampler. Even though prevalence can be computed algebraically, our purpose here is to illustrate how the Gibbs Sampler can be used to do the computation. Because the Gibbs Sampler involves extensive computer simulation and gives approximate answers, its practical use is for cases were an algebraic solution is impossible or very difficult. But for us it is instructive to meet the Gibbs Sampler for the first time in a situation where is results are easy to verify by a more familiar method.

1.1 Simulation steps of the Gibbs Sampler

If we knew the joint distribution of the random variables D and T, then we could find the prevalence directly as the marginal probability

p = P(D=1) = P(D=1, T=1) + P(D=1, T=0).

Here instead, as is often the case in real applications, we know the two conditional distributions T|D and D|T, not the joint distribution. For us, the conditional distribution of T|D is determined by h and q, and the conditional distribution of D|T is determined by g and d.

Here is how the simulation procedure of the Gibbs Sampler untangles the given conditional information to find the prevalence p.

Step 1. Begin at step m = 1 with an arbitrary initial value of D, say D(1) = 1.

Step 2a. At step m = 2, condition on this value of D(1) to simulate whether T(1) is 1 or 0. That is, simulate T(1) = 1 with probability h or T(1) = 0 with probability h*. (Alternatively, if we had begun with D(1) = 0, then we would simulate using q* or q.)

Step 2b. Next, simulate the value of D(2) using information from g or d, as appropriate. For example, if we happened to get T(1) = 1, then we would simulate D(2) = 1 with probability g = P{D(2)=1|T(1)=1}.

Step 3a. In turn, at step m = 3, simulate T(2) using either h or q.

Step 3b. Then simulate D(3) using g or d, depending on the value of T(2).

Notice that at step m = 2 we started with a value of D(1) and then went via a simulated value of T(1) to obtain a simulated value of D(2). In the two parts of step m = 3 we start with this value of D(2) to obtain a simulated value of D(3).

Iteration. It can be shown that, upon iteration, this simulation process will stabilize to a limit. So repeat it to obtain D(1), D(2), ..., D(M1) during a "burn-in" period long enough to achieve stability, and then continue on to obtain values D(M1+1), ..., D(M2) at "steady state."

Result. Finally, we estimate p as the proportion of steps at steady state for which D(m) = 1.

The Gibbs simulation algorithm just outlined is also available on this site in the format of two computer programs, one in S-Plus and one in Q-Basic.

1.2. Simulation results

Throughout these sessions we are assuming that h = 99% and q = 97% for our ELISA screening test. At a particular site, suppose that reliable estimates of the other conditional probabilities are g = 40.24% and d = 99.98%. Based on these four numbers, we used an S-plus program to simulate 20 runs with M1 = 50,000 and M2 = 100,000.

Ten of these runs started with D(1) = 0 and ten with D(1) = 1. We estimated p for each run from the last 50,000 values of D. Table 1 shows the results. In these circumstances, the consistent results show that p must be near 2%.

Table 1: Prevalence As Estimated From 20 Simulation Runs
(Each estimate based on 50,000 iterations at steady state)

Initial Value
D(1) = 0 D(1) = 1
0.0204  0.0201 0.0216  0.0187
0.0210  0.0218 0.0184  0.0192
0.0212  0.0185 0.0200  0.0180
0.0184  0.0194 0.0206  0.0222
0.0194  0.0199 0.0210  0.0194
 Mean    0.0200=2.00% 0.0199=1.99%
 StDev    0.0011 0.0014

Note: It would have been futile to try to simulate p = P(D=1) directly from the conditional distributions of D|T. The probability g = P(D=1|T=1) = 40.24% is much too large and d* = P(D=1|T=0) = 0.02% is much too small. However, as the simulation goes through the process described above, the conditional distributions of T|D come into play to ensure that, in the long run, each of these cases is simulated the appropriate proportion of the time. Thus the proportion of steps at steady state with D(m) = 1 turns out to be a good estimate of the prevalence p.

1.3. Questions about the simulation process

Before we move on, it is worthwhile to ask several questions about the simulation process.

First. Did the starting value, 0 or 1, of D(1) make any difference in the values of p obtained? Our results in Table 1 show that the starting value makes no significant difference.

Second. Were our simulation runs long enough? The similarity of the results in Table 1 shows that our values of M1 and M2 were large enough to give reproducible results, but longer runs would have further reduced their variability. The running proportion of D’s taking the value 1 (up to each stage m) is plotted in Figure 1 for one of our runs. The stability shown there is typical of all our runs. (Because d is nearly 1, this particular example requires relatively long simulation runs to achieve satisfactory results.)

Third. Why did this converging process always settle down to values of p » 2%? Although the 2% value of p was not an explicit input value, it is consistent with the values of g and d we assumed for our hypothetical site in Session 1. The simulation has reclaimed this 2% value for p from the conditional information we provided.

In the next section we will see the theoretical basis for the stability of the simulation process and for the specific numerical answer we obtained.

Problems:

1.1. Assume the values h = 99%, q = 97%, g = 40.24% and d = 99.98%. Use the definitions of these quantities to show that P(D=1, T=1) = 0.99p = 0.4024t and that P(D=1, T=0) = 0.9998(1 – t) = 0.97(1 – p). Solve these two equations to obtain the value of p. (Recall that in Session 1, the values h = 99%, q = 97%, and p =2% were used to derive these assumed values of g and d. What value of p do you suppose you will retrieve from the values of h, q, g, and d assumed here?)

1.2. Here are 12 random numbers (actually, pseudorandom numbers because they were generated by a computer program). Assume that they are a random sample from the uniform distribution on the interval [0 1).

0.38453   0.52815   0.12248   0.58569   0.97243   0.29543
0.34281   0.93651   0.69464   0.49699   0.64854   0.81394

(a) Use these random numbers to simulate 8 tosses of a fair coin: The coin is said to show Heads if its random number is smaller than 0.5, otherwise Tails. [This works because a random variable U that is uniformly distributed on [0 1) has P(U < 0.5) = 0.5.] How many Heads do you get?

(b) Suppose that an ELISA test has specificity 97% and that it is administered to 12 units of blood that are, if fact, free of the virus. This event can be simulated by saying that a random number less than 0.97 corresponds to a negative test result and that a larger random number corresponds to a positive test result. How many negative test results do you simulate out of 12? [Answer: 11.]

1.3. Follow through the Gibbs simulation process outlined in this section using the parameter values given in this section (and reiterated in Problem 1.1) for the following two examples.

(a) Start with D(1) = 1 and use the twelve random numbers of Problem 1.2. The first few steps are as follows:

Step 2a: T(1) = 1 because the first random number 0.38453 < h = P(T=1|D=1) = 0.99.
Step 2b: D(2) = 0 because the second random number 0.52815 > g = P(D=1|T=1) = 0.4024.
Step 3a: T(2) = 0 because the third random number 0.12248 < q = P(T=0|D=0) = 0.97.
Step 3b. D(3) = 0 because the fourth random number 0.58569 < d = P(D=0|T=0) = 0.9998, and so on.

Use the remaining eight random numbers to carry this process through to the simulated value of D(7).

(b) Start with D(1) = 0 and use the following 12 random numbers:

0.73085   0.01484   0.60402   0.80980   0.96003   0.14434
0.87397   0.27461   0.99158   0.02151   0.57669   0.41376

[Answer: These random numbers were generated using a superficial modification of the Q-Basic program available on this site. The program was modified to print out these first few random numbers and the corresponding values of T and D. Here are the results, which you should verify step by step: T(1) = 0, D(1) = 0, T(2) = 0, D(2) = 0, T(3) = 0, D(3) = 0, T(4) = 0, D(4) = 0, T(5) = 1, D(5) = 1, T(6) = 1, D(6) = 0, T(7) = 0, D(7) = 0.]

1.4. Perform the following two-sample t-procedures on the simulated values in Table 1 to confirm that the starting value of D makes no difference.

(a) Test the null hypothesis that the starting value makes no difference against the two-sided alternative. What is the value of the t-statistic? Do you reject the null hypothesis at the 5% level? [Partial answer: The "separate variances" or "unpooled" t-statistic is t = 0.18 with 17 degrees of freedom. The "equal variances" or "pooled" t-test gives similar results.]

(b) Find a 95% confidence interval for the difference in population means (starting values 0 and 1, respectively). Based on the data in Table 1, for how many decimal places of accuracy can we confidently claim agreement between the two starting values of D.

(c) What type of error might you have made in part (a) — Type I or Type II? Assuming a common population standard deviation of s = 0.0013, how many simulation runs (instead of 10) at each starting value would be required in order to be reasonably sure to discover (power 90% for a test with significance level 5%) any actual difference in population mean Gibbs prevalence values exceeding 0.0005?

1.5. (Computer Simulation) If you have a computer and appropriate software available, use the S-plus code or Q-Basic code available on this site (or a program you write yourself in another language) to make 20 simulation runs of your own, ten starting with D(1) = 0 and ten with D(1) = 1.

2. Stability and Limit of This Gibbs Sampler

How do we know for sure that a Gibbs Sampler of Section 1 ever stabilizes, and what determines the limiting value of p? We show that the D's of Section 1 form a 2-state Markov Chain and apply the results we established in Session 2.

2.1. A Markov Chain

We will now derive the key equation showing that the D-process is a Markov Chain. It uses the conditional distributions T|D and D|T to express the transition probabilities of the chain:

P{D(m+1)=j|D(m)=i} = Sk P{T(m)=k, D(m+1)=j|D(m)=i}
  = Sk P{T(m)=k|D(m)=i} P{D(m+1)=j|T(m)=k, D(m)=i}
  = Sk P{D(m+1)=j|T(m)=k} P{T(m)=k|D(m)=i},

for m = 1, 2, 3, ... and i, j, k = 0, 1.

The first equality takes into account that either T(m) = 0 or T(m) = 1 just before the value of D(m+1) is simulated. The second uses the definition of conditional probability. The third uses the fact that the earlier condition D(m) = i is irrelevant once we are given the later information that T(m) = k, so that

P{D(m+1)=j|D(m)=i, T(m)=k} = P{D(m+1)=j|T(m)=k}.

This equation that is similar to the Markov Property discussed in Session 2, but it involves both D(m) and T(m) and so refers to "half-steps" of the simulation process. There are two simulated random events in each "full" step and each one of uses only the most recent previous result.

With the same values of h, q, g and d as in Section 1, the key equation for the transitional structure of the D-process is equivalent to the matrix equation

P QR
 
 

P is the transition matrix of the Markov Chain D(1), D(2), ..., which has state space {0, 1}. For example, the upper-left element is

p00 = P{D(m+1)=0|D(m)=0}
  = P{D(m+1)=0|T(m)=0} P{T(m)=0|D(m)=0}  +  P{D(m+1)=0|T(m)=1} P{T(m)=1|D(m)=0}
  = P{T(m)=0|D(m)=0} P{D(m+1)=0|T(m)=0}  +  P{T(m)=1|D(m)=0} P{D(m+1)=0|T(m)=1}
  = qd + q*g* = (0.97)(0.9998) + (0.03)(0.5976) = 0.9877.

2.2. Limiting distribution of this chain

Because P has all positive elements, we know from Session 2 that the D-process settles down to a limiting distribution. From the agreement of the results in Table 1 and from the stability achieved for a typical run as shown in Figure 1, we conclude that 50,000 iterations were more than sufficient for the chain to reach its steady state.

As mentioned above (and proved in Session 2), it was not really necessary to use simulation to find the limiting distribution l of this chain. This distribution is the solution of the matrix equation lP = l which, upon solving two equations in two unknowns, is seen to be l = (p*, p) = (0.98, 0.02). The simulated values in Section 1 came quite close to the true value of p.

Problems:

2.1. Compute the matrix P = QR in the following circumstances:

(a) Express the matrix P in terms of the parameters h, q, g and d (using *s as needed). Substitute the numerical values of the parameters for the ELISA test discussed in this session to obtain the numerical result for P shown in Section 2.1. [One of the four entries of the matrix is stated in the required form at the very end of Section 2.1; find the other three.]

(b) A screening test for a certain parasitic infection in humans has sensitivity 80% and specificity 70%. Suppose that in a particular population the predictive values of positive and negative tests are 72.73% and 77.78% respectively. Find the P-matrix that corresponds to the the Gibbs Sampler in this situation.

2.2. In each of the following circumstances use algebraic methods to find prevalence.

(a) Use the P-matrix given at the end of Section 2.1 and the parameter values used throughout the text of this session to verify that the prevalence of the virus in units of blood at the site of interest is 2%. That is, verify the claim at the end of Section 2.2. (Recall that the two components of the steady-state vector l must sum to 1.) Compare this "linear-algebra" solution with the solution in Problem 1.1.

(b) For the screening test of problem 2.1(b), solve the matrix equation lP = l to find the prevalence of parasitic infections in the population described.

2.3. (Computer Simulation) Use the Gibbs Sampler to obtain the prevalence for the situation described in Problems 2.1(b) and 2.2(b). Do 10 runs with each starting value (20 in all). Use simulation runs of 100,000 steps with a burn-in period of 50,000. Why might you expect faster convergence here than for the ELISA test considered in this session?


The article on Gibbs Sampling in STATS magazine mentioned in the note at the beginning of this article, including Table 1 and Figure 1 repeated here, are copyright © 2000 by the American Statistical Association.

This document, Session 3, is copyright © 1999, 2000 by Bruce E. Trumbo. All rights reserved. Please contact the author btrumbo@csuhayward.edu to obtain permission for instructional and other uses. This is a draft; comments and corrections are welcome.