CSU Hayward

Statistics Department

Session 1: Screening Tests


1. Introduction to Screening Tests

Suppose that international public health officials want to determine the prevalence of a particular virus in donated blood at several sites throughout the world. Also suppose that a relatively inexpensive test is available to screen units of blood for this virus — an ELISA test. Accordingly, the study will be based on the results of ELISA tests performed on randomly chosen units of blood donated at each place to be surveyed.

ELISA stands for enzyme-linked immunosorbent assay. Specific ELISA tests detect antibodies to particular viruses, such as HIV, various types of hepatitis, etc.

1.1. Prevalence and proportion testing positive

It will be convenient for us to define two random variables, D and T, corresponding to a randomly chosen unit of blood. Each of these random variables takes only the values 0 and 1:

Throughout, we use Greek letters to denote probabilities. As above, whenever we use an asterisk (*) with one of these symbols, we mean to subtract its value from 1 to obtain the probability of the corresponding complementary event.

The proportion t of ELISA tests indicating presence of the virus is not the same as the proportion p of the sample actually contaminated with it. The ELISA test is useful, but not perfect.

1.2. Sensitivity

During its development, this ELISA test was performed on a large number of blood samples known to have come from subjects infected with the virus. Suppose that about 99% of these infected samples showed a positive result. That is to say, the ELISA test correctly detects the virus in 99% of infected units of blood. In terms of random variables and probabilities, we say that the sensitivity of the test is

h = P(positive test | has virus) = P(T=1|D=1) = 99%.

As above, we will often express probabilities as percentages.

1.3. Specificity

On the other hand, consider a group of units of blood known, from more costly, more accurate procedures, to be free of the virus (D = 0). When administered to such units of blood, the ELISA test was found to give negative results (T = 0) for about 97% of them. That is, for some reason ELISA incorrectly gave an indication of the virus in 3% of uncontaminated units of blood. These are called "false-positive" results. We say that the specificity of the test is

q = P(negative test | no virus) = P(T=0|D=0) = 97%.

1.4. Particular values of sensitivity, specificity, and prevalence

The particular numerical values of h and q that we have given above, and continue to use throughout this session, are reasonable, but hypothetical values.

The actual sensitivity and specificity of an ELISA test procedure for any particular virus would depend on whether tests are done once or several times on each unit of blood, and whether borderline results are declared as "positive" or "negative." When the immediate purpose is to protect the blood supply from contamination with the virus, such issues would be settled in favor of increasing the sensitivity. But the consequence would be to decrease the specificity.

For a discussion of sensitivity and specificity in several kinds of screening tests see: Gastwirth, Joseph L., "The statistical precision of medical screening procedures: Applications to polygraph and AIDS antibody test data" (including discussion), Statistical Science, Vol. 2, No. 3 (1987), pages 213-238.

Starting in the next section, we give several hypothetical prevalence values. In real life, actual prevalences range widely depending on the population and the disease. For example, in the US the prevalence of HIV in the donated blood supply is now essentially 0. (Pre-donation questionnaires used by blood banks tend to eliminate even donors likely to produce false-positive results.) On the other hand, in clinical applications screening tests are sometimes used where the prevalence of a disease exceeds 50%.

Problems:

1.1. In his regular trivia column "The Grab Bag" (appearing in The San Francisco Chronicle on July 17, 1999) L. M. Boyd poses the question why lie detector results are not admissible in court. His answer is that "lie detectors tests pass 10 percent of the liars and fail 20 percent of the truth-tellers." If you use these percentages and take D = 1 to mean being deceitful and T = 1 to mean failing the test, what are the numerical values of the sensitivity and specificity of such a lie detector test?

1.2. Suppose that a medical screening test for a particular disease provides a continuum of numerical values as its output. It is generally agreed that values less that 50 on this scale must be judged as a negative indication for having the disease (T = 0) and that values greater than 56 must be judged as positive indications (T = 1). The borderline values between 50 and 56 are usually also read as positive, and this practice is reflected in the published sensitivity and specificity values of the test. What would happen to the sensitivity if the borderline values were read as negative — increase or decrease? What would happen to the specificity? Explain your answers briefly.

1.3. Consider a bogus test for a virus that always gives positive results, regardless of whether the virus is present or not. What is its sensitivity? What is its specificity? In describing the usefulness of a screening test, why might it be misleading to say how "sensitive" it is without saying how "specific" it is?

2. Some Attempts to Estimate Prevalence

At one of the sites under study, suppose that we estimate t = P(T=1) as t, the proportion of positive tests in a sample. We have seen in Section 1 that we cannot use t itself to estimate the prevalence p = P(D=1). But can we somehow use t indirectly to find an estimate p of p?

2.1. Solving and estimating

One proposed method of estimating prevalence is to use the fact that t and p are related by the equation

t= P(T=1)
 = P(D=1, T=1) + P(D=0, T=1)
 = P(D=1)P(T=1|D=1) + P(D=0)P(T=1|D=0)
 = ph + p*q*,

where q* = 1 – q (just as we have done above for other Greek letters representing probabilities). Here we have partitioned all positive tests into true positives and false positives, applied the law of total probability, and twice used the "general multiplication rule for probabilities": P(EÇF) = P(E)P(F|E).

Solving this equation for p, we obtain

p = (tq*)/(hq*).

Then replacing t by t, we obtain the estimate

p = (tq*)/(hq*).

For example, suppose that we have a sample of N = 1000 units of blood and that 49 of them test positive. We use the notation A = #(T=1) = 49. Then t = A/N = 0.049 = 4.9%, and

p = (4.9% – 3%)/(99% – 3%) = 1.98%.

An approximate 95% confidence interval for t based on the normal approximation to the binomial distribution is (3.56%, 6.24%), and the corresponding 95% confidence interval for p is (0.58%, 3.38%).  (See Problem 2.1.)

2.2. Difficulties

Unfortunately, this method sometimes gives absurd estimates p of p. For example, if we have a sample of N = 215 units of blood and five of them test positive, then t = 2.3% and p = –0.73%. The difficulty here is that we expect 3% of the tests to be positive even if the prevalence is 0, but sampling variation has given us a value of t less than 3%.

In some applications, such estimates of prevalence that stray into negative territory can be quite common. If p is very near 0, then p will be negative about half the time. In different circumstances, this method can give absurd estimates of prevalence that exceed 100%.

For further applications of this method of estimating prevalence (including another example that gives an absurd result) see: Pagano, Marcello and Gauvreau, Kimberlee: Principles of Biostatistics, 2nd ed., Duxbury Press, 2000, Belmont, CA. pages 141-144.

Quiz Question 13 on this site contains still further examples of screening tests, and of estimating prevalence.

Problems:

2.1. This question involves computational verification of the confidence intervals at the end of Section 2.1.

(a) Compute the 95% confidence interval for t given at the end of Section 2.1. Use the normal approximation to the binomial distribution (taking "Success" to be {T = 1}). The approximate boundaries are t ± 1.96SE(t), where the standard error SE(t) is estimated by the square root of t(1 – t)/N.

(b) Show how the 95% confidence interval for p is obtained from the confidence interval in (a), using the relationship between t and p stated in Section 2.1.

2.2. Suppose that a screening test for a particular parasite in humans has sensitivity 80% and specificity 70%.

(a) In a sample of 100 from a population we obtain 45 positive tests. Give a point estimate the prevalence (single "best" value). Use the normal approximation to give an interval estimate of prevalence.

(b) In a sample of 70 from a different population we obtain 62 positive tests. Find the point estimate of prevalence. How do you explain this result?

2.3. Consider the ELISA test of this section and suppose that the prevalence of infection is 2% of the units of blood.

(a) Show that the proportion testing positive is 4.92%.

(b) Suppose that N = 215 units of blood are tested and that A of them yield positive results. What values of t = A/N and of the integer A yield a negative estimate of prevalence?

(c) Even in realistic circumstances as with our ELISA test, it is not especially rare for the method of this section to give negative estimates of prevalence. With N = 215, use the results of part (b) to show that one would get a negative estimate about 10% of the time. (Use the normal approximation to the binomial with continuity correction.)

3. Predictive Values

In this section we introduce some additional conditional probabilities. They are of importance in practical situations. Moreover, they provide a point of view that will eventually permit us to make better estimates of prevalence.

3.1. Predictive value of a positive test

It is often useful to know what percentage of units with positive tests is actually infected. This is a property of the site, more than it is a property of the particular screening test used. As a hypothetical example, suppose that the prevalence at a particular location is p = 2%. Then, from the formula for t in Section 2.1,

t= ph + p*q*
 = (0.02)(0.99) + (0.98)(0.03) = 4.92%.

From this we can compute

g= P(D=1|T=1) = P(D=1, T=1) / P(T=1)
  = ph / t = 0.0198 / 0.0492 = 40.24%.

This quantity g is called the predictive value of a positive test. (You may recognize this computation as an example of Bayes' Theorem.)

3.2. Predictive value of a negative test

Similarly, we compute

d= P(D=0|T=0)
 = p*q / t* = 0.9506 / (1 – 0.0492) = 99.98%,
the predictive value of a negative test.

Of course, we can hope that the predictive values of both positive and negative tests are high (near 1), but even when one or both of these values is low, the screening test may still be useful. (For an example, see Problem 3.1.)

3.3. "Gold standard" tests

One way to get direct information about predictive values is to perform a "gold standard" procedure on some of the units of blood. In concept, a gold standard provides essentially a 100% accurate determination as to whether or not the virus is present in a unit, but at a cost of administration that prevents its use on every unit of blood. (If such a gold standard were inexpensive, why bother with imperfect ELISA tests for screening?)

Procedures called "Western blot" tests, are regarded as a gold standard for some viruses. They use a different technology than ELISA tests and are considerably more accurate, and expensive, to use than ELISA tests. However, in practice, no such procedure is absolutely perfect. Both the Western blot test and the ELISA test actually detect antibodies to a specific virus. In most circumstances the presence of antibodies corresponds to presence of the virus itself. Exceptions might be units of blood from people who have taken vaccines (antibodies but no virus) or who have been very recently infected (virus, but not yet antibodies). Blood banks use pre-donation questionnaires to try to avoid accepting such units of blood.

Unless the prevalence of the virus is extremely high at a particular location, the actual number of units with ELISA-positive tests found there may be small enough that we could check them all against the gold standard. Without knowing p, we could then estimate g for this site as #(T=1, D=1)/#(T=1); that is, the proportion of the ELISA-positive units proved by subsequent gold-standard procedures actually to have the virus.

Although we would not ordinarily be able to apply the gold standard to all units of blood that tested ELISA-negative, we might be able to check some of them against the gold standard to get an estimate of d (if only to verify that d really is very nearly 1, as would be the case if the prevalence is, say, below 5%).

In Session 2, we shall see that if reliable estimates of the conditional probabilities g and d were available, they would provide the basis for an improved estimate of p. In fact, this possibility provides a simple illustration of the important estimation technique known as the Gibbs sampler.

Problems:

3.1. Suppose that a screening test for a particular disease is to be given to all available members of a population. The goal is to detect the disease early enough that a cure is still possible. This test is relatively cheap, convenient, and safe. It has sensitivity 98% and specificity 96%. Suppose that the prevalence of the disease in this population is 1/2%.

(a) What proportion of those who test positive will actually have the disease? Even though this value may seem quite low, notice that it is much greater than 1/2%.

(b) All of those who test positive will be subjected to more expensive, less convenient (possibly even somewhat risky) diagnostic procedures to determine whether or not they actually have the disease. What percentage of the population will be subjected to these procedures? Notice that this is a relatively small part of the population. It would have been prohibitively expensive (and depending on risks, possibly even unethical) to perform the definitive diagnostic procedures on the entire population. However, the screening test has provided us with a subpopulation that

  • Is relatively small, as just shown, and
  • Has relatively high probability of having the disease, as shown in (a).

Thus it may be warranted to perform the definitive diagnostic procedures within this subpopulation.

(c) The entire population can be viewed as having been split into four groups:

  • True positives (disease detected, possibly early enough for a cure),
  • False positives (no disease, but false alarm leading to inconvenience of follow-up diagnostics),
  • True negatives (no disease, no false alarm), and
  • False negatives (disease undetected, disease may develop to incurable stage).

What proportion of the entire population falls into each of these categories? Suppose you could change the sensitivity of the test to 99% with a consequent change in specificity to 94%. What factors of economics, patient risk, and preservation of life would be involved in deciding whether to make this change?

3.2. Recall the lie detector test from above. (Boyd asserts that "lie detectors tests pass 10 percent of the liars and fail 20 percent of the truth-tellers." The event {D = 1} = {Liar} and {T = 1} = {Fails test}.) Suppose that 5% of those in the population are liars.

(a) What is the probability that a randomly chosen member of the population will fail the test?

(b) What proportion of those who fail the test are really liars? What proportion of those who fail the test are really truth-tellers?

(c) What proportion of those who pass the test are really telling the truth?

(d) Following the notation of this section, assign the appropriate Greek letters to the probabilities and proportions in parts (a), (b), and (c).

3.3. There are three urns, identical in outward appearance. Two of them each contain 3 red balls and 1 white ball. One of them contains 1 red ball and three white balls. One of the three urns is selected at random.

(a) Neither you nor John has looked into the urn. On an intuitive "hunch," John is willing to make you an even-money bet that the urn selected has one red ball. (You each put up $1 and then look into the urn. He gets both dollars if the urn has exactly one red ball, otherwise you do.) Would you take the bet? Explain briefly.

(b) Consider the same situation as in (a), except that one ball has been chosen at random from the urn selected, and that ball is white. The result of this draw has provided both of you with some additional information. Would you take the bet in this situation? Explain briefly.

3.4. Let the sample space S be partitioned by the disjoint and exhaustive events Ai, where i = 1, ... K. Then Bayes' Theorem states that for any event E contained in S and any one of the partition events Aj,

P(Aj|E) = P(E|Aj)P(Aj) / Si P(E|Ai)P(Ai).

That is, if one knows the probabilities of all of the partition events, and all of the conditional probabilities P(E|Ai), then one can find the "reverse" conditional probability P(Aj|E) for any j = 1, ... K. Bayes' Theorem is proved by showing that the fraction on the right amounts to P(AjÇE)/P(E), where the "law of total probability" is used in the denominator.

In our discussion of screening tests in this session, we have used Bayes' Theorem in the special case where there are only two partition events, A1 = {D = 1} and A2 = {D = 0}. As an illustration of a more general case with three partition events consider the following situation.

Each employees in a large company can be classified according to his or her use of an illegal drug into exactly one of three categories: Frequent users, Occasional users, and Abstainers (who never use the drug at all). Suppose that the percentages of employees in these categories are 2%, 8%, and 90%, respectively. Further suppose that a urine test for this drug is positive 98% of the time for frequent users, 50% of the time for occasional users, and 5% of the time for abstainers.

(a) If employees are selected at random from this company and given the drug test described, what percentage of them will test positive?

(b) Of those employees who test positive, what percentage are abstainers?

(c) Suppose that employees are selected at random for testing and that those who test positive are severely disciplined or dismissed. How might an employee union or civil rights organization argue against the fairness of drug testing in these circumstances?

(d) Can you envision different circumstances, under which such a test might be appropriately used in the workplace?


This web session was adapted by Bruce E. Trumbo from preliminary drafts of the article "Elementary uses of the Gibbs Sampler: Applications to medical screening tests" by Eric A. Suess, Christopher M. Fraser, and Bruce E. Trumbo, appearing in STATS, Winter 2000. That article in STATS is copyright © American Statistical Association, 2000. This web session is intended primarily for individual, noncommercial use by readers of the article in STATS magazine. The problems in this session are copyright © Bruce E. Trumbo, 2000. All rights reserved. Contact the author btrumbo@csuhayward.edu for all other intended uses. This is a draft, which may contain errors. Comments and corrections are welcome.