Copyright © Richard B. Darlington. All rights reserved.

When a scientist must give a brief introduction to statistical inference for a general audience,
it's hard to find a more easily understood example than the difference in success rates
between treatment and control groups. Thus it's surprising that there is little agreement
among statisticians on the best way to handle this problem or other problems involving
testing for independence in 2 x 2 frequency tables. As explained later, the so-called Fisher
"exact" 2 x 2 test is in fact not exact at all but is usually overly conservative, frequently
giving significance levels *p* over twice what most people would call a "true"
level.

This work covers several topics related to 2 x 2 tests. After a review of well-known
tests, the work includes a link to a table of exact significance levels for 2 x 2 tables, for two
groups each ranging from 3 to 20 in size so that the total sample size can range from 6 to
40. But then I argue that exact significance levels may not be what one wants for 2 x 2
tests, and that likelihood ratios (LR's) may be theoretically superior. A simple formula is
given for computing LR's. But finally I argue that LR's are too unfamiliar to most scientists
for everyday use, so that a method in the familiar significance-test format is more useful.
Finally I offer a chi-square test which, because of its connection to likelihood-ratio theory,
may actually be superior to the table of exact *p*'s given earlier.

All these points are covered in 10 sections, which are summarized below.

**A review of several well-known 2 x 2 tests** reviews the Fisher, Pearson,
Pearson-Yates, and likelihood-ratio chi-square tests.

**The proximal null hypothesis** describes a concept familiar to
statisticians but unknown to most scientists.

**Random versus fixed marginal totals** explains why the Fisher test is not
exact for most problems, and why no test can be exact without considering whether marginal
totals for the particular problem are random or fixed.

**Are marginal totals ancillary?** analyzes and rejects an argument that has
been used for the Fisher test.

**Exact tests, uniformly most powerful tests, and likelihood ratios**
reviews several other concepts familiar to statisticians but unknown to most scientists.

**An exact test based on LR values** describes an exact test for the most
common 2 x 2 problem.

**A table of exact one-tailed p**'s includes a link to a large
table of exact

**Do we really want an exact test?** then turns around and asks whether
this table of exact *p*'s is really what we want.

**Likelihood ratios versus significance levels** discusses the relative
advantages of likelihood ratios and signficance levels for reporting research results, and
concludes that each approach has its advantages.

**Combining the advantages of LR's and p**'s describes a new
chi-square test which is little more complex than the familiar Pearson test, but which is so
much more accurate that I actually defend its use even when the aforementioned table of
exact

z = (p_{1} -
p_{2})/sqrt[P(1-P)(1/n_{1}+1/n_{2})]

The square of this z exactly equals the chi-square you get from the familiar Pearson chi-square formula

chi-square = SUM (o - e)^{2}/e = SUM(o^{2}/e) - N,

where SUM denotes summation across cells (usually denoted by capital sigma), and as usual
*o* and *e* are the observed and expected cell frequencies.

You also get the same chi-square from the familiar formula

Chi-square = N(A*D-B*C)^{2}/[(A+B)*(C+D)*(A+C)*(B+D)].

Here n_{1} = A+B, n_{2} = C+D, p_{1} =
A/n_{1} and p_{2} = C/n_{2}.

This test is also equivalent to computing an ordinary Pearson correlation r between the row and column variables, then computing

Chi-square = N*r^{2} or z = r sqrt(N)

The significance level *p* in any of these chi-square tests is the *two-tailed
p* from the *z* test. If you did the test as a chi-square test, you can halve
*p* to get a one-tailed *p*. This chi-square test is a one-tailed test in the
mechanical sense that you are finding the area of just one tail of the chi-square distribution,
but it is logically a two-tailed test in that you can get a significant result in either of two
opposite ways: A/B > C/D or A/B < C/D.

We shall denote all of these equivalent tests as the *Pearson* test. Pearson
suggested that the test may be invalid if any of the expected cell frequencies falls below 5.
Other authors have suggested that this rule may be too conservative, and that expected values
as low as 2 may be satisfactory.

Yates suggested that the Pearson test is often too liberal. He suggested fixing this by
adjusting each of the observed cell frequencies by .5 toward its own expected frequency
*e* before computing chi-square. You get the same result in the formula using A,
B, C, D by replacing (A*D-B*C)^{2} by (|A*D-B*C|-N/2)^{2}.
In either of these forms the modified test is usually called the Pearson-Yates test, or the
Pearson test with the Yates correction for continuity. The Pearson-Yates test is considered
too conservative by many writers.

An alternative to the Pearson chi-square test is the *likelihood-ratio chi-square*
(LRCS) test. The LRCS test uses the formula

Chi-square = 2 SUM o ln(o/e)

where again the summation is across cells. LRCS is rarely used because it is slightly harder to use than the Pearson test, and also has generally been found to be too liberal.

The Fisher test is more difficult to compute by hand than any of these alternate tests, but is still simple by the standards of modern computers. It is consistently more conservative than the Pearson test, but is often well approximated by the Pearson-Yates test.

In summary, the LRCS test is considered too liberal, the Pearson-Yates and Fisher tests are often considered too conservative, and the Pearson test falls between these poles. My own position is that the Fisher test is the correct test for a small set of problems to be described later, but for the great majority of 2 x 2 problems there are superior alternatives to all of these tests. These alternatives are described below.

If both success rates are equal to the same value *p _{0}*, then
the probability that all 3 cases in the treatment group will succeed is

In any area of statistics, if a null hypothesis allows more than one value for the parameter under discussion, then the value that is most consistent with the observed data, in the sense just described, is called the proximal null hypothesis. It will come as no great surprise that in simple binomial problems like the one in the illustration, the proximal null hypothesis is simply the overall success rate. Thus for instance if there are 4 successes and 8 failures total, the value in the proximal null hypothesis is simply 4/12 or 1/3.

The proximal null hypothesis is the specific instance of the null hypothesis that is
entered into a significance test. The argument is that if even this instance of the null
hypothesis turns out to be inconsistent with the observed data, then all other instances of the
null hypothesis must also be inconsistent with the data. In the 6-case example the proximal
null hypothesis of .5 yields a probability of .5^{6} or .015625. That value is
well below .05, so by normal criteria the null hypothesis is rejected. (In general a
significance level is the probability of finding the observed outcome *or other possible
outcomes still less consistent with the null hypothesis*. However, in this example there
are no such other outcomes, so that point can be ignored.)

In the case already illustrated, we knew before running the experiment that there were 3 cases in the treatment group and 3 others in the control group. Thus if you create a table that looks like this

Success | Failure | Total | |

Treatment | 3 | 0 | 3 |

Control | 0 | 3 | 3 |

Total | 3 | 3 | 6 |

the row marginal totals of the table can be considered fixed or known in advance. However, the column marginal totals were not known in advance; they could be calculated only after we ran the experiment and observed 3 successes and 3 failures total. We have seen that an exact significance level for this table is .015625.

Now suppose you ask 6 people two Yes-No questions, such as whether they approve of their nation's current domestic policies and current foreign policies. Suppose that 3 people answer Yes to both and the other 3 answer No to both, so the 2 x 2 table looks like this:

Yes - Qu. 2 | No - Qu. 2 | Total | |

Yes - Qu. 1 | 3 | 0 | 3 |

No - Qu. 1 | 0 | 3 | 3 |

Total | 3 | 3 | 6 |

You want to test the null hypothesis that the two questions are mutually independent.
The proximal null hypothesis is that both questions have Yes rates of .5, so the probability
of observing any one pattern in any one subject (Yes-Yes, No-No, Yes-No, or No-Yes)
is .5^{2} or .25. The multinomial theorem (an extension of the binomial
theorem to problems with more than two possible responses) then says that the probability of
observing 3 Yes-Yes patterns and 3 No-No patterns, and no Yes-No or No-Yes patterns,
is (6!/(3!3!0!0!))*.25^{6} or .00488. This too is an exact significance
level, but it is less than one-third the size of the previous value of .015625, even though the
frequency table for both problems was

3 | 0 |

0 | 3 |

The difference between the two problems is that in the latter problem, neither the row marginal totals nor the column marginal totals were fixed in advance, while in the previous problem the row totals were fixed.

Now consider a problem in which all marginal totals are fixed. Such problems are much rarer than the first two types, but they do occur. Suppose a judge is presented with 6 cups of coffee, and is told that 3 are caffeinated and 3 are decaffeinated, and they are to judge which are which. Suppose the judge succeeds perfectly. What's the probability of that? Well, the number of ways of dividing 6 objects into two groups of 3 is 6!/(3!3!) or 20, so the probablity is 1/20 or .05 that the judge would succeed perfectly just by chance. But the problem can be set up in the same table as before; this time the table is

Guess caff. | Guess decaf. | Total | |

Real caff. | 3 | 0 | 3 |

Real decaf. | 0 | 3 | 3 |

Total | 3 | 3 | 6 |

So we now have a significance level of .05 from the same set of four frequencies that previously yielded levels of .015625 or .00488. The difference is that this time, we knew all the marginal totals in advance. We knew in advance that there were 3 caffeinated and 3 decaffeinated cups, and we also knew in advance that the judge would divide the cups 3 and 3.

Thus we get very different exact significance levels, depending on whether both pairs of marginal totals are fixed (the coffee example), or just one is (the sucess-failure example), or neither is (the Yes-No example). There are still other cases, as for instance when one keeps drawing cases until one gets at least 5 cases in each cell, or perhaps at least 10 cases in each row. But cases like these are rare, and it is generally agreed that the three cases already described are the three major cases. Further, the case of all fixed marginals is generally agreed to be considerably rarer than the other two cases. The case of one fixed margin pair typically arises in testing the relation between an indendent variable and a dependent variable (as in our success-failure example), while the case of no fixed margins typically arises in testing the relation between two outcome or dependent variables (as in our Yes-No example).

To see Fisher's argument, first suppose that we are setting up an experiment with
human subjects in treatment and control groups, and as each person arrives for the
experiment we flip a coin to determine whether they are placed in the treatment group or the
control group. Because of the coin, we don't know in advance how many people will be in
each group, so speaking literally the group sizes are random. However, the experiment is
about the treatment effect, not about the coin. It is generally agreed that the coin's results
are irrelevant or *ancillary* to the main experiment, so it is more reasonable to
think of the group sizes as fixed.

Continuing with this example, suppose each person succeeds or fails at a task. Fisher
argued that the overall success rates are also irrelevant to the question of whether there is a
*difference* in success rates between treatment and control groups. He thus
claimed that all marginal totals are always irrelevant or ancillary to the hypothesis tested. By
that argument, the Fisher test is always the correct test.

There is a technical argument (see Little, 1989) to the effect that one margin pair can always be considered ancillary, but that both pairs should not be so considered. I'll give a simple example to illustrate the point. To say that both margin pairs are ancillary is to say that the marginal data alone tell you nothing about the presence or absence of association. But suppose you have an experiment with 400 people in a treatment group and 600 others in a control group. Suppose a counter malfunctions during the experiment, and at the end you know only that altogether there were 400 successes and 600 failures, but the broken counter prevents you from knowing the success rate within each group. Surprisingly, there is still a useful analysis you can do. The highest positive association between treatment and success occurs if everyone in the treatment group succeeds and everyone in the control group fails. If that hypothesis were true then the observed outcome would not only be possible, it would be the only possible outcome. So the higher the positive association between the independent and dependent variables, the greater the agreement we would expect between the row and column marginal totals. Since we observed perfect agreement, we can use that fact to test the null hypothesis: we can ask about the probability of perfect agreement occurring if the treatment and control groups had equal success rates. Given our data, the proximal null hypothesis is that the true success rate in each group is .4. Under that hypothesis, it can be calculated that the probability of observing perfect agreement beween row and column marginal totals is only .0257. Thus surprisingly, one can actually reject at the .05 level the null hypothesis that the treatment has no effect, based entirely on the marginal totals! It's hard to argue then that the marginal totals are irrelevant to assessing the presence of association.

Another argument against Fisher's position is at a far more practical level. The
defenders of the Fisher position are generally unmoved by the argument that the Fisher test is
too conservative, frequently replying that the other side just wants to get significance more
often than they should. If the Fisher test were consistently too liberal, it's hard to imagine
that the test's defenders would be equally adamant. But the fact is that for 2 x 2 tests, it's
often the case that *nonsignificance* is used to promote acceptance of new
treatments. This is true in the medical world, where for every signicance test on the
effectiveness of a new treatment, there are usually many tests looking for undesirable side
effects. The promoters of the new treatment are of course hoping for nonsignificance in
these tests. They are thus all too happy to use the Fisher test, which is biased against finding
significance. There are many new treatments that are clearly effective though dangerous, so
people trying to get the new treatment approved are often happy to use the Fisher test
consistently, sacrificing some power on the test of effectiveness in order to make it easier to
claim an absence of side effects. My point is that scientists and statisticians who are
conservative in a genuine sense should be just as concerned about the charge that the Fisher
test is too "conservative" as they would be about the opposite charge.

However, I will accept Fisher's argument to the extent of agreeing that one marginal
total can always be considered ancillary. As mentioned, Little has actually given a rigorous
proof of that point. Thus the case of one fixed margin pair is by far the most important
case, and that is the case we shall consider in the rest of this document. This case is often
called the *double binomial* case, since we can think of two independent groups
(e.g., treatment and control groups), with the success rate in each group following a binomial
distribution.

For the double binomial problem, there are many ways to construct an exact test, but
there is no uniformly most powerful test. Test A might be better than test B at
distinguishing between true success rates of .8 and .9, while B might be better at
distinguishing between true success rates of .4 and .6. Thus there is no one test that is
clearly "best" in the sense the *t* test is best for the problem just described.
However, I shall argue later that the particular exact test I shall describe is "best" in a less
formal sense.

This test uses another standard concept of statistical theory--the *likelihood
ratio*. We'll introduce the concept with the frequency table

Success | Failure | Total | |

Treatment | 3 | 1 | 4 |

Control | 0 | 5 | 5 |

Total | 3 | 6 | 9 |

The likelihood ratio, here denoted LR, is defined as the ratio of the probabilities calculated under the proximal null hypothesis and the overall proximal hypothesis. Thus in this example, LR = .0130/.4219 = .031. LR is a measure of the consistency of the data with the proximal null hypotheiss, relative to its consistency with the overall proximal hypothesis. In this case we can say that the data would have been only about 3.1% as likely to occur under the proximal null hypothesis (which is of course the instance of the null hypothesis most consistent with the data) as under the hypothesis which is overall most consistent with the data.

A general rule is that 2 ln(1/LR) is distributed approximately as chi-square, with df equal to the number of additional restrictions imposed by the null hypothesis. Here there is only one such restriction (the restriction of equal success rates in the two groups), so df = 1. In the present case we have chi-square = 2 ln(1/.031) = 6.96. With 1 df, the one-tailed significance level associated with that chi-square is .0042. This is one test that has been proposed for testing the 2 x 2 null hypothesis, but it is rarely used because it is too liberal. As explained later, an exact significance level for this pattern is .021.

3 | 1 |

0 | 5 |

4 | 0 |

1 | 4 |

4 | 0 |

0 | 5 |

3 | 1 |

0 | 5 |

- Where
*n*and_{1}*n*denote the two group sizes,_{2}*n**_{1}*n*may not exceed 500._{2} - The exact
*p*is .1 or less; otherwise no*p*is given.

The total table is large (about 4.4 megabytes), so it is divided into 16 separate files based
on the size of *n _{1}*, where

n1 = 1

n1 = 2

n1 = 3

n1 4 or 5

n1 6 to 8

n1 9 to 11

n1 12 to 15

n1 16 to 19

n1 20 to 24

n1 25 to 30

n1 31 to 40

n1 41 to 50

For the files above this point, entries are listed by
*n _{1}*, then

For the files below, entries are listed by

n1 51 to 75

n1 76 to 120

n1 121 to 200

n1 201 to 500

Please read the following directions for using the files; otherwise you may think you have
found the correct *p* when it's really the wrong *p*.

- After entering the correct file, use your browser's Search command (usually found in
the Edit menu) to find the correct value of
*n*, then the correct value of_{1}*n*. Sections of the table have headings like_{2}N1 = 30 N2 = 15

There is always exactly one space before and after each equals sign. Use that spacing in your browser's Search command. - As mentioned above, in files for
*n*of 50 or below, entries are listed by_{1}*n*, then_{1}*n*, then_{2}*A*, then*C*, while in files for*n*of 51 or above, entries are listed by_{1}*n*, then_{1}*n*, then_{2}*C*, then*A*. It has been found that this arrangement lowers disk space requirements and download times relative to having all files with the same arrangement. - Once you have found the correct combination of
*n*and_{1}*n*,_{2}**do not**use the Search function to find the desired value of A or C. Rather, scroll down manually, one screen at a time, to find it. Otherwise you may unwittingly jump to another n1-n2 combination. For instance, suppose you have*n*= 41,_{1}*n*= 2,_{2}*A*= 18,*C*= 0. You can use Search to find the section of the table for*n*= 41,_{1}*n*= 2. But that section of the table contains no entry for_{2}*A*= 18, because no such entry yields a*p*below .1. Therefore if you use Search to find*A*= 18, you will unwittingly jump to the section of the table for*n*= 41,_{1}*n*= 4, which does contain such an entry._{2}

Consider the outcome

12 | 0 |

12 | 5 |

9 | 3 |

5 | 12 |

These two *p* values differ substantially; the first is 85% larger than the
second. The second outcome is significant at the .01 level one-tailed, while the first isn't
even very close. However, the LR values for these two outcomes are respectively .04796
and .04819. These two values are nearly identical; their ratio is .9951. The associated
*p*'s are so different because when calculating *p* for the outcome that
comes later in the ranking, we must include the probability of the other outcome, and it turns
out that the probability of outcome #45 is substantial relative to the probability of outcome
#46. Many other examples like this can be found, in which the ratio of two LR values is
between .99 and 1 yet the ratio of the two *p* values is 1.5 or greater--and in
which one or both of the *p*'s are in the "important" range between .01 and .05.
I'll call this the *near-tie* phenomenon.

That raises the question of whether significance levels or likelihood ratios are really better measures of the outcome of a hypothesis test. The next section considers this question.

Statisticians define a test as valid if the actual probability of rejection does not exceed the nominal rejection probability. For instance, a test is invalid if there is an actual probability of .07 of rejecting a true null hypothesis at the .05 level.

But suppose test RN (for "random number") consists of drawing a random number from a uniform distribution between 0 and 1, and rejecting the null if that number is .05 or less. By the standard definition, test RN is a valid test at the .05 level. But suppose you heard that a scientist had used test RN to test the effectiveness of a proposed medical treatment, had found significance, and had therefore concluded that the treatment was effective. In ordinary English you would call this conclusion invalid.

For clarity, I'll use the term *credibility* to refer to the scientist's
conception of validity. That is, I'll define a test as "credible" only if a significant result will
lead intelligent scientists to lower their faith in the null hypothesis--provided of course they
agree that the experiment itself was well designed. Of course, even a credible test can make
Type 1 errors of rejecting true null hypotheses, but at least it doesn't do so nearly as
foolishly as the "valid" test RN described above.

Statisticians attempt to transmute their technical concept of "validity" into what I have
called credibility by combining "validity" with *power*. A test's power is its
ability to reject false null hypotheses--in other words, to reject when rejection is the correct
decision. Notice that test RN, ridiculed above, is singularly lacking in power. No matter
how effective the treatment and no matter how large and well-designed the experiment, the
probability of rejecting the null under test RN never exceeds .05. Thus we have a surprising
and paradoxical result--test RN's lack of power is actually a clue to its lack of credibility. A
test is credible only if it combines validity and power. Thus statisticians will say that they
really have the problem of credibility covered through the concepts of power and
validity.

My feeling is that this approach is at best indirect--rather like assuring yourself that a one-pound box of crackers is free of dirt, not by the direct method of looking for dirt and failing to find it, but indirectly by making sure that the weight of the crackers themselves is a full pound.

Consider again why power is so important in establishing credibility. A scientist
wants to know that the observed result was unlikely to be observed under the null hypothesis,
*and also* was likely to be observed under the alternative hypothesis. Thus to be
credible, a test's rejection region (the set of possible outcomes that will lead to rejection of
the null) must consist of outcomes unlikely under the null but likely under the
alternative.

Now consider a different example involving random numbers. The familiar
*z* value of 1.96 does not correspond exactly to the .05 level two-tailed; a more
exact level is .0499958, which is .0000042 below .05. Now imagine a scientist who pointed
that out, and who said that in conducting a test he had used a critical *z* of 1.96,
but to bring the test up to the .05 level more exactly, he had decided that if *z* did
not reach the 1.96 level he would choose a random number between 0 and 1, and reject the
null if that random number fell below .0000042. Suppose the scientist then reported that the
null had indeed been rejected at the .05 level, and suppose the hypothesis was an imporant
hypothesis about the effectiveness of a medical treatment. You would certainly want to know
whether the result had been labeled significant because *z* had exceeded 1.96 or
because the random number had fallen below .0000042.

Notice what this means. The great majority of possible outcomes in the rejection region are indeed more likely if the experimental treatment is effective than if it is ineffective. But there are some points in the rejection region (corresponding to the random number below .0000042) lacking that property. When you insist on knowing exactly why the scientist called the result significant, you are saying you don't care about the rejection region as a whole, you just care about the particular outcome that was observed.

This suggests that likelihood ratios have certain advantages over significance levels,
because the size of a likelihood ratio is determined entirely by the particular outcome that
was observed, not by a set of other outcomes that were not observed. On the other hand, a
significance level using, say, a *z* test reports "the probability of finding a
*z* this high *or higher*." The italicized words mean that a significance
level is a statement mostly about outcomes that did not in fact occur.

Many familiar significance tests are known by statisticians to be *likelihood-ratio
tests*. This means that the *p* values of all possible outcomes rank the
outcomes in the exact same order as their LR values. Thus odd examples like my
random-number examples cannot arise; the fact that any outcome has a low *p*
always
means that that very outcome also has a low LR value. Given the usual assumptions of
normality, linearity, and homoscedasticity, all the familiar parametric tests (*t* tests
on means, tests on correlations, analysis of variance, tests in regression and linear models)
are likelihood-ratio tests.

Some authors (e.g., Edwards, 1984) have argued that LR's should be used by scientists to evaluate hypotheses in place of significance levels. However, in that capacity LR's have two major disadvantages. First, they always require a specific distributional form; there is no such thing as a distribution-free likelihood ratio. This problem is avoided only with categorical data--the type of data used in the foregoing example. Second, there does not seem to be any satisfactory way to correct likelihood ratios for multiple tests, or to satisfactorily test composite null hypotheses such as the hypothesis that 5 cell means are all equal. Thus we can say that likelihood ratios seem to be superior to significance levels in a theoretical sense, but can't be used as everyday tools for all hypothesis evaluation because of practical limitations which arise in contexts other than performing a single 2 x 2 test.

LR = exp(-LRCS/2) = exp(-SUM o ln(o/e))

where the summation is across cells. LRCS was defined in this paper's first section as a likelihood-ratio chi-square value. If any cell is empty then ln(o/e) is undefined for that cell, but that doesn't matter because it will be multiplied by 0, giving a product of 0. However, scientists wanting to report LR's in scientific papers will face the problem that most scientists are not familiar with them. And as mentioned above, that situation is unlikely to change because it is unlikely that LR's will ever replace

All this suggests another line of approach. Suppose we could find a test which
reports results in terms of familiar *p*'s, but in which a reported *p* can
be transformed with good accuracy to the outcome's LR value. Such a test would be
approximately a likelihood-ratio test. Then readers unfamiliar with LR's could interpret the
*p*'s in the usual way, while readers familiar with LR's would gain confidence by
the fact that the *p*'s map closely onto LR values.

Recall that test LRCS does calculate *p*'s completely as a function of LR
values. The problem with this test is that it is too liberal by a noticeable margin. But if that
could be fixed somehow, then we would have a test that might actually be regarded as
theoretically superior to the "exact" test described earlier.

It turns out that this can be done. Modifying test LRCS by adding a continuity
correction *cc* of .25 produces a test I'll call LRCC for "likelihood
ratio with continuity correction". To perform test LRCC, calculate the four *o*
and *e* values, adjust each *o* by .25 toward its own *e*, then
enter the modified *o*'s into the LRCS formula.

Test LRCC has the following advantages:

- It is easily calculated, given a pocket calculator with an
*ln*key. - It doesn't exactly use LR's, and doesn't give exact
*p*'s, but arguably it comes as close as any known test to offering an accurate "likelihood-ratio 2 x 2 test", combining the theoretical advantages of LR's with the familiarity of*p*'s. - It yields
*p*'s surprisingly close to the "exact" double-binomial*p*'s. - Unlike the Pearson chi-square test and various modifications of it, test LRCC has no
lower limit on
*e*-values. - LRCC is also surprisingly accurate even for extremely small
*p*'s. That's particularly convenient when using the Bonferroni correction for multiple tests. - For the rare cases when the Fisher test is appropriate but the means to calculate it
are not readily available, changing
*cc*in LRCC from .25 to .5 gives an approximation to the Fisher test that is far better than the Fisher-Yates test--which is generally considered an excellent approximation. And changing*cc*to .125 in LRCC yields an excellent approximation to "exact"*p*'s for the case in which all margins are random.

Note that the patterns just discussed (one frequency of 1, the diagonally opposite frequency large, and the other two entries both 0) are the patterns that yield the lowest expected cell frequencies and therefore the patterns for which Pearson-family tests are not even used because of their invalidity for these cases. For instance, if the one large frequency is 499 then the lowest expected cell frequency is .002. Test LRCC has no lower limit on expected cell frequencies, or any other comparable proscription on its use.

For the truly compulsive (or for those writing a computer program that needs to be
written only once), the continuity correction *cc* can be modified as follows.
Define *lnsum* as the natural log of the total sample size, and define
*lnr* = ln(n_{1}/n_{2}). It doesn't matter which
*n* is called n_{1} in this formula, since the choice will not affect the
absolute value of *lnr*, and *lnr* is used only in squared form. Then let
the continuity correction *cc* be

cc=.27589 - .00581*lnr^{2 } - .000206*lnsum^{2}
+ .000628*lnr^{2}*lnsum

When n_{1} and n_{2} are both 500 or below, values of
*cc* calculated by this formula range from below .19 to above .27, and values of R
calculated using these values of *cc* remain between .977 (found when
n_{1} = 4 and n_{2} = 147) and 1.061 (found when
n_{1} = 1 and n_{2} = 500). In other words, test LRCC is at
most about 2% too liberal or 6% too conservative, across the range of *n*'s
studied, for the extreme cases of maximum chi-square.

I have risked derision by emphasizing perfect patterns (all treatment-group cases
succeed and all control-group cases fail) with their ridiculously small *p*'s because
detailed analyses of other patterns involve the "near-tie" problem discussed above, in which
two outcomes with nearly equal LR values may yield "exact" *p*'s that differ from
each other by a factor of nearly 2. This makes test LRCC miss these "exact" *p*'s
by the same general factor. But the argument given earlier in favor of LR values over
*p*'s also implies that these misses by test LRCC are as much the fault of the
"exact" *p*'s as of LRCC. It can thus be argued that test LRCC is really as good
as the "exact" *p*'s. And of course the user who understands the advantages of
LR values has the assurance that any small *p* found by test LRCC is associated
with a low LR value, and thus with an outcome substantially less consistent with the null
hypothesis than with the alternative.

For readers not content to read only about perfect patterns, I also looked at a set of
29,201 outcome patterns that includes all patterns with the following characteristics: (a) the
success rate in the treatment group exceeds that in the control group, (b) each group size is
at least 3 and at most 25, (c) the treatment group size does not exceed the control group size.
The last condition prevents duplication, since every outcome pattern in which treatment and
control groups are, say, 25 and 20 respectively, duplicates some pattern in which the group
sizes are reversed. Using *cc* = .25, the test is generally a little conservative;
only 3665 of the 29,201 calculated *p*'s fell below their true *p*'s. The
29,201 values of R ranged from .6175 to 1.4898, with a median of 1.1243. This range
seems small enough to require no apologies at all, but it is small enough to suggest that
much of that range could be produced by the "near-tie" phenomenon discussed earlier.

Storer and Kim (1990) reviewed a substantial literature on the double binomial
problem. Their recommendations depend on the individual user's access to computing
power and need for conservatism, but the very last sentence of
their article speaks highly of a test suggested by Pirie and Hamden
in which the familiar Pearson chi-square is adjusted by a continuity correction of .25, rather
than the correction of .5 suggested decades ago by Yates. Like Pearson and Yates, they
suggest using the test only if the smallest expected cell frequency--here denoted min(e)--
exceeds some specified value. I found that requiring min(e) to exceed 4.0909 brings the
Pirie-Hamden test to one of the same criteria of accuracy satisfied by LRCC, making the
smallest value of R equal to .6175 across the current set of 29,201 outcome patterns.
However, that restriction eliminates 48.6% of the 29,201 outcome patterns in the studied list,
while LRCC yields a *p* for every outcome in this or any other list. That seems
like a major difference in the usefulness of the two tests.

Edwards, A. W. F. (1984) *Likelihood.* Cambridge Univ. Press.

Little, R. J. A. (1989) Testing the equality of two independent binomial proportions.
*American Statistician* v. 43 #4, 283-288

Storer, Barry E. and Kim, Choongrak (1990). Exact properties of some exact test statistics
for comparing two binomial proportions. *Journal of the American Statistical
Association, *v. 85 #409, 146-155.