## Some New 2 x 2 Tests

Richard B. Darlington

Abstract
Many concepts relevant to 2 x 2 tests are reviewed. A large table of exact significance levels is offered. But also presented is a 2 x 2 chi-square test that is so accurate that I argue that it might actually be preferable to the table of exact p's.
* * * * *

When a scientist must give a brief introduction to statistical inference for a general audience, it's hard to find a more easily understood example than the difference in success rates between treatment and control groups. Thus it's surprising that there is little agreement among statisticians on the best way to handle this problem or other problems involving testing for independence in 2 x 2 frequency tables. As explained later, the so-called Fisher "exact" 2 x 2 test is in fact not exact at all but is usually overly conservative, frequently giving significance levels p over twice what most people would call a "true" level.

This work covers several topics related to 2 x 2 tests. After a review of well-known tests, the work includes a link to a table of exact significance levels for 2 x 2 tables, for two groups each ranging from 3 to 20 in size so that the total sample size can range from 6 to 40. But then I argue that exact significance levels may not be what one wants for 2 x 2 tests, and that likelihood ratios (LR's) may be theoretically superior. A simple formula is given for computing LR's. But finally I argue that LR's are too unfamiliar to most scientists for everyday use, so that a method in the familiar significance-test format is more useful. Finally I offer a chi-square test which, because of its connection to likelihood-ratio theory, may actually be superior to the table of exact p's given earlier.

All these points are covered in 10 sections, which are summarized below.

A review of several well-known 2 x 2 tests reviews the Fisher, Pearson, Pearson-Yates, and likelihood-ratio chi-square tests.

The proximal null hypothesis describes a concept familiar to statisticians but unknown to most scientists.

Random versus fixed marginal totals explains why the Fisher test is not exact for most problems, and why no test can be exact without considering whether marginal totals for the particular problem are random or fixed.

Are marginal totals ancillary? analyzes and rejects an argument that has been used for the Fisher test.

Exact tests, uniformly most powerful tests, and likelihood ratios reviews several other concepts familiar to statisticians but unknown to most scientists.

An exact test based on LR values describes an exact test for the most common 2 x 2 problem.

A table of exact one-tailed p's includes a link to a large table of exact p's.

Do we really want an exact test? then turns around and asks whether this table of exact p's is really what we want.

Likelihood ratios versus significance levels discusses the relative advantages of likelihood ratios and signficance levels for reporting research results, and concludes that each approach has its advantages.

Combining the advantages of LR's and p's describes a new chi-square test which is little more complex than the familiar Pearson test, but which is so much more accurate that I actually defend its use even when the aforementioned table of exact p's is available. Unlike the Pearson test, the new test has no lower limits on expected cell frequencies, no upper or lower limits on sample sizes, and no lower limit on meaningful p's. That is, a p of, say, .000001 can be interpreted as essentially that value, and not merely as .0001 or less.

#### A review of several well-known 2 x 2 tests

Numerous textbooks describe a z test on the difference between two independent proportions p1 and p2--for instance, between the proportions of successes in treatment and control groups. Let P denote the proportion of successes in the total sample. Let n1 and n2 denote the two sample sizes, and let N = n1 + n2. Then the test is

z = (p1 - p2)/sqrt[P(1-P)(1/n1+1/n2)]

The square of this z exactly equals the chi-square you get from the familiar Pearson chi-square formula

chi-square = SUM (o - e)2/e = SUM(o2/e) - N,

where SUM denotes summation across cells (usually denoted by capital sigma), and as usual o and e are the observed and expected cell frequencies.

You also get the same chi-square from the familiar formula

Chi-square = N(A*D-B*C)2/[(A+B)*(C+D)*(A+C)*(B+D)].

Here n1 = A+B, n2 = C+D, p1 = A/n1 and p2 = C/n2.

This test is also equivalent to computing an ordinary Pearson correlation r between the row and column variables, then computing

Chi-square = N*r2 or z = r sqrt(N)

The significance level p in any of these chi-square tests is the two-tailed p from the z test. If you did the test as a chi-square test, you can halve p to get a one-tailed p. This chi-square test is a one-tailed test in the mechanical sense that you are finding the area of just one tail of the chi-square distribution, but it is logically a two-tailed test in that you can get a significant result in either of two opposite ways: A/B > C/D or A/B < C/D.

We shall denote all of these equivalent tests as the Pearson test. Pearson suggested that the test may be invalid if any of the expected cell frequencies falls below 5. Other authors have suggested that this rule may be too conservative, and that expected values as low as 2 may be satisfactory.

Yates suggested that the Pearson test is often too liberal. He suggested fixing this by adjusting each of the observed cell frequencies by .5 toward its own expected frequency e before computing chi-square. You get the same result in the formula using A, B, C, D by replacing (A*D-B*C)2 by (|A*D-B*C|-N/2)2. In either of these forms the modified test is usually called the Pearson-Yates test, or the Pearson test with the Yates correction for continuity. The Pearson-Yates test is considered too conservative by many writers.

An alternative to the Pearson chi-square test is the likelihood-ratio chi-square (LRCS) test. The LRCS test uses the formula

Chi-square = 2 SUM o ln(o/e)

where again the summation is across cells. LRCS is rarely used because it is slightly harder to use than the Pearson test, and also has generally been found to be too liberal.

The Fisher test is more difficult to compute by hand than any of these alternate tests, but is still simple by the standards of modern computers. It is consistently more conservative than the Pearson test, but is often well approximated by the Pearson-Yates test.

In summary, the LRCS test is considered too liberal, the Pearson-Yates and Fisher tests are often considered too conservative, and the Pearson test falls between these poles. My own position is that the Fisher test is the correct test for a small set of problems to be described later, but for the great majority of 2 x 2 problems there are superior alternatives to all of these tests. These alternatives are described below.

#### The proximal null hypothesis

To discuss these issues we must introduce a concept well-known statisticians--the proximal null hypothesis. To illustrate the concept, suppose that 3 cases in a treatment group all succeed at a task, and 3 cases in a control group all fail. You want to test the null hypothesis that the treatment has no effect on the success rate. But there are many possible instances of the null hypothesis: it would be true if both success rates were .4, or if both were .5, or if both were .7. Under the null hypothesis, the two success rates could be any value so long as they are equal.

If both success rates are equal to the same value p0, then the probability that all 3 cases in the treatment group will succeed is p03, and the probability that all 3 cases in the control group will fail is (1-p0)3. These two probabilities are mutually independent, so the probability of both together is their product p03(1-p0)3. If you entered all possible values of p0 into this expression, you would find that the value of p0 that maximizes the expression is .5. That is, the value of p0 that makes 3 successes and 3 failures more likely than any other value of p0, is .5. Thus you can say that if you have observed 3 successes and 3 failures, then the hypothesis that p0 is .5 in both groups, is more consistent with the data than any other possible instance of the null hypothesis.

In any area of statistics, if a null hypothesis allows more than one value for the parameter under discussion, then the value that is most consistent with the observed data, in the sense just described, is called the proximal null hypothesis. It will come as no great surprise that in simple binomial problems like the one in the illustration, the proximal null hypothesis is simply the overall success rate. Thus for instance if there are 4 successes and 8 failures total, the value in the proximal null hypothesis is simply 4/12 or 1/3.

The proximal null hypothesis is the specific instance of the null hypothesis that is entered into a significance test. The argument is that if even this instance of the null hypothesis turns out to be inconsistent with the observed data, then all other instances of the null hypothesis must also be inconsistent with the data. In the 6-case example the proximal null hypothesis of .5 yields a probability of .56 or .015625. That value is well below .05, so by normal criteria the null hypothesis is rejected. (In general a significance level is the probability of finding the observed outcome or other possible outcomes still less consistent with the null hypothesis. However, in this example there are no such other outcomes, so that point can be ignored.)

#### Random versus fixed marginal totals

Although elementary applied statistics texts generally avoid the point, statisticians have known for decades that in discussing 2 x 2 tests, it is useful to distinguish among three (or sometimes more) cases involving the marginal totals of the 2 x 2 table. We'll use the previous example, with 3 successes and 3 failures, to consider the three major cases.

In the case already illustrated, we knew before running the experiment that there were 3 cases in the treatment group and 3 others in the control group. Thus if you create a table that looks like this

 Success Failure Total Treatment 3 0 3 Control 0 3 3 Total 3 3 6

the row marginal totals of the table can be considered fixed or known in advance. However, the column marginal totals were not known in advance; they could be calculated only after we ran the experiment and observed 3 successes and 3 failures total. We have seen that an exact significance level for this table is .015625.

Now suppose you ask 6 people two Yes-No questions, such as whether they approve of their nation's current domestic policies and current foreign policies. Suppose that 3 people answer Yes to both and the other 3 answer No to both, so the 2 x 2 table looks like this:

 Yes - Qu. 2 No - Qu. 2 Total Yes - Qu. 1 3 0 3 No - Qu. 1 0 3 3 Total 3 3 6

You want to test the null hypothesis that the two questions are mutually independent. The proximal null hypothesis is that both questions have Yes rates of .5, so the probability of observing any one pattern in any one subject (Yes-Yes, No-No, Yes-No, or No-Yes) is .52 or .25. The multinomial theorem (an extension of the binomial theorem to problems with more than two possible responses) then says that the probability of observing 3 Yes-Yes patterns and 3 No-No patterns, and no Yes-No or No-Yes patterns, is (6!/(3!3!0!0!))*.256 or .00488. This too is an exact significance level, but it is less than one-third the size of the previous value of .015625, even though the frequency table for both problems was

 3 0 0 3

The difference between the two problems is that in the latter problem, neither the row marginal totals nor the column marginal totals were fixed in advance, while in the previous problem the row totals were fixed.

Now consider a problem in which all marginal totals are fixed. Such problems are much rarer than the first two types, but they do occur. Suppose a judge is presented with 6 cups of coffee, and is told that 3 are caffeinated and 3 are decaffeinated, and they are to judge which are which. Suppose the judge succeeds perfectly. What's the probability of that? Well, the number of ways of dividing 6 objects into two groups of 3 is 6!/(3!3!) or 20, so the probablity is 1/20 or .05 that the judge would succeed perfectly just by chance. But the problem can be set up in the same table as before; this time the table is

 Guess caff. Guess decaf. Total Real caff. 3 0 3 Real decaf. 0 3 3 Total 3 3 6

So we now have a significance level of .05 from the same set of four frequencies that previously yielded levels of .015625 or .00488. The difference is that this time, we knew all the marginal totals in advance. We knew in advance that there were 3 caffeinated and 3 decaffeinated cups, and we also knew in advance that the judge would divide the cups 3 and 3.

Thus we get very different exact significance levels, depending on whether both pairs of marginal totals are fixed (the coffee example), or just one is (the sucess-failure example), or neither is (the Yes-No example). There are still other cases, as for instance when one keeps drawing cases until one gets at least 5 cases in each cell, or perhaps at least 10 cases in each row. But cases like these are rare, and it is generally agreed that the three cases already described are the three major cases. Further, the case of all fixed marginals is generally agreed to be considerably rarer than the other two cases. The case of one fixed margin pair typically arises in testing the relation between an indendent variable and a dependent variable (as in our success-failure example), while the case of no fixed margins typically arises in testing the relation between two outcome or dependent variables (as in our Yes-No example).

#### Are marginal totals ancillary?

It is universally agreed that in the case with all fixed margins, the Fisher 2 x 2 test is genuinely exact. However, as just mentioned, that case is rare. Fisher argued that even when marginal totals are not literally fixed, they can be thought of as fixed, and many statisticians accept that argument. If that argument were correct (and I am one of the many who disagree), it would mean that the Fisher test is always an exact test, and the 2 x 2 problem would be solved.

To see Fisher's argument, first suppose that we are setting up an experiment with human subjects in treatment and control groups, and as each person arrives for the experiment we flip a coin to determine whether they are placed in the treatment group or the control group. Because of the coin, we don't know in advance how many people will be in each group, so speaking literally the group sizes are random. However, the experiment is about the treatment effect, not about the coin. It is generally agreed that the coin's results are irrelevant or ancillary to the main experiment, so it is more reasonable to think of the group sizes as fixed.

Continuing with this example, suppose each person succeeds or fails at a task. Fisher argued that the overall success rates are also irrelevant to the question of whether there is a difference in success rates between treatment and control groups. He thus claimed that all marginal totals are always irrelevant or ancillary to the hypothesis tested. By that argument, the Fisher test is always the correct test.

There is a technical argument (see Little, 1989) to the effect that one margin pair can always be considered ancillary, but that both pairs should not be so considered. I'll give a simple example to illustrate the point. To say that both margin pairs are ancillary is to say that the marginal data alone tell you nothing about the presence or absence of association. But suppose you have an experiment with 400 people in a treatment group and 600 others in a control group. Suppose a counter malfunctions during the experiment, and at the end you know only that altogether there were 400 successes and 600 failures, but the broken counter prevents you from knowing the success rate within each group. Surprisingly, there is still a useful analysis you can do. The highest positive association between treatment and success occurs if everyone in the treatment group succeeds and everyone in the control group fails. If that hypothesis were true then the observed outcome would not only be possible, it would be the only possible outcome. So the higher the positive association between the independent and dependent variables, the greater the agreement we would expect between the row and column marginal totals. Since we observed perfect agreement, we can use that fact to test the null hypothesis: we can ask about the probability of perfect agreement occurring if the treatment and control groups had equal success rates. Given our data, the proximal null hypothesis is that the true success rate in each group is .4. Under that hypothesis, it can be calculated that the probability of observing perfect agreement beween row and column marginal totals is only .0257. Thus surprisingly, one can actually reject at the .05 level the null hypothesis that the treatment has no effect, based entirely on the marginal totals! It's hard to argue then that the marginal totals are irrelevant to assessing the presence of association.

Another argument against Fisher's position is at a far more practical level. The defenders of the Fisher position are generally unmoved by the argument that the Fisher test is too conservative, frequently replying that the other side just wants to get significance more often than they should. If the Fisher test were consistently too liberal, it's hard to imagine that the test's defenders would be equally adamant. But the fact is that for 2 x 2 tests, it's often the case that nonsignificance is used to promote acceptance of new treatments. This is true in the medical world, where for every signicance test on the effectiveness of a new treatment, there are usually many tests looking for undesirable side effects. The promoters of the new treatment are of course hoping for nonsignificance in these tests. They are thus all too happy to use the Fisher test, which is biased against finding significance. There are many new treatments that are clearly effective though dangerous, so people trying to get the new treatment approved are often happy to use the Fisher test consistently, sacrificing some power on the test of effectiveness in order to make it easier to claim an absence of side effects. My point is that scientists and statisticians who are conservative in a genuine sense should be just as concerned about the charge that the Fisher test is too "conservative" as they would be about the opposite charge.

However, I will accept Fisher's argument to the extent of agreeing that one marginal total can always be considered ancillary. As mentioned, Little has actually given a rigorous proof of that point. Thus the case of one fixed margin pair is by far the most important case, and that is the case we shall consider in the rest of this document. This case is often called the double binomial case, since we can think of two independent groups (e.g., treatment and control groups), with the success rate in each group following a binomial distribution.

#### Exact tests, uniformly most powerful tests, and likelihood ratios

Our goal, then is to develop an exact test for the double binomial problem. It turns out that there can be several exact tests for this problem. To see why, briefly consider a different problem: the problem of testing the null hypothesis that a variable X has a mean of 0, when X is assumed to be normally distributed. The familiar t test is an exact test for this problem. But the binomial test, based on counting the number of positive and negative cases, is also exact. The Wilcoxon signed-ranks test is also exact. By "exact" we mean that the probability of getting a significance level of, say, .0268 or better, is exactly .0268. All three of these tests are exact in that sense. However, it can be shown that given the assumption of normality, the t test is the most powerful of all possible tests. That is, the t test is the one most likely to give significant results when the null hypothesis is really false. Furthermore, the t test is most powerful regardless of the value of the true mean: the t test is most powerful regardless of whether the true mean is .5 or 2 or 18 or any other value, and regardless of the value of the true standard deviation. Thus the t test is uniformly most powerful. It is thus clearly the best test, even though there are many "exact" tests.

For the double binomial problem, there are many ways to construct an exact test, but there is no uniformly most powerful test. Test A might be better than test B at distinguishing between true success rates of .8 and .9, while B might be better at distinguishing between true success rates of .4 and .6. Thus there is no one test that is clearly "best" in the sense the t test is best for the problem just described. However, I shall argue later that the particular exact test I shall describe is "best" in a less formal sense.

This test uses another standard concept of statistical theory--the likelihood ratio. We'll introduce the concept with the frequency table

 Success Failure Total Treatment 3 1 4 Control 0 5 5 Total 3 6 9
In this table there are altogether 3 successes in 9 cases, so the proximal null hypothesis is p0 = 1/3. Under that hypothesis the probability of exactly 3 successes in group 1 is (4!/(3!1!))*(1/3)3(2/3)1 = .0987, and the probability of 5 failures in group 2 is (2/3)5 = .1317, so the probability of the total pattern is .0987*.1317 = .0130. Now consider the overall proximal hypothesis, which is the hypothesis most consistent with the data--with no restrictions imposed by the null. The overall proximal hypothesis is that the success rate is 3/4 in group 1 and 0 in group 2. Under that hypothesis the probability of the observed result in group 1 is (4!/(3!1!))*(.75)3(.25) = .4219, and the probability of the observed result in group 2 is 1, since all cases must fail if the probability of success is 0 as hypothesized. So the probability of the total pattern is .4219.

The likelihood ratio, here denoted LR, is defined as the ratio of the probabilities calculated under the proximal null hypothesis and the overall proximal hypothesis. Thus in this example, LR = .0130/.4219 = .031. LR is a measure of the consistency of the data with the proximal null hypotheiss, relative to its consistency with the overall proximal hypothesis. In this case we can say that the data would have been only about 3.1% as likely to occur under the proximal null hypothesis (which is of course the instance of the null hypothesis most consistent with the data) as under the hypothesis which is overall most consistent with the data.

A general rule is that 2 ln(1/LR) is distributed approximately as chi-square, with df equal to the number of additional restrictions imposed by the null hypothesis. Here there is only one such restriction (the restriction of equal success rates in the two groups), so df = 1. In the present case we have chi-square = 2 ln(1/.031) = 6.96. With 1 df, the one-tailed significance level associated with that chi-square is .0042. This is one test that has been proposed for testing the 2 x 2 null hypothesis, but it is rarely used because it is too liberal. As explained later, an exact significance level for this pattern is .021.

#### An exact test based on LR values

Even if we don't calculate a significance level directly from LR, it is still reasonable to use the LR values of various patterns to rank the patterns for consistency with the null hypothesis. We must do this because a significance level is defined as the probability, under the null hypothesis, of finding the observed result or some other result still less consistent with the null hypothesis. In the current example, with group sizes of 4 and 5 respectively, there are just two other patterns that are less consistent with the null hypothesis than the observed pattern

 3 1 0 5
These other two patterns are

 4 0 1 4
and

 4 0 0 5
Therefore an exact significance level p for the pattern

 3 1 0 5
is the probability of finding one of these three patterns if the null hypothesis is true. Thus p is the sum of the individual probabilities of these three patterns. But we want the probability generated by the specific value of p0 that maximizes p. That value of p0 will usually be fairly near the proximal null hypothesis for the observed pattern (which was 1/3 in this example), but the two values will rarely be exactly equal. Therefore one must try many values of p0 to find the one that maximizes p. When that is done, p turns out to be the value .021 mentioned earlier.

#### A table of exact one-tailed p's

The process just described was used to generate a table giving all the exact one-tailed p's meeting the following two requirements:
• Where n1 and n2 denote the two group sizes, n1 * n2 may not exceed 500.
• The exact p is .1 or less; otherwise no p is given.
Instead of adjusting p0 iteratively, 999 values of p0 were used, ranging from .001 up to .999 in increments of .001. The highest of the 999 resulting values of p was then taken as the significance level for the pattern. To test the accuracy of this approach, p0 was determined iteratively for 220 cases with significance levels (p's) ranging from .45 down to 5E-9, and the ratio between the p found that way and the p found the "999 value" way never exceeded 1.001.

The total table is large (about 4.4 megabytes), so it is divided into 16 separate files based on the size of n1, where Group 1 is defined as the group with the higher success rate. These 16 files range in size from 206K to 369K bytes. Thus download times should not exceed about 2 minutes, even with only a 28.8K baud (bits per second) modem. Here are the links to the files:

n1 = 1
n1 = 2
n1 = 3
n1 4 or 5
n1 6 to 8
n1 9 to 11
n1 12 to 15
n1 16 to 19
n1 20 to 24
n1 25 to 30
n1 31 to 40
n1 41 to 50
For the files above this point, entries are listed by n1, then n2, then A, then C.
For the files below, entries are listed by n1, then n2, then C, then A.
n1 51 to 75
n1 76 to 120
n1 121 to 200
n1 201 to 500

Please read the following directions for using the files; otherwise you may think you have found the correct p when it's really the wrong p.

1. After entering the correct file, use your browser's Search command (usually found in the Edit menu) to find the correct value of n1, then the correct value of n2. Sections of the table have headings like
`N1 = 30     N2 = 15`
There is always exactly one space before and after each equals sign. Use that spacing in your browser's Search command.
2. As mentioned above, in files for n1 of 50 or below, entries are listed by n1, then n2, then A, then C, while in files for n1 of 51 or above, entries are listed by n1, thenn2, then C, then A. It has been found that this arrangement lowers disk space requirements and download times relative to having all files with the same arrangement.
3. Once you have found the correct combination of n1 and n2, do not use the Search function to find the desired value of A or C. Rather, scroll down manually, one screen at a time, to find it. Otherwise you may unwittingly jump to another n1-n2 combination. For instance, suppose you have n1 = 41, n2 = 2, A = 18, C = 0. You can use Search to find the section of the table for n1 = 41, n2 = 2. But that section of the table contains no entry for A = 18, because no such entry yields a pbelow .1. Therefore if you use Search to find A = 18, you will unwittingly jump to the section of the table for n1 = 41, n2 = 4, which does contain such an entry.

#### Do we really want an exact test?

For years, scientists and statisticians who recognized the inexactness of the Fisher test wanted a truly exact test. Now that one is available, questions arise immediately as to whether that's what we really want.

Consider the outcome

 12 0 12 5
When the aforementioned computer program was used to determine an exact significance level for this outcome, it determined that of all possible outcomes with group sizes of 12 and 17 respectively and with success rates in the treatment group above those in the control group, this outcome is the 46th best outcome when outcomes are ranked by their LR values. It also found that under the null hypothesis of no treatment effect, the probability of observing one of these 46 outcomes may be as high as .0172. So that's an exact significance level p for this outcome. The outcome
 9 3 5 12
is the immediately preceding outcome in this ranking; it is the 45th best outcome. Under the null hypothesis of no treatment effect, the probability of observing one of the 45 best outcomes is .0093, so that's the exact significance level for this outcome.

These two p values differ substantially; the first is 85% larger than the second. The second outcome is significant at the .01 level one-tailed, while the first isn't even very close. However, the LR values for these two outcomes are respectively .04796 and .04819. These two values are nearly identical; their ratio is .9951. The associated p's are so different because when calculating p for the outcome that comes later in the ranking, we must include the probability of the other outcome, and it turns out that the probability of outcome #45 is substantial relative to the probability of outcome #46. Many other examples like this can be found, in which the ratio of two LR values is between .99 and 1 yet the ratio of the two p values is 1.5 or greater--and in which one or both of the p's are in the "important" range between .01 and .05. I'll call this the near-tie phenomenon.

That raises the question of whether significance levels or likelihood ratios are really better measures of the outcome of a hypothesis test. The next section considers this question.

#### Likelihood ratios versus significance levels

When scientists talk about the "validity" of a hypothesis test, they usually mean something far different from what statisticians mean. In particular, there are tests that statisticians would describe as completely valid, that scientists would describe as completely invalid. A test may be completely valid in the statistician's sense, and may reject a null hypothesis at the .01 level or even the .001 level, and yet when reasonable scientists hear a detailed desciption of the test and its outcome, nearly all would say that the outcome has not changed their opinion about the hypothesis at all. This fact suggests that we need to think about hypothesis tests in quite a different way.

Statisticians define a test as valid if the actual probability of rejection does not exceed the nominal rejection probability. For instance, a test is invalid if there is an actual probability of .07 of rejecting a true null hypothesis at the .05 level.

But suppose test RN (for "random number") consists of drawing a random number from a uniform distribution between 0 and 1, and rejecting the null if that number is .05 or less. By the standard definition, test RN is a valid test at the .05 level. But suppose you heard that a scientist had used test RN to test the effectiveness of a proposed medical treatment, had found significance, and had therefore concluded that the treatment was effective. In ordinary English you would call this conclusion invalid.

For clarity, I'll use the term credibility to refer to the scientist's conception of validity. That is, I'll define a test as "credible" only if a significant result will lead intelligent scientists to lower their faith in the null hypothesis--provided of course they agree that the experiment itself was well designed. Of course, even a credible test can make Type 1 errors of rejecting true null hypotheses, but at least it doesn't do so nearly as foolishly as the "valid" test RN described above.

Statisticians attempt to transmute their technical concept of "validity" into what I have called credibility by combining "validity" with power. A test's power is its ability to reject false null hypotheses--in other words, to reject when rejection is the correct decision. Notice that test RN, ridiculed above, is singularly lacking in power. No matter how effective the treatment and no matter how large and well-designed the experiment, the probability of rejecting the null under test RN never exceeds .05. Thus we have a surprising and paradoxical result--test RN's lack of power is actually a clue to its lack of credibility. A test is credible only if it combines validity and power. Thus statisticians will say that they really have the problem of credibility covered through the concepts of power and validity.

My feeling is that this approach is at best indirect--rather like assuring yourself that a one-pound box of crackers is free of dirt, not by the direct method of looking for dirt and failing to find it, but indirectly by making sure that the weight of the crackers themselves is a full pound.

Consider again why power is so important in establishing credibility. A scientist wants to know that the observed result was unlikely to be observed under the null hypothesis, and also was likely to be observed under the alternative hypothesis. Thus to be credible, a test's rejection region (the set of possible outcomes that will lead to rejection of the null) must consist of outcomes unlikely under the null but likely under the alternative.

Now consider a different example involving random numbers. The familiar z value of 1.96 does not correspond exactly to the .05 level two-tailed; a more exact level is .0499958, which is .0000042 below .05. Now imagine a scientist who pointed that out, and who said that in conducting a test he had used a critical z of 1.96, but to bring the test up to the .05 level more exactly, he had decided that if z did not reach the 1.96 level he would choose a random number between 0 and 1, and reject the null if that random number fell below .0000042. Suppose the scientist then reported that the null had indeed been rejected at the .05 level, and suppose the hypothesis was an imporant hypothesis about the effectiveness of a medical treatment. You would certainly want to know whether the result had been labeled significant because z had exceeded 1.96 or because the random number had fallen below .0000042.

Notice what this means. The great majority of possible outcomes in the rejection region are indeed more likely if the experimental treatment is effective than if it is ineffective. But there are some points in the rejection region (corresponding to the random number below .0000042) lacking that property. When you insist on knowing exactly why the scientist called the result significant, you are saying you don't care about the rejection region as a whole, you just care about the particular outcome that was observed.

This suggests that likelihood ratios have certain advantages over significance levels, because the size of a likelihood ratio is determined entirely by the particular outcome that was observed, not by a set of other outcomes that were not observed. On the other hand, a significance level using, say, a z test reports "the probability of finding a z this high or higher." The italicized words mean that a significance level is a statement mostly about outcomes that did not in fact occur.

Many familiar significance tests are known by statisticians to be likelihood-ratio tests. This means that the p values of all possible outcomes rank the outcomes in the exact same order as their LR values. Thus odd examples like my random-number examples cannot arise; the fact that any outcome has a low p always means that that very outcome also has a low LR value. Given the usual assumptions of normality, linearity, and homoscedasticity, all the familiar parametric tests (t tests on means, tests on correlations, analysis of variance, tests in regression and linear models) are likelihood-ratio tests.

Some authors (e.g., Edwards, 1984) have argued that LR's should be used by scientists to evaluate hypotheses in place of significance levels. However, in that capacity LR's have two major disadvantages. First, they always require a specific distributional form; there is no such thing as a distribution-free likelihood ratio. This problem is avoided only with categorical data--the type of data used in the foregoing example. Second, there does not seem to be any satisfactory way to correct likelihood ratios for multiple tests, or to satisfactorily test composite null hypotheses such as the hypothesis that 5 cell means are all equal. Thus we can say that likelihood ratios seem to be superior to significance levels in a theoretical sense, but can't be used as everyday tools for all hypothesis evaluation because of practical limitations which arise in contexts other than performing a single 2 x 2 test.

#### Combining the advantages of LR's and p's

The theoretical superiority of likelihood ratios suggests the uncomfortable conclusion that the preceding "exact" 2 x 2 test may actually not be as good as a much simpler procedure--reporting a test's LR value, which can be calculated by the simple formula
LR = exp(-LRCS/2) = exp(-SUM o ln(o/e))
where the summation is across cells. LRCS was defined in this paper's first section as a likelihood-ratio chi-square value. If any cell is empty then ln(o/e) is undefined for that cell, but that doesn't matter because it will be multiplied by 0, giving a product of 0. However, scientists wanting to report LR's in scientific papers will face the problem that most scientists are not familiar with them. And as mentioned above, that situation is unlikely to change because it is unlikely that LR's will ever replace p's as everyday statistics for reporting scientific results.

All this suggests another line of approach. Suppose we could find a test which reports results in terms of familiar p's, but in which a reported p can be transformed with good accuracy to the outcome's LR value. Such a test would be approximately a likelihood-ratio test. Then readers unfamiliar with LR's could interpret the p's in the usual way, while readers familiar with LR's would gain confidence by the fact that the p's map closely onto LR values.

Recall that test LRCS does calculate p's completely as a function of LR values. The problem with this test is that it is too liberal by a noticeable margin. But if that could be fixed somehow, then we would have a test that might actually be regarded as theoretically superior to the "exact" test described earlier.

It turns out that this can be done. Modifying test LRCS by adding a continuity correction cc of .25 produces a test I'll call LRCC for "likelihood ratio with continuity correction". To perform test LRCC, calculate the four o and e values, adjust each o by .25 toward its own e, then enter the modified o's into the LRCS formula.

Test LRCC has the following advantages:

• It is easily calculated, given a pocket calculator with an ln key.
• It doesn't exactly use LR's, and doesn't give exact p's, but arguably it comes as close as any known test to offering an accurate "likelihood-ratio 2 x 2 test", combining the theoretical advantages of LR's with the familiarity of p's.
• It yields p's surprisingly close to the "exact" double-binomial p's.
• Unlike the Pearson chi-square test and various modifications of it, test LRCC has no lower limit on e-values.
• LRCC is also surprisingly accurate even for extremely small p's. That's particularly convenient when using the Bonferroni correction for multiple tests.
• For the rare cases when the Fisher test is appropriate but the means to calculate it are not readily available, changing cc in LRCC from .25 to .5 gives an approximation to the Fisher test that is far better than the Fisher-Yates test--which is generally considered an excellent approximation. And changing cc to .125 in LRCC yields an excellent approximation to "exact" p's for the case in which all margins are random.
From the nature of test LRCC, it is obvious that chi-square will be zero when the data exactly fit the null hypothesis. Therefore the test can be validated to a substantial degree merely by showing that it gives reasonably accurate p's at the opposite extreme, when all treatment-group cases succeed and all control-group cases fail. That would be regarded as a ridiculously stringent test for any Pearson-family 2 x 2 test, whose calculated p's at the extreme frequently differ from exact double-binomial p's by 10 orders of magnitude or more (that is, by a factor of 1010 or 10 billion). Therefore the sizes of errors for test LRCC are remarkable. Define R as calculated p/exact p. I looked for liberal errors (values of R below 1) only for n1 and n2 values of 500 or below, since my Gauss program could not handle exact p's much below the p of 9.33E-302 that is found when n1 = n2 = 500. But within this range, R never falls below .7918, and even that value of R occurs at the extreme of n1 = n2 = 500. The highest (most conservative) values of R are found when one n is 1 and the other is large; the larger the higher R. But when one n is 1, R exceeds 2 only when the other n is 814 or higher. Thus we can say that R remains in a remarkably tight range around 1 for virtually all sample sizes that users are likely to have, and within that range never makes liberal errors over 21%.

Note that the patterns just discussed (one frequency of 1, the diagonally opposite frequency large, and the other two entries both 0) are the patterns that yield the lowest expected cell frequencies and therefore the patterns for which Pearson-family tests are not even used because of their invalidity for these cases. For instance, if the one large frequency is 499 then the lowest expected cell frequency is .002. Test LRCC has no lower limit on expected cell frequencies, or any other comparable proscription on its use.

For the truly compulsive (or for those writing a computer program that needs to be written only once), the continuity correction cc can be modified as follows. Define lnsum as the natural log of the total sample size, and define lnr = ln(n1/n2). It doesn't matter which n is called n1 in this formula, since the choice will not affect the absolute value of lnr, and lnr is used only in squared form. Then let the continuity correction cc be

cc=.27589 - .00581*lnr2 - .000206*lnsum2 + .000628*lnr2*lnsum

When n1 and n2 are both 500 or below, values of cc calculated by this formula range from below .19 to above .27, and values of R calculated using these values of cc remain between .977 (found when n1 = 4 and n2 = 147) and 1.061 (found when n1 = 1 and n2 = 500). In other words, test LRCC is at most about 2% too liberal or 6% too conservative, across the range of n's studied, for the extreme cases of maximum chi-square.

I have risked derision by emphasizing perfect patterns (all treatment-group cases succeed and all control-group cases fail) with their ridiculously small p's because detailed analyses of other patterns involve the "near-tie" problem discussed above, in which two outcomes with nearly equal LR values may yield "exact" p's that differ from each other by a factor of nearly 2. This makes test LRCC miss these "exact" p's by the same general factor. But the argument given earlier in favor of LR values over p's also implies that these misses by test LRCC are as much the fault of the "exact" p's as of LRCC. It can thus be argued that test LRCC is really as good as the "exact" p's. And of course the user who understands the advantages of LR values has the assurance that any small p found by test LRCC is associated with a low LR value, and thus with an outcome substantially less consistent with the null hypothesis than with the alternative.

For readers not content to read only about perfect patterns, I also looked at a set of 29,201 outcome patterns that includes all patterns with the following characteristics: (a) the success rate in the treatment group exceeds that in the control group, (b) each group size is at least 3 and at most 25, (c) the treatment group size does not exceed the control group size. The last condition prevents duplication, since every outcome pattern in which treatment and control groups are, say, 25 and 20 respectively, duplicates some pattern in which the group sizes are reversed. Using cc = .25, the test is generally a little conservative; only 3665 of the 29,201 calculated p's fell below their true p's. The 29,201 values of R ranged from .6175 to 1.4898, with a median of 1.1243. This range seems small enough to require no apologies at all, but it is small enough to suggest that much of that range could be produced by the "near-tie" phenomenon discussed earlier.

Storer and Kim (1990) reviewed a substantial literature on the double binomial problem. Their recommendations depend on the individual user's access to computing power and need for conservatism, but the very last sentence of their article speaks highly of a test suggested by Pirie and Hamden in which the familiar Pearson chi-square is adjusted by a continuity correction of .25, rather than the correction of .5 suggested decades ago by Yates. Like Pearson and Yates, they suggest using the test only if the smallest expected cell frequency--here denoted min(e)-- exceeds some specified value. I found that requiring min(e) to exceed 4.0909 brings the Pirie-Hamden test to one of the same criteria of accuracy satisfied by LRCC, making the smallest value of R equal to .6175 across the current set of 29,201 outcome patterns. However, that restriction eliminates 48.6% of the 29,201 outcome patterns in the studied list, while LRCC yields a p for every outcome in this or any other list. That seems like a major difference in the usefulness of the two tests.

References

Edwards, A. W. F. (1984) Likelihood. Cambridge Univ. Press.

Little, R. J. A. (1989) Testing the equality of two independent binomial proportions. American Statistician v. 43 #4, 283-288

Storer, Barry E. and Kim, Choongrak (1990). Exact properties of some exact test statistics for comparing two binomial proportions. Journal of the American Statistical Association, v. 85 #409, 146-155.