A new method for pooling independent significance levels is described. The new method has three major advantages over competing methods: it yields more specific conclusions, it is far less vulnerable to methodological criticisms of individual studies in the analysis, and it can readily be extended to handle the "file-drawer problem"--the problem of unpublished negative results.

*P*-poolers are quite different from the better-known meta-analytic
methods for
estimating average effect sizes. These latter methods are widely used even when the
literature contains several results that are highly significant individually, so there is little
doubt that the effect in question exists sometimes. *P*-poolers are more
appropriately used to
show merely that an effect exists at least sometimes--though later I discuss going somewhat
beyond that basic conclusion.

*P*-poolers also differ from corrections for multiple tests, which always
yield a
corrected result less significant than the original result. The reason that
*p*-poolers can yield
more significant results is that the conclusion supported by the result is a vague conclusion,
such as the conclusion that at least one of the *k* studied effects must be
real, while corrections
for multiple tests always yield specific conclusions of the form, "This result is real, even
after
correcting for multiple tests."

C = 2 SUM ln(1/p_{i}) = -2 SUM ln(p_{i})

(where SUM denotes summation) and compares C to a chi-square distribution with df =
2*k*, where *df *denotes degrees of
freedom. This method is based on the additivity of independent chi-squares, and on the fact
that when a null hypothesis is true, -2 ln(*p*) is distributed as chi-square with
2 df.

Another *p*-pooler by Stouffer consists of using a standard normal table
to convert each
*p*_{i} to a standard normal *z*, then
computing

Z = SUM(z)/sqrt(k) = sqrt(k) mean(z)

then using a standard normal table to find the pooled significance level *P*
associated with *Z*.
This method works because each *z* has a standard normal distribution, so
SUM(z) has a normal
distribution with mean 0 and variance *k*.

Both these methods are exact in the sense that if the values of
*p*_{i} entering them are distributed uniformly on the interval
0,1 (in other words, if the individual significance tests
are exactly valid), then the pooled significance level *P* will have the same
distribution.
Nevertheless, the Stouffer method usually gives a more significant pooled *P*
than the Fisher
when the individual values *p*_{i} are fairly similar, while the
Fisher method gives a more
significant *P* when they are quite dissimilar.

First, as mentioned above, they yield only vague conclusions, of the form that "at least one" of the studied effects must exist. Methodological discussions of meta-analysis often talk about independent experiments testing "the same hypothesis", but that phrase ignores an important feature of the real world. In fact no two studies ever test exactly the same null hypothesis, even though the hypotheses may be similar. For instance, in psychology some studies use male subjects while others use female, some test subjects in natural surroundings while others use more artificial surroundings, and so on. It is typically not known how important these factors are. It is therefore very desirable to be able to be as specific as possible about which studies are contributing to a significant pooled result.

A second limitation is rarely mentioned in methodological discussions, but arises
frequently when *p*-poolers are actually used: any methodological criticism
of even a single
study in the analysis is sufficient to invalidate the pooled conclusion. If ten studies are
pooled, critic A may mistrust study 3 while critic B mistrusts study 5 and critic C mistrusts
studies 8 and 10--but all three critics agree that the pooled significance level is meaningless,
because pooling assumes that all the studies in the analysis are trustworthy. In the social
sciences it almost never happens that everyone is completely happy with all the studies in a
meta-analysis, so nearly everyone except the meta-analyst agrees that the pooled result is
meaningless.

A third limitation of standard *p*-poolers is that they have no good way
to handle the
"file drawer problem"--the problem that perhaps only the most significant results were
published, and many nonsignificant results are hiding in file drawers. A method by
Rosenthal and Rubin (Rosenthal, 1979) extends the Stouffer formula to handle a form of the
file-drawer
problem, but the method assumes that there is an equal tendency to publish the most
significant results, regardless of whether they are positive (e.g., treatment-group mean above
the control-group mean) or negative. If the tendency were instead to publish just the most
positive results, the Rosenthal-Rubin method is invalid.

The conclusion reached by a significant pooled result is that at least one of these
*k*
effects must be real. That conclusion is of course much more specific than the vague
conclusion that at least one of all *t* effects most be real. And of course the
conclusion is
susceptible to methodological criticisms concerning the *k* studies whose
*p*-values were actually used in the pooled test, but that conclusion is
invulnerable to methodological criticisms
concerning the other *t - k* experiments--typically the great majority of all
the experiments
originally examined.

This approach can easily handle the file-drawer problem simply by setting
*t* larger
than the total number of studies actually examined. For instance, if 20 studies were
examined, then setting *t* = 30 allows for the possibility that 10 other
studies in the same area
were unknown to the meta-analyst. By scanning different sections of a table covering many
values of *t*, the analyst can even discover the largest value of
*t* that would still make the pooled result significant.

Each value of *R* is reported as a decimal value followed by an integer.
For instance,
the first value in the table, for *t* = 5, *k* = 2, alpha =
.10, is reported as ".294 2".
The integer is the number of zeros that should be inserted immediately after the decimal
point. In this
case the critical *R* is .00294. Equivalently, the sign-reversed integer equals
the exponent
when *R* is written in exponential notation; in this example
*R* = .294E-2.

These values are based on Monte Carlo studies with 2 million trials each. The analysis took approximately 46 hours using GAUSS 2.2 on a 486-33 DOS machine.

*P = LA + (UA - LA)((Observed R - LR)/(UR - LR))*

For instance, suppose *k* = 3, *t* = 10, and
*R* = .0000153. In the section of the table for *t* = 10 and
*k* = 3, we find this *R* is straddled by
*R*-values of .0000165 and .0000127, which correspond to alpha's of .025
and .03 respectively. Thus we have LR = .0000127, UR = .0000165, LA = .025, and UA
= .03. Provided all the *R*-values involved have the same number of
leading zeros, the result is unchanged if their leading zeros are omitted. Thus the pooled
*P* is

P = .025 + .005(.153 - .127)/(.165 - .127) = .0284.

If you are in this situation, then one acceptable strategy is to first set *k*
high and then
lower it gradually until a nonsignificant result is found. Then the last significant result yields
the most specific conclusion reachable from the data. This strategy does not violate
proscriptions against multiple tests because no positive conclusion is drawn if the first test in
the series is not significant.

A standard tenet of very large meta-analyses is that the meta-analyst cannot, and
perhaps should not, attempt to screen studies for methodological adequacy. That approach is
totally inappropriate for *p*-poolers, which in no sense average the studies
examined, and thus
cannot count on the notion that conservative biases in some studies will cancel out liberal
biases in others. Rather, a liberal bias in even a few studies can invalidate the conclusions of
a *p*-pooler. Of course, even if studies are screened somewhat on
methodological grounds,
there will still be criticisms concerning individual studies. As already explained, the present
method is designed to handle that fact.

Even aside from this last point, I believe that most readers will be far more convinced by a meta-analysis focusing on 30 or fewer of the best studies in a field than by an analysis pooling the results of hundreds or thousands of weak studies. The reaction to the latter kind of analysis is typically "garbage in, garbage out".

I studied *k*-values only up to 5 for two reasons. First, for reasons
already explained, I
believe a conclusion is less vulnerable to criticism on methodological grounds if
*k* is only a
small fraction of *t*, and I studied *t*-values only up to 30.
Second, the individual investigators believed that their studies had a good chance of being
significant even individually. Thus to
say that one needs to pool over 5 such studies in order to reach a positive conclusion, is to
say the original investigators were wrong by an order of magnitude. If that's the case, then
perhaps studies in the area should be redesigned.

To illustrate both the method and the type of conclusion that can be drawn, suppose
that the first 8 of 30 ranked p's are .002, .003, .005, .007, .01, .03, .05, and .08. Even the
first of these is not significant at the .05 level after a Bonferroni correction, since .002 x 30
= .06. But the product of the first 5 p's is .21E-11, which Table 1 shows to be significant
beyond the .0003 level. Thus we have the vague conclusion that at least one of the studied
effects must be real. Or to be more precise, the conclusion is that at least one of the studied
effects is either real or contaminated by methodological inadequacies. We then remove the
.002 from the pool, and compute the product of the next 5 p's, which is .315E-10. We can
compare this to the table using *t* = 29, since one of the 30 has been
removed. The table
shows this result to be significant just beyond the .0017 level, so at least one of these 29
effects must be real. Since this was so even after the removal of the one best result, we
conclude that at least two of the studied effects are either real or contaminated by
methodological inadequacies. We then also remove the .003
from the pool, and compute the product of the next 5 *p*'s, which is
.525E-9. Using *t* = 28, the table shows this result to be significant just
beyond the .01 level. Since this was so even
after the removal of the two best results, we conclude that at least three of the studied effects
are either real or contaminated by methodological inadequacies. When we delete the .005
and add the .08, the resulting product is .84E-8. Using *t* = 27 we find
this value is just
significant at the .05 level. Thus our final conclusion is that at least four of the top 8 effects
are either real or contaminated. Therefore to dismiss the entire set of results, a critic would
have to argue not merely that there is an occasional methodological error, but that errors
pervaded the most positive results. And this is despite the fact that not a single result was
significant by the standard Bonferroni correction for multiple tests.

Any number found this way can be added to the number found significant by the Bonferroni method. For instance, suppose two studies are found significant after Bonferroni corrections, and are removed. Suppose the subsequent analysis of the remaining studies indicates that at least three of the remaining effects must be real (without saying precisely which ones). Then the overall conclusion is that five of the effects must be real, of which two can be identified specifically.

Hedges, Larry V. and Olkin, Ingram. Statistical Methods for Meta-Analysis. New York: Academic Press, 1985.

Rosenthal, Robert. The "file drawer problem" and tolerance for null results. Psychological Bulletin, 1979, 86, 638-641.