A Meta-Analytic "p-pooler" with Three Advantages

Copyright © Richard B. Darlington. All rights reserved.


A new method for pooling independent significance levels is described. The new method has three major advantages over competing methods: it yields more specific conclusions, it is far less vulnerable to methodological criticisms of individual studies in the analysis, and it can readily be extended to handle the "file-drawer problem"--the problem of unpublished negative results.

What's a p-pooler?

I use the term p-pooler to describe any method for computing a "pooled significance level" for several independent experiments. These methods are used primarily to gain statistical power; a pooled p may be significant even when none of the individual results are.

P-poolers are quite different from the better-known meta-analytic methods for estimating average effect sizes. These latter methods are widely used even when the literature contains several results that are highly significant individually, so there is little doubt that the effect in question exists sometimes. P-poolers are more appropriately used to show merely that an effect exists at least sometimes--though later I discuss going somewhat beyond that basic conclusion.

P-poolers also differ from corrections for multiple tests, which always yield a corrected result less significant than the original result. The reason that p-poolers can yield more significant results is that the conclusion supported by the result is a vague conclusion, such as the conclusion that at least one of the k studied effects must be real, while corrections for multiple tests always yield specific conclusions of the form, "This result is real, even after correcting for multiple tests."

Two well-known p-poolers

Several p-poolers are known, and are described in sources such as Chapter 3 of Hedges and Olkin (1985). I briefly mention two of the best known. One is the Fisher chi-square method, in which the analyst computes

C = 2 SUM ln(1/pi) = -2 SUM ln(pi)

(where SUM denotes summation) and compares C to a chi-square distribution with df = 2k, where df denotes degrees of freedom. This method is based on the additivity of independent chi-squares, and on the fact that when a null hypothesis is true, -2 ln(p) is distributed as chi-square with 2 df.

Another p-pooler by Stouffer consists of using a standard normal table to convert each pi to a standard normal z, then computing

Z = SUM(z)/sqrt(k) = sqrt(k) mean(z)

then using a standard normal table to find the pooled significance level P associated with Z. This method works because each z has a standard normal distribution, so SUM(z) has a normal distribution with mean 0 and variance k.

Both these methods are exact in the sense that if the values of pi entering them are distributed uniformly on the interval 0,1 (in other words, if the individual significance tests are exactly valid), then the pooled significance level P will have the same distribution. Nevertheless, the Stouffer method usually gives a more significant pooled P than the Fisher when the individual values pi are fairly similar, while the Fisher method gives a more significant P when they are quite dissimilar.

Three limitations of standard p-poolers

Standard p-poolers, including the Fisher and Stouffer methods, suffer from three major limitations.

First, as mentioned above, they yield only vague conclusions, of the form that "at least one" of the studied effects must exist. Methodological discussions of meta-analysis often talk about independent experiments testing "the same hypothesis", but that phrase ignores an important feature of the real world. In fact no two studies ever test exactly the same null hypothesis, even though the hypotheses may be similar. For instance, in psychology some studies use male subjects while others use female, some test subjects in natural surroundings while others use more artificial surroundings, and so on. It is typically not known how important these factors are. It is therefore very desirable to be able to be as specific as possible about which studies are contributing to a significant pooled result.

A second limitation is rarely mentioned in methodological discussions, but arises frequently when p-poolers are actually used: any methodological criticism of even a single study in the analysis is sufficient to invalidate the pooled conclusion. If ten studies are pooled, critic A may mistrust study 3 while critic B mistrusts study 5 and critic C mistrusts studies 8 and 10--but all three critics agree that the pooled significance level is meaningless, because pooling assumes that all the studies in the analysis are trustworthy. In the social sciences it almost never happens that everyone is completely happy with all the studies in a meta-analysis, so nearly everyone except the meta-analyst agrees that the pooled result is meaningless.

A third limitation of standard p-poolers is that they have no good way to handle the "file drawer problem"--the problem that perhaps only the most significant results were published, and many nonsignificant results are hiding in file drawers. A method by Rosenthal and Rubin (Rosenthal, 1979) extends the Stouffer formula to handle a form of the file-drawer problem, but the method assumes that there is an equal tendency to publish the most significant results, regardless of whether they are positive (e.g., treatment-group mean above the control-group mean) or negative. If the tendency were instead to publish just the most positive results, the Rosenthal-Rubin method is invalid.

An approach that avoids these difficulties

All these limitations are avoided by any approach that uses only the several most significant (numerically smallest) p's, together with a value t that is the number of tests from which these were selected. Let k denote the number of p's actually used in the analysis. Typically k will be only a small fraction of t. In a typical problem k might be 4 while t is 20.

The conclusion reached by a significant pooled result is that at least one of these k effects must be real. That conclusion is of course much more specific than the vague conclusion that at least one of all t effects most be real. And of course the conclusion is susceptible to methodological criticisms concerning the k studies whose p-values were actually used in the pooled test, but that conclusion is invulnerable to methodological criticisms concerning the other t - k experiments--typically the great majority of all the experiments originally examined.

This approach can easily handle the file-drawer problem simply by setting t larger than the total number of studies actually examined. For instance, if 20 studies were examined, then setting t = 30 allows for the possibility that 10 other studies in the same area were unknown to the meta-analyst. By scanning different sections of a table covering many values of t, the analyst can even discover the largest value of t that would still make the pooled result significant.

A table based on R(k,t), the product of the k most significant p's

Define R(k,t) as the product of the k most significant p-values out of t values studied (or assumed to exist, asmentioned in the previous paragraph). A table gives critical values of R for various values of k and t. The table is for 5 <= t <= 30 and 2 <= k <= 5. The table does not include k = 1 because the familiar Bonferroni method is available for that case. The table gives critical values of R for the 53 values of the pooled significance level alpha shown in the left column.

Each value of R is reported as a decimal value followed by an integer. For instance, the first value in the table, for t = 5, k = 2, alpha = .10, is reported as ".294 2". The integer is the number of zeros that should be inserted immediately after the decimal point. In this case the critical R is .00294. Equivalently, the sign-reversed integer equals the exponent when R is written in exponential notation; in this example R = .294E-2.

These values are based on Monte Carlo studies with 2 million trials each. The analysis took approximately 46 hours using GAUSS 2.2 on a 486-33 DOS machine.


To find a pooled significance level P more accurately, one can perform a simple linear interpolation between two tabled values of R we'll call UR and LR for upper R and lower R respectively. The corresponding tabled values of alpha are UA and LA respectively. Then interpolation uses

P = LA + (UA - LA)((Observed R - LR)/(UR - LR))

For instance, suppose k = 3, t = 10, and R = .0000153. In the section of the table for t = 10 and k = 3, we find this R is straddled by R-values of .0000165 and .0000127, which correspond to alpha's of .025 and .03 respectively. Thus we have LR = .0000127, UR = .0000165, LA = .025, and UA = .03. Provided all the R-values involved have the same number of leading zeros, the result is unchanged if their leading zeros are omitted. Thus the pooled P is

P = .025 + .005(.153 - .127)/(.165 - .127) = .0284.

Choosing k

In choosing k a meta-analyst faces a tradeoff between power and specificity of conclusions. The smaller you set k the more specific you can make your conclusions because a significant result allows you to conclude that at least one of the k effects used in computing R is nonzero. On the other hand, you may gain statistical power by increasing k, especially if the t experiments were all very similar to each other.

If you are in this situation, then one acceptable strategy is to first set k high and then lower it gradually until a nonsignificant result is found. Then the last significant result yields the most specific conclusion reachable from the data. This strategy does not violate proscriptions against multiple tests because no positive conclusion is drawn if the first test in the series is not significant.

Why are k and t assumed to be so small?

Many published meta-analyses involve collections of hundreds or even thousands of published studies. Such collections are usually so diverse in their populations, methods, and hypotheses, that specialists in the area often don't know what to make of positive results. If an analyst pools 800 studies, and 5 of them used Eskimos and 4 used Pygmies, then one doesn't know whether a positive result means the studied treatment works just for Eskimos, or just for Pygmies, or for everyone. I believe meta-analyses are more informative if studies are grouped somewhat according to the populations, methods, and precise hypotheses studied.

A standard tenet of very large meta-analyses is that the meta-analyst cannot, and perhaps should not, attempt to screen studies for methodological adequacy. That approach is totally inappropriate for p-poolers, which in no sense average the studies examined, and thus cannot count on the notion that conservative biases in some studies will cancel out liberal biases in others. Rather, a liberal bias in even a few studies can invalidate the conclusions of a p-pooler. Of course, even if studies are screened somewhat on methodological grounds, there will still be criticisms concerning individual studies. As already explained, the present method is designed to handle that fact.

Even aside from this last point, I believe that most readers will be far more convinced by a meta-analysis focusing on 30 or fewer of the best studies in a field than by an analysis pooling the results of hundreds or thousands of weak studies. The reaction to the latter kind of analysis is typically "garbage in, garbage out".

I studied k-values only up to 5 for two reasons. First, for reasons already explained, I believe a conclusion is less vulnerable to criticism on methodological grounds if k is only a small fraction of t, and I studied t-values only up to 30. Second, the individual investigators believed that their studies had a good chance of being significant even individually. Thus to say that one needs to pool over 5 such studies in order to reach a positive conclusion, is to say the original investigators were wrong by an order of magnitude. If that's the case, then perhaps studies in the area should be redesigned.

This method should follow the Bonferroni

I suggest that the present method be preceded by Bonferroni tests on the most significant individual results, since that method yields more specific conclusions. One can even do Bonferroni layering, in which the Bonferroni correction factor is lowered by 1 for each successive test. For instance, if the most significant p's out of 15 are .001, .003, and .006, then a simple Bonferroni correction on the first one yields a corrected p of .001 x 15 = .015. Then the next one is the most significant of the remaining 14, so we have .003 x 14 = .042. Then the next one is the most significant of the remaining 13, so we have .006 x 13 = .078. This result is not significant by ordinary standards, so we stop there, and conclude that the first two results are significant even after correcting for the fact that they were selected post hoc as the most significant of 15 results. Assuming those studies were methodologically satisfactory, the specific effects studied by those two studies can be accepted as real. I suggest that any studies significant by such criteria be removed from the pool, and further analyses done on the remaining studies.

Other routes to more specific conclusions

Whether or not some studies were removed from the pool by the Bonferroni method as just suggested, one can analyze the remaining studies, trying to show that a pooled result is significant even after the most significant remaining results are removed from the pool. Even if no results were significant individually by the Bonferroni criterion, such an analysis can yield the conclusion that at least two or more of the studied effects must be real. Such conclusions are not as specific as those reached by the Bonferroni method, but they are more specific than conclusions of the form "at least one". And such conclusions also allow for the possibility that one or two of the pooled studies contain methodological inadequacies.

To illustrate both the method and the type of conclusion that can be drawn, suppose that the first 8 of 30 ranked p's are .002, .003, .005, .007, .01, .03, .05, and .08. Even the first of these is not significant at the .05 level after a Bonferroni correction, since .002 x 30 = .06. But the product of the first 5 p's is .21E-11, which Table 1 shows to be significant beyond the .0003 level. Thus we have the vague conclusion that at least one of the studied effects must be real. Or to be more precise, the conclusion is that at least one of the studied effects is either real or contaminated by methodological inadequacies. We then remove the .002 from the pool, and compute the product of the next 5 p's, which is .315E-10. We can compare this to the table using t = 29, since one of the 30 has been removed. The table shows this result to be significant just beyond the .0017 level, so at least one of these 29 effects must be real. Since this was so even after the removal of the one best result, we conclude that at least two of the studied effects are either real or contaminated by methodological inadequacies. We then also remove the .003 from the pool, and compute the product of the next 5 p's, which is .525E-9. Using t = 28, the table shows this result to be significant just beyond the .01 level. Since this was so even after the removal of the two best results, we conclude that at least three of the studied effects are either real or contaminated by methodological inadequacies. When we delete the .005 and add the .08, the resulting product is .84E-8. Using t = 27 we find this value is just significant at the .05 level. Thus our final conclusion is that at least four of the top 8 effects are either real or contaminated. Therefore to dismiss the entire set of results, a critic would have to argue not merely that there is an occasional methodological error, but that errors pervaded the most positive results. And this is despite the fact that not a single result was significant by the standard Bonferroni correction for multiple tests.

Any number found this way can be added to the number found significant by the Bonferroni method. For instance, suppose two studies are found significant after Bonferroni corrections, and are removed. Suppose the subsequent analysis of the remaining studies indicates that at least three of the remaining effects must be real (without saying precisely which ones). Then the overall conclusion is that five of the effects must be real, of which two can be identified specifically.


Hedges, Larry V. and Olkin, Ingram. Statistical Methods for Meta-Analysis. New York: Academic Press, 1985.

Rosenthal, Robert. The "file drawer problem" and tolerance for null results. Psychological Bulletin, 1979, 86, 638-641.

Go to Darlington home page