A new method for pooling independent significance levels is described. The new method has three major advantages over competing methods: it yields more specific conclusions, it is far less vulnerable to methodological criticisms of individual studies in the analysis, and it can readily be extended to handle the "file-drawer problem"--the problem of unpublished negative results.
P-poolers are quite different from the better-known meta-analytic methods for estimating average effect sizes. These latter methods are widely used even when the literature contains several results that are highly significant individually, so there is little doubt that the effect in question exists sometimes. P-poolers are more appropriately used to show merely that an effect exists at least sometimes--though later I discuss going somewhat beyond that basic conclusion.
P-poolers also differ from corrections for multiple tests, which always yield a corrected result less significant than the original result. The reason that p-poolers can yield more significant results is that the conclusion supported by the result is a vague conclusion, such as the conclusion that at least one of the k studied effects must be real, while corrections for multiple tests always yield specific conclusions of the form, "This result is real, even after correcting for multiple tests."
C = 2 SUM ln(1/pi) = -2 SUM ln(pi)
(where SUM denotes summation) and compares C to a chi-square distribution with df = 2k, where df denotes degrees of freedom. This method is based on the additivity of independent chi-squares, and on the fact that when a null hypothesis is true, -2 ln(p) is distributed as chi-square with 2 df.
Another p-pooler by Stouffer consists of using a standard normal table to convert each pi to a standard normal z, then computing
Z = SUM(z)/sqrt(k) = sqrt(k) mean(z)
then using a standard normal table to find the pooled significance level P associated with Z. This method works because each z has a standard normal distribution, so SUM(z) has a normal distribution with mean 0 and variance k.
Both these methods are exact in the sense that if the values of pi entering them are distributed uniformly on the interval 0,1 (in other words, if the individual significance tests are exactly valid), then the pooled significance level P will have the same distribution. Nevertheless, the Stouffer method usually gives a more significant pooled P than the Fisher when the individual values pi are fairly similar, while the Fisher method gives a more significant P when they are quite dissimilar.
First, as mentioned above, they yield only vague conclusions, of the form that "at least one" of the studied effects must exist. Methodological discussions of meta-analysis often talk about independent experiments testing "the same hypothesis", but that phrase ignores an important feature of the real world. In fact no two studies ever test exactly the same null hypothesis, even though the hypotheses may be similar. For instance, in psychology some studies use male subjects while others use female, some test subjects in natural surroundings while others use more artificial surroundings, and so on. It is typically not known how important these factors are. It is therefore very desirable to be able to be as specific as possible about which studies are contributing to a significant pooled result.
A second limitation is rarely mentioned in methodological discussions, but arises frequently when p-poolers are actually used: any methodological criticism of even a single study in the analysis is sufficient to invalidate the pooled conclusion. If ten studies are pooled, critic A may mistrust study 3 while critic B mistrusts study 5 and critic C mistrusts studies 8 and 10--but all three critics agree that the pooled significance level is meaningless, because pooling assumes that all the studies in the analysis are trustworthy. In the social sciences it almost never happens that everyone is completely happy with all the studies in a meta-analysis, so nearly everyone except the meta-analyst agrees that the pooled result is meaningless.
A third limitation of standard p-poolers is that they have no good way to handle the "file drawer problem"--the problem that perhaps only the most significant results were published, and many nonsignificant results are hiding in file drawers. A method by Rosenthal and Rubin (Rosenthal, 1979) extends the Stouffer formula to handle a form of the file-drawer problem, but the method assumes that there is an equal tendency to publish the most significant results, regardless of whether they are positive (e.g., treatment-group mean above the control-group mean) or negative. If the tendency were instead to publish just the most positive results, the Rosenthal-Rubin method is invalid.
The conclusion reached by a significant pooled result is that at least one of these k effects must be real. That conclusion is of course much more specific than the vague conclusion that at least one of all t effects most be real. And of course the conclusion is susceptible to methodological criticisms concerning the k studies whose p-values were actually used in the pooled test, but that conclusion is invulnerable to methodological criticisms concerning the other t - k experiments--typically the great majority of all the experiments originally examined.
This approach can easily handle the file-drawer problem simply by setting t larger than the total number of studies actually examined. For instance, if 20 studies were examined, then setting t = 30 allows for the possibility that 10 other studies in the same area were unknown to the meta-analyst. By scanning different sections of a table covering many values of t, the analyst can even discover the largest value of t that would still make the pooled result significant.
Each value of R is reported as a decimal value followed by an integer. For instance, the first value in the table, for t = 5, k = 2, alpha = .10, is reported as ".294 2". The integer is the number of zeros that should be inserted immediately after the decimal point. In this case the critical R is .00294. Equivalently, the sign-reversed integer equals the exponent when R is written in exponential notation; in this example R = .294E-2.
These values are based on Monte Carlo studies with 2 million trials each. The analysis took approximately 46 hours using GAUSS 2.2 on a 486-33 DOS machine.
P = LA + (UA - LA)((Observed R - LR)/(UR - LR))
For instance, suppose k = 3, t = 10, and R = .0000153. In the section of the table for t = 10 and k = 3, we find this R is straddled by R-values of .0000165 and .0000127, which correspond to alpha's of .025 and .03 respectively. Thus we have LR = .0000127, UR = .0000165, LA = .025, and UA = .03. Provided all the R-values involved have the same number of leading zeros, the result is unchanged if their leading zeros are omitted. Thus the pooled P is
P = .025 + .005(.153 - .127)/(.165 - .127) = .0284.
If you are in this situation, then one acceptable strategy is to first set k high and then lower it gradually until a nonsignificant result is found. Then the last significant result yields the most specific conclusion reachable from the data. This strategy does not violate proscriptions against multiple tests because no positive conclusion is drawn if the first test in the series is not significant.
A standard tenet of very large meta-analyses is that the meta-analyst cannot, and perhaps should not, attempt to screen studies for methodological adequacy. That approach is totally inappropriate for p-poolers, which in no sense average the studies examined, and thus cannot count on the notion that conservative biases in some studies will cancel out liberal biases in others. Rather, a liberal bias in even a few studies can invalidate the conclusions of a p-pooler. Of course, even if studies are screened somewhat on methodological grounds, there will still be criticisms concerning individual studies. As already explained, the present method is designed to handle that fact.
Even aside from this last point, I believe that most readers will be far more convinced by a meta-analysis focusing on 30 or fewer of the best studies in a field than by an analysis pooling the results of hundreds or thousands of weak studies. The reaction to the latter kind of analysis is typically "garbage in, garbage out".
I studied k-values only up to 5 for two reasons. First, for reasons already explained, I believe a conclusion is less vulnerable to criticism on methodological grounds if k is only a small fraction of t, and I studied t-values only up to 30. Second, the individual investigators believed that their studies had a good chance of being significant even individually. Thus to say that one needs to pool over 5 such studies in order to reach a positive conclusion, is to say the original investigators were wrong by an order of magnitude. If that's the case, then perhaps studies in the area should be redesigned.
To illustrate both the method and the type of conclusion that can be drawn, suppose that the first 8 of 30 ranked p's are .002, .003, .005, .007, .01, .03, .05, and .08. Even the first of these is not significant at the .05 level after a Bonferroni correction, since .002 x 30 = .06. But the product of the first 5 p's is .21E-11, which Table 1 shows to be significant beyond the .0003 level. Thus we have the vague conclusion that at least one of the studied effects must be real. Or to be more precise, the conclusion is that at least one of the studied effects is either real or contaminated by methodological inadequacies. We then remove the .002 from the pool, and compute the product of the next 5 p's, which is .315E-10. We can compare this to the table using t = 29, since one of the 30 has been removed. The table shows this result to be significant just beyond the .0017 level, so at least one of these 29 effects must be real. Since this was so even after the removal of the one best result, we conclude that at least two of the studied effects are either real or contaminated by methodological inadequacies. We then also remove the .003 from the pool, and compute the product of the next 5 p's, which is .525E-9. Using t = 28, the table shows this result to be significant just beyond the .01 level. Since this was so even after the removal of the two best results, we conclude that at least three of the studied effects are either real or contaminated by methodological inadequacies. When we delete the .005 and add the .08, the resulting product is .84E-8. Using t = 27 we find this value is just significant at the .05 level. Thus our final conclusion is that at least four of the top 8 effects are either real or contaminated. Therefore to dismiss the entire set of results, a critic would have to argue not merely that there is an occasional methodological error, but that errors pervaded the most positive results. And this is despite the fact that not a single result was significant by the standard Bonferroni correction for multiple tests.
Any number found this way can be added to the number found significant by the Bonferroni method. For instance, suppose two studies are found significant after Bonferroni corrections, and are removed. Suppose the subsequent analysis of the remaining studies indicates that at least three of the remaining effects must be real (without saying precisely which ones). Then the overall conclusion is that five of the effects must be real, of which two can be identified specifically.
Hedges, Larry V. and Olkin, Ingram. Statistical Methods for Meta-Analysis. New York: Academic Press, 1985.
Rosenthal, Robert. The "file drawer problem" and tolerance for null results. Psychological Bulletin, 1979, 86, 638-641.
Go to Darlington home page