Computation-Based Estimates of True Shrunken Multiple Correlations

Richard B. Darlington
Copyright © Richard B. Darlington. All rights reserved.

Virtually every commercial computer program for multiple regression includes the "Adjusted R2", whose square root estimates the population multiple correlation I'll denote RTP for "True Population coRrelation." Less well known is what I call the "shrunken" correlation and will denote here as RTS for "True Shrunken coRrelation." Both RTP and RTS are measured in the population, but the regression weights used in RTS are the weights found in one particular sample regression, while the weights used in RTP are the unknown true population weights. Thus RTP is always at least as high as RTS, because the true population weights are by definition the weights yielding the highest possible correlation in the population. Especially when the number of predictor variables P is large and the sample size N is small, RTS may fall far below RTP. For a given population and set of variables there is only one value of RTP, but there are infinitely many values of RTS, one for every sample drawn from the population. However, once you have drawn one particular sample and fitted a multiple regression in that sample, a unique but unknown value of RTS is defined by that regression. Estimation of RTS, not RTP, is what is needed in most practical prediction problems--problems in which you ask, "How good is this predictive composite that I've derived, when applied to the whole population?"

At least four authors--Browne (1975), Burket (1964), Claudy (1978), and Rozeboom (1978)--have published formulas for estimating RTS from an observed multiple correlation R, together with N and P, assuming multivariate normality. The formulas yield quite different estimates of RTS from the same values of R, N, and P. Here I offer still another way to estimate RTS, in which the user employs a large table posted at my Web site. Using the grid that appears later in this document, the user can download just the portion of the table for a particular combination of N and P. Unlike any of the previous formulas, the table also includes upper and lower 90% "confidence limits", derived in a manner to be explained shortly.

A notable difference between the new method and the four older methods concerns a plot of estimated RST against R for fixed values of N and P. The figure below shows such a plot for N = 50, P = 10. For the moment ignore the two confidence-limit lines labeled "Upper CL" and "Lower CL". The other 5 lines show estimates of true validity computed by the 5 methods, as a function of the observed R. The four older methods all drop abruptly to the horizontal axis, while the new method, labeled "Sample-based estimate", approaches the horizontal axis gradually. That seems more reasonable intuitively than hitting the horizontal axis so abruptly.

In summary, the three major advantages of the new method are:

Below are two successive lines from the portion of the table for P = 10, N = 50. You can read there in the final column that the table's estimates of RST for P = 10, N = 50, R = .60 are based on RST values found in 4850 samples. In the other columns you read that the mean of those 4850 RST values was .324, the median was .329, the 10th percentile value was .123, and the 90th percentile value was .507. These are the values that serve as a kind of confidence limits on the estimate of RST.

 r     mean    median   10th    90th   # obs
0.58   0.295   0.297   0.101   0.479   4732
0.60   0.324   0.329   0.123   0.507   4850

The reason that different estimates are based on different numbers of samples is explained in later sections. The maximum number of samples for any one estimate is 5000, and most estimates of RST, like the ones shown here, are based on somewhere near that number.

As shown above, values appear in the table for increments of .02 in R. Thus the portion of the table for each N/P combination covers only about one page of normal type. There are altogether 360 portions of the table; you can select the portion you want to see by clicking on the X in any of the cells in the grid below.

Choose an N/P combination from this grid
N / P234567891011121416182024
30X XXX------------
35XXXXXXXXX-------
40XXXXXXXXXXXX----
45XXXXXXXXXXXXXXX-
50XXXXXXXXXXXXXXXX
55XXXXXXXXXXXXXXXX
60XXXXXXXXXXXXXXXX
65XXXXXXXXXXXXXXXX
N / P234567891011121416182024
70XXXXXXXXXXXXXXXX
75XXXXXXXXXXXXXXXX
80XXXXXXXXXXXXXXXX
85XXXXXXXXXXXXXXXX
90XXXXXXXXXXXXXXXX
95XXXXXXXXXXXXXXXX
100XXXXXXXXXXXXXXXX
110XXXXXXXXXXXXXXXX
N / P234567891011121416182024
120XXXXXXXXXXXXXXXX
130XXXXXXXXXXXXXXXX
140XXXXXXXXXXXXXXXX
150XXXXXXXXXXXXXXXX
175XXXXXXXXXXXXXXXX
200XXXXXXXXXXXXXXXX
225XXXXXXXXXXXXXXXX
250XXXXXXXXXXXXXXXX
N / P234567891011121416182024

The Irrelevance of Collinearity

At first it would seem as if all the approaches mentioned here must be oversimplified because they all ignore collinearity. That is, it seems that the expected difference between R and RST must increase with collinearity. After all, collinearity increases the standard error of each regression slope, and the difference between RST and RPT is caused entirely by errors in regression slopes.

Every point in the above argument is correct, but it omits an important point. Although collinearity increases the standard errors of individual regression slopes, it also increases the correlations between those slopes. To say that two slopes b1 and b2 correlate negatively is to say that if you drew 1000 samples from the same population and computed b1 and b2 in each sample, you would find a negative correlation between the 1000 values of b1 and the 1000 values of b2. In fact, if the two variables involved correlate highly (or more precisely, if they correlate highly when the other predictor variables are controlled), then the correlation between b1 and b2 will actually be negative. But the negativeness of this correlation means that errors in the two regression slopes will cancel each other out. If predictor variable X1 is overweighted and X2 is underweighted, and X1 and X2 correlate highly with each other, then the errors will largely cancel each other out and prediction will be almost as accurate as if both variables had received their true population weights. This sort of canceling of errors does in fact occur under collinearity, with the result that RST falls no further below RPT and R under collinearity than under its absence. That's why collinearity is ignored by all the other workers cited above, and is ignored here too.

How the Table Was Constructed

In the next three sections I take advantage of the loose space constraints on the Web, to present explanations of the table's derivation at three levels of technical difficulty. The current section is designed to be as nontechnical as possible, and much of the information here is repeated in the following more technical section. More technically oriented readers may wish to skip to that section. However, only the last of the three sections really shows all the formulas used.

Since the degree of collinearity is irrelevant, without loss of generality we can assume the predictor variables to be uncorrelated in the population. Without loss of generality we can also assume all the predictors to have unit variance. If we then choose some arbitrary set of true regression weights B, then the true variance of the predictable component of Y equals the sum of squared entries in B. The matrix expression for this value is B'B. If we let SDE denote the number we choose to equal the square root of the true residual variance, then the true squared multiple correlation equals B'B/(B'B + SDE2), the ratio of predictable variance to total variance.

If we use a random number generator to produce an N by P matrix of random normal numbers, we can think of those numbers as the scores of N cases on P predictor variables, drawn from the aforementioned population. We can also use a random normal number generator to generate a column of N true standardized residuals; multiplying them by SDE gives the true unstandardized residuals. One can also use these random numbers to work out both the multiple correlation R observed in the sample and the true multiple correlation RST of the regression equation derived in that sample.

Now suppose for instance we wanted to study RST values for an observed multiple correlation of .6. For a given set of random numbers as just described, the observed multiple correlation R and the true multiple correlation RST both change with SDE. One can therefore adjust SDE to make R equal .6, and then compute what RST would be for that value of SDE. One can repeat this for many samples. The mean or median of all the values of RST then provides an estimate of the RST associated with R = .6, for those particular values of N and P.

That is what was done to construct the table here. I used 24 different values of N and 16 values of P. When N/P combinations were eliminated for which N - P < 25, that left 360 N/P combinations. R was studied in increments of .02. 5000 samples were generated for each combination of N, P, and R.

Computer time required was massive (computing took about a week on a 486 4DX-100 computer), but was still only a fraction of what the previous paragraph would imply. That is because a given set of random numbers could be used for many different values of R. For instance, for a given sample one can find the value of SDE that yields R = .6, and for the same set of random numbers one can find the values of SDE that yield R = .62, .64, etc.

However, for any given sample there was a nonzero lower limit on R. Even when one assumes an infinite value of SDE (which corresponds to RPT = 0), R in any given sample will turn out to equal some positive value--a value that might be fairly high. Therefore the numbers of samples that yielded estimates for very high values of R were always 5000, but as R got lower the number of usable samples got lower and lower. When R was so low that the number of usable samples fell below 50, then no estimate was printed. The number of usable samples used for each estimate of RST appears in the table; most of these numbers are at or near 5000.

A More Complete Explanation of the Method

Although this section is more technical than the previous one, I have still not abandoned all effort to reach a wide readership. This section contains some matrix algebra, but readers unfamiliar with matrix algebra should still be able to follow large parts of this section.

Suppose you produce an N x P matrix X of random standardized normal numbers, and a N x 1 column vector se of standardized deviations from a true regression line. That is, se is standardized in the population, which is to say that se is simply a set of random standardized normal numbers. Let SDE be the population standard deviation of the unstandardized residuals, and let B denote the vector of true regression slopes. Then values of Y could be generated for simulations by the matrix equation

Y = X*B + SDE*se

This equation produces Y-values with a population mean of zero, though in the following we will pretend we don't know that, and derive sample regressions with the additive constant not set to zero.

Without loss of generality we can assume the predictor variables are of unit variance and uncorrelated in the population. Then the Y-variance due to X is B'B, the sum of squared entries in B, and the error Y-variance is SDE2. Therefore the true multiple correlation RTP is

RTP = sqrt(B'B/(B'B + SDE2)).

If you generate X and se for a single sample, and you have specified some true weight vector B and a value of SDE, then you can work out the observed weight vector b and the observed multiple correlation R. Further, RTS is determined entirely by b, so you can find RTS.

The present method is based on the fact that if one uses a random number generator to generate a matrix X and vector se, then one can associate those values of X and se with many different values of SDE, ranging from 0 up toward infinity. For each value of SDE one can work out the associated values of R, RTP, and RTS. Therefore one can iteratively adjust SDE to make the associated R equal some value of R that was (or might be) observed in the real world. The value of RTS associated with that value of R is then an estimate of RTS for the observed R, though of course the estimate is based on just one sample.

But one can repeat this process as many times as desired. In each repetition one uses a random normal number generator to produce a matrix X and a vector se, then finds for that combination of X and se the value of SDE that makes R equal the value observed in the real world, then calculates the associated value of RTS. If one does that 5000 times, then the mean or median of the 5000 RTS values is an estimate of RTS for the real-world R, and the 500th and 4501th ranked values of RTS provide lower and upper 90% "confidence limits" on RTS. Each of these is a one-sided confidence limit, and a user would typically be most interested in the lower limit; one has 90% confidence that the real-world RTS exceeds that limit.

I used this procedure to estimate RTS for many values of R. I used the same 5000 samples for many different supposedly observed values of R, changing R in decrements of .02 from 1.00 down as far as possible for each sample. The curve cannot actually go down to R = 0, because even when you assume RTP = 0 (which in current terms is done by letting SDE approach infinity), the observed R still does not get down to 0. Each of the 5000 samples has its own lower limit. Thus a very low value of R (lower than you would typically find even if RTP were 0) might have a computable value of RTS only for, say, 1000 samples out of 5000, and the associated estimates of RTS will therefore be based on 1000 samples instead of 5000, while estimates of RTS from higher values of R are routinely based on all 5000 samples. The last column in the table shows the actual number of samples used in each estimate of RTS; most of those values are 5000.

To achieve RPT = 0 one would actually have to use an infinite value of SDE, and that of course is not possible. Therefore in the tables I actually used the SDE associated with RPT = .00001 wherever I really wanted RPT = 0.

Technical section

This section finally gives all the formulas used in constructing the table. Let B and b denote respectively population and sample vectors of regression weights, let SDE denote the true residual standard deviation, let X denote a N x P matrix of random normal numbers with a unit vector added on the left, and let es denote a N x 1 column vector of random normal numbers. Then as mentioned earlier, a sample vector Y could be generated from

Y = X*B+es*SDE

But b = inv(X'X)X'Y

so b = inv(X'X)X'XB + inv(X'X)X'es*SDE = B + inv(X'X)X'es*SDE

Define be = inv(X'X)X'es, so b = B + be*SDE

We have ^Y = X*b, where ^Y denotes the estimates of Y made from the sample regression. Recall the predictors are assumed to be independent in the population. Thus in the population the covariance between ^Y and Y is b'B, the total variance of ^Y is b'b, and the total variance of Y is B'B + SDE2. Therefore

RST = Cov(Y,^Y)/sqrt(var(Y)*var(^Y) = b'B/sqrt((b'b)(B'B+SDE2))

= (B'B+B'be*SDE)/sqrt((B'B+SDE2)(B'B+2B'be*SDE+(be'be)*SDE2))

This equation expresses RST as a function of the 4 scalars B'B, B'be, be'be, and SDE. The first three are determined by B (which in my case I made up) and be, which is determined by X and es. So I could calculate these three scalars for a given sample, and then by choosing SDE I could calculate RST from the last equation.

We also need an equation relating R to SDE. If we denote the residuals in the sample regression as e, then we can write

R2 = 1 - sse/sstot

where sse is the sum of squared residuals and sstot is the corrected sum of squares for Y. We have already noted that

Y = X*B+es*SDE

We also have e = Y - ^Y = Y - X*b

Replacing b in this equation by the right side of the equation

b = B + inv(X'X)(X'es)*SDE,

and replacing Y by X*B+es*SDE,

we can express e as a function of B, X, es, and SDE.

We can also write MY = mean of (XB) + SDE*mean(es)

Thus all the values needed to calculate sse, sstot, and then R2 can be expressed as functions of B, X, es, and SDE. Once again, functions of B, X, and es can be calculated once, and then different values of SDE can be used to produce different values of R. Lengthy but straightforward matrix algebra shows that a computationally efficient procedure is to first compute the two vectors X*B, X'es, then define MXB as the mean of X*B and MES as the mean of es, then write

sstot = [(X*B)'(X*B)] + SDE2[es'es] + 2*SDE*[es'X*B] - N(MXB+SDE*MSE)2

sse = SDE2[es'es - (X'es)'inv(X'X)(X'es)]

All quantities in brackets, plus MXB and MSE, are scalars unaffected by SDE, and thus need be computed only once for a given sample.

It follows from all this that one can choose some value of R, use these formulas to iteratively find the value of SDE associated with it for a given sample (that is, given X and es), then find the value of RST associated with that value of SDE.

More technical material: negative values of RST

RST can be negative, and indeed if RPT is very low and N is small then RST will be negative in about half of all samples. Interestingly, RST will never be negative if RPT = 0. This results from an interesting point. Recall that RST, like RPT, is a population value; it is not merely a correlation in some cross-validation sample. Therefore RST can never exceed RPT, because no linear function of the predictors can correlate more highly with Y than the true linear function. But if RST is negative then its absolute value can also never exceed RPT, because if it did then reversing the signs of all the regression weights would produce a function whose correlation with Y exceeded RPT, and that can never happen. All this means that when RPT = 0 then neither RST nor -RST can exceed 0, so RST has to equal 0. Therefore negative values of RST occur frequently when RPT is near zero, but do not occur when RPT actually equals zero. That in turn means that when one plots RST against R for a single sample by adjusting SDE (thereby effectively adjusting RPT), the curve may fall below 0 near its lower end, but always returns to exactly 0 at the very end of the curve.

An odd technical twist

The true multiple correlation RTP can be related to SDE by the fact that in the population the true explained variance is B'B, the true residual variance is SDE2, so

RTP2 = B'B/(B'B+SDE2). Thus for a given sample one can not only plot RST against R, as discussed earlier, one also can plot both RST and R against RTP.

I for one assumed all these plots would be monotonically increasing, but such is not the case. Both the plots of RST and R against RTP can be nonmonotonic, with a minimum somewhere other than at the left end of the curve. Further, the minima do not occur at the same value of RTP, so the plot of RST against R can be surprisingly convoluted. The curve always starts at some point on the horizontal axis of the aforementioned unit square, and always ends up at the upper right corner of the square. But it can leave its starting point in literally any direction, not just toward the upper right as one would expect. It may go down to the right, or up to the left, or even down to the left. And if it goes down to the left, it may circle around either above or below its starting point. But once it leaves its starting point, it always ultimately curves around and ends up at the upper right.

The figure immediately above shows some of of the possibilities. Each of the 8 curves shown starts somewhere on the horizontal line True R = 0, and ends in the upper right corner where True R = Observed R = 1. Curves 1, 2, 3, and 4 are all fairly unsurprising, in that they go in moderately straight lines from their starting points to their ending point. But they differ in detail; curve 1 is very nearly straight, curve 2 is slightly concave (that is, with a positive second derivative), curve 3 is slightly convex (negative second derivative), and curve 4 is S-shaped. Curves 5-8 have more surprising shapes. Curves 5 and 6 are opposites in a sense; curve 5 heads toward the upper left from its starting point, while curve 6 heads toward the lower right. Curves 7 and 8 both go first toward the lower left from their starting points (this is not easily seen for curve 8), then both curve around toward the upper right--though 7 curves around above its starting point while 8 curves around below its starting point.

For purposes of constructing tables to estimate RST from R, curves like 5, 7, and 8 (which first go left from their starting points) are more problematic than the others, since such curves can all yield two different values of RST for a single value of R. Which value should you use? Or should you use both, counting that curve twice? It matters, because such curves are approximately half of all curves--though as mentioned above it matters only for values of R so low that they could easily have arisen from zero values of RPT and RST.

It does not seem reasonable to me to use two values from a single curve, because each curve represents a single sample that should not be counted more heavily than any other single sample. Therefore whenever a single curve yielded two estimates of RST, I chose the upper estimate half the time and the lower estimate half the time. The programming required to do that was rather complex, but that is beyond the scope of this work.

References

Browne, M. W. (1975). Predictive validity of a linear regression equation. British Journal of Mathematical and Statistical Psychology, 28, 79-87.

Burket, G. R. (1964). A study of reduced rank models for multiple prediction. Psychometric Monographs (No. 12).

Claudy, J. G. (1978). Multiple regression and validity estimation in one sample. Applied Psychological Measurement, 32, 311-322.

Rozeboom, W. W. (1978). The estimation of cross-validated multiple correlation: a clarification. Psychological Bulletin, 85, 1349-1351.