Virtually every commercial computer program for multiple regression includes the "Adjusted
R^{2}", whose square root estimates the population multiple correlation I'll
denote *RTP* for "True Population coRrelation." Less well known is what I call
the "shrunken" correlation and will denote here as *RTS* for "True Shrunken
coRrelation." Both *RTP* and *RTS* are measured in the population,
but
the regression weights used in *RTS* are the weights found in one particular
sample
regression, while the weights used in *RTP* are the unknown true population
weights. Thus *RTP* is always at least as high as *RTS*, because the
true population weights are by definition the weights yielding the highest possible correlation
in the population. Especially when the number of predictor variables *P* is large
and the sample size *N* is small, *RTS* may fall far below
*RTP*. For a given population and set of variables there is only one value of
*RTP*, but there are infinitely many values of *RTS*, one for every
sample drawn from the population. However, once you have drawn one particular sample
and fitted a multiple regression in that sample, a unique but unknown value of
*RTS* is defined by that regression. Estimation of RTS, not RTP, is what is
needed in most practical prediction problems--problems in which you ask, "How good is this
predictive composite that I've derived, when applied to the whole population?"

At least four authors--Browne (1975), Burket (1964), Claudy (1978), and Rozeboom
(1978)--have published formulas for estimating *RTS* from an observed multiple
correlation *R*, together with *N* and *P*, assuming
multivariate normality. The formulas yield quite different estimates of *RTS* from
the same values of *R*, *N*, and *P*. Here I offer still
another
way to estimate *RTS*, in which the user employs a large table posted at my Web
site. Using the grid that appears later in this document, the
user can download just the portion of the table for a particular combination of
*N* and *P*. Unlike any of the previous formulas, the table also
includes
upper and lower 90% "confidence limits", derived in a manner to be explained
shortly.

A notable difference between the new method and the four older methods concerns a
plot of estimated *RST* against *R* for fixed values of *N* and
*P*. The figure below shows such a plot for *N* = 50, *P* =
10. For the moment ignore
the two confidence-limit lines labeled "Upper CL" and "Lower CL". The other 5 lines show
estimates of true validity computed by the 5 methods, as a function of the observed R.
The four older methods all drop abruptly to the horizontal axis, while the new method,
labeled "Sample-based estimate", approaches the horizontal axis gradually. That seems more
reasonable intuitively than hitting the horizontal axis so abruptly.

In summary, the three major advantages of the new method are:

- "Confidence limits"
- Curves with a shape more intuitively reasonable
- Intuitively satisfying method of derivation (this point is developed in the next
section.)

r mean median 10th 90th # obs 0.58 0.295 0.297 0.101 0.479 4732 0.60 0.324 0.329 0.123 0.507 4850

The reason that different estimates
are based on different numbers of samples is explained in later sections. The maximum
number of samples for any one estimate is 5000, and most estimates of *RST*, like
the
ones shown here, are based on somewhere near that number.

As shown above, values appear in the table for increments of .02 in *R*.
Thus the
portion of the table for each *N/P* combination covers only about one page of
normal type.
There are altogether 360 portions of the table; you can select the portion you want to see
by clicking on the *X* in any of the cells in the grid below.

N / P | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 16 | 18 | 20 | 24 |

30 | X | X | X | X | - | - | - | - | - | - | - | - | - | - | - | - |

35 | X | X | X | X | X | X | X | X | X | - | - | - | - | - | - | - |

40 | X | X | X | X | X | X | X | X | X | X | X | X | - | - | - | - |

45 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | - |

50 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

55 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

60 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

65 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

N / P | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 16 | 18 | 20 | 24 |

70 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

75 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

80 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

85 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

90 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

95 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

100 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

110 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

N / P | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 16 | 18 | 20 | 24 |

120 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

130 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

140 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

150 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

175 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

200 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

225 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

250 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |

N / P | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 14 | 16 | 18 | 20 | 24 |

Every point in the above argument is correct, but it omits an important point. Although
collinearity increases the standard errors of individual regression slopes, it also increases
the correlations between those slopes. To say that two slopes *b1* and
*b2*
correlate negatively is to say that if you drew 1000 samples from the same population and
computed *b1* and *b2* in each sample, you would find a negative
correlation between the
1000 values of *b1* and the 1000 values of *b2*. In fact, if the two
variables
involved correlate highly (or more precisely, if they correlate highly when the other predictor
variables are controlled), then the correlation between *b1* and *b2* will
actually
be negative. But the negativeness of this correlation means that errors in the two regression
slopes will cancel each other out. If predictor variable *X1* is overweighted and
*X2* is
underweighted, and *X1* and *X2* correlate highly with each other, then
the errors will largely cancel each other out and prediction will be almost as accurate as if
both variables
had received their true population weights. This sort of canceling of errors does in fact
occur
under collinearity, with the result that *RST* falls no further below
*RPT* and *R* under collinearity than under its absence. That's why
collinearity is ignored by all the
other workers cited above, and is ignored here too.

Since the degree of collinearity is irrelevant, without loss of generality we can assume the
predictor variables to be uncorrelated in the population. Without loss of generality we can
also assume all the predictors to have unit variance. If we then choose some arbitrary set of
true regression weights *B*, then the true variance of the predictable component of
*Y* equals the sum of squared entries in *B*. The matrix expression for
this value is *B'B*. If we let *SDE* denote the number we choose to
equal the square root of the true residual variance,
then the true squared multiple correlation equals *B'B/(B'B +
SDE*^{2}), the ratio
of predictable variance to total variance.

If we use a random number generator to produce
an *N* by *P* matrix of random normal numbers, we can think of those
numbers as the
scores of *N* cases on *P* predictor variables, drawn from the
aforementioned population.
We can also use a random normal number generator to generate
a column of *N* *true* standardized residuals; multiplying them by
*SDE* gives the true
unstandardized residuals. One can also use these random numbers to work out both the
multiple correlation *R* observed in the sample and the true multiple correlation
*RST* of the regression equation derived in that sample.

Now suppose for instance we wanted to study *RST* values for an observed
multiple
correlation of .6. For a given set of random numbers as just described, the observed
multiple correlation *R* and the true multiple correlation *RST* both
change with *SDE*.
One can therefore adjust *SDE* to make *R* equal .6, and then compute
what
*RST* would be for that value of *SDE*. One can repeat this for many
samples. The mean or median of all the values of *RST* then provides an estimate
of the *RST* associated with *R* = .6, for those particular values of
*N* and *P*.

That is what was done to construct the table here. I used 24 different values of
*N* and 16
values of *P*. When *N/P* combinations were eliminated for which
*N - P* < 25, that left 360 *N/P* combinations. *R* was
studied in increments of .02. 5000 samples were
generated for each combination of *N*, *P*, and *R*.

Computer time required was massive (computing took about a week on a 486 4DX-100
computer), but was still only a fraction of what the previous paragraph would imply. That is
because a given set of random numbers could be used for many different values of
*R*.
For instance, for a given sample one can find the value of *SDE* that yields
*R* = .6, and for
the same set of random numbers one can find the values of *SDE* that yield
*R* = .62, .64, etc.

However, for any given sample there was a nonzero lower limit on *R*.
Even when one
assumes an infinite value of *SDE* (which corresponds to *RPT* = 0),
*R* in any given sample
will turn out to equal some positive value--a value that might be fairly high. Therefore the
numbers of samples that yielded estimates for very high values of *R* were always
5000, but
as *R* got lower the number of usable samples got lower and lower. When
*R* was so low
that the number of usable samples fell below 50, then no estimate was printed. The number
of usable samples used for each estimate of *RST* appears in the table; most of
these numbers
are at or near 5000.

Suppose you produce an *N* x *P* matrix *X* of
random standardized normal numbers, and a *N* x 1 column vector *se*
of standardized deviations from a true regression line. That is, *se* is standardized
in the population, which is to say that *se* is simply a set of random standardized
normal numbers. Let SDE be the population standard deviation of the unstandardized
residuals, and let B denote the vector of true regression slopes. Then values of Y could be
generated for simulations by the matrix equation

Y = X*B + SDE*se

This equation produces Y-values with a population mean of zero, though in the following we will pretend we don't know that, and derive sample regressions with the additive constant not set to zero.

Without loss of generality we can assume the predictor variables are of unit variance
and uncorrelated in the population. Then the Y-variance due to X is B'B, the sum of
squared entries in B, and the error Y-variance is SDE^{2}. Therefore the true
multiple correlation *RTP* is

RTP = sqrt(B'B/(B'B + SDE^{2})).

If you generate *X* and *se* for a single sample, and you have
specified some true weight vector B and a value of SDE, then you can work out the observed
weight vector *b* and the observed multiple correlation *R*. Further,
*RTS* is determined entirely by *b*, so you can find
*RTS*.

The present method is based on the fact that if one uses a random number generator
to generate a matrix *X* and vector *se*, then one can associate those
values of *X* and *se* with *many different* values of SDE,
ranging from 0 up toward infinity. For each value of SDE one can work out the associated
values of *R, RTP, *and *RTS*. Therefore one can iteratively adjust
SDE
to make the associated *R* equal some value of *R* that was (or might
be)
observed in the real world. The value of *RTS* associated with that value of
*R* is then an estimate of *RTS* for the observed *R*, though
of course
the estimate is based on just one sample.

But one can repeat this process as many times as desired. In each repetition one uses
a random normal number generator to produce a matrix *X* and a vector
*se*, then finds for that combination of *X* and *se* the value
of SDE that makes *R* equal the value observed in the real world, then calculates
the associated value of *RTS*. If one does that 5000 times, then the mean or
median of the 5000 *RTS* values is an estimate of *RTS* for the
real-world *R*, and the 500th and 4501th ranked values of *RTS* provide
lower and upper 90% "confidence limits" on *RTS*. Each of these is a one-sided
confidence limit, and a user would typically be most interested in the lower limit; one has
90% confidence that the real-world *RTS* exceeds that limit.

I used this procedure to estimate *RTS* for many values of *R*. I
used the same 5000 samples for many different supposedly observed values of *R*,
changing *R* in decrements of .02 from 1.00 down as far as possible for each
sample. The curve cannot actually go down to *R* = 0, because even when you
assume *RTP* = 0 (which in current terms is done by letting SDE approach
infinity), the observed *R* still does not get down to 0. Each of the 5000 samples
has its own lower limit. Thus a very low value of *R* (lower than you would
typically find even if *RTP* were 0) might have a computable value of
*RTS* only for, say, 1000 samples out of 5000, and the associated estimates of
*RTS* will therefore be based on 1000 samples instead of 5000, while estimates of
*RTS* from higher values of *R* are routinely based on all 5000
samples.
The last column in the table shows the actual number of samples used in each estimate of
*RTS*; most of those values are 5000.

To achieve *RPT* = 0 one would actually have to use an infinite value of
SDE, and that of course is not possible. Therefore in the tables I actually used the SDE
associated with *RPT* = .00001 wherever I really wanted *RPT* =
0.

Y = X*B+es*SDE

But b = inv(X'X)X'Y

so b = inv(X'X)X'XB + inv(X'X)X'es*SDE = B + inv(X'X)X'es*SDE

Define be = inv(X'X)X'es, so b = B + be*SDE

We have ^Y = X*b, where ^Y denotes the estimates of Y made from the sample regression.
Recall the predictors are assumed to be independent in the population. Thus in the
population the covariance between ^Y and Y is b'B, the total variance of ^Y is b'b, and the
total variance of Y is B'B + SDE^{2}. Therefore

RST = Cov(Y,^Y)/sqrt(var(Y)*var(^Y) =
b'B/sqrt((b'b)(B'B+SDE^{2}))

=
(B'B+B'be*SDE)/sqrt((B'B+SDE^{2})(B'B+2B'be*SDE+(be'be)*SDE^{2}))

This equation expresses RST as a function of the 4 scalars B'B, B'be, be'be, and SDE. The first three are determined by B (which in my case I made up) and be, which is determined by X and es. So I could calculate these three scalars for a given sample, and then by choosing SDE I could calculate RST from the last equation.

We also need an equation relating R to SDE. If we denote the residuals in the sample regression as e, then we can write

R^{2} = 1 - sse/sstot

where sse is the sum of squared residuals and sstot is the corrected sum of squares for Y. We have already noted that

Y = X*B+es*SDE

We also have e = Y - ^Y = Y - X*b

Replacing b in this equation by the right side of the equation

b = B + inv(X'X)(X'es)*SDE,

and replacing Y by X*B+es*SDE,

we can express e as a function of B, X, es, and SDE.

We can also write M_{Y} = mean of (XB) + SDE*mean(es)

Thus all the values needed to calculate sse, sstot, and then R^{2} can be
expressed as functions of B, X, es, and SDE. Once again, functions of B, X, and es can be
calculated once, and then different values of SDE can be used to produce different values of
R. Lengthy but straightforward matrix algebra shows that a computationally efficient
procedure is to first compute the two vectors X*B, X'es, then define MXB as the mean of
X*B and MES as the mean of es, then write

sstot = [(X*B)'(X*B)] + SDE^{2}[es'es] + 2*SDE*[es'X*B] -
N(MXB+SDE*MSE)^{2}

sse = SDE^{2}[es'es - (X'es)'inv(X'X)(X'es)]

All quantities in brackets, plus MXB and MSE, are scalars unaffected by SDE, and thus need be computed only once for a given sample.

It follows from all this that one can choose some value of R, use these formulas to iteratively find the value of SDE associated with it for a given sample (that is, given X and es), then find the value of RST associated with that value of SDE.

RTP^{2} = B'B/(B'B+SDE^{2}). Thus for a given sample one
can not only plot RST against R, as discussed earlier, one also can plot both RST and R
against RTP.

I for one assumed all these plots would be monotonically increasing, but such is not the case. Both the plots of RST and R against RTP can be nonmonotonic, with a minimum somewhere other than at the left end of the curve. Further, the minima do not occur at the same value of RTP, so the plot of RST against R can be surprisingly convoluted. The curve always starts at some point on the horizontal axis of the aforementioned unit square, and always ends up at the upper right corner of the square. But it can leave its starting point in literally any direction, not just toward the upper right as one would expect. It may go down to the right, or up to the left, or even down to the left. And if it goes down to the left, it may circle around either above or below its starting point. But once it leaves its starting point, it always ultimately curves around and ends up at the upper right.

The figure immediately above shows some of of the possibilities. Each of the 8 curves shown starts somewhere on the horizontal line True R = 0, and ends in the upper right corner where True R = Observed R = 1. Curves 1, 2, 3, and 4 are all fairly unsurprising, in that they go in moderately straight lines from their starting points to their ending point. But they differ in detail; curve 1 is very nearly straight, curve 2 is slightly concave (that is, with a positive second derivative), curve 3 is slightly convex (negative second derivative), and curve 4 is S-shaped. Curves 5-8 have more surprising shapes. Curves 5 and 6 are opposites in a sense; curve 5 heads toward the upper left from its starting point, while curve 6 heads toward the lower right. Curves 7 and 8 both go first toward the lower left from their starting points (this is not easily seen for curve 8), then both curve around toward the upper right--though 7 curves around above its starting point while 8 curves around below its starting point.

For purposes of constructing tables to estimate *RST* from *R*,
curves like 5, 7, and 8 (which first go left from their starting points) are more problematic
than the others, since such curves can all yield two different values of *RST*
for a single value of *R*. Which value should you use? Or should you use both,
counting that curve twice? It matters, because such curves are approximately half of all
curves--though as mentioned above it matters only for values of *R* so low that
they could easily have arisen from zero values of *RPT* and
*RST*.

It does not seem reasonable to me to use two values from a single curve, because
each curve represents a single sample that should not be counted more heavily than any other
single sample. Therefore whenever a single curve yielded two estimates of *RST*,
I chose the upper estimate half the time and the lower estimate half the time. The
programming required to do that was rather complex, but that is beyond the scope of this
work.

Burket, G. R. (1964). A study of reduced rank models for multiple prediction. Psychometric Monographs (No. 12).

Claudy, J. G. (1978). Multiple regression and validity estimation in one sample. Applied Psychological Measurement, 32, 311-322.

Rozeboom, W. W. (1978). The estimation of cross-validated multiple correlation: a clarification. Psychological Bulletin, 85, 1349-1351.