ALLSETS: A simple algorithm for all-subsets regression

Richard B. Darlington
Cornell University

When P predictor variables are available to predict a dependent variable Y by regression, there are altogether 2P different sets of predictor variables that could be formed. That's because each predictor can be included or excluded independently of the others, and there are P such binary choices, making 2P combinations. That includes the "null" regression that contains no predictors, and the full regression containing all P predictors.

ALLSETS is a procedure that can scan all these regressions with surprising speed, computing R2 for each. I argue elsewhere that ALLSETS is useful in identifying collinear sets of variables in a regression.

ALLSETS uses a procedure called SWEEP which can be applied to a single variable in a correlation matrix -- that is, to a single row and matching column in the matrix. By SWEEPing any set of variables in the matrix, one can easily compute the R2 for predicting any unswept variable from the swept variables. For instance, if a matrix had 5 variables and you swept variables 1, 3, and 5, then the second diagonal entry in the swept matrix would be 1 - R2, where R2 is the squared multiple correlation predicting the second variable from variables 1, 3, and 5. The fourth diagonal entry would have a similar meaning for variable 4.

Each sweep changes every entry in the matrix. Sweeping is reversible -- that is, sweeping a variable twice returns the matrix to its original state. Thus to say that a variable "is" currently swept is to say that it has been swept an odd number of times, and to say it "is not" currently swept is to say it has been swept an even number of times.

ALLSETS works with the correlation matrix of P predictor variables plus Y, with Y placed last. By performing 2P - 1 sweeps in that matrix, ALLSETS computes all 2P - 1 nonnull values of R2 for predicting Y. (We don't need to calclulate R2 for the "null" regression with no predictors; we know it's 0.) Then performing one more sweep returns the correlation matrix to its original state, except for any accumulated rounding error.

Because of that last point, the last sweep is recommended, since it allows the user to compare the final matrix with the original one, to see how much rounding error has accumulated. The less has accumulated, the more confidence we can have in the R2 values computed along the way. Rounding error is discussed in more detail below.

Let d(Y) denote the last diagonal entry in the matrix -- the diagonal entry for Y. Then after any number of sweeps, R2 for the set of swept variables is R2 = 1 - d(Y).

The necessary sequence of sweeps is easily described. For P = 4 it is

1 2 1 3 1 2 1 4 1 2 1 3 1 2 1 4

That is, sweep variable 1, then 2, then 1 again, etc.

To construct the list of sweeps for any value of P, proceed as follows:

  1. Start with a list of 2P values of 1.

  2. Add 1 to every second entry, then another 1 to every 4th entry, then another 1 to every 8th entry, etc.

  3. Stop after (P-1) such additions, or equivalently when the last entry in the list equals P.
When P = 3 the list is 1 2 1 3 1 2 1 3. Below we show how this list generates every possible set of predictors for P = 3. Recall that sweeping a variable previously swept is equivalent to not sweeping it at all. For instance, after sweeping variables 1 and 2, re-sweeping variable 1 leaves variable 2 as the only swept variable.

Variable    Variables
last        currently
swept       swept
1           1
2           1, 2
1           2
3           2, 3
1           1, 2, 3
2           1, 3
1           3
3           -

Handling rounding error

One way to minimize the accumulation of rounding error is to first measure the tolerance of each predictor variable. Tolerance is 1-RC2, where RC is the multiple correlation predicting that one predictor variable from all the others. The higher a variable's tolerance, the less rounding error will accumulate from sweeping it. Since variable 1 is swept the most times, variable 2 the next most times, etc., rounding error is minimized by reordering the variables, before any sweeps, in order of declining tolerance. Or equivalently, leave the variables in their original order, but reinterpret the list of variables to be swept as pertaining to a list of the variables in order of decreasing tolerance. That is, the first three entries "1 2 1" mean you should first sweep the variable with the highest tolerance, then the variable with the second-highest tolerance, then sweep again the variable with the highest tolerance.

As already mentioned, one way to check on the accumulation of rounding error is to compare the matrix after 2P sweeps with the original matrix, computing for instance the root mean square change in the entries. Another way is perhaps more satisfying. This way does take more computer time, though that is rarely a major problem, since the whole procedure is very fast. Starting with the final matrix (the matrix after all 2P sweeps), go through the whole process again, performing 2P more sweeps, and compare the values of R2 computed the second time around to the values computed the first time around. The difference between these two sets of values should presumably exceed the amount of rounding error in the first set of values, because more sweeps occurred between those two values than were used to calculate most of the original values.