Richard B. Darlington

Cornell University

The significance of such sets, despite the nonsignificance of all their members taken individually, is usually caused primarily by collinearity among the members of the set. Such patterns of significance can also occur in the absence of collinearity, but are of interest even so -- so that fact does not detract from the usefulness of finding all sets with this property.

We can also define a variant of the problem which distinguishes
between covariates and independent variables. Covariates are variables
included in a regression model merely to control for their effects; the
investigator cares little whether their effects are statistically significant. The
collinearity problem is then defined the same as before, but with the additional
proviso that even though the covariates are always in the model, all identified
sets of variables must contain only independent variables. It turns out that the
solution to this problem requires only a simple extension to the solution of the
problem without covariates, so we consider first a problem with *P* independent
variables and no covariates.

After using ALLSETS to compute all
2^{P} values of *R*^{2} (and hold them in memory),
it is simple to use an *F* test to test the difference between
each of these values of *R*^{2} and the
*R*^{2} for the
full regression. Each of these *F*'s tests the significance of the set of omitted
variables. For instance, if *P* = 5 and you test the difference between the full
*R*^{2} and the *R*^{2} computed for
variables 1, 2,
and 4, that *F* tests the independent contribution of the set containing variables
3 and 5. (The multiple tests problem is discussed later; here I'll just say I
believe there is a simple and satisfactory solution to it.)

This series of *F* tests does not by itself provide a satisfactory
solution to the collinearity problem, because the number of such tests may be
unmanageably large. For instance, if *P* = 12, then the total number of
possible regressions is 2^{12} = 4096. Even if we don't count either the
regression with no variables or the regression with all variables, or the 12
regressions missing only one variable, that still leaves 4096 - 14 = 4082
regressions. It turns out to be quite feasible to calculate 4082 such values of *F*
and their associated *p* values. But it would still be an enormous job to examine
all 4082 of these values of *p* or *F* to determine which sets are
significant.
Further, we're not interested in the significance of any set if some subset
within it is also significant. For instance, if the set containing variables 2, 4,
and 8 is significant, then we're not separately interested in the set containing
variables 2, 4, 8, and 9. The significance of the latter set would tell us only
that some of the variables in that set must be related to *Y*, and we already
know that from the significance of the set containing variables 2, 4, and
8.

Thus another step is needed. We must pare the list of significant *F*'s
(which as just seen, may be thousands of items long) down to just those *F*'s
testing sets without significant subsets. To do this, the method requires the
user to specify some significance level alpha which a set must meet to be
considered significant; most users will probably set alpha to .05 or .01.
After all values of *F* and *p* have been computed, the
method ranks the tested sets from small to large, according to the number of
variables in each. For instance, if *P* = 5, the regression containing variables
1, 2, 4 would be ranked ahead of the regression containing just variables 1 and
2, because the tested set for the latter (variables 3, 4, 5) is larger than the
tested set for the former (variables 3 and 5). Then the program goes through
the list of *F*'s in order, and ignores any *F* testing a set which
contains a subset
previously noted to be significant. All significant *F*'s passing this test are
printed, together with the list of variables tested by that *F*. The list of printed
sets is then usually quite short.

Despite the screening process just described, a single variable may be
in two or more printed sets. For instance, suppose *P* = 8 and the only
significant sets are (2, 4, 5) and (1, 5, 7, 8). Then both will be printed
despite both containing variable 5, because neither one is a subset of the
other.

Computing time is trivial; with GAUSS running on a DOS 486 machine,
a problem with *P* = 10 takes about 15 seconds to do everything described above,
yielding output like that shown next. Most computing languages are considerably slower
than GAUSS, and problems with larger *P* will obviously take longer, but computing
time still seems inconsequential.

Below is some sample output with 10 predictors and alpha = .05.
*Q* is the number of variables in a significant set.
In the table section titled "Variables in tested set", 1's denote variables in the tested set.
You see in the first two rows that the program found variables 8 and 3
significant individually, with *p*'s of .0486 and .0007 respectively.
Then it identified a set containing variables 5 and 9
with *p* = .0395, then a set containing variables 2 and 9 with *p* =
.0457, then a set containing variables 1, 7, 9, and 10 with *p* = .0343, and finally
a set containing variables 1, 2, 5, 7, and 10 with *p* = .0497. Only variables 4 and 6
fail to fall in any significant set; columns 4 and 6 contain all 0's.

Overall R^2 = 0.4900 R^2 F p Q Variables in tested set 1 2 3 4 5 6 7 8 9 10 0.4358 4.1429 0.0486 1 0 0 0 0 0 0 0 1 0 0 0.3115 13.6471 0.0007 1 0 0 1 0 0 0 0 0 0 0 0.3980 3.5158 0.0395 2 0 0 0 0 1 0 0 0 1 0 0.4025 3.3443 0.0457 2 0 1 0 0 0 0 0 0 1 0 0.3385 2.8949 0.0343 4 1 0 0 0 0 0 1 0 1 1 0.3292 2.4594 0.0497 5 1 1 0 0 1 0 1 0 0 1

We now return to the case in which the user wishes to distinguish
between independent variables and covariates. This case is handled simply by
applying the method just described, not to a simple correlation matrix but to a
matrix of partial correlations in which the covariates have been partialed out.
That is, the correlation matrix has (*P*+1) rows and columns as described
above, but *P* is just the number of independent variables, not including any
covariates. If this is done, *N* should be reduced by the number of covariates
to make the F tests work correctly. For instance, if *N* was 100 but 5
covariates were partialed out, enter *N* as 95.

We must certainly consider the problem of multiple tests, since the procedure just
described may scan hundreds or even thousands of tests, and print just a few
significant ones. I suggest using a multiple-test philosophy similar to that of
the Fisher protected *t* method. That philosophy suggests the followingprocedure: in the full model, test the set of all
variables not significant individually. If that set is not significant, stop,
because then there is no evidence that collinearity has prevented some sets from
being significant. That way, all later tests actually performed are tests of subsets from
a set already found to be significant. The Fisher philosophy is not accepted
universally, but is widely used and I believe it is often reasonable; for a fuller
discussion see Sections 11.5.4 and 11.5.5 of my textbook
*Regression and Linear Models* (McGraw-Hill, 1990).

This extra "Fisher protection" test is not incorporated into program FINDSETS described later, because I assume a worker will typically turn to FINDSETS only after considerable analysis of the data with an ordinary regression program. Therefore I believe the easiest thing to do is to perform the Fisher protection test with that other program. That may save the worker from having to use FINDSETS at all.

If the regression model contains some covariates which the investigator
has included in the model just so they can be controlled, and the investigator
wants to find sets of variables which include only independent variables (not
covariates), then FINDSETS assumes *R* is a matrix of *partial*
correlations among independent variables and *Y*, with the covariates partialed
out. In that case, *N* should be reduced by the number of covariates. For
instance, if *N* was 100 but 5 covariates were partialed out, enter *N* as
95.

Here are the major vectors (columns or rows) and matrices used in FINDSETS. GAUSS ignores a letter's case, so names are typically lower-case in the program itself, but upper-case in the discussion for easy identification.

VJ is a list of the 2^{P} sweeps to be
performed. It is created early in the program, using rules given
earlier.

C is a copy of R. SWEEPing modifies C, but leaves R unchanged.
After 2^{P} sweeps dictated by the entries in VJ, C
will differ from R only because of rounding error.

RSQ starts as a column vector of 2^{P} zeros,
but after each sweep one of those zeros is replaced by the newly-calculated
value of R^{2}, so RSQ is full of values of R^{2}
by the end of the program. RSQ is filled sequentially, from top to bottom.

CH is a 1 x P row vector which starts as all zeros. When any variable k is swept, CH[k] is changed from 0 to 1, or from 1 to 0 if CH[k] = 1. Thus after each sweep, CH contains a record of the rows swept at that moment.

MEMB (for "members") starts as a 2^{P} x P
matrix of zeros. After the ith sweep, CH is copied into row i of MEMB. Thus
after all the sweeps, MEMB shows the independent variables used to compute each
value of R^{2}. RSQ and MEMB are not printed out by default;
they're typically too long. But they are not erased by the program. A little
later we describe how the interested worker can access the unprinted
information in them.

MAX is the largest value of R^{2} found by the program
-- the value found when all P regressors are in the model. Its position in RSQ
is identified by ALLVAR. Thus to say ALLVAR = 9 is to say that the 9th
entry in RSQ is the one for the full model.

F is a vector that holds the 2^{P} values of F
calculated as described earlier. That is, each entry in F tests the difference
between a particular R^{2} and MAX.

PV is a vector that contains the size of the set tested by each of those F tests. For instance, if P = 8 and the 11th value of RSQ is for a model with 5 variables, then PV[11] = 3 because the tested model differs from the full model by 3 variables.

FINDSETS uses vector commands to calculate all
2^{P} values of F simultaneously. It so happens that
values of PV appear in the denominators of F. That means that when
FINDSETS attempts to test the full model against itself, the expression for F
has a denominator of 0. To avoid that problem, that single value of PV is
changed from 0 to 1 before F's are computed (we're not interested in that
value of F anyway), and then changed back to 0 afterward.

After all F's are computed, the significance level *p*
corresponding to each F is computed. These values of *p* are held in
vector PP.

Then a matrix SRT is created with 2^{P} rows
and P+4 columns. The first four columns of SRT are RSQ, F, PP, and PV
respectively, and the remaining columns are 1 - MEMB. The subtraction is
done because MEMB itself contains 1's for variables in a model and 0's for
other variables. But those other variables are the ones whose contribution to
the overall R^{2} is tested by each F. Therefore each row of 1 -
MEMB contains 1's for the variables tested by the corresponding F, and 0 for
other variables.

After it is created, SRT is sorted by its fourth column (PV), so the F tests on the smallest sets are placed first. That facilitates the later step in which a significant set is not printed if it contains a significant subset. After the sorting, any such subset will precede the set under consideration, so only the preceding rows must be scanned.

The actual procedure is a little different from that just described, but
equivalent. A vector SKIP is created, containing at first all 0's. Then changing
any entry *i* in SKIP to 1 tells FINDSETS to skip that row when
printing. First the program enters 1's into SKIP for all rows with
nonsignificant values of F. Then, for each significant set I, the program
sets SKIP to 1 for every later set J which includes I as a subset.

For readers who want to know exactly how this is done, it is done by dot-multiplying the I and J "membership lists" and setting skip[j]=1 if the product equals the "membership list" for I. For instance, consider the membership lists (1 0 1 0 1) and (1 0 1 1 1). The first is a subset of the second, since every 1 in the first list is matched by a 1 in the second. The dot-product of the two lists (containing products of corresponding entries) is (1 0 1 0 1), which matches the first list. The general rule is that the product of two lists will equal the first list only if the first is a subset of the second. That's the rule the program uses; it compares the first list to the dot-product of the two lists, and changes SKIP for the second list to 1 if the two are identical.

Finally SRT is saved to text file FINDSETS.OUT, except for the rows of SRT identified by SKIP.

The program is divided into 6 independent sections. By "independent" I mean there are no loops or "go to" commands that go between sections. Thus the program can be run section by section. Or a programmer trying to translate the program into another computer language can write and debug sections sequentially, debugging earlier sections before later sections have even been written. The six sections do the following:

- Define the SWEEP procedure
- Create VJ, the list of rows to sweep
- Perform 2^P sweeps, create RSQ and MEMB.
- Find or create MAX, ALLVAR, vectors F, PP, and PV, and matrix SRT
- Create vector SKIP
- Write an output file named FINDSETS.OUT.

1. Most important, if you have already run the entire program and created file FINDSETS.OUT, change the first command in section 6 to, for instance, "output file = allsets.out reset". This names the about-to-be-created file ALLSETS.OUT instead of FINDSETS.OUT. Without this step you would overwrite the previously-created file FINDSETS.OUT.

2. Delete the commands "if skip[i]==0" and "endif" from section 6b. Those deletions make the section ignore the SKIP vector, and print all of SRT.

A note to programmers planning to rewrite: The SWEEP procedure is actually called from only one command in the entire program (in section 3). Therefore you can, if you wish, enter the commands for SWEEP at that one spot in the program, and thereby do away with all procedures or subroutines.

GAUSS sometimes uses two "=" signs in a row; they will look like "==".

The program lines

"....R^2......F........p.......Q...Variables in tested set";

"..................................1 2 3 4 5 6 7 8 9 10";

print the heading lines for the output table. In those program lines,
the dots should be replaced by spaces; they are shown as dots merely to make
it easier to see how many there are.

After this line, program comments are enclosed between two "@" symbols.

@ 1. Define the SWEEP procedure @

proc sweep(k,c); local pv,b; pv=1/c[k,k]; b=c-c[.,k]*c[k,.]*pv;

b[k,.]=c[k,.]*pv; b[.,k]= -c[.,k]*pv; b[k,k]=pv; retp(b); endp;

@ 2. Create VJ, the list of rows to sweep @

vj=ones(2^p,1); incr=1; do until incr==2^(p-1); incr=2*incr;

i=0; do until i>2^p-incr; i=i+incr; vj[i]=vj[i]+1; endo; endo;

@ 3. Perform 2^P sweeps, create RSQ and MEMB. @

rsq=zeros(2^p,1); ch=zeros(1,p); memb=zeros(2^p,p);

i=0; do until i==2^p; i=i+1; jj=vj[i]; ch[jj]=1-ch[jj];

memb[i,.]=ch; c=sweep(jj,c); rsq[i]=1-c[p+1,p+1]; endo;

@ 4. Find or create MAX, ALLVAR, vectors F, PP, and PV, and matrix SRT
@

max=maxc(rsq); pv= -sumc(memb')+p; allvar=minindc(pv); pv[allvar]=1;

f=(rsq-max)./pv*(n-p-1)/(max-1); pp=cdffc(f,pv,(n-p-1)*ones(rows(pv),1));

pv[allvar]=0; srt=rsq~f~pp~pv~(1-memb); srt=sortc(srt,4);

@ 5. Create vector SKIP @

skip=(srt[.,3] .> alpha); i=1; do until i==2^p; i=i+1; if skip[i]==0;

j=i; do until j==2^p; j=j+1; if
skip[j]==0; pr=srt[i,5:p+4].*srt[j,5:p+4];

skip[j]=(pr==srt[i,5:p+4]); endif; endo; endif; endo;

@ 6a. Open text file FINDSETS.OUT, write opening material into it @

output file = findsets.out reset;

format /rd 8,4; "Overall R^2 =";; srt[1,1]; " ";

"....R^2......F........p.......Q...Variables in tested set";

"..................................1 2 3 4 5 6 7 8 9 10";

@ 6b. Write into the file the rows of SRT for which SKIP = 0 @

i=0; do until i==2^p; i=i+1; if skip[i]==0;

format /rd 8,4; srt[i,1:3];; format 4,0; srt[i,4];; " ";;

format 1,0; srt[i,5:p+4]; endif; endo;

@ 6c. Write some notes at the end of file, close the file @

" "; "F tests difference between R^2 and overall R^2.";

"p is the signif. level for F. Q is number of variables in tested set.";

"In the 'Variables' section, 1 denotes the variables out of the model and";

" therefore in the tested set, while 0 denotes variables in the model.";

output off;