Many scientists think of random assignment and statistical control (the use of covariates in
linear models) as alternative methods of control. It is well known that random assignment
has certain advantages over statistical control; see chapter 4 of my book *Regression and
Linear Models* (hereafter abbreviated RLM). However, there are are at least four
reasons for using statistical control *along with* random assignment if the latter is
planned. This note outlines those reasons. The first three reasons (control of nonrandom
attrition, assessment of indirect effects, and increased power and precision in estimating
effect sizes) are well understood by many people. However, I believe that the last of the
four reasons has not been described before.

For instance, consider an experiment in which a randomly-assigned half of all subjects
are told a mental test indicates they should be especially good at solving problems of a
certain type, and are then found to persist longer in trying to solve such problems which are
in fact impossible. Of course, subjects are told the truth at the end of the experiment. Was
their persistence produced (a) by self-confidence, or (b) by increased liking of the
experimenter who had complimented them, or (c) by some other intervening mechanism, or
(d) by some combination of these? These various possibilities can be distinguished by a
regression predicting perseverance from the independent variable of treatment condition, plus
measures of self-confidence and liking of the experimenter, which are the proposed
mechanisms in this example. Under choice A we expect only self-confidence to be
significant, under B we expect only liking to be significant, under C we expect only
treatment condition to be signficant, while D covers all possibilities of two or three
significant effects. The use of regression in such cases clarifies not so much the presence of
the effect as its nature--that is, the intervening variables that mediate the effect. These are
actually measures of *indirect effects*, which are discussed more fully in RLM
Section 7.2.

The figure shows a case in which 7 people in a treatment group (the 7 triangles) and 7 others in a control group (the 7 circles) are measured both before and after a treatment. For simplicity I have made the mean pretest scores exactly equal for treatment and control samples, though the major points below are valid without this condition.

The diagonal lines in the figure represent the model fitted by regressing posttest onto pretest and a dummy treatment variable. The upper line shows the predicted posttest scores of the treatment group, while the lower line shows the predicted scores of the control group. Assuming random sampling from a population, you can tell without any real calculation that the differences between treatment and control groups cannot be explained by chance, because every one of the treatment-group cases is closer to the treatment-group line than to the control-group line, while every one of the control-group cases is closer to the control-group line. The inferential formulas of RLM Chapter 5 agree with this intuitive conclusion; when they are used to test the hypothesis of no treatment effect, that hypothesis is rejected at the .0000045 level of significance (t = 8.33, df = 11).

We can also estimate the size of the treatment effect. The vertical distance between the two diagonal lines is 3.0; thus 3.0 is the estimated effect of the treatment on posttest scores, as explained in RLM Section 3.2.3. Again you can see without calculation that the lines in the figure must closely approximate the population lines, because randomly discarding any one case from the sample would hardly change at all the placement of either line. Therefore the vertical distance of 3.0 between the lines must be an accurate estimate of the treatment effect. Again the formulas in RLM Chapter 5 agree with our intuition; they show a standard error of only .360 for the estimated treatment effect.

But when we ignore regression formulas, and use a simple two-sample *t*test to test
the significance of the difference between the two groups, we ignore the information about
each case's horizontal placement in Figure 4.1, using only its vertical placement. If this
information were all we had, in our noncomputational intuitive testing we wouldn't be nearly
so certain about the size or even the existence of the treatment effect, since the treatment-group posttest scores range from 6 to 15, and the control-group scores overlap them
substantially, ranging from 2 to 12. The two-sample *t*test has this same limitation. Because
treatment and control groups had exactly the same means on pretest, the estimated difference
between groups is the same whether or not we control for pretest. The previously estimated
effect size of 3.0 was simply the difference between the two sample means, which forms the
numerator of the *t*test. But because the *t*test ignores pretest scores, that estimate's standard
error is 1.79--nearly five times the value of .360 mentioned above. Thus we find a
nonsignificant *t*of 3/1.79 = 1.68, df = 12, p = .12.

In this example the treatment effect was not significant at even the .05 level without statistical control, but much the same point can apply even if it is. If you had to choose between spending thousands of dollars on a treatment that had been demonstrated effective at just the .05 level, or the same money on another treatment whose effectiveness had been demonstrated at the .001 level, which would you choose? Presumably the latter; after all, 50 times as many ineffective treatments pass tests at the .05 level as at the .001 level. Thus investigators should attempt to show the most significant results they validly can.

I do not mean to imply that one always gains power by indiscriminately adding
covariates to a model with random assignment. The more strongly a covariate affects the
dependent variable Y, the more power is gained from controlling it. But if a covariate has
absolutely no effect on Y, one actually loses a little power by adding it to the model. The
power lost is the same as is lost by randomly discarding one case from the sample, so the
loss is usually small. But even this small loss suggests that one should not indiscriminately
add dozens of extra covariates to the model, just because they happen to be in the data set.
Elsewhere I describe and justify a method for selecting a specific set of covariates. In
the method's simplest form, you predict the dependent variable from a broad set of relevant
covariates, then drop from the model the covariate with the lowest absolute *t*. As described
there, continue dropping covariates one at a time, recomputing the regression after each
deletion, until all remaining covariates have absolute t's of 1.42 or higher. Add the
independent variable to the regression only after completing this process. Otherwise the
covariates correlating highest with the treatment variable--the very ones it is most important
to keep--will tend to be deleted because of their redundancy with the treatment
variable.

Once we agree that in this extreme example there is some doubt about the treatment's
effectiveness, we must ask how extreme an example must be to raise similar doubts.
Perhaps we should be concerned about all significant differences between treatment groups
on covariates, despite the familiar argument (given in RLM Section 4.1.2) against this
position. But we can avoid the whole problem by using linear models along with random
assignment. The problem arises because we presume that the covariates correlate with the
dependent variable in the population, so that if by chance we draw a sample in which the
covariates correlate with the treatment variable as well, then we must presume the sample
correlation between the treatment and the dependent variable is at least partly spurious. But
as described in RLM Chapter 2, in a linear model with an independent variable X and
several covariates, X's sample regression slope can be thought of as the simple regression
slope predicting the dependent variable from the portion of X independent of the covariates.
This portion of X is exactly uncorrelated with all covariates *in the sample
studied*, not merely in some hypothetical population. This eliminates the problem,
which was that X might conceiveably correlate highly with covariates just by chance in the
sample studied, even though random assignment assures that this correlation is zero in the
population. But the use of a linear model means that we are always using in effect just the
portion of X that is independent of the covariates in the sample studied. Even in our extreme
example, regression would estimate the treatment effect to be zero, which is the estimate
supported by intuition.