Standard methods of simple and multiple regression assume
*homoscedasticity*--the condition that all conditional distributions of
the dependent variable *Y* have the same standard deviation. When
one tests for the significance of regression slopes in simple or multiple
regression, the accuracy of the test depends heavily on that assumption. In
fact, even "robust" tests of regression slopes, using techniques like jackknife
and bootstrap, can fail badly under heteroscedasticity (the absence of
homoscedasticity). The failure can be in either direction, making the tests
either too liberal or too conservative. This document describes methods for
estimating the standard error of a regression slope that work well even under
the cases that give competing methods the most trouble.

To see why, consider a simple regression predicting *Y* from
*X*. I'll use *SE* to stand for "standard error", and use
*b* to denote a regression slope, so *SE*(*b*) is
the standard error of a regression slope. If a single point at the far right of the
scatterplot happens to be exceptionally high, that random event will tend to
increase *b*. If a single point at the far left is exceptionally high,
that will tend to lower *b*. But if a single point near the mean of
*X* is exceptionally high on *Y*, it will have little effect on
*b*. That is, *b* is influenced primarily by the points at
the far left or the far right. But the ordinary regression formula for
*SE*(*b*) is determined heavily by *MSE*, the
mean squared error, and all cases are counted equally in computing
*MSE*. But suppose *X* is approximately normally
distributed. That is not a requirement of regression, but is often true. Then
most cases will be near the mean of *X*, and just a few cases will be
at the far left or right. But the latter cases are the only ones that have
substantial effect on the computed value of *b*. Therefore
*SE*(*b*) will be affected primarily by the majority of
cases near the mean of *X*, while the actual random fluctuations in
*b* will be determined by the minority of cases at the far left and
right. That is why the assumption of homoscedasticity is so crucial: the data
points affecting *SE*(*b*) and the data points affecting
*b* must be assumed to have the same degree of random variability.
But if the actual scatterplot has a butterfly shape, then the points on the far left
and right (those actually affecting *b* the most) will have more
random variability than those in the center [those having the most effect on the
computed value of *SE*(*b*)], so that the actual standard
error of *b* will exceed its estimated value.

There are at least two reasonable circumstances under which a butterfly pattern can arise. In one case, a population consists of two subpopulations with equal means on all variables, but subpopulation A has greater variability on all variables than subpopulation B. Then the extreme scores on any predictor variable will be mostly from A while the central scores will be mostly from B, so that the central scores will have a lower standard deviation on Y than the extreme scores. For instance, any nation includes natives and immigrants. The immigrants may be from many nations and continents, while by definition the natives are all from one. The immigrants may thus be more heterogeneous than the natives on a wide variety of variables, producing the effect just described.

Another circumstance that can produce butterfly heteroscedasticity occurs when the regression model totally ignores some variable that interacts with the variable of interest. For instance, imagine studying attitude about nonstandard computer keyboards as a function of computer experience. Suppose that the true situation is that experience interacts with handedness, so that lefthanders become more positive toward nonstandard keyboards as their experience increases, while righthanders become less positive with experience. Suppose the two regression lines cross near the mean on experience. Then the attitude-experience scatterplot will have little range on attitude in the middle where the two regression lines cross, and will have greater range at the left and right ends. If handedness isn't in the model at all because the investigator overlooked it, then the data will exhibit butterfly heteroscedasticity, and SE(b) for experience will be underestimated.

This document presents three methods for estimating SE(b) under heteroscedasticity. The various methods are all related to each other. They appear below in order of increasing computational and conceptual complexity.

This method can be extended to multiple regression by use of the concept of the independent component of a regressor or predictor variable. In a problem with 8 regressors (predictors), suppose we're thinking of one of the 8 as the independent variable of interest, and are thinking of the other 7 regressors as covariates which are in the regression in order to control them statistically. Denote the one independent variable as X and denote the covariates collectively as C. As before, the dependent variable is Y.

Now imagine using a multiple regression to predict X from C. The
residuals in that regression are uncorrelated with all the variables in C, and
can thus be thought of as the component of X independent of C. Let's denote
this residual variable as X.C. Similarly we can predict Y from C, denoting
the residuals there as Y.C. Now imagine a simple regression predicting Y.C
from X.C. The slope *b* in that simple regression will exactly equal
the multiple regression slope found for X when Y is predicted from X and C
simultaneously.

Because of this, the simple-regression method for controlling heteroscedasticity can be extended to multiple regression by computing X.C and Y.C, then deleting from the sample the cases closest to the mean on X.C. (That mean is always exactly 0, since residuals always have a mean of 0.) Then proceed as described above.

SE(b) = (sqrt(SUM x_{i}^{2}
v_{i}))/(SUM x_{i}^{2})

In this equation, the values x_{i} are known but the values
v_{i} are unknown. A reasonable (and slightly conservative)
estimate of v_{i} is ex_{i}^{2}, where
ex_{i} is the "deleted residual" of case *i*, meaning
the residual of case *i* from the regression computed in the sample
from which case *i* has been deleted. As discussed shortly, this
value is easier to compute than one might assume. Thus a reasonable
computable value for SE(b) is

SE(b) = (sqrt(SUM x_{i}^{2}
ex_{i}^{2}))/(SUM x_{i}^{2})

Let *N* denote the number of cases in the sample. It might
seem that computing each value of ex_{i} would require an
entirely new regression in a separate sample of size *N*-1, but it
turns out that all *N* values of ex_{i} can be computed
at once if one can compute the *N *values of h_{i}--values commonly called "leverage" values. Most standard regression
programs will compute leverage values. Let e_{i} denote the
usual residual for case *i*. Then it can be shown that
ex_{i} = e_{i}/(1-h_{i}). Thus in a
program like SYSTAT or MYSTAT, which will compute columns of
e_{i} and h_{i} values under the names RESIDUAL
and LEVERAGE respectively, a single command like LET EX =
RESIDUAL/(1 - LEVERAGE) will compute all the values of
ex_{i} at once.

As with the earlier method of deleting cases, this procedure can be
adapted to multiple regression by replacing the values of x_{i} by
values I shall denote as xic_{i}, defined as the portion of X
independent of the other regressors. As mentioned earlier, values of
xic_{i} can be found by using regression to predict X from the
other predictor variables and taking the residuals in that regression. Then
enter those residuals in place of the values of x_{i} in the
formula for SE(b).

Let X now denote a matrix of regressor scores, including a unit column
to represent the additive constant in the regression, and let Y denote the
column vector of scores on the dependent variable. Let V denote an N x N
diagonal matrix whose *i*th diagonal entry is v_{i} as
defined earlier. And let B denote the entire vector of regression weights,
according to the familiar matrix regression formula B = (X'X)^{-1}X'Y. Then it can be shown that under heteroscedasticity, the
variance-covariance matrix G for B is G = (X'X)^{-1}X'VX(X'X)^{-1}.
That is, each diagonal entry in G is,
in the notation of previous sections, a value of SE(b)^{2}, and
the off-diagonal entries in G give covariances among the elements of B. As
before, each diagonal entry in V can be estimated by
ex_{i}^{2}. In the following, think of V as the matrix
formed that way.

It is extremely inefficient to compute G in quite the way suggested by
the above matrix formula for G, since V will be an enormous matrix if N is
large, so that any multiplications involving it are very slow. However, VX is
computed quickly as a dot-product, wherein each diagonal entry in V is
multiplied individually by all the entries in the corresponding row of X. For
instance, if X is

1 3 6 1 5 2 1 8 3 1 3 9 1 7 3and the diagonal entries in V are respectively 5, 3, 7, 2, 8, then VX is

5 15 30 3 15 6 7 56 21 2 6 18 8 56 24and this computation took only 15 scalar multiplications and no additions. Then X'VX is computed in the ordinary way from X and VX, and the rest of the computation can proceed in the obvious way.

Compute C = (X'X)^{-1}

Compute D=C*X'

D has the same number of rows and columns as X'. Define F as the matrix
formed by multiplying each entry in D by the corresponding entry in X'.
Then the N values of h_{i} are simply the column sums of
F.

Run the regression, perhaps using B = CX'Y.

Compute residuals *e *as usual, perhaps using e = Y - X*B.

Compute values of ex_{i} from ex_{i} =
e_{i}/(1 - h_{i}).

Use the values of ex_{i}^{2} to compute VX as described
above.

Then G = D*VX*C.