Correcting the standard errors of regression slopes for heteroscedasticity

Richard B. Darlington

Standard methods of simple and multiple regression assume homoscedasticity--the condition that all conditional distributions of the dependent variable Y have the same standard deviation. When one tests for the significance of regression slopes in simple or multiple regression, the accuracy of the test depends heavily on that assumption. In fact, even "robust" tests of regression slopes, using techniques like jackknife and bootstrap, can fail badly under heteroscedasticity (the absence of homoscedasticity). The failure can be in either direction, making the tests either too liberal or too conservative. This document describes methods for estimating the standard error of a regression slope that work well even under the cases that give competing methods the most trouble.

When are standard formulas for standard errors most misleading?

To understand these methods, the most important point to know is that the patterns of heteroscedasticity that most distort statistical inference are not the most common or obvious cases. The most common case is what we might call the "wedge" because the XY scatterplot looks like a wedge, narrow at one end and fat at the other. The narrow end of the wedge might be at either the left or the right. But a wedge-shaped scatterplot does not distort statistical inference nearly as much as two other shapes that I'll call "butterfly" and "galaxy" shapes. A scatterplot with a butterfly shape is thick at both its left and right ends, and thin in the middle. A galaxy shape is the opposite, thick in the middle and thin on both the left and right. A butterfly shape leads to liberal errors because the standard error of the regression slope is underestimated by standard methods--or by nearly any method for that matter. A galaxy shape leads to conservative errors because the standard error is overestimated.

To see why, consider a simple regression predicting Y from X. I'll use SE to stand for "standard error", and use b to denote a regression slope, so SE(b) is the standard error of a regression slope. If a single point at the far right of the scatterplot happens to be exceptionally high, that random event will tend to increase b. If a single point at the far left is exceptionally high, that will tend to lower b. But if a single point near the mean of X is exceptionally high on Y, it will have little effect on b. That is, b is influenced primarily by the points at the far left or the far right. But the ordinary regression formula for SE(b) is determined heavily by MSE, the mean squared error, and all cases are counted equally in computing MSE. But suppose X is approximately normally distributed. That is not a requirement of regression, but is often true. Then most cases will be near the mean of X, and just a few cases will be at the far left or right. But the latter cases are the only ones that have substantial effect on the computed value of b. Therefore SE(b) will be affected primarily by the majority of cases near the mean of X, while the actual random fluctuations in b will be determined by the minority of cases at the far left and right. That is why the assumption of homoscedasticity is so crucial: the data points affecting SE(b) and the data points affecting b must be assumed to have the same degree of random variability. But if the actual scatterplot has a butterfly shape, then the points on the far left and right (those actually affecting b the most) will have more random variability than those in the center [those having the most effect on the computed value of SE(b)], so that the actual standard error of b will exceed its estimated value.

There are at least two reasonable circumstances under which a butterfly pattern can arise. In one case, a population consists of two subpopulations with equal means on all variables, but subpopulation A has greater variability on all variables than subpopulation B. Then the extreme scores on any predictor variable will be mostly from A while the central scores will be mostly from B, so that the central scores will have a lower standard deviation on Y than the extreme scores. For instance, any nation includes natives and immigrants. The immigrants may be from many nations and continents, while by definition the natives are all from one. The immigrants may thus be more heterogeneous than the natives on a wide variety of variables, producing the effect just described.

Another circumstance that can produce butterfly heteroscedasticity occurs when the regression model totally ignores some variable that interacts with the variable of interest. For instance, imagine studying attitude about nonstandard computer keyboards as a function of computer experience. Suppose that the true situation is that experience interacts with handedness, so that lefthanders become more positive toward nonstandard keyboards as their experience increases, while righthanders become less positive with experience. Suppose the two regression lines cross near the mean on experience. Then the attitude-experience scatterplot will have little range on attitude in the middle where the two regression lines cross, and will have greater range at the left and right ends. If handedness isn't in the model at all because the investigator overlooked it, then the data will exhibit butterfly heteroscedasticity, and SE(b) for experience will be underestimated.

This document presents three methods for estimating SE(b) under heteroscedasticity. The various methods are all related to each other. They appear below in order of increasing computational and conceptual complexity.

Deletion of central cases

The first method, when applied to simple regression, consists simply of deleting from the sample cases near the mean of X, to prevent those cases from affecting the computed values of MSE and SE(b). By a set of reasoning not given here, I suggest deleting the central 46% of cases, leaving the 27% of leftmost cases and the 27% of rightmost cases.

This method can be extended to multiple regression by use of the concept of the independent component of a regressor or predictor variable. In a problem with 8 regressors (predictors), suppose we're thinking of one of the 8 as the independent variable of interest, and are thinking of the other 7 regressors as covariates which are in the regression in order to control them statistically. Denote the one independent variable as X and denote the covariates collectively as C. As before, the dependent variable is Y.

Now imagine using a multiple regression to predict X from C. The residuals in that regression are uncorrelated with all the variables in C, and can thus be thought of as the component of X independent of C. Let's denote this residual variable as X.C. Similarly we can predict Y from C, denoting the residuals there as Y.C. Now imagine a simple regression predicting Y.C from X.C. The slope b in that simple regression will exactly equal the multiple regression slope found for X when Y is predicted from X and C simultaneously.

Because of this, the simple-regression method for controlling heteroscedasticity can be extended to multiple regression by computing X.C and Y.C, then deleting from the sample the cases closest to the mean on X.C. (That mean is always exactly 0, since residuals always have a mean of 0.) Then proceed as described above.

Weighting of cases

The second method is somewhat more complex but should give somewhat more accurate results. As before, we can describe first a method for simple regression, then extend that method to multiple regression. Assume that each individual case i is drawn from a subpopulation with its own unknown true variance vi. In the simple regression let xi denote the deviation score of case i on X. Let SUM denote summation, usually represented by the Greek letter sigma. Then it can be shown that

SE(b) = (sqrt(SUM xi2 vi))/(SUM xi2)

In this equation, the values xi are known but the values vi are unknown. A reasonable (and slightly conservative) estimate of vi is exi2, where exi is the "deleted residual" of case i, meaning the residual of case i from the regression computed in the sample from which case i has been deleted. As discussed shortly, this value is easier to compute than one might assume. Thus a reasonable computable value for SE(b) is

SE(b) = (sqrt(SUM xi2 exi2))/(SUM xi2)

Let N denote the number of cases in the sample. It might seem that computing each value of exi would require an entirely new regression in a separate sample of size N-1, but it turns out that all N values of exi can be computed at once if one can compute the N values of hi--values commonly called "leverage" values. Most standard regression programs will compute leverage values. Let ei denote the usual residual for case i. Then it can be shown that exi = ei/(1-hi). Thus in a program like SYSTAT or MYSTAT, which will compute columns of ei and hi values under the names RESIDUAL and LEVERAGE respectively, a single command like LET EX = RESIDUAL/(1 - LEVERAGE) will compute all the values of exi at once.

As with the earlier method of deleting cases, this procedure can be adapted to multiple regression by replacing the values of xi by values I shall denote as xici, defined as the portion of X independent of the other regressors. As mentioned earlier, values of xici can be found by using regression to predict X from the other predictor variables and taking the residuals in that regression. Then enter those residuals in place of the values of xi in the formula for SE(b).

A matrix approach

The approach in this section requires matrix operations such as matrix inversion and multiplication. Since it can be implemented only by someone who understands matrix algebra, that is used in explaining the approach. The advantages of the approach are twofold: one can compute SE(b) for all regressors simultaneously, and one can estimate the covariances among b's, thereby allowing tests on linear functions of b's.

Let X now denote a matrix of regressor scores, including a unit column to represent the additive constant in the regression, and let Y denote the column vector of scores on the dependent variable. Let V denote an N x N diagonal matrix whose ith diagonal entry is vi as defined earlier. And let B denote the entire vector of regression weights, according to the familiar matrix regression formula B = (X'X)-1X'Y. Then it can be shown that under heteroscedasticity, the variance-covariance matrix G for B is G = (X'X)-1X'VX(X'X)-1. That is, each diagonal entry in G is, in the notation of previous sections, a value of SE(b)2, and the off-diagonal entries in G give covariances among the elements of B. As before, each diagonal entry in V can be estimated by exi2. In the following, think of V as the matrix formed that way.

It is extremely inefficient to compute G in quite the way suggested by the above matrix formula for G, since V will be an enormous matrix if N is large, so that any multiplications involving it are very slow. However, VX is computed quickly as a dot-product, wherein each diagonal entry in V is multiplied individually by all the entries in the corresponding row of X. For instance, if X is

     1    3    6
     1    5    2
     1    8    3
     1    3    9
     1    7    3
and the diagonal entries in V are respectively 5, 3, 7, 2, 8, then VX is
     5    15   30
     3    15    6
     7    56   21
     2     6   18
     8    56   24
and this computation took only 15 scalar multiplications and no additions. Then X'VX is computed in the ordinary way from X and VX, and the rest of the computation can proceed in the obvious way.

Computation of h

If it is desired to use matrix formulas to compute both the h-values and G from scratch, an efficient procedure is as follows.

Compute C = (X'X)-1

Compute D=C*X'

D has the same number of rows and columns as X'. Define F as the matrix formed by multiplying each entry in D by the corresponding entry in X'. Then the N values of hi are simply the column sums of F.

Run the regression, perhaps using B = CX'Y.

Compute residuals e as usual, perhaps using e = Y - X*B.

Compute values of exi from exi = ei/(1 - hi).

Use the values of exi2 to compute VX as described above.

Then G = D*VX*C.