A New Diagnostic Measure di for Regression

Richard B. Darlington
Cornell University

Abstract

This note introduces a new diagnostic statistic di which measures the degree to which case i distorts the regression plane for the other cases in the sample. di is exactly proportional to the degree to which SSE (sum of squared errors) for those other cases exceeds the value it would have if case i were not in the sample. di is standardized so that di = 1 for a case whose standardized residual is 1 and whose value of leverage equals the mean leverage. di measures a quality rather similar to the Cook statistic, and in a typical sample values of di correlate highly with Cook. But I argue that di is both simpler than Cook and conceptually superior.

* * * * * * *

Notation and Some Basics on Diagnostic Statistics

In regression, diagnostic statistics are statistics computed for each case, which in some sense measure the case's influence or leverage on the regression. The three best-known types of diagnostic measure are hi, Cooki, and various types of residuals. This section reviews those briefly.

hi measures the degree to which a case's pattern of regressor (predictor) scores is atypical. For instance, suppose X1 and X2 correlate highly with each other, and suppose Joe has one of the highest scores on X1 and one of the lowest scores on X2. Then Joe might well have by far the highest value of hi, even if he has neither the very highest score on X1 nor the very lowest score on X2, since his pattern of scores is so unusual. In a regression with an additive constant, with P predictor variables not including the additive constant, in a sample of size N, the mean value of hi is always exactly (P+1)/N.

hi is often called leverage, because a case with high hi has high ability to lower its own residual by pulling the regression plane toward itself. Thus we might say that hi measures a case's potential influence on the regression plane. But it does not measure a case's actual influence. To see why, suppose Joe's score on Y happens to place him right on the regression plane that would have existed even if Joe weren't in the sample. Then adding Joe to that sample wouldn't move the plane at all. Thus it's reasonable to say that Joe's influence on the regression plane is 0, despite his high leverage.

The best-known measure of "influence" is Cooki. This statistic is exactly proportional to the sum of the N squared changes in hat-Y values that would result from deleting case i from the sample. It is thus a reasonable measure of the "influence" of case i on the regression plane.

Various types of residual measure the distance that cases fall above or below the regression plane. Simple residuals are here denoted ei. But as just mentioned, cases with high leverage have high ability to lower their own values of ei toward zero, so the expected value E(ei2) is lower for cases with high hi. This can be fixed with the leverage-corrected residual lcri, defined as
lcri = ei/sqrt(1 - hi).
It can be shown that under the standard assumptions of regression, E(lcri2) equals the unknown true residual variance, regardless of case i's value of h.

Define SSE as the usual sum of squared errors. As usual we define the mean squared error MSE as MSE = SSE/(N-P-1). Dividing lcri by sqrt(MSE) standardizes values of lcri to make them independent of MSE. These values are called standardized residuals and are here denoted stri. Studentized residuals, also called t-residuals and here denoted tri, are further transformations of standardized residuals to make them exactly follow a t distribution under the standard assumptions of regression. Values of lcri, stri, and tri are all monotonically related; to say that case A exceeds case B on one of these measures is to say that A exceeds B on all of them.

A case high on all these residual measures may still have little influence on the regression plane if its leverage is low. There is a sense in which influence can be thought of as the product of residual and leverage. Specifically,

Cooki = stri2 x hi/((1-hi)*(P+1)).

Thus if we ignore P on the ground that it is constant across all cases, Cooki can be expressed as the product of a particular squared residual and a function of hi.

A Measure of Distortion

Unless case i has 0 influence because it falls exactly on the regression plane fitted to the other (N-1) cases, case i always moves the regression plane so as to lower its own residual, but it always raises the sum of squared errors of the other (N-1) cases. Thus when we said above that Cook is proportional to the sum of N squared changes in hat-Y values, we were referring to a sum that mixes two kinds of change: a drop in the case's own squared residual, and a rise in the sum of the remaining squared residuals. It can be shown that the sum of squared changes in the hat-Y values for those (N - 1) other cases, exactly equals the increase in SSE for those same cases that results from adding case i to the sample. We might call that increase in SSE a measure of the degree to which case i distorts the regression plane for those other cases. This measure will nearly always be rather similar to the sum of N squared changes that is closely related to the Cook statistic, since it's the sum of (N - 1) of those N values. But it does seem to me that the sum of (N - 1) values measures something which is both conceptually purer and more important. As already mentioned, it measures (in a way the Cook statistic does not) the degree to which each case distorts the regression plane for the other cases.

It can be shown that this increase in SSE for the other (N - 1) cases equals ei2 hi/(1-hi). For reasons to be explained shortly, we shall define a measure of distortion di as this increase times the constant N/[(P+1) x MSE]. We can then write

di = ei2 hi/(1-hi) x N/[(P+1) x MSE]

For users who can easily compute values of ei and hi, this will typically be the easiest formula for computing values of di. The set of values ei2 hi/(1-hi) can typically be computed from ei and hi with a few commands, and the constant N/[(P+1) x MSE] can be computed on a pocket calculator, then multiplied by the other values.

We now show why we chose that particular multiplicative constant. Since

stri2 = ei2 /[MSE (1-hi)],

substitution yields

di = stri2 hi N/(P+1)

We might regard a "typical case" as one for which stri2 equals its expected value of 1 and hi equals its mean value of (P+1)/N. We see that for such a case, di = 1. Thus by including N/[(P+1) x MSE] in the definition of di, we have made di equal the amount that case i has raised SSE for the other (N - 1) cases, divided by the increase in SSE produced by the "typical case" just described.