## A New Diagnostic Measure *d*_{i} for Regression

Richard B. Darlington

Cornell University

**Abstract**
This note introduces a new diagnostic statistic *d*_{i} which measures
the degree to which case *i* distorts the regression plane for the other cases in the
sample. *d*_{i} is exactly proportional to the degree to which *SSE*
(sum of squared errors) for those other cases exceeds the value it would have if case
*i* were not in the sample. *d*_{i} is standardized so that
*d*_{i} = 1 for a case whose standardized residual is 1 and whose
value of leverage equals the mean leverage. *d*_{i} measures a
quality rather similar to the Cook statistic, and in a typical sample values of
*d*_{i} correlate highly with Cook. But I argue that
*d*_{i} is both simpler than Cook and conceptually superior.

* * * * * * *
**Notation and Some Basics on Diagnostic Statistics**

In regression, diagnostic statistics are statistics computed for each case, which in some sense
measure the case's influence or leverage on the regression. The three best-known types of
diagnostic measure are *h*_{i}, Cook_{i}, and various
types of residuals. This section reviews those briefly.

*h*_{i} measures the degree to which a case's *pattern*
of regressor (predictor) scores is atypical. For instance, suppose X_{1}
and X_{2} correlate highly with each other, and suppose Joe has one of the
highest scores on X_{1} and one of the lowest scores on X_{2}.
Then Joe might well have by far the highest value of *h*_{i},
even if he has neither the very highest score on X_{1} nor the very lowest
score on X_{2}, since his *pattern* of scores is so unusual. In a
regression with an additive constant, with *P* predictor variables not including the
additive constant, in a sample of size *N*, the mean value of
*h*_{i} is always exactly (*P*+1)/*N*.

*h*_{i} is often called *leverage*, because a case
with high *h*_{i} has high ability to lower its own residual by pulling
the regression plane toward itself. Thus we might say that *h*_{i}
measures a case's *potential influence* on the regression plane. But it does not
measure a case's *actual* influence. To see why, suppose Joe's score on Y
happens to place him right on the regression plane that would have existed even if Joe
weren't in the sample. Then adding Joe to that sample wouldn't move the plane at all. Thus
it's reasonable to say that Joe's influence on the regression plane is 0, despite his high
leverage.

The best-known measure of "influence" is Cook_{i}. This statistic is
exactly proportional to the sum of the *N* squared changes in hat-Y values that
would result from deleting case *i* from the sample. It is thus a reasonable
measure of the "influence" of case *i* on the regression plane.

Various types of residual measure the distance that cases fall above or below the
regression plane. Simple residuals are here denoted *e*_{i}. But as
just mentioned, cases with high leverage have high ability to lower their own values of
*e*_{i} toward zero, so the expected value
E(*e*_{i}^{2}) is lower for cases with high
*h*_{i}. This can be fixed with the leverage-corrected residual
*lcr*_{i}, defined as

*lcr*_{i} =
*e*_{i}/sqrt(1 - *h*_{i}).

It can be shown that
under the standard assumptions of regression, E(*lcr*_{i}^{2})
equals the unknown true residual variance, regardless of case *i*'s value of
*h*.

Define *SSE* as the usual sum of squared errors. As usual we define the mean squared
error *MSE* as *MSE* = *SSE*/(*N-P*-1). Dividing *lcr*_{i} by sqrt(*MSE*)
standardizes values of *lcr*_{i} to make them independent of *MSE*.
These values are called *standardized residuals* and are here denoted
*str*_{i}. *Studentized residuals*, also called *t-residuals* and here denoted *tr*_{i}, are further transformations
of standardized residuals to make them exactly follow a *t* distribution under the
standard assumptions of regression. Values of *lcr*_{i},
*str*_{i}, and *tr*_{i} are all monotonically
related; to say that case A exceeds case B on one of these measures is to say that A exceeds
B on all of them.

A case high on all these residual measures may still have little influence on the
regression plane if its leverage is low. There is a sense in which influence can be thought of
as the product of residual and leverage. Specifically,

Cook_{i} =
*str*_{i}^{2} x *h*_{i}/((1-*h*_{i})*(*P*+1)).

Thus if we ignore *P* on the
ground that it is constant across all cases, Cook_{i} can be expressed as the
product of a particular squared residual and a function of *h*_{i}.

#### A Measure of Distortion

Unless case *i* has 0 influence because it falls exactly on the regression plane fitted
to the other (*N*-1) cases, case *i* always moves the regression plane so
as to lower its own residual, but it always raises the sum of squared errors of the other
(*N*-1) cases. Thus when we said above that Cook is proportional to the sum of
*N* squared changes in hat-Y values, we were referring to a sum that mixes two
kinds of change: a drop in the case's own squared residual, and a rise in the sum of the
remaining squared residuals. It can be shown that the sum of squared changes in the hat-Y
values for those (*N* - 1) other cases, exactly equals the increase in *SSE* for those
same cases that results from adding case *i* to the sample. We might call that
increase in *SSE* a measure of the degree to which case *i* *distorts* the
regression plane for those other cases. This measure will nearly always be rather similar to
the sum of *N* squared changes that is closely related to the Cook statistic, since
it's the sum of (*N* - 1) of those *N* values. But it does seem to me that
the sum of (*N* - 1) values measures something which is both conceptually purer
and more important. As already mentioned, it measures (in a way the Cook statistic does
not) the degree to which each case distorts the regression plane for the other cases.
It can be shown that this increase in *SSE* for the other (*N* - 1) cases equals
*e*_{i}^{}2
*h*_{i}/(1-*h*_{i}). For reasons to be
explained shortly, we shall define a measure of distortion *d*_{i} as
this increase times the constant *N*/[(*P*+1) x *MSE*]. We can then
write

*d*_{i} = *e*_{i}^{2}
*h*_{i}/(1-*h*_{i}) x
*N*/[(*P*+1) x *MSE*]

For users who can easily compute values of *e*_{i} and
*h*_{i}, this will typically be the easiest formula for computing values
of *d*_{i}. The set of values *e*_{i}^{2}
*h*_{i}/(1-*h*_{i}) can typically be computed
from *e*_{i} and *h*_{i} with a few
commands, and the constant *N*/[(*P*+1) x *MSE*] can be computed on a
pocket calculator, then multiplied by the other values.

We now show why we chose that particular multiplicative constant. Since

*str*_{i}^{2} = *e*_{i}^{2} /[*MSE* (1-*h*_{i})],

substitution yields

*d*_{i} = *str*_{i}^{2}
*h*_{i} *N*/(*P*+1)

We might regard a "typical case" as one for which *str*_{i}^{2} equals its
expected value of 1 and *h*_{i} equals its mean value of (*P*+1)/*N*.
We see that for such a case, *d*_{i} = 1. Thus by including
*N*/[(*P*+1) x *MSE*]
in the definition of *d*_{i}, we have made *d*_{i} equal the amount
that case *i* has raised *SSE* for the other (*N* - 1) cases, divided by the increase in *SSE*
produced by the "typical case" just described.