The first few paragraphs of this work describe 5 major advantages that result from the use of multiple regression, simultaneous linear equations, and regression-based time-series analysis in statistical process control (quality control).

Statistical process control (SPC) is the use of statistical methods to improve the quality or uniformity of the output of a process--usually a manufacturing process. Many SPC methods can apply to other processes as well, such as the processing of checks in a bank.

For simplicity I write throughout this document as if all adjustments to processes involved resetting dials on a machine, although of course in reality there are many other ways to adjust a process.

The methods here have the following advantages over more basic methods:

1. Simpler methods often determine merely whether a process needs adjustment, while I describe how a statistically untrained worker can use a simple computer program for simultaneous equations to determine quickly the specific nature and extent of any needed adjustments. I also show how an ordinary regression program can be used for this purpose, so that a separate program for simultaneous equations need not be purchased.

2. I consider the case in which a machine has several dials that may be reset, but each setting affects several characteristics of the finished pieces. Thus it may be very difficult to determine the optimum combination of settings, since changing any dial affects several output characteristics at once. I show how a statistically untrained worker may be able to quickly use the aforementioned computer program to simultaneously calculate a whole new set of dial settings estimated to produce optimum output under the current conditions (temperature, humidity, nature of raw material, etc.).

3. Though many books on SPC describe designs for carefully controlled experiments in which dial settings are varied systematically, the current approach makes no such requirement. While collecting data to enter into the aforementioned simultaneous equations, settings may be varied unsystematically as part of ad hoc attempts to achieve high-quality output as quickly as possible. All that the current approach requires is that you keep track of all the data associated with each piece of output, and that each of these input characteristics vary enough (and independently enough of other input characterisics) to allow you to assess its effects. Later I comment more on this latter point.

4. I assume that the aforementioned analyses will be based on a substantial body of data--at least 100 pieces of output and ideally much more. However, I do not assume that the target specifications of all the pieces in that data set be identical, and I also allow workers using the aforementioned program to change the target specifications frequently. Thus these methods will often work well even if manufacturing runs between specification changes are very short.

5. Like most methods built on hypothesis testing, many SPC methods take an either-or approach to process adjustments--much like the common practice of estimating the true difference between two means to be 0 if one calculates p = .06, but estimating it to equal the observed difference if one calculates p = .04. In contrast, the present approach does not treat current settings as a null hypothesis to be retained until definitively rejected, but rather treats settings as something to be recalculated and adjusted as often as is convenient.

I also require that a computer program for solving
simultaneous linear equations be available for quick calculations on
the factory floor, though I do not require that the workers making
these quick calculations be trained in statistics. As mentioned
above, I explain later how an ordinary regression program
can be used for this purpose if necessary. In what follows, the
original set of regressions will be termed the
**fundamental** analysis, while the later solution of
simultaneous equations will be termed a **quick**
analysis.

In addition, I share the following assumptions with nearly all SPC methods. These paragraphs also introduce my basic notation.

1. I assume that each piece, after its manufacture, can be
measured on one or more dimensions Y_{1},
Y_{2}, etc., such as length, hole diameter, hardness,
etc. I do not assume that every single piece is actually measured
on these dimensions, but I do assume that pieces are regularly
selected from the output stream for measurement. Typically every
*k*th piece of output is selected for evaluation, though
I shall later suggest other possibilities.

2. I assume there is some optimum score
T_{i} (T for "target") on each dimension
Y_{i}--e.g. an optimum length or hole size. More
basic approaches to SPC sometimes simply classify each piece as
"satisfactory" or "unsatisfactory", but consistent with the work of
Taguchi and others, I shall treat the output characteristics as
continuous dimensions, and assume that the goal is to maximize
uniformity of output by minimizing the mean squared deviation of
scores on each dimension Y_{i} from their optimum
level T_{i}.

3. I also assume that besides the settings controlled directly
by workers, there are several other variables, called
*covariates*, which affect the output and which can be
measured but are not easily controlled. Covariates may include
temperature of the machine or some crucial part of the machine,
atmospheric humidity, and perhaps certain characteristics of the
raw material. That is, the nature of the raw material is
controllable in theory, but in practice material with given
characteristics has been delivered at a particular time, and
substitute material is not readily available. Other covariates may
measure the elapsed time or number of items produced since a
machine was last cleaned or lubricated. If dirt accumulates and
lubricant evaporates partly with the mere passage of time and
partly as a result of use, then you might consider four covariates
of this type: time and number of uses since last cleaning, and time
and number of uses since last lubrication. Of course I cannot
begin to suggest all the other covariates that might be relevant in
a given instance.

There is a widespread exaggeration of the statistical dangers of using too many covariates as variables in regressions; see another article. In a nutshell, the most important stability characteristics of a regression are determined not by the ratio between the number of cases in the sample and the number of variables in the regression, as is widely believed, but by the difference between those two quantities. Therefore with large samples (which are usually available in SPC), adequate sampling stability can usually be achieved even with many covariates. Thus I prefer erring on the liberal side in determining the number of covariates.

I assume here that you have identified the *p*
*most* easily controlled input parameters, where
*p* is the number of output characteristics measured, and
that those *p *input parameters will collectively affect all
of the output characteristics so that optimum output can in
principle be achieved by manipulating just those *p
*input parameters. I will define *settings *as the
*p *most easily controlled input parameters, so that the
number of "settings" equals the number of output characteristics,
and define *covariates *as the less easily controlled input
parameters. Let *q* denote the number of
covariates.

I describe three methods, in order of increasing complexity: the basic regression method, regression with simple corrections, and regression with time-series corrections.

Further suppose that just before entering the oven, each piece is washed thoroughly with ordinary water. The temperature of that water affects the temperature of the piece at the time it enters the oven, and therefore affects the optimum oven temperature and bake time. The temperature of the water can be measured accurately, but it is uneconomical to try to control the water temperature. Thus the water temperature varies substantially from winter to summer, and even varies somewhat from day to day.

In this example bake time and oven temperature are
*settings* while water temperature is a
*covariate*. For simplicity that is the only covariate in
this example, though other examples could have several covariates.
The problem is to determine the optimum bake time and oven
temperature for a given moment given the water temperature at
that moment.

To apply the regression method to this problem, we would run a set of regressions predicting the hardness H and size S of pieces from the bake time B, oven temperature T, and water temperature W that obtained at the time of their manufacture. These regressions might be based on the characteristics of every hundredth piece manufactured, and might use pieces manufactured over several weeks or months so that water temperature varied substantially across the pieces.

Let the lower-case letters *a* through
*h* denote the coefficients and additive constants found
in these two regressions; specifically, the regressions are

H = aB + bT + cW + d

S = eB + fT + gW + h

Let H' and S' denote the desired values on hardness and size; they are of course known. Water temperature W is also a "given" for our analysis, since it cannot be controlled. Putting all the known fixed values on the left sides of the equations, we have

H' - cW - d = aB + bT

S' - gW - h = eB + fT

Since the quantities on the left sides of the equations are known,
we will denote them by *m* and *n*
respectively, so the equations become

m = aB + bT

n = eB + fT

All the lower-case letters in these equations are known, and we want to find the corresponding values of B and T. Ordinary simultaneous-equation methods yield the formulas

B = (mf - nb)/(af - eb)

T = (na - me)/(af - eb)

These equations allow one to combine the values found by the regression with the desired size and hardness of the pieces to find the optimum bake time B and oven temperature T.

In a real-world setting, the regressions used to find the 8
values *a* through *h* (what I am calling the
"fundamental analysis") might have to be done only once, until
some feature of the manufacturing process changes. But then the
equations

m = H' - cW - d

n = S' - gW - h

B = (mf - nb)/(af - eb)

T = (na - me)/(af - eb)

would have to be recalculated on the factory floor every time there was a change in water temperature W or desired size and hardness S' and H'. Thus it is desirable to make the latter calculation as simple and automatic as possible. Methods for doing this are described later. This part of the analysis is the "quick analysis".

Using matrix notation unrelated to the previous notation, this procedure can be generalized to any number of settings and covariates as follows:

*s*denotes the column vector of*p*Settings to be found*c*denotes the column vector of*q*current Covariate values*t*denotes the column vector of*p*Target values for the output variables- B
_{s}denotes the*p*x*p*matrix of regression coefficients found for the setting variables, corresonding to the values*a, b, e,*and*f*in the previous discussion - B
_{c}denotes the*p*x*q*matrix of regression coefficients found for the covariates *a*denotes the vector of*p*additive constants in the regressions

t = B_{s}s + B_{c}c + a

Solving for *s* gives

s = B_{s-1}(t - B_{c}c -a)

The values in B_{s} change only with new
fundamental regression analyses, so the matrix inversion need be
done only occasionally; the only values that change often are in
*t* and *c*.

- Variable
*t*is the set of*p*target values - Variable
*a*is the set of*p*additive constants - Variables S1, S2,...,Sp contain the values in B
_{s} - Variables C1, C2,...,Cq contain the values in B
_{c}

LET d = t - C1*94 - C2*63 - C3*7 - a

where 94, 63, and 7 are the current values for covariates C1, C2,
C3 respectively. In other words, multiply each covariate value by
its appropriate coefficients. The variable *d* computed
in this way can serve as *y* in the equation b =
X^{-1}y. Then use a no-constant regression to
predict *d* from S1, S1,...,Sp. The resulting regression
slopes are the desired settings.

As you use this system in manufacturing, you may if you
wish save all relevant data (input settings, covariate values, and
output characteristics) for the newly manufactured items. When
enough data has accumulated, you can repeat the fundamental
analysis, in order to get more accurate results from the larger data
base. In fact, as you realize belatedly that some of the settings
used earlier were far from optimum, you may choose to discard
that data from the analysis. That is a second way to achieve
adequate linearity, because linearity is more likely to be a good
approximation over a short range than over a broad range. Notice
I am *not *suggesting that you discard from the data set
the pieces which missed the targets by the largest amounts; rather
discard the cases in which the settings were (in retrospect) worst.
In other words, discard the pieces for which ^Y, not Y, was
farthest from the target values.

As mentioned, the quick analysis may be done whenever covariates change noticeably. One covariate that you know is changing constantly is time--time since last lubrication, time since last cleaning, or simply time since the machine was turned on. The regression method allows you to estimate the amount by which the optimum settings change with time. Thus you might in principle choose to change the settings every few minutes as a function of time, without even bothering to measure other covariates so frequently. Since you are allowed to use polynomial functions of the covariates and time is a covariate, you might derive a polynomial rule which suggests changing a dial by 7 units every 10 minutes just after lubricating a machine, but then gradually increasing those changes to 10 then 12 then 15 units every 10 minutes before lubricating the machine again.

In the fundamental regression analysis the TOLERANCE values are more useful than in many applications. Much research outside SPC involves carefully designed experiments in which no attention need be paid to TOLERANCE values because they all achieve their maximum values of 1.0. Much other research involves independent variables such as rainfall or subject's gender which are totally out of the control of the analyst, so that the analyst might notice the TOLERANCE values but cannot easily control them. SPC represents an intermediate case in which you are trying to achieve maximum quality as quickly as possible without doing carefully controlled experiments, but you can at any time sacrifice the quality of the next few pieces by increasing the range of one of the settings in order to get a better idea of the effects of that particular setting. In this process the TOLERANCE values can help tell you how much you could improve the estimate of the effect of a particular setting by increasing its range. The standard errors of regression slopes will tell you how accurate your current estimates are.

This means that the very same covariate measurement may
appear two or more times in the data set used for the quick
analysis, in different rows and columns. This is routinely done in
time-series analysis, and is no problem because in regression there
is no requirement that variables be mutually independent. To
illustrate the process, suppose each piece sits in a chemical bath
for 5 minutes. Each minute you remove one piece and replace it
with a new piece, so that 5 pieces share the bath at any given
moment. You take the bath's temperature just before removing
each piece from it. Then the bath's temperature during piece
*j*'s immersion is measured by the temperatures taken
just before the removal of pieces *j*, *j*-1,
*j*-2, *j*-3, and *j*-4. You don't
know which of these temperatures will be most predictive of piece
*j*'s output characteristics, so you'd like to study all of
them, treating them as five different covariates TEMP1, TEMP2,
TEMP3, TEMP4, and TEMP5, with TEMP1 being the first
temperature taken after piece *j*'s immersion and
TEMP5 the temperature taken just before piece *j*'s
removal. Suppose you have already recorded TEMP5 in a data
set, with each piece's TEMP5 measurement appearing on the same
row of the data set as that piece's output characteristics. Then in
some statistical packages such as SYSTAT, you can easily create
columns TEMP4, TEMP3, TEMP2, and TEMP1 with commands
like

LET TEMP4 = LAG(TEMP5)

LET TEMP3 = LAG(TEMP4)

LET TEMP2 = LAG(TEMP3)

LET TEMP1 = LAG(TEMP2)

Each of these commands copies the entries in one column into the
next rows of another column. Thus after executing these four
commands, the entry that had been in row *j*-4 of
column TEMP5 will appear as well in row *j*-3 of
column TEMP4, row *j*-2 of column TEMP3, row
*j*-1 of column TEMP4, and row *j* of
column TEMP5. Therefore the five temperature measurements
taken during piece *j*'s immersion will appear on row
*j* of the data set along with piece *j*'s output
characteristics.

If you don't want to use so many columns, in some packages such as SYSTAT 6 you can LAG a column by two or more rows at a time. For instance, suppose in the last example you wanted to use just TEMP1 and TEMP5, the first and last temperature measurements made during the immersion of piece i. You could construct TEMP1 from TEMP5 with the command

LET TEMP1 = LAG(TEMP5, 4)

V = Sd, where V is the vector of values just described, S is the
*p* x *p* matrix of S-values, and *d*
is the amount by which the settings should be changed for
optimum future output. If you use a regression program to solve
the set of simultaneous equations (see above), then predict V from
S1, S2,...,Sp in a no-constant regression, and the regression slopes
are the values of *d*.

For a more complex example of the kind of problem time-series corrections can handle, consider again the previous example in which both the hardness and size of manufactured pieces were affected by settings on oven temperature and bake time and a covariate of water temperature. Suppose the last 5 errors in hardness are -2, 0, 2, 4, and 6, so the next error is forecast to be 8, while the last 5 errors in size are 11, 8, 5, 2, -1, so the next error is forecast to be -4. Both hardness and size are affected by both oven temperature and bake time, so the problem is to adjust settings on those variables to correct for the forecast errors.

As you might guess, once the errors are forecast, the method is essentially the same as the previous method (regression with simple corrections), with the only difference being that forecast errors instead of average past errors are entered into column V. Thus the major difference between the methods is that we will now use time-series methods to forecast errors. Of course it will usually be more difficult to forecast these errors than it was in the simple examples of the previous two paragraphs.

The central feature of any problem which makes time-series corrections desirable is that fluctuations over even short time intervals may be nonrandom. If unmeasured covariates change slowly enough so that such fluctuations are largely random, then time-series corrections will not improve over simple corrections. I do not assume that all readers are familiar with time-series regression, so I go into some detail illustrating the commands used.

E1 E2 -2 11 0 8 2 5 4 2 6 -1It doesn't matter whether entries in ERRORS are (target - observed) or (observed - target); the output I'll describe is the same either way.

After these first *p* columns, *p* other
columns B1, B2, etc. are created using the LAG command. This
can be done in a variety of ways depending on which predictors of
errors you decide to use. I shall first consider the simple case in
which the error in each characteristic of each piece is to be
predicted from the error in the same characteristic of the
immediately preceding piece. If *p* = 3, the commands
adding the necessary columns to ERRORS might be

LET B1 = LAG(E1)

LET B2 = LAG(E2)

LET B3 = LAG(E3)

Then fit three separate regressions, the first predicting E1 from B1, the second predicting E2 from B2, and the third predicting E3 from B3. These should be no-constant regressions, so that errors of 0 in past output would be predictive of errors of 0 for the next output. It might seem that in recommending no-constant regression I have overlooked the fact that some settings may need continuous adjustment as a function of time (for instance constantly decreasing grinding time as a grinder with no temperature gauge gradually grinds faster as it heats up during the morning), but I assume that any such adjustments were already being made in the manufacturing process which yielded data set ERRORS.

As in an earlier example, lags of various lengths might be introduced, and all might be used in the regression. For instance, one might compute two lagged variables for each output characteristic with commands like

LET B11 = LAG(E1)

LET B12 = LAG(B11)

LET B21 = LAG(E2)

LET B22 = LAG(B21)

LET B31 = LAG(E3)

LET B32 = LAG(B31)

then predict E1 from B11 and B12, predict E2 from B21 and B22, and predict E3 from B31 and B32.

I now consider three possible modifications of this basic procedure.

First, the regressions should be designed to exclude predictors which will not be available in time during the ordinary manufacturing process. For instance, if a piece is manufactured every minute but it takes 3 minutes to measure the characteristics of a completed piece, then don't use the characteristics of the previous two pieces as predictors in the time-series regressions, because in practice that data will not be available in time. Thus instead of the LAG commands given above, you might use commands like

LET B1 = LAG(E1, 3)

LET B2 = LAG(E2, 3)

LET B3 = LAG(E3, 3)

where the second entry within each parentheses tells the program to lag by 3 units instead of 1.

I will use the term "available" to handle the complication discussed in the previous paragraph. That is, the last few "available" pieces are the last few pieces whose characteristics would be known in time, during the ordinary manufacturing process, to allow them to be used to correct settings for the next piece. Actually I have been using this term throughout this section, even before explaining its full meaning. The availability problem is one reason you should typically use an ordinary regression program for this purpose; a time-series autoregression program typically does not allow you to omit the most recent observations from the set of predictors.

A second modification of the basic commands allows you
to use as predictors an *average* of the output
characteristics of several previous pieces. The usefulness of this
modification depends partly on the cost and ease of measuring
output: the cheaper and easier the measurement, the more
measurements you may want to average to form the predictors
used in the time-series regressions.

In a third possible modification to the basic command set,
you include in each time-series regression the previous errors in
other Y-variables as well as previous errors in the same variable
Y_{i}. To see why this might be useful, suppose
two machines, A and B, sit side by side, and each piece typically
enters machine A 30 minutes before entering B. Suppose output
characteristic Y_{1} is most affected by machine A,
while Y_{2} is most affected by B. Suppose the
same general environmental conditions--dust, vibration, etc.--
generally affect the two machines similarly. Each time the area is
hit by vibration from a passing truck, it affects Y_{2}
starting with pieces entering machine B at the time of the vibration
and affects Y_{1} starting with pieces entering
machine A at the same time. But the former pieces actually
entered machine A 30 minutes earlier, and thus will appear in data
set ERRORS before the other pieces. Therefore errors in
Y_{2} may be predictive of Y_{1}
errors in later output.

You needn't knock yourself out thinking of scenarios like
this; rather you can simply use regression to see whether errors in
one output characteristic are in fact predictive of errors in other
output characteristics for later pieces. Thus I suggest that the
time-series regressions should typically predict the error of each
output characteristic Y_{i} from all *p*
errors of the previously manufactured pieces. This is the other
major reason an ordinary time-series regression program can't
ordinarily be used to fit these time-series regressions; such
programs typically don't allow prediction using entries in series
other than the same series for which predictions are being
made.

After the forecasts are made, enter them into column V and carry on from there as in regression with simple corrections.

One point that should be made clear is that the corrections
are to an *algorithm*, not to a set of *settings*.
Further, the algorithm itself includes time-series corrections. For
instance, it may be that certain covariate scores (including perhaps
TIME) will change for every single piece manufactured, so that
settings change for every piece. Suppose that for each piece you
first calculate a set of settings by the regression method, then
calculate time-series corrections to those settings, adjust the
settings accordingly, then produce the piece and measure its errors
E1, E2, etc. These errors are the values entered into the time-series
analysis to estimate the settings for the next piece, even
though the settings used were actually adjusted twice since the last
piece--once to allow for changes in covariate scores, and once to
incorporate the time-series corrections.

Ryan, Thomas P. (1989) Statistical methods for quality improvement. New York: Wiley