The multivariate analysis problems discussed here are like problems in regression or linear models, except that a single analysis includes two or more dependent variables. For instance, suppose you measure consumer satisfaction with two or more variables such as "How pleased are you with this product?" and "How likely would you be to recommend it to a friend?" An independent variable might be the type of product bought, while covariates might include the consumer's age, sex, income, previous use of similar products, etc.

*p*dependent variables Y1, Y2, Y3, etc., which might be achievement as measured by several tests. There is no requirement that the tests use comparable scales.*q*independent variables X1, X2, X3, etc., which might be experience of teacher, number of teacher aides available, and classroom size. An independent variable might be a multicategorical variable such as textbook used, when 5 different groups each used a different textbook.- several covariates C1, C2, C3, etc., which might be student's age, socioeconomic background, and family size.

Consider first a problem with no covariates. When there
is just one dependent variable Y, an ordinary multiple regression
finds the linear function of the independent variables that
correlates maximally with Y. That correlation is called the
multiple correlation *R.* The size of *R* (or
some equivalent statistic, such as a residual sum of squares) is
used to test the null hypothesis of no relation between Y and any
of the independent variables.

That concept is extended to the multivariate case as
follows. The computer program derives *two* linear
functions--a linear function of the X's, and a separate linear
function of the Y's--such that the two functions correlate
maximally with each other. That maximum correlation is called
the **first canonical correlation** between the X and
Y sets. This canonical correlation would be 0 only if all the
aforementioned *p* x *q* simple correlations
were 0. That would virtually never be true in a sample. But
just as the size of a sample's simple correlation can be used to
test a null hypothesis about the population value of the
correlation, so can the sample value of the canonical correlation
be used to test the null hypothesis that all *p* x
*q* simple XY correlations are zero.

Now consider the more general case in which you use the multivariate approach to test the null hypothesis that there is no relation between several independent variables X1, X2, etc. and several dependent variables Y1, Y2, etc., when several covariates C1, C2, etc. are controlled. In effect a multivariate analysis will follow a three-step process:

- Regress each independent variable on the set of
covariates and save in memory the residuals in that regression.
Call these variables X1.C (the portion of X1 independent of the
C variables), X2.C, etc.
- Similarly derive Y1.C, Y2.C, etc. by regressing Y1,
Y2, etc. on the C variables.
- Find the canonical correlation between the two sets of
variables derived in steps 1 and 2. This
**partial canonical correlation**is the principal statistic used to test the hypothesis of interest.

This same argument applies, in more complex forms, even when the Y's are not measured on similar scales so that the linear function of Y's is not a simple difference as in the previous example, and even when the most predictable linear function of Y's is a function of three or more Y-variables rather than just two variables.

On the other hand, a significant result from a
multivariate test yields only the vague conclusion that the
independent variable affects "one or more" of the dependent
variables, without allowing you to say which variables are
affected. Therefore if the multivariate test yields significant
results, you will typically want to look at the "univariate" tests--tests
on one dependent variable at a time. Philosophies differ as
to whether these univariate tests should be subjected to
corrections for multiple tests if they follow a significant
multivariate test. The logic of the Fisher protected *t*
method would imply that they need not be corrected. However,
that logic is controversial. As I explain in Section 11.5.4 of
*Regression and Linear Models* (McGraw-Hill, 1990),
I consider that logic to be appropriate on some occasions and
not others.

There are also many cases in which Bonferroni corrections to univariate tests are more powerful than a multivariate test. In fact, such cases might be the norm. This occurs when no linear function of the dependent variables correlates much more highly with the independent variable(s) than a single dependent variable. Then the canonical correlation or partial canonical correlation derived by the multivariate analysis will not be much above the highest simple correlations. Yet the significance test on the canonical correlation must take into account the fact that it is much easier for a canonical correlation to be high just by chance, than it is for a simple correlation. Therefore under the conditions just described, the multivariate test will be less significant than the Bonferroni-corrected univariate test.

For an extreme example of this point, consider a
problem with a single independent variable whose sample
correlation with each of six dependent variables is .5. Suppose
the sample is size 40 and, unknown to the investigator, all six
dependent variables correlate perfectly with each other because
they really measure the same thing. Thus the canonical
correlation is also .5. In that sample the two-tailed Bonferroni-corrected
*p* would be .0061 while the multivariate
*p* would be .075--over 12 times as large.

From what I have said so far, multivariate tests are a mixed blessing. They may occasionally yield a substantial gain in power over Bonferroni-corrected univariate tests, but the typical case is probably just the opposite. However, multivariate tests offer another advantage described in the next section.

CS TE Y1 .4 .6 Y2 .6 .9 Y3 .2 .3 CS = classroom size, TE = teacher experience

In this matrix the two columns are proportional.

The proportionality hypothesis is often of considerable
interest; it says that the dependent variables all relate to the
independent variables in the same way. Since a composite's
correlations with other variable are affected only by the
*relative* sizes of the weights in the composite, the
proportionality hypothesis says that the composite of X's
correlating highest with Y1 is the same composite that correlates
highest with Y2, Y3, and all the other Y's. A multivariate
analysis is capable of testing that hypothesis.

Proportionality is treated as a null hypothesis; a
significant result indicates that the unknown true regression
slopes fail to fit a pattern of proportionality. Thus you never
*prove* proportionality, any more than nonsignificance
in a simple two-group *t* test proves that two means
are equal. Rather if you assume that the hypothesis of
proportionality is simpler or more parsimonious than the
hypothesis of nonproportionality, then a nonsignificant result in
the proportionality test means that proportionality is the simplest
hypothesis consistent with the data.

As in any significance test, there is no requirement that you know the true values of the regression slopes. Rather the computer finds the particular set of proportional values that most closely fit the data, and then tests whether the fit between the data and that particular hypothesis is good or bad. If bad, then the null hypothesis of proportionality is rejected.

Figure 1 Average rated durability and ease of use of five lawnmower brands--imaginary population data illustrating the straight-line hypothesis D| * u| r| * a| b| * i| l| * i| t| * y|_____________________ Ease of useThe hypothesis that several points fall in a straight line is actually the hypothesis that proportionality applies to the regression slopes of a set of dummy variables. I assume here that you understand dummy variables; if not, see Darlington (1990) or any of many other books on regression and linear models. Then the variable "brand" in this example, with 5 levels, would be represented by 4 dummy variables. The one brand lacking a dummy variable would be called the "base" or "reference" brand. Suppose for instance that the rightmost dot in Figure 1 is the dot for the base brand. Then the regression slope for any dummy variable predicting ease of use equals the horizontal difference between the dot for that brand and the dot for the base brand, while the regression slope for any dummy variable predicting durability equals the vertical difference between the same two dots. Therefore the hypothesis that proportionality applies to these regression slopes is the hypothesis that proportionality applies to the horizontal and vertical distances between dots--and that is the hypothesis that the dots fall in a straight line. Just as you can never prove the null hypothesis of proportionality, you can never prove the null hypothesis that the dots fall in a straight line; that is merely the simplest conclusion consistent with the data.

For simplicity we have used no covariates in the lawnmower example, but a similar argument can also be made that allows for covariates. The hypothesis then states that the dots representing several groups would fall in a straight line after we adjust for differences among the groups on the covariates.

Also for simplicity the lawnmower example included
only two dependent variables, but the argument applies to any
number of dependent variables. The hypothesis of
proportionality then states that the points fall on a straight line
in *p*-dimensional space, where *p* is the
number of dependent variables.

Unfortunately a test of the straight-line hypothesis tells
you nothing about the *slope* of the line in question; it
merely tests whether it's a straight line. In the lawnmower
example, as in many examples, it matters a lot whether the
slope of the line is positive or negative. A positive slope in the
lawnmower example would mean that instead of a tradeoff
between durability and ease of use, more of one is always
associated with more of the other. The straight-line hypothesis
is also consistent with the hypothesis that lawnmowers differ on
durability *or* ease of use, but only on one of those.
Therefore you might want to do some additional examination of
the data to distinguish among those various possibilities.

PRINT MEDIUM

MODEL Y1 Y2 Y3 = CONSTANT + X1 + X2 + C1 + C2
+ C3

ESTIMATE

HYPOTHESIS

EFFECT = X1 + X2

TEST

As before, the Y's denote dependent variables, the X's denote independent variables, and the C's denote covariates. In this example there are three Y's, two X's, and three C's.

Substantial amounts of output will emerge after both the ESTIMATE and TEST commands, but the most important output follows the TEST command. This output includes among other things a test using the Wilks lambda; that test tests the first null hypothesis discussed--the hypothesis that there is no relation between the X and Y sets when the C variables are controlled. Actually the Wilks lambda is just one of several tests of the same hypothesis that SYSTAT prints, but I recommend the Wilks test.

The output following the TEST command will also
include a section titled TEST OF RESIDUAL ROOTS. This
section includes a series of chi-square tests. The
*second* test in this series tests the proportionality
hypothesis. (The first test in the series tests the same hypothesis
as the Wilks lambda and the other multivariate test statistics, but
does so less accurately and should be ignored.) The next
section describes in more detail how that test is performed.
That section also describes the hypotheses tested by the third
and later chi-square tests in this series, although those
hypotheses are usually of less scientific interest than the
hypotheses already described.

In the SYSTAT output section titled TEST OF RESIDUAL ROOTS, each chi-square tests the significance of the corresponding canonical correlation: the first chi-square tests the first canonical correlation, the second tests the second, and so on. The null hypothesis that the first canonical correlation is zero is the null hypothesis of no association between the independent and dependent sets. That hypothesis is tested more accurately by the Wilks and other multivariate tests printed after the ESTIMATE command, so there is no good reason to use the first chi-square test.

Each chi-square test *j* tests the null hypothesis
that the rank of the XY correlation matrix (the rectangular
correlation matrix relating the dependent variables to the
independent variables) is *j*-1 or lower. Thus the first
chi-square tests the null hypothesis that the rank is 0 (meaning
the matrix is all zeros), the second tests the null hypothesis that
the rank is 1 (the hypothesis of proportionality), and so on. I
shall not explain here the concept of the rank of a matrix
because I believe that for most scientists, the only test in this
series of any real importance is the second one, and the meaning
of the hypothesis tested by that test has already been explained
without reference to the concept of rank.