The multivariate analysis problems discussed here are like problems in regression or linear models, except that a single analysis includes two or more dependent variables. For instance, suppose you measure consumer satisfaction with two or more variables such as "How pleased are you with this product?" and "How likely would you be to recommend it to a friend?" An independent variable might be the type of product bought, while covariates might include the consumer's age, sex, income, previous use of similar products, etc.
Consider first a problem with no covariates. When there is just one dependent variable Y, an ordinary multiple regression finds the linear function of the independent variables that correlates maximally with Y. That correlation is called the multiple correlation R. The size of R (or some equivalent statistic, such as a residual sum of squares) is used to test the null hypothesis of no relation between Y and any of the independent variables.
That concept is extended to the multivariate case as follows. The computer program derives two linear functions--a linear function of the X's, and a separate linear function of the Y's--such that the two functions correlate maximally with each other. That maximum correlation is called the first canonical correlation between the X and Y sets. This canonical correlation would be 0 only if all the aforementioned p x q simple correlations were 0. That would virtually never be true in a sample. But just as the size of a sample's simple correlation can be used to test a null hypothesis about the population value of the correlation, so can the sample value of the canonical correlation be used to test the null hypothesis that all p x q simple XY correlations are zero.
Now consider the more general case in which you use the multivariate approach to test the null hypothesis that there is no relation between several independent variables X1, X2, etc. and several dependent variables Y1, Y2, etc., when several covariates C1, C2, etc. are controlled. In effect a multivariate analysis will follow a three-step process:
This same argument applies, in more complex forms, even when the Y's are not measured on similar scales so that the linear function of Y's is not a simple difference as in the previous example, and even when the most predictable linear function of Y's is a function of three or more Y-variables rather than just two variables.
On the other hand, a significant result from a multivariate test yields only the vague conclusion that the independent variable affects "one or more" of the dependent variables, without allowing you to say which variables are affected. Therefore if the multivariate test yields significant results, you will typically want to look at the "univariate" tests--tests on one dependent variable at a time. Philosophies differ as to whether these univariate tests should be subjected to corrections for multiple tests if they follow a significant multivariate test. The logic of the Fisher protected t method would imply that they need not be corrected. However, that logic is controversial. As I explain in Section 11.5.4 of Regression and Linear Models (McGraw-Hill, 1990), I consider that logic to be appropriate on some occasions and not others.
There are also many cases in which Bonferroni corrections to univariate tests are more powerful than a multivariate test. In fact, such cases might be the norm. This occurs when no linear function of the dependent variables correlates much more highly with the independent variable(s) than a single dependent variable. Then the canonical correlation or partial canonical correlation derived by the multivariate analysis will not be much above the highest simple correlations. Yet the significance test on the canonical correlation must take into account the fact that it is much easier for a canonical correlation to be high just by chance, than it is for a simple correlation. Therefore under the conditions just described, the multivariate test will be less significant than the Bonferroni-corrected univariate test.
For an extreme example of this point, consider a problem with a single independent variable whose sample correlation with each of six dependent variables is .5. Suppose the sample is size 40 and, unknown to the investigator, all six dependent variables correlate perfectly with each other because they really measure the same thing. Thus the canonical correlation is also .5. In that sample the two-tailed Bonferroni-corrected p would be .0061 while the multivariate p would be .075--over 12 times as large.
From what I have said so far, multivariate tests are a mixed blessing. They may occasionally yield a substantial gain in power over Bonferroni-corrected univariate tests, but the typical case is probably just the opposite. However, multivariate tests offer another advantage described in the next section.
CS TE Y1 .4 .6 Y2 .6 .9 Y3 .2 .3 CS = classroom size, TE = teacher experience
In this matrix the two columns are proportional.
The proportionality hypothesis is often of considerable interest; it says that the dependent variables all relate to the independent variables in the same way. Since a composite's correlations with other variable are affected only by the relative sizes of the weights in the composite, the proportionality hypothesis says that the composite of X's correlating highest with Y1 is the same composite that correlates highest with Y2, Y3, and all the other Y's. A multivariate analysis is capable of testing that hypothesis.
Proportionality is treated as a null hypothesis; a significant result indicates that the unknown true regression slopes fail to fit a pattern of proportionality. Thus you never prove proportionality, any more than nonsignificance in a simple two-group t test proves that two means are equal. Rather if you assume that the hypothesis of proportionality is simpler or more parsimonious than the hypothesis of nonproportionality, then a nonsignificant result in the proportionality test means that proportionality is the simplest hypothesis consistent with the data.
As in any significance test, there is no requirement that you know the true values of the regression slopes. Rather the computer finds the particular set of proportional values that most closely fit the data, and then tests whether the fit between the data and that particular hypothesis is good or bad. If bad, then the null hypothesis of proportionality is rejected.
Figure 1 Average rated durability and ease of use of five lawnmower brands--imaginary population data illustrating the straight-line hypothesis D| * u| r| * a| b| * i| l| * i| t| * y|_____________________ Ease of useThe hypothesis that several points fall in a straight line is actually the hypothesis that proportionality applies to the regression slopes of a set of dummy variables. I assume here that you understand dummy variables; if not, see Darlington (1990) or any of many other books on regression and linear models. Then the variable "brand" in this example, with 5 levels, would be represented by 4 dummy variables. The one brand lacking a dummy variable would be called the "base" or "reference" brand. Suppose for instance that the rightmost dot in Figure 1 is the dot for the base brand. Then the regression slope for any dummy variable predicting ease of use equals the horizontal difference between the dot for that brand and the dot for the base brand, while the regression slope for any dummy variable predicting durability equals the vertical difference between the same two dots. Therefore the hypothesis that proportionality applies to these regression slopes is the hypothesis that proportionality applies to the horizontal and vertical distances between dots--and that is the hypothesis that the dots fall in a straight line. Just as you can never prove the null hypothesis of proportionality, you can never prove the null hypothesis that the dots fall in a straight line; that is merely the simplest conclusion consistent with the data.
For simplicity we have used no covariates in the lawnmower example, but a similar argument can also be made that allows for covariates. The hypothesis then states that the dots representing several groups would fall in a straight line after we adjust for differences among the groups on the covariates.
Also for simplicity the lawnmower example included only two dependent variables, but the argument applies to any number of dependent variables. The hypothesis of proportionality then states that the points fall on a straight line in p-dimensional space, where p is the number of dependent variables.
Unfortunately a test of the straight-line hypothesis tells you nothing about the slope of the line in question; it merely tests whether it's a straight line. In the lawnmower example, as in many examples, it matters a lot whether the slope of the line is positive or negative. A positive slope in the lawnmower example would mean that instead of a tradeoff between durability and ease of use, more of one is always associated with more of the other. The straight-line hypothesis is also consistent with the hypothesis that lawnmowers differ on durability or ease of use, but only on one of those. Therefore you might want to do some additional examination of the data to distinguish among those various possibilities.
MODEL Y1 Y2 Y3 = CONSTANT + X1 + X2 + C1 + C2 + C3
EFFECT = X1 + X2
As before, the Y's denote dependent variables, the X's denote independent variables, and the C's denote covariates. In this example there are three Y's, two X's, and three C's.
Substantial amounts of output will emerge after both the ESTIMATE and TEST commands, but the most important output follows the TEST command. This output includes among other things a test using the Wilks lambda; that test tests the first null hypothesis discussed--the hypothesis that there is no relation between the X and Y sets when the C variables are controlled. Actually the Wilks lambda is just one of several tests of the same hypothesis that SYSTAT prints, but I recommend the Wilks test.
The output following the TEST command will also include a section titled TEST OF RESIDUAL ROOTS. This section includes a series of chi-square tests. The second test in this series tests the proportionality hypothesis. (The first test in the series tests the same hypothesis as the Wilks lambda and the other multivariate test statistics, but does so less accurately and should be ignored.) The next section describes in more detail how that test is performed. That section also describes the hypotheses tested by the third and later chi-square tests in this series, although those hypotheses are usually of less scientific interest than the hypotheses already described.
In the SYSTAT output section titled TEST OF RESIDUAL ROOTS, each chi-square tests the significance of the corresponding canonical correlation: the first chi-square tests the first canonical correlation, the second tests the second, and so on. The null hypothesis that the first canonical correlation is zero is the null hypothesis of no association between the independent and dependent sets. That hypothesis is tested more accurately by the Wilks and other multivariate tests printed after the ESTIMATE command, so there is no good reason to use the first chi-square test.
Each chi-square test j tests the null hypothesis that the rank of the XY correlation matrix (the rectangular correlation matrix relating the dependent variables to the independent variables) is j-1 or lower. Thus the first chi-square tests the null hypothesis that the rank is 0 (meaning the matrix is all zeros), the second tests the null hypothesis that the rank is 1 (the hypothesis of proportionality), and so on. I shall not explain here the concept of the rank of a matrix because I believe that for most scientists, the only test in this series of any real importance is the second one, and the meaning of the hypothesis tested by that test has already been explained without reference to the concept of rank.