|
|
OverviewA canonical correlation is the correlation of two canonical (latent) variables, one representing a set of independent variables, the other a set of dependent variables. Each set may be considered a latent variable based on measured indicator variables in its set. The canonical correlation is optimized such that the linear correlation between the two latent variables is maximized. Whereas multiple regression is used for many-to-one relationships, canonical correlation is used for many-to-many relationships. There may be more than one such linear correlation relating the two sets of variables, with each such correlation representing a different dimension by which the independent set of variables is related to the dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not to model the individual variables. For each canonical variate we can also assess how strongly it is related to measured variables in its own set, or the set for the other canonical variate. Wilks's lambda is commonly used to test the significance of canonical correlation.Analogous with ordinary correlation, canonical correlation squared is the percent of variance in the dependent set explained by the independent set of variables along a given dimension (there may be more than one). In addition to asking how strong the relationship is between two latent variables, canonical correlation is useful in determining how many dimensions are needed to account for that relationship. Canonical correlation finds the linear combination of variables that produces the largest correlation with the second set of variables. This linear combination, or "root," is extracted and the process is repeated for the residual data, with the constraint that the second linear combination of variables must not correlate with the first one. The process is repeated until a successive linear combination is no longer significant. Canonical correlation is a member of the multiple general linear hypothesis (MLGH) family and shares many of the assumptions of multiple regression such as linearity of relationships, homoscedasticity (same level of relationship for the full range of the data), interval or near-interval data, untruncated variables, proper specification of the model, lack of high multicollinearity, and multivariate normality for purposes of hypothesis testing. It also shares with factor analysis the need to impute labels for the canonical variables based on structure correlations, which function as a form of canonical factor loading; researchers may well impute different labels based on the same data. See also the nonlinear canonical correlation procedure (OVERALS). See also partial least squares regression, which is sometimes used to predict one set of response variables from a set of independent variables.
|
|
The first canonical correlation is always the one which explains most of the relationship. The canonical correlations are interpreted the same as Pearson's r: their square is the percent of variance in the canonical variate of one set of variables explained by the canonical variate for the other set along the dimension represented by the given canonical correlation (usually the first). Another way to put it is to say that Rc-squared is the percent of variance shared by the canonical variates along this dimension. As an arbitrary rule of thumb, some researchers state that a dimension will be of interest if its canonical correlation is .30 or higher, corresponding to about 10% of variance explained. Some researchers, when reporting canonical correlation, report just the first canonical correlation, but it is recommended that all meaningful and interpretable canonical correlations be reported.
Structure correlations are used for three purposes.
Thus canonical correlation reflects the percent of variance in the dependent canonical variable explained by the independent canonical variable and is used when exploring relationships between the independent and the dependent set of variables. In contrast, redundancy has to do with the percent of variance in the set of original individual dependent variables explained by the independent canonical variable and is used when assessing the effectiveness of the canonical analysis in capturing the variance of the original variables. These are different, and one may think of redundancy analysis as a check on the meaning of the canonical correlation. Redundancy analysis is used to measure whether the dependent canonical variable predicts the values of the original independent variables, how well the independent canonical variable predicts the values of the original independent variables, and how well the dependent canonical variable predicts the values of the original dependent variables. See the redundancy analysis question in the FAQ section below.

The canonical correlation plot shows one (usually the first) canonical correlation. The X axis is the covariate canonical variable (the latent variable for the set of independents). The Y axis is the canonical variable representing the dependent variables. The tick marks on the axes are in standardized units. Note the origin is about -4, -4. The 0,0 location is in the center of the plot. The points are the canonical scores of each case based on the case's scores on each of the two canonical variables. A regression line shows the scatter of points. The canonical correlation (.95 in the example) is printed on the regression line. When canonical correlation is high, the points will form two clusters at different points on the regression line.
Outliers. Cases outside the two clusters in a canonical correlation plot are outliers and may merit scrutiny or separate modeling. In the example above, outliers are circled. That is, canonical correlation plots are a useful method of identifying outliers or exceptional cases which differ from other cases in not sharing the same pattern of correlation among the two sets of variables in the study.

The helio plot is of a single canonical correlation (usually the first one). The plot is formed of two semi-circles, with original variables arrayed around the perimeter. The left semicircle lists the dependent variables and the right semicircle lists the independents. Bars proportionate in length to the structure correlations of each variable are arrayed around an inner circle. Bars reaching outward represent positive correlations and bars reaching inward represent negative correlations. Helio plots are not output by SPSS or SAS through 2007 versions.
Note one cannot save canonical scores in this method.
INCLUDE 'c:\Program Files\SPSS\Canonical correlation.sps'. CANCORR SET1=varlist/ SET2=varlist/.where "varlist" is one of two lists of numeric variables. Output will be saved to a file called "cc_tmp2.sav," which will contain the canonical scores as new variables along with the original data file. These scores will be labeled s1_cv1 and s2_cv1, s2_cv1 and s2_cv2, and the like, standing for the scores on the two canonical variables associated with each canonical correlation. The macro will create two canonical variables for a number of canonical correlations equal to the smaller number of variables in SET1 or SET2.
However, Levine (1977: 18-19) argues against the procedure above on the ground that the canonical coefficients may be subject to multicollinearity, leading to incorrect judgments. Also, because of suppression, a canonical coefficient may even have a different sign compared to the correlation of the original variable with the canonical variable. Therefore, instead, Levine recommends intepreting the relations of the original variables to a canonical variable in terms of the correlations of the original variables with the canonical variables - that is, by structure coefficients. This is now the standard approach.
Canonical Redundancy Analysis
Raw variance tables are reported by SAS but are omitted here because redundancy is normally interpreted using the standardized tables.
Standardized Variance of the dependent variables
Explained by
Their Own The Opposite
Canonical Variables Canonical Variables
Cumulative Canonical Cumulative
Proportion Proportion R-Squared Proportion Proportion
1 0.2394 0.2394 0.4715 0.1129 0.1129
2 0.3518 0.5912 0.0052 0.0018 0.1147
The table above shows that, for the first canonical correlation, although the independent canonical variable explains 47.15% of the variance in the dependent canonical variable, the independent canonical variable is able to predict only 11.29% of the variance in the individual original dependent variables. Also, the dependent canonical variable predicts only 23.94% of the variance in the individual original dependent variables. Similar statements could be made about the second canonical correlation (row 2).
Canonical Redundancy Analysis
Standardized Variance of the independent variables
Explained by
Their Own The Opposite
Canonical Variables Canonical Variables
Cumulative Canonical Cumulative
Proportion Proportion R-Squared Proportion Proportion
1 0.5000 0.5000 0.4715 0.2357 0.2357
2 0.5000 1.0000 0.0052 0.0026 0.2383
The table above repeats the first, except for comparisons involving the independent canonical variable.
Canonical Redundancy Analysis
Squared Multiple Correlations Between the dependent variables and
the First 'M' Canonical Variables of the independent variables
M 1 2
Y1 0.1510 0.1526
Y2 0.0280 0.0305
Y3 0.1596 0.1610
In the table above, the columns represent the canonical correlations and the rows represent the original dependent variables, three in this case. The R-squareds are the percent of variance in each original dependent variable explained by the independent canonical variables. A similar table for the independent variables and the dependent canonical variables is also output by SAS but is not reproduced here.
OVERALS uses optimal scaling, which quantifies categorical variables and then treats as numerical variables, including applying nonlinear transformations to find the best-fitting model. For nominal variables, the order of the categories is not retained but values are created for each category such that goodness of fit is maximized. For ordinal variables, order is retained and values maximizing fit are created. For interval variables, order is retained as are equal distances between values.
Obtain OVERALS from the SPSS menu by selecting Analyze, Data Reduction, Optimal Scaling; Select Multiple sets; Select either Some variable(s) not multiple nominal or All variables multiple nominal; click Define; define at least two sets of variables; define the value range and measurement scale (optimal scaling level) for each selected variable. SPSS output includes frequencies, centroids, iteration history, object scores, category quantifications, weights, component loadings, single and multiple fit, object scores plots, category coordinates plots, component loadings plots, category centroids plots, and transformation plots.
Tip: To minimize output, use the Automatic Recode facility on the Transform menu to create consecutive categories beginning with 1 for variables treated as nominal or ordinal. To minimize output, for each variable scaled at the numerical (integer) level, subtract the smallest observed value from every value and add 1.
Warning: Optimal scaling recodes values on the fly to maximize goodness of fit for the given data. As with any atheoretical, post-hoc data mining procedure, there is a danger of overfitting the model to the given data. Therefore, it is particularly appropriate to employ cross-validation, developing the model for a training dataset and then assessing its generalizability by running the model on a separate validation dataset.
The SPSS manual notes, "If each set contains one variable, nonlinear canonical correlation analysis is equivalent to principal components analysis with optimal scaling. If each of these variables is multiple nominal, the analysis corresponds to homogeneity analysis. If two sets of variables are involved and one of the sets contains only one variable, the analysis is identical to categorical regression with optimal scaling."
Some authors have argued for varimax rotation of the matrix of structure coefficients or (less commonly) the canonical coefficients, similar to what is done routinely in factor analysis. This is possible whenever there are two or more canonical correlations in the solution. Rotation will not change the sums of the squared canonical correlation coefficients and it will lead to a simpler structure. SPSS supports rotation. However, Thompson (1984: 38) notes, "Although varimax rotation is appealing, the application seriously violates the fundamental logic of canonical analysis." The purposes of canonical analysis assume the importance of keeping separate the independent and dependent sets of variables, whereas factor analysis is used to explore the structure underlying all variables in the analysis. Thompson (1984:31-41) discusses possible strategies for using rotation but in general, most researchers do not use rotation as part of canonical analysis.
Both create latent variables (variates) based on a linear combination of measured variables, but factor analysis usually is not focused on the correlation of these variates. In fact, in most forms of factor analysis, the factors are orthogonal (uncorrelated). Factor analysis is a non-dependent procedure, whereas canonical correlation can be conceptualized in terms of an independent and a dependent set of variables. Variates are rarely rotated in canonical correlation, whereas rotation of factors is the norm in factor analysis.
One can conduct two separate principal components factor analyses, one on the independent set of variables and one on the dependent. Then one can substitute factor scores for the original variables and conduct canonical correlation. If there are relatively few principal component factors which explain most of the variance in the original sets of variables, and if the factorized canonical correlation analysis shows the independent set accounts for most of the variation in the dependent set, then the researcher has demonstrated that "the same set of underlying factors which account for most of the variation within variable sets are also important in determining relationships across variable sets" (Dunteman, 1989: 86).
MANOVA produces output which is extraneous to canonical correlation, and it will not save canonical scores.
One method is to compute a new nominal variable in which every combination of responses is a separate value (use the IF statements in SPSS), then you may use chi-square or nominal measures of association. Alternatively, you might consider using the SPSS CATEGORIES module after recoding your multi-response item as a set of separate variables. CATEGORIES then gives you the canonical correlation coefficient between your multiple response set as independents and another variable such as income level as dependent.
Copyright 1998, 2008 by G. David Garson.
last update 3/24/2008.