|
|
Overview
Partial least squares (PLS) regression/path analysis is thus an alternative to OLS regression, canonical correlation, or structural equation modeling (SEM) for analysis of systems of independent and response variables. PLS is a predictive technique which can handle many independent variables, even when there are more predictors than cases and even when predictors display multicollinearity. Like canonical correlation or multivariate GLM, it can also relate the set of independent variables to a set of multiple dependent (response) variables. However, PLS is less than satisfactory as an explanatory technique because it is low in power to filter out variables of minor causal importance (Tobias, 1997: 1). The advantages of PLS include ability to model multiple dependents as well as multiple independents; ability to handle multicollinearity among the independents; robustness in the face of data noise and missing data; and creating independent latents directly on the basis of crossproducts involving the response variable(s), making for stronger predictions . Disadvantages of PLS include greater difficulty of interpreting the loadings of the independent latent variables (which are based on crossproduct relations with the response variables, not based as in conventional factor analysis on correlations among the manifest independents) and because the distributional properties of estimates are not known, the researcher cannot assess significance except through bootstrap induction. Overall, the mix of advantages and disadvantages means PLS is favored as a predictive technique and not as an interpretive technique, except for exploratory analysis as a prelude to an intepretive technique such as multiple linear regression or structural equation modeling. Though developed by Herman Wold (Wold, 1981, 1985) for econometrics, PLS first gained popularity in chemometric research and later industrial applications. It has since spread to research in education, marketing, and the social sciences. PLS may be implemented as a regression model, predicting one or more dependents from a set of one or more independents; or it can be implemented as a path model, akin to structural equation modeling. PLS is implemented as a regression model by SPSS as of SPSS Version 16 and by SAS's PROC PLS as of ver. 6.11.
|
|
Categorical variable coding. Both nominal and ordinal variables are treated the same, as categorical variables, by SPSS algorithms. Dummy variable coding is used. For a categorical variable with c categories, the first is coded (1, 0, 0,...0), where the last 0 is for the cth category. The last category is coded (0, 0, 0, .... 1). In the PLS dialog, the researcher specifies which dummy variable representing desired reference category is to be omitted in the model.
When prompted at the start of the PLS run, click the "Define Variable Properties" button to obtain first a dialog letting the user enter the variables to be used, then proceed to the "Define Variable Properties" dialog, shown above. SPSS scans the first 200 (default) cases and makes estimates of the measurement level, classifying variables into nominal, ordinal, or scalar (interval or ratio). Symbols in front of variable names in the "Scanned variable list" on the left show the assigned measurement levels, though these initial assignments can be changed in the main dialog, using the drop-down menu for "Measurement Level". It is a good idea to check proper assignment of missing value codes and other settings in this dialog also. Clicking the "Help" button explains the many options available in the "Define Variable Properties" dialog.
The cross-validation coefficient, r2cv, is the percent of variance explained in the dependent variate by the predictions from the leave-one-out process (see Wakeling & Morris, 2005: 294). That is,
where RSS is the initial sum of squares for the dependent variable and PRESS is the PRESS statistic (discussed) below. Wakeling & Morris (2005: 298-300), using Monte Carlo simulation methods, have developed tables of critical values of r2cv for one-, two-, and three-dimensional models, for datasets with given numbers of rows and columns. Thus r2cv greater than the critical value may be taken as significant, and the researcher may select the model with the least number of dimensions with a significant cross-validation statistic as being the most parsimonious and therefore optimal model.
The more a factor explains of the variation in the Y variables, the more powerful it is apt to be in explaining the variation in a new sample of dependent values. The more a factor explains in the variation of the X variables, the more it well reflects the observed values of the set of independent variables.
| Proportion of Variance Explained | |||||
| Latent Factors | Statistics | ||||
| X Variance | Cumulative X Variance | Y Variance | Cumulative Y Variance (R-square) | Adjusted R-square | |
| 1 | .307 | .307 | .011 | .011 | .010 |
| 2 | .271 | .578 | .002 | .013 | .011 |
| 3 | .218 | .796 | .000 | .014 | .011 |
| 4 | .079 | .875 | 5.024E-5 | .014 | .010 |
| 5 | .125 | 1.000 | 1.875E-5 | .014 | .010 |
| Weights | |||||
| Variables | Latent Factors | ||||
| 1 | 2 | 3 | 4 | 5 | |
| [sex=Male] | .048 | .206 | .708 | .173 | -.457 |
| [race=White] | .297 | .528 | -.076 | .689 | .979 |
| [race=Black] | -.301 | -.472 | .238 | .688 | .978 |
| age | .463 | -.555 | -.482 | .154 | -.409 |
| prestg80 | .778 | -.524 | .459 | -.108 | .273 |
| [happy=Very Happy] | .113 | -.022 | .019 | .004 | .003 |
| [happy=Pretty Happy] | -.059 | .056 | .011 | .024 | .007 |
| Loadings | |||||
| Variables | Latent Factors | ||||
| 1 | 2 | 3 | 4 | 5 | |
| [sex=Male] | .065 | .182 | .705 | .871 | -.677 |
| [race=White] | .534 | .615 | -.207 | .461 | .187 |
| [race=Black] | -.531 | -.617 | .208 | .482 | .171 |
| age | .370 | -.380 | -.507 | .896 | -.577 |
| prestg80 | .652 | -.259 | .417 | -.576 | .380 |
| [happy=Very Happy] | .759 | -.713 | 1.504 | -.833 | -.829 |
| [happy=Pretty Happy] | -.707 | .792 | -.578 | 1.150 | 1.371 |
| Variable Importance in the Projection | |||||
| Variables | Latent Factors | ||||
| 1 | 2 | 3 | 4 | 5 | |
| [sex=Male] | .108 | .219 | .310 | .310 | .312 |
| [race=White] | .663 | .783 | .775 | .779 | .783 |
| [race=Black] | .674 | .757 | .753 | .758 | .761 |
| age | 1.034 | 1.075 | 1.075 | 1.073 | 1.073 |
| prestg80 | 1.739 | 1.651 | 1.641 | 1.638 | 1.637 |
| Cumulative Variable Importance | |||||
| Parameters | ||
| Independent Variables | Dependent Variables | |
| [happy=Very Happy] | [happy=Pretty Happy] | |
| (Constant) | .049 | .724 |
| [sex=Male] | .013 | .017 |
| [race=White] | .034 | .048 |
| [race=Black] | -.020 | .027 |
| age | .001 | -.002 |
| prestg80 | .004 | -.003 |






1. Install SPSS (SPSS CD) 2. Install Python 2.5.1 (SPSS CD) 3. Install SPSS-Python Integration Plug-in (from SPSS CD) 4. Install NumPy and SciPy (From SPSS CD under Python and Additional Modules; Note this option installs Python, NumPy, and SciPy in order if they are not already present) 5. Install PLS Extension available at: http://www.spss.com/devcentral/index.cfm?pg=downloadDet&dId=76 You can log in as Guest/Guest.
PLS generally yields the most accurate predictions and therefore has been much more widely used than PCR. PLS may also be more parsimonious than PCR. In a chemistry setting, Wentzell & Vega (2003: 257) conducted simulations to compare PLS and PCR, finding "In all cases, except when artificial constraints were placed on the number of latent variables retained, no significant differences were reported in the prediction errors reported by PCR and PLS. PLS almost always required fewer latent variables than PCR, but this did not appear to influence predictive ability."
Attempts have been made to improve the predictive power of PCR. Traditional PCR methods use the first k components (first by having the highest eigenvalues) to predict the response variable, Y. Hwang & Nettleton (2003: 71 ) note, "Restricting attention to principal components with the largest eigenvalues helps to control variance inflation but can introduce high bias by discarding components with small eigenvalues that may be most associated with Y. Jollife (1982) provided several real-life examples where the principal components corresponding to small eigenvalues had high correlation with Y . Hadi and Ling (1998) provided an example where only the principal component associated with the smallest eigenvalue was correlated with Y ." Recall variance inflation (measured in regression by the variance inflation factor, VIF) indicates multicollinearity: while a multicollinear model may explain a high proportion of variance in Y, but redundancy among the X variables leads to inflated standard error and inflated parameter estimates. Minimizing variance inflation may not minimize mean square error (MSE). To deal with the tradeoff between variance inflation and MSE, some researchers emply an "inferential approach", which uses only components whose regression coefficients significantly differ from zero (Mason & Gunst, 1985). More recently, Hwang & Nettleton (2003) have proposed a PCR selection strategy which selects components which minimize mean square error (MSE) demonstrating through simulations studies that their estimator performed superior to traditional PCR, inferential PCR, or even traditional PLS (which ranked second in the simulation, among many variants tested). However, it appears that Hwang-Nettleton estimators are not employed by current software. .
PCR. With METHODS=PCR one is asking for principal components regression, which predicts response variables from factors underlying the predictor variables. Latents created with PCR may not predict Y-scores as well as latents created by PLS or SIMPLS.
Copyright 1998, 2008 by G. David Garson.
Last update 8/28/08.