|
|
To obtain this output:
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI BCOV R ANOVA COLLIN TOL CHANGE ZPP
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT rincome
/METHOD=ENTER agewed age educ degree
/PARTIALPLOT ALL
/SCATTERPLOT=(*SDRESID ,*ZPRED ) (*ZPRED ,rincome )
/RESIDUALS DURBIN HIST(ZRESID) NORM(ZRESID)
/CASEWISE PLOT(ZRESID) OUTLIERS(3) .
| Output Created | 29-JAN-2007 16:51:32 | |
|---|---|---|
| Comments | ||
| Input | Data | I:\PC\DATASETS\GSS\gss93\GSS93.SAV |
| Active Dataset | DataSet1 | |
| Filter | <none> | |
| Weight | <none> | |
| Split File | <none> | |
| N of Rows in Working Data File | 1606 | |
| Missing Value Handling | Definition of Missing | User-defined missing values are treated as missing. |
| Cases Used | Statistics are based on cases with no missing values for any variable used. | |
| Syntax | REGRESSION /DESCRIPTIVES MEAN STDDEV CORR SIG N /MISSING LISTWISE /STATISTICS COEFF OUTS CI BCOV R ANOVA COLLIN TOL CHANGE ZPP /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT rincome /METHOD=ENTER agewed age educ degree /PARTIALPLOT ALL /SCATTERPLOT=(*SDRESID ,*ZPRED ) (*ZPRED ,rincome ) /RESIDUALS DURBIN HIST(ZRESID) NORM(ZRESID) /CASEWISE PLOT(ZRESID) OUTLIERS(3) . |
|
| Resources | Elapsed Time | 0:00:06.61 |
| Memory Required | 15276 bytes | |
| Additional Memory Required for Residual Plots | 2768 bytes | |
| Processor Time | 0:00:04.88 | |
[DataSet1] I:\PC\DATASETS\GSS\gss93\GSS93.SAV
| Mean | Std. Deviation | N | |
|---|---|---|---|
| RESPONDENTS INCOME | 9.94 | 3.035 | 837 |
| AGE WHEN FIRST MARRIED | 22.79 | 4.660 | 837 |
| AGE OF RESPONDENT | 43.13 | 12.028 | 837 |
| HIGHEST YEAR OF SCHOOL COMPLETED | 13.56 | 2.827 | 837 |
| RS HIGHEST DEGREE | 1.61 | 1.182 | 837 |
The correlation matrix below shows the Pearsonian r's, the significance of each r, and the sample size (n) for each r. All correlations are significant at the .05 level or better, except age with respondent's income, age when first married, and respondent's highest degree. At r = .886, there may be multicollinearity between highest year of education completed and highest degree.
| RESPONDENTS INCOME | AGE WHEN FIRST MARRIED | AGE OF RESPONDENT | HIGHEST YEAR OF SCHOOL COMPLETED | RS HIGHEST DEGREE | ||
|---|---|---|---|---|---|---|
| Pearson Correlation | RESPONDENTS INCOME | 1.000 | .102 | .012 | .344 | .301 |
| AGE WHEN FIRST MARRIED | .102 | 1.000 | .039 | .286 | .312 | |
| AGE OF RESPONDENT | .012 | .039 | 1.000 | -.114 | -.042 | |
| HIGHEST YEAR OF SCHOOL COMPLETED | .344 | .286 | -.114 | 1.000 | .886 | |
| RS HIGHEST DEGREE | .301 | .312 | -.042 | .886 | 1.000 | |
| Sig. (1-tailed) | RESPONDENTS INCOME | . | .002 | .362 | .000 | .000 |
| AGE WHEN FIRST MARRIED | .002 | . | .127 | .000 | .000 | |
| AGE OF RESPONDENT | .362 | .127 | . | .000 | .113 | |
| HIGHEST YEAR OF SCHOOL COMPLETED | .000 | .000 | .000 | . | .000 | |
| RS HIGHEST DEGREE | .000 | .000 | .113 | .000 | . | |
| N | RESPONDENTS INCOME | 837 | 837 | 837 | 837 | 837 |
| AGE WHEN FIRST MARRIED | 837 | 837 | 837 | 837 | 837 | |
| AGE OF RESPONDENT | 837 | 837 | 837 | 837 | 837 | |
| HIGHEST YEAR OF SCHOOL COMPLETED | 837 | 837 | 837 | 837 | 837 | |
| RS HIGHEST DEGREE | 837 | 837 | 837 | 837 | 837 |
| Model | Variables Entered | Variables Removed | Method |
|---|---|---|---|
| 1 | RS HIGHEST DEGREE, AGE OF RESPONDENT, AGE WHEN FIRST MARRIED, HIGHEST YEAR OF SCHOOL COMPLETED(a) | . | Enter |
| a All requested variables entered. | |||
| b Dependent Variable: RESPONDENTS INCOME | |||
The table below is the "bottom line."
P
| Model | R | R Square | Adjusted R Square | Std. Error of the Estimate | Change Statistics | Durbin-Watson | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| R Square Change | F Change | df1 | df2 | Sig. F Change | R Square Change | F Change | df1 | df2 | Sig. F Change | |
| 1 | .348(a) | .121 | .117 | 2.852 | .121 | 28.634 | 4 | 832 | .000 | 1.980 |
| a Predictors: (Constant), RS HIGHEST DEGREE, AGE OF RESPONDENT, AGE WHEN FIRST MARRIED, HIGHEST YEAR OF SCHOOL COMPLETED | ||||||||||
| b Dependent Variable: RESPONDENTS INCOME | ||||||||||
The ANOVA table below tests the overall significance of the model (that is, of the regression equation). If we had been doing stepwise regression, significance for each step would be computed. Here the significance of the F value is below .05, so the model is significant.
| Model | Sum of Squares | df | Mean Square | F | Sig. | |
|---|---|---|---|---|---|---|
| 1 | Regression | 931.945 | 4 | 232.986 | 28.634 | .000(a) |
| Residual | 6769.699 | 832 | 8.137 | |||
| Total | 7701.644 | 836 | ||||
| a Predictors: (Constant), RS HIGHEST DEGREE, AGE OF RESPONDENT, AGE WHEN FIRST MARRIED, HIGHEST YEAR OF SCHOOL COMPLETED | ||||||
| b Dependent Variable: RESPONDENTS INCOME | ||||||
The table below gives the b and beta coefficients and other coefficients for the model.
| Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | 95% Confidence Interval for B | Correlations | Collinearity Statistics | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| B | Std. Error | Beta | Lower Bound | Upper Bound | Zero-order | Partial | Part | Tolerance | VIF | B | Std. Error | ||
| 1 | (Constant) | 3.949 | 1.014 | 3.893 | .000 | 1.958 | 5.939 | ||||||
| AGE WHEN FIRST MARRIED | .001 | .022 | .001 | .043 | .966 | -.043 | .045 | .102 | .001 | .001 | .899 | 1.113 | |
| AGE OF RESPONDENT | .014 | .008 | .054 | 1.638 | .102 | -.003 | .030 | .012 | .057 | .053 | .967 | 1.034 | |
| HIGHEST YEAR OF SCHOOL COMPLETED | .407 | .076 | .379 | 5.330 | .000 | .257 | .556 | .344 | .182 | .173 | .209 | 4.777 | |
| RS HIGHEST DEGREE | -.085 | .183 | -.033 | -.464 | .643 | -.443 | .274 | .301 | -.016 | -.015 | .209 | 4.793 | |
| a Dependent Variable: RESPONDENTS INCOME | |||||||||||||
| Model | RS HIGHEST DEGREE | AGE OF RESPONDENT | AGE WHEN FIRST MARRIED | HIGHEST YEAR OF SCHOOL COMPLETED | ||
|---|---|---|---|---|---|---|
| 1 | Correlations | RS HIGHEST DEGREE | 1.000 | -.120 | -.125 | -.876 |
| AGE OF RESPONDENT | -.120 | 1.000 | -.060 | .168 | ||
| AGE WHEN FIRST MARRIED | -.125 | -.060 | 1.000 | -.031 | ||
| HIGHEST YEAR OF SCHOOL COMPLETED | -.876 | .168 | -.031 | 1.000 | ||
| Covariances | RS HIGHEST DEGREE | .033 | .000 | -.001 | -.012 | |
| AGE OF RESPONDENT | .000 | 6.96E-005 | -1.11E-005 | .000 | ||
| AGE WHEN FIRST MARRIED | -.001 | -1.11E-005 | .000 | -5.20E-005 | ||
| HIGHEST YEAR OF SCHOOL COMPLETED | -.012 | .000 | -5.20E-005 | .006 | ||
| a Dependent Variable: RESPONDENTS INCOME | ||||||
Above: The tolerance for a variable is 1 - R-squared for the regression of that variable on all the other independents, ignoring the dependent. When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable. VIF is the variance inflation factor, which is simply the reciprocal of tolerance. Therefore, when VIF is high there is high multicollinearity and instability of the b and beta coefficients. However, the tolerance and VIF reported in the "Correlations" table above are not the coefficients needed.
Below: The table below is another way of assessing if there is too much multicollinearity in the model. To simplify, crossproducts of the independent variables are factored. High eigenvalues indicate dimensions (factors) which account for a lot of the variance in the crossproduct matrix. Eigenvalues close to 0 indicate dimensions which explain little variance. Multiple eigenvalues close to 0 indicate an ill-conditioned crossproduct matrix, meaning there is a problem with multicollinearity. The condition index summarizes the findings, and a common rule of thumb is that a condition index over 15 indicates a possible multicollinearity problem and a condition index over 30 suggests a serious multicollinearity problem. If a factor has a high condition index, one looks in the variance proportions column to see if it accounts for a sizable proportion of variance in other variables. If it does, multicollinearity is a problem. Here we see that the fifth dimension has a high condition index. The fifth dimension corresponds to the 4th variable entered (with the constant makes 5), which is highest degree. It has a high variance proportion with the third dimension, which is highest year of education. The researcher will wish to drop a variable, combine variables, or pursue some other multicollinearity strategy.
| Model | Dimension | Eigenvalue | Condition Index | Variance Proportions | ||||
|---|---|---|---|---|---|---|---|---|
| (Constant) | AGE WHEN FIRST MARRIED | AGE OF RESPONDENT | HIGHEST YEAR OF SCHOOL COMPLETED | RS HIGHEST DEGREE | (Constant) | AGE WHEN FIRST MARRIED | ||
| 1 | 1 | 4.638 | 1.000 | .00 | .00 | .00 | .00 | .00 |
| 2 | .279 | 4.074 | .00 | .00 | .04 | .00 | .19 | |
| 3 | .055 | 9.223 | .01 | .15 | .80 | .01 | .05 | |
| 4 | .024 | 13.998 | .08 | .80 | .05 | .09 | .04 | |
| 5 | .005 | 31.365 | .91 | .05 | .11 | .90 | .72 | |
| a Dependent Variable: RESPONDENTS INCOME | ||||||||
The table below is a listing of outliers: cases where the prediction is 3 standard deviations or more from the mean value of the dependent. The researcher looks at these cases to consider if they merit a separate model, or if they reflect measurement errors. Either way, the researcher may decide to drop these cases from analysis.
| Case Number | Std. Residual | RESPONDENTS INCOME | Predicted Value | Residual |
|---|---|---|---|---|
| 110 | -3.225 | 1 | 10.20 | -9.199 |
| 411 | -3.278 | 1 | 10.35 | -9.350 |
| 767 | -3.310 | 1 | 10.44 | -9.442 |
| 789 | -3.112 | 1 | 9.88 | -8.878 |
| 878 | -3.415 | 2 | 11.74 | -9.741 |
| 933 | -3.510 | 1 | 11.01 | -10.013 |
| 1115 | -3.250 | 1 | 10.27 | -9.272 |
| 1174 | -3.104 | 1 | 9.85 | -8.853 |
| 1269 | -3.713 | 1 | 11.59 | -10.591 |
| a Dependent Variable: RESPONDENTS INCOME | ||||
The table below contains summary data regarding the residuals (the difference between predicted and actual values). Std. residual, for instance, is the standardized residual (raw residual divided by the standard deviation of residuals). Since the minimum standardized residual is -3.71, at least one prediction is more than 3 standard deviations below the mean residual. Studentized residuals are very similar to standardized residuals and follow the t distribution. These are used in plots of standardized or studentized predicted values vs. observed values. The "deleted residual" rows have to do with coefficients when the model is recomputed over and over, dropping one case from the analysis each time. The bottom three rows are measures of the influence of the minimum, maximum, and mean case on the model. Mahalanobis distance is (n-1) times leverage (the bottom row), which is a measure of case influence. Cases with leverage values less than .2 are not a problem, but cases with leverage values of .5 or higher may be unduly influential in the model and should be examined. Cook's distance measures how much the b coefficients change when a case is dropped. In this example, it does not appear there are problem cases since the maximum leverage is only .077.
| Minimum | Maximum | Mean | Std. Deviation | N | |
|---|---|---|---|---|---|
| Predicted Value | 4.67 | 12.67 | 9.94 | 1.056 | 837 |
| Std. Predicted Value | -4.988 | 2.592 | .000 | 1.000 | 837 |
| Standard Error of Predicted Value | .113 | .795 | .208 | .072 | 837 |
| Adjusted Predicted Value | 4.81 | 12.68 | 9.94 | 1.057 | 837 |
| Residual | -10.591 | 6.356 | .000 | 2.846 | 837 |
| Std. Residual | -3.713 | 2.228 | .000 | .998 | 837 |
| Stud. Residual | -3.762 | 2.286 | .000 | 1.001 | 837 |
| Deleted Residual | -10.875 | 6.691 | -.002 | 2.865 | 837 |
| Stud. Deleted Residual | -3.792 | 2.292 | -.001 | 1.003 | 837 |
| Mahal. Distance | .310 | 63.998 | 3.995 | 4.443 | 837 |
| Cook's Distance | .000 | .110 | .001 | .006 | 837 |
| Centered Leverage Value | .000 | .077 | .005 | .005 | 837 |
| a Dependent Variable: RESPONDENTS INCOME | |||||
The zresid histogram below provides a visual way of assessing if the assumption of normally distributed residual error is met. Regression is robust in the face of some deviation from this assumption, and for this example the small skewness to the right should not affect substantive conclusions.
The normal probability plot (zresid normal p-p plot) below is another test of normally distributed residual error. Under perfect normality, the plot will be a 45-degree line. For this example, it is imperfect but close enough for exploratory conclusions.
We would like residuals to be randomly related to the value of the dependent, but here we see there is a downward-sloping trend.
In the plot of stnardized predicted values vs. observed values if 100% of the variance is explained in a linear relationship, the points will form a straight line. The lower the percent of variance explained, the more the points will form a cloud with no trend. The more the points are dispersed around the trend (a lot, in this case), the higher the standard error of estimate and the poorer the model. The plot below reflects the fact that the model in this case only explains a small percentage of the variance.
Partial regression plots, like those below, simply show the plot of one independent (such as agewed, for instance) on the dependent (respondent income). A partial residual plot (not shown), also called an "added variable plot," is plotted when there are two or more predictors. You get one partial residual plot for each predictor. In a partial residual plot, the dependent (rincome91) is regressed on all predictors except the one of current interest, and likewise that predictor is also regressed on all other predictors. When the two sets of residuals are plotted, the extent to which the points fall on a line shows the correlation of the dependent with the given independent, controlling for all other predictors. Thus the partial residual plot is a visual form of the t-test of the b coefficient for the given variable.