|
|
Overview
The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. Multiple regression shares all the assumptions of correlation: linearity of relationships, the same level of relationship throughout the range of the independent variable ("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose range is not truncated. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables. See also a variety of alternatives related to OLS regression:
|
|

Note: t-tests are not used for dummy variables, even though SPSS and other statistical packages output them -- see Frequently Asked Questions section below. Note also that the t-test is a test only of the unique variance an independent variable accounts for, not of shared variance it may also explain, as shared variance while incorporated in R2, is not reflected in the b coefficient.
One- vs. two-tailed t tests. Also note that t-tests in SPSS and SAS are two-tailed, which means they test the hypothesis that the b coefficient is either significantly higher or lower than zero. If our model is such that we can rule out one direction (ex., negative coefficients) and thus should test only if the b coefficient is more than zero, we want a one-tailed test. The one-tailed significance level will be twice the two-tailed probability level: if SPSS reports .05, for instance, then the one-tailed equivalent significance level is .1.
For large samples, SEE approximates the standard error of a predicted value. SEE is the standard deviation of the residuals. In a good model, SEE will be markedly less than the standard deviation of the dependent variable. In a good model, the mean of the dependent variable will be greater than 1.96 times SEE.

In SPSS, the F test appears in the ANOVA table, shown above for the example of number of auto accidents predicted from gender and age. Note that the F test is too lenient for the stepwise method of estimating regression coefficients and an adjustment to F is recommended (see Tabachnick and Fidell, 2001: 143 and Table C.5). In SPSS, select Analyze, Regression, Linear; click Statistics; make sure Model fit is checked to get the ANOVA table and the F test. Here the model is significant at the .001 level, which is the same as shown in the Model Summary table.
Generally, the beta method and the model comparison method will show the same IVs to be most important, but it easily can happen that an IV will have a beta approaching zero but still have an appreciable effect on R2 when it is dropped from the model, because its joint effects are appreciable even if its unique effect is not. The beta method is related to partial correlation, which is relative to the variability of the dependent variable after partialling out from the dependent variable the common variance associated with the control IVs (the IVs other than the one being considered in partial correlation). The model comparison method is related to part correlation, which is relative to the total variability of the dependent variable.

Mathematically, R2 = (1 - (SSE/SST)), where SSE = error sum of squares = SUM((Yi - EstYi)squared), where Yi is the actual value of Y for the ith case and EstYi is the regression prediction for the ith case; and where SST = total sum of squares = SUM((Yi - MeanY)squared). Sums of squares are shown in the ANOVA table in SPSS output, where the example computes R2 = (1-(1140.1/1172.358)) = 0.028. The "residual sum of squares" in SPSS output is SSE and reflects regression error. Thus R-square is 1 minus regression error as a percent of total error and will be 0 when regression error is as large as it would be if you simply guessed the mean for all cases of Y. Put another way, the regression sum of squares/total sum of squares = R-square, where the regression sum of squares = total sum of squares - residual sum of squares. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure Model fit is checked to get R2.

The Model Summary table in SPSS output, shown above, gives R, R2, adjusted R2, the standard error of estimate (SEE), R2 and F change and the corresponding significance level, and the Durbin-Watson statistic. In the example above, number of accidents is predicted from age and gender. This output shows age and gender together explain nly 2.4% of the variance in number of accidents for this sample. R2 is close to adjusted R2 because there are only two independent variables (adjusted R2 is discussed below). R2 change is the same as R2 because the variables were entered at the same time (not stepwise or in blocks), so there is only one regression model to report, and R2 change is change from the intercept-only model, which is also what R2 is.. R2 change is discussed below. Since there is only one model, "Sig F Change" is the overall significance of the model, which for one model is also the significance of adding the sex and age to the model in addition to the intercept. The Durbin-Watson statistic is a test to see if the assumption of independent observations is met, which is the same as testing to see if autocorrelation is present. As a rule of thumb, a Durbin-Watson statistic in the range of 1.5 to 2.5 means the researcher may reject the notion that data are autocorrelated (serially dependent) and instead may assume independence of observations, as is the case here. The Durbin-Watson test is discussed below.
F-incremental = [(R2with - R2without)/m] / [(1 - R2)/df]
where m = number of IVs in new block which is added; and df = N - k - 1 (where N is sample size; k is number of indpendent variables). F is read with m and df degrees of freedom to obtain a p (probability) value. Note the without model is nested within the with model. In SPSS, Analyze, Regression, Linear; click the Statistics button; make sure R squared change is checked to get "Sig F Change".
The beta weights for the equation in the final step of stepwise regression do not partition R2 into increments associated with each independent because beta weights are affected by which variables are in the equation. The beta weights estimate the relative predictive power of each independent, controlling for all other independent variables in the equation for a given model. The R2 increments estimate the predictive power an independent brings to the analysis when it is added to the regression model, as compared to a model without that variable. Beta weights compare independents in one model, whereas R2 increments compare independents in two or more models.
This means that assessing a variable's importance using R2 increments is very different from assessing its importance using beta weights. The magnitude of a variable's beta weight reflects its relative explanatory importance controlling for other independents in the equation. The magnitude of a variable's R2 increment reflects its additional explanatory importance given that common variance it shares with other independents entered in earlier steps has been absorbed by these variables. For causal assessments, beta weights are better (though see the discussion of corresponding regressions for causal analysis). For purposes of sheer prediction, R2 increments are better.

In the table above, Mahalanobis distance, Cook's distance, and leverage are used for identifying outliers and influential cases, as discussed further below.

In the figure above, for the example of predicting auto accidents from sex and age, the Casewiuse Diagnostics tables shows two outliers: cases 166 and 244.

In the partial regression plot above, for the example of sex and age predicting car accidents, sex is being used first to predict accidents, then to predict age. Since sex does not predict age at all and predicts only a very small percentage of accidents, the pattern of residuals in the partial regression plot forms a random cloud. This case, where the dots do not form a line, does not indicate lack of linearity of age with accidents but rather correlation approaching zero.
When tolerance is close to 0 there is high multicollinearity of that variable with other independents and the b and beta coefficients will be unstable.The more the multicollinearity, the lower the tolerance, the more the standard error of the regression coefficients. Tolerance is part of the denominator in the formula for calculating the confidence limits on the b (partial regression) coefficient.
| Rj | Tolerance | VIF | Impact on SEb |
|---|---|---|---|
| 0 | 1 | 1 | 1.0 |
| .4 | .84 | 1.19 | 1.09 |
| .6 | .64 | 1.56 | 1.25 |
| .75 | .44 | 2.25 | 1.5 |
| .8 | .36 | 2.78 | 1.67 |
| .87 | .25 | 4.0 | 2.0 |
| .9 | .19 | 5.26 | 2.29 |
Standard error is doubled when VIF is 4.0 and tolerance is .25, corresponding to Rj = .87. Therefore VIF >= 4 is an arbitrary but common cut-off criterion for deciding when a given independent variable displays "too much" multicollinearity: values above 4 suggest a multicollinearity problem. Some researchers use the more lenient cutoff of 5.0 or even 10.0 to signal when multicollinearity is a problem. The researcher may wish to drop the variable with the highest VIF if multicollinearity is indicated and theory warrants.

The figure above is output for the example of predicting accidents from gender and age. This further confirms that this example has no collinearity problem since not condition index approaches 30, making it unnecessary to examine variance proportions.
Panel data regression models may be in one of three types: fixed, between, or random effects.
Note: Adding variables to the model will always improve R2 at least a little for the current data, but it risks misspecification and does not necessarily improve R2 for other datasets examined later on. That is, it can overfit the regression model to noise in the current dataset and actually reduce the reliability of the model.
Sometimes specification is phrased as the assumption that "independent variables are measured without error." Error attributable to omitting causally important variables means that, to the extent that these unmeasured variables are correlated with the measured variables which are in the model, the b coefficients will be off. If the correlation is positive, then b coefficients will be too high; if negative, too low. That is, when a causally important variable is added to the model, the b coefficients will all change, assuming that variable is correlated with existing measured variables in the model (usually the case).
In regression, as a rule of thumb, nonlinearity is generally not a problem when the standard deviation of the dependent is more than the standard deviation of the residuals. Linearity is further discussed in the section on data assumptions. Note also that regression smoothing techniques and nonparametric regression exist to fit smoothed curves in a nonlinear manner.

An alternative for the same purpose is the normal probability plot, with the observed cumulative probabilities of occurrence of the standardized residuals on the Y axis and of expected normal probabilities of occurrence on the X axis, such that a 45-degree line will appear when observed conforms to normally expected. For the same example, the P-P plot below shows the same moderate departure from normality.

The F test is relatively robust in the face of small to medium violations of the normality assumption. The central limit theorem assumes that even when error is not normally distributed, when sample size is large, the sampling distribution of the b coefficient will still be normal. Therefore violations of this assumption usually have little or no impact on substantive conclusions for large samples, but when sample size is small, tests of normality are important.
Histograms and P-P plots may be selected under the Plots button in the SPSS regression dialog. Alternatively, in SPSS, select Graphs, Histogram; specify sre_1 as the variable (this is the studentized residual, previously saved with the Save button in the regression dialog). One can also test the residuals for normality using a Q-Q plot: in SPSS, select Graphs, Q-Q; specify the studentized residual (sre_1) in the Variables list; click OK. Dots should approximate a 45 degree line when residuals are normally distributed.
Nonconstant error variance can be observed by requesting a simple residual plot (a plot of residuals on the Y axis against predicted values on the X axis). A homoscedastic model will display a cloud of dots, whereas lack of homoscedasticity will be characterized by a pattern such as a funnel shape, indicating greater error as the dependent increases. Nonconstant error variance can indicate the need to respecify the model to include omitted independent variables.
Lack of homoscedasticity may mean (1) there is an interaction effect between a measured independent variable and an unmeasured independent variable not in the model; or (2) that some independent variables are skewed while others are not. One usual method of dealing with hetereoscedasticity is to use weighted least squares regression instead of OLS regression. This causes cases with smaller residuals to be weighted more in calculating the b coefficients. Square root, log, and reciproval transformations of the dependent may also reduce or eliminate lack of homoscedasticity.
Data labels. Influential cases with high leverage can be spotted graphically. Save lev_1 in the SPSS Save procedure above, then select Graphs, Scatter/Dot; select Simple Scatter; click Define; make lev_1 the Y axis and caseid the X axis; be sure to make an appropriate variable (like Name) the "Label cases by" variable; OK. Then double-click on the plot to bring up the Chart Editor; select Elements, Data Label Mode; click on cases high on the Y axis.
One can also spot outliers graphically using Cook's distance, which highlights very (unduly) influential cases. In SPSS, save Cook;s distance (coo_1) using the Save button in the Regression dialog. Then elect Graphs, Scatter/Plot, Simple Scatter; click Define; let coo_1 be the Y axis and case number be the X axis; click OK. If the graph shows any points far off the line, you can lanel them by case number. Double-click in the chart to bring up the Chart Editor, then select Elements, Data Label Mode, then click on the outlying dot(s) to make the label(s) appear.
The (population) error term, which is the difference between the actual values of the dependent and those estimated by the population regression equation, should be uncorrelated with each of the independent variables. Since the population regression line is not known for sample data, the assumption must be assessed by theory. Specifically, one must be confident that the dependent is not also a cause of one or more of the independents, and that the variables not included in the equation are not causes of Y and correlated with the variables which are included. Either circumstance would violate the assumption of uncorrelated error. One common type of correlated error occurs due to selection bias with regard to membership in the independent variable "group" (representing membership in a treatment vs. a comparison group): measured factors such as gender, race, education, etc., may cause differential selection into the two groups and also can be correlated with the dependent variable. When there is correlated error, conventional computation of standard deviations, t-tests, and significance are biased and cannot be used validly.
Note that residual error -- the difference between observed values and those estimated by the sample regression equation -- will always be uncorrelated and therefore the lack of correlation of the residuals with the independents is not a valid test of this assumption.
Alternatively, the d value has an association p probability value for various significance cutoffs (ex., .05). For a given level of significance such as .05, there is an upper and a lower d value limit. If the computed Durbin-Watson d value for a given series is more than the upper limit, the null hypothesis of no autocorrelation is not rejected and it is assumed that errors are serially uncorrelated. If the computed d value is less than the lower limit, the null hypothesis is rejected and it is assumed that errors are serially correlated. If the computed value is in-between the two limits, the result is inconclusive. In SPSS, one can obtain the Durbin-Watson coefficient for a set of residuals by opening the syntax window and running the command, FIT RES_1, assuming the residual variable is named RES_1.
For a graphical test of serial independence, a plot of studentized residuals on the Y axis against the sequence of cases (the caseid variable) on the X axis should show no pattern, indicating independence of errors. In SPSS, select Graphs, Scatter/Dot, Simple Scatter; specify sre_1 (the studentized residual, previously saved with the Save button in the regression dialog) as the Y axis and caseid as the X Axis; OK; double-click the graph to bring up the Chart Editor; select Options, Y Axis Reference Line; click Properties, specify 0 for the Y axis position; click Apply; Close.
When autocorrelation is present, one may choose to use generalized least-squares (GLS) estimation rather than the usual ordinary least-squares (OLS). In iteration 0 of GLS, the estimated OLS residuals are used to estimate the error covariance matrix. Then in iteration 1, GLS estimation minimizes the sum of squares of the residuals weighted by the inverse of the sample covariance matrix.
As an independent: The regression model makes no distributional assumptions about the independents, which may be discrete variables as long as other regression assumptions are met. The discreteness of ordinal variables is thus not a problem, but do ordinal variables approach intervalness? Ordinal variables must be interpreted with great care when there are known large violations of intervalness, such as where it is known that rankings obscure large gaps between, say the top three ranks and all the others. In most cases, however, methodologists simply use a rule-of-thumb that there must be a certain minimum number of classes in the ordinal independent (Achen, 1991, argues for at least 5; Berry (1993: 47) states five or fewer is "clearly inappropriate"; others have insisted on 7 or more). However, it must be noted that use of 5-point Likert scales in regression is extremely common in the literature.
As a dependent: Ordinal dependents are more problematic because their discreteness violates the regression assumptions of normal distribution of error with constant variance. A conservative method is to test to see if there are significant differences in the regression equation when computed separately for each value class of the ordinal dependent. If the independents seem to operate equally across each of the ordinal levels of the dependent, then use of an ordinal dependent is considered acceptable. The more liberal and much more common approach is to allow use of ordinal dependents as long as the number of response categories is not very small (at least 5 or 7, see above) and the responses are not highly concentrated in a very small number of response categories.
Three considerations govern which category to leave out. Since the b coefficients for dummy variables will reflect changes in the dependent with respect to the reference group (which is the left-out group), it is best if the reference group is clearly defined. Thus leaving out the "Other" or "Miscellaneous" category is not a good idea since the reference comparisons will be unclear, though leaving out "North" in the example above would be acceptable since the reference is well defined. Second, the left-out reference group should not be one with only a small number of cases, as that will not lead to stable reference comparisons. Third, some researchers prefer to leave out a "middle" category when transforming ordinal categories into dummy variables, feeling that reference comparisons with median groups are better than comparisons with extremes.
Regression coefficients should be assessed for the entire set of dummy variables for an original variable like "Region" (as opposed to separate t-tests for b coefficients as is done for interval variables). For a regression model in which all the independents are dummies for one original ordinal or nominal variable, the test is the F test for R-squared. Otherwise the appropriate test is the F test for the difference of R-squareds for the model with the set of dummies and the model without the set.
F = [(R22 - R12)/(k2 - k1)]/[(1-R22)/(n - k2 -1)]
There are three methods of coding dummy variables. Coding greatly affects the magnitude and meaning of the b and beta coefficients, but not their significance. Coding does not affect the R-squared for the model or the significance of R-squared, as long as all dummy variables save the reference category are included in the model.
In general, the b coefficients are the distances from the dummy values to the reference value, controlling for other variables in the equation, and the distance from the reference category to the other dummy variables will be the same in a model in which the reference (omitted) categories are switched. Another implication is that the distance from one included dummy value to another included value (ex., from East to West in the example in which North is the omitted reference category) is simply the difference in their b coefficients. Thus if the b coefficient for West is 1.6, then we may say that the effect of East is .5 units more (2.1 - 1.6 = .5) than the West effect, where the effect is still gauged in terms of unit increases in the dependent variable compared to being in the North. For "Region," assuming "North" is the reference category and education level is the dependent, a b of -1.5 for the dummy "South" means that the expected education level for the South is 1.5 years less than the average of "North" respondents.
Some textbooks say the b coefficient for a dummy variable is the difference in means between the two values of the dummy (0,1) variable. This is true only if the variable is a dichotomy. In general, the b coefficient for a given dummy variable is the difference in means between the given dummy variable and omitted reference dummy variable. For dichotomies, there will be only one given dummy variable and the other value will be the omitted reference category and so it is a special case in which the b coefficient is the difference in means between the two values of the dummy variable.
In an experimental context, the omitted reference group would ordinarily be the control group.
Given effect coding and given education level as the dependent, a b of -1.5 for the dummy "South" means that the expected education level for the South is 1.5 years less than the unweighted mean of the expected values for all subgroups. That is, binary coding interprets b for the dummy category (South) relative to the reference group (the left-out category), effects coding interprets it relative to the entire set of groups. A positive b coefficient for any included group (other than the -1 group) means it scored higher on the response variable than the grand mean for all subgroups, or if negative, then lower. A significant b coefficient for any included group (other than the -1 group) means that group is significantly different on the response variable from the grand mean. Under effect coding there is no comparison between the group coded -1 and the grand mean.
To compare the first cluster with the second, the cluster of interest (managers and white-collar) would thus be coded +.5 each (1 divided by the 2 categories in the cluster), and the other categories of the reference cluster as -.33 each (-1 divided by the 3 categories). Contrast code(s) will sum to zero across all categories. To contrast managers v. white-collar only, considering managers as the category of interest (coded +1), white-collar the reference category (coded -1), and all others as the third cluster (coded 0). The group contrast is the b coefficien