|
|
Overview
Logistic regression can be used to predict a dependent variable on the basis of continuous and/or categorical independents and to determine the percent of variance in the dependent variable explained by the independents; to rank the relative importance of independents; to assess interaction effects; and to understand the impact of covariate control variables. The impact of predictor variables is usually explained in terms of odds ratios. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable (the natural log of the odds of the dependent occurring or not). In this way, logistic regression estimates the odds of a certain event occurring. Note that logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does. Logistic regression has many analogies to OLS regression: logit coefficients correspond to b coefficients in the logistic regression equation, the standardized logit coefficients correspond to beta weights, and a pseudo R2 statistic is available to summarize the strength of the relationship. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the independent variables and the dependent, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements. It does, however, require that observations be independent and that the independent variables be linearly related to the logit of the dependent. The predictive success of the logistic regression can be assessed by looking at the classification table, showing correct and incorrect classifications of the dichotomous, ordinal, or polytomous dependent. Goodness-of-fit tests such as the likelihood ratio test are available as indicators of model appropriateness, as is the Wald statistic to test the significance of individual independent variables. . In SPSS, binomial logistic regression is under Analyze - Regression - Binary Logistic, and the multinomial version is under Analyze - Regression - Multinomial Logistic. Logit regression, discussed separately, is another related option in SPSS for using loglinear methods to analyze one or more dependents. Where both are applicable, logit regression has numerically equivalent results to logistic regression, but with different output options. For the same class of problems, logistic regression has become more popular among social scientists. |
|
SAS's PROC CATMOD computes both simple and multinomial logistic regression, whereas PROC LOGIST is for simple (dichotomous) logistic regression. CATMOD uses a conventional model command: ex., model wsat*supsat*qman=_response_ /nogls ml ;. Note that in the model command, nogls suppresses generalized least squares estimation and ml specifies maximum likelihood estimation.
The -2LL statistic is the likelihood ratio. It is also called goodness of fit, deviance chi-square, scaled deviance, deviation chi-square, DM, or L-square. It reflects the significance of the unexplained variance in the dependent. In SPSS output, this statistic is found in the "-2 Log Likelihood" column of the "Model Fitting Information" table for the Final row. The likelihood ratio is not used directly in significance testing, but it it the basis for the likelihood ratio test, which is the test of the difference between two likelihood ratios (two -2LL's), as discussed below. In general, as the model becomes better, -2LL will decrease in magnitude.
The likelihood ratio test looks at model chi-square (chi square difference) by subtracting deviance (-2LL) for the final (full) model from deviance for the intercept-only model. Degrees of freedom in this test equal the number of terms in the model minus 1 (for the constant). This is the same as the difference in the number of terms between the two models, since the null model has only one term. Model chi-square measures the improvement in fit that the explanatory variables make compared to the null model.
Warning: If the log-likelihood test statistic shows a small p value (<=.05) for a model with a large effect size, ignore contrary findings based on the Wald statistic discussed below as it is biased toward Type II errors in such instances - instead assume good model fit overall.
In the table above, the response variable is Gunown, a binary variable indicating whether or not there is a gun in the home. Predictors are marital status, race, attitude toward the death penalty (Cappun), and age. The likelihood ratio tests of individual parameters show that the model without Age is not significantly different from the final (full) model and therefore Age should be dropped based on preference for the more parsimonious reduced model. For the significant variables, the larger the chi-square value, the greater the loss of model fit if that term is dropped. In this example, dropping "marital" would result in the greatest loss of model fit.
Binomial logistic regression in SPSS offers these variants: forward conditional, forward LR, forward Wald, backward conditional, backward LR, or backward Wald. The conditional options uses a computationally faster version of the likelihood ratio test, LR options utilize the likelihood ratio test (chi-square difference), and the Wald options use the Wald test. The LR option is most often preferred. The likelihood ratio test computes -2LL for the current model, then reestimates -2LL with the target variable removed. The conditional option is preferred when LR estimation proves too computationally time-consuming. The conditional statistic is considered not as accurate as the likelihood ratio test but more so than the third possible criterion, the Wald test. Stepwise procedures are selected in the Method drop-down list of the binomial logistic regression dialog.
Multinomial logistic regression offers these variants: forward stepwise, backward stepwise, forward entry, and backward elimination. Stepwise procedures are selected under the Model button of the multinomial logistic regression dialog. These four options are described in the FAQ section below. All are based on maximum likelihood estimation (MLE), with forward methods using the likelihood ratio or score statistic and backward methods using the likelihood ratio or Wald's statistic. LR is the default, but score and Wald alternatives are available under the Options button. Forward entry adds terms to the model until no omitted variable would contribute significantly to the model. Forward stepwise determines the forward entry model and then alternates between backward elimination and forward entry until all variables not in the model fail to meet entry or removal criteria. Backward elimination and backward stepwise are similar, but begin with all terms in the model and work backward. with backward elimination stopping when the model contains only terms which are significant and with backward stepwise taking this result and further alternating between forward entry and backward elimination until no omitted variable would contribute significantly to the model.
In the illustration below, forward stepwise modeling of a binary dependent, which was having a gun in the home or not (Gunown), as predicted by the categorical variable marital status (marital, with four categories), race (three categories), and attitude on the death penalty (binary). The forward stepwise procedure adds Marital first, then Race, then Gunown.
For a one-independent model, z would equal the constant, plus the b coefficient times the value of X1, when predicting odds(event) for persons with a particular value of X1. If X1 is a binary (0,1) variable, then z = X0 (that is, the constant) for the "0" group on X1 and equals the constant plus the b coefficient for the "1" group. To convert the log odds (which is z, which is the logit) back into an odds ratio, the natural logarithmic base e is raised to the zth power: odds(event) = exp(z) = odds the binary dependent is 1 rather than 0. If X1 is a continuous variable, then z equals the constant plus the b coefficient times the value of X1. For models with additional constants, z is the constant plus the crossproducts of the b coefficients times the values of the X (independent) variables. Exp(z) is the log odds of the dependent, or the estimate of odds(event).
To summarize, logits are the log odds of the event occurring (usually, that the dependent = 1 rather than 0). The "z" in the logistic formula above is the logit. Odds(event) = Exp(z). Where OLS regression has an identity link function, logistic regression has a logit link function (that is, logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does). Parameter estimates (b coefficients) associated with explanatory variables are estimators of the change in the logit caused by a unit change in the independent. In SPSS output, the parameter estimates appear in the "B" column of the "Variables in the Equation" table. Logits do not appear but must be estimated using the logistic regression equation above, inserting appropriate values for the constant and X variable(s). The b coefficients vary between plus and minus infinity, with 0 indicating the given explanatory variable does not affect the logit (that is, makes no difference in the probability of the dependent value equaling the value of the event, usually 1); positive or negative b coefficients indicate the explanatory variable increases or decreases the logit of the dependent. Exp(b) is the odds ratio for the explanatory variable, discussed below. Note that when b=0, Exp(b)=1, so therefore an odds ratio of 1 corresponds to an explanatory variable which does not affect the dependent variable.
Put another way, Exp(b) is the ratio of odds for two groups where each group has a values of Xj which are one unit apart from the values of Xj in the other group. An Exp(b)>1 means the independent variable increases the logit and therefore increases odds(event). If Exp(b) = 1.0, the independent variable has no effect. If Exp(b) is less than 1.0, then the independent variable decreases the logit and decreases odds(event). For instance, if b1 = 2.303, then the corresponding odds ratio (the exponential function, eb) is 10, then we may say that when the independent variable increases one unit, the odds that the dependent = 1 increase by a factor of 10, when other variables are controlled. In SPSS, odds ratios appear as "Exp(B)" in the "Variables in the Equation" table.
A second simple example: Some 20 people take a performance test, where 0=fail and 1=success. For males, 3 fail and 7 succeed. For females, 7 fail and 3 succeed. Then p(success) for males = 7/10 = .70; and q(failure) for males = 3/10 = .30. Likewise p(success) for females = 3/10 = .30, and q(failure) for females = 7/20 = .70. Therefore the odds of success for males is the ratio of the probabilities = .7/.3 = 2.3333. The odds of success for females = .3/.7 = .4286, rounded off. Then the odds ratio for success (for performance = 1) for males:females is 2.3333/.4286 = 5.4444. Since the parameter estimate is the natural log of the odds ratio, therefore b(gender) = ln(5.4444) = 1.6946. Conversely, if the b for gender was 1.6946 we could convert it to an odds ratio using the function exp(1.6946) = 5.4444. And we would say that the odds of success (the odds that the dependent variable performance = 1) are 5.4444 times as large for males as for females. (If you try this on your calculator results will not be exact due to rounding error: there are actually more than the four decimal places shown above, and also delta must be set to 0).
Warning: Note that dichotomous variables may be entered as categorical dummies rather than as simple variables (which would be the norm). If if entered as categorical variables, then their odds ratios will be computed differently and must be interpreted comparative to the reference category rather than as simple increase/decrease in odds ratio. That is, it makes a major difference if a dichotomy is entered as a categorical variable using the Categorical button. This by default will make the higher category the reference category (2=female in this example). As illustrated in the spreadsheet below, we are then making an entirely different statement using a different odds ratio. We would now say that the odds a man does not have a gun are .55 times the odds a woman does not (.56 times in the spreadsheet, due to rounding/precision differences).
For the first category of marital in the example above, the odds ratio is .324. Recall binary logistic regression by default predicts the higher category of the dependent, which is gunown = 2 = not owning. We would therefore say that the odds of a married person (category 1 of marital) is .324 the odds of a never-married person not owning a gun. Similar statements would be made about the other levels of marital, all making comparision to the reference category, never married. Note, however, that since level 2 (widowed) is not significant, we would not make such a statement for that category.
To take another example, consider the example of number of publications of professors (see Allison, 1999: 188). Let the logit coefficient for "number of articles published" be +.0737, where the dependent variable is "being promoted". The odds ratio which corresponds to +.0737 is approximately 1.08 (e to the .0737 power). Therefore one may say, "one additional article published increases the odds of promotion by about 8%, controlling for other variables in the model.." (Obviously, this is the same as saying the original dependent odds increases by 108%, or noting that one multiplies the original dependent odds by 1.08. By the same token, it is not the same as saying that the probability of promotion increases by 8%.)
To take a third example, let income be a continuous explanatory variable measured in ten thousands of dollars, with a parameter estimate of 1.5 in a model predicting home ownership=1, no home ownership=0. A 1 unit increase in income (one $10,000 unit) is then associated with a 1.5 increase in the log odds of home ownership. However, it is more intuitive to convert to an odds ratio: exp(1.5) = 4.48, allowing one to say that a unit ($10,000) change in income increases the odds of the event ownership=1 about 4.5 times.
For example, let "candidate" be a categorical dependent variable with three levels: the first parameter estimate will be the log of the odds (probability candidate=1: probability candidate=3), and the second parameter estimate will the the log odds of (p(candidate=2):p(candidate=3)). Let the explanatory variable be gender, with 0=female and 1=male, such that the reference category is 1=male. Let the reference category of the dependent equal 3, the default. There will be two parameter estimates for gender: one for candidate 1 and one for candidate 2. Let the parameter estimate for gender=0 for candidate 1 be .500. Then the odds ratio is exp(.500) = 1.649. We can then say the odds of a female selecting candidate 1 compared to candidate 3 is 1.649 times (about 65% greater than) the odds a male would. Warning: This is a statement about odds - do not directly transform it into a statement about probabilities/likelihood/chances.
Note that R2-like measures below are not goodness-of-fit tests but rather attempt to measure strength of association. Unfortunately, the pseudo-R2 measures reflect and confound effect strength with goodness of fit. For small samples, for instance, an R2-like measure might be high when goodness of fit was unacceptable by the likelihood ratio test. SPSS supports three R2-like measures: Cox and Snell's, Nagelkerke's, and McFadden's, as illustrated below. Output is identical for binomial and multinomial logistic regression and in SPSS appears in the "Pseudo R Square" table.
Parameter codings for indicator contrasts
------------------------------------------------
Parameter
Value Freq Coding
(1) (2)
GROUP
1 106 1.000 .000
2 116 .000 1.000
3 107 .000 .000
------------------------------------------------
This example shows a three-level categorical independent (labeled GROUP), with category values of 1, 2, and 3.
The predictor here is called simply GROUP. It takes on the values 1-3, with frequencies listed in the "Freq" column. The two "Coding" columns are the internal values (parameter codings) assigned by SPSS under indicator coding. There are two columns of codings because two dummy variables are created for the three-level variable GROUP. For the first variable, which is Coding (1), cases with a value of 1 for GROUP get a 1, while all other cases get a 0. For the second,
cases with a 2 for GROUP get a 1, with all other cases getting a 0.
In the example above, both the likelihood ratio tests table and the parameter estimates table show for these data that Test (test score) is significant in differentiating those promoted from those not (i.e., subjects from matches), controlling for variables used for matching (age, gender). In addition, Rating (supervisor's rating) and Race are not significant.
When is discriminant analysis preferred over logistic regression?
LOGISTIC REGRESSION /VARIABLES income WITH age SES gender opinion1
opinion2 region
/CATEGORICAL=gender, opinion1, opinion2, region
/CONTRAST(region)=INDICATOR(4)
/METHOD FSTEP(LR)
/CLASSPLOT
Above is the SPSS syntax in simplified form. The dependent variable is the variable immediately after the VARIABLES term. The independent variables are those immediately after the WITH term. The CATEGORICAL command specifies any categorical variables; note these must also be listed in the VARIABLES statement. The CONTRAST command tells SPSS which category of a categorical variable is to be dropped when it automatically constructs dummy variables (here it is the 4th value of "region"; this value is the fourth one and is not necessarily coded "4"). The METHOD subcommand sets the method of computation, here specified as FSTEP to indicate forward stepwise logistic regression. Alternatives are BSTEP (backward stepwise logistic regression) and ENTER (enter terms as listed, usually because their order is set by theories which the researcher is testing). ENTER is the default method. The (LR) term following FSTEP specifies that likelihood ratio criteria are to be used in the stepwise addition of variables to the model. The /CLASSPLOT option specifies a histogram of predicted probabilities is to output (see above).
The full syntax is below:
LOGISTIC REGRESSION VARIABLES = dependent var
[WITH independent varlist [BY var [BY var] ... ]]
[/CATEGORICAL = var1, var2, ... ]
[/CONTRAST (categorical var) = [{INDICATOR [(refcat)] }]]
{DEVIATION [(refcat)] }
{SIMPLE [(refcat)] }
{DIFFERENCE }
{HELMERT }
{REPEATED }
{POLYNOMIAL[({1,2,3...})]}
{metric }
{SPECIAL (matrix) }
[/METHOD = {ENTER** } [{ALL }]]
{BSTEP [{COND}]} {varlist}
{LR }
{WALD}
{FSTEP [{COND}]}
{LR }
{WALD}
[/SELECT = {ALL** }]
{varname relation value}
[/{NOORIGIN**}]
{ORIGIN }
[/ID = [variable]]
[/PRINT = [DEFAULT**] [SUMMARY] [CORR] [ALL] [ITER [({1})]] [GOODFIT]]
{n}
[CI(level)]
[/CRITERIA = [BCON ({0.001**})] [ITERATE({20**})] [LCON({0** })]
{value } {n } {value }
[PIN({0.05**})] [POUT({0.10**})] [EPS({.00000001**})]]
{value } {value } {value }
[CUT[{O.5** }]]
[value }
[/CLASSPLOT]
[/MISSING = {EXCLUDE **}]
{INCLUDE }
[/CASEWISE = [tempvarlist] [OUTLIER({2 })]]
{value}
[/SAVE = tempvar[(newname)] tempvar[(newname)]...]
[/OUTFILE = [{MODEL }(filename)]]
{PARAMETER}
[/EXTERNAL]
**Default if the subcommand or keyword is omitted.
The syntax for multinomial logistic regression is:
NOMREG dependent varname [(BASE = {FIRST } ORDER = {ASCENDING**})] [BY factor list]
{LAST**} {DATA }
{value } {DESCENDING }
[WITH covariate list]
[/CRITERIA = [CIN({95**})] [DELTA({0**})] [MXITER({100**})] [MXSTEP({5**})]
{n } {n } {n } {n }
[LCONVERGE({0**})] [PCONVERGE({1.0E-6**})] [SINGULAR({1E-8**})]
{n } {n } {n }
[BIAS({0**})] [CHKSEP({20**})] ]
{n } {n }
[/FULLFACTORIAL]
[/INTERCEPT = {EXCLUDE }]
{INCLUDE** }
[/MISSING = {EXCLUDE**}]
{INCLUDE }
[/MODEL = {[effect effect ...]} [| {BACKWARD} = { effect effect ...}]]
{FORWARD }
{BSTEP }
{FSTEP }
[/STEPWISE =[RULE({SINGLE** })][MINEFFECT({0** })][MAXEFFECT(n)]]
{SFACTOR } {value}
{CONTAINMENT}
{NONE }
[PIN({0.05**})] [POUT({0.10**})]
{value } {value }
[ENTRYMETHOD({LR** })] [REMOVALMETHOD({LR**})]
{SCORE} {WALD}
[/OUTFILE = [{MODEL }(filename)]]
{PARAMETER}
[/PRINT = [CELLPROB] [CLASSTABLE] [CORB] [HISTORY({1**})] [IC] ]
{n }
[SUMMARY ] [PARAMETER ] [COVB] [FIT] [LRT] [KERNEL]
[ASSOCIATION] [CPS**] [STEP**] [MFI**] [NONE]
[/SAVE = [ACPROB[(newname)]] [ESTPROB[(rootname[:{25**}])] ]
{n }
[PCPROB[(newname)]] [PREDCAT[(newname)]]
[/SCALE = {1** }]
{n }
{DEVIANCE}
{PEARSON }
[/SUBPOP = varlist]
[/TEST[(valuelist)] = {[‘label’] effect valuelist effect valuelist...;}]
{[‘label’] ALL list; }
{[‘label’] ALL list }
** Default if the subcommand is omitted.
As there is no direct counterpart to R-squared in logistic regression, VIF cannot be computed -- though obviously one could apply the same logic to various psuedo-R-squared measures. Unfortunately, I am not aware of a VIF-type test for logistic regression, and I would think that the same obstacles would exist as for creating a true equivalent to OLS R-squared.
A high odds ratio would not be evidence of multicollinearity in itself.
To the extent that one independent is linearly or nonlinearly related to another independent, multicollinearity could be a problem in logistic regression since, unlike OLS regression, logistic regression does not assume linearity of relationship among independents. Some authors use the VIF test in OLS regression to screen for multicollinearity in logistic regression if nonlinearity is ruled out. In an OLS regression context, nonlinearity exists when eta-square is significantly higher than R-square. In a logistic regression context, the Box-Tidwell transformation and orthogonal polynomial contrasts are ways of testing linearity among the independents.
When an ordinal variable has been entered as a set of dummy variables, the interaction of another variable with the ordinal variable will involve multiple interaction terms. In this case the significance of the interaction of the two variables is the significance of the change of R-square of the equation with the interaction terms and the equation without the set of terms associated with the ordinal variable. (See the StatNotes section on "Regression" for computing the significance of the difference of two R-squares).
FORWARD ENTRY 1. Estimate the parameter and likelihood function for the initial model and let it be our current model. 2. Based on the MLEs of the current model, calculate the score or LR statistic for every variable eligible for inclusion and find its significance. 3. Choose the variable with the smallest significance. If that significance is less than the probability for a variable to enter, then go to step 4; otherwise, stop FORWARD. 4. Update the current model by adding a new variable. If there are no more eligible variable left, stop FORWARD; otherwise, go to step 2. FORWARD STEPWISE 1. Estimate the parameter and likelihood function for the initial model and let it be our current model. 2. Based on the MLEs of the current model, calculate the score statistic or likelihood ratio statistic for every variable eligible for inclusion and find its significance. 3. Choose the variable with the smallest significance (p-value). If that significance is less than the probability for a variable to enter, then go to step 4; otherwise, stop FSTEP. 4. Update the current model by adding a new variable. If this results in a model which has already been evaluated, stop FSTEP. 5. Calculate the significance for each variable in the current model using LR or Wald’s test. 6. Choose the variable with the largest significance. If its significance is less than the probability for variable removal, then go back to step 2. If the current model with the var