|
|
Overview
Log-linear analysis deals with association of categorical or grouped data, looking at all levels of possible main and interaction effects, comparing this saturated model with reduced models, with the primary purpose being to find the most parsimonious model which can account for cell frequencies in a table. That is, log-linear analysis is a non-dependent procedure for accounting for the distribution of cases in a crosstabulation of categorical variables. Log-linear analysis is a type of multi-way frequency analysis (MFA) and sometimes log-linear analysis is labeled MFA. Logit modeling is similar to log-linear modeling, but explains one or more dependent categorical variables. When there is a dependent categorical variable, however, binary and multinomial logistic regression are more commonly used. Logistic regression is also used when the independents are continuous (forcing continuous variables into categories attenuates correlation and is not recommended). Conditional logit handles matched-pairs and panel data, and data for analyzing choices. Probit is a variant of logit modeling based on different data assumptions. Logit is the more commonly used, based on the assumption of equal categories. Probit may be the more appropriate choice when the categories are assumed to reflect an underlying normal distribution of the dependent variable, even if there are just two categories.
Log-linear models were developed to analyze the conditional relationship of two or more categorical values. Log-linear analysis is different from logistic regression in four ways: Logit and probit extend the log-linear model to allow a mixture of categorical and continuous independent variables to predict one or more categorical dependent variables. Both logit and probit usually lead to the same conclusions for the same data. Logit regression yields results equivalent to logistic regression, but with different output options. Many problems can be handled by either logit or logistic regression, though the latter has become more popular among social scientists. Note that generalized linear models, discussed separately, represent a more recent set of procedures which can also analyze categorical dependents and independents, and in this sense represent a different method of implementing log-linear, logit, probit, Poisson, and other models. See also the separate section on ordinal regression, which can also implement logit, probit, and other models. See also the separate section on probit response models, which additionally supports logit response models. Traditional approaches to categorical data relied on chi-square and other measures of significance to establish if a relationship existed in a table, then employed any of a wide variety of measures of association to come up with a number, usually between 0 and 1, indicating how strong the relationship was. Loglinear methods are similar in function but have the advantage of making it far easier to analyze multi-way tables (more than two categorical variables) and to understand just which values of which variables and which interaction effects are contributing the most to the relationship. For simple two-variable tables, traditional approaches may still be preferred, but for multivariate analysis of three or more categorical variables, log-linear analysis is preferred. Loglinear methods also differ from multiple regression in substituting maximum likelihood estimation of a link function of the dependent for regression's use of least squares estimation of the dependent itself. The link function transforms the dependent variable and it is this transform, not the raw variable, which is linearly related to the model (the terms on the right-hand side of the equation). The link function used in log-linear analysis is the log of the dependent, y. The function used in logit is the natural log of the odds ratio. The function used in probit is the inverse of the standard normal cumulative distribution function. There are several possible purposes for undertaking log-linear modeling, the primary being to determine the most parsimonious model which is not significantly different from the saturated model, which is a model that fully but trivially accounts for the cell frequencies of a table. Log-linear analysis is used to determine if variables are related, to predict the expected frequencies (table cell values) of a dependent variable, the understand the relative importance of different independent variables in predicting a dependent, and to confirm models using a goodness of fit test (the likelihood ratio). Residual analysis can also determine where the model is working best and worst. Often researchers will use hierarchical loglinear analysis (in SPSS, the Model Selection option under Loglinear) for exploratory modeling, then use general loglinear analysis for confirmatory modeling. SPSS supports these related procedures, among others:
|
|
As elaborated below in the section on effects, a saturated log-linear model takes the form: the natural log of the frequency for any cell equals the grand mean (the constant) plus the sum of the lambda parameter estimates for all 1-way, 2-way, 3-way, .... k-way interaction effects in a model with k variables. Depicted graphically, a saturated model with six variables (A through F) would show connecting lines from each variable to each other variable, for a total of (2k - 1) = 127 effects and 16 lines. However, a parsimonious model such as that below might have far fewer connecting lines.
The parsimonious model above has the form: ln(cell frequency) =
+ A + B + C + D + E + F + A *B + B*C + B*D + C*D + D*E + E*F + B*C*D, for a total of only 12 effects and 6 connecting lines. Each effect is reflected in a parameter estimate, discussed below.
| Original Table | ||
|---|---|---|
| - | Black | White |
| Not Literate | 6 | 2 |
| Literate | 2 | 6 |
|
|
Model A: Race*Literacy |
|
|
Model B: Race*Region + Literacy*Region |
|
|
Model C: Race*Literacy + Race*Region + Literacy*Region |
The original table above is shown with three different possible splits by the control variable Region. In Model A, the split tables have the same relationship as the original table. There is no control effect and therefore the control variable, Region, is not part of the loglinear generating class reported by hierarchical loglinear modeling using backward elimination. In Model B, there is full explanation (total control by the control variable Region) and each component of the loglinear generating class is an interaction involving the control variable). In Model C, the original relationship disappears (is controlled) in the South region but is stronger than the original in the North region, showing the original table to be a misleading average. For Model C, the loglinear generating class contains all three interactions.
Once the most parsimonious model is selected, SPSS can compute the expected frequencies. These expected frequencies can be subtracted from the observed cell frequencies to give the residuals. The smaller the residual, the better the model is working for that cell. Likewise, large residuals indicate marginal (row and column) conditions where the model is not working well. SPSS shows residuals in a table of "Cell Counts and Residuals." Note SPSS outputs four types of residuals and three types of plots:
) is the usual designation for the effect coefficient. Mu (
) is the usual designation for the constant. These coefficients are obtained in SPSS by asking for Estimates under Options in the loglinear dialog. The loglinear model is one in which the natural log of the frequency for any cell is equal to a grand mean (the constant, mu) plus the lambda parameter estimate for the effect of the first independent, plus the lambda for each other independent, plus the lambdas for all 2-way, 3-way, or higher interaction effects, according to the number of independents. Thus for two categorical variables, A and B, the saturated model is:
+
iA +
jB +
ijAB
iA is the main effect for variable A (the row effect),
jB is the main effect for variable B (the column effect), and
ijAB is the interaction effect of A with B. However, this is the saturated model which is always fully predictive of the table frequencies, but trivial. The trick is to figure out how many lambdas can be constrained to 0 and still have acceptable estimates of the frequencies.
SPSS prints out all these lambda parameter estimates in the "Parameter Estimates" table of the output. Lambdas appear as coefficients in the "Estimates" column of this table. While some authors label these estimates as "b" coefficients, they are not regression-type coefficients. Lambdas are effect estimates, not slopes.
How to interpret a log-linear parameter estimate, b, using odds ratios. The odds ratio, discussed below, is Exp(b). That is, the odds ratio is the natural log base e raised to the power of b. There will be a b coefficient for each category of the categorical variable, except the reference category. The odds of a person in the given category of an independent variable (the category corresponding to b) also being associated with the reference category of the dependent variable (usually 1, corresponding to an event happening; or to the highest dependent category when the dependent has > 2 values; though the researcher can set the dependent reference category as desired) is Exp(b) times the odds of a person in the reference category of the independent variable, controlling for other variables in the model. Also, when the independent increases one unit, the odds of the dependent (usually 1 = event happening) increase by a factor of x.
Consider an example looking at the effects of gender, party, and race upon one another. In this model, for the most parsimonious model (Design: Constant + Gender * Party + Race * Party), the Parameter Estimates table looks like that below. Note that because output is from Analyze, Loglinear, General, regression-type indicator (dummy) coding is used, where the last category becomes the left-out category. (The differences between indicator coding in General Loglinear Regression and deviance coding in Hierarchical Loglinear Regression are discussed later on.):

For the data above, Party is coded 1=Democrat, 2=Independent, 3=Republican. Gender is coded 1=Male, 2=Female.Race is coded 1=White, 2=Hispanic, 3=Black.
Here it can be seen that the (Party=3)*(Race=2) interaction, which corresponds to Republican Hispanics, is not significant. All of the other combinations of interacting values are significantly contributing to the explanation of the distribution of data in the table. The highest significant Z value for this example is 8.38, for Female Democrats. If we were to go back to the original cell counts and compare the expected cell counts, we would find that the sum of absolute residuals (observed minus expected) for Republican Hispanics was only 3.5, whereas the corresponding residual sum for Female Democrats was 38.5 - far larger.
Thus parameter estimates can be used to explore the relative importance of the independent variables. The ratio of the absolute magnitudes of the standardized parameter estimates (labeled 'Z' in the Parameter Estimates table) for any two cells reflects the relative importance of those parameters in explaining the frequencies in the table. Standardized parameters are parameters divided by their standard errors and are shown in the Z column in the SPSS output table above.
In the table above, the two-way effect [Party=1]*[Gender=1] has a parameter estimate of 1.329; e1.329 = exp(1.329) = 3.777, which is the odds ratio. The odds ratio is also a measure of effect size, in this case for the Male*Democrat effect. Standardized parameters are preferred for comparing effects, however.
Two-way effects with large standardized parameters flag the most important two-way interactions, and so on for higher-way effects. For any given two-way effect, there will be a parameter estimate for each cell and the ratio of standardized lambdas will indicate which cells contributed the most to that effect. Higher-way effects are interpreted analogously to two-way effects.
For further discussion of the table above, see the section on relative risk and odds ratios as measures of association. (In SPSS, the table is obtained by selecting Analyze, Descriptive Statistics, Crosstabs, and then checking Risk under Statistics.)
The "Cell Counts and Residuals" table below is output from HILOG for the party-race-sex data discussed previously above. Delta is set to 0 (the SPSS default adding .5 to all cells is overridden). As can be seen, the saturated model explains cell frequencies perfectly, with 0 residuals.
In the "Parameter Estimates" table above, Gender has two categories while Race and Party have three. This is why the main effects for Gender, Race, and Party in the table above have 1, 2, and 2 parameters respectively, with the last in each case being the redundant reference category. The two-way interactions involving Gender thus have 1*2 = 2 parameters, while the Race*Party interaction has 2*2 = 4 parameters. The three way interaction has 1*2*2 = 4 parameters also.
The "Tests that K-way and higher order effects are zero" table, illustrated above, shows the value of adding effects of a given order or higher to the model. The table will have rows for K= 1 up to p, where p is the highest order possible for the data at hand (in this example, 3, since there are 3 factors - Gender, Race, Party). If the "Sig" significance level for the K = 3 row is non-significant, as it is above, then the researcher would conclude 3-way interaction terms should not be in the model. If the "Sig" for the second row, which is K = 2 for this example, were non-significant, then the researcher would conclude neither 2-way nor 3-way terms should be in the model. However, since K = 2 is significant above, the researcher fails to reject that null hypothesis. Both likelihood ratio and Pearson chi-square tests of significance are available, but the former are generally preferred. In this example it makes no difference, which is usually the case.
The "K-Way Effects" table is the lower half of the same table in SPSS output, as shown above. This tests if specific K-way effects are zero. The table shows the value of adding main, two-way, three-way, fourth-order, or higher effects to the model. The table will have rows for K=1 to p, where p is the highest order for the data at hand. The probability column ("Prob." ) for the likelihood ratio ("L. R. Chisq") shows the significance of adding the corresponding order of effects. For instance, if row 3 is non-significant, then adding 3rd-order effects (3-way interactions) to the model is rejected, as it would be in the example above. In the example, adding main and 2nd order effects in the model is warranted..
In hierarchical models, if one has a higher-order term, one must have subsidiary lower ones. If one dropped a 3rd-order term, one could not retain a 4th-order term containing one of the elements of the 3rd-order term. For this reason, the "Tests that K-way and higher order effects are zero" table is the more relevant to modeling using HILOG.
In the example above, the default of starting with the saturation model was take. Thus Step 0 is for Gender*Race*Party and all hierarchically subsidiary 2nd and 1st order terms. In Step 0, the backward elimination algorithm tests to see if the highest order (here, 3rd order) term may be dropped from the model as non-significant. At Sig. = .953, it is indeed non-significant and is dropped, leading to Step 1. Step 1 is the model with all 2nd order (two-way) terms and subsidiary 1st order terms. Since here three factors corresponds to three two-way interactions, each of the three is tested for possible dropping. It is found that Gender*Race is non-significant and may be dropped, but the other two 2nd order terms should be retained. In Step 2, Gender*Race is dropped and the remaining two 2nd order interactions are used as the generating class. This time no terms are found suitable for dropping (non are found to be non-significant). Step 3, the final step, merely lists the generating class for the most parsimonious hierarchical model.
How it works: the backward elimination option calculates partial chi-square for every term in the generating class. Backward elimination deletes any term with a zero partial chi-square , then it sees which effect has the largest significance of change in chi-square if it is deleted (the default alpha significance level is .05). This gives a new model and a new generating class, which is tested in turn. The process continues until there is no significant gain in deleting further terms.
In the final step output under backward elimination, SPSS will print the likelihood ratio chi square and its significance for the model as a whole. A non-significant likelihood ratio indicates a good fit, as is the case in this example. Keep in mind that in a hierarchical model, a higher-order term like factor2*factor3*factor4 includes subsidiary 2-way and 1-way effects such as factor2*factor3. If when the researcher goes back to GENLOG to enter a custom model, the researcher would enter the hierarchically-implied terms as well as the actual "final model" terms listed in the HILOG output. Of course, backward elimination does not guarantee the most parsimonious well-fitting model - researcher experimentation may still be called for. If one enters the example data into GENLOG (general loglinear modeling, discussed below) and asks for the best model emerging from HILOG (the "Model Selection" option in SPSS), one will get the goodness-of-fit table below, which has the same likelihood ratio goodness of fit as shown in the backward elimination table in HILOG (Sig.=.964, where non-significance corresponds to a well-fitting model). For more on computation of Pearson and likelihood ratio chi-square, click here.
In SPSS, select Analyze, Loglinear, General to select the GENLOG procedure, illustrated below. In the General Loglinear Analysis dialog box, move all the categorical variables of interest (ex., gender, race, and party in the tabled example below) to the Factors box. Clicking OK enters the saturated model by default. (If you click the Models button, you will see that "Saturated model" is checked by default.) There is also an Options button where you may check Frequencies, Residuals, Estimates, Criteria, plots, and more. You will normally want to select at least Estimates, which also gives significance of the estimates for each effect. Under Options, Criteria, you can set Delta=0 to suppress the default under which .5 is added to all cells to avoid having cells with zero count. Set the data distribution assumption (see below), Click Continue. OK.
Looking at the significance of effects obtained by asking for Estimates under the Options button of the general loglinear dialog box, is a prime way of reducing the saturated model, eliminatiing non-significant effects.When dropping effects which are nonsignificant, it is best to drop one effect at a time to be sure lower-order non-significant effects don't become significant when a higher-order non-significant effect is dropped. When two or more effects are nonsignificant, start the reduction process by dropping the highest-order nonsignificant effect first, then proceed by dropping one term at a time on subsequent runs. To specify an unsaturated model, in the loglinear analysis dialog, click Model, Custom, and enter the effect terms you want (ex., race, gender, race*gender, highschool).
Consider the following table, in which TestRank is used to predict WorkRank:
The estimate of the B regression coefficient is shown in the "Parameter Estimates" table, B row, Estimate column.
If the likelihood ratio (or Pearson chi-square) is nonsignificant, there is goodness-of-fit achieved simply by adding the B linear-by-linear association (interaction) effect to the complete independence model (which would be Design: Constant + rowvariable + columnvariable). For these data, a finding of significance means the linear-by-linear interaction terms should not be added to the model.
If the likelihood ratio (or Pearson chi-square) is nonsignificant, there is goodness-of-fit. Here that is not the case.
The DESIGN takes these forms for the saturated model:
For unsaturated models, obtained under the Custom choice under the Model button, the design will include the main effect of the dependent plus the effect of the dependent interacting with whatever terms ar listed. If x1 is listed, the design will be y + y*x1. If x1 and x2 are listed, the design will be y + y*x1 +y*x2. Etc.
Thus in the example above, Gender and Race are used to predict Party. The model is not saturated since Gender*Race is not modeled. The logit model thus includes Party (the main effect of the dependent) plus Party*Gender (the dependent's interaction with the first factor) plus Party*Race (the dependent's interaction with the second factor). Because the model is not saturated, it is possible for residuals to differ from 0 and for the goodness of fit to be computed. Because model fit is non-significant in this example, the model is considered well-fitting.
_____________________________________________________________________________
Income
Low=0 High=1 odds odds ratio ln(ratio)
Party R = 0 400 500 0.667 -0.405
D = 1 600 400 1.25 0.223
0.533 -0.629
.223 = parameter estimate for party=0
-.629 = parameter estimate for party=0*income=0
_____________________________________________________________________________
* The odds of being Republican compared to Democrat for high income people is 500/400 = 1.25
* Since Democrats is the reference category, we can take the log of the odds to get the parameter estimate using the reference row Party = 1 = Democrat. It is ln(1.25) = .223. In the Parameter Estimates table this will be listed as the estimate for Party = 0 (Republican). The estimate for Party = 1 (Democrat) will be 0, since it is the reference category.
* The odds ratio is the ratio of the odds of being Republican for low income people (.667) to the odds of being Republican for high income people (1.25). In this example it is .667/1.25 = .5333. The log of the odds ratio, ln(.5333)= -.629 is the parameter estimate for the interaction of the independent and dependent. Specifically, it is the parameter estimate for party=0*income=0 in the output above.
* The odds ratio, .5333, is easier to put into a sentence than is the corresponding parameter estimate of -.629. We can say that the odds of being a Republican if low income is .53 times the odds of being a Republican if high income, for the data in this example. Thus, the odds ratio of .533 = .667/1.25, which is the ratio of the two odds in the table above.
* Tip: If replicating this in SPSS, set Delta=0 so as not to add .5 to each cell.
Though sometimes described as being similar to R-square in regression, these effect size coefficients may be small even when the relation between the independent and dependent is strong. Each estimates the percent of the dispersion in the dependent variable which is explained by the model, and both coefficients are usually but not always close to one another.
Every subject id will have two data rows: one for the case and one for its control.
In SPSS, select Analyze, Survival, Cox Regression. In the Cox Regression dialog, let the "Time" variable be the dichotomous dependent variable (ex., type - the row is the subject, coded 1, or is the control, coded 2. This means 1 is the "event" condition and 2 is the "censored" condition for each matched pair. Let the "Status" variable be a copy of the dependent variable (ex., type2) and then click the Define Event button and select Single Value and set it to 1. This tells the program that the value 1 corresponds to the event occurring, that is, being the case rather than the control. Enter continuous explanatory variables in the Covariates box (there is an option to do so in blocks). Click the Categorical button and enter any categorical explanatory variables as covariates, choosing the default Indicator contrasts. Back in the main Cox Regression dialog, let the "Strata" variable be a variable giving the subject's id number
Probit regression is an alternative log-linear approach to handling categorical dependent variables. Its assumptions are consistent with having a categorical dependent variable assumed to be a proxy for a true underlying continuous normal distribution. A typical use of probit is to analyze dose-response data in medical studies. Like logit or logistic regression, the researcher focuses on a transformation of the probability that Y, the dependent, equals 1. Where the logit transformation is the natural log of the odds ratio, the function used in probit is the inverse of the standard normal cumulative distribution function. Where logistic regression is based on the assumption that the categorical dependent reflects an underlying qualitative variable and uses the binomial distribution, probit regression assumes the categorical dependent reflects an underlying quantitative variable and it uses the cumulative normal distribution. As with logit regression, there are oprobit (ordinal probit) and mprobit (multinomial probit) options.
In practical terms, probit models usually come to the same conclusions as logistic regression and have the drawback that probit coefficients are more difficult to interpret (there is no equivalent to logistic regression's odds ratios as effect sizes in probit), hence they are less used, though the choice is largely one of personal preference. Both the cumulative standard normal curve used by probit as a transform and the logistic (log odds) curve used in logistic regression display an S-shaped curve. Though the probit curve is slightly steeper, differences are small. Because of its basis on the standard normal curve, probit is not recommended when there are many cases in one tail or the other of a distribution. An extended discussion of probit is found in Pampel (2000: 54-68).
Note that the chi-square test of goodness of fit cannot be used with PROBIT because it is based on an n by 2 table with one observation per row, which cannot approximate the chi-square distribution even for large samples.
If you ask for Predictions under the Save button, SPSS will add a column of predicted counts (labeled PRE_1 on the first run) to the working dataset. In the example, this would be the predicted count of suicides for the given cell. The predicted rate can be calculated based on the Poisson regression model:
Partial odds ratios. Partial odds ratios, like partial correlation coefficients for interval data, indicate the strength of a relationship when other variables are controlled. Put another way, partial odds ratios are a measure of main and interaction effects in a model. The partial odds ratio is the geometric mean of second-order odds ratios (odds ratios for conditional odds ratios on a third variable, such as odds ratios for men and women on being Democrats, for levels of education as a third variable). The partial odds ratio for education as a control variable would be the geometric mean of the simple (marginal) odds ratios for each the three levels of education.
The partial odds ratio and the marginal odds ratio usually differ. If the simple or marginal odds ratio of Democrat as the dependent variable and female as the independent is 1.50, then a unit increase (switching from male=0 to female=1) is associated with a 50% (1 - 1.5) increase in the odds of being a Democrat. If the partial odds ratio turns out to be, say 1.25, then a unit increase (switching from male=0 to female=1) is associated with a 25% (1 - 1.25) increase in the odds of being a Democrat when education is controlled.
Structural zeros may also occur when the Cell structure option is used to weight cells, and the If button is used to set the weighting variable to 0 under certain conditions. This might be done by a researcher who wanted to see if a loglinear model was a good fit not only on all the cells in a table, but also on the table ignoring some of the cells. The to-be-ignored cells are set to structural zeros using the Cell structure and If options, thereby forcing the creation of structural zeros. Then the loglinear analysis is run normally, but SPSS will not use the structural zero cells. When the to-be-ignored cells are the diagonal cells, the test of quasi-independence uses this method to see if the independence model (constant and main effects only, no higher effects) is a good fit (nonsignificant on the likelihood ratio).