|
|
Overview
SPSS offers three general approaches to cluster analysis. Hierarchical clustering allows users to select a definition of distance, then select a linking method for forming clusters, then determine how many clusters best suit the data. Hierarchical clustering generates representation of clusters in icicle plots and dendograms. In k-means clustering the researcher specifies the number of clusters in advance, then calculates how to assign cases to the K clusters. K-means clustering is much less computer-intensive and is therefore sometimes preferred when datasets are large (ex., > 1,000). K-means clustering generates an ANOVA table showing mean-square error. Finally, two-step clustering creates pre-clusters, then it clusters the pre-clusters. Two step clustering handles very large datasets, is the method chosen when data are categorical, and has the largest array of output options, including variable importance plots.
|
|
There are a variety of different measures of inter-observation distances and inter-cluster similarities and distances to use as criteria when merging nearest clusters into broader groups or when considering the relation of a point to a cluster. Distance measures how far apart two observations are. Cases which are alike share a low distance. Similarity measures how alike two cases are. However, it is common to refer to all measures as "distance" measures since the same function is served. Note that when two or more variables are used to define distance, the one with the larger magnitude will dominate, so to avoid this it is common to first standardize all variables. SPSS supports these interval measures: ;
INTERVAL
BINARY
Summary. In SPSS, similarity/distance measures are selected in the Measure area of the Method subdialog obtained by pressing the Method button in the Classify dialog. There are three measure pulldown menus, for interval, binary, and count data respectively.The proximity matrix table in the output shows the actual distances or similarities computed for any pair of cases. In SPSS, proximity matrices are selected under Analyze, Cluster, Hierarchical clustering; Statistics button; check proximity matrix.
One may wish to use the hierarchical cluster procedure on a sample of cases (ex., 200) to inspect results for different numbers of clusters. The optimum number of clusters depends on the research purpose. Identifying "typical" types may call for few clusters and identifying "exceptional" types may call for many clusters. After using hierarchical clustering to determine the desired number of clusters, the researcher may wish then to analyze the entire dataset with k-means clustering (aka, the Quick Cluster procedure: Analyze, Cluster, K-Means Cluster Analysis), specifying that number of clusters.
In the figure above on 8 judges rating 50 objects, the agglomeration schedule shows, for instance, that judges 3 and 5 are combined in a cluster first (the cluster is labeled 3). Then at a later point, stage 4, the new cluster 3 is combined with judge 7 to form a larger cluster, also now labeled 3. Etc.
In the figure above, from hierarchical cluster analysis on 8 judges who rated 50 objects, the vertical icicle plot shows that when there are 2 clusters, the judge labeled "Enthusiast" is in one cluster and all the country-affiliated judges are in the other. When there are 3 clusters, Enthusiast is still in a cluster alone, Russia-China-Romania are in a second cluster, and all other countries in a third cluster. Etc.
In the figure above, from hierarchical cluster analysis on 8 judges who rated 50 objects, the dendogram shows judges 3 & 5 (these were Romania and China respectively) to be in one of the two earliest clusters, with judge 7 (Russia) affiliated with cluster 3 & 5 only at a greater distance. In general, the dendogram shows the pattern of clustering among the judges, with connecting lines further to the right indicating more distance between judges and clusters. The final linkage to judge 8 ("Enthusiast") is not shown but indicated by the trailing linkage line furthest to the left. While this figure and those above were for clustering variables, one can also cluster cases. The dendogram below is for the clustering of the 50 objects, with objects 10, 38, 17, 16, 18, 43, 2, 46, and 27 forming one of the first clusters:
Save button: Optionally, you may press the Save button to save the final cluster number of each case as an added column in your dataset (labeled QCL_1), and/or you may save the Euclidean distance between each case and its cluster center (labeled QCL_2) by checking "Distance from cluster center."
Options button: Optionally, you may press the Options button to select statistics or missing values options. There are three statistics options:
In the figure above, for 8 judges (7 nations plus "Enthusiast" are the "variables") rating 50 objects, the ANOVA table shows the largest error associated with the "Enthusiast" judge, meaning that judge (variable) is least helpful in forming and differentiating the clusters. All judges/variables are significant, but this is largely meaningless. The ANOVA table is used mainly to look at the size of the mean square errors.
There are two missing values options: listwise (the default) and pairwise deletion of cases with missing values.
Drawing initial cluster centers from a file. In the K-means Cluster main dialog, in the Cluster centers area, one may check "Write final as" and enter a file name. This will save the final cluster centers of each variable to a data file which can later be used as the initial centers for a different sample of cases. In the same area, there is a "Read initial from" checkbox and place to enter a file name for this purpose. If the initial file is created manually, note the file referenced must have as its first column a variable named "cluster_". The additional columns are the same variables you specified in the dialog box, though they need not be in the same order. You may have additional variables in this file beyond those specified in the dialog box -- these will be ignored. If you do not draw initial cases from a file, SPSS will find k well-separated cases (that is, a case for each variable for each of the k clusters requested) and use these as initial values. Either way, the output includes a table of "Initial cluster centers."
In the example above, by the BIC criterion alone one would select 6 clusters as being optimal. By the SPSS default algorithm, however, 2 clusters are selected because this yields a large BIC ratio of change and a large ratio of distances. In essence, the SPSS algorithm judges that the gain in information from having more than 2 clusters is not worth the increased complexity (diminution of parsimony) of the model. The researcher has the option to override this default and specify 6 or some other number of clusters.
For continuous variables, error bar charts are shown for each cluster. These charts are labeled "Confidence Intervals for means" and show each continuous variable's mean with wing bars depicting the 95% confidence limits around each mean.
The plot below shows that both categorical variables, country and number of cylinders, differentiate the cars in Cluster 2.
Copyright 1998, 2008 by G. David Garson.
Last updated 3/24/08.