Bibliography on Cluster Analysis Warren S. Sarle <[email protected]> Originally published in the _SAS/STAT User's Guide_, 1990 Revised Sep 14, 1997 The clustering literature contains a vast number of useless publications. This bibliography is intended to concentrate on the more useful ones. Massart and Kaufman (1983) is the best elementary introduction to cluster analysis. Other important texts are Anderberg (1973), Sneath and Sokal (1973), Duran and Odell (1974), Hartigan (1975), Titterington, Smith, and Makov (1985), McLachlan and Basford (1988), and Kaufmann and Rousseeuw (1990). Hartigan (1975) and Spath (1980) give numerous FORTRAN programs for clustering. Any prospective user of cluster analysis should study the Monte Carlo results of Milligan (1980), Milligan and Cooper (1985), and Cooper and Milligan (1984). Important references on the statistical aspects of clustering include MacQueen (1967), Wolfe (1970), Scott and Symons (1971), Hartigan (1977; 1978; 1981; 1985), Symons (1981), Everitt (1981), Sarle (1983), Bock (1985), and Thode et al. (1988). Bayesian methods have important advantages over maximum likelihhod; see Binder (1978; 1981), Banfield and Raftery (1993), and Bensmail et al, (1997). For fuzzy clustering, see Bezdek (1981) and Bezdek and Pal (1992). The signal-processing perspective is provided by Gersho and Gray (1992). See Blashfield and Aldenderfer (1978) for a discussion of the fragmented state of the literature on cluster analysis. Avoid articles in the Journal of Marketing Research. There is a separate list of references at the end on nonparametric clustering methods, which define a cluster as a mode in the probability density function; these nonparametric methods have major advantages over all traditional methods. Anderberg, M.R. (1973), _Cluster Analysis for Applications_, New York: Academic Press, Inc. Art, D., Gnanadesikan, R., and Kettenring, R. (1982), "Data-based Metrics for Cluster Analysis," Utilitas Mathematica, 21A, 75-99. Banfield, J.D. and Raftery, A.E. (1993). "Model-Based Gaussian and Non- Gaussian Clustering", Biometrics, 49, 803-821. Bensmail, H., Celeux, G., Raftery, A.E., and Robert, C.P. (1997), "Inference in model-based cluster analysis," Statistics and Computing, 7, 1-10. Bezdek, J.C. (1981), _Pattern Recognition with Fuzzy Objective Function Algorithms_, New York: Plenum Press. Bezdek, J.C. & Pal, S.K., eds. (1992), _Fuzzy Models for Pattern Recognition_, New York: IEEE Press. Binder, D.A. (1978), "Bayesian Cluster Analysis," Biometrika, 65, 31-38. Binder, D.A. (1981), "Approximations to Bayesian Clustering Rules," Biometrika, 68, 275-285. Blashfield, R.K. and Aldenderfer, M.S. (1978), "The Literature on Cluster Analysis," Multivariate Behavioral Research, 13, 271-295. Bock, H.H. (1985), "On Some Significance Tests in Cluster Analysis," Journal of Classification, 2, 77-108. Calinski, T. and Harabasz, J. (1974), "A Dendrite Method for Cluster Analysis," Communications in Statistics, 3, 1-27. Cooper, M.C. and Milligan, G.W. (1988), "The Effect of Error on Determining the Number of Clusters," Proceedings of the International Workshop on Data Analysis, Decision Support and Expert Knowledge Representation in Marketing and Related Areas of Research, 319-328. Duda, R.O. and Hart, P.E. (1973), _Pattern Classification and Scene Analysis_, New York: John Wiley & Sons, Inc. Duran, B.S. and Odell, P.L. (1974), _Cluster Analysis_, New York: Springer-Verlag. Englemann, L. and Hartigan, J.A. (1969), "Percentage Points of a Test for Clusters," Journal of the American Statistical Association, 64, 1647-1648. Everitt, B.S. (1979), "Unresolved Problems in Cluster Analysis," Biometrics, 35, 169-181. Everitt, B.S. (1981), "A Monte Carlo investigation of the likelihood ratio test for the number of components in a mixture of normal distributions," Multivariate Behavioral Research, 16, 171-80. Everitt, B.S. and Hand, D.J. (1981), _Finite Mixture Distributions_, New York: Chapman and Hall. Gersho, A. and Gray, R.M. (1992), _Vector Quantization and Signal Compression_, Boston: Kluwer Academic Publishers. Good, I.J. (1977), "The Botryology of Botryology," in Classification and Clustering, ed. J. Van Ryzin, New York: Academic Press, Inc. Hartigan, J.A. (1975), _Clustering Algorithms_, New York: John Wiley & Sons, Inc. Hartigan, J.A. (1977), "Distribution Problems in Clustering," in Classification and Clustering, ed. J. Van Ryzin, New York: Academic Press, Inc. Hartigan, J.A. (1978), "Asymptotic Distributions for Clustering Criteria,"Annals of Statistics, 6, 117-131. Hartigan, J.A. (1981), "Consistency of Single Linkage for High-Density Clusters," Journal of the American Statistical Association, 76, 388-394. Hartigan, J.A. (1985), "Statistical Theory in Clustering," Journal of Classification, 2, 63-76. Hathaway, R.J. (1985), "A constrained formulation of maximum-likelihood estimation for normal mixture distributions," Annals of Statistics, 13, 795-800. Hawkins, D.M., Muller, M.W., and ten Krooden, J.A. (1982), "Cluster Analysis," in Topics in Applied Multivariate Analysis, ed. D.M. Hawkins, Cambridge: Cambridge University Press. Hubert, L. (1974), "Approximate Evaluation Techniques for the Single-Link and Complete-Link Hierarchical Clustering Procedures," Journal of the American Statistical Association, 69, 698-704. Hubert, L.J. and Baker, F.B. (1977), "An Empirical Comparison of Baseline Models for Goodness-of-Fit in r-Diameter Hierarchical Clustering," in Classification and Clustering, ed. J. Van Ryzin, New York: Academic Press, Inc. Kaufmann, L. and Rousseeuw, P.J. (1990), _Finding Groups in Data_, New York: John Wiley & Sons, Inc. Lee, K.L. (1979), "Multivariate Tests for Clusters," Journal of the American Statistical Association, 74, 708-714. Lindsay, B.G., and Basak, P. (1993), "Multivariate normal mixtures: A fast consistent method of moments," Journal of the American Statistical Association, 88, 468-476. Ling, R.F (1973), "A Probability Theory of Cluster Analysis," Journal of the American Statistical Association, 68, 159-169. MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281-297. Marriott, F.H.C. (1971), "Practical Problems in a Method of Cluster Analysis,"Biometrics, 27, 501-514. Marriott, F.H.C. (1975), "Separating Mixtures of Normal Distributions," Biometrics, 31, 767-769. Massart, D.L. and Kaufman, L. (1983), _The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis_, New York: John Wiley & Sons, Inc. McClain, J.O. and Rao, V.R. (1975), "CLUSTISZ: A Program to Test for the Quality of Clustering of a Set of Objects," Journal of Marketing Research, 12, 456-460. McLachlan, G.J. and Basford, K.E. (1988), _Mixture Models_, New York: Marcel Dekker, Inc. Mezzich, J.E and Solomon, H. (1980), _Taxonomy and Behavioral Science_, New York: Academic Press, Inc. Milligan, G.W. (1980), "An Examination of the Effect of Six Types of Error Perturbation on Fifteen Clustering Algorithms," Psychometrika, 45, 325-342. Milligan, G.W. (1981), "A Review of Monte Carlo Tests of Cluster Analysis," Multivariate Behavioral Research, 16, 379-407. Milligan, G.W. and Cooper, M.C. (1985), "An Examination of Procedures for Determining the Number of Clusters in a Data Set," Psychometrika, 50, 159-179. Pollard, D. (1981), "Strong Consistency of k-Means Clustering," Annals of Statistics, 9, 135-140. Priebe, C.E. (1994), "Adaptive mixtures," Journal of the American Statistical Association, 89, 796-806. Sarle, W.S. (1982), "Cluster Analysis by Least Squares," Proceedings of the Seventh Annual SAS Users Group International Conference, 651-653. Sarle, W.S. (1983), _Cubic Clustering Criterion_, SAS Technical Report A-108, Cary, NC: SAS Institute Inc. Scott, A.J. and Symons, M.J. (1971), "Clustering Methods Based on Likelihood Ratio Criteria," Biometrics, 27, 387-397. Sneath, P.H.A. and Sokal, R.R. (1973), _Numerical Taxonomy_, San Francisco: W.H. Freeman. Spath, H. (1980), _Cluster Analysis Algorithms_, Chichester, England: Ellis Horwood. Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal Mixtures," Biometrics, 37, 35-43. Thode, H.C.Jr., Mendell, N.R., and Finch, S.J. (1988), "Simulated percentage points for the null distribution of the likelihood ratio test for a mixture of two normals," Biometrics, 44, 1195-1201. Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985), _Statistical Analysis of Finite Mixture Distributions_, New York: John Wiley & Sons, Inc. Vuong, Q.H. (1989), "Likelihood ratio tests for model selection and non-nested hypotheses," Econometrica, 57, 307-333. Ward, J.H. (1963), "Hierarchical Grouping to Optimize an Objective Function," Journal of the American Statistical Association, 58, 236-244. Wolfe, J.H. (1970), "Pattern Clustering by Multivariate Mixture Analysis," Multivariate Behavioral Research, 5, 329-350. Wolfe, J.H. (1978), "Comparative Cluster Analysis of Patterns of Vocational Interest," Multivariate Behavioral Research, 13, 33-44. ************************************************************************* More references for nonparametric estimation of clusters as modes: Barnett, V., ed. (1981), _Interpreting Multivariate Data_, New York: John Wiley & Sons, Inc. Girman, C.J. (1994), "Cluster Analysis and Classification Tree Methodology as an Aid to Improve Understanding of Benign Prostatic Hyperplasia," Ph.D. thesis, Chapel Hill, NC: Department of Biostatistics, University of North Carolina. Gitman, I. (1973), "An Algorithm for Nonsupervised Pattern Classification," IEEE Transactions on Systems, Man, and Cybernetics, SMC-3, 66-74. Hartigan, J.A. and Hartigan, P.M. (1985), "The Dip Test of Unimodality," Annals of Statistics_ 13, 70-84. Hartigan, P.M. (1985), "Computation of the Dip Statistic to Test for Unimodality," Applied Statistics, 34, 320-325. Huizinga, D. H. (1978), "A Natural or Mode Seeking Cluster Analysis Algorithm," Technical Report 78-1, Behavioral Research Institute, 2305 Canyon Blvd., Boulder, Colorado 80302. Koontz, W.L.G. and Fukunaga, K. (1972a), "A Nonparametric Valley-Seeking Technique for Cluster Analysis," IEEE Transactions on Computers, C-21, 171-178. Koontz, W.L.G. and Fukunaga, K. (1972b), "Asymptotic Analysis of a Nonparametric Clustering Technique," IEEE Transactions on Computers, C-21, 967-974. Koontz, W.L.G., Narendra, P.M., and Fukunaga, K. (1976), "A Graph-Theoretic Approach to Nonparametric Cluster Analysis," IEEE Transactions on Computers, C-25, 936-944. Minnotte, M.C. (1992), "A Test of Mode Existence with Applications to Multimodality," Ph.D. thesis, Rice University, Department of Statistics. Mizoguchi, R. and Shimura, M. (1980), "A Nonparametric Algorithm for Detecting Clusters Using Hierarchical Structure," IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 292-300. Mueller, D.W. and Sawitzki, G. (1991), "Excess mass estimates and tests for multimodality," JASA 86, 738-746. Polonik, W. (1993), "Measuring Mass Concentrations and Estimating Density Contour Clusters--An Excess Mass Approach," Technical Report, Beitraege zur Statistik Nr. 7, Universitaet Heidelberg. SAS Institute Inc. (1993), _SAS/STAT Software: The MODECLUS Procedure_, SAS Technical Report P-256, Cary, NC: SAS Institute Inc. Silverman, B.W. (1986), _Density Estimation_, New York: Chapman and Hall. Tukey, P.A. and Tukey, J.W. (1981), "Data-Driven View Selection; Agglomeration and Sharpening," in Barnett (1981). Wong, M.A. (1982), "A Hybrid Clustering Method for Identifying High-Density Clusters," Journal of the American Statistical Association, 77, 841-847. Wong, M.A. and Lane, T. (1983), "A _k_th Nearest Neighbor Clustering Procedure," _Journal of the Royal Statistical Society_, Series B, 45, 362-368. Wong, M.A. and Schaack, C. (1982), "Using the _k_th Nearest Neighbor Clustering Procedure to Determine the Number of Subpopulations," _American Statistical Association 1982 Proceedings of the Statistical Computing Section_, 40-48.