IT training clustering for data mining a data recovery approach mirkin 2005 04 29

Computer Science and Data Analysis Series Clustering for Data Mining A Data Recovery Approach Boris Mirkin Boca Raton London New York Singapore © 2005 by Taylor & Francis Group, LLC C5343_Discl Page Thursday, March 24, 2005 8:38 AM Published in 2005 by Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2005 by Taylor & Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group No claim to original U.S Government works Printed in the United States of America on acid-free paper 10 International Standard Book Number-10: 1-58488-534-3 (Hardcover) International Standard Book Number-13: 978-1-58488-534-4 (Hardcover) Library of Congress Card Number 2005041421 This book contains information obtained from authentic and highly regarded sources Reprinted material is quoted with permission, and sources are indicated A wide variety of references are listed Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Mirkin, B G (Boris Grigorévich) Clustering for data mining : a data recovery approach / Boris Mirkin p cm (Computer science and data analysis series ; 3) Includes bibliographical references and index ISBN 1-58488-534-3 Data mining Cluster analysis I Title II Series QA76.9.D343M57 2005 006.3'12 dc22 2005041421 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com Taylor & Francis Group is the Academic Division of T&F Informa plc © 2005 by Taylor & Francis Group, LLC and the CRC Press Web site at http://www.crcpress.com Chapman & Hall/CRC Computer Science and Data Analysis Series The interface between the computer and statistical sciences is increasing, as each discipline seeks to harness the power and resources of the other This series aims to foster the integration between the computer sciences and statistical, numerical, and probabilistic methods by publishing a broad range of reference works, textbooks, and handbooks SERIES EDITORS John Lafferty, Carnegie Mellon University David Madigan, Rutgers University Fionn Murtagh, Royal Holloway, University of London Padhraic Smyth, University of California, Irvine Proposals for the series should be sent directly to one of the series editors above, or submitted to: Chapman & Hall/CRC 23-25 Blades Court London SW15 2NU UK Published Titles Bayesian Artificial Intelligence Kevin B Korb and Ann E Nicholson Pattern Recognition Algorithms for Data Mining Sankar K Pal and Pabitra Mitra Exploratory Data Analysis with MATLAB® Wendy L Martinez and Angel R Martinez Clustering for Data Mining: A Data Recovery Approach Boris Mirkin Correspondence Analysis and Data Coding with JAVA and R Fionn Murtagh R Graphics Paul Murrell © 2005 by Taylor & Francis Group, LLC Contents Preface List of Denotations Introduction: Historical Remarks What Is Clustering Base words 1.1 Exemplary problems 1.1.1 Structuring 1.1.2 Description 1.1.3 Association 1.1.4 Generalization 1.1.5 Visualization of data structure 1.2 Bird's-eye view 1.2.1 De nition: data and cluster structure 1.2.2 Criteria for revealing a cluster structure 1.2.3 Three types of cluster description 1.2.4 Stages of a clustering application 1.2.5 Clustering and other disciplines 1.2.6 Di erent perspectives of clustering What Is Data Base words 2.1 Feature characteristics 2.1.1 Feature scale types 2.1.2 Quantitative case 2.1.3 Categorical case 2.2 Bivariate analysis 2.2.1 Two quantitative variables 2.2.2 Nominal and quantitative variables © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC 2.2.3 Two nominal variables cross-classi ed 2.2.4 Relation between correlation and contingency 2.2.5 Meaning of correlation 2.3 Feature space and data scatter 2.3.1 Data matrix 2.3.2 Feature space: distance and inner product 2.3.3 Data scatter 2.4 Pre-processing and standardizing mixed data 2.5 Other table data types 2.5.1 Dissimilarity and similarity data 2.5.2 Contingency and ow data K-Means Clustering Base words 3.1 Conventional K-Means 3.1.1 Straight K-Means 3.1.2 Square error criterion 3.1.3 Incremental versions of K-Means 3.2 Initialization of K-Means 3.2.1 Traditional approaches to initial setting 3.2.2 MaxMin for producing deviate centroids 3.2.3 Deviate centroids with Anomalous pattern 3.3 Intelligent K-Means 3.3.1 Iterated Anomalous pattern for iK-Means 3.3.2 Cross validation of iK-Means results 3.4 Interpretation aids 3.4.1 Conventional interpretation aids 3.4.2 Contribution and relative contribution tables 3.4.3 Cluster representatives 3.4.4 Measures of association from ScaD tables 3.5 Overall assessment Ward Hierarchical Clustering Base words 4.1 Agglomeration: Ward algorithm 4.2 Divisive clustering with Ward criterion 4.2.1 2-Means splitting 4.2.2 Splitting by separating 4.2.3 Interpretation aids for upper cluster hierarchies 4.3 Conceptual clustering 4.4 Extensions of Ward clustering 4.4.1 Agglomerative clustering with dissimilarity data 4.4.2 Hierarchical clustering for contingency and ow data © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC 4.5 Overall assessment Data Recovery Models Base words 5.1 Statistics modeling as data recovery 5.1.1 Averaging 5.1.2 Linear regression 5.1.3 Principal component analysis 5.1.4 Correspondence factor analysis 5.2 Data recovery model for K-Means 5.2.1 Equation and data scatter decomposition 5.2.2 Contributions of clusters, features, and individual entities 5.2.3 Correlation ratio as contribution 5.2.4 Partition contingency coe cients 5.3 Data recovery models for Ward criterion 5.3.1 Data recovery models with cluster hierarchies 5.3.2 Covariances, variances and data scatter decomposed 5.3.3 Direct proof of the equivalence between 2-Means and Ward criteria 5.3.4 Gower's controversy 5.4 Extensions to other data types 5.4.1 Similarity and attraction measures compatible with K-Means and Ward criteria 5.4.2 Application to binary data 5.4.3 Agglomeration and aggregation of contingency data 5.4.4 Extension to multiple data 5.5 One-by-one clustering 5.5.1 PCA and data recovery clustering 5.5.2 Divisive Ward-like clustering 5.5.3 Iterated Anomalous pattern 5.5.4 Anomalous pattern versus Splitting 5.5.5 One-by-one clusters for similarity data 5.6 Overall assessment Di erent Clustering Approaches Base words 6.1 Extensions of K-Means clustering 6.1.1 Clustering criteria and implementation 6.1.2 Partitioning around medoids PAM 6.1.3 Fuzzy clustering 6.1.4 Regression-wise clustering 6.1.5 Mixture of distributions and EM algorithm 6.1.6 Kohonen self-organizing maps SOM © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC 6.2 Graph-theoretic approaches 6.2.1 Single linkage, minimum spanning tree and connected components 6.2.2 Finding a core 6.3 Conceptual description of clusters 6.3.1 False positives and negatives 6.3.2 Conceptually describing a partition 6.3.3 Describing a cluster with production rules 6.3.4 Comprehensive conjunctive description of a cluster 6.4 Overall assessment General Issues Base words 7.1 Feature selection and extraction 7.1.1 A review 7.1.2 Comprehensive description as a feature selector 7.1.3 Comprehensive description as a feature extractor 7.2 Data pre-processing and standardization 7.2.1 Dis/similarity between entities 7.2.2 Pre-processing feature based data 7.2.3 Data standardization 7.3 Similarity on subsets and partitions 7.3.1 Dis/similarity between binary entities or subsets 7.3.2 Dis/similarity between partitions 7.4 Dealing with missing data 7.4.1 Imputation as part of pre-processing 7.4.2 Conditional mean 7.4.3 Maximum likelihood 7.4.4 Least-squares approximation 7.5 Validity and reliability 7.5.1 Index based validation 7.5.2 Resampling for validation and selection 7.5.3 Model selection with resampling 7.6 Overall assessment Conclusion: Data Recovery Approach in Clustering Bibliography © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC Preface Clustering is a discipline devoted to nding and describing cohesive or homogeneous chunks in data, the clusters Some exemplary clustering problems are: - Finding common surf patterns in the set of web users - Automatically revealing meaningful parts in a digitalized image - Partition of a set of documents in groups by similarity of their contents - Visual display of the environmental similarity between regions on a country map - Monitoring socio-economic development of a system of settlements via a small number of representative settlements - Finding protein sequences in a database that are homologous to a query protein sequence - Finding anomalous patterns of gene expression data for diagnostic purposes - Producing a decision rule for separating potentially bad-debt credit applicants - Given a set of preferred vacation places, nding out what features of the places and vacationers attract each other - Classifying households according to their furniture purchasing patterns and nding groups' key characteristics to optimize furniture marketing and production Clustering is a key area in data mining and knowledge discovery, which are activities oriented towards nding non-trivial or hidden patterns in data collected in databases Earlier developments of clustering techniques have been associated, primarily, with three areas of research: factor analysis in psychology 55], numerical taxonomy in biology 122], and unsupervised learning in pattern recognition 21] Technically speaking, the idea behind clustering is rather simple: introduce a measure of similarity between entities under consideration and combine similar entities into the same clusters while keeping dissimilar entities in di erent clusters However, implementing this idea is less than straightforward First, too many similarity measures and clustering techniques have been © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC invented with virtually no support to a non-specialist user in selecting among them The trouble with this is that di erent similarity measures and/or clustering techniques may, and frequently do, lead to di erent results Moreover, the same technique may also lead to di erent cluster solutions depending on the choice of parameters such as the initial setting or the number of clusters speci ed On the other hand, some common data types, such as questionnaires with both quantitative and categorical features, have been left virtually without any substantiated similarity measure Second, use and interpretation of cluster structures may become an issue, especially when available data features are not straightforwardly related to the phenomenon under consideration For instance, certain data on customers available at a bank, such as age and gender, typically are not very helpful in deciding whether to grant a customer a loan or not Specialists acknowledge peculiarities of the discipline of clustering They understand that the clusters to be found in data may very well depend not on only the data but also on the user's goals and degree of granulation They frequently consider clustering as art rather than science Indeed, clustering has been dominated by learning from examples rather than theory based instructions This is especially visible in texts written for inexperienced readers, such as 4], 28] and 115] The general opinion among specialists is that clustering is a tool to be applied at the very beginning of investigation into the nature of a phenomenon under consideration, to view the data structure and then decide upon applying better suited methodologies Another opinion of specialists is that methods for nding clusters as such should constitute the core of the discipline related questions of data pre-processing, such as feature quantization and standardization, de nition and computation of similarity, and post-processing, such as interpretation and association with other aspects of the phenomenon, should be left beyond the scope of the discipline because they are motivated by external considerations related to the substance of the phenomenon under investigation I share the former opinion and argue the latter because it is at odds with the former: in the very rst steps of knowledge discovery, substantive considerations are quite shaky, and it is unrealistic to expect that they alone could lead to properly solving the issues of pre- and post-processing Such a dissimilar opinion has led me to believe that the discovered clusters must be treated as an \ideal" representation of the data that could be used for recovering the original data back from the ideal format This is the idea of the data recovery approach: not only use data for nding clusters but also use clusters for recovering the data In a general situation, the data recovered from aggregate clusters cannot t the original data exactly, which can be used for evaluation of the quality of clusters: the better the t, the better the clusters This perspective would also lead to the addressing of issues in pre- and post- © 2005 by Taylor & Francis Group, LLC © 2005 by Taylor & Francis Group, LLC Conclusion: Data Recovery Approach in Clustering Traditionally clusters are built based on similarity A found cluster of similar entities may be used as a whole to generalize and predict This order of action is reversed in the data recovery approach A property of similarity clusters, the possibility to aggregate data of individual entities into data of clusters, is taken here as the de ning attribute According to the data recovery approach, entities are brought together not because they are similar but because they can be used to better recover the data they have been built from Is it not just a new name for old wine? Indeed, the closer the entities to each other the more each of them resembles the cluster's pro le Yet the shift of the focus brings forward an important di erence In the data recovery approach, the concept of similarity loses its foundation status and becomes a derivative of the criterion of recovery In conventional clustering, the emphasis is on presenting the user with a number of options for measuring similarity the greater the choice the better There is nothing wrong with this idea when the substantive area is well understood But everything is wrong with this when the knowledge of the substantive area is poor No user is capable of reasonably choosing a similarity measure in such a situation A similarity measure should come from the data mining side This is the case in which a data recovery approach can provide sound recommendations for the similarity measurement and, moreover, for the data pre-processing needed to balance items constituting the data recovery criterion The heart of the data recovery clustering framework is Pythagorean decomposition of the data scatter into two items, that explained by the cluster structure and that unexplained, the square error The items are further decomposed in the contributions of individual entity{feature pairs or larger substructures Some developments that follow from this: Pre-processing (a) The data scatter expresses the scattering of entities around the origin of the feature space, which thus must be put into a central, or normal, 245 © 2005 by Taylor & Francis Group, LLC CLUSTERING FOR DATA MINING 246 position among the entity points (this position is taken to be the grand mean) (b) The data scatter is the sum of feature contributions that are proportional to their variances thus re ecting the distribution shapes this allows for tuning feature normalization options by separating scale and shape related parts (c) The strategy of binary coding of the qualitative categories in order to simultaneously process them together with quantitative features, which cannot be justi ed in conventional frameworks, is supported in this framework with the following: i Binary features appear to be the ultimate form of quantitative features, those maximally contributing to the data scatter ii The explained parts of category contributions sum up to association contingency coe cients that already have been heavily involved in data analysis and statistics, though from a very different perspective iii The association coe cients are related to the data normalization options, which can be utilized to facilitate the user's choice among the latter this can be done now from either end, the process' input or output, or both iv The equivalent entity-to-entity similarity measure which has emerged in the data recovery context is akin to best heuristic similarity measures but goes even further by taking into account the information weights of categories Clustering (a) Data recovery models for both K-Means and Ward clustering extend the Principal Component Analysis (PCA) model to the cases in which scoring vectors are to be compulsory binary or tertiary, respectively This analogy should not be missed because the PCA itself is conventionally considered but a heuristic method for extracting the maximum variance from the data, which is not quite correct In fact, the PCA can be justi ed by using a data recovery model, as shown in section 5.1.3, and then extended to clustering (b) K-Means is the method of alternating minimization applied to the square error criterion (c) Agglomerative and divisive Ward clustering involve somewhat \dual" decompositions of the data scatter That for divisive clustering is a natural one, treating the scatter's unexplained part as a whole and further decomposing the explained part into items related to the © 2005 by Taylor & Francis Group, LLC CONCLUSION (d) (e) (f) (g) 247 contributions of individual clusters and features In contrast, the decomposition for agglomerative clustering hides the structure of the explained part, which probably can explain why no speci c tools for the interpretation of cluster hierarchies have been proposed before By exploiting the additive structure of the data recovery models in the manner following that of the PCA method, one-by-one clustering methods are proposed to allow for e ective computational schemes as well as greater exibility in a controlled environment In particular, the intelligent version of K-Means, iK-Means, can be used for incomplete clustering with removal of devious or, in contrast, overly normal items, if needed Local search algorithms presented lead to provably tight clusters { the fact expressed with the attraction coe cient, a theory-based analogue to popular criteria such as the silhouette width coe cient The approach is extended to contingency and ow data by taking into account the property that each entry is a part of the whole The entries are naturally standardized into the Quetelet coe cients the corresponding data scatter appears to be equal to the chi-squared contingency coe cient The inner product can be used as an equivalent device in the correspondingly changed criteria, thus leading to similarity measures and clustering criteria { some of those are quite popular and some are new, still being similar to those in use Interpretation aids (a) The models provide for using the traditional cluster centroids as indicators of cluster tendencies Also, more emphasis is put on the standardized, not original, values thus relating them to the overall norms (averages) (b) Inner products of individual entities and centroids, not distances between them that are used conventionally, express entity contributions to clusters This relates to the choice of a cluster representative as well: not by the distance but by the inner product (c) The decomposition of the data scatter over clusters and features in table ScaD provides for comparative analysis of the relative contributions of all elements of the cluster structure, especially with Quetelet coe cients (table QScaD), to reveal the greatest of them (d) APPCOD, a method for conceptual description of individual clusters based on the table ScaD, can serve as a supplementary or complementary tool to classical decision trees as interpretation and description aids © 2005 by Taylor & Francis Group, LLC 248 CLUSTERING FOR DATA MINING (e) In hierarchical classi cation, the conventional decomposition of the data scatter over splits and clusters has been supplemented with similar decompositions of feature variances, covariances and individual entries these may be used as aids in the interpretation of the tendencies of hierarchical clusters (f) The decomposition of the data scatter over an upper cluster hierarchy can now be visualized with the concept of the box-chart, extending the conventional pie-charts Much room remains for further developments and extensions of the data recovery approach First, data speci cs, such as those that we considered here only for the cases of mixed and contingency data tables, should be taken into account Consider, for instance, digitized image data Here entities are pixels organized in a grid Conventionally, image data are compressed and processed with techniques exploiting their spatial character with such constructions as quadtrees and wavelets, which are unrelated in the current thinking The data recovery model with upper cluster hierarchies, used here for developing divisive Wardlike clustering, in fact can be considered as an extension of both quadtrees and wavelets This may potentially lead to methods of image processing while simultaneously compressing them Another promising area is applying the data recovery approach to analysis of temporal or spatio-temporal data An advantage of modeling data with a cluster model is that the temporal trajectory of a cluster centroid can be modeled as a speci c, say exponential, function of time Gene expression data contain measurements of several properties made over the same gene array spots they should be considered another promising direction Second, the models themselves can be much improved beyond the simplest formats employed in the book These models, in fact, require any data entry to be equal to a corresponding entry in a centroid or a combination of centroids This can be extended to include \gray" memberships, transformed feature spaces, and logically described clusters The least squares criterion can be changed for criteria that are less sensitive to data variation The least moduli criterion is a most obvious alternative Possibilities of probabilistic modelling should not be discarded either In the current setting, the Pythagorean decompositions much resemble those used for the analysis of variance of a single variable over various groupings Related probabilistic models, included in most statistics texts, seem, however, overly rigid and restrictive Hopefully, cluster analysis may provide a ground for seeking more relevant probabilistic frameworks Third, clustering methods can be extended from the purely local search techniques presented in the book One of the directions is building better versions with provable approximation estimates Another direction is applying the evolutionary and multi-agent approaches of computational intelligence © 2005 by Taylor & Francis Group, LLC Bibliography 1] K Ali and M Pazzani (1995) Hydra-mm: Learning multiple descriptions to improve classi cation accuracy, International Journal on Arti cial Intelligence Tools, 4, 115-133 2] P Arabie, L Hubert, and G De Soete (Eds.) (1996) Classi cation and Clustering, Singapore: World Scienti c 3] S Baase (1991) Computer Algorithms, Second Edition, Reading, Ma: Addison-Wesley 4] Bailey, K.D (1994) Typologies and Taxonomies: An Introduction to Classi cation Techniques, London: Sage Publications 5] Y Barash and N Friedman (2002) Context-speci c Bayesian clustering for gene expression data, Journal of Computational Biology, 9, 169-191 6] J.P Benzecri (1992) Correspondence Analysis Handbook, New York: Marcel Dekker 7] J.C Bezdek, J.M Keller, R Krishnapuram, L.I Kuncheva, and N.R Pal (1999) Will the real Iris data please stand up?, IEEE Transactions on Fuzzy Systems, 7, no 3, 368-369 8] H.H Bock (1996) Probability models and hypothesis testing in partitioning cluster analysis In P Arabie, C D Carroll, and G De Soete (Eds.) Clustering and Classi cation, River Edge, NJ: World Scienti c Publishing, 377-453 9] H.H.Bock (1999) Clustering and neural network approaches In W Gaul and H Locarek-Junge (Eds.) Classi cation in the Information Age, Berlin-Heidelberg: Springer, 42-57 10] Boley, D.L (1998) Principal direction divisive partitioning Data Mining and Knowledge Discovery, 2(4), 325-344 249 © 2005 by Taylor & Francis Group, LLC 250 CLUSTERING FOR DATA MINING 11] L Breiman, J.H Friedman, R.A Olshen, and C.J Stone (1984) Classi cation and Regression Trees, Belmont, Ca: Wadswarth International Group 12] T.M Cover and J.A Thomas (1991) Elements of Information Theory, Wiley 13] L.L Cavalli-Sforza (2001) Genes, Peoples, and Languages, London: Penguin Books 14] Clementine 7.0 User's Guide Package (2003) Chicago: SPSS Inc 15] W W Cohen (1995) Fast e ective rule induction, In Armand Prieditis and Stuart Russell (Eds.) Proc of the 12th International Conference on Machine Learning, 115-123, Tahoe City, CA, 1995 Morgan Kaufmann 16] S Deerwester, S.T Dumais, G.W Furnas, T.K Landauer, and R Harshman (1990) Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 41, 391-407 17] G Der and B.S Everitt (2001) Handbook of Statistical Analyses Using SAS, Second Edition, CRC Press 18] M Devaney and A Ram (1997) E cient feature selection in conceptual clustering, In: Proceedings of 14th International Conference on Machine Learning, 92-97, Morgan Kaufmann 19] S Dolnicar and F Leisch (2000) Getting more out of binary data: segmenting markets by bagged clustering, Working paper no 71, Vienna University of Economics and Business Administration, 22 p 20] S Draghici (2003) Data Analysis Tools for DNA Microarrays, Boca Raton/London/New York: Chapman & Hall/CRC 21] R.O Duda and P.E Hart (1973) Pattern Classi cation and Scene Analysis, New York: J.Wiley & Sons 22] S Dudoit and J Fridlyand (2002) A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biology, 3, n 7, 1-21 23] M.H Dunham (2003) Data Mining: Introductory and Advanced Topics, Upper Saddle River, NJ: Pearson Education Inc 24] S Dzeroski and N Lavrac (Eds.) (2001) Relational Data Mining, Berlin: Springer-Verlag © 2005 by Taylor & Francis Group, LLC BIBLIOGRAPHY 251 25] B Efron and R.J Tibshirani (1993) An Introduction to the Bootstrap, Chapman and Hall 26] M Ester, A Frommelt, H.-P Kriegel, and J Sander (2000) Spatial data mining: Database primitives, algorithms and e cient dbms support, Data Mining and Knowledge Discovery, 4, 193-216 27] B.S Everitt and G Dunn (2001) Applied Multivariate Data Analysis, London: Arnold 28] B.S Everitt, S Landau, and M Leese (2001) Cluster Analysis (4th edition), London: Arnold 29] U.M Fayyad, G Piatetsky-Shapiro, P Smyth, and R Uthurusamy (Eds.) (1996) Advances in Knowledge Discovery and Data Mining, Menlo Park, Ca: AAAI Press/The MIT Press 30] J Felsenstein (1985) Con dence limits on phylogenies: an approach using the bootstrap, Evolution, 39, 783-791 31] D.W Fisher (1987) Knowledge acquisition via incremental conceptual clustering, Machine Learning, 2, 139-172 32] K Florek, J Lukaszewicz, H Perkal, H Steinhaus, and S Zubrzycki (1951) Sur la liason et la division des points d'un ensemble ni, Colloquium Mathematicum, 2, 282-285 33] E.B Fowlkes and C.L Mallows (1983) A method for comparing two hierarchical clusterings, Journal of American Statistical Association, 78, 553-584 34] K.R Gabriel and S Zamir (1979) Lower rank approximation of matrices by least squares with any choices of weights, Technometrics, 21, 489-298 35] A.P Gasch and M.B Eisen (2002) Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering, Genome Biology, 3, 11 36] M Girolami (2002) Mercer kernel based clustering in feature space, IEEE Transactions on Neural Networks, 13, 780-784 37] R Gnanadesikan, J.R Kettenring, and S.L Tsao (1995) Weighting and selection of variables, Journal of Classi cation, 12, 113-136 38] G.H Golub and C.F Van Loan (1989) Matrix Computations, Baltimore: J Hopkins University Press © 2005 by Taylor & Francis Group, LLC 252 CLUSTERING FOR DATA MINING 39] A.D Gordon (1999) Classi cation (2nd edition), Boca Raton: Chapman and Hall/CRC 40] J.C Gower (1967) A comparison of some methods of cluster analysis, Biometrics, 23, 623-637 41] J.C Gower and G.J.S Ross (1969) Minimum spanning trees and single linkage cluster analysis, Applied Statistics, 18, 54-64 42] S.B Green and N.J Salkind (2003) Using SPSS for the Windows and Macintosh: Analyzing and Understanding Data (3rd Edition), Prentice Hall 43] S Guha, R Rastogi, and K Shim (2000) ROCK: A robust clustering algorithm for categorical attributes, Information Systems, 25, n 2, 345366 44] J Han and M Kamber (2001)Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers 45] P Hansen and N Mladenovic (2001) J-means: a new local search heuristic for minimum sum-of-squares clustering, Pattern Recognition, 34, 405-413 46] J.A Hartigan (1967) Representation of similarity matrices by trees, Journal of the American Statistical Association, 62, 1140-1158 47] J.A Hartigan (1972) Direct clustering of a data matrix, Journal of the American Statistical Association, 67, 123-129 48] J.A Hartigan (1975) Clustering Algorithms, New York: J.Wiley & Sons 49] T Hastie, R Tibshirani, M.B Eisen, A Alizadeh, R Levy, L Staudt, W.C Chan, D Botstein, and P Brown (2000) `Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns Genome Biology, 1, http://genomebiology.com/ 2000/1/2/research/0003/abstract 50] T Hastie, R Tibshirani, and J.R Fridman (2001) The Elements of Statistical Learning, New York: Springer 51] S Haykin (1999) Neural Networks, 2nd ed., New Jersey: Prentice Hall 52] L J Heier, S Kruglyak, and S Yooseph (1999) Exploring expression data: Identi cation and analysis of coexpressed genes, Genome Research, 9, 1106-1115 © 2005 by Taylor & Francis Group, LLC BIBLIOGRAPHY 253 53] M.O Hill (1979) TWINSPAN: a FORTRAN program for arranging multivariate data in an ordered two-way table by classi cation of the individuals and attributes Ecology and Systematics, Ithaca, NY: Cornell University 54] N.S Holter, M Mitra, A Maritan, M Cieplak, J.R Banavar, and N.V Fedoro (2000) Fundamental patterns underlying gene expression proles: Simplicity from complexity, Proceedings of the National Academy of Sciences of the USA, 97, no 15, 8409-8414 55] K.J Holzinger and H.H Harman (1941) Factor Analysis, Chicago: University of Chicago Press 56] L.J Hubert and P Arabie (1985) Comparing partitions, Journal of Classi cation, 2, 193-218 57] P Jaccard (1908) Nouvelles recherches sur la distribution orale, Bulletine de la Societe Vaudoise de Sciences Naturelles, 44, 223-370 58] A.K Jain and R.C Dubes (1988) Algorithms for Clustering Data, Englewood Cli s, NJ: Prentice Hall 59] A.K Jain, M.N Murty, and P.J Flynn (1999) Data clustering: A review, ACM Computing Surveys, 31, n 3, 264-323 60] C.V Jawahar, P.K Biswas, and A.K Ray (1995) Detection of clusters of distinct geometry: A step toward generalized fuzzy clustering, Pattern Recognition Letters, 16, 1119-1123 61] I.T Jolli e (1986) Principal Component Analysis New York: SpringerVerlag 62] L Kaufman and P Rousseeuw (1990) Finding Groups in Data: An Introduction to Cluster Analysis, New York: J Wiley & Son 63] M.G Kendall and A Stuart (1979) The Advanced Theory of Statistics, 2, 4th ed , New York: Hafner 64] M.K Kerr and G.A Churchill (2001) Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments, Proceedings of the National Academy of Science USA, 98, no 16, 8961-8965 65] H.A.L Kiers (1997) Weighted least squares tting using ordinary least squares algorithms, Psychometrika, 62, 251-266 66] W Klosgen (1996) Explora { A multipattern and multistrategy discovery assistant, In 29], 249-271 © 2005 by Taylor & Francis Group, LLC 254 CLUSTERING FOR DATA MINING 67] T Kohonen (1995) Self-Organizing Maps, Berlin: Springer-Verlag 68] E Koonin and M Galperin (2002) Sequence{Evolution{Function: Computational Approaches in Comparative Genomics, Dordrecht: Kluwer Academic Publishers 69] B Kovalerchuk and E Vityaev (2000) Data Mining in Finance: Advances in Relational and Hybrid Methods, Boston/Dordrecht/London: Kluwer Academic Publishers 70] D.E Krane and M.L Raymer (2003) Fundamental Concepts of Bioinformatics, San Francisco, CA: Pearson Education 71] W Krzanowski and Y Lai (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering, Biometrics, 44, 23-34 72] W.J Krzanowski and F.H.C Marriott (1994) Multivariate Analysis, London: Edward Arnold 73] S Laaksonen (2000) Regression-based nearest neighbour hot decking, Computational Statistics, 15, 65-71 74] G Lako (1990) Women, Fire, and Dangerous Things: What Categories Reveal About the Mind, Chicago: University of Chicago Press 75] G.N Lance and W.T Williams (1967) A general theory of classi catory sorting strategies: Hierarchical Systems, The Computer Journal, 9, 373-380 76] M.H.Law, A.K.Jain, and M.A.T Figueirido (2003) Feature selection in mixture-based clustering, Advances in Neural Information Processing Systems, 15 77] L Lebart, A Morineau, and M Piron (1995) Statistique Exploratoire Multidimensionnelle, Paris: Dunod 78] E Levine and E Domany (2001) Resampling method for unsupervised estimation of cluster validity, Neural Computation, 13, 2573-2593 79] R.J.A Little and D.B Rubin (1987) Statistical Analysis with Missing Data, J Wiley & Sons 80] T Margush and F.R McMorris (1981) Consensus n-trees, Bulletin of Mathematical Biology, 43, 239-244 © 2005 by Taylor & Francis Group, LLC BIBLIOGRAPHY 255 81] R.M McIntyre and R.K Blash ed (1980) A nearest-centroid technique for evaluating the minimum variance clustering procedure, Multivariate Behavioral Research, 22, 225-238 82] G McLachlan and K Basford (1988) Mixture Models: Inference and Applications to Clustering, New York: Marcel Dekker 83] J.B MacQueen (1967) Some methods for classi cation and analysis of multivariate observations, L Lecam and J Neymen (Eds.) Proceedings of 5th Berkeley Symposium, 2, 281-297, University of California Press, Berkeley 84] G.W Milligan (1981) A Monte-Carlo study of thirty internal criterion measures for cluster analysis, Psychometrika, 46, 187-199 85] G.W Milligan (1989) A validation study of a variable weighting algorithm for cluster analysis, Journal of Classi cation, 6, 53-71 86] G.W Milligan and M.C Cooper (1985) An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 159-179 87] G.W Milligan and M.C Cooper (1988) A study of standardization of the variables in cluster analysis, Journal of Classi cation, 5, 181-204 88] B Mirkin (1987) Additive clustering and qualitative factor analysis methods for similarity matrices, Journal of Classi cation, 4, 7-31 Erratum (1989), 6, 271-272 89] B Mirkin (1990) Sequential tting procedures for linear data aggregation model, Journal of Classi cation, 7, 167-195 90] B Mirkin (1996) Mathematical Classi cation and Clustering, Dordrecht: Kluwer Academic Press 91] B Mirkin (1997) L1 and L2 approximation clustering for mixed data: scatter decompositions and algorithms, in Y Dodge (Ed.) L1 -Statistical Procedures and Related Topics, Hayward, Ca.: Institute of Mathematical Statistics (Lecture Notes-Monograph Series), 473-486 92] B Mirkin (1999) Concept learning and feature selection based on squareerror clustering, Machine Learning, 35, 25-40 93] B Mirkin (2001) Eleven ways to look at the chi-squared coe cient for contingency tables, The American Statistician, 55, no 2, 111-120 94] B Mirkin (2001) Reinterpreting the category utility function, Machine Learning, 45, 219-228 © 2005 by Taylor & Francis Group, LLC 256 CLUSTERING FOR DATA MINING 95] B Mirkin and R Brooks (2003) A tool for comprehensively describing class-based data sets, In J.M Rossiter and T Martin (Eds.) Proceedings of 2003 UK Workshop on Computational Intelligence, University of Bristol, UK, 149-156 96] B Mirkin and E Koonin (2003) A top-down method for building genome classi cation trees with linear binary hierarchies, In M Janowitz, J.-F Lapointe, F McMorris, B Mirkin, and F Roberts (Eds.) Bioconsensus, DIMACS Series, V 61, Providence: AMS, 97-112 97] B Mirkin, M Levin, and E Bakaleinik (2002) Intelligent K-Means clustering in analysis of newspaper articles on bribing, Intelligent Systems, Donetsk, Ukraine, n 2, 224-230 98] B Mirkin and I Muchnik (2002) Layered clusters of tightness set functions, Applied Mathematics Letters, 15, 147-151 99] M Mitchell (1998) An Introduction to Genetic Algorithms (Complex Adaptive Systems), MIT Press 100] D.S Modha and W.S Spangler (2003) Feature weighting in k-means clustering, Machine Learning, 52, 217-237 101] S Monti, P Tamayo, J Mesirov, and T Golub (2003) Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data, Machine Learning, 52, 91-118 102] J Mullat, Extremal subsystems of monotone systems: I, II, Automation and Remote Control, 37, 758-766, 1286-1294 (1976) 103] F Murtagh (1985) Multidimensional Clustering Algorithms, Heidelberg: Physica-Verlag 104] S Nascimento, B Mirkin, and F Moura-Pires (2003) Modeling proportional membership in fuzzy clustering, IEEE Transactions on Fuzzy Systems, 11, no 2, 173-186 105] Nei, M and Kumar, S (2000) Molecular Evolution and Phylogenetics, Oxford University Press 106] K Pearson (1901) On lines and planes of closest to systems of points in space, The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series 2, 559-572 107] M Perkowitz and O Etzioni (2000) Towards adaptive Web sites: Conceptual framework and case study, Arti cial Intelligence, 118, 245-275 © 2005 by Taylor & Francis Group, LLC BIBLIOGRAPHY 257 108] S.M Perlmutter, P.C Cosman, C.-W Tseng, R.A Olshen, R.M Gray, K.C.P.Li, and C.J Bergin (1998) Medical image compression and vector quantization, Statistical Science, 13, 30-53 109] K.S Pollard and M.J van der Laan (2002) A method to identify signi cant clusters in gene expression data, U.C Berkeley Division of Biostatistics Working Paper Series, 107 110] J Qin, D.P Lewis, and W.S Noble (2003) Kernel hierarchical gene clustering from microarray expression data, Bioinformatics, 19, 2097-2104 111] J.R Quinlan (1993) C4.5: Programs for Machine Learning, San Mateo: Morgan Kaufmann 112] W.M Rand (1971) Objective criteria for the evaluation of clustering methods, Journal of American Statistical Association, 66, 846-850 113] Repository of databases for machine learning and data mining, Urvine, UCL 114] R.J Roiger and M.W Geatz (2003) Data Mining: A Tutorial-Based Primer, Addison Wesley, Pearson Education, Inc 115] Romesburg, C.H (1984) Cluster Analysis for Researchers, Belmont, Ca: Lifetime Learning Applications Reproduced by Lulu Press, North Carolina, 2004 116] G Salton (1989) Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley 117] S L Salzberg (1998) Decision trees and Markov chains for gene nding, In S.L Salzberg, D.B Searls & S Kasif (Eds.) Computational Methods in Molecular Biology, 187-203, Amsterdam, Elsevier Science B.V 118] J.L Schafer (1997) Analysis of Incomplete Multivariate Data, Chapman and Hall 119] SAS/ETS User's Guide, Version (2000) Volumes and 2, SAS Publishing 120] C Seidman (2001) Data Mining with Microsoft SQL Server 2000 Technical Reference, Microsoft Corporation 121] R.N Shepard and P Arabie (1979) Additive clustering: representation of similarities as combinations of overlapping properties, Psychological Review, 86, 87-123 © 2005 by Taylor & Francis Group, LLC 258 CLUSTERING FOR DATA MINING 122] P.H.A Sneath and R.R Sokal (1973) Numerical Taxonomy, San Francisco: W.H Freeman 123] M Sonka, V Hlavac, and R Boyle (1999) Image Processing, Analysis and Machine Vision, Paci c Grove, Ca: Brooks/Cole Publishing Company 124] J.A Sonquist, E.L Baker, and J.N Morgan (1973) Searching for Structure, Institute for Social Research, Ann Arbor: University of Michigan 125] R Spence (2001) Information Visualization, ACM Press/Addison-Wesley 126] R Srikant, Q Vu, and R Agraval (1997) Mining association rules with the item constraints, in (Eds D Heckerman, H Manilla, D Pregibon & R Uthrusamy) Proceedings of Third International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, AAAI Press, 67-73 127] M Steinbach, G Karypis, and V Kumar (2000) A comparison of document clustering techniques, In Proc Workshop on Text Mining, 6th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, wwwusers.cs.umn.edu/ karypis/ publications/Papers/PDF/doccluster.pdf 128] C.A Sugar and G.M James (2003) Finding the number of clusters in a data set: An information-theoretic approach, Journal of the American Statistical Association, 98, n 463, 750-778 129] R Tibshirani, G Walther, and T Hastie (2001) Estimating the number of clusters in a dataset via the Gap statistics, Journal of the Royal Statistical Society B, 63, 411-423 130] O Troyanskaya, M Cantor, G Sherlock, P Brown, T Hastie, R Hastie, R Tibshirani, D Botsein, and R.B Altman (2001) Missing value estimation methods for DNA microarrays, Bioinformatics 17, 520-525 131] R.C Tryon (1939) Cluster Analysis, Ann Arbor: Edwards Bros 132] C Tsallis, R.S Mendes, and A.R Plastino (1998) The role of constraints within generalized nonextensive statistics, Physica A, 261, 534-554 133] M A Turk and A P Pentland (1991) Eigenfaces for recognition, Journal of Cognitive Neuroscience, 3, 71-96 134] Vivisimo Document Clustering Engine (2003) http://vivisimo.com 135] J.H Ward, Jr (1963) Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, 58, 236-244 136] I Wasito and B Mirkin (2005) Nearest neighbour approach in the leastsquares data imputation algorithms, Information Systems, 169, 1-25 © 2005 by Taylor & Francis Group, LLC BIBLIOGRAPHY 259 137] A Webb (2002) Statistical Pattern Recognition, Chichester, England: J Wiley & Sons 138] A Weingessel, E Dimitriadou, and S Dolnicar (1999) An examination of indexes for determining the number of clusters in binary data sets, Working Paper No 29, Vienna University of Economics, Wien, Austria 139] S.M Weiss, N Indurkhya, T Zhang, and F.J Damerau (2005) Text Mining: Predictive Methods for Analyzing Unstructured Information, Springer Science+Business Media 140] D Wishart (1999) The ClustanGraphics Primer, Edinburgh: Clustan Limited 141] K.Y Yeung, C Fraley, A Murua, A.E Raftery, and W.L Ruzzo (2001) Model-based clustering and data transformations for gene expression data, Bioinformatics, 17, no 10, 977-987 142] S Zhong and J Ghosh (2003) A uni ed framework for model-based clustering, Journal of Machine Learning Research, 4, 1001-1037 © 2005 by Taylor & Francis Group, LLC ... Sankar K Pal and Pabitra Mitra Exploratory Data Analysis with MATLAB® Wendy L Martinez and Angel R Martinez Clustering for Data Mining: A Data Recovery Approach Boris Mirkin Correspondence Analysis... the data that are explained by clusters can be separated from those that are not The data recovery approach is common in more traditional data mining and statistics areas such as regression, analysis... original data back from the ideal format This is the idea of the data recovery approach: not only use data for nding clusters but also use clusters for recovering the data In a general situation,

IT training clustering for data mining a data recovery approach mirkin 2005 04 29

Thông tin tài liệu

Từ khóa liên quan

Mục lục

cover

Clustering for Data Mining: A Data Recovery Approach

Preface

Acknowledgments

Author

List of Denotations

Introduction: Historical Remarks

Contents

Chapter 1 What Is Clustering

Base words

1.1 Exemplary problems

1.1.1 Structuring

Market towns

Primates and Human origin

Gene presenceabsence proles

1.1.2 Description

Describing Iris genera

Body mass

1.1.3 Association

Digits and patterns of confusion between them

Literary masterpieces

1.1.4 Generalization

1.1.5 Visualization of data structure

One dimensional data

One dimensional data within groups

Two dimensional display

Block structure

Tài liệu cùng người dùng

Tài liệu liên quan

IT training clustering for data mining a data recovery approach mirkin 2005 04 29

Thông tin tài liệu

Từ khóa liên quan

Mục lục

cover

Clustering for Data Mining: A Data Recovery Approach

Preface

Acknowledgments

Author

List of Denotations

Introduction: Historical Remarks

Contents

Chapter 1 What Is Clustering

Base words

1.1 Exemplary problems

1.1.1 Structuring

Market towns

Primates and Human origin

Gene presenceabsence pro les

1.1.2 Description

Describing Iris genera

Body mass

1.1.3 Association

Digits and patterns of confusion between them

Literary masterpieces

1.1.4 Generalization

1.1.5 Visualization of data structure

One dimensional data

One dimensional data within groups

Two dimensional display

Block structure

Tài liệu cùng người dùng

Tài liệu liên quan

Gene presenceabsence proles