Constrained clustering advances in algorithms, theory, and applications basu, davidson wagstaff 2012 11 29 Cấu trúc dữ liệu và giải thuật

CuuDuongThanCong.com Constrained Clustering Advances in Algorithms, Theory, and Applications C9969_FM.indd CuuDuongThanCong.com 7/11/08 11:47:01 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series SERIES EDITOR Vipin Kumar University of Minnesota Department of Computer Science and Engineering Minneapolis, Minnesota, U.S.A AIMS AND SCOPE This series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and handbooks The inclusion of concrete examples and applications is highly encouraged The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues PUBLISHED TITLES UNDERSTANDING COMPLEX DATASETS: Data Mining with Matrix Decompositions David Skillicorn COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: Advances in Algorithms, Theory, and Applications Sugato Basu, Ian Davidson, and Kiri L Wagstaff C9969_FM.indd CuuDuongThanCong.com 7/11/08 11:47:01 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery Series Constrained Clustering Advances in Algorithms, Theory, and Applications Edited by 4VHBUP#BTVr*BO%BWJETPO Kiri L Wagstaff C9969_FM.indd CuuDuongThanCong.com 7/11/08 11:47:01 AM Cover image shows the result of clustering a hyperspectral image of Mars using soft constraints to impose spatial contiguity on cluster assignments The data set was collected by the Space Telescope Imaging Spectrograph (STIS) on the Hubble Space Telescope This image was reproduced with permission from Intelligent Clustering with Instance-Level Constraints by Kiri Wagstaff Chapman & Hall/CRC Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2009 by Taylor & Francis Group, LLC Chapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S Government works Printed in the United States of America on acid-free paper 10 International Standard Book Number-13: 978-1-58488-996-0 (Hardcover) This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint Except as permitted under U.S Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400 CCC is a not-for-profit organization that provides licenses and registration for a variety of users For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging-in-Publication Data Constrained clustering : advances in algorithms, theory, and applications / editors, Sugato Basu, Ian Davidson, Kiri Wagstaff p cm (Chapman & Hall/CRC data mining and knowledge discovery series) Includes bibliographical references and index ISBN 978-1-58488-996-0 (hardback : alk paper) Cluster analysis Data processing Data mining Computer algorithms I Basu, Sugato II Davidson, Ian, 1971- III Wagstaff, Kiri IV Title V Series QA278.C63 2008 519.5’3 dc22 2008014590 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com C9969_FM.indd CuuDuongThanCong.com 7/11/08 11:47:01 AM Thanks to my family, friends and colleagues especially Joulia, Constance and Ravi – Ian I would like to dedicate this book to all of the friends and colleagues who’ve encouraged me and engaged in idea-swapping sessions, both about constrained clustering and other topics Thank you for all of your feedback and insights! – Kiri Dedicated to my family for their love and encouragement, with special thanks to my wife Shalini for her constant love and support – Sugato CuuDuongThanCong.com CuuDuongThanCong.com Foreword In 1962 Richard Hamming wrote, “The purpose of computation is insight, not numbers.” But it was not until 1977 that John Tukey formalized the field of exploratory data analysis Since then, analysts have been seeking techniques that give them better understanding of their data For one- and two-dimensional data, we can start with a simple histogram or scatter plot Our eyes are good at spotting patterns in a two-dimensional plot But for more complex data we fall victim to the curse of dimensionality; we need more complex tools because our unaided eyes can’t pick out patterns in thousanddimensional data Clustering algorithms pick up where our eyes leave off: they can take data with any number of dimensions and cluster them into subsets such that each member of a subset is near the other members in some sense For example, if we are attempting to cluster movies, everyone would agree that Sleepless in Seattle should be placed near (and therefore in the same cluster as) You’ve Got Mail They’re both romantic comedies, they’ve got the same director (Nora Ephron), the same stars (Tom Hanks and Meg Ryan), they both involve falling in love over a vast electronic communication network They’re practically the same movie But what about comparing Charlie and the Chocolate Factory with A Nightmare on Elm Street? On most dimensions, these films are near opposites, and thus should not appear in the same cluster But if you’re a Johnny Depp completist, you know he appears in both, and this one factor will cause you to cluster them together Other books have covered the vast array of algorithms for fully-automatic clustering of multi-dimensional data This book explains how the Johnny Depp completist, or any analyst, can communicate his or her preferences to an automatic clustering algorithm, so that the patterns that emerge make sense to the analyst; so that they yield insight, not just clusters How can the analyst communicate with the algorithm? In the first five chapters, it is by specifying constraints of the form “these two examples should (or should not) go together.” In the chapters that follow, the analyst gains vocabulary, and can talk about a taxonomy of categories (such as romantic comedy or Johnny Depp movie), can talk about the size of the desired clusters, can talk about how examples are related to each other, and can ask for a clustering that is different from the last one Of course, there is a lot of theory in the basics of clustering, and in the refinements of constrained clustering, and this book covers the theory well But theory would have no purpose without practice, and this book shows how CuuDuongThanCong.com constrained clustering can be used to tackle large problems involving textual, relational, and even video data After reading this book, you will have the tools to be a better analyst, to gain more insight from your data, whether it be textual, audio, video, relational, genomic, or anything else Dr Peter Norvig Director of Research Google, Inc December 2007 CuuDuongThanCong.com Editor Biographies Sugato Basu is a senior research scientist at Google, Inc His areas of research expertise include machine learning, data mining, information retrieval, statistical pattern recognition and optimization, with special emphasis on scalable algorithm design and analysis for large text corpora and social networks He obtained his Ph.D in machine learning from the computer science department of the University of Texas at Austin His Ph.D work on designing novel constrained clustering algorithms, using probabilistic models for incorporating prior domain knowledge into clustering, won him the Best Research Paper Award at KDD in 2004 and the Distinguished Student Paper award at ICML in 2005 He has served on multiple conference program committees, journal review committees and NSF panels in machine learning and data mining, and has given several invited tutorials and talks on constrained clustering He has written conference papers, journal papers, book chapters, and encyclopedia articles in a variety of research areas including clustering, semi-supervised learning, record linkage, social search and routing, rule mining and optimization Ian Davidson is an assistant professor of computer science at the University of California at Davis His research areas are data mining, artificial intelligence and machine learning, in particular focusing on formulating novel problems and applying rigorous mathematical techniques to address them His contributions to the area of clustering with constraints include proofs of intractability for both batch and incremental versions of the problem and the use of constraints with both agglomerative and non-hierarchical clustering algorithms He is the recipient of an NSF CAREER Award on Knowledge Enhanced Clustering and has won Best Paper Awards at the SIAM and IEEE data mining conferences Along with Dr Basu he has given tutorials on clustering with constraints at several leading data mining conferences and has served on over 30 program committees for conferences in his research fields Kiri L Wagstaff is a senior researcher at the Jet Propulsion Laboratory in Pasadena, CA Her focus is on developing new machine learning methods, particularly those that can be used for data analysis onboard spacecraft, enabling missions with higher capability and autonomy Her Ph.D dissertation, “Intelligent Clustering with Instance-Level Constraints,” initiated work in the machine learning community on constrained clustering methods She has developed additional techniques for analyzing data collected by instruments on the EO-1 Earth Orbiter, Mars Pathfinder, and Mars Odyssey The applications range from detecting dust storms on Mars to predicting crop yield on CuuDuongThanCong.com Learning with Pairwise Constraints for Video Object Classification 427 rithms were derived by plugging in the hinge loss and the logistic loss functions The experiments with two surveillance video data sets demonstrated the proposed pairwise learning algorithms could achieve considerable improved performance with pairwise constraints, compared to the baseline classifier which uses labeled data alone and a majority voting scheme The proposed algorithms also outperformed the RCA algorithm and the Gaussian mixture model with constraints when the same number of pairwise constraints are used A comparison among the proposed algorithms showed that the algorithms with non-convex loss functions could achieve a higher classification accuracy but the algorithms with convex loss functions are more efficient and robust Finally, we also evaluated the performance of weighted pairwise kernel logistic regression algorithms using noisy pairwise constraints provided by human feedback It showed that the weighted algorithms can achieve higher accuracy than the non-weighted counterpart In this work, we mainly focused on developing new pairwise learning algorithms and leave the exploration of more advanced visual features to future research We also want to point out that although our learning framework and previous work on learning distance metric exploit the pairwise constraints in different ways, they can be complementary to each other For example, it is possible to apply the proposed learning framework in a new distance metric learned from other algorithms References [1] E L Allwein, R E Schapire, and Y Singer Reducing multiclass to binary: A unifying approach for margin classifiers In Proceedings of the 17th International Conference on Machine Learning, pages 9–16, 2000 [2] S Antania, R Kasturi, and R Jain A survey on the use of pattern recognition methods for abstraction, indexing and retrieval of images and video Pattern Recognition, 4:945–65, April 2002 [3] S Basu, A Banerjee, and R J Mooney Active semi-supervision for pairwise constrained clustering In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, Aug 2003 [4] S Basu, M Bilenko, and R Mooney A probabilistic framework for semi-supervised clustering In Proceedings of SIGKDD, pages 59–68, 2004 [5] A Blum and T Mitchell Combining labeled and unlabeled data with CuuDuongThanCong.com 428 Constrained Clustering: Advances in Algorithms, Theory, and Applications co-training In Proceedings of the Workshop on Computational Learning Theory, 1998 [6] T F Coleman and Y Li An interior, trust region approach for nonlinear minimization subject to bounds SIAM Journal on Optimization, 6:418– 445, 1996 [7] A J Comenarez and T S Huang Face detection with informationbased maximum discrimination In Proceedings of CVPR, 1997 [8] A Dempster, N Laird, and D Rubin Maximum likelihood from incomplete data via the EM algorithm Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977 [9] R Fergus, P Perona, and A Zisserman Object class recognition by unsupervised scale-invariant learning In Proceedings of CVPR, 2003 [10] H Gish and M Schmidt Text-independent speaker identification IEEE Signal Proceedingsssing Magazine, 11(4):18–32, 1994 [11] T Hastie, R Tibshirani, and J Friedman The Elements of Statistical Learning Springer Series in Statistics Springer Verlag, Basel, 2001 [12] M Hewish Automatic target recognition International Defense Review, 24(10), 1991 [13] G Kimeldorf and G Wahba Some results on tchebycheffian spline functions Journal of Mathematical Analysis and Applications, 33:82– 95, 1971 [14] S Kumar and M Hebert Discriminative random fields: A discriminative framework for contextual interaction in classification In IEEE International Conference on Computer Vision (ICCV), 2003 [15] J T Kwok and I W Tsang Learning with idealized kernel In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, Aug 2003 [16] T Lange, M H Law, A K Jain, and J Buhmann Learning with constrained and unlabeled data In Proceedings of CVPR, 2005 [17] F Li, R Fergus, and P Perona A bayesian approach to unsupervised one-shot learning of object categories In Proceedings of the International Conference on Computer Vision, Oct 2003 [18] K Nigam, A K McCallum, S Thrun, and T M Mitchell Text classification from labeled and unlabeled documents using EM Machine Learning, 39:103–134, 2000 [19] A Pentland, B Moghaddam, and T Starner View-based and modular eigenspaces for face recognition In Proceedings of IEEE Conference on CuuDuongThanCong.com Learning with Pairwise Constraints for Video Object Classification 429 Computer Vision and Pattern Recognition 94 (CVPR’94), pages 568– 574, Seattle, WA, June 1994 [20] W E Pierson and T D Ross Automatic target recognition (atr) evaluation theory: a survey In Proceedings of the SPIE - The International Society for Optical Engineering 4053, 2000 [21] J Platt Fast training of support vector machines using sequential minimal optimization In B Schă olkopf, C Burges, and A Smola, editors, Advances in Kernel Methods - Support Vector Learning MIT Press, 1998 [22] G Shakhnarovich, L Lee, and T Darrell Integrated face and gait recognition from multiple views In Proceedings IEEE Conference on Computer Vision and Pattern Recognition, 2001 [23] N Shental, A Bar-Hillel, T Hertz, and D Weinshall Computing gaussian mixture models with em using side information In Workshop on ’The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining,’ ICML 2003, Washington, DC, Aug 2003 [24] N Shental, A Bar-Hillel, T Hertz, and D Weinshall Enhancing image and video retrieval: Learning via equivalence constraints In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Madison, WI, June 2003 [25] J Sivic and A Zisserman Video google: A text retrieval approach to object matching in videos In Proceedings of the International Conference on Computer Vision, Oct 2003 [26] O Trier, A Jain, and T Taxt Feature extraction methods for character recognition - a survey Pattern Recognition, 29, 1993 [27] Z Tu, X Chen, A L Yuille, and S Zhu Image parsing unifying segmentation, detection and recognition In Proceedings of ICCV, 2003 [28] V N Vapnik The Nature of Statistical Learning Theory Springer, 1995 [29] P Viola, M J Jones, and D Snow Detecting pedestrians using patterns of motion and appearance In Proceedings of ICCV, 2003 [30] K Wagstaff, C Cardie, S Rogers, and S Schroedl Constrained kmeans clustering with background knowledge In Proceedings of the 18th International Conference on Machine Learning, pages 577–584 Morgan Kaufmann Publishers Inc., 2001 [31] L Xie and P Pérez Slightly supervised learning of part-based appearance models In IEEE Workshop on Learning in Computer Vision and Pattern Recognition, in conjunction with CVPR 2004, June 2004 CuuDuongThanCong.com 430 Constrained Clustering: Advances in Algorithms, Theory, and Applications [32] E P Xing, A Y Ng, M I Jordan, and S Russel Distance metric learning with applications to clustering with side information In Advances in Neural Information Proceedings Systems, 2002 [33] R Yan, J Yang, and A G Hauptmann Automatically labeling data using multi-class active learning In Proceedings of the International Conference on Computer Vision, pages 516–523, 2003 [34] R Yan, J Zhang, J Yang, and A G Hauptmann A discriminative learning framework with pairwise constraints for video object classification IEEE Transactions on Pattern Analysis Machine Intelligence, 28(4):578–593, 2006 [35] S X Yu and J Shi Grouping with directed relationships Lecture Notes in Computer Science, 2134:283–291, 2001 [36] J Zhang and R Yan On the value of pairwise constraints in classification and consistency In Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, June 2007 [37] J Zhu and T Hastie Kernel logistic regression and the import vector machine In Advances in Neural Information Proceedingsssing Systems, 2001 [38] X Zhu, Z Ghahramani, and J Lafferty Semi-supervised learning using gaussian fields and harmonic functions In Proceedings of 20th International Conference on Machine Learning, 2003 CuuDuongThanCong.com Index 431 Index Acquaintance network, 285 Active constraints, 417–418 Active learning, 19–20, 27 AdaTron, 366 Adjusted Rand Index, 338 Agglomerative clustering, 20, 192, 376 AGNES, 288 Algorithms CDK-MEANS, 128–130 collective relational clustering, 230– 233 CondEns, 253–254, 272 convex pairwise kernel logistic regression, 407–408, 420–423, 426 convex pairwise support vector machine, 408–410, 422, 426 Expectation Maximization See Expectation Maximization algorithm heuristic, 300–305 margin-based learning, 399, 402 non-convex pairwise kernel logistic regression, 410–413, 420–422 non-redundant clustering, 281 RelDC, 227 static, 189–190 streaming, 189–191 weighted pairwise kernel logistic regression, 416, 425 Annealing, deterministic, 112, 265– 267 Applications correlation clustering, 323–325 description of, 9–10 interactive visual clustering, 330 semi-supervised clustering, 28 3-approximation, 321–322 Approximation algorithms CuuDuongThanCong.com connected k-center problem, 295– 299 correlation clustering, 314 Artificial constraints, soft-penalized probabilistic clustering with, 73–75 Attribute-based entity resolution, 235 Attribute data description of, 285 joint cluster analysis of, 285–307 AutoClass, 29 Automatic target recognition, 400 Balanced constraints, 171–173 Balancing clustering batch algorithms, 186–189 case study of, 183–186 clusters populating and refining, 176–177 size of, 173 experimental results, 177–181, 186– 191 frequency sensitive approaches for, 182–191 graph partitioning for, 192–193 overview of, 171–174 scalable framework for, 174–181 summary of, 195–196 Batch algorithms, 186–189 Bayesian networks, 259 Bayes model, 20 Bayes rule, 39, 97 Best path problem, 382 Bias, Bibliographic databases, 366 Bioinformatics, 290 Biological networks, 285 Blocking, 230 432 Constrained Clustering: Advances in Algorithms, Theory, and Applications Boolean matrices, 124–125 information orthogonality for, 250 initialization of, 231 merging of, 232 Cannot-be-linked example pairs, 358 Cluster analysis, 287–288 Cannot-link constraints Cluster assumption, 95 in collective relational clustering, 231– Cluster editing, 324 232 Clustering constrained co-clustering, 125 agglomerative, 20, 192, 376 description of, 3, 289–290 balancing See Balancing clusterin Expectation Maximization algoing rithm, 41–45 browsing capabilities using, 151 extended, 126–127, 136–139 consensus See Consensus clusterformation of, 36 ing hard, 75–76 correlation See Correlation clusintransitive nature of, 42 tering local counterparts of, 131 data mining uses of, must-link constraints and, 37, 51 definition of, 1, 20 normalization factor Z calculations, description of, 91 52–55 ensemble, 252–254 transitive inference of, goals of, 329 Categorization algorithm, 157–160 goodness of, 21 Categorization modeling, supervised interactive visual See Interactive clustering for visual clustering categorization algorithm, 157–160 micro-, 377, 387–389 categorization system performance, model-based, 96–98 160–168 partitioning-based, 376 definitions, 151–152 penalized probabilistic See Penalfeature selection, 152–153 ized probabilistic clustering notations, 151–152 projected, 156 overview of, 149–150 prototype-based, 20 supervised cluster, 153–157 redundancies in, 245 Category management, 172 relational See Relational clusterCDK-MEANS, 128–130, 133, 137, 139 ing Centroid, 152, 270 seeded single link, 332–333 CHAMELEON, 289 spectral, 51, 193 Chunklets supervised See Supervised clusterdefinition of, 37–38, 69 ing in must-link constraint incorporawithout nearest representative proption, 38–41 erty, 380–387 posterior probability, 39–40 Clustering algorithms Citation databases, 366 description of, CLARANS, 288 equivalence constraints incorporated Cleaning clusters, 316 into, 50 Cluster(s) Expectation Maximization See Excleaning, 316 pectation Maximization algorithm CuuDuongThanCong.com Index 433 k-means See k-means clustering Constrained clustering algorithm basis of, 329 Scatter/Gather, 29 correlation clustering uses of, 324 Clustering locality graph, 380 description of, 1–2 Cluster labels See Labels under existential constraint, 380 Cluster posterior, 68–70 interactive visual clustering, 332– Cluster refinement algorithm, 387 333 Cluster validation, 249 privacy-preserving data publishing, COBWEB, 3, 29 377–380, 389–392 Co-clustering problem involving, 203–206 constraint See Constraint co-clustering study of, 192 summary of, 144–145 uses of, 245 Co-clustering task, 126 Constrained complete-linkage algorithm, Collaboration network, 285 45–49 Collective relational clustering Constrained EM algorithm See Exalgorithm for, 230–233 pectation Maximization algorithm correctness of, 233–234 Constrained k-means algorithm See definition of, 229 also k-means clustering algorithm description of, 222–223 description of, 3, 50 entity clusters, 229 Expectation Maximization algorithm experimental evaluation, 235–241 vs., 45–49 relational similarity, 233 limitations of, 71 summary of, 241–242 problems involving, 205 synthetic data experiments, 238–241 weaknesses of, 71 Community identification, 287 Constrained k-median algorithm, 210, Complete data likelihood, 39, 42, 77 212 Complexity analysis, 292–295 Constrained partitional clustering, CondEns, 253–254, 272, 274–275, 280 Constraint(s) Conditional ensembles, 252–257 See active selection of, 350 also CondEns balanced, 171–173 Conditional entropy, 247–248, 255 cannot-link See Cannot-link conConditional information bottleneck, straints 257–267, 280 equivalence See Equivalence conConditional random fields, 225 straints Connected k-center problem existential, 378 approximation algorithms, 295–299 maximum entropy approach to indefinition, 291 tegration of, 98–111 description of, 286–288 must-link See Must-link constraints inapproximability results for, 295– settings for, 92 296 Constraint co-clustering joint cluster analysis, 291 cannot-link constraints, 125 NetScan adaptation to, 305 CDK-MEANS algorithm, 128–130 Connected k-median problem, 288 co-clustering task, 126 Connectivity constraint, 307 definition of, 123 Consensus clustering, 324–325 experimental validation, 134–144 CuuDuongThanCong.com 434 Constrained Clustering: Advances in Algorithms, Theory, and Applications internal criterion used to evaluate, 135–136 local-to-global approach, 123, 127– 134 must-link constraints, 125 non-interval constraint, 142–144 overview of, 124–125 problem setting for, 126–127 time interval cluster discovery, 139– 144 user-defined constraints, 138 Constraint weights, 65–67, 81–82 Continuity soft links from, 76–78 temporal, 49–50 Continuous-valued background knowledge, 263–265 Convex pairwise kernel logistic regression, 407–408, 420–423, 426 Convex pairwise loss functions, 404– 406 Convex pairwise support vector machines, 408–410, 422, 426 Coordinated conditional information bottleneck, 257–267, 274–276, 280 COP-KMEANS, 3–4, 86–87 Co-reference analysis, 323 Correlation clustering applications of, 323–325 approximation algorithms, 314 background on, 314–317 combinatorial approach, 321–322 constrained clustering uses of, 324 co-reference, 323 definition of, 313 description of, 313 location area planning using, 323 maximizing agreements, 315–316 maximizing correlation, 317 minimizing disagreements, 316–317 region growing, 317–321 techniques, 317–322 Cosine normalization, 152 Cosine similarity, 184 Co-training, 401 CuuDuongThanCong.com Cumulant generating function, 98 Damped centroid, 152 Data clustering, 91 Data mining, 1, 124, 288–289 Data sets interactive visual clustering, 339– 342 semi-supervised clustering, 25 UCI, 46–48 WebKB, 271 DBSCAN, 288 Deterministic annealing, 112, 265–267 DIANA, 288 Differential entropy, 255 Direct marketing, 172 Discontinuous label assignments, 95 Discriminative learning with pairwise constraints, 401–406 Disjointed pairwise relations, 68, 73 DISP, 380 Dissimilarity, 22 Dissimilarity data, 107 Dissimilarity matrix, 107 Distance metric description of, 95 general similarity/dissimilarity measure vs., 357 from pairwise constraints, 5–6 Distance metric learning background, 357–358 cannot-be-linked example pairs, 358 dissimilar example pairs in, 359 name disambiguation application of, 366–370 noisy data, 364–365 online learning, 365–366 pairwise constraints in, 399 positive semi-definiteness of learned matrix, 362–363 preliminaries, 359–361 problem formalization, 361–362 problems associated with, 358 single-class learning, 365 Index summary of, 370 support vector machine learning, 361, 363–364 Distributed learning, 50 Distributed sensor networks, 172 Divisive clustering, 192 D-MINER, 131, 134 Document clustering of, 172 cosine similarity between, 152 projection of, 152 Dominant clustering, 269 Domination matrix, 158 Domination number, 159 Domination threshold, 159 Edge factors, 53 Ensemble clustering, 252–254 Entity networks, 291 Entropy conditional, 247–248 maximum, 98–111 relative, 248–249 Shannon information, 247–248 Equivalence constraints description of, 33 Expectation Maximization algorithm with, 33, 35, 37, 49–50 labels and, 34–36 likelihood objective function modified by, 34 obtaining of, 49–50 purpose of, 34 summary of, 50 Equivalent linear discriminant analysis, 110 E-step, 37, 64 Existential constraints, 378, 380, 384– 385 Expectation Maximization algorithm cannot-link constraints in, 41–45 constrained complete-linkage algorithm vs., 45–49 CuuDuongThanCong.com 435 constrained k-means algorithm vs., 45–49 description of, 20, 33, 96, 191–192, 412 equivalence constraints added to, 33, 35, 37, 49–50 E-step, 37, 64, 93 experimental results with, 45–49 facial image database results, 48– 49 Gaussian mixture models used with, 33, 99 M-step, 37, 64, 412 must-link constraints in, 38–41, 44– 45 UCI data sets, 46–48 Expected log likelihood, 40–41 Experiments balancing clustering, 177–181, 186– 191 collective relational clustering, 235– 241 Expectation Maximization algorithm, 45–49 mean field approximation, 112–116 penalized probabilistic clustering, 72– 79 semi-supervised clustering, 24–26 video object classification using pairwise constraints, 416–426 Extended cannot-link constraints, 126– 127, 136–139 Extended must-link constraints, 126– 127, 136–139 Factorial approximation, 105 Feedback relevance, 29 user, 17–29 F-measure, 369 Fokker-Plank equation, 183 Force-directed layout, 331, 338 Frequency sensitive competitive learning, 182–183 436 Constrained Clustering: Advances in Algorithms, Theory, and Applications Frobenius norm, 361 F-scores, 112–113, 115, 117 Gabor filter, 116 Gaussian mixture models definition of, 36–37 description of, 60–61 deterministic annealing with, 112 equivalence constraints incorporated into See Equivalence constraints with Expectation Maximization algorithm See Expectation Maximization algorithm with hard constraints, 60–87 M-component, 61 pairwise constraints, 400 with soft constraints, 60 in supervised settings, 34 in unsupervised settings, 34 uses of, 33 General facility location problem, 288 Generalized linear models, 264 Generalized mutual information, 253 Gibbs distribution, 102–103 Gibbs-Markov equivalence, 103 Gibbs posterior, 107 Gibbs potential, 103 Gibbs sampling, 69 Goodman-Kruskal’s pi coefficient, 124, 135–136, 140 GraClus, 289 Graph clustering, 289 Graph-cut algorithms, 51 Graph partitioning, 192–193 Group constraints model, 72 Hard cannot-link constraints, 75–76 Hard pairwise constraints definition of, 60 Gaussian mixture model with, 60– 87 Hard penalized probabilistic clustering, 73, 112 Heuristic algorithm, 300–305 CuuDuongThanCong.com Hidden Markov Random Field, 51, 91, 290 Hinge loss function, 406 HMRF-KMeans algorithm, Human-guided search framework, 350 Information orthogonality, 250 Informative graphs, 291 Instance-level constraints definition of, 92 elements of, pairwise, 94 properties of, Instance-level label constraints, 91 Interactive visual clustering Amino Acid indices, 342–343, 348– 349, 352–353 background on, 331 Circles data set, 340, 343–345 Classical and Rock Music data set, 342, 347, 350 comparisons with other approaches, 337–338 constrained clustering, 332–333 data sets, 339–342 definition of, 330 display, 333–334 experimental application of, 330 future work, 351–352 Internet Movie Data Base data set, 341–342, 346–347, 349 Iris data set, 340–341, 346–348 methodology, 337–343 Overlapping Circles data set, 334– 337, 340, 345–346 overview of, 329–330 Rand Index, 338–339 results of, 343–351 seeded single link clustering, 332– 333 steps involved in, 332–334 system operation, 334–337 user actions, 332 Interval constraints, 125 Index 437 partial, 99 Inverse document frequency, 115–116 ISOMAP-based embedding procedures, smoothness of, 95 95 Lagrange multipliers, 265–266 Iterative conditional mode, 105 Latent Dirichlet allocation, 228 Iterative optimization, 29 Latent semantic indexing, 115–116 l-diversity model, 390–391 Learning Jaccard’s co-efficient, 229 active, 19–20, 27 Jaynes’ maximum entropy principle, discriminative, 401–406 100 distance metric See Distance metJoint cluster analysis, 285–307 ric learning Jump number, 135 distributed, 50 frequency sensitive competitive, 182– k-anonymity model, 390 183 Karush-Kuhn-Tucker conditions, 203– online, 365–366 204, 410 with pairwise constraints, 400–401 Kernel logistic regression, 400 semi-supervised, 24–25 k-means clustering algorithm single-class, 365 cluster assignment, 206–208 constrained See Constrained k-means supervised See Supervised learning algorithm support vector machine, 361, 363– document clustering applications of, 364 305–306 term weightings, 26 drawbacks of, 202 unsupervised See Unsupervised learnempty clusters, 202 ing extensions, 213–217 Likelihood maximization, 262 NetScan vs., 306 Likelihood objective function, 34 numerical evaluation, 208–213 Linear discriminant analysis, 110 overview of, 201–202 Linear warping, 23 problems involving, 204 Logistic loss function, 406 summary of, 217–218 Log-partition function, 98 k-means model Loss functions constrained algorithm See Conhinge, 406 strained k-means algorithm logistic, 406 from penalized probabilistic clusterpairwise, 404–406 ing, 83–87 k-median algorithm, 210, 212 k-Medoids, 288 Mahalanobis metric, 95 Kullback-Leibler divergence, 22, 248 Margin-based learning algorithms, 399, 402 Market segmentation, 286–287 Labels Markov network, 43–45, 53 description of, 27 discontinuous assignments, 95 Markov Random Field, 92, 96, 102– 103, 214 equivalence constraints and, 34–36 maximum-entropy, 100–102 Max-gap, 132 CuuDuongThanCong.com 438 Constrained Clustering: Advances in Algorithms, Theory, and Applications Maximum entropy approach to constraint integration, 98–111 Maximum likelihood, 67, 96 Mean field approximation description of, 70 experiments, 112–116 for posterior inference, 105–106 Mean field equations, 70 Median partition technique, 253 Micro-clustering, 377, 387–389 Micro-cluster sharing, 388–389 Min-cut algorithm, 193 Min-gap, 132 Minimum cost flow linear network optimization problem, 206–207, 215 Mixture models description of, 96 estimating of, 34 Gaussian See Gaussian mixture models uses of, 34 Model-based clustering description of, 96–98 with soft balancing, 194–195 Modified cost function, 71 Modified Gibbs posterior, 107 Molecular biology, 307 Movement path, 383 M-step, 37, 64 Multi-relational clustering, 288 Multivariate information bottleneck, 258–259 Must-link constraints cannot-link constraints and, 37, 51 constrained co-clustering, 125 definition of, 92 description of, 289–290 in Expectation Maximization algorithm, 38–41, 44–45 extended, 126–127, 136–139 formation of, 36 local counterparts of, 131 modeling of, 92 transitive inference of, 2–3 update equations, 38–41 CuuDuongThanCong.com Must-not-link constraint, 92 Naive relational entity resolution, 235 Name disambiguation, 366–370 Nearest representative property clustering without, 380–387 description of, 378–379 NetScan, 300–302, 305–306 Network analysis, 285, 289 Node factors, 53 Noisy pairwise constraints, 415–416, 424–426 Non-convex pairwise kernel logistic regression, 410–413, 420–422 Non-convex pairwise loss functions, 404 Non-interval constraints, 125, 142– 144 Non-linear kernel discriminant analysis, 110 Non-redundant clustering algorithms for, 281 background concepts, 247–249 CondEns, 253–254, 272, 274–275, 280 conditional ensembles, 252–257 conditional entropy, 247–248 coordinated conditional information bottleneck, 257–267, 274–276, 280 dominant clustering, 269 experimental evaluation, 267–280 information orthogonality, 250 multiple clusterings, 249–250 mutual information, 248–249 orthogonality assumption, 276, 278 problem conditions for correctness, 254–257 defining of, 251 setting for, 246–251 relative entropy, 248–249 settings for, 245–246 successive, 278–279 summary of, 280–281 synthetic data evaluation, 273–279 Index text data sets, 269–270 update equations, 261–265 Normalized gini index, 153 Normalized mutual information, 178, 189–190 NP-Complete, 287–288, 393 Object identification, 366 Objects, Online learning, 365–366 Optical character recognition, 400 OPTICS, 288 Orthogonality assumption, 276, 278 Outliers, 303–304 Pairwise clustering, 107–108 Pairwise constraints description of, 397–398 discriminative learning with, 401– 406 distance metric from, 5–6 in distance metric learning, 399 enforcing, 3–5 hard See Hard pairwise constraints implementing of, 22–23 instance-level, 94 introduction of, 51 multi-class classification with, 414– 415 noisy, 415–416, 424–426 supervised learning with, 400 unsupervised learning with, 400 video object classification using See Video object classification using pairwise constraints Pairwise Gaussian mixture model, 421, 423 Pairwise loss functions convex, 411 description of, 404–406 non-convex, 411 Pairwise relations disjointed, 68, 73 CuuDuongThanCong.com 439 penalized probabilistic clustering, 62– 63, 65 weight of, 65 Partial labels, 99 Partition function, 101 Partitioning-based clustering, 376 Pearl’s junction tree algorithm, 43 Penalized probabilistic clustering with artificial constraints, 73–75 cluster posterior, 68–70 constraint weights, 65–67, 81–82 description of, 61 experiments, 72–79 hard, 73, 112 k-means model from, 83–87 limitations of, 78–79 model fitting, 63–65 pairwise relations, 62–63, 65, 73 prior distribution on cluster assignments, 61–62 real-world problems, 75–78 related models, 70–72 soft, 73 soft links from continuity, 76–78 summary of, 78–79 Pivot movement graph, 382–383 Positive semi-definiteness of learned matrix, 362–363 Posterior probability, 39–40, 69, 105 Potential bridge nodes, 302 Prior distribution on cluster assignments, 61–62 Prior probability with pairwise relations, 63 Privacy-preserving data publishing constrained clustering, 377–380, 389– 392 existential constraints, 378, 380, 384– 385 local optimality of algorithm, 385– 386 nearest representative property, 378– 387 overview of, 375–377 summary of, 392–394 440 Constrained Clustering: Advances in Algorithms, Theory, and Applications termination of algorithm, 387 Probabilistic clustering models of, 71 penalized See Penalized probabilistic clustering Probabilistic edges, 339, 344 Probabilistic relational models, 288 Projected clustering, 156 Projected out, 152 Projection of document, 152 Prototype-based clustering, 20 Pseudo-centroids, 152, 156 Pseudo-likelihood assumption, Z approximations using, 54–55 Reversible jump MCMC, 196 Runtime complexity, 296–297 Scatter/Gather algorithm, 29 Seeded single link clustering, 332–333 Semi-supervised clustering active learning and, 19–20 applications of, 28 data sets, 25 definition of, 92 description of, 1–2, 17–19, 21–22 experiments using, 24–26 goal of, 18–19 HMRF-based, 290 illustration of, 22 Quasi-cliques, 307 performance of, 24–25 Quasi-identifier, 376 supervised clustering vs., 27 with user feedback, 17–29 Rand coefficient, 141 Semi-supervised clustering algorithms Rand Index, 338–339 description of, 34, 36 Ratio between the minimum to exExpectation Maximization See Expected cluster sizes, 178 pectation Maximization algorithm Region growing, 317–321 types of, 50 Relational clustering Semi-supervised learning, 93 collective See Collective relational Sensor networks, 172 clustering Sequential minimal optimization, 366 collective resolution, 225–230, 233, Shannon information entropy, 247– 235 248 entity resolution, 222–230 Single-class learning, 365 overview of, 221–223 Skew fraction, 153 pairwise decisions using relationships, Social network, 285–286 226–227 Social network analysis, 289 pairwise resolution, 224 Soft balanced constraint, 173 Relationship data Soft balancing, model-based clusterdescription of, 285 ing with, 194–195 elements of, 285–286 Soft constraints, 60 joint cluster analysis of, 285–307 Soft penalized probabilistic clustermodeling of, 291 ing, 73 Relative entropy, 248–249 Soft preferences, 63 Relative seed subtraction, 159 Spectral clustering, 51, 193 RelDC algorithm, 227 Spring embedding, 331 Relevance feedback, 29 Relevant component analysis, 400, 427 Spring force, 331 Reproducing kernal Hilbert space, 402 Stable clustering, 177 CuuDuongThanCong.com Index Standard deviation in cluster sizes, 178 Static algorithm, 189–190 Stop words, 152 Streaming algorithms, 189–191 Supervised clustering for categorization modeling categorization algorithm, 157–160 categorization system performance, 160–168 definitions, 151–152 feature selection, 152–153 notations, 151–152 overview of, 149–150 supervised cluster, 153–157 performance of, 27 semi-supervised clustering vs., 27 Supervised learning class labels in, 27 description of, 18, 93 with pairwise constraints, 400 Support vector machines convex pairwise, 408–410 learning, 361, 363–364, 402 Temporal constraints, 417 Temporal continuity, 49–50 Term frequency, 152 Term vector, 151 Term weightings, 26 Threshold, 157 Time interval cluster discovery, 139– 144 Top-down clustering, 192 Transitive inference of cannot-link constraints, of must-link constraints, 2–3 Triangle inequality, 249 UCI data sets, 46–48 Unary neighbors, 303 Unconstrained clustering, 377, 379 Unconstrained co-clustering, 138 Unstable pivots, 381 CuuDuongThanCong.com 441 Unsupervised clustering, 149 Unsupervised learning description of, 24–25, 93 with pairwise constraints, 400 Update equations for cannot-link constraints, 42–44 for must-link constraints, 38–41 for non-redundant clustering, 261– 265 User feedback, semi-supervised clustering with, 17–29 Video object classification using pairwise constraints algorithms convex pairwise kernel logistic regression, 407–408, 420–423, 426 convex pairwise support vector machines, 408–410 description of, 406 non-convex pairwise kernel logistic regression, 410–413, 420–422 experiments, 416–426 overview of, 397–400 pairwise constraints discriminative learning with, 401– 406 multi-class classification with, 414– 415 noisy, 415–416, 424–426 performance evaluations, 421–424 related work, 400–401 summary of, 426–427 VIOLATE-CONSTRAINTS, von Mises-Fisher distribution, 183– 184 Ward’s method, 192 WebKB data set, 271 Weighted objects, 174 Weighted pairwise kernel logistic regression, 416, 425 Weighting, 108–110 Weights, constraint, 65–67, 81–82 ... Liu and Hiroshi Motoda CONSTRAINED CLUSTERING: Advances in Algorithms, Theory, and Applications Sugato Basu, Ian Davidson, and Kiri L Wagstaff C9969_FM.indd CuuDuongThanCong.com 7 /11/ 08 11: 47:01.. .Constrained Clustering Advances in Algorithms, Theory, and Applications C9969_FM.indd CuuDuongThanCong.com 7 /11/ 08 11: 47:01 AM Chapman & Hall/CRC Data Mining and Knowledge Discovery... trademarks, and are used only for identification and explanation without intent to infringe Library of Congress Cataloging -in- Publication Data Constrained clustering : advances in algorithms, theory, and

Constrained clustering advances in algorithms, theory, and applications basu, davidson wagstaff 2012 11 29 Cấu trúc dữ liệu và giải thuật

Thông tin tài liệu

Từ khóa liên quan

Mục lục

cover.pdf

page_r01.pdf

page_r02.pdf

page_r03.pdf

page_r04.pdf

page_r05.pdf

page_r06.pdf

page_r07.pdf

page_r08.pdf

page_r09.pdf

page_r10.pdf

page_r11.pdf

page_r12.pdf

page_r13.pdf

page_r14.pdf

page_r15.pdf

page_r16.pdf

page_r17.pdf

page_r18.pdf

page_r19.pdf

Tài liệu cùng người dùng

Tài liệu liên quan