Phân cụm mờ với đại số gia tử và ứng dụng

TÓM TẮT LUẬN VĂN THẠC SĨ Đề tài: Phân cụm mờ với Đại số gia tử ứng dụng Tác giả luận văn: Đinh Khắc Đông – Khóa: 2009 Người hướng dẫn: PGS TS Trần Đình Khang Nội dung tóm tắt: a) Lý chọn đề tài Các phương pháp phân cụm từ đời nghiên cứu rộng rãi thu nhiều thành tựu nhiều lĩnh vực toán định, học máy, khai phá liệu Tuy vậy, trình nghiên cứu nhiều hạn chế thuật toán ảnh hưởng đến chất lượng phân cụm Bên cạnh đó, nghiên cứu cấu trúc tự nhiên miền ngôn ngữ nhóm tác giả mà đứng đầu GS Nguyễn Cát Hồ với đề xuất hàm độ đo tính mờ hàm định lượng ngữ nghĩa trở nên đủ mạnh để cung cấp metric cho toán phân cụm mờ hứa hẹn mang lại cải tiến cho thuật toán FCM truyền thống Dưới tìm tòi nghiên cứu tác giả tính khả thi đề tài hướng dẫn PGS TS Trần Đình Khang – Viện Công nghệ thông tin Truyền thông, Đại học Bách Khoa Hà Nội, định nghiên cứu vấn đề “Phân cụm mờ với Đại số gia tử ứng dụng” luận văn Cao học b) Mục đích nghiên cứu, đối tượng phạm vi nghiên cứu Trong số 60.000 nghiên cứu toán phân cụm, lớp thuật toán ý phân cụm mờ cho phép liệu thuộc vào nhiều cụm với độ thuộc tương ứng khác Tuy vậy, nghiên cứu sau khó khăn gặp phải áp dụng thuật toán phân cụm mờ - Fuzzy C-Means toán thực tế Từ phân tích hạn chế FCM truyền thống, luận văn đề xuất hướng cải tiến cách sử dụng cấu trúc ĐSGT Trong khuôn khổ luận văn Cao học này, lớp ĐSGT nghiên cứu ĐSGT đối xứng tuyến tính c) Các nội dung đóng góp Sau nghiên cứu thuật toán FCM truyền thống phân tích hạn chế, luận văn đề xuất hướng cải tiến cách sử dụng cấu trúc ngôn ngữ ĐSGT Với phương pháp này, luận văn có đóng góp sau: • Đầu tiên, cấu trúc ngôn ngữ biến chân lý sử dụng để thay đổi khoảng cách liệu đến tâm cụm Cụ thể độ đo tính mờ giá trị ngôn ngữ cấu trúc ĐSGT đóng vai trò trọng số tính toán khoảng cách liệu tâm cụm tương ứng Do ảnh hưởng liệu đến trình cập nhật tâm cụm trở nên đa dạng, phụ thuộc vào tham số ĐSGT • Đóng góp thứ hai luận văn trình cập nhật tâm cụm, có điểm có độ thuộc lớn hay phần tử trung hòa cấu trúc ĐSGT coi thuộc vào cụm dùng trình tính toán tâm cụm Do đó, liệu có nhiễu hay điểm ngoại lai ảnh hưởng đến tâm cụm giảm thiểu • Cuối cùng, tham số ĐSGT sử dụng làm tham số huấn luyện phương pháp học có giám sát để thu kết phân cụm tốt d) Phương pháp nghiên cứu Đầu tiên, thuật toán Fuzzy C-means truyền thống nghiên cứu để tìm vấn đề gặp phải mặt lý thuyết triển khai Sau đó, cấu trúc ngôn ngữ biến chân lý nghiên cứu ĐSGT tuyến tính đối xứng Ảnh hưởng độ đo tính mờ gia tử tác động lên độ đo tính mờ giá trị ngôn ngữ xem xét thông qua hàm độ đo tính mờ ánh xạ định lượng ngữ nghĩa Từ đưa đề xuất cải tiến thuật toán FCM để khắc phục nhược điểm Sau đưa thuật toán cải tiến: Phân cụm mờ với Đại số gia tử, mệnh đề tính trọng số gán cho mẫu mệnh đề tính tổng quát thuật toán trình bày, khẳng định tính giá trị mặt lý thuyết thuật toán Cuối cùng, thuật toán Phân cụm mờ với Đại số gia tử thử nghiệm toán Phân cụm với liệu nhân tạo Phân cụm với liệu thực đa chiều để khẳng định tính thực tiễn thuật toán áp dụng toán thực tế e) Kết luận Luận văn tìm hiểu thuật toán FCM để tìm hạn chế áp dụng toán phân cụm Ngoài ra, với nghiên cứu ĐSGT đối xứng tuyến tính, luận văn đề xuất thuật toán cải tiến FCM cách gán trọng số cho mẫu không gian đầu vào Các trọng số xây dựng khoảng cách từ mẫu đến tâm cụm Ngoài ra, phần tử trung hòa cấu trúc ĐSGT dùng làm ngưỡng trình cập nhật tâm cụm Với tiếp cận này, ta thu kết tốt nhờ trình tối ưu hóa tham số ĐSGT Luận văn thuật toán đề xuất tổng quát thuật toán FCM truyền thống trường hợp đặc biệt Một số ứng dụng thử nghiệm cho thấy khả ứng dụng thuật toán cải tiến BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI ———————— Đinh Khắc Đông PHÂN CỤM MỜ VỚI ĐẠI SỐ GIA TỬ VÀ ỨNG DỤNG LUẬN VĂN THẠC SĨ KHOA HỌC CÔNG NGHỆ THÔNG TIN NGƯỜI HƯỚNG DẪN KHOA HỌC: PGS TS Trần Đình Khang Hà Nội - Năm 2011 Lời cảm ơn Tôi xin gửi lời cảm ơn sâu sắc tới thầy giáo, PGS TS Trần Đình Khang, Phó viện trưởng Viện Công nghệ thông tin Truyền thông, Đại học Bách Khoa Hà Nội Dưới hướng dẫn nghiêm khắc phong cách nghiên cứu khoa học nghiêm túc thầy, trưởng thành nhiều trình làm luận văn Tôi xin chân thành cảm ơn ThS Phan Anh Phong, giảng viên khoa Công nghệ thông tin, trường Đại học Vinh, Nghệ An, Nghiên cứu sinh Đại học Bách Khoa Hà Nội với đóng góp mang tính chuyên môn cho luận văn Tôi xin gửi lời cảm ơn chân thành đến thầy cô giáo Viện Công nghệ thông tin Truyền thông, Đại học Bách Khoa Hà Nội với tri thức quý báu mà thầy cô truyền đạt lại cho suốt trình học Cao học trường Cuối xin gửi lời cảm ơn thương yêu đến bố mẹ, anh trai tôi, bạn bè đặc biệt em Bùi Thu Trang quan tâm, khuyến khích, động viên suốt thời gian học tập hoàn thành luận văn xa nhà Đinh Khắc Đông Hà Nội, tháng năm 2011 i Lời cam đoan Tôi – Đinh Khắc Đông – học viên cao học lớp Công nghệ thông tin - Khóa 2009, cam kết nội dung Luận văn tốt nghiệp Cao học công trình nghiên cứu thân tôi, hướng dẫn PGS.TS Trần Đình Khang Tôi cam đoan kết luận văn hoàn toàn trung thực Các hình vẽ, tri thức sử dụng lại tham chiếu đầy đủ, rõ ràng từ tài liệu cuối luận văn Đinh Khắc Đông Hà Nội, tháng năm 2011 ii Mục lục Lời cảm ơn i Lời cam đoan ii Mục lục iii Danh sách ký hiệu, chữ viết tắt vii Danh sách bảng viii Danh sách hình vẽ viii Mở đầu 1.1 Lý chọn đề tài 1.2 Lịch sử nghiên cứu 1.3 Mục đích, đối tượng phạm vi nghiên cứu 1.4 Các luận điểm đóng góp 1.5 Phương pháp nghiên cứu 10 1.6 Bố cục luận văn 11 iii Mục lục Phân cụm liệu 13 2.1 Đặt vấn đề 13 2.2 Bài toán phân cụm liệu 15 2.3 Các định nghĩa ký hiệu 17 2.4 Biểu diễn mẫu trích chọn đặc trưng 18 2.5 Các độ đo tương tự 20 2.6 Các kỹ thuật phân cụm 22 2.6.1 Các thuật toán phân cụm phân cấp 22 2.6.2 Các thuật toán phân vùng 25 2.6.3 Phân cụm mờ 27 2.6.4 Phân cụm với mạng nơron nhân tạo 29 2.6.5 Phân cụm với thuật toán tiến hóa 30 2.7 So sánh phương pháp phân cụm liệu 31 2.8 Các ứng dụng phân cụm liệu 33 Đại số gia tử 36 3.1 Tổng quan Đại số gia tử 37 3.2 Hàm dấu 38 3.3 Hàm độ đo tính mờ 40 3.3.1 41 Định nghĩa hàm độ đo tính mờ iv Mục lục 3.3.2 Tính chất hàm độ đo tính mờ 42 3.4 Hàm định lượng ngữ nghĩa 42 3.5 Ứng dụng Đại số gia tử 43 3.5.1 Xây dựng tập mờ loại hai Đại số gia tử 43 3.5.2 Lập luận xấp xỉ dựa Đại số gia tử 44 Phân cụm mờ với Đại số gia tử 46 4.1 Đặt vấn đề 46 4.2 Khái niệm tâm cụm ngôn ngữ 48 4.3 Thuật toán xác định trọng số 51 4.4 Thuật toán phân cụm mờ với Đại số gia tử - HAFCM 53 4.5 Tối ưu hóa tham số Đại số gia tử 55 Các ứng dụng Thuật toán phân cụm mờ với ĐSGT 5.1 57 Giải thuật di truyền 58 5.1.1 Mã hóa 58 5.1.2 Toán tử chọn lọc 58 5.1.3 Toán tử lai ghép 59 5.1.4 Toán tử đột biến 60 5.1.5 Hàm thích nghi 60 5.1.6 Các thành phần khác 60 v Mục lục 5.2 Ứng dụng toán phân cụm với liệu nhân tạo 61 5.3 Ứng dụng toán phân cụm với liệu thực đa chiều 63 5.4 Các kết bàn luận 64 Kết luận 66 6.1 Đóng góp luận văn 66 6.2 Các vấn đề tồn 67 6.3 Hướng phát triển 68 Tài liệu tham khảo 69 Phụ lục 73 vi Danh sách ký hiệu, chữ viết tắt ANN Artificial Neural Network ES Evolutional Strategy EP Evolutional Programming GK Gustafson-Kessel HA Hedge Algebra HAFCM Hedge Algebraic Fuzzy-C Means HCM Hard C-Means FCM Fuzzy C-means FCV Fuzzy C-varieties FLS Fuzzy Logic System GA Generic Algorithm SA Search-based Algorithm SOM Self-organizing map vii Chương Các ứng dụng Thuật toán phân cụm mờ với ĐSGT HAFCM cho kết sai với mẫu (2.86%) Tỉ lệ phân cụm xác HAFCM cao với 97.14% • Trong toán phân cụm với liệu IRIS, 75% liệu sử dụng trình học tham số ĐSGT Sau có tham số này, ta sử dụng 25% số liệu lại để kiểm tra mô hình đưa tỉ lệ nhận dạng Hình 5.6 cho thấy tỉ lệ nhận dạng HAFCM cao FCM với giá trị tham số mờ m 65 Chương Kết luận “Milky way chẳng qua vô số xếp thành cụm” – Galileo Galilei 6.1 Đóng góp luận văn Luận văn trình bày nghiên cứu thuật toán FCM truyền thống kiến thức Đại số gia tử Dựa sở đó, luận văn phân tích hạn chế giải thuật FCM, đồng thời đề xuất hướng cải tiến cách sử dụng cấu trúc ngôn ngữ Đại số gia tử Với phương pháp này, luận văn có đóng góp sau: [30] • Đầu tiên, cấu trúc ngôn ngữ biến chân lý sử dụng để thay đổi khoảng cách liệu đến tâm cụm Cụ thể độ đo tính mờ giá trị ngôn ngữ cấu trúc ĐSGT đóng vai trò trọng số tính toán khoảng cách liệu tâm cụm tương ứng Do ảnh hưởng liệu đến trình cập nhật tâm cụm trở nên đa dạng, phụ thuộc vào tham số ĐSGT 66 Chương Kết luận • Với thuật toán FCM truyền thống, tất mẫu tác động đến trình cập nhật tâm cụm Điều làm FCM nhạy cảm với điểm ngoại lai ảnh hưởng đến chất lượng tâm cụm Đóng góp thứ hai luận văn trình cập nhật tâm cụm, có điểm có độ thuộc lớn hay phần tử trung hòa cấu trúc ĐSGT coi thuộc vào cụm Và có điểm tác động đến trình tính toán tâm cụm Do đó, liệu có nhiễu hay điểm ngoại lai ảnh hưởng đến tâm cụm giảm thiểu • Khi sử dụng độ đo tính mờ giá trị ngôn ngữ làm trọng số tính toán khoảng cách từ mẫu đến tâm cụm, trọng số thay đổi theo tham số ĐSGT Các tham số sử dụng làm tham số huấn luyện phương pháp học có giám sát để thu kết tốt với thuật toán phân cụm mờ với ĐSGT 6.2 Các vấn đề tồn Luận văn xây dựng thuật toán Phân cụm mờ dựa Đại số gia tử hai ứng dụng minh họa với kết tốt Tuy nhiên, có vấn đề cần hoàn thiện sau: • Thuật toán tập trung vào thay đổi khoảng cách điểm tâm cụm chưa cách xác định tham số mờ m giải thuật FCM • Số lượng cụm cần phải biết trước làm đầu vào cho giải thuật Do ta cần có tri thức liệu để xác định số cụm phù hợp • Luận văn chưa mối liên hệ mức k giá trị chân lý với hiệu thuật toán 67 Chương Kết luận 6.3 Hướng phát triển Bài toán phân cụm mờ nghiên cứu từ lâu với nhiều cải tiến theo nhiều hướng khác Trên sở nghiên cứu vấn đề cần phát triển, luận văn đề xuất hướng nghiên cứu, mở rộng sau đây: • Nghiên cứu, xây dựng phương pháp xác định số cụm tối ưu từ liệu đầu vào Quá trình giúp làm giảm bớt yêu cầu tri thức thực toán phân cụm • Cài đặt thử nghiệm với nhiều liệu khác với số cụm, số thuộc tính khối lượng liệu lớn để kiểm tra hiệu thuật toán HAFCM 68 Tài liệu tham khảo [1] A K Jain, M.N Murthy and P.J Flynn, “Data Clustering: A Review, ACM Computing Reviews,” Nov 1999 [2] B Thomas and G Raju A novel fuzzy clustering method for outlier detection in data mining International Journal of Recent Trends in Engineering, 1(2), May 2009 [3] Ball, G H and Hall, D J 1965 ISODATA, a novel method of data analysis and classification Tech Rep Stanford University, Stanford, CA [4] Bobrowski, L and Bezdek, J (1991) c-Means clustering with the l1 and l∞ ; norms IEEE Transactions Systems, Man, and Cybernetics, 21(3): 545 – 554 [5] C Hwang and F C.-H Rhee An interval type-2 fuzzy c spherical shells algorithm In Proc IEEE-FUZZ Conference, pages 1117-1122 Hungary, 2004 [6] C Hwang and F C.-H Rhee Uncertainty fuzzy clustering: Interval type-2 fuzzy approach to c-means IEEE Transactions on Fuzzy Systems, 15(1):107-120, February 2007 [7] Cannon, R., Dave, J , and Bezdek, J “Efficient implementation of the fuzzy c-means clustering algorithms.” IEEE Transactions on Pattern Analysis and Machine Intelligence, PMI - 8: 248 – 255 69 Tài liệu tham khảo [8] Carpenter,G and Grossberg, S 1990 ART3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures Neural Networks 3, 129–152 [9] Cheng, T , Goldgof, D , and Hall, L (1998) “Fast fuzzy clustering.” , Fuzzy Sets Systems, 93 : 49 – 56 [10] D E Gustafson and W C Kessel Fuzzy clustering with a fuzzy covariance matrix In Proc IEEE Conf Decision Control, pages 761-766 San Diego, CA, 1979 [11] Diday, E and Simon, J C 1976 Clustering analysis In Digital Pattern Recognition,K S Fu, Ed Springer-Verlag, Secaucus, NJ, 47–94 [12] Dunn, J “A fuzzy relative of the ISODATA process and its use in detecting compact well separated clusters.”, Journal of Cybernetics, 3(3): 32 – 57 [13] E Cox Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration Elservier, 2005 [14] Fogel, L J.,Owens, A J., and Walsh, M J 1965 Artificial Intelligence Through Simulated Evolution John Wiley and Sons, Inc., New York, NY [15] Hansen, P and Jaumard, B (1997), “Cluster analysis and mathematical programming Mathematical Programming,” 79: 191 – 215 [16] Holland, J H 1975 Adaption in Natural and Artificial Systems University of Michigan Press, Ann Arbor, MI [17] Höppner, F , Klawonn, F , and Kruse, R (1999) “Fuzzy cluster analysis: Methods for classification, data analysis and image recognition.”, New York, NY : Wiley [18] I.Gath and A B Geva “Unsupervised Optimal Fuzzy Clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1989, 11(7), 773-781 70 Tài liệu tham khảo [19] J C Bezdek “Pattern recognition with fuzzy objetive function algorithms,” New York: Plenum, 1981 [20] Jain,A.K and Dubes, R C 1988 Algorithms for Clustering Data Prentice Hall advanced reference series Prentice-Hall, Inc., Upper Saddle River, NJ [21] Kohonen, T 1989 Self-Organization and Associative Memory 3rd ed Springer information sciences series Springer-Verlag, New York, NY [22] L A Zadeh, “Fuzzy logic = computing with words,” IEEE Trans on Fuzzy Systems, vol 4, pp 103-111, 1996 [23] Mendel, J M “The Perceptual Computer: an Architecture for Computing With Words”, Proceedings of Modeling With Words Workshop in Proc IEEE FUZZ Conference, Melbourne, Australia, Dec 2-5, 2001, pp 35-38 [24] Rui Xu and Donald C Wunsch II, “Clustering,”, IEEE Press series on Computational Intelligence, 2009 [25] Ruspini, E H 1969 A new approach to clustering Inf Control 15, 22–32 [26] Schwefel, H P 1981 Numerical Optimization of Computer Models John Wiley and Sons, Inc., New York, NY [27] U Kaymak and M Setnes Fuzzy clustering with volume prototypes and adaptive cluster merging IEEE Transactions on Fuzzy Systems, 10(6), December 2002 [28] Watanabe, S 1985 Pattern Recognition: Human and Mechanical John Wiley and Sons, Inc., New York, NY [29] Wilson, D R and Martinez, T R 1997 Improved heterogeneous distance functions J Artif Intell Res 6, 1–34 71 Tài liệu tham khảo [30] Dinh Khac Dong, Tran Dinh Khang and Phan Anh Phong, “Fuzzy clustering with hedge algebra”, in Proceedings of the 2010 Symposium on Information and Communication Technology, SoICT 2010, Hanoi, Viet Nam, August 27-28, 2010, p49-54 [31] Nguyễn Công Hào (2005), “Mô hình sở liệu mờ theo cách tiếp cận Đại số gia tử”, Kỷ yếu hội thảo quốc gia vấn đề chọn lọc Công nghệ thông tin Truyền thông, Hải Phòng, 25-27/8, tr 285–293 [32] Ho N C., Wechler W (1990), “Hedge algebra: An algebraic approach to structures of sets of linguistic truth values”, Fuzzy Sets and Systems 35, pp 281–293 [33] N.C Ho, T.D Khang, H.V Nam and N.H Chau, “Hedge algebras, linguistic-valued logic and their application to fuzzy reasoning”, International Journal of Uncertainty, Fuzziness and Knowledge-Based System, vol 7, no.4, 347-361, 1999 [34] T D Khang, P A Phong, D K Dong, and C M Trang, “HedgeAlgebraic type fuzzy sets,” in Proceedings of FUZZ-IEEE 2010, in conjunction with the WCCI 2010, pp 1850-1857, July, Barcelona, Spain, 2010 [35] P A Phong, D K Dong, and T D Khang, “HaT2-FLS and its application to predict survival time of myeloma patients,” in Proc of KSE2009, 2009, Vietnam, pp 13-18, IEEE Computer Society [36] Phan Anh Phong, Dinh Khac Dong, and Tran Dinh Khang, “A Method for Constructing Hedge Algebraic Type-2 Fuzzy Logic Systems,” in Proc of 2011 IEEE Symposium on Advances in Type-2 Fuzzy Logic Systems, April, Paris, France [37] Lê Xuân Việt (2009), Luận văn Tiến sĩ 72 Phụ lục PAPER Fuzzy Clustering With Hedge Algebra Dinh Khac Dong1 , Tran Dinh Khang2 , Phan Anh Phong3 1,2 Hanoi University of Technology, Hanoi, Vietnam Vinh University, Nghean, Vietnam Proceedings of the 2010 Symposium on Information and Communication Technology, SoICT 2010, Hanoi, Viet Nam, August 27-28, 2010 73 Fuzzy Clustering with Hedge Algebra Dinh Khac Dong Tran Dinh Khang dinhkhacdong86 @gmail.com khangtd-fit @mail.hut.edu.vn Hanoi University of Technology Hanoi, Vietnam Hanoi University of Technology Hanoi, Vietnam ABSTRACT In this paper, we propose a new approach to fuzzy clustering in order to handle the uncertainties in pattern recognition problems on the basis of conventional fuzzy C-means algorithm (FCM) In our approach, we define the concept of linguistic cluster center by employing the semantic structure of hedge algebra This kind of cluster center is constructed to give the appropriate weights for each pattern of the dataset in our clustering algorithm The parameters of hedge algbra are then optimized in the training process to obtain the suitable parameters for the dataset We also incorporate the k-means algorithm to get better results in comparing to conventional FCM Categories and Subject Descriptors I.5.3 [Pattern Recognition]: Clustering—Algorithms General Terms Algorithms, Theory Keywords Hedge algebra, Fuzzy clustering, Linguistic cluster center INTRODUCTION In pattern recognition problems, the fuzzy C-means algorithm (FCM) proposed by J C Bezdek [1] plays an important role and has been extensively researched Different from the k-means algorithm, in FCM, each data point are assigned the membership grades for each cluster based on the distances between that pattern and cluster centers These membership grades indicate the degree in which each data point belongs to a cluster Moreover, in the phase of updating center in FCM, these membership grades are used as the contribution measure of each pattern in generating the new cluster center Many studies, however, have demonstrated the drawbacks of FCM such as the sensitivity with noise, the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee SoICT ’10, August 27-28, 2010, Hanoi, Vietnam Copyright 2010 ACM 978-1-4503-0105-3/10/08 $10.00 Phan Anh Phong Vinh University Nghean, Vietnam phongpa@gmail.com contribution of all data points to an arbitrary cluster center, etc and also presented the improvement on this algorithm [3, 6, 10] Another class of approaches to enhance fuzzy C-means algorithm is the studying of uncertainty associated with the distance measures [7, 8] In [7], the authors have extended a pattern set to interval type-2 fuzzy set with two parameters m1 and m2 to manage uncertainties of the fuzzifier m of FCM Hence, the membership grade of a data point is no longer a crisp number This leads to the variety in choosing the right membership grade in which a data point belongs to a cluster In the traditonal FCM, the membership grades are directly generated from the distance between the data point and the cluster center Therefore, a pattern having the same distances to two cluster centers will have the similar membership grades This matter is only suitable when two clusters’ densities are the same Moreover, E.Cox in [2] has shown that the influence of all patterns on the cluster centers have the tendency to move these centers to the centers of the dataset To enhance the performance of FCM, another important aspect we need to consider is the shape of the clusters in the dataset If the cluster center is a single point, the shape of clusters totally depends on the employed distance metrics Therefore, modifying the distance measure could deal with various kind of shapes of clusters For instance, the Gustafson-Kessel (GK) clustering algorithm [3] performs well with ellipsoidal shapes Moreover, the shape of the prototypes have influences on the shape of the clusters The fuzzy c-varieties (FCV) algorithm [1] is appropriate for detecting linear structures in the dataset by using the linear subspaces as prototypes We can use the above algorithms when having sufficient knowledge about the shape of clusters In other cases, the authors in [8] have proposed the concept of volume prototype and also presented the method to determine this prototype from the dataset In some researches, the authors in [4] and [5] have shown the ability of hedge algebra in representing and handling the linguistic variables based on their semantic structure Moreover, the recent study [9] has reinforced the potential of hedge algebra in practical applications In this paper, our contributions are the folowings: • First, we employ the structure of hedge algebra [4, 5] to modify the distances from patterns to cluster centers The fuzziness measures of linguistic values in hedge algebra are used as the weights for the corresponding patterns Hence, the influence of each data point on cluster centers becomes diversified on the basis of hedge algebra parameters • Second, a pattern definitely belongs to a cluster when its membership grade is not less than w - the neutral element of hedge algebra Only patterns falling above the threshold w are used to calculate the new cluster centers in the updating phase Therefore, in noisy cases, the noise rarely affects the cluster centers • Finally, the parameters of hedge algebra are used as the training parameters for supervised learning methods in order to get better results in clustering algorithms The remainder of this paper is organized as follows Section recalls the basic concepts of fuzzy C-means algorithm and the brief overview of hedge algebra structure In Section 3, we define the notion of linguistic cluster center and then propose the new fuzzy clustering algorithm as well as the optimization of the hedge algebra parameters Then, in Section 4, we present some experimental results showing the ability of the proposed method before the conclusions and the research orientations are drawn in Section PRELIMINARIES 2.1 Fuzzy Clustering [1] Fuzzy C-means is the most popular method used in pattern recognition The key idea of FCM is that each data point simultaneously belongs to two or more clusters This algorithm is a type of unsupervised objective function based fuzzy clustering method We can obtain the optimal partitions by minimizing the objective function: N C Jm = i=1 j=1 um ij xi − cj (1) where m is an arbitrary real number greater than 1, uij is the membership grade of xi in the cluster j, xi is the i-th pattern among N data points, cj is the j-th cluster center among C centers, and ∗ is a normal distance function which measure the disimilarity between data points and cluster centers To optimize (1), FCM is carried out through an iterative procedure in which the membership grades uij and the cluster centers cj are continuously updated as follows: uij = C xi −cj xi −ck k=1 N cj = m−1 (2) um ij xi i=1 N i=1 (3) um ij The iteration will stops if the termination criterion is satisfied: (n+1) cj (n) − cj < (4) where is a positive integer and n is the n-th iteration step This procedure converges to a local minimum of Jm which is proved by Bezdek in [1] 2.2 A Short Overview of Hedge Algebra Hedge algebra is an approach to represent and handle values of a linguistic variable based on their semantic structure Linguistic variables usually have original notions, for instance, true, false with variable TRUTH, old, young with variable AGE, etc Applying adverbs (hedges), e.g., very, more, possibly, etc, to generators or other values will generate new linguistic values such as very true, possibly false, possibly true, etc, with the following hypotheses: (i) Each hedge weakens or strengthens the meaning of generators, e.g., very applies to true, generating very true which strengthens the meaning of true (ii) Each hedge weakens or strengthens the meaning of other hedges, e.g., possibly applies to very true, generating possibly very true which weakens the meaning of very - the first hedge (iii) Applying hedges to linguistic variables generates new linguistic values which “semantically inherit” from the original linguistic values For example, consider two linguistic values very true and more true, no matter how many hedges very we added to more true such as very very very more true, its meaning is not as strong as possibly very true The above mentions lead to the definition of hedge algebras which are the sets of four components (AX, G, H, ≤), in which AX is the set of values, G is the set of generators, H is the set of hedges, and “≤” is the ordered relationship among elements of hedge algebra This relation can be formalized from the order of generators, the order of hedges and the weakening or strengthening effect when applying hedges to elements of hedge algebras In fact, many linguistic values have two opposite generators, e.g., true and false, tall and short, old and young, etc Then, hedge algebra has G with two generators, one of which having the “stronger” meaning, e.g., true, tall, old, etc, as positive generator (denoted as c+ ), and the other e.g., false, short, young, etc, as negative generator (denoted as c− ) This class of hedge algebras is called symmetrical hedge algebras The set of hedges H can be divided into H + including positive hedges which strengthen the meaning of positive hedge and H − including negative hedges which weakens the meaning of positive hedge With linear hedge algebras, the hedges are ordered as follows: h1 < h2 < < hp in H + , and h−q < h−q+1 < < h−1 in H − In [4], the authors have considered symmetrical and linear hedge algebras and used the fuzziness measure of linguistic values f m(ˆ x) with x ˆ ∈ AX According to [4], the authors suppose that: for each h, the proportion f m(hˆ x)/f m(ˆ x) is constant without dependence on x ˆ; hence, f m(hˆ x)/f m(ˆ x) is denoted as µ(h) and called the fuzziness measure of the hedge h Therefore, it is possible to calculate the fuzziness measure of all linguistic values from the fuzziness measures of generators and the fuzziness measures of hedges, because f m(hn hn−1 h1 c) = µ(hn )×f m(hn−1 h1 c) = = µ(hn )× µ(hn−1 )× ×µ(h1 )×f m(c), c ∈ {c− , c+ } If we assume that p f m(c− ), f m(c+ ) ∈ [0, 1], µ(hi ) ∈ [0, 1], i=−q µ(hi ) = 1, + − f m(c ) + f m(c ) = then f m(ˆ x) ∈ [0, 1] In this paper, x) and f m(ˆ x) as the left bound and the right we denote f m(ˆ bound of f m(ˆ x) respectively (i.e f m(ˆ x)−f m(ˆ x) = f m(ˆ x)) Another notion we have to consider is the level of a linguistic values A k-level linguistic value is a linguistic value whose number of symbol is k For example, V eryM oreT rue is a 3-level linguistic value From now on, we will only use the symmetrical and linear hedge algebra with level k=2, two generators: c− = f alse, c+ = true with f m(c− ) = w, f m(c+ ) = − w, and the hedges set includes H − = {Less, P ossible}, H + = {M ore, V ery} in all examples Figure is an illustration of the fuzziness measures of linguistic values in the hedge algebra where w = 0.5 and µ(Less) = µ(P ossible) = µ(M ore) = µ(V ery) = 0.25 Our approach to overcome this drawback of FCM-based algorithms is assign a weight value for each pattern in the dataset to diversify the contribution to the cluster centers The distance from a data point to the cluster center associates with its weight will give us more chances to obtain the expected results The structure of hedge algebra is employed to generate the suitable weights for each patterns on the basis of its distance to the cluster centers The extension of conventional cluster center with hedge algebra is called the linguistic cluster center To construct the linguistic cluster center for cluster center cj , there are three steps: • Step 1: Determine all k-level linguistic values as well as their fuzziness measures Figure 1: An example of the fuzziness measures in hedge algebra 3.1 FUZZY CLUSTERING ALGORITHM WITH HEDGE ALGEBRA Linguistic cluster center • Step 2: Find the farthest pattern belonging to the cluster cj and denote the maximum distance as dmax • Step 3: Determine the circle having the center cj and the radius dmax Partition this circle into sections proportional to the fuzziness measures of linguistic values In Figure 3, we illustrate the linguistic cluster center for cj with the linguistic values’ fuzziness measures presented in Figure In this section, we define the notion of linguistic cluster center as the extension of the conventional cluster center In traditional FCM, the fuzzy membership grades of data points are calculate based on the distance from patterns to cluster centers These membership grades, as shown in (3), Figure 3: Linguistic cluster center based on hedge algebra structure Figure 2: An Example of The Membership Grades in FCM show that the nearer patterns contribute more to the cluster centers than the further patterns The m in (3) plays the role of fuzzifier parameters in FCM Figure (2) illustrates the various membership grades in FCM Suppose that there are two cluster centers, c1 and c2 , for the dataset The shaded area as well as the gradient present the variety in the membership grades corresponding to c1 (the dark area means that the membership grade is high) FCM-based algorithms are known to be suitable with datasets whose clusters the similar in shape and density Nevertheless, the different dentities of clusters can cause the unexptected results In these cases, we can change m to obtain the desireable cluster centers These changes, however, are not intuitive in some cases [7] 3.2 The proposed fuzzy clustering algorithm In our method, the distance from patterns to cluster centers are modified by adding the corresponding weights to specific patterns The following algorithm will calculate the weights for each pattern on the basis of linguistic cluster center Algorithm Input: All patterns: xi , ≤ i ≤ N ; cluster centers: cj , ≤ j ≤ C; hedge algebra parameters: level k of the linguistic values, neutral element w, fuzziness measures of hedges f m(hi ), −q ≤ i ≤ p Output: The weights for each distance dij from xi to cj Construct the k-level linguistic values of hedge algebra with corresponding parameters The iteration will stops if the termination criterion is satisfied: (n+1) Determine the linguistic cluster center for cj , ≤ j ≤ C, respectively Let the distance from xi to cluster center cj be dij and dj = max {dij } Determine x î , the k-level linguistic value 1≤i≤N dj − dij < f m(ˆ xi ) dj dj − dij and x ˆmin , x ˆmax are the mindj imum and maximum k-level linguistic values respectively • It is obvious that ≤ rij ≤ because ≤ dij ≤ dj In special cases, – rij = if dij = dj , i.e xi is the farthest pattern belonging to cluster j, and – rij = if dij = 0, i.e xi coincides with linguistic cluster center cj • Besides, the hedge algebra considered in Algorithm is symmetrical and linear Hence, the set of intervals corresponding to k-level lingtuistic values: xmin ), f m(ˆ xmin ) , , f m(ˆ xmax ), f m(ˆ xmax ) f m(ˆ form a partition of [0, 1] Therefore, there exists a unique k-level linguistic value xî satisfying f m(ˆ xi ) ≤ rij < f m(ˆ xi ) Example In Figure 3, consider pattern xi and cluster center cj , the k-level linguistic value of hedge algebra which satisfies (5) is P ossiblyT rue (P T ) Hence, the corresponding weight of xi is wij = f m(P T ) This algorithm is used in the proposed method for fuzzy clustering Hedge algebra based Fuzzy C-means algorithm – HAFCM At first, the initial centers are generated by k-means algorithm Then, similar to FCM, HAFCM is carried out through an iterative procedure in which the membership grades uij and the cluster centers cj are continuously updated as follows: (7) uij = C N cj = m−1 um ij xi i=1 N i=1 where uij > w um ij where is a positive integer and n is the n-th iteration step The below proposition presents the relationship between the conventional FCM and HAFCM Proof Consider the n-th iteration step of these two algorithms Proof Let rij = wij × xi −cj wik × xi −ck (9) (6) Proposition In Algorithm 1, there exists a unique k-level linguistic value x î which satisfies (5) k=1 < (5) Then, the weight for pattern xi with cluster center cj is: wij = f m(ˆ xi ) = f m(ˆ xi ) − f m(ˆ xi ) (n) − cj Proposition Consider (AX, G, H, ≤) where G = {true, f alse} and H = H − ∪ H + = {h−q , , hp } If f m(h−q ) = f m(h−q+1 ) = = f m(hp−1 ) = f m(hp ) then the clustering result of HAFCM is equivalent to FCM’s satisfying the following condition: xi ) ≤ f m(ˆ cj (8) • Phase 1: Updating the membership grades Refer to (7), the membership grades of xi with cluster center cj is: uij = C k=1 wij × xi −cj wik × xi −ck m−1 (10) According to the assumption: f m(h−q ) = f m(h−q+1 ) = = f m(hp−1 ) = f m(hp ) and Algorithm 1, it is obvious that the weights of all patterns are the same: ∀i, j Therefore, (10) becomes: wij = p+q uij = C k=1 xi −cj xi −ck m−1 (11) The form of (11) and (2) are similar Hence, after Phase 1, HAFCM and FCM produce the same membership grades for each patterns • Phase 2: Updating the cluster centers Notice that the Phase of HAFCM and FCM are not different from each other, so the cluster centers produced by HAFCM and FCM are similar The threshold w in updating centers is considered in HAFCM to eliminate the affect of noisy data According to the assumption that f m(h−q ) = f m(h−q+1 ) = = p f m(hp−1 ) = f m(hp ), we have w = p+q The above two phases show that the clustering results of HAFCM and FCM’s are equivalent 3.3 The optimization of hedge algebra’s parameters As presented above, the backbone of our clustering method is the employment of hedge algebra structure It is clear that the modification of hedge algebra’s parameters will dramatically affect the clustering results This changes can lead to the desireable centers, especially in noisy cases The reason is that the noisy patterns which are always far from cluster centers should have the lower membership grades than other patterns In order to have the appropriate parameters set, we can use various training methods such as neural network, genetic algorihm, etc Table 1: Clustering results of three algorithms HCM FCM HAFCM Number of data 70 70 70 Noise rate 40% 40% 40% Misclassified patterns Recognition rate 87.14% 94.29% 97.14% Example Consider the dataset and cluster cj in Figure The pattern x0 and x1 seem to be the noise of the dataset Therefore, if we use the hedge algebra in Figure and keep the parameters set {w, µ(Less), µ(P ossible), µ(M ore), µ(V ery)} = { 0.5, 0.25, 0.25, 0.25, 0.25 }, the weights for x0 , x1 and other patterns will be equal to µ(V )× f m(T ) = 0.25 × 0.5 = 0.125 This affects the quality of clustering process However, if we modify the parameters set into {0.6, 0.25, 0.25, 0.25, 0.25} then the weights for x0 and x1 are f m(V F ) = f m(M F ) = µ(M ) × f m(F ) = 0.25 × 0.6 = 0.15, larger than the weights for other patterns which is f m(LT ) = f m(P T ) = f m(M T ) = f m(V T ) = µ(V )× f m(T ) = 0.25 × 0.4 = 0.1 The larger the weights are, the larger the relative distances from patterns to cluster center are Hence, the modified parameters will give a more appropriate result Figure 5: The identical clustering results of HCM, FCM and HAFCM respectively, are different In particular, the number of misclassified patterns as well as the recognition rate are shown in Table (a) (b) Figure 4: A dataset with noisy patterns (a) A set of identical weights (b) A more reasonable set of weights EXPERIMENTAL RESULTS In this section, we present two implementations with synthetic data and multi-dimensional data Then, we compare among hard C-means algorithm (HCM) which is known as k-means, fuzzy C-means algorithm (FCM) and our method hedge algebra fuzzy C-means (HAFCM) These clustering results show that HAFCM outperforms HCM and FCM in all cases 4.1 Synthetic Data with Noise The below illustration compares the clustering results of HCM, FCM and HAFCM with two similar-volume clusters Figure show the appropriate results generated by these three algorithms as expected In this implementation, we have used 50 patterns in 2-dimensional space under the Gaussian random number generator These patterns’ distribution is quite typical, hence three clustering results are identical Nevertheless, when we added noisy data up to 40% to the above pattern set, the clustering results of HCM, FCM and HAFCM, as shown in Figure 6, Figure and Figure Figure 6: HCM’s clustering result with noisy data 4.2 Multi-dimensional Real Data In the following experiment, we have employed the ”IRIS plants” data, which is widely used in pattern recognition problems Some properties of this dataset are: the number of IRIS plant’s features is 4, the number of patterns is 150, and the number of class is To have the better result, we use genetic algorithm as the leaning method where 75% of the pattern set are used as training data and the remaining 25% for testing Because the classification of this dataset is known, it is easy to calculate the recognition rates generated by FCM and HAFCM In Figure 9, it is very clear that our proposed method gives better classification than FCM with various fuzzifier parameters m CONCLUSIONS Hedge algebra is a algebra structure with the ability of representing and handling linguictic variables In this pa- Figure 7: FCM’s clustering result with noisy data per, we have established the concept of linguistic cluster center based on this structure Then, we modify the FCM algorithm by adding weights for each pattern These weights depend on the representation of linguistic cluster centers and the distances from patterns to those centers Moreover, the neutral element of hedge algebra, w, is used as the threshold to update cluster centers From this approach, we can obtain the more desireable results by optimizing the parameters of hedge algebra Our clustering method are proved to be a generalization of FCM algorithm with specific parameters in Proposition Some implementations have shown the potential of our proposed clustering method In the further, we will study the effectiveness of our methods in various kind of clustering problems as well as the affects of hedge algebra parameters to clustering results In our proposed method, the number of clusters has to be defined as the input of the HAFCM Hence, another version of HAFCM will be studied in the future to determine this parameter from the dataset ACKNOWLEDGMENTS This paper is completed under the sponsor of National Foundation for Science and Technology Development, Vietnam REFERENCES Figure 8: HAFCM’s clustering result with noisy data Figure 9: Recognition rate of IRIS data [1] J C Bezdek Pattern recognition with fuzzy objetive function algorithms New York: Plenum, 1981 [2] E Cox Fuzzy Modeling and Genetic Algorithms for Data Mining and Exploration Elservier, 2005 [3] D E Gustafson and W C Kessel Fuzzy clustering with a fuzzy covariance matrix In Proc IEEE Conf Decision Control, pages 761–766 San Diego, CA, 1979 [4] N C Ho, T D Khang, H V Nam, and N H Chau Hedge algebras, linguistic-valued logic and their application to fuzzy reasoning International Journal of Uncertainty, Fuzziness and Knowledge-Based System, 7(4):347–361, December 1999 [5] N C Ho and W Wechler Hedge algebras: an algebraic approach to structures of sets of linguistic domains of linguistic truth variables Fuzzy Sets and Systems 35, pages 281–293, 1990 [6] C Hwang and F C.-H Rhee An interval type-2 fuzzy c spherical shells algorithm In Proc IEEE-FUZZ Conference, pages 1117–1122 Hungary, 2004 [7] C Hwang and F C.-H Rhee Uncertainty fuzzy clustering: Interval type-2 fuzzy approach to c-means IEEE Transactions on Fuzzy Systems, 15(1):107–120, February 2007 [8] U Kaymak and M Setnes Fuzzy clustering with volume prototypes and adaptive cluster merging IEEE Transactions on Fuzzy Systems, 10(6), December 2002 [9] P A Phong, D K Dong, and T D Khang Hedge algebra based type-2 fuzzy logic system and its application to predict survival time of myeloma patients In Proc Knowledge and System Engineering, KSE 2009, pages 13–18 IEEE Computer Society [10] B Thomas and G Raju A novel fuzzy clustering method for outlier detection in data mining International Journal of Recent Trends in Engineering, 1(2), May 2009 ... ngôn ngữ cấu trúc Đại số gia tử trình bày chương Chương khép lại với số ứng dụng Đại số gia tử nghiên cứu gần • Chương 4: Phân cụm mờ với Đại số gia tử Đây phần quan trọng luận văn với nghiên cứu... Phân cụm mờ với Đại số gia tử Hai mệnh đề sau trình bày khẳng định ý nghĩa mặt lý thuyết thuật toán • Chương 5: Các ứng dụng thuật toán Phân cụm mờ với Đại số gia tử Thuật toán áp dụng với hai... biệt Một số ứng dụng thử nghiệm cho thấy khả ứng dụng thuật toán cải tiến BỘ GIÁO DỤC VÀ ĐÀO TẠO TRƯỜNG ĐẠI HỌC BÁCH KHOA HÀ NỘI ———————— Đinh Khắc Đông PHÂN CỤM MỜ VỚI ĐẠI SỐ GIA TỬ VÀ ỨNG DỤNG

Phân cụm mờ với đại số gia tử và ứng dụng

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Li cam n

Li cam oan

Muc luc

Danh sách các ký hiu, ch vit tt

Danh sách bang

Danh sách hình ve

M u

Lý do chon tài

Lich s nghiên cu

Muc ích, i tng và pham vi nghiên cu

Các lun im c ban và óng góp mi

Phng pháp nghiên cu

B cuc cua lun van

Phân cum d liu

Ðt vn

Bài toán phân cum d liu

Các inh nghıa và ký hiu

Biu din mu và trích chon c trng

Các o s tng t

Các ky thut phân cum

Các thut toán phân cum phân cp

Các thut toán phân vùng

Phân cum m

Tài liệu cùng người dùng

Tài liệu liên quan