Data Mining and Knowledge Discovery Handbook, 2 Edition part 65 ppt

620 Maria Halkidi and Michalis Vazirgiannis 31.2.3 Interestingness Measures of Classification Rules The number of classification patterns generated could be very large and it is possible that different approaches do result in different sets of patterns. The patterns extracted during the classification process could be represented in the form of rules, known as classification rules. It is important to evaluate the discovered patterns identifying these ones that are valid and pro- vide new knowledge. Techniques that aim at this goal are broadly referred to as interestingness measures. The interestingness of the patterns that discovered by a classification approach could also be considered as another quality criterion. Some representative measures (Hilder- man and Hamilton, 1999) for ranking the usefulness and utility of discovered classification patterns (classification rules) are: • Rule-Interest Function. Piatetsky-Shapiro introduced the rule-interest (Piatetsky-Shapiro, 1991) that is used to quantify the correlation between attributes in a classification rule. It is suitable only for the single classification rules, i.e. the rules whose both the left- and right-hand sides correspond to a single attribute. • Smyth and Goodman’s J-Measure. The J-measure (Smyth and Goodman, 1991) is a measure for probabilistic classification rules and is used to find the best rules relating discrete- valued attributes. A probabilistic classification rule is a logical implication, X →Y, satisfied with some probability p. The left- and right-hand sides of this implication correspond to a single attribute. The right-hand side is restricted to simple single-valued assignment expression while the left-hand-side may be a conjunction of simple expressions. • General Impressions. In (Liu et al., 1997) general impression is proposed as an approach for evaluating the importance of classification rules. It compares discovered rules to an approximate or vague description of what is considered to be interesting. Thus a general impression can be considered as a kind of specification language. • Gago and Bento’s Distance Metric. The distance metric (Gago and Bentos, 1998) measures the distance between classification rules and is used to determine the rules that pro- vide the highest coverage for the given data. The rules with the highest average distance to the other rules are considered to be most interesting. For additional discussion regarding interestingness measures please refer to Chapter 29.5 in this volume. 31.3 Association Rules Mining rules is one of the main tasks in the Data Mining process. It has attracted considerable interest because the rule provides a concise statement of potentially useful information that is easily understood by the end-users. There is a lot of research in the field of association rule extraction, resulting in a variety of algorithms that efficiently analyzing data and extract rules from them. The extracted rules have to satisfy some user-defined thresholds related with association rule measures (such as support, confidence, leverage, lift). These measures give an indication of the association rules’ importance and confidence. They may represent the predictive advantage of a rule and help to identify interesting patterns of knowledge in data and make decisions. Below we shall briefly summarize these measures. 31 Quality Assessment Approaches in Data Mining 621 31.3.1 Association Rules Interestingness Measures Let LHS → RHS be an association rule. Further we refer to the left hand side and the right hand side of the rule as LHS and RHS respectively. Below some of the most known measures of the rule interestingness are presented (Han and Kamber, 2001, Berry and Linoff, 1996). Coverage The coverage of an association rule is the proportion of cases in the data that have the attribute values or items specified on the Left Hand Side of the rule: Coverage = n(LHS) N (31.5) or Coverage = P(LHS) where N is the total number of cases under consideration and n(LHS) denotes the number of cases covered by the Left Hand Side. Coverage takes values in [0,1]. An association rule with coverage value near 1 can be considered as an interesting association rule. Support The support of an association rule is the proportion of all cases in the data set that satisfy a rule, i.e. both LHS and RHS of the rule. More specifically, support is defined as Support = n(LHS∩RHS) N (31.6) or Support = P(LHS∩RHS) where N is the total number of cases under consideration and n(LHS) denotes the number of cases covered by the Left Hand Side. Support can be considered as an indication of how often a rule occurs in a data set and as a consequence how significant is a rule. Confidence The confidence of an association rule is the proportion of the cases covered by LHS of the rule that are also covered by RHS: Con fidence = n(LHS∩RHS) n(LHS) (31.7) or Con fidence = P(LHS∩RHS) P(LHS) where n(LHS) denotes the number of cases covered by the Left Hand Side. Confidence takes values in [0, 1]. A value of confidence near to 1 is an indication of an important association rule. The above discussed interestingness measures, support and confidence, are widely used in the association rule extraction process and are also known as Agrawal and Srikant’s Itemset measures. From their definitions, we could say that confidence corresponds to the strength while support to the statistical significance of a rule. 622 Maria Halkidi and Michalis Vazirgiannis Leverage The leverage (MAGOpus) of an association rule is the proportion of additional cases covered by both the LHS and RHS above those expected if the LHS and RHS were independent of each other. This is a measure of the importance of the association that includes both the confidence and the coverage of the rule. More specifically, it is defined as Leverage = p(RHS|LHS) −(p(LHS) · p(RHS)) (31.8) Leverage takes values in [−1, 1]. Values of leverage equal or under 0, indicate a strong independence between LHS and RHS. On the other hand, values of leverage near to 1 are an indication of an important association rule. Lift The lift of an association rule is the confidence divided by the proportion of all cases that are covered by the RHS. This is a measure of the importance of the association and it is independent of coverage. Li f t = p(LHS ∩RHS) p(LHS) · p(RHS) (31.9) or Li f t = Con fidence p(RHS) It takes values in R + (the space of the real positive numbers). Based on the values of lift we get the following inferences for the rules interestingness: 1. lift → 1 means that RHS and LHS are independent, which indicates that the rule is not interesting. 2. lift values close to +∞. Here, we have the following sub-cases: • RHS ⊆ LHS or LHS ⊆RHS. If any of these cases is satisfied, we may conclude that the rule is not interesting. • P(RHS) is close to 0 or P(RHS|LHS) is close to 1. The first case indicates that the rule is not important. On the other hand, the second case is a good indication that the rule is an interesting one. 3. lift = 0 means that P(RHS|LHS)=0 ⇔ P(RHS ∩LHS)=0, which indicates that the rule is not important. Further discussion on interestigness measures Based on the definition of the association rules and the related measures it is obvious that support is an indication of the rule’s importance based on the amount of data that support it. For instance, assuming the rule A → B, a high support of the rule is an indication that a high number of tuples contains both the left hand side and right hand side of this rule and thus it can be considered as a representative rule of our data set. Moreover! confidence expresses our confidence based on the available data that when the left hand side of the rule happens, the right hand side also happens. Though support and confidence are useful to mine association rules in many applications, they could mislead us in some cases. Based on support-confidence framework a rule can be identified as interesting even though the occurrence of A does not imply the occurrence of B. In this case lift and leverage could be considered as alternative interestingness measures giving also an indication about the correlation of LHS and RHS. 31 Quality Assessment Approaches in Data Mining 623 Also lift is another measure, which may give an indication of rule significance, or how interesting is the rule. Lift represents the predictive advantage a rule offers over simply guess- ing based on the frequency of the rule consequence (RHS). Thus, lift may be an indication whether a rule could be considered as representative of the data so as to use it in the process of decision-making (Roberto et al., 1999). For instance, let a rule A + B → G with confidence 85% and support(G)=90%. Due to the high confidence of the rule we may conclude that it is a significant rule. On the other hand, the right hand side of the rule represents the 90% of the studied data, that is, a high proportion of the data contains G. Then, the rule may not be very interesting since there is a high probability the right hand side of the rule (G) to be satisfied by our data. More specifically, the rule may be satisfied by a high percentage of the data under consideration but at the same time the consequence of the rule (RHS) is high supported. As a consequence this rule may not make sense in making decisions or extracting general rule as regards the behavior of the data. Finally, the leverage expresses the hyper-representation of the rule in relation with its representation in data set if there is no interaction between LHS and RHS. A similar measure, conviction! was proposed by Brin etal. (Brin et al., 1997). The formal definition is (n is the number of transactions in the database): Conviction = n − p(RHS) (1 −Con f idence) (31.10) Both the lift and the conviction are monotone in the confidence. 31.3.2 Other approaches for evaluating association rules There are also some other well-known approaches and measures for evaluating association rules are: • Rule templates are used to describe a pattern for those attributes that can appear in the left- or right-hand side of an association rule. A rule template may be either inclusive or restrictive. An inclusive rule template specifies desirable rules that are considered to be interesting. On the other hand a restrictive rule template specifies undesirable rules that are considered to be uninteresting. Rule pruning can be done by setting support, confidence and rule size thresholds. • Dong and Li’s interestingness measure (Dong and Li, 1998) is used to evaluate the importance of an association rule by considering its unexpectedness in terms of other association rules in its neighborhood. The neighborhood of an association rule consists of association rules within a given distance. • Gray and Orlowska’s Interestingness (Gray and Orlowka, 1998) to evaluate the confidence of associations between sets of items in the extracted association rules. Though suppor and confidence have been shown to be useful for characterizing association rules, interestingness contains a discriminator component that gives an indication of the independence of the antecedent and consequent. • Peculiarity (Zhong et al., 1999) is a distance-based measure of rules interestingness. It is used to determine the extent to which one data object differs from other similar data objects. • Closed Association Rules Mining. It is widely recognized that the larger the set of frequent itemsets, the more association rules are presented to the user, many of which turn out to be redundant. However it is not necessary to mine all frequent itemsets to guarantee that all non-redundant association rules will be found. It is sufficient to consider only the closed 624 Maria Halkidi and Michalis Vazirgiannis frequent itemsets (Zaki and Hsiao, 2002, Pasquier et al., 1999, Pei et al., 2000). The set of closed frequent itemsets can guarantee completeness even in dense domains and all non-redundant association rules can be defined on it. CHARM is an efficient algorithm for closed association rules mining. 31.4 Cluster Validity Clustering is a major task in the Data Mining process for discovering groups and identifying interesting distributions and patterns in the underlying data (Fayyad et al., 1996). Thus, the main problem in the clustering process is to reveal the organization of patterns into ‘sensible’ groups, which allow us to discover similarities and differences, as well as to derive useful inferences about them (Guha et al., 1999). In the literature a wide variety of algorithms have been proposed for different applications and sizes of data sets (Han and Kamber, 2001), (Jain et al., 1999). The application of an algorithm to a data set aims at, assuming that the data set offers a clustering tendency, discovering its inherent partitions. However, the clustering process is perceived as an unsupervised process, since there are no predefined classes and no examples that would show what kind of desirable relations should be valid among the data (Berry and Linoff, 1996). Then, the various clustering algorithms are based on some assumptions in order to define a partitioning of a data set. As a consequence, they may behave in a different way depending on: i) the features of the data set (geometry and density distribution of clusters) and ii) the input parameter values. A problem that we face in clustering is to decide the optimal number of clusters into which our data can be partitioned. In most algorithms’ experimental evaluations 2D-data sets are used in order that the reader is able to visually verify the validity of the results (i.e. how well the clustering algorithm discovered the clusters of the data set). It is clear that visualization of the data set is a crucial verification of the clustering results. In the case of large multidimensional data sets (e.g. more than three dimensions) effective visualization of the data set would be difficult. Moreover the perception of clusters using available visualization tools is a difficult task for humans that are not accustomed to higher dimensional spaces. As a consequence, if the clustering algorithm parameters are assigned an improper value, the clustering method may result in a partitioning scheme that is not optimal for the specific data set leading to wrong decisions. The problems of deciding the number of clusters (i.e. partitioning) better fitting a data set as well as the evaluation of the clustering results has been the subject of several research efforts (Dave, 1996, Gath and Geva, 1989, Theodoridis and Koutroubas, 1999, Xie and Beni, 1991). We shall now discuss the fundamental concepts of clustering validity. Furthermoref we present the external and internal criteria in the context of clustering validity assessment while the relative criteria will be discussed in Section 31.4.4. 31.4.1 Fundamental Concepts of Cluster Validity The procedure of evaluating the results of a clustering algorithm is known under the term cluster validity. In general terms, there are three approaches to investigating cluster validity (Theodoridis and Koutroubas, 1999). The first is based on external criteria. This implies that we evaluate the results of a clustering algorithm based on a pre-specified structure, which is imposed on a data set and reflects our intuition about the clustering structure of the data set. The second approach is based on internal criteria. The results of a clustering algorithm are 31 Quality Assessment Approaches in Data Mining 625 Fig. 31.1. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed index, where q 0 p is the ρ proportion of q under hypothesis Ho evaluated in terms of quantities that involve the data themselves (e.g. proximity matrix). The third approach of clustering validity is based on relative criteria. Here the basic idea is the evaluation of a clustering structure by comparing it to other clustering schemes, resulting with the same algorithm but with different parameter values. The first two approaches are based on statistical tests and their major drawback is their high computational cost. Moreover, the indices related to these approaches aim at measuring the degree to which a data set confirms an a priori specified scheme. On the other hand, the third approach aims at finding the best clustering scheme that a clustering algorithm can define under certain assumptions and parameters. External and Internal Validity Indices In this section, we discuss methods suitable for the quantitative evaluation of the clustering results, known as cluster validity methods. However, these methods give an indication of the quality of the resulting partitioning and thus they can only be considered as a tool at the disposal of the experts in order to evaluate the clustering results. The cluster validity approaches based on external and internal criteria rely on statistical hypothesis testing. In the following section, an introduction to the fundamental concepts of hypothesis testing in cluster validity is presented. Hypothesis Testing in Cluster Validity In cluster validity the basic idea is to test whether the points of a data set are randomly structured or not. This analysis is based on the Null Hypothesis, denoted as Ho, expressed as a statement of random structure of a data set X. To test this hypothesis we use statistical tests, which lead to a computationally complex procedure. Monte Carlo techniques, discussed below, are used as a solution to high computational problems (Theodoridis and Koutroubas, 1999). 626 Maria Halkidi and Michalis Vazirgiannis How Monte Carlo is used in Cluster Validity The goal of using Monte Carlo techniques is the computation of the probability density function (pdf) of the validity indices. They rely on simulating the process of estimating the pdf of a validity index using a sufficient number of computer-generated data. First, a large amount of synthetic data sets is generated by normal distribution. For each one of these synthetic data sets, called X i , the value of the defined index, denoted q i , is computed. Then based on the respective values of q i for each of the data sets X i , we create a scatter-plot. This scatter-plot is an approximation of the probability density function of the index. Figure 31.1 depicts the three possible cases of probability density function’s shape of an index q. There are three different possible shapes depending on the critical interval ¯ D ρ , corresponding to significant level ρ (statistic constant). The probability density function of a statistic index q, under Ho, has a single maximum and the region is either a half line, or a union of two half lines (Theodoridis and Koutroubas, 1999). Assuming that the scatter-plot has been generated using r-values of the index q, called q i , in order to accept or reject the Null Hypothesis Ho we examine the following conditions: if the shape is right-tailed (Figure 31.1b) if (q’s value of our data set, is greater than (1 − ρ ) ·r of q i values) then Reject Ho else Accept Ho endif else if the shape is left-tailed (Figure 31.1c), if (q’s value for our data set, is smaller than ρ ·r rofq i values) then Reject Ho else Accept Ho endif else if the shape is two-tailed (Figure 31.1a) if (q is greater than ( ρ /2) ·r number of q i values and smaller than (1− ρ /2) ·r of q i values) then Accept Ho endif endif 31.4.2 External Criteria Based on external criteria we can work in two different ways. Firstly, we can evaluate the resulting clustering structure C, by comparing it to an independent partition of the data P built according to our intuition about the clustering structure of the data set. Secondly, we can compare the proximity matrix P to the partition P. Comparison of C with partition P (non- hierarchical clustering) Let C = {C 1 C m } be a clustering structure of a data set X and P = {P 1 , ,P s } be a defined partition of the data. We refer to a pair of points (x v ,x u ) from the data set using the following terms: • SS: if both points belong to the same cluster of the clustering structure C and to the same group of partition P. • SD: if points belong to the same cluster of C and to different groups of P. • DS: if points belong to different clusters of C and to the same group of P. • DD: if both points belong to different clusters of C and to different groups of P. Assuming now that a, b, c and d are the number of SS,SD,DS and DD pairs respectively, then a +b + c+ d = M which is the maximum number of all pairs in the data set (meaning, M = N ·(N −1)/2 where N is the total number of points in the data set). 31 Quality Assessment Approaches in Data Mining 627 Now we can define the following indices to measure the degree of similarity between C and P: 1. Rand Statistic: R =(a + d)/M 2. Jaccard Coefficient: J = a/(a+ b + c) The above two indices range between 0 and 1, and are maximized when m=s. Another index is the: 3. Folkes and Mallows index: FM = a/ √ m 1 ·m 2 =  a a + b · a a + c (31.11) where m 1 =(a + b), m 2 =(a + c). For the previous three indices it has been proven that high values of indices indicate great similarity between C and P. The higher the values of these indices are the more similar C and P are. Other indices are: 4. Huberts Γ statistic: Γ =(1/M) N−1 ∑ i=1 N ∑ j=i+1 X(i, j) ·Y (i, j) (31.12) High values of this index indicate a strong similarity between the matrices X and Y. 5. Normalized Γ statistic: ˆ Γ =  (1/M) ∑ N−1 i=1 ∑ N j=i+1 (X(i, j) − μ X )(Y (i, j) − μ Y )  σ X · σ Y (31.13) where X(i, j) and Y (i, j) are the (i, j) element of the matrices X , Y respectively that we wish to compare. Also μ x , μ y , σ x , σ y are the respective means and variances of X , Y matrices. This index takes values between −1 and 1. All these statistics have right-tailed probability density functions, under the random hypothesis. In order to use these indices in statistical tests we must know their respective probability density function under the Null Hypothesis, Ho, which is the hypothesis of random structure of our data set. This means that using statistical tests, if we accept the Null Hy- pothesis then our data are randomly distributed. However, the computation of the probability density function of these indices is computationally expensive. A solution to this problem is to use Monte Carlo techniques. The procedure is as follows: Algorithm 1: Monte Carlo Algorithm 1: for i = 1tor do 2: Generate a data set X i with N vectors (points) in the area of X (i.e. having the same dimension with those of the data set X). 3: Assign each vector y j,i of X i to the group that x j ∈ X belongs, according to the partition P. 4: Run the same clustering algorithm used to produce structure C, for each X i , and let Ci the resulting clustering structure. 5: Compute q(C i ) value of the defined index q for P and C i . 6: end for 7: Create scatter-plot of the r validity index values, q(C i ) (that computed into the for loop). 628 Maria Halkidi and Michalis Vazirgiannis After having plotted the approximation of the probability density function of the defined statistic index, its value, denoted by q, is compared to the q(C i ) values, further referred to as q i . The indices R, J,FM,G defined previously are used as the q index mentioned in the above procedure. Comparison of P (proximity matrix) with partition P Let P be the proximity matrix of a data set X and P be its partitioning. Partition P can be considered as a mapping g : X → 1, ,n c where n c is the number of clusters. Assuming the matrix Y defined as: Y (i, j)=  1ifg(x i ) = g  x j  , 0 otherwise The Γ (or normalized Γ ) statistic index can be computed using the proximity matrix P and the matrix Y . Based on the index value, we may have an indication of the two matrices’ similarity. To proceed with the evaluation procedure we use the Monte Carlo techniques as mentioned above. In the ‘Generate’ step of the procedure the corresponding mappings gi is generated for every generated X i data set. So in the ‘Compute’ step the matrix Y i is computed for each X i in order to find the Γ i corresponding statistic index. 31.4.3 Internal Criteria Using this approach of cluster validity the goal is to evaluate the clustering result of an algorithm using only quantities and features inherited from the data set. There are two cases in which we apply internal criteria of cluster validity depending on the clustering structure: a) hierarchy of clustering schemes, and b) single clustering scheme. Validating hierarchy of clustering schemes A matrix called cophenetic matrix, P c , can represent the hierarchy diagram that is produced by a hierarchical algorithm. The element P c (i, j), of cophenetic matrix represents the proximity level at which the two vectors x i and x j are found in the same cluster for the first time. We may define a statistical index to measure the degree of similarity between P c and P (proximity matrix) matrices. This index is called Cophenetic Correlation Coefficient and defined as: CPCC = (1/M) · ∑ N−1 i=1 ∑ N j=i+1 d ij ·c ij − μ P · μ C   (1/M) ∑ N−1 i=1 ∑ N j=i+1 (d 2 ij − μ 2 P )  ·  (1/M) ∑ N−1 i=1 ∑ N j=i+1 (c 2 ij − μ 2 C )  (31.14) where M = N ·(N −1)/2 and N is the number of points in a data set. Also, μ p and μ c are the means of matrices P and P c respectively, and are defined as follows: μ P =(1/M) N−1 ∑ i=1 N ∑ j=i+1 P(i, j), μ C =(1/M) N−1 ∑ i=1 N ∑ j=i+1 P c (i, j) (31.15) Moreover, d ij , c ij are the (i, j) elements of P and P c matrices respectively. The CPCC are between −1 and 1. A value of the index close to 1 is an indication of a significant similarity 31 Quality Assessment Approaches in Data Mining 629 between the two matrices. The procedure of the Monte Carlo techniques described above is also used in this case of validation. Validating a single clustering scheme The goal here is to find the degree of match between a given clustering scheme C, consisting of n c clusters, and the proximity matrix P. The defined index for this approach is Hubert’s G statistic (or normalized G statistic). An additional matrix for the computation of the index is used, that is Y =  1ifx i and x j belong to different clusters 0 otherwise where i, j = 1, ,N. The application of Monte Carlo techniques is also the way to test the random hypothesis in a given data set. 31.4.4 Relative Criteria The basis of the above described validation methods is statistical testing. Thus, the major drawback of techniques based on internal or external criteria is their high computational demands. A different validation approach is discussed in this section. It is based on relative criteria and does not involve statistical tests. The fundamental idea of this approach is to choose the best clustering scheme of a set of defined schemes according to a pre-specified criterion. More specifically, the problem can be stated as follows: Let P alg be the set of parameters associated with a specific clustering algorithm (e.g. the number of clusters n c ). Among the clustering schemes C i ,i= 1, ,n c , is defined by a specific algorithm. For different values of the parameters in P alg , choose the one that best fits the data set. Then, we can consider the following cases of the problem: 1. P alg does not contain the number of clusters, n c , as a parameter. In this case, the choice of the optimal parameter values are described as follows: The algorithm runs for a wide range of its parameters’ values and the largest range for which n c remains constant is selected (usually n c << N (number of tuples)). Then the values that correspond to the middle of this range are chosen as appropriate values of the P alg parameters. Also, this procedure identifies the number of clusters that underlie our data set. 2. P alg contains n c as a parameter. The procedure of identifying the best clustering scheme is based on a validity index. Selecting a suitable performance index, q, we proceed with the following steps: • The clustering algorithm runs for all values of nc between a minimum n c min and a maximum n c max . The minimum and maximum values have been defined a priori by the user. • For each of the values of n c , the algorithm runs r times, using different sets of values for the other parameters of the algorithm (e.g. different initial conditions). • The best values of the index q obtained by each n c is plotted as the function of n c . Based on this plot we may identify the best clustering scheme. We have to stress that there are two approaches for defining the best clustering depending on the behavior of q with respect to nc. Thus, if the validity index does not exhibit an increasing or decreasing trend as n c increases we seek the maximum (minimum) of the plot. On the other hand, for indices . in Data Mining 621 31.3.1 Association Rules Interestingness Measures Let LHS → RHS be an association rule. Further we refer to the left hand side and the right hand side of the rule as LHS and. the data set). 31 Quality Assessment Approaches in Data Mining 627 Now we can define the following indices to measure the degree of similarity between C and P: 1. Rand Statistic: R =(a + d)/M 2. . sufficient to consider only the closed 624 Maria Halkidi and Michalis Vazirgiannis frequent itemsets (Zaki and Hsiao, 20 02, Pasquier et al., 1999, Pei et al., 20 00). The set of closed frequent itemsets

Data Mining and Knowledge Discovery Handbook, 2 Edition part 65 ppt

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan