Data Mining and Knowledge Discovery Handbook, 2 Edition part 63 doc

600 Noa Ruschin Rimini and Oded Maimon On the next phase of the research we plan to further develop the proposed scheme which is based on fractal representation, to account for online changes in monitored processes. We plan to suggest a novel type of online interactive SPC chart that enables a dynamic inspection of non-linear state dependant processes. The presented algorithmic framework is applicable for many practical domains, for example visual analysis of the affect of operation sequence on product quality (See Ruschin-Rimini et al., 2009), visual analysis of customers action history, visual analysis of products defect codes history, and more. The developed application was utilized by the General Motors research labs located in Bangalore, India, for visual analysis of vehicle failure history. References Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition Letters, 27(14): 1619–1631, 2006, Elsevier. Barnsley M., Fractals Everywhere, Academic Press, Boston, 1988 Barnsley, M., Hurd L. P., Fractal Image Compression, A. K. Peters, Boston, 1993 Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007. Da Cunha C., Agard B., and Kusiak A., Data mining for improvement of product quality, International Journal of Production Research, 44(18-19), pp. 4027-4041, 2006 Falconer K., Techniques in Fractal geometry, John Wiley & Sons, 1997 Jeffrey H. J., Chaos game representation of genetic sequences, Nucleic Acids Res., vol. 18, pp. 2163 – 2170, 1990 Keim D. A., Information Visualization and Visual Data mining, IEEE Transactions of Visu- alization and Computer Graphics, Vol. 7, No. 1, pp. 100-107, 2002 Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors manufacturing case study, in Data Mining for Design and Manufacturing: Methods and Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001. Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behav- ioral classification of the host, Computational Statistics and Data Analysis, 52(9):4544– 4566, 2008. Quinlan, J. R., C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993 Rokach L., Mining manufacturing data using genetic algorithm-based feature set decomposition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008. Rokach L., Genetic algorithm-based feature set partitioning for classification prob- lems,Pattern Recognition, 41(5):1676–1700, 2008. Rokach, L., Decomposition methodology for classification tasks: a meta decomposer framework, Pattern Analysis and Applications, 9(2006):257–271. Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In- ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480, 2001. Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel- ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158. Rokach L., and Maimon O., Data mining for improving the quality of manufacturing: A feature set decomposition approach. Journal of Intelligent Manufacturing, 17(23.3), pp. 285-299, 2006 29 Visual Analysis of Sequences Using Fractal Geometry 601 Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3) (2006), pp. 329–350. Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer- Verlag, 2004. Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap- proach, Proceedings of the 14th International Symposium On Methodologies For Intel- ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag, 2003, pp. 24–31. Rokach L., Romano R. and Maimon O., Mining manufacturing databases to discover the effect of operation sequence on the product quality, Journal of Intelligent Manufacturing, 2008 Ruschin-Rimini N., Maimon O. and Romano R., Visual Analysis of Quality-related Manu- facturing Data Using Fractal Geometry, working paper submitted for publication, 2009. Weiss C. H., Visual Analysis of Categorical Time Series, Statistical Methodology 5, pp. 56- 71, 2008 30 Interestingness Measures - On Determining What Is Interesting Sigal Sahar Department of Computer Science, Tel-Aviv University, Israel gales@post.tau.ac.il Summary. As the size of databases increases, the sheer number of mined from them can easily overwhelm users of the KDD process. Users run the KDD process because they are overloaded by data. To be successful, the KDD process needs to extract interesting patterns from large masses of data. In this chapter we examine methods of tackling this challenge: how to identify interesting patterns. Key words: Interestingness Measures, Association Rules Introduction According to (Fayyad et al., 1996) “Knowledge Discovery in Databases (KDD) is the non- trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” Mining algorithms primarily focus on discovering patterns in data, for example, the Apriori algorithm (Agrawal and Shafer, 1996) outputs the exhaustive list of association rules that have at least the predefined support and confidence thresholds. Interestingness dif- ferentiates between the “valid, novel, potentially useful and ultimately understandable” mined association rules and those that are not—differentiating the interesting patterns from those that are not interesting. Thus, determining what is interesting, or interestingness, is a critical part of the KDD process. In this chapter we review the main approaches to determining what is interesting. Figure 30.1 summarizes the three main types of interestingness measures, or approaches to determining what is interesting. Subjective interestingness explicitly relies on users’ specific needs and prior knowledge. Since what is interesting to any user is ultimately subjective, these subjective interestingness measures will have to be used to reach any complete solution of determining what is interesting. (Silberschatz and Tuzhilin, 1996) differentiate between subjective and objective interestingness. Objective interestingness refers to measures of interest “where interestingness of a pattern is measured in terms of its structure and the underlying data used in the discovery process” (Silberschatz and Tuzhilin, 1996) but requires user intervention to select which of these measures to use and to initialize it. Impartial interestingness, introduced in (Sahar, 2001), refers to measures of interest that can be applied automatically O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_30, © Springer Science+Business Media, LLC 2010 604 Sigal Sahar Pruning & constraints Summarization Ranking patterns Interest via not interesting Rule-by-rule classification Expert/ grammar Fig. 30.1. Types of Interestingness Approaches. to the output of any association rule mining algorithm to reduce the number of not-interesting rules independently of the domain, task and users. 30.1 Definitions and Notations Let Λ be a set of attributes over the boolean domain. Λ is the superset of all attributes we discuss in this chapter. An itemset I is a set of attributes: I ⊆ Λ .Atransaction is a subset of attributes of Λ that have the boolean value TRUE. We will refer to the set of transactions over Λ as a database. If exactly s% of the transactions in the database contain an itemset I then we say that I has support s, and express the support of I as P(I). Given a support threshold s we will call itemsets that have at least support s large or frequent. Let A and B be two sets of attributes such that A,B ⊆ Λ and A ∩B = /0. Let D be a set of transactions over Λ . Following the definition in (Agrawal and Shafer, 1996), an association rule A→B is defined to have support s% and confidence c%inD if s% of the transactions in D contain A ∪B and c% of the transactions that contain A also contain B. For convenience, in an association rule A→B we will refer to A as the assumption and B as the consequent of the rule. We will express the support of A→B as P(A ∪B). We will express the confidence of A→B as P(B|A) and denote it with confidence (A→B). (Agrawal and Shafer, 1996) presents an elegantly simple algorithm to mine the exhaustive list of association rules that have at least predefined support and confidence thresholds from a boolean database. 30.2 Subjective Interestingness What is interesting to users is ultimately subjective; what is interesting to one user may be known or irrelevant, and therefore not interesting, to another user. To determine what is subjectively interesting, users’ domain knowledge—or at least the portion of it that pertains to the data at hand—needs to be incorporated into the solution. In this section we review the three main approaches to this problem. 30 Interestingness Measures 605 30.2.1 The Expert-Driven Grammatical Approach In the first and most popular approach, the domain knowledge required to subjectively determine which rules are interesting is explicitly described through a predefined grammar. In this approach a domain expert is expected to express, using the predefined grammar what is, or what is not, interesting. This approach was introduced by (Klemettinen et al., 1994), who were the first to apply subjective interestingness, and many other applications followed. (Klemetti- nen et al., 1994) define pattern templates that describe the structure of interesting association rules through inclusive templates and the structure of not-interesting rules using restrictive templates. (Liu et al., 1997) present a formal grammar that allows the expression of imprecise or vague domain knowledge, the General Impressions. (Srikant et al., 1997) introduce into the mining process user defined constraints, including taxonomical constraints, in the form of boolean expressions, and (Ng et al., 1998) introduce user constraints as part of an architecture that supports exploratory association rule mining. (Padmanabhan and Tuzhilin, 2000) use a set of predefined user beliefs in the mining process to output a minimal set of unexpected association rules with respect to that set of beliefs. (Adomavicius and Tuzhilin, 1997) define an action hierarchy to determine which which association rules are actionable; actionability is an aspect of being subjectively interesting. (Adomavicius and Tuzhilin, 2001, Tuzhilin and Ado- mavicius, 2002) iteratively apply expert-driven validation operators to incorporate subjective interestingness in the personalization and bioinformatics domains. In some cases the required domain knowledge can be obtained from a pre-existing knowledge base, thus eliminating the need to engage directly with a domain expert to acquire it. For example, in (Basu et al., 2001) the WordNet lexical knowledge-base is used to measure the novelty—an indicator of interest—of an association rule by assessing the dissimilarity between the assumption and the consequent of the rule. An example of a domain where such a knowledge base exists naturally is when detecting rule changes over time, as in (Liu et al., 2001a). In many domains, these knowledge-bases are not readily available. In those cases the success of this approach is conditioned on the availability of a domain expert willing and able to complete the task of defining all the required domain knowledge. This is no easy task: the domain expert may unintentionally neglect to define some of the required domain knowledge, some of it may not be applicable across all cases, and could change over time. Acquiring such a domain expert for the duration of the task is often costly and sometimes unfeasible. But given the domain knowledge required, this approach can output the small set of subjectively interesting rules. 30.2.2 The Rule-By-Rule Classification Approach In the second approach, taken in (Subramonian, 1998), the required domain knowledge base is constructed by classifying rules from prior mining sessions. This approach does not depend on the availability of domain experts to define the domain knowledge, but does require very in- tensive user interaction of a mundane nature. Although the knowledge base can be constructed incrementally, this, as the author says, can be a tedious process. 30.2.3 Interestingness Via What Is Not Interesting Approach The third approach, introduced by (Sahar, 1999), capitalizes on an inherent aspect in the interestingness task: the majority of the mined association rules are not interesting. In this approach a user is iteratively presented with simple rules, with only one attribute in their assumption and 606 Sigal Sahar one attribute in the consequent, for classification. These rules are selected so that a single user classification of a rule can imply that a large number of the mined association rules are also not-interesting. The advantages of this approach are that it is simple so that a naive user can use it without depending on a domain expert to provide input, that it very quickly, with only a few questions, can eliminate a significant portion of the not-interesting rules, and that it cir- cumvents the need to define why a rule is interesting. However, this approach is used only to reduce the size of the interestingness problem by substantially decreasing the number of potentially interesting association rules, rather than pinpointing the exact set of interesting rules. This approach has been integrated into the mining process in (Sahar, 2002b). 30.3 Objective Interestingness The domain knowledge needed in order to apply subjective interestingness criteria is difficult to obtain. Although subjective interestingness is needed to reach the short list of interesting patterns, much can be done without explicitly using domain knowledge. The application of objective interestingness measures depends only the structure of the data and the patterns extracted from it; some user intervention will still be required to select the measure to be used, etc. In this section we review the three main types of objective interestingness measures. 30.3.1 Ranking Patterns To rank association rules according to their interestingness, a mapping, f , is introduced from the set of mined rules, Ω , to the domain of real numbers: f : Ω → ℜ. (30.1) The number an association rule is mapped to is an indication of how interesting this rule is; the larger the number a rule is mapped to, the more interesting the rule is assumed to be. Thus, the mapping imposes an order, or ranking, of interest on a set of association rules. Ranking rules according to their interest has been suggested in the literature as early as (Piatetsky-Shapiro, 1991). (Piatetsky-Shapiro, 1991) introduced the first three principles of interestingness evaluation criteria, as well as a simple mapping that could satisfy them: P-S(A→B)=P(A ∪B) −P(B)·P(A). Since then many different mappings, or rankings, have been proposed as measures of interest. Many definitions of such mappings, as well as their em- pirical and theoretical evaluations, can be found in (Kl ¨ osgen, 1996, Bayardo Jr. and Agrawal, 1999, Sahar and Mansour, 1999, Hilderman and Hamilton, 2000, Hilderman and Hamilton, 2001, Tan et al., 2002). The work on the principles introduced by (Piatetsky-Shapiro, 1991) has been expanded by (Major and Mangano, 1995, Kamber and Shinghal, 1996). (Tan et al., 2002) extends the studies of the properties and principles of the ranking criteria. (Hilderman and Hamilton, 2001) provide a very thorough review and study of these criteria, and introduce an interestingness theory for them. 30.3.2 Pruning and Application of Constraints The mapping in Equation 30.1 can also be used as a pruning technique: prune as not-interesting all the association rules that are mapped to an interest score lower than a user-defined threshold. Note that in this section we only refer to pruning and application of constraints performed 30 Interestingness Measures 607 using objective interestingness measures, and not subjective ones, such as removing rules if they contain, or do not contain, certain attributes. Additional methods can be used to prune association rules without requiring the use of the an interest mapping. Statistical tests such as the χ 2 test are used for pruning in (Brin et al., 1997,Liu et al., 1999,Liu et al., 2001b). These tests have parameters that need to be initialized. A collection of pruning methods is described in (Shah et al., 1999). Another type of pruning is the constraint based approach of (Bayardo Jr. et al., 1999). To output a more concise list of rules as the output of the mining process, the algorithm of (Ba- yardo Jr. et al., 1999) only mines rules that comply with the usual constraints of minimum support and confidence thresholds as well as with two new constraints. The first constraint is a user-specified consequent (subjective interestingness). The second, unprecedented, constraint is of a user-specified minimum confidence improvement threshold. Only rules whose confidence is at least the minimum confidence improvement threshold greater than the confidence of any of their simplifications are outputted; a simplification of a rule is formed by removing one or more attributes from its assumption. 30.3.3 Summarization of Patterns Several distinct methods fall under the summarization approach. (Aggarwal and Yu, 1998) introduce a redundancy measure that summarizes all the rules at the predefined support and confidence levels very compactly by using more ”complex” rules. The preference to complex rules is formally defined as follows: a rule C→D is redundant with respect to A→B if (1) A ∪B = C ∪D and A ⊂C, or (2) C ∪D ⊂A ∪B and A ⊆C. A different type of summary that favors less ”complex” rules was introduced by (Liu et al., 1999). (Liu et al., 1999) provide a summary of association rules with a single attributed consequent using a subset of ”direction- setting” rules, rules that represent the direction a group of non-direction-setting rules follows. The direction is calculated using the χ 2 test, which is also used to prune the mined rules prior to the discovery of direction-setting rules. (Liu et al., 2000) present a summary that simplifies the discovered rules by providing an overall picture of the relationships in the data and their exceptions. (Zaki, 2000) introduces an approach to mining only the non-redundant association rules from which all the other rules can be inferred. (Zaki, 2000) also favors ”less-complex” rules, defining a rule C→D to be redundant if there exists another rule A→B such that A ⊆C and B ⊆ D and both rules have the same confidence. (Adomavicius and Tuzhilin, 2001) introduce summarization through similarity based rule grouping. The similarity measure is specified via an attribute hierarchy, organized by a domain expert who also specifies a level of rule aggregation in the hierarchy, called a cut. The association rules are then mapped to aggregated rules by mapping to the cut, and the aggregated rules form the summary of all the mined rules. (Toivonen et al., 1995) suggest clustering rules “that make statements about the same database rows [ ]” using a simple distance measure, and introduce an algorithm to compute rule covers as short descriptions of large sets of rules. For this approach to work without los- ing any information, (Toivonen et al., 1995) make a monotonicity assumption, restricting the databases on which the algorithm can be used. (Sahar, 2002a) introduce a general clustering framework for association rules to facilitate the exploration of masses of mined rules by automatically organizing them into groups according to similarity. To simplify interpretation of the resulting clusters, (Sahar, 2002a) also introduces a data-inferred, concise representation of the clusters, the ancestor coverage. 608 Sigal Sahar 30.4 Impartial Interestingness To determine what is interesting, users need to first determine which interestingness measures to use for the task. Determining interestingness according to different measures can result in different sets of rules outputted as interesting. This dependence of the output of the interestingness analysis on the interestingness measure used is clear when domain knowledge is applied explicitly, in the case of the subjective interestingness measures (Section 30.2). When domain knowledge is applied implicitly, this dependence may not be as clear, but it still exists. As (Sa- har, 2001) shows, objective interestingness measures depend implicitly on domain knowledge. This dependence is manifested during the selection of the objective interestingness measure to be used, and, when applicable, during its initialization (for pruning and constraints) and the interpretation of the results (for summarization). (Sahar, 2001) introduces a new type of interestingness measure, as part of an interestingness framework, that can be applied automatically to eliminate a portion of the rules that is not interesting, as in Figure 30.2. This type of interestingness is called impartial interestingness because it is domain-independent, task-independent, and user-independent, making it impartial to all considerations affecting other interestingness measures. Since the impartial interestingness measures do not require any user intervention, they can be applied sequentially and automatically, directly following the Data Mining process, as depicted in Figure 30.2. The impartial interestingness measure preprocess the mined rules to eliminate those rules that are not interesting regardless of the domain, task and user, and so they form the Interestingness PreProcessing Step. This step is followed by Interestingness Processing, which includes the application of objective (when needed) and subjective interestingness criteria. rules outputted by a data-mining algorithm Interestingness Processing: (objective and subjective criteria) pruning, summarization, etc. Interestingness PreProcessed rules Interesting rules Interestingness PreProcessing (impartial criteria) includes several techniques Fig. 30.2. Framework for Determining Interestingness. To be able to define impartial measures, (Sahar, 2001) assume that the goal of the interestingness analysis on a set of mined rules is to find a subset of interesting rules, rather than to infer from the set of mined rules rules that have not been mined that could potentially be interesting. An example of an impartial measure is (Overfitting, (Sahar, 2001)) the deletion of all rules r = A∪C→B if there exists another mined rule  r = A→B such that confidence (  r) ≥ confidence (r). 30 Interestingness Measures 609 30.5 Concluding Remarks Characterizing what is interesting is a difficult problem, primarily because what is interesting is ultimately subjective. Numerous attempts have been made to formulate these qualities, ranging from evidence and simplicity to novelty and actionability, with no formal definition for “interestingness” emerging so far. In this chapter we reviewed the three main approaches to tackling the challenge of discovering which rules are interesting under certain assumptions. Some of the interestingness measures reviewed have been incorporated into the mining process as opposed to being applied after the mining process. (Spiliopoulou and Roddick, 2000) discuss the advantages of processing the set of rules after the mining process, and introduce the concept of higher order mining, showing that rules with higher order semantics can be extracted by processing the mined results. (Hipp and G ¨ unter, 2002) argue that pushing constraints into the mining process ”[ ] is based on an understanding of KDD that is no longer up-to-date” as KDD is an iterative discovery process rather than ”pure hypothesis investigation”. There is no consensus on whether it is advisable to push constraints into the mining process. An optimal solution is likely to be produced through a balanced combination of these approaches; some interestingness measures (such as the impartial ones) can be pushed into the mining process without overfitting its output to match the subjective interests of only a small audience, permitting further interestingness analysis that will tailor it to each user’s subjective needs. Data Mining algorithms output patterns. Interestingness discovers the potentially interesting patterns. To be successful, the KDD process needs to extract the interesting patterns from large masses of data. That makes interestingness a very important capability in the ex- tremely data-rich environment in which we live. It is likely that our environment will continue to inundate us with data, making determining interestingness critical for success. References Adomavicius, G. and Tuzhilin, A. (1997). Discovery of actionable patterns in databases: The action hierarchy approach. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 111–114, Newport Beach, CA, USA. AAAI Press. Adomavicius, G. and Tuzhilin, A. (2001). Expert-driven validation of rule-based user models in personalization applications. Data Mining and Knowledge Discovery, 5(1/2):33–58. Aggarwal, C. C. and Yu, P. S. (1998). A new approach to online generation of association rules. Technical Report Research Report RC 20899, IBMTJWatson Research Center. Agrawal, R., Heikki, M., Srikant, R., Toivonen, H., and Verkamo, A. I. (1996). Advances in Knowledge Discovery and Data Mining, chapter 12: Fast Discovery of Association Rules, pages 307–328. AAAI Press/The MIT Press, Menlo Park, California. Basu, S., Mooney, R. J., Pasupuleti, K. V., and Ghosh, J. (2001). Evaluating the novelty of text-mined rules using lexical knowledge. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 233–238, San Francisco, CA, USA. Bayardo Jr., R. J. and Agrawal, R. (1999). Mining the most interesting rules. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 145–154, San Diego, CA. . 18, pp. 21 63 – 21 70, 1990 Keim D. A., Information Visualization and Visual Data mining, IEEE Transactions of Visu- alization and Computer Graphics, Vol. 7, No. 1, pp. 100-107, 20 02 Maimon O., and. (Sahar, 20 01), refers to measures of interest that can be applied automatically O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09 823 -4_30,. Srikant, R., Toivonen, H., and Verkamo, A. I. (1996). Advances in Knowledge Discovery and Data Mining, chapter 12: Fast Discovery of Association Rules, pages 307– 328 . AAAI Press/The MIT Press,

Data Mining and Knowledge Discovery Handbook, 2 Edition part 63 doc

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan