Data Mining and Knowledge Discovery Handbook, 2 Edition part 38 pptx

10 318 0
Data Mining and Knowledge Discovery Handbook, 2 Edition part 38 pptx

Đang tải... (xem toàn văn)

Thông tin tài liệu

350 Jean-Francois Boulicaut and Baptiste Jeudy describe a mining algorithm but rather a pruning technique for non anti-monotonic and non monotonic constraints. Considering a sub-lattice ˚ A of 2 I , the problem is to decide whether this sub-lattice can be pruned. A sub-lattice is characterized by its maximal element M and its minimal element m, i.e., the sub-lattice is the collection of all itemsets S such that m ⊆ S ⊆ M. To prune this sub-lattice, one must prove that none of its elements can satisfy the constraint C . To check this, the authors introduce the concept of negative witness: a negative witness for C in the sub-lattice ˚ A is an itemset W such that ¬C (W ) ⇒∀X ∈ ˚ A, ¬C (X). Therefore, if the constraint is not satisfied by the negative witness, then the whole sub-lattice can be pruned. Finding witnesses for anti-monotonic or monotonic constraints is easy : m is the witness for all anti-monotonic constraints and M for all monotonic ones. The authors then show how to compute efficiently witnesses for various tough constraints. For instance, for AV G (S) > σ , a witness is the set m ∪ { i ∈ M |i.v > σ } . The authors also gives an algorithm (linear in the size of I ) to compute a witness for the difficult constraint (VA R (S) > σ ) where VAR denotes the variance. 17.4.3 Ad-hoc Strategies Apart from generic algorithms, many algorithms have been designed to cope with specific classes of constraints. We select only two examples. The FIC algorithm (Pei et al., 2001) does a depth-first exploration of the item- set lattice. It is very efficient due to its clever data structure, a prefix-tree used to store the database. This algorithm can compute the extended theory for a conjunction C am ∧C m ∧C  where C  is convertible anti-monotonic or monotonic. A constraint C  is convertible anti-monotonic if there exists an order on the items such that, if itemsets are written using this order, every prefix of an itemset satisfying C  satisfies C  . For instance, AVG(S) > σ is convertible anti-monotonic if the items i are or- dered by decreasing value i.v. The main problem with convertible constraints is that a conjunction of convertible constraints is generally not convertible. Another example of an ad-hoc strategy is used in the c-Spade algorithm (Zaki, 2000). This algorithm is used to extract constrained sequences where each event in the sequences is dated. One of the constraints, the max −gap constraint, states that two consecutive events occurring in a pattern must not be further apart than a given maximum gap. This constraint is neither anti-monotonic nor monotonic and a specific algorithm has been designed for it. 17.4.4 Other Directions of Research Among others, let us introduce here three important directions of research. Adaptive Pruning Strategies We mentioned the trade-off between anti-monotonic pruning which is known to be quite efficient and pruning based on non anti-monotonic constraints. Since the se- lectivity of the various constraints is generally unknown, a quite exciting challenge 17 Constraint-based Data Mining 351 is to look for adaptive strategies which can decide of the pruning strategy dynam- ically. (Bonchi et al., 2003A, Bonchi et al., 2003B) propose algorithms for fre- quent itemsets under syntactical monotonic constraints. (Albert-Lorincz and Bouli- caut, 2003) considers frequent sequence mining under regular expression constraints. These are promising approaches to widen the applicability of constraint-based min- ing techniques in real contexts. Combining Constraints and Condensed Representations A few papers, e.g., (Boulicaut and Jeudy, 2000, Bonchi and Lucchese, 2004), deal with the problem of extracting constrained condensed representation. In these works, the aim is to compute a condensed representation of the extended theory Th x (D,2 I ,C am ∧C m ,freq). In (Boulicaut and Jeudy, 2000), the authors use free itemsets, i.e., their algorithm computes the extended theory Th x (D,2 I ,C am ∧C m ∧ C free ,freq). In (Bonchi and Lucchese, 2004), the authors use closed itemsets, i.e., their algorithm computes the extended theory Th x (D,2 I ,C am ∧C m ∧C clos ,freq). However, in these two works, the definition of free sets and closed sets have been modified to be able to regenerate the extended theory Th x (D,2 I ,C am ∧C m ,freq) from the extracted theories. This kind of research combines the advantages of both condensed representations and constrained mining which result in very efficient al- gorithms. Constraint-based Mining of more Complex Pattern Domains Most of the recent results have concerned simple local pattern discovery tasks like the ones based on itemsets or sequences. We believe that inductive querying is much more general. Many open problems are however to be addressed. For instance, even constraint-based mining of association rules is already much harder than constraint- based mining of itemsets (Lakshmanan et al., 1999,Jeudy and Boulicaut, 2002). The recent work on the MINE RULE query language (Meo et al., 1998) is also typical of the difficulty to optimize constraint-based association rule mining (Meo, 2003). When considering model mining under constraints (e.g., classifier design or clus- tering), only very preliminary approaches are available (see, e.g., (Garofalakis and Rastogi, 2000)). We think that this will be a major issue for research in the next few years. For instance, for clustering, it seems important to go further than the classical similarity optimization constraints and enable to specify other constraints on clusters (e.g., enforcing that some objects are or are not within the same clusters). 17.5 Conclusion In this chapter, we have considered constraint-based mining approaches, i.e., the core techniques for inductive querying. 352 Jean-Francois Boulicaut and Baptiste Jeudy This domain has been studied a lot for simple pattern domains like itemsets or sequences. Rather general forms of inductive queries on these domains (e.g., ar- bitrary boolean expressions over monotonic and anti-monotonic constraints) have been considered. Beside the many ad-hoc algorithms, an interesting effort has con- cerned generic algorithms. Many open problems are still there: how to solve tough constraints?, how to design relevant approximation or relaxation schemes? how to combine constraint-based mining with condensed representations, not only for sim- ple pattern domains but also more complex ones? Moreover, within the inductive database framework, the problem is to optimize sequences of queries and typically sequences of correlated inductive queries. It is crucial to consider that the optimization of a query and thus constraint-based mining must also take into account the previously solved queries. Looking for the formal properties between inductive queries, especially containment, is thus a major priority. Here again, we believe that condensed representations might play a major role. Last but not the least, a quite challenging problem is to consider from where the constraints come. The analysts can think in terms of constraints or declarative specifications which are not supported by the available solvers: an obvious example could be unexpectedness or novelty w.r.t. some explicit background knowledge. To be able to derive appropriate inductive queries based on a limited number of primi- tives (and some associated solvers) from the constraints expressed by the analysts is challenging. References R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, pages 307– 328. AAAI Press, 1996. H. Albert-Lorincz and J F. Boulicaut. Mining frequent sequential patterns under regular expressions: a highly adaptative strategy for pushing constraints. In Proc. SIAM DM’03, pages 316–320, 2003. Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non- redundant association rules using frequent closed itemsets. In Proc. CL 2000, volume 1861 of LNCS, pages 972–986. Springer-Verlag, 2000. Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2):66–75, 2000. R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. ACM SIGMOD’98, pages 85–93, 1998. F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. Adaptive constraint pushing in fre- quent pattern mining. In Proc. PKDD’03, volume 2838 of LNAI, pages 47–58. Springer- Verlag, 2003A. F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. Examiner: Optimized level-wise frequent pattern mining with monotone constraints. In Proc. IEEE ICDM’03, pages 11–18, 2003B. F. Bonchi, F. Giannotti, A. Mazzanti, and D. Pedreschi. Exante: Anticipated data reduction in constrained pattern mining. In Proc. PKDD’03, volume 2838 of LNAI, pages 59–70. Springer-Verlag, 2003C. 17 Constraint-based Data Mining 353 F. Bonchi and C. Lucchese. On closed constrained frequent pattern mining. In Proc. IEEE ICDM’04 (In Press), 2004. J F. Boulicaut. Inductive databases and multiple uses of frequent itemsets: the cInQ ap- proach. In Database Technologies for Data Mining - Discovering Knowledge with In- ductive Queries, volume 2682 of LNCS, pages 1–23. Springer-Verlag, 2004. J F. Boulicaut and A. Bykowski. Frequent closures as a concise representation for binary Data Mining. In Proc. PAKDD’00, volume 1805 of LNAI, pages 62–73. Springer-Verlag, 2000. J F. Boulicaut, A. Bykowski, and C. Rigotti. Approximation of frequency queries by mean of free-sets. In Proc. PKDD’00, volume 1910 of LNAI, pages 75–85. Springer-Verlag, 2000. J F. Boulicaut, A. Bykowski, and C. Rigotti. Free-sets : a condensed representation of boolean data for the approximation of frequency queries. Data Mining and Knowledge Discovery, 7(1):5–22, 2003. J F. Boulicaut and B. Jeudy. Using constraint for itemset mining: should we prune or not? In Proc. BDA’00, pages 221–237, 2000. J F. Boulicaut and B. Jeudy. Mining free-sets under constraints. In Proc. IEEE IDEAS’01, pages 322–329, 2001. C. Bucila, J. E. Gehrke, D. Kifer, and W. White. Dualminer: A dual-pruning algorithm for itemsets with constraints. Data Mining and Knowledge Discovery, 7(4):241–272, 2003. D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A maximal frequent itemset algorithm for transactional databases. In Proc. IEEE ICDE’01, pages 443–452, 2001. A. Bykowski and C. Rigotti. DBC: a condensed representation of frequent patterns for efficient mining. Information Systems, 28(8):949–977, 2003. T. Calders and B. Goethals. Mining all non-derivable frequent itemsets. In Proc. PKDD’02, volume 2431 of LNAI, pages 74–85. Springer-Verlag, 2002. B. Cr ´ emilleux and J F. Boulicaut. Simplest rules characterizing classes generated by delta- free sets. In Proc. ES 2002, pages 33–46. Springer-Verlag, 2002. L. De Raedt. A perspective on inductive databases. SIGKDD Explorations, 4(2):69–77, 2003. L. De Raedt, M. Jaeger, S. Lee, and H. Mannila. A theory of inductive query answering. In Proc. IEEE ICDM’02, pages 123–130, 2002. L. De Raedt and S. Kramer. The levelwise version space algorithm and its application to molecular fragment finding. In Proc. IJCAI’01, pages 853–862, 2001. M. M. Garofalakis and R. Rastogi. Scalable Data Mining with model constraints. SIGKDD Explorations, 2(2):39–48, 2000. M. M. Garofalakis, R. Rastogi, and K. Shim. SPIRIT: Sequential pattern mining with regular expression constraints. In Proc. VLDB’99, pages 223–234, 1999. B. Goethals and M. J. Zaki, editors. Proc. of the IEEE ICDM 2003 Workshop on Frequent Itemset Mining Implementations, volume 90 of CEUR Workshop Proceedings, 2003. D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharm. Discovering all most specific sentences. ACM Transactions on Database Systems, 28(2):140–174, 2003. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communi- cations of the ACM, 39(11):58–64, 1996. B. Jeudy and J F. Boulicaut. Optimization of association rule mining queries. Intelligent Data Analysis, 6(4):341–357, 2002. D. Kifer, J. E. Gehrke, C. Bucila, and W. White. How to quickly find a witness. In Proc. ACM PODS’03, pages 272–283, 2003. 354 Jean-Francois Boulicaut and Baptiste Jeudy S. Kramer, L. De Raedt, and C. Helma. Molecular feature mining in HIV data. In Proc. ACM SIGKDD’01, pages 136–143, 2001. L. V. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In Proc. ACM SIGMOD’99, pages 157–168, 1999. D I. Lin and Z. M. Kedem. Pincer search: An efficient algorithm for discovering the maxi- mum frequent sets. IEEE Transactions on Knowledge and Data Engineering, 14(3):553– 566, 2002. H. Mannila and H. Toivonen. Multiple uses of frequent sets and condensed representations. In Proc. KDD’96, pages 189–194. AAAI Press, 1996. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discov- ery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997. C. Mellish. The description identification problem. Artificial Intelligence, 52(2):151–168, 1992. R. Meo. Optimization of a language for Data Mining. In Proc. ACM SAC’03 - Data Mining Track, pages 437–444, 2003. R. Meo, G. Psaila, and S. Ceri. An extension to SQL for mining association rules. Data Mining and Knowledge Discovery, 2(2):195–224, 1998. T. Mitchell. Generalization as search. Artificial Intelligence, 18(2):203–226, 1980. R. Ng, L. V. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In Proc. ACM SIGMOD’98, pages 13–24, 1998. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems, 24(1):25–46, 1999. J. Pei, G. Dong, W. Zou, and J. Han. On computing condensed frequent pattern bases. In Proc. IEEE ICDM’02, pages 378–385, 2002. J. Pei, J. Han, and L. V. S. Lakshmanan. Mining frequent itemsets with convertible con- straints. In Proc. IEEE ICDE’01, pages 433–442, 2001. R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proc. ACM SIGKDD’97, pages 67–73, 1997. M. J. Zaki. Sequence mining in categorical domains: incorporating constraints. In Proc. ACM CIKM’00, pages 422–429, 2000. 18 Link Analysis Steve Donoho Mantas, Inc. Summary. Link analysis is a collection of techniques that operate on data that can be rep- resented as nodes and links. This chapter surveys a variety of techniques including subgraph matching, finding cliques and K-plexes, maximizing spread of influence, visualization, find- ing hubs and authorities, and combining with traditional techniques (classification, clustering, etc). It also surveys applications including social network analysis, viral marketing, Internet search, fraud detection, and crime prevention. Key words: Link analysis, Social network analysis, Graph theory 18.1 Introduction The term ”link analysis” does not refer to one specific technique or algorithm. Rather it refers to a collection of techniques that are bound together by the type of data they operate on. Link analysis techniques are applied to data that can be represented as nodes and links as in Figure 18.1. A node represents an entity such as a person, a document, or a bank account. Nodes are sometimes referred to as ”vertices.” A link represents a relationship be- tween two entities such as a parent/child relationship between two people, a reference relationship between two documents, or a transaction between two bank accounts. Links are sometimes referred to as ”edges.” Because links show relationships among entities, this type of data is often referred to as relational data. This is as opposed to attribute vector data used by many other unsupervised and supervised Data Mining techniques. In most standard Data Mining techniques, data is represented as a set of tuples (a vector of attribute values). Each tuple represents an entity, but there is no explicit data about relationships among entities. In link analysis, information exists about the relationships among entities, and analysis of these relationships is the focus of the field. The roots of link analysis predate the use of modern computers. Law enforce- ment officials have carried out manual link analysis for many years. When a crime is investigated, a network such as in Figure 18.1 is drawn where the nodes represent O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0-387-09823-4_18, © Springer Science+Business Media, LLC 2010 356 Steve Donoho Fig. 18.1. Node and Link Data Used by Link Analysis Techniques. people, weapons, crime scenes, etc. One person may be linked to another if they are family, friends, roommates, or business partners. A person may be linked to a weapon if it is registered in his name or if it was found at his home. Once a network of relationships is drawn out, the bigger picture of a crime emerges from the details. Holes in the network become apparent, and they are areas for further investigation. Hypotheses can be formed and tested. Sociologists also performed manual link analysis long before there were com- puters. The structure of a clan or tribe would be mapped out with nodes represent- ing people and links representing family, work, or social relationships. From this a sociologist could deduce who held powerful positions within the clan, who might influence who else, how information might spread within the clan, and what factions might arise. The advent of computers allowed these techniques to become much more wide- spread and to be applied on a much larger scale. All 10 million of a bank’s customers can be analysed for money laundering relationships. The hundreds of millions of documents on the Internet can be analysed to determine which are most respected and reliable. Large communities can be analysed to determine how information and opinions spread and who are the most influential individuals. This chapter surveys the techniques that fall under the umbrella of link analysis and how these techniques are being applied. Section 18.2 presents some key con- cepts from the field of Social Network Analysis. Section 18.3 examines how link analysis techniques are used to improve search engine results. Section 18.4 looks at recent link analysis ideas emerging from the field of viral marketing. Section 18.5 shows how fraud detection and law enforcement have presented unique challenges and opportunities for link analysis. Finally, Section 18.6 surveys recent combinations of link analysis with traditional Data Mining techniques. 18 Link Analysis 357 18.2 Social Network Analysis The field of social network analysis (Wasserman, 1994,Hanneman, 2001) has devel- oped over many years as sociologists developed formal methods of studying groups of people and their relationships. When studying a social network, there are many questions sociologists are interested in answering: 1. Which people are powerful? 2. Which people influence other people? 3. How does information spread within the network? 4. Who is relatively isolated, and who is well connected? 5. In a disagreement, who is likely to side with whom? 6. What roles do people play in an organization, and who has similar roles? While concepts such as powerful, influential, isolated, and connected are somewhat subjective, social network analysis methods give us a baseline for measuring and making comparisons. Fig. 18.2. Three Networks to Illustrate an Individual’s Power within a Network. Many things can make a person powerful within a group. Consider the shaded nodes in the three networks shown in Figure 18.2 (Hanneman, 2001). The person at the center of the star intuitively seems more powerful than the one in the circle or the one at the end of the line. If the people in the star want to communicate with each other they have to go through the center person, and that person has the power to either facilitate or hamper communication. If the people in the star want to engage in business, they have to go through the person in the center, and that person has the power to charge a fee as the middleman. In contrast, the shaded node in the circular network is the most convenient path of communication or trade for some nodes, but he is not the only path. Intuitively, he has less power than the center of the star. The shaded node at the end of the line is dependant on others for communication and trade but has no one who is dependant on him. Intuitively, he has little or no power. The networks in Figure 18.2 illustrate how ”centrality” is one measure of power. The node at the center of the star derives its power from being in the center of its network. The shaded nodes in the circle and line are less central to their networks 358 Steve Donoho and are therefore less powerful. Some quantitative methods of measuring centrality are: 1. Degree. The shaded node in the star network is linked to six other nodes and thus has a degree of six. All the other nodes in the star have a degree of one and are comparatively less central. All the nodes in the circle have the same degree: two. The shaded node in the line has a degree of one and is thus slightly less central than other nodes in the line with degree two. 2. Closeness. The average distance from the shaded node in the star to all other nodes is 1.0. This node has very direct access to everyone else. Other nodes in the star have an average distance of 1.8. All the nodes in the circle have an average distance of 2.0. The node at the end of the line has an average distance of 3.5 whereas the node in the center of the line has an average distance of 2.0. 3. Betweenness. The shaded node in the star is between all other 15 pairs of nodes. In the circle there are two paths between each pair of nodes. The shaded node in the circle is on a path between all other 15 pairs, but since there is an alternative path between each pair, the shaded node is on 50% of the paths between pairs. The node at the end of the line is between no pairs. The node one from the end of the line is on paths between 5 pairs (33% of 15 paths). The node at the center of the line is on paths between 9 pairs (60% of 15 paths). 4. Cutpoints. Related to betweenness, cutpoints are nodes that if removed divide the network into unconnected systems. These nodes hold particular power because they are the only point of contact between otherwise disconnected networks. If the center of the star is removed, six disconnected systems result. If a node in the circle is removed, the network is still connected. If a non-end node is removed from the line, two disconnected systems result. A clique is a small, highly-interconnected group within a larger network. Cliques are of interest for several reasons. Ideas or information may spread extremely quickly within a clique because of the high connectivity. Members of a clique often act and behave as a cohesive unit. Disputes may form between cliques (”factions”). A person can be described with respect to the clique(s) they belong to. A person who is only connected to people in his clique is called a ”local” and is strongly influenced by the clique. A person who belongs to many cliques is called a ”cosmopolitan” and serves to bring outside ideas and information into a clique. The most strict definition of a clique is a complete subgraph (all nodes in the clique must be linked to all other nodes). A couple more relaxed definitions are: 1. K-plexes. A group of N nodes is a K-plex if each of the nodes is connected to at least N-K other nodes in the group. Intuitively, if K=2 then every member of the clique has to be connected to all but two of the other members. 2. K-cores. The definition of a K-core is slightly more relaxed than that of a K-plex. A K-core is a maximal group of nodes all of which are connected to at least K other nodes in the group. For example, if K=4 then every member of the clique is connected to at least 4 other clique members. 18 Link Analysis 359 The concept of ”equivalence” is very important within social networks. It makes it possible to determine if a person is playing a particular role within a network. This allows both intra-network comparisons (one node has the same role as another node within one network) and inter-network comparison (two nodes in different networks are playing the same role). Two measures of equivalence are: 1. Structural Equivalence. This is a strict measure of equivalence between two nodes. Two nodes are exactly structurally equivalent if they are linked to exactly the same other nodes. If not exactly equivalent, the degree of partial structural equivalence can be measured using the degree of overlap in nodes they are linked to. 2. Regular equivalence. Regular equivalence is a less strict definition than structural equivalence. Two nodes have regular equivalence if the nodes they are linked to are regular equivalents. For example, Fred Flintstone is the regular equivalent of Barney Rubble because Fred is the husband of Wilma, and Barney is the husband of Betty, and Wilma and Betty are regular equivalents. On a broader scale, equivalence of nodes lays the groundwork for measuring the similarity of one whole social network to another whole social network. This is useful for matching a network against a known template in order to identify the nature of the network as will be seen in Section 5 on Fraud Detection and Law Enforcement. Many groups such as academic circles, fraud rings, business circles, shoppers with common interests, and professional societies can be represented as social net- works. Because of this, Social Network Analysis lays the groundwork for many im- portant real-world applications. 18.3 Search Engines The Internet is rich in relational data by the simple fact that web pages are linked to other web pages. While traditional search techniques such as keyword searches fo- cus exclusively on the content of a single page, newer techniques (Page et al., 1998, Kleinberg, 1999) exploit relationships among pages. A user performing a search wants to find results that are not only relevant but are also authoritative and reli- able. A keyword search on ”stock market” will not only return authoritative sites such as the NASDAQ and NYSE pages, it will also return pages from thousands of self-proclaimed gurus selling books, software, and advice. The truly reliable sources of information are likely to be lost among the self-proclaimed gurus. Is there a way to separate the wheat from the chaff? This is where relational information contained in links comes into play. An authoritative site such as the NASDAQ is likely to be recognized as authori- tative by many people; therefore, many other sites are likely to point to the NASDAQ site. But a self-proclaimed stock market guru is less likely to have many other sites pointing to his site unless there truly is some merit to what he has to say. When one site references another site, it is in fact declaring that that site has some merit – it is casting a vote for the value and importance of the other site. Conceptually, this . association rules. Data Mining and Knowledge Discovery, 2( 2):195 22 4, 1998. T. Mitchell. Generalization as search. Artificial Intelligence, 18 (2) :20 3 22 6, 1980. R. Ng, L. V. Lakshmanan, J. Han, and A. Pang (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed., DOI 10.1007/978-0 -387 -09 823 -4_18, © Springer Science+Business Media, LLC 20 10 356 Steve Donoho Fig. 18.1. Node and Link Data Used. with constraints. Data Mining and Knowledge Discovery, 7(4) :24 1 27 2, 20 03. D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A maximal frequent itemset algorithm for transactional databases. In Proc.

Ngày đăng: 04/07/2014, 05:21

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan