Data Mining and Knowledge Discovery Handbook, 2 Edition part 39 pdf

360 Steve Donoho is similar to paper citations in academia. A paper that is cited often is considered to contain important ideas. A paper that is seldom or never cited is considered to be less important. The following paragraphs present two algorithms for incorporating link information into search engines: PageRank (Page et al., 1998) and Kleinberg’s Hubs and Authorities (Kleinberg, 1999). The PageRank algorithm takes a set of interconnected pages and calculates a score for each. Intuitively, the score for a page is based on how many other pages point to that page and what their scores are. A page that is pointed to by a few other important pages is probably itself important. Similarly, a pages that is pointed to by numerous other marginally important pages is probably itself important. But a page that is not pointed to by anything probably isn’t important. A more formal definition taken from (Page et al., 1998) is: Let u be a web page. Then let F u be the set of pages u points to and B u be the set of pages that point to u. Let N u = |F u | be the number of links from u. Then let E(u) be an a priori score assigned to u. Then R(u), the score for u, is calculated: R(u)= ∑ v∈B u R(v) N v + E(u) So the score for a page is some constant plus the sum of the scores of its incoming links. Each incoming link has the score of the page it is from divided by the number of outgoing links from that page (so a page’s score is divided evenly among its outgoing links). The constant E(u) serves a couple functions. First, it counterbalances the effect of ”sinks” in the network. These are pages or groups of pages that are dead ends – they are pointed to, but they don’t point out to any other pages. E(u) provides a ”source” of score that counterbalances the ”sinks” in the network. Secondly, it provides a method of introducing a priori scores if certain pages are known to be authoritative. The PageRank algorithm can be combined with other techniques to create a search engine. For example, PageRank is first used to assign a score to all pages in a population. Next a simple keyword search is used to find a list of relevant candidate pages. The candidate pages are ordered according to their PageRank score. The top ranked pages presented to the user are both relevant (based on the keyword match) and authoritative (based on the PageRank score). The Google search engine is based on PageRank combined with other factors including standard IR techniques and the text of incoming links. Kleinberg’s algorithm (Kleinberg, 1999) differs from the PageRank approach in two important respects: 1. Whereas the PageRank approach assigns a score to each page before applying a text search, Kleinberg assigns a score to a page within the context of a text search. For example, a page containing both the words ”Microsoft” and ”Oracle” may receive a high score if the search string is ”Oracle” but a low score if the search string is ”Microsoft.” PageRank would assign a single score regardless of the text search. 18 Link Analysis 361 2. Kleinberg’s algorithm draws a distinction between ”hubs” and ”authorities.” Hubs are pages that point out to many other pages on a topic (preferably many are authorities). Authorities are pages that are pointed to by many other pages (preferably many are hubs). Thus the two have a symbiotic relationship. The first step in a search is to create a ”focused subgraph.” This is a small subset of the Internet that is rich in pages relevant to the search and also contains many of the strongest authorities on the searched topic. This is done by doing a pure text search and retrieving the top t pages (t about 200). This set is augmented by adding all the pages pointed to by pages in t and all pages that point to pages in t (for pages with a large in-degree, only a subset of the in-pointing pages are added). Note that adding these pages may add pages that do not contain the original search string! This is actually a good thing because often an authority on a topic may not contain the search string. For example, toyota.com may not contain the string ”automobile manufacturer” (Rokach and Maimon, 2006) or a page may discuss several machine learning algorithms but not have the phrase ”machine learning” because the authors always use the phrase ”Data Mining.” Adding the linked pages pulls in related pages whether they contain the search text or not. The second step calculates two scores for each page: an authority score and a hub score. Intuitively, a page’s authority score is the normalized sum of the hub scores of all pages that point to it. A page’s hub score is the normalized sum of the authority scores of all the pages it points to. By iteratively recalculating each pages’ hub and authority score, the scores converge to an equilibrium. The reinforcing relationship between hubs and authorities helps the algorithm differentiate true authorities on a topic from generally popular web sites such as amazon.com and yahoo.com. In summary, both algorithms measure a web page’s importance by its relationships with other web pages – an extension of the notion of importance in a social network being determined by position in the network. 18.4 Viral Marketing Viral marketing relies heavily on ”word-of-mouth” advertising where one individual who has bought a product tells their friends about the product (Domingos, 1996, Richardson, 2002, Kempe et al., 2003). A famous example of viral marketing is the rapid spread of Hotmail as a free email service. Attached to each email was a short advertisement and Hotmail’s URL, and customers spread the word about Hotmail simply by emailing their family and friends. Hotmail grew from zero to 12 million users in 18 months (Richardson, 2002). Word-of-mouth is a powerful form of advertising because if a friend tells me about a product, I don’t believe that they have a hidden motive to sell the product. Instead I believe that they really believe in the inherent quality of the product enough to tell me about it. Products are not the only things that spread by word of mouth. Fashion trends spread from person to person. Political ideas are transferred from one person to the next. Even technological innovations are transferred among a network of coworkers 362 Steve Donoho and peers. These problems can all be viewed as the diffusion of information, ideas, or influence among the members of a social network (Kempe et al., 2003). These social networks take on a variety of forms. The most easily understood is the old fashion ”social network” where people are friends, neighbors, coworkers, etc. and their connections are personal and often face-to-face. For example, when a new medical procedure is invented, some physicians will be early adopters, but others will wait until close friends have tried the procedure and been successful. The Internet has also created social networks with virtual connections. In a collaborative filtering system, a recommendation for a book, movie, or musical CD may be made to Customer A based on N ”similar” customers – customers who have bought similar items in the past. Customer A is influenced by these N other customer even though Customer A never meets them, communicates with them, or even knows their identi- ties. Knowledge-sharing sites provide a second type of virtual connection with more explicit interaction. On these sites people provide reviews and ratings on things rang- ing from books to cars to restaurants. As an individual follows the advice offered by various ”experts” they grow to trust the advice of some and not trust the advice of others. Formally, this can be modeled as a social network where the nodes are people and node X i is linked to node X j if the person represented by X i in some way in- fluences X j . From a marketing standpoint some natural questions emerge. ”Which nodes should I market to to maximize my profit?” Or alternatively, ”If I only have the budget to market to k of the n nodes in the network, which k should I choose to maximize the spread of influence?” Based on work from the field of social network analysis, two plausible approaches would be to pick nodes with the highest out-degree (nodes that influence a lot of other nodes) or nodes with good distance centrality (nodes that have a short average distance to the rest of the network). Two recent approaches to these questions are described in the following paragraphs. The first approach proposed by (Domingos, 1996) and (Richardson, 2002) model the social network as a Markov random field. The probability that each node will pur- chase the product or adopt the idea is modeled as P(X i | N i , Y, M) where N i are the neighbors of X i , the ones who directly influence X i . Y is a vector of attributes de- scribing the product. This reflects the fact that X i is influenced not only by neighbors but also by the attributes of the product itself. A bald man probably won’t buy a hair- brush even if all the people he trusts most do. M i is the marketing action taken for X i . This reflects the fact that a customer’s decision to buy is influenced by whether he is marketed to such as if he receives a discount. This probability can be combined with other information (how much it costs to market to a customer, what the revenue is from a customer that was marketed to, and what the revenue is from a customer who was not marketed) to calculate the expected profit from a particular marketing plan. Various search techniques such as greedy search and hill-climbing can be employed to find local maxima for the profit. A second approach proposed by (Kempe et al., 2003) uses a more operational model of how ideas spread within a network. A set of nodes are initialized to be active (indicating they bought the product or adopted the idea) at time t=0.Ifan inactive node X i has active neighbors, those neighbors exert some influence on X i 18 Link Analysis 363 to become active. As more of X i ’s neighbors become active, this may cause X i to become active. Thus the process unfolds in a set of discrete steps where a set of nodes change their values at time t based on the set of active nodes at time t −1. Two models for how nodes become activated are: 1. Linear Threshold Model. Each link coming into X i from its neighbors has a weight. When the sum of the weights of the links from X i ’s active neighbors surpasses a threshold θ i , then X i becomes active at time t + 1. The process runs until the network reaches an equilibrium state. 2. Independent Cascade Model. When the neighbor of an inactive node X i first becomes active, it has one chance to activate X i . It succeeds at activating X i with a probability of p i, j , and X i becomes active at time t + 1. The process runs until no new nodes become active. Kempe presents a greedy hill-climbing algorithm and proves that its performance is within a factor of 63% of optimal. Empirical experiments show that the greedy algorithm performs better than picking nodes with the highest out-degree or the best distance centrality. Areas of further research in viral marketing include dealing with the fact that network knowledge is often incomplete. Network knowledge can be acquired, but this involves a cost that must be factored into the overall marketing cost. 18.5 Law Enforcement & Fraud Detection Link analysis was used in law enforcement long before the advent of computers. Police and detectives would manually create charts showing how people and pieces of evidence in a crime were connected. Computers greatly advanced these techniques in two key ways: 1. Visualization of crime/fraud networks. Charts that were previously manually drawn and static became automatically drawn and dynamic. A variety of link analysis visualization tools allowed users to perform such operations as: a) Automatically arranging networks to maximize clarity (e.g. minimize link crossings), b) Rearranging a network by dragging and dropping nodes, c) Filtering out links by weight or type, d) Grouping nodes by types. 2. Proliferation of databases containing information to link people, events, accounts, etc. Two people or accounts could be linked because: a) One sent a wire transfer to the other, b) They were in the same auto accident and were mentioned in the insurance claim together, c) They both owned the same house at different times, d) They share a phone number 364 Steve Donoho All these pieces of information were gathered for non-law enforcement reasons and stored in databases, but they could be used to detect fraud and crime rings. A pioneering work in automated link analysis for law enforcement is FAIS (Sen- ator et al., 1995), a system for investigating money laundering developed at FINCEN (Financial Crimes Enforcement Network). The data supporting FAIS was a database of Currency Transaction Reports (CTRs) and other forms filed by banks, brokerages, casinos, businesses, etc. when a customer conducts a cash transaction over $10,000. Entities were linked to each other because they appeared on the same CTR, shared an address, etc. The FAIS system provided leads to investigators on which people, businesses, accounts, or locations they should investigate. Starting with a suspicious entity, an investigator could branch out to everyone linked to that entity, then everyone linked to those entities, and so on. This information was then displayed in a link analysis visualization tool where it could be more easily manipulated and understood. Insurance fraud is a crime that is usually carried out by rings of professionals. A ringleader orchestrates staged auto accidents and partners with fraudulent doctors, lawyers, and repair shops to file falsified claims. Over time, this manifests itself in the claim data as many claims involving an interlinked group of drivers, passen- gers, doctors, lawyers, and body shops. Each claim in isolation looks legitimate, but taken together they are extremely suspicious. The ”NetMap for Claims” solution from NetMap Analytics like FAIS allows users to start with a person, find everyone directly linked to them (on a claim together), then find everyone two links away, then three links away, etc. These people and their interconnections are then displayed in a versatile visualization tool. More recent work has focused on generating leads not based on any one node but instead on the relationships among nodes. This is because many nodes in isolation are perfectly legitimate or are only slightly suspicious but innocuous enough to stay ”under the radar.” But when several of these ”slightly suspicious” nodes are linked together, it becomes very suspicious (Donoho and Johnstone, 1995). For example, if a bank account has cash deposits of between $2000 and $5000 in a month’s time, this is only slightly suspicious. At a large bank there will be numerous accounts that meet this criteria – too many to investigate. But if it is found that 10 such accounts are linked by shared personal information or by transactions, this is suddenly very suspicious because it is highly unlikely this would happen by chance and this is exactly what money launderers do to hide their behavior. Scenario-based approaches have instances of crimes represented as networks, and new situations are suspicious if they are sufficiently similar to a known crime instance. Fu et al. (2003) describes a system to detect contract murders by the Rus- sian mafia. The system contains a library of known contract murders described as people linked by phone calls, meetings, wire transfers, etc. A new set of events may match a known instance if it has a similar network topology even if a phone call in one matches a meeting in the other (both are communication events) or if a wire transfer in one matches a cash payment in the other (both are payment events). The LAW system (Ruspini et al., 2003) is a similar system to detect terrorist activities before a terrorist act occurs. LAW measures the similarity between two networks 18 Link Analysis 365 using edit distance – the number of edits needed to convert network #1 to be exactly link network #2. In summary, crime rings and fraud rings are their own type of social network, and analyzing the relationships among entities is a powerful method of detecting those rings. 18.6 Combining with Traditional Methods Many recent works focus on combining link analysis techniques with traditional knowledge discovery techniques such as inductive inference and clustering. Jensen (1999) points out that certain challenges arise when traditional induction techniques are applied to linked data: 1. The linkages in the data may cause instances to no longer be statistically independent. If multiple instances are all linked to the same entity and draw some of their characteristics from that entity, then those instances are no long independent. 2. Sampling becomes very tricky. Because instances are interlinked with each other, sampling breaks many of these linkages. Relational attributes of an instance such as degree, closeness, betweenness can be drastically changed by sampling. 3. Attribute combinatorics are greatly increased in linked data. In addition to an instance’s k intrinsic attributes, an algorithm can draw upon attributes from neighbors, neighbors’ neighbors, etc. Yet more attributes arise from the combinations of neighbors and the topologies with which they are linked. These and other challenges are discussed more extensively in (Jensen, 1999). Neville and Jensen (2000) present an iterative method of classification using relational data. Some of an instance’s attributes are static. These include intrinsic attributes (which contain information about an instance by itself regardless of linkages) and static relational attributes (which contain information about an instance’s linked neighbors but are not dependent on the neighbors’ classification). More interesting are the dynamic relational attributes. These attributes may change value as an instance’s neighbors change classification. So the output of one instance (its class) is the input of a neighboring instance (in its dynamic relational attributes), and vice versa. The algorithm iteratively recalculates instances’ classes. There are m itera- tions, and at iteration i it accepts class labels on the N ∗(i/m) instances with the highest certainty. Classification proceeds from the instances with highest certainty to those with lowest certainty until all N instances are classified. In this way instances with the highest certainty have more opportunity to affect their neighbors. Using linkage data in classification has also been applied to the classification of text (Chakrabarti et al., 1998,Slattery, 2000,Oh et al., 2000,Lu, 2003). Simply incorporating words from neighboring texts was not found to be helpful, but incorporating more targeted information such as hierarchical category information, predicted class, and anchor text was found to improve accuracy. 366 Steve Donoho In order to cluster linked data, Neville et al. (2003) combines traditional clustering with graph partitioning techniques. Their work uses similarity metrics taken from traditional clustering to assign weights to linked graphs. Once this is done, several standard graph partitioning techniques can be used to partition the graph into clusters. Taskar et al. (2001) cluster linked data using probabilistic relational models. While most work in link analysis assumes that the graph is complete and fairly correct, this is often far from the truth. Work in the area of link completion (Kubica et al., 2002,Goldenberg et al., 2003,Kubica et al., 2003) induces missing links from previously observed data. This allows users to ask questions about a future state of the graph such as ”Who is person XYZ likely to publish a paper with in the next year?” There are many possible ways link analysis can be combined with traditional techniques, and many remain unexplored making this a fruitful area for future research. 18.7 Summary Link analysis is a collection of techniques that operate on data that can be represented as nodes and links. A variety of applications rely on link analysis techniques or can be improved by link analysis techniques. Among these are Internet search, viral marketing, fraud detection, crime prevention, and sociologic study. To support these applications, a number of new link analysis techniques have emerged in recent years. This chapter has surveyed several of these including subgraph matching, finding cliques and K-plexes, maximizing spread of influence, visualization, and finding hubs and authorities. A fruitful area for future research is the combination of link analysis techniques with traditional Data Mining techniques. References Chakrabarti S, Dom B, Agrawal R, & Raghavan P. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic tax- onomies. VLDB Journal: Very Large Data Bases 1998. 7:163 – 178. Domingos P & Richardson M. Mining the network value of customers. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining; 2001 Au- gust 26 – 29; San Francisco, CA. ACM Press, 2001. Donoho S. & Lewis S. Understand behavior detection technology: Emerging approaches to dealing with three major consumer protection threats. April 2003. Fu D, Remolina E, & Eilbert J. A CBR approach to asymmetric plan detection. Proceed- ings of Workshop on Link Analysis for Detecting Complex Behavior; 2003 August 27; Washington, DC. Goldenberg A, Kubica J, & Komerak P. A comparison of statistical and machine learning algorithms on the task of link completion. Proceedings of Workshop on Link Analysis for Detecting Complex Behavior; 2003 August 27. Washington, DC. Hanneman R. Introduction to social network methods. Univ of California, Riverside, 2001. 18 Link Analysis 367 Jensen D. Statistical challenges to inductive inference in linked data. Preliminary papers of the 7th International Workshop on Artificial Intelligence and Statistics; 1999 Jan4–6; Fort Lauderdale. FL. Kempe D, Kleinberg J, & Tardos E. Maximizing the spread of influence through a social network. Proceedings of The Ninth ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining; 2003 August 24 – 27; Washington, DC. ACM Press, 2003. Kleinberg J. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46,5:604-632. Kubica J, Moore A, Schneider J, & Yang Y. Stochastic link and group detection. Proceedings of The Eighth National Conference on Artificial Intelligence; 2002 July 28 – Aug 1; Edmonton, Alberta, Canada. ACM Press, 2002. Kubica J, Moore A, Cohn D, & Schneider J. cGraph: A fast graph-based method for link analysis and queries. Proceedings of Text-Mining & Link-Analysis Workshop; 2003 August 9; Acapulco, Mexico. Lu Q & Getoor L. Link-based text classification. Proceedings of Text-Mining & Link- Analysis Workshop; 2003 August 9; Acapulco, Mexico. Neville J & Jensen D. Iterative classification in relational data. Proceeding of AAAI-2000 Workshop on Learning Statistical Models from Relational Data; 2000 August 3; Austin, TX. AAAI Press, 2000. Neville J, Adler M, & Jensen D. Clustering relational data using attribute and link information. Proceedings of Text-Mining & Link-Analysis Workshop; 2003 August 9; Acapulco, Mexico. Oh H, Myaeng S, & Lee M. A practical hypertext categorization method using links and incrementally available class information. Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR-00); 2000 July; Athens, Greece. Page L, Brin S, Motwani R, & Winograd T. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project. 1998. Richardson M & Domingos P. Mining knowledge-sharing sites for viral marketing. Pro- ceedings of Eighth International Conference on Knowledge Discovery and Data Mining; 2002 July 28 – Aug 1; Edmonton, Alberta, Canada. ACM Press, 2002. Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra- tive reports. Lecture notes in artificial intelligence, 3055. pp. 217-228, Springer-Verlag (2004). Rokach, L., Averbuch, M., and Maimon, O., Information retrieval system for medical narra- tive reports. Lecture notes in artificial intelligence, 3055. pp. 217-228, Springer-Verlag (2004). Rokach L. and Maimon O., Data mining for improving the quality of manufacturing: A feature set decomposition approach. Journal of Intelligent Manufacturing 17(3): 285299, 2006. Ruspini E, Thomere J, & Wolverton M. Database-editing metrics for pattern matching. SRI Intern, March 2003. Senator T, Goldberg H, Wooton J, Cottini A, Umar A, Klinger C, et al. The FinCEN Artificial Intelligence System: Identifying Potential Money Laundering from Reports of Large Cash Transactions. Proceedings of the 7th Conference on Innovative Applications of AI; 1995 August 21 – 23; Montreal, Quebec, Canada. AAAI Press, 1995. Slattery S & Craven M. Combining statistical and relational methods for learning in hypertext domains. Proceedings of ILP-98, 8th International Conference on Inductive Logic 368 Steve Donoho Programming; 1998 July 22 – 24; Madison, WI. Springer Verlag, 1998. Taskar B, Segal E, & Koller D. Probabilistic clustering in relational data. Proceedings of Seventh International Joint Conference on Artificial Intelligence; 2001 August4–10; Seattle, Washington. Wasserman S & Faust K, Social Network Analysis. Cambridge University Press, 1994. Part IV Soft Computing Methods . M & Domingos P. Mining knowledge- sharing sites for viral marketing. Pro- ceedings of Eighth International Conference on Knowledge Discovery and Data Mining; 20 02 July 28 – Aug 1; Edmonton,. Knowl- edge Discovery and Data Mining; 20 03 August 24 – 27 ; Washington, DC. ACM Press, 20 03. Kleinberg J. Authoritative sources in a hyperlinked environment. Journal of the ACM, 1999, 46,5:604-6 32. Kubica. Large Data Bases 1998. 7:163 – 178. Domingos P & Richardson M. Mining the network value of customers. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining;

Data Mining and Knowledge Discovery Handbook, 2 Edition part 39 pdf

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan