Managing and Mining Graph Data part 47 potx

A Survey of Privacy-Preservation of Graphs and Social Networks 447 social graph. It considered the case where no underlying graph is released, and, in fact, the owner of the network would like to keep the entire structure of the graph hidden from any one. The goal of the adversary is, rather than to de-anonymize particular individuals from that graph, to compromise the link privacy of as many individuals as possible. Specifically, the adversary deter- mines the link structure of the graph based on the local neighborhood views of the graph from the perspective of several non-anonymous users. Analysis showed that the number of users that need to be compromised in order to cover a constant fraction of the entire network drops exponentially with increase in the lookahead parameter 𝑙 provided by the network data owner. Here a network has a lookahead 𝑙 if a registered user can see all the links and nodes incident to him within distance 𝑙 from him. For example, 𝑙 = 0 if a user can see exactly who he links to; 𝑙 = 1 if a user can see exactly the friends that he links to as well as the friends that his friends link to. Each time the adversary gains access to a user account, he immediately covers all nodes that are at distance no more than the lookahead distance 𝑙 enabled by the social network. In other words, he learns about all the edges incident to these nodes. Thus by gaining access to the account of user 𝑢, an adversary immediately covers all nodes that are within distance 𝑙 of 𝑢. Additionally, he learns about the existence of all nodes within distance 𝑙+1 from 𝑢. The authors studied several attacking strategies shown as below. Benchmark-Greedy: Among all users in the social network, pick the next user to bribe as the one whose perspective on the network gives the largest possible amount of new information. Formally, at each step the adversary picks the node covering the maximum number of nodes not yet covered. Heuristically Greedy: Pick the next user to bribe as the one who can offer the largest possible amount of new information, according to some heuristic measure. For example, Degree-Greedy picks the next user to bribe as the one with the maximum unseen degree, i.e., its degree minus the number of edges incident to it already known by the adversary. Highest-Degree: Bribe users in the descending order of their degrees. Random: Pick the users to bribe at random. Crawler: Similar to the Heuristically Greedy strategy, but choose the next node to bribe only from the nodes already seen (within distance 𝑙 + 1 of some bribed node). One example is Degree-Greedy-Crawler that picks, from all users already seen, the next user to bribe as the one with the maximum unseen degree. 448 MANAGING AND MINING GRAPH DATA Experiments on a 572, 949-node friendship graph extracted from Live- Joural.com indicated that 1) Highest-Degree yields the best performance while Random performs the worst; 2) in order to obtain 80% coverage of the graph using lookahead 2, Highest-Degree needs to bribe 6, 308 users while it only needs to bribe 36 users to obtain the same coverage using lookahead 3. The authors suggested that as a general rule, the social network owners should re- frain from permitting a lookahead higher than 2. Data owner may also want to decrease their vulnerability of the social network by not showing the exact number of connections that each user has, or by varying the lookahead avail- able to users based on their trustworthiness. 7.2 Deriving Personal Identifying Information from Social Networking Sites Online network users often publish their profiles as well as their connections that contain vast amounts of personal and sometimes sensitive information (e.g., photo, birth date, phone number, current residence, various inter- ests, and their friends). Acquisti and Gross in [16] studied the privacy risk associated with these networks. The user’s profile information can be used to estimate a person’s social security number and exposes his/her to identity theft. Their studies showed that only a small number of Facebook members change the default privacy preferences. As a result, users expose themselves to various physical and cyber risks, and make it extremely easy for third parties to create digital dossiers of their behavior. Their study quantified patterns of information revelation and inferred usage of privacy settings from actual field data. 8. Conclusion and Future Work We surveyed recent studies on anonymization techniques for privacy- preserving publishing of social network data. The research and development of privacy-preserving social network analysis is still in its early stage com- pared with much better studied privacy-preserving data analysis for tabular data. We revisited the naive anonymization approach and several structural attacks which can be exploited on the naive anonymized graphs. We cate- gorized the state-of-the-art anonymization methods on simple graphs in three main categories: 𝐾-anonymity based privacy preservation via edge modifica- tion, probabilistic privacy preservation via edge randomization, and privacy preservation via generalization. We then review anonymization methods on rich graphs. Since social network data is more complicated than tabular data, privacy preservation in social networks is much more challenging than privacy preservation in tabular data. While ideas and methods can be borrowed from the well studied privacy preservation in tabular data, many serious efforts are A Survey of Privacy-Preservation of Graphs and Social Networks 449 greatly needed due to new challenges (see Section 1.2 and 1.3) associated with the network data. We present a set of recommendations for future research in this emerging area. Develop privacy models for graphs and networks. Investigate how well different strategies protect privacy (identity, link privacy, and attribute privacy) when adversaries exploit various complex background knowledge in their attacks. How to model various background knowledge and quantify disclosures when complex attacks are used needs to be investi- gated. Since how to preserve utility in the released graph is an important issue in privacy-preserving social network analysis, measures and methodologies need to be developed to quantify utility and information loss. It is important to develop workload-aware metrics that adequately quantify levels of information loss of graph data. Furthermore, various anonymization strategies need to be evaluated in terms of the tradeoff between privacy and utility. Existing studies except [52] do not consider dynamic releases. Many applications of evolutionary networks and dynamic social network analysis require publishing data periodically to support dynamic analysis. The “one-time” released network data from existing annonymization methods cannot guarantee privacy when adversaries collect historical information from multiple releases. Distributed privacy-preserving social network analysis based on secure multi-party computation [43]. Distributed privacy-preserving data analysis on tabular data has been well studied (e.g., [29]; refer to the book [1] for surveys). However, distributed privacy-preserving social network analysis has not been well reported in literature. Create a benchmark graph data repository. Researchers can compare and learn how different approaches work in terms of the privacy-utility tradeoff. The scalability issue needs to be studied and empirical evaluations need to be conducted on large social networks. Acknowledgments Authors Wu and Ying were supported in part by U.S. National Science Foundation IIS-0546027 and CNS-0831204. References 450 MANAGING AND MINING GRAPH DATA [1] C. C. Aggarwal and P. S. Yu. Privacy-Preserving Data Mining: Models and Algorithms. Springer, 2008. [2] D. Agrawal and C. Agrawal. On the design and quantification of privacy preserving data mining algorithms. In Proceedings of the 20th Sympo- sium on Principles of Database Systems, 2001. [3] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceed- ings of the ACM SIGMOD International Conference on Management of Data, pages 439–450. Dallas, Texas, May 2000. [4] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganog- raphy. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, pages 181–190, New York, NY, USA, 2007. ACM Press. [5] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group for- mation in large social networks: membership, growth, and evolution. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 44–54, New York, NY, USA, 2006. ACM. [6] J. Baumes, M. K. Goldberg, M. Magdon-Ismail, and W. A. Wallace. Dis- covering hidden groups in communication networks. In ISI, pages 378– 389, 2004. [7] T. Y. Berger-Wolf and J. Saia. A framework for analysis of dynamic social networks. In KDD, pages 523–528, 2006. [8] S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava. Class- based graph anaonymization for social network data. In Proc. of 35th International Conference on Very Large Data Base, 2009. [9] A. Campan and T. M. Truta. A clustering approach for data and structural anonymity in social networks. In PinKDD, 2008. [10] D. Chakrabarti, C. Faloutsos, and M. McGlohon. Graph Mining: Laws and Generators. Springer, 2010. [11] G. Cormode, D. Srivastava, T. Yu, and Q. Zhang. Anonymizing bipartite graph data using safe groupings. In Proc. of VLDB08, pages 833–844, 2008. [12] L. da F. Costa, F. A. Rodrigues, G. Travieso, and P. R. V. Boas. Charac- terization of complex networks: A survey of measurements. Advances In Physics, 56:167, 2007. [13] S. Das, - Omer Egecioglu, and A. E. Abbadi. Anonymizing edge-weighted social network graphs. Technical report, UCSB CS, March 2009. [14] A. Fast, D. Jensen, and B. N. Levine. Creating social networks to improve peer-to-peer networking. In KDD, pages 568–573, 2005. A Survey of Privacy-Preservation of Graphs and Social Networks 451 [15] M. Girvan and M. E. Newman. Community structure in social and bio- logical networks. Proc. Natl. Acad. Sci. USA, 99(12):7821–7826, June 2002. [16] R. Gross and A. Acquisti. Information revelation and privacy in online social networks (the Facebook case). Proceedings of the Workshop on Privacy in the Electronic Society, 2005. [17] S. Guo, X. Wu, and Y. Li. Determining error bounds for spectral filtering based reconstruction methods in privacy preserving data mining. Knowl. Inf. Syst., 17(2):217–240, 2008. [18] S. Hanhijarvi, G. C. Garriga, and K. Puolamaki. Randomization techniques for graphs. In Proc. of the 9th SIAM Conference on Data Mining, 2009. [19] M. Hay, G. Miklau, D. Jensen, D. Towsely, and P. Weis. Resisting structural re-identification in anonymized social networks. In VLDB, 2008. [20] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava. Anonymizing social networks. University of Massachusetts Technical Report, 07-19, 2007. [21] Z. Huang, W. Du, and B. Chen. Deriving private information from randomized data. In Proceedings of the ACM SIGMOD Conference on Man- agement of Data. Baltimore, MA, 2005. [22] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the privacy preserving properties of random data perturbation techniques. In Proc. of the 3rd Int’l Conf. on Data Mining, pages 99–106, 2003. [23] D. Kempe, J. M. Kleinberg, and « E. Tardos. Maximizing the spread of influence through a social network. In KDD, pages 137–146, 2003. [24] J. M. Kleinberg. Challenges in mining social network data: processes, privacy, and paradoxes. In KDD, pages 4–5, 2007. [25] Y. Koren, S. C. North, and C. Volinsky. Measuring and extracting prox- imity in networks. In KDD, pages 245–255, 2006. [26] A. Korolova, R. Motwani, S. Nabar, and Y. Xu. Link privacy in social networks. In Proceedings of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008. [27] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In KDD, pages 611–617, 2006. [28] D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 556–559, New York, NY, USA, 2003. ACM. 452 MANAGING AND MINING GRAPH DATA [29] Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology (CRYPTO’00), pages 36–53. Springer-Verlag, 2000. [30] K. Liu, K. Das, T. Grandison, and H. Kargupta. Privacy-preserving data analysis on graphs and social networks, 2008. [31] K. Liu and E. Terzi. Towards identity anonymization on graphs. In Pro- ceedings of the ACM SIGMOD Conference, Vancouver, Canada, 2008. ACM Press. [32] L. Liu, J. Wang, J. Liu, and J. Zhang. Privacy preservation in social networks with sensitive edge weights. In SDM, pages 954–965, 2009. [33] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. 𝑙- diversity: privacy beyond 𝑘-anonymity. In Proceedings of the IEEE ICDE Conference, 2006. [34] A. Narayanan and V. Shmatikov. De-anonymizing social networks. In IEEE Security & Privacy ’09, 2009. [35] S. Russell and P. Norvig. Artifical Intelligence: A Modern Approach. Pearson Education, 2003. [36] A. Seary and W. Richards. Spectral methods for analyzing and visu- alizing networks: an introduction. National Research Council, Dynamic Social Network Modelling and Analysis: Workshop Summary and Papers, pages 209–228, 2003. [37] M. Shiga, I. Takigawa, and H. Mamitsuka. A spectral clustering approach to optimally combining numericalvectors with a modular network. In KDD, pages 647–656, 2007. [38] E. Spertus, M. Sahami, and O. Buyukkokten. Evaluating similarity measures: a large-scale study in the orkut social network. In KDD, pages 678–684, 2005. [39] C. Tantipathananandh, T. Y. Berger-Wolf, and D. Kempe. A framework for community identification in dynamic social networks. In KDD, pages 717–726, 2007. [40] S. White and P. Smyth. Algorithms for estimating relative importance in networks. In KDD, pages 266–275, 2003. [41] L. Wu, X. Ying, and X. Wu. Reconstruction of randomized graph via low rank approximation. Technical report, UNC-Charlotte, SIS, 2009. [42] X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preservation. In Proceedings of the 32nd International Conference on Very Large Data Bases, pages 139–150, September 2006. [43] A. C. Yao. How to generate and exchange secrets. In SFCS ’86: Proceed- ings of the 27th Annual Symposium on Foundations of Computer Science, pages 162–167. IEEE Computer Society, 1986. A Survey of Privacy-Preservation of Graphs and Social Networks 453 [44] X. Ying, K. Pan, X. Wu, and L. Guo. Comparisons of randomization and k-degree anonymization schemes for privacy preserving social network publishing. In SNA-KDD ’09: Proceedings of the 3rd SIGKDD Workshop on Social Network Mining and Analysis (SNA-KDD), 2009. [45] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. In Proc. of the 8th SIAM Conference on Data Mining, April 2008. [46] X. Ying and X. Wu. Graph generation with prescribed feature constraints. In Proc. of the 9th SIAM Conference on Data Mining, 2009. [47] X. Ying and X. Wu. On link privacy in randomizing social networks. In PAKDD, 2009. [48] L. Zhang and W. Zhang. Edge anonymity in social graphs. In Proceed- ings of the 2009 International Conference on Social Computing, 2009. [49] E. Zheleva and L. Getoor. Preserving the privacy of sensitive relationships in graph data. In PinKDD, pages 153–171, 2007. [50] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. IEEE 24th International Conference on Data Engineer- ing, pages 506–515, 2008. [51] B. Zhou, J. Pei, and W S. Luk. A brief survey on anonymization techniques for privacy preserving publishing of social network data. SIGKDD Explorations, 10(2), 2009. [52] L. Zou, L. Chen, and M. T. - Ozsu. K-automorphism: A general framework for privacy preserving network publication. In Proc. of 35th International Conference on Very Large Data Base, 2009. Chapter 15 A SURVEY OF GRAPH MINING FOR WEB APPLICATIONS Debora Donato Yahoo! Research Avd Diagonal 177, Barcelona, Spain debora@yahoo-inc.com Aristides Gionis Yahoo! Research Avd Diagonal 177, Barcelona, Spain gionis@yahoo-inc.com Abstract Graph structures provide a general framework for modeling entities and their relationships, and they are routinely used to describe a wide variety of data such as the Internet, the web, social networks, metabolic networks, protein-interaction networks, food webs, citation networks, and many more. In recent years, there has been an increasing amount of literature on studying properties, models, and algorithms for graph data. In this chapter we provide a brief overview of graph- mining algorithms for web and social-media applications. We review a wide range of algorithms, such as those for estimating reputation and popularity of items in a network, mining query logs and performing query recommendations. The main goal of the chapter is to provide the reader with an understanding of how graph structural mining algorithms can be exploited in the context of web applications. This highlights the challenges of, and provides an understanding of the power of graph mining in the context of web and social-media applications. Keywords: Graph Mining, Link Mining, Web Mining, Social Network Analysis, World Wide Web, Query-Log Mining, Query Recommendation © Springer Science+Business Media, LLC 2010 C.C. Aggarwal and H. Wang (eds.), Managing and Mining Graph Data, Advances in Database Systems 40, DOI 10.1007/978-1-4419-6045-0_15, 455 456 MANAGING AND MINING GRAPH DATA 1. Introduction Graph mining has been widely used to study relationships among various types of entities. Real-world graphs are also referred to as networks, and the interactions between the entities represented in the networks are modeled as links. The problems of studying the properties of real-world networks, designing algorithms for mining such networks, and developing applications on top of network data has been of increasing interest in the past few years. This has led to the birth of a very active area of scientific research, which is known as analysis of complex networks [7, 16, 55]. One of the most pervasive properties of real-world networks is the emer- gence of power-law distributions that tend to characterize many of networks statistical properties [6, 26]. Power laws have intrigued the interest of researchers, who have proposed various models that attempt to explain the pres- ence of power-law distributions in real graphs. For examples of such models, see [6, 25, 40]. In this chapter, we deviate from the classical exposition of properties and generative models for complex networks, and we focus on graph-mining applications that appear in the context of the web and social-media. Such graphs include data that model the interaction of users in a social network. For example, this may correspond to comments of users in a blog, user activity in a question-answering portal, or query-log data that summarize the interaction of users with a search engine. Understanding the structure of such graphs, modeling the complex interactions between entities, and designing algorithms for leveraging the latent knowledge (also known as the wisdom of the crowds) in those graphs introduces new challenges in the field of graph mining. One important difference with networks that have been previously studied, is that in social-media and web-usage graphs the links represent many different types of interactions and activities among nodes. For instance in a question-answering portal, users ask questions, answer questions for other users, vote for favorite answers, interesting questions, assign answers to categories of a hierarchy, and much more. Hence graphs from such applications are characterized by having different types of nodes and high degree of heterogeneity in the types of interactions among nodes. Consequently, algorithms and methodologies widely applied in the web and other complex networks have to be adapted to this new multifaceted scenario, which allows for the different meanings that are implic- itly or explicitly captured by each link. This chapter is organized as follows. In Section 2 we briefly introduce measures and algorithms that have been extensively used as basic tools for graph mining. Then we focus on two different areas of graph mining in the context of social-media and web applications. In Section 3, we review techniques for identifying items of high quality in social-media networks. We discuss two A Survey of Graph Mining for Web Applications 457 concrete examples: (1) predicting the number of citations of authors in a bib- liographic data set, and (2) finding high-quality items in a question answering system. In both cases, the examples rely on adapting link-mining algorithms for computing authoritativeness scores in linked environments. In Section 4 we discuss algorithms for mining graph structures that represented information collected in the query logs of search engines. We first discuss various graph representations of query logs, and then discuss how to use these representations in order to perform the task of query recommendation. The conclusions are presented in Section 5. 2. Preliminaries An undirected graph 𝒢 = (𝑉, 𝐸) consists of a set of nodes 𝑉 , also called vertices, and a set 𝐸 of pairs of distinct nodes, which are called edges or arcs. A directed graph, or digraph, is distinguished from the undirected version by the fact that its edges are ordered pairs of nodes. In an undirected graph, the degree of a node is the number of edges incident to it. For a directed graph, we define the in-degree and the out-degree of a node to be the number of in- coming and out-going edges, respectively. In an undirected graph 𝒢, a set of nodes 𝑆 forms a connected component (CC), if for every pair of nodes 𝑢, 𝑣 ∈ 𝑆 there exists a path from 𝑢 to 𝑣 (which is also a path from 𝑣 to 𝑢). In a directed graph 𝒢, a set of nodes 𝑆 forms a strongly connected component (SCC), if for every pair of nodes 𝑢, 𝑣 ∈ 𝑆, there exists a (directed) path from 𝑢 to 𝑣, and a path from 𝑣 to 𝑢. A set of nodes 𝑆 forms a weakly connected component (WCC), if and only if the set 𝑆 is a connected component in the undirected graph 𝒢 𝑢 that is obtained by ignoring the directionality of the edges in 𝒢. Power laws and scale-free networks. Power-law distributions ubiquitously characterize real-world networks. We say that a discrete random variable 𝑋 follows a power-law distribution if the probability distribution is defined for each discrete value 𝑘 as follows: Pr[𝑋 = 𝑘] ∝ 𝑘 −𝛾 The value 𝛾 is called the exponent of the power-law. We assume that 𝛾 ≥ 0. Detailed surveys on power laws may be found in [45] and [46]. If a random variable 𝑋 follows a power-law distribution, then we know that the conditional probability Pr[𝑋 ≥ 𝑘 ∣ 𝑋 ≥ 𝑚] is the same as Pr[𝑋 ≥ 𝑘]. In other words, conditioning on the size does not yield any additional information. For this reason, networks that have attributes that follow a power-law distribution are also called scale-free networks. Degree and Assortativeness. The degree of the nodes of a graph can be of great interest in social-media applications. The out-degree of a node might . networks. Acknowledgments Authors Wu and Ying were supported in part by U.S. National Science Foundation IIS-0546027 and CNS-0831204. References 450 MANAGING AND MINING GRAPH DATA [1] C. C. Aggarwal and P. S. Yu. Privacy-Preserving. the challenges of, and provides an understanding of the power of graph mining in the context of web and social-media applications. Keywords: Graph Mining, Link Mining, Web Mining, Social Network. 455 456 MANAGING AND MINING GRAPH DATA 1. Introduction Graph mining has been widely used to study relationships among various types of entities. Real-world graphs are also referred to as networks, and