distributed solutions in privacy preserving data mining

, mn B GIÁO DC VÀ ÀO TO B QUC PHÒNG VIN KHOA HC VÀ CÔNG NGH QUÂN S e ̌f LNG TH DNG DISTRIBUTED SOLUTIONS IN PRIVACY PRESERVING DATA MINING (Nghiên cu xây dng mt s gii pháp đm bo an toàn thông tin trong quá trình khai phá d liu) LUN ÁN TIN S TOÁN HC Hà N  i - 2011 B GIÁO DC VÀ ÀO TO B QUC PHÒNG VIN KHOA HC VÀ CÔNG NGH QUÂN S ěf LNG TH DNG DISTRIBUTED SOLUTIONS IN PRIVACY PRESERVING DATA MINING (Nghiên cu xây dng mt s gii pháp đm bo an toàn thông tin trong quá trình khai phá d liu) Chuyên ngành: Bo đm toán hc cho máy tính và h thng tính toán. Mã s : 62 46 35 01 LUN ÁN TIN S TOÁN HC Ngi hng dn khoa hc: 1. GIÁO S - TIN S KHOA HC H TÚ BO 2. PHÓ GIÁO S - TIN S BCH NHT HNG Hà Ni - 2011 Pledge I promise that this thesis is a presentation of my ori gi n al research work. Any of the content was written based on the reliable references such as published papers in distinguished international conferences and journals, and books published by widely-known publishers. Results and discussions of the thesis are new, not previously published by any other authors. i Contents 1 INTRODUCTION 1 1.1 Privacy-preserving data mining: An overview . . . . . . . . . 1 1.2 Objectives and contributions . . . . . . . . . . . . . . . . . . 5 1.3 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . 12 2 METHODS FOR SECURE MULTI-PARTY COMPUTATION 13 2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Computational indistinguishability . . . . . . . . . . . 13 2.1.2 Secure multi-party computation . . . . . . . . . . . . . 14 2.2 Secure computation . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Secret sharing . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Secure sum computation . . . . . . . . . . . . . . . . . 16 2.2.3 Probabilistic public key cryptosystems . . . . . . . . . 17 2.2.4 Variant ElGamal Cryptosystem . . . . . . . . . . . . . 18 2.2.5 Oblivious polynomial evaluation . . . . . . . . . . . . 20 2.2.6 Secure scalar product computation . . . . . . . . . . . 21 2.2.7 Privately computing ln x . . . . . . . . . . . . . . . . . 22 3 PRIVACY PRESERVING FREQUENCY-BASED LEARNING IN 2PFD SETTING 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Privacy preserving frequency mining in 2PFD setting . . . . . 27 3.2.1 Problem formulation . . . . . . . . . . . . . . . . . . 27 3.2.2 Definition of privacy . . . . . . . . . . . . . . . . . . . 29 3.2.3 Frequency mining protocol . . . . . . . . . . . . . . . 30 ii 3.2.4 Correctness Analysis . . . . . . . . . . . . . . . . . . . 32 3.2.5 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 34 3.2.6 Efficiency of frequency mining protocol . . . . . . . . 37 3.3 Privacy Preserving Frequ en cy -b ased Learning in 2PFD Setting 38 3.3.1 Naive Bayes learning problem in 2PFD setting . . . . 38 3.3.2 Naive Bayes learning Protocol . . . . . . . . . . . . . . 40 3.3.3 Correctness and privacy analysis . . . . . . . . . . . . 42 3.3.4 Efficiency of naive Bayes learning protocol . . . . . . . 42 3.4 An improvement of frequency mining protocol . . . . . . . . . 44 3.4.1 Improved frequency mining protocol . . . . . . . . . . 44 3.4.2 Protocol Analysis . . . . . . . . . . . . . . . . . . . . . 45 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4 ENHANCING PRIVACY FOR FREQUENT ITEMSET MINING IN VERTICALLY 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Association rules and frequent itemset . . . . . . . . . 51 4.2.2 Frequent itmeset identifyin g in vertically distributed data 52 4.3 Computational and privacy model . . . . . . . . . . . . . . . 53 4.4 Support count preserving protocol . . . . . . . . . . . . . . . 54 4.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.4.2 Protocol design . . . . . . . . . . . . . . . . . . . . . . 56 4.4.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 57 4.4.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 59 4.4.5 Performance analysis . . . . . . . . . . . . . . . . . . . 61 4.5 Support count computation-based protocol . . . . . . . . . . 64 4.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 Protocol Design . . . . . . . . . . . . . . . . . . . . . . 65 4.5.3 Correctness Analysis . . . . . . . . . . . . . . . . . . . 65 4.5.4 Privacy Analysis . . . . . . . . . . . . . . . . . . . . . 67 4.5.5 Performance analysis . . . . . . . . . . . . . . . . . . . 68 4.6 Using binary tree communication structure . . . . . . . . . . 69 iii 4.7 Privacy-preserving distributed Apriori algorithm . . . . . . . 70 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5 PRIVACY PRESERVING CLUSTERING 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3 Privacy preserving clustering for the multi-party distributed data 76 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.2 Private multi-party mean computation . . . . . . . . . 78 5.3.3 Privacy preserving multi-party clustering protocol . . 80 5.4 Privacy preserving clustering without disclosing cluster centers 82 5.4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.2 Privacy preserving two-party clustering protocol . . . 85 5.4.3 Secure mean sharing . . . . . . . . . . . . . . . . . . . 87 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6 PRIVACY PRESERVING OUTLIER DETECTION 91 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.2 Technical prelimi n ar i es . . . . . . . . . . . . . . . . . . . . . . 92 6.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . 92 6.2.2 Linear transformation . . . . . . . . . . . . . . . . . . 93 6.2.3 Privacy model . . . . . . . . . . . . . . . . . . . . . . 94 6.2.4 Private matrix product sharing . . . . . . . . . . . . . 95 6.3 Protocols for the horizontally distributed data . . . . . . . . . 95 6.3.1 Two-party protocol . . . . . . . . . . . . . . . . . . . . 97 6.3.2 Multi-party protocol . . . . . . . . . . . . . . . . . . . 100 6.4 Protocol for two-party vertically distributed data . . . . . . . 101 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 SUMMARY 107 Publication List 110 Bibliography 111 iv List of Phrases Abbreviation Full name PPDM Privacy Preserving Data Mining k-NN k-nearest neighbor EM Expectation-maximization SMC Secure Multiparty Computation DDH Decisional Diffie-Hellman PMPS Private Matrices Product Sharing SSP Secure Scalar Product OPE Oblivious polynomial evaluation ICA Independent Component Analysis 2PFD 2-part fully distributed setting FD fully distributed setting c ≡ computational indistinguishability v List of Tables 4.1 The communication cost . . . . . . . . . . . . . . . . . . . . 62 4.2 The complexity of the support count preserving protocol . . . 63 4.3 The parties’s time for the support count preserving protocol 64 4.4 The communication cost . . . . . . . . . . . . . . . . . . . . 68 4.5 The complexity of the support count computation protocol . 69 4.6 The parties’s time for the support count computation protocol 70 6.1 The parties’s computational time for the horizontally distributed data 105 6.2 The parties’s computational time for the vertically distributed data 105 vi List of Figures 3.1 Frequency mining protocol . . . . . . . . . . . . . . . . . . . . 33 3.2 The time used by the miner for computing the frequency f . 38 3.3 Privacy preserving protocol of naive Bayes learning . . . . . . 41 3.4 The computational time for the first phase and the third phrase 43 3.5 The time for computing the key values in the first phase . . 43 3.6 The time for computing the frequency f in third phrase . . . 44 3.7 Improved frequency mining protocol . . . . . . . . . . . . . . 47 4.1 Support count preserving protocol. . . . . . . . . . . . . . . . 58 4.2 The support count computation protocol. . . . . . . . . . . . 66 4.3 Privacy-preserving distributed Apriori protocol . . . . . . . . 72 5.1 Privacy preserving multi-party mean computation . . . . . . 79 5.2 Privacy preserving multi-party clustering protocol . . . . . . 81 5.3 Privacy preserving two-party clustering . . . . . . . . . . . . 86 5.4 Secure mean sharing . . . . . . . . . . . . . . . . . . . . . . . 89 6.1 Private matrix product sharing (PMPS). . . . . . . . . . . . . 96 6.2 Protocol for two-party horizontally distributed data. . . . . . 98 6.3 Protocol for multi-party horizontally distributed data. . . . . 101 6.4 Protocol for two-party vertically distributed data. . . . . . . . 103 vii Chapter 1 INTRODUC TI O N 1.1. Privacy-preserving data mining : An overview Data mining plays an important role in the current world and provides us a powerful tool to efficiently discover valuable information from large databases [25]. However, the process of mining data can result in a viola- tion of privacy, therefore, issues of privacy preservation in data mining are receiving more and more attention from the this community [52]. As a result, there are a large number of studies has been produced on the topic of privacy-preserving data minin g (PPDM) [72]. These studies deal wi t h the problem of learning data mining models from the databases, while protecting data privacy at the level of individual records or the level of organizations. Basically, there are three major problems in PPDM [8]. First, the organizations such as government agencies wish to publish their data for research er s and even community. However, they want to preserve the d at a privacy, for example, highly sensitive financial and health private data. Second, a group of the organizations (or parties) wishes to together obtain the mining result on their joint data without disclosing each party’s privacy information. Third, a miner wishes to collect data or obtai n the data mining models from the individual users, while preserv i n g privacy of each user. Consequently, PPDM can be formed into three following areas depending on the models of information sharing. Privacy-preserving data publishing: The model of this resear ch consists of only an organization, is the trusted data holder. This organization wishes to publish its d at a to the mi n er or the r esear ch community such that the an onymized data are useful for the data mining applications. For example, some hospitals collect records from their patients for the some required 1 [...]... are used for privacy- preserving distributed data mining as well, where data are distributed across several parties Thus, the privacy property of privacy- preserving distributed data mining algorithms is quantified by the privacy definition of SMC, where each party involved in the privacy- preserving distributed protocols is only allowed to learn the desired data mining models without any other information... from privacy preserving data publishing, each study in privacy- preserving distributed data mining is often to solve a specific data mining task The model of this area usually consists of several parties instead, each party has one private data set The general purpose is to enable the parties for mining cooperatively on their joint data sets without revealing private information to other participating... 2PFD setting In the FD setting, other solutions based on k-anonymization of user’s data have been proposed in [83, 77] The advantage of these solutions is that they do not depend on the underlying data mining tasks, because the anonymous data can be used for various data mining tasks without disclosing privacy However, these solutions are inapplicable in 2PFD setting, because the miner can not link two... variety of privacy preserving data mining solutions have been proposed in this area Some randomization-based solutions proposed in [21, 4, 19, 3, 1, 36, 16] can be applied to classification algorithms in fully distributed setting The basic idea of these solutions is that every user perturbs its data, before sending it to the miner The miner then can reconstruct the original data to obtain the mining results... be used in 2PFD setting Some other solutions based on k-anonymization of user’s data have been proposed in [83, 77] The advantage of these solutions is that they do not depend on the underlying data mining tasks, because the anonymous data can be used for various data mining tasks without disclosing privacy However, these solutions are inapplicable in 2PFD setting, because the miner can not link two... accuracy in the data mining results, and vice-versa 25 In [74, 77] the authors solved various privacy preserving data mining tasks such as naive Bayes learning, decision tree learning, association rule mining etc The proposed cryptographic approaches are able to maintain strong privacy without loss of accuracy The key idea of these approaches is a private frequency computation method that allows a data miner... Li/Ion batteries lead to brain tumors in diabetics Privacy- preserving user data mining: This research involves a scenario in which a data miner surveys a large number of users to learn some data mining results based on the user data or collects the user data while the sensitive attributes of these users need to be protected [74, 77, 19] In this scenario, each user only maintains a data record This can be... learning algorithms in this scenario 2 To develop novel privacy- preserving techniques for popular data mining algorithms such as association rule mining and clustering methods 3 To present a technique to design protocols for privacy- preserving multivariate outlier detection in both horizontally and vertically distributed data models The developed solutions will be evaluated in terms of the degree of privacy. .. horizontally partitioned data In the context of privacy- preserving data mining, banks do not need to reveal their databases to each other They can still apply k-NN classification to the joint databases of banks while preserving each bank’s privacy information In vertically distribution, a data set is distributed into some parties Every party owns a vertical part of every record in the database (it holds records... then from observing the data one can determine the identity of the user or deduce a limited set that consists of user The goal of k-anonymity is that every tuple in the released private table is indistinguishability from at least other k users Privacy- preserving distributed data mining: This research area aims to develop distributed data mining algorithms without accessing original data [33, 79, 35, . s. Privacy- preserving distributed data mining: This research area aims to develop distributed data mining algorithms without accessing original data [33, 79, 35, 68, 80, 40]. Different from privacy. lead to brain tumors in diabetics. Privacy- preserving user data mining: This research involves a scenario in which a data miner surveys a large number of users to learn some data mining r esu. 68, 80, 40]. Different from privacy preserving data publishing, each study in privacy- preserving distr i b u t ed data mining is often to solve a specific data mining task. The model of this area

distributed solutions in privacy preserving data mining

Thông tin tài liệu

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan