On the quality and price of data

ON THE QUALITY AND PRICE OF DATA TANG RUIMING (Bachelor of Engineering, Northeastern University in China) A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILOSOPHY COMPUTER SCIENCE, SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE Supervised by Professor Stéphane Bressan 2014 DECLARATION I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. TANG RUIMING 31 July 2014 Acknowledgements This thesis would not have been possible without the guidance and help of many people to my study. My first and foremost thank goes to my supervisor Professor Stéphane Bressan. Professor Bressan first introduced me to the area of database research. He taught me how to read research papers, how to identify research problems, how to formalize problems and how to write research papers. His guidance has led me towards being able to think and work independently. As a supervisor, his insights in database research, as well as his advice, inspires my growth from an undergraduate student to a qualified Ph.D. candidate. I will benefit from these not only for my Ph.D. degree, but also for the rest of my life. Professor Pierre Senellart, who has influenced me in different ways, deserves my special thanks. He hosted me for a 2-month internship in Télécom ParisTech. Whenever I have a question, his door is always open to discussion. I gratefully acknowledge Professor Reynold Cheng and Professor Patrick Valduriez who gave me insightful advice in my research work. We had fruitful productions. I appreciate all the co-authors who worked with me: Huayu Wu, Sadegh Nobari, Dongxu Shao, Zhifeng Bao, Antoine Amarilli and M. Lamine Ba. We are working on interesting research ideas. Their contributions further strengthened the technical depth and presentation quality of our papers. Moreover, many thanks go to my lab-mates in School of Computing. We spent a few years together and it will be my precious memories forever. Last but not least, my deepest love goes to my parents, Yongning Tang and Ling Jiang, and my wife Yanjia Yang. They have always supported and encouraged me during my whole Ph.D. career. Their love and support gives me the faith and strength to face any difficulties in my life. i Contents Acknowledgements i Abstract vi List of Publications ix List of Figures xii List of Tables xiii Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Research Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Data Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.3 Query Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Related Work 2.1 20 Probabilistic Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Probabilistic Relational Data Models . . . . . . . . . . . . . . . . 21 ii 2.1.2 Probabilistic XML Data Models . . . . . . . . . . . . . . . . . . . 26 2.2 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Data Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Price of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.2 Price of Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Cleaning Data: Conditioning Uncertain Data 37 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Proposed Probabilistic Data Model . . . . . . . . . . . . . . . . . . . . . . 44 3.3 3.4 3.5 3.2.1 Trees and XML documents . . . . . . . . . . . . . . . . . . . . . . 44 3.2.2 Probabilistic Relational and XML Data . . . . . . . . . . . . . . . 44 3.2.3 Possible worlds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.4 Equivalent probabilistic databases . . . . . . . . . . . . . . . . . . 47 3.2.5 Conditioning Problem . . . . . . . . . . . . . . . . . . . . . . . . 50 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.1 Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.3.2 Compactness of representation . . . . . . . . . . . . . . . . . . . . 54 Constraint Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.4.1 Mutually Exclusive Constraints . . . . . . . . . . . . . . . . . . . 58 3.4.2 Implication Constraints . . . . . . . . . . . . . . . . . . . . . . . . 59 Detailed Description of Considered Constraints and Local Database (Local Tree) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.6 3.5.1 Local Database (Local Tree) and Local Possible Worlds . . . . . . 60 3.5.2 Considered Constraints . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5.3 Number of Local Possible Worlds . . . . . . . . . . . . . . . . . . 65 Mutually Exclusive Tuple Constraints in Probabilistic Relational Databases 3.6.1 67 MET constraints under WOMB semantics . . . . . . . . . . . . . . 68 iii 3.6.2 3.7 3.8 MET constraints under W MB semantics . . . . . . . . . . . . . . . 71 Implication constraints in probabilistic relational databases . . . . . . . . . 72 3.7.1 FKPK Implication Constraints . . . . . . . . . . . . . . . . . . . . 72 3.7.2 FK Implication Constraints . . . . . . . . . . . . . . . . . . . . . 74 3.7.3 REF Implication Constraints . . . . . . . . . . . . . . . . . . . . . 77 Mutually Exclusive Constraints in probabilistic XML documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.9 3.8.1 MutEx Siblings Constraints . . . . . . . . . . . . . . . . . . . . . 85 3.8.2 MutEx AD Constraints . . . . . . . . . . . . . . . . . . . . . . . . 92 3.8.3 MutEx Descendance Constraints . . . . . . . . . . . . . . . . . . . 98 3.8.4 MED&AD MutEx Constraints . . . . . . . . . . . . . . . . . . . . 110 Discussion: Multiple Constraints . . . . . . . . . . . . . . . . . . . . . . . 113 3.9.1 Multiple Constraints in Probabilistic Relational databases . . . . . 116 3.9.2 Multiple Constraints in Probabilistic XML documents . . . . . . . 116 3.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Pricing Data: What you Pay for is What you Get 4.1 4.2 119 Relational Data Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.1.2 Data Model and Basic Concepts . . . . . . . . . . . . . . . . . . . 123 4.1.3 Distance and Probability Functions . . . . . . . . . . . . . . . . . 126 4.1.4 Optimal (pr0 , Dbase )−acceptable distributions . . . . . . . . . . . . 133 4.1.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 4.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 XML Data Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.2.2 Background: Subtree/Subgraph Sampling . . . . . . . . . . . . . . 145 iv 4.3 4.4 4.2.3 Pricing Function and Sampling Problem . . . . . . . . . . . . . . . 146 4.2.4 Tractability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.2.5 Algorithms for Tractable Uniform Sampling . . . . . . . . . . . . . 152 4.2.6 Repeated Requests . . . . . . . . . . . . . . . . . . . . . . . . . . 161 4.2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Query Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 4.3.2 Background: Relational Data Provenance Semantics . . . . . . . . 165 4.3.3 Pricing Queries on Relational Data . . . . . . . . . . . . . . . . . . 170 4.3.4 Computing Price . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 4.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 4.3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 Conclusion and Future Work 202 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.2.1 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.2.2 Data Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Bibliography 207 v Abstract Data consumers, data providers and data market owners participate in data markets. Data providers collect, clean and trade data. In this thesis, we study the quality and price of data. More specifically, we study how to improve data quality through conditioning, and the relationship between quality and price of data. In order to improve data quality (more specifically, accuracy) by adding constraints or information, we study the conditioning problem. A probabilistic database denotes a set of non-probabilistic databases called possible worlds, each of which has a probability. This is often a compact way to represent uncertain data. In addition, direct observations and general knowledge, in the form of constraints, help in refining the probabilities of the possible worlds, and possibly ruling out some of them. Enforcing such constraints on the set of the possible worlds of a probabilistic database, obtaining a subset of the possible worlds which are valid under the given constraints, and refining the probability of each valid possible world to be the conditional probability of the possible world when the constraints are true, is called conditioning the probabilistic database. The conditioning problem is to find a new probabilistic database that denotes the valid possible worlds, with respect to the constraints, with their new probabilities. We propose a framework for representing conditioned probabilistic (relational and XML) data. Unfortunately, the general conditioning problem involves the simplification of general Boolean expressions and is NP-hard. Specific practical families of constraints are thus identified, for which efficient algorithms to perform conditioning are devised and presented. Data providers and data consumers expect the price of data to be commensurate with its quality. We study the relationship between quality and price of data. We separate the cases wherein data consumers request data items directly, and those in which data consumers specify the parts of data they are interested in by issuing queries. For pricing data items, we propose a pricing framework in which data consumers can trade data quality for vi discounted prices. For pricing queries, we propose a pricing framework to define, compute and estimate the prices of queries. For pricing data items, we propose a theoretical and practical pricing framework for a data market in which data consumers can trade data quality for discounted prices. In most data markets, prices are prescribed and not negotiable, and give access to the best data quality that the data provider can achieve. Instead, we consider a model in which data quality can be traded for discounted prices: “what you pay for is what you get”. A data consumer proposes a price for the data that she requests. If the price is less than the price set by the data provider, then she will possibly get a lower-quality version of the requested data. The data market owners negotiate the pricing schemes with the data providers. They implement these schemes for generating lower-quality versions of the requested data. We propose a theoretical and practical pricing framework with algorithms for relational data and XML data respectively. Firstly, in the framework for pricing relational data, “data quality” refers to data accuracy. The data value published is randomly determined from a probability distribution. The distribution is computed such that its distance to the actual value is commensurate with the discount. The published value comes with a guarantee on the probability of being the exact value. The probability is also commensurate with the discount. We present and formalize the principles that a healthy data market should meet for such a transaction. Two ancillary functions are defined and the algorithms that compute the approximate value from the proposed price, using these functions, are described. We prove that the functions and the algorithms meet the required principles. Secondly, in the framework for pricing XML data, “data quality” refers to data completeness. In our setting, the data provider offers an XML document, and sets both the price of the document and a weight for each node of the document, depending on its potential worth. The data consumer proposes a price. If the proposed price is lower than that of the entire document, then the data consumer receives a sample, i.e., a random rooted subtree of the document whose selection depends on the discounted price and the weight of nodes. By requesting several vii samples, the data consumer can iteratively explore the data in the document. The uniform random sampling of a rooted subtree with prescribed weight is unfortunately intractable. However, it is possible to identify several practical cases that are tractable. The first case is a uniform random sampling of a rooted subtree with prescribed size; the second case restricts to binary weights. For both these practical cases, polynomial-time algorithms are presented, with an explanation for how they can be integrated into an iterative exploratory sampling approach. We study the problem of defining and computing the prices of queries for cases wherein data consumers request for data in forms of queries. A generic query pricing model which is based on minimal provenances, i.e., minimal sets of tuples contributing to the query result (which can be viewed as the quality of the query result) is proposed. A data consumer has to pay for the tuples that her query needs to produce the query result: “what you pay for is what you get”. If a query needs higher-quality (namely higher-price) tuples, the price of this query should be higher. The proposed model fulfills desirable properties, such as contribution monotonicity, bounded-price and contribution arbitrage-freedom. It is found that computing the exact price of a query in our pricing model is NP-hard, and a baseline algorithm to compute the exact price of a query is presented. Several heuristics are devised, presented and compared. A comprehensive experimental study is conducted to show their effectiveness and efficiency. viii Chapter 4. Pricing Data: What you Pay for is What you Get are more suitable than the metrics in Heuristic and Heuristic in the random data sets. The intuition of Heuristic and Heuristic is to include many source tuples at first and try to reuse them. However, in our experiment setting, there are 100 source tuples, while each minimal why-provenance consists of only tuples. The chance of such reuse is very small. Therefore it works worse than choosing the cheapest why-minimal provenances directly. • The algorithm with Heuristic is more effective than the one with Heuristic 1, and the algorithm with Heuristic is more effective than the one with Heuristic 2. It can be inferred that memorizing previous choices when choosing the current minimal why-provenance improves the performance of the algorithms. The rationale behind this, is that memorizing this information would be helpful in making the current choice closer to the global optimal value in most of the cases. 4.3.5.4 Study of Efficiency In our study, we observe that the number of result tuples affects efficiency the most, compared to the other factors. Therefore in this section, we study the efficiency when we vary the number of result tuples from 1,000 to 5,000. We fix the number of minimal why-provenances for each result tuple and the number of source tuples in each minimal why-provenance to be both 5. Figure 4.6 shows the running time of different approximation algorithms when varying the number of result tuples from 1,000 to 5,000. The x-axis represents the number of result tuples. The y-axis represents the running time of the approximation algorithms. Every value is the average of 10,000 runs. We not present the running time of the exact algorithm since it does not scale to the cases of result tuples being from 1,000 to 5,000. Note that the curves of Heuristic and Heuristic are almost the same, while the curves of Heuristic and Heuristic are almost the same, therefore we can only see two separate 197 Chapter 4. Pricing Data: What you Pay for is What you Get 100 100 90 90 80 percentage(%) percentage (%) 80 70 60 57.912 50 33.988 40 30 70 60 50 40 30.668 36.892 30 20.536 20 20 7.064 10 0.926 [1.0,1.2) [1.2,1.4) [1.4,1.6) [1.6,1.8) 0.098 0.012 [1.8,2.0) [2.0,10.0] 8.124 10 [1.0,1.2) [1.2,1.4) (a) Algorithm with Heuristic (choose the cheapest ones independently) [1.6,1.8) 1.228 [2.0,10.0] (b) Algorithm with Heuristic (choose the ones with lowest average price independently) 100 100 87.864 90 80 80 percentage(%) percentage(%) [1.4,1.6) α value of Heuristic 2 α value of Heuristic 1 90 2.552 [1.8,2.0) 70 60 50 40 30 20 60 57.784 50 40 29.524 30 20 11.258 10 0.816 [1.0,1.2) 70 [1.2,1.4) [1.4,1.6) 0.058 0.004 [1.6,1.8) [1.8,2.0) [2.0,10.0] 9.302 10 [1.0,1.2) [1.2,1.4) α value of Heuristic 3 (c) Algorithm with Heuristic (choose the cheapest ones considering the previous choices) 2.554 0.628 0.208 [1.6,1.8) [1.8,2.0) [2.0,10.0] [1.4,1.6) α value of Heuristic 4 (d) Algorithm with Heuristic (choose the cheapest ones considering the previous choices) Figure 4.5: Percentages of α value in different intervals running time(ms) 300 250 200 Heuristic 1 150 Heuristic 2 100 Heuristic 3 50 Heuristic 4 1K 2K 3K 4K number of result tuples 5K Figure 4.6: Running time of different approximation algorithms 198 Chapter 4. Pricing Data: What you Pay for is What you Get curves in Figure 4.6. Recall that the time complexity of the algorithms with Heuristic and Heuristic is O(mn), and the time complexity of the algorithms with Heuristic and Heuristic is O(mnb). We can get the following observations from Figure 4.6. Firstly, the algorithms with Heuristic and Heuristic have almost the same running time, while the algorithms with Heuristic and Heuristic also have the same running time. The two algorithms with Heuristic and Heuristic differ from each other by using different metrics when choosing a minimal why-provenance for each result tuple: the former one chooses the cheapest why-provenance while the latter one chooses with the one with the lowest average price. The computational cost of these two metrics are the same. Therefore the algorithms with Heuristic and Heuristic have the same running time. The same analysis fits for the running time of algorithms with Heuristic and Heuristic 4. Secondly, the running time is linear to the number of result tuples since other relevant parameters are fixed. Thirdly, the algorithms with Heuristic and Heuristic are more efficient compared to the algorithms with Heuristic and Heuristic 4. When we choose a minimal why-provenance for a result tuple using Heuristic and Heuristic 4, we have to check and exclude the source tuples that have already been bought. This checking is not a part of the algorithms of Heuristic and Heuristic 2. Therefore the algorithms with Heuristic and Heuristic are more costly in terms of running time. Generally, the approximation algorithms are efficient. When the number of result tuples reaches 5,000, their running time is less than 300 ms. 4.3.6 Conclusion In this section we proposed a generic query pricing model that is based on minimal provenances, i.e., minimal sets of tuples contributing to the result of a query, which can be viewed as the quality of the query result. We showed that the proposed model fulfils desirable properties, such as contribution monotonicity, bounded-price and contribution arbitrage199 Chapter 4. Pricing Data: What you Pay for is What you Get freedom. We showed that, in general, computing the exact price of a query is intractable. We devised a baseline algorithm to compute the exact price of a query and heuristics to approximate the price of a query in PTIME. We also presented two favorable classes of queries for which the running time of the exact algorithm is polynomial. We evaluated the effectiveness and efficiency of the proposed algorithms. The experiments showed that the accuracy of the approximate price computation is much better than the theoretical analysis, and the algorithms are efficient. 4.4 Conclusion In this chapter, we studied the relationship between the quality and price of data. We separated the cases where data consumers request data items directly, and those where data consumers specify the parts of data they are interested in by issuing queries. For pricing data items, we proposed a theoretical and practical pricing framework for a data market in which data consumers can trade data quality for discounted prices: “what you pay for is what you get”, for relational data and XML data respectively. In Section 4.1, we proposed a framework for pricing relational data in which “data quality” refers to data accuracy. In our framework, the value provided to a data consumer is exact if she offers the full price for it. The value is approximate if she offers to pay only a discounted price. In the case of a discounted price, the value is randomly determined from a probability distribution. The distance of the distribution to the actual value (to the degenerate distribution) is commensurate with the discount. The published value comes with a guarantee on its probability. The probability is also commensurate with the discount. We defined two ancillary pricing functions under several principles, for a healthy market. Algorithms to compute a satisfactory probability distribution (from which the published value is sampled) with the help of the two defined pricing functions, given a proposed price by the data consumer, were proposed. We proved the correctness of the functions and 200 Chapter 4. Pricing Data: What you Pay for is What you Get algorithms. In Section 4.2, we proposed a framework for pricing XML data in which “data quality” refers to data completeness. Namely, a data provider offers an XML document, and sets both the price and weights of nodes of the document. The data consumer proposes a price but may get only a sample if the proposed price is less than that of the entire document. A sample is a rooted subtree of prescribed weight, as determined by the proposed price, sampled uniformly at random. We proved that if nodes in the XML document have arbitrary non-negative weights, the sampling problem is intractable. We identified tractable cases, namely the unweighted sampling problem and 0/1-weights sampling problem, for which we devised PTIME algorithms. We proved the time complexity and correctness of the algorithms. We also considered repeated requests and provided PTIME solutions to the unweighted cases. In Section 4.3, we studied the problem of defining and computing the prices of queries for the case that data consumers request for data in forms of queries. We proposed a generic query pricing model that is based on minimal provenances, i.e., minimal sets of tuples contributing to the result of a query, which can be viewed as the quality of the query result. A data consumer has to pay for the tuples that her query needs to produce the query result: “what you pay for is what you get”. If a query needs higher-quality (namely higher-price) tuples, the price of this query should be higher. We showed that the proposed model fulfils desirable properties, such as contribution monotonicity, bounded-price and contribution arbitrage-freedom. We showed that, in general, computing the exact price of a query is intractable. We devised a baseline algorithm to compute the exact price of a query and also devised heuristics to approximate the price in PTIME. We also presented two classes of queries for which the running time of the exact algorithm is polynomial. We evaluated the effectiveness and efficiency of the proposed algorithms. The experiments showed that the accuracy of the approximate price computation is much better than expected based on the theoretical analysis, and the algorithms are efficient. 201 Chapter Conclusion and Future Work 5.1 Conclusion In data marketplaces, data providers collect, clean and trade data. In this thesis, we studied the quality and price of data. More specifically, we studied how to improve data quality via conditioning, and studied the relationship between the quality and price of data. Data providers may clean data to get higher-quality versions of data, in order to gain higher willingness-to-pay from data consumers. In Chapter 3, in order to improve data quality (more specifically, accuracy), we studied the conditioning problem. We presented our probabilistic data model (i.e., probabilistic relational data model and probabilistic XML data model) which natively caters for constraints rather than treating them as add-ons. We defined the conditioning problem in our proposed probabilistic data model. We showed that conditioning in general is intractable and obtaining minimal representations relates to long-standing open problems in circuit complexity. An EXPTIME algorithm for the general case of conditioning probabilistic XML data was presented. Then we focused on the special cases of mutually exclusive constraints and implication constraints in probabilistic relational and probabilistic XML data with independent events. We devised and presented 202 Chapter 5. Conclusion and Future Work PTIME conditioning algorithms for such constraints. Lastly, we studied the conditions for when our conditioning algorithms can be applied to handle multiple constraints. In Chapter 4, we studied the relationship between the quality and price of data. We separated the cases wherein data consumers request data items directly, and those where data consumers specify the parts of data they are interested in by issuing queries. For pricing data items, we introduced the idea of “versioning”, which is to say we generate a lower-quality version of data for a data consumer with lower willingness-to-pay. We proposed a theoretical and practical pricing framework for a data market in which data consumers can trade data quality for discounted prices: “what you pay for is what you get”. A data consumer proposes a price for the data that she requests. If the proposed price is less than the price set by the data provider, then she possibly gets a lower-quality version of the requested data. We proposed a theoretical and practical pricing framework with the algorithms for relational data and XML data respectively. In Section 4.1, we proposed a theoretical and practical pricing framework for a data market in which data consumers can trade relational data accuracy for discounted prices. In our framework, the exact value is returned to a data consumer if she proposes the same price as the one set by the data provider. An approximate value is returned if she offers to pay only a discounted price. In the case of a discounted price, an approximation of the exact value is randomly determined from a probability distribution, where the distance to the exact value (the degenerate distribution) is commensurate with the discount. The published approximate value comes with a guarantee on its probability to be the exact value. The probability is also commensurate with the discount. We defined two ancillary pricing functions under several principles, for a healthy market. Algorithms to compute a satisfactory probability distribution (from which the published value is sampled) with the help of the two defined pricing functions, given a proposed price by the data consumer, were proposed. We proved the correctness of the functions and algorithms. In Section 4.2, we proposed a framework for pricing XML data in which “data quality” 203 Chapter 5. Conclusion and Future Work refers to data completeness. A data provider offers an XML document, and sets both the price and weights of nodes of the document. The data consumer gets the full XML document only if her proposed price is the same as that of the entire document. If a discounted price is offered, a sample which is a rooted subtree of prescribed weight commensurate with the proposed price is sampled uniformly at random. We proved that if nodes in the XML document have arbitrary non-negative weights, the sampling problem is intractable. We devised PTIME algorithms for two tractable cases: the unweighted sampling problem and 0/1-weights sampling problem. We proved the time complexity and correctness of the algorithms. We also considered repeated requests and provided PTIME solutions to the unweighted cases. We studied the problem of defining and computing the prices of queries for the cases where data consumers request for data in forms of queries in Section 4.3. In response to the observation that view granularity is too coarse for some applications, we propose a tuple-level pricing model. In our model, each tuple has a price set by the data provider. The price of a query is determined by its minimal provenances, i.e., minimal sets of tuples contributing to the result of the query, which can be viewed as the quality of the query result. A data consumer has to pay for the tuples that her query needs to produce the query result: “what you pay for is what you get”. We showed that the proposed model fulfils desirable properties, such as contribution monotonicity, bounded-price and contribution arbitrage-freedom. We showed that, in general, computing the exact price of a query is intractable in our pricing model. We devised a baseline algorithm to compute the exact price of a query. We also devised heuristics to approximate the price of a query in PTIME. We also presented two classes of queries for which the running time of the exact algorithm is polynomial. We evaluated the effectiveness and efficiency of the proposed algorithms. The experiments showed that the accuracy of the approximate price computation is much better than expected based on the theoretical analysis, and the algorithms are efficient. 204 Chapter 5. Conclusion and Future Work 5.2 Future Work We studied how to improve data quality by conditioning and the relationship between the quality and price of data. In this section, we present several directions that we would like to work on in future. 5.2.1 Conditioning In Chapter 3, we studied the conditioning problem in probabilistic data. We devised and presented PTIME conditioning algorithms for mutually exclusive constraints and implication constraints in PrTPLind and mutually exclusive constraints in PrXMLind . We list possible research directions in conditioning probabilistic data. Implication constraints are also a kind of practical constraints to consider in PrXMLind that have not yet been studied. For instance, in Figure 3.1, consider a constraint saying that module IT2002 is a prerequisite of module CS2102. This constraint is an implication constraint: the existence of node 15 implies the existence of node 16. Finding approximation solutions and finding tractable special cases are two ways of studying intractable problems. Due to the intractability of the general conditioning problem, we studied some special cases of constraints for which the conditioning problem is tractable. As the other way of studying the general conditioning problem, approximation conditioning algorithms with reasonable bounds are useful when accuracy is not strictly required in certain applications. Devising such approximation conditioning algorithms is also a reasonable direction to work on. In a third direction, it is useful to increase understanding of the general case of conditioning, and have a decision procedure for determining whether a given conditioning problem is tractable. The results we have now are sufficient conditions for a conditioning problem to be tractable. Finding its necessary conditions is also important and is worthy of study. 205 Chapter 5. Conclusion and Future Work 5.2.2 Data Pricing In Section 4.1, we proposed a framework for trading relational data accuracy for discounted prices. In our proposed framework, we only considered data with discrete domains. We will extend our framework to handle data with continuous domains. The problem of avoiding arbitrage when the same consumer issues several requests has not yet been studied. We will adapt our framework to consider the case of avoiding arbitrage. Moreover, accuracy is only one of the dimensions for measuring data quality. It is also possible for other data quality dimensions (e.g., reputation, completeness, timeliness, etc) to be traded for discounted prices. We will extend our framework by considering other data quality dimensions. In Section 4.2, we proposed a framework for trading XML data completeness for discounted prices. The more general issue that we are currently investigating is that of sampling rooted subtrees uniformly at random under more expressive conditions than size restrictions or 0/1-weights. In particular, we intend to identify the tractability boundary to describe the class of tree statistics for which it is possible to sample rooted subtrees in PTIME under a uniform distribution. In Section 4.3, we proposed a pricing framework for charging for queries. To the best of our knowledge, there is no existing research on pricing queries that allows data consumers to propose her willingness-to-pay when issuing a query: receiving any possible proposed payment from a data consumer, the data provider returns a query result (of the input query) according to the proposed price. One possible solution is: according to a discounted price, the quality of the query result is degraded (e.g., reducing the number of answers, approximating the values, etc) so that the returned query result is a lower-quality version of the perfect query result. 206 Bibliography [Abiteboul and Senellart, 2006] Serge Abiteboul and Pierre Senellart. Querying and Updating Probabilistic Information in XML. In EDBT, pages 1059–1068, 2006. [Abiteboul et al., 1995] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. [Abiteboul et al., 2009] Serge Abiteboul, Benny Kimelfeld, Yehoshua Sagiv, and Pierre Senellart. On the Expressiveness of Probabilistic XML models. VLDB J., 18(5):1041– 1064, 2009. [Agrawal et al., 2006] Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006. [Allender, 2008] Eric Allender. Chipping away at P vs NP: How far are we from proving circuit size lower bounds? In CATS, page 3, 2008. [Amarilli and Senellart, 2013] Antoine Amarilli and Pierre Senellart. On the connections between relational and XML probabilistic data models. In Proc. BNCOD, pages 121– 134, Oxford, United Kingdom, July 2013. [Andrei and Robert, 1997] Shleifer Andrei and Vishny Robert. The limits of arbitrage. Journal of Finance, 1997. [Antova et al., 2008] Lyublena Antova, Thomas Jansen, Christoph Koch, and Dan Olteanu. Fast and simple relational processing of uncertain data. In ICDE, pages 983– 992, 2008. [Ba et al., 2013] M. Lamine Ba, Talel Abdessalem, and Pierre Senellart. Uncertain version control in open collaborative editing of tree-structured documents. In Proc. DocEng, Florence, Italy, 2013. 207 BIBLIOGRAPHY [Balazinska et al., 2011] Magdalena Balazinska, Bill Howe, and Dan Suciu. Data markets in the cloud: An opportunity for the database community. PVLDB, 4(12):1482–1485, 2011. [Barbará et al., 1992] Daniel Barbará, Hector Garcia-Molina, and Daryl Porter. The Management of Probabilistic Data. IEEE Trans. Knowl. Data Eng., 4(5):487–502, 1992. [Bhargava and Sundaresan, 2003] Hemant K Bhargava and Shankar Sundaresan. Contingency pricing for information goods and services under industrywide performance standard. In J. Manage. Inf. Syst., 2003. [Birnbaum and Jaffe, 2007] Ben Birnbaum and Alex Jaffe. Relational data markets. 2007. [Brynjolfsson et al., 2011] Erik Brynjolfsson, Lorin M. Hitt, and Heekyung Kim. Strength in numbers: How does data-driven decision-making affect firm performance? In ICIS, 2011. [Buneman et al., 2001] Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. Why and Where: A Characterization of Data Provenance. In ICDT, pages 316–330, 2001. [Cavallo and Pittarelli, 1987] Roger Cavallo and Michael Pittarelli. The Theory of Probabilistic Databases. In VLDB, pages 71–81, 1987. [Chang et al., 2006] Chia-Hui Chang, Mohammed Kayed, Moheb Ramzy Girgis, and Khaled F. Shaalan. A survey of Web information extraction systems. IEEE Trans. on Knowl. and Data Eng., 2006. [Cheney et al., 2009] James Cheney, Laura Chiticariu, and Wang Chiew Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4):379– 474, 2009. [Cohen et al., 2008] Sara Cohen, Benny Kimelfeld, and Yehoshua Sagiv. Incorporating Constraints in Probabilistic XML. In PODS, pages 109–118, 2008. [Cook, 1971] Stephen A. Cook. The complexity of theorem-proving procedures. In Proceedings of the Third Annual ACM Symposium on Theory of Computing, STOC ’71, pages 151–158, New York, NY, USA, 1971. ACM. [Cui and Widom, 2000] Yingwei Cui and Jennifer Widom. Practical Lineage Tracing in Data Warehouses. In ICDE, pages 367–378, 2000. [Dalvi and Suciu, 2004] Nilesh N. Dalvi and Dan Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864–875, 2004. 208 BIBLIOGRAPHY [Dey and Sarkar, 1998] Debabrata Dey and Sumit Sarkar. Psql: A query language for probabilistic relational data. Data Knowl. Eng., 28(1):107–120, 1998. [Dong et al., 2009] Xin Luna Dong, Alon Halevy, and Cong Yu. Data integration with uncertainty. VLDB Journal, 2009. [Durkee, 2010] Dave Durkee. Why cloud computing will never be free. Commun. ACM, 2010. [Fink et al., 2011] Robert Fink, Dan Olteanu, and Swaroop Rath. Providing support for full relational algebra in probabilistic databases. In ICDE, pages 315–326, 2011. [Fuhr and Rölleke, 1997] Norbert Fuhr and Thomas Rölleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1), 1997. [Green and Tannen, 2006] Todd J. Green and Val Tannen. Models for incomplete and probabilistic information. In EDBT Workshops, pages 278–296, 2006. [Henzinger et al., 2000] Monika Rauch Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc Najork. On near-uniform URL sampling. Computer Networks, 33(1-6), 2000. [Hey et al., 2009] Tony Hey, Stewart Tansley, and Kristin M. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009. [Horowitz and Sahni, 1974] Ellis Horowitz and Sartaj Sahni. Computing partitions with applications to the knapsack problem. J. ACM, 21(2):277–292, April 1974. [Hübler et al., 2008] Christian Hübler, Hans-Peter Kriegel, Karsten Borgwardt, and Zoubin Ghahramani. Metropolis algorithms for representative subgraph sampling. In ICDM, 2008. [Hung et al., 2003a] Edward Hung, Lise Getoor, and V. S. Subrahmanian. Probabilistic Interval XML. In ICDT, pages 358–374, 2003. [Hung et al., 2003b] Edward Hung, Lise Getoor, and V. S. Subrahmanian. PXML: A Probabilistic Semistructured Data Model and Algebra. In ICDE, pages 467–479, 2003. [INTELIUS, 2013] INTELIUS. INTELIUS. https://www.intelius.com/, 2013. [J.Green et al., 2007] Todd J.Green, Grigoris Karvounarakis, and Val Tannen. Provenance Semirings. In PODS, 2007. 209 BIBLIOGRAPHY [Kannan, 1982] Ravi Kannan. Circuit-size lower bounds and non-reducibility to sparse sets. Information and Control, 55(1-3):40–56, 1982. [Karp, 1972] R. Karp. Reducibility among combinatorial problems. In R. Miller and J. Thatcher, editors, Complexity of Computer Computations, pages 85–103. Plenum Press, 1972. [Khanna et al., 2000] Sanjeev Khanna, Madhu Sudan, Luca Trevisan, and David P. Williamson. The approximability of constraint satisfaction problems. SIAM J. Comput., 30(6):1863–1920, 2000. [Kharlamov and Senellart, 2011] Evgeny Kharlamov and Pierre Senellart. Modeling, Querying, and Mining Uncertain XML Data. In Andrea Tagarelli, editor, XML Data Mining: Models, Methods, and Applications. IGI Global, 2011. [Kharlamov et al., 2010] Evgeny Kharlamov, Werner Nutt, and Pierre Senellart. Updating probabilistic XML. In Proc. EDBT/ICDT Workshops, 2010. [Kimelfeld and Senellart, 2013] Benny Kimelfeld and Pierre Senellart. Probabilistic XML: Models and complexity. In Zongmin Ma and Li Yan, editors, Proc. Advances in Probabilistic Databases for Uncertain Information Management. Springer-Verlag, 2013. [Koch and Olteanu, 2008] Christoph Koch and Dan Olteanu. Conditioning probabilistic databases. PVLDB, 2008. [Koutris et al., 2012a] Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, and Dan Suciu. Query-based data pricing. In PODS, 2012. [Koutris et al., 2012b] Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, and Dan Suciu. QueryMarket demonstration: Pricing for online data markets. PVLDB, 5(12), 2012. [Koutris et al., 2013] Paraschos Koutris, Prasang Upadhyaya, Magdalena Balazinska, Bill Howe, and Dan Suciu. Toward practial query pricing with querymarket. In SIGMOD, 2013. [Kushal et al., 2012] Avanish Kushal, Sharmadha Moorthy, and Vikash Kumar. Pricing for data markets. 2012. [Leskovec and Faloutsos, 2006] Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In SIGKDD, 2006. [Li and Miklau, 2012] Chao Li and Gerome Miklau. Pricing aggregate queries in a data marketplace. In WebDB, pages 19–24, 2012. 210 BIBLIOGRAPHY [Li et al., 2012] Chao Li, Daniel Yang Li, Gerome Miklau, and Dan Suciu. A theory of pricing private data. CoRR, abs/1208.5258, 2012. [Lin and Kifer, 2014] Bing-Rong Lin and Daniel Kifer. On arbitrage-free pricing for general data queries. PVLDB, 7(9):757–768, 2014. [LLC, 2012] AggData LLC. AggData. http://www.aggdata.com/, 2012. [Lu and Bressan, 2012] Xuesong Lu and Stéphane Bressan. Sampling connected induced subgraphs uniformly at random. In SSDBM, 2012. [Luo et al., 2009] Cheng Luo, Zhewei Jiang, Wen-Chi Hou, Feng Yu, and Qiang Zhu. A sampling approach for XML query selectivity estimation. In EDBT, 2009. [Maiya and Berger-Wolf, 2010] Arun S. Maiya and Tanya Y. Berger-Wolf. Sampling community structure. In WWW, 2010. [Microsoft, 2012] Microsoft. azure.com/, 2012. Windows Azure Marketplace. https://datamarket. [Muschalle et al., 2012] Alexander Muschalle, Florian Stahl, Alexander Loser, and Gottfried Vossen. Pricing approaches for data markets. In BIRTE, 2012. [Nierman and Jagadish, 2002] Andrew Nierman and H. V. Jagadish. ProTDB: Probabilistic Data in XML. In VLDB, pages 646–657, 2002. [Nilsson, 1986] Nils J. Nilsson. Probabilistic logic. Artif. Intell., 28(1):71–87, 1986. [Papadimitriou, 1994] Christos H. Papadimitriou. Computational Complexity. AddisonWesley, 1994. [Pipino et al., 2002] Leo Pipino, Yang W. Lee, and Richard Y. Wang. Data quality assessment. Commun. ACM, 45(4):211–218, 2002. [Pittarelli, 1994] Michael Pittarelli. An algebra for probabilistic databases. IEEE Trans. Knowl. Data Eng., 6(2):293–303, 1994. [Prugoveˇcki, 2006] E. Prugoveˇcki. Quantum Mechanics in Hilbert Space: Second Edition. Dover Books on Physics Series. Dover Publications, 2006. [Püschel and Neumann, 2009] Tim Püschel and Dirk Neumann. Management of cloud infastructures: Policy-based revenue optimization. In ICIS, page 178, 2009. 211 BIBLIOGRAPHY [Püschel et al., 2009] Tim Püschel, Arun Anandasivam, Stefan Buschek, and Dirk Neumann. Making money with clouds: Revenue optimization through automated policy decisions. In ECIS, pages 2303–2314, 2009. [Re and Suciu, 2007] Christopher Re and Dan Suciu. Materialized views in probabilistic databases for information exchange and query optimization. In VLDB, pages 51–62, 2007. [Ribeiro and Towsley, 2010] Bruno F. Ribeiro and Donald F. Towsley. Estimating and sampling graphs with multidimensional random walks. In Internet Measurement Conference, 2010. [Shapiro and Varian, 1998] Carl Shapiro and Hal R. Varian. Harvard Business Review, 1998. [Upadhyaya et al., 2012] Prasang Upadhyaya, Magdalena Balazinska, and Dan Suciu. How to price shared optimizations in the cloud. PVLDB, 5(6):562–573, 2012. [van Keulen et al., 2005] Maurice van Keulen, Ander de Keijzer, and Wouter Alink. A Probabilistic XML Approach to Data Integration. In ICDE, pages 459–470, 2005. [Varian, 1995] Hal R. Varian. Pricing information goods. 1995. [Wang and Strong, 1996] Richard Y. Wang and Diane M. Strong. Beyond accuracy: What data quality means to data consumers. J. of Management Information Systems, 12(4):5– 33, 1996. [Wang et al., 2003] Wei Wang, Haifeng Jiang, Hongjun Lu, and Jeffrey Xu Yu. Containment join size estimation: Models and methods. In SIGMOD, 2003. [WeatherUnlocked, 2014] WeatherUnlocked. WeatherUnlocked. https://developer. weatherunlocked.com/, 2014. [Wu and Banker, 2010] Shinyi Wu and Rajiv D. Banker. Best pricing strategy for information services. J. AIS, 11(6), 2010. 212 [...]... data consumer, the data provider charges for the query The price of the query should reflect the quality of the query result The quality of the query result is affected by (1) the amount of the data items needed to answer the query and (2) the quality of the data items needed to answer the query (the quality of a data item may be defined as its price) One strategy of pricing queries is to define the price. .. consumer, the data provider charges for the query The price of the query should reflect the quality of the query result The quality of the query result is affected by (1) the amount of the data items needed to answer the query and (2) the quality of the data items needed to answer the query (the quality of a data item is defined as its price) One strategy of pricing queries is to define the price of a query... a query as the aggregation of the prices of the data items needed to produce the query result We consider the research problem of proposing a pricing framework to charge for queries on relational data, and expand more on this in Section 1.2.3 Data consumers offer lower willingness-to-pay for a lower -quality version of data In contrast, data providers may provide higher -quality versions of data to gain... best data quality that the data provider can achieve Yet, the idea of generating different versions of data for data consumers with different willingness-to-pay is not considered in existing data pricing frameworks If we want to devise a data pricing scheme based on the concept of “versioning”, we have to study the relationship between quality and price of data The concept of versioning (“degrade the quality. .. quality of the product offered to the consumers with a low willingness-to-pay”) suggests that data quality is an important factor for pricing data and there is a positive correlation between quality and price of data Still, we have to answer the following two questions (1) what is the price of a specific version of the original data according to its quality, (2) which is the version of the original data. .. if the data consumer’s willingness-to-pay is less than the full price of the requested data, a lowerquality version of the requested data, where the quality is commensurate with the proposed price, is returned to the data consumer In this framework, data quality can be traded for discounted prices The idea is explored further in Section 1.2.2 We present several examples to illustrate the basic idea of. .. define the price of a query as the aggregation of the prices of the data items needed to produce the query result The authors of [Koutris et al., 2012a; Koutris et al., 2013; Koutris et al., 2012b] propose a pricing model that defines the price of an arbitrary query as the minimum sum of the prices of views that can determine the query on the current database, where the prices of a set of pre-defined views... Figure 1.1: The big picture of the contributions of this thesis straints rather than treating them as add-ons We define the conditioning problem in our proposed data model and prove that for every consistent probabilistic database, there exists an equivalent unconstrained probabilistic database Unfortunately, the general conditioning problem involves the simplification of general Boolean expressions and is... this example, the data quality dimension refers to reputation • Intelius [INTELIUS, 2013] sells personal data A data set contains personal information about “Full name, DOB, Criminal check, Marriage & divorce” For a data consumer whose proposed price is lower than the full price of this data set, we may return a lower -quality version of the data set, e.g., another data set that contains only “Full name,... gathering for data providers and data consumers to purchase and sell data The three participants within a data market are data providers, data market owners and data consumers ([Muschalle et al., 2012]) Data providers bring data to the market and set its full price Data consumers buy data from the market A data market owner is a broker She negotiates the pricing schemes with data providers and manages the market . and trade data. In this thesis, we study the quality and price of data. More specifically, we study how to improve data quality through conditioning, and the relationship between quality and price. devised and presented. Data providers and data consumers expect the price of data to be commensurate with its quality. We study the relationship between quality and price of data. We separate the cases. get”. A data consumer proposes a price for the data that she requests. If the price is less than the price set by the data provider, then she will possibly get a lower -quality version of the requested data.

On the quality and price of data

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Acknowledgements

Abstract

List of Publications

List of Figures

List of Tables

Introduction

Motivation

Research Problems

Conditioning

Data Pricing

Query Pricing

Contributions

Organization

Related Work

Probabilistic Data Models

Probabilistic Relational Data Models

Probabilistic XML Data Models

Conditioning

Data Pricing

Price of Data

Price of Query

Cleaning Data: Conditioning Uncertain Data

Introduction

Proposed Probabilistic Data Model

Trees and XML documents

Probabilistic Relational and XML Data

Possible worlds

Tài liệu cùng người dùng

Tài liệu liên quan