Tài liệu Database Systems: The Complete Book- P12 docx

Thông tin tài liệu

1080 CHAPTER 20. IXFORlI.4TION IXTEGRATIO-\- it appears. Figure 20.17 suggests the process of adding a border to the cube in each dimension, to represent the * ulue and the aggregated values that it implies. In this figure u;e see three din~ensions. with the lightest shading representing aggregates in one dimension, darker shading for aggregates over two dimensions, and tlle darkest cube in the corner for aggregation over all three dimensions. Notice that if the number of values along each dirnension is reasonably large, but not so large that most poil~ts in tlle cube are unoccupied. then the "border" represents only a small addition to the volume of the cube (i.e., the number of tuples in the fact table). In that case, the size of the stored data CCBE(F) is not much greater than tlic size of F itself. Figure 20.17: The cube operator augments a data cube with a border of aggregations in all combinations of ctinien.ions A tuple of the table CLBE;(F) that has * in one or more dimensions TI-ill have for each dependent attribute the sum (or another aggregate f~~nction) of the values of that attribute in all the tuples that xve can obtain by replacing the *'s by real values. In effect. we build into the data the result of aggregating along any set of dimensions. Sotice. holvever. that the CUBE operator does not support <\ggregation at intermediate levels of granularity based on values in the dirnension tables For instance. ne may either leave data broken dovi-11 by day (or whatever the finest granularity for time is). or xve may aggregate time completely, but \re cannot, with thc CCBE operator alone, aggregate by weeks. months. or years. Example 20.17 : Let us reconsider the -1ardvark database from Esarnple 20.12 in the light of ~vhat the Ct-BE oprr;i~or can givc us. Recall the fact table from that exiumplc, is Sales(serialN0, date, dealer, price) Hoxvever, the dimension represented by serialNo is not well suited for the cube. since the serial number is a key for Sales. Thus. sumning the price over all dates, or over all dealers, but keeping the serial ~lumbrr fixed has 110 effect: n-e n-ould still gct the "sum" for the one auto ~vith that serial number. .I Illole useful data cube would replace the serial number by the txo attributes - model and color - to which the serial number connects Sales via the dimension table Autos. Sotice that if we replace serialNo by model and color, then tile cube no longer has a key among its dimensions. Thus, an entry of the cube ~vould hare the total sales price for all automobiles of a given model. with a given color, by a given dealer, on a given date. There is another change that is useful for the data-cube implementation of the Sales fact table. Since the CUBE operator normally sums dependent variables, and 13-e might want to get average prices for sales in some category, n-e need both the sum of the prices for each category of automobiles (a given model of a given color sold on a given day by a given dealer) and the total number of sales in that category. Thus, the relation Sales to which we apply the CCBE operator is Sales(mode1, color, date, dealer, val, cnt) The attribute val is intended to be the total price of all automobiles for the given model, color. date. and dealer, while cnt is the total number of automobiles in that category. Xotice that in this data cube. individual cars are not identified: they only affect the value and count for their category. Son let us consider the relation cC~~(Sa1es). I hypothetical tuple that n-ould be in both Sales and ti lo sales). is ('Gobi', 'red', '2001-05-21', 'Friendly Fred', 45000, 2) The interpretation is that on May 21; 2001. dealer Friendly Fled sold two red Gobis for a total of $45.000. The tuple ('Gobi', *, '2001-05-21', 'Friendly Fred', 152000, 7) says that on SIay 21, 2001. Friendly Fred sold seven Gobis of all colors, for a total price of S152.000. Sote that this tuple is in sales) but not in Sales. Relation sales) also contains tuples that represent the aggregation over more than one attribute. For instance. ('Gobi', *, '2001-05-21', *, 2348000, 100) says rliat on \la!- 21. 2001. rllei-e n-ere 100 Gobis sold by all the dealers. and the total price of tliose Gobis Tvas S2.348.000. ('Gobi', *, *, *, 1339800000, 58000) Says that over all time, dealers. and colors. 58.000 Gobis have been sold for a total price of S1.339.800.000. Lastly. the tuple Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. tells us that total sales of all Aardvark lnodels in all colors, over all time at all dealers is 198.000 cars for a total price of $3,521,727,000. Consider how to answer a query in \\-hich we specify conditions on certain attributes of the Sales relation and group by some other attributes, n-hile asking for the sum, count, or average price. In the relation are r sales), we look for those tuples t with the fo1lov;ing properties: 1. If the query specifies a value v for attribute a; then tuple t has v in its component for a. 2. If the query groups by an attribute a, then t has any non-* value in its conlponent for a. 3. If the query neither groups by attribute a nor specifies a value for a. then t has * in its component for a. Each tuple t has tlie sum and count for one of the desired groups. If n-e \%-ant the average price, a division is performed on the sum and count conlponents of each tuple t. Example 20.18 : The query SELECT color, AVG(price) FROM Sales WHERE model = 'Gobi' GROUP BY color; is ansn-ered by looking for all tuples of sales) ~vith the form ('Gobi', C. *, *, 21, n) here c is any specific color. In this tuple, v will be the sum of sales of Gobis in that color, while n will be the nlini!)cr of sales of Gobis in that color. Tlie average price. although not an attribute of Sales or sales) directly. is v/n. Tlie answer to the query is the set of (c, vln) pairs obtained fi-om all ('Gobi'. c, *, *. v. n) tuples. 20.5.2 Cube ImplementaOion by Materialized Views 11% suggested in Fig. 20.17 that adding aggregations to the cube doesn't cost much in tcrms of space. and saves a lot in time \vhen the common kincis of decision-support queries are asked. Ho~vever: our analysis is based on the as- sumption that queries choose either to aggregate completely in a dimension or not to aggregate at all. For some dime~isions. there are many degrees of granularity that could be chosen for a grouping on that dimension. Uc have already mentioned thc case of time. xvl-here numerolls options such as aggregation by weeks, months: quarters, or ycars exist,, in addition to the all-or-nothing choices of grouping by day or aggregating over all time. For another esanlple based on our running automobile database, Ive could choose to aggregate dealers completely or not aggregate them at all. Hon-ever, we could also choose to aggregate by city, by state, or perhaps by other regions, larger or smaller. Thus: there are at least sis choices of grouping for time and at least four for dealers. l\Tllen the number of choices for grouping along each dimension grows, it becomes increasingly expensive to store the results of aggregating by every possible conlbination of groupings. Sot only are there too many of them, but they are not as easily organized as the structure of Fig. 20.17 suggests for tlle all-or-nothing case. Thus, commercial data-cube systems may help the user to choose some n~aterialized views of the data cube. A materialized view is the result of some query, which we choose to store in the database, rather than reconstructing (parts of) it as needed in response to queries. For the data cube, the vie~vs we n-ould choose to materialize xi11 typically be aggregations of the full data cube. The coarser the partition implied by the grouping, the less space the materialized view takes. On the other hand, if ire ~vant to use a view to answer a certain query, then the view must not partition any dimension more coarsely than the query does. Thus, to maximize the utility of materialized views, we generally n-ant some large \-iers that group dimensions into a fairly fine partition. In addition, the choice of vien-s to materialize is heavily influenced by the kinds of queries that the analysts are likely to ask. .in example will suggest tlie tradeoffs in\-011-ed. INSERT INTO SalesVl SELECT model, color, month, city, SUM(va1) AS val, SUM(cnt) AS cnt FROM Sales JOIN Dealers ON dealer = name GROUP BY model, color, month, city; Figure 20.18: The materialized vien. SalesVl Example 20.19 : Let us return to the data cube Sales (model, color, date, dealer, val , cnt) that ne de\-eloped in Esample 20.17. One possible materialized vie\\- groups dates by nionth and dealers by city. This view. 1%-hich 1%-e call SalesV1, is constlucted by the query in Fig. 20.18. This query is not strict SQL. since n-e imagine that dates and their grouping units such as months are understood by the data-cube system n-ithout being told to join Sales with the imaginary relation rep~esenting dajs that \ve discussed in Example 20.14. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. CHAPTER 20. IiYFORI\IATIOAr IArTEGR.4TION 20.5. DdT.4 CUBES 1055 INSERT INTO SalesV2 SELECT model, week, state, SUM(va1) AS val, SUM(cnt) AS cnt FROM Sales JOIN Dealers ON dealer = name GROUP BY model, week, state; Figure 20.19: Another materialized view, SalesV2 Another possible materialized view aggregates colors completely, aggregates time into u-eeks, and dealers by states. This view, SalesV2, is defined by the query in Fig. 20.19. Either view SalesVl or SalesV2 can be used to ansn-er a query that partitions no more finely than either in any dimension. Thus, the query 41: SELECT model, SUM(va1) FROM Sales GROUP BY model; can be answered either by SELECT model, SUM(va1) FROM SalesVl GROUP BY model; SELECT model, SUM(va1) FROM SalesV2 GROUP BY model; On the other hand, the query 42: SELECT model, year, state, SUM(va1) FROM Sales JOIN Dealers ON dealer = name GROUP BY model, year, state; can on1 be ans\vered from SalesV1. as SELECT model, year, state, SUM(va1) FROM SalesVl GROUP BY model, year, state; Incidentally. the query inmediately above. like the qu'rics that nggregate time units, is not strict SQL. That is. state is not ari attribute of SalesVl: only city is. \Ye rmust assume that the data-cube systenl knol\-s how to perform the aggregation of cities into states, probably by accessing the dimension table for dealers. \Ye cannot answer Q2 from SalesV2. Although we could roll-up cities into states (i.e aggregate the cities into their states) to use SalesV1, we carrrlot roll-up ~veeks into years, since years are not evenly divided into weeks. and data from a week beginning. say, Dec. 29, 2001. contributes to years 2001 and 2002 in a way we carinot tell from the data aggregated by weeks. Finally, a query like 43: SELECT model, color, date, ~~~(val) FROM Sales GROUP BY model, color, date; can be anslvered from neither SalesVl nor SalesV2. It cannot be answered from Salesvl because its partition of days by ~nonths is too coarse to recover sales by day, and it cannot be ans~vered from SalesV2 because that view does not group by color. We would have to answer this query directly from the full data cube. 20.5.3 The Lattice of Views To formalize the cbservations of Example 20.10. it he!ps to think of a lattice of possibl~ groupings for each dimension of the cube. The points of the lattice are the ways that we can partition the ~alucs of a dimension by grouping according to one or more attributes of its dimension table. nB say that partition PI is belo~v partition P2. written PI 5 P2 if and only if each group of Pl is contained within some group of PZ. All Years / 1 I Quarters I Weeks Months Days Figure 20.20: A lattice of partitions for time inter\-als Example 20.20: For the lattice of time partitions n-e might choose the dia- gram of Fig. 20.20. -4 path from some node fi dotvn to PI means that PI 5 4. These are not the only possible units of time, but they \\-ill serve as an example Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. of what units a s~stern might support. Sotice that daks lie below both \reeks and months, but weeks do not lie below months. The reason is that while a group of events that took place in one day surely took place within one \reek and within one month. it is not true that a group of events taking place in one week necessarily took place in any one month. Similarly, a week's group need not be contained within the group cor~esponding to one quarter or to one year. At tlie top is a partition we call "all," meaning that events are grouped into a single group; i.e we niake no distinctions among diffeient times. All I State I City I Dealer Figure 20.21: A lattice of partitions for automobile dealers Figure 20.21 shows another lattice, this time for the dealer dimension of our automobiles example. This lattice is siniplcr: it shows that partitioning sales by dealer gives a finer partition than partitioning by the city of the dealer. i<-hich is in turn finer than partitioning by tlie state of tlie dealer. The top of tlle ldrtice is the partition that places all dealers in one group. Having a lattice for each dimension, 15-12 can now define a lattice for all the possible materialized views of a data cube that can be formed by grouping according to some partition in each dimension. If 15 and 1% are two views formed by choosing a partition (grouping) for each dimension, then 1; 5 11 means that in each dimension, the partition Pl that ~ve use in 1; is at least as fine as the partition Pl that n.e use for that dimension in Ti; that is. Pl 5 P? Man) OLAP queries can also be placed in the lattice of views In fact. fie- quently an OLAP query has the same form as the views we have described: the query specifies some pa~titioning (possibly none or all) for each of the dimensions. Other OL.iP queiics involve tliis same soit of grouping, and then "slice tlie cube to focus 011 a subset of the data. as nas suggested by the diag~ani in Fig. 20.15. The general rule is. I\c can ansn-er a quciy Q using view 1- if and o~ily if 1- 5 Q. Example 20.21 : Figure 20.22 takes the vielvs and queries of Example 20.19 and places them in a lattice. Sotice that the Sales data cube itself is technically a view. corresponding to tlie finest possible partition along each climensio~l. As we observed in the original example, QI can be ans~vered from either SalesVl or Sales Figure 20.22: The lattice of views and queries from Example 20.19 SalesV2; of course it could also be answered froni the full data cube Sales, but there is no reason to want to do so if one of the other views is materialized. Q2 can be answered from either SalesVl or Sales, while Q3 can only be answered from Sales. Each of these relationships is expressed in Fig. 20.22 by the paths downxard from the queries to their supporting vie~vs. Placing queries in the lattice of views helps design data-cube databases. Some recently developed design tools for data-cube systems start with a set of queries that they regard as typical" of the application at hand. They then select a set of views to materialize so that each of these queries is above at least one of the riel\-s, preferably identical to it or very close (i.e., the query and the view use the same grouping in most of the dimensions). 20.5.4 Exercises for Section 20.5 Exercise 20.5.1 : IVhat is the ratio of the size of CCBE(F) to the size of F if fact table F has the follorving characteristics? * a) F has ten dimension attributes, each with ten different values. b) F has ten dimension attributes. each with two differcnt values. Exercise 20.5.2: Let us use the cube ~nBE(Sa1es) from Example 20.17, ~vhich was built from the relation Sales (model, color, date, dealer, val, cnt) Tcll I\-hat tuples of the cube n-e 15-ould use to answer tlle follon-ing queries: * a) Find the total sales of I~lue cars for each dealer. b) Find the total nurnber of green Gobis sold by dealer .'Smilin' Sally." c) Find the average number of Gobis sold on each day of March, 2002 by each dealer. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 1088 CHAPTER 20. ISFORJlATIOS IXTEGRA4TIOS *! Exercise 20.5.3: In Exercise 20.4.1 lve spoke of PC-order data organized as a cube. If we are to apply the CCBE operator, we might find it convenient to break several dimensions more finely. For example, instead of one processor dimension, we might have one dimension for the type (e.g., AlID Duron or Pentium-IV), and another d~mension for the speed. Suggest a set of dimrnsions and dependent attributes that will allow us to obtain answers to a variety of useful aggregation queries. In particular, what role does the customer play? .Also, the price in Exercise 20.4.1 referred to the price of one macll~ne, while several identical machines could be ordered in a single tuple. What should the dependent attribute(s) be? Exercise 20.5.4 : What tuples of the cube from Exercise 20.5.3 would you use to answer the following queries? a) Find, for each processor speed, the total number of computers ordered in each month of the year 2002. b) List for each type of hard disk (e.g., SCSI or IDE) and eacli processor type the number of computers ordered. c) Find the average price of computers with 1500 megahertz processors for each month from Jan., 2001. ! Exercise 20.5.5 : The computers described in the cube of Exercise 20.5.3 do not include monitors. IVhat dimensions would you suggest to represent monitors? You may assume that the price of the monitor is included in the price of the computer. Exercise 20.5.6 : Suppose that a cube has 10 dimensions. and eacli dimension has 5 options for granularity of aggregation. including "no aggregation" and "aggregate fully.'' How many different views can we construct by clioosing a granularity in each dinlension? Exercise 20.5.7 : Show how to add the following time units to the lattice of Fig. 20.20: hours, minutes, seconds, fortnights (two-week periods). decades. and centuries. Exercise 20.5.8: How 15-onld you change the dealer lattice of Fig. 20.21 to include -regions." ~f: a) A region is a set of states. * b) Regions are not com~liensurate with states. but each city is in only one region. c) Regions are like area codes: each region is contained \vithin a state. some cities are in two or more regions. and some regions ha~e several cities. 20.6. DATA -111-YIA-G 1089 ! Exercise 20.5.9: In Exercise 20.5.3 ne designed a cube suitable for use ~vith the CCBE operator. Horn-ever. some of the dimensions could also be given a non- trivial lattice structure. In particular, the processor type could be organized by manufacturer (e g., SUT, Intel. .AND. llotorola). series (e.g SUN Ult~aSparc. Intel Pentium or Celeron. AlID rlthlon, or llotorola G-series), and model (e.g., Pentiuni-I\- or G4). a) Design tlie lattice of processor types following the examples described above. b) Define a view that groups processors by series, hard disks by type, and removable disks by speed, aggregating everything else. c) Define a view that groups processors by manufacturer, hard disks by speed. and aggregates everything else except memory size. d) Give esamples of qneries that can be ansn-ered from the view of (11) only, the vieiv of (c) only, both, and neither. *!! Exercise 20.5.10: If the fact table F to n-hicli n-e apply the CuBE operator is sparse (i.e there are inany fen-er tuples in F than the product of the number of possihle values along each dimension), then tlie ratio of the sizes of CCBE(F) and F can be very large. Hon large can it be? 20.6 Data Mining A family of database applications cal!ed data rnin,ing or knowledge discovery in dntnbases has captured considerable interest because of opportunities to learn surprising facts fro111 esisting databases. Data-mining queries can be thought of as an estended form of decision-support querx, although the distinction is in- formal (see the box on -Data-llining Queries and Decision-Support Queries"). Data nli11i11:. stresses both the cpcry-optimization and data-management com- ponents of a traditional database system, as 1%-ell as suggesting some important estensions to database languages, such as language primitix-es that support effi- cient sampling of data. In this section, we shall esamine the principal directions data-mining applications have taken. Me then focus on tlie problem called "fre- quc'iit iteinsets." n-hich has 1-eceiwd the most attention from the database point of view. 20.6.1 Data-Iblining Applications Broadly. data-mining queries ask for a useful summary of data, often ~vithout suggcstir~g the values of para~netcrs that would best yield such a summary. This family of problems thus requires rethinking the nay database systems are to be used to provide snch insights abo~it the data. Below are some of tlie applications and problems that are being addressed using very large amounts Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 1092 CHAPTER 20. I;YFORhlATION INTEGR.4TION (stop words) such as .'and" or 'The." which tend to be present in all documents and tell us nothing about the content A document is placed in this space according to the fraction of its word occurrences that are any particular word. For instance, if the document has 1000 word occurrences, two of which are "database." then the doculllent ~vould be placed at the ,002 coordinate in the dimension cor~esponding to "database." By clustering documents in this space, we tend to get groups of documents that talk about the same thing. For instance, documents that talk about databases might have occurrences of words like "data," "query," "lock," and so on, while documents about baseball are unlikely to have occurrences of these rvords. The data-mining problem here is to take the data and select the "means" or centers of the clusters. Often the number of clusters is given in advance. although that number niay be selectable by the data-mining process as ti-ell. Either way, a naive algorithm for choosing the centers so that the average distance from a point to its nearest center is minimized involves many queries; each of which does a complex aggregation. 20.6.2 Finding Frequent Sets of Items Now. we shall see a data-mining problem for which algorithms using secondary storage effectively have been developed. The problem is most easily described in terms of its principal application: the analysis of market-basket data. Stores today often hold in a data warehouse a record of what customers have bought together. That is, a customer approaches the checkout with a .'market basket" full of the items he or she has selected. The cash register records all of these items as part of a single transaction. Thus, even if lve don't know anything about the customer, and we can't tell if the customer returns and buys additional items. we do know certain items that a single customer bu-s together. If items appear together in market baskets more often than ~vould be es- pected, then the store has an opportunity to learn something about how customers are likely to traverse the store. The items can be placed in the store so that customers will tend to take certain paths through the store, and attractive items can be placed along these paths. Example 20.22 : .A famous example. which has been clainied by several people; is the discovery that people rvho buy diapcrs are unusually likely also to buy beer. Theories have been advanced foi n.hy that relationship is true. including tile possibility that peoplc n-110 buy diapers. having a baby at home. ale less likely to go out to a bar in the evening and therefore tcnd to drink beer at home. Stores may use the fact that inany customers 15-ill walk through the store from where the diapers are to where the beer is. or vice versa. Clever maiketers place beer and diapers near each other, rvitli potato chips in the middle. The claim is that sales of all three items then increase. We can represent market-basket data by a fact table: Baskets (basket, item) where the first attribute is a .'basket ID," or unique identifier for a market basket, and the secoild attribute is the ID of some item found in that basket. Sote that it is not essential for the relation to come from true ma~ket-basket data; it could be any relation from which we xant to find associated items. For ~nstance, the '.baskets" could be documents and the "items" could be words, in which case ne are really looking for words that appear in many documents together. The simplest form of market-basket analysis searches for sets of items that frequently appear together in market baskets. The support for a set of items is the number of baskets in which all those items appear. The problem of finding frequent sets of ~tems is to find, given a support threshold s, all those sets of items that have support at least s. If the number of items in the database is large, then even if we restrict our attention to small sets, say pairs of items only, the time needed to count the support for all pairs of items is enormous. Thus, the straightforward way to solve even the frequent pairs problem - compute the support for each pair of items z and j, as suggested by the SQL query in Fig. 20.24 - ~vill not work This query involves joining Baskets r~ith itself, grouping the resulting tuples by the tri-o lte~ns found 111 that tuple, and throwing anay groups where the number of baskets is belon- the support threshold s Sote that the condition I. item < J. item in the WHERE-clause is there to prevent the same pair from being considered in both orders. or for a .'pair" consisting of the same item twice from being considered at all. SELECT I.itern, J.item, COUNT(I.basket) FROM Baskets I, Baskets J WHERE 1.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(I.basket) >= s; Figure 20.24: Saive way to find all high-support pairs of items 20.6.3 The A-Priori Algorithm There is an optimization that greatly reduccs the running time of a qutry like Fig. 20.21 \\-hen the support threshold is sufficiently large that few pairs meet it. It is ieaso~iable to set the threshold high, because a list of thousands or millions of pairs would not be very useful anyxay; ri-e xi-ant the data-mining query to focus our attention on a sn~all number of the best candidates. The a-przorz algorithm is based on the folloiving observation: Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 1094 CHAPTER 20. IATFORlI~4TION INTEGR.ATION Association Rules A more complex type of market-basket mining searches for associatzon ~xles of the form {il, 22, . . . , in) 3 j. TKO possible properties that \ve might want in useful rules of this form are: 1. Confidence: the probability of finding item j in a basket that has all of {il,i2 . . , in) is above a certain threshold. e.g., 50%; e.g "at least 50% of the people who buy diapers buy beer." 2. Interest: the probability of finding item j in a basket that has all of {il, i2,. . . ,in} is significantly higher or lower than the probability of finding j in a random basket. In statistical terms, j correlates with {il, iz, . . . , i,,), either positively or negatively. The discovery in Example 20.22 was really that the rule {diapers) + beer has high interest. Sote that el-en if an association rule has high confidence or interest. it n-ill tend not to be useful unless the set of items inrrolved has high support. The reason is that if the support is low, then the number of instances of the rule is not large, which limits the benefit of a strategy that exploits the rule. If a set of items S has support s. then each subset of A' must also have support at least s. In particular, if a pair of items. say {i. j) appears in, say, 1000 baskets. then we know there are at least 1000 baskets with item i and we know there are at least 1000 baskets xvith item j. The converse of the above rule is that if we are looking for pairs of items ~vith support at least s. we may first eliminate from consideration any item that does not by itself appear in at least s baskets. The a-priorz algorltl~m ans11-ers the same query as Fig. 20.24 by: 1. First finding the srt of candidate nte~ns - those that appear in a sufficient number of baskets by thexnsel~es - and then 2. Running the query of Fig. 20.24 on only the candidate items. The a-priori algorithnl is thus summarized by the sequence of two SQL queries in Fig. 20.25. It first computes Candidates. the subset of the Baskets relation i~hose iter~ls ha\-c high support by theniselves. then joins Candidates ~vith itself. as in the naive algorithm of Fig. 20.24. INSERT INTO Candidates SELECT * FROM Baskets WHERE item IN ( SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s >; SELECT I.item, J.item, ~~~N~(~.basket) FROM Candidates I, Candidates J WHERE 1.basket = J.basket AND I.item < J.item GROUP BY I.item, J.item HAVING COUNT(*) >= s; Figure 20.25: Tlie a-priori algorithm first finds frequent items before finding frequent pairs Example 20.23 : To get a feel for how the a-priori algorithm helps, consider a supermarket that sells 10,000 different items. Suppose that the average market- basket has 20 items in it. Also assume that the database keeps 1,000,000 baskets as data (a small number compared with what would be stored in practice). Then the Baskets relation has 20,000,000 tuples, and the join in Fig. 20.24 (the naive algorithm) has 190,000,000 pairs. This figure represents one million baskets times (y) which is 190: pairs of items. These 190,000,000 tuples must all be grouped and counted. However, suppose that s is 10,000, i.e., 1% of the baskets. It is impossi- ble that Inore than 20.000,000/10,000 = 2000 items appear in at least 10,000 baskets. because there are only 20,000.000 tuples in Baskets, and any item ap- pearing in 10.000 baskets appears in at least 10,000 of those tuples. Thus: if we use the a-priori algoritllrn of Fig. 20.25, the subquery that finds the candidate ite~ns cannot produce more than 2000 items. and I\-ill probably produce many fewer than 2000. \\'e cannot he sure ho~v large Candidates is. since in the norst case all the items that appear in Baskets will appear in at least 1% of them. Honever. in practice Candidates will be considerably smaller than Baskets. if the threshold s is high. For sake of argument, suppose Candidates has on the average 10 itelns per basket: i.e., it is half the size of Baskets. Then the join of Candidates with itself in step (2) has 1,000,000 times (y) = 45,000,000 tuples, less than 11-1 of the number of tuples in the join of Baskets ~-ith itself. \Ye ~vould thtis expect the a-priori algorithm to run in about 111 the time of the naive Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 1096 CHAPTER 20. IlYFORM-rlTI0.V INTEGRATION algorithm. In common situations, where Candidates has much less than half tlie tuples of Baskets, the improvement is even greater, since running time shrinks quadratically with the reduction in the number of tuples involved in the join. 20.6.4 Exercises for Section 20.6 Exercise 20.6.1: Suppose we are given the eight "market baskets" of Fig. 20.26. B1 = {milk, coke, beer) BP = {milk, pepsi, juice) B3 = {milk, beer) B4 = {coke, juice) Bg = {milk, pepsi, beer) B6 = {milk, beer, juice, pepsi) B7 = {coke, beer, juice) B8 = {beer, pepsi) Figure 20.26: Example market-basket data * a) As a percentage of the baskets, what is the support of the set {beer, juice)? b) What is the support of the set {coke, pepsi)? * c) What is the confidence of milk given beer (i.e., of the association rule {beer) + milk)? d) What is the confidence of juice given milk? e) What is the confidence of coke, given beer and juice? * f) If the support threshold is 35% (i.e., 3 out of the eight baskets are needed), which pairs of items are frequent? g) If the support threshold is 50%, which pairs of items are frequent? ! Exercise 20.6.2 : The a-priori algorithm also may be used to find frequent sets of more than ttvo items. Recall that a set S of k items cannot have support at least s t~nless every proper subset of S has support at least s. In particular. the subsets of X that are of size k - 1 must all have support at least s. Thus. having found the frequent itemsets (those with support at least s) of size k - 1. we can define the candidate sets of size k to be those sets of k items, all of nhose subsets of size k - 1 have support at least s. Write SQL queries that, given the frequent itemsets of size k - 1 first compute the candidate sets of size k, and then compute the frequent sets of size k. 20.7. SC'AIAI,4RY OF CHAPTER 20 1097 Exercise 20.6.3: Using the baskets of Exercise 20.6.1, answer the following: a) If the support threshold is 35%, what is the set of candidate triples? b) If the support threshold is 35%, what sets of triples are frequent? 20.7 Summary of Chapter 20 + Integration of Information: Frequently, there exist a variety of databases or other information sources that contain related information. nTe have the opportunity to combine these sources into one. Ho~vever, hetero- geneities in the schemas often exist; these incompatibilities include dif- fering types, codes or conventions for values, interpretations of concepts, and different sets of concepts represented in different schernas. + Approaches to Information Integration: Early approaches involved "fed- eration," where each database would query the others in the terms understood by the second. Nore recent approaches involve ~varehousing, where data is translated to a global schema and copied to the warehouse. An alternative is mediation, where a virtual warehouse is created to allolv queries to a global schema; the queries are then translated to the terms of the data sources. + Extractors and Wrappers: Warehousing and mediation require compo- nents at each source, called extractors and wrappers, respectively. X ma- jor function is to translate querics and results betneen the global schema and the local schema at the source. + Wrapper Generators: One approach to designing wrappers is to use templates, which describe how a query of a specific form is translated from the global schema to the local schema. These templates are tabulated and in- terpreted by a driver that tries to match queries to templates. The driver may also have the ability to combine templates in various ways, and/or perform additional ~vork such as filtering. to answer more con~plex queries. + Capability-Based Optimtzation: The sources for a mediator often are able or ~villing to answer only limited forms of queries. Thus. the mediator must select a query plan based on the capabilities of its sources, before it can el-en think about optiniizing the cost of query plans as con\-entional DBAIS's do. + OLAP: An important application of data I<-arehouses is the ability to ask complex queries that touch all or much of the data. at the same ti~ne that transaction processing is conducted at the data sources. These queries, which usually involve aggregation of data. are termed on-line analytic processing, or OLAP; queries. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. 1098 CHAPTER 20. IXFORJIIATION IhTTEGR.4TI0.\' 20.8. REFERENCES FOR CH-APTER 20 1099 + ROLAP and AIOLAP: It is frequently useful when building a warehouse for OLAP, to think of the data as residing in a multidimensional space. with diniensions corresponding to independent aspects of the data represented. Systems that support such a vie~v of data take either a relational point of view (ROLAP, or relational OLAP systems), or use the specialized data-cube model (lIOL.AP, or multidimensional OLAP systems). + Star Schernas: In a star schema, each data element (e.g., a sale of an item) is represented in one relation, called tlie fact table, while inforniation helping to interpret the values along each dimension (e.g what kind of product is iten1 1234?) is stored in a diinension table for each diinension. + The Cube Operator: A specialized operator called CCBE pre-aggregates the fact table along all subsets of dimensions. It may add little to the space needed by the fact table, and greatly increases the speed with which many OLAP queries can be answered. + Dzmenszon Lattices and Alaterialized Vzews: A more polverful approach than the CLBE operator, used by some data-cube implementations. is to establish a lattice of granularities for aggregation along each dimension (e.g., different time units like days, months, and years). The ~vareliouse is then designed by materializing certain view that aggregate in different \va!.s along the different dimensions, and the rien- with the closest fit is used to answer a given query. + Data Mining: IVareliouses are also used to ask broad questions that involve not only aggregating on command. as in OL.1P queries, but search- ing for the "right" aggregation. Common types of data mining include clustering data into similar groups. designing decision trees to predict one attribute based on the value of others. and finding sets of items that occur together frequently. + The A-Priori Algorithm: -An efficiellt \\-a?; to find frequent itemsets is to use the a-priori algorithm. This technique exploits the fact that if a set occurs frequently. then so do all of its subsets. 20.8 References for Chapter 20 Recent smveys of \varehonsing arid related technologics are in [9]. [3]. and [TI. Federated systems are surveyed 111 11'21. The concept of tlic mediato1 conies from [14]. Implementation of mediators and \\-rappers, especially tlie mapper-genera- tor approach. is covered in [5]. Capabilities-based optilnization for iriediators n-as explored in [ll. 131. The cube operator was proposed in 161. The i~iipleinentation of cubes by materialized vie\\-s appeared in 181. [4] is a survey of data-mining techniques, and [13] is an on-line survey of data mining. The a-priori algorithm was del-eloped in [I] and 121. 1. R. Agranal, T. Imielinski, and A. Sn-ami: '.lIining association rules be- tween sets of items in large databases," Proc. -ACAi SIGAlOD Intl. Conf. on ibfanagement of Data (1993), pp. 203-216. 2. R. Agrawal, and R. Srikant, "Fast algorithms for mining association rules," Proc. Intl. Conf. on Veq Large Databa.ses (1994), pp. 487-199. 3. S. Chaudhuri and U. Dayal, .'Ail overview of data warehousing and OLAP technology," SIGAJOD Record 26: 1 (1997), pp. 63-74. 4. U. 52. Fayyad, G. Piatetsky-Shapiro. P. Smyth, and R. Uthurusamy, Ad- Lances in Knowledge Discovery and Data hlznzng. AAAI Press, hlenlo Park CA, 1996. 3. H. Garcia-llolina, Y. Papakonstalltinou. D. Quass. -1. Rajalaman, Y. Sa- giv. V. \Bssalos. J. D. Ullman, and J. n7idorn) The TSIlIlIIS approach to mediation: data nlodels and languages. J. Intellzgent Informatzon Sys- tems 8:2 (1997), pp. 117-132. 6. J. S. Gray, A. Bosworth, A. Layman. and H. Pirahesh, .'Data cube: a relational aggregation operator generalizing group-by. cross-tab, and sub- totals." Proc. Intl. Conf. on Data Englneerzng (1996). pp. 132-139. 7. -1. Gupta and I. S. SIumick. A.laterioltzed Vieccs: Technzques, Implemcn- tatzons, and Applzcatzons. lIIT Pres4. Cambridge 11-1. 1999 8. V. Harinarayan, -1. Rajaraman, and J. D. Ullman. ~~Implementiiig data cubes efficiently." Proc. ACAf SIGilfOD Intl. Conf. on Management of Data (1996). pp. 205-216. 9. D. Loniet and J. U-idom (eds.). Special i~sue on materialized l-ie~vs and data warehouses. IEEE Data Erlg?ilcerlng Builet~n 18:2 (1395). 10. I*. Papakonstantinou. H. Garcia-llolina. arid J. n'idom. "Object ex- change across heterogeneous information sources." Proc. Intl. Conf. on Data Englneerlng (1993). pp 251-260. 11. I Papakonstantinou. .I. Gupta. and L. Haas. "Capnl>ilities-base query ren-riting in mediator s!-stems." Conference 011 Par(111el and Distributed Informntion Systc~ns (1996). ,\l-;lil~il~le as: 12. .A. P. Sheth and J. -1. Larson. "Federated databases for managing distributed. heterogeneous. and autonomous databases." Cornputzng Surreys 22:3 (1990), pp. 183-236. Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark. [...]... Data structure 503 Data type 292 See also UDT Data ~\-areho~ls' 9 See also %rehouse Database 2 Database address 579-580, 582 Database administrator 10 Database element 879, 957 Database management system 1 910 Database programming 1, 15 17 Database schema See Relational database schema Database state See State, of a database Data-definition language 10 292 See also ODL, Schema Datalog 463-502 Data-manipulation... 717, 733 735736, 765 822 879, 888 See also Database address Disk controller 517: 522 Disk crash See Media failure Disk failure 546-563 See also Disk crash Disk head 516 See also Head assembly Disk I/O 511.519-523.525-526,717, 840: 832 8.56 Disk scheduling 538 Disk striping Sce RAID Striping Diskette 519 See also Floppy disk DISTINCT 277, 279 429-430 Distributed database 1018-1035 Distributive law 218... 720-723, 728, 733-734, 871 See also Pipelining Java 393 Java database connectivity Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark See JDBC JDBC 349, 393-397 Join 112-113,192-193.254-255.270272, j05-506 See also antisemijoin, CROSS JOIN, Equijoin, Satural join, Sestedloop join, Outerjoin, Selectivity, of a join Semijoin, Theta-join, Zig-zag join Join ordcring 818, 8-17-859 Join... 664 Schedule 918, 923-924 See also Serial schedule, Serializable schedule Scheduler 917 932 934-936 951937.969.973-97.5.979-980 Schema 49 8.5, 167 173 504, 572 575 See also Database schema Global schema Relat~onschema, Relational database schema Star schema Schrleider, R 711 Sclinarz, P 916 1044 Scope, of names 269 Scrolling cursor 361 Search key 605-606 612 614 623 663 See also Ilash key Second nornlal... 716 719, 721, 861-862, 867-868 Tag 178 Tagged field 593 Tanaka H 785 Tape 512 Template 1058-1059 Tertiary memory 512-513 Tertiary storage 6 Thalhein~.B 60 T E 368 HN Theta-join 199-201 205, 220, 477, 731.796-799.802,805,819520 826-827 Theta-outerjoin 229 Third norrnal form See 3 S F Thomas R H 1045 Tliomasian -1 988 Thrashing 766 3 S F 114-116 124-125 Three-valued logic 249-2.51 Thuraisingliam B 988... of a database Data-definition language 10 292 See also ODL, Schema Datalog 463-502 Data-manipulation language See Query language DATE 247, 293, 571-572 Date, C J 314 Dayal, U 348, 1099 DB2 492 DBMS See Database management system DDL See Data-definition language Deadlock 14, 885, 939, 1009-1018 1033 Decision tree 1090-1091 Decision-support query 1070, 10891090 See also OL.iP DECLARE 352-353,356, 367 Decoinposition... Rollback ABSOLUTE361 ichilles, X.-C 21 ACID properties 14 See also Atomicity, Consistency, Durability, Isolation ACR schedule See Cascading rollback Action 340 -ADA 350 ADD 294 Addition rule 101 Address See Database address Forwarding address Logical address, \Iemor>- address Physical address Structured address I'irtual memory Address space 309, 582 880 -1dornment 1066, lOG8 AFTER341-3-12 -1ggregation 221-223... 652-656 Estensible markup language See SAIL Cstensional predicate 469 Estent 131-152.170 Estractor See \\lapper Faloutsos, C 663,712 Faulstich, L.C 188 Fayyad, U.1099 FD See Functional dependency Federated databases 1047,1049-1051 FETCH 356,361,389-390 Field 132.567,570.573 FIFO See First-in-first-out File 504,506,567 See also Sequential file File system 2 Filter 844,860-862,868 Filter, for a wrapper 1060-1061... Griffiths P.P 424 438-441 GROUP BY 277.280-284 Group co~nrnit996-997 Group niode 954-955, 961 Groupi~lg 221-226.279.727-728.737 740-741.747 751.755.771 773 780,806-808.834 See also GROUP BY Gulutzan P 314 Gunther 0 712 Gupta, A 237,785,1099 Guttman, A 712 H Haderle, D J 916,1044 Hadzilacos, 1 916,987 ' Haerder, T 916 Hall, P.A V 874 Hamilton, G 424 Hamming code 557,562 Hamming distance 562 Handle 386 Hapner,... Idempotence 230,891, 998 Identity 555 IDREF 183 I F 368 Imielinski, T 1099 Immediate constraint checking 323325 Immutable object 133 Impedence mismatch 350-351 I N 266-267,430 Inapplicable value 248 Incomplete transaction 889, 898 Increment lock 946-949 Incremental dump 910 Incremental update 1052 Index 12-13, 16, 295-300,318-319, 605-606,757-764, 1065 See also Bitmap index, B-tree, Clustering index, . implied by the grouping, the less space the materialized view takes. On the other hand, if ire ~vant to use a view to answer a certain query, then the view. only a small addition to the volume of the cube (i.e., the number of tuples in the fact table). In that case, the size of the stored data CCBE(F)

Ngày đăng: 26/01/2014, 15:20

Xem thêm: Tài liệu Database Systems: The Complete Book- P12 docx, Tài liệu Database Systems: The Complete Book- P12 docx

Tài liệu Database Systems: The Complete Book- P12 docx

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Content 01.bmp

Content 02.bmp

Content 03.bmp

Content 04.bmp

Content 05.bmp

Content 06.bmp

Content 07.bmp

Content 08.bmp

Content 09.bmp

Content 10.bmp

Content 11.bmp

Content 12.bmp

DataBase 0000.bmp

DataBase 0002.bmp

DataBase 0004.bmp

DataBase 0006.bmp

DataBase 0008.bmp

DataBase 0010.bmp

DataBase 0012.bmp

DataBase 0014.bmp

Tài liệu cùng người dùng

Tài liệu liên quan