Building the Data Warehouse Third Edition phần 4 doc

Occasionally, the introduction of derived (i.e., calculated) data into the physical database design can reduce the amount of I/O needed. Figure 3.31 shows such a case. A program accesses payroll data regularly in order to calculate the annual pay and taxes that have been paid. If the program is run regularly and at the year’s end, it makes sense to create fields of data to store the calculated data. The data has to be calculated only once. Then all future requirements can access the calculated field. This approach has another advantage in that once the field is calculated, it will not have to be calculated again, eliminating the risk of faulty algorithms from incorrect evaluations. The Data Warehouse and Design 107 access access access access access bom partno . . . . . . . . . . . . . . . inventory partno . . . . . . . . . . prod ctl partno . . . . . . . . . . . . . . . mrp partno . . . . . . . . . . partno desc u/m qty . . . . . access access access access access update Description is nonredundant and is used frequently, but is seldom updated. bom partno desc . . . . . . . . . . inventory partno desc . . . . . prod ctl partno desc . . . . . . . . . . mrp partno desc . . . . . partno desc u/m qty . . . . . selective use of redundancy update Figure 3.29 Description is redundantly spread over the many places it is used. It must be updated in many places when it changes, but it seldom, if ever, does. Uttama Reddy One of the most innovative techniques in building a data warehouse is what can be termed a “creative” index, or a creative profile (a term coined by Les Moore). Figure 3.32 shows an example of a creative index. This type of creative index is created as data is passed from the operational environment to the data warehouse environment. Because each unit of data has to be handled in any case, it requires very little overhead to calculate or create an index at this point. CHAPTER 3 108 low probability of access very high probability of access acctno domicile date opened balance acctno domicile date opened acctno balance Figure 3.30 Further separation of data based on a wide disparity in the probability of access. week 1 pay taxes FICA other introducing derived data week 2 pay taxes FICA other week 3 pay taxes FICA other week 52 pay taxes FICA other annual pay, taxes, FICA, other program week 1 pay taxes FICA other annual pay, taxes, FICA, other week 2 pay taxes FICA other week 3 pay taxes FICA other week 52 pay taxes FICA other Figure 3.31 Derived data, calculated once, then forever available. TEAMFLY Team-Fly ® Uttama Reddy The creative index does a profile on items of interest to the end user, such as the largest purchases, the most inactive accounts, the latest shipments, and so on. If the requirements that might be of interest to management can be antici- pated (admittedly, they cannot in every case), at the time of passing data to the data warehouse, it makes sense to build a creative index. A final technique that the data warehouse designer should keep in mind is the management of referential integrity. Figure 3.33 shows that referential integrity appears as “artifacts” of relationships in the data warehouse environment. In the operational environment, referential integrity appears as a dynamic link among tables of data. But because of the volume of data in a data warehouse, because the data warehouse is not updated, and because the warehouse repre- sents data over time and relationships do not remain static, a different approach should be taken toward referential integrity. In other words, relationships of data are represented by an artifact in the data warehouse environment. Therefore, some data will be duplicated, and some data will be deleted when The Data Warehouse and Design 109 existing systems lightly summarized data true archival creative indexes; profiles creative indexes/profiles extract, load Figure 3.32 Examples of creative indexes: • The top 10 customers in volume are ___. • The average transaction value for this extract was $nnn.nn. • The largest transaction was $nnn.nn. • The number of customers who showed activity without purchasing was nn. Uttama Reddy other data is still in the warehouse. In any case, trying to replicate referential integrity in the data warehouse environment is a patently incorrect approach. Snapshots in the Data Warehouse Data warehouses are built for a wide variety of applications and users, such as customer systems, marketing systems, sales systems, and quality control sys- CHAPTER 3 110 But, in the data warehouse environment: • There is much more data than in the operational environment. • Once in the warehouse, the data doesn't change. • There is a need to represent more than one business rule over time. • Data purges in the warehouse are not tightly coordinated. In operational systems, the relationships between databases are handled by referential integrity. data warehouse and referential integrity Artifacts in the data warehouse environment: • can be managed independently • are very efficient to access • do not require update a b ' a b ' a b ' a b ' subject A subject B b a ' b a ' b a ' b a ' b a ' b a ' b a ' b a ' Figure 3.33 Referential integrity in the data warehouse environment. Uttama Reddy tems. Despite the very diverse applications and types of data warehouses, a common thread runs through all of them. Internally, each of the data warehouses centers around a structure of data called a “snapshot.” Figure 3.34 shows the basic components of a data warehouse snapshot. Snapshots are created as a result of some event occurring. Several kinds of events can trigger a snapshot. One event is the recording of information about a discrete activity, such writing a check, placing a phone call, the receipt of a shipment, the completion of an order, or the purchase of a policy. In the case of a discrete activity, some business occurrence has occurred, and the business must make note of it. In general, discrete activities occur randomly. The other type of snapshot trigger is time, which is a predictable trigger, such as the end of the day, the end of the week, or the end of the month. The snapshot triggered by an event has four basic components: ■■ A key ■■ A unit of time ■■ Primary data that relates only to the key ■■ Secondary data captured as part of the snapshot process that has no direct relationship to the primary data or key Of these components, only secondary data is optional. The key can be unique or nonunique and it can be a single element of data. In a typical data warehouse, however, the key is a composite made up of many ele- ments of data that serve to identify the primary data. The key identifies the record and the primary data. The unit of time, such as year, month, day, hour, and quarter, usually (but not always) refers to the moment when the event being described by the snapshot has occurred. Occasionally, the unit of time refers to the moment when the capture of data takes place. (In some cases a distinction is made between when an The Data Warehouse and Design 111 time key nonkey primary data secondary data Figure 3.34 A data warehouse record of data is a snapshot taken at one moment in time and includes a variety of types of data. Uttama Reddy event occurs and when the information about the event is captured. In other cases no distinction is made.) In the case of events triggered by the passage of time, the time element may be implied rather than directly attached to the snapshot. The primary data is the nonkey data that relates directly to the key of the record. As an example, suppose the key identifies the sale of a product. The element of time describes when the sale was finalized. The primary data describes what product was sold at what price, conditions of the sale, location of the sale, and who were the representative parties. The secondary data—if it exists—identifies other extraneous information captured at the moment when the snapshot record was created. An example of secondary data that relates to a sale is incidental information about the product being sold (such as how much is in stock at the moment of sale). Other secondary information might be the prevailing interest rate for a bank’s preferred customers at the moment of sale. Any incidental information can be added to a data warehouse record, if it appears at a later time that the information can be used for DSS processing. Note that the incidental information added to the snapshot may or may not be a foreign key. A foreign key is an attribute found in a table that is a reference to the key value of another table where there is a business relationship between the data found in the two tables. Once the secondary information is added to the snapshot, a relationship between the primary and secondary information can be inferred, as shown in Figure 3.35. The snapshot implies that there is a relationship between secondary and primary data. Nothing other than the existence of the relationship is implied, and the relationship is implied only as of the instant of the snapshot. Nevertheless, by the juxtaposition of secondary and primary data in a snapshot record, at the instant the snapshot was taken, there is an inferred relationship of data. Sometimes this inferred relationship is called an “artifact.” The snapshot record that has been described is the most general and most widely found case of a record in a data warehouse. CHAPTER 3 112 nonkey primary data secondary data Figure 3.35 The artifacts of a relationship are captured as a result of the implied relationship of secondary data residing in the same snapshot as primary data. Uttama Reddy Meta Data An important component of the data warehouse environment is meta data. Meta data, or data about data, has been a part of the information processing milieu for as long as there have been programs and data. But in the world of data warehouses, meta data takes on a new level of importance, for it affords the most effective use of the data warehouse. Meta data allows the end user/DSS analyst to navigate through the possibilities. Put differently, when a user approaches a data warehouse where there is no meta data, the user does not know where to begin the analysis. The user must poke and probe the data warehouse to find out what data is there and what data is not there and consid- erable time is wasted. Even after the user pokes around, there is no guarantee that he or she will find the right data or correctly interpret the data encoun- tered. With the help of meta data, however, the end user can quickly go to the necessary data or determine that it isn’t there. Meta data then acts like an index to the contents of the data warehouse. It sits above the warehouse and keeps track of what is where in the warehouse. Typi- cally, items the meta data store tracks are as follows: ■■ Structure of data as known to the programmer ■■ Structure of data as known to the DSS analyst ■■ Source data feeding the data warehouse ■■ Transformation of data as it passes into the data warehouse ■■ Data model ■■ Relationship between the data model and the data warehouse ■■ History of extracts Managing Reference Tables in a Data Warehouse When most people think of data warehousing, their thoughts turn to the nor- mal, large databases constantly being used by organizations to run day-to-day business such as customer files, sales files, and so forth. Indeed, these common files form the backbone of the data warehousing effort. Yet another type of data belongs in the data warehouse and is often ignored: reference data. Reference tables are often taken for granted, and that creates a special prob- lem. For example, suppose in 1995 a company has some reference tables and starts to create its data warehouse. Time passes, and much data is loaded into The Data Warehouse and Design 113 Uttama Reddy the data warehouse. In the meantime, the reference table is used operationally and occasionally changes. In 1999, the company needs to consult the reference table. A reference is made from 1995 data to the reference table. But the reference table has not been kept historically accurate, and the reference from 1995 data warehouse data to reference entries accurate as of 1999 produces very inaccurate results. For this reason, reference data should be made time-variant, just like all other parts of the data warehouse. Reference data is particularly applicable to the data warehouse environment because it helps reduce the volume of data significantly. There are many design techniques for the management of reference data. Two techniques—at the opposite ends of the spectrum—are discussed here. In addition, there are many variations on these options. Figure 3.36 shows the first design option, where a snapshot of an entire reference table is taken every six months This approach is quite simple and at first glance appears to make sense. But the approach is logically incomplete. For example, suppose some activity had occurred to the reference table on March 15. Say a new entry—ddw—was added, then on May 10 the entry for ddw was deleted. Taking a snapshot every six months would not capture the activity that transpired from March 15 to May 10. A second approach is shown in Figure 3.37. At some starting point a snapshot is made of a reference table. Throughout the year, all the activities against the reference table are collected. To determine the status of a given entry to the reference table at a moment in time, the activity is reconstituted against the reference table. In such a manner, logical completeness of the table can be reconstructed for any moment in time. Such a reconstruction, however, is a not a trivial matter; it may represent a very large and complex task. The two approaches outlined here are opposite in intent. The first approach is simple but logically incomplete. The second approach is very complex but logically complete. Many design alternatives lie between these two extremes. However they are designed and implemented, reference tables need to be managed as a regular part of the data warehouse environment. CHAPTER 3 114 Jan 1 AAA - Amber Auto AAT - Allison's AAZ - AutoZone BAE - Brit Eng July 1 AAA - Amber Auto AAR - Ark Electric BAE - Brit Eng BAG - Bill's Garage Jan 1 AAA - Alaska Alt AAG - German Air AAR - Ark Electric BAE - Brit Eng Figure 3.36 A snapshot is taken of a reference table in its entirety every six months—one approach to the management of reference tables in the data warehouse. Uttama Reddy Cyclicity of Data—The Wrinkle of Time One of the intriguing issues of data warehouse design is the cyclicity of data, or the length of time a change of data in the operational environment takes to be reflected in the data warehouse. Consider the data in Figure 3.38. The current information is shown for Judy Jones. The data warehouse contains the historical information about Judy. Now suppose Judy changes addresses. Figure 3.39 shows that as soon as that change is discovered, it is reflected in the operational environment as quickly as possible. Once the data is reflected in the operational environment, the changes need to be moved to the data warehouse. Figure 3.40 shows that the data warehouse has a correction to the ending date of the most current record and a new record has been inserted reflecting the change. The issue is, how soon should this adjustment to the data warehouse data be made? As a rule, at least 24 hours should pass from the time the change is known to the operational environment until the change is reflected into the data warehouse (see Figure 3.41). There should be no rush to try to move the change into the data warehouse as quickly as possible. This “wrinkle of time” should be implemented for several reasons. The first reason is that the more tightly the operational environment is coupled to the data warehouse, the more expensive and complex the technology is. A 24-hour wrinkle of time can easily be accomplished with conventional technology. A 12-hour wrinkle of time can be accomplished but at a greater cost of technology. A 6-hour wrinkle of time can be accomplished but at an even greater cost in technology. The Data Warehouse and Design 115 Jan 1 - add TWQ - Taiwan Dairy Jan 16 - delete AAT Feb 3 - add AAG - German Power Feb 27 - change GYY - German Govt A complete snapshot is taken on the first of the year. Changes to the reference table are collected throughout the year and are able to be used to reconstruct the table at any moment in time. Jan 1 AAA - Amber Auto AAT - Allison's AAZ - AutoZone BAE - Brit Eng Figure 3.37 Another approach to the management of reference data. Uttama Reddy CHAPTER 3 116 Rte 4, Austin, TX operational J Jones has moved to Rte 4, Austin, TX J Jones 123 Main Credit - AA Figure 3.39 The first step is to change the operational address of J Jones. operational J Jones has moved to Rte 4, Austin, TX J Jones 1989-1990 Apt B Credit - B J Jones 1990-1991 Apt B Credit - AA J Jones 1992-present 123 Main Credit - AA J Jones 123 Main Credit - AA data warehouse Figure 3.38 What happens when the corporation finds out that J Jones has moved? Uttama Reddy [...]... data gets from the data warehouse to the data mart Data in the data warehouse is very granular Data in the data mart is very compact and summarized Periodically data must be moved from the data warehouse to the data mart This movement of data from the data warehouse to the data mart is analogous to the movement of data into the data warehouse from the operational legacy environment Data Marts: A Substitute... for its data mart, the sales department will have another structure for its data mart, and the marketing department will have another data structure for its data mart All of their structures will be fed from the granular data found in the data warehouse The data structure found in any given data mart is different from the data structure for any other data mart For example, the sales data mart data structure... Substitute for a Data Warehouse? There is an argument in the IT community that says that a data warehouse is expensive and troublesome to build Indeed, a data warehouse requires resources in the best of cases But building a data warehouse is absolutely worth the effort The argument for not building a data warehouse usually leads to building something short of a data warehouse, usually a data mart The premise... Data Warehouse Data Figure 3 .46 illustrates the dynamics of the simplest of those circumstances -the direct access of data from the data warehouse by the operational environment A request has been made within the operational environment for data that resides in the data warehouse The request is transferred to the data warehouse environment, and the data is located and transferred to the operational... records can be used to go into the data warehouse or a data mart that is fed by the data warehouse When the profile records go into a data warehouse, they are for general-purpose use When the profile records go into the data mart, they are customized for the department that uses the data mart The aggregation of the operational records into a profile record is almost always done on the operational server because... see why there is a minimal amount of backward flow of data in the case of direct access Indirect Access of Data Warehouse Data Because of the severe and uncompromising conditions of transfer, direct access of data warehouse data by the operational environment is a rare occurrence Indirect access of data warehouse data is another matter entirely Indeed, one of the most effective uses of the data warehouse. .. element of time may need to be added as the data moves from the operational environment to the data warehouse environment ■ ■ The data warehouse addresses the informational needs of the corporation, while the operational environment addresses the up-to -the- second clerical needs of the corporation ■ ■ Transmission of the newly created output file that will go into the data warehouse must be accounted for In... With a 24- hour wrinkle there is no temptation to do operational processing in the data warehouse and data warehouse processing in the operational environment But if the wrinkle of time is reduced—say, to 4 hours—there is the temptation to do such processing, and that is patently a mistake Another benefit of the wrinkle of time is opportunity for data to settle before it is moved to the data warehouse. .. required as data passes from the operational, legacy environment to the data warehouse environment? The following lists some of the necessary functionality: ■ ■ The extraction of data from the operational environment to the data warehouse environment requires a change in technology This normally includes reading the operational DBMS technology, such as IMS, and writing the data out in newer, data warehouse. .. bookings and the flight status table The result is a very fast and smooth interaction with the travel agent and the ability to use the data stored in the data warehouse A Retail Personalization System Another example of the indirect use of data warehouse data in the operational environment occurs in the retail personalization system In this system, a customer reads a catalog or other flyer issued by the retailer . used. ■■ The design of the data warehouse must conform to a corporate data model. As such, there is order and discipline to the design and structuring of the data warehouse. The input to the data warehouse. possible. Once the data is reflected in the operational environment, the changes need to be moved to the data warehouse. Figure 3 .40 shows that the data warehouse has a correction to the ending date of the. of data as known to the programmer ■■ Structure of data as known to the DSS analyst ■■ Source data feeding the data warehouse ■■ Transformation of data as it passes into the data warehouse ■■ Data