Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 5 potx

46 312 0
Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 5 potx

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

maintain calendars on these terms, allowing the user to see April 18, 2002 instead of day 198, and yet internally, such systems continue to count days based on working and nonworking days. By their very nature, factory calen- dars are localized. That is, each production facility may have its own factory calendar to reflect the workdays and shifts within that facility. Whether you are developing your data warehouse for a manufacturer or not, the idea of tracking workdays has wide applicability when performing analy- sis for a variety of businesses. The concept of the factory calendar has been expanded in modern usage to provide a calendar for consistently measuring the number of workdays between two events, such as the beginning and end of a promotional period, and for facilitating data access for an equal number of workdays regardless of the elapsed calendar days. The premise behind the fac- tory calendar was to homogenize the timeline so that day calculations could be done using simple arithmetic. This homogeneity is also important when per- forming year-over-year and month-over-month analysis. If you have a sales force that works Monday through Friday and you need to compare this month’s performance to date against last month’s, it makes sense to count the same number of selling days in each month; otherwise, the comparison can be skewed depending on where the weekend falls in relation to the start of the month. This concept will be discussed at length in the next section. Calendar Elements An effective calendar entity can significantly improve people’s ability to per- form analysis over time. There are three fundamental concepts that facilitate the analysis: ■■ The day of the week, particularly in Retail businesses ■■ Accounting for nonworking days to provide meaningful comparative analysis and elapsed-time calculations ■■ Defining time periods of interest, such as holiday seasons, for analysis Day of the Week The day of the week is probably the most commonly recognized special day for performing date-oriented analyses. Often the day of the week is used to distinguish between workdays and nonworkdays. A workday indicator in the Day of the Week entity in Figure 6.1 shows whether it is a regularly scheduled workday, and this can be used to facilitate analysis. Acompany’s revenue cycle is also sometimes dependent on the day of the week. Some retailers have a heavier daily sales volume during weekend days, while Modeling the Calendar 165 other retailers may have a lower volume during the weekend, particularly if they have shorter hours. Sales on Mondays may be consistently higher than sales on other days for some companies. The relationship between the Day of the Week and the Date in Figure 6.1 facilitates analysis based on the day of the week. NOTE Some sales cycles are also dependent on the day of the month. For example, commercial sales may be higher at the beginning of the month than at the end of the month. Inclusion of a day sequence number in the Date entity in Figure 6.1 can help analysts who need to use that information. Holidays Holidays impact two areas: your organization’s business practices and your cus- tomers’ ability to do business with you. In the first case, holidays (or more gener- ically, nonworkdays) are established by internal business policies and impact company holiday schedules, store hours, plant closings, and other events. Within the Date entity in Figure 6.1, we’ve included a workday indicator within the Day of the Week entity and a Holiday Indicator within the Date entity. With this infor- mation, analysis based on workdays is facilitated. In addition, the use of a Work- day Sequence Number helps in comparing results during the first 10 workdays of two months. As Figure 6.5 shows, in June 2006, the 10th workday occurs on the 14th, while the 10th workday in July 2006 will not occur until the 17th. WARNING Analysts should be careful with the criteria used to perform month-to-month analy- sis. For example, in companies that are heavily oriented toward commercial sales or sales through distribution channels, it may be very appropriate to make comparisons based on the number of business days in the month since these are the days in which purchasing agents typically make buying decisions. In companies that are ori- ented toward retail sales, it would be more appropriate to make comparisons based on the number of days that the stores are open. Figure 6.5 Workday comparisons. July 2006June 2006 SSSMTTWFSMT TWF 1123 45678 456789 10 2 3 11 12 13 14 15 11 12 13 14 15 16 17 9 10 16 17 18 19 20 21 22 18 19 20 21 2223 24 23 2425 26 27 28 29 25 26 27 28 29 30 31 30 Chapter 6 166 The effect that holidays have on your customers is a much different matter. The greatest impact is seen in the retail business, in which holidays and other events can have a significant impact on comparative statistics. For example, it makes no sense to directly compare January sales against December sales if 25 percent (or more) of your sales for the year occur in the weeks leading up to Christmas. External calendars influence businesses in many different ways. Children’s clothing and tourism have cycles influenced by the school calendar. Candy sales are influenced by events such as Easter and Halloween. Firework sales are influenced by special events such as Independence Day and New Year’s Day. When performing analyses of sales that are influenced by such events, it is important for the data warehouse to provide the means to apply the necessary context to the data as the data migrates to the marts. Without such context, direct comparison of the numbers can be misleading, which may lead to bad decisions. These are predictable business sales cycles. Hence information about these can be included in the business data model and cascaded to the data warehouse model. Attributes can be included (as shown in Figure 6.5) in the Date entity to indicate special periods to which the date belongs. If there are a small number of such periods, then each could be accommodated by a separate attribute; if the number of periods is large, then the periods could be classified into logical groupings and the attributes could indicate the group(s) to which the date belongs. Holiday Season The holiday season, which begins on the day following Thanksgiving and ends on Christmas Day (December 25), is of special interest in retailing. An indicator for this season is very useful since the beginning date varies each year. With an indicator, an analyst comparing Holiday Season sales for 3 years can simply select dates based on the value of the indicator. The holiday season impact cascades beyond just sales during the holiday season since companies need to ensure that the products are available at that time. To prepare for the large sales volume, there are preceding periods that affect product planning and production. Sometimes it is meaningful to track these as well. For example, large inventory levels following the peak-selling season are not healthy, but it is very appropriate (and in fact essential) to have high inventory levels immediately preceding the peak selling season. One way of handling this is to include a derived field that represents the number of days before the peak selling season. The analyst can use that information to qualify analysis of data for the inventory levels, production schedules, and so on. Modeling the Calendar 167 Company holiday information is easily obtained from the Human Resources Department within your organization. The challenge is that such information may not be readily available from an existing application. You may find that the only source for a list of nonworking days is from memos and other such documents. In such cases, it would become necessary to implement a data entry application to collect and maintain this data for the warehouse. From a technical standpoint, it is a very simple, low-volume application. However, finding a department to support this data may be difficult. Usually, the holi- day schedule is published by the Human Resources Department so it would be the most likely candidate to maintain this information in the application. In most cases, initial warehouse implementations often do not support the Human Resources Department, and the Human Resources Department is typ- ically out of the loop when discussing warehouse requirements. So, it is com- mon that, when asked, the Human Resources Department may decline to assume that responsibility. Do not be surprised if the responsibility for main- taining this data within the data warehouse falls on the data warehouse sup- port staff. Seasons In addition to holidays and other events, seasons play an important role in influencing business activity. In the context of this discussion, it is best to look at a season is its most generic form. A season is defined as any time period of significance. What this means depends on what is significant to your business. If you are in sporting goods, then the baseball season is significant. If you man- ufacture watercraft, then summer is important. Carried to a logical conclusion, a seasonal calendar can be used in a data warehouse to provide context to any date or range of dates. Figures 6.15 and 6.16, later in this chapter, show an example of a seasonal calendar model. The advantage of a seasonal calendar is that it formalizes a process that allows the end user to surround any period of time with a special context. It recog- nizes the fact that the impact on business is not the event itself, but rather the days, weeks, or months preceding or following that event. It also facilitates year-to-year analysis based on seasons that are important to the business, sim- ilar to the holiday season previously described. The concept of the season acknowledges the fact that, as a data warehouse designer, you cannot antici- pate every conceivable avenue of analysis. A seasonal calendar structure puts that control in the hands of the end user. Creating a seasonal calendar will be discussed later in this chapter. Chapter 6 168 Calendar Time Span A major application of the calendar is to provide a time context to business activity. At a minimum, the calendar should cover the historical and current activity time period maintained in the warehouse. From a practical standpoint, it should cover the planning horizon (for example, the future time span for which a forecast or quota that is used in strategic analysis may be created), which is often several years into the future. Some industries, such as banking, may require much longer timeframes in their calendar to cover maturity dates of bonds, mortgages, and other financial instruments. As you will see later in this section, there is a lot of information about a date that a calendar entity can include. It may not be possible to gather all the necessary information for dates 10, 20, or 30 years into the future. This should not be of great concern. There is no requirement that all columns for all dates be populated immediately. If the data is not available, then a null condi- tion or a value indicating that the data is not available may be used. When this is done, the metadata should explain the meaning of the field content. Time and the Data Warehouse Time can be an important aspect of analysis, depending on your business. In retail, identifying busy and slow parts of the day can aid in better work sched- uling. In logistics, analysis of delay patterns at pickup and delivery points can help improve scheduling and resource utilization. This section will examine the use of time in the data warehouse. The Nature of Time Acommon mistake in data warehouse design is to treat date and time together. This is understandable because it is common for people and the business to Modeling the Calendar 169 Dealing with Missing Information The data warehouse will have a column for each data element, including a col- umn for dates into the future, and the data to populate this column may not ini- tially be available. Therefore, these columns may be null at first (if your database standards permit this). When the data becomes available, a new row is added to the data warehouse with values in these columns. From a purely theoretical point of view, the old row is also retained. To simplify the structure of the data ware- house, companies sometimes choose not to keep history of that nature, in which case the previous row containing data for that date is deleted. consider them as one and the same. This natural tendency can result in very undesirable effects in the data warehouse. If we develop the business model (such as the one shown in Figure 6.3) with the understanding that the Date attribute represents a specific Gregorian date, then all other entities that refer to the Date entity have a foreign key that rep- resents a specific Gregorian date. An attribute that represents both the date and time cannot be used as a foreign key since it represents a point in time rather than a date. To avoid this conflict, the model should represent date and time of day as separate attributes. Doing so will help clarify the model and avoid potential implementation issues. Standardizing Time An aspect of time is that it is different from place to place. While it is 3:33 P.M. on June 2 in New York, it is 1:03 A.M. on June 3 in Calcutta. When you are designing the data warehouse, you will need to take into account which time is important for the business: a common standard time, the local time, or both. A traditional retail chain is most likely interested in the local time because it represents time from the customer’s perspective. Whereas a telecommunica- tions company needs both, local time to analyze customer patterns and rates, and a common standard time to analyze network traffic. If there is a requirement for a common standard time, you must store the local time and date as well as the standard time and date in the data warehouse. There are some basic reasons for this. If you only stored one date and time, it would be very difficult to reliably calculate the other, particularly in historical analysis. Around the world, the recording of time has more to do with politics than it does with science. It is up to government authorities in each country to decide what time they will keep and how they will keep it. At the time of this writing, there are 37 different standard time zones. This number increases if you take into account those that observe daylight savings time and those who don’t. Chapter 6 170 Storing Time Receiving time values from different systems can be problematic as each system may provide time at different levels of precision. Two decisions you need to arrive for the data warehouse is what degree of precision is useful and how will the time value be stored. The level of precision will depend on your business. In most cases, hour and minute are sufficient. However, some industries—such as telecommunications and banking—require more precise values. In addition, the observation dates will vary from country to country. The job would be easier if it were a national decision, but it is not. In the United States for example, it is up to the States and Tribal Nations, not the Federal govern- ment, to establish time rules. For example, Arizona does not observe daylight savings time, whereas the Navajo Indian Reservation in Arizona does observe Modeling the Calendar 171 When storing time, there are three approaches. One method is to store time as you would expect to display it, for example, as a four-digit number where the first two digits are the hour and the last two digits are the minute. A second method is to express the time as the number of minutes or seconds since the beginning of the day. The first method is useful when displaying time, while the second is more useful for calculating elapsed time. Both methods are useful for sorting. Both methods need to be supplemented with functions to accommodate their particular shortcoming. A third approach is to store a discrete time value using one of the other two methods and to store a full date/time value in the databases native format. This is redundant, as you would be storing a discrete date, discrete time, and a contin- uous timestamp. The latter value will allow you to use native database functions to manipulate and measure time, while the discrete values provide useful keys for analysis. This approach provides the greatest flexibility and utility. Of course, there is another class of date/time attributes that is used internally for audit and control purposes. The values of these attributes would never be used for business analysis and would not be made available to the user commu- nity at large. These values should be stored as timestamps in the native database format. No effort should be made to store them as discrete date and time values as such values are only useful for business analysis. If your source is Web-based transactions, it is fairly easy to do. Web transmis- sions are time stamped with the local time as well as Zulu (UMT or Greenwich Mean Time) time. Otherwise, you need to check your source systems to determine what time is collected. If a standard time is not recorded, it may be worthwhile investigating modifications to the transactional system to collect this information. If all else fails, there are services that can provide worldwide time zone infor- mation on a subscription basis. The data will typically contain ISO country and region coding, the offset from Zulu time, and the dates that daylight savings time is observed. These services will provide periodic updates, so you do not need to worry about regulatory changes. The challenge will be to associate your locations with the ISO country and regional-coding standard. Also, it is not certain that the ISO codes will be specific enough to differentiate between sections of states. Therefore, initial use of such data may require some analysis and manual effort to assign the ISO codes to your locations in order to correlate them with the time zone data. A search for “time” at www.google.com will locate such services as well as a wealth of information about time zones. daylight savings time. In other cases, time zones go through States, following a river, mountain crest, or keeping true to the longitude. To avoid getting bogged down in legislative and geographic minutia, it is better to capture both times at the time of the transaction. Data Warehouse System Model The previous section described the business characteristics of the calendar, including the various types, elements, and time spans. In this section, we describe the impact on both the system and technology representations of the data warehouse. Before getting into the case studies, we introduce the concept of keys as they apply to the calendar. This material expands on the material provided in Chapter 5. As you will see throughout the remainder of this chapter, the use of entity- relationship modeling concepts for the data warehouse provides the designer with significant flexibility. Properly structured, we preserve the primary mis- sion of the warehouse as a focal point for collecting data and for subsequently distributing it to the data marts. Date Keys Within the data warehouse, data is related to the calendar by foreign keys to the calendar entity’s primary key. Transaction dates—such as enrollment date, order date, or invoice date—would be associated to the calendar in this man- ner. Other dates, such as birth dates, that have no relationship to business activity, are usually stored as dates with no relationship to the calendar. As discussed in Chapter 4, the date is one of the few attributes that has a known, reliable set of unique values. We can also assume that, at least in our lifetime, there will not be a change in the calendar system so there is no danger that management will decide to renumber the dates. A date has all the trap- pings of a perfect key. Whether or not to use a surrogate key for the calendar table will depend on your particular preferences and policies. One could have a policy that all keys for all tables should contain surrogate keys. That is fine because it certainly removes any question as to the nature of a table’s primary key. The other issues to consider are why you may want a surrogate key for the calendar and how you plan to deal with bad dates. The surrogate key section in Chapter 5 discusses different strategies to deal with erroneous reference data. Review that discussion before you decide. The entity will also have multiple alternate natural key attributes depending on how dates are represented in the source systems. One attribute may be the date in the native database format; unless you are using a surrogate primary key, this attribute would serve as the primary key. Additional attributes could contain the Chapter 6 172 dates stored in the format used by incoming data feeds. For example, if one of the data sources contains dates stored as an eight-character CCYYMMDD field, you should include an attribute in the date entity with the date in that format to ease interfacing the external system with the data warehouse. By storing the date in these different formats, your data warehouse load interfaces can locate the appro- priate date row without having to use date format conversion functions. This avoids potential problems should the date being received be an invalid one. If you store a single natural key, usually a date in the database’s native format, you will be faced with developing code to validate and cleanse the date prior to using it to lookup the primary key. Failure to do this properly will cause an exception from the database system during the load. Such an exception will cause the load process to abort and require some late-night troubleshooting and delays in pub- lishing the warehouse data. As the data warehouse grows and new system inter- faces are encountered, it is not unusual to discover new date formats. Adding new attributes to the entity easily accommodates these new formats. Case Study: Simple Fiscal Calendar Our consumer packaged goods company, Delicious Foods Company (DFC), is implementing its data warehouse in phases over many years. The initial phases will concentrate on sales, revenue, and customer fulfillment. All sales and revenue reporting is tied to the fiscal calendar. The company uses a mod- ified 4-5-4 calendar, where the fiscal year always begins on January 1 and ends on December 31. The company has the following business rules for its calendar: ■■ The fiscal year begins January 1. ■■ The fiscal year ends December 31. ■■ There are always 52 fiscal weeks in the year. ■■ The week begins on Monday and ends on Sunday. ■■ If the year begins on a Monday, Tuesday, or Wednesday, the first week of the year ends on the first Sunday of the year, otherwise it ends on the second Sunday of the year. ■■ The last week of the year is week 52 and always ends on December 31. ■■ Each quarter has 3 fiscal months consisting of 4 weeks, 5 weeks, and 4 weeks (4-5-4), respectively. ■■ Workdays are Monday through Friday, except holidays. Activity on Satur- day or Sunday is treated as the same workday as the preceding Friday unless that Saturday and/or Sunday is in a different fiscal year, that is, the Friday in question fell on December 30 or December 31. Activity on a holi- day is counted in the preceding business day. Modeling the Calendar 173 Figure 6.6 Fiscal calendar data feed. The company would like the calendar to support both fiscal and Gregorian cal- endars. It should support reporting by day, month, and year as well as fiscal week, fiscal month, fiscal quarter, and fiscal year. Year-to-date versus last-year- to-date and fiscal-month-to-date versus last-year-fiscal-month-to-date com- parisons must use the working day of the fiscal month as the point of comparison. For example, if the current date is the 15th workday of the fiscal month, then last year’s numbers must be those for the period through the 15th workday of last year’s fiscal month. However, if the current date is the last day of the fiscal month, the comparison should include all days of last year’s fiscal month regardless of the number of workdays. The company has a standard holiday list that is produced annually by the Human Resources Department. Days on this list are considered nonworkdays. The company’s operational system can provide a data feed that defines a fiscal year by providing 12 records containing the calendar date for the end of each fiscal month. Figure 6.6 shows an example of a typical data feed that defines the fiscal calendar for 2002. The incoming data contains four fields, the year, month, and day (which specifies the last day of the fiscal month), and the fis- cal month number. Based on the business models discussed earlier, this section discusses how the technical model is implemented within the data warehouse. Analysis It is not unusual for an operational system to provide insignificant data sur- rounding fiscal calendars. The concern on the operational side is to ensure that transactions are posted to the proper fiscal month. It is not concerned with 1 2 3 4 5 6 7 9 9 10 12 12 27 24 31 28 26 30 28 1 29 27 1 31 1 2 3 4 5 6 7 8 9 10 11 12 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 2002 Year Month Day Period_ID Chapter 6 174 [...]... important to understand that sometimes a data warehouse effort goes beyond data collection and publication and into areas of data creation and maintenance When such a situation occurs, it is to best to deal with it as a separate application effort A common approach is to build an application to create and maintain the data separate from the data warehouse, then to treat this data as any other data source being... Figure 6.11 shows the modified schema Handling Different Date Presentation Formats As the scope of the data warehouse expands internationally, it is necessary to accommodate different date and number formats for each country Ideally, the task of maintaining these formats should be handled in the data marts rather than the data warehouse The first two approaches, database and query tool localization, rely... maintain the data and in the delivery process to associate the table with other data warehouse tables Defining and storing the surrogate primary key in the data warehouse allows the surrogate key value to remain consistent across delivery processes Keeping a stable primary key value is critical to maintaining conformance across the data marts and ensuring referential integrity with historical fact data Delivering... compound primary key of location and date This is not desirable in a dimensional data mart Such a dimension would perform better in the data mart if there were a single simple primary key In such cases it is best to create and store a surrogate primary key within the denormalized data warehouse table and use the location and date as an alternate key to support the data delivery process Note that Figure... Modeling the Calendar 187 Database Localization If the plan is to publish separate data marts for each target country, and that each of these marts would reside in its own database, you can rely on the database’s own localization parameters These parameters allow you to specify how the database should return date and numeric values All databases support these parameters in some form, and this is the most... three structures are combined to calculate and report commissions From a business point of view, this may simply be referred to as the “commission structure” and viewed as a monolithic hierarchy From a data warehouse modeling perspective, it is actually a combination of normal relationships and hierarchies, and it is up to the data warehouse modeler to detect and dissect such organizations to effectively... integration with transactional data is paramount to a successful implementation In this chapter, we will look at the use of hierarchies in business and how to effectively model them in the data warehouse First, we will take a look at the use of hierarchies in business and the different ways they can be represented in the business data model The transition to the data warehouse model will be examined... dictate and used by multiple delivery processes Summary In this chapter, we examined various forms that calendar data can take in your data warehouse We also introduced the concept of predelivery staging within the data warehouse by creating denormalized tables with precalculated derived values This technique can significantly reduce the effort necessary to deliver information to external databases and. .. also create denormalized versions of the data within the data warehouse In this case, where you have both a standard corporate calendar and local variations, it would be most efficient to produce two sets of tables The first would be the corporate calendar that supports fiscal reporting, and the other would be location-specific work schedule using both the date and location as the primary key The new... obtaining data across specified years, all of which identified as belonging to the same season There is no existing mechanism in the current data warehouse design or elsewhere to create and maintain these definitions Management desires that one be developed Analysis Starting with the last point first, it is not uncommon to encounter situations where the data warehouse is asked to support data that . 6 .5 Workday comparisons. July 2006June 2006 SSSMTTWFSMT TWF 1123 456 78 456 789 10 2 3 11 12 13 14 15 11 12 13 14 15 16 17 9 10 16 17 18 19 20 21 22 18 19 20 21 2223 24 23 24 25 26 27 28 29 25 26. patterns and rates, and a common standard time to analyze network traffic. If there is a requirement for a common standard time, you must store the local time and date as well as the standard time and. primary mis- sion of the warehouse as a focal point for collecting data and for subsequently distributing it to the data marts. Date Keys Within the data warehouse, data is related to the calendar

Ngày đăng: 08/08/2014, 22:20

Tài liệu cùng người dùng

Tài liệu liên quan