Mastering Data Warehouse DesignRelational and Dimensional Techniques phần 9 pdf

correspond to each subject area. (This technique cannot be used if the tool does not provide the ability to divide the model into subject area views.) This technique facilitates the grouping of the data entities by subject area and the pro- vision of views accordingly. The major advantages of this technique are: ■■ Each entity is assigned to a subject area and the subject area assignment is clear. ■■ If a particular data steward or data modeler has responsibility for a specific subject area, then all of the data for which that person is responsible is in one place. ■■ Information can easily be retrieved for specific subject areas. The major disadvantage of this technique is that the subject area view is fine for developing the data model, but a single subject area rarely provides a com- plete picture of the business scenario. Hence, for discussion with business users, we need to create additional (for example, process-oriented) views, thereby increasing the maintenance work. Including the Subject Area within the Entity Name The third approach is to include the subject area name or code within the entity name. For example, if the Customers subject area is coded CU and the Products subject area is coded PR, we would have entities such as CU Cus- tomer, CU Prospect, PR Item, and PR Product Family. The major advantages of this approach are: ■■ It is easy to create the initial entity name with the relationship to the subject area. ■■ It is independent of the data-modeling tool. ■■ There is no issue with respect to displaying the relationship between an entity and a subject area. ■■ Alphabetic lists of entities will be grouped by subject area. The major disadvantages of this approach are: ■■ The entity name is awkward. With this approach, the modeler is moving away from using business-meaningful names for the entity names. ■■ Maintenance is more difficult. It is possible to have an entity move from one subject area to another when the subject area is refined. A refinement, for example, may change the definition of subject areas, so that with the revised definition, some of the entities previously assigned to it may need to be reassigned. With this approach, the names of the entities must change. This is a relatively minor inconvenience since it does not cascade to the system and technology models. Maintaining the Models 349 Figure 11.4 Segregating subject areas. Chapter 11 350 Business and System Data Models The toughest relationship to maintain is that between the business data model and the system data model. This difficulty is caused by the volume of changes, the fact that these two models need to be consistent—but not necessarily iden- tical—to each other, and the limited tool support for maintaining these relationships. Some examples of the differences include: Differences in the attributes within an entity. The entity within the business data model includes all of the attributes for that entity. Within each system model, only the attributes of interest to that “system” are included. In Chapter 4 (Step 1), we discussed the exclusion of attributes that are not needed in the data warehouse. Representation over time. The business data model is a point-in-time model that represents the current view of the data and not a series of snapshots. The data warehouse represents data over time (that is, snapshots), and its governing system model is therefore an over-time model. As we saw in Step 2 of the methodology for developing this model, there are sub- stantial structural differences that exist in the deployment since some relationships change, for example, from one-to-many to many-to-many. Inclusion of summarized data. Summarized data is often included in a system model. Step 5 of our methodology described specifically how to incor- porate summarized data in the data warehouse. Summarized data is inappropriate in a 3NF model such as the business data model. These differences contribute to the difficulty of maintaining the relationships between these models. None of the data-modeling tools with which we are familiar provide an easy way to overcome these differences. The technique we recommend is that the relationship between the business data model and the system models be manually maintained. There are steps that you can take to make this job easier: Maintaining the Models 351 Associative Entities Associative entities that resolve the many-to-many relationship between entities that reside in different subject areas do not cleanly fit into a single subject area. Because one of the uses of the subject area model is to ensure that an entity is only represented once in the business data model, a predictable process for desig- nating the subject area for these entities is needed. Choices include basing the decision on stewardship responsibilities (our favorite) or making arbitrary choices and maintaining an inventory of these to ensure that they are not duplicated. If the first option is used, a special color can be used for these entities if desired; if the second option is used, entities could be shown in multiple subject area views, since they still would exist only once in the master model. 1. Develop the business data model to the extent practical for the first itera- tion. Be sure to include definitions for all the entities and attributes. 2. Include derived data in the business data model. The derived data represents a deviation from pure normal form. Including it within the business data model promotes consistency since we will be copying a portion of this model as a starting point for each system data model. 3. Maintain some physical storage characteristics of the attributes in the business data model. These characteristics really don’t belong in the business data model since that model represents the business and not the elec- tronic storage of the information. As you will see in a subsequent step, we use a copy of information in the business data model to generate the starting point for each system data model. Since an entity in the business data model may be replicated into multiple system data models, by storing some physical characteristics in the business data model, we promote consistency and avoid redundant entry of the physical characteristics. The physical characteristics we recommend maintaining within the business data model are the column name, nullability information, and the datatype (including the length or precision). There may be valid reasons for the nullability information and the datatype to change within a systems model, but we at least start out with a standard set. For example, the relationship between a customer and a sales transaction may be optional (null permitted) in the business data model if prospects are considered customers. If we are building a data warehouse or application system that only applies to people who actually acquired our product, the relationship is mandatory, and the foreign key cannot be null. 4. Copy the relevant portion of the business data model and use it as the starting point of the system data model. In the modeling tool, this consists of a copy-and-paste operation—not inclusion. Inclusion of entities from one model (probably represented as a view in the modeling tool) into another within the modeling tool does not create a new entity, and any changes made will be reflected back into the business data model. 5. Make appropriate adjustments to the model based on the scope of the application system or data warehouse segment. Each time an adjustment is made, think about whether or not the change has an impact on the business data model. Changes that are made to reflect history, to adjust the storage granularity, and to improve performance generally don’t affect the business data model. It is possible that as the system data model is developed definitions will be revised. These changes do need to be reflected in the business data model. 6. Periodically compare the system data model to the business data model and ensure that the models are consistent with each other and that all of the differences are due to what each of the models represents. Chapter 11 352 This process requires adherence to data-modeling practices that promote model consistency. Significant effort will be required, and a natural question to ask is, “Is it worth the trouble?” Yes, it is worth the effort. Maintaining consistency between the data warehouse system model and the business data model promotes stability and supports maintenance of the business view within the data warehouse and other systems. The benefits of the business data model noted in Chapter 2 can then be realized. Another critical advantage is that the maintenance of the relationship between the business data model and the system data model forces a degree of disci- pline. Project managers are often faced with database designers who like to jump directly to the physical design (or technology model) without consider- ing any of the preceding models on which it depends. To promote adherence to these practices, the project managers must ensure that the development methodology includes this steps, that everyone who works with the model understands the steps and why they are important. Effective adherence to these practices should also be included in the job descriptions. The forced coordination of the business and system data models and the subsequent downstream relationship between the system and technology models ensures that sound data management techniques are applied in the data warehouse development of all data stores. It promotes managing of data and information as corporate assets. System and Technology Data Models Most companies have only a single instance of a production database such as a data warehouse. Even companies that have multiple production versions of this database typically deploy them on the same platform and in the same database management system. This approach significantly simplifies the maintenance of the system and technology data models since we have a one- to-one relationship, as shown in Figure 11.5. Most of the data-modeling tools maintain a “logical” and “physical” data model. While these are often presented as two separate data models, they are often actually two views of the same data model with (in some tools) an ability to include some of the entities and attributes in only one of the models. These two views correspond to the system data model and the technology data model. Without the aid of a repository, most of the tools do not enable the modeler to easily maintain separate system and technology data models. If a company has only one version of the physical data warehouse, we recommend coupling these tightly together and using the data-modeling tool to accomplish this. The major advantage of this approach is its simplicity. We don’t have to do any extra work to keep the system and technology models synchronized—the modeling tool takes care of that for us. Further, if the data-modeling tool is Maintaining the Models 353 Figure 11.5 Common deployment approach. Potential Situation Data Warehouse System Model Technology Models Common Situation Chapter 11 354 used to generate the DDL for the database schema, the system model and the physical schema are always synchronized as well. The final technology model is dependent on the physical platform, and changes in the model are made to improve performance. The major disadvantage of this approach is that when the system and technology model are tightly linked, changes in the technology model create changes in the system model, and we lose information about which decisions concerning the model were made based on the system level constraints and which were made based on the physical deployment constraints. While this disadvantage is worth noting, we feel that a pragmatic approach is appropriate here unless the modeling tool facilitates the separate maintenance of the system and technology models. Managing Multiple Modelers The preceding section dealt with managing the relationships between successive pairs of data models. Another maintenance coordination we face is managing the activities of multiple modelers. The two major considerations for managing a staff of modelers are the roles and responsibilities of each person or group and the collision management facilities. Roles and Responsibilities Traditionally, data-modeling roles are divided between the data administration staff and the database administration staff. The data administration staff is generally responsible for the subject area model and the business data model, while the database administration staff is generally responsible for the technology model. The system model responsibility may fall in either court or may be shared. The first thing that companies must do is establish responsibilities at the group level. Even if a single group has responsibility for a model, we have the potential of having multiple people involved. Let’s examine each of the data models individually. Subject Area Model The subject area model is developed under the auspices of a cross-functional group of business representatives and rarely changes. While it may be under the responsibility of the data administration group, no single individual in that group should change the subject area model. Any changes to this model need to be understood and sanctioned by the data administration organization. We feel the most appropriate approach is to maintain it under the auspices of the data stewardship group (if one exists), but data administration if there is no data stewardship group. This model actually helps us in managing the development of the business data model. Maintaining the Models 355 Business Data Model The business data model is the largest data model in our organization. This is true because, when completed, it encompasses the entire enterprise. A com- plete business data model may contain hundreds of entities and over 10,000 attributes. All entities and attributes in any of the successive models are either extracted from this model or can be derived, based on elements within this model. The most effective way to manage changes in this model is to assign prime responsibilities based on groupings of entities, some of which may be defined by virtue of the subject areas. We may, for example, have a modeler responsible for an entire subject area, such as Customers. We could also split responsibility for a subject area, with the accountability for some of the entities within a subject area being within the realm of one modeler and the accountability for other entities being within the realm of another modeler. We feel that allocating responsibility at an attribute level is inappropriate. Very often an individual activity will impact multiple subject areas. The entity responsibilities need to be visibly published so that efforts that entail overlaps can involve the appropriate people. Having prime responsibilities allocated does not mean that only one modeler can work within a section of the model. It means that one modeler is responsible for that section. When we undertake a data warehouse effort that encompasses several subject areas, it may not be appropriate to involve all of the responsible data analysts. Instead, a single person may be assigned to repre- sent data administration, and that person coordinates with the modelers responsible for each section of the model. System and Technology Data Model We previously recommended that the data-modeling tool facilities be used to maintain synchronization between the system and technology data model. We noted that, in respect to the tool, these are in fact a single model with two views. The system and technology data models are developed within the scope of a project. The project leader needs to assign responsibilities appropriately and to ensure that the entire team understands each person’s responsibility. Since all of the activities are under the realm of the project leader, the project plan can be used to aid in the coordination. Remember that any change to the system data model needs to be considered in the business data model. The biggest challenge is not in maintaining the synchronization among the people responsible for any particular model—it is in maintaining the synchronization among the people responsible for the different (that is, business data model and system data model) models. Just as companies have procedures that require maintenance programmers to consider Chapter 11 356 downstream systems in making changes, procedures are needed to require people maintaining models to consider the impact on other models. The impact of the changes was presented in Figure 11.2. An inventory of the data models and their relationships to each other should be maintained so that the affected models can be identified. Collision Management Collision management is the process for detecting and addressing changes to the model. The process entails providing the modeler with access to a portion of the model, making the model changes, comparing the revised model to the base model, and incorporating appropriate changes. A member of the Data Administration team is responsible for managing this process. That person must be familiar with the collision management capabilities of the tool, have data modeling skills, have strong communication and negotiation skills, and have a solid understanding of the overall business data model. Model Access Access to the model can be provided in one of two forms. One approach is to let the data modeler copy the entire model, and another is to let the data modeler check out a portion of the model. When the facility to check out a portion of the model exists, some tools provide options with respect to exclusivity of control. When these options are provided, the data modeler checks out the model portion and can lock this portion of the model, protecting it from changes made by any other person. Anyone else who makes a request to check out that portion of the model is informed that he or she is receiving read-only access and will not be able to save the changes. When the tool does not provide this level of protection, two people can actively make changes to the same portion of the model, and the one who gets his or her changes in first will have an easier time getting them absorbed, as described in the remainder of this section. With either approach, the data modeler has a copy of the data model that he or she can modify to reflect the necessary changes. Modifications Once the modeler has a copy of the portion of the data model of interest, he or she performs the modeling activities dictated by his or her responsibilities. Remember, these changes are being made to a copy of the data model—not to the base model (that is, the model from which components are extracted). When the modeler completes the work, the updates need to be migrated to the base model. Maintaining the Models 357 Comparison Each data modeler is addressing his or her view of the enterprise. The full business data model has a broader perspective. The business data model represents the entire enterprise; the system data model represents the entire scope of a data warehouse or application system. It is possible for the modeler to be unaware of other aspects of the model that are affected by the changes. The collision management process identifies these impacts. Prior to importing the changes into the base model, the base model and the changed model are compared using a technique called collision management. The technique has this name because it looks for collisions—or differences— between the two models and identifies them. The person responsible for overall model administration can review the identified collisions and indicate which ones should be absorbed into the base model. This step in the process also provides a checkpoint to ensure that the changes in the system model are appropriately reflected in the business model. Any changes that are not incorporated should be discussed with the modeler. Incorporation The last step in the process is incorporation of the changes. Once the person responsible for administering the base model makes the decision concerning incorporation of the changes, these are incorporated. Each modeling tool han- dles this process somewhat differently, but most provide for some degree of automation. Summary Synchronization of the various data models is critical if you are to accomplish a major goal of the data warehouse—data consistency. The business data model is used as the foundation for all subsequent models. Every data element that is eventually deployed in a database is linked back to a defined element in the business data model. This linkage ensures consistency and significantly simplifies integration and transformation activities in building the data warehouse. The individual data models may change for a variety of reasons. Changes to the subject area model and business data model are driven primarily by business changes, and revisions to the other models are driven primarily by impacts of these changes and deployment decisions. The challenge of keeping the models synchronized is exacerbated by the absence of tools that can automate the entire process. The most difficult task is keeping the business data model synchronized with the lower-level models, but as we saw, this synchronization is at the heart of keeping the enterprise perspective. Chapter 11 358 [...]... the data in the chosen mart for what data belongs in the data warehouse data versus the data used in the data mart Your job is to separate the data into the data model for the data warehouse and the data model for the data mart We recommend that you begin by separating Deploying the Relational Solution 3 79 detailed data from the more summarized or aggregated data as the criteria for data warehouse data. .. Solution 375 Data Acquisition CIF Data Management Data Delivery Meta Data Management Data Warehouse X the last data mart is converted and all the redundant data acquisition processing is eliminated and a separation is achieved between data acquisition and data delivery * Finally acquisition programs are constructed to load the data warehouse, replacing those used by the data mart * The data mart is... business and data warehouse data models so that they reflect the enterprise business rules Data Delivery Data Acquisition CIF Data Management Meta Data Management Data Warehouse X * * Figure 12.7 Converting one subject area at a time (continued) The data delivery programs are created and begin delivering information to the data marts The data acquisition programs begin to populate the data warehouse. .. to data mart data Remember that the data warehouse model will be more normalized, and the data mart model may be a star schema, snowflake schema, token design, flat files, or other design, depending on the technology chosen for the data mart The data warehouse data model should be based on the business data model and should have the characteristics outlined in this book 2 Map the data warehouse data. .. area in the data warehouse As an alternative, you can convert one entire data mart at a time to the new architecture, bringing into the warehouse all of the data needed to support that mart This requires that the entire data model for the data mart be converted into the enterprise data warehouse data model at once Then, the data mart is fed the data it needs from the newly constructed data warehouse. .. design, analysis, and implementation Because there is no synergy between the independent data mart implementations, there is no reduction in cost as more and more data marts are created This reduction in the CIF architecture occurs because the data warehouse serves as the repository of historical and 1 Data Warehouses vs Data Marts” by Campbell (Databased Web Advisor, January 199 8, page 32) Deploying... bring in new data elements or calculated fields into existing dependent data marts since they already exist in detail form in the data warehouse ■ ■ It is easy to switch access tools if the starting blocks of data reside in the data warehouse It becomes a much simpler process of tearing down and rebuilding the data mart from data stored in the data warehouse repository ■ ■ The separation of data acquisition... detail on data stewardship) to help with the creation of enterprise standards for entities, attributes, definitions, as well as calculated, derived, and aggregated data Each data mart must be mapped to the implemented data in the data warehouse and the mart models are changed to match the enterprise nature of the data warehouse data model Be sure to perform an analysis to ensure that the detailed data in... data models and the data acquisition pieces New enterprise-driven data acquisition programs are developed that extract data once and distribute to many data marts Redundant extraction programs are eliminated, where possible (X) Sales Marketing X X Finance Chapter 12 X 372 Figure 12.6 Create the data warehouse data model The data warehouse data model focuses on the integration of strategic data only,... The data acquisition programs are designed, coded, tested, and executed You then begin loading the data warehouse 3 Map the data mart data model to the data in the data warehouse The data delivery programs are then designed, coded, and tested These are much simpler than the previous programs used to create the data mart because all the heavy lifting (the extraction, integration, transformation, and . occurs because the data warehouse serves as the repository of historical and Chapter 12 364 1 Data Warehouses vs. Data Marts” by Campbell (Databased Web Advisor, January 199 8, page 32) detailed data that. Conform the dimensions used in the data marts. ■■ Create a data warehouse data model and convert each data mart model to it. ■■ Convert data marts to the data warehouse architecture—two paths. promotes managing of data and information as corporate assets. System and Technology Data Models Most companies have only a single instance of a production database such as a data warehouse. Even