Data Modeling Essentials 2005 phần 5 pot

6.4.1 When to Use Structured Keys The rule for using structured keys is straightforward: you can include a foreign key in a primary key only if it represents a mandatory nontransferable 4 relationship. The relationship needs to be mandatory because an optional relationship would mean that some rows would have a null value for the foreign key; hence, the primary key for those rows would be partially null. The problems of nulls in primary key columns are discussed in Section 6.7. The reason for the nontransferability may not be so obvious. The problem with transferable relationships is that the value of the foreign key will need to change when the relationship is transferred to a new owner. For example, if an employee is transferred from one department to another, the value of Department ID for that employee will change. If the foreign key is part of the primary key, then we have a change in value of the primary key, and a violation of our stability criterion. In this example, Department ID should not form part of the primary key of Employee. Another way of looking at this situation is that if we strictly follow the rule that primary key values cannot change (as we should), then structured keys can be used to enforce nontransferability (i.e., the structured key implements the rule that dependent entity instances cannot be transferred from one owner entity to another). Figure 6.3 provides a more detailed example, using the notation for nontransferability introduced in Section 3.5.6. The Stock Holding entity class has mandatory, nontransferable relationships to both Stock and Client. In business terms: 1. An instance of Stock Holding cannot exist without corresponding instances of Stock and Client. 2. An instance of Stock Holding cannot be transferred to a different stock or client. By contrast, the relationship from Client to Investment Advisor is optional and transferable, representing the business rules that: 1. We can hold information about a client who does not have an investment adviser. 2. A client can be transferred to a different investment adviser. Accordingly, in constructing a primary key for a Stock Holding table, we could include the primary keys of the tables implementing the Stock and Client entity classes, but we would not include the primary key of the 196 ■ Chapter 6 Primary Keys and Identity 4 Transferability was introduced in Section 3.5.6. Simsion-Witt_06 10/11/04 8:57 PM Page 196 table implementing the Investment Advisor entity class in the primary key of the Client table. Incidentally, a very common case in which structured keys are suitable is that of an intersection table that supports a many-to-many relationship. This is because rows of the intersection table cannot exist without corresponding instances of the entity classes involved in the many-to-many relationship and cannot be reallocated to different instances of those entity classes. In working through these examples, you should be aware of a real trap. Standard E-R diagrams do not include a symbol for nontransferability. 5 And many data modelers overlook the stability criterion for primary keys. We therefore reemphasize: It is only safe to incorporate a foreign key into a primary key if that foreign key represents a nontransferable relationship. 6.4.2 Programming and Structured Keys Structured keys may simplify programming and improve performance by providing more data items in a table row without violating normalization 6.4 Structured Keys ■ 197 Figure 6.3 Transferable and nontransferable relationships. Stock Client Stock Holding be of be the subject of Investment Advisor be advised by advise be held by hold 5 Some CASE tools and E-R modeling extensions do provide some support. Simsion-Witt_06 10/11/04 8:57 PM Page 197 rules. In Figure 6.4, we are able to determine the department from which a leave application comes without needing to access the Employee table. But can an employee transfer from one department to another? If so, the primary key of Employee will be unstable—almost certainly an unaccept- able price to pay for a little programming convenience and performance. If performance was critically affected by the decision, it would probably be better to carry Department ID redundantly as a nonprimary-key item in the Leave Application table. In any event, these are decisions for the physical design stage! 6.4.3 Performance Issues with Structured Keys Although performance is not our first concern as data modelers, it can provide a useful basis for deciding between alternatives that rate similarly against other criteria. (At the physical database design stage, we may need to reconsider the implications of structured keys as we explore compro- mises to improve performance.) Structured keys may affect performance in three principal ways. First, they may reduce the number of tables that need to be accessed by some transactions, as in Figure 6.4 (discussed above). Second, they may reduce the number of access mechanisms that need to be supported. Take the Stock Holding example from Figure 6.3. If we proposed a stand-alone surrogate key for Stock Holding, it is likely that 198 ■ Chapter 6 Primary Keys and Identity Figure 6.4 Navigation short cut supported by structured key. Department Employee Leave Application be employed by employ submit be submitted by Department ID Navigation Short-Cut Department ID Employee ID Employee Name Department ID Employee ID Leave Start Date Leave End Date Leave Type Simsion-Witt_06 10/11/04 8:57 PM Page 198 the physical database designer would need to construct three indexes: one for the surrogate key and one for each of the foreign keys to Client and Stock. But if we used Client ID + Stock ID + Date, the designer could probably get by with two indexes, resulting in a saving in space and update time. Third, as the number of columns in a structured key increases, so does the size of table and index records. It is not unknown for a table at the bottom of a deep hierarchy to have six or more columns in its key. A key we encountered in an Insurance Risk table reflected the following hierarchy: State + Branch + District + Agent + Client + Policy Class + Original Issuer + Policy + Risk—a nine part key, used throughout the organization. In this case, the key had been constructed in the days of serial files and reflected neither a true hierarchy nor a nontransferable relationship. Very large keys are also common in data marts in which star schemas (see Chapter 16) are used. When we encounter large keys, we have the option of introducing a stand- alone surrogate key at any point(s) in the hierarchy, reducing the size of the primary keys from that point downwards. Doing so will prevent us from fully enforcing nontransferability and will cost us an extra access mechanism. In the Compact Disk Library model of Figure 6.5 on the next page, we can add a surrogate key Track ID to Track, as the primary key, and use this to replace the large foreign key in Performer Role. The primary key of Performer Role would then become Track ID + Performer ID. However, the model would no longer enforce the fact that a track could not be transferred from one CD to another (and perhaps prompt us to rethink our definition of Track). 6.4.4 Running Out of Numbers Structured keys are prone to a particular kind of stability problem—running out of numbers—which can ultimately require that we reallocate all key values. The more parts to a key, the more likely we are to exhaust all possible values for one of them. Of course, this may also imply running out of numbers for the relevant owner entity instances, but the impact on what is often only a reference table may be more local and manageable. Incidentally, the owner entity class may not actually be represented by a table in the database; its key may provide sufficient information in itself for our purposes. If we do run out of numbers, it may be prohibitively expensive to rede- fine the key and amend the programs that use it. Experience suggests that we (or the system users) will be tempted to add new data and meaning to other parts of the key in order to keep the overall value unique. In turn, program logic now has to be amended to extract the meaning of the values held in these parts. Most experienced data modelers have horror stories to tell in this area. One organization had a team of four staff members working full time on 6.4 Structured Keys ■ 199 Simsion-Witt_06 10/11/04 8:57 PM Page 199 allocating location codes. Another had to completely redevelop a system because they ran out of insurance agent identifiers (the agent identifier consisted of a State Code, Branch Code within state, and Agent Number within state and branch; when all agent numbers for a particular branch had been allocated, new numbers were assigned by creating phantom branches and states). As a result of problems of this kind, it is often suggested that structured keys be avoided altogether. However, a structured key should involve no more risk than a single-column key, as long as we make adequate provision for growth of each component, and do not break the basic rules of column definition and key design. 200 ■ Chapter 6 Primary Keys and Identity Figure 6.5 Large structured keys. Manufacturer CD Track Performer Role Performer be issued by issue be contained on contain be featured on feature be performed by perform Label (Manufacturer ID) Label (Manufacturer ID) Catalogue Number Label Catalogue Number Track Number Performer ID Label Catalogue Number Track Number Performer ID Role Simsion-Witt_06 10/11/04 8:57 PM Page 200 6.5 Multiple Candidate Keys Quite frequently we encounter tables in which there are two or more columns (or combinations of columns) that could serve as the primary key. There may be two or more natural keys or, more often, a natural and a surrogate key. We refer to each possible key as a candidate key. There are a few rules we need to observe and some traps to watch out for when there is more than one candidate key. 6.5.1 Choosing a Primary Key We strongly recommend that you always nominate a single primary key for each table. One of the most important reasons for doing so is to specify how relationships will be supported; in nominating the primary key, you are specifying which columns are to be held elsewhere as foreign keys. 6 The choice of primary key should be based on the requirements and issues discussed earlier in this section. In addition to comparing applicability, stability, structure, and meaningfulness, we should ask, “Does each candidate key represent the same thing for all time?” The presence of more than one candidate key may be a clue that an entity class should be split into two entity classes linked by a one-to-one transferable relationship. If after this we still genuinely have two (or more) candidate keys for the same entity that are equally applicable and stable, the shortest of these may result in a significant saving in storage requirements, as primary keys are replicated in foreign keys and indexes. 6.5.2 Normalization Issues Multiple candidate keys can be a sign of tables that are in third normal form but not Boyce-Codd normal form (this is discussed in Chapter 13). Tables with two or more candidate keys can also be a source of confusion in earlier stages of normalization. Some informal definitions of 3NF imply that a nonkey column (i.e., a column that is not part of the primary key) is not allowed to be a determinant of another nonkey column. (“Each nonkey item must depend on the key, the whole key, and nothing but the key.”) Look at the table in Figure 6.6: 6.5 Multiple Candidate Keys ■ 201 6 The SQL standard and some DBMSs allow relationships to be supported by foreign keys that point to candidate keys other than the primary key (Section 10.6.1.2). We recommend that use of this facility be restricted to the physical design stage. Simsion-Witt_06 10/11/04 8:57 PM Page 201 Let us assume that every customer has a Tax File No, and that no two customers have the same Tax File No. A bit of thought will show that Tax File No (a nonkey item) is a determinant of Name, Address, and indeed every other column in the table. On the basis of our informal definition of 3NF, we would conclude that the table is not in third normal form, and remove Name, Address , and so on. to another table, with Tax File No copied across as the key. We do not want to do this! It does not achieve anything useful. Remember our definition of 3NF in Chapter 2: Every determinant of a nonkey item must be a candidate key. Our table satisfies this; it is only the “rough and ready” definition of 3NF that leads us astray. 6.6 Guidelines for Choosing Keys Having read this far, you may feel that we have adequately made our point about primary key choice being complex and difficult! As in much of data modeling, there are certainly choices to be made, and when unusual cir- cumstances arise, there is no substitute for a good understanding of the underlying principles. However, we can usefully draw together the threads of the discussion so far and offer some general guidelines for choosing keys. We divide the problem into two cases, based on the concepts of dependent and independent entity classes introduced in Section 3.5.7. Recall that a dependent entity class is one that has at least one many-to- one mandatory, nontransferable relationship with another entity class. An independent entity class has no such relationships. A table representing a many-to-many relationship can be thought of as implementing an intersection entity class, which (as we saw in Section 3.5.2) will be dependent on the entity classes participating in the relationship. Accordingly, such a table will follow the rules for a dependent entity class. 6.6.1 Tables Implementing Independent Entity Classes The primary key of a table representing an independent entity class must be one of the following: 1. A natural identifier: one or more columns in the table corresponding to attributes that are used to identify things in the real world: if you have 202 ■ Chapter 6 Primary Keys and Identity Figure 6.6 Table with two candidate keys CUSTOMER (Customer No, Tax File No, Name, Address, . . .) Simsion-Witt_06 10/12/04 3:55 PM Page 202 used the naming conventions outlined in Chapter 5, they will usually be columns with names ending in “Number,” “Code,” or “ID.” 2. A surrogate key: a single column. A sensible general approach to selecting the primary key of an independent entity class is to use natural identifiers when they are available and surrogate keys otherwise. 6.6.2 Tables Implementing Dependent Entity Classes and Many-to-Many Relationships We have an additional option for the primary key of a table representing a dependent entity class or a many-to-many relationship in that we can include the foreign key(s) representing the relationships to the entity classes on which the entity class in question depends. Obviously, a single foreign key alone is not sufficient as a primary key, since that would only allow for one instance of the dependent entity for each instance of the associated entity. The additional options for the primary key of the table representing a dependent entity class are as follows: 1. The foreign key(s) plus one or more existing columns. For example, a scheduled flight will be flown as multiple actual flights; there is therefore a one-to-many relationship between Scheduled Flight and Actual Flight . Actual flights can be identified by a combination of the Flight No (the primary key of Scheduled Flight) and the date on which the actual flight is flown. 2. Multiple foreign keys that together satisfy the criteria for a primary key. The classic example of this is the implementation of an intersection entity class (Section 3.5.2) (though this approach will not work for all intersection entity classes, some of which will require options 1 or 3, [i.e., the addition of an existing column (e.g., a date) or a surrogate key)]. 3. The foreign key(s) plus a surrogate key. For example, a student could be identified by a combination of the Student ID issued by his or her college and the ID of the college that issued it (the foreign key representing the relationship between Student and College). Our general rule is to include all foreign keys that represent depend- ency relationships, adding a surrogate or (if available) an existing column to ensure uniqueness if necessary. By doing this, we are enforcing nontransferability, as long as we stick to the general rule that primary key values cannot be changed. 6.6 Guidelines for Choosing Keys ■ 203 Simsion-Witt_06 10/11/04 8:57 PM Page 203 We nearly always use primary keys containing foreign keys for tables representing dependent entity classes, but will sometimes find that such a table has an excellent stand-alone key available. We may then choose to trade enforcement of nontransferability for the convenience of using an available “natural” key. For example, it may not be possible for a passport to be transferred from one person to another; hence, we could include the key of Person in the key of Passport, but we may prefer to use a well-established stand-alone Passport Number. 6.7 Partially-Null Keys We complete this chapter by looking at an issue that arises from time to time: whether or not null values should be permitted in primary key columns. There are plenty of good reasons why the entire primary key should never be allowed to be null (empty); we would then have a problem with interpreting foreign keys—does null mean “no corresponding row” or is it a pointer to the row with the null primary key? But conventional data modeling wisdom also dictates that no part (i.e., no column) of a multicolumn primary key should ever be null. Some of the arguments are to do with sophisticated handling of different types of nulls, which is currently of more academic than practical relevance, since the null handling of most DBMSs is very basic. To our knowledge, no DBMS allows for any column of a primary key to be null. However, there are situations where not every attribute represented by a column of the primary key has a legitimate value for every instance. In these situations you may want to use some special value to indicate that there is no real-world value for those attributes in those instances. (We shall discuss possible special values shortly.) The issue often arises when implementing a supertype whose subtypes have distinct primary keys. For example, an airline may want to implement a Service entity whose subtypes are Flight Service (identified by a Flight Number ) and Accommodation Service (identified by an alphabetic Accommodation Service ID). The key for Service could be Flight Service No + Accommodation Service ID, where one value would always be logically null. This is a workable, if inelegant, alternative to generalizing the two to produce a single alphanumeric attribute. A variant of this situation is shown in Figure 6.7. The keys for Branch and Department are legitimate as long as branches cannot be transferred from one division to another and departments cannot be transferred from one branch to another. But if we decide to implement at the Organization Unit level, giving us a simple hierarchy, can we generalize the primary keys of the subtypes 204 ■ Chapter 6 Primary Keys and Identity Simsion-Witt_06 10/11/04 8:57 PM Page 204 into a primary key for Organization Unit? The proposed key would be Division ID + Branch ID + Department ID. For divisions, Branch ID and Department ID would be logically null, and for branches Department ID would be logically null. Again, we have logically null values in the primary key; again, we have a solution that is workable and that has been employed successfully in practice. The choice of key in this example has some interesting implications. The foreign key, which points to the next level up the hierarchy, is contained in the primary key (e.g., Branch ID “0219” contains the key of Division “02”). This limits us to three levels of hierarchy; our choice of primary key has imposed a constraint on the number of levels and their relationships. With a surrogate key, by contrast, any such limits would need to be enforced outside the data structure. This is another example of a structured key imposing constraints that we may or may not want to enforce for the life of the system. What special values can we use to represent a logically-null primary key attribute given that our DBMS will almost certainly not allow us to use “null” itself? If the attribute is a text item or category (see Section 5.4.2), you might use a zero-length character string. If it is a quantifier, you can use zero if it does not represent a real-world value. If it does, you are reduced to either choosing some other special value, like –1 or 999999 or adding a 6.7 Partially-Null Keys ■ 205 Figure 6.7 Use of a primary key with logically null attributes. Division Branch Department Organization Unit report to control report to control Division ID Division ID Branch ID Division ID Branch ID Department ID Simsion-Witt_06 10/11/04 8:57 PM Page 205 [...]... Organizing the Data Modeling Task “The fact was I had the vision I think everyone has what we lack is the method.” – Jack Kerouac “Art and science have their meeting point in method.” – Edward Bulwer-Lytton 8.1 Data Modeling in the Real World In the preceding chapters, we have focused largely on learning the language of data modeling without giving much attention to the practicalities of modeling in... and deliverables in systems analysis and design beyond data modeling In this chapter we focus on some of the key issues for the data modeler Finally we look briefly at Object Role Modeling (ORM), which has been well researched, has CASE tool support, and is in use in some organizations We do not look at modeling languages for object-oriented (OO) databases; they represent a substantially different paradigm... logical data model from the conceptual data model Whichever of these techniques you have used to model complex attributes, it may not support the transformations required to generate the appropriate structure in the logical data model if the DBMS for which the logical data model is being generated does not support complex attributes The relevant transformations are described in Section 11.4 .5 7.2.2 .5 Multivalued... non-capitalized “model” which refers to a model of data to support a particular problem 2 The language we use for the physical data model is usually the Data Definition Language (DDL) supported by the DBMS, sometimes supplemented by data structure diagrams similar to those of the logical data model 207 208 ■ Chapter 7 Extensions and Alternatives Extensions to logical modeling languages are often prompted by... DAMA/MetaData Conference, San Antonio, April 2002 7 Date, C., and Darwen, H: Foundation for Future Database Systems: The Third Manifesto, 2nd Edition, Addison-Wesley, 2000 7.4 Using UML Object Class Diagrams ■ 2 25 +Member Team Employee 1 * Order Order Line 1 Figure 7.12 * An aggregation and a composition relationships as a powerful means of capturing business data requirements in the earlier stages of modeling. .. for multivalued attributes is not guaranteed, even if you are modeling in UML 7.3 The Chen E-R Approach In 1976, Peter Chen published an influential paper “The Entity-Relationship Approach: Towards a Unified View of Data. 5 He proposed a conceptual modeling language that could be used to specify either a relational or a network (CODASYL) database The language itself continues to be widely used in academic... Data Systems Languages” (specifically the Database Task Group which became the Data Description Language Committee) refers to a set of standards for “network” DBMSs in which the principal constructs were Record Types, Data Items, and Sets 210 ■ Chapter 7 Extensions and Alternatives implement it on such a DBMS, it is similarly a mistake to constrain the logical data model to exclude structures that make... illustrated in Figure 7 .5) , rather than introducing an intersection entity class as is necessary in conventional E-R modeling (discussed in Section 3 .5. 2) Responsibility Employee 1 Figure 7.4 Asset N Chen convention for relationships (including relationships with attributes) 218 ■ Chapter 7 Extensions and Alternatives Service Availability Service M Organization N P Area Figure 7 .5 Ternary relationship... These restrictive conventions are inappropriate in a logical data model if the target DBMS implements any of the additional features of the SQL99 standard and, in any case, inappropriate in a conceptual data model, which should illustrate data structures as the business would naturally view them rather than as they will be implemented in a database Having said that, we are aware that some CASE tools... situation the inheritance of a relationship by a subtype can only be inferred by tracing the generalization lines back through the hierarchy of supertypes 7 .5 Object Role Modeling Object-Role Modeling (ORM) has a long history Its ancestors include Binary Modeling and NIAM,10 and (more so than most alternative languages) it has been used quite widely in practice and generated a substantial body of research . model of data to support a particular problem. 2 The language we use for the physical data model is usually the Data Definition Language (DDL) supported by the DBMS, sometimes supplemented by data. systems analysis and design beyond data modeling. In this chapter we focus on some of the key issues for the data modeler. Finally we look briefly at Object Role Modeling (ORM), which has been well. logical data model intends to 7.2 Extensions to the Basic E-R Approach ■ 209 3 CODASYL from “Conference on Data Systems Languages” (specifically the Database Task Group which became the Data Description