Data Modeling Essentials 2005 phần 2 potx

1.11.5 Personal Computing and User-Developed Systems Today’s professionals or knowledge workers use PCs as essential “tools of trade” and frequently have access to a DBMS such as Microsoft Access™. Though an organization’s core systems may be supported by packaged software, substantial resources may still be devoted to systems development by such individuals. Owning a sophisticated tool is not the same thing as being able to use it effectively, and much time and effort is wasted by amateurs attempting to build applications without an understanding of basic design principles. The discussion about the importance of data models earlier in this chapter should have convinced you that the single most important thing for an application designer to get right is the data model. A basic understanding of data modeling makes an enormous difference to the quality of the results that an inexperienced designer can achieve. Alternatively, the most critical place to get help from a professional is in the data-modeling phase of the project. Organizations that encourage (or allow) end-user development of applications would do well to provide specialist data modeling training and/or consultancy as a relatively inexpensive and nonintrusive way of improving the quality of those applications. 1.11.6 Data Modeling and XML XML (Extensible Markup Language) was developed as a format for presenting data, particularly in web pages, its principal value being that it provided information about the meaning of the data in the same way that HTML provides information about presentation format. The same benefits have led to its wide adoption as a format for the transfer of data between applications and enterprises, and to the development of a variety of tools to generate XML and process data in XML format. XML’s success in these roles has led to its use as a format for data storage as an alternative to the relational model of storage used in RDBMSs and, by extension, as a modeling language. At this stage, the key message is that, whatever its other strengths and weaknesses, XML does not remove the need to properly understand data requirements and to design sound, well-documented data structures to support them. As with object-oriented approaches, the format and language may differ, but the essentials of data modeling remain the same. 1.11.7 Summary The role of the data modeler in many organizations has changed. But as long as we need to deal with substantial volumes of structured data, we 28 ■ Chapter 1 What Is Data Modeling? Simsion-Witt_01 10/12/04 12:09 AM Page 28 need to know how to organize it and need to understand the implications of the choices that we make in doing so. That is essentially what data modeling is about. 1.12 Alternative Approaches to Data Modeling One of the challenges of writing a book on data modeling is to decide which of the published data modeling “languages” and associated conventions to use, in particular for diagrammatic representation of conceptual models. There are many options and continued debate about their relative merits. Indeed, much of the academic literature on data modeling is devoted to exploring different languages and conventions and proposing DBMS architectures to support them. We have our own views, but in writing for practitioners who need to be familiar with the most common conventions, our choice is narrowed to two options: 1. One core set of conventions, generally referred to as the Entity Relationship 14 (E-R) approach, with ancestry going back to the late 1960s, 15 was overwhelmingly dominant until the late 1990s. Not every- one uses the same “dialect,” but the differences between practitioners are relatively minor. 2. Since the late 1990s, an alternative set of conventions—the Unified Modeling Language (UML), which we noted in Section 1.9.4—has gained in popularity. The overwhelming majority of practicing modelers know and use one or both of these languages. Similarly, tools to support data modeling almost invariably use E-R or UML conventions. UML is the “richer” language. It provides conventions for recording a wide range of conventional and object-oriented analysis and design deliv- erables, including data models represented by class diagrams. Class diagrams are able to capture a greater variety of data structures and rules than E-R diagrams. However, this complexity incurs a substantial penalty in difficulty of use and understanding, and we have seen even very experienced practitioners misusing the additional language constructs. Also some of the rules and structures that UML is able to capture are not readily implemented with current relational DBMSs. 1.12 Alternative Approaches to Data Modeling ■ 29 14 Chen, P, P (1976): The Entity-Relationship Model—Towards a Unified View of Data, ACM Transactions on Database Systems (1,1) March, pp. 9–36. 15 Bachman, C (1969): Data Structure Diagrams, Bulletin of ACM SIGFIDET 1(2). Simsion-Witt_01 10/12/04 12:09 AM Page 29 We discuss the relative merits of UML and E-R in more detail in Chapter 7. Our decision to use (primarily) the E-R conventions in this book was the result of considerable discussion, which took into account the growing popularity of UML. Our key consideration was the desire to focus on what we believe are the most challenging parts of data modeling: understanding user requirements and designing appropriate data structures to meet them. As we reviewed the material that we wanted to cover, we noted that the use of a more sophisticated language would make a difference in only a very few cases and could well distract those readers who needed to devote a substantial part of their efforts to learning it. However, if you are using UML, you should have little difficulty adapt- ing the principles and techniques that we describe. In a few cases where the translation is not straightforward—usually because UML offers a feature not provided by E-R—we have highlighted the difference. At the time of writing, we are planning to publish all of the diagrams in this book in UML format on the Morgan Kaufmann website at www.mkp.com/?isbn=0126445516. As practicing data modelers, we are sometimes frustrated by the short- comings of the relatively simple E-R conventions (for which UML does not always provide a solution). In Chapter 7, we look at some of the more interesting alternatives, first because you may encounter them in practice (or more likely in reading more widely about data modeling), and second because they will give you a better appreciation of the strengths and weaknesses of the more conventional methods. However, our principal aim in this book is to help you to get the best results from the tools that you are most likely to have available. 1.13 Terminology In data modeling, as in all too many other fields, academics and practitioners have developed their own terminologies and do not always employ them consistently. We have already seen an example in the names for the different com- ponents of a database specification. The terminology that we use for the data models produced at different stages of the design process—viz conceptual, logical, and physical models—is widely used by practitioners, but, as noted earlier, there is some variation in how each is defined. In some contexts (though not in this book), no distinction may be made between the conceptual and logical models, and the terms may be used inter- changeably. Finally, you should be aware of two quite different uses of the term data model itself. Practitioners use it, as we have in this chapter, to refer to a representation of the data required to support a particular process or set of processes. Some academics use “data model” to describe a particular way 30 ■ Chapter 1 What Is Data Modeling? Simsion-Witt_01 10/12/04 12:09 AM Page 30 of representing data: for example, in tables, hierarchically, or as a network. Hence, they talk of the “Relational Model” (tables), the “Object-Role Model,” or the “Network Model.” 16 Be aware of this as you read texts aimed at the academic community or in discussing the subject with them. And encourage some awareness and tolerance of practitioner terminology in return. 1.14 Where to from Here?—An Overview of Part I Now that we have an understanding of the basic goals, context, and terminology of data modeling, we can take a look at how the rest of this first part of the book is organized. In Chapter 2 we cover normalization, a formal technique for organizing data into tables. Normalization enables us to deal with certain common problems of redundancy and incompleteness according to straightforward and quite rigorous rules. In practice, normalization is one of the later steps in the overall data modeling process. We introduce it early in the book to give you a feeling for what a sound data model looks like and, hence, what you should be working towards. In Chapter 3, we introduce a method for presenting models in a diagrammatic form. In working with the insurance model, you may have found that some of the more important business rules (such as only one customer being allowed for each policy) were far from obvious. As we move to more complex models, it becomes increasingly difficult to see the key concepts and rules among all the detail. A typical model of 100 tables with five to ten columns each will appear overwhelmingly complicated. We need the equivalent of an architect’s sketch plan to present the main points, and we need the ability to work “top down” to develop it. In Chapter 4, we look at subtyping and supertyping and their role in exploring alternative designs and handling complex models. We touched on the underlying idea when we discussed the possible division of the Customer table into separate tables for personal and corporate customers (we would say that this division was based on Personal Customer and Corporate Customer being subtypes of Customer, or, equivalently, Customer being a supertype of Corporate Customer and Personal Customer). In Chapter 5 we look more closely at columns (and their conceptual model ancestors, which we call attributes). We explore issues of defini- tion, coding, and naming. 1.14 Where to from Here?—An Overview of Part 1 ■ 31 16 On the (rare) occasions that we employ this usage (primarily in Chapter 7), we use capitals to distinguish; the Relational Model of data versus a relational model for a particular database. Simsion-Witt_01 10/12/04 12:09 AM Page 31 In Chapter 6 we cover the specification of primary keys—columns such as Policy Number, which enable us to identify individual rows of data. In Chapter 7 we look at some extensions to the basic conventions and some alternative modeling languages. 1.15 Summary Data and databases are central to information systems. Every database is specified by a data model, even if only an implicit one. The data model is an important determinant of the design of the associated information systems. Changes in the structure of a database can have a radical and expen- sive impact on the programs that access it. It is therefore essential that the data model for an information system be an accurate, stable reflection of the business it supports. Data modeling is a design process. The data model cannot be produced by a mechanical transformation from hard business facts to a unique solution. Rather, the modeler generates one or more candidate models, using analysis, abstraction, past experience, heuristics, and creativity. Quality is assessed according to a number of factors including completeness, nonredundancy, faithfulness to business rules, reusability, stability, elegance, integration, and communication effectiveness. There are often trade-offs involved in satisfying these criteria. Performance of the resulting database is an important issue, but it is primarily the responsibility of the database administrator/database technician. The data modeler will need to be involved if changes to the logical data model are contemplated. In developing a system, data modeling and process modeling usually proceed broadly in parallel. Data modeling principles remain important for object-oriented development, particularly where large volumes of structured data are involved. Prototyping and agile approaches benefit from a stable data model being developed and communicated at an early stage. Despite the wider use of packaged software and end-user development, data modeling remains a key technique for information systems professionals. 32 ■ Chapter 1 What Is Data Modeling? Simsion-Witt_01 10/12/04 12:09 AM Page 32 Chapter 2 Basics of Sound Structure “A place for everything and everything in its place.” – Samuel Smiles, Thrift, 1875 “Begin with the end in mind.” – Stephen R. Covey, The 7 Habits of Highly Effective People 2.1 Introduction In this chapter, we look at some fundamental techniques for organizing data. Our principal tool is normalization, a set of rules for allocating data to tables in such a way as to eliminate certain types of redundancy and incompleteness. In practice, normalization is usually one of the later activities in a data modeling project, as we cannot start normalizing until we have established what columns (data items) are required. In the approach described in Part 2, normalization is used in the logical database design stage, following requirements analysis and conceptual modeling. We have chosen to introduce normalization at this early stage of the book 1 so that you can get a feeling for what a well-designed logical data model looks like. You will find it much easier to understand (and under- take) the earlier stages of analysis and design if you know what you are working toward. Normalization is one of the most thoroughly researched areas of data modeling, and you will have little trouble finding other texts and papers on the subject. Many take a fairly formal, mathematical approach. Here, we focus more on the steps in the process, what they achieve, and the practical problems you are likely to encounter. We have also highlighted areas of ambiguity and opportunities for choice and creativity. The majority of the chapter is devoted to a rather long example. We encourage you to work through it. By the time you have finished, you will 33 1 Most texts follow the sequence in which activities are performed in practice (as we do in Part 2). However, over many years of teaching data modeling to practitioners and college students, we have found that both groups find it easier to learn the top-down techniques if they have a concrete idea of what a well-structured logical model will look like. See also comments in Chapter 3, Section 3.3.1. Simsion-Witt_02 10/11/04 8:47 PM Page 33 have covered virtually all of the issues involved in basic normalization 2 and encountered many of the most important data modeling concepts and terms. 2.2 An Informal Example of Normalization Normalization is essentially a two-step 3 process: 1. Put the data into tabular form (by removing repeating groups). 2. Remove duplicated data to separate tables. A simple example will give you some feeling for what we are trying to achieve. Figure 2.1 shows a paper form (it could equally be a computer input screen) used for recording data about employees and their qualifications. If we want to store this data in a database, our first task is to put it into tabular form. But we immediately strike a problem: because an employee can have more than one qualification, it’s awkward to fit the qualification data into one row of a table (Figure 2.2). How many qualifications do we allow for? Murphy’s law tells us that there will always be an employee who has one more qualification than the table will handle. We can solve this problem by splitting the data into two tables. The first holds the basic employee data, and the second holds the qualification data, one row per qualification (Figure 2.3). In effect, we have removed the “repeating group” of qualification data (consisting of qualification descriptions and years) to its own table. We hold employee numbers in the second table to serve as a cross-reference back to the first, because we need to know to whom each qualification belongs. Now the only limit on the 34 ■ Chapter 2 Basics of Sound Structure 2 Advanced normalization is covered in Chapter 13. 3 This is a simplification. Every time we create a table, we need to identify its primary key. This task is absolutely critical to normalization; the only reason that we have not nominated it as a “step” in its own right is that it is performed within each of the two steps which we have listed. Figure 2.1 Employee qualifications form. Employee Number: 01267 Employee Name: Clark Department Number: 05 Department Name: Auditing Department Location: HO Qualification Year Bachelor of Arts Master of Arts Doctor of Philosophy 1970 1973 1976 Simsion-Witt_02 10/11/04 8:47 PM Page 34 number of qualifications we can record for each employee is the maximum number of rows in the table—in practical terms, as many as we will ever need. Our second task is to eliminate duplicated data. For example, the fact that department number “05” is “Auditing” and is located at “HO” is repeated for every employee in that department. Updating data is therefore complicated. If we wanted to record that the Auditing department had moved to another location, we would need to update several rows in the Employee table. Recall that two of our quality criteria introduced in Chapter 1 were “non-redundancy” and “elegance”; here we have redundant data and a model that requires inelegant programming. The basic problem is that department names and addresses are really data about departments rather than employees, and belong in a separate Department table. We therefore establish a third table for department data, resulting in the three-table model of Figure 2.4 (see page 37). We leave Department Number in the Employee table to serve as a cross-reference, in the same way that we retained Employee Number in the Qualification table. Our data is now normalized. This is a very informal example of what normalization is about. The rules of normalization have their foundation in mathematics and have been very closely studied by researchers. On the one hand, this means that we can have confidence in normalization as a technique; on the other, it is very easy to become lost in mathematical terminology and proofs and miss the essential simplicity of the technique. The apparent rigor can also give us a false sense of security, by hiding some of the assumptions that have to be made before the rules are applied. You should also be aware that many data modelers profess not to use normalization, in a formal sense, at all. They would argue that they reach the same answer by common sense and intuition. Certainly, most 2.2 An Informal Example of Normalization ■ 35 Figure 2.2 Employee qualifications table. Qualification 1Employee Number Employee Name Dept. Number Dept. Name Dept. Location Description Year 01267 Clark 05 Auditing HO Bachelor of Arts 1970 70964 Smith 12 Legal MS Bachelor of Arts 1969 22617 Walsh 05 Auditing HO Bachelor of Arts 1972 50607 Black 05 Auditing HO Qualification 2 Qualification 3 Qualification 4 Description Year Description Year Description Year Master of Arts 1973 Doctor of Philosophy 1976 Master of Arts 1977 Simsion-Witt_02 10/11/04 8:47 PM Page 35 practitioners would have had little difficulty solving the employee qualification example in this way. However, common sense and intuition come from experience, and these experienced modelers have a good idea of what sound, normalized data models look like. Think of this chapter, therefore, as a way of gaining familiarity with some sound models and, conversely, with some important and easily classified design faults. As you gain experience, you will find that you arrive at properly normalized structures as a matter of habit. Nevertheless, even the most experienced professionals make mistakes or encounter difficulties with sophisticated models. At these times, it is helpful to get back onto firm ground by returning to first principles such as normalization. And when you encounter someone else’s model that has not been properly normalized (a common experience for data modeling con- sultants), it is useful to be able to demonstrate that some generally accepted rules have been violated. 2.3 Relational Notation Before tackling a more complex example, we need to learn a more concise notation. The sample data in the tables takes up a lot of space and is not required to document the design (although it can be a great help in 36 ■ Chapter 2 Basics of Sound Structure Figure 2.3 Separation of qualification data. Employee Number Employee Name Dept. Number Dept. Name Dept. Location 01267 Clark 05 Auditing HO 70964 Smith 12 Legal MS 22617 Walsh 05 Auditing HO 50607 Black 05 Auditing HO Employee Number Qualification Description Qualification Year 01267 Bachelor of Arts 1970 01267 Master of Arts 1973 01267 Doctor of Philosophy 1976 70964 Bachelor of Arts 1969 22617 Bachelor of Arts 1972 22617 Master of Arts 1977 Employee Table Qualification Table Simsion-Witt_02 10/11/04 8:47 PM Page 36 communicating it). If we eliminate the sample rows, we are left with just the table names and columns. Figure 2.5 on the next page shows the normalized model of employees and qualifications using the relational notation of table name followed by column names in parentheses. (The full notation requires that the primary key of the table be marked—discussed in Section 2.5.4.) This convention is widely used in textbooks, and it is convenient for presenting the minimum amount of information needed for most worked examples. In practice, however, we usually want to record more information about each column: format, optionality, and perhaps a brief note or description. Practitioners therefore usually use lists as in Figure 2.6, also on the next page. 2.4 A More Complex Example Armed with the more concise relational notation, let’s now look at a more complex example and introduce the rules of normalization as we proceed. 2.4 A More Complex Example ■ 37 Figure 2.4 Separation of department data. Employee Number Qualification Description Qualification Year 01267 Bachelor of Arts 1970 01267 Master of Arts 1973 01267 Doctor of Philosophy 1976 70964 Bachelor of Arts 1969 22617 Bachelor of Arts 1972 22617 Master of Arts 1977 Employee Number Employee Name Dept. Number 01267 Clark 05 22617 Walsh 05 70964 Smith 12 50607 Black 05 Dept. Number Dept. Name Dept. Location 05 Auditing HO 12 Legal MS Employee Table Department Table Qualification Table Simsion-Witt_02 10/11/04 8:47 PM Page 37 [...]... Drug Name 1, Manufacturer 1, Size of Dose 1, Unit of Measure 1, Method of Administration 1, Dose Cost 1, Number of Doses 1, Drug Short Name 2, Drug Name 2, Manufacturer 2, Size of Dose 2, Unit of Measure 2, Method of Administration 2, Dose Cost 2, Number of Doses 2, Drug Short Name 3, Drug Name 3, Manufacturer 3, Size of Dose 3, Unit of Measure 3, Method of Administration 3, Dose Cost 3, Number of Doses... trade-off between data retrieval (faster if we do not have to assemble the base data and calculate the total each time) and data update (the total will need to be recalculated if we change the base data) Far more importantly, though, performance is not our concern at the logical modeling stage If the physical database designers cannot achieve the required performance, then specifying redundant data in the... Number of Doses 4) Figure 2. 9 Drug expenditure model after tidying up 4 “Key” can have a variety of meanings in data modeling and database design Although it is common for data modelers to use the term to refer only to primary keys, we strongly recommend that you acquire the habit of using the full term to avoid misunderstandings 2. 6 Repeating Groups and First Normal Form ■ 43 2. 6 Repeating Groups and... of Data for Large Shared Data Banks,” Communications of the ACM (June, 1970) This was the first paper to advocate normalization as a data modeling technique 2. 7 Second and Third Normal Forms ■ 47 but the original data structure did allow us to distinguish between first, second, third, and fourth administrations A sequence column in the Drug Administration table would have enabled us to retain that data. .. Drug Name 1, Drug Name 2, Drug Name 3, and Drug Name 4 are in some sense the “same sort of thing,” and we represent them with a generic Drug Name It is hard to dispute this case, but what about the example in Figure 2. 20? 62 ■ Chapter 2 Basics of Sound Structure CURRENCY (Currency ID, Date, Spot Rate, Exchange Rate 3 Days, Exchange Rate 4 Days, Exchange Rate 5 Days, ) Figure 2. 20 Currency exchange... group” of drug administration data is that we have to set an arbitrary maximum number of repetitions, large enough to accommodate the greatest number that might ever occur in practice 2. 6 .2 Data Reusability and Program Complexity The need to predict and allow for the maximum number of repetitions is not the only problem caused by the repeating group The data cannot 44 ■ Chapter 2 Basics of Sound Structure... preserve this data, we would need to add a Return Date or Return Sequence column If the hospitals used red forms for emergency operations and blue forms for elective surgery, we would need to add a column to record the category if it was of interest to the database users 2. 5.3 Derivable Data Remember our basic objective of nonredundancy We should remove any data that can be derived from other data in the... of the data that we do at this stage of normalization often brings to light issues that may take us back to the earlier stages of preparation for normalization and removal of repeating groups 2. 7.4 Third Normal Form Figure 2. 13 shows the final model Every time we removed data to a separate table, we eliminated some redundancy and allowed the data in the table to be stored independently of other data. .. Code Figure 2. 14 Drug Name Max 50mg Max 100mg Max 20 0mg Maxicillin Maxicillin Maxicillin Drug table resulting from complex drug code 54 ■ Chapter 2 Basics of Sound Structure Equivalently we can say that the other nominated columns are functionally dependent on the determinant The determinant concept is what 3NF is all about; we are simply grouping data items around their determinants 2. 8 .2 Primary Keys... with “S.” The prefixes add no information, at least when we are dealing with them as data in the database, in the context of their column names If they were to be used without that context, we would simply add the appropriate prefix when we printed or otherwise exported the data 42 ■ Chapter 2 Basics of Sound Structure 2. 5.4 Determining the Primary Key Finally, we determine a primary key 4 for the table . development, data modeling remains a key technique for information systems professionals. 32 ■ Chapter 1 What Is Data Modeling? Simsion-Witt_01 10/ 12/ 04 12: 09 AM Page 32 Chapter 2 Basics of. Arts 1973 0 126 7 Doctor of Philosophy 1976 70964 Bachelor of Arts 1969 22 617 Bachelor of Arts 19 72 226 17 Master of Arts 1977 Employee Number Employee Name Dept. Number 0 126 7 Clark 05 22 617 Walsh. HO Employee Number Qualification Description Qualification Year 0 126 7 Bachelor of Arts 1970 0 126 7 Master of Arts 1973 0 126 7 Doctor of Philosophy 1976 70964 Bachelor of Arts 1969 22 617 Bachelor of Arts 19 72 226 17 Master of Arts 1977 Employee