DATABASE SYSTEMS (phần 22) pdf

836 I Chapter 25 Distributed Databases and Client-Server Architectures Distributed database design has been addressed in terms of horizontal and vertical fragmentation, allocation, and replication. Ceri et a1. (1982) defined the concept of minterm horizontal fragments. Ceri et a1. (1983) developed an integer programming based optimization model for horizontal fragmentation and allocation. Navathe et '11. (1984) developed algorithms for vertical fragmentation based on attribute affinity and showed a variety of contexts for vertical fragment allocation. Wilson and Navathe (1986) present an analytical model for optimal allocation of fragments. Elmasri et a1. (1987) discuss fragmentation for the EeR model; Karlapalem et a1. (1994) discuss issues for distributed design of object databases. Navathe et a1. (1996) discuss mixed fragmentation by combining horizontal and vertical fragmentation; Karlapalem et a1. (1996) present a model for redesign of distributed databases. Distributed query processing, optimization, and decomposition are discussed in Hevner and Yao (1979), Kerschberg et a1. (1982), Apers et a1. (1983), Ceri and Pelagatti (1984), and Bodorick et a1. (1992). Bernstein and Goodman (1981) discuss the theory behind semijoin processing. Wong (1983) discusses the use of relationships in relation fragmentation. Concurrency control and recovery schemes are discussed in Bernstein and Goodman (1981a). Kumar and Hsu (1998) have some articles related to recovery in distributed databases. Elections in distributed systems are discussed in Garcia-Molina (1982). Lamport (1978) discusses problems with generating unique timestamps in a distributed system. A concurrency control technique for replicated data that is based on voting is presented by Thomas (1979). Gifford (1979) proposes the use of weighted voting, and Paris (1986) describes a method called voting with witnesses. ]ajodia and Mutchler (1990) discuss dynamic voting. A technique called available copy is proposed by Bernstein and Goodman (1984), and one that uses the idea of a group is presented in EIAbbadi and Toueg (1988). Other recent work that discusses replicated data includes Gladney (1989), Agrawal and E1Abbadi (1990), E1Abbadi and Toueg (1990), Kumar and Segev (1993), Mukkamala (1989), and Wolfson and Milo (1991). Bassiouni (1988) discusses optimistic protocols for DDB concurrency control. Garcia-Molina (1983) and Kumar and Stonebraker (1987) discuss techniques that use the semantics of the transactions. Distributed concurrency control techniques based on locking and distinguished copiesare presented by Menasce et a1. (1980) and Minoura and Wiederhold (1982). Obermark (1982) presents algorithms for distributed deadlock detection. A survey of recovery techniques in distributed systems is given by Kohler (1981). Reed (1983) discusses atomic actions on distributed data. A book edited by Bhargava (1987) presents various approaches and techniques for concurrency and reliability in distributed systems. Federated database systems were first defined in McLeod and Heimbigner (1985). Techniques for schema integration in federated databases are presented by Elmasri et al. (1986), Batini et a1. (1986), Hayne and Ram (1990), and Motro (1987). Elmagarmid and Helal (1988) and Gamal-Eldin et a1. (1988) discuss the update problem in heterogeneous DDBSs. Heterogeneous distributed database issues are discussed in Hsiao and Kamel (1989). Sheth and Larson (1990) present an exhaustive survey of federated database management. Selected Bibliography I 837 Recently, multidatabase systems and interoperability have become important topics. Techniques for dealing with semantic incompatibilities among multiple databases are examined in DeMichiel (1989), Siegel and Madnick (1991), Krishnamurthy et al. (1991), and Wang and Madnick (1989). Castano et al. (1998) present an excellent survey of techniques for analysis of schemas. Pitoura et al. (1995) discuss object orientation in multidatabase systems. Transaction processing in multidatabases is discussed in Mehrotra et al. (1992), Georgakopoulos et al. (1991), Elmagarmid et al. (1990), and Brietbart et al. (1990), among others. Elmagarmid et al. (1992) discuss transaction processing for advanced applications, including engineering applications discussed in Heiler et a1. (1992). The workflow systems, which are becoming popular to manage information in complex organizations, use multilevel and nested transactions in conjunction with distributed databases. Weikum (1991) discusses multilevel transaction management. Alonso et al. (1997) discuss limitations of current workflow systems. A number of experimental distributed DBMSs have been implemented. These include distributed INGRES (Epstein et al., 1978), DDTS (Devor and Weeldreyer, 1980), SDD-l (Rothnie et al., 1980), System R* (Lindsay et al., 1984), SIRIUS-DELTA (Ferrier and Stangret, 1982), and MULTIBASE (Smith et al., 1981). The OMNIBASE system (Rusinkiewicz et al., 1988) and the Federated Information Base developed using the Candide data model (Navathe et al., 1994) are examples of federated DDBMS. Pitoura et al. (1995) present a comparative survey of the federated database system prototypes. Most commercial DBMS vendors have products using the client-server approach and offer distributed versions of their systems. Some system issues concerning client-server DBMS architectures are discussed in Carey et al. (1991), DeWitt et al. (1990), and Wang and Rowe (1991). Khoshafian et al. (1992) discuss design issues for relational DBMSs in the client-server environment. Client-server management issues are discussed in many books, such as Zantinge and Adriaans (1996). 8 EMERGING TECHNOLOGIES XML and Internet Databases We now turn our attention to how databases are used and accessed from the Internet. Many electronic commerce (e-commerce) and other Internet applications provide Web interfaces to access information stored in one or more databases. These databases are often referred to as data sources. It is common to use two-tier and three-tier clientserver architectures for Internet applications (see Section 2.5). In some cases, other variations of the clientserver model are used. E-commerce and other Internet database applications are designed to interact with the user through Web interfaces that display Web pages. The common method of specifying the contents and formatting of Web pages is through the use of hyperlink documents. There are various languages for writing these documents, the most common being HTML (Hypertext Markup Language). Although HTML is widely used for formatting and structuring Web documents, it is not suitable for specifying structured data that is extracted from databases. Recently, a new language-namely, XML (Extended Markup Language)-has emerged as the standard for structuring and exchanging data over the Web. XML can be used to provide information about the structure and meaning of the data in the Web pages rather than just specifying how the Web pages are formatted for display on the screen. The formatting aspects are specified separately-for example, by using a formatting language such as XSL (Extended Stylesheet Language). This chapter describes the basics of accessing and exchanging information over the Internet. We start in Section 26.1 by discussing how traditional Web pages differ from structured databases, and discuss the differences between structured, semistructured, and unstructured data. Then in Section 26.2 we turn our attention to the XML standard and 841 842 I Chapter 26 XML and Internet Databases its tree-structured (hierarchical) data model. Section 26.3 discusses XMLdocuments and the languages for specifying the structure of these documents, namely, XML DTD (Document Type Definition) and XML schema. Section 26.4 presents the various approaches for storing XML documents, whether in their native (text) format, in a compressed form, or in relational and other types of databases. Section 26.5 gives an overview of the languages proposed for querying XML data. Section 26.6 summarizes the chapter. 26.1 STRUCTURED, SEMISTRUCTURED, AND UNSTRUCTURED DATA The information stored in databases is known as structured data because it is represented in a strict format. For example, each record in a relational database table-such as the EMPLOYEE table in Figure S.6-follows the same format as the other records in that table. For structured data, it is common to carefully design the database using techniques such as those described in Chapters 3, 4, 7, 10, and 11 in order to create the database schema. The DBMS then checks to ensure that all data follows the structures and constraints specified in the schema. However, not all data is collected and inserted into carefully designed structured databases. In some applications, data is collected in an ad-hoc manner before it is known how it will be stored and managed. This data may have a certain structure, but not all the information collected will have identical structure. Some attributes may be shared among the various entities, but other attributes may exist only in a few entities. Moreover, additional attributes can be introduced in some of the newer data items at any time, and there is no predefined schema. This type of data is known as semistructured data. A number of data models have been introduced for representing semistructured data, often based on using tree or graph data structures rather than the flat relational model structures. A key difference between structured and semistructured data concerns how the schema constructs (such as the names of attributes, relationships, and entity types) are handled. In semistructured data, the schema information is mixedin with the data values, since each data object can have different attributes that are not known in advance. Hence, this type of data is sometimes referred to as self-describing data. Consider the following example. We want to collect a list of bibliographic references related to a certain research project. Some of these may be books or technical reports, others maybe research articles in journals or conference proceedings, and still others may refer to complete journal issues or conference proceedings. Clearly, each of these may have different attributes and different types of information. Even for the same type of reference-say, conference articles-we may have different information. For example, one article citation may be quite complete, with full information about author names, title, proceedings, page numbers, and so on, whereas another citation may not have all the information available. New types of bibliographic sources may appear in the future- for example, references to Web pages or to conference tutorials-and these may have new attributes that describe them. 26.1 Structured, Semistructured, and Unstructured Data I 843 Company Projects Name • "Product X" Project • • "123456789" "Smith" Project • • 32.5 "435435435" • "Joyce" • 20.0 FIGURE 26.1 Representing semistructured data as a graph. Semistructured data may be displayed as a directed graph, as shown in Figure 26.1. The information shown in Figure 26.1 corresponds to some of the structured data shown in Figure 5.6. As we can see, this model somewhat resembles the object model (see Figure 20.1) in its ability to represent complex objects and nested structures. In Figure 26.1, the labels or tags on the directed edges represent the schema names: the names of attributes, object types (or entity types or classes), and relationships. The internal nodes represent individual objects or composite attributes. The leaf nodes represent actual data values of simple (atomic) attributes. There are two main differences between the semistructured model and the object model that we discussed in Chapter 20: 1. The schema information-names of attributes, relationships, and classes (object types) in the semistructured model is intermixed with the objects and their data values in the same data structure. 2. In the semistructured model, there is no requirement for a predefined schema to which the data objects must conform. In addition to structured and semistructured data, a third category exists, known as unstructured data because there is very limited indication of the type of data. A typical example is a text document that contains information embedded within it. Web pages in HTML that contain some data are considered to be unstructured data. Consider part of an HTML file, shown in Figure 26.2. Text that appears between angled brackets, < >, is an HTML tag. A tag with a backslash, «] >, indicates an end tag, which represents the 844 I Chapter 26 XML and Internet Databases <html> <head> </head> <body> <H1>List of company projects and the employees in each project<\H1> <H2>The ProductX project:</H2> <table width="100%" border=O cellpadding=O cellspacing=O> <TR> <TO width="50%"><font size="2" face="Arial">John Smith:</font></TO> <TO>32.5 hours per week</TO> </TR> <TR> <TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO> <TO>20.0 hours per week</TD> </TR> </table> <H2>The ProductY project:</H2> <table width="100%" border=O cellpadding=O cellspacing=O> <TR> <TO width="50%"><font size="2" face="Arial">John Smith:</font></TO> <TO>7.5 hours per week</TO> </TR> <TR> <TO width="50%%"><font size="2" face="Arial">Joyce English:</font></TO> <TO>20.0 hours per week</TO> </TR> <TR> <TO width="50%%"><font size="2" face="Arial">Franklin Wong:</font></TO> <TO>10.0 hours per week</TO> </TR> </table> </body> </html> FIGURE 26.2 Part of an HTML document representing unstructured data. ending of the effect of a matching start tag. The tags mark up the document! in order to instruct an HTML processor how to display the text between a start tag and a matching end tag. Hence, the tags specify document formatting rather than the meaning of the various data elements in the document. HTML tags specify information, such as font size and style (boldface, italics, and so on), color, heading levels in documents, and so on. Some tags provide text structuring in documents, such as specifying a numbered or 1. That is why it is known as Hypertext Markup Language. 26.1 Structured, Semistructured, and Unstructured Data I 845 unnumbered list or a table. Even these structuring tags specify that the embedded textual data is to be displayed in a certain manner, rather than indicating the type of data represented in the table. HTML uses a large number of predefined tags, which are used to specify a variety of commands for formatting Web documents for display. The start and end tags specify the range of text to be formatted by each command. A few examples of the tags shown in Figure 26.2 follow: • The <html> </html> tags specify the boundaries of the document. • The document header information-within the <head> </head> tags-specifies various commands that will be used elsewhere in the document. For example, it may specify various script functions in a language such as JAVA Script or PERL,or certain formatting styles (fonts, paragraph styles, header styles, and so on) that can be used in the document. It can also specify a title to indicate what the HTML file is for, and other similar information that will not be displayed as part of the document. • The body of the document-specified within the <body> </body> tags-includes the document text and the markup tags that specify how the text is to be formatted and displayed. It can also include references to other objects, such as images, videos, voice messages, and other documents. • The <HI> </HI> tags specify that the text is to be displayed as a level I heading. There are many heading levels «H2>, <H3>, and so on), each displaying text in a less prominent heading format. • The <table> </table> tags specify that the following text is to be displayed as a table. Each row in the table is enclosed within <TR> </TR> tags, and the actual text data in a row is displayed within <TD> </TD> tags. 2 • Some tags may have attributes, which appear within the start tag and describe additional properties of the tag." In Figure 26.2, the <table> start tag has four attributes describing various characteristics of the table. The following <TD> and <font> start tags have one and two attributes, respectively. HTML has a very large number of predefined tags, and whole books are devoted to describing how to use these tags. If designed properly, HTML documents can be formatted so that humans are able to easily understand the document contents, and are able to navigate through the resulting Web documents. However, the source HTML text documents are very difficult to interpret automatically by computer programs because they do not include schema information about the type of data in the documents. As e- commerce and other Internet applications become increasingly automated, it is becoming crucial to be able to exchange Web documents among various computer sites and to interpret their contents automatically. This need was one of the reasons that led to the development of XML, which we discuss in the next section. 2. <TR> stands for table row, and <TO> for table data. 3. This is how the term attribute is used in document markup languages, which differs from how it is used in database models. 846 I Chapter 26 XML and Internet Databases 26.2 XMl HIERARCHICAL (TREE) DATA MODEL We now introduce the data model used in XML. The basic object is XMLin the XML document. Two main structuring concepts are used to construct an XML document: elements and attributes. It is important to note right away that the term attribute in XMLis not used in the same manner as is customary in database terminology, but rather as it is used in document description languages such as HTML and SGML. 4 Attributes in XML provide additional information that describes elements, as we shall see. There are additional concepts in XML, such as entities, identifiers, and references, but we first concentrate on describing elements and attributes to show the essence of the XMLmodel. Figure 26.3 shows an example of an XML element called <projects>. As in HTML, elements are identified in a document by their start tag and end tag. The tag names are enclosed between angled brackets < >, and end tags are further identified by a backslash, </. >. 5 Complex elements are constructed from other elements hierarchically, whereas simple elements contain data values. A major difference between XMLand HTML is that XML tag names are defined to describe the meaning of the data elements in the document, rather than to describe how the text is to be displayed. This makes it possible to process the data elements in the XML document automatically by computer programs. It is straightforward to see the correspondence between the XML textual representation shown in Figure 26.3 and the tree structure shown in Figure 26.1. In the tree representation, internal nodes represent complex elements, whereas leaf nodes represent simple elements. That is why the XML model is called a tree model or a hierarchical model. In Figure 26.3, the simple elements are the ones with the tag names <Name>, <Number>, <Location>, <DeptNo>, <SSN>, <LastName>, <FirstName>, and <hours>. The complex elements are the ones with the tag names <projects>, <project>, and <Worker>. In general, there isno limit on the levels of nesting of elements. In general, it is possible to characterize three main types of XML documents: • Data-centric XML documents: These documents have many small data items that fol- Iowa specific structure and hence may be extracted from a structured database. They are formatted as XMLdocuments in order to exchange them or display them over the Web. • Document-centric XML documents: These are documents with large amounts of text, such as news articles or books. There are few or no structured data elements in these documents. • Hybrid XMLdocuments: These documents may have parts that contain structured data and other parts that are predominantly textual or unstructured. It is important to note that data-centric XMLdocuments can be considered either as semistructured data or as structured data. If an XML document conforms to a predefined 4. SGML (Standard Generalized Markup Language) is a more general language for describing documents and provides capabilities for specifying new tags. However, it is more complex than HTML and XML. 5. The left and right angled bracket characters « and» are reserved characters, as are the amper- sand (&), apostrophe e), and single quotation marks ('). To include them within the text of a document, they must be encoded as &It;, >, &, ', and ", respectively. 26.2 XML Hierarchical (Tree) Data Model I 847 <?xml version="l.O" standalone="yes"?> <projects> <project> <Name>ProductX</Name> <Number>l</Number> <Location>Bellaire</Location> <DeptNo>5</DeptNo> <Worker> <SSN>123456789</SSN> <LastName>Smith</LastName> <hours>32.5</hours> </Worker> <Worker> <SSN>453453453</SSN> <FirstName>]oyce</FirstName> <hours>20.0</hours> </Worker> «project> </project> <Name>ProductY</Name> <Number>2</Number> <Location>Sugarland</Location> <DeptNo >5</DeptNo > <Worker> <SSN>123456789</SSN> <hours>7.5</hours> </Worker> <Worker> <SSN>453453453</SSN> <hours>20.0</hours> </Worker> <Worker> <SSN>333445555</SSN> <hours>10.0</hours> </Worker> </project> </projects> FIGURE 26.3 A complex XML element called <projects>. XML schema or DTD (see Section 26.3), then the document can be considered as structured data. On the other hand, XML allows documents that do not conform to any schema; and these would be considered as semistructured data. The latter are also known as schemaless XML documents. When the value of the STANDALONE attribute in an XML document is "YES", as in the first line of Figure 26.3, the document isstandalone and schemaless. XML attributes are generally used in a manner similar to how they are used in HTML (see Figure 26.2), namely, to describe properties and characteristics of the elements (tags) within which they appear. It is also possible to use XML attributes to hold the values of [...]... the traditional database models based on flat files (relational model) and graph representations (ER model) 26.4.2 Extracting XML Documents from Relational Databases This section discusses the representational issues that arise when converting data from a database system into XML documents As we have discussed, XML uses a hierarchical (tree) model to represent documents The database systems with the... the UNIVERSITY database The data needed for these documents is contained in the database attributes of the entity types COURSE, SECTION, and STUDENT from Figure 26.6, and the relationships s-s and c-s between them In general, most documents extracted from a database will only use a subset of the attributes, entity types, and relationships in the database In this example, the subset of the database that... corresponding to the COMPANY database shown in Figures 3.2 and 5.5 Although it is unlikely that we would want to display the whole database as a single document, there have been proposals to store data in native XML format as an alternative to storing the data in relational databases The schema in Figure 26.5 would serve the purpose of specifying the structure of the COMPANY database if it were stored... diagram for a simplified ~ Students attended FIGURE 26.7 Subset of the UNIVERSITY UNIVERSITY database ~ SoD database schema needed for ~ sections ' - - - - - - - ' course XML document extraction 858 I Chapter 26 XML and Internet Databases At least three possible document hierarchies can be extracted from the database subset in Figure 26.7 First, we can choose COURSE as the root, as illustrated in Figure... and databases To date, it is not wellintegrated with database management systems We will briefly review the state of the art of this rather extensive field of data mining, which uses techniques from such areas as machine learning, statistics, neural networks, and genetic algorithms We will highlight the nature of the information that is discovered, the types of problems faced when trying to mine databases,... Chapter 26 XML and Internet Databases (1) (2 ) M r J N (3 ) FIGURE 26.13 Converting a graph with cycles into a hierarchical (tree) structure 26.4.4 Other Steps for Extracting XML Documents from Databases In addition to creating the appropriate XML hierarchy and corresponding XML schema document, several other steps are needed to extract a particular XML document from a database: 1 It is necessary... 26 XML and Internet Databases Exercises 26.7 Create an XML instance document to correspond to the data stored in the relational database shown in Figure 5.6 such that the XML document conforms to the XML schema document in Figure 26.5 26.8 Create XML schema documents to correspond to the hierarchies shown in Figures 26.12 and 26.13 part (3) 26.9 Consider the LIBRARY relational database schema of Figure... many organizations have generated a large amount of machine-readable data in the form of files and databases To process this data, we have the database technology available that supports query languages like SQL The problem with SQL is that it is a structured language that assumes the user is aware of the database schema SQL supports operations of relational algebra that allow a user to select rows and... XML documents from relational databases and storing XML documents 26.4 XML DOCUMENTS AND DATABASES We now discuss how various types of XML documents can be stored and retrieved Section 26.4.1 gives an overview of the various approaches for storing XML documents Section 26.4.2 discusses one of these approaches, in which data-centric XML documents are extracted from existing databases, in more detail In... databases: Because there are enormous amounts of data already stored in relational databases, parts of this data may need to be formatted as documents for exchanging or displaying over the Web This approach would use a separate middleware software layer to handle the conversions needed between the XML documents and the relational database All four of these approaches have received considerable attention over . and reliability in distributed systems. Federated database systems were first defined in McLeod and Heimbigner (1985). Techniques for schema integration in federated databases are presented by Elmasri. distributed database issues are discussed in Hsiao and Kamel (1989). Sheth and Larson (1990) present an exhaustive survey of federated database management. Selected Bibliography I 837 Recently, multidatabase. analysis of schemas. Pitoura et al. (1995) discuss object orientation in multidatabase systems. Transaction processing in multidatabases is discussed in Mehrotra et al. (1992), Georgakopoulos et al.