Trees, Hierarchies, and Graphs

63 421 0
Trees, Hierarchies, and Graphs

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

Thông tin tài liệu

C H A P T E R 12    Trees, Hierarchies, and Graphs Although at times it may seem chaotic, the world around us is filled with structure and order. The universe itself is hierarchical in nature, made up of galaxies, stars, and planets. One of the natural hierarchies here on earth is the food chain that exists in the wild; a lion can certainly eat a zebra, but alas, a zebra will probably never dine on lion flesh. And of course, we’re all familiar with corporate management hierarchies—which some companies try to kill off in favor of matrixes, which are not hierarchical at all . . . but more on that later! We strive to describe our existence based on connections between entities—or lack thereof—and that’s what trees, hierarchies, and graphs help us do at the mathematical and data levels. The majority of databases are at least mostly hierarchical, with a central table or set of tables at the root, and all other tables branching from there via foreign key references. However, sometimes the database hierarchy needs to be designed at a more granular level, representing the hierarchical relationship between records contained within a single table. For example, you wouldn’t design a management database that required one table per employee in order to support the hierarchy. Rather, you’d put all of the employees into a single table and create references between the rows. This chapter discusses three different approaches for working with these intra-table hierarchies and graphs in SQL Server 2008, as follows: • Adjacency lists • Materialized paths • The hierarchyid datatype Each of these techniques has its own virtues depending on the situation. I will describe each technique individually and compare how it can be used to query and manage your hierarchical data. Terminology: Everything Is a Graph Mathematically speaking, trees and hierarchies are both different types of graphs. A graph is defined as a set of nodes (or vertices) connected by edges. The edges in a graph can be further classified as directed or undirected, meaning that they can be traversed in one direction only (directed) or in both directions (undirected). If all of the edges in a graph are directed, the graph itself is said to be directed (sometimes referred to as a digraph). Graphs can also have cycles, sets of nodes/edges that when traversed in order bring you back to the same initial node. A graph without cycles is called an acyclic graph. Figure 12-1 shows some simple examples of the basic types of graphs. 371 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS Figure 12-1. Undirected, directed, undirected cyclic, and directed acyclic graphs The most immediately recognizable example of a graph is a street map. Each intersection can be thought of as a node, and each street an edge. One-way streets are directed edges, and if you drive around the block, you’ve illustrated a cycle. Therefore, a street system can be said to be a cyclic, directed graph. In the manufacturing world, a common graph structure is a bill of materials, or parts explosion, which describes all of the necessary component parts of a given product. And in software development, we typically work with class and object graphs, which form the relationships between the component parts of an object-oriented system. A tree is defined as an undirected, acyclic graph in which exactly one path exists between any two nodes. Figure 12-2 shows a simple tree. Figure 12-2. Exactly one path exists between any two nodes in a tree.  Note Borrowing from the same agrarian terminology from which the term tree is derived, we can refer to multiple trees as a forest. A hierarchy is a special subset of a tree, and it is probably the most common graph structure that developers need to work with. It has all of the qualities of a tree but is also directed and rooted. This means that a certain node is designated as the root, and all other nodes are said to be subordinates (or descendants) of that node. In addition, each nonroot node must have exactly one parent node—a node that directs into it. Multiple parents are not allowed, nor are multiple root nodes. Hierarchies are extremely common when it comes to describing most business relationships; manager/employee, contractor/subcontractor, and firm/division associations all come to mind. Figure 12-3 shows a hierarchy containing a root node and several levels of subordinates. 372 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS Figure 12-3. A hierarchy must have exactly one root node, and each nonroot node must have exactly one parent. The parent/child relationships found in hierarchies are often classified more formally using the terms ancestor and descendant, although this terminology can get a bit awkward in software development settings. Another important term is siblings, which describes nodes that share the same parent. Other terms used to describe familial relationships are also routinely applied to trees and hierarchies, but I’ve personally found that it can get confusing trying to figure out which node is the cousin of another, and so have abandoned most of this extended terminology. The Basics: Adjacency Lists and Graphs The most common graph data model is called an adjacency list. In an adjacency list, the graph is modeled as pairs of nodes, each representing an edge. This is an extremely flexible way of modeling a graph; any kind of graph, hierarchy, or tree can fit into this model. However, it can be problematic from the perspectives of query complexity, performance, and data integrity. In this section, I will show you how to work with adjacency lists and point out some of the issues that you should be wary of when designing solutions around them. The simplest of graph tables contains only two columns, X and Y: CREATE TABLE Edges ( X int NOT NULL, Y int NOT NULL, PRIMARY KEY (X, Y) ); GO The combination of columns X and Y constitutes the primary key, and each row in the table represents one edge in the graph. Note that X and Y are assumed to be references to some valid table of nodes. This table only represents the edges that connect the nodes. It can also be used to reference unconnected nodes; a node with a path back to itself but no other paths can be inserted into the table for that purpose. 373 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS  Note When modeling unconnected nodes, some data architects prefer to use a nullable Y column rather than having both columns point to the same node. The net effect is the same, but in my opinion the nullable Y column makes some queries a bit messier, as you’ll be forced to deal with the possibility of a NULL . The examples in this chapter, therefore, do not follow that convention—but you can use either approach in your production applications. Constraining the Edges As-is, the Edges table can be used to represent any graph, but semantics are important, and none are implied by the current structure. It’s difficult to know whether each edge is directed or undirected. Traversing the graph, one could conceivably go either way, so the following two rows may or may not be logically identical: INSERT INTO Edges VALUES (1, 2); INSERT INTO Edges VALUES (2, 1); If the edges in this graph are supposed to be directed, there is no problem. If you need both directions for a certain edge, simply insert them both, and don’t insert both for directed edges. If, on the other hand, all edges are supposed to be undirected, a constraint is necessary in order to ensure that two logically identical paths cannot be inserted. The primary key is clearly not sufficient to enforce this constraint, since it treats every combination as unique. The most obvious solution to this problem is to create a trigger that checks the rows when inserts or updates take place. Since the primary key already enforces that duplicate directional paths cannot be inserted, the trigger must only check for the opposite path. Before creating the trigger, empty the Edges table so that it no longer contains the duplicate undirected edges just inserted: TRUNCATE TABLE Edges; GO Then create the trigger that will check as rows are inserted or updated as follows: CREATE TRIGGER CheckForDuplicates ON Edges FOR INSERT, UPDATE AS BEGIN IF EXISTS ( SELECT * FROM Edges e WHERE EXISTS ( 374 Download at WoweBook.com CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS SELECT * FROM inserted i WHERE i.X = e.Y AND i.Y = e.X ) ) BEGIN ROLLBACK; END END; GO Attempting to reinsert the two rows listed previously will now cause the trigger to end the transaction and issue a rollback of the second row, preventing the duplicate edge from being created. A slightly cleverer way of constraining the uniqueness of the paths is to make use of an indexed view. You can take advantage of the fact that an indexed view has a unique index, using it as a constraint in cases like this where a trigger seems awkward. In order to create the indexed view, you will need a numbers table (also called a tally table) with a single column, Number, which is the primary key. The following code listing creates such a table, populated with every number between 1 and 8000: SELECT TOP (8000) IDENTITY(int, 1, 1) AS Number INTO Numbers FROM master spt_values a CROSS JOIN master spt_values b; ALTER TABLE Numbers ADD PRIMARY KEY (Number); GO  Note We won’t actually need all 8,000 rows in the Numbers table (in fact, the solution described here requires only two distinct rows), but there are lots of other scenarios where you might need a larger table of numbers, so it doesn’t do any harm to prime the table with additional rows now. The master spt_values table is an arbitrary system table chosen simply because it has enough rows that, when cross-joined with itself, the output will be more than 8,000 rows. A table of numbers is incredibly useful in many cases in which you might need to do interrow manipulation and look-ahead logic, especially when dealing with strings. However, in this case, its utility is fairly simple: a CROSS JOIN to the Numbers table, combined with a WHERE condition, will result in an output containing two rows for each row in the Edges table. A CASE expression will then be used to swap the X and Y column values—reversing the path direction—for one of the rows in each duplicate pair. The following view encapsulates this logic: CREATE VIEW DuplicateEdges WITH SCHEMABINDING 375 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS AS SELECT CASE n.Number WHEN 1 THEN e.X ELSE e.Y END X, CASE n.Number WHEN 1 THEN e.Y ELSE e.X END Y FROM Edges e CROSS JOIN Numbers n WHERE n.Number BETWEEN 1 AND 2; GO Once the view has been created, it can be indexed in order to constrain against duplicate paths: CREATE UNIQUE CLUSTERED INDEX IX_NoDuplicates ON DuplicateEdges (X,Y); GO Since the view logically contains both paths as they were inserted into the table, as well as the reverse paths, the unique index serves to constrain against duplication. Both techniques have similar performance characteristics, but there is admittedly a certain cool factor with the indexed view. It can also double as a quick lookup for finding all paths in a directed notation.  Note Once you have chosen either the trigger or the indexed view approach to prevent duplicate edges, be sure to delete all rows from the Edges table again before executing any of the remaining code listings in this chapter. Basic Graph Queries: Who Am I Connected To? Before traversing the graph to answer questions, it’s again important to discuss the differences between directed and undirected edges and the way in which they are modeled. Figure 12-4 shows two graphs: I is undirected and J is directed. Figure 12-4. Directed and undirected graphs have different connection qualities. 376 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS The following node pairs can be used to represent the edges whether or not the Edges table is considered to be directed or undirected: INSERT INTO Edges VALUES (2, 1), (1, 3); GO Now we can answer a simple question: starting at a specific node, what nodes can we traverse to? In the case of a directed graph, any node Y is accessible from another node X if an edge exists that starts at X and ends at Y. This is easy enough to represent as a query (in this case, starting at node 1): SELECT Y FROM Edges e WHERE X = 1; For an undirected graph, things get a bit more complex because any given edge between two nodes can be traversed in either direction. In that case, any node Y is accessible from another node X if an edge is represented as either starting at X and ending at Y, or the other way around. We need to consider all edges for which node Y is either the start or endpoint, or else the graph has effectively become directed. To find all nodes accessible from node 1 now requires a bit more code: SELECT CASE WHEN X = 1 THEN Y ELSE X END FROM Edges e WHERE X = 1 OR Y = 1; Aside from the increased complexity of this code, there’s another much more important issue: performance on larger sets will start to suffer due to the fact that the search argument cannot be satisfied based on an index seek because it relies on two columns with an OR condition. The problem can be fixed to some degree by creating multiple indexes (one in which each column is the first key) and using a UNION ALL query, as follows: SELECT Y FROM Edges e WHERE X = 1 UNION ALL SELECT X FROM Edges e WHERE Y = 1; This code is somewhat unintuitive, and because both indexes must be maintained and the query must do two index operations to be satisfied, performance will still suffer compared with querying the directed graph. For that reason, I recommend generally modeling graphs as directed and dealing with inserting both pairs of edges unless there is a compelling reason not to, such as an extremely large undirected graph where the extra edge combinations would challenge the server’s available disk space. The remainder of the examples in this chapter will assume that the graph is directed. 377 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS Traversing the Graph Finding out which nodes a given node is directly connected to is a good start, but in order to answer questions about the structure of the underlying data, the graph must be traversed. For this section, a more rigorous example data set is necessary. Figure 12-5 shows an initial sample graph representing an abbreviated portion of a street map for an unnamed city. Figure 12-5. An abbreviated street map A few tables are required to represent this map—to begin with, a table of streets: CREATE TABLE Streets ( StreetId int NOT NULL PRIMARY KEY, StreetName varchar(75) ); GO INSERT INTO Streets VALUES (1, '1st Ave'), (2, '2nd Ave'), (3, '3rd Ave'), (4, '4th Ave'), (5, 'Madison'); GO Each street is assigned a surrogate key so that it can be referenced easily in other tables. The next requirement is a table of intersections—the nodes in the graph. This table creates a key for each intersection, which is defined in this set of data as a collection of one or more streets: CREATE TABLE Intersections ( IntersectionId int NOT NULL PRIMARY KEY, IntersectionName varchar(10) ); GO INSERT INTO Intersections VALUES (1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'); GO Next is a table called IntersectionStreets, which maps streets to their respective intersections. Note that I haven’t included any constraints on this table, as they can get quite complex. One constraint that might be ideal would specify that any given combination of streets should not intersect more than once. However, it’s difficult to say whether this would apply to all cities, given that many older cities 378 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS have twisting roads that may intersect with each other at numerous points. Dealing with this issue is left as an exercise for you to try on your own. CREATE TABLE IntersectionStreets ( IntersectionId int NOT NULL REFERENCES Intersections (IntersectionId), StreetId int NOT NULL REFERENCES Streets (StreetId), PRIMARY KEY (IntersectionId, StreetId) ); GO INSERT INTO IntersectionStreets VALUES (1, 1), (1, 5), (2, 2), (2, 5), (3, 3), (3, 5), (4, 4), (4, 5); GO The final table describes the edges of the graph, which in this case are segments of street between each intersection. I’ve added a couple of constraints that might not be so obvious at first glance: Rather than using foreign keys to the Intersections table, the StreetSegments table references the IntersectionStreets table for both the starting point and ending point. In both cases, the street is also included in the key. The purpose of this is so that you can’t start on one street and magically end up on another street or at an intersection that’s not even on the street you started on. The CK_Intersections constraint ensures that the two intersections are actually different—so you can’t start at one intersection and end up at the same place after only one move. It’s theoretically possible that a circular street could intersect another street at only one point, in which case traveling the entire length of the street could get you back to where you started. However, doing so would clearly not help you traverse through the graph to a destination, which is the situation currently being considered. Here’s the T-SQL to create the street segments that constitute the edges of the graph: CREATE TABLE StreetSegments ( IntersectionId_Start int NOT NULL, IntersectionId_End int NOT NULL, StreetId int NOT NULL, CONSTRAINT FK_Start FOREIGN KEY (IntersectionId_Start, StreetId) REFERENCES IntersectionStreets (IntersectionId, StreetId), CONSTRAINT FK_End FOREIGN KEY (IntersectionId_End, StreetId) REFERENCES IntersectionStreets (IntersectionId, StreetId), CONSTRAINT CK_Intersections CHECK (IntersectionId_Start <> IntersectionId_End), CONSTRAINT PK_StreetSegments PRIMARY KEY (IntersectionId_Start, IntersectionId_End) ); 379 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS GO INSERT INTO StreetSegments VALUES (1, 2, 5), (2, 3, 5), (3, 4, 5); GO In addition to these tables, a helper function is useful in order to make navigation easier. The GetIntersectionId function returns the intersection at which the two input streets intersect. As mentioned before, the schema used in this example assumes that each street intersects only once with any other street, and the GetIntersectionId function makes the same assumption. It works by searching for all intersections that the input streets participate in, and then finding the one that had exactly two matches, meaning that both input streets intersect. Following is the T-SQL for the function: CREATE FUNCTION GetIntersectionId ( @Street1 varchar(75), @Street2 varchar(75) ) RETURNS int WITH SCHEMABINDING AS BEGIN RETURN ( SELECT i.IntersectionId FROM dbo.IntersectionStreets i WHERE StreetId IN ( SELECT StreetId FROM dbo.Streets WHERE StreetName IN (@Street1, @Street2) ) GROUP BY i.IntersectionId HAVING COUNT(*) = 2 ) END; GO Using the schema and the function, we can start traversing the nodes. The basic technique of traversing the graph is quite simple: find the starting intersection and all nodes that it connects to, and iteratively or recursively move outward, using the previous node’s ending point as the starting point for the next. This is easily accomplished using a recursive common table expression (CTE). The following is a simple initial example of a CTE that can be used to traverse the nodes from Madison and 1st Avenue to Madison and 4th Avenue: DECLARE @Start int = dbo.GetIntersectionId('Madison', '1st Ave'), @End int = dbo.GetIntersectionId('Madison', '4th Ave'); WITH Paths 380 [...]... + '/' 386 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS AS varchar(255) ) FROM Paths p JOIN dbo.StreetSegments ss ON ss.IntersectionId_Start = p.theEnd WHERE p.theEnd @End AND p.thePath NOT LIKE '%/' + CONVERT(varchar, ss.IntersectionId_End) + '/%' ) SELECT * FROM Paths; GO This concludes this chapter’s coverage on general graphs The remainder of the chapter deals with modeling and querying of hierarchies... this data is not easy to work with and requires a lot of cleanup to get it to the point where it can be easily queried 387 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS Adjacency List Hierarchies As mentioned previously, any kind of graph can be modeled using an adjacency list This of course includes hierarchies, which are nothing more than rooted, directed, acyclic graphs with exactly one path between... to encapsulate the code in a multistatement table-valued UDF to allow greater potential for reuse 399 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS Note If you’re following along with the examples in this chapter and you increased the number of rows in the Employee_Temp table, you should drop and re-create it before continuing with the rest of the chapter Traversing up the Hierarchy For an adjacency list,... already has a couple of constraints that help guard against certain issues: a primary key and a self-referencing foreign key The primary key, which is 402 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS on the EmployeeID column, guards against most cycles by making it impossible for a given employee to have more than one manager And the self-referencing foreign key guards against most forest issues because every... et.EmployeeID WHERE et.ManagerID IS NOT NULL AND e.ManagerID e.EmployeeID ) SELECT @CycleExists = 1 FROM e WHERE e.ManagerID = e.EmployeeID; IF @CycleExists = 1 404 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS BEGIN RAISERROR('The update introduced a cycle', 16, 1); ROLLBACK; END END GO This type of cycle can only be caused by either updates or multirow inserts, and in virtually all of the hierarchies I’ve... DECLARE statement that assigns the @Start and @End variables to be as follows: DECLARE @Start int = dbo.GetIntersectionId('Madison', '1st Ave'), @End int = dbo.GetIntersectionId('Lexington', '1st Ave'); Having made these changes, the output of the CTE query is now as follows: 382 CHAPTER 12 theStart theEnd 1 2 2 3 2 6 6 5 3 4 4 8 8 7 7 6 6 TREES, HIERARCHIES, AND GRAPHS 5 There are now two paths from the... outer query—but this is not the only way to write this query The query could also be written such that the CTE uses and returns only the EmployeeID column, necessitating an additional JOIN in the outer query to get the other columns: WITH n AS ( 391 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS SELECT EmployeeID FROM Employee_Temp WHERE ManagerID IS NULL UNION ALL SELECT e.EmployeeID FROM Employee_Temp... Title—requires a bit of manipulation to the path Instead of materializing the EmployeeID, materialize a row number that represents the current ordered 393 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS sibling This can be done using SQL Server’s ROW_NUMBER function, and is sometimes referred to as enumerating the path The following modified version of the CTE enumerates the path: WITH n AS ( SELECT EmployeeID, ManagerID,... Engineering Manager (EmployeeID 3) starts with EmployeeID 109 and continues to EmployeeID 12 before getting to the Engineering Manager Looking at the same column using the enumerated path, it is not possible to discover the actual IDs that make up a given path without following it back up the hierarchy in the output 395 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS Are CTEs the Best Choice? While CTEs are possibly... becoming an additional root node 396 CHAPTER 12 TREES, HIERARCHIES, AND GRAPHS Once this code has been run, the Employee_Temp hierarchy will have 9,249 nodes, instead of the 290 that we started with However, the hierarchy still has only five levels To increase the depth, a slightly different algorithm is required To add levels, find all managers except the CEO, and insert new duplicate nodes, incrementing . types of graphs. 371 CHAPTER 12  TREES, HIERARCHIES, AND GRAPHS Figure 12-1. Undirected, directed, undirected cyclic, and directed acyclic graphs The. connections between entities—or lack thereof and that’s what trees, hierarchies, and graphs help us do at the mathematical and data levels. The majority of databases

Ngày đăng: 05/10/2013, 08:48

Từ khóa liên quan

Tài liệu cùng người dùng

Tài liệu liên quan