Microsoft SQL Server 2008 R2 Unleashed- P128 ppsx

ptg 1214 CHAPTER 35 Understanding Query Optimization Identifying Search Arguments A SARG is defined as a WHERE clause that compares a column to a constant. The format of a SARG is as follows: Column operator constant_expression [and ] SARGs provide a way for the Query Optimizer to limit the rows searched to satisfy a query. The general goal is to match a SARG with an index to avoid a table scan. Valid operators for a SARG are =, >, <, >=, and <=, BETWEEN, and LIKE. Multiple SARGs can be combined with the AND clause. (A single index might match some or all of the SARGs ANDed together.) Following are examples of SARGs: . flag = 7 . salary > 100000 . city = ‘Saratoga’ and state = ‘NY’ . price between $10 and $20 (the same as price > = $10 and price <= $20) . 100 between lo_val and hi_val (the same as lo_val <= 100 and hi_val >= 100) . au_lname like ‘Sm%’ (the same as au_lname >= ‘Sm’ and au_lname < ‘Sn’) In some cases, the column in a SARG might be compared with a constant expression rather than a single constant value. The constant expression can be an arithmetic operation, a built-in function, a string concatenation, a local variable, or a subquery result. As long as the left side of the SARG contains a column, it’s considered an optimizable SARG. Identifying OR Clauses The next statements the Query Optimizer looks for in the query are OR clauses. OR clauses are SARGable expressions combined with an OR condition rather than an AND condition and are treated differently than standard SARGs. The format of an OR clause is SARG or SARG [or ] with all columns involved in the OR belonging to the same table. This IN statement column in ( constant1, constant2, ) is also treated as an OR clause, becoming this: column = constant1 or column = constant2 or Some examples of OR clauses are as follows: where au_lname = ‘Smith’ or au_fname = ‘Fred’ where (type = ‘business’ and price > $25) or pub_id = “1234” where au_lname in (‘Smith’, ‘Jones’, ‘N/A’) Download from www.wowebook.com ptg 1215 Query Analysis An OR clause is a disjunction; all rows matching either of the two criteria appear in the result set. Any row matching both criteria should appear only once. The main issue is that an OR clause cannot be satisfied by a single index search. Consider the first example just presented: where au_lname = ‘Smith’ or au_fname = ‘Fred’ An index on au_lname and au_fname helps SQL Server find all the rows where au_lname = ‘Smith’ AND au_fname = ‘Fred’, but searching the index tree does not help SQL Server efficiently find all the rows where au_fname = ‘Fred’ and the last name is any value. Unless an index on au_fname exists as well, the only way to find all rows with au_fname = ‘Fred’ is to search every row in the table or scan every row in a nonclustered index that contains au_fname as a nonleading index key. An OR clause can typically be resolved by either a table scan or by using the OR strategy. Using a table scan, SQL Server reads every row in the table and applies each OR criteria to each row. Any row that matches any one of the OR criteria is put into the result set. A table scan is an expensive way to process a query, so the Query Optimizer looks for an alternative for resolving an OR. If an index can be matched against all SARGs involved in the OR clause, SQL Server evaluates the possibility of applying the index union strategy described later in this chapter, in the section “Using Multiple Indexes.” Identifying Join Clauses The next type of clause the Query Optimizer looks for during the query analysis phase is the join clause. A join condition is specified in the FROM clause using the JOIN keyword, as follows: FROM table1 JOIN table2 on table1.column = table2.column Alternatively, join conditions can be specified in the WHERE clause using the old-style join syntax, as shown in the following example: Table1.Column Operator Table2.Column A join clause always involves two tables, except in the case of a self-join, but even in a self-join, you must specify the table twice in the query. Here’s an example: select employee = e.LastName + ‘, ‘ + e.FirstName, manager = m.LastName + ‘, ‘ + m.FirstName from Northwind Employees e left outer join Northwind Employees m on e.ReportsTo = m.EmployeeID order by 2, 1 SQL Server treats a self-join just like a normal join between two different tables. 35 Download from www.wowebook.com ptg 1216 CHAPTER 35 Understanding Query Optimization In addition to join clauses, the Query Optimizer also looks for subqueries, derived tables, and common table expressions and makes the determination whether they need to be flat- tened into joins or processed using a different strategy. Subquery optimization is discussed later in this chapter. Row Estimation and Index Selection When the query analysis phase of optimization is complete and all SARGs, OR clauses, and join clauses have been identified, the next step is to determine the selectivity of the expressions (that is, the estimated number of matching rows) and to determine the cost of finding the rows. The costs are measured primarily in terms of logical and physical I/O, with the goal of generating a query plan that results in the lowest estimated I/O and processing cost. Primarily, the Query Optimizer attempts to identify whether an index exists that can be used to locate the matching rows. If multiple indexes or search strategies can be considered, their costs are compared with each other and also against the cost of a table or clustered index scan to determine the least expensive access method. An index is typically considered useful for an expression if the first column in the index is used in the expression and the search argument in the expression provides a means to effectively limit the search. If no useful indexes are found for an expression, typically a table or clustered index scan is performed on the table. A table or clustered index scan is the fallback tactic for the Query Optimizer to use if no lower-cost method exists for returning the matching rows from a table. Evaluating SARG and Join Selectivity To determine selectivity of a SARG, which helps in determining the most efficient query plan, the Query Optimizer uses the statistical information stored for the index or column, if any. If no statistics are available for a column or index, SQL Server automatically creates statistics on nonindexed columns specified in a SARG if the AUTO_CREATE_STATISTICS option is enabled for the database. SQL Server also automatically generates and updates the statistics for any indexed columns referenced in a SARG if the AUTO_UPDATE_STATISTICS option is enabled. In addition, you can explicitly create statistics for a column or set of columns in a table or an indexed view by using the CREATE STATIS- TICS command. Both index statistics and column statistics (whether created automatically or manually with the CREATE STATISTICS command) are maintained and kept up-to-date, as needed, if the AUTO_UPDATE_STATISTICS option is enabled or if the UPDATE STATISTICS command is explicitly run for a table, index, or column statistics. Available and up-to-date statistics allow the Query Optimizer to more accurately assess the cost of different query plans and choose a high-quality plan. If no statistics are available for a column or an index and the AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS options have been disabled for the database or table, SQL Server cannot make an informed estimate of the number of matching rows for a SARG Download from www.wowebook.com ptg 1217 Row Estimation and Index Selection Operator Row Estimate = (# of rows in table) .75 between, > and < 9% (closed-range search) >, <, >=, <= 30% (open-range search) Using these default percentages almost certainly results in inappropriate query execution plans being chosen. You should always try to ensure that you have up-to-date statistics available for any columns referenced in your SARGs and join clauses. When the value of a SARG can be determined at the time of query optimization, the Query Optimizer uses the statistics histogram to estimate the number of matching rows for the SARG. The histogram contains a sampling of the data values in the column and stores information on the number of matching rows for the sampled values, as well as for values that fall between the sampled values. If the statistics are up-to-date, this is the most accurate estimate of the number of matching rows for a SARG. (For a more detailed discussion of index and column statistics and the information contained in them, see Chapter 34.) If the SARG contains an expression that cannot be evaluated until runtime (for example, a local variable or scalar function) but is an equality expression (=), the Query Optimizer uses the density information from the statistics to estimate the number of matching rows. The density value reflects the overall uniqueness of the data values in the column or index. Density information does not estimate the number of matching rows as accurately as the histogram because its value is determined across the entire range of values in a column or an index key and can be skewed higher by one or more values that have a high number of duplicates. Expressions that cannot be evaluated until runtime include comparisons against local variables or function expressions that cannot be evaluated until query execution. If an expression cannot be evaluated at the time of optimization and the SARG is not an equality search but a closed- or open-range search, the density information cannot be used. The same percentages are used for the row estimates as when no statistics are available (9% for a closed-range search and 30% for an open-range search). As a special case, if a SARG contains the equality ( =) operator and a unique index exists that matches the SARG, based on the nature of a unique index, the Query Optimizer knows, without having to analyze the index statistics, that one and only one row can match the SARG. If the query contains a join clause, SQL Server determines whether any usable indexes or column statistics exist that match the column(s) in the join clause. Because the Query Optimizer has no way of determining what value(s) will join between rows in the table at 35 and resorts to using some built-in percentages for the number of matching rows for various types of expressions. These percentages currently are as follows: Download from www.wowebook.com ptg 1218 CHAPTER 35 Understanding Query Optimization 150,000 rows – 1,570 rows (where qty = 1000) = 148,430 rows (where qty <> 1000) optimization time, it can’t use the statistics histogram on the join column to estimate the number of matching rows. Instead, it uses the density information, as it does for SARGs that are unknown during optimization. A lower density value indicates a more selective index. As the density approaches 1, the join condition becomes less selective. For example, if a nonclustered index has a high density value, it will likely be more expensive in terms of I/O to retrieve the matching rows using the nonclustered index than to perform a table scan or clustered index scan and the index likely will not be used. NOTE For a more thorough and detailed discussion of indexes and index and column statistics, see Chapter 34. SARGs and Inequality Operators In previous versions of SQL Server, when a SARG contained an inequality operator (!= or <>), the selectivity of the SARG could not be determined effectively for the simple reason that index or column statistics can help you estimate only the number of matching rows for a specific value, not the number of nonmatching rows. However, for some SARGs with inequality operators, if index or column statistics are available, SQL Server 2008 is able to estimate the number of matching rows. For example, consider the following SARG: WHERE qty <> 1000 Without any available index or column statistics on the qty column, SQL Server would treat the inequality SARG as a SARG with no available statistics. Potentially every row in the table could satisfy the search criteria, so it would estimate the number of matching rows as all rows in the table. However, if index or column statistics were available for the qty column, the Query Optimizer would look up the search value (1000) in the statistics and estimate the number of matching rows for the search value and then determine the number of matching rows for the query as the total number of rows in the table minus the estimated number of matching rows for the search value. For example, if there are 150,000 rows in the table and the statistics indicate that 1,570 rows match, where qty = 1000, the number of matching rows would be calculated as follows: In this example, with the large number of estimated rows where qty <> 1000, SQL Server would likely end up performing a table scan to resolve the query. However, if the Query Optimizer estimates that there is a very small number of rows where qty <> 1000, the Query Optimizer might determine that it would be more efficient to use an index to find Download from www.wowebook.com ptg 1219 Row Estimation and Index Selection the nonmatching rows. You may be wondering how SQL Server efficiently searches the index for the rows where qty <> 1000 without having to look at every row. In this case, internally, it converts the inequality SARG into two range retrievals by using an OR condition: WHERE qty < 1000 OR qty > 1000 NOTE Even if an inequality SARG is optimizable, that doesn’t necessarily mean an index will be used. It simply allows the Query Optimizer to make a more accurate estimate of the number of rows that will match a given SARG. More often than not, an inequality SARG will result in a table or clustered index scan. You should try to avoid using inequality SARGs whenever possible. SARGs and LIKE Clauses In SQL Server versions prior to SQL Server 2005, the Query Optimizer would estimate the selectivity of a LIKE clause only if the first character in the string was a constant. Every row would have to be examined to determine if it was a match. SQL Server 2008 uses string summary statistics, which were introduced in SQL Server 2005, for estimating the selectivity of LIKE conditions. String summary statistics provide a statistical summary of substring frequency distribution for character columns. String summary statistics can be created on columns of type text, ntext, char, varchar, and nvarchar. String summary statistics allow SQL Server to estimate the selectivity of LIKE conditions where the search string may have any number of wildcards in any combination, including when the first character is a wildcard. In versions of SQL Server prior to 2005, row estimates could not be accurately obtained when the leading character of a search string was a wildcard character. SQL Server 2008 can estimate the selectivity of LIKE predicates similar to the following: . au_lname LIKE ‘Smith%’ . stor_name LIKE ‘%Books’ . title LIKE ‘%Cook%’ . title_id LIKE ‘BU[1234567]001’ . title LIKE ‘%Cook%Chicken’ The string summary statistics result in fairly accurate row estimates. However, if there is a user-specified escape character in a LIKE pattern (for example, stor_name LIKE ‘%abc#_%’ ESCAPE ‘#’), SQL Server 2008 has to guess at the selectivity of the SARG. The values generated for string summary statistics are not visible via DBCC SHOW_STATIS- TICS. However, DBCC SHOW_STATISTICS does indicate if string summary statistics have been calculated; if the value YES is specified in the String Index field in the first rowset returned by DBCC SHOW_STATISTICS, the statistics also include a string summary. Also, if 35 Download from www.wowebook.com ptg 1220 CHAPTER 35 Understanding Query Optimization the strings are more than 80 characters in length, only the first and last 40 characters are used for creating the string summary statistics. Accurate frequency estimates cannot be determined for substrings that do not appear in the first and last 40 characters of a string. SARGS on Computed Columns In versions of SQL Server prior to 2005, for a SARG to be optimizable, there had to be no computations on the column itself in the SARG. In SQL Server 2008, expressions involv- ing computations on a column might be treated as SARGs during optimization if SQL Server can simplify the expression into a SARG. For example, the SARG ytd_sales/12 = 1000 can be simplified to this: ytd_sales = 12000 The simplified expression is used only during optimization to determine an estimate of the number of matching rows and the usefulness of the index. During actual execution, the conversion is not done while traversing the index tree because it won’t be able to do the repeated division by 12 for each row while searching through the tree. However, doing the conversion during optimization and getting a row estimate from the statistics helps the Query Optimizer decide on other strategies to consider, such as index scanning versus table scanning, or it might help to determine an optimal join order if it’s a multi- table query. SQL Server 2008 supports the creation, update, and use of statistics on computed columns. The Query Optimizer can make use of the computed column statistics even when a query doesn’t reference the computed column by name but rather contains an expression that matches the computed column expression. This feature avoids the need to rewrite the SARGs in queries with expressions that match a computed column expression to SARGs that explicitly contain the computed column itself. When the SARG has a more complex operation performed on it, such as a function, it can potentially prevent effective optimization of the SARG. If you cannot avoid using a function or complex expression on a column in the search expression, you should consider creating a computed column on the table and creating an index on the computed column. This materializes the function result into an additional column on the table that can be indexed for faster searching, and the index statistics can be used to better estimate the number of matching rows for the SARG expression that references the function. An example of using this approach would be for a query that has to find the number of orders placed in a certain month, regardless of the year. The following is a possible solution: select distinct stor_id from sales where datepart(month, ord_date) = 6 Download from www.wowebook.com ptg 1221 Row Estimation and Index Selection This query gets the correct result set but ends up having to do so with a full table or index scan because the function on the ord_date column prevents the Query Optimizer from using an index seek against any index that might exist on the ord_date column. If this query is used frequently in the system and quick response time is critical, you could create a computed column on the function and index it as follows: alter table sales add ord_month as datepart(month, ord_date) create index nc_sales_ordmonth on sales(ord_month) Now, when you run the query on the table again, if you specify the computed column in the WHERE clause, the Query Optimizer can use the index on the computed column to accurately estimate the number of matching rows and possibly use the nonclustered index to find the matching rows and avoid a table scan, as it does for the following query: select distinct stor_id from sales where ord_month = 6 Even if the query still ends up using a table scan, it at least has statistics available to know how many rows it can expect to match where the month matches the value specified. In addition, if a computed column exists that exactly matches the SARG expression, SQL Server 2008 can still use the statistics and index on the computed column to optimize the query, even if the computed column is not specified in the query itself. For example, with the ord_month column defined on the sales table and an index created on it, the following query can also use the statistics and index to optimize the query: select distinct stor_id from sales where datepart(month, ord_date) = 6 TIP The automatic matching of computed columns in SQL Server 2008 enables you to create and exploit computed columns without having to change the queries in your application. Be aware, though, that computed column matching is based on identical comparison. For example, a computed column of the form A + B + C does not match an expression of the form A + C + B. Estimating Access Path Cost After the selectivity of each of the SARGs, OR clauses, and join clauses is determined, the next phase of optimization is estimating the access path cost of the query. The Query Optimizer attempts to identify the total cost of various access paths to the data and determine which path results in the lowest cost to return the matching rows for an expression. 35 Download from www.wowebook.com ptg 1222 CHAPTER 35 Understanding Query Optimization Number of index levels in the clustered index + Number of pages to scan within the range of values 3 (index levels to find the first row) + 3 (data pages: 600 rows divided by 250 rows per page) = 6 logical page I/Os The primary cost of an access path, especially for single-table queries, is the number of logical I/Os required to retrieve the data. Using the available statistics and the information stored in SQL Server regarding the average number of rows per page and the number of pages in the table, the Query Optimizer estimates the number of logical page reads neces- sary to retrieve the estimated number of rows using a table scan or any of the candidate indexes. It then ranks the candidate indexes to determine which access path would retrieve the matching data rows with the lowest cost, typically the access path that requires the fewest number of logical and physical I/Os. NOTE A logical I/O occurs every time a page is accessed. If the page is not in cache, a physical I/O is first performed to bring the page into cache memory, and then a logical I/O is performed against the page. The Query Optimizer has no way of knowing whether a page will be in memory when the query actually is executed, so it always assumes a cold cache, that the first read of a page will be from disk. In a very few cases (for example, small OLTP queries), this assumption may result in a slightly slower plan being chosen that optimizes for the number of initial I/Os required to process the query. However, the cold cache assumption is a minor factor in the query plan costing, and it’s actually the total number of logical I/Os that is the primary factor in determining the cost of the access path. TIP The following sections assume a general understanding of SQL Server index structures. If you haven’t done so already, now is a good time to read through Chapter 34. Clustered Index Cost Clustered indexes are efficient for lookups because the rows that match the SARGs are clustered on the same page or over a range of adjacent pages. SQL Server needs only to find its way to the first page and then read the rows from that page and any subsequent pages in the page chain until no more matching rows are found. Therefore, the I/O cost estimate for a clustered index is calculated as follows: The number of pages to scan is based on the estimated number of matching rows divided by the number of rows per page. For example, if SQL Server can store 250 rows per page for a table, and 600 rows are within the range of values being searched, SQL Server would estimate that it requires at least three page reads to find the qualifying rows. If the index is three levels deep, the logical I/O cost would be as follows: Download from www.wowebook.com ptg 1223 Row Estimation and Index Selection For a unique clustered index and an equality operator, the logical I/O cost estimate is one data page plus the number of index levels that need to be traversed to access the data page. When a clustered index is used to retrieve the data rows, you see a query plan similar to the one shown in Figure 35.1. Nonclustered Index Cost When searching for values using a nonclustered index, SQL Server reads the index key values at the leaf level of the index and uses the bookmark to locate and read the data row. SQL Server has no way of knowing if matching search values will be on the same data page until it has read the bookmark. It is possible that while retrieving the rows, SQL Server might find all data rows on different data pages, or it might revisit the same data page multiple times. Either way, a separate logical I/O is required each time it visits the data page. The I/O cost is therefore based on the depth of the index tree, the number of index leaf rows that need to be scanned to find the matching key values, and the number of matching rows. The cost of retrieving each matching row depends on whether the table is clustered or is a heap table (that is, a table with no clustered index defined on it). For a heap table, the nonclustered row bookmark is the page and row pointer (the row ID [RID]) to the actual data row. A single I/O is required to retrieve the data row. Therefore, the worst- case logical I/O cost for a heap table can be estimated as follows: FIGURE 35.1 An execution plan for a clustered index seek. 35 Number of nonclustered index levels + Number of leaf pages to be scanned + Number of qualifying rows (each row represents a separate data page read) Download from www.wowebook.com . would have to be examined to determine if it was a match. SQL Server 2008 uses string summary statistics, which were introduced in SQL Server 2005, for estimating the selectivity of LIKE conditions. String. wildcard. In versions of SQL Server prior to 2005, row estimates could not be accurately obtained when the leading character of a search string was a wildcard character. SQL Server 2008 can estimate. Computed Columns In versions of SQL Server prior to 2005, for a SARG to be optimizable, there had to be no computations on the column itself in the SARG. In SQL Server 2008, expressions involv- ing

Microsoft SQL Server 2008 R2 Unleashed- P128 ppsx

Thông tin tài liệu

Từ khóa liên quan

Mục lục

Table of Contents

Introduction

Part I: Welcome to Microsoft SQL Server

1 SQL Server 2008 Overview

SQL Server Components and Features

SQL Server 2008 R2 Editions

SQL Server Licensing Models

Summary

2 What’s New in SQL Server 2008

New SQL Server 2008 Features

SQL Server 2008 Enhancements

Summary

3 Examples of SQL Server Implementations

Application Terms

OLTP Application Examples

DSS Application Examples

Summary

Part II: SQL Server Tools and Utilities

4 SQL Server Management Studio

What’s New in SSMS

The Integrated Environment

Administration Tools

Development Tools

Summary

5 SQL Server Command-Line Utilities

What’s New in SQL Server Command-Line Utilities

The sqlcmd Command-Line Utility

Tài liệu cùng người dùng

Tài liệu liên quan