Friday, 10 February 2012

Using Covering Indexes to Improve Query Performance

Introduction

There are many ways to measure a database’s performance. We can look at its overall availability; that is: are our users able to get the information they need when they need it? We can consider its concurrency; how many users can query the database at one time? We can even watch the number of transactions that occur per second to gauge its level of activity.
Often for our users however, it comes down to one very simple, but somewhat nebulous, measurement – perceived responsiveness. Or put another way, when a user issues a query does he receive his results without having to wait too long?
Of course, there are many factors that influence this. There is the hardware itself and how its configured. There is the physical implementation of the database; the logical design of the database; and even how the queries themselves are written.
In this article however, I will focus on one specific technique that can be used to improve performance – creating covering indexes.

An indexing refresher

Before taking on the challenge of covering indexes, let’s take a moment for a brief but very relevant review of index types and terminology.
Indexes allow SQL Server to quickly find rows in a table based on key values, much like the index of a book allows us to easily find the pages that contain the information we want. There are two types of indexes in SQL Server, clustered and nonclustered indexes.

Clustered Indexes    

A clustered index is an index whose leaf nodes, that is the lowest level of the index, contain the actual data pages of the underlying table. Hence the index and the table itself are, for all practical purposes, one and the same.  Each table can have only one clustered index. For more information on clustered indexes, see the Books Online topic "Clustered Index Structures" (http://msdn.microsoft.com/en-us/library/ms177443.aspx).
When a clustered index is used to resolve a query, SQL Server begins at the root node for the index and traverses the intermediate nodes until it locates the data page that contains the row it's seeking.
Many database designs make prolific use of clustered indexes. In fact, it is generally considered a best practice to include a clustered index on each table; of course that's painting with a very broad brush and there will most assuredly be exceptions. For more information about the benefits of clustered indexes, see the SQL Server Best Practices Article entitled "Comparing Tables Organized with Clustered Indexes versus Heaps" on TechNet .
Lets’ consider an example. In the Figure 1, the Customers table has a clustered index defined on the Customer_ID column. When a query is executed that searches by the Customer_ID column, SQL Server navigates through the clustered index to locate the row in question and returns the data. This can be seen in the Clustered Index Seek operation in the query’s Execution Plan.

Figure 1. Clustered index execution plan
The following Transact-SQL statement was used to create the clustered index on the Customer_ID column.

CREATE CLUSTERED INDEX [ix_Customer_ID] ON [dbo].[Customers] (
            
[Customer_ID] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

Nonclustered indexes

Nonclustered indexes use a similar methodology to store indexed data for tables within SQL Server. However in a nonclustered index, the lowest level of the index does not contain the data page of the table. Instead, it contains the information that allows SQL Server to navigate to the data pages it needs. For tables that have a clustered index, the leaf node of the nonclustered index contains the clustered index keys. In the previous example, the leaf node of a nonclustered index on the Customers table would contain the Customer_ID key.
If the underlying table does not have a clustered index (this data structure is known as a heap), the leaf node of the nonclustered index contains a row locator to the heap data pages.
In the following example, a nonclustered composite index has been created on the Customers table as described in the following Transact-SQL code.

CREATE NONCLUSTERED INDEX [ix_Customer_Name] ON [dbo].[Customers] (
            
[Last_Name] ASC,
            
[First_Name] ASC) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
In this case when a query that searched by customer last name was executed, the SQL Server query optimizer chose to use the ix_Customer_Name index to resolve the query. This can be seen in the Execution Plan in the following figure.

Figure 2. Nonclustered index execution plan
For more information on clustered indexes, see the Books Online topic "Nonclustered Index Structures" .

Using Nonclustered indexes

As illustrated in the preceding example, nonclustered indexes may be employed to provide SQL Server with an efficient way to retrieve data rows. However, under some circumstances, the overhead associated with nonclustered indexes may be deemed too great by the query optimizer and SQL Server will resort to a table scan to resolve the query. To understand how this may happen, let's examine the preceding example in more detail.

Key Lookups

Looking again at the graphical query execution plan depicted in Figure 2, notice that the plan not only includes an Index Seek operation that uses the ix_Customer_Name nonclustered index, it also includes a Key Lookup operation.
SQL Server uses a Key Lookup to retrieve non-key data from the data page when a nonclustered index is used to resolve the query. That is, once SQL Server has used the nonclustered index to identify each row that matches the query criteria, it must then retrieve the column information for those rows from the data pages of the table.
Since the leaf node of the nonclustered index contains the key value information for the row, SQL Server must navigate through the clustered index to retrieve the columnar information for each row of the result set. In this example, SQL Server choose to do this using a nested loop join type.
This query produced a result of 1,000 rows and nearly 100% of the expense of the query was directly attributed to the Key Lookup operation. Digging a little deeper into the Key Lookup operation, we can see why.

Figure 3. Key lookup operation properties
This Key Lookup operation was executed 1000 times, once for each row of the result set. 

Resorting to Table Scans

As the number of rows in the result set increases, so does the number of Key lookups. At some point, the cost associated with the Key Lookup will outweigh any benefit provided by the nonclustered index.
To illustrate this point, let's modify the query so that it retrieves more rows. Figure 4. depicts the new query along with the actual execution plan used to resolve the query.

Figure 4. Resorting to a table scan
The new query searches for a range of customers whose last name is between "Roland" and "Smith". There are 69,000 of them in our database. From the actual execution plan, we can see that the query optimizer determined that the overhead cost of performing a Key Lookup for each of the 69,000 rows was more than simply traversing the entire table via a table scan. Hence, our ix_Customer_Name index was not used at all during the query.
Figure 5  shows some additional properties of the table scan.

Figure 5. Properties for the table scan operation
One may be tempted to force SQL Server to resolve the query using the nonclustered index by supplying a table hint as shown in the following illustration.

Figure 6. Using a table hint to resolve the query
This is almost always a bad idea since the optimizer generally does a good job in choosing an appropriate execution plan. Additionally, the optimizer bases its decisions on column statistics; those are likely to change over time. A table hint that works well today, may not work well in the future when the selectivity of the key columns change.
Figure 7 shows the properties for the Key Lookup when we forced SQL Server to use the nonclustered ix_Customer_Name index. The Estimated Operator Cost for the Key Lookup is 57.02 compared to 12.17 for the Clustered Index Scan shown in Figure 5. Forcing SQL Server to use the index significantly affected performance, and not for the better!

Figure 7. Properties of the Key Lookup for the table hint

Covering Indexes

So, if Key Lookups can be detrimental to performance during query resolution for large result sets, the natural question is: how can we avoid them? To answer that question, let's consider a query that does not require a Key Lookup.
Let's begin by modifying our query so that it no longer selects the Email_Address column. Figure 8 illustrates this updated query along with its actual execution plan.

Figure 8. Reduced query to eliminate the Key Lookup
The new execution plan has been streamlined and only uses the ix_Customer_Name nonclustered index. Looking at the properties of the operation providers further evidence of the improvement. The properties are shown in Figure 9.

Figure 9. Reduced query to eliminate the Key Lookup properties
The Estimated Operator Cost went down dramatically, from 12.17 in Figure 5 to 0.22 in Figure 9.  We could also look at the Logical and Physical Read characteristics by setting STATISTICS IO on, however for this demonstration it’s sufficient to view the Operator Costs for each operation.
The observed improvement is due to the fact that the nonclustered index contained all of the required information to resolve the query. No Key Lookups were required. An index that contains all information required to resolve the query is known as a "Covering Index"; it completely covers the query.

Using the Clustered Key Column

Recall that if a table has a clustered index, those key columns are automatically part of the nonclustered index. So, the following query is a covering query by default.

Figure 10. Covering index using the clustered index keys
However unless your clustered index contains the required columns, which is not the case in our example, it will be insufficient for covering our query.

Adding Key Columns to the Index

To increase the likelihood that a nonclustered index is a covering index, it is tempting to begin adding additional columns to the index key. For example, if we regularly query the customer's middle name and telephone number, we could add those columns to the ix_Customer_Name index. Or, to continue with our previous example, we could the Email_Address column to the index as shown in the following Transact-SQL code.

CREATE NONCLUSTERED INDEX [ix_Customer_Email] ON [dbo].[Customers] (
            
[Last_Name] ASC,
            
[First_Name] ASC,
            
[Email_Address] ASC)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
Before doing so, it is important to remember that indexes must be maintained by SQL Server during data manipulation operations. Too many index hurts performance during write operations. Additionally, the wider the index, that is to say the more bytes that make up the index keys, the more data pages it will take to store the index.
Furthermore, there are some built in limitations for indexes. Specifically, indexes are limited to 16 key columns or 900 bytes, whichever comes first, in both SQL Server 2005 and SQL Server 2008. And some datatypes cannot be used as index keys, varchar(max) for instance.

Including Non-Key columns

SQL Server 2005 provided a new feature for nonclustered indexes, the ability to include additional, non-key columns in the leaf level of the nonclustered indexes. These columns are technically not part of the index, however they are included in the leaf node of the index. SQL Server 2005 and SQL Server 2008 allow up to 1023 columns to be included in the leaf node.
To create a nonclustered index with included columns, use the following Transact-SQL syntax.

CREATE NONCLUSTERED INDEX [ix_Customer_Email] ON [dbo].[Customers] (
            
[Last_Name] ASC,
            
[First_Name] ASC)INCLUDE ( [Email_Address]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
Rerunning our query yields an execution plan that make use of our new index to rapidly return the result set. The execution plan may be found in the Figure 11.

Figure 11. Covering query with included columns
Notice that even though our query selects columns that are not part of the nonclustered index’s key, SQL Server is still able to resolve the query without having to use a Key Lookup for each row. Since the ix_CustomerEmail index includes the Email_Address column as part of its definition, the index “covers” the query. The properties of the Nonclustered Index Seek operator confirm our findings as depicted in the Figure 12.

Figure 12. Execution properties for the covering query with included columns
From this execution plan, we can see that the Estimate Operator Cost decreased from 12.17 in Figure 5 to 0.41. Including an additional non-key column has dramatically improved the performance of the query.
For more information about including non-key columns in a nonclustered index, See "Index with Included Columns" in Books Online .

Summary

By including frequently queried columns in nonclustered indexes, we can dramatically improve query performance by reducing I/O costs. Since the data pages for an nonclustered index are frequently readily available in memory, covering indexes are the usually the ultimate in query resolution.

No comments:

Post a Comment