knowledge: SQL Server Performance Testing

Clustered indexes and primary keys

By Grant Fritchey
The company I work is for is implementing a custom system, built in Visual Studio and running against SQL Server 2000, to replace some old mainframe applications. It's an all new database design that will meet all of the old application data requirements and a whole slew of new ones. I was responsible for the initial design, which I completed and handed over to a development team. I had been working quite closely with them for some time on the application, but I got pulled in other directions and the app and database had proceeded without me. One day, the development team raised a concern that the database was far too slow to support the anticipated transaction load. I began my investigation.

The problem

The application in question has a few data requirements that are different than the average application. Basically, at its heart, it is a document management system. However, every change to large sections of the data has to be stored historically, so that we have multiple versions of pieces of a given document. Not all of the data stored is recreated for each version. We needed to know that change 'A' was included in version '1', but that change 'B' was in version '3' and, on a completely different piece of data, read another table, and know that change 'Q' was included in version '2'. The data looks something like this:
Table 1

Version	Value
1	'A'
2
3	'B'

Table 2

Version	Value
1	'Some other value'
2	'Q'
3

When queries are run, the latest data in Table 1 is 'B' from version three and the latest data in Table 2 is 'Q' from version 2.

The original design: clustered indexes and non-clustered PKs

The original database design started with a table, Document. Each Document had one or more Versions. The attributes that describe a Document were attached as children to the Version. Since we knew that data access would always be by Document and Version, I made the decision to denormalize the data a bit and keep the DocumentId and VersionId on each table. This is a sample of what the database would have looked like. Names have been changed and the structure simplified for the purposes of the article.
'Figure

As you can see, my original design, which I refer to as the "switch-board" approach, placed a clustered index on each table on the DocumentId and VersionId in order to group the data for the anticipated access path. This means that in addition to the non-clustered primary key on each table, there is a clustered index.
Or, at least this was how it was supposed to be.

The actual design

What I found was that, after several months on their own, and with input from some other DBAs, the developers had dropped some of the indexes, changed some of the primary keys to be clustered and the indexes to be non-clustered, removed some of the denormalization, and so on. In other words, a total mess.
My initial reaction, after cursing and spitting, was to simply clean up the database and put the developers back on track. It wasn't that simple. During this same period we had a consultant from Microsoft helping out in the database area. He had been talking to the developers about their problems and had proposed, directly to the development lead, a radical change to the design of the database. Now I was presented with more than one solution to the problem.

The proposed new design: compound, clustered PKs

Everyone agreed that the existing database design wouldn't support the required user load, the expected transaction volume, or the amount of data needed. Despite my inclination to revert to my original design, I couldn't simply dismiss the consultant's idea, which was as follows: drop all the indexes, make the primary keys on all tables the clustered index for that table and finally redesign those keys so that each parent table acted as an identifier for the child table, thus creating compound primary keys. His proposed new design can be summarized as follows:
Figure 02: The Compound PK design

NOTE: The real design, being quite a bit more complex, had some tables with a clustered compound primary key that contained as many as eight fields. Colors and symbols are from ERStudio. Blue represents foreign keys, Red are identifying foreign keys.
This design bothered me. It violated one of the fundamentals that I'd learned and read about for years; namely keeping the primary key small and narrow. It also looked like it would be difficult to maintain. Finally, after arguing back and forth about the merits and drawbacks of each of the designs, we decided that testing them was the only way to be sure.

Implementing and testing the designs

In all, I had to implement and test three different designs:

The current, "messed up" design (used as our baseline case).
My original "switch-board" design, as described in Figure 1 (produced by cleaning up the current "messed up" design).
The proposed new "Compound PK design, as described in figure 2.

Implementing the designs

Having created each basic design, I needed to load each with the various defined data loads under which we wished to test. I also had to modify the stored procedures that ran the queries against these designs, as appropriate.

Loading the data

I needed to test under multiple data loads, projecting two, four, and eight years worth of data, and therefore create multiple databases, for each of the three designs. Creation of these data loads was simple in concept but quite labor-intensive. I had a known client list from which to pull a data distribution that would accurately represent the real business data and provide us with some of the largest anticipated data sets. Further, by loading from known data, it made it possible to define our test transactions based on that same known data. In other words, I knew which organizations we would be calling and could make that list known to the testing tool. In order to facilitate the data load process, I took advantage of a tool from the Quest suite called Data Factory. I first defined a two year load, verified that it was working correctly, backed up the database, and then simply doubled the initial row counts across the tables in order to get to the four and then eight year data loads. Each went into a separate database. In the end, I had nine different databases representing the three designs to be tested.

Verifying the stored procedures

Next, I had to modify and verify the stored procedures to work with the "cleaned up" original design and the new Compound PK design (obviously, the procedures for the existing "messed up" design I simply left alone). The stored procedures consisted of a set of simple, required INSERT, UPDATE and SELECT commands plus some more complex SELECT statements needed to retrieve the latest values across the historical data, as described earlier. The cleaned up version required me to check a number of stored procedures and rewrite quite a few of them. The new "compound PK" architecture required me to completely rewrite the join criteria on all the procedures. I made the initial pass through this and then had the MS consultant clean up my work so that his design was represented in the best possible light. I wanted it to fail, but I wanted it to fail fairly.
Even in the original design the queries to retrieve the latest values across the historical, versioned data were rather cumbersome to write so we had spent a lot of time on getting them just right. The original queries, with only the switchboard style of primary-to-foreign key relationships, looked something like this:

SELECT …
FROM dbo.Document d
   INNER JOIN dbo.Version v
      ON d.DocumentId = v.DocumentId
   LEFT OUTER JOIN dbo.DocumentCountry dc
      ON d.DocumentId = dc.DocumentId
      AND dc.VersionId = (SELECT MAX(dc2.VersionId)
                          FROM dbo.DocumentCountry dc2
                          WHERE dc2.VersionId <= v.VersionId
                          AND dc2.DocumentId = d.DocumentId
                         )
   INNER JOIN dbo.DocumentLimit dl
      ON d.DocumentId = dl.DocumentId
      AND dl.VersionId = (SELECT MAX(dl2.VersionId)
                          FROM dbo.DocumentLimit dl2
                          WHERE dl2.VersionId <= v.VersionId
                          AND dl2.DocumentId = d.DocumentId
                         )
   INNER JOIN dbo.LimitQualifier lq
      ON lq.DocumentLimitId = dl.DocumentLimitId
      AND d.DocumentId = lq.DocumentId
      AND lq.VersionId = (SELECT MAX(dl2.VersionId)
                          FROM dbo.LimitQualifier lq2
                          WHERE lq2.VersionId <= v.VersionId
                          AND lq2.DocumentId = d.DocumentId
                         )

These queries would get the latest value from each of the separate tables using the MAX function.
NOTE: We experimented with using TOP(1) instead of MAX, and in some situations it performed better. However, use of the aggregate function gave a more consistent performance across the system, so we standardized on it. Clearly, as the number of table joins grows, these queries can become quite unwieldy. However, the new compound PK architecture produced different and even more unwieldy queries:

 SELECT …
FROM dbo.Document d
   INNER JOIN dbo.Version v
      ON d.DocumentId = v.DocumentId
   LEFT OUTER JOIN dbo.DocumentCountry dc
      ON v.DocumentId = dc.DocumentId
      AND dc.VersionId = (SELECT MAX(dc2.VersionId)
                          FROM dbo.DocumentCountry dc2
                          WHERE dc2.VersionId <= v.VersionId
                          AND dc2.DocumentId = v.DocumentId
                         )
   INNER JOIN dbo.DocumentLimit dl
      ON v.DocumentId = dl.DocumentId
      AND dl.VersionId = (SELECT MAX(dl2.VersionId)
                          FROM dbo.DocumentLimit dl2
                          WHERE dl2.VersionId <= v.VersionId
                          AND dl2.DocumentId = v.DocumentId
                         )
   INNER JOIN dbo.LimitQualifier lq
      ON dl.DocumentId = lq.DocumentId
      AND lq.VersionId = (SELECT MAX(dl2.VersionId)
                          FROM dbo.LimitQualifier lq2
                          WHERE lq2.VersionId <= v.VersionId
                          AND lq2.DocumentId = v.DocumentId
                         )
      AND dl.LimitTypeId = lq.LimitTypeId

In order to keep the article manageable, and because I can't show the real design, these queries seem almost as simple as for the original design. However, trust me, as the structure gets deeper and more complex, some of the joins are between six and eight columns wide rather than the currently displayed maximum of four. This causes not only the main part of the query to grow, but the subqueries to get extended as well.

Testing the designs

In order to test which design would facilitate the fastest queries and function in the most scalable way, I had to come up with a method of performing database-only testing. This was the only way to eliminate any concerns that we were seeing artifacts from the user interface or business façade, etc. As noted, I needed to test each design under three different data loads and I also needed to ramp up the simultaneous user load in order to test the scalability and measure the breakdown point of each design, for each data load. The tool that answered all these needs was Quest's Benchmark Factory.
I had to create a set of transactions to run the required queries against each of the database designs. Due to a limitation in Benchmark Factory (it can't handle output parameters in SQL Server) I was forced to create a wrapper procedure that encapsulated the required series of reads, writes, udpates, and finally a delete, that represented how the application would be making the database calls. This wasn't the optimal way for the database to perform since the queries were running within a single transaction rather than as a series of transactions and therefore not the best way to run the tests. However, as long as I did the same thing to each of the database designs, I'd be comparing apples-to-apples.

Test results

I set up the tests to run for 30 minutes at each of a variety of concurrent user loads; 2, 4, 8, 16, 20 (20 being the max for our license). This meant 2.5 hours at a pop for each test. What with reseting the databases between tests and gathering up all the test data, I could only complete two tests a day. Ultimately, it took most of a week to hack through all this stuff, on top of the three weeks it took to prepare it all. However, in the end, I had some answers.

Existing Design (baseline case)

The first tests were run against the existing design, messed up indexes and everything else that was wrong with it. It was hardly a shock to see that it didn't perform very well. The results for a two year data load showed an incredibly slow performance, even when one takes into account that the "transaction" displayed represents the wrapper stored procedure and not the execution of each of the individual procedures within the wrapper.

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	1.21	0	1.659	0.671	2.513
4	1.01	0	3.952	1.3	5.304
8	1.01	0	7.882	2.852	11.239
16	1.39	0	11.541	4.882	24.826
20	0.93	0	21.529	7.94	28.353

Things went from bad to worse. The four year data load started deadlocking immediately. I tried resetting the database, and even SQL Server, to clear up the problem, but nothing worked. I tried running the eight year data load and got a partial set of results before the deadlocks started again:

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	1.07	11	1.864	1.087	6.385
4	0.77	399	5.17	0.569	21.3

NOTE: When the information collected was reviewed, we found that we were getting lots of table scans, index scans, and lock escalation, all of which were contributing to the deadlocks.
Cleaned up "switch-board" design
I believed very strongly that the cleaned up indexes and primary keys would work well enough. The tests supported this quite strongly, as you can see for the two year data load case:

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	2.4	0	0.833	0.115	2.985
4	4.22	0	0.947	0.088	3.239
8	9.99	0	0.8	0.04	3.099
16	23.79	0	0.672	0.06	8.821
20	23.26	0	0.859	0.136	9.006

When compared to the baseline, I got a very consistent level of performance, with the transactions completing in an average time of between 0.6 and 0.9 seconds. The design produced minimum responses times down to 0.04 s (compared to around 0.7 s for the existing design). The maximum response time never exceeded 9 s, whereas for the existing design it exceeded 28 s for a single transaction.
I approached the test for the to the four year data set with some degree of trepidation, since the existing design had failed so spectacularly. I shouldn't have worried:

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	2.38	0	0.84	0.345	2.694
4	4.73	0	0.844	0.049	2.697
8	9.8	0	0.816	0.042	3.329
16	22.96	0	0.698	0.076	11.532
20	22.2	0	0.902	0.108	14.196

This showed that, despite doubling the amount of data in the system, not only did I avoid deadlocks, but I got results that were entirely consistent with the two year data test, including that odd sweet spot at 16 users. The maximum times were a bit higher, but not alarmingly so. I went on to the eight year test feeling more optimistic:

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	2.44	0	0.818	0.149	2.609
4	4.84	0	0.825	0.063	2.838
8	9.98	0	0.801	0.042	3.541
16	23.02	0	0.694	0.07	10.235
20	22.23	0	0.899	0.092	10.133

The results were the same. Use of two, four and eight years worth of data didn't cause a significant change in performance. I had captured PerfMon data as well and it showed how the load scaled as more users were added. This is from the eight-year test:
Figure 03: PerfMon output for the "switch-board" design

Figure 03: PerfMon output for the "switch-board" design

As you can see, as the number of users increased, CPU & disk I/O went up accordingly. The interesting moment occurs when going from eight to sixteen simultaneous users. The CPU goes quite high, but the disk queuing goes through the roof (more on this later).
If I hadn't had the consultant's design in hand, I would have declared the cleaned up design a success and walked away quite satisfied. But, I had to run the same tests against this new design.
The compound PK design
My initial estimate was that this new design might be about 10-20% faster, which quite honestly wouldn't justify the additional overhead of rewriting all the stored procedures with extra code, and so on. The two year data tests came back:

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	19.15	0	0.104	0.038	2.091
4	30.5	0	0.13	0.038	2.354
8	30.39	0	0.262	0.05	7.015
16	26.9	0	0.594	0.102	8.012
20	27.01	0	0.74	0.09	11.412

I was a little surprised. OK, more than a little surprised. At the lower end, 2-8 users, this was running between four and eight times faster. However, for the higher user loads the performance was only 10-20% better – in line with my original expectations.
Something was going on here that merited a lot more investigation. I didn't believe we'd hit deadlocks above 8 users, but given the failure of the baseline case at the four year data load, I wasn't sure:

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	20.38	0	0.098	0.036	1.933
4	29.32	0	0.136	0.038	2.105
8	29.22	0	0.276	0.05	8.447
16	28.57	0	0.562	0.08	11.983
20	27.74	0	0.721	0.09	11.079

The results were again consistent with those obtained for the two year data load, with the same performance trend as for the cleaned up design, but this time with a slight drop-off in TPS performance at sixteen users. Nothing for it but to get the eight year data load results:

Userload	TPS	Errors	Avg Time	Min Time	Max Time
2	15.99	0	0.124	0.038	2.004
4	27.55	0	0.145	0.038	2.789
8	29.29	0	0.276	0.044	6.723
16	29.83	0	0.536	0.078	8.834
20	29.14	0	0.687	0.103	8.695

These were also very consistent with the other results. Performance was decidedly better at low user loads, but only somewhat better at higher loads. The Perfmon data gave a better indication of why things were behaving as they were:
Figure 04: PerfMon output for the "compound PK" design

Figure 04: PerfMon output for the "compound PK" design

As you can see, disc queuing was quite variable, but on average, much lower than for the cleaned up design. It was the CPU that was getting the most use in this design. The strange decline in performance at 16 users directly corresponded to the CPU maxing out at about that value. Clearly, as CPU's were added or upgraded, this design would be faster than the original.
The casue of this radical increase in performance was found in the query plans.

The query plans

At the conclusion of the testing, a few things were very clear. The existing design was not only slow, but missing or improper clustered indexes were leading to table scans and unnecessary lock escalation, resulting in deadlocks. We could get adequate performance by simply cleaning up the indexes and stored procedures as they existed. But, when you can absolutely point to a four fold increase in performance, you have to assume that it's worth the time and effort to achieve it. The only question remaining was: why did the new design outperform the old one?
The answer was provided by the respective query plans. In my switch-board design, the clustered indexes were being used quite well. However, because the primary key structure was independent of the clustered index, we were seeing a bookmark lookup from the clustered index to the primary key. This radically increases the disk I/O (for a more thorough examination see, for example, Randy Dyess' article on SQL Server Central,
However, the compound primary key structure eliminated this processing. My original design required, at a minimum two indexes, the cluster and the primary key. The new design reduced this to one, which in turn reduced the actual size of the database noticeably. This is where the potential for added costs during inserts due to the wider key structures actually ends up in a performance improvement since only a single index is being maintained. We even saw, in some cases, the elimination of a clustered index scan because the more selective index created by adding multiple columns was able to find the data in more efficiently.

Conclusions

We decided to go with the compound key approach. It does add to the number of lines of code and every time we do a review with a different DBA, the size of the primary key's will start them twitching. That said, it's scaling very well and looks to be a success. A concern was raised that the larger and more complex queries would compile slowly, but testing again assured us that, while it was a bit slower, it was easily offset by the performance gains. We're now evaluating all new projects as they come on board to determine if they need quick and easy development (the switch-board approach), or reliable speed and scalability, the compound key approach. Oh, and we've implemented more frequent reviews of databases that we've turned over to development to ensure they are not deviating from the original design as far as this one did.
###
Grant Fritchey is a database administrator for a major insurance company with 18 years experience in IT. He has been working with SQL Server since 6.0 back in 1995 and has also developed in VB, VB.Net, C# and Java. He is currently working on methods for incorporating Agile development techniques into database design and development at our company.

knowledge

Trang

Friday 10 February 2012

SQL Server Performance Testing