T-SQL coding

Processing Rows from Very Large Tables

Processing rows in a VLT (very large table)…something that many of us encounter from time to time. I am either asked to code something or am asked to performance tune code that someone else has written to process one or more VLTs about once a year. It happened again recently and made me think about writing this post. Last year I had to tune a set of insert statements to a data mart/reporting database, hundreds of millions of rows; one particular step of the process was taking almost 5 days to complete. Using the methods below I was able to speed the step up to about 7-8 hours.

I break down VLT processing into 5 basic coding methods which I will describe below. This post relates to inserts, updates, and deletes equally, and since I most recently dealt with this topic regarding deletes that is the point of view that I will take today. Just substitute INSERT or UPDATE in place of DELETE and most of this post will be applicable to you. It is applicable to a MERGE statement as well.

What makes a table a VLT? That definition is pretty much up to you and your SQL Server. To me a very large table (VLT) is one that consumes a large amount of resources to either select or change rows; that means a large number of rows or a large “footprint” in the database (a combination of row count and row length, using a large amount of storage). The actual numbers vary from database to database, but somewhere around 100 million rows I find is a good place to start. I usually find VLTs in data marts or reporting databases, but you may have one or more in your OLTP database as well.

It’s usually pretty straightforward to write the basic T-SQL statements to accomplish your goal (an INSERT, UPDATE, or DELETE against a VLT), and when you run a simple unit test it works. Most developers stop at this point; promoting their code ahead and forgetting about it…until the DBA or a user complains that their process is taking too long/too many resources/etc.

I usually get involved because of performance. The conversation usually starts with “my insert/update/delete is taking way too long to complete and I can’t figure out why”. It usually continues with “can’t you just add an index or something and make it run faster”? Well…it’s not that easy.

What follows are the 5 methods that I use to process VLTs. I will give you piece of skeleton T-SQL logic that you can start with since each situation will be specific to the VLT that you need to process. Each of these methods will get the job done, but each has its own set of pros and cons. They are presented in relative order of performance; however, keep in mind that your performance will vary based on the characteristics of your database and SQL Server configuration. When I am performance tuning I usually see either method 1 or 2 in the existing code (and it’s not performing well against the VLT!) and try one or more of methods 3-5 (or a combination of them!) to get the process moving much faster.

My example is this: I need to write some T-SQL to delete rows from my VLT. For simplicity let’s say there are only 2 tables involved, MyParentTable and MyChildTable. They are related to each other by a single foreign key column called ParentKey. In this example I have already deleted rows from MyChildTable and I am preparing to remove all rows from MyParentTable that are no longer needed. In my development environment the tables have a million rows each, but in production, 300 million rows each. In my database that makes it a VLT.

Method 1 – IF (NOT) EXISTS

The code for this method usually looks something like this:

DELETE FROM p

FROM dbo.MyParentTable p

WHERE NOT EXISTS (SELECT *

               FROM dbo.MyChildTable c

               WHERE p.ParentKey = c.ParentKey);

 

This method is probably the easiest to code, but it also frequently produces the least efficient plan to perform the work. For my delete I want to remove all rows from my parent table where I don’t have a corresponding child row. This method works fine in my development environment since the tables are “regular” sized, but with a VLT the performance may bog down due to resource constraints on your server. Think about it; what if your server has 10 GB of free memory when you execute this but the plan created requires much more than that due to the size of the tables involved. Your SQL Server will do the best that it can, but it must make use of TempDB or virtual memory or some other virtual resource to get the job done (google “SQL Server spills” if you want to learn about the technical details of this!). Using virtual resources or TempDB to complete sorting or joining operations or even I/O will significantly slow down your process.

Another potential issue with this method is the amount of transaction log space required. Since there is only one DML statement (DELETE in this case), the database engine will log the entire set of changes as a single transaction. If your database is using full recovery mode then the space utilized by these transactions (the virtual log files or VLFs) will not be marked as available and cannot be reused until the entire operation is committed and then backed up. If your database is using simple recovery mode the same logic applies regarding the space utilized in the transaction log, although the space will be available for other transactions after the operation is committed. This means that your transaction log must be large enough or have enough expansion room to complete the entire operation. This transaction log growth will probably cause your DBA some pain and suffering, usually in the middle of the night. Trust me; you don’t want that to happen.

I often see this method used by the developer and moved into production use, since it performed quite reliably in the development environment.

Method 2 – LEFT JOIN/ key is null

The code for this method usually looks something like this:

DELETE FROM p

FROM dbo.MyParentTable p

LEFT JOIN dbo.MyChildTable c ON p.ParentKey = c.ParentKey

WHERE c.ChildKey is null;

 

This method is also easy to code but may not be the method that is apparent to the developer on first glance. I will often use this method in my code because I believe it to be (generally) more efficient than IF NOT EXISTS. I believe that the database engine works with sets more efficiently than the looping-type logic represented by method 1. In spite of this, the LEFT JOIN has similar performance problems as the IF NOT EXISTS method when VLTs are involved. We also have the same transaction log issue as discussed with method 1.

Method 3 – Temp table to hold keys

The code for this method usually looks something like this:

— Make a temporary table to hold the keys

CREATE TABLE #KeepTheseKeys

(ParentKey int );

— Grab the keys

INSERT INTO #KeepTheseKeys

SELECT DISTINCT ParentKey

FROM dbo.MyChildTable;

— Delete the rows based on the temp table contents

DELETE FROM p

FROM dbo.MyParentTable p

LEFT JOIN #KeepTheseKeys c ON p.ParentKey = c.ParentKey

WHERE c.ParentKey is null;

 

This method is also fairly easy to code; it’s really just a variation of either method 1 or method 2 (I used method 2 in my example). The difference is that I have replaced my VLT with a temp table. I have used this method successfully when MyParentTable is large, especially when the row length on MyParentTable is quite large. This method can put many more ParentKey values on a single database page, providing performance benefits. I will often try this method if the table sizes are just entering the “VLT” zone. Let’s not forget the transaction log…this method also has the same issue as above on the transaction log.

One of the drawbacks to this method is that there is “setup” time involved before the delete begins; i.e. it takes time and resources to populate the temp table. This time needs to be taken into consideration when deciding if this method is right for your situation. My experience has been that this “setup” time is often a few minutes and is recouped by the faster performance of the DELETE statement.

If this method doesn’t work fast enough for me then I know I will need to use either method 4 or 5. Both of these methods use a loop to complete the task at hand.

Loops you say? They aren’t as efficient as set-based operators!

As a general statement that is correct, but these are VERY LARGE TABLES; the performance problems that I see are based on the fact that there are not enough physical resources available to the SQL Server and SQL Server must spill to get the job done, or, even worse, the operating system must make use of virtual resources. So your options at this point are to 1) ask for a larger server (probably won’t happen!), or 2) write the T-SQL code to process your rows in a more efficient manner based on the resources available. Methods 4 and 5 use loops to break the work up into “batches”; set-based operators are still used, and the batch size is determined through trial-and-error to get a fairly optimal size to complete the task that you need to accomplish.

Method 4 – Loop using a batch size

The code for this method usually looks something like this:

— Setup the necessary variables for looping

DECLARE @BatchSize int = 100000,   — change this value as appropriate

@LoopSize int;

SET @LoopSize = @BatchSize;

— Loop until there are no more records processed

WHILE @LoopSize > 0

BEGIN

               — Delete the rows based on a set batch size

               DELETE TOP(@BatchSize) FROM p

               FROM dbo.MyParentTable p

               LEFT JOIN dbo.MyChildTable c ON p.ParentKey = c.ParentKey

               WHERE c.ParentKey is null;

               — Update the looping variables and do it again

               — This logic will end the loop once all rows have been processed

               SET @LoopSize = @@ROWCOUNT;

               SET @LoopSize = CASE WHEN @LoopSize < @BatchSize

                              THEN 0 ELSE @BatchSize END;

END;

 

This method processes only the number of rows represented by @BatchSize for each DML operation. If you were to write this type of loop on a “regular” size table your DBA would send it back to you because it is not as efficient as method 1 or 2 above. However we are now in the land of VLTs so conventional wisdom doesn’t apply. This is a better method for processing VLTs than those above for a couple of reasons.

First, since you are selecting a subset of the entire set to work on at one time SQL Server can use a different plan to perform the DML operation. If you don’t believe me, try it out. Set your @BatchSize = 1 or 10 or 100 and check out the estimated plan produced. For VLTs, SQL Server will (in many situations) select a different plan (using an index seek, for instance) than method 1 or 2.

Second, because of the different plan a single iteration of your loop should execute fairly fast. This will depend upon the indexes available, and you will need to try out different values of @BatchSize to get your own optimal value. My experience with this method suggests that once you get above a certain batch size that the database engine will revert back to a plan much like methods 1 or 2; you don’t want to set your batch size this high. In addition, when the batch size becomes so large that all of the physical resources available to SQL Server are used then you will see a dramatic increase in the elapsed time for the execution of each loop. That is another clue that you need to decrease the batch size for optimal results.

Third, your worries about the transaction log should go away since each transaction size is now much smaller. Just make sure that if using full recovery mode you are backing up your transaction log during your operation and you can re-use (at least some of) the log space. If using simple recovery mode then you will (in effect) re-use the same transaction log space over and over again. Your DBA will thank you for this…

This method sometimes has a large negative associated with it (again, it depends on your table definitions and your DML operation). Since the DML statement is using “TOP n” logic, SQL Server often must select all of the possible rows, sort them, and then filter them by “TOP n”. I often see the execution time for a single loop grow as more of the entire set is processed due to this filtering. For this reason I don’t often use this method.

Method 5 – Loop using a key range

The code for this method usually looks something like this:

— Setup the necessary variables for looping

DECLARE @Increment int = 100000,   — change this value as appropriate

@MinKey int,

@MaxKey int,

@MaxInLoopKey int;

SELECT @MinKey = MIN(ParentKey), @MaxKey = MAX(ParentKey)

               FROM dbo.MyParentTable;

SET @MaxInLoopKey = @MinKey + @Increment;

— Loop until you have processed all of the keys

WHILE @MinKey < @MaxKey

BEGIN

               — Delete the rows based on a key range

               DELETE FROM p

               FROM dbo.MyParentTable p

               LEFT JOIN dbo.MyChildTable c ON p.ParentKey = c.ParentKey

               WHERE c.ParentKey is null

               AND p.ParentKey between @MinKey and @MaxInLoopKey;

               — Update the looping variables and do it again

               — This logic will end the loop once all key ranges have been hit

               SET @MinKey = @MaxInLoopKey + 1;

               SET @MaxInLoopKey = @MaxInLoopKey + Increment;

END;

 

This method processes your VLT using a subset of the entire range of key values. It usually provides all of the benefits of the other methods above without the negative hit that you can take in method 4 using the “TOP n” logic. I have used this method several times with great improvement in performance for the entire operation. You may notice that this requires that the key column be numeric and the key values in numeric order; I have also used this method with a varchar key columns. I placed the keys into a temp table that also contained an identity column and used that identity column to control my looping. If you are still paying attention at this point you may recognize that this is essentially combining methods 3 and 5 together…Yes I have combined method 3 with methods 4 or 5 before and used it successfully as well.

Don’t forget that the increment size in this method is much like the batch size in method 4; you will need to test with different values to arrive at the one that is best for your situation.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s