Blog Post

User Blogs
11 MIN READ

Boosting SQL Server Backup/Restore Performance: Threads and Parallelism

markwdev's avatar
markwdev
Day Hiker II
9 days ago

Deploying faster backup storage is a huge step forward, but if you don’t tune your database hosts, you’re leaving a lot of performance on the table.

In this post, we’ll discuss day 1 tuning you can do on your database hosts to take full advantage of your new high-performance backup storage. We’ll go over a few tricks around database layout and backup configuration for maximum throughput, discuss some quirks with SMB, and finally discuss using S3 effectively.

Threads and Parallelism

First things first, let's talk about how SQL Server allocates worker threads for backup and restore operations. It’s no secret that a lot of SQL Server’s power comes from being able to process tasks in parallel. Backup and restore is no different. Using multiple threads, SQL Server is able to parallelize these operations and maximize potential throughput. 

Two main factors can control this parallelism:

  • Volume count: A worker thread is allocated for each volume that contains SQL Server data files for the database being backed up/restored.
  • Number of backup target files: When a backup is striped across multiple backup files, a new worker thread is allocated per backup file.

This is true for both backup and restore operations. In the case of a backup, a new reader thread is allocated per volume, and a new writer thread is allocated per backup file. The reverse is true for a restore, with each backup file being allocated a reader thread, and each volume allocated a writer thread.

Simple backup operation

Let’s start with a simple configuration and see what SQL Server does. Here, we have an instance with a single data volume and a single backup target. This backup target could be SMB, S3, or even a local disk on the same host.

Figure 1: Simple backup operation with a single reader and a single writer.

In this example, a single reader thread is allocated to handle I/O for the single data volume, and a single writer thread is allocated to handle I/O to the single backup target. This configuration is not going to have a ton of throughput, but in some cases, that is absolutely fine, or maybe even necessary.

If you’re working in a highly constrained environment, you simply may not have enough resources to increase throughput. But, when higher throughput is required, and you have the resources to handle it, increasing SQL Server backup/restore parallelism can significantly improve performance.

Increasing backup/restore parallelism

This next example is a bit more complex. Here, we’ve added a second data volume and spread our data files over both volumes. We’ve also added three additional backup targets (files or objects), bringing the total to four. 

Figure 2: Parallelized backup operation with two readers and four writers.

In this example, we’re going to end up with quite a few more threads to handle our I/O. Since we get one thread per data volume and one thread per backup file, we now have six threads handling this backup operation. While this isn’t going to be 6X the performance, you’ll definitely see a performance boost here.

The amount of that boost is going to depend a lot on the environment and its limits. If your primary storage is already under pressure, adding more data volumes is likely not going to increase performance. Similarly, adding more backup targets when the network is already completely saturated, or the backup storage is overloaded, is not going to increase performance. 

Considerations when using SMB

If you’re running SQL Server on a Windows server and sending backups to an SMB share, there’s another knob you can use to further tune performance. 

When Windows connects to an SMB share, it will first check to see if it already has an active session for that SMB host. If it finds one, it will use it for all SMB operations—if it can’t, it sets one up. That session will service any and all SMB needs for that unique client/server pair. The session has its own data queue and handles operations as fast as it can. Oftentimes, this session becomes a bottleneck as Windows simply cannot service the SMB commands fast enough. When this happens, the queue climbs, and throughput suffers.

Fortunately, there’s a simple way to get around this limitation: create more DNS names for the SMB host. Adding additional DNS names for the SMB host and referencing those names in your backup target file paths forces Windows to set up dedicated sessions per DNS name. The end result is more sessions, more queues, and potentially more throughput.

The diagram below helps illustrate what we’re talking about. Here, we have two DNS names (SRV01 and SRV02) for the same SMB host, and we’re writing backup files to both DNS names. They’re ending up on the same SMB share, but I/O is being managed by two separate SMB sessions.

Figure 3: SQL Server backup thread parallelism with SMB backup targets.

Now, there is surely a ceiling to what this method can provide you, so you should really test it to find the number of DNS names that work best for you. I would suggest testing with one, two, four, six, and eight to start, just to get a feel for what’s possible in your environment. Don’t forget to alter your backup scripts to take advantage of these new DNS names as well; otherwise, SQL Server will not know to use them. Here is an example of what I am talking about. The following script writes a compressed backup striped across four files. Each of the four file paths is referencing a different DNS name for the same SMB host and share: 

BACKUP DATABASE [MyBigDB] TO
    DISK = '\\srv01\backups\db_01.Bak',
    DISK = '\\srv02\backups\db_02.Bak',
    DISK = '\\srv03\backups\db_03.Bak',
    DISK = '\\srv04\backups\db_04.Bak'
WITH INIT, COMPRESSION

When making these changes in your environment, make sure to keep an eye not only on backup throughput, but also on server load, network load, and the load on your target storage. Speed only matters if your systems stay online.

Monitoring queue length

While you’re testing, you can also actively monitor your SMB session queue lengths with some simple PowerShell code:

Get-Counter -Counter '\SMB Client Shares(*)\Avg. Data Queue Length' `
            -SampleInterval 5 `
            -Continuous

While queue length is not a definitive signal that SMB session management is a bottleneck, it’s a good leading indicator. This metric should be monitored over time, along with backup throughput, to understand the impact as you add more DNS names. 

Considerations when using S3/object

While there aren’t a lot of settings you can change to speed up backup/restores with S3, there are some important limitations and behaviors that should be considered when using S3 as a backup target. Throughout this section, I’ll be referring to the backup target as an “object” instead of a file since S3 is an object store.

Note: S3 does not benefit from using multiple DNS names like SMB does. S3 utilizes the Multipart API to parallelize the workload.

When writing backups to S3, SQL Server uses the S3 Multipart API, which splits each backup object into smaller parts for parallel writing. If multiple backup target objects are specified, SQL Server spawns multiple threads per backup object, and then even more threads to handle the object parts in parallel, enabling highly parallel upload operations and significantly improving performance.

Figure 4: SQL Server backup thread parallelism with S3/object backup targets.

There are a few limitations around the S3 Multipart API and SQL Server that you should be aware of:

  • The MAXTRANSFERSIZE (MTS) backup setting determines the size of the parts each object is broken into (maximum value of 20MB, default of 10MB).
  • If using non-default values for MTS, compression is required. 
  • A given backup object can be broken up into a maximum of 10,000 parts.
  • SQL Server supports a maximum of 64 backup targets in a backup set.

Based on these limits, there is a limit of ~12.2TB per backup set, with an individual backup object limit of ~192GB. 

Performance tuning backups is a balancing act, and S3 brings one more thing to the party: object parts. With everything else we’ve discussed, you’re trying to keep SQL Server thread consumption under control, respect network constraints, and be nice to your backup storage. With S3, you also have to keep an eye on how many parts your backup objects are being split into.

More object parts means more load on storage. So a perfectly tuned setup that was working fine with SMB may not work as well with S3 without further tuning to balance these factors. In the limited lab testing I did, 5MB MTS had a noticeably different wait profile, was overall slower, and put more load on the storage. Like with anything, though, you need to test to see which settings work best for you.

Examples and errors

Let’s walk through a quick example scenario to better understand how SQL Server interacts with the S3 Multipart API and the types of errors that can be encountered. Assume we have the following:

  • Single DB instance
  • Single 1TB database
  • Average backup set size of 530GB (compressed)

To successfully write a backup set this large (and stay under the limits of S3), the following calculation has to be true:

<total backup set size> / <number of backup objects> / MAXTRANSFERSIZE <= 10000

For example, assume that MTS is set to 20MB and the backup is striped across four objects:

530GB / 4 objects = 132.5GB per backup object

132.5GB / 20MB MTS =  6,784 parts per backup object

In this case, we come in well under 10,000 parts per object, and the backup would complete successfully.

If MTS was instead set to 10MB, and the backup was still striped across four objects, the story would be a bit different:

530GB / 4 objects = 132.5GB per backup object

132.5GB / 10MB MTS = 13,568 parts per backup object

We’ve gone past our 10,000 limit, and the backup would fail. What might surprise you is how it fails and how long it takes to fail.

When you attempt this backup, you’ll eventually see an error in whatever tool you’re using to execute the backup (like maybe SSMS, VS Code, or the SQL Server Agent):

Write on "s3://srv01/backups/BIG_Backup1of4.bak" failed: 1117(The request could not be performed because of an I/O device error.)

This sounds sort of scary, like maybe your backup storage failed, and it didn’t happen right away, so that backs up that theory. However, if you check the SQL Server error log, you’ll see a slightly more useful error:

Write to S3 object storage device s3://srv01/backups/BIG_Backup1of4.bak failed. Device has reached its limit of 10000 allowed parts.

So the backup device did not fail; we just went over our 10,000-part limit. But this error didn’t happen right away. Since the S3 API does not know the size of the object being sent, it has no idea how many parts are coming its way. So it happily accepts an upload that is bound to fail. 

From a SQL Server perspective, the server will start the backup process and begin breaking the backup objects into parts and sending them down the wire. Only when SQL attempts to send more than 10,000 parts for a given object does the error get thrown. What this means is that in the case of our 530GB backup set with a 10MB MTS, SQL Server would send ~400GB of data before the failure occurs. Each of the four backup objects would have gotten to ~10,000 parts before the failure, and each part is 10MB.

This surely brings up some questions in your mind:

  1. Doesn’t SQL Server check to see if the backup set is compatible with S3 before firing 400GB of data across the network?
  2. No, it does not. This is up to you, the DBA, to keep track of.
  3. What happens to all that data? Is it just getting dumped on the data center floor?
  4. No! That would be wasteful. Instead, it’s stored on your backup storage target.

The S3 Multipart API does support resuming a broken upload, so most S3 providers will keep the data around. I think that makes sense and is a nice quality-of-life feature. SQL Server, however, does not support resuming uploads, and in this case, it wouldn’t really matter since the backup set isn’t compatible.

In this case, there is 400GB of storage consumed that you will need to manually clean up. The process varies by S3 provider, but there is typically an API to find and clean up orphaned parts, or a policy based option to clean up orphaned parts after a certain amount of time.

Checking your environment

Nobody likes being paged in the middle of the night for failed backups, so here is some sample TSQL code you can use in your environment. This code could simply be run ad hoc to determine which databases might need some changes before backing up to S3, or it could be incorporated directly into your backup processes to alert you before failures start happening.

This script takes four parameters:

  • MaxTransferSize: The MTS you plan to use in your environment.

  • Threshold: This is a percentage of the 10,000 part limit. It defaults to 0.90 (90%). Once the estimated parts count breaches this threshold, it will be flagged for attention.

  • DatabaseName: If desired, this can be run for a single database. This might be a good option if this code was incorporated into a backup process.

  • MonitorMode: This is a bit parameter. If set to 1, it will only return rows if databases are found that breach the threshold. This would be ideal when integrating with your existing backup processes.

DECLARE @MaxTransferSizeMB INT = 10,
        @Threshold NUMERIC(3,2) = 0.9,
        @DatabaseName SYSNAME = NULL,
        @MonitorMode BIT = 0,
        @1MB INT = 1048576; -- For simpler math

SELECT  bs.database_name As DatabaseName,
        buf.backup_objs AS BackupObjectCount,
        IIF(bs.compression_algorithm IS NULL,0,1) AS Compressed,
        bs.compressed_backup_size / @1MB AS BackupSizeMB,
        ( bs.compressed_backup_size / @1MB ) / buf.backup_objs AS BackupObjectSizeMB,
        buf.EstObjectParts,
        IIF(buf.EstObjectParts > 10000,
            'Incompatible', IIF(buf.EstObjectParts > (10000 * @Threshold),
            'Needs Attention' , 'Compatible')) AS S3Compatibility,
        CEILING(10000 * @Threshold) AS Threshold
FROM    [msdb].[dbo].[backupset] AS bs
        CROSS APPLY (
            SELECT  COUNT(*) AS backup_objs,
                    CEILING((bs.compressed_backup_size / COUNT(*)) / (@MaxTransferSizeMB * @1MB)) AS EstObjectParts
            FROM    [msdb].[dbo].[backupmediafamily] AS bmf
            WHERE   media_set_id = bs.media_set_id
        ) AS buf
        CROSS APPLY (
            SELECT TOP(1) bs2.backup_set_id
            FROM msdb.dbo.backupset AS bs2
            WHERE bs2.database_name = bs.database_name
            AND bs2.type = 'D'
            ORDER BY bs2.backup_finish_date DESC
        ) AS TopSet
WHERE   bs.backup_set_id = TopSet.backup_set_id
        AND (
            @DatabaseName IS NULL
            OR bs.database_name = @DatabaseName
        )
        AND (
            @MonitorMode = 0
            OR buf.EstObjectParts > (10000 * @Threshold)
        )
ORDER BY DatabaseName;

Note: If your database backup set is close to the threshold, you’re using compression, and MAXTRANSFERSIZE and object count are already at the maximum, there really isn’t much you can do if you want to use S3 as your backup target. 

Conclusion

Fast backup and restore is more complex than just faster storage. Tuning backup and restore is a balancing act between controlling resource consumption and meeting your RPO and RTO targets. The topics discussed here, when paired with a high-performance backup target like Everpure™ FlashBlade//S™, give you a lot of options when finding that balance. 

I would highly encourage you to walk through your processes and configurations and see if there is anywhere throughput can be increased safely. Remember, most of the methods in this post apply to restores as well, so the more throughput you can get, the faster your restores can be. 

Updated 9 days ago
Version 1.0
No CommentsBe the first to comment