Cloud Data Management
Backup 10,000 SQL Server databases in hours
Companies that use Microsoft SQL Server tend to increase the number of databases horizontally. For newer versions of SQL Server, the maximum number of databases on a single instance is 32,767 – and it’s common for customer needs to exceed that limit. Rubrik, the Zero Trust Data management company, supports multiple organizations approaching the 10,000 database mark on a single host. As SQL Server moves more and more into the enterprise segment, one has to deal with the problem of managing and securing these servers on a large scale. The biggest problem with this scale is the fuse. Is it possible to perform daily backups of 10,000 databases on a single Windows server?
Rubrik has placed 4,000 small SQL Server databases on a single host and decided to back them up one by one using Microsoft’s standard APIs and the SQL Server Writer to see where there are optimization opportunities. It took about a month to complete the backup process. The team then implemented batch processing, where they backed up a batch of databases at once. The team varied the batch size up to 200, but couldn’t complete the backup in less than a day.
Rubrik got to the bottom of the problems and discovered three areas that could be optimized:
- The first task of backing up a large number of databases is to tabulate the metadata of all databases in the system. If there are many databases on the host, the current method is likely to take more time because it collects metadata about all databases on the host, including those that are not backed up. In the above experiment, the team found that this task can take more than an hour.
- The time required to collect the metadata of all databases could be reduced if a large number of databases could be backed up at once and the metadata would not have to be collected several times. However, Rubrik could not increase the batch size beyond 200 databases. For a backup of 10,000 databases, the metadata would have to be collected 50 times, which would take an hour each. This would defeat the goal of creating daily backups.
- Finally, any mechanism that copies data from a Windows host to a backup destination must scale with a large number of files. A host with 10,000 databases will have at least 20,000 files, and in such a situation, the copy mechanism should scale with a large number of files on a host.
Two innovations in the SQL Server backup solution help overcome the challenge:
- Scalable Snapshot service: Rubrik has developed a new database snapshot mechanism that captures and freezes only the databases to be backed up, takes a snapshot of the volumes in which the databases reside, “thaws” those databases, and performs post-snapshot billing. In addition to being very fast, this mechanism optimizes the number of threads and the amount of memory used, allowing the use of a large batch size of 400.
- Parallel Copy Writer: Analysis showed that there was significant overhead associated with each individual file retrieval per database. Therefore, Rubrik has developed a new parallel copy writer to recoup the overhead with a single group retrieval function that can retrieve a large number of files with a single RPC. This allows for a more efficient mechanism for transferring files from the Windows host to a backup destination and maximizes bandwidth usage for scenarios with a large number of small files.
To test the solution, Rubrik measured the time it took to back up 100 to 10,000 test databases with an average size of less than one GB on a four-core Windows host with 128 GB of memory. This was done with both Microsoft’s standard infrastructure and the new Rubrik solution. It took less than six hours to back up 10,000 databases with the solution.