Managed Instance Gotchas – Error Conditions

I was recently on a project to migrate a very transactional installation of SQL Server to Azure SQL Managed Instance (MI). SQL Managed Instance is a good stepping stone between a full, on-prem SQL instance / Azure VM and an Azure SQL Database. It has most of the functionality of a full, on-prem instance, with management of the SQL engine, backups, OS and underlying hardware done by Microsoft. It allows you to use cross database queries and run SQL Agent jobs, with fewer limitations than Azure SQL Database migrations.

The migration process isn’t completely seamless. During the migration of this system, we encountered several surprises. Hopefully, this will help you avoid, or at least be prepared for these differences from the on-prem version. This also reinforces the importance of testing each aspect of your migration.

I was going to make this a single post, but as I was documenting the issues, it was clear that this needs to be broken into sections. This post focuses on error conditions. These are items that may stop, or severely impact, your migration. Post-migration surprises and management issues will be covered later.

DNS Aliases

If you plan on using DNS aliases to assist the transition to managed instances, know that it isn’t supported. There may be some way to hack a solution, but that’s generally not a good idea. We tried to use aliases to allow clients / external systems to make the DNS name transition before our production migration. To put it simply, this was a failure. The Azure environment blocks these aliases. You can only use the Microsoft assigned DNS entry. There are likely also certificate issues if using your own alias. You may be able to hack some solution, but that would be fragile at best, and unsupported.

Plan on client applications migrating at the same time as your server migration. In a best-case scenario, the number of clients hitting your system directly will be limited. But be ready for this coordination when you migrate.

Zone Redundancy

An attractive aspect of migrating to MI is the high-availability options available. Zone redundancy synchronizes your server between zones in Azure (refer to the references for detailed zone redundancy information). This is the high-availability option. If a datacenter has issues, your server will fail-over to another zone in your region, when using zone redundancy.

We had significant problems with zone redundancy. After our initial deployments, we enabled zone redundancy in our dev environment. This was successful. After testing, we tried to enable zone redundancy in production. Because server and CPU resources weren’t available in this region (during this deployment), all of our deployments failed. Even environments that were established failed to deploy with no changes.

After many discussions internally, talking to our Microsoft liaisons, and opening a Microsoft ticket, we found that the documentation was incomplete. Zone redundancy was not available in our region and likely will not be available until the 4^th quarter of this year. We had this error even though we had successfully enabled zone redundancy in one of our environments.

This is error we saw in our DevOps environment during our attempted deploy:

{“code”:”ManagedInstanceZoneRedundantFeatureTemporarilyDisabledPerHardwareGeneration”,”message”,”Creation of ZoneRedundant SQL MI is temporarily disabled for this hardware generation on this region. You can create Zone Redundant SQL Managed Instances with different hardware generation on this region or many other regions in Azure...”}

The lesson learned from this is, confirm that the options you want to use are available before you plan your migration. The documentation is incomplete. Demand for services in Azure is very high and documented resources may be inaccurate. Consult your Azure representative to verify your plan. Even if it works once, you may be unable to deploy later due to these limitations.

TDE Enabled Databases

Moving to Azure Managed Instances is a commitment. If you want to have all of the benefits of a managed server, that probably includes having Microsoft manage the keys for your TDE (encryption at rest) enabled databases. The initial transfer of your databases is a little complicated if it’s TDE enabled. You either need to manage your own keys, which complicates things considerably in the long run, or make the switch to Microsoft managed keys.

If you continue to use your own keys, you need to manage them in Azure SQL MI. This negates some benefits of migrating to MI (primarily, key management and unmanaged PITR restore options between environments). Assuming you don’t want to manage keys, the steps to migrate a TDE enabled database are the following:

Disable TDE on the source database
Backup the source database
Copy the backup to an MI accessible location (blob storage)
Restore the backup
Enable TDE on the MI
Delete the unencrypted backups
Re-enable TDE on the source system (if the source database is not immediately deleted)

Using Microsoft managed keys allows Point in Time Recovery (PITR) to become seamless between environments, including databases protected with TDE, and key management is handled for you. The extra steps are well worth the migration effort. Remember to allow extra time in your migration for the decryption / encryption steps. This will vary considerably depending on your database size, MI performance tier, and source hardware.

Traditional Backup / Restore

Backup and restore options when using MI are limited. Point in time restore (PITR) is the recommended method and provides the most functionality. It allows restores between different databases on the same server to be easily synchronized. One of the primary reasons for moving to MI instead of individual SQL Databases is cross database queries and dependencies between databases. PITR makes this easier.

You can still use the traditional backup / restore process in MI environments, with limitations. You need to use a COPY_ONLY backup. To limit scope of our migration effort, we continued to use our old backup / restore process between environments using a COPY_ONLY backup scenario. This worked most of the time. Most of the time is a problem with things like backups and restores. Some of our restores failed. Since our databases were interdependent, one database failing left the entire environment unreliable and potentially unusable.

Our failure and diagnosis looked like the following:

Copy Only restore attempted
- Failed after an extended period of time with “Unable to cycle error log” message
  - Unable to cycle error log file from <FileLocation> to <FileLocation> due to OS error ‘32(The process cannot access the file because it is being used by another process.). A process outside of SQL Server may be preventing SQL Server from reading the files.
- The restore may work on different servers / environments
- Database restores for the same server may work for different databases
Errors start to show up in the server error log soon after the restore starts (well before the error message is returned to the restore process)
- Page corruption
  - RESTORE detected an error on page (1:12345) in database <GUID> as read from the backup set.
  - [CorruptedPageList: InsertEntryImITS]: Page (1:45678) in database <GUID> is suspect
  - Error: 3183, Severity: 16, State: 1
DBCC CHECKDB run on source database
- No issues
- This was the nightmare scenario – needed it to be ruled out as soon as possible
RESTORE VERIFYONLY run against backup file
- No issues
- Backup was able to be restored to a different server

The restore process didn’t fail on every server. This, in addition to checking the source with CHECKDB and the restore with VERIFYONLY, indicated that it was a problem with the destination server. This item also involved opening a Microsoft ticket after checking with our Microsoft architect. The issue seemed to be I/O buffer related. Using the following options during the restore fixed the issue:

Restore options added
- BLOCKSIZE = 65536
- MAXTRANSFERSIZE = 4194304

This fixed the issue, but PITR is the recommended method for SQL MI backups and restores. It’s what you should plan on using. Using a traditional, COPY_ONLY, backup method is technical debt. The first question asked by every level of support was, “Why aren’t you using PITR?”. Move to PITR as soon as possible for all of your MI processes.

SQL Agent Job Errors

The final blocker for us to migrate to SQL MI was SQL Agent job errors. In our non-production environments, we had errors requiring investigation and changes to the servers by Microsoft. They were intermittent issues, making fixing this very difficult. Resolving the errors required us to open Microsoft support tickets.

The Agent job issue was relatively easy to fix, since our Microsoft architect was able to find it in the internal support database. That database is not available to customers, so it was only accessible by our support team.

1	Unable to connect to SQL Server ‘tcp:localhost,11000’. The step failed

We don’t interact with localhost, so it was clear it wasn’t our issue. We also don’t specify port 11,000. We also saw ports above 11,000 (i.e., localhost,11007) – which we also don’t specify. (both of these items were verified by querying the job step table, dbo.sysjobsteps, in msdb). The fix for this was out of our hands. The Microsoft team had to roll back an ODBC driver on the server. After the driver was reverted, this error disappeared.

Agent Jobs are one of the reasons for selecting SQL MI, so this was very disruptive. We will eventually move to a different technology, such as Elastic Jobs or ADF, but be sure thoroughly test your jobs under load.

Summary

SQL Managed Instances are attractive to enterprises that don’t want to be responsible for all of the maintenance of a full SQL Server installation. That includes backups and engine maintenance. We encountered a few issues that would have stopped our migration if we hadn’t tested first or had help from the various Microsoft teams.

We were able to hit our deadlines and had a successful migration, but – it wasn’t a smooth process. There are a few items that may hamper your migration. If you know about these issues before you begin, you are more likely to be successful. Test thoroughly, talk to the technical team about hardware availability in your region, and be ready to open tickets to resolve issues. And check your error logs during and after the migration. I’ll discuss other differences and surprises in SQL MI in later posts.

Register for Simple Talk

Managed Instance Gotchas – Error Conditions

DNS Aliases

Zone Redundancy

TDE Enabled Databases

Traditional Backup / Restore

SQL Agent Job Errors

Summary

References

Article tags

About the author

Ben Johnston

Ben's contributions

Articles

Books

Top topics

Ben's latest contributions:

SQL Server 2022 CU Updates and Error Code 0x851A0043

A Use Case for Memory-Optimized Tempdb Metadata

Writing an Efficient Query

DNS Aliases

Zone Redundancy

TDE Enabled Databases

Traditional Backup / Restore

SQL Agent Job Errors

Summary

References

Article tags

Recommended

About the author

Ben Johnston

Ben's contributions

Articles

Books

Top topics

Ben's latest contributions:

SQL Server 2022 CU Updates and Error Code 0x851A0043

A Use Case for Memory-Optimized Tempdb Metadata

Writing an Efficient Query