Database Updates: Roll Back or Fix Forward?
Phil Factor's take on the best ways to reverse or accelerate your way out of the problems caused by a failed database update.
- When the process of updating the database is halted during its execution
- When the release succeeds, but a problem occurs after the new version has ‘gone live’
- Designing database releases defensively
Database releases are a source of anxiety for all the participants, because the entire organization can suffer if things go wrong. Even an apparently minor release has been known to have disastrous consequences. What is the best way of dealing with a failed release? As so often with databases, it depends.
When the process of updating the database is halted during its execution
When a problem, such as a SQL runtime error or a connection termination, occurs during the execution of a database update, then all the changes will be tolled back automatically, if you can run the update operation within an explicit SQL Server transaction.
If not, then the failure could leave the database in an indeterminate state, and the only possible courses of action are either to ‘rollback’ all the changes manually, using a script, or to revert the database to a SQL Server database snapshot or backup, taken before the update started.
Using a transaction to update the database
The rollback of a transaction is well-understood in SQL processes, and it is one of the more valuable qualities of a relational database. When the process of updating the database occurs within an explicit (multi-statement) transaction, all changes either succeed completely and are committed or, if some error occurs during the transaction, are rolled back completely.
This transaction rollback is essential to maintaining database integrity because, if an error happens, all the changes performed within the transaction are played back in reverse order, undoing the changes, and so the connection is restored to a consistent state at the point when the transaction started.
It is not just the changes in data that can be rolled back but, with one or two exceptions, also the changes to the metadata. In other words, the DDL that alters the database itself, rather than just the data within it, can be rolled back. In some cases, this relies on the correct isolation level being set and appropriate connection settings such as XACT_ABORT
.
SQL Compare, SQL Change Automation (SCA), and any other competent database synchronization or deployment tool, will normally use a transaction, with the isolation set to SERIALIZABLE
, or occasionally READ
COMMITTED
(in tandem with other checks, such as on @@TRANCOUNT
), when running a deployment script. In the former case, it prevents all read phenomena, meaning that no other user sessions can see any effects of the changes to the data, nor make any changes to it, until the transaction commits or rolls back. These tools will also help deal with error handling to ensure a clean roll back.
So, if SQL Compare or SCA modifies a table, and then the changes are rolled back, what happens to other users who are accessing that table? In fact, there will be no other users accessing the table, because the schema modifications that Compare performs require acquisition of a table-level (Sch-M
) lock, which can only be acquired if no other sessions are accessing the table, and once it’s acquired no other session can access the table for its duration. It is a resilient and well-used system that underlies the practice of making updates to the live production system rather than having to take the system offline for the release.
Without an explicit transaction
Occasionally, you can’t use an explicit transaction when modifying a database. For example, you can’t issue CREATE
/ALTER
/DROP
DATABASE
, or CREATE
/ALTER
/DROP
FULLTEXT
INDEX
commands, within an explicit transaction. If so, then you have some rather more painful alternatives. The extent of the complications depends on whether the database release is being done with the database offline, or with the database and applications online and being used.
Using a ‘rollback’ script
If you cannot use a SQL transaction when deploying a new version of a database, you can use a ‘rollback’ script, so-called because, if it is effective, it mimics the end result of a ROLLBACK
TRANSACTION
statement.
It is a good practice to have a well-tested rollback script prepared alongside every migration script. The ‘rollback’ script will check all the individual changes in the corresponding migration script and remove them, if they’ve been made, to revert the database to the previous release, while preserving the existing data.
You can use a tool such as SQL Compare to get a first cut of the rollback script, by comparing the release version of the database, as source, with the current production version as target. This will, at least, deal with the transactional nature of the script, and error handling. You will then need to test and amend the script, as required, to ensure that both the schema and data are returned to their original states.
The insertion of a column in a table, for example, will need a corresponding rollback script that removes the column, if it exists, after moving any data changes in the column into an appropriate place in the previous version. Like a migration or synchronization script, it moves the database from one version to another, but in this case the database is in an indeterminate state between two versions, which is why rollback scripts must be idempotent; they must check what has and hasn’t been done.
There are two distinct types of rollback script: one which is used if the service was taken ‘offline’ for the release, and so you can be confident that no live users of the system have changed the data subsequent to the release. The other type is a rollback of an ‘online’ release, which must also preserve any changes in the data that have happened since the ‘failed’ release.
The first is relatively simple and can be an effective second-best to a transaction rollback. The second isn’t necessarily simple at all, especially if the release made significant changes to the base tables and live data changes happened subsequently. It is, in fact, only possible if you’ve designed the database to be able to do it.
At the level of the base tables, user changes could be deletes, updates, insertions. If the release has made alterations to base tables, then ‘reversing’ the data, including the subsequently changed data, back into the old table structures could be get complicated. The extra complication is in having to deal with data in the released version for which there is no table structure to preserve it, in the previous version. After all, if Adventureworks
had decided to also do Surfboards, or if Pubs
wanted to cope with CDs and Audiobooks, and released a version to enabled them to do it, where would the subsequent data on the new products be stored on rollback? It can be done, but it is a lengthy and distracting discussion involving XML and/or JSON.
If you have a pre-prepared and tested rollback script, you can run it immediately. You’ll need to prevent other users making changes while the rollback takes place. For example, you could follow the format that SQL Compare uses for encapsulating the script within a serializable transaction. Alternatively, you’ll need to ensure that the system is offline before the rollback starts.
It is not unknown for a rollback script to be created in an emergency, after an error happens, but it is bad practice as it is untested, and could extend the time that the database must be kept offline. It is much better to have this script already prepared even if, like a fire extinguisher or parachute, it is almost never used.
Reverting to a SQL Server database snapshot
If a release goes badly wrong, it is sometimes possible to revert to a SQL Server database snapshot, taken at the start of the release. This can only be done if there is a single snapshot for the database. For the ‘revert to snapshot’ operation to succeed, there must not be any read-only or compressed filegroups in the database, and all files that were online when the snapshot was taken must still be online when you revert. Backup the log before the operation and take a full database backup after a snapshot-based reversion.
Any changes made to the database after the snapshot was created will be lost. The only simple way to avoid this is to ensure that no other users can access the database from the point the snapshot is created until after the release is complete. Otherwise, all data additions, deletions, and changes made to the new version of the database since the start of the release will somehow need to be captured via a script and replayed in sequence on the snapshot. I’ll explain a way of doing this later, see Designing database releases defensively).
It is possible to do a SAN snapshot rather than a SQL Server snapshot. These generally have fewer restrictions than a SQL Server snapshot but with the same problem of retrieving all data changes made after the snapshot was taken.
Restoring from a backup
If, shortly before the deployment starts, you take a full database backup, and immediately prior to starting the deployment you take a final log backup, then a small database can be restored quite quickly to the state immediately before a failed release. With the increasing size of a database, this restoration process can take a long time. Again, any changes made to the database after the final log backup was taken will be lost unless you capture them some other way.
When the release succeeds, but a problem occurs after the new version has ‘gone live’
The sort of problem that happens at this point is generally different in character to an immediate SQL Server error during a release. This time the release succeeds, and so the database is at a known version, but then something ‘breaks’ later, after normal operation is resumed. It could have several causes, such as a required SQL server component that is missing, or problem caused because someone has made uncontrolled changes to the current release. A failure after release, however, is generally an error in database functionality, or unexpected condition, which wasn’t detected during testing. It can manifest itself in a number in ways such as SQL errors, faulty processing of data, performance problems, or outright database narcolepsy (sudden sleep).
The correct course of action is a matter of judgement. It all depends on the nature of the problem. You can still revert to a snapshot, or restore a backup, if all else fails, but you would lose all the data changes that have happened since the release, and it would mean significant downtime.
Rolling back
As discussed earlier, rolling back database changes made with the service online, and where users have made subsequent changes, is complicated.
If releases are frequent and the metadata changes are minor, a rollback script might still be reasonably simple. After all, it is just migrating a database from one known version to another. However, the release may be a major one that has made changes to base tables, adding entities, perhaps, and then users have made subsequent changes to the data. Then, restoring the data after the release would require a complex ‘roll-back’ script that is very dependent on how the database is designed.
Rolling forward or fixing forward
In general database practice, a ‘roll forward’ is the term given to the part of a database recovery process that ” redoes” committed transactions whose changes weren’t reflected in the data file at the point the database service was interrupted to roll the database forward to a transactionally consistent state.
On restart, SQL Server automatically runs the database recovery operation. The “redo” portion of this operation replays the log records of any committed transaction that was not included in the last CHECKPOINT
operation to run before the database went offline, and the “undo” portion will also reverse the effects of any uncommitted transaction that was included in the last CHECKPOINT
operation.
The term ‘roll forward’ has been hijacked to describe two scripted techniques that are advocated when a release succeeds but cause a significant problem subsequently. One requires ‘DBA heroics” and is very bad and the other can be good if it is part of a well-controlled release process.
First, the very bad.
Uncontrolled roll forward
There was a time that, when a release failed, it could be fixed with ‘DBA Heroics’. After all, with the confident belief of knowing exactly what is causing the failure, what could go wrong if one strode towards the terminal with a confident chuckle and keyed the necessary DDL changes into the terminal. I have witnessed occasions where this has gone woefully wrong because, although one can do this within a transaction and revert it if necessary, you cannot test the change before you commit it. Even if you’re lucky and it works, if you end up making several changes to the live system, it no longer represents the current release. The code changes that you make must then be retrofitted to the code for the release and added to source control as a new release. This will, of course, invalidate the deployment process with all its tests and checks. This is high-risk: If your changes aren’t signed off by the security team, for example, could you be culpable of a subsequent security breach? Basically, this is an uncontrolled process which has no place in any functioning development department. We haven’t described what happens if the ‘heroic’ roll-forward fails, because it is too painful to relate the potential outcome, but it will involve production downtime and perhaps loss of data.
A more satisfactory version of fixing forward is the Hotfix, which is usually created as a migration script in development as soon as the issue in Production is detected, and, after automated testing, is tested thoroughly in Staging before being applied, as a version increment, to Production.
Rolling forward with a new release
An alternative ‘roll forward’ technique, and the one I advocate, is merely to expedite the next database release to fix the problem. This requires a rapid database releasing system. The changes then represent a new version, with as much of the test series as possible and all the necessary signoffs by the various teams. Because it is tested, hopefully with improved tests to cover the unexpected condition that caused the problem in the first place, then it is low risk. The code is in source control. With a fully scripted process, this can be done in a matter of hours. If the tests can focus on the root of the problem, then it could be much quicker. A team that is good at the rapid and reliable release of small discrete chunks of database functionality will manage this type of ‘fixing’ release but the more wide-ranging a release is, the slower a subsequent ‘fixing’ release will be, because more must be tested, checked and signed off.
Alternative approaches: using feature switching or feature toggling
When a problem happens after a release and is isolated to a new feature that isn’t fundamental to the working of the database, as a whole, you may have an alternative to rolling forward or reverting. You don’t necessarily have to revert if you’ve designed the database and application to be able to switch off new features gracefully. These switches, usually called Feature Toggles, are designed to disable new features or new functionality without changing the version of the database, so the device used to toggle must not require any significant database schema changes, just data changes. Sometimes, toggles can be represented by extended properties or else can be done simply by refencing ‘switch values’ in a table.
Via a toggle, new features introduced in the release can be ‘disappeared’, causing the user interface to change likewise ‘in concert’ (e.g. hiding a field from the UI). There must be a method of managing any data associated with a new feature to preserve this data if the feature must be re-engineered to fix a problem.
Toggles are most useful when used in conjunction with role-based security. For example, only a small set of “power users” can be assigned permissions to access the new feature, meaning that turning the toggle “on” makes it available only to a small pool of tolerant users, rather than everyone.
The feature toggle approach has the downside that it must be engineered into the database and application and requires additional test resources. Its advantage is that it will, if done well, require no DDL changes to turn features on or off, or restrict it to a subset of the users. I’ve had problems with feature-toggles where someone has failed to understand the reason for them, made changes to them or even removed them. The consequences of a well-used feature on which the business relies being accidentally switched off by a bored ops guy is enough to induce a cold sweat.
Designing database releases defensively
There is no technique that entirely removes the anxieties of the release of a production system. All that one can do is to take all possible precautions. I’m no stranger to having a well-known captain of industry striding up and down the server room, scowling and shouting at me ‘every minute this **!#* system is down you folks are losing me $xxxxxxx of revenue!‘. You can’t argue with that, and so it is beholden on the developer to reduce the risk as far as possible. Think lifebelts. In my databases, for example, all migration scripts from source version to target version are accompanied by a reverse-migration script from target to source that likewise preserve the data.
This requires taking steps to make sure that it is easy to preserve data in a rollback. If, for example, I wish to make changes to a table, I like to preserve a column and its now-redundant data, for one release, after moving or reorganizing the data elsewhere in the database. Only when the change proves itself in production is the redundant column removed in the subsequent release. I like to use an interface with the application that logs all the input, time, SPIDs, and so on, so that business transactions can be replayed. I have phantom users on the live system directly after a release. These users, from a script, exercise all the vital processes just to double-check that everything is in working order. For example, users armed with special credit card numbers will perform a series of customer actions on a shopping website. After the initial release of a database system, I release little and often.
Where I have a say in the design of a new system, and it happens occasionally, I insist that all interaction between the applications and the database must happen via a defined interface within the database. This helps to decouple the evolution of the database from the development of the application. It also allows a ‘playback’ log to be kept of all user interaction that affects the data. This includes the date it happened as well as the SPID, user identification, procedure, and parameters. If done well, this can, with suitable scripting, allow the restoration of data, even if the database must be restored from backup or reverted from snapshot. If it can be anonymized, it is also gold-dust for subsequently testing database processes.
Finally, how do we revert a database to a previous version when that previous version has no suitable schematic structure (tables) which could assimilate the data? For this, I add, before the first release, an ‘accommodation’ or ‘limbo’ table. It might just have one date column, and another column for storing the data, in the form of JSON arrays, or XML. The release has no knowledge of this ‘limbo’ table, and it provides a way of preserving the data for subsequent restoration.
Conclusion
There is no magical way to ensure that the database release process is easy and stress-free. If problems crop up, the best remedial actions vary according to the nature of the problem, the scale of the release, and the nature of both the database and application.
Some advice is obvious. Always, where possible, do a migration within a transaction. Shake out as many issues as possible in Staging. Where possible, perform ‘sanity-checks’ on the basic processes within the production system with test scripts performed by a ‘phantom user’ before the new release goes live. Have available the tested idempotent ‘rollback’ scripts that match every migration script. Other advice ‘depends’. Always take backups before performing any process on a live database that can potentially go wrong.
The best approach is to get into the habit of releasing often, cautiously, and in small doses of new functionality. This allows you a controlled way of updating the database to a new release to fix a flaw in a database release. Where possible, design the database with a defensive approach to releases, and to preserving data.
By doing all this, it is possible to reduce the anxiety that surrounds the database deployment process.