Flyway Database Drift and How it Happens
Flyway's approach to database migrations is based on strict versioning, but there is a limit to what a single process can do to prevent 'drift'. This article explains how drift can happen, and why you also need source control and external processes that log changes, to prevent it.
Flyway works to ensure that a database is at the version you specify, by imposing a reliable versioning system on database development. However, while Flyway encourages versioning, isn’t its mission to enforce it. In fact, it can’t prevent someone subverting the system, either deliberately or accidentally. Someone can make database or changes, either outside Flyway’s control, or by ‘altering history’ using Flyway’s own features, so it isn’t, by itself, an ‘auditable’ system. There are three extra components that are needed:
- A source control system that can track every change to any of the scripts and files used to develop the database.
- A build-check process that is used both to ensure that the scripts can build any version of the database being developed and to check it against the ‘canonical source’ for the version, for drift.
- An external log that records the details of every migration, for auditing
Why database drift matters
In general, a version of a database can only apply to one particular ‘state’ of the database. If a database is in the correct state for its recorded version, then the process of updating a database becomes much more reliable because a migration from one known database state to another can be well-tested and rehearsed. Conversely, if you have several copies of a database that all claim to be at the same version but in fact are in different states, then you have database drift, and deployments becomes unreliable.
You can change the data all you like, and it doesn’t change the version, but if you change the database objects such as tables, functions, views, indexes and constraints, then the version must change too.
Why should this matter? The larger the team, the more things can go wrong if this essential discipline is lost. Just a few examples from my own experience:
- The testers check the wrong release candidate – after days of work they discover that somehow the release candidate has been changed without changing the version, so the test team must start again.
- A ‘backdoor’ interface is introduced into the database that circumvents security. It isn’t in the code lodged in source control.
- A merge operation fails because someone has, in the meantime done an uncontrolled change that conflicts with the new feature
- A deployment fails, despite adequate testing, because of a patch that was applied to the production database but not versioned.
- A deployment succeeds, but after a time causes a problem that would have been picked up in integration testing. The release was changed to fix a different bug, ‘because nobody will notice’. Too late for a roll back, the fault must be corrected with a time-consuming and risky roll forward.
Flyway is designed to make the version, and the migration script between versions, central to the database development method. This doesn’t stop a feature being built by other means, but it becomes part of the released database only via a migration rather than being merely included in a build script. It performs some of the things that a source control system would do but not all, and it isn’t a substitute for it. Here are some of the things that can go wrong.
Uncontrolled changes
Perhaps the most obvious source of ‘drift’ will be database changes (deletions, amendments, and additions) made outside of Flyway, through an IDE, or by executing SQL against the database via a query tool.
Flyway cannot detect these ‘uncontrolled’ changes, so you cannot be sure that the version is correct for the state of the database. The best way to detect this form of ‘drift’ is to rebuild the database from the ‘canonical source’ of migration files for that version, in source control, and then compare it to the database being used for Flyway development. These regular rebuilds will also prove to the team that the source can, by itself, build that version.
Changing the history of changes
If there is just one database being built, and if you never perform a ‘clean’ operation, then that database will contain the list of all the migrations that have been applied to it, when it happened, and who applied it. It can also detect retrospective changes to files that should be immutable once applied.
The clean operation is vital at some stages in development, but unfortunately, it clears out the history table without archiving it, and drops all database objects such as tables, views, procedures, and triggers, within the configured schemas. If you change the database, or delete and recreate it, the new database won’t have the correct history. It is easy to lose the real history of changes and replace it with a plausible ‘origin myth’. You can now change the old migration files, changing the metadata without changing the version. Flyway won’t be able to detect the fact. You lose the name of the person who originally applied the migration and when it was done. You lose the list of files that were used to produce each version and have no idea if the contents were the same. This matters little if you are the only developer, working on an isolated branch, and you haven’t made any version ‘public’. In a team development, however, it could be the cause of chaos.
The basic problem is that the schema history table is designed by Flyway to tell us about the history of the development of the one copy of the database. This is useful, but for larger projects, the information needs to be collated and consistent across a project. You need to know when each version was first applied, and by whom, and you may also need to know the history of the real twists and turns of development.
Mutable migrations
If you alter the contents of the file of a resolved migration, Flyway will, the next time it is run, notice that something is amiss. It will detect a difference in the checksum between the file that represents the resolved migration, and the record of that file in the history table in the database being built.
Flyway politely assumes that the checksum has somehow been corrupted (this happens if you change the character encoding), rather than suspecting nefarious tampering. Also, of course, checksum-mismatch can also happen for perfectly innocent reasons, such as if someone retrospectively adds documentation to a source or reformats the code to make it more readable.
In such cases, you can get Flyway to ‘repair‘ the problem and then re-run the migration files to update or create a new copy of the database. However, if it turns out that the checksum mismatch was, in fact, caused by a change in the metadata of a resolved migration, then this will cause drift; the new copy of the database will have the same version number as the original one, but will be in a different state.
Using the ‘repair’ option before you’ve investigated why a migration failed is like switching off the burglar alarm rather than seeing what set caused the alarm to be raised. You need to investigate why Flyway couldn’t do the migration. If a file was altered it means that somewhere down the line, versions won’t match and that could cause a deployment to fail.
Intermediate changes
This is a special case of exploiting ‘mutable migrations’ (see previous section). Flyway’s ‘versioned migration’ files (‘V and U’ prefix in the filename) specify the version number that they create. However, unlike a conventional synchronization file, they don’t specify the source version from which that they migrate. Therefore, it is possible to slip in a previous migration with code that maybe evades code review, and then cover your tracks with a ‘repair’ command to Flyway. As this is a new file it will evade any audit system that is looking for changes to existing files.
Failed migrations that don’t run in a transaction
Normally, there isn’t a problem with a failed migration because Flyway executes them in a transaction. However, some systems such as MySQL can’t wrap DDL actions in a transaction so that when a migration fails it does so messily, requiring the user to mop up. If this isn’t done completely, then the version reverted to isn’t what it should be, and the alterations will stay in the database until a build-check can be done.
Incomplete Undos
An undo migration is designed to ‘migrate backwards’ to the designated version, from the subsequent version. However, there is no guarantee that it actually does that, because there is no way of doing such a check that works across different RDBMSs. A schema comparison tool such as SQL Compare will verify an undo, but without this, it is easy to make an undetected mistake that remains uncorrected. This will alter the ‘state’ of the version.
Running an Undo without updating baseline migrations
Imagine you have a sequence of migrations v1-v2-v3-v4, and one developer performs an “undo” to v1 (by running the u4, u3 and u2 scripts), corrects the v2 migration and then reruns the v2 to v4 sequence. This is legitimate in Flyway and requires no ‘repair’. If several databases will be built from the same source, then every developer using that source will need to be able to “undo” to v1 and rerun the sequence (or perform a ‘clean’ and rebuild their version from scratch).
However, what if there were existing baseline migration scripts for v3 or v4? These would also need to be updated, otherwise new databases that are created from their ‘baseline’ won’t contain the v2 changes. They may also fail to execute subsequent migrations because their source is different so tables or views might have changed, and procedures may be different.
Use of repeatable migrations
Repeatable files (prefix: R) have a description and a checksum, but no version. Instead of being run just once, they are applied, and then re-applied every time their checksum changes. They are useful at times in the development cycle, particularly for tasks that don’t affect the metadata such as adding various sets of test data, or attaching documentation to the database.
It is fine to use repeatable migrations when working in an isolated feature branch, where you are trying out all sorts of ideas with idempotent code and creating many test harnesses. At the point of merge, the deliverable will be one migration script in the parent branch, with a new version number, that contains all the changes, including creating/altering any code modules previously covered by a repeatable migration.
However, elsewhere, using them is dangerous because it means that you are basically “opting out” of versioning. The production or development branches should never use them, for example, because they will sabotage any chance of accurate versioning: an aspect that becomes essential in any ‘releasable’ code. One can slip all manner of changes in a ‘repeatable’ file, and then subsequently remove all trace of the code so preventing any code review, supervision or audit.
Enforcing Versioning
I hope I’ve described ways that versioning in Flyway can be subverted, either by willfulness or accident. The larger the team, the more this becomes a problem. To avoid any kind of database drift, we need a source control system, occasional build checks by an external process, and a complete and reliable history of changes exported from, but stored outside, Flyway.
Source Control
All the files used by a project should be saved in source control on every change, since the source control system is used as the final arbiter of an issue affecting the development of a database, and if used properly can be used to disentangle any development muddle. By ‘all files’ I mean not just the content of the Flyway ‘locations’, but also scripts, reports, and documentation.
This will draw attention to the incorrect use of repeatable files, as long as you commit the entire contents of the Flyway ‘locations’ used to a source control system at every migration, thereby detecting changes.
Checking for version ‘drift’
When a version is first created, I save a ‘model’ of the database metadata. It its crudest form, this will be a build script, but I use a JSON model of the main components of the database. One can also use a schema comparison tool, such as SQL Compare, to provide an object-level directory.
This gives you an artefact that you can then use for comparison work. It provides a reference for the version that can be compared with any subsequent recreation of that version to check for drift. You can check that an existing database really is at its stated version, and to verify that an undo script really does take the database back to its stated version.
This is all very easy with SQL Server and Oracle because there is so much third-party tooling around. However, for the others there will always be an available method of comparing the metadata of two relational databases of the same RDBMS because you can always do this by comparing build scripts generated from the two databases with a text DIFF tool. Usually, you would need to produce a build script as a routine when a version is first achieved. Then you need only one live database to run the checks. I use method whereby I produce JSON files representing the metadata for every version which is then compared more accurately to report changes, even across RDBMSs.
Keeping an auditable history
We also need to tackle the problem of the ease by which the real history of migrations can be lost from the flyway schema history table. I create an SQLite database that keeps an external log or record of every migration, so that it can report the first time a migration is ever applied. This can allow any forensic examination of the database development. To be an intrinsic part of a Flyway operation, it needs to be done outside a transaction after the end of a migration run. It also requires that all team members use their own names in the user-level flyway.conf file for the installedBy parameter (see A Programmer’s Guide to Flyway Configuration for more details).
Summary
Flyway’s whole approach to database migrations is based on strict versioning. It does all it can to ensure that the database version can be built from the resolved migrations, but there is a limit to what a single process can do to prevent ‘drift’. Therefore, you need external processes that log changes and manage the ‘materials’ used for the development process. No single system can do this without the risk of being subverted.
After a long experience of database development in a range of different types of organization, I must emphasize that the unimaginable always happen eventually, to any process that relies on the frailties of humanity to work. The more automated checks and precautions that an IT process can bring to bear on the work, and the more team-based reviews of the work that there are, the more likely that process will succeed without problems. This is particularly true of a database in the financial, engineering or healthcare sector. Fortunately, Flyway is particularly amenable to automated checks that supplement its built-in versioning, because of its architecture that encourages call-back scripts. It is easy to make a Flyway-based system into a rock-solid enforcer of versioning, and team-based database developments will be easier for it.