Defusing Flyway Validation Errors using Smarter Checksum Comparisons
This article provides a scripted SQL tokenizer script that quickly verifies whether a Flyway validation error is a real cause for concern, due to retrospective metadata changes, or just the result of a developer valiantly adding formatting and documentation to improve the code. If the changes are purely cosmetic, we can safely run Flyway repair to resume normal migrations.
Flyway validation checks: necessary but indiscriminate
Flyway runs validation checks to make sure that the files that have been successfully applied to the database haven’t subsequently been changed. It must do this to maintain version consistency: if two copies of the same database have the same schema version number then the schemas should match. Put another way, the set of migration file locations that you specify must always produce the same versions of the database.
If a versioned migration file that has been used on a database is later changed, Flyway will detect it and raise a validation error the next time you try to run a migration on that database.
It’s vital that Flyway can run the validation checks as quickly as possible, so it just calculates a generic file checksum. This is a necessary compromise, but it means the checks are indiscriminate. Flyway knows the file has changed but cannot determine the nature of the change. Even if you’ve just added helpful comments or formatted the SQL nicely, the file checksum will have changed, and you’ll see a validation error. It also gives false positives because it doesn’t just check changes in SQL, an extra space between keywords, for example, will cause Flyway to refuse to do a migration.
At this point, you’ll need to carefully review the changes. This can be a time-consuming process. If you’re confident that the changes were purely cosmetic and don’t compromise the ‘version integrity’ of the migration chain, you can run a Flyway repair to ‘re-sync’ the checksums, and continue migrations.
This article is about automating, as far as possible, the process of “verifying the validation error” so that if the changes are purely decorative, normal development work can continue as quickly as possible.
Defusing Flyway validation errors
Ideally, all code would be fully documented and beautifully formatted before it’s committed to the shared development branch in version control. But just in case it isn’t, it is worth making it as easy as possible to tidy up code retrospectively, while minimizing problems with Flyway migrations. Otherwise, it could easily inhibit developers from formatting code for better readability, or from adding better documentation with block comments and end-of-line comments. Nothing should discourage you or your co-workers from making these sorts of changes, however long it is delayed. After all, the source is the only easy way of understanding the design of the database.
The idea behind this article is to provide developers with a quick and automated way for developers to verify whether a Flyway validation error is a real problem, or a ‘false flag’ triggered by cosmetic rather than SQL changes, so they can respond accordingly.
I’ll show how to do this check using a PowerShell-based SQL tokenizer that performs a more discriminating, ‘metadata only’ checksum comparison. It parses the code into tokens, filters out those tokens that represent comments or whitespace changes, and then re-calculates the checksums for the original and modified migration files. If the ‘tokenized checksums’ now match, it’s likely the changes are purely cosmetic.
If a migration is already applied to production, then any subsequent validation errors on that migration should probably be investigated with a full drift check, even if the “discriminate checksum” indicates only cosmetic changes.
If a migration has been committed to the shared development branch in version control but has not yet been deployed, the approach can be different. If the ‘tokenized checksums’ still don’t match, then it means there are SQL changes, and the live database no longer represents the database built from the source. The database will need to be cleaned and then rebuilt from the source, followed by a fresh test cycle.
If the checksums match then there are no changes to the SQL, even accidental ones, and developers can quickly and safely run the Flyway repair command to resume migrations on the affected databases.
The tokenized checksum comparison
To be certain that a block of code has the same functionality as the one it replaces, we’d need to perform a full semantic code comparison, comparing the syntax trees of the language with an AST (Abstract Syntax Tree) comparison tool and consider the meaning of symbols.
We don’t need anything so sophisticated because we have much simpler aims. We just want to allow SQL code to be reformatted or documented without it being flagged as functionally different. I wouldn’t even want to allow SQL that is different but has, hopefully, the same effect. Even if it did, you’d need to use the discipline of a migration to make that sort of change, along with its associated tests.
We don’t need a parser to achieve this. We just need a tokenizer to split the SQL code into its individual tokens, and then eliminate those tokens that represent allowable cosmetic changes, such as comments. The tokenizer is already provided in the Flyway Teamwork framework as one of the resources called Tokenize-SQLString.ps1. It’s cross-RDBMS (based on SQL-92 standard) and I’ve used it previously to present color-coded migration files in a browser, and to help you check for external references in your deployment code.
The Get-SQLCodeHash function
The Tokenize-SQLString
cmdlet is just a script and is rather slow, because of the complications of detecting three-part or four-part names. It runs speedily enough for normal DDL scripting, but it isn’t quick enough for SQL-based DML (data manipulation) involving megabytes of data inserts. For our purpose here, we use a slimmed-down version of it, removing everything that isn’t essential for calculating the checksum.
With the slimmed-down version, we use a rather elaborate Regex. Each token, such as ‘Select
‘, ‘From
‘ or ‘,
‘, is matched, but with sufficient details to allow us to check whether the code is the same or different. If required, it can also tell you where in the code the change is:
Name : Punctuation Index : 446 Length : 1 Value : ; Type : line : 9 column : 62
We use it to read the code and create a set of tokens and the Regex then removes any tokens that end-of-line comments, documentation, or formatting changes. This allows us to quickly build a string consisting just of the SQL code, and calculate its checksum:
1 2 3 4 5 6 7 8 9 |
$mystream = [IO.MemoryStream]::new([byte[]][char[]]( $SQLString |Tokenize_SQLString| where { $_.name -notin ('BlockComment', 'EndOfLineComment') } | foreach -Begin { $skinnedSQL = '' }{ $skinnedSQL += "$($_.value.ToLower()) " } -End { $skinnedSQL } ) ); Get-FileHash -InputStream $mystream -Algorithm SHA256 | select -ExpandProperty Hash |
This process ensures that the SHA256 checksum calculation is impervious to changes in case, comments, formatting or documentation within the original file. We can turn this into a function, Get-SQLCodeHash:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
<# .SYNOPSIS Gets a hash of the actual SQL code, ignoring formatting and comments. .DESCRIPTION This function uses Tokenize-SQLString to calculate the SHA256 checksum of a SQL String, ignoring extra whitespace, block comments and end-of-line comments .PARAMETER SQLString The sql code as a string. .PARAMETER Filepath A description of the Filepath parameter. .EXAMPLE PS C:\> Get-SQLCodeHash -SQLString @' /* Complex update with case and subquery */ UPDATE Inventory SET StockLevel = CASE WHEN StockLevel < 10 THEN StockLevel + 5 ELSE (SELECT AVG(StockLevel) FROM Inventory WHERE CategoryID = Inventory.CategoryID) END WHERE ProductID = 101; '@ #> function Get-SQLCodeHash { [CmdletBinding()] param ( [Parameter(Mandatory = $false, ValueFromPipeline = $true)] [string]$SQLString = '', #if you are parssing a string rather than a filename [string]$Filepath # if you wish to pass a filepath containing the SQL ) Begin { #The regex string will check for the major SQL Components $parserRegex = [regex]@' (?i)(?s)(?<BlockComment>/\*.*?\*/)|(?# )(?<EndOfLineComment>--[^\n\r]*)|(?# )(?<String>N?'.*?')|(?# )(?<number>[+\-]?\d+\.?\d{0,100}[+\-0-9E]{0,6})|(?# )(?<SquareBracketed>\[.{1,255}?\])|(?# )(?<Quoted>".{1,255}?")|(?# )(?<Identifier>[@]?[\p{N}\p{L}_][\p{L}\p{N}@$#_]{0,127})|(?# )(?<Operator><>|<=>|>=|<=|==|=|!=|!|<<|>>|<|>|\|\||\||&&|&|-|\+|\*(?!/)|/(?!\*)|\%|~|\^|\?)|(?# )(?<Punctuation>[^\w\s\r\n]) '@ } Process { if (!([string]::IsNullOrEmpty($Filepath))) { $SQLString = Get-Content -path $Filepath } #get from a filepath if it is provided $exceptions = @(0, 'BlockComment', 'EndOfLineComment') #things we want to ignore $parserRegex.Matches($SQLString).Groups | foreach -begin { $script = '' }{ if ($_.success -eq $true -and $_.name -ne 0 -and $_.name -notIn $exceptions) { $script += " $($_.Value.ToLower())" } } -end { $mystream = [IO.MemoryStream]::new([byte[]][char[]]($script) ); #no calculated the hash Get-FileHash -InputStream $mystream -Algorithm SHA256 | select -ExpandProperty Hash } } } |
Comparing SQL expressions
Let’s try it out. The following two SQL Expressions in the following block of code are the same. The only differences are the formatting, the end-of-line comments, and block documentation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
@' WITH SalesCTE AS ( SELECT p.ProductID, SUM(s.Quantity) AS TotalQuantity FROM Sales s INNER JOIN Products p ON s.ProductID = p.ProductID WHERE s.SaleDate BETWEEN '2023-01-01' AND '2023-12-31' GROUP BY p.ProductID ) SELECT p.ProductName, c.CategoryName, sc.TotalQuantity FROM SalesCTE sc JOIN Products p ON sc.ProductID = p.ProductID JOIN Categories c ON p.CategoryID = c.CategoryID ORDER BY sc.TotalQuantity DESC; '@ |Get-SQLCodeHash @' WITH SalesCTE /* get the total quantity for each product between the two dates */ AS (SELECT p.ProductID, Sum (s.Quantity) AS TotalQuantity FROM sales s INNER JOIN Products p ON s.ProductID = p.ProductID WHERE s.SaleDate BETWEEN '2023-01-01' AND '2023-12-31' GROUP BY p.ProductID) SELECT p.ProductName, c.CategoryName, sc.TotalQuantity FROM SalesCTE sc --execute the CTE JOIN Products p ON sc.ProductID = p.ProductID JOIN Categories c ON p.CategoryID = c.CategoryID ORDER BY sc.TotalQuantity DESC; '@ |Get-SQLCodeHash |
The ‘tokenized checksums’ are identical:
9C32C29389678889788B486EC6F516F15076C962856D13394A507087CFCECA40 9C32C29389678889788B486EC6F516F15076C962856D13394A507087CFCECA40
Why not change an alias in one version but not the other, take out a column of one version maybe, and you’ll see that the checksum won’t be the same.
Checking for the nature of changes to Flyway migration scripts
So, we can now use this to check a file that has failed the validation process in the Flyway migrate command, because Flyway has detected that the file has changed.
You should never blindly run Flyway Repair. First, you need to verify that there were no metadata changes in the file, and if there were, you need to investigate what changed and why. As noted earlier, Flyway knows the file has changed but not what has changed, which is why the command is ‘repair’ rather than ‘fix’. All the repair command does is modify entries in the Flyway Schema History (FSH) table; in this case, it would refresh the checksum it has recorded for the file. You should only allow this if you’re certain the changes are just comments or formatting.
To make the check, we just take the previous version of the migration file from version control and compare its checksum output with that of the current file. To do this, we use Get-SQLCodeHash
to calculate both the old and new checksums but this time getting the SQL from a file:
1 |
Get-SQLCodeHash -Filepath '<PathTo>\northwindBuildNoData.sql' |
You can, of course, get these hashes from a file directory all at once. This script has a guard to prevent it from getting scripts that are too long, but you can adjust it to taste. Script-based tokenization is always going to be slow, but then, you are unlikely to want to add documentation to a huge wadge of DML data imports!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
dir '<path to migration files>' *.sql -Recurse | foreach{ if ($_.Length -lt 1000000) { @{ 'filename' = $_.Name; 'HashCode' = "$(Get-SQLCodeHash -filepath $_.FullName)"; 'FullName' = "$($_.Fullname)" } } else { @{ 'filename' = $_.Name; 'HashCode' = "not attempted"; 'FullName' = "$($_.Fullname)" } } } | convertTo-json > 'FilesAndHashcodes.json'; |
Conclusion
Flyway’s method of checking changes in migration files is fine, just as long as it is seen as a way of notifying the user that something needs investigating. Source control is fine for finding text changes but without a semantic extension or plugin, it cannot distinguish between code changes that change functionality and mere decoration.
If you are faced with a failed migration due to a change in the checksum of one of the successfully-applied migration files, then you are likely to be interested in a simple way of finding out whether someone has just done some useful work making the history of changes, annotations, and documentation more readable, or has just made the horrible mistake of trying to make retrospective changes. A scripted tokenizer can find out for you. It might be slow, but it is quicker than trying to scan through source control to spot the nature of the changes.