Adding Test Data to Databases During Flyway Migrations
How to generate "realistic but fake" SQL Server test data, using SQL Data Generator, and then use SQL Data Compare to produce an INSERT script that Flyway can run to load the data into the database. The technique is useful for managing small volumes of test data or for "topping up" existing data when you create new entities.
This article is part of a series on managing test data in a Flyway development:
- Better Database Development: The Role of Test Data Management
- A Test Data Management Strategy for Database Migrations
- Dealing with Database Data and Metadata in Flyway Developments
- Adding Test Data to Databases During Flyway Migrations
- Getting Data In and Out of SQL Server Flyway Builds
- Bulk Loading of Data via a PowerShell Script in Flyway
- Managing Test Datasets for a Database: What’s Required?
- Managing Datasets for Database Development Work using Flyway
Also relevant are the series of related articles on Database testing with Flyway.
Before you apply development updates to a live production database, you need to test the changes thoroughly using realistic data. However, it is often not practical, or even permissible, to use the real data taken from the production version of the database, if it contains personal information. Pseudonymized data disguises parts of the information that can be easily used to identify individuals, but to be entirely safe you must either anonymize the production data completely or else generate your own ‘realistic but fake’ development data.
Generating test data for a Flyway-managed database is just as easy as for a database developed by more ‘traditional’ methods. I’ll show how to add data to a new set of tables in an existing Flyway database, using SQL Data Compare and SQL Data Generator, but we need to go over a few principles before we get stuck in.
Should data be ‘versioned’?
It is unusual for databases to be ‘versioned’ by the data as well as the ‘metadata’ (the tables, indexes, views and so on). However, data is structured for a particular design of tables, and a change in the design of the tables, or the way that they inter-relate, generally demands changes to the way the data is structured. In other words, a change to a table often requires a change to the data. However, there isn’t always a one-to-one mapping of data to metadata. One data version may survive for many versions if you are doing most of the work on the logic and rarely touching the base-table design. Otherwise, you’ll need one data version per database version, and you might even want a range of datasets for any one version.
A migration approach to database development allows data to be inserted with the same ease as metadata. However, just because you can do it doesn’t mean that you should do it, for a real database. Test data can accidentally leak into a production system (we all remember the letter that went out from a British Bank with the salutation ‘Dear Rich Bastard’). It also isn’t a good idea for a database version to depend on the data, even if the data is almost entirely static, such as when it is historic. Production databases must, after all, avoid ‘uncontrolled’ version changes, yet must allow data changes. There is a small exception to this rule, where a set of ‘enumeration’ static data in a table is crucial to the running of the database and would cause errors if changed. These are best held in views containing SELECT…VALUES
statements and treated as metadata.
You will, ideally, store the development data separately to the code. This also allows you to keep several development data sets according to their purpose. Because table changes usually occur less frequently than version changes, I prefer to keep each set of development data with the version that represents the earliest version for which the data is appropriate. Then the script that imports the data needs only to find the nearest dataset with a version that is lower or equal to the metadata version.
Data can be stored outside a database, in a binary file format, one table per file. Data can also be extracted or imported in ways that mimic the tables, in a ‘tabular’ markup such as CSV. It is also possible to use markup such as JSON or XML, which allows a degree of mapping to deal with small-scale changes such as table-splitting. A different technique is to store data as INSERT
SQL statements, but approach is usually over ten times slower than using bulk load techniques.
Is a database that is developed by Flyway different in the way that data is treated?
No. All databases must consider their strategy for creating separate data for different database environments, and for loading data into databases for test runs. The only real difference in Flyway is that it enforces the concept of a database version. A database that is built conventionally by a ‘build script’ generally consists of only metadata, and the data is subsequently added as a separate exercise. The development team still need to develop the change script, often called the ‘synchronization script’, that moves from the current version to the version that has just been built. This script will probably need to change the existing data structure to fit the new version. At this point, it will become apparent that a migration-based approach is merely a difference in emphasis to the ‘static’ approach. When releasing a new version of a production/live database, there is no magic way of avoiding the task of developing and testing a deployment script that preserves the existing data.
How do you generate spoof development data?
I use SQL Data Generator for the task. It has built-in generators for several types of data such as names, addresses and phone numbers, but you’ll often get more realistic results by also using lists within files, reverse-regex queries or a mixture of both. It also allows you to create generators in Python although truthfully this is best avoided. My preferred approach is to develop the word banks using a decent scripting language (it is possible to do this in SQL, but the process is slow), and then use SQL Generator to grab the word banks and pop them into the database.
You can generate data for just one table, a group of tables, or the entire database. A table that is unrelated with any other is a good place to start. If you are populating a group of inter-related tables, start with all the tables that have no foreign keys. Then add those tables that have just foreign keys relating to unique or primary keys in the tables you’ve already filled. This will lead to you being able to fill all the tables.
I cover one of the techniques for generating realistic text here: How to Complain Bitterly using SQL Data Generator and explain some of the ways that I generated the books and notes for the Pubs sample database here in Getting your SQL Server Development Data in Three Easy Steps
How do you insert development data during Flyway migrations?
With Flyway Community, the only way of adding data as an intrinsic part of the process is by using SQL INSERT
statements, and this script is best created with a utility that generates the SQL code to do it. In this article we’ll use SQL Data Compare to generate the INSERT statements.
For small quantities of data, it is fine to use INSERT
statements, in this way, but very soon, it makes more sense to remove all data from a database and bulk-copy the data that is appropriate to the destination version, once the migration run is finished.
Fortunately, Flyway Teams allows migrations to be scripted in PowerShell or by a DOS batch, so the data can be added as appropriate to the target of a migration run. I show how to do this in my article, Bulk Loading Data via a PowerShell Script in Flyway. With Flyway Community, this can only be done as a separate process rather than with a Flyway Callback script.
Using SQL Data Generator with a Flyway database
For our demo, we’ll imagine that we have an existing database, and we need to update our test data that it includes spoof data for a new group of tables we’ve recently added. We’ll use my PubsFlywayTeamsMigration project for the Pubs sample database. Within the Scripts folder you’ll find the V1.1.10__AddAddressesPhonesEtc.sql migration script, which introduces a whole set of new tables, to give us a better way of handling people, their contact details, and interactions.
You’ll notice from this script that some of these new tables already have data in them. We had to store the names of all the people who were already authors and employees, so that we could then change the old tables to remove ‘people’ information. We therefore need to make sure that SQL Data Generator adds the spoof data without disturbing any existing data.
Having generated the data and inserted it into V1.1.0 of our Pubs database, we then use SQL Data Compare to compare its data to an empty copy of V1.1.0 and so produce a script to insert the data, which we can then add to our Flyway project.
To make it more interesting, we use the Flyway Teams version that allows forward and backward migrations. This means that the output from our work must be not only a SQL script that adds the data but also its ‘evil twin’ sister, a script that removes it.
Generating the data
The first job is to create two empty copies of the Pubs database and then migrate each one so that they are both at V1.1.10. You can use the UndoBuild.ps1 script to run the migrations on each database.
We choose one of these copies to add the data to the new tables and connect to it from SQL Data Generator. All the new tables are in a separate schema called 'people'
.
First de-select all the tables. Next, select one of the new tables that has no foreign keys. If you’re unsure, this SQL will lists all such tables without foreign keys, and the number of rows they contain.
1 2 3 4 5 6 7 8 9 |
SELECT OBJECT_SCHEMA_NAME (Tables.OBJECT_ID) + '.' + tables.NAME, PARTITIONS.ROWS FROM sys.tables INNER JOIN sys.PARTITIONS ON PARTITIONS.OBJECT_ID = tables.OBJECT_ID LEFT OUTER JOIN sys.Foreign_keys ON tables.OBJECT_ID = Foreign_keys.parent_object_id WHERE Foreign_keys.OBJECT_ID IS NULL; |
If you run this on V1.1.0 of Pubs, you’ll see that some of the tables already contain data; a previous revision put the original pubs data about ‘people’ (authors and employees mostly) into it. With SQL Data Generator, we can opt to overwrite it or add to it, and we’ll need to be careful to always choose the latter option.
We’ll start with the people.phoneType
table:
As you see, there are no warnings (you get warned if there is a foreign key to a table that needs populating first). When you click the name of the table, you get an input form, as follows:
You need to fill in the number of rows and whether you need to generate the data or just grab it from a data source. We just want five rows because we can’t think of any other types of phone. We just want one type in each row.
Finally, it is important not to click the ‘Delete data from table before generation’ option, because as discussed earlier we don’t want to replace whatever is there, just add the data. If we replace the data, the authors of the original publications in the ancient Pubs database will get lost, which would be sad. If this were real life, things could get tense in development, and you’d have backups.
To specify how you fill those columns, first click on the ModifiedDate
column to get a reasonable range of dates for when the column was inserted or changed.
To fill in the TypeofPhone
column, we set the column to unique and tap in a simple reverse regex to give just five alternatives.
That’s it. You can work your way through all the all the tables without foreign keys, in this fashion. For the Note
table, I use the technique I described in How to Complain Bitterly using SQL Data Generator.
You can then select them all and populate them all in one go (Ctrl-G), or just select each one in turn and populate it (but you must remember to de-select it afterwards!).
Once these are done, you can work your way through the tables that just reference these tables. SQL Data generator will warn you of all dependencies in the shape of foreign key references, and if they are there will set them automatically. From this point onwards, it is easy to get the data in. It is just a matter of doing them in reverse dependency order, but SDG will soon warn you if you get it wrong.
Producing a script for Flyway
For this you need SQL Data Compare. You will remember that I insisted that you need two copies of the database. You now open SQL Data Compare to compare the filled copy (in my case, PubsFive) to the empty one (PubsFour), with the filled one as the source:
You hit Deploy… to deploy the changes to PubsFour, but instead of letting SQL Data Compare do the deployment, you tell it to generate a script. Before you do this, you first need to go to the Options tab and tell SQL Data Compare not to use transactions when it generates this script:
Open the script in your favorite editor and check it over. Save it as a Flyway versioned migration (V) file.
To generate the accompanying undo file (Flyway Teams only) you just switch the source and target database using the button provided and rerun the deployment. Edit the resulting script and save it (in Unicode of course) as the undo file for the new version.
Conclusions
It is so much easier to develop databases with plenty of development data, especially if it is generated from scratch (you can, for example, work from home then). If you develop using realistic amounts of data, you can more easily gauge performance, and the places that require better indexes or improved design become obvious. I use a slow server just to make it even more obvious when I’ve got a well-tuned database. You can never guarantee that the database will use the same plans for queries or batches, but you are much closer to how the database will behave in production.