A Test Data Management Strategy for Database Migrations
This article will help you understand the steps to better test data management, when using a migrations-based approach to database development and deployment. It explains the different types of data required and why, the need for separation of DDL and DML code, and the most efficient way to create, load and switch between the different required data sets.
This article is part of a series on implementing a test data management strategy for Flyway:
- Better Database Development: The Role of Test Data Management
- A Test Data Management Strategy for Database Migrations
- Dealing with Database Data and Metadata in Flyway Developments
- Adding Test Data to Databases During Flyway Migrations
- Getting Data In and Out of SQL Server Flyway Builds
- Bulk Loading of Data via a PowerShell Script in Flyway
- Managing Test Datasets for a Database: What’s Required?
- Managing Datasets for Database Development Work using Flyway
Also relevant are the series of related articles on Database testing with Flyway.
A cruel reality of database testing is that the harder you work at it, the more bugs you’ll discover. When doing database work for an organization, you’ll find yourself testing database code constantly and rapidly. For this you need the right sort of test data and the means of changing datasets as easily as changing your hat. When working in a team, you’ll also save a lot of time by sharing data, and to do that, you need a system for doing so. That system is called Test Data Management.
In this article, I’ll describe the steps to effective test data management in a migrations-based approach, as typified by Flyway, and explain a few techniques that will make it much easier to support test-driven database development. I’ve tackled the general issues of test data management in an introductory article on Simple-Talk, called Managing Test Data for Database Development.
The steps to effective database testing
The main reason why database deployments fail is a lack of rigorous database testing, and the main reason this task is so often avoided, or delayed, is the difficulties of creating and managing all the required test data.
The way that database testing is performed varies widely, both between organizations and between projects. Various organizations and projects tend to be, culturally, at different stages in the management of test data. Most database projects start with only a primitive way of managing test data and then introduce appropriate management as required by the project. To support rapid releases on complex live systems you’ll need more complex systems of test data management, because the testing will need to be done rapidly, in parallel, and with a high ‘coverage’. In the early pre-release stages of a project, however, the emphasis is on rapid build and teardown of database prototypes and so the objectives are different. We need to test out different ideas rapidly and ‘prove’ database designs. We still need the test data, but the priorities are different.
In looking at the different stages of database testing, there seems to be distinct steps towards methods of test data management that allow the team to deliver frequent high-quality database releases.
Stage 1. Use production data
You may think that you don’t need test data because you already have production data and can just use this. Not so fast! Firstly, there are often difficult security and legal issues around the use of personal and sensitive data (if its use is necessary, you’ll need secure testing environments and techniques like data masking and subsetting, discussed later).
Secondly, production data is only the data that managed to get into the system: it won’t represent all the ‘edge case’ data that should have got into the system but failed or was mangled in the process. It also won’t represent invalid data that was prevented from entering the database by your checks and constraints, but you still want that data in certain test data sets to ensure all these safeguards continue to function correctly, as you make database changes.
Stage 2. Grow my database with its own built-in test data
Surely, you might think, you just need to have a single data set that is intrinsic to the development database. As you change the tables and their relationships, the data will, with a bit of luck, change naturally with it, and if not you can provide the code to move the data into the new database design. While this technique is simple, in that there is little distinction or separation between data and the metadata and no separate repository or source for the test data, there are problems with it.
Firstly, if you mix data manipulation with data definition in a migrations-first approach like Flyway, you are then forced to use migrations under all circumstances, even when you just need to change or correct the test data or use a special test dataset.
Secondly, you cannot do rapid releasing without the risk of test data leaking into production. If, during release, you apply a migration to the live server that accidentally includes test data, it will either give and error that could well require a lot of ‘mopping-up’ of incorrect data, or it will obligingly add the test data to the live version of your database, which is likely to be a career-changing experience. Yes, it has happened often, and the effects are sometimes awful and public.
The final problem with mixing DML and DDL SQL code is that it slows down the process. The best RDBMSs provide a fast bulk-load to load in datasets.
Stage 3. Separate DDL and DML code in database development
The next stage of sophistication in Flyway is to load the test data for the correct database version, rather than use a migration to insert the data. Purely for simplicity, my Flyway demos on GitHub do tend to use migration scripts to insert development data. Generally, however, data should never affect a version number. If it did, your production system would always have a different version history than development.
Changes to the data, like changes to documentation, don’t equate to a version change, and you don’t need the same controls as provided by a version control system. This means that you shouldn’t mix data and DDL, particularly in a system like Flyway that takes ‘versioning’ seriously.
When making a change to a database that affects where data is held, you’ll need to load the data for the current version, develop the SQL code that manages the migration of the data into the changed table structures, and then save the data for the resulting version.
Test data and development datasets are managed quite differently to the DDL code that is used to create the database. Data is best maintained separately, and Flyway offers several techniques for doing this, such as bulk inserting the data using a repeatable migration, or my preferred technique of loading it into the database in a separate process, using a callback script. To load data after a migration run, all you need to know is the version number of the last successful DDL migration. See Managing Datasets for Database Development Work using Flyway.
Stage 4. Different datasets for different test objectives
To move to a more rapid way of deploying a database change to production, you’ll need more than one dataset. I’ve spelled out the reasons for this in detail in Managing Test Data for Database Development. You need data that, through being unexpected, is likely to break your code; you need data that tells you if your data is designed and indexed in such a way that it will continue to perform well as the volume of data grows; you need Mr Null data for checking your data inputs; you need ‘little Bobby Tables’ data that is able to check your security and access control.
A single dataset can meet several of these requirements but, even for a simple database, you’ll need at least a couple of datasets that include the data that helps the developer to write reliable code and another, with the full gamut of data, designed to test whether it really is reliable.
In my experience, there are four main requirements for datasets:
Data for data feeds
Data that tests your general validation checks and constraints. Can bad data get in? Checks and constraints act as a first line of defense against data-related errors and inconsistencies. Tests should be run that intentionally attempt to place bad data in the database. This will identify potential vulnerabilities or weaknesses in the way that constraints are implemented.
Data to check compliance with business rules
Data that is valid, but incompatible with business rules. From experience, these checks cannot be confined to a single layer of a database application. Even if checks are made in a workstation application or website, they need to be at the database level as well, for defense-in-depth. Constraints should reflect the business rules and logic specific to the applications or systems that use the database. Testing the constraints ensures that the database enforces these business rules correctly. It helps verify that the data being stored aligns with the expected business requirements and validation criteria.
Data for security
This is data that tests for incorrect data input routines that allow SQL injection, or inputs that allow fraud. These datasets will be used to check that there is appropriate access controls in place. The checks will normally run from scripts, via ODBC connections, and will ensure that each category of user can only access the data necessary for the role.
Data that checks processes
There is a special category of data that will allow us to so a ‘black box’ check that a process works exactly as intended. In fact, these datasets come in pairs: the set that provides the input data for the process and the set that defines the expected output. A classic example of this is the processing of purchases such as in a website shopping basket. A set of purchases will need to be represented by a set of changes to various tables in the database. To check that the process works, the inputs need to be presented to the system, and the tables affected by the outputs from the process have to be checked to ensure that the changes are correct.
Overcoming the culture shock
A solid test data management strategy will allow you to pick and choose the appropriate dataset for your needs. You may want a good, neat, general-purpose dataset for general development, something dark and rather scary for penetration testing, and something big for scalability testing. However, this canteen system for test data management can result in a strange culture shock in a migrations-based approach to database development. A build-focused system always has the task of loading data and so can easily incorporate a canteen system for loading and maintaining the most effective test data for the task. A migration-based system must come up with a way of doing the same.
Sure, one can devise a way of always migrating with an empty database and then adding the data as a post- migration task, but cleaning and re-stocking a database just for a trivial migration run might sound irksome. However, this clean-and-restock approach is only necessary if the data has changed, or where a migration alters one or more tables in such a way that a bulk insert with the existing dataset no longer works. A migration run that loads the appropriate dataset after it is completed will, if the RDBMS has a good bulk-load system, run much faster than one that involves loading data via INSERT
statements.
The other cultural change is that whoever creates or alters one or more of the base tables in such a way as to require a change to the datasets must do the work of altering the datasets. For example, a table-split will always require changes to the datasets. If you change a base table in such a way that the old dataset can no longer be inserted into an empty variant of the database, then you need to save the new correct version of the data out to a file-based dataset.
Of course, all of this indicates that a dataset will need a system for defining the range of database versions for which it is suitable. I’ve described a system for doing this, in case you struggle to think of one, where you save each dataset in a directory with a name indicating the first Flyway version for which the dataset is appropriate. The bulk load routine then just picks the version equal or next lowest.
In the short-term, the disassociation of data (DML) and definition (DDL) is tiresome. Data can change with almost every version of the database, especially if you have adopted an evolutionary approach to database development that demands continual changes to base tables. At this stage, data must be matched to the version, but in a one-to-many relationship. One version of a particular dataset, such as that required for scalability testing, will be loadable into several versions of the database. Once the design of the base tables has settled down, there will be less requirement for storing old versions. There are, very occasionally, times when one needs to revive an old version of the database with its data to track down a particularly tricky bug, but it is rare.
Doing the heavy lifting of datasets
Once you separate data out from the database definition code, you’ll immediately need a way of, when necessary, removing data from a database and replacing it with the test dataset that is most effective for your test-run. This dataset must, of course, be correct for the database version.
In the Flyway Teamwork framework, I have a routine described in Getting Data In and Out of SQL Server Flyway Builds that will automatically save or load the right dataset, specifically for your branch. This will, as soon as I have the time, become RDBMS-independent!
Importing data
The different RDBMS providers offer various proprietary tools and utilities for bulk-importing data. SQL Server has the command-line BCP (Bulk Copy Program) and the DacFx utility. PostgreSQL provides the COPY
command to copy data between files and tables. Oracle has SQL*Loader that loads data from external files into Oracle database tables, and the Oracle Data Pump tool for importing and exporting data and metadata from Oracle databases. It can handle large data sets efficiently. MySQL has the LOAD
DATA
INFILE
statement, which allows you to import data from a file into a table. SQLite’s command-line shell provides the .import
and .output
command, which can import data from a file into an SQLite database.
You can write scripts or programs using languages like PowerShell, Python, Java, or Ruby to extract data from the source database and load it into the target database. Database-specific libraries or APIs can provide direct access to the database, making the data transfer more efficient. Some databases offer CLI-based utilities for bulk copy operations or provide high-performance data transfer capabilities for moving large volumes of data between databases. Examples include the BCP (Bulk Copy Program) utility in Microsoft SQL Server or the COPY
command in PostgreSQL.
As well as scripting, it is worth considering other ways of unloading and loading data, such as by using utilities that manage the whole process. The main approaches are:
- Database Export/Import Utilities: Many RDBMSs provide built-in export and import utilities, or tools that allow you to unload and load data. These utilities often allow you to export and import data in various formats such as CSV, XML and JSON. Examples include MySQL’s
mysqldump
, SQL Server’s SSIS, or theEXPORT
andIMPORT
commands in Oracle. - ETL (Extract, Transform, Load) Tools: ETL tools provide graphical interfaces or workflows that allow you to define the extraction and loading process visually. They often offer data transformation capabilities, allowing you to manipulate and format the data during the transfer. Examples of popular ETL tools include Hevo, Informatica PowerCenter, Oracle GoldenGate, Talend, and Apache NiFi.
Make it easy to create datasets
Good datasets will reflect closely the nature of real data in its distribution, covariance and cardinality. However, they are hard to produce, even with a tool such as SQL Data Generator. Fortunately, it is rarely necessary to fake it so closely. Most RDBMSs create their query plans from a fairly simple distribution map or ‘statistics’. The main reason for mimicking real data as closely as possible is to ensure that the same query plans are selected in testing as in the production system, which is important for scalability and performance testing. There are a number of technologies that can be used, and it is likely that a database developer will need them all.
Random (pseudo-random) data generation
This involves generating data that follows a specific pattern or distribution. You normally use random numbers of the distribution you specify and then shuffle, select or sample from data to generate random names, addresses, phone numbers, and so on.
Synthetic data generation
Synthetic data is derived from the language, frequency list or pools of public-domain data, but not necessarily from actual data. It mimics the characteristics and patterns of real data. This is particularly useful for performance or scalability testing where you need to generate large volumes of data that is unrelated to the real data of the production system. There are a number of techniques that can be used to provide verisimilitude to the derived data.
Data masking, pseudonymization and anonymization
In this approach, we obfuscate real data to create development and test data that closely resembles the real data but does not reveal sensitive information. This usually involves replacing personally identifiable information (PII), such as names, addresses, and social security numbers, with fictional or randomized values. However, even having done this the data is only pseudonymized, rather than fully anonymized. If you only mask the data without shuffling it, it is prone to inference attack that can reveal enough about certain individuals to be classed as a data breach. Data Masker for SQL Server and Oracle can use a variety of techniques to ensure that the data is fully anonymized.
Data generation tools and libraries
There are specialized data generation tools and libraries available that can simplify the process of generating data. These tools often provide features for generating specific types of data, such as customer data, financial data, or geospatial data. Some popular data generation tools include SQL Data Generator, Mockaroo, TalendDataGenerator, and DataGen.
Conclusions
Choosing a dataset of test data should be as easy as selecting the right packet of crisps at the supermarket. ‘What version of the database?’, ‘What sort of testing?’. ‘Soak testing? Bad luck mate. Still, here you are, how much do you want?’. Next thing you know, it is in the database.
The reason why databases have so much trouble with deployment is that they aren’t tested rigorously enough, and they aren’t tested rigorously enough because it is tiresome and tedious to create, find and load the test data. Test Data Management has developed to plan out in advance when the different types of data will be required, and the nature of the data. It gives time to devise the best way of loading the data. It also allows us to recreate a previous version of the database complete with its data to allow developers to track down when and how a particular bug got into the system. It allows a lot more flexibility in the way developers who are practicing test-driven development can plan their development branches.