Building Better Test Data with SQL Provision

Steve Jones show how a team might use SQL Provision to build consistent, compliant, useful databases, on demand, for development and test environments.

Development teams make software available for release once they are confident that it behaves consistently, as it was designed to behave, under as many different user workflows as they can test. Unfortunately, their test cells often don’t reflect the harsher reality of the live environments, where their software will encounter large volumes of real data, and some ‘unexpected’ data values.

Inevitably, this leads to unexpected issues, when they deliver changes to the live environment. A good way to minimize this is to run early builds and tests with data that mimics what we expect in live environments, so that the test results provide a more reliable indicator of the behavior seen on release. Unfortunately, data privacy and protection regulations will often restrict the movement and use of that data. So, how do you create test data that complies with regulations that prohibit the sharing of sensitive or personal data, but still looks and behaves like the real data?

If you need to do this manually, using scripts, and you need to provision many development and test environments, then you’re going to find it difficult. This article will explain how to use SQL Provision to solve this problem, in an automated fashion. It uses SQL Provision’s data masker tool to create SQL Server test data where any sensitive or personal information is obfuscated, but that retains the volume and characteristics of the predicted production loads. It then describes how to use SQL Clone to automate the provision of development and test environments with realistic, but compliant, test data.

This combination of techniques is valuable in ensuring we can evaluate the true behavior of our software, before users encounter the application.

Security Matters

As we develop database software, there is often no substitute for the actual data that exists in a live environment. However, the security of this data is paramount, and most of our test and development systems have lower security configurations, to allow for the flexible nature of the software development process.

While there are many complex issues around data security, data privacy, and data subject rights, most organizations realize that using actual data in less secure environments is a poor practice. Indeed, many countries around the world are introducing legislation that penalizes organizations for losing track of data. Laws, such as the GDPR, explicitly call out the need to anonymize or pseudonymize data in test and development environments.

Tools that provide data masking to randomize and protect sensitive data help ensure that the development team can still use data that is as close as possible in its character, volume and distribution to the production data to evaluate the functionality and performance of software.

Random data is not enough

If security regulations mean you can’t use the production data, then you may decide simply to use ‘randomly-generated’ test data. While this is helpful from a security standpoint, it introduces other issues into the development process.

Firstly, it won’t reflect the characteristics and distribution of the real data, so our application processes that use this data probably won’t behave or perform in the same way as they will when they encounter the real data. We need to ‘randomize’ data to protect it, but we need to do so in a way that means it still looks and behaves like the real thing.

Secondly, the developers and testers who write the features, and evaluate their correctness, become used to certain data sets, and when they see familiar data in the results, they can rapidly evaluate whether the system is working as intended. For example, specific data values may need to be present, to mimic a series of orders in an e-commerce system and ensure all the various order permutations are tested and working correctly. Similarly, the more realistic the data, the easier it is for users involved in UAT to decide whether what the developers are delivering meets their needs. If the user is expecting to see specific schedules for a registration system, for example, then seeing random values instead can often be misinterpreted as a software malfunction.

In short, using random data will protect us from accidentally revealing sensitive or personal data, and may well have the additional benefit of introducing ‘edge cases’ that we may not otherwise have considered testing. However, it’s not sufficient on its own. The test data must also contain a curated set of values, to test the known cases and boundaries of our software. This might be a set of general ledger transactions for our accounting software, or a known series of schedules for a registration system. This curated set of data would contain no sensitive data, and would be consistent in all our development and test databases.

Realistic but compliant test data using SQL Provision

Our goal is a lightweight, adaptable and automated process that can provide developers and testers with a consistent set of data that is safe to use, since sensitive and personal data has been replaced with randomized values, but is also representative of the values found in the live system, and contains the known values that are useful in their daily work.

However, any technique that relies on shifting vast quantities of data between systems, using database backup and restore, and running manual ‘data cleansing’ scripts, will be slow, cumbersome, and hard to maintain. By contrast, the tools in this article use modern data masking and data virtualization to allow us to provision multiple environments, consistently, throughout the software development lifecycle. I’ll show how to use SQL Provision, a combination of SQL Clone and Data Masker for SQL Server, to automate the following process steps:

  1. Make a copy of production data
  2. Perform random masking of any columns containing sensitive or personal data so that this data does not ‘leak’ into less secure environments
  3. Inject a curated data set into the database – known data sets that contain enough values to adequately test the software, but without any sensitive information.
  4. Use data virtualization to rapidly provision consistent databases on demand

My previous article already demonstrated a simple way to implement Steps 1, 2 and 4. It used the PowerShell script shown in Listing 1, to automate the creation of data images, using SQL Clone, the substitution of any sensitive or personal data with randomly-generated values, using Data Masker, and then SQL Clone to deploy a sanitized database clones to the development and test servers.

Listing 1: Deploy sanitized database clones for development and testing work

This article will build on that work to extend the solution to include step 3, the injection of a curated set of data.

Creating the curated data set

The curated data set must contain values to test all know behaviors of our software, including edge cases. It’s an important step that cannot be easily automated, because each database contains its own set of unique transactional elements, each of which must be tested to ensure that the software handles them properly. However, once created the data set, the ongoing maintenance to add data to represent new cases is minimal.

In the previous article, as part of the image creation process, we used Data Masker to generate a data masking set for the dbo.dm_Customers and dbo.dm_Employees tables, replacing any sensitive values with randomized data.

We implemented the masking by using the -Modifications parameter of the New-SqlCloneImage cmdlet to specify use of a data masking set. We supply the masking file to the New-SqlCloneMask cmdlet, which takes it and produces a data modification script that can be run as part of the image creation process.

Listing 2 shows two relevant lines of code:

Listing 2: Using random data masking.

We will modify this section of the code to inject, subsequently, a curated data set into each of these tables, using two script files. While this can be one script, the abstraction of data sets into separate files is better for ongoing maintenance in a team environment. Each file would be created manually, perhaps using information from the production database, but with sensitive values changed, but in such a way that its retains the correct characteristics.

Listing 3 shows a sample of the contents of the dm_customer_testdata.sql file, which add the curated data to dbo.dm_Customers, changing the customer name, address, and credit card values, and so on, to realistic values, but unlinked to an actual person.

Listing 3: The curated data set for customers

Listing 4 shows sample contents from the dm_employee_testdata.sql script file, which uses some famous names (from public data), plus some generated values to represent specific cases.

Listing 4: The curated data set for employees

In any live system, there would be additional rules to cover other entities, but for the sakes of simplicity, this article will just include data for these two tables. The idea can be extended to more objects if necessary.

Injecting the curated data

Now that we have the two files containing curated data, let’s inject that data into the database. SQL Clone makes this easy because, in addition to a data masking set created using Data Masker, we can specify a regular SQL file as well.

We use the New-SqlCloneSqlScript cmdlet to create an object that specifies the file, as shown in Listing 5. We use two variables, one for each object.

Listing 5: Adding the curated data scripts

Then, we just modify the New-SqlCloneImage line to add the curated data scripts. Place these objects after the data masker script, so that they execute last.

Listing 6: Using SQL Clone to use both randomized and curated data sets during image creation

When we run this as part of our automation, we will see both the masking set and the SQL scripts execute as part of the process. When we query any clones that are created from this script, we will see our curated data set included along with the other masked data.

Conclusion

In the real world where applications are running, however, we often find that the users will stress our system with data that we haven’t used in our test and development environments. This forces teams to set aside ‘unplanned’ maintenance time, to fix bugs and performance issues, post-deployment.

Ingrained in the DevOps culture is the idea that to minimize this unplanned work, we need to ensure that our software behaves correctly, even when handling edge cases, performs adequately, and integrates smoothly, as part of routine daily development. This means being able to build databases and test them under conditions that mimic as closely as possible the live environment.

This article presents a method that allows us to use SQL Provision to build consistent, compliant, useful databases, on demand, for development and test environments. By combining the power of Data Masker for SQL Server to randomize any sensitive data along with custom SQL scripts to represent a curated data set, we ensure that the resulting databases can be used for effective and reliable testing, and are easy for humans to use, when examining the data.

Tools in this post

SQL Clone

Create SQL Server database copies in an instant

Find out more

SQL Provision

Provision virtualized clones of databases in seconds, with sensitive data shielded

Find out more