Product articles Redgate Test Data Manager Subsetting
Reducing Database Complexity and Size…

Reducing Database Complexity and Size using Subsetting

Getting realistic test data from large production databases can be a challenge, especially when you're trying to keep dev environments lightweight and secure. Redgate Test Data Manager includes a subsetting CLI that simplifies this by letting you generate smaller, fully representative subsets automatically. This article will walk you through getting started: from setup and configuration to running and refining simple subset operations.

Guest post

This is a guest post from Tonie Huizer. With 20+ years of experience Tonie likes to read, experiment, talk and write about software. In his stories on medium he explains Azure, SQL and other Microsoft technologies.

Redgate Test Data Manager aims to support a test data management strategy that gives developers flexible ways to provision the test data they need – whether that’s seed data, generated test datasets, or masked production subsets – without unnecessary overhead. By reducing database size while preserving referential integrity, subsetting helps teams simplify provisioning and improve test efficiency.

What is Subsetting?

Subsetting is a technique used to reduce a large database to a smaller, more manageable size while maintaining data relationships and integrity. This process is valuable when working with large production databases that cannot be easily handled in a non-production environment.

By extracting only the necessary data, developers can work with a dataset that contains the relevant data while reducing the complexity and overhead in managing large databases for development and testing. By excluding columns or rows that carry personal sensitive information, subsetting also becomes a data‑protection strategy: the best way to safeguard sensitive data is simply not to copy it in the first place.

The challenge is to subset in a way that retains referential and logical integrity, so that applications and reporting queries still behave as expected. Phil Factor’s Database Subsetting and Data Extraction article provides a good overview of the technical hurdles involved, dealing with cascading foreign keys, identity columns, orphaned rows, and more.

Redgate’s Subsetter tool is designed to help overcome these challenges and automate the process of test data provisioning for large databases. This is the only way to make subsetting repeatable, reliable, and efficient at scale, especially when working with large, complex databases. Let’s see how it works!

Setting up the Subsetter CLI

To get started with the Redgate Test Data Manager Subsetter CLI (rgsubset), you must ensure that all the requirements are met and you download the right version for your system. If you’re already using Subsetter v1, check the documentation for guidance on the upgrade path, because v2 contains breaking changes.

Before running Subsetter, authenticate your Redgate license using the auth command:

If you are testing the functionality without a license, add the --start-trial flag to the end of the above command. There is no formal confirmation that this works, but I verified the activation of my trial on the Redgate Licensing Portal. I then needed to rerun the auth command as follows (this extra step is only necessary for trial licenses):

Preparing for subsetting

After installation, you can begin configuring the CLI to run subsetting operations, either through command-line parameters or by using a JSON/YAML configuration file.

Redgate Subsetter takes a source database and applies filtering rules to extract only the required data, while automatically preserving the relationships between tables. It then bulk inserts this subset into an empty target database with the same schema, so that your test environments mirror your production environment, but at a much more manageable scale.

The following diagram illustrates this process:

Subsetter workflow

As you can see, to prepare for subsetting, we need a source database containing the full set of data, we need a target database with the same schema but completely empty, and we need a Subsetter options file that instructs the Subsetter how to perform its task.

Creating an empty target database with the same schema

Our first task is to prepare an empty target database, meaning no data but with a full schema model of the source database. The documentation suggests you use either a Flyway baseline migration or a native tool (like SSMS for SQL Server) to script out the database schema with no data. Either way, you must start with an empty target database (no user objects).

My main use for subsetting is to prepare a dev or test database for Flyway, so this would be my preferred way to create the target database. Using a baseline migration script is fine if you’re testing this out on an existing Flyway repository and the script you need already exists. However, I’d tread carefully with regenerating a baseline migration purely to test subsetting, because you wouldn’t want this script committed to source control (without a valid use case). The alternative is simply to run the sequence of migration files on an empty database, to recreate the version you need.

In either case, you’ll have a non-empty flyway_schema_history table in the target database, which will cause a validation error when you attempt to run the rgsubset command. You’ll need to drop this table before running subsetting or instruct the subsetter to delete any existing data in the target (see “Troubleshooting common issues” later in the article).

I think a simpler and cleaner approach is to create the target database using the Flyway prepare and deploy commands. The prepare command will create a script of the source database:

…and deploy will run it against your target empty database. See the documentation for more details.

Prepare a basic subsetting options file

Subsetter can automatically create a subset either by targeting a “desired size”, or by following the relationships from a starting table and filter clause that you define. These config options are specified in an Options file, which can be written in YAML or JSON, so you define the rules once and then reuse them.

The desired size mode is most useful and accurate when dealing with a large source database. The entry in the options file might look like this:

Alternatively, if we provide a starting table in the subsetter options file, the subsetter will use that as the root table, and then it automatically “follows the referential tree from this table and collects all the data necessary from the related tables” to build a coherent and referentially intact subset. We can also provide a filter clause to control the specific rows from the starting table that ought to be included in the subset.

You cannot use the two modes in conjunction, but you can run each mode successively. For example, the first subsetting run might use a starting table and filter to specify what data is needed, and then a second run can use the desiredSize to reduce the footprint.

What Happens If You Don’t Provide Options?

If you don’t provide any options, Subsetter will choose a starting table and attempt to build a subset automatically. While this can work for simple databases, the results may be unpredictable in complex schemas. For more consistent results, it’s best to define your own starting table.

The following is a bare-bones JSON options file that specifies an Orders table as the starting table and that it will only include rows with an OrderID of less than 10260. We’ll discuss how to refine this with other options shortly.

Save the file as rgsubset-options-MyDB.json (or similar) in the root of your project directory, where you’ll be running the Subsetter commands.

Running the subsetting process

Once your configuration file is ready, you can run the subsetting operation using the CLI. The rgsubset run command is the core of the subsetting process. It extracts a subset of data from a source database and transfers it to a target database while preserving referential integrity, as described above.

The following command runs a subset operation on the source SQL Server database (MyDB) according to the criteria set out in the options file. It loads the subsetted data into the target database (MySubsetDB), which has the same schema but no data. The connection strings for the source and target databases must be in .NET format. Reference connectionstrings.com for examples. For illustration, I’ve used SQL Server credentials for the source and Windows credentials for the target.

rgsubset run 
  --database-engine SqlServer 
  --source-connection-string "Server=mySourceServer;Database=MyDB;User Id=myUser;Password=myPassword;" 
  --target-connection-string "Server=myTargetServer;Database=MySubsetDB;Integrated Security=true;TrustServerCertificate=true;"
--options-file rgsubset-options-MyDB.json

The result should be that MySubsetDB contains a smaller, yet structurally complete, subset of the MyDB database. The following example shows the results for a simple database like Northwind for SQL Server (you can find some sample scripts here).

Subsetting example for Northwind

Refining the subsetting options

You can refine your subsetting configuration to gain more control over the size and scope of the subset. By defining starting points, exclusions, and reference tables, you can adjust your subset to match real usage patterns, without resorting to manual scripting or one-off exports.

As well as specifying starting tables or a desired size, we can use filtering rules, exclusions, and static data tables to refine the dataset. Here’s a refined JSON options file that I used to subset my TasteWhisky database:
<

refining subsetting options

  • startingTables – specifies that subsetting starts from the Customers table but only includes rows where Id = 1.
  • staticDataTables – ensures the Distilleries table is fully copied as static data. This is useful for reference data or lookup tables that are needed for consistency in the target database.
  • excludedTables – the Tastings table is excluded from the subset, which can help reduce database size by excluding logs, audit trails, or other non-essential data.
  • useTemporaryTables – set to true so that Subsetter will create temporary tables in tempdb instead of using standard tables in the source database, during processing. This is particularly useful in environments where direct write access to the source database is restricted, as it avoids making changes to the source schema while still enabling efficient data processing.

Troubleshooting Common Issues

Subsetting can sometimes present challenges, especially in complex databases or due to configuration mistakes. I’ve covered some of the minor issues I encountered. For a more detailed list, refer to the official documentation.

Overwrite your subsetted database

By default, Subsetter prevents overwriting an existing target database, so you’ll see a validation failure if attempting to write to a target database with, for example, pre-seeded data or a populated flyway schema history table. To avoid this, you can use the --target-database-write-mode parameter with the overwrite option:

This ensures that Subsetter will delete all rows from tables in the target database before writing out the new subset.

Case-sensitive options

Subsetter’s configuration files are case-sensitive, meaning table names and options must match the exact casing used in the database schema. A minor difference in capitalization can cause tables to be ignored or lead to unexpected errors.

beware case-sensitivity

Double-check that table names in your configuration file match the exact case as defined in the database schema.

Connection string issues

Incorrectly formatted connection strings can lead to connection failures. The Subsetter CLI is sensitive to specific keywords and formats, which may differ from other database tools.

  • Ensure that your connection string follows the correct format for your database engine. Refer to connectionstrings.com for examples.
  • Use .NET connection strings, not JDBC, which is used in Flyway. If transitioning from Flyway, be aware of these differences.
  • Some power users attempt to convert between formats—see this Redgate article for more details.

Conclusion

Redgate’s Subsetter CLI simplifies how you manage test data by reducing database size while preserving the structure and integrity of the data your applications rely on. It helps developers and testers work faster, with less overhead, but with environments that still accurately reflect the real data, both in terms of its characteristics and its distribution.

This article introduced the basics of subsetting: installing the CLI, configuring your source and target, defining what data to include or exclude, and avoiding common pitfalls. The next articles will look at more advanced use cases, from optimizing for fixed subset sizes to implementing complex filtering logic to fully automating subsetting in a CI/CD pipeline.

 

Tools in this post

Redgate Test Data Manager

Reliable and secure test data provisioning

Find out more