Simplifying Data Import and Export for Database Development
Extracting and importing data for development and testing is made trickier due to issues such as constraints, dependencies, and special data types. This article introduces a cross-RDBMS solution with JSON for data storage and PowerShell cmdlets that use ODBC to help automate extraction and import, and JSON Schema for validation. Just provide a DSN, and you’re good to go!
This article aims to simplify the mechanics of extracting test data from an existing database and of loading the required test data set into empty copies of your development and test databases. To simplify the import and export processes, we must script them, of course, otherwise too much can go wrong. Dependencies, constraints violations, trouble with special datatypes, and so on, can and will cause either process to fail. I’ll provide two PowerShell routines, Insert-JSONDataset
and Extract-JSONDataset
, to load a dataset into a new database version, or to extract a dataset from it while taking care of these issues. They are designed to allow you to easily run a series of database tests with different datasets.
We store the test data outside the database in JSON and use JSON Schema to validate the data against the metadata (ensuring correct data types, constraints, and so on). This technique works across different RDBMSs (SQL Server, PostgreSQL, MySQL, and MariaDB at the time of writing) and so offers the prospect of a unified import/export process.
Generating and maintaining test datasets
To begin, you need a test data management strategy for generating the required development and test datasets and then using that data to provision development and test databases for your Flyway projects. If you use Redgate Test Data Manager, it will produce the different datasets you need, whether by data generation or by masking production data. You can then provision clones for your Flyway projects, containing the test data you need. You can save a data container revision for each new version of a database created by Flyway. This will allow you to quickly load any version of the database during test setup, and quickly ‘reset’ a container during test teardown. You can also graduate a new data container revision to an image so the team can then use it for ongoing work. This will be the most efficient approach, especially when working with significantly large amounts of test data
If you’re working with Flyway only, I’ve provided plenty of manual techniques over the years to help you generate test data (although mainly for SQL Server databases). You’ll also need a way to import and export different datasets during development, for different types of tests. This article provides scripted techniques to make this much easier. The Insert-JSONDataset routine will load the required dataset into a freshly-built Flyway database that has no data. If you’ve designed and tested a migration that alters datatypes or requires data migration, for example, then you can use Extract-JSONDataset to extract and save the data for the new version.
These PowerShell cmdlets aren’t designed for speed but are useful for loading and maintaining the various smaller test data sets required to support ad-hoc testing and development tasks for Flyway projects. I tackled a similar problem in an earlier article, Managing Datasets for Database Development Work using Flyway. The approach I describe there is potentially faster for big datasets because it essentially just dumps the data into JSON format without additional structure or validation. The routines I provide in this article represent a much more reliable, versatile and automated way of doing it.
So, why script data import and export?
We need reliable, scripted data import and export processes, whether for managing test data sets, exchanging data between systems, archiving, or ensuring compatibility with legacy systems. Exporting data is generally straightforward, but it isn’t always that simple to import data. The two main problems that will confront you are changes to the target table made after the data was exported and violation of foreign key constraints.
Loading data into a newly built database is easy. You just need a manifest that defines the correct dependency order, and then you use it to import the data into each table, one after another. However, if you’re re-loading that data into a database to which you’ve made subsequent schema changes then, as disparities mount, it can soon escalate to the sort of task where you have to have sandwiches brought in every four hours.
You can take one of two different approaches to avoiding referential constraint violations. One is to temporarily disable the referential constraints, whack in the data, then enable them. The other is to “delete and reload from scratch” using a manifest that establishes table dependency order. Either way is fine, and the method you choose depends on the RDBMS and your circumstances. BCP in SQL Server, for example, is a fine tool for the “disable constraints, whack in the data” approach.
As well as the problem of violating referential constraints, all sorts of other issues can derail data loading, and the native tools for each RDBMS will help you handle many of them. However, if you want a unified, cross-RDBMS approach then you’ll need to deal with a lot of these issues yourself. For example, you’ll need to handle calculated columns separately to ensure that the target database can re-calculate them. When there are auto-incrementing columns you’ll need to avoid loading duplicate key values or causing issues with the auto-increment sequence that causes future inserts to fail. Then there are CLR datatypes, keywords that must be delimited, quirks and conversion issues to deal with for other datatypes.
It pays to know something about the source and destination tables, especially if there are subtle differences between them, and for this you need to know the metadata of the table you are inserting into.
If you only use one relational database
If you’re only managing test data for one RDBMS, then you’ll probably best use the native export and import tools they provide. They will be fast and will take care of many of the ‘gotchas’ for you. Here is a summary of how you’d extract and import data for each table, using the appropriate tools provided by the Relational Database System:
RDBMS | Import | Export |
SQL Server | BULK INSERT, BCP, SSIS | BCP, SELECT INTO OUTFILE, SSMS Export Data Wizard |
PostgreSQL | COPY, \copy (client-side) | COPY, \copy (client-side) |
MySQL | LOAD DATA INFILE | SELECT INTO OUTFILE, mysqldump |
SQLite | import (SQLite CLI) | .dump (SQLite CLI), CSV export |
Oracle | SQL*Loader, INSERT INTO with file options | Data Pump (EXPDP), SPOOL (to CSV or text files) |
SQL Server
For SQL Server you have the choice of BCP
, BULK
INSERT
, and SSIS
- BCP (Bulk Copy Program) –allows bulk import or export of data from CSV text files, or native BCP format, to SQL Server tables. You can automate the process with scripts to test multiple datasets.
- BULK INSERT – this command lets you import large datasets into a table directly from a file that is based on the server. It requires superuser privileges or specific file access permissions.
- SSIS – a tool for automating the import of several datasets, including such processes as transformations, scheduling, and error handling.
PostgreSQL
PostgreSQL provides both COPY
(or /copy
from client machines) and pg_dump
:
- COPY command – allows you to import or export data from a CSV file or other text files based on the server. Used when the file is on the PostgreSQL server. It requires superuser privileges or specific file access permissions. It is highly efficient for testing bulk datasets.
- /copy – can be used from a client machine via the psql utility, and does not require superuser access or server-side file availability.
- pg_dump and pg_restore – can be used to dump a dataset from a PostgreSQL database and restore it for testing purposes.
MySQL and MariaDB
MySQL has LOAD
DATA
INFILE
. This is like BCP, in that it allows you to load data from a text file into a MySQL table. To export data, you can use:
1 2 3 4 |
SELECT * INTO OUTFILE '/path/to/file.csv' FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' FROM table_name; |
For CLI-based export, you can use mysqldump
. There are options for various export formats.
SQLite
SQLite provides the import
command in its command-line interface for bulk importing. You can also export data directly into CSV or text files using the CLI tool:
1 2 3 4 5 |
sqlite3 database.db .headers on .mode csv . output output.csv SELECT * FROM table_name; .exit |
Oracle
To import into Oracle, you use a bulk loading utility called SQL*Loader. To export, you can use the EXP utility or the newer EXPDP (Data Pump Export), which will export tables, schemas, or entire databases to files. These files can later be imported using the complementary IMPDP (Data Pump Import).
A RDBMS-agnostic approach: storing test data in JSON
Historically speaking, we didn’t worry much about neutral ways of storing data. After all, if we just use SQL Server, and it is possible to use BCP native format, there is little problem, unless we change the individual tables or the referential design. If we are engaged in multi-RDBMS work, with maybe PostgreSQL, MySQL and SQLite in the mix, then we need metadata.
A major part of the ODBC and JDBC interface is concerned with the querying of the metadata. Why? If we need to get data between different RDBMSs, for example, there must be some information about the metadata to go with the data.
Until recently, there was little thought to providing metadata in data documents. XML tried, bless it. YAML said it could do it, but in practice, it couldn’t. JSON has been enabled by JSON Schema to do it. JSON provides a lightweight, human-readable, and editable, format for representing structured data. Although not as terse a CSV or TSV, it is a reliable way of storing database records in a text file, because it can detect if the end of a document is missing. JSON Schema adds a standardized way to define and validate the structure, data types, and constraints of JSON documents, ensuring data integrity and consistency, and providing a way of avoiding common pitfalls in the process of inserting records that use the less common datatypes.
JSON schema is quite easy to read and write. By default, it allows you to sanity-check a JSON file. For table data, it will ensure that every column is there, and the values can be parsed whatever basic datatypes of “string”, “number”, “integer”, “Boolean”, “object”, and “array” are appropriate for the column. It will check numeric data ranges and string length. If you want it, it will check values with a Regex expression, or against an enumeration list. You can write more elaborate expressions to check the data. It is enough to prevent a data corruption from causing import errors.
So, we need three essential components to our file-based data. A Directory to represent the database, in which is stored a manifest that lists all the data tables in the correct order for operations such as insertions and deletions. A file for each table with the table metadata, and a file with the actual data.
Practicalities of Extracting and importing JSON data
To make this system work smoothly, I offer two PowerShell cmdlets that will do a lot of the heavy-lifting for you and allow you to automate these tasks as far as possible: Insert-JSONDataset and Extract-JSONDataset. These are designed to be generally usable for loading a dataset into a new version of a database or to extract a dataset from it.
The cmdlets extract and import the data via an ODBC connection, and all you need to do to get them to work with Flyway is provide the name of the DSN as a user placeholder, plus the name of the database and the required credentials. ODBC and JDBC provide some metadata to help with the extraction, but there are gaps, and inconsistencies in what the drivers for each RDBMS support, so the cmdlets try to deal with as many of the complications as possible.
The Extract-JSONDataset
cmdlet will extract data from a database using an ODBC connection. You need to provide it the Data Source Name (DSN It will then:
- Create a manifest that records the order in which tables should be processed, based on dependencies.
- Generate a JSON Schema for each table, describing the structure of the data (columns, data types, constraints, etc.).
- Extract the data for each table into a JSON file. It handles special cases for data types (e.g., geography, hierarchyid, time) to ensure compatibility in JSON format
- Validate the JSON data files against their corresponding JSON schemas
In the illustration below, we provision a development database at the required version, complete with test data (step 1 – not covered in this article). We connect our Flyway project to the database, containing the prepared test data (step 2). We call the Extract-JSONDataset
PowerShell cmdlet, which I’ll provide, to extract the data into JSON (step 3). It stores all the JSON data files in a designated data directory, named according to the current schema version of the connected Flyway database.
We can repeat this extraction process, if required, for each new version of the database. When connected to a new, empty copy of a later version of the database, we call the Insert-JSONDataset
PowerShell cmdlet and it grabs the data from the corresponding JSON data directory and uses the manifest file to insert it into the empty tables in the correct dependency order (step 4):
Putting it to the test
To demonstrate all this, we can, in SQL Server, create an empty version of Adventureworks
using a build script and fill it with data from an existing copy of the database. We can then extract the data and save it in JSON files and then subsequently import the JSON datasets into another empty copy of AdventureWorks
. Having done that, we can compare the two databases with our favorite data-comparison tool (e.g. SQL Data Compare) to make sure the data is identical.
Adventureworks
has some deliberately planted ‘gotchas’ or traps for this sort of test such as XML datatypes, plenty of identity/auto-increment columns, and sneaky dependencies. If you’re feeling less adventurous, you can try it out with the Pubs database instead.
Extracting datasets into JSON
Once you’ve set up your Flyway project, Flyway’s first task is establishing a connection to the required database. You’ll need to provide the name of the DSN. In the following demo, I’ve also used Add-FlywayEnvironmentConfigItems which makes it easier to set up ODBC connections and allows Flyway to be run with the minimum of commandline parameters. You’ll need to provide the Flyway configuration file containing all the required details.
Once connected, we execute the Flyway commands to clean and then migrate the database to the required version (to simplify the demo, one of the migration scripts simply inserts some test data). We use the Extract-JSONDataset
cmdlet to save the JSON data file, JSON schema files, and the manifest file in a ReportDirectory
folder named according to the current schema version of the connected Flyway database (V1.3 in this example). We can later use this exported JSON dataset to fill any empty copy of the database at the same version.
The cmdlet extracts various bits of information from the config variables, such as the names of the server ($Server
) and database ($database
) and gets the Flyway schema version ($version
) from the info
command. If you provide the root path to theReportDirectory
, it will save all the files in a folder called \MyPath\$server\$database\$version
. You can override the name of the database or server that was in the DSN, so you only need a separate DSN for each RDBMS.
Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
$VerbosePreference = 'Continue' #Go to our working directory, cd '<MyPathTo>\FlywayTeamwork\Pubs' <# Clear the existing Flyway Environment variables and then read the credential files from the user area into environment variables, adding the RDBMS and server so we can use an ODBC connection #> Add-FlywayEnvironmentConfigItems "$($env:USERPROFILE)\<MyDatabaseEnvironment>.conf" <# Now clean the source database and bring it up to the current level of the database so we can use it for our data export. We fetch the current version from the INFO command #> @('clean','migrate','info')| foreach{ Flyway $_ -outputType=json >"$($_)Feedback.json" } #Fetch the useful information, mainly to get the current version number $Feedback=(dir *Feedback.json).FullName|foreach{Get-Content -path $_ -Raw}|ConvertFrom-JSON #check for errors. Normally, in a script, you'd stop here if there were errors $Feedback|foreach{if ($_.error -ne $null){$_.error} }|foreach{write-warning $_.message} #now get the version number $Version=$Feedback| where {$_.operation -eq 'info'}|foreach{$_.schemaversion} #wasn't it lucky we used Add-FlywayEnvironmentConfigItems! we can now do ODBC work $ODBCServer=$env:FLYWAY_PLACEHOLDERS_ODBCSERVER $Database=$env:FLYWAY_PLACEHOLDERS_DATABASE $Parameters= @{ 'SourceDSN'='Clone Server'; # you need to set this up first. We overwrite the server/port and credentials 'Server'=$ODBCserver ; 'Database'=$database; 'Schemas'='*'; 'User'=$env:FLYWAY_USER; 'Password'=$env:FLYWAY_PASSWORD; 'ReportDirectory'="C:\JSONData\$Server\$Database\$Version"; } <# we use a directory convention that just keeps things tidy #> <# Now we actually generate the JSOn Files by taking the data from the tables.#> $VerbosePreference='Continue' Extract-JSONDataset @Parameters # take our input from the hashtable $VerbosePreference='SilentlyContinue' |
As a result, we have a directory with the JSON data, the Manifest and the JSON Schema
Importing the JSON datasets
Nice. Now we can insert this JSON data into another, empty copy of the same database. For this, we need the Insert-JSONDataset cmdlet. After connecting and disabling constraints and triggers, if necessary, it will read the manifest to establish the dependency order and then construct batch insert statements for each table, using the associated JSON schema file. It performs any special handling required for certain data types and excludes certain columns like calculated columns. It executes the batch inserts for each table sequentially
The following “import” example works very similarly to the export example described previously and uses Insert-JSONDataset
to do all the heavy lifting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
cd "<MyPathTo>\FlywayTeamwork\Pubs\Branches\develop\Variants\NoData" #Change environment credentialse and connection details Add-FlywayEnvironmentConfigItems "$($env:USERPROFILE\<MyOtherDatabaseEnvironment>.conf"" #Get the target version filled and up-to-date @('clean','migrate','info')| foreach{ Flyway $_ -outputType=json >"$($_)Feedback.json" } #Fetch the useful information, mainly to get the current version number $Feedback=(dir *Feedback.json).FullName|foreach{Get-Content -path $_ -Raw}|ConvertFrom-JSON #check for errors. Normally, in a script, you'd stop here if there were errors $ThereWereErrors=$False; #assume the best $Feedback|foreach{if ($_.error -ne $null){ $ThereWereErrors=$True; $_.error} }|foreach{write-warning $_.message} if (-not $ThereWereErrors) { $ODBCServer=$env:FLYWAY_PLACEHOLDERS_ODBCSERVER $Database=$env:FLYWAY_PLACEHOLDERS_DATABASE $Parameters= @{ 'TargetDSN'='Philf01'; 'Server'=$ODBCserver ; 'Database'=$database; 'ReportDirectory'=$ReportDirectory; 'User'=$env:FLYWAY_USER; 'Password'=$env:FLYWAY_PASSWORD; } $VerbosePreference='Continue' Insert-JSONDataset @Parameters $VerbosePreference='SilentlyContinue' } else {Write-warning "could not continue because of an error"} |
By choosing a neutral format of data, and cross-database connectivity, we can use different RDBMSs, and even (theoretically) do cross-RDBMS data transfers. Actually, because this technique requires quite a lot of information about the metadata, it exposes the inconsistencies in ODBC Drivers.
The most obvious inconsistency that complicates the use of ODBC is the missing foreign key information in the standard. Sure, the information schemas can provide extra information but then each RDBMS has interpreted the information schema in different ways, and one must be prepared to make use of the RDBMS-specific system views to get some data. As I write this, Insert-JSONDataset
and Extract-JSONDataset
work with SQL Server, PostgreSQL, MySQL, and MariaDB, but more are on the way.
Bugs still crop up with unexpected conditions, such as a table using a reserved word or an illegal, but ‘escaped’ identifier, but they are reasonably robust.
Why do you need a manifest?
I provide a table manifest within the routine Extract-JSONDataset
which is written out with the data, though the system will work without it because the same information can be gleaned from the JSON Schemas. A manifest is a list of tables in their dependency order. Actually, dependency is in levels rather than a strict order, so if you change the data in a table, you’ve got to have changed its lower-level dependencies first; the tables that it depends on for its foreign key references. I keep the level in the JSON Schema. Data must be deleted from the most dependent tables (highest level) downwards and imported into tables from the lowest levels upwards. Even if you don’t explicitly use a manifest, it is possible to make up a correct order for the operation (delete or import) from the information in the JSON Schemas.
Additional uses for the import and extract routines
A zipped directory of JSON data isn’t the most economical way of backing up test datasets, but the fact that you get extra information about the columns and dependency level makes it far easier to put the datasets into a database. There is a switch in Insert-JSONDataset
that allows you to use it merely for cleaning out all the existing data in the target database. It is handy occasionally, but there are quicker and easier routines for doing this.
It is also possible to use a modified version of Insert-JSONDataset
for writing out the necessary insertion statements without executing them, but these SQL scripts take a fair amount of disk space. A more interesting possibility is generating fake data in JSON tables, rather than generating them from an existing database. I find this to be more useful if it is just done a table at a time.
Conclusion
JSON-based datasets work very effectively as long as you aren’t in a hurry. This is designed to be a cross-RDBMS utility for relatively small datasets. For large datasets, you are best with a ‘Clone’ technology or a container, but the image or container still must be prepared. Any experienced Database Developer will know of the times, such as when preparing a container or clone image, where a simple import of data from a set of files is the only way to get the work done, especially if you are working with a range of RDBMSs. If you are creating datasets from applications, scripts, and existing text-based information, then this is certainly a great boon.