Meeting your CCPA needs with Data Classification and Masking
This article will explain how to import the data classification metadata for a SQL Server database into Data Masker, providing a masking plan that you can use to ensure the protection of all this data. By applying the data masking operation as part of an automated database provisioning process, you make it fast, repeatable and auditable.
What data do you store and how is it used?
“In the event of a data breach, any organization must be able to prove that they understand what data they hold and where, how it is being used, and that they have enforced the required standards for access control, security and the ethical use of personal data“.
– William Brewer, The Need for a Data Catalog
Legislation such as CCPA, as well as GDPR and others, now holds organizations legally accountable to the owners of the data, for the care with which they manage and use their data. Understandably, this mean new demands on data teams, who will need to identify all the personal and sensitive data that must be removed or obfuscated, prior to distributing any copies of the database for reuse within the organization, such as for development, testing, analysis and training. For many, the first step will be to build a data catalog that classifies and describes the data in every data source, revealing where throughout the organization, or elsewhere, this data is stored and used and for what purpose, as well as the business owners of that data.
These classifications form the basis for a documented and visible data protection strategy, which will involve, among many other measures, encryption, use of role-based security to restrict access to this data, strict auditing of any changes to it, and ensuring that none of this personal or sensitive data ‘leaks’, unprotected, into various corners of the organization’s infrastructure.
Creating a masking plan based on data classification metadata
“Even small extracts of data need to be created with caution, if they are for public consumption. Sensitive data can ‘hide’ in unexpected places.”
– Phil Factor, Masking Data in Practice
With Data Masker (v7 and later), we can now create a masking plan for each database, based directly the data classifications agreed by the organization, describing exactly what data needs protection, and how.
Exporting and Importing the data classifications
To get started, open Data Masker, connect to the database and schema containing the data you wish to mask, and the switch to the ‘Tables’ tab. It provides a useful “masking shopping list” that summarizes the columns in each of the tables, what primary keys are defined for each table, the number of rows, and which of the columns have been classified and masked appropriately.
In particular, the Plan Summary, Sensitivity and Plan Comment columns tell us, respectively, which of the columns in a table have been classified and masked, whether a column contains sensitive data that needs masking, and the particular data classification for each column (public, GDPR – Sensitive, and so on). As you can see, though, it’s a blank slate to start with, so we need to populate it. As we apply Sensitivity settings to each column the Plan Summary column will automatically update to reflect which columns have been classified, and which have been masked.
In simple cases, we can apply Sensitivity settings to each column, or to a set of columns or tables, directly within the UI, using the right-click context menu. However, a safer way is to create a masking plan based on your documented data classification strategy, using Data Masker’s new Export/Import Plan feature, the button for which you can see at the bottom of the Tables tab. As you can see, I have two possible sources:
With Data Masker v7, we can import, as a masking plan, either native SQL Server data classifications, or our own custom classifications for each of the tables and columns, stored in a .CSV file. If I’ve used T-SQL or SQL Server Management Studio’s Classify Data workflow, then I can import these SQL Server classifications from the connected schema. For many databases, however, writing the data classification metadata to the schema isn’t really an option; either because the schema is off-limits (as is the case for many third-party databases are, which still need to be classified), or because the organization has chosen another method for capturing classification metadata, such as by building a dedicated data catalog.
In my case, I’ve classified the data in my StackOverflow2010 database using SQL Data Catalog. Soon, you’ll be able to import these classifications directly into Data Masker, but for now we have the additional step of first using SQL Data Catalog’s PowerShell cmdlets to export the classification metadata as a CSV file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
$authToken = "[my auth token]" Invoke-WebRequest -Uri 'http://sqldatacatalog.mydomain.com:15156/powershell' -OutFile 'data-catalog.psm1' -Headers @{"Authorization" = "Bearer $authToken" } Import-Module .\data-catalog.psm1 -Force # connect to your SQL Data Catalog instance Connect-SqlDataCatalog -ServerUrl 'http://sqldatacatalog.mydomain.com:15156' -AuthToken $authToken $instanceName = 'sql1.mydomain.com' $databaseName = 'StackOverflow2010' # export all columns to a .csv file, swapping the native sensitivity label to 'sensitive' where needed Get-ClassificationColumn -InstanceName $instanceName -DatabaseName $databaseName | ` Select-Object schemaName, tableName, columnName, @{Name = "Sensitivity"; Expression = { switch ($_.sensitivityLabel) { { $_ -match "Confidential - GDPR" } { "Sensitive" } { $_ -match "Public" } { "Nonsensitive" } Default { "Check" } } } } , @{Name = "Comments"; Expression = { $_.sensitivityLabel } } | ` ConvertTo-Csv -NoTypeInformation | ` ForEach-Object { $_ -replace '"' } | ` Out-File "C:\temp\StackOverflow2010-Classification.csv" -Encoding Unicode |
This done, I import the StackOverflow2010-Classification.csv into Data Masker, and I can see that my ‘shopping list’ of the tables and columns that need protection is populated.
Any tables containing ‘sensitive’ columns that haven’t yet had masking rules applied to them (or where you haven’t specified whether a column should be masked) are highlighted in bold.
Using the masking plan to design the masking set
In Data Masker, we arrange various masking rules into a masking set that will obfuscate all the sensitive data in a table, or across a set of tables. The obligation of a masking operation is to redact, substitute, shuffle and replace personal and sensitive data, as required, such that it can’t be used to identify the subject. The challenge is to do this while also ensuring that the masked data still looks like the real data, and retains its referential integrity, and its distribution characteristics.
In a well-designed and normalized database, with appropriate constraints this is relatively straightforward and should require a small number of rules. However, Data Masker also supplies rules to help you to, for example, mask denormalized data in a consistent way, or even to protect data in databases that don’t have any keys or constraints (a much harder task).
Learning the masking rules
Refer to the documentation for the details of each masking rule, and to our video training course, and Product Learning articles for examples of how to use them.
Adding simple rules to the masking set
For the simpler rules, such as the Substitution rule, I can add it to the masking set by right-clicking a column and selecting the rule directly from the Tables tab.
Then, I just need to select the substitution dataset, and for simple cases like this, I’m done. Here, I’m just using a “Paragraphs of Gibberish” data set, as an example, but you can do a lot better job of generating realistic text, if that’s what’s required.
Masking denormalized data
My requirements are more complex, however, because my database is not fully normalized. This shows one of the challenges of automation in this space, and which we see regularly at customer sites.
The column LastEditorDisplayName
on the Posts
table duplicates the values in the DisplayName
column of the Users
table. We also have an identifier (LastEditorUserId
), which we can use to join to the Users
table, but for some reason the data has been duplicated (and who hasn’t encountered a database schema something like this?).
I want my masking operation to update this data, consistently, otherwise my testing may suffer, or potentially I’ll introduce unrealistic bugs to my test environments. First, I add a substitution rule to the Displayname
column of Users
, since this could contain personal data, and then use Data Masker’s ‘Table-To-Table’ rule to update Posts
with the corresponding values for LastEditorDisplayName
.
I need to make sure my ‘Table-To-Table’ rule is dependent on the previous substitution rule, replacing the DisplayName
data, so I’ll drag it inward in the UI to protect that dependency (this could be important during later optimizations).
In this manner, you can implement a masking set, built by following a plan that itself was generated using the classification metadata (labels) agreed with the business. We’d recommend storing each masking set file in source control, as a DevOps best practice, and to facilitate any future audit demands.
Masking as part of automated and auditable database provisioning
Having devised the masking set, you can test it by running it though the UI. However, generally, you need to run these masking operations as part of an automated process, so you can be sure that they are always applied, before moving any data about the organization, Moreover, you will be able to prove it, since you’ll have auditable logs that verify that the masking plan was implemented.
You can automate the masking operation via the command line, or by incorporating the masking set into a SQL Clone operation to provision a sanitized, virtual copy of the database. Here’s my operation running in SQL Clone:
This automated process produces a log file and set of reports that can prove to my governance team (or auditor) how the operation was carried out:
The log file (which you’ll also see if you run the masking operation through UI, or the command line) shows clearly that I had a plan and implemented masking that matched that plan.
Summary
There are no ‘shortcuts’ to a data masking operation. Regulatory requirements dictate that all personal and sensitive data is protected. Using a combination of Data Masker and SQL Data Catalog, you can protect your data using a masking plan based on agreed and documented data classifications. The masking operation is designed to be incorporated easily into an automated database provisioning process, so your developers and tester will be able to work with data that accurately reflects the characteristics of real data, but could never be used to reveal the identity of the subject. Your operations team will have the audit trail to prove that the appropriate data protection was always applied.
Discover how Data Masker and SQL Data Catalog can help you implement a robust masking plan and protect sensitive data.
Tools in this post
Data Masker
Shield sensitive information in development and test environments, without compromising data quality