Improving the Quality of Data Governance: Where to Start?
How do we set about improving the quality of data governance within an organization? What are the priorities? Data Governance is generally considered to mean providing clear roles, responsibilities, policies, principles, and organizational structures that can ensure that data is managed well, in a way that benefits the whole organization. Where do you start?
Every experienced IT manager will give you a different answer, but here’s what I think is important.
The scope of data governance
Data Governance is a portmanteau word that describes the various tasks it must perform to look after the data it is using responsibly. These tasks include:
- Data Classification – all data needs to be classified so that the organisation can answer the question of where personal sensitive and commercially confidential data is held
- Data custodianship and compliance – making sure that this data is held securely and robustly in conformance with legal requirements and industry standards
- Data Security and resilience – ensuring that the methods you use to meet these legal requirements are sufficient and up to date
- Metadata inventory, lineage and provenance – data governance aims to ensure that you know what data you are using, the source of the data, who is using it, and where it goes.
- Data Integration and interoperability – is the data being used consistently and in the right context
- Data Retention – providing clear guidance on how, and for how long, the data is kept
- Data Quality – defining the criteria by which we asses that the data is correct and consistent.
My previous artilce, Data Governance: Joining the Dots, introduced the need for data governance. When was involved in data discovery, mapping and classification and how this provided the basis for devising data strategy and policy as well as data standards. Here, I’ll delve a bit deeper into what’s involved in the above tasks and suggest a route to getting started with the overall task of designing or improving an organization’s data governance strategy.
Find the common ground
One of the first steps must be to find the common threads of an organization’s data governance requirements. It is important to have a consistent organization-wide scope for the standards of the curation of data. Outsiders, especially the judiciary, or even the customers of an organization, are unlikely to sympathize with the predicament of any organization that contains several competing IT activities, warring like mediaeval city-states.
Besides, it is far less costly to have a single set of rules for the different types of data, and a single method of identifying and mapping data. These are normally held simply as documents and available to IT staff. They are best shared as widely as possible. I once had to advise two different departments of Government on data governance, and the idea that they could share a common data strategy was as much of a shock to them as if I’d suggested them sharing with the Russians. It turned out that their data governance requirements were identical. It wouldn’t surprise me if the Russians had a few requirements in common with them.
The same applies to teams within IT. An isolated, inward looking IT team tends to think that their data, requirements, legislative framework and security requirements are somehow unique to them. If the diverse teams within an organization that are working with Information Technology can cooperate to solve data governance issues, or share an existing set of strategies, it can save a lot of money and a lot of time. It will also convince them that they are faced with the same data governance issues.
If necessary, this shared model can be ‘federated’ rather than monolithic, as long as the divergencies are justified and clearly expressed. Each team will have special expertise that can help refine any data governance strategy. The best data governance strategy is the result of fostering ways of capturing all the information, wisdom and knowledge of the members of the organization.
Any Data governance model will need to document the metadata. The method used must suit the organisation: It is done effectively by ER-diagramming or UML modelling, but must be grounded in the simple facts, described in the organization’s language and specialized jargon, so that a non-specialist can access it.
Establish the recommended ‘data governance’ roles
Your organization must have one or more people who are responsible for data governance, and report directly to the board or trustees. They will advise them directly on what their responsibilities are regarding data and inform them whether these responsibilities are being met.
The data governance ‘office’ must work closely with everyone in the organization who is involved with data, to ensure that their data is accurate, unduplicated, protected, consistent, complete, timely, resilient, and valid. The data governance office would be the ones to supervise the creation and maintenance of data catalogs that include business glossaries, metadata-driven data dictionaries and data lineage records. These will assist business intelligence staff and at the same time provide an immediate and easily understood purpose for data governance.
Implement role-based access control for data
All databases must have access control. Users should only be able to access, and perform searches, on only the data that they need for their roles in the organization. See, for example, Schema-Based Access Control for SQL Server Databases. Data encryption is a sensible precaution, but it is no substitute for role-based access controls.
Unfortunately, you’ll still find systems that access databases with a login that can access any data or even execute files on the server, change configuration settings and elevate their own access permissions. Proper access control is your first line of defense against attacks, such as SQL Injection.
Make data governance tasks as simple as possible
Many of the people involved in data governance will be some way outside the comfort-zone of their expertise. Wherever you can, use simple language to avoid amplifying the mystery. Encourage the use of tools for the heavy lifting required for searching directories for data, for Monitoring data servers, data modelling, building a data catalog and using this to document the data.
Make it an evolutionary process
This isn’t a one-off exercise. Every organization will change, the technology of data will change and so will the requirements of society. It is a process of continuous improvement. The data governance team must monitor progress towards meeting the requirements of compliance and providing a more responsive IT service that values the quality of data and its curation.
Check that your data inventory is complete
Make sure that you have a complete record of databases and data resources run by the organization, and their associated metadata. Check that their classification is correct and complete. Make sure that there are data catalogues and metadata for all the organizations’ entities (customers, products, inventory, purchases and so on) that are used by these data resources. It needs to be clear, for example, what database application is keeping aspects of the customer’s interactions with the organisation, and how the organisation can gain the big picture of their customers.
Ensure that there is a single source for accessing all the metadata in the organization.
Data must be associated with, or reference, a description of the data source, its purpose, provenance and processing. This is referred to in data governance as the ‘metadata‘. Relevant metadata supports data quality work and helps users to assess whether the data is adequate or appropriate for their use.
It can seem daunting to add a new application to a mature organization-wide reference about the data, its lineage and its governance. This has proved to present an increasing hurdle to the model’s maintenance. As long as such applications are committed to eventual sharing or federation, they should be allowed to make a start with data governance ‘a bite at a time’ rather than try to integrate immediately with an existing single source across all the activities in an organization.
Where a department of an organization does this, they need to be encouraged, but with the clear aim of integration with a centralized metadata repository that is usable across many platforms.
Every organization must be able to demonstrate that it is holding its data responsibly, so a common and consistent source for metadata that has a tracked history of changes (e.g., source-controlled) is essential for effective governance and stewardship. It has the bonus that it makes it much easier for planning scalability and performing data analytics. It also helps different departments understand the value of data lineage when it becomes easier to correct the data at its source, and report on the data at the most appropriate place.
Review data quality and check for consistency in usage
Data quality is defined in government documents in terms of completeness, uniqueness, consistency, timeliness, validity and accuracy. This relies on good design of services, data architecture and data collection.
Data quality is relatively easy to achieve by good database design, appropriate documentation, having the right values in the data, preventing duplication, and processing that data effectively. A relational database should use CHECK
constraints liberally on the data to prevent bad data getting in and highlight potential problems. The CHECK
constraints on data should be consistent across the organization, as defined in the metadata.
Check records management and retention
It is likely that long-established organizations will already have organizational policies and procedures for storing and retaining information. This can provide shortcuts for data governance because it will provide a history of the understanding and management of data across the organization. Good records management will have already identified clear ownership, maintenance of assets, retention and disposal, audit trails and metadata, and this will benefit the development of the metadata.
Maintain and improve the Governance Model
For some organizations the most appropriate way of recording and maintaining an understanding of how data is used and the requirements for managing it, is by using a single centralized model that defines the important data elements, their category, requirements, lineage and use. It must cover the aspects I listed in the earlier section, “The scope of data governance“. A model can range from a document to a model held in a graph database or in a commercial governance application.
This model will aim to define the core processes to which the organization adheres. Other organizations will have a federated model that has differences for the different data needs of various parts of the organization. The resource must be accessible, and editable. Changes must be tracked. If it can be represented diagrammatically, then it becomes easier to communicate.
Whatever the type of model, it must be simple, and democratic. The only useful data governance model is one that can be embedded naturally into normal IT processes and activities.
Evangelize the benefits of data governance
There are a few less intrinsically exciting things than a data governance model; they are difficult to describe without dozing off. This means that one of the most importing tasks in a data governance team is to describe and promote the whole activity to members of the organizations in plain language, in such a way that its advantages and benefits are obvious.
The initial pain is soon forgotten when the advantages are felt across the organization. The ease of rapid correction of data at source, the speed of finding the right data for business analytics, the increased speed of audit, for example. The other benefit that the organization gets is that change can be quicker because the repeated ‘discovery’ phase of business analysis for applications can be avoided. You need less time to work out precisely how the organization uses data, because it is, hopefully, already in the model.
Review emerging security risks and changes in the legislative framework
Even if you want your data governance model to remain unchanging, events will soon disabuse you of the idea. The legislative framework and society’s expectations are forever changing. The organization will forever be changing the way it uses data. New security risks will emerge and existing, but hitherto unknown, risks will be highlighted by data breaches. The data governance model will be constantly changing and will need a regular review process.
How to track your progress
We can now come to a quick way of assessing progress towards an effective standard of data governance. I’ve taken the seven data governance tasks identified earlier and tried to map out a sensible progression route from chaotic through to “expert”:
Chaotic | Beginner | Intermediate | Professional | Expert | |
Data Retention | No idea when to discard records | There is a standard way of classifying records | Each type of record has a minimum retention period | All types of commercial transaction and customer interaction are retained for a defined period | All personal and commercial data has a clearly defined minimum and maximum retention date |
Data Quality | What could go wrong? | There is a way of classifying data so that its quality can be assessed | Constraints are devised to ensure that the key data is ‘sanity-checked’ | Each type of data has its validation and constraints defined | Each type of data is checked for validity, and is checked against commonly agreed constraints |
Data Classification | It is either an integer, a float or a string | Each system has its own clear way of classifying the data it uses | There is a common and unambiguous means of classifying data | All data in the organization that requires special curation (e.g., encryption, access restriction) is identified | The full range of data used by the organization is classified using common criteria |
Data custodianship and compliance. | It’s ours, isn’t it? | All development work uses masked data. The general responsibilities for using personal data are understood | Every part of the organization knows what type of data requires special precautions | It is possible to determine all the users of personal, financial or sensitive data in the organization | All data us used in conformance with both legal and professional standards |
Data Security and resilience | It must be safe: it is behind a firewall and it is backed up! | All systems with data that is classified as requiring security or special resilience are identified | The security settings of every database can be centrally monitored. A resilience plan is created | Role-based access controls to data are introduced for systems and file stores | Full encryption-at-rest and role-based access for databases, applications and files that contain sensitive data |
Data Integration and interoperability | No way are Sales getting our data | Each system is consistent in the way that it defines each type of data | Every upstream and downstream user of data understands it in the same way | Across the organization, each type of data has the same name, type, units, constraints and rules. | The units, conditions of acquisition, and constraints for all types of data are documented |
Metadata inventory, lineage and provenance | I wonder where it all came from | The essential data that is vital to the organization is identified | All data being held by the organization is mapped, identified and recorded | All data is categorized so there can be no question of how it is managed, protected or used | We can correct data at source, and we know all ‘upstream’ and ‘downstream’ systems using the data |
Summary
Data governance is by no means a new idea and has its roots in business analysis and data architecture. What has changed is that data governance is no longer just good practice but is now a legal requirement as well as being required to meet industry or professional standards. The task of data governance has increased with the use of data across every organization, not just in databases, but also in data stores, files and cloud-based ‘big data’ stores. Wherever it is, and however the data is represented, the responsibilities remain the same and the processes that together make up the task of data governance must be extended to them.
It really shouldn’t be seen as an unnecessary chore to engage in data governance. It is a necessary chore that subsequently speeds up the adoption of further systems and makes the subsequent task of auditing the use of data much easier. Whatever methodologies are used for introducing new functionality to an organization, data governance is still required. It isn’t exciting, but an organization that understands its data can be much quicker in evolving the way they use it.