SQL Server Closure Tables

SQL Server has a couple of different ways to represent hierarchies, but they do not allow storing any additional information that is important to the relationship between objects. In this article, Phil Factor explains how to use closure tables to overcome this deficiency.

You’ll find self-referencing tables being used to represent hierarchies in SQL databases, and they seem like an elegant recursive way to do it: AdventureWorks2016 has three of them, including one to map the staff hierarchy. The problem is, of course, that such an approach mixes relationships and values. Before we get too conceptual about this difficulty, just imagine how you’d deal with a sudden request from your AdventureWorks bosses to be able to track the corporate hierarchy over time, or to work out when, and for how long, Tim was reporting to Alice. Yes, the information isn’t there, and you can’t deal with it in a self-referencing table even if you had the information. Relationships can have attributes as well as entities.

The Closure Table pattern

Real-life hierarchies need more than a parent-child relationship. Such hierarchies are often best modelled in SQL Server by having one table for the nodes and another for the ‘edges’, the relations between them. This ‘Closure Table’ pattern is much more suitable for real-life hierarchies that change over time or have conditions or other subtleties in the nature of the branches. You can add attributes of the relationship. In real life, for example, relationships tend have a beginning and an end, and this often needs to be recorded, so your database can tell you the state of the hierarchy at a certain point in time.

EDITOR’S NOTE: SQL Server 2017 has a new feature called Graph Databases which is similar to the techniques explained here. To learn more about this feature, take a look at the series by Robert Sheldon.

Closure tables are plain ordinary relational tables that are designed to work easily with relational operations. It is true that useful extensions are provided for SQL Server to deal with hierarchies. The HIERARCHYID data type and the common language runtime (CLR) SqlHierarchyId class are provided to support the Path Enumeration method of representing hierarchies and are intended to make tree structures represented by self-referencing tables more efficient, but they are likely to be appropriate for some but not all the practical real-life hierarchies or directories. As well as path enumerations, there are also the well-known design patterns of Nested Sets and Adjacency Lists. In this article, we’ll concentrate on closure tables.

A directed acyclic graph (DAG) is a more general version of a closure table. You can use a closure table for a tree structure where there is only one trunk, because a branch or leaf can only have one trunk. We just have a table that has the nodes (e.g. staff member or directory ‘folder’) and edges (the relationships). We are representing an acyclic (no loops allowed) connected graph where the edges must all be unique, and where there is reflexive closure. (each node has an edge pointing to itself)

An Example of a Closure Table

Vadim Tropashko, in his book SQL Design Patterns (Rampant Techpress, Kittrell, NC, USA, 2006.) gave a good explanation of the Closure pattern, and it has since been described in more detail in Bill Karwin’s SQL Antipatterns (The Pragmatic Bookshelf Dallas Texas).

We won’t go into the theory. We’ll just get stuck into demonstrating it with data from AdventureWorks2016 and so you can try things out.

Preparation

We are going to convert the AdventureWorks Employee table from a hierarchy path to a closure table. Then we will see if we can insert a department, delete a manager, insert someone and so on, just to see how hard it is.

We fill this with the data from AdventureWorks

We can now create and fill our new staff table and the StaffClosure Table

The Closure table has a constraint to prevent duplicate edges and to ensure that all heads and tails reference the IDs of existing staff. We’ve added a Depth attribute that isn’t strictly necessary but it’s useful. As well as the direct parent/child references (edges) between nodes, there are ancestor/descendent references (edges). Also, nodes reference themselves (reflexive closure).

 

That’s it. All done now. We have a hierarchy represented by a closure table that we can play with now.

Listing the Tree

Let’s start by seeing who is there and the chain of reporting in the organisation. We’ll include the whole organisation by specifying that the CEO is in the chain. Obviously. You wouldn’t hard-code the name of the CEO in a working system, but use a function to return the reporting structure of the part of the organisation you’re interested in.

Non-recursive Listing

And we can see everyone listed with their management chain.

This method uses a simple aggregation to do this which relies on sorting the paths to get the order right. This was demonstrated by Bill Karwin in Rendering Trees with Closure Tables. To do this using SQL Server, we used the XML trick to emulate the group_concat() that MySQL has.

With SQL Server 2017, we now have String_agg() that is close to group_concat() but which doesn’t allow you to specify the order of concatenation of the expressions within the strings .

In this example, the ultimate boss (ancestor) was specified by hard-coding the name of the person. If this were a Table-valued function you could get the complete reporting line of any node and its descendants all the way up to the ultimate boss..

And here is an example of its use

Recursive Listing

The same thing can be done using a recursive Common Table Expression (CTE), with the advantage that the natural order reflects the traversal of the nodes of the hierarchy and so you don’t need to order by the path.

In case you wish to preserve the order of the result, you can provide a row count and then subsequently sort by that row count, by adding …

… in the final line

Both these queries allow you report the management structure starting at whatever point you wish. The Recursive CTE was suggested by Vadim Tropashko, in his book SQL Design Patterns (Rampant Techpress, Kittrell, NC, USA, 2006.). Here is another version that is handy for a simpler indented list of the structure. This version can be easily turned into a YAML document and thence into JSON.

Here, I’ve used the employee’s job title too, which is handy for double-checking that everyone is reporting to the right manager!

Adding an Employee

Let’s create a new employee, Philip J Factor IIIrd.

Now, we’d like him to report to Ovidiu V Cracium. Senior Tool Designer. We need to insert a new leaf node, including the self-referencing edge. Then we take Ovidiu V Cracium’s connections and copy them to Philip but adding 1 to the depth. Then all we need is the edge that links Ovidiu to Philip.

Now we can check that he is there OK (Normally, you’d turn this code into a function but I’ve unwrapped it, so you can see the cogs move)

We can also use our table-valued function to check the reporting line

Record That an Employee Has Left

Now let’s imagine that Roberto Tamburello, the Engineering Manager, has left the company. We can’t just delete him from the staff. First, we need to decide who his team should now report to. Terri Lee Duffy, Vice President of Engineering, isn’t doing that much, so let’s transfer Roberto’s team to him. In a sense, it is a bit of a promotion for the team because they report directly to a manager who is higher up in the pecking order. Depths will be affected, and so on. Our routine needs to be robust enough to transfer the leaderless team anywhere else in the organization: not just up and down the same branch.

Here is the reporting line for Terri Duffy before we do anything

First, we get the IDs of the participants

Now here is Terri Duffy’s team

General Reporting

These tend to be very simple to do, owing to the comprehensive way that the edges are placed in the closure table.

How many direct reports does ‘Stephen Y Jiang. North American Sales Manager’ have?

How many indirect reports does ‘Ken J Sánchez. Chief Executive Officer’ have?

Which of AdventureWorks staff have the most direct reports?

Using Recursive Queries

There is an occasional need for a recursive function. Getting a JSON document for an organisation hierarchy is an example of where it is handy.

Which gives (after formatting) …

Extending the Model

The whole point of the closure pattern is that we can give the edges other attributes. We’ve actually done this in the previous example with the depth attribute. We’ll add another, the time the edge started and ended.

Imagine we have a playground with some children. Fortunately, they all have different names, so we can use the name as a primary key to make life simpler.

We now create the closure table

We can now simply add the reflexive closure

We can now display the organization tree at any time.

This is at the point that Adolf, Atilla, Mary and Vlad had joined the gang but Gengis and Neil hadn’t appeared

Now Neil and Gengis have joined Adolf’s faction

Catherine has turned up to be Vlad’s sidekick

Now Neil and Gengis have both flounced off indoors

OK, at this stage it looks like overkill, but this system is capable of a great deal of elaboration at little cost.

Conclusions

There are two main points about hierarchies and trees for representing such data as parts lists, organizations or routes.

  • The first is that the record of the relationships should be held separately from the entities that have the relationships. They are different entities with different attributes. Even with the simplest of hierarchies, such as a file system, the files are different than directories. In staff hierarchies, staff members don’t cease to exist the moment they no longer report to a manager, and organizational structures change unceasingly.
  • The second point is that once one gets over the idea that one must pack everything into one table, other ways of doing hierarchies beyond Adjacency lists, Nested Sets or Path Enumerations become attractive. There is no single Closure Table pattern. I’ve seen several, but they make a type of storage that before now seemed slightly scary to the traditional relational database developer, seem much more natural.

I think that the two points should be considered separately. Different applications require different patterns, and Adjacency Lists, Nested Sets and Path Enumerations all have their uses. There are certain applications where the nature of the edges is irrelevant, so it is perfectly fine in that case to hold them in the one relational table. We just don’t have to do it that way.