Setting up hierarchical data in Microsoft SQL Server

Background

I have more than once been called upon to set up an SQL database where data is organized in a hierarchy. Examples might include employees set up in an org chart, or a set of ideas arranged in a taxonomy. I have a preferred implementation that I am calling the hierarchy diagram.  I came upon it through a project I inherited, and I’ve been pleased by its elegance and performance.

This implementation requires a single database table, which I’ll call the Nodes table. Each record is an item (i. e., node) in your hierarchy; e.g., an employee, an outline item, a genus of animal. You must have at least these three required fields the Nodes table:

  1. Node ID: this uniquely identifies the node. Typically I use an auto numbered integer field and make this the primary key.
  2. Parent node ID: this is the ID of the node above the current record’s node. In a true hierarchy, every node has only one parent, except the root/top node which has no parent.In the interest of database normalization, your database should be configured with a foreign key constraint. The parent node ID must have the value of a node ID that exists in the table. Also, the parent node ID must not be the same as the node ID.#1 and #2 are sufficient to create a hierarchy of nodes, but not to make the table easily queriable.  This is why we add the third column:
  3. Hierarchy diagram: a delimited list of the node ID and all ancestor node IDs (e.g., the parent, the parent’s parent, etc. all the way up to the root node.). You can use any non numer character as the delimiter. My own personal convention is to use a period.  You can use a VARCHAR field type. (Don’t use VARCHAR(MAX) because it cannot be indexed, which will become important later.)

Here is an example using a business org chart:

NodeID NodeParentID HierarchyDiagram NodeText
1 NULL .1. Fauna
2 1 .1.2. Jerome
3 2 .1.2.3. Lee
4 2 .1.2.4. Alice
5 3 .1.2.3.5. Tyler

 

Ease of Maintenance

The first advantage of the hierarchy diagram is that you do not have to maintain it. You should never manually update this field’s value. Instead, create a trigger that updates this value.

This trigger will run on a node if the following occurs:

  1. The node is inserted.
  2. The node’s parent node ID is updated

The trigger, when invoked, will do the following:

  1. Set its hierarchy diagram to be its parent’s hierarchy diagram concatenated with its own ID.
  2. Do the same operation to all its descendants, as determined by looking at the hierarchy diagram.  (See Get all descendants of a node below.)

Now, the hierarchy diagram will be maintained no matter how the nodes are arranged.  An example of a trigger follows:

CREATE TRIGGER trigger_Nodes_HierarchyDiagram ON Nodes AFTER INSERT, UPDATE
AS 
BEGIN
		-- This trigger will update the hierarchy diagram for the inserted/updated node and all descendants.

	IF (UPDATE(NodeParentID))  -- Ignore updates to other data fields
	BEGIN
			-- First, all updated nodes will start with the same string: the new hierarchy diagram of the
			--  updated/inserted node.  Find this first, then use it as the base for all updated values.
		DECLARE @newDiagramStart VARCHAR(MAX)  

			-- Just use a dot if this is the top of the hierarchy. Otherwise, obtain the hierarchy diagram of the new parent.
		SELECT @newDiagramStart = ISNULL(Nodes.HierarchyDiagram, '.') FROM inserted LEFT JOIN Nodes ON Nodes.NodeID = inserted.NodeParentID  

			-- Append the inserted/updated node ID.
		SELECT @newDiagramStart = @newDiagramStart + CAST(NodeID AS VARCHAR(MAX)) + '.' FROM inserted 

			-- Now update the hierarchy diagram for the inserted/updated node and any descendants.
		UPDATE NodesDescendants
			-- Find the existing hierarchy diagram after the node ID of the inserted/updated node, and append it.
		SET HierarchyDiagram = @newDiagramStart
		 + ISNULL(RIGHT(NodesDescendants.HierarchyDiagram, LEN(NodesDescendants.HierarchyDiagram) - LEN(inserted.HierarchyDiagram)),'')  
		FROM inserted 
			INNER JOIN Nodes AS NodesDescendants 
			ON NodesDescendants.nodeID = inserted.nodeID		-- Update the inserted/updated node.
			OR NodesDescendants.HierarchyDiagram LIKE inserted.HierarchyDiagram + '%'  -- Update all descendants.
	END

END
GO

 

Ease of Use

The second advantage of the hierarchy diagram is that it is easy and efficient to query. It can immediately give you all a node’s ancestors or all a node’s descendants.  These are both accomplished using a string compare with the LIKE operator.

 

Get all ancestors of a node.

Join the table to itself on all hierarchy diagrams that are a substring of the node’s hierarchy diagram.

-- Get all ancestors (n2) of a node (n1) (including the node itself).
SELECT n1.*, n2.NodeID AS AncestorNodeID, n2.HierarchyDiagram AS AncestorHierarchyDiagram
FROM Nodes n1
INNER JOIN Nodes AS n2
ON n1.HierarchyDiagram LIKE n2.HierarchyDiagram + '%'
ORDER BY n1.NodeID, n2.NodeID

 

Get all descendants of a node.

Join the table to itself on all hierarchy diagrams of which the node’s hierarchy diagram is a substring.

-- Get all descendants (n2) of a node (n1) (including the node itself).
SELECT n1.*, n2.NodeID AS DescendantNodeID, n2.HierarchyDiagram as DescendantHierarchyDiagram
FROM Nodes n1
INNER JOIN Nodes AS n2
ON n2.HierarchyDiagram LIKE n1.HierarchyDiagram + '%'
ORDER BY n1.NodeID, n2.NodeID

 

Optimize SELECT speed through indexing

The two string operations listed above are fast because, rather than searching the whole diagram string for substrings, the query can start from the beginning of each string and throw out strings as soon as it finds that the first few characters don’t match.

I recommend further optimizing these queries by creating an index on the Nodes table. The index should have one indexed field: the HierarchyDiagram field. The Node ID can be an included, non-indexed field in the index.  This index speeds up the descendants query because now all descendants of a node will be arranged in a contiguous set of records in the index.

CREATE INDEX IDX_Nodes_HierarchyDiagram ON Nodes(HierarchyDiagram) INCLUDE (NodeID)

To test the effectiveness of an index on the hierarchy diagram field, I took a table of hierarchical data with over 500,000 rows, and queried for all descendants of a given node.  Using the SET STATISTICS TIME ON command, I compared the time it took with and without the index.  Without the index, this query average around 65 ms.  With the index, it took around 4 ms.

 

Visual inspection

Aside from whatever method your front end uses to display the hierarchy information, it is easy to look directly at the database and see the nodes arranged in their hierarchy.  Just use the following query:

SELECT * FROM Nodes ORDER BY hierarchy_diagram

The root node will be at the top, and every node will be followed immediately by all its descendants.  Each node will be followed by its children, but each of its children will be followed by their own children, recursively down to each terminal node in the hierarchy.  (In computer science, this is equivalent to a depth-first search on a tree.)

By looking at your nodes’ metadata along with the hierarchy diagram, you can visually inspect the hierarchy directly in the database table.

 

What about the hierarchy data type in SQL Server 2008?

HierarchyID is a data type introduced by Microsoft in SQL Server 2008 for handling exactly the scenario I’ve described, but with a built-in data type [1].

I have tried this data type and decided against using it for the following reasons:

  1. It is not as easy to use. As you have seen above, queries using the hierarchy diagram can be written very quickly with JOINs and LIKE operators. The SQL statements are short and elegant. The hierarchy data type, on the other hand, requires calls to CLR functions instead of standard SQL. The complexity of using the objects could be overcome through practice, but I see no reason to take the time if an easier way already exists and there is no advantage to changing.The one downside to the hierarchy diagram is the initial work to create the trigger.  But once that is done, the table will take care of itself from then on.
  2. I have not seen a performance advantage. If the hierarchy data type were significantly faster, I would look into using it. However, I ran a test and found that the hierarchy diagram outperformed the HIERARCHY_ID data type. The test I ran was on a table of nodes with over 50,000 nodes, seeking all descendants of node ID 1:
    SELECT NodeDescendants.*
    FROM Nodes AS NodeDescendants
    INNER JOIN Nodes AS NodeToSearch
    ON NodeDescendants.hierarchy_diagram LIKE NodeToSearch.hierarchy_diagram + '%'
    WHERE NodeToSearch.NodeID = 1
    ORDER BY NodeDescendants.hierarchy_diagram
    
    SELECT NodeDescendants.*
    FROM Nodes AS NodeDescendants
    INNER JOIN Nodes AS NodeToSearch 
    ON NodeDescendants.NodeHierarchyID.IsDescendantOf(NodeToSearch.NodeHierarchyID) = 1
    WHERE NodeToSearch.NodeID = 1
    ORDER BY NodeDescendants.NodeHierarchyID
    

    The running times for both queries were about the same, around 720 ms on average.

    I admit I may be doing something wrong, such as failing to create the proper indexes. So, it’s possible that the hierarchy data type could run more efficiently than I realize. But if so, I haven’t found the way to make that efficiency happen.

However, the hierarchy type does have some other functions that could be useful.  For example. the combination of GetLevel() and GetAncestor() could be used to find the ancestor of a given node at the level below the root level.  This could be done using the hierarchy diagram, if an extra field is created that stores the hierarchy level for each node.  I may continue looking into the HIERARCHY_ID in future projects, but for now, the hierarchy diagram is working very well.
 

Sample file
This ZIP file contains a script that will create a sample nodes table, show the ancestors for each node, and show the descendants for each node: hierarchy_sample.sql (ZIP).

 
 

2016-07-22 Update: Clarified some of the code samples, corrected a couple of errors, and added download.