Get It Done With MySQL 5&6, Chapter 20. Copyright © Peter Brawley and Arthur Fuller 2008. All rights reserved.
Working with Graphs in MySQL
Nested set model of a tree Edgelist model of a network Parts explosions
Most nontrivial data is hierarchical. Customers have orders, which have line items, which refer to products, which have prices. Population samples have subjects, who take tests, which give results, which have subresults and norms. Web sites have pages, which have links, which collect hits, which distribute across dates and times. With such data, we know the depth of the hierarchy before we sit down to write a query. The fixed depth of the hierarchy logically limits the number of JOINs needed in a query.
But if our data describes a family tree, or a browsing history, or a bill of materials, then hierarchical depth depends on the data. We no longer know how many JOINs our query will require. We need a different data model.
That model is the graph (Fig 1), which is a set of nodes (vertices) and the edges (lines or arcs) that connect them. This chapter is about how to model and query graphs in a MySQL database.
Graph theory is a branch of topology. It is the study of geometric relations which aren't changed by stretching and compression—rubber sheet geometry, some call it. Graph theory is ideal for modelling hierarchies—like family trees, browsing histories, search trees and bills of materials—whose shape and size we can't know in advance.
Let the set of nodes in Fig 1 be N
, the set of edges be L
, and the graph be G
. Then G
is the tuple or ordered pair {N,L}
:
N = {A,B,C,D,E,F}
L = {AC,CD,CF,BE}
G = {N,L}
If the edges are directed, the graph is a digraph or directed graph. A mixed graph has both directed and undirected edges.
Examples of graphs are organisational charts; itineraries; route maps; parts explosions; massively multiplayer games; language rules; chat histories; network and link analysis in a wide variety of fields, for example search engines, forensics, epidemiology and telecommunications; data mining; models of chemical structure hierarchies; and biochemical processes.
Graph characteristics and models
Nodes and edges : Two nodes are adjacent if there is an edge between them. Two edges are adjacent if they connect to a common node. In a complete graph, all nodes are adjacent to all other nodes.
In a digraph or directed graph, the number of edges entering a node is its indegree; the number leaving is its outdegree. A node of indegree zero is a root node, a node of outdegree zero is a leaf node.
In a weighted graph, used for example to solve the travelling salesman problem, edges have a weight attribute. A digraph with weighted edges is a network.
Paths and cycles: A connected sequence of edges is a path, its length the number of edges traversed. Two nodes are connected if there is a path between them. If there is a path connecting every pair of nodes, the graph is a connected graph.
A path in which no node repeats is a simple path. A path which returns to its own origin without crossing itself is a cycle or circuit. A graph with multiple paths between at least one pair of nodes is reconvergent. A reconvergent graph may be cyclic or acyclic. A unit length cycle is a loop.
If a graph's edges intersect only at nodes, it is planar. Two paths having no node in common are independent.
Traversing graphs: There are two main approaches, breadthfirst and depthfirst. Breadthfirst traversal visits all a node's siblings before moving on to the next level, and typically uses a queue. Depthfirst traversal follows edges down to leaves and back before proceeding to siblings, and typically uses a stack.
Sparsity: A graph where the size of E approaches the maximum N^{2} is dense. When the multiple is much smaller than N, the graph is considered sparse.
Trees: A tree is a connected graph with no cycles. It is also a graph where the indegree of the root node is 0, and the indegree of every other node is 1. A tree where every node is of outdegree <=2 is a binary tree. A forest is a graph in which every connected component is a tree.
Euler paths: A path which traverses every edge in a graph exactly once is an Euler path. An Euler path which is a circuit is an Euler circuit.
If and only if every node of a connected graph has even degree, it has an Euler circuit (which is why the good people of Königsberg cannot go for a walk crossing each of their seven bridges exactly once). If and only if a connected graph has exactly 2 nodes with odd degree, it has a noncircuit Euler path. The degree of an endpoint of a noncycle Euler path is 1 + twice the number of times the path passes through that node, so it is always odd.
Models for computing graphs
Traditionally, computer science textbooks have offered edge lists, adjacency lists and adjacency matrices as data structures for graphs, with algorithms implemented in languages like C, C++ and Java. More recently other models and tools have been suggested, including query languages customised for graphs.
Edge list: The simplest way to represent a graph is to list its edges: for Fig 1, the edge list is {AC,CD,CF,BE}
. It is easy to add an edge to the list; deletion is a little harder.
 
Nodes  Adjacent nodes 
A  C 
B  E 
C  F,D,A 
D  C 
E  B 
F  C 
An undirected graph may be represented by having node j in the list for node i, and node i in the list for node j. Table 1 shows the adjacency list of the graph in Fig 1 interpreted as undirected.
Adjacency matrix: An adjacency matrix represents a graph with n nodes as an n x n matrix, where the entry at (i,j) is 1 if there is an edge from node i to node j, or zero if there is not.
An adjacency matrix can represent a weighted graph using the weight as the entry, and can represent an undirected graph by using the same entry in both (i,j) and (j,i), or by using an upper triangular matrix.
There are useful glossaries here and here.
Graphs and SQL
Often standard SQL has been thought cumbersome for graph problems. Craig Mullins once wrote that "the setbased nature of SQL is not simple to master and is anathema to the OO techniques practiced by Java developers."
A few years after Mullins wrote that, SQL is everywhere, and it is increasingly applied to graph problems. DB2 has a WITH operator for processing recursive sets. Oracle has a CONNECT BY operator for graphs that are trees. SQL Server has recursive unions. MySQL has no such enhancements for graphs, but Joe Celko and Scott Stephens, among others, have published general SQL graph problem solutions that are simpler and smaller than equivalent C++, C# or Java code. Here we implement some of these ideas in MySQL.
Beware that in ports of edge list and adjacency list methods to SQL, there has been name slippage. What's often called the adjacency list model in the SQL world is actually an edge list model. If you follow the nowcommon practice in the SQL world of referring to edge lists as adjacency lists, don't be surprised to find that the model isn't quite like the adjacency list in Table 1. Here we waffle. We call them edgeadjacency lists.
There are also two newer kinds of models: what Joe Celko called the nested sets mode—also known as the interval model—which uses greaterthan/lessthan arithmetic to encode tree relationships and modified preorder tree traversal (MPTT) to query them, and Tropashko's materialised path model, where each node is stored with its path to the root. So we have four main possibilities for modelling graphs in MySQL:
 edgeadjacency lists: based on an adaptation by EF Codd of the logic of linked lists to table structures and queries,
 adjacency matrices,
 nested sets for trees simplify some queries, but insertion and deletion are cumbersome, and
 materialised paths.
Here we work out how to implement edgeadjacency, nested set and materialised path models— or parts of them—in MySQL 5&6.
The edge list
The edge list is the simplest way to represent a graph: minimally, a onecolumn table of nodes and a twocolumn table of edges. The edges table can be thought of as a nodesnodes bridge table, each row containing pointers to origin and destination nodes. Other details of nodes and edges can be encoded in extra nodes and edges columns, or in child tables.
In the real world, the nodes table might be a table of personnel, or assembly parts, or locations on a map. It might have many other columms of data. The edges table might also have additional columns for edge properties. The key integers of both tables might be BIGINTs.
To model Fig 1, though, we keep things as simple as possible:
Listing 1
CREATE TABLE nodes(
nodeID CHAR(1) PRIMARY KEY
);
CREATE TABLE edges(
childID CHAR(1) NOT NULL,
parentID CHAR(1) NOT NULL,
PRIMARY KEY(childID,parentID)
);
INSERT INTO nodes VALUES('A'), ('B'), ('C'), ('D'), ('E'), ('F');
INSERT INTO edges VALUES ('A','C'), ('C','D'), ('C','F'), ('B','E');
SELECT * FROM edges;
+++
 childID  parentID 
+++
 A  C 
 B  E 
 C  D 
 C  F 
+++
Now, without any assumptions whatever about whether the graph is connected, whether it is directed, whether it is a tree, or whatever, how hard is it to write a reachability procedure, a procedure which tells us where we can get to from here, wherever 'here' is?
A simple approach is a breadthfirst search:
 Seed the list with the starting node,
 Add, but do not duplicate, nodes which are children of nodes in the list,
 Add, but do not duplicate, nodes which are parents of nodes in the list,
 Repeat steps 2 and 3 until there are no more nodes to add.
Here it is as a MySQL stored procedure. It avoids duplicate nodes by defining reached.nodeID
as a primary key and adding reachable nodes with INSERT IGNORE
.
If you are running MySQL 5.0.6 through 5.0.15, you will have to make two changes to this and subsequent procedures: move CREATE TABLE
statements outside the procedures, and set log_bin_trust_routine_creators=TRUE.
Listing 2
DROP PROCEDURE IF EXISTS ListReached;
DELIMITER 
CREATE PROCEDURE ListReached( IN root CHAR(1) )
BEGIN
DECLARE rows SMALLINT DEFAULT 0;
DROP TABLE IF EXISTS reached;
CREATE TABLE reached (
nodeID CHAR(1) PRIMARY KEY
) ENGINE=HEAP;
INSERT INTO reached VALUES (root );
SET rows = ROW_COUNT();
WHILE rows > 0 DO
INSERT IGNORE INTO reached
SELECT DISTINCT childID
FROM edges AS e
INNER JOIN reached AS p ON e.parentID = p.nodeID;
SET rows = ROW_COUNT();
INSERT IGNORE INTO reached
SELECT DISTINCT parentID
FROM edges AS e
INNER JOIN reached AS p ON e.childID = p.nodeID;
SET rows = rows + ROW_COUNT();
END WHILE;
SELECT * FROM reached;
DROP TABLE reached;
END;

DELIMITER ;
CALL ListReached('A');
++
 nodeID 
++
 A 
 C 
 D 
 F 
++
To make the procedure more versatile, give it input parameters which tell it whether to list child, parent or all connections, and whether to recognise loops (for example C to C).
To give the model referential integrity, use InnoDB and make edges.childID
and edges.parentID
foreign keys. To add or delete a node, add or delete desired single rows in nodes
and edges
. To change an edge, edit it. The model does not require the graph to be connected or treelike, and does not presume direction.
The edge list is basic to what SQLers often call the adjacency list model.
Edgeadjacency list model of a tree
Writers in the SQL graph literature often give solutions using single denormalised tables. Denormalisation can cost, big time. The bigger the table, the bigger the cost. You cannot edit nodes and edges separately. Carrying extra node information during edge computation slows performance. With nodes and edges denormalised to one table, you have to represent the root node with a NULL.
Normalisation banishes these difficulties. For William Shakespeare's family tree (Fig 2) we again use two tables, a family
table describing family members (nodes), and a familytree
table with a row for each tie that binds (edges). Later, when we use a different tree model, we won't have to mess with the data being modelled.
Listing 3
 Base data:
CREATE TABLE family (
ID smallint(6) PRIMARY KEY AUTO_INCREMENT,
name char(20) default '',
siborder tinyint(4) default NULL,
born smallint(4) unsigned default NULL,
died smallint(4) unsigned default NULL
);
INSERT INTO family VALUES (1, 'Richard Shakespeare', NULL, NULL, 1561);
INSERT INTO family VALUES (2, 'Henry Shakespeare', 1, NULL, 1569);
INSERT INTO family VALUES (3, 'John Shakespeare', 2, 1530, 1601);
INSERT INTO family VALUES (4, 'Joan Shakespeare', 1, 1558, NULL);
INSERT INTO family VALUES (5, 'Margaret Shakespeare', 2, 1562, 1563);
INSERT INTO family VALUES (6, 'William Shakespeare', 3, 1564, 1616);
INSERT INTO family VALUES (7, 'Gilbert Shakespeare', 4, 1566, 1612);
INSERT INTO family VALUES (8, 'Joan Shakespeare', 5, 1568, 1646);
INSERT INTO family VALUES (9, 'Anne Shakespeare', 6, 1571, 1579);
INSERT INTO family VALUES (10, 'Richard Shakespeare', 7, 1574, 1613);
INSERT INTO family VALUES (11, 'Edmond Shakespeare', 8, 1580, 1607);
INSERT INTO family VALUES (12, 'Susana Shakespeare', 1, 1583, 1649);
INSERT INTO family VALUES (13, 'Hamnet Shakespeare', 1, 1585, 1596);
INSERT INTO family VALUES (14, 'Judith Shakespeare', 1, 1585, 1662);
INSERT INTO family VALUES (15, 'William Hart', 1, 1600, 1639);
INSERT INTO family VALUES (16, 'Mary Hart', 2, 1603, 1607);
INSERT INTO family VALUES (17, 'Thomas Hart', 3, 1605, 1670);
INSERT INTO family VALUES (18, 'Michael Hart', 1, 1608, 1618);
INSERT INTO family VALUES (19, 'Elizabeth Hall', 1, 1608, 1670);
INSERT INTO family VALUES (20, 'Shakespeare Quiney', 1, 1616, 1617);
INSERT INTO family VALUES (21, 'Richard Quiney', 2, 1618, 1639);
INSERT INTO family VALUES (22, 'Thomas Quiney', 3, 1620, 1639);
INSERT INTO family VALUES (23, 'John Bernard', 1, NULL, 1674);
 Table which models the tree:
CREATE TABLE familytree (
childID smallint NOT NULL,
parentID smallint NOT NULL,
PRIMARY KEY (childID, parentID);
);
INSERT INTO familytree VALUES
(2, 1), (3, 1), (4, 2), (5, 2), (6, 2), (7, 2), (8, 2), (9, 2),
(10, 2), (11, 2), (12, 6), (13, 6), (14, 6), (15, 8), (16, 8),
(17, 8), (18, 8), (19, 12), (20, 14), (21, 14), (22, 14), (23, 19);
(The family
PK is autoincrement, but the listing is more readerfriendly when the ID
values are shown.)
It will be useful to have a function that returns the family.name
for a pointer in familytree
. Listing 4 shows that function, and simple queries which call it to retrieve child and parent names:
Listing 4
 5.0.16 OR LATER:
SET GLOBAL log_bin_trust_function_creators=TRUE;
DROP FUNCTION IF EXISTS PersonName;
DELIMITER 
CREATE FUNCTION PersonName( personID SMALLINT )
RETURNS CHAR(20)
BEGIN
DECLARE pname CHAR(20) DEFAULT '';
SELECT name INTO pname FROM family WHERE ID=personID;
RETURN pname;
END;

DELIMITER ;
SELECT PersonName( parentID ) AS 'Parent of William'
FROM familytree
WHERE childID = 6;
++
 Parent of William 
++
 Henry Shakespeare 
++
SELECT PersonName( childID ) AS 'Children of William'
FROM familytree
WHERE parentID = (
SELECT ID FROM family
WHERE name = 'William Shakespeare'
);
++
 Children of William 
++
 Susana Shakespeare 
 Hamnet Shakespeare 
 Judith Shakespeare 
++
SELECT
PersonName(childID) AS child,
PersonName(parentID) AS parent
FROM familytree;
+++
 child  parent 
+++
 Henry Shakespeare  Richard Shakespeare 
 John Shakespeare  Richard Shakespeare 
 Joan Shakespeare  Henry Shakespeare 
 Margaret Shakespeare  Henry Shakespeare 
 William Shakespeare  Henry Shakespeare 
 Gilbert Shakespeare  Henry Shakespeare 
 Joan Shakespeare  Henry Shakespeare 
 Anne Shakespeare  Henry Shakespeare 
 Richard Shakespeare  Henry Shakespeare 
 Edmond Shakespeare  Henry Shakespeare 
 Susana Shakespeare  William Shakespeare 
 Hamnet Shakespeare  William Shakespeare 
 Judith Shakespeare  William Shakespeare 
 William Hart  Joan Shakespeare 
 Mary Hart  Joan Shakespeare 
 Thomas Hart  Joan Shakespeare 
 Michael Hart  Joan Shakespeare 
 Elizabeth Hall  Susana Shakespeare 
 Shakespeare Quiney  Judith Shakespeare 
 Richard Quiney  Judith Shakespeare 
 Thomas Quiney  Judith Shakespeare 
 John Bernard  Elizabeth Hall 
+++
GROUP_CONCAT() simplifies grouping parents with their children:
Listing 5
SELECT
PersonName(parentID) AS Parent,
GROUP_CONCAT( PersonName(childID) SEPARATOR ', ' ) AS Children
FROM familytree
GROUP BY parentID;
+++
 Parent  Children 
+++
 Richard Shakespeare  Henry Shakespeare, John Shakespeare 
 Henry Shakespeare  Edmond Shakespeare, Richard Shakespeare, Anne Shakespeare, Joan Shakespeare, 
  Gilbert Shakespeare, William Shakespeare, Margaret Shakespeare, Joan Shakespeare 
 William Shakespeare  Judith Shakespeare, Hamnet Shakespeare, Susana Shakespeare 
 Joan Shakespeare  William Hart, Mary Hart, Thomas Hart, Michael Hart 
 Susana Shakespeare  Elizabeth Hall 
 Judith Shakespeare  Shakespeare Quiney, Richard Quiney, Thomas Quiney 
 Elizabeth Hall  John Bernard 
+++
The paterfamilias is the root node, individuals with no children are the leaf nodes, and queries to retrieve subtree statistics are straightforward:
Listing 6
SELECT
PersonName(ID) AS Paterfamilias,
IFNULL(born,'?') AS Born,
IFNULL(died,'?') AS Died
FROM family AS f
LEFT JOIN familytree AS t ON f.ID=t.childID
WHERE t.childID IS NULL;
++++
 Paterfamilias  Born  Died 
++++
 Richard Shakespeare  ?  1561 
++++
SELECT
PersonName(ID) AS Childless,
IFNULL(born,'?') AS Born,
IFNULL(died,'?') AS Died
FROM family AS f
LEFT JOIN familytree AS t ON f.ID=t.parentID
WHERE t.parentID IS NULL;
++++
 Childless  Born  Died 
++++
 John Shakespeare  1530  1601 
 Joan Shakespeare  1558  ? 
 Margaret Shakespeare  1562  1563 
 Gilbert Shakespeare  1566  1612 
 Anne Shakespeare  1571  1579 
 Richard Shakespeare  1574  1613 
 Edmond Shakespeare  1580  1607 
 Hamnet Shakespeare  1585  1596 
 William Hart  1600  1639 
 Mary Hart  1603  1607 
 Thomas Hart  1605  1670 
 Michael Hart  1608  1618 
 Shakespeare Quiney  1616  1617 
 Richard Quiney  1618  1639 
 Thomas Quiney  1620  1639 
 John Bernard  ?  1674 
++++
SELECT ROUND(AVG(diedborn),2) AS 'Longevity of the childless'
FROM family AS f
LEFT JOIN familytree AS t ON f.ID=t.parentID
WHERE t.parentID IS NULL;
++
 Longevity of the childless 
++
 25.86 
++
In marked contrast with Celko's nested sets model, inserting a new item in this model requires no revision of existing rows. We just add a new family
row, then a new familytree
row with IDs specifying who is the parent of whom. Deletion is also a twostep: delete the familytree
row which documents the childparent link, then delete the family
row for that child.
Retrieving subtrees in the edgeadjacency list model
Retrieving subtrees is what gives the edgeadjacency list model its reputation for difficulty. We can't know in advance, except in the simplest of trees, how many levels of parent and child have to be queried, so we need recursion or a logically equivalent loop.
It's a natural problem for a stored procedure. Here is a breadthfirst algorithm that is straightforward, though not optimised:
 Create a table for the results,
descendants
,  Create a
HEAP
table,nextparents
, for the next parents whose children are to be found , and anotherHEAP
table for use as a temp copy,  Seed
nextparents
with the name of the ancestor whose descendants we wish to list,  Add to
descendants
all children of rows innextparents
,  Seed
nextparents
with those children's IDs, and if there are any, loop back to 4.
As a convenience, the procedure accepts either a name or numeric ID argument. (With MySQL 5.0.6 or 5.0.7, you will have to move the CREATE TABLE calls outside the procedure, to step around a MySQL bug that was fixed in 5.0.9.)
Listing 7
DROP PROCEDURE IF EXISTS ListDescendants;
DELIMITER 
CREATE PROCEDURE ListDescendants( ancestor CHAR(20) )
BEGIN
DECLARE rows, iLevel, iMode INT DEFAULT 0;
 create temp tables
DROP TEMPORARY TABLE IF EXISTS descendants,nextparents,prevparents;
CREATE TEMPORARY TABLE descendants( childID INT, parentID INT, level INT );
CREATE TEMPORARY TABLE IF NOT EXISTS nextparents ( parentID SMALLINT) ENGINE=MEMORY;
CREATE TEMPORARY TABLE prevparents LIKE nextparents;
 seed nextparents
IF ancestor RLIKE '[:alpha:]+' THEN  ancestor passed as a string
INSERT INTO nextparents SELECT id FROM family WHERE name=ancestor;
ELSE
SET iMode = 1;  ancestor passed as a numeric
INSERT INTO nextparents VALUES( CAST( ancestor AS UNSIGNED ));
END IF;
SET rows = ROW_COUNT();
WHILE rows > 0 DO
 add children of nextparents
SET iLevel = iLevel + 1;
INSERT INTO descendants
SELECT t.childID, t.parentID, iLevel
FROM familytree AS t
INNER JOIN nextparents USING(parentID);
SET rows = ROW_COUNT();
 save nextparents to prevparents
TRUNCATE prevparents;
INSERT INTO prevparents
SELECT * FROM nextparents;
 next parents are children of these parents:
TRUNCATE nextparents;
INSERT INTO nextparents
SELECT childID FROM familytree
INNER JOIN prevparents USING (parentID);
SET rows = rows + ROW_COUNT();
END WHILE;
 result
IF iMode = 1 THEN
SELECT CONCAT(REPEAT( ' ', level), parentID ) As Parent, GROUP_CONCAT(childID) AS Child
FROM descendants GROUP BY parentID ORDER BY level;
ELSE
SELECT CONCAT(REPEAT( ' ', level), PersonName(parentID) ) As Parent, PersonName(childID) AS Child
FROM descendants;
END IF;
DROP TEMPORARY TABLE descendants,nextparents,prevparents;
END;

DELIMITER ;
CALL ListDescendants( 1 );
+++
 Parent  Child 
+++
 1  2,3 
 2  11,10,9,8,7,6,5,4 
 6  14,13,12 
 8  15,16,17,18 
 12  19 
 14  20,21,22 
 19  23 
+++
CALL ListDescendants( 'Richard Shakespeare' );
+++
 Parent  Child 
+++
 Richard Shakespeare  Henry Shakespeare 
 Richard Shakespeare  John Shakespeare 
 Henry Shakespeare  Joan Shakespeare 
 Henry Shakespeare  Margaret Shakespeare 
 Henry Shakespeare  William Shakespeare 
 Henry Shakespeare  Gilbert Shakespeare 
 Henry Shakespeare  Joan Shakespeare 
 Henry Shakespeare  Anne Shakespeare 
 Henry Shakespeare  Richard Shakespeare 
 Henry Shakespeare  Edmond Shakespeare 
 William Shakespeare  Susana Shakespeare 
 William Shakespeare  Hamnet Shakespeare 
 William Shakespeare  Judith Shakespeare 
 Joan Shakespeare  William Hart 
 Joan Shakespeare  Mary Hart 
 Joan Shakespeare  Thomas Hart 
 Joan Shakespeare  Michael Hart 
 Susana Shakespeare  Elizabeth Hall 
 Judith Shakespeare  Shakespeare Quiney 
 Judith Shakespeare  Richard Quiney 
 Judith Shakespeare  Thomas Quiney 
 Elizabeth Hall  John Bernard 
+++
Not too bad. Notice how little of the code is specific to the Shakespeare example: the logic will port easily to any other edge list. In fact let's prove that right now by rewriting ListDescendants() for the general case. We need six parameters to drive three PREPAREd queries:
 the name and key column of the data table (family, ID in the Shakespeare example),
 the name of the edge table, its ID column and its parentID column (familytree, childID, parentID),
 a parameter for the ancestorID whose subtree is to be listed:
Listing 7a: Generalpurpose edge list subtree walker DROP PROCEDURE IF EXISTS GenericSubtree; DELIMITER 
CREATE PROCEDURE GenericSubtree( dataTable CHAR(64), dataIDcol CHAR(64), edgeTable CHAR(64), edgeIDcol CHAR(64), edgeParentIDcol CHAR(64), ancestorID INT ) BEGIN DECLARE rows INT DEFAULT 0;  create temp tables DROP TEMPORARY TABLE IF EXISTS descendants,nextparents,prevparents; CREATE TEMPORARY TABLE descendants( id INT, parentID INT ); CREATE TEMPORARY TABLE IF NOT EXISTS nextparents ( parentID INT ) ENGINE=MEMORY; CREATE TEMPORARY TABLE prevparents LIKE nextparents;  seed nextparents table with ancestorID INSERT INTO nextparents VALUES (ancestorID); SET rows = ROW_COUNT(); WHILE rows > 0 DO  insert children of nextparents into descendants SET @sql = CONCAT( 'INSERT INTO descendants SELECT t.', edgeIDcol, ',t.', edgeParentIDcol, ' FROM ', edgeTable, ' AS t JOIN nextparents n ON t.', edgeParentIDcol, ' = n.parentID' ); PREPARE stmt FROM @sql; EXECUTE stmt; SET rows = ROW_COUNT(); DROP PREPARE stmt;  save copy of nextparents to prevparents TRUNCATE prevparents; INSERT INTO prevparents SELECT * FROM nextparents;  insert the children of these parents into nextparents: TRUNCATE nextparents; SET @sql = CONCAT( 'INSERT INTO nextparents SELECT ', edgeIDcol, ' FROM ', edgeTable, ' t JOIN prevparents p ON t.', edgeParentIDcol, ' = p.parentID' ); PREPARE stmt FROM @sql; EXECUTE stmt; SET rows = rows + ROW_COUNT(); DROP PREPARE stmt; END WHILE;  show the result SET @sql = CONCAT( 'SELECT t.* FROM descendants d', ' JOIN ', edgeTable, ' e ON d.ID = e.', edgeIDCol, ' JOIN ', dataTable, ' t ON e.', edgeIDcol, '=t.', dataIDcol ); PREPARE stmt FROM @sql; EXECUTE stmt; DROP PREPARE stmt;  clean up DROP TEMPORARY TABLE descendants,nextparents,prevparents; END;  DELIMITER ;
Does it work? Let's have William's descendants again:
CALL GenericSubtree( 'family', 'ID', 'familytree', 'childID', 'parentID', (SELECT ID from family where name='William Shakespeare') ); ++++++  ID  name  siborder  born  died  ++++++  12  Susana Shakespeare  1  1583  1649   13  Hamnet Shakespeare  1  1585  1596   14  Judith Shakespeare  1  1585  1662   19  Elizabeth Hall  1  1608  1670   20  Shakespeare Quiney  1  1616  1617   21  Richard Quiney  2  1618  1639   22  Thomas Quiney  3  1620  1639   23  John Bernard  1  NULL  1674  ++++++
Is it fast? No. But once we have this algorithm, pruning subtrees is no harder than calling GenericSubtree() then deleting the listed rows. Better still, write a generic tree pruner from Listing 7a replacing the final SELECT with a DELETE command. To insert a subtree, prepare a table of new rows, point its top edge at an existing node as parent, and INSERT it.
For speed, try recursion! Here is a recursive depthfirst PHP treewalk for the familytree and family tables:
Listing 7b: Recursive edge list subtree in PHP $info = recursivesubtree( 1, $a = array(), 0 ); foreach( $info as $row ) echo str_repeat( " ", 2*$row[4] ), ( $row[3] > 0 ) ? "<b>{$row[1]}</b>" : $row[1], "<br/>"; function recursivesubtree( $rootID, $a, $level ) { $childcountqry = "(SELECT COUNT(*) FROM familytree WHERE parentID=t.childID) AS childcount"; $qry = "SELECT t.childid,f.name,t.parentid,$childcountqry,$level " . "FROM familytree t JOIN family f ON t.childID=f.ID " . "WHERE parentid=$rootID ORDER BY childcount<>0,t.childID"; $res = mysql_qry( $qry ); while( $row = mysql_fetch_row( $res )) { $a[] = $row; if( $row[3] > 0 ) $a = recursivesubtree( $row[0], $a, $level+1 ); // down before right } return $a; }
A query with a subquery, a fetch loop, and a recursive callthat's all there is to it. To port this to MySQL, you must have set maximum recursion depth in my.cnf/ini or in your client:
Listing 7c: Recursive edge list subtree in MySQL SET @@SESSION.max_sp_recursion_depth=25; DROP PROCEDURE IF EXISTS recursivesubtree; DELIMITER  CREATE PROCEDURE recursivesubtree( iroot INT, ilevel INT ) BEGIN DECLARE ichildid, iparentid,ichildcount,done INT DEFAULT 0; DECLARE cname VARCHAR(64); DECLARE cur CURSOR FOR SELECT t.childid,t.parentid,f.name, (SELECT COUNT(*) FROM familytree WHERE parentID=t.childID) AS childcount FROM familytree t JOIN family f ON t.childID=f.ID WHERE parentid=iroot ORDER BY childcount<>0,t.childID; DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET done = 1; IF ilevel = 0 THEN DROP TEMPORARY TABLE IF EXISTS descendants; CREATE TEMPORARY TABLE descendants( childID INT,parentID INT,name VARCHAR(64),childcount INT,level INT ); END IF; OPEN cur; WHILE NOT done DO FETCH cur INTO ichildid,iparentid,cname,ichildcount; IF NOT done THEN INSERT INTO descendants VALUES(ichildid,iparentid,cname,ichildcount,ilevel ); IF ichildcount > 0 THEN CALL recursivesubtree( ichildid, ilevel + 1 ); END IF; END IF; END WHILE; CLOSE cur; END;  DELIMITER ; CALL recursivesubtree(1,0); SELECT CONCAT(REPEAT(' ',2*level),IF(childcount,UPPER(name),name)) AS 'Richard\'s Descendants' FROM descendants; ++  Richard's Descendants  ++  John Shakespeare   HENRY SHAKESPEARE   Joan Shakespeare   Margaret Shakespeare   Gilbert Shakespeare   Anne Shakespeare   Richard Shakespeare   Edmond Shakespeare   WILLIAM SHAKESPEARE   Hamnet Shakespeare   SUSANA SHAKESPEARE   ELIZABETH HALL   John Bernard   JUDITH SHAKESPEARE   Shakespeare Quiney   Richard Quiney   Thomas Quiney   JOAN SHAKESPEARE   William Hart   Mary Hart   Thomas Hart   Michael Hart  ++
The recursive depthfirst treewalk is faster than the breadthfirst procedures. It is also faster than a MySQL version of Kendall Willet's depthfirst edge list subtree algorithm:
Listing 7d: Depthfirst edge list subtree CREATE PROCEDURE depthfirstsubtree( iroot INT ) BEGIN DECLARE ilastvisited, inxt, ilastord INT; SET ilastvisited = iroot; SET ilastord = 1; DROP TABLE IF EXISTS descendants; CREATE TABLE descendants SELECT childID,parentID,1 AS ord FROM familytree; UPDATE descendants SET ord=1 WHERE childID=iroot; this: LOOP SET inxt = NULL; SELECT MIN(childID) INTO inxt FROM descendants  go down WHERE parentID = ilastvisited AND ord = 1 ; IF inxt IS NULL THEN  nothing down, so go right SELECT MIN(d2.childID) INTO inxt FROM descendants d1 JOIN descendants d2 ON d1.parentID = d2.parentID AND d1.childID < d2.childID WHERE d1.childID = ilastvisited; END IF; IF inxt IS NULL THEN  nothing right. so go up SELECT parentID INTO inxt FROM descendants WHERE childID = ilastvisited AND parentID IS NOT NULL; END IF; UPDATE descendants SET ord = ilastord + 1 WHERE childID = inxt AND ord = 1; IF ROW_COUNT() > 0 THEN SET ilastord = ilastord + 1; END IF; IF inxt IS NULL THEN LEAVE this; END IF; SET ilastvisited = inxt; END LOOP; END;
One reason Willet's is slower is that MySQL does not permit use of temporary tables with its queries; when all algorithms are denied temp tables, though, this algorithm is still slower than recursion.
Edge list subtrees are easier to query than their reputation suggests. And edge tables are flexible. For a tree describing a parts explosion rather than a family, just add columns for weight, quantity, assembly time, cost, price and so on. Reports need only aggregate column values and sums. We'll revisit this near the end of the chapter.
Enumerating paths in an edgeadjacency list
Path enumeration in an edge list tree is almost as easy as depthfirst subtree traversal:
 create a table for paths,
 seed it with paths of unit length from the tree table,
 iteratively add paths till there are no more to add.
MySQL's INSERT IGNORE command simplifies the code by removing the need for a NOT EXISTS(...) clause in the INSERT ... SELECT statement. Since adjacencies are logically symmetrical, we make path direction the caller's choice, UP
or DOWN
:
Listing 8
DROP PROCEDURE IF EXISTS ListAdjacencyPaths;
DELIMITER 
CREATE PROCEDURE ListAdjacencyPaths( IN direction CHAR(5) )
BEGIN
DROP TABLE IF EXISTS paths;
CREATE TABLE paths(
start SMALLINT,
stop SMALLINT,
len SMALLINT,
PRIMARY KEY(start,stop)
) ENGINE=HEAP;
IF direction = 'UP' THEN
INSERT INTO paths
SELECT childID,parentID,1
FROM familytree;
ELSE
INSERT INTO paths
SELECT parentID,childID,1
FROM familytree;
END IF;
WHILE ROW_COUNT() > 0 DO
INSERT IGNORE INTO paths
SELECT DISTINCT
p1.start,
p2.stop,
p1.len + p2.len
FROM paths AS p1 INNER JOIN paths AS p2 ON p1.stop = p2.start;
END WHILE;
SELECT start, stop, len
FROM paths
ORDER BY start, stop;
DROP TABLE paths;
END;

DELIMITER ;
To find the paths from just one node, seed the paths
table with paths from the starting node, then iteratively search a JOIN of familytree
and paths
for edges which will extend existing paths in the userspecified direction:
Listing 9
DROP PROCEDURE IF EXISTS ListAdjacencyPathsOfNode;
DELIMITER 
CREATE PROCEDURE ListAdjacencyPathsOfNode( IN node SMALLINT, IN direction CHAR(5) )
BEGIN
TRUNCATE paths;
IF direction = 'UP' THEN
INSERT INTO paths
SELECT childID,parentID,1
FROM familytree
WHERE childID = node;
ELSE
INSERT INTO paths
SELECT parentID,childID,1
FROM familytree
WHERE parentID = node;
END IF;
WHILE ROW_COUNT() > 0 DO
IF direction = 'UP' THEN
INSERT IGNORE INTO paths
SELECT DISTINCT
paths.start,
familytree.parentID,
paths.len + 1
FROM paths
INNER JOIN familytree ON paths.stop = familytree.childID;
ELSE
INSERT IGNORE INTO paths
SELECT DISTINCT
paths.start,
familytree.childID,
paths.len + 1
FROM paths
INNER JOIN familytree ON paths.stop = familytree.parentID;
END IF;
END WHILE;
SELECT start, stop, len
FROM paths
ORDER BY start, stop;
END;

DELIMITER ;
CALL ListAdjacencyPathsOfNode(1,'DOWN');
++++
 start  stop  len 
++++
 1  2  1 
 1  3  1 
 1  4  2 
 1  5  2 
 1  6  2 
 1  7  2 
 1  8  2 
 1  9  2 
 1  10  2 
 1  11  2 
 1  12  3 
 1  13  3 
 1  14  3 
 1  15  3 
 1  16  3 
 1  17  3 
 1  18  3 
 1  19  4 
 1  20  4 
 1  21  4 
 1  22  4 
 1  23  5 
++++
These algorithms don't bend the brain. They perform acceptably with large trees. Querying edgeadjacency lists for subtrees and paths is less daunting than their reputation suggests.
Nested sets model of a tree
Imagine an oval drawn round every leaf and every subtree in Fig 2, and a final oval round the entire tree. The tree is a set. Each subtree is a subset. That's the basic idea of the nested sets model.
The advantage of the nested sets model is that root, leaves, subtrees, levels, tree height, ancestors, descendants and paths can be retrieved without recursion or application language code. The disadvantages are:
 initial setup of the tree table can be difficult,
 queries for parents (immediate superiors) and children (immediate subordinates) are more complicated than with an edge list model,
 insertion, updates and deletion are extremely cumbersome since they may require updates to much of the tree.
The nested sets model depends on using a modified preorder tree traversal (MPTT) depthfirst algorithm to assign each node left and right integers which define the node's tree position. All nodes of a subtree have
 left values greater than the subtree parent's left value, and
 right values smaller than that of the subtree parent's right value.
so queries for subtrees are dead simple. If the numbering scheme is integersequential as in Fig 3, the root node receives a left value of 1 and a right value equal to twice the item count.
To see how to code nested sets using MPTT, trace the ascending integers in Fig 3, starting with 1 on the left side of the root node (Richard Shakespeare
). Following edges downward and leftward, the left side of each box gets the next integer. When you reach a leaf (Joan
, left=3), the right side of that box gets the next integer (4). If there is another node to the right on the same level, continue in that direction; otherwise continue up the right side of the subtree you just descended. When you arrive back at the root on the right side, you're done. Down, right and up.
A serious problem with this scheme jumps out right away: after you've written the Fig 3 tree to a table, what if historians discover an older brother or sister of Henry and John? Every row in the tree table must be updated!
Celko and others have proposed alternative numbering schemes to get round this problem, but the logical difficulty remains: inserts and updates can invalidate many or all rows, and no SQL CHECK or CONSTRAINT can prevent it. The nested sets model is not good for trees which require frequent updates, and is pretty much unsupportable for large updatable trees that will be accessed by many concurrent users. But as we'll see in a moment, it can be very useful indeed for reporting a tree.
How to build a nested set representation from an edge list
Obviously, numbering a tree by hand will be errorprone, and seriously impractical for large trees. It's usually best, therefore, to code the tree initially as an edgeadjacency list, then use a stored procedure to translate the edge list to a nested sets model. We can write a stored procedure that uses Celko's pushdown stack method to translate any edge list into a nested sets tree:
 create a table for the tree:
node
,leftedge
,rightedge
, and a stack pointer (top
),  seed that table,
nestedsettree
, with the root node of the adjacency tree,  set a counter to 1 plus the left value of the root node, i.e. 2,
 while that counter is less than the right value of the root node ...
 insert rows for children of parent rows into
nestedsettree
, or  update existing right integers in
nestedsettree
.
 insert rows for children of parent rows into
Listing 10
DROP PROCEDURE IF EXISTS EdgeListToNestedSet;
DELIMITER 
CREATE PROCEDURE EdgeListToNestedSet( edgeTable CHAR(64), idCol CHAR(64), parentCol CHAR(64) )
BEGIN
DECLARE maxrightedge, rows SMALLINT DEFAULT 0;
DECLARE current SMALLINT DEFAULT 1;
DECLARE nextedge SMALLINT DEFAULT 2;
 create working tree table as a copy of edgeTable
DROP TEMPORARY TABLE IF EXISTS tree;
CREATE TEMPORARY TABLE tree( childID INT, parentID INT );
SET @sql = CONCAT( 'INSERT INTO tree SELECT ', idCol, ',', parentCol, ' FROM ', edgeTable );
PREPARE stmt FROM @sql;
EXECUTE stmt;
DROP PREPARE stmt;
 initialise result table
DROP TABLE IF EXISTS nestedsettree;
CREATE TABLE nestedsettree (
top SMALLINT, nodeID SMALLINT, leftedge SMALLINT, rightedge SMALLINT
) ENGINE=HEAP;
 find root of tree (or: add an sproc param for the desired root value)
SET @sql = CONCAT( 'SELECT DISTINCT f.', parentCol, ' INTO @root FROM ',
edgeTable, ' AS f LEFT JOIN ', edgeTable, ' AS t ON f.',
parentCol, '=', 't.', idCol, ' WHERE t.', idCol, ' IS NULL'
);
 For the familytree table, the above query parses to:
 SELECT DISTINCT f.parentid INTO @root
 FROM familytree AS f LEFT JOIN familytree AS t ON f.parentid=t.childid
 WHERE t.childid IS NULL 
PREPARE stmt FROM @sql;
EXECUTE stmt;
DROP PREPARE stmt;
 build nested sets tree
SET maxrightedge = 2 * (1 + (SELECT + COUNT(*) FROM tree));
INSERT INTO nestedsettree VALUES( 1, @root, 1, maxrightedge );
WHILE nextedge < maxrightedge DO
SELECT *
FROM nestedsettree AS s
JOIN tree AS t ON s.nodeID=t.parentID AND s.top=current;
SET rows = FOUND_ROWS();
IF rows > 0 THEN
BEGIN
INSERT INTO nestedsettree
SELECT current+1, MIN(t.childID), nextedge, NULL
FROM nestedsettree AS s
JOIN tree AS t ON s.nodeID = t.parentID AND s.top = current;
DELETE FROM tree
WHERE childID = (SELECT nodeID FROM nestedsettree WHERE top=(current+1));
SET nextedge = nextedge + 1;
SET current = current + 1;
END;
ELSE
BEGIN
UPDATE nestedsettree
SET rightedge=nextedge, top = top
WHERE top=current;
SET nextedge=nextedge+1;
SET current=current1;
END;
END IF;
END WHILE;
SET rows := ( SELECT COUNT(*) FROM tree );
DROP TEMPORARY TABLE tree;
 show result
IF rows > 0 THEN
SELECT 'Orphaned rows remain';
ELSE
SELECT nodeID, leftedge AS 'Left', rightedge AS 'Right'
FROM nestedsettree ORDER BY nodeID;
END IF;
END 
DELIMITER ;
CALL EdgeListToNestedSet( 'familytree', 'childID', 'parentID' );
++++
 nodeID  Left  Right 
++++
 1  1  46 
 2  2  43 
 3  44  45 
 4  3  4 
 5  5  6 
 6  7  24 
 7  25  26 
 8  27  36 
 9  37  38 
 10  39  40 
 11  41  42 
 12  8  13 
 13  14  15 
 14  16  23 
 15  28  29 
 16  30  31 
 17  32  33 
 18  34  35 
 19  9  12 
 20  17  18 
 21  19  20 
 22  21  22 
 23  10  11 
++++
Finding a node's parent and children
A nested sets tree can be queried without recursion, but using the model's arithmetic in a query requires a bit of thought. For example, if we INNER JOIN the tree table AS t1
to itself AS t2 ON t2.leftedge BETWEEN t1.leftedge AND t1.rightedge
, and if we scope the query to the descendants of a particular node, we get a list of t2.nodeID
values in which the children (one level down) occur once, the grandkids (two levels down) occur twice, and so on:
Listing 11a
SELECT GROUP_CONCAT(t2.nodeID ORDER BY t2.nodeID) AS 'Descendants of William'
FROM nestedsettree AS t1
INNER JOIN nestedsettree AS t2
ON t2.leftedge BETWEEN t1.leftedge AND t1.rightedge
WHERE t1.leftedge > 7 AND t1.rightedge < 24;
++
 Descendants of William 
++
 12,13,14,19,19,20,20,21,21,22,22,23,23,23 
++
Therefore
HAVING COUNT(t2.nodeID)=1
will scope listed descendants to those who are one level down:
Listing 11b
DROP PROCEDURE IF EXISTS ListNestedSetChildNodes;
DELIMITER 
CREATE PROCEDURE ListNestedSetChildNodes( node SMALLINT )
BEGIN
DECLARE thisleft, thisright SMALLINT DEFAULT 0;
SELECT leftedge, rightedge
INTO thisleft, thisright
FROM nestedsettree
WHERE nodeID = node;
SELECT
PersonName(t2.nodeid) AS Children
FROM nestedsettree AS t1
INNER JOIN nestedsettree AS t2
ON t2.leftedge BETWEEN t1.leftedge AND t1.rightedge
WHERE t1.leftedge > thisleft AND t1.rightedge < thisright
GROUP BY t2.nodeid
HAVING COUNT(t2.nodeid) = 1
ORDER BY t2.leftedge;
END;

DELIMITER ;
CALL ListNestedSetChildNodes(6);
++
 Children 
++
 Susana Shakespeare 
 Hamnet Shakespeare 
 Judith Shakespeare 
++
Similar logic gets us the parent of a node:
 retrieve its
leftedge
andrightedge
values,  compute its level,
 find the node which is one level up and which has edge values outside the node's
leftedge
andrightedge
values.
Listing 12
DROP PROCEDURE IF EXISTS ShowNestedSetParent;
DELIMITER 
CREATE PROCEDURE ShowNestedSetParent( node SMALLINT )
BEGIN
DECLARE level, thisleft, thisright SMALLINT DEFAULT 0;
 find node edges
SELECT leftedge, rightedge
INTO thisleft, thisright
FROM nestedsettree
WHERE nodeID = node;
 find node level
SELECT COUNT(n1.nodeid)
INTO level
FROM nestedsettree AS n1
INNER JOIN nestedsettree AS n2
ON n2.leftedge BETWEEN n1.leftedge AND n1.rightedge
WHERE n2.nodeid = node
GROUP BY n2.nodeid;
 find parent
SELECT
PersonName(n2.nodeid) AS Parent
FROM nestedsettree AS n1
INNER JOIN nestedsettree AS n2
ON n2.leftedge BETWEEN n1.leftedge AND n1.rightedge
WHERE n2.leftedge < thisleft AND n2.rightedge > thisright
GROUP BY n2.nodeid
HAVING COUNT(n1.nodeid)=level1;
END;

DELIMITER ;
CALL ShowNestedSetParent(6);
++
 Child 
++
 Henry Shakespeare 
++
Other queries
For some queries, adjacency list and nested sets queries are equivalently simple. For example to find the tree root and leaves, compare Listing 6 with:
Listing 13
SELECT
name AS Paterfamilias,
IFNULL(born,'?') AS Born,
IFNULL(died,'?') AS Died
FROM nestedsettree AS t
INNER JOIN family AS f ON t.nodeID=f.ID
WHERE leftedge = 1;
++++
 Paterfamilias  Born  Died 
++++
 Richard Shakespeare  ?  1561 
++++
SELECT
name AS 'Childless Shakespeares',
IFNULL(born,'?') AS Born,
IFNULL(died,'?') AS Died
FROM nestedsettree AS t
INNER JOIN family AS f ON t.nodeID=f.ID
WHERE rightedge = leftedge + 1;
++++
 Childless Shakespeares  Born  Died 
++++
 Joan Shakespeare  1558  ? 
 Margaret Shakespeare  1562  1563 
 John Bernard  ?  1674 
 Hamnet Shakespeare  1585  1596 
 Shakespeare Quiney  1616  1617 
 Richard Quiney  1618  1639 
 Thomas Quiney  1620  1639 
 Gilbert Shakespeare  1566  1612 
 William Hart  1600  1639 
 Mary Hart  1603  1607 
 Thomas Hart  1605  1670 
 Michael Hart  1608  1618 
 Anne Shakespeare  1571  1579 
 Richard Shakespeare  1574  1613 
 Edmond Shakespeare  1580  1607 
 John Shakespeare  1530  1601 
++++
But in contrast to edge list models, finding subtrees in a nested sets model requires no twisted code, no stored procedure. To retrieve the nestedsettree
nodes in William's subtree, simply ask for the nodes whose leftedge
values are greater, and whose rightedge
values are smaller than William's:
Listing 14
SELECT
PersonName(t.nodeID) AS Descendant
FROM nestedsettree AS s
INNER JOIN nestedsettree AS t
ON s.leftedge < t.leftedge AND s.rightedge > t.rightedge
WHERE s.nodeID = (
SELECT ID FROM family
WHERE name='William Shakespeare'
);
Finding a single path in the nested sets model is about as complicated as path enumeration (Listings 8, 9) with edgeadjacency lists:
Listing 15
SELECT
t2.nodeID AS Node,
PersonName(t2.nodeID) AS Person,
(SELECT COUNT(*)
FROM nestedsettree AS t4
WHERE t4.leftedge BETWEEN t1.leftedge AND t1.rightedge
AND t2.leftedge BETWEEN t4.leftedge AND t4.rightedge
) AS Path
FROM nestedsettree AS t1
INNER JOIN nestedsettree AS t2 ON t2.leftedge BETWEEN t1.leftedge AND t1.rightedge
INNER JOIN nestedsettree AS t3 ON t3.leftedge BETWEEN t2.leftedge AND t2.rightedge
WHERE t1.nodeID=(SELECT ID FROM family WHERE name='William Shakespeare')
AND t3.nodeID=(SELECT ID FROM family WHERE name='John Bernard');
++++
 Node  Person  Path 
++++
 6  William Shakespeare  1 
 12  Susana Shakespeare  2 
 19  Elizabeth Hall  3 
 23  John Bernard  4 
++++
Displaying the tree
Here the nested sets model shines. The arithmetic that was used to build the tree makes short work of summary queries. For example to retrieve a node list which preserves all parentchild relations, we need just two facts:
 listing order is the order taken in the node walk that created the tree, i.e.
leftedge
,  a node's indentation depth is simply the JOIN (edge) count from root to node:
Listing 16
SELECT
CONCAT( SPACE(2*COUNT(n1.nodeid)2), PersonName(n2.nodeid) )
AS 'The Shakespeare Family Tree'
FROM nestedsettree AS n1
INNER JOIN nestedsettree n2
ON n2.leftedge BETWEEN n1.leftedge AND n1.rightedge
GROUP BY n2.nodeid
ORDER BY n2.leftedge;
++
 The Shakespeare Family Tree 
++
 Richard Shakespeare 
 Henry Shakespeare 
 Joan Shakespeare 
 Margaret Shakespeare 
 William Shakespeare 
 Susana Shakespeare 
 Elizabeth Hall 
 John Bernard 
 Hamnet Shakespeare 
 Judith Shakespeare 
 Shakespeare Quiney 
 Richard Quiney 
 Thomas Quiney 
 Gilbert Shakespeare 
 Joan Shakespeare 
 William Hart 
 Mary Hart 
 Thomas Hart 
 Michael Hart 
 Anne Shakespeare 
 Richard Shakespeare 
 Edmond Shakespeare 
 John Shakespeare 
++
To retrieve only a subtree, add a query clause which restricts nodes to those whose edges are within the range of the parent node's left and right edge values, for example for William and his descendants...
WHERE n1.leftedge >= 7 AND n1.rightedge <=24
Node insertions, updates and deletions
Nested set arithmetic also helps with insertions, updates and deletions, but the problem remains that changing just one node can require changing much of the tree.
Inserting a node in the tree requires at least two pieces of information: the nodeID
to insert, and the nodeID
of its parent. The model is normalised so the nodeID
first must have been added to the parent (family
) table. The algorithm for adding the node to the tree is:
 increment
leftedge
by 2 in nodes where therightedge
value is greater than the parent'srightedge
,  increment
rightedge
by 2 in nodes where theleftedge
value is greater than the parent'sleftedge
,  insert a row with the given
nodeID
,leftedge
= 1 + parent'sleftedge
,rightedge
= 2 + parent'sleftedge
.
That's not difficult, but all rows will have to be updated!
Listing 17
DROP PROCEDURE IF EXISTS InsertNestedSetNode;
DELIMITER 
CREATE PROCEDURE InsertNestedSetNode( IN node SMALLINT, IN parent SMALLINT )
BEGIN
DECLARE parentleft, parentright SMALLINT DEFAULT 0;
SELECT leftedge, rightedge
INTO parentleft, parentright
FROM nestedsettree
WHERE nodeID = parent;
IF FOUND_ROWS() = 1 THEN
BEGIN
UPDATE nestedsettree
SET rightedge = rightedge + 2
WHERE rightedge > parentleft;
UPDATE nestedsettree
SET leftedge = leftedge + 2
WHERE leftedge > parentleft;
INSERT INTO nestedsettree
VALUES ( 0, node, parentleft + 1, parentleft + 2 );
END;
END IF;
END;

DELIMITER ;
"Sibline" or horizontal order is obviously significant in family trees, but may not be significant in other trees. Listing 17 adds the new node at the left edge of the sibline. To specify another position, modify the procedure to accept a third parameter for the nodeID
which is to be to the left or right of the insertion point.
Updating a node in place requires nothing more than editing nodeID
to point at a different parent row.
Deleting a node raises the problem of how to repair links severed by the deletion. In tree models of parts explosions, the item to be deleted is often replaced by a new item, so it can be treated like a simple nodeID
update. In organisational bossemployee charts, though, does a colleague move over, does a subordinate get promoted, does everybody in the subtree move up a level, or does something else happen? No formula can catch all the possibilities. Listing 18 illustrates how to handle two common scenarios, move everyone up, and move someone over. All possibilities except simple replacement of the nodeID
involve changes elsewhere in the tree.
Listing 18
DROP PROCEDURE IF EXISTS DeleteNestedSetNode;
DELIMITER 
CREATE PROCEDURE DeleteNestedSetNode( IN mode CHAR(7), IN node SMALLINT )
BEGIN
DECLARE thisleft, thisright SMALLINT DEFAULT 0;
SELECT leftedge, rightedge
INTO thisleft, thisright
FROM nestedsettree
WHERE nodeID = node;
IF mode = 'PROMOTE' THEN
BEGIN  Ian Holsman found these two bugs
DELETE FROM nestedsettree
WHERE nodeID = node;
UPDATE nestedsettree
SET leftedge = leftedge  1, rightedge = rightedge  1  rather than = thisleft
WHERE leftedge BETWEEN thisleft AND thisright;
UPDATE nestedsettree
SET rightedge = rightedge  2
WHERE rightedge > thisright;
UPDATE nestedsettree
SET leftedge = leftedge  2
WHERE leftedge > thisright;  rather than > thisleft
END;
ELSEIF mode = 'REPLACE' THEN
BEGIN
UPDATE nestedsettree
SET leftedge = thisleft  1, rightedge = thisright
WHERE leftedge = thisleft + 1;
UPDATE nestedsettree
SET rightedge = rightedge  2
WHERE rightedge > thisleft;
UPDATE nestedsettree
SET leftedge = leftedge  2
WHERE leftedge > thisleft;
DELETE FROM nestedsettree
WHERE nodeID = node;
END;
END IF;
END;

DELIMITER ;
Nested set model summary
Given the concurrency nightmare which nested sets impose for inserts and deletions, it is reasonable to reserve the nested set model for fairly static trees whose users are mostly interested in querying subtrees. You could think of the nested set model as an OLAP tool: maintain an OLTP tree in an edgeadjacency list representation, and build a nested sets OLAP table when the report is needed.
If you will be using the nested sets model, you may be converting back and forth with edge list models, so here is a simple query which generates an edge list from a nested set tree:
Listing 19
SELECT
p.nodeID AS parentID,
c.nodeID AS childID
FROM nestedsettree AS p
INNER JOIN nestedsettree AS c
ON p.leftedge = (SELECT MAX(s.leftedge)
FROM nestedsettree AS s
WHERE c.leftedge > s.leftedge
AND c.leftedge < s.rightedge)
ORDER BY p.nodeID;
Edge list model of a nontree graph
Many graphs are not trees. Imagine a small airline which has just acquired licences for flights no longer than 6,000 km between Los Angeles (LAX), New York (JFK), Heathrow in London, Charles de Gaulle in Paris, AmsterdamSchiphol, Arlanda in Sweden, and HelsinkiVantaa. You have been asked to compute the shortest possible oneway routes that do not deviate more than 90° from the direction of the first hop—roughly, oneway routes and no circuits.
Airports are nodes, flights are edges, routes are paths. We will need three tables.
Airports (nodes)
To identify an airport we need its code, location name, latitude and longitude. Latitude and longitude are usually given as degrees, minutes and seconds, north or south of the equator, east or west of Greenwich. To hide details that aren't directly relevant to nodes and edges, code latitude and longitude as simple reals where longitudes west of Greenwich and latitudes south of the equator are negative, whilst longitudes east of Greenwich and latitudes north of the equator are positive:
Listing 20
CREATE TABLE airports (
code char(3) NOT NULL,
city char(100) default NULL,
latitude float NOT NULL,
longitude float NOT NULL,
PRIMARY KEY (code)
) ENGINE=MyISAM;
INSERT INTO airports VALUES ('JFK', 'New York, NY', 40.75, 73.97);
INSERT INTO airports VALUES ('LAX', 'Los Angeles, CA', 34.05, 118.22);
INSERT INTO airports VALUES ('LHR', 'London, England', 51.5, 0.45);
INSERT INTO airports VALUES ('HEL', 'Helsinki, Finland', 60.17, 24.97);
INSERT INTO airports VALUES ('CDG', 'Paris, France', 48.86, 2.33);
INSERT INTO airports VALUES ('STL', 'St Louis, MO', 38.63, 90.2);
INSERT INTO airports VALUES ('ARN', 'Stockholm, Sweden', 59.33, 18.05);
Flights (edges)
The model attaches two weights to flights: distance and direction.
We need a method of calculating the Great Circle Distance—the geographical distance between any two cities  another natural job for a stored function. The distance calculation
 converts to radians the degree coordinates of any two points on the earth's surface,
 calculates the angle of the arc subtended by the two points, and
 converts the result, also in radians, to surface (circumferential) kilometres (1 radian=6,378.388 km).
Listing 21
SET GLOBAL log_bin_trust_function_creators=TRUE;  since 5.0.16
DROP FUNCTION IF EXISTS GeoDistKM;
DELIMITER 
CREATE FUNCTION GeoDistKM( lat1 FLOAT, lon1 FLOAT, lat2 FLOAT, lon2 FLOAT ) RETURNS float
BEGIN
DECLARE pi, q1, q2, q3 FLOAT;
SET pi = PI();
SET lat1 = lat1 * pi / 180;
SET lon1 = lon1 * pi / 180;
SET lat2 = lat2 * pi / 180;
SET lon2 = lon2 * pi / 180;
SET q1 = COS(lon1lon2);
SET q2 = COS(lat1lat2);
SET q3 = COS(lat1+lat2);
SET rads = ACOS( 0.5*((1.0+q1)*q2  (1.0q1)*q3) );
RETURN 6378.388 * rads;
END;

DELIMITER ;
That takes care of flight distances. Flight direction is, approximately, the arctangent (ATAN) of the difference between flights.depart
and flights.arrive
latitudes and longitudes. Now we can seed the airline's flights
table with onehop flights up to 6,000 km long:
Listing 22
CREATE TABLE flights (
id INT PRIMARY KEY AUTO_INCREMENT,
depart CHAR(3),
arrive CHAR(3),
distance DECIMAL(10,2),
direction DECIMAL(10,2)
) ENGINE=MYISAM;
INSERT INTO flights
SELECT
NULL,
depart.code,
arrive.code,
ROUND(GeoDistKM(depart.latitude,depart.longitude,arrive.latitude,arrive.longitude),2),
ROUND(DEGREES(ATAN(arrive.latitudedepart.latitude,arrive.longitudedepart.longitude)),2)
FROM airports AS depart
INNER JOIN airports AS arrive ON depart.code <> arrive.code
HAVING Km <= 6000;
SELECT * FROM flights;
++++++
 id  depart  arrive  distance  direction 
++++++
 1  LAX  JFK  3941.18  8.61 
 2  LHR  JFK  5550.77  171.68 
 3  CDG  JFK  5837.46  173.93 
 4  STL  JFK  1408.11  7.44 
 5  JFK  LAX  3941.18  171.39 
 6  STL  LAX  2553.37  170.72 
 7  JFK  LHR  5550.77  8.32 
 8  HEL  LHR  1841.91  161.17 
 9  CDG  LHR  354.41  136.48 
 10  ARN  LHR  1450.12  157.06 
 11  LHR  HEL  1841.91  18.83 
 12  CDG  HEL  1912.96  26.54 
 13  ARN  HEL  398.99  6.92 
 14  JFK  CDG  5837.46  6.07 
 15  LHR  CDG  354.41  43.52 
 16  HEL  CDG  1912.96  153.46 
 17  ARN  CDG  1545.23  146.34 
 18  JFK  STL  1408.11  172.56 
 19  LAX  STL  2553.37  9.28 
 20  LHR  ARN  1450.12  22.94 
 21  HEL  ARN  398.99  173.08 
 22  CDG  ARN  1545.23  33.66 
++++++
The distances agree approximately with public information sources for flight lengths. For a pair of airports A and B not very near the poles, the error in calculating direction using ATAN(), is small. To remove that error, instead of ATAN() use a formula from spherical trigonometry (for example one of the formulas at http://www.dynagen.co.za/eugene/where/formula.html).
Routes (paths)
A route is a path along one or more of these edges, so flights:routes
is a 1:many relationship. For simplicity, though, we denormalise our representation of routes with a variation of the materialised path model to store all the hops of one route as a list of flights in one routes
column. The column routes.route
is the sequence of airports, from first departure to final arrival, the column routes.hops
is the number of hops in that route, and the column routes.direction
is the direction:
Listing 23
CREATE TABLE routes (
id INT PRIMARY KEY AUTO_INCREMENT,
depart CHAR(3),
arrive CHAR(3),
hops SMALLINT,
route CHAR(50),
distance DECIMAL(10,2),
direction DECIMAL(10,2)
) ENGINE=MYISAM;
Starting with an empty routes
table, how do we populate it with the shortest routes
between all ordered pairs of airports
?
 Insert all 1hop flights from the
flights
table.  Add in the set of shortest multihop routes for all pairs of airports which don't have rows in the flights table.
For 1hop flights we just write
Listing 24
INSERT INTO routes
SELECT
NULL,
depart,
arrive,
1,
CONCAT(depart,',',arrive),
distance,
direction
FROM flights;
NULL being the placeholder for the autoincrementing id
column.
For multihop routes, we iteratively add in sets of all allowed 2hop, 3hop, ... nhop routes, replacing longer routes by shorter routes as we find them, until there is nothing more to add or replace. That also breaks down to two logical steps: add hops to build the set of next allowed routes, and update longer routes with shorter ones.
Next allowed routes
The set of next allowed routes is the set of shortest routes that can be built by adding, to existing routes, flights which leave from the last arrival airport of an existing route, which arrive at an airport which is not yet in the given route, and which stay within ± 90° of the route's initial compass direction. That is, every new route is a JOIN between routes
and flights
in which
depart = routes.depart
,arrive = flights.arrive
,flights.depart = routes.arrive
,distance =
MIN(routes.distance + flights.distance)
,LOCATE(
flights.arrive,routes.route) = 0
,flights.direction+360 > routes.direction+270 AND flights.direction+360 < routes.direction+450
This is a natural logical unit of work for a VIEW:
Listing 25
CREATE OR REPLACE VIEW nextroutes AS
SELECT
routes.depart,
flights.arrive,
routes.hops+1 AS hops,
CONCAT(routes.route, ',', flights.arrive) AS route,
MIN(routes.distance + flights.distance) AS distance,
routes.direction
FROM routes INNER JOIN flights
ON routes.arrive = flights.depart
AND LOCATE(
flights.arrive,routes.route) = 0
WHERE flights.direction+360>routes.direction+270
AND flights.direction+360<routes.direction+450
GROUP BY depart,arrive;
How to add these new hops to routes
? In standard SQL, this variant on a query by Scott Stephens should do it...
Listing 26
INSERT INTO routes
SELECT NULL,depart,arrive,hops,route,distance,direction FROM nextroutes
WHERE (nextroutes.depart,nextroutes.arrive) NOT IN (
SELECT depart,arrive FROM routes
);
but MySQL does not yet support subqueries on the table being updated. We have to use a subqueryless (and faster) version of that logic:
Listing 27
INSERT INTO routes
SELECT
NULL,
nextroutes.depart,
nextroutes.arrive,
nextroutes.hops,
nextroutes.route,
nextroutes.distance,
nextroutes.direction
FROM nextroutes
LEFT JOIN routes ON nextroutes.depart = routes.depart
AND nextroutes.arrive = routes.arrive
WHERE routes.depart IS NULL AND routes.arrive IS NULL;
Running that code right after the initial seeding from flights
gives ...
SELECT * FROM routes; ++++++++  id  depart  arrive  hops  route  distance  direction  ++++++++  1  LAX  JFK  1  LAX,JFK  3941.18  8.61   2  LHR  JFK  1  LHR,JFK  5550.77  171.68   3  CDG  JFK  1  CDG,JFK  5837.46  173.93   4  STL  JFK  1  STL,JFK  1408.11  7.44   5  JFK  LAX  1  JFK,LAX  3941.18  171.39   6  STL  LAX  1  STL,LAX  2553.37  170.72   7  JFK  LHR  1  JFK,LHR  5550.77  8.32   8  HEL  LHR  1  HEL,LHR  1841.91  161.17   9  CDG  LHR  1  CDG,LHR  354.41  136.48   10  ARN  LHR  1  ARN,LHR  1450.12  157.06   11  LHR  HEL  1  LHR,HEL  1841.91  18.83   12  CDG  HEL  1  CDG,HEL  1912.96  26.54   13  ARN  HEL  1  ARN,HEL  398.99  6.92   14  JFK  CDG  1  JFK,CDG  5837.46  6.07   15  LHR  CDG  1  LHR,CDG  354.41  43.52   16  HEL  CDG  1  HEL,CDG  1912.96  153.46   17  ARN  CDG  1  ARN,CDG  1545.23  146.34   18  JFK  STL  1  JFK,STL  1408.11  172.56   19  LAX  STL  1  LAX,STL  2553.37  9.28   20  LHR  ARN  1  LHR,ARN  1450.12  22.94   21  HEL  ARN  1  HEL,ARN  398.99  173.08   22  CDG  ARN  1  CDG,ARN  1545.23  33.66   23  ARN  JFK  2  ARN,LHR,JFK  7000.89  157.06   24  CDG  LAX  2  CDG,JFK,LAX  9778.64  173.93   25  CDG  STL  2  CDG,JFK,STL  7245.57  173.93   26  HEL  JFK  2  HEL,LHR,JFK  7392.68  161.17   27  JFK  ARN  2  JFK,LHR,ARN  7000.89  8.32   28  JFK  HEL  2  JFK,LHR,HEL  7392.68  8.32   29  LAX  CDG  2  LAX,JFK,CDG  9778.64  8.61   30  LAX  LHR  2  LAX,JFK,LHR  9491.95  8.61   31  LHR  LAX  2  LHR,JFK,LAX  9491.95  171.68   32  LHR  STL  2  LHR,JFK,STL  6958.88  171.68   33  STL  CDG  2  STL,JFK,CDG  7245.57  7.44   34  STL  LHR  2  STL,JFK,LHR  6958.88  7.44  ++++++++
... adding 12 twohop rows.
Replace longer routes with shorter ones
As we build routes with more hops, it is logically possible that the nextroutes
view will find shorter routes for an existing routes
pair of depart
and arrive
. Standard SQL for replacing existing routes
rows with nextroutes
rows which match (depart, arrive)
and have shorter distance
values would be:
Listing 28
UPDATE routes SET (hops,route,distance,direction) = (
SELECT hops, route, distance, direction
FROM nextroutes
WHERE nextroutes.depart = routes.depart AND nextroutes.arrive = routes.arrive
)
WHERE (depart,arrive) IN (
SELECT depart,arrive FROM nextroutes
WHERE nextroutes.distance < routes.distance
);
but MySQL does not support SET(col1,...)
syntax, and as with Listing 7, MySQL does not yet accept subqueries referencing the table being updated, so we have to write more literal SQL:
Listing 29
UPDATE routes, nextroutes
SET
routes.hops=nextroutes.hops,
routes.route=nextroutes.route,
routes.distance=nextroutes.distance,
routes.direction=nextroutes.direction
WHERE routes.arrive=nextroutes.arrive
AND routes.depart=nextroutes.depart
AND nextroutes.distance < routes.distance;
Running this code right after the first run of Listing 27 updates no rows. To test the logic of iteration, continue running Listings 27 and 29 until no rows are being added or changed. The final result is:
SELECT * FROM ROUTES;
++++++++
 id  depart  arrive  hops  route  distance  direction 
++++++++
 1  LAX  JFK  1  LAX,JFK  3941.18  8.61 
 2  LHR  JFK  1  LHR,JFK  5550.77  171.68 
 3  CDG  JFK  1  CDG,JFK  5837.46  173.93 
 4  STL  JFK  1  STL,JFK  1408.11  7.44 
 5  JFK  LAX  1  JFK,LAX  3941.18  171.39 
 6  STL  LAX  1  STL,LAX  2553.37  170.72 
 7  JFK  LHR  1  JFK,LHR  5550.77  8.32 
 8  HEL  LHR  1  HEL,LHR  1841.91  161.17 
 9  CDG  LHR  1  CDG,LHR  354.41  136.48 
 10  ARN  LHR  1  ARN,LHR  1450.12  157.06 
 11  LHR  HEL  1  LHR,HEL  1841.91  18.83 
 12  CDG  HEL  1  CDG,HEL  1912.96  26.54 
 13  ARN  HEL  1  ARN,HEL  398.99  6.92 
 14  JFK  CDG  1  JFK,CDG  5837.46  6.07 
 15  LHR  CDG  1  LHR,CDG  354.41  43.52 
 16  HEL  CDG  1  HEL,CDG  1912.96  153.46 
 17  ARN  CDG  1  ARN,CDG  1545.23  146.34 
 18  JFK  STL  1  JFK,STL  1408.11  172.56 
 19  LAX  STL  1  LAX,STL  2553.37  9.28 
 20  LHR  ARN  1  LHR,ARN  1450.12  22.94 
 21  HEL  ARN  1  HEL,ARN  398.99  173.08 
 22  CDG  ARN  1  CDG,ARN  1545.23  33.66 
 23  ARN  JFK  2  ARN,LHR,JFK  7000.89  157.06 
 24  CDG  LAX  2  CDG,JFK,LAX  9778.64  173.93 
 25  CDG  STL  2  CDG,JFK,STL  7245.57  173.93 
 26  HEL  JFK  2  HEL,LHR,JFK  7392.68  161.17 
 27  JFK  ARN  2  JFK,LHR,ARN  7000.89  8.32 
 28  JFK  HEL  2  JFK,LHR,HEL  7392.68  8.32 
 29  LAX  CDG  2  LAX,JFK,CDG  9778.64  8.61 
 30  LAX  LHR  2  LAX,JFK,LHR  9491.95  8.61 
 31  LHR  LAX  2  LHR,JFK,LAX  9491.95  171.68 
 32  LHR  STL  2  LHR,JFK,STL  6958.88  171.68 
 33  STL  CDG  2  STL,JFK,CDG  7245.57  7.44 
 34  STL  LHR  2  STL,JFK,LHR  6958.88  7.44 
 35  ARN  LAX  3  ARN,LHR,JFK,LAX  10942.07  157.06 
 36  ARN  STL  3  ARN,LHR,JFK,STL  8409.00  157.06 
 37  HEL  LAX  3  HEL,LHR,JFK,LAX  11333.86  161.17 
 38  HEL  STL  3  HEL,LHR,JFK,STL  8800.79  161.17 
 39  LAX  ARN  3  LAX,JFK,CDG,ARN  10942.07  8.61 
 40  LAX  HEL  3  LAX,JFK,CDG,HEL  11333.86  8.61 
 41  STL  ARN  3  STL,JFK,CDG,ARN  8409.00  7.44 
 42  STL  HEL  3  STL,JFK,CDG,HEL  8800.79  7.44 
++++++++
All that's left to do is to assemble the code in a stored procedure:
Listing 30
DROP PROCEDURE IF EXISTS BuildRoutes;
DELIMITER 
CREATE PROCEDURE BuildRoutes()
BEGIN
DECLARE rows INT DEFAULT 0;
TRUNCATE routes;
 STEP 1, LISTING 24: SEED ROUTES WITH 1HOP FLIGHTS
INSERT INTO routes
SELECT
NULL,
depart,
arrive,
1,
CONCAT(depart,',',arrive),
distance,
direction
FROM flights;
SET rows = ROW_COUNT();
WHILE (rows > 0) DO
 STEP 2, LISTINGS 25, 27: ADD NEXT SET OF ROUTES
INSERT INTO routes
SELECT
NULL,
nextroutes.depart,
nextroutes.arrive,
nextroutes.hops,
nextroutes.route,
nextroutes.distance,
nextroutes.direction
FROM nextroutes
LEFT JOIN routes ON nextroutes.depart = routes.depart
AND nextroutes.arrive = routes.arrive
WHERE routes.depart IS NULL AND routes.arrive IS NULL;
SET rows = ROW_COUNT();
 STEP 3, LISTING 29: UPDATE WITH SHORTER nextroutes ROUTES IF ANY
UPDATE routes,nextroutes SET
routes.hops=nextroutes.hops,
routes.route=nextroutes.route,
routes.distance=nextroutes.distance,
routes.direction=nextroutes.direction
WHERE routes.arrive=nextroutes.arrive
AND routes.depart=nextroutes.depart
AND nextroutes.distance < routes.distance;
SET rows = rows + ROW_COUNT();
END WHILE;
END;

DELIMITER ;
If you are running MySQL 5.0.6 or 5.0.7, BuildRoutes()
fails to insert four rows:
+++++++
 depart  arrive  route  hops  distance  direction 
+++++++
 ARN  LAX  ARN,LHR,JFK,LAX  3  10942.07  157.06 
 ARN  STL  ARN,LHR,JFK,STL  3  8409.00  157.06 
 HEL  LAX  HEL,LHR,JFK,LAX  3  11333.86  161.17 
 HEL  STL  HEL,LHR,JFK,STL  3  8800.79  161.17 
+++++++
That MySQL bug (#11302) was corrected in 5.0.9.
Route queries
Route queries are straightforward. How do we check that the algorithm produced no duplicate departarrive
pairs? The following query should yield zero rows,
Listing 31
SELECT depart, arrive, COUNT(*)
FROM routes
GROUP BY depart,arrive
HAVING COUNT(*) > 1;
and does. Reachability queries are just as simple, for example where can we fly to from Helsinki?
Listing 32
SELECT *
FROM routes
WHERE depart='HEL'
ORDER BY distance;
++++++++
 id  depart  arrive  hops  route  distance  direction 
++++++++
 21  HEL  ARN  1  HEL,ARN  398.99  173.08 
 8  HEL  LHR  1  HEL,LHR  1841.91  161.17 
 16  HEL  CDG  1  HEL,CDG  1912.96  153.46 
 26  HEL  JFK  2  HEL,LHR,JFK  7392.68  161.17 
 38  HEL  STL  3  HEL,LHR,JFK,STL  8800.79  161.17 
 37  HEL  LAX  3  HEL,LHR,JFK,LAX  11333.86  161.17 
++++++++
An extended edge list model is simple to implement, gracefully accepts extended attributes for nodes, edge and paths, does not unduly penalise updates, and responds to queries with reasonable speed.
Parts explosions
A bill of materials for a house would include the cement block, lumber, shingles, doors, wallboard, windows, plumbing, electrical system, heating system, and so on. Each subassembly also has a bill of materials; the heating system has a furnace, ducts, and so on. A bill of materials implosion links component pieces to a major assembly. A bill of materials explosion breaks apart assemblies and subassemblies into their component parts.
Which graph model best handles a parts explosion? Combining edge list and "nested sets" algorithms provides a clean solution.
Imagine a new company that plans to make variously sized bookcases, either packaged as doityourself kits of, or assembled from sides, shelves, shelf brackets, backboards, feet and screws. Shelves and sides are cut from planks. Backboards are trimmed from laminated sheeting. Feet are machinecarved from readycut blocks. Screws and shelf brackets are purchased in bulk. Here are the elements of one bookcase:
1 backboard, 2 x 1 m 1 laminate 8 screws 2 sides 2m x 30 cm 1 plank length 4m 12 screws 8 shelves 1 m x 30 cm (incl. top and bottom) 2 planks 24 shelf brackets 4 feet 4cm x 4cm 4 cubes 16 screws
which may be packaged in a box for sale at one price, or assembled as a finished product at a different price. At any time we need to be able to answer questions like
 Do we have enough parts to make the bookcases on order?
 What assemblies or packages would be most profitable to make given the current inventory?
There is no reason to break the normalising rule that item detail belongs in a nodes table, and graph logic belongs in an edges table. Edges also require a quantity attribute, for example a shelf includes four shelf brackets. Nodes and edges may also have costs and prices:
 item purchase cost,
 item assembly cost,
 assembly cost,
 assembly selling price.
In many parts problems like this one, items occur in multiple assemblies and subassemblies. The graph is not a tree. Also, it is often desirable to model multiple graphs without the table glut that would arise from giving each graph its own edges table. A simple way to solve this problem is to represent multiple graphs (assemblies) in the edges table by giving every row not only childID
and parentID
pointers, but a pointer which identifies the root itemID
of the graph to which the row belongs.
So the data model is just two tables, for items (nodes) and for product graphs or assemblies (edges). Assume that the company begins with a plan to sell the 2m x 1m bookcase in two forms, assembled and kit, and that the purchasing department has bought quantities of raw materials (laminate, planks, shelf supports, screws, wood cubes, boxes). Here are the nodes (items) and edges (assemblies):
Listing 33
CREATE TABLE items (
itemID INT PRIMARY KEY AUTO_INCREMENT,
name CHAR(20) NOT NULL,
onhand INT NOT NULL DEFAULT 0,
reserved INT NOT NULL DEFAULT 0,
purchasecost DECIMAL(10,2) NOT NULL DEFAULT 0,
assemblycost DECIMAL(10,2) NOT NULL DEFAULT 0,
price DECIMAL(10,2) NOT NULL DEFAULT 0
);
CREATE TABLE assemblies (
assemblyID INT NOT NULL,
assemblyroot INT NOT NULL,
childID INT NOT NULL,
parentID INT NOT NULL,
quantity DECIMAL(10,2) NOT NULL,
assemblycost DECIMAL(10,2) NOT NULL,
PRIMARY KEY(assemblyID,childID,parentID)
);
INSERT INTO items VALUES  inventory
(1,'laminate',40,0,4,0,8),
(2,'screw',1000,0,0.1,0,.2),
(3,'plank',200,0,10,0,20),
(4,'shelf bracket',400,0,0.20,0,.4),
(5,'wood cube',100,0,0.5,0,1),
(6,'box',40,0,1,0,2),
(7,'backboard',0,0,0,3,0),
(8,'side',0,0,0,8,0),
(9,'shelf',0,0,0,4,0),
(10,'foot',0,0,0,1,0),
(11,'bookcase2x30',0,0,0,10,0),
(12,'bookcase2x30 kit',0,0,0,2,0);
INSERT INTO assemblies VALUES
(1,11,1,7,1,0),  laminate to backboard
(2,11,2,7,8,0),  screws to backboard
(3,11,3,8,.5,0),  planks to side
(4,11,2,8,6,0),  screws to side
(5,11,3,9,0.25,0),  planks to shelf
(6,11,4,9,4,0),  shelf brackets to shelf
(7,11,5,10,1,0),  wood cubes to foot
(8,11,2,10,1,0),  screws to foot
(9,11,7,11,1,0),  backboard to bookcase
(10,11,8,11,2,0),  sides to bookcase
(11,11,9,11,8,0),  shelves to bookcase
(12,11,10,11,4,0),  feet to bookcase
(13,12,1,7,1,0),  laminate to backboard
(14,12,2,7,8,0),  screws to backboard
(15,12,3,8,0.5,0),  planks to side
(16,12,2,8,6,0),  screws to sides
(17,12,3,9,0.25,0),  planks to shelf
(18,12,4,9,4,0),  shelf brackets to shelves
(19,12,5,10,1,0),  wood cubes to foot
(20,12,2,10,1,0),  screws to foot
(21,12,7,12,1,0),  backboard to bookcase kit
(22,12,8,12,2,0),  sides to bookcase kit
(23,12,9,12,8,0),  shelves to bookcase kit
(24,12,10,12,4,0),  feet to bookcase kit
(25,12,6,12,1,0);  container box to bookcase kit
Now, we want a parts list, a bill of materials, which will list show parentchild relationships and quantities, and sum the costs. Could we adapt the depthfirst "nested sets" treewalk algorithm (Listing 10) to this problem even though our graph is not a tree and our sets are not properly nested? Yes indeed. We just have to modify the treewalk to handle multiple parent nodes for any child node, and add code to percolate costs and quantities up the graph. Navigation remains simple using leftedge
and rightedge
values. This is just the sort of problem the Celko algorithm is good for: reporting!
Listing 34
DROP PROCEDURE IF EXISTS ShowBOM;
DELIMITER 
CREATE PROCEDURE ShowBOM( IN root INT )
BEGIN
DECLARE thischild, thisparent, rows, maxrightedge INT DEFAULT 0;
DECLARE thislevel, nextedgenum INT DEFAULT 1;
DECLARE thisqty, thiscost DECIMAL(10,2);
 Create and seed intermediate table:
DROP TABLE IF EXISTS edges;
CREATE TABLE edges (
childID smallint NOT NULL,
parentID smallint NOT NULL,
PRIMARY KEY (childID, parentID)
) ENGINE=HEAP;
INSERT INTO edges
SELECT childID,parentID
FROM assemblies
WHERE assemblyRoot = root;
SET maxrightedge = 2 * (1 + (SELECT COUNT(*) FROM edges));
 Create and seed result table:
DROP TABLE IF EXISTS bom;
CREATE TABLE bom (
level SMALLINT,
nodeID SMALLINT,
parentID SMALLINT,
qty DECIMAL(10,2),
cost DECIMAL(10,2),
leftedge SMALLINT,
rightedge SMALLINT
) ENGINE=HEAP;
INSERT INTO bom
VALUES( thislevel, root, 0, 0, 0, nextedgenum, maxrightedge );
SET nextedgenum = nextedgenum + 1;
WHILE nextedgenum < maxrightedge DO
 How many children of this node remain in the edges table?
SET rows = (
SELECT COUNT(*)
FROM bom AS s
INNER JOIN edges AS t ON s.nodeID=t.parentID AND s.level=thislevel
);
IF rows > 0 THEN
 There is at least one child edge.
 Compute qty and cost, insert into bom, delete from edges.
BEGIN
 Alas MySQL nulls MIN(t.childid) when we combine the next two queries
SET thischild = (
SELECT MIN(t.childID)
FROM bom AS s
INNER JOIN edges AS t ON s.nodeID=t.parentID AND s.level=thislevel
);
SET thisparent = (
SELECT DISTINCT t.parentID
FROM bom AS s
INNER JOIN edges AS t ON s.nodeID=t.parentID AND s.level=thislevel
);
SET thisqty = (
SELECT quantity FROM assemblies
WHERE assemblyroot = root
AND childID = thischild
AND parentID = thisparent
);
SET thiscost = (
SELECT a.assemblycost + (thisqty * (i.purchasecost + i.assemblycost ))
FROM assemblies AS a
INNER JOIN items AS i ON a.childID = i.itemID
WHERE assemblyroot = root
AND a.parentID = thisparent
AND a.childID = thischild
);
INSERT INTO bom
VALUES(thislevel+1, thischild, thisparent, thisqty, thiscost, nextedgenum, NULL);
DELETE FROM edges
WHERE childID = thischild AND parentID=thisparent;
SET thislevel = thislevel + 1;
SET nextedgenum = nextedgenum + 1;
END;
ELSE
BEGIN
 Set rightedge, remove item from edges
UPDATE bom
SET rightedge=nextedgenum, level = level
WHERE level = thislevel;
SET thislevel = thislevel  1;
SET nextedgenum = nextedgenum + 1;
END;
END IF;
END WHILE;
SET rows := ( SELECT COUNT(*) FROM edges );
IF rows > 0 THEN
SELECT 'Orphaned rows remain';
ELSE
 Total
SET thiscost = (SELECT SUM(qty*cost) FROM bom);
UPDATE bom
SET qty = 1, cost = thiscost
WHERE nodeID = root;
 Show the result
SELECT
CONCAT(Space(Abs(level)*2), ItemName(nodeid,root)) AS Item,
ROUND(qty,2) AS Qty,
ROUND(cost, 2) AS Cost
FROM bom
ORDER BY leftedge;
END IF;
END;

DELIMITER ;
 Function used by ShowBOM() to retrieve bom item names:
DROP FUNCTION IF EXISTS ItemName;
SET GLOBAL log_bin_trust_function_creators=TRUE;
DELIMITER 
CREATE FUNCTION ItemName( id INT, root INT ) RETURNS CHAR(20)
BEGIN
DECLARE s CHAR(20) DEFAULT '';
SELECT name INTO s FROM items WHERE itemid=id;
RETURN IF( id = root, UCASE(s), s );
END;

DELIMITER ;
CALL ShowBOM(11);
++++
 Item  Qty  Cost 
++++
 BOOKCASE2X30  1.0  327.93 
 backboard  1.0  3.00 
 laminate  1.0  4.00 
 screw  8.0  0.80 
 side  2.0  16.00 
 screw  6.0  0.60 
 plank  0.5  5.00 
 shelf  8.0  32.00 
 plank  0.3  2.50 
 shelf bracket  4.0  0.80 
 foot  4.0  4.00 
 screw  1.0  0.10 
 wood cube  1.0  0.50 
++++
With
ShowBOM()
in hand, it's easy to compare costs of assemblies and subassemblies. By adding price columns, we can do the same for prices and profit margins. And now that MySQL has reenabled prepared statements in stored procedures, it will be relatively easy to write a more general version of ShowBOM()
. We leave that to you.
Shorter and sweeter
But ShowBOM() is not the small, efficient bit of nested sets reporting code we were hoping for. There is a simpler solution: hide graph cycles from the edges table by making them references to rows in a nodes table, so we can treat the edges table like a tree; then apply a breadthfirst edgelist subtree algorithm to generate the Bill of Materials. Again assume a cabinetmaking company making bookcases (with a different costing model). For clarity, skip inventory tracking for now. An items table ww_nodes tracks purchased and assembled bookcase elements with their individual costs, and an assemblies/edges ww_edges table tracks sets of edges that combine to make products.
Listing 35: DDL for a simpler parts explosion
DROP TABLE IF EXISTS ww_nodes;
CREATE TABLE ww_nodes (
nodeID int,
description CHAR(50),
cost decimal(10,2)
);
INSERT INTO ww_nodes VALUES (1,'finished bookcase',10);
INSERT INTO ww_nodes VALUES (2,'backboard2x1',1);
INSERT INTO ww_nodes VALUES (3,'laminate2x1',8);
INSERT INTO ww_nodes VALUES (4,'screw',.10);
INSERT INTO ww_nodes VALUES (5,'side',4);
INSERT INTO ww_nodes VALUES (6,'plank',20);
INSERT INTO ww_nodes VALUES (7,'shelf',4);
INSERT INTO ww_nodes VALUES (8,'shelf bracket',.5);
INSERT INTO ww_nodes VALUES (9,'feet',1);
INSERT INTO ww_nodes VALUES (10,'cube4cmx4cm',1);
INSERT INTO ww_nodes VALUES (11,'bookcase kit',2);
INSERT INTO ww_nodes VALUES (12,'carton',1);
DROP TABLE IF EXISTS ww_edges;
CREATE TABLE ww_edges (
rootID INT,
nodeID int,
parentnodeID int,
qty decimal(10,2)
);
INSERT INTO ww_edges VALUES (1,1,null,1);
INSERT INTO ww_edges VALUES (1,2,1,1);
INSERT INTO ww_edges VALUES (1,3,2,1);
INSERT INTO ww_edges VALUES (1,4,2,8);
INSERT INTO ww_edges VALUES (1,5,1,2);
INSERT INTO ww_edges VALUES (1,6,5,1);
INSERT INTO ww_edges VALUES (1,4,5,12);
INSERT INTO ww_edges VALUES (1,7,1,8);
INSERT INTO ww_edges VALUES (1,6,7,.5);
INSERT INTO ww_edges VALUES (1,8,7,4);
INSERT INTO ww_edges VALUES (1,9,1,4);
INSERT INTO ww_edges VALUES (1,10,9,1);
INSERT INTO ww_edges VALUES (1,4,9,1);
INSERT INTO ww_edges VALUES (11,11,null,1);
INSERT INTO ww_edges VALUES (11,2,11,1);
INSERT INTO ww_edges VALUES (11,3,2,1);
INSERT INTO ww_edges VALUES (11,4,2,8);
INSERT INTO ww_edges VALUES (11,5,11,2);
INSERT INTO ww_edges VALUES (11,6,5,1);
INSERT INTO ww_edges VALUES (11,4,5,12);
INSERT INTO ww_edges VALUES (11,7,11,8);
INSERT INTO ww_edges VALUES (11,6,7,.5);
INSERT INTO ww_edges VALUES (11,8,7,4);
INSERT INTO ww_edges VALUES (11,9,11,4);
INSERT INTO ww_edges VALUES (11,10,9,1);
INSERT INTO ww_edges VALUES (11,4,9,11);
INSERT INTO ww_edges VALUES (11,12,11,1);
Here is an adaptation of the breadthfirst edge list algorithm to retrieve a Bill of Materials for a product identified by a rootID:
· Initialise a leveltracking variable to zero.
· Seed a temp reporting table with the rootID of the desired product.
· While rows are being retrieved, increment the level tracking variable and add rows to the temp table whose parentnodeIDs are nodes at the current level.
· Print the BOM ordered by path with indentation proportional to tree level.
Listing 36: A simpler parts explosion
DROP PROCEDURE IF EXISTS ww_bom;
DELIMITER 
CREATE PROCEDURE ww_bom( root INT )
BEGIN
DECLARE lev INT DEFAULT 0;
DECLARE totalcost DECIMAL( 10,2);
DROP TABLE IF EXISTS temp;
CREATE TABLE temp  initialise temp table with root node
SELECT
e.nodeID AS nodeID,
n.description AS Item,
e.parentnodeID,
e.qty,
n.cost AS nodecost,
e.qty * n.cost AS cost,
0 as level,  tree level
CONCAT(e.nodeID,'') AS path  path to this node as a string
FROM ww_nodes n
JOIN ww_edges e USING(nodeID)  root node
WHERE e.nodeID = root AND e.parentnodeID IS NULL;
WHILE FOUND_ROWS() > 0 DO
BEGIN
SET lev = lev+1;  increment level
INSERT INTO temp  add children of this level
SELECT
e.nodeID,
n.description AS Item,
e.parentnodeID,
e.qty,
n.cost AS nodecost,
e.qty * n.cost AS cost,
lev,
CONCAT(t.path,',',e.nodeID)
FROM ww_nodes n
JOIN ww_edges e USING(nodeID)
JOIN temp t ON e.parentnodeID = t.nodeID
WHERE e.rootID = root AND t.level = lev1;
END;
END WHILE;
WHILE lev > 0 DO  percolate costs up the graph
BEGIN
SET lev = lev  1;
DROP TABLE IF EXISTS tempcost;
CREATE TABLE tempcost  compute child cost
SELECT p.nodeID, SUM(c.nodecost*c.qty) AS childcost
FROM temp p
JOIN temp c ON p.nodeid=c.parentnodeid
WHERE c.level=lev
GROUP by p.nodeid;
UPDATE temp JOIN tempcost USING(nodeID)  update parent item cost
SET nodecost = nodecost + tempcost.childcost;
UPDATE temp SET cost = qty * nodecost  update parent cost
WHERE level=lev1;
END;
END WHILE;
SELECT  list BoM
CONCAT(SPACE(level*2),Item) AS Item,
ROUND(nodecost,2) AS 'Unit Cost',
ROUND(Qty,0) AS Qty,ROUND(cost,2) AS Cost FROM temp
ORDER by path;
END 
DELIMITER ;
CALL ww_bom( 1 );
+++++
 Item  Unit Cost  Qty  Cost 
+++++
 finished bookcase  206.60  1.0  206.60 
 backboard2x1  9.80  1.0  9.80 
 laminate2x1  8.00  1.0  8.00 
 screw  0.10  8.0  0.80 
 side  25.20  2.0  50.40 
 screw  0.10  12.0  1.20 
 plank  20.00  1.0  20.00 
 shelf  16.00  8.0  128.00 
 plank  20.00  0.5  10.00 
 shelf bracket  0.50  4.0  2.00 
 foot  2.10  4.0  8.40 
 cube4cmx4cm  1.00  1.0  1.00 
 screw  0.10  1.0  0.10 
+++++
Summary
Stored procedures, stored functions and Views make it possible to implement edge list graph models, nested sets graph models, and breadthfirst and depthfirst graph search algorithms in MySQL 5&6.
Further Reading
Celko J, "Trees and Hierarchies in SQL For Smarties", Morgan Kaufman, San Francisco, 2004.
Codersource.net, "Branch and Bound Algorithm in C#", http://www.codersource.net/csharp_branch_and_bound_algorithm_implementation.aspx.
Math Forum, "Euler's Solution: The Degree of a Vertex", http://mathforum.org/isaac/problems/bridges2.html
Muhammad RB, "Trees", http://www.personal.kent.edu/~rmuhamma/GraphTheory/MyGraphTheory/trees.htm.
Mullins C, "The Future of SQL", http://www.craigsmullins.com/idug_sql.htm.
Rodrigue JP, "Graph Theory: Definition and Properties", http://people.hofstra.edu/geotrans/eng/ch2en/meth2en/ch2m1en.html.
Santry P, "Recursive SQL User Defined Functions", http://www.wwwcoder.com/main/parentid/191/site/1857/68/default.aspx.
Shasha D, Wang JTL, and Giugno R, "Algorithmics and applications of tree and graph searching", In Symposium on Principles of Database Systems, 2002, p 3952.
Stephens S, "Solving directed graph problems with SQL, Part I", http://builder.com.com/51006388_145245017.html.
Stephens, S, "Solving directed graph problems with SQL, Part II", http://builder.com.com/51006388_145253701.html.
Steinbach T, "Migrating Recursive SQL from Oracle to DB2 UDB", http://www106.ibm.com/developerworks/db2/library/techarticle/0307steinbach/0307steinbach.html.
Tropashko V, "Nested Intervals Tree Encoding in SQL, http://www.sigmod.org/sigmod/record/issues/0506/p47articletropashko.pdf
Van Tulder G, "Storing Hierarchical Data in a Database", http://www.sitepoint.com/print/hierarchicaldatadatabase.
Venagalla S, "Expanding Recursive Opportunities with SQL UDFs in DB2 v7.2", http://www106.ibm.com/developerworks/db2/library/techarticle/0203venigalla/0203venigalla.html.
Wikipedia, "Graph Theory", http://en.wikipedia.org/wiki/Graph_theory.
Willets K, "SQL Graph Algorithms", http://willets.org/sqlgraphs.html.
TOC  Previous  Next  Last updated 20 Aug 2007 
'SQL > MySQL' 카테고리의 다른 글
[MySQL] hierarchical Data 표현... (0)  2009.02.11 

[MySQL] Got error 28 from storage engine (0)  2008.03.03 
[MySQL] mysqld_multi 사용법, 데몬 두 개 운영하기 (0)  2006.11.03 
mysql 관리 툴 EMS (0)  2006.09.30 
[Mysql] GRANT로 사용자 추가하기 (0)  2006.08.28 
replication 사용하기.. version  4.0.17 (0)  2004.07.06 
댓글을 달아 주세요