Graphs
Graph Terminology
A graph is a general data structure for showing connections among a set of objects. The objects are called nodes or vertices, and we will usually label them with a unique number or string. The connections are called edges; an edge connects two nodes. Depending on the application, there may be additional conditions or properties of edges:
- In an undirected graph, an edge linking to is also a link from to .
- In a directed graph, we distinguish an edge from to (we refer to as the source and as the target) from an edge going the other way, from to .
- In a simple graph, there must be at most one edge connecting any pair of nodes, and there are no loops (edges from a node to itself).
- To emphasize that multiple edges are allowed between nodes we may refer to a multigraph; to emphasize that loops are allowed we may refer to a loop graph. A multigraph with loops is sometimes called a pseudograph, although we will generally just refer to all of these variants as graphs and only make distinctions where needed.
- An edge may be labeled with a number or string; we do not generally require edge labels to be unique.
- If there is an edge from to , then we say that is adjacent to .
Graph Representations
One representation of an unlabeled graph is simply as a set of ordered pairs of nodes. This corresponds to the usual notion of a relation as a set of pairs (sometimes called, perhaps confusingly, the "graph" of the relation).
Another representation is as a function (often given in the form of a table) from nodes to a set of neighbors (i.e., the adjacent nodes). The set of neighbors of each node is often presented as a list, which leads to the term adjacency list for this presentation.
A third representation is as an adjacency matrix. If the nodes are numbered 1 through , then the adjacency matrix is the two-dimensional array where the entry in row , column will be 1 (or 'true') if there is an edge from to and 0 ('false') if not. This representation is often extended to labeled graphs where the labels are weights: the entry for gives the weight (or distance, or cost, …) of the edge between and . Depending on the application, it may be preferable to use an infinite weight to represent absent edges, rather than 0 (since a weight of 0 might be interpreted as saying there is no cost to go from to ).
Here is some ReasonML code defining types for these representations, plus some conversion functions. In each case, the type is parameterized by a node type 'n
, and the representation type is a pair where the first element is the list of nodes (because there is no generic way to get this list for an arbitrary type 'n
, and if 'n
is int
or string
we only want to use a subset of the possible values anyway). The code uses a number of standard functions on lists such as List.map
and List.filter
; for details, see the library documentation.
Here is a collection of functions to render a graph (represented as pairs) with DPoodle:
Graph Traversals
We will consider two main ways of traversing a graph here. In addition to visiting all of the nodes in the graph, each approach will allow us to answer some extra questions about the structure of the graph. The depth-first traversal is analogous to the pre-, in-, and postorder traversals of a tree, in the sense that we will (recursively) visit all of the nodes reachable from one neighbor before backing up and trying another neighbor. The breadth-first traversal by contrast is analogous to the level order traversal of a tree, where we will visit all of the immediate neighbors before continuing on to visit their neighbors, and so on. Just as with level order traversal, the breadth-first traversal does not have as natural a recursive implementation as depth-first, although we will see that both can be expressed with very similar code by making use of an explicit data structure to control the traversal. Graph traversals in general are more complicated than tree traversals, because we have to worry about sharing of descendants of a node (there may be multiple paths to reach the same node) as well as cycles in the graph (there may be paths that loop back on themselves). Indeed, one definition of a tree is that it is a graph with a distinguished node, called the root, such that there is a unique path from the root to any other node.
Depth-First Traversal
To perform a depth-first traversal, we will need to keep track of two additional pieces of information about the nodes. The first is a set of visited nodes; whenever we first arrive at a node, we add it to this set. The other is the finishing list; whenever we have finished examining all of a node's neighbors, we will add it to the head of the finishing list. When working with a graphical representation, we will color in a node when it is first visited, and then write a number (the finishing number) next to the node when it is finished showing its position from the end of the finishing list. This information corresponds to performing actions according to preorder (visiting) and postorder (finishing) traversal of a tree.1
Here is the basic algorithm for depth-first traversal starting from a node :
- Mark as visited
- While there is an adjacent unvisited node,
- Perform a depth-first traversal starting from
- When all nodes adjacent to are visited, add to the head of the finishing list.
To traverse all of the nodes in a graph, first choose an arbitrary node and perform a depth-first traversal starting from that node. When that node is finished, if there are any remaining unvisited nodes then choose one and repeat.
Clearly this procedure will eventually visit all of the nodes exactly once. By keeping track of a little more information, it can also answer some interesting questions about a graph. As we are traversing the graph, we will put each edge into one of three categories:
- Tree Edge: an edge that we follow to get to an adjacent unvisited node
- Back Edge: an edge that we do not follow because it leads to a visited, but not yet finished, node
- Forward Edge: an edge that we do not follow because it leads to an already finished node2
Here then is the augmented algorithm for depth-first traversal from :
- Mark as visited
- For each edge coming out of and going to a node :
- If is unvisited, perform a depth-first traversal from and mark as a tree edge
- If is visited but not yet finished, mark as a back edge
- If is finished, mark as a forward edge
- Add to the finishing list.
The tree edges form what is known as the depth-first search tree, with the root at the node where we started the traversal. More generally, since we get a separate tree each time we restart the traversal, the result of visiting all of the nodes in a graph will in general be a depth-first search forest (since a forest is just a set of trees).
If we ever find a back edge in traversing a graph, then the graph has a cycle. Suppose that there is a back edge from to . Since is visited but not yet finished, there must be a sequence of tree edges leading from to (the current path we are exploring). That path plus the back edge will form a cycle. Conversely, if a graph has a cycle, then we will find at least one back edge, because there will be some point in the traversal where we close the loop back to a node on the current path (which will necessarily be visited but not yet finished).
On the other hand, if we find no back edges, then the graph is acyclic. Directed acyclic graphs (often called dags) share many of the advantages of trees; indeed, we have already encountered them as combinational circuits, which we viewed as a generalization of boolean expression trees (with shared subterms). They are also useful to model dependencies, for example showing which tasks must be completed before others (such as prerequisite courses in a college catalog).
Given an acyclic graph, we may "linearize" the nodes by choosing an order in which to list them that respects the dependencies among them (for example, an order in which to take a sequence of classes where all the prerequisites are taken before the courses that depend on them). This is a generalization of sorting, where the edges in the graph reflect only a "partial" ordering among the items; it is known as topological sorting, and the result is a topological ordering of the nodes. The perhaps surprising fact about the finishing list is that, if there were no cycles, then it gives a topological ordering! The reason is that we can finish a node only after all of the nodes that can be reached from it are finished, so when we put it at the front of the finishing list it will be followed by all of the nodes that depend on it.
As an example of depth-first traversal, consider the following graph:
Start by marking visited:
Follow the (tree) edge to :
Follow the (tree) edge from to :
Node has no neighbors at all, so mark it finished:
Back at node , follow the (tree) edge to :
The only edge out of goes to , which is visited but not yet finished, so mark it as a back edge and mark finished:
Now node is also finished, so back up to node and consider its remaining outward edge. It goes to node , which is already finished, so mark it as a forward edge and mark finished as well:
We still have nodes and unvisited. Arbitrarily restart the traversal at , and first consider the edge to . Since is already finished, this is another forward edge:
Now follow the tree edge from to , and find the edge from to is a forward edge:
Finally, the edge from to itself is a back edge, after which and then are finished:
Since there were back edges, the graph has at least one cycle. In fact, there are two: , and . In general, there may not be an exact match between the number of back edges and the number of cycles, because several cycles may share a single back edge.
For another example, here is the same graph with the edges and removed:
We leave the details of tracing through the traversal as an exercise, but here is the final marked graph:
The finishing list is , , , , , in this case. Since there were no back edges, this is a topological ordering of the nodes in the graph. Here is the graph with the nodes reordered:
Observe that all of the arrows go from left to right. (TODO This diagram will look better when the rendering function can use curved arrows to avoid overlap….)
Note that other markings are possible, depending on the choices made (which nodes to start at, and the order in which to visit the edges out of each node). As another exercise, perform another traversal of this graph that produces a different marked-up result. Can you find a different topological ordering?
ReasonML Implementation of Depth-First Traversal
To write the depth-first traversal as a pure functional program, we do not want to store the extra information (visited and finished lists, and the categorization of the edges) in mutable data structures.
Instead, we will pass around a ReasonML record containing the state.
The syntax for records is very much like that in JavaScript: it consists of a comma-separated set of fields of the form label: value
inside a pair of curly brackets.
To access the field x
of record r
we use r.x
.
To create a new record as a copy of r
, with field x
updated to v
, we write {...r, x: v}
.
The returned value from depthFirst
is a pair of a dfsResult
and a graphPairs
.
The dfsResult
is either Cycle(e)
, where e
is a back edge (pair of nodes) causing a cycle, or TopoOrder(nodes)
, where nodes
is a list of the graph nodes in topological order.
The graphPairs
value is the subgraph consisting of just the tree nodes.
Instead of using the recursive helper function dfs
, which performs a single depth-first exploration starting from a given node, we may perform all of the exploration in the (tail-recursive) helper function run
by using an explicit stack to keep track of the current path from the starting node. Each entry in the stack can be one of three values: Visit(n)
says that the current task is to visit node n
; Finish(n)
says that the current task is to finish node n
(note that this is pushed on the stack underneath all of the edges out of n
, so it will only be done after all of the neighbors are processed); and Edge(n1, n2)
says that the current task is to consider the edge from n1
to n2
. The run
helper function can now be seen as a loop that continually removes a task from the stack and then updates the stack and the state to perform that task:
Breath-First Traversal
The big payoff of rewriting the depth-first traversal to use an explicit stack is that we can now explain breadth-first traversal (and level order traversal of a tree) quite simply: instead of a stack, use a queue! The idea is that the queue is maintaining a list of edges yet to be processed. When the algorithm starts, we push all of the edges of the initial node onto the queue and process them in order. As we handle each edge, if its target node has not yet been visited, then we push all of its outgoing edges onto the queue, but they will have to wait until all of the current edges are processed. In this way, we visit all of the immediate neighbors first, and then we continue with their neighbors, followed by their neighbors' neighbors, etc., until the entire graph is traversed.
Breadth-first traversal is desirable when the graph might be large and we want to stop searching when we find the closest match, or if we only want to process items within a given distance (for example, it is used in game-playing strategies to look ahead up to a certain number of moves). Several famous algorithms can also be viewed as variants of breadth-first traversal.
Dijkstra's Algorithm
Consider a labeled graph where each edge has an associated (non-negative) distance or cost. Instead of using a queue to store the edges under consideration, we will use a priority queue (perhaps implemented as a binary heap). By assigning a priority based on the total length of the path from the starting node (including the edge under consideration), Dijkstra's algorithm will greedily build a shortest-path tree with the starting node as the root. This is an effective way to answer questions like "what is the shortest travel time from to ?" or "what is the closest node to with a given property?"
Prim's Algorithm
If we have an undirected graph and perform Dijkstra's algorithm with the priority of an edge being just the cost of that edge alone, then instead of a shortest-path tree we get a minimum spanning tree. This is the smallest subset of edges that connects all of the nodes, with the least possible total cost of the edges.
(TODO examples will have to wait until the tree rendering can do labels…)
Exercises
- Give the list of pairs, adjacency list, and adjacency matrix representations for the following undirected graph:
Answer
- Consider the adjacency matrix representation of a graph. What can you say about the matrix if the graph is undirected? What if the graph has loops on all of the nodes?
Answer
An undirected graph will have a symmetric adjacency matrix: the entry in row , column is the same as the entry in row , column (symmetric about the diagonal). A graph with all of the self-loops will have 1's along the diagonal (and a simple graph, with no loops, will have 0's on the diagonal).
- (Based on a problem from Aho & Ullman) Consider the following directed graph:
- Give two different depth-first traversal trees starting at node . For each, also label the graph to show the forward and back edges, and the finishing number.
Answer
Using depth-first traversal (others are possible):
- Find the distance (length of the shortest path) from to each of the other nodes.
Answer
Using breadth-first traversal:
- What would go wrong in Dijkstra's algorithm if we allowed edges with a negative cost? Give an example where it fails to find the shortest path.
Answer
With negative cost edges the greedy approach to building the shortest-path tree is not guaranteed to find the shortest paths. Suppose there is an edge from to with cost 2, and an edge from to with cost 3: Dijkstra's algorithm will say that the shortest path from to has cost 2, without even looking at node . However, if there is an edge from to with cost , then the better route would be to go , with total cost 1.
Even worse, if there were a cycle in the graph with a total negative cost (such as to with cost 1, but to with cost ), then any path touching that cycle could be extended to follow the cycle any number of times, leading to an arbitrarily low (large negative) cost! For such a graph, the notion of "shortest path" is undefined.
- There is no analogue to inorder, because we don't impose any order on the neighbors of a node, so there is no equivalent of having finished the left child and being about to start the right child.↩
- Some authors distinguish between proper forward edges, where there is a path of tree edges leading from the current node to the already finished node, and cross edges, where there is not such a path, but we will not need that distinction.↩