A "tree" is a discrete structure that serve as important model for a variety of hierarchical data sets that need to be represented and processed in computer science (and in bioinformatics). Here are just a handful of common uses of trees in modeling data:
| CS term | Description | Biologists term |
|---|---|---|
| node | A representation of a single entity within the tree | |
| edge | A connection representing a relationship between two nodes | |
| root | The (topmost) node from which an entire tree eminates | |
| parent | The immediate ancestor of a node in a tree | |
| external node (leaf) | a node without any subsequent children | tip |
| internal node | a node without any subsequent children | node |
| child | The immediate descendants of a node in a tree | |
| ancestor | Any of the nodes "above" a given node (i.e., toward the root) | |
| descendant | Any of the nodes "below" a given node (i.e., away from the root) | |
| branch | a path between an ancestor and one of its descendants | |
| subtree | the portion of a tree including a node and all of its descendants | clade |
Trees are inherently recursive and so our representation of trees, and our functions for processing trees, will also be recursive.
To represent trees, we will begin by considering a special class of trees known as binary trees in which each internal node of a tree has precisely two children. (Although the techniques we use can typically be extended to more general trees with arbitrary branching factors.)
We choose a relatively simple representation using Python's tuples (this is a built-in structure that is similar to a list, but immutable). The basic format used is a triple,
(label, firstsubtree, secondsubtree)By convention, we will use a representation where if a node doesn't have any children, we will use empty tuples, such as
('C', (), ())
By this convention, a phylogenetic tree that might be represented
graphically as
would be represented by the recursive structure
('A',
('B', (), ()),
('C', (), ())
)
although to Python, the whitespace and indentation is not actually
important in this context, so this could actually be viewed more
streamlined as:
('A', ('B', (), ()), ('C', (), ()) )
or even without the spaces as
('A',('B',(),()),('C',(),()))
As a more complex example, the following tree
('A',
('B',(),()),
('C',
('D',
('E',(),()),
('F',(),())
),
('G', (), ())
)
)
which is equally valid in Python as
('A', ('B',(),()), ('C', ('D', ('E',(),()), ('F',(),()) ), ('G', (), ()) ) )