|
BACKGROUND INFORMATION
One area of expansion in biology is
within the classification of species. Traditionally, species
were grouped together by anatomical similarities. Today, access
to molecular-level data has given rise to new models based
on proteins and DNA. New computer tools will be needed to
help scientists create and contrast these new classification structures.
Although optimized for comparison of classification
datasets, the solution presented here might apply to other
large hierarchies with a moderate number of levels and relatively
small areas of change.
Dataset Selection
Zoomology was created in response to
the InfoVis
2003 Contest for Visualization and Pair Wise Comparison of
Trees. Three problems were posed by the contest, including
comparing small phylogenic trees by structure alone,
comparing file system logs of about 70,000 nodes containing
many variables, and comparing classification trees
of approximately 200,000 nodes with three variables.
The classification trees problem
is the most challenging of the datasets to visualize because
it requires representation of both structure and attributes.
We wanted to gain firsthand experience in creating interactive
visualizations of large structures, since this could be a
crucial skill to acquire in this era of information explosion.
The classification datasets compared here
represent about 15% of over one million species of known
living organisms. These are traditional anatomical phylogenies
rather than molecular phylogenies [6]. Classification trees
follow a hierarchy of rank. In order of increasing specificity,
the seven major ranks are Kingdom, Phylum, Class, Order,
Family, Genus, and Species. Each rank may also contain sub-,
infra-, and super- handling levels, and we found twenty distinct
ranks. By walking the path of nodes from the root (Kingdom
Animalia) to the leaf, one gets the complete formal classification
for a particular species.
Each node contains up to three variables,
of which the first two are always present. The first is its
rank. The second is its Latin name, for example, Ctenophora.
This uniquely identifies the animal. Each name refers to
a comparable pair of animals in each of the two datasets,
though the exact children and tree topology may differ. The
third, when included, is its common name, such as jellyfish
or treefrog. One common name may encompass numerous species.
Conversely, a node might be known by different common names
in differing representations [2].
Domain Background
One of the central tasks in comparing
these trees involves uncovering and analyzing the differences
in their hierarchies. Differences occur due to variation
in the way animals are ranked, and can be subtle. For example,
an infraorder in one tree might be ranked as an order in
another. The addition of a branch node changes the hierarchy
for all of its descendants. As Cyndy Sims Parr, one of the
contributors to the two trees involved in the contest, explains
on her web page:
“Classification is
a human undertaking. Most systematists agree that classification
(how organisms are named and grouped into things like Families
and Orders and Classes) ought to reflect what we know about
how organisms are related to each other. Yet, what we know
is constantly changing, hopefully for the better. And like
all human undertakings, there are controversies over what
exactly we do know. There is no single ‘correct’ classification,
just classifications that are currently accepted by most
systematists.” [9]
Previous Work
Our framework is similar to the "Pad"
[10] allegory and its extension, Pad++
[1]. In this system, the information space is considered
as an infinite 2D plane, which can be stretched by orders
of magnitude at any point to investigate details. Pad++ has
been mostly explored as a highly interactive, zoomable alternative
to traditional windows and icons interfaces and in applications
such as navigable web interfaces.
Our visualization exploits zooming techniques
employed in GVis, a tool for visualizing genome data [5].
|