|
METHODOLOGY
The approach to the problem spanned several
steps as listed below:
Create an XML parser and analyze
the datasets.
Develop user tasks
- Generate specific queries from dataset
analyses.
- Review questions posed by the contest creators.
- Develop queries suggested by a popular taxonomy of visualization
tasks.
Evaluate current visualization models
for their application to relevant tasks.
Apply encoding techniques.
Develop User Tasks
Specific Queries from Dataset Analyses
Part of our analysis of the domain involved
understanding the classification data. Because we were dealing
with almost 200,000 nodes per tree, we needed an automated
method to discover basic information from our data. Using
an XML parser that was written in the Squeak programming
language, we deduced certain information about our data sets:
How many unique ranks exist? 20 The
ranks are the same in both data sets.
How deep in the tree is the most classified
creature? 15 All animals appear to share the seven
major classification ranks, but none includes all 20 subclassifications.
How wide is the tree at its widest? 154,415
species exist in the combined tree (union). The next largest
rank is genus with 28,243. It appears that the tree grows
exponentially with each major rank, with many leaves sprouting
from few branches.
Following this analysis, we created queries
to gain further insight into the data. For instance:
Not counting species or subspecies, and
identifying by Latin name and rank, what instances exist
in one dataset that do not exist in the other? This was intended
to provide insight into the structure of the datasets.
Create a dataset with the output from the
previous question and find out whether any of those names
exist in the other dataset. Can we list the genealogy of
each its ranks all the way up the ladder? This helped
us locate a node in a different area of the alternate tree
and identified some basic differences between the datasets.
It also listed all the species in one tree that resided at
a different level in the other, allowing comparison between
the datasets.
Questions Posed by the Contest
Creators
The contest originators posed some questions
representing typical user tasks. These helped define our
approach, although our solution does not answer all of them.
Some of these are listed here:
Considering both datasets, to what extent
are the differences in the classifications due to differences
in how animals are thought to be related? Are there other
kinds of differences, and can you explain them? Possible
approaches are:
Are the common names the same? Do they share
a Latin name higher in the genealogy that splits into more
trees in one than the other? Are there any clues when it
splits? Are their common prefixes in their names?
Implications: Upon identifying a node in one
tree, allow the user to locate the node in the other tree.
Zoomology approaches this by linking the detail windows.
Additional solutions to the locate task are being examined.
Considering one dataset or the other, can
you say in how many different subtrees a particular common
name (such as dolphin or horse) is
used? How closely are these animals related? Are common names
a good guide to understanding relationships?
Implication: Allow search by common name, with
results highlighted in a recognizable manner. This has been
designed but not yet implemented.
Some scientific names are maddeningly similar.
For example, Spirulida and Spirurida are two nodes in two
different subtrees. A user types in the wrong one. What kind
of feedback does your tool provide to alert the user quickly?
Do the names have the same rank? Is the typed name in the
expected part of the tree?
Implication: When users type a Latin name for
search, offer predictive help so misspellings may be avoided.
Text-based queries would be useful here but are outside the
scope of this project.
Queries by Popular Taxonomy of Visualization
Tasks
The following commonly identified visualization
tasks from the Wehrend and Lewis taxonomy were selected as
relevant to the classification data [13]:
Identify questions
easily could arise in the structure represented; good support
for discovery (browsing). Example: From what Class is this
huge fanout derived?
Browse (See Identify,
above.)
Locate critical
to the user. Example: Where is such-and-such (rank, common
name, Latin name)?
Compare Between
important because more than one dataset with like structures
is present. Example: Where is a species located in each set
(common name, Latin name)? In the same place (rank)?
Associate an
important task. Example: In any given species, can the common
name indicate where a rank has least physical differentiate
(i.e., is a species "something snake" called that
because it looks like a reptile, even though it is amphibian)?
Correlation possible
correlation questions can arise. Example: If common name
of one species is found under unlikely genus, will other
species with similar common names be likewise allocated?
Rank an important
task that is inherently defined in our visualization. The
data represents a hierarchical structure. Example: Is Mammalia
a part of x Class or x Class? How many classes are found
in the Phylum x?
Distinguish a
challenge with this huge dataset: users should be able to
pick out individual items. Example: Can I distinguish between
two adjacent suborders? Will I be able to select an individual
one or two species for comparisons?
Categorize, Cluster,
Compare Within and Monitor were not considered important
due to the limited number of variables and the structure
of the dataset.
Evaluating Current Visualization Models
We applied the contest questions and our
user tasks to a number of existing visualization models in
order to evaluate their effectiveness in this domain. These
included Treemaps, Radial/SunBurst Models, the Node Link
Model, and the Hyperbolic Browser.
Treemaps
In the Treemap model [7], nodes are displayed
as squares or rectangles where the size of the figure is
proportional to a quantitative property such as number of
children, and the color represents a nominal or ordinal property
such as rank. One strength of this model is its efficient
utilization of space, however, nodes are not easily arranged
in a manner that elicits the underlying structure. We found
tree maps inadequate for representing structure beyond a
few levels. Without the presence of connecting lines or a
consistent spatial placement, it was difficult to establish
relationships between nodes. We concluded that treemaps might
be a good choice to represent global changes between trees
but found them wanting at pinpointing more precise areas
of change.
Radial/Sunburst
Models
While similar to a treemap in its spatial
representation of a quantitative property such as density,
Sunbursts [12] advantage is that successive levels
radiate outward in an easily recognizable fashion. As each
level expands outward to encircle its predecessor, it encompasses
more area. This maps well to a tree structure such as classifications,
where the number of nodes increase with each subsequent level.
Our overview uses the same space-filling technique as Sunburst
and shares many of its advantages. For instance, an increase
in area represents an increase in the number of nodes. It
also shares its weaknesses, e.g., as the number of nodes
expands at lower levels, the lack of available space results
in pixel-thin lines that actually represent large numbers
of nodes.
Node-Link
Model
Node-link trees excel at representing structure,
especially at higher levels of the tree. Lines connecting
parent and child nodes clearly show the relationship between
them. However, the one-to-one relationship implied by these
links doesnt scale well and can result in a visualization
that resembles a ball of string. Attempts to encode rankings
by color might result in a confusing mass of colored confetti.
Despite these disadvantages, Zoomology includes a second
overview in node-link format to accommodate biologists who
commonly view trees in this manner [2].
Radial
Node-link Hybrid
A hybrid radial node-link model
was explored, utilizing the space-saving advantages of a radial layout, and intrinsic mapping of levels to concentric rings. The greatest challenge presented itself in determining what kind of algorithm could be developed to prevent links within a wide arc from crossing over nodes of its siblings.
Hyperbolic
Browser
Since a hyperbolic browser [8] represents
focus and context together, it can spotlight an area in detail
while retaining the general structure of a tree. In limited
space, this provides the capability to show more nodes than
a traditional node link model. However, as one changes focus
using hyperbolic browsers, surrounding nodes shift in both
position and size. Thus, it is easy to lose track of the
underlying structure, a problem that is intensified by the
large number of levels in the classification trees. In deep
trees such as these, it is also difficult to represent the
full hierarchy because levels far away from the focus might
be rendered too small to see or not be drawn at all.
Encoding Schemes
We considered the following encoding schemes
for representing structure and attributes:
Position
In the zoom view, which shows detail, spatial
encoding distinguishes ranks. Root nodes enclose child
nodes, which then enclose their own children. As the user
zooms in to the next level, the current level expands until
it is larger than the screen, revealing detail of the next
level. Zoom out shrinks the current level, and the previous
level narrows in towards the center of the screen to appear
in detail. The location of nodes within a given rank does not map to anything.
In the overview, proximity indicates a parent/child
relationship and maps width to progeny. Horizontal location does not map to anything. (9)
Color
We chose color as the primary encoding for
rank. The challenge was in representing up to twenty different
levels with colors distinct enough to be easily distinguished.
To do this we made all of the various rankings within a major
rank, such as order or class, the same base color, and varied
them by differing their tints. Thus, all of the subrankings
at a particular rank use shades of a single color.
Since children are usually no more than
one or two ranks down from their parent, we determined that
only a few colors would be required at any given time. This
eliminated the lack of clarity that can result when too many
colors compete for ones attention. Thus, a traditional
rainbow scale with in-between tints works without causing
undue visual clutter.
We determined that the best background color
would be dark gray. Black would render the overview visualization
too vibrant and distracting. A white background would not
allow us to use white and light tints effectively for encoding
differences.
We have yet to determine the best technique
for displaying differences. We currently use white to represent
change since it is easily distinguished from the colored
background. However, the overview is a representation of
the combined sets, and this model does not allow us to show
nodes that exist only in dataset A separately from those
that are only in dataset B.
Size
In the overview, size is used to represent
relative number of cases within each ranking. In the detail
view, size is not relevant.
Shape
We selected color over shape to encode levels
since we believed recognition would be more rapid [3]. Ovals
represent nodes in the detail view. In the overview and legend,
shape distinguishes the navigation paths of the different
databases.
|