Zoomology Zoomology home page link
graphics go here Go to Background Information Go to Problem Space Go to Methodology Go to application description Go to Future Directions Future Directions link Zoomology description link Methodology link Problem Space link Background Information link Link to GVIS Tree of Life link Link to InfoVis 2003 contest

 

 

METHODOLOGY

The approach to the problem spanned several steps as listed below:

• Create an XML parser and analyze the datasets.
• Develop user tasks

   - Generate specific queries from dataset analyses.
   - Review questions posed by the contest creators.
   - Develop queries suggested by a popular taxonomy of visualization tasks.

• Evaluate current visualization models for their application to relevant tasks.
• Apply encoding techniques.

Develop User Tasks

Specific Queries from Dataset Analyses

Part of our analysis of the domain involved understanding the classification data. Because we were dealing with almost 200,000 nodes per tree, we needed an automated method to discover basic information from our data. Using an XML parser that was written in the Squeak programming language, we deduced certain information about our data sets:

How many unique ranks exist? 20 — The ranks are the same in both data sets.

How deep in the tree is the most “classified” creature? 15 — All animals appear to share the seven major classification ranks, but none includes all 20 subclassifications.

How wide is the tree at its widest? 154,415 species exist in the combined tree (union). The next largest rank is genus with 28,243. It appears that the tree grows exponentially with each major rank, with many leaves sprouting from few branches.

Following this analysis, we created queries to gain further insight into the data. For instance:

Not counting species or subspecies, and identifying by Latin name and rank, what instances exist in one dataset that do not exist in the other? This was intended to provide insight into the structure of the datasets.

Create a dataset with the output from the previous question and find out whether any of those names exist in the other dataset. Can we list the genealogy of each — its ranks all the way up the ladder? This helped us locate a node in a different area of the alternate tree and identified some basic differences between the datasets. It also listed all the species in one tree that resided at a different level in the other, allowing comparison between the datasets.

Questions Posed by the Contest Creators

The contest originators posed some questions representing typical user tasks. These helped define our approach, although our solution does not answer all of them. Some of these are listed here:

Considering both datasets, to what extent are the differences in the classifications due to differences in how animals are thought to be related? Are there other kinds of differences, and can you explain them? Possible approaches are:

Are the common names the same? Do they share a Latin name higher in the genealogy that splits into more trees in one than the other? Are there any clues when it splits? Are their common prefixes in their names?

Implications: Upon identifying a node in one tree, allow the user to locate the node in the other tree. Zoomology approaches this by linking the detail windows. Additional solutions to the locate task are being examined.

Considering one dataset or the other, can you say in how many different subtrees a particular common name (such as “dolphin” or “horse”) is used? How closely are these animals related? Are common names a good guide to understanding relationships?

Implication: Allow search by common name, with results highlighted in a recognizable manner. This has been designed but not yet implemented.

Some scientific names are maddeningly similar. For example, Spirulida and Spirurida are two nodes in two different subtrees. A user types in the wrong one. What kind of feedback does your tool provide to alert the user quickly? Do the names have the same rank? Is the typed name in the expected part of the tree?

Implication: When users type a Latin name for search, offer predictive help so misspellings may be avoided. Text-based queries would be useful here but are outside the scope of this project.

Queries by Popular Taxonomy of Visualization Tasks

The following commonly identified visualization tasks from the Wehrend and Lewis taxonomy were selected as relevant to the classification data [13]:

Identify — questions easily could arise in the structure represented; good support for discovery (browsing). Example: From what Class is this huge fanout derived?

Browse — (See Identify, above.)

Locate — critical to the user. Example: Where is such-and-such (rank, common name, Latin name)?

Compare Between — important because more than one dataset with like structures is present. Example: Where is a species located in each set (common name, Latin name)? In the same place (rank)?

Associate — an important task. Example: In any given species, can the common name indicate where a rank has least physical differentiate (i.e., is a species "something snake" called that because it looks like a reptile, even though it is amphibian)?

Correlation — possible correlation questions can arise. Example: If common name of one species is found under unlikely genus, will other species with similar common names be likewise allocated?

Rank — an important task that is inherently defined in our visualization. The data represents a hierarchical structure. Example: Is Mammalia a part of x Class or x Class? How many classes are found in the Phylum x?

Distinguish — a challenge with this huge dataset: users should be able to pick out individual items. Example: Can I distinguish between two adjacent suborders? Will I be able to select an individual one or two species for comparisons?

Categorize, Cluster, Compare Within and Monitor were not considered important due to the limited number of variables and the structure of the dataset.

Evaluating Current Visualization Models

We applied the contest questions and our user tasks to a number of existing visualization models in order to evaluate their effectiveness in this domain. These included Treemaps, Radial/SunBurst Models, the Node Link Model, and the Hyperbolic Browser.

Treemaps

In the Treemap model [7], nodes are displayed as squares or rectangles where the size of the figure is proportional to a quantitative property such as number of children, and the color represents a nominal or ordinal property such as rank. One strength of this model is its efficient utilization of space, however, nodes are not easily arranged in a manner that elicits the underlying structure. We found tree maps inadequate for representing structure beyond a few levels. Without the presence of connecting lines or a consistent spatial placement, it was difficult to establish relationships between nodes. We concluded that treemaps might be a good choice to represent global changes between trees but found them wanting at pinpointing more precise areas of change.

Radial/Sunburst Models

While similar to a treemap in its spatial representation of a quantitative property such as density, Sunburst’s [12] advantage is that successive levels radiate outward in an easily recognizable fashion. As each level expands outward to encircle its predecessor, it encompasses more area. This maps well to a tree structure such as classifications, where the number of nodes increase with each subsequent level. Our overview uses the same space-filling technique as Sunburst and shares many of its advantages. For instance, an increase in area represents an increase in the number of nodes. It also shares its weaknesses, e.g., as the number of nodes expands at lower levels, the lack of available space results in pixel-thin lines that actually represent large numbers of nodes.

Node-Link Model

Node-link trees excel at representing structure, especially at higher levels of the tree. Lines connecting parent and child nodes clearly show the relationship between them. However, the one-to-one relationship implied by these links doesn’t scale well and can result in a visualization that resembles a ball of string. Attempts to encode rankings by color might result in a confusing mass of colored confetti. Despite these disadvantages, Zoomology includes a second overview in node-link format to accommodate biologists who commonly view trees in this manner [2].

Radial Node-link Hybrid

A hybrid radial node-link model was explored, utilizing the space-saving advantages of a radial layout, and intrinsic mapping of levels to concentric rings. The greatest challenge presented itself in determining what kind of algorithm could be developed to prevent links within a wide arc from crossing over nodes of its siblings.

Hyperbolic Browser

Since a hyperbolic browser [8] represents focus and context together, it can spotlight an area in detail while retaining the general structure of a tree. In limited space, this provides the capability to show more nodes than a traditional node link model. However, as one changes focus using hyperbolic browsers, surrounding nodes shift in both position and size. Thus, it is easy to lose track of the underlying structure, a problem that is intensified by the large number of levels in the classification trees. In deep trees such as these, it is also difficult to represent the full hierarchy because levels far away from the focus might be rendered too small to see or not be drawn at all.

Encoding Schemes

We considered the following encoding schemes for representing structure and attributes:

Position

In the zoom view, which shows detail, spatial encoding distinguishes ranks. Root nodes enclose child nodes, which then enclose their own children. As the user zooms in to the next level, the current level expands until it is larger than the screen, revealing detail of the next level. Zoom out shrinks the current level, and the previous level narrows in towards the center of the screen to appear in detail. The location of nodes within a given rank does not map to anything.

In the overview, proximity indicates a parent/child relationship and maps width to progeny. Horizontal location does not map to anything. (9)

Color

We chose color as the primary encoding for rank. The challenge was in representing up to twenty different levels with colors distinct enough to be easily distinguished. To do this we made all of the various rankings within a major rank, such as order or class, the same base color, and varied them by differing their tints. Thus, all of the subrankings at a particular rank use shades of a single color.

Since children are usually no more than one or two ranks down from their parent, we determined that only a few colors would be required at any given time. This eliminated the lack of clarity that can result when too many colors compete for one’s attention. Thus, a traditional rainbow scale with in-between tints works without causing undue visual clutter.

We determined that the best background color would be dark gray. Black would render the overview visualization too vibrant and distracting. A white background would not allow us to use white and light tints effectively for encoding differences.

We have yet to determine the best technique for displaying differences. We currently use white to represent change since it is easily distinguished from the colored background. However, the overview is a representation of the combined sets, and this model does not allow us to show nodes that exist only in dataset A separately from those that are only in dataset B.

Size

In the overview, size is used to represent relative number of cases within each ranking. In the detail view, size is not relevant.

Shape

We selected color over shape to encode levels since we believed recognition would be more rapid [3]. Ovals represent nodes in the detail view. In the overview and legend, shape distinguishes the navigation paths of the different databases.