---
title: "Tree space analysis"
author: "[Martin R. Smith](https://smithlabdurham.github.io/)"
output:
rmarkdown::html_vignette:
fig_width: 6
fig_height: 4
bibliography: ../inst/REFERENCES.bib
csl: https://raw.githubusercontent.com/citation-style-language/styles/master/apa-old-doi-prefix.csl
vignette: >
%\VignetteIndexEntry{Tree space analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r secret-header, echo=FALSE}
set.seed(0)
protoclust <- if (requireNamespace("protoclust", quietly = TRUE)) {
protoclust::protoclust
} else {
hclust
}
```
It can be instructive to visualize the distribution of trees by mapping their
spatial relationships.
This can be a helpful means to address whether discrete islands exist in a
sample of trees, or whether analytical runs have converged.
Such analysis is relatively simple to conduct, but a few common oversights
can mislead interpretation.
## Tree space analysis via user interface
Tree space mapping and analysis is made simple with the Shiny app included in
the "TreeDist" R package. Simply install [R](https://www.r-project.org/) or
[RStudio](https://posit.co/), then copy the code below into the R command
line:
```r
install.packages("TreeDist")
TreeDist::MapTrees()
```
This will allow you to conduct and evaluate basic tree space mappings
from tree lists saved in most common file formats; see an
[outline of the basic functionality](https://ms609.github.io/TreeDist/reference/MapTrees.html).
To avoid misinterpreting tree space, it's worth having a broad idea of what
an analysis involves, and some potential pitfalls.
## Avoiding common pitfalls in tree space analysis
Here's an example analysis of a series of 200 trees from an ordered list.
The list corresponds to a mixed-base representation of trees (see
`TreeTools::as.TreeNumber()`), so is expected to contain some structure as we
jump from one "class" of tree to another.
Let's see whether we can visualize and corroborate this structure.
First we'll generate the trees, and load some colours with which we might
identify them.
```{r generate-trees}
library("TreeTools", quietly = TRUE)
treeNumbers <- c(1:220)
trees <- as.phylo(treeNumbers, 8)
spectrum <- hcl.colors(220, "plasma")
treeCols <- spectrum[treeNumbers]
```
### Using a suitable distance metric
Now we need to calculate the distance between each pair of trees in our list.
The choice of distance metric is important [@SmithSpace].
The widely used Robinsonâ€“Foulds distance is, unfortunately, unsuitable for
tree space analysis.
The clustering information distance [@SmithDist] is a reliable alternative
that is fast to calculate:
```{r calculate-distances, message=FALSE}
library("TreeDist")
distances <- ClusteringInfoDistance(trees)
```
The reader is encouraged to repeat the exercise with other distances:
```r
distances <- RobinsonFoulds(trees)
distances <- PhylogeneticInfoDistance(trees)
distances <- as.dist(Quartet::QuartetDivergence(
Quartet::ManyToManyQuartetAgreement(trees), similarity = FALSE))
```
### Mapping distances
Then we need to reduce the dimensionality of these distances.
We'll start out with a 12-dimensional mapping; if needed, we can always
drop higher dimensions.
Principal coordinates analysis is quick and performs very well:
```{r map}
mapping <- cmdscale(distances, k = 12)
```
Alternative mapping methods do exist, and sometimes give slightly better
mappings. `isoMDS()` performs non-metric multidimensional scaling (MDS)
with the Kruskal-1 stress function [@Kruskal1964]:
```r
kruskal <- MASS::isoMDS(distances, k = 12)
mapping <- kruskal$points
```
whereas `sammon()`, one of many metric MDS methods, uses Sammon's stress
function [@Sammon1969]:
```
sammon <- MASS::sammon(distances, k = 12)
mapping <- sammon$points
```
That's a good start. It is tempting to plot the first two dimensions
arising from this mapping and be done:
```{r plot-mapping-2d, fig.asp = 1, fig.width = 3, fig.align="center"}
par(mar = rep(0, 4))
plot(mapping,
asp = 1, # Preserve aspect ratio - do not distort distances
ann = FALSE, axes = FALSE, # Don't label axes: dimensions are meaningless
col = treeCols, pch = 16
)
```
### Identifying clusters
A quick visual inspection suggests at least two clusters, with the possibility
of further subdivision of the brighter trees.
But visual inspection can be
highly misleading [@SmithSpace]. We must take a statistical approach.
A combination of partitioning around medoids and hierarchical clustering with
minimax linkage will typically find a clustering solution that is close to
optimal, if one exists [@SmithSpace]; if suitably initialized, k-means++
clustering [@Arthur2007] can also be worthwhile.
```{r clustering, fig.align="center"}
possibleClusters <- 2:10
# Partitioning around medoids
pamClusters <- lapply(possibleClusters, function(k) cluster::pam(distances, k = k))
pamSils <- vapply(pamClusters, function(pamCluster) {
mean(cluster::silhouette(pamCluster)[, 3])
}, double(1))
bestPam <- which.max(pamSils)
pamSil <- pamSils[bestPam]
pamCluster <- pamClusters[[bestPam]]$cluster
# Hierarchical clustering
hTree <- protoclust(distances)
hClusters <- lapply(possibleClusters, function(k) cutree(hTree, k = k))
hSils <- vapply(hClusters, function(hCluster) {
mean(cluster::silhouette(hCluster, distances)[, 3])
}, double(1))
bestH <- which.max(hSils)
hSil <- hSils[bestH]
hCluster <- hClusters[[bestH]]
# k-means++ clustering
kClusters <- lapply(possibleClusters, function(k) KMeansPP(distances, k = k))
kSils <- vapply(kClusters, function(kCluster) {
mean(cluster::silhouette(kCluster$cluster, distances)[, 3])
}, double(1))
bestK <- which.max(kSils)
kSil <- kSils[bestK]
kCluster <- kClusters[[bestK]]$cluster
plot(pamSils ~ possibleClusters,
xlab = "Number of clusters", ylab = "Silhouette coefficient",
ylim = range(c(pamSils, hSils)))
points(hSils ~ possibleClusters, pch = 2, col = 2)
points(kSils ~ possibleClusters, pch = 3, col = 3)
legend("topright", c("PAM", "Hierarchical", "k-means++"),
pch = 1:3, col = 1:3)
```
Silhouette coefficients of < 0.25 suggest that structure is not meaningful;
\> 0.5 denotes good evidence of clustering, and \> 0.7 strong evidence
[@Kaufman1990].
The evidence for the visually apparent clustering is not as strong as it first
appears. Let's explore our two-cluster hierarchical clustering solution anyway.
```{r chosen-cluster}
nClusters <- 2
whichResult <- match(nClusters, possibleClusters)
cluster <- hClusters[[whichResult]]
```
We can visualize the clustering solution as a tree:
```{r h-tree, fig.align="center"}
class(hTree) <- "hclust"
par(mar = c(0, 0, 0, 0))
plot(hTree, labels = FALSE, main = "")
points(seq_along(trees), rep(1, length(trees)), pch = 16,
col = spectrum[hTree$order])
```
Another thing we may wish to do is to take the consensus of each cluster:
```{r cluster-consensus, fig.align="center"}
par(mfrow = c(1, 2), mar = rep(0.2, 4))
col1 <- spectrum[mean(treeNumbers[cluster == 1])]
col2 <- spectrum[mean(treeNumbers[cluster == 2])]
plot(consensus(trees[cluster == 1], p = 0.5),
edge.color = col1, edge.width = 2, tip.color = col1)
plot(consensus(trees[cluster == 2], p = 0.5),
edge.color = col2, edge.width = 2, tip.color = col2)
```
Here, we learn that the two clusters are distinguished by the position of `t7`.
### Identifying islands
Besides clustering, we can also define 'islands' in tree space that are
separated by a 'moat', such that all trees on one island are separated from
all trees on another by at least a certain distance [@Silva2021].
```{r island-id, fig.asp = 1, fig.width = 3, fig.align="center"}
par(mar = rep(0, 4))
# set a threshold corresponding to the width of the "moat" between islands
threshold <- 1.8
island <- Islands(distances, threshold)
# See how many trees are on each island
table(island)
# Let's ignore the small islands for now
largeIsle <- Islands(distances, threshold, smallest = 5)
# Colour trees according to their island
plot(mapping,
asp = 1, # Preserve aspect ratio - do not distort distances
ann = FALSE, axes = FALSE, # Don't label axes: dimensions are meaningless
col = ifelse(is.na(largeIsle), "grey", largeIsle + 1),
pch = 16
)
```
Let's view the consensus of each large island cluster:
```{r island-consensus, fig.align="center"}
par(mfrow = c(1, 2), mar = rep(0.2, 4))
plot(consensus(trees[!is.na(largeIsle) & largeIsle == 1], p = 0.5),
edge.color = 2, edge.width = 2, tip.color = 2)
plot(consensus(trees[!is.na(largeIsle) & largeIsle == 2], p = 0.5),
edge.color = 3, edge.width = 2, tip.color = 3)
```
### Validating a mapping
Now let's evaluate whether our map of tree space is representative.
First we want to know how many dimensions are necessary to adequately
represent the true distances between trees. We hope for a trustworthiness Ă—
continuity score of \> 0.9 for a usable mapping, or \> 0.95 for a good one.
```{r how-many-dims, fig.align="center"}
txc <- vapply(seq_len(ncol(mapping)), function(k) {
newDist <- dist(mapping[, seq_len(k)])
MappingQuality(distances, newDist, 10)["TxC"]
}, 0)
plot(txc, xlab = "Dimension")
abline(h = 0.9, lty = 2)
```
We are going to need at least five dimensions to adequately represent the
distances between trees.
To help establish visually what structures are more likely to be genuine,
we might also choose to calculate a minimum spanning tree:
```{r calculate-MST}
mstEnds <- MSTEdges(distances)
```
Let's plot the first five dimensions of our tree space, highlighting the convex
hulls of our clusters:
```{r plot-mapping-5d, fig.asp = 1, fig.align="center"}
nDim <- which.max(txc > 0.9)
plotSeq <- matrix(0, nDim, nDim)
plotSeq[upper.tri(plotSeq)] <- seq_len(nDim * (nDim - 1) / 2)
plotSeq <- t(plotSeq[-nDim, -1])
plotSeq[nDim * 1:3] <- (nDim * (nDim - 1) / 2) + 1:3
layout(plotSeq)
par(mar = rep(0.1, 4))
for (i in 2:nDim) for (j in seq_len(i - 1)) {
# Set up blank plot
plot(mapping[, j], mapping[, i], ann = FALSE, axes = FALSE, frame.plot = TRUE,
type = "n", asp = 1, xlim = range(mapping), ylim = range(mapping))
# Plot MST
MSTSegments(mapping[, c(j, i)], mstEnds,
col = StrainCol(distances, mapping[, c(j, i)]))
# Add points
points(mapping[, j], mapping[, i], pch = 16, col = treeCols)
# Mark clusters
for (clI in unique(cluster)) {
inCluster <- cluster == clI
clusterX <- mapping[inCluster, j]
clusterY <- mapping[inCluster, i]
hull <- chull(clusterX, clusterY)
polygon(clusterX[hull], clusterY[hull], lty = 1, lwd = 2,
border = "#54de25bb")
text(mean(clusterX), mean(clusterY), clI, col = "#54de25bb", font = 2)
}
}
# Annotate dimensions
plot(0, 0, type = "n", ann = FALSE, axes = FALSE)
text(0, 0, "Dimension 2")
plot(0, 0, type = "n", ann = FALSE, axes = FALSE)
text(0, 0, "Dimension 3")
plot(0, 0, type = "n", ann = FALSE, axes = FALSE)
text(0, 0, "Dimension 4")
```
Our clusters, so distinct in dimension 1, overlap strongly in every other
dimension. The fact that the minimum spanning tree moves between clusters also
underlines the fact that they are not as well defined as they appear by eye.
Note that cluster membership, as well as the precise shape of tree space,
is a function of the tree distance metric. The phylogenetic information
distance recovers a different pair of clusters, which may not correspond to
those that are most apparent from a simple visual inspection of the
two-dimensional tree space plot:
```{r pid, fig.asp = 1, fig.width = 4, fig.align = "center", echo = FALSE, message = FALSE}
library("TreeDist")
pid_distances <- PhylogeneticInfoDistance(trees)
pid_mapping <- cmdscale(pid_distances, k = 6)
pid_cluster <- cutree(protoclust(pid_distances), k = 2)
par(mar = rep(0, 4))
plot(pid_mapping, ann = FALSE, axes = FALSE, asp = 1,
col = treeCols, pch = 16)
MSTEdges(pid_distances, TRUE, pid_mapping[, 1], pid_mapping[, 2],
col = "#bbbbbb", lty = 1)
pid_clusters <- seq_along(unique(pid_cluster))
for (clI in pid_clusters) {
inCluster <- pid_cluster == clI
clusterX <- pid_mapping[inCluster, 1]
clusterY <- pid_mapping[inCluster, 2]
hull <- chull(clusterX, clusterY)
polygon(clusterX[hull], clusterY[hull], lty = 1, lwd = 2,
border = "#54de25bb")
}
```
### Comparing cluster size
It is tempting to compare the size of clusters by calculating the area of
convex hulls on a two-dimensional mapping. However, mapped areas do [not
necessarily correspond](https://xkcd.com/2489/) to true hypervolumes.
Accuracy may be improved by comparing higher dimensions of projections using
the "[hypervolume](https://www.benjaminblonder.org/hypervolume_faq.html)"
package, though the same considerations apply [@Blonder2018].
Interpretation of overlap statistics is detailed in @Mammola2019.
```{r hypervolume, message = FALSE}
hypervolumeInstalled <- requireNamespace("hypervolume", quietly = TRUE)
if (hypervolumeInstalled) {
library("hypervolume")
hv1 <- hypervolume_gaussian(pid_mapping[pid_cluster == 1, 1:3],
verbose = FALSE)
hv2 <- hypervolume_gaussian(pid_mapping[pid_cluster == 2, 1:3],
verbose = FALSE)
hv_dist <- hypervolume_distance(hv1, hv2)
capture.output(
hyperset <- hypervolume_set(hv1, hv2, verbose = FALSE,
check.memory = FALSE)
) -> XX_VerboseNotRespected
hv_overlap <- hypervolume_overlap_statistics(hyperset)
hv_dist
hv_overlap
} else {
print("Install the 'hypervolume' package to run this example")
}
```
If the objective is to quantify the spread of different clusters, other metrics
may be easier to interpret than the clusters' hypervolume (e.g. @Smith2022).
The divergence of outlying points can be measures using the sum of ranges:
```{r sum-of-ranges}
SumOfRanges(pid_mapping, pid_cluster)
```
The overall size of a cluster can be measured using the sum of variances,
or the mean distance from the centroid or median:
```{r cluster-size}
SumOfVariances(pid_mapping, pid_cluster)
MeanCentroidDistance(pid_mapping, pid_cluster)
DistanceFromMedian(pid_mapping, pid_cluster)
```
The density of points within a cluster can be measured using the mean nearest-neighbour distance or the mean minimum spanning tree edge length:
```{r cluster-density}
MeanNN(pid_mapping, pid_cluster)
MeanMSTEdge(pid_mapping, pid_cluster)
```
## Self-organizing maps
An alternative approach to visualizing tree space is to create emergent
self-organizing maps [@Kohonen1982;@Ultsch2003;@Thrun2016],
which map high-dimensional data into two dimensions,
then add a third dimension to indicate distance between data points:
nearby points occur in valleys, and are separated by ridges from more distant
data points.
```{r umatrix, fig.asp = 1, fig.align = "center"}
umatrixInstalled <- requireNamespace("Umatrix", quietly = TRUE)
if (umatrixInstalled) {
map <- Umatrix::esomTrain(as.matrix(distances), Key = seq_along(trees),
Epochs = 5, # Increase for better results
Lines = 42,
Columns = 42,
Toroid = FALSE)
Umatrix::plotMatrix(Matrix = map$Umatrix,
Toroid = FALSE, FixedRatio = TRUE,
TransparentContours = FALSE, Clean = TRUE) +
ggplot2::geom_point(data = data.frame(x = map$BestMatches[, 3],
y = map$BestMatches[, 2]),
shape = 19, color = treeCols, size = 2)
} else {
message("Install the 'Umatrix' package to run this example")
}
```
## What next?
You may wish to:
- [Analyse landscapes](landscapes.html) of phylogenetic trees
- [Provide context](using-distances.html) for tree distances
- Review [available distance measures](https://ms609.github.io/TreeDist/index.html)
and the corresponding [functions](https://ms609.github.io/TreeDist/reference/index.html#section-tree-distance-measures)
- [Interpret tree distance metrics](https://ms609.github.io/TreeDistData/articles/09-expected-similarity.html)
- Compare the distribution of different [sets of trees](compare-treesets.html)
# References