Banner

Introduction

This vignette gives you a quick introduction to data.tree applications. We took care to keep the examples simple enough so non-specialists can follow them. The price for this is, obviously, that the examples are often simple compared to real-life applications.

If you are using data.tree for things not listed here, and if you believe this is of general interest, then please do drop us a note, so we can include your application in a future version of this vignette.

World PopulationTreeMap (visualization)

This example is inspired by the examples of the treemap package.

You’ll learn how to

Original Example, to be improved

The original example visualizes the world population as a tree map.

library(treemap)
data(GNI2014)
treemap(GNI2014,
       index=c("continent", "iso3"),
       vSize="population",
       vColor="GNI",
       type="value")

As there are many countries, the chart gets clustered with many very small boxes. In this example, we will limit the number of countries and sum the remaining population in a catch-all country called “Other”.

We use data.tree to do this aggregation.

Convert from data.frame

First, let’s convert the population data into a data.tree structure:

library(data.tree)
GNI2014$continent <- as.character(GNI2014$continent)
GNI2014$pathString <- paste("world", GNI2014$continent, GNI2014$country, sep = "/")
tree <- as.Node(GNI2014[,])
print(tree, pruneMethod = "dist", limit = 20)
##                        levelName
## 1  world                        
## 2   ¦--North America            
## 3   ¦   ¦--Bermuda              
## 4   ¦   ¦--United States        
## 5   ¦   °--... 22 nodes w/ 0 sub
## 6   ¦--Europe                   
## 7   ¦   ¦--Norway               
## 8   ¦   ¦--Switzerland          
## 9   ¦   °--... 39 nodes w/ 0 sub
## 10  ¦--Asia                     
## 11  ¦   ¦--Qatar                
## 12  ¦   ¦--Macao SAR, China     
## 13  ¦   °--... 45 nodes w/ 0 sub
## 14  ¦--Oceania                  
## 15  ¦   ¦--Australia            
## 16  ¦   ¦--New Zealand          
## 17  ¦   °--... 11 nodes w/ 0 sub
## 18  ¦--South America            
## 19  ¦   ¦--Uruguay              
## 20  ¦   ¦--Chile                
## 21  ¦   °--... 10 nodes w/ 0 sub
## 22  ¦--Seven seas (open ocean)  
## 23  ¦   ¦--Seychelles           
## 24  ¦   ¦--Mauritius            
## 25  ¦   °--... 1 nodes w/ 0 sub 
## 26  °--Africa                   
## 27      °--... 48 nodes w/ 0 sub

We can also navigate the tree to find the population of a specific country. Luckily, RStudio is quite helpful with its code completion (use CTRL + SPACE):

tree$Europe$Switzerland$population
## [1] 7604467

Or, we can look at a sub-tree:

northAm <- tree$`North America`
Sort(northAm, "GNI", decreasing = TRUE)
print(northAm, "iso3", "population", "GNI", limit = 12)
##                    levelName iso3 population    GNI
## 1  North America                          NA     NA
## 2   ¦--Bermuda                BMU      67837 106140
## 3   ¦--United States          USA  313973000  55200
## 4   ¦--Canada                 CAN   33487208  51630
## 5   ¦--Bahamas, The           BHS     309156  20980
## 6   ¦--Trinidad and Tobago    TTO    1310000  20070
## 7   ¦--Puerto Rico            PRI    3971020  19310
## 8   ¦--Barbados               BRB     284589  15310
## 9   ¦--St. Kitts and Nevis    KNA      40131  14920
## 10  ¦--Antigua and Barbuda    ATG      85632  13300
## 11  ¦--Panama                 PAN    3360474  11130
## 12  °--... 14 nodes w/ 0 sub              NA     NA

Or, we can find out what is the country with the largest GNI:

maxGNI <- Aggregate(tree, "GNI", max)
#same thing, in a more traditional way:
maxGNI <- max(sapply(tree$leaves, function(x) x$GNI))

tree$Get("name", filterFun = function(x) x$isLeaf && x$GNI == maxGNI)
##   Bermuda 
## "Bermuda"

Aggregate and Cumulate

We aggregate the population. For non-leaves, this will recursively iterate through children, and cache the result in the population field.

tree$Do(function(x) {
        x$population <- Aggregate(node = x,
        attribute = "population",
        aggFun = sum)
        }, 
     traversal = "post-order")

Next, we sort each node by population:

Sort(tree, attribute = "population", decreasing = TRUE, recursive = TRUE)

Finally, we cumulate among siblings, and store the running sum in an attribute called cumPop:

tree$Do(function(x) x$cumPop <- Cumulate(x, "population", sum))

The tree now looks like this:

print(tree, "population", "cumPop", pruneMethod = "dist", limit = 20)
##                        levelName population     cumPop
## 1  world                         6683146875 6683146875
## 2   ¦--Asia                      4033277009 4033277009
## 3   ¦   ¦--China                 1338612970 1338612970
## 4   ¦   ¦--India                 1166079220 2504692190
## 5   ¦   °--... 45 nodes w/ 0 sub         NA         NA
## 6   ¦--Africa                     962382035 4995659044
## 7   ¦   ¦--Nigeria                149229090  149229090
## 8   ¦   ¦--Ethiopia                85237338  234466428
## 9   ¦   °--... 46 nodes w/ 0 sub         NA         NA
## 10  ¦--Europe                     728669949 5724328993
## 11  ¦   ¦--Russian Federation     140041247  140041247
## 12  ¦   ¦--Germany                 82329758  222371005
## 13  ¦   °--... 39 nodes w/ 0 sub         NA         NA
## 14  ¦--North America              528748158 6253077151
## 15  ¦   ¦--United States          313973000  313973000
## 16  ¦   ¦--Mexico                 111211789  425184789
## 17  ¦   °--... 22 nodes w/ 0 sub         NA         NA
## 18  ¦--South America              394352338 6647429489
## 19  ¦   ¦--Brazil                 198739269  198739269
## 20  ¦   ¦--Colombia                45644023  244383292
## 21  ¦   °--... 10 nodes w/ 0 sub         NA         NA
## 22  ¦--Oceania                     33949312 6681378801
## 23  ¦   ¦--Australia               21262641   21262641
## 24  ¦   ¦--Papua New Guinea         6057263   27319904
## 25  ¦   °--... 11 nodes w/ 0 sub         NA         NA
## 26  °--Seven seas (open ocean)      1768074 6683146875
## 27      °--... 3 nodes w/ 0 sub          NA         NA

Prune

The previous steps were done to define our threshold: big countries should be displayed, while small ones should be grouped together. This lets us define a pruning function that will allow a maximum of 7 countries per continent, and that will prune all countries making up less than 90% of a continent’s population.

We would like to store the original number of countries for further use:

tree$Do(function(x) x$origCount <- x$count)

We are now ready to prune. This is done by defining a pruning function, returning ‘FALSE’ for all countries that should be combined:

myPruneFun <- function(x, cutoff = 0.9, maxCountries = 7) {
  if (isNotLeaf(x)) return (TRUE)
  if (x$position > maxCountries) return (FALSE)
  return (x$cumPop < (x$parent$population * cutoff))
}

We clone the tree, because we might want to play around with different parameters:

treeClone <- Clone(tree, pruneFun = myPruneFun)
print(treeClone$Oceania, "population", pruneMethod = "simple", limit = 20)
##              levelName population
## 1 Oceania                33949312
## 2  ¦--Australia          21262641
## 3  °--Papua New Guinea    6057263

Finally, we need to sum countries that we pruned away into a new “Other” node:

treeClone$Do(function(x) {
  missing <- x$population - sum(sapply(x$children, function(x) x$population))
  other <- x$AddChild("Other")
  other$iso3 <- paste0("OTH(", x$origCount, ")")
  other$country <- "Other"
  other$continent <- x$name
  other$GNI <- 0
  other$population <- missing
},
filterFun = function(x) x$level == 2
)


print(treeClone$Oceania, "population", pruneMethod = "simple", limit = 20)
##              levelName population
## 1 Oceania                33949312
## 2  ¦--Australia          21262641
## 3  ¦--Papua New Guinea    6057263
## 4  °--Other               6629408

Plot

Plotting the treemap

In order to plot the treemap, we need to convert the data.tree structure back to a data.frame:

df <- ToDataFrameTable(treeClone, "iso3", "country", "continent", "population", "GNI")

treemap(df,
        index=c("continent", "iso3"),
        vSize="population",
        vColor="GNI",
        type="value")

Plot as dendrogram

Just for fun, and for no reason other than to demonstrate conversion to dendrogram, we can plot this in a very unusual way:

plot(as.dendrogram(treeClone, heightAttribute = "population"))

Further developments

Obviously, we should also aggregate the GNI as a weighted average. Namely, we should do this for the OTH catch-all countries that we add to the tree.

Portfolio Breakdown (finance)

In this example, we show how to display an investment portfolio as a hierarchic breakdown into asset classes. You’ll see:

Convert from data.frame

fileName <- system.file("extdata", "portfolio.csv", package="data.tree")
pfodf <- read.csv(fileName, stringsAsFactors = FALSE)
head(pfodf)
##           ISIN                                     Name Ccy Type Duration
## 1 LI0015327682          LGT Money Market Fund (CHF) - B CHF Fund       NA
## 2 LI0214880598        CS (Lie) Money Market Fund EUR EB EUR Fund       NA
## 3 LI0214880689        CS (Lie) Money Market Fund USD EB USD Fund       NA
## 4 LU0243957825    Invesco Euro Corporate Bond A EUR Acc EUR Fund     5.10
## 5 LU0408877412 JPM Euro Gov Sh. Duration Bd A (acc)-EUR EUR Fund     2.45
## 6 LU0376989207 Aberdeen Global Sel Emerg Mkt Bd A2 HEUR EUR Fund     6.80
##   Weight AssetCategory AssetClass        SubAssetClass
## 1  0.030          Cash        CHF                     
## 2  0.060          Cash        EUR                     
## 3  0.020          Cash        USD                     
## 4  0.120  Fixed Income        EUR Sov. and Corp. Bonds
## 5  0.065  Fixed Income        EUR Sov. and Corp. Bonds
## 6  0.030  Fixed Income        EUR       Em. Mkts Bonds

Let us convert the data.frame to a data.tree structure. Here, we use again the path string method. For other options, see ?as.Node.data.frame

pfodf$pathString <- paste("portfolio", 
                          pfodf$AssetCategory, 
                          pfodf$AssetClass, 
                          pfodf$SubAssetClass, 
                          pfodf$ISIN, 
                          sep = "/")
pfo <- as.Node(pfodf)

Aggregate

To calculate the weight per asset class, we use the Aggregate method:

t <- Traverse(pfo, traversal = "post-order")
Do(t, function(x) x$Weight <- Aggregate(node = x, attribute = "Weight", aggFun = sum))

We now calculate the WeightOfParent,

Do(t, function(x) x$WeightOfParent <- x$Weight / x$parent$Weight)

Duration is a bit more complicated, as this is a concept that applies only to the fixed income asset class. Note that, in the second statement, we are reusing the traversal from above.

pfo$Do(function(x) x$Duration <- ifelse(is.null(x$Duration), 0, x$Duration), filterFun = isLeaf)
Do(t, function(x) x$Duration <- Aggregate(x, function(x) x$WeightOfParent * x$Duration, sum))

Formatters

We can add default formatters to our data.tree structure. Here, we add them to the root, but we might as well add them to any Node in the tree.

SetFormat(pfo, "WeightOfParent", function(x) FormatPercent(x, digits = 1))
SetFormat(pfo, "Weight", FormatPercent)

FormatDuration <- function(x) {
  if (x != 0) res <- FormatFixedDecimal(x, digits = 1)
  else res <- ""
  return (res)
}

SetFormat(pfo, "Duration", FormatDuration)

These formatter functions will be used when printing a data.tree structure.

Print

#Print
print(pfo, 
      "Weight", 
      "WeightOfParent",
      "Duration",
      filterFun = function(x) !x$isLeaf)
##                           levelName   Weight WeightOfParent Duration
## 1  portfolio                        100.00 %                     0.8
## 2   ¦--Cash                          11.00 %         11.0 %         
## 3   ¦   ¦--CHF                        3.00 %         27.3 %         
## 4   ¦   ¦--EUR                        6.00 %         54.5 %         
## 5   ¦   °--USD                        2.00 %         18.2 %         
## 6   ¦--Fixed Income                  28.50 %         28.5 %      3.0
## 7   ¦   ¦--EUR                       26.00 %         91.2 %      3.1
## 8   ¦   ¦   ¦--Sov. and Corp. Bonds  18.50 %         71.2 %      2.4
## 9   ¦   ¦   ¦--Em. Mkts Bonds         3.00 %         11.5 %      6.8
## 10  ¦   ¦   °--High Yield Bonds       4.50 %         17.3 %      3.4
## 11  ¦   °--USD                        2.50 %          8.8 %      1.6
## 12  ¦       °--High Yield Bonds       2.50 %        100.0 %      1.6
## 13  ¦--Equities                      40.00 %         40.0 %         
## 14  ¦   ¦--Switzerland                6.00 %         15.0 %         
## 15  ¦   ¦--Euroland                  14.50 %         36.2 %         
## 16  ¦   ¦--US                         8.10 %         20.2 %         
## 17  ¦   ¦--UK                         0.90 %          2.2 %         
## 18  ¦   ¦--Japan                      3.00 %          7.5 %         
## 19  ¦   ¦--Australia                  2.00 %          5.0 %         
## 20  ¦   °--Emerging Markets           5.50 %         13.7 %         
## 21  °--Alternative Investments       20.50 %         20.5 %         
## 22      ¦--Real Estate                5.50 %         26.8 %         
## 23      ¦   °--Eurozone               5.50 %        100.0 %         
## 24      ¦--Hedge Funds               10.50 %         51.2 %         
## 25      °--Commodities                4.50 %         22.0 %

ID3 (machine learning)

This example shows you the following:

Thanks a lot for all the helpful comments made by Holger von Jouanne-Diedrich.

Classification trees are very popular these days. If you have never come across them, you might be interested in classification trees. These models let you classify observations (e.g. things, outcomes) according to the observations’ qualities, called features. Essentially, all of these models consist of creating a tree, where each node acts as a router. You insert your mushroom instance at the root of the tree, and then, depending on the mushroom’s features (size, points, color, etc.), you follow along a different path, until a leaf node spits out your mushroom’s class, i.e. whether it’s edible or not.

There are two different steps involved in using such a model: training (i.e. constructing the tree), and predicting (i.e. using the tree to predict whether a given mushroom is poisonous). This example provides code to do both, using one of the very early algorithms to classify data according to discrete features: ID3. It lends itself well for this example, but of course today there are much more elaborate and refined algorithms available.

ID3 Introduction

During the prediction step, each node routes our mushroom according to a feature. But how do we chose the feature? Should we first separate our set according to color or size? That is where classification models differ.

In ID3, we pick, at each node, the feature with the highest Information Gain. In a nutshell, this is the feature which splits the sample in the possibly purest subsets. For example, in the case of mushrooms, dots might be a more sensible feature than organic.

Purity and Entropy

IsPure <- function(data) {
  length(unique(data[,ncol(data)])) == 1
}

The entropy is a measure of the purity of a dataset.

Entropy <- function( vls ) {