[en] An example of recursive partitioning with Titanic data

Nicolas Robette

2021-07-14

First steps

First, the necessary packages are loaded into memory.

library(tidyverse)  # data management
library(caret)  # confusion matrix
library(party)  # conditional inference random forests and trees
library(partykit)  # conditional inference trees
library(pROC)  # ROC curves
library(measures)  # performance measures
library(varImp)  # variable importance
library(pdp)  # partial dependence
library(vip)  # measure of interactions
library(moreparty)  # surrogate trees, accumulated local effects, etc.
library(RColorBrewer)  # color palettes
library(GDAtools)  # bivariate analysis

Now, we then import titanic data set from moreparty.

data(titanic)
str(titanic)
spec_tbl_df[,5] [1,309 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Survived: Factor w/ 2 levels "No","Yes": 2 2 1 1 1 2 2 1 2 1 ...
 $ Sex     : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
 $ Pclass  : Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
 $ Age     : num [1:1309] 29 0.917 2 30 25 ...
 $ Embarked: Factor w/ 3 levels "Cherbourg","Queenstown",..: 3 3 3 3 3 3 3 3 3 1 ...

We have 1309 cases, one categorical explained variable, Survived, which codes whether or not an individual survived the shipwreck, and four explanatory variables (three categorical and one continuous): gender, age, passenger class, and port of embarkation. The distribution of the variables is examined.

summary(titanic)
 Survived      Sex      Pclass         Age                 Embarked  
 No :809   female:466   1st:323   Min.   : 0.1667   Cherbourg  :270  
 Yes:500   male  :843   2nd:277   1st Qu.:21.0000   Queenstown :123  
                        3rd:709   Median :28.0000   Southampton:914  
                                  Mean   :29.8811   NA's       :  2  
                                  3rd Qu.:39.0000                    
                                  Max.   :80.0000                    
                                  NA's   :263                        

The distribution of the explained variable is not balanced, as survival is largely in the minority. In addition, some explanatory variables have missing values, in particular Age.

We examine the bivariate statistical relationships between the variables.

BivariateAssoc(titanic$Survived, titanic[,-1])
$YX
  variable measure assoc p.value   criterion
1      Sex  cramer 0.527 0.00000 0.000000000
2   Pclass  cramer 0.313 0.00000 0.000000000
3 Embarked  cramer 0.184 0.00000 0.000000001
4      Age    eta2 0.002 0.26069 0.302040642

$XX
  variable1 variable2 measure assoc p.value    criterion
1    Pclass       Age    eta2 0.170 0.00000 0.0000000000
2    Pclass  Embarked  cramer 0.280 0.00000 0.0000000000
3       Sex    Pclass  cramer 0.125 0.00004 0.0000378611
4       Sex  Embarked  cramer 0.122 0.00006 0.0000563134
5       Age  Embarked    eta2 0.006 0.01789 0.0180491352
6       Sex       Age    eta2 0.003 0.03964 0.0404512887

Survival is primarily associated with gender, secondarily with the passenger class. The explanatory variables are weakly related to each other.

catdesc(titanic$Survived, titanic[,-1], min.phi=0.1, robust=FALSE)
$variables
  variable    measure association permutation.pvalue
1      Sex Cramer's V       0.529                 NA
2   Pclass Cramer's V       0.313                 NA
3 Embarked Cramer's V       0.184                 NA
4      Age       Eta2       0.003                 NA

$bylevel
$bylevel$No
$bylevel$No$categories
             categories pct.ycat.in.xcat pct.xcat.in.ycat pct.xcat.global
2              Sex.male            0.809            0.843           0.644
3            Pclass.3rd            0.745            0.653           0.542
6  Embarked.Southampton            0.667            0.754           0.699
8    Embarked.Cherbourg            0.444            0.148           0.207
9            Pclass.1st            0.381            0.152           0.247
11           Sex.female            0.273            0.157           0.356
      phi
2   0.529
3   0.283
6   0.152
8  -0.183
9  -0.279
11 -0.529

$bylevel$No$continuous.var
  variables mean.x.in.ycat mean.x.global sd.x.in.ycat sd.x.global   cor
1       Age             NA            NA           NA          NA 0.056


$bylevel$Yes
$bylevel$Yes$categories
             categories pct.ycat.in.xcat pct.xcat.in.ycat pct.xcat.global
1            Sex.female            0.727            0.678           0.356
4            Pclass.1st            0.619            0.400           0.247
5    Embarked.Cherbourg            0.556            0.301           0.207
7  Embarked.Southampton            0.333            0.610           0.699
10           Pclass.3rd            0.255            0.362           0.542
12             Sex.male            0.191            0.322           0.644
      phi
1   0.529
4   0.279
5   0.183
7  -0.152
10 -0.283
12 -0.529

$bylevel$Yes$continuous.var
  variables mean.x.in.ycat mean.x.global sd.x.in.ycat sd.x.global    cor
2       Age             NA            NA           NA          NA -0.056

Women, first class passengers and those who boarded at Cherbourg are over-represented among the survivors. Men, 3rd class passengers and those who boarded at Southampton are over-represented among the non-survivors.

Random forests imply a share of randomness (via resampling and drawing of splitting variables), as well as some interpretation tools (via variable permutations). From one program run to the next, the results may therefore differ slightly. If you wish to obtain the same results systematically and to ensure reproducibility, use the set.seed function.

set.seed(1912)

Classification tree

In order to build a classification tree with CTree conditional inference algorithm, we use partykit package, which allows more flexibility than party package, in particular to deal with missing values.

The tree can be displayed in textual or graphical form.

arbre <- partykit::ctree(Survived~., data=titanic, control=partykit::ctree_control(minbucket=30, maxsurrogate=Inf, maxdepth=3))

print(arbre)

Model formula:
Survived ~ Sex + Pclass + Age + Embarked

Fitted party:
[1] root
|   [2] Sex in female
|   |   [3] Pclass in 1st, 2nd
|   |   |   [4] Pclass in 1st: Yes (n = 144, err = 3.5%)
|   |   |   [5] Pclass in 2nd: Yes (n = 106, err = 11.3%)
|   |   [6] Pclass in 3rd
|   |   |   [7] Embarked in Cherbourg, Queenstown: Yes (n = 87, err = 36.8%)
|   |   |   [8] Embarked in Southampton: No (n = 129, err = 39.5%)
|   [9] Sex in male
|   |   [10] Pclass in 1st
|   |   |   [11] Age <= 53: No (n = 148, err = 38.5%)
|   |   |   [12] Age > 53: No (n = 31, err = 12.9%)
|   |   [13] Pclass in 2nd, 3rd
|   |   |   [14] Age <= 9: No (n = 77, err = 35.1%)
|   |   |   [15] Age > 9: No (n = 587, err = 12.4%)

Number of inner nodes:    7
Number of terminal nodes: 8
plot(arbre)