25.1 CART

The idea behind CART is to divide the predictor space (petal width and length) with a straight line and fit simple models on either side of the line.

Mathematically, consider predictors \(X_j\) and some split point \(s\) that splits the predictor space into two half-spaces \[L_{j, s} = \lbrace X |X_j \le s \rbrace \text{ and } R_{j, s} = \lbrace X |X_j > s\rbrace\].

The idea is to split on the varible and at the location which minimizes some loss function: \[min_{j, s}\lbrace loss(R_{j, s}) + loss(L_{j, s}) \rbrace\] For classification problems, the easiest model is ``majority wins’’ and the loss function is the number of misclassificed observations. For regression, the easiest model is the mean model, and the loss function is squared error loss.

Trees, left unattended, can easily overfit. Loss can always be reduced to zero by cutting the predictor space up enough. The two major ways to hadle this is the Bayesian approach of putting priors on trees, or a penalized loss function approach, which adds a penalty for more complex trees (mode leaves).

It’s common practice to grow a tree too large and then prune it back, rather than just stop growing the tree when it gets too complex. This explores tree space more thoroughly. Once we have our overgrown tree we remove terminal nodes and try to minimize a penalized loss function. For a tree with \(T\) nodes labelled \(N_t\) for \(t \in 1, \ldots, T\) we want to minimize

\[k T + \sum_{t = 1}^{T}loss(N_t)\] where k is a parameter controlling the penalty on the tree size. Harsher penalties lead to smaller trees.

Other methods use a significance test approach to determining whether or not to split. For each variable, the model performs a univariate hypothesis test and splits on the variable with the lowest p-value. If no null hypotheses can be rejected, the tree stops splitting.

A main advantage of trees is their ease of interpretability and use.

spamTree = ctree(yesno ~., data = spam7)
spamTree
## 
##   Conditional inference tree with 20 terminal nodes
## 
## Response:  yesno 
## Inputs:  crl.tot, dollar, bang, money, n000, make 
## Number of observations:  4601 
## 
## 1) n000 <= 0.12; criterion = 1, statistic = 515.579
##   2) dollar <= 0.055; criterion = 1, statistic = 393.255
##     3) bang <= 0.102; criterion = 1, statistic = 129.53
##       4) bang <= 0.028; criterion = 1, statistic = 29.259
##         5) money <= 0.13; criterion = 0.998, statistic = 12.684
##           6)*  weights = 2154 
##         5) money > 0.13
##           7)*  weights = 24 
##       4) bang > 0.028
##         8) money <= 0; criterion = 1, statistic = 23.253
##           9) bang <= 0.034; criterion = 0.953, statistic = 7.03
##             10)*  weights = 19 
##           9) bang > 0.034
##             11) dollar <= 0.04; criterion = 0.982, statistic = 8.779
##               12)*  weights = 170 
##             11) dollar > 0.04
##               13)*  weights = 15 
##         8) money > 0
##           14)*  weights = 21 
##     3) bang > 0.102
##       15) crl.tot <= 86; criterion = 1, statistic = 64.968
##         16) crl.tot <= 51; criterion = 1, statistic = 24.737
##           17) money <= 1.16; criterion = 1, statistic = 19.282
##             18)*  weights = 330 
##           17) money > 1.16
##             19)*  weights = 8 
##         16) crl.tot > 51
##           20)*  weights = 184 
##       15) crl.tot > 86
##         21) bang <= 0.471; criterion = 1, statistic = 37.344
##           22) money <= 0; criterion = 1, statistic = 20.151
##             23) crl.tot <= 321; criterion = 0.963, statistic = 7.487
##               24)*  weights = 152 
##             23) crl.tot > 321
##               25)*  weights = 58 
##           22) money > 0
##             26)*  weights = 43 
##         21) bang > 0.471
##           27)*  weights = 180 
##   2) dollar > 0.055
##     28) bang <= 0.075; criterion = 1, statistic = 35.434
##       29) money <= 0.16; criterion = 1, statistic = 27.115
##         30)*  weights = 138 
##       29) money > 0.16
##         31)*  weights = 67 
##     28) bang > 0.075
##       32)*  weights = 436 
## 1) n000 > 0.12
##   33) bang <= 0; criterion = 1, statistic = 21.155
##     34)*  weights = 49 
##   33) bang > 0
##     35) n000 <= 0.25; criterion = 0.973, statistic = 8.025
##       36) n000 <= 0.19; criterion = 0.96, statistic = 7.343
##         37)*  weights = 43 
##       36) n000 > 0.19
##         38)*  weights = 21 
##     35) n000 > 0.25
##       39)*  weights = 489

Let’s predict an email!

How well does it predict?

# Let's do a simple cross validation
inSampleProp = .85
inSampleIndicator = sample(c(TRUE, FALSE), size = nrow(spam7), replace = TRUE, prob = c(inSampleProp, 1 - inSampleProp))
trainingSet = spam7[inSampleIndicator,]
testingSet = spam7[!inSampleIndicator,]
trainingTree = ctree(yesno ~., data = trainingSet)
trainingTree
## 
##   Conditional inference tree with 20 terminal nodes
## 
## Response:  yesno 
## Inputs:  crl.tot, dollar, bang, money, n000, make 
## Number of observations:  3906 
## 
## 1) dollar <= 0.055; criterion = 1, statistic = 474.384
##   2) bang <= 0.085; criterion = 1, statistic = 107.911
##     3) bang <= 0.028; criterion = 1, statistic = 22.15
##       4) n000 <= 0.11; criterion = 1, statistic = 16.097
##         5) money <= 0.03; criterion = 0.998, statistic = 12.534
##           6)*  weights = 1794 
##         5) money > 0.03
##           7)*  weights = 23 
##       4) n000 > 0.11
##         8)*  weights = 31 
##     3) bang > 0.028
##       9)*  weights = 161 
##   2) bang > 0.085
##     10) crl.tot <= 85; criterion = 1, statistic = 70.548
##       11) crl.tot <= 51; criterion = 1, statistic = 19.938
##         12) money <= 0.48; criterion = 1, statistic = 16.228
##           13)*  weights = 293 
##         12) money > 0.48
##           14)*  weights = 7 
##       11) crl.tot > 51
##         15)*  weights = 169 
##     10) crl.tot > 85
##       16) bang <= 0.448; criterion = 1, statistic = 36.142
##         17) money <= 0; criterion = 1, statistic = 17.308
##           18) n000 <= 0.1; criterion = 0.955, statistic = 7.13
##             19) crl.tot <= 321; criterion = 0.972, statistic = 7.991
##               20)*  weights = 142 
##             19) crl.tot > 321
##               21)*  weights = 54 
##           18) n000 > 0.1
##             22)*  weights = 10 
##         17) money > 0
##           23) money <= 1.05; criterion = 0.984, statistic = 8.964
##             24)*  weights = 62 
##           23) money > 1.05
##             25)*  weights = 7 
##       16) bang > 0.448
##         26)*  weights = 181 
## 1) dollar > 0.055
##   27) bang <= 0.049; criterion = 1, statistic = 39.757
##     28) n000 <= 0.39; criterion = 1, statistic = 22.383
##       29) money <= 0.16; criterion = 1, statistic = 22.929
##         30)*  weights = 114 
##       29) money > 0.16
##         31)*  weights = 53 
##     28) n000 > 0.39
##       32)*  weights = 35 
##   27) bang > 0.049
##     33) crl.tot <= 39; criterion = 0.994, statistic = 10.79
##       34)*  weights = 16 
##     33) crl.tot > 39
##       35) crl.tot <= 411; criterion = 0.981, statistic = 8.653
##         36) bang <= 0.407; criterion = 0.994, statistic = 10.723
##           37)*  weights = 220 
##         36) bang > 0.407
##           38)*  weights = 189 
##       35) crl.tot > 411
##         39)*  weights = 345

Now we can do a more fair out of sample calculation