Training Manual for SDAL

25.4 Viewing Random Forests

The last part of the course will introduce you to the random forest package, some nifty things you can do with it, and some visuals. IMPORTANT NOTE!!! This implementation of random forest can’t handle catgegorical predictors directly. You need to convert them to a model matrix. This isn’t that hard, but if you’re not aware of it you can get spurious results

library(randomForest)

mtcarsRf = randomForest(mpg ~ ., data=mtcars, importance=TRUE, proximity=TRUE, mtry = 4)
mtcarsRf
## 
## Call:
##  randomForest(formula = mpg ~ ., data = mtcars, importance = TRUE,      proximity = TRUE, mtry = 4) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 5.424577
##                     % Var explained: 84.58

plot(mtcarsRf)

Variable importance.

varImportance = importance(mtcarsRf, type = 2)
varImportance
##      IncNodePurity
## cyl     189.275498
## disp    262.391424
## hp      205.431608
## drat     45.489344
## wt      278.369216
## qsec     19.915649
## vs        9.846512
## am        8.995382
## gear      9.776913
## carb     19.251604

layout(matrix(c(1,1,2,1,1,3,1, 1, 4), nrow = 3, ncol = 3, byrow = TRUE))
varImpPlot(mtcarsRf, type = 2)
impvar = rownames(varImportance)[order(varImportance[, 1], decreasing=TRUE)]
colors = c("black", "blue", "red")
for(j in 1:3){
  partialPlot(mtcarsRf, mtcars, impvar[j], main = "",xlab = impvar[j], lwd= 3, cex.lab = 2, n.pt = 100)
}

A neat output of an RF model is a measure of proximity between rows. Proximity is the proportion of times two observations appear in the same leaf node. This can be very useful in cases where rows contain both continuous and categorical data, a typically difficult situation for most metrics.

If we do 1 - proximity we can turn this into a distance matrix and do things like multidimensional scaling.