Ensembles

May 6, 2020

Introduction

Today we will be discussing:

meta-learning
ensembles
bagging - bootstrap aggregating
boosting - boosts the performance of weak learners
random forests - decision tree forests

caret

The caret package in R gives a unified interface to most of the packages in R that are used for data mining, machine learning and statistical learning.

It also provides a very nice function for splitting data.

It provides functions for evaluating the results.

Helps with working with ensembles.

And and it gives parallel processing when appropriate.

zelig

On a totally unrelated package, there is the zelig package for basic statistics.

This is another package that tries to produce a unified interface, but for statistics in general.

(I have enjoyed working with this package in the past. I also like the naming reference.)

Zelig: Everyone's Statistical Software

Leonard Zellman

Zelig: Woody Allen movie 1983

caret

The caret package is very useful for tuning models.

When tuning models, many different similar models need to be fit. This is a perfect situation for parallel processing.

The caret package can parallel process with the parallel package in R.

Meta-learning

The idea is to combine several models to form a powerful team.

Build a strong team of weak learners.

Ensembles

All ensemble methods are based on the idea that by combining multiple weaker learners, a stronger learner is created.

Bagging

bootstrap aggreating

Here sampling without replacement is used to produce the traning data, the other examples are said to be out-of-bag, they are used to validation/testing.

Boosting

Ensembles of models are trained on randomly resampled data and voting (weighted) is used to determine the final prediction used.

We implemented boosting eariler with C5.0

Random forests

Ensembles of decision trees are produces and voting is used again.

Here not only is the data resampled, but now the features are also sampled.

Decision tree forests.

Example

The author uses random forests to look at the credit data again.