2024-02-19
Today we will work with the C5.0 decision trees algorithm and the credit dataset from the book.
We will see one way to randomize the rows of a dataset.
We will also see how to make plots of trees.
Here is the link to the documentation of R Package C5.0
In the book there is no tree plotted. Why?
I do not know. Seems trees are nice to plot.
Here is another blog post about making nice plots of trees.
2007-2008 bad years for the banking industry.
How to identify risky loans?
Who can get a loan and who cannot? Why?
Decision trees are very nice and give the model in plain language.
Banks use decision trees to try and minimize potential losses, i.e., minimize making bad loans.
The credit data is from the UCI Machine Learning Data Repository.
Note that the data is from Germany and the currency is in Deutsche Marks (DM).
To randomize the dataset the following R code is used to randomize the index.
set.seed(123)
train_sample <- sample(1000, 900)
credit_train <- credit[train_sample, ]
credit_test <- credit[-train_sample, ]
Run the model, see the output.
It is here we would like to see a plot of the tree.
Also, it should be noted that there is a tendency for decision trees to overfit the model to the training data. So the error rates on the training data may be overly optimistic.
So it is important to evaluate the decision tree on the test dataset.
Using the training data evaluate the model.
Model only predicted 50% of the defaulted loans. Not so good.
This is a process in which many trees are built and the trees vote on the best class for each example.
According to the author, “… boosting is rooted in the notion that by combining a number of weak learners, you can create a team that is much stronger than any one of the learners alone.”
Cost
It would be good to have an easy way to get the plot of the tree.
Try rattle