--- title: "Classification2" author: "Prof. Eric A. Suess" date: "February 19, 2024" format: revealjs: self-contained: true --- ## Today Today we will work with the C5.0 decision trees algorithm and the credit dataset from the book. We will see one way to randomize the rows of a dataset. We will also see how to make plots of trees. ## Package C5.0 in R Here is the link to the documentation of [R Package C5.0](http://cran.r-project.org/web/packages/C50/C50.pdf) ## plot for iris data ```{r} library(C50) mod1 <- C5.0(Species ~ ., data = iris) plot(mod1) ``` ## sub plot for iris data ```{r} plot(mod1, subtree = 3) ``` ## plot the C5.0 tree In the book there is no tree plotted. Why? I do not know. Seems trees are nice to plot. ## plots of trees from rpart Here is another blog post about making nice plots of trees. [Revolution Analytics: Plotting trees](http://blog.revolutionanalytics.com/2013/06/plotting-classification-and-regression-trees-with-plotrpart.html) ## Example -- identifying risky bank loans using C5.0 decision trees 2007-2008 bad years for the banking industry. How to identify risky loans? Who can get a loan and who cannot? Why? Decision trees are very nice and give the model in plain language. Banks use decision trees to try and minimize potential losses, i.e., minimize making bad loans. ## Step 1 -- collecting the data The credit data is from the UCI Machine Learning Data Repository. Note that the data is from Germany and the currency is in Deutsche Marks (DM). ## Step 2 -- exploring and preparing the data To randomize the dataset the following R code is used to randomize the index. > set.seed(123) > train_sample <- sample(1000, 900) > credit_train <- credit[train_sample, ] > credit_test <- credit[-train_sample, ] ## Step 3 -- train the model Run the model, see the output. It is here we would like to see a plot of the tree. Also, it should be noted that there is a tendency for decision trees to overfit the model to the training data. So the error rates on the training data may be overly optimistic. So it is important to evaluate the decision tree on the test dataset. ## Step 4 -- evaluating model performance Using the training data evaluate the model. Model only predicted 50% of the defaulted loans. Not so good. ## Step 5 -- improving model performance [Adaptive Boosting](https://www.cs.princeton.edu/courses/archive/spring12/cos598A/slides/intro.pdf) This is a process in which many trees are built and the trees vote on the best class for each example. According to the author, "... boosting is rooted in the notion that by combining a number of **weak learners**, you can create a **team that is much stronger** than any one of the learners alone." ## Step 5 -- improving model performance Cost ## rattle It would be good to have an easy way to get the plot of the tree. Try [rattle](http://rattle.togaware.com/)