library(pacman)
p_load(tidyverse, janitor, naniar, DataExplorer, Boruta, mlbench)
Stat. 652: Final
Answer the following questions:
Question 1
Read the paper, “Feature Selection with the Boruta Package” by KursaWitold and Rudnicki. See the Final R Project subdirectory /Boruta for the pdf of the paper.
What are the important variables identified by the Boruta algorithm from the Ozone data? Use the output of the provided code below.
Answer:
Code:
data(Ozone)
head(Ozone)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1 1 1 4 3 5480 8 20 NA NA 5000 -15 30.56 200
2 1 2 5 3 5660 6 NA 38 NA NA -14 NA 300
3 1 3 6 3 5710 4 28 40 NA 2693 -25 47.66 250
4 1 4 7 5 5700 3 37 45 NA 590 -24 55.04 100
5 1 5 1 5 5760 3 51 54 45.32 1450 25 57.02 60
6 1 6 2 6 5720 4 69 35 49.64 1568 15 53.78 60
Note that the target variable is V4 = Daily maximum one-hour-average ozone reading.
Convert V1, V2, and V3 to integers.
<- Ozone %>% mutate(
Ozone V1 = as.integer(V1),
V2 = as.integer(V2),
V3 = as.integer(V3)
)
head(Ozone)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
1 1 1 4 3 5480 8 20 NA NA 5000 -15 30.56 200
2 1 2 5 3 5660 6 NA 38 NA NA -14 NA 300
3 1 3 6 3 5710 4 28 40 NA 2693 -25 47.66 250
4 1 4 7 5 5700 3 37 45 NA 590 -24 55.04 100
5 1 5 1 5 5760 3 51 54 45.32 1450 25 57.02 60
6 1 6 2 6 5720 4 69 35 49.64 1568 15 53.78 60
<- Ozone %>% clean_names()
Ozone2
head(Ozone2)
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13
1 1 1 4 3 5480 8 20 NA NA 5000 -15 30.56 200
2 1 2 5 3 5660 6 NA 38 NA NA -14 NA 300
3 1 3 6 3 5710 4 28 40 NA 2693 -25 47.66 250
4 1 4 7 5 5700 3 37 45 NA 590 -24 55.04 100
5 1 5 1 5 5760 3 51 54 45.32 1450 25 57.02 60
6 1 6 2 6 5720 4 69 35 49.64 1568 15 53.78 60
It is always a good idea to check for duplicate records/examples/rows in your dataset.
get_dupes(Ozone2)
No variable names specified - using all columns.
No duplicate combinations found of: v1, v2, v3, v4, v5, v6, v7, v8, v9, ... and 4 other variables
[1] v1 v2 v3 v4 v5 v6
[7] v7 v8 v9 v10 v11 v12
[13] v13 dupe_count
<0 rows> (or 0-length row.names)
Start by investigating the missing values and completeness of the features in the data. Note that the age variable contains some missing values.
vis_miss(Ozone2)
gg_miss_var(Ozone2)
gg_miss_var(Ozone2, show_pct = TRUE)
create_report(Ozone2, y = "v4", output_file = "report_Ozone.html", output_dir = getwd())
Drop V9 because it is missing more than 40% of its values.
<- Ozone2 %>% select(-v9)
Ozone2
vis_miss(Ozone2)
gg_miss_var(Ozone2)
gg_miss_var(Ozone2, show_pct = TRUE)
<- drop_na(Ozone2)
Ozone2
vis_miss(Ozone2)
gg_miss_var(Ozone2)
gg_miss_var(Ozone2, show_pct = TRUE)
<- Boruta(v4 ~ ., data = Ozone2, doTrace = 2, ntree = 500) Boruta.Ozone
1. run of importance source...
2. run of importance source...
3. run of importance source...
4. run of importance source...
5. run of importance source...
6. run of importance source...
7. run of importance source...
8. run of importance source...
9. run of importance source...
10. run of importance source...
11. run of importance source...
After 11 iterations, +1.8 secs:
confirmed 8 attributes: v1, v10, v11, v12, v13 and 3 more;
still have 3 attributes left.
12. run of importance source...
13. run of importance source...
14. run of importance source...
15. run of importance source...
16. run of importance source...
17. run of importance source...
18. run of importance source...
After 18 iterations, +2.6 secs:
rejected 1 attribute: v3;
still have 2 attributes left.
19. run of importance source...
20. run of importance source...
21. run of importance source...
22. run of importance source...
23. run of importance source...
24. run of importance source...
25. run of importance source...
26. run of importance source...
27. run of importance source...
28. run of importance source...
29. run of importance source...
30. run of importance source...
31. run of importance source...
32. run of importance source...
33. run of importance source...
34. run of importance source...
35. run of importance source...
36. run of importance source...
37. run of importance source...
38. run of importance source...
39. run of importance source...
40. run of importance source...
41. run of importance source...
42. run of importance source...
43. run of importance source...
44. run of importance source...
45. run of importance source...
46. run of importance source...
47. run of importance source...
48. run of importance source...
49. run of importance source...
50. run of importance source...
51. run of importance source...
52. run of importance source...
53. run of importance source...
54. run of importance source...
55. run of importance source...
56. run of importance source...
57. run of importance source...
58. run of importance source...
59. run of importance source...
60. run of importance source...
61. run of importance source...
62. run of importance source...
63. run of importance source...
64. run of importance source...
65. run of importance source...
66. run of importance source...
67. run of importance source...
68. run of importance source...
69. run of importance source...
After 69 iterations, +8.1 secs:
rejected 1 attribute: v6;
still have 1 attribute left.
70. run of importance source...
71. run of importance source...
72. run of importance source...
73. run of importance source...
74. run of importance source...
75. run of importance source...
76. run of importance source...
77. run of importance source...
78. run of importance source...
79. run of importance source...
80. run of importance source...
81. run of importance source...
82. run of importance source...
83. run of importance source...
84. run of importance source...
85. run of importance source...
86. run of importance source...
87. run of importance source...
88. run of importance source...
89. run of importance source...
90. run of importance source...
91. run of importance source...
After 91 iterations, +10 secs:
rejected 1 attribute: v2;
no more attributes left.
Boruta.Ozone
Boruta performed 91 iterations in 10.48163 secs.
8 attributes confirmed important: v1, v10, v11, v12, v13 and 3 more;
3 attributes confirmed unimportant: v2, v3, v6;
plot(Boruta.Ozone)
<- Boruta(v4 ~ ., data = Ozone2, maxRuns = 12) Boruta.Short
Boruta.Short
Boruta performed 11 iterations in 1.194764 secs.
8 attributes confirmed important: v1, v10, v11, v12, v13 and 3 more;
1 attributes confirmed unimportant: v3;
2 tentative attributes left: v2, v6;
plot(Boruta.Short)
TentativeRoughFix(Boruta.Short)
Boruta performed 11 iterations in 1.194764 secs.
Tentatives roughfixed over the last 11 iterations.
8 attributes confirmed important: v1, v10, v11, v12, v13 and 3 more;
3 attributes confirmed unimportant: v2, v3, v6;
getConfirmedFormula(Boruta.Ozone)
v4 ~ v1 + v5 + v7 + v8 + v10 + v11 + v12 + v13
<environment: 0x561db0403010>
attStats(Boruta.Ozone)
meanImp medianImp minImp maxImp normHits decision
v1 11.36648857 11.26799811 9.6675475 13.512800 1.00000000 Confirmed
v2 1.76139392 1.74109038 -0.3398283 3.639818 0.32967033 Rejected
v3 0.05196969 -0.05630592 -1.6226358 2.264972 0.02197802 Rejected
v5 12.62144108 12.65065174 10.8743353 14.317759 1.00000000 Confirmed
v6 1.61699406 1.68806068 -0.8939025 3.655509 0.23076923 Rejected
v7 15.07364908 15.07569578 13.5979196 17.066436 1.00000000 Confirmed
v8 23.86364085 23.80697949 22.0722809 26.101230 1.00000000 Confirmed
v10 16.69386366 16.63445298 15.0061347 18.535477 1.00000000 Confirmed
v11 16.14994074 16.09983397 14.3987129 18.788007 1.00000000 Confirmed
v12 22.05108807 22.05770956 20.3210652 23.700102 1.00000000 Confirmed
v13 10.52212408 10.44392183 8.5735066 12.302283 1.00000000 Confirmed
plotImpHistory(Boruta.Ozone)
Question 2
What are the important variables identified by the Boruta algorithm from the titanic training data? Provide code below to run the Boruta algorithm on the titanic_train data.
Answer:
Code:
library(titanic)
data(titanic_train)
Question 3:
- Define clear what the correlation coefficient measures.
Answer:
- Use an AI tool you are comfortable with, Google AI Studio, ChatGPT, Mistral, Microsoft Colab, or another to prepare a response to the prompt provided in part a). Comment on the similarity of your answer with that provided by an AI tool.
Answer:
Question 4:
Using your selected AI tool, upload or share the steps01.docx with the AI.
- Write a prompt to summarize and explain the ideas presented in the document. Ask for further details be added to the summary for each step. Ask for example R code using the tidyverse and tidymodels package.
Answer: prompt
Answer: summary of the AI generated summary of the steps01.docx.