Stat. 652: Final

Author

Prof. Eric A. Suess

Answer two questions:

  1. What are the important variables identified by the Boruta algorithm from the Ozone data?

Answer:

  1. What are the important variables identified by the Boruta algorithm from the titanic training data?

Answer:

library(pacman)
p_load(tidyverse, janitor, naniar, DataExplorer, Boruta, mlbench)
data(Ozone)

head(Ozone)
  V1 V2 V3 V4   V5 V6 V7 V8    V9  V10 V11   V12 V13
1  1  1  4  3 5480  8 20 NA    NA 5000 -15 30.56 200
2  1  2  5  3 5660  6 NA 38    NA   NA -14    NA 300
3  1  3  6  3 5710  4 28 40    NA 2693 -25 47.66 250
4  1  4  7  5 5700  3 37 45    NA  590 -24 55.04 100
5  1  5  1  5 5760  3 51 54 45.32 1450  25 57.02  60
6  1  6  2  6 5720  4 69 35 49.64 1568  15 53.78  60

Note that the target variable is V4 = Daily maximum one-hour-average ozone reading.

Convert V1, V2, and V3 to integers.

Ozone <- Ozone %>% mutate(
  V1 = as.integer(V1),
  V2 = as.integer(V2),
  V3 = as.integer(V3)
)

head(Ozone)
  V1 V2 V3 V4   V5 V6 V7 V8    V9  V10 V11   V12 V13
1  1  1  4  3 5480  8 20 NA    NA 5000 -15 30.56 200
2  1  2  5  3 5660  6 NA 38    NA   NA -14    NA 300
3  1  3  6  3 5710  4 28 40    NA 2693 -25 47.66 250
4  1  4  7  5 5700  3 37 45    NA  590 -24 55.04 100
5  1  5  1  5 5760  3 51 54 45.32 1450  25 57.02  60
6  1  6  2  6 5720  4 69 35 49.64 1568  15 53.78  60
Ozone2 <- Ozone %>% clean_names()

head(Ozone2)
  v1 v2 v3 v4   v5 v6 v7 v8    v9  v10 v11   v12 v13
1  1  1  4  3 5480  8 20 NA    NA 5000 -15 30.56 200
2  1  2  5  3 5660  6 NA 38    NA   NA -14    NA 300
3  1  3  6  3 5710  4 28 40    NA 2693 -25 47.66 250
4  1  4  7  5 5700  3 37 45    NA  590 -24 55.04 100
5  1  5  1  5 5760  3 51 54 45.32 1450  25 57.02  60
6  1  6  2  6 5720  4 69 35 49.64 1568  15 53.78  60

It is always a good idea to check for duplicate records/examples/rows in your dataset.

get_dupes(Ozone2)
No variable names specified - using all columns.
No duplicate combinations found of: v1, v2, v3, v4, v5, v6, v7, v8, v9, ... and 4 other variables
 [1] v1         v2         v3         v4         v5         v6        
 [7] v7         v8         v9         v10        v11        v12       
[13] v13        dupe_count
<0 rows> (or 0-length row.names)

Start by investigating the missing values and completeness of the features in the data. Note that the age variable contains some missing values.

vis_miss(Ozone2)

gg_miss_var(Ozone2)

gg_miss_var(Ozone2, show_pct = TRUE)

create_report(Ozone2, y = "v4", output_file = "report_Ozone.html", output_dir = getwd())

Drop V9 because it is missing more than 40% of its values.

Ozone2 <- Ozone2 %>% select(-v9)

vis_miss(Ozone2)

gg_miss_var(Ozone2)

gg_miss_var(Ozone2, show_pct = TRUE)

Ozone2 <- drop_na(Ozone2)

vis_miss(Ozone2)

gg_miss_var(Ozone2)

gg_miss_var(Ozone2, show_pct = TRUE)

Boruta.Ozone <- Boruta(v4 ~ ., data = Ozone2, doTrace = 2, ntree = 500)
 1. run of importance source...
 2. run of importance source...
 3. run of importance source...
 4. run of importance source...
 5. run of importance source...
 6. run of importance source...
 7. run of importance source...
 8. run of importance source...
 9. run of importance source...
 10. run of importance source...
 11. run of importance source...
After 11 iterations, +0.99 secs: 
 confirmed 8 attributes: v1, v10, v11, v12, v13 and 3 more;
 still have 3 attributes left.
 12. run of importance source...
 13. run of importance source...
 14. run of importance source...
 15. run of importance source...
After 15 iterations, +1.1 secs: 
 rejected 1 attribute: v3;
 still have 2 attributes left.
 16. run of importance source...
 17. run of importance source...
 18. run of importance source...
 19. run of importance source...
 20. run of importance source...
 21. run of importance source...
 22. run of importance source...
 23. run of importance source...
 24. run of importance source...
 25. run of importance source...
 26. run of importance source...
 27. run of importance source...
 28. run of importance source...
 29. run of importance source...
 30. run of importance source...
 31. run of importance source...
 32. run of importance source...
 33. run of importance source...
 34. run of importance source...
 35. run of importance source...
 36. run of importance source...
 37. run of importance source...
 38. run of importance source...
 39. run of importance source...
 40. run of importance source...
 41. run of importance source...
 42. run of importance source...
 43. run of importance source...
 44. run of importance source...
 45. run of importance source...
 46. run of importance source...
 47. run of importance source...
 48. run of importance source...
 49. run of importance source...
 50. run of importance source...
 51. run of importance source...
 52. run of importance source...
 53. run of importance source...
 54. run of importance source...
 55. run of importance source...
 56. run of importance source...
 57. run of importance source...
 58. run of importance source...
 59. run of importance source...
 60. run of importance source...
 61. run of importance source...
 62. run of importance source...
 63. run of importance source...
 64. run of importance source...
 65. run of importance source...
 66. run of importance source...
 67. run of importance source...
After 67 iterations, +2.8 secs: 
 rejected 1 attribute: v6;
 still have 1 attribute left.
 68. run of importance source...
 69. run of importance source...
 70. run of importance source...
 71. run of importance source...
 72. run of importance source...
 73. run of importance source...
 74. run of importance source...
 75. run of importance source...
 76. run of importance source...
 77. run of importance source...
 78. run of importance source...
 79. run of importance source...
 80. run of importance source...
 81. run of importance source...
 82. run of importance source...
 83. run of importance source...
 84. run of importance source...
 85. run of importance source...
 86. run of importance source...
 87. run of importance source...
 88. run of importance source...
 89. run of importance source...
 90. run of importance source...
 91. run of importance source...
 92. run of importance source...
 93. run of importance source...
 94. run of importance source...
 95. run of importance source...
 96. run of importance source...
After 96 iterations, +3.7 secs: 
 rejected 1 attribute: v2;
 no more attributes left.
Boruta.Ozone
Boruta performed 96 iterations in 3.719625 secs.
 8 attributes confirmed important: v1, v10, v11, v12, v13 and 3 more;
 3 attributes confirmed unimportant: v2, v3, v6;
plot(Boruta.Ozone)

Boruta.Short <- Boruta(v4 ~ ., data = Ozone2, maxRuns = 12)
Boruta.Short 
Boruta performed 11 iterations in 0.368885 secs.
 8 attributes confirmed important: v1, v10, v11, v12, v13 and 3 more;
 No attributes deemed unimportant.
 3 tentative attributes left: v2, v3, v6;
plot(Boruta.Short)

TentativeRoughFix(Boruta.Short)
Boruta performed 11 iterations in 0.368885 secs.
Tentatives roughfixed over the last 11 iterations.
 8 attributes confirmed important: v1, v10, v11, v12, v13 and 3 more;
 3 attributes confirmed unimportant: v2, v3, v6;
getConfirmedFormula(Boruta.Ozone)
v4 ~ v1 + v5 + v7 + v8 + v10 + v11 + v12 + v13
<environment: 0x561dd5af3ee0>
attStats(Boruta.Ozone)
       meanImp  medianImp     minImp     maxImp   normHits  decision
v1  11.3874552 11.3651587  9.8382744 12.9709571 1.00000000 Confirmed
v2   1.5758688  1.5881500 -0.7651735  3.4231289 0.33333333  Rejected
v3  -0.2573714 -0.2609554 -2.0913614  0.8875875 0.01041667  Rejected
v5  12.5399895 12.5756653 10.5397329 14.0529765 1.00000000 Confirmed
v6   1.5711420  1.4576277 -0.8055712  4.4285334 0.20833333  Rejected
v7  15.0561014 14.9926542 13.0008470 17.3376540 1.00000000 Confirmed
v8  23.9711310 24.0119841 21.4645870 26.0054181 1.00000000 Confirmed
v10 16.7129825 16.7243863 14.8258328 18.4648197 1.00000000 Confirmed
v11 16.1937697 16.1761995 13.8122657 18.2077687 1.00000000 Confirmed
v12 22.0144459 21.9837469 20.4770559 23.9295550 1.00000000 Confirmed
v13 10.5502988 10.5900424  8.5199322 12.4803862 1.00000000 Confirmed
plotImpHistory(Boruta.Ozone)