Example: Compare Simple Linear Regresion to a single layer NN.
The cars dataset in R contains two variables stopping speed of cars in mph and dist in feet. Using speed to predict stopping distance, two models are fit. See the R code.
- What function is used to normalize the data?
- What percentage of the data is used for training? What percentage of the data is used for testing?
- What is the fitted linear regression model?
- What is the correlation between the linear regression predicted values and the values from the test data?
- Sketch the NN model that is used to model stopping distance.
- What kind of activation function was used in the ANN? Sketch a picture of what the activation function looks like.
- What is the correlation between the ANN predicted values and the values from the test data?
- Examine the scatterplot of speed by distance with the fitted models. Is the NN fitting a near linear function?
- Which model would you use for prediction? Explain.
Answer:
Read in data and examine structure.
suppressMessages(library("tidyverse"))
cars <- as.tibble(cars)
cars
str(cars)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
cars %>% ggplot(aes(x=speed, y=dist)) +
geom_point(size = 4) +
ggtitle("Cars data")
Apply scaling to entire data frame.
cars_norm <- cars %>% mutate(speed = scale(speed), dist=scale(dist))
cars_norm
str(cars_norm)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 50 obs. of 2 variables:
$ speed: num [1:50, 1] -2.16 -2.16 -1.59 -1.59 -1.4 ...
..- attr(*, "scaled:center")= num 15.4
..- attr(*, "scaled:scale")= num 5.29
$ dist : num [1:50, 1] -1.59 -1.28 -1.513 -0.814 -1.047 ...
..- attr(*, "scaled:center")= num 43
..- attr(*, "scaled:scale")= num 25.8
cars_norm %>% ggplot(aes(x=speed, y=dist)) +
geom_point(size = 4) +
ggtitle("Scaled cars data") +
scale_x_continuous(limits = c(-2.2, 2)) +
scale_y_continuous(limits = c(-2, 3))
Create training and test data.
Side note: This is not done using best practices, the scale() function should only be applied to the training data not the entire dataset. This is a common practice in many machine learning books. This should be corrected.
set.seed(12345)
idx <- sample(1:50, 40)
cars_train <- cars_norm[idx, ]
str(cars_train)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 40 obs. of 2 variables:
$ speed: num 0.681 0.87 1.816 0.87 -0.265 ...
$ dist : num 0.117 0.816 1.631 0.505 -0.271 ...
cars_test <- cars_norm[-idx, ]
str(cars_test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 10 obs. of 2 variables:
$ speed: num -1.589 -1.021 -1.021 -0.454 -0.454 ...
$ dist : num -1.513 -0.969 -0.348 -0.348 0.117 ...
cars_train %>% ggplot(aes(x=speed, y=dist)) +
geom_point(size = 4) +
ggtitle("Training Data") +
scale_x_continuous(limits = c(-2.2, 2)) +
scale_y_continuous(limits = c(-2, 3))
cars_test %>% ggplot(aes(x=speed, y=dist)) +
geom_point(size = 4) +
ggtitle("Test Data") +
scale_x_continuous(limits = c(-2.2, 2)) +
scale_y_continuous(limits = c(-2, 3))
Fit a simple linear regression. Train a linear regression model. Predict the Test Data. Compare predicted values with the holdout values.
cars_lm <- cars_train %>% lm(dist ~ speed, data = .)
summary(cars_lm)
Call:
lm(formula = dist ~ speed, data = .)
Residuals:
Min 1Q Median 3Q Max
-1.0337298 -0.3661916 -0.0613685 0.2381485 1.6623987
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03133762 0.08918628 -0.35137 0.72725
speed 0.73450008 0.09457276 7.76651 0.0000000023147 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5638863 on 38 degrees of freedom
Multiple R-squared: 0.6135016, Adjusted R-squared: 0.6033306
F-statistic: 60.31865 on 1 and 38 DF, p-value: 0.000000002314661
predicted_lm_dist <- predict(cars_lm, cars_test)
# examine the correlation between predicted and actual values
cor(predicted_lm_dist, cars_test$dist)
[1] 0.8696117656
Fit a NN. Train a neural network model. Compare the R code. It is very similar.
library(neuralnet)
set.seed(12345)
cars_model <- cars_train %>% neuralnet(formula = dist ~ speed,
act.fct = "logistic", hidden = 3, linear.output=TRUE)
plot(cars_model)
Nice plot with the plotnet() function.
library(NeuralNetTools)
par(mar = numeric(4), family = 'serif')
plotnet(cars_model, alpha = 0.6)
Predict the Test Data. Compare predicted values with the holdout values.
model_results <- compute(cars_model, cars_test[1])
predicted_dist <- model_results$net.result
# examine the correlation between predicted and actual values
cor(predicted_dist, cars_test$dist)
[,1]
[1,] 0.8744086583
Plot the fitted models.
ggplot(data=cars_test, aes(x=speed, y=dist)) +
geom_point(size = 4) +
geom_smooth(method='lm', formula=y~x, fill=NA) +
geom_line(aes(y = predicted_dist)) +
ggtitle("Test Data Fitted with a Linear Model (blue) and NN (black)") +
scale_x_continuous(limits = c(-2.2, 2)) +
scale_y_continuous(limits = c(-2, 3))
Example: Compare Simple Linear Regression to a Deep Learning, multilayer neural network.
- Do you think this model will orverfit?
- What does parsimonious mean?
- Suggest a better measure for goodness-of-fit.
cars_model <- cars_train %>% neuralnet(formula = dist ~ speed,
act.fct = "logistic", hidden = c(10,5), linear.output=TRUE)
plot(cars_model)
Nice plot with the plotnet() function.
par(mar = numeric(4), family = 'serif')
plotnet(cars_model, alpha = 0.6)
Predict the Test Data. Compare predicted values with the holdout values.
model_results <- compute(cars_model, cars_test[1])
predicted_dist <- model_results$net.result
# examine the correlation between predicted and actual values
cor(predicted_dist, cars_test$dist)
[,1]
[1,] 0.889083417
Plot the fitted models.
ggplot(data=cars_test, aes(x=speed, y=dist)) +
geom_point(size = 4) +
geom_smooth(method='lm', formula=y~x, fill=NA) +
geom_line(aes(y = predicted_dist)) +
ggtitle("Test Data Fitted with a Linear Model (blue) and NN (black)") +
scale_x_continuous(limits = c(-2.2, 2)) +
scale_y_continuous(limits = c(-2, 3))
LS0tCnRpdGxlOiAiTk4gUmVncmVzc2lvbiIKYXV0aG9yOiAiSlNNIDIwMTg6IFBvc3RlciAxODEgLSBDbGFzc3Jvb20gRGVtb25zdHJhdGlvbjogRGVlcCBMZWFybmluZyBmb3IgQ2xhc3NpZmljYXRpb24gYW5kIFJlZ3Jlc3Npb24sIEludHJvZHVjdGlvbiB0byBHUFUgQ29tcHV0aW5nIgpvdXRwdXQ6CiAgcGRmX2RvY3VtZW50OiBkZWZhdWx0CiAgbGF0ZXhfZW5naW5lOiB4ZWxhdGV4CiAgaHRtbF9ub3RlYm9vazogZGVmYXVsdAotLS0KCgpcYmVnaW57Y2VudGVyfQpFcmljIEEuIFN1ZXNzCgpEZXBhcnRtZW50IE9mIFN0YXRpc3RpY3MgYW5kIEJpb3N0YXRpc3RpY3MKCkNTVSBFYXN0IEJheQoKZXJpYy5zdWVzc0Bjc3VlYXN0YmF5LmVkdQpcZW5ke2NlbnRlcn0KCgojIEV4YW1wbGU6IENvbXBhcmUgU2ltcGxlIExpbmVhciBSZWdyZXNpb24gdG8gYSBzaW5nbGUgbGF5ZXIgTk4uCgpUaGUgKipjYXJzKiogZGF0YXNldCBpbiBSIGNvbnRhaW5zIHR3byB2YXJpYWJsZXMgc3RvcHBpbmcgKnNwZWVkKiBvZiBjYXJzIGluIG1waCBhbmQgKmRpc3QqIGluIGZlZXQuICBVc2luZyBzcGVlZCB0byBwcmVkaWN0IHN0b3BwaW5nIGRpc3RhbmNlLCB0d28gbW9kZWxzIGFyZSBmaXQuICBTZWUgdGhlIFIgY29kZS4KCmEuIFdoYXQgZnVuY3Rpb24gaXMgdXNlZCB0byBub3JtYWxpemUgdGhlIGRhdGE/CmIuIFdoYXQgcGVyY2VudGFnZSBvZiB0aGUgZGF0YSBpcyB1c2VkIGZvciAqdHJhaW5pbmcqPyAgV2hhdCBwZXJjZW50YWdlIG9mIHRoZSBkYXRhIGlzIHVzZWQgZm9yICp0ZXN0aW5nKj8KYy4gV2hhdCBpcyB0aGUgZml0dGVkIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsPwpkLiBXaGF0IGlzIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHRoZSBsaW5lYXIgcmVncmVzc2lvbiBwcmVkaWN0ZWQgdmFsdWVzIGFuZCB0aGUgdmFsdWVzIGZyb20gdGhlIHRlc3QgZGF0YT8KZS4gU2tldGNoIHRoZSBOTiBtb2RlbCB0aGF0IGlzIHVzZWQgdG8gbW9kZWwgc3RvcHBpbmcgZGlzdGFuY2UuCmYuIFdoYXQga2luZCBvZiBhY3RpdmF0aW9uIGZ1bmN0aW9uIHdhcyB1c2VkIGluIHRoZSBBTk4/ICBTa2V0Y2ggYSBwaWN0dXJlIG9mIHdoYXQgdGhlIGFjdGl2YXRpb24gZnVuY3Rpb24gbG9va3MgbGlrZS4KZy4gV2hhdCBpcyB0aGUgY29ycmVsYXRpb24gYmV0d2VlbiB0aGUgQU5OIHByZWRpY3RlZCB2YWx1ZXMgYW5kIHRoZSB2YWx1ZXMgZnJvbSB0aGUgdGVzdCBkYXRhPwpoLiBFeGFtaW5lIHRoZSBzY2F0dGVycGxvdCBvZiBzcGVlZCBieSBkaXN0YW5jZSB3aXRoIHRoZSBmaXR0ZWQgbW9kZWxzLiAgSXMgdGhlIE5OIGZpdHRpbmcgYSBuZWFyIGxpbmVhciBmdW5jdGlvbj8KaS4gV2hpY2ggbW9kZWwgd291bGQgeW91IHVzZSBmb3IgcHJlZGljdGlvbj8gIEV4cGxhaW4uCgoqKkFuc3dlcjoqKgoKUmVhZCBpbiBkYXRhIGFuZCBleGFtaW5lIHN0cnVjdHVyZS4KCmBgYHtyfQpzdXBwcmVzc01lc3NhZ2VzKGxpYnJhcnkoInRpZHl2ZXJzZSIpKQpgYGAKCgpgYGB7cn0KY2FycyA8LSBhcy50aWJibGUoY2FycykKY2FycwoKc3RyKGNhcnMpCgpjYXJzICU+JSBnZ3Bsb3QoYWVzKHg9c3BlZWQsIHk9ZGlzdCkpICsgCiAgZ2VvbV9wb2ludChzaXplID0gNCkgKwogIGdndGl0bGUoIkNhcnMgZGF0YSIpIAoKYGBgCgpBcHBseSBzY2FsaW5nIHRvIGVudGlyZSBkYXRhIGZyYW1lLgoKCmBgYHtyfQpjYXJzX25vcm0gPC0gY2FycyAlPiUgbXV0YXRlKHNwZWVkID0gc2NhbGUoc3BlZWQpLCBkaXN0PXNjYWxlKGRpc3QpKQpjYXJzX25vcm0KCnN0cihjYXJzX25vcm0pCgpjYXJzX25vcm0gJT4lIGdncGxvdChhZXMoeD1zcGVlZCwgeT1kaXN0KSkgKyAKICBnZW9tX3BvaW50KHNpemUgPSA0KSArIAogIGdndGl0bGUoIlNjYWxlZCBjYXJzIGRhdGEiKSArCiAgc2NhbGVfeF9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIuMiwgMikpICsKICBzY2FsZV95X2NvbnRpbnVvdXMobGltaXRzID0gYygtMiwgMykpCgpgYGAKCkNyZWF0ZSB0cmFpbmluZyBhbmQgdGVzdCBkYXRhLgoKKipTaWRlIG5vdGU6KiogVGhpcyBpcyBub3QgZG9uZSB1c2luZyBiZXN0IHByYWN0aWNlcywgdGhlIHNjYWxlKCkgZnVuY3Rpb24gc2hvdWxkIG9ubHkgYmUgYXBwbGllZCB0byB0aGUgdHJhaW5pbmcgZGF0YSBub3QgdGhlIGVudGlyZSBkYXRhc2V0LiAgVGhpcyBpcyBhIGNvbW1vbiBwcmFjdGljZSBpbiBtYW55IG1hY2hpbmUgbGVhcm5pbmcgYm9va3MuICBUaGlzIHNob3VsZCBiZSBjb3JyZWN0ZWQuCgpgYGB7cn0Kc2V0LnNlZWQoMTIzNDUpCgppZHggPC0gc2FtcGxlKDE6NTAsIDQwKQoKY2Fyc190cmFpbiA8LSBjYXJzX25vcm1baWR4LCBdCnN0cihjYXJzX3RyYWluKQoKY2Fyc190ZXN0IDwtIGNhcnNfbm9ybVstaWR4LCBdCnN0cihjYXJzX3Rlc3QpCgpjYXJzX3RyYWluICU+JSBnZ3Bsb3QoYWVzKHg9c3BlZWQsIHk9ZGlzdCkpICsgCiAgZ2VvbV9wb2ludChzaXplID0gNCkgKyAKICBnZ3RpdGxlKCJUcmFpbmluZyBEYXRhIikgKwogIHNjYWxlX3hfY29udGludW91cyhsaW1pdHMgPSBjKC0yLjIsIDIpKSArCiAgc2NhbGVfeV9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIsIDMpKQoKY2Fyc190ZXN0ICU+JSBnZ3Bsb3QoYWVzKHg9c3BlZWQsIHk9ZGlzdCkpICsgCiAgZ2VvbV9wb2ludChzaXplID0gNCkgKyAKICBnZ3RpdGxlKCJUZXN0IERhdGEiKSArCiAgc2NhbGVfeF9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIuMiwgMikpICsKICBzY2FsZV95X2NvbnRpbnVvdXMobGltaXRzID0gYygtMiwgMykpCmBgYAoKRml0IGEgc2ltcGxlIGxpbmVhciByZWdyZXNzaW9uLiAgVHJhaW4gYSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbC4gIFByZWRpY3QgdGhlIFRlc3QgRGF0YS4gIENvbXBhcmUgcHJlZGljdGVkIHZhbHVlcyB3aXRoIHRoZSBob2xkb3V0IHZhbHVlcy4KCmBgYHtyfQpjYXJzX2xtIDwtIGNhcnNfdHJhaW4gJT4lIGxtKGRpc3QgfiBzcGVlZCwgZGF0YSA9IC4pCgpzdW1tYXJ5KGNhcnNfbG0pCgpwcmVkaWN0ZWRfbG1fZGlzdCA8LSBwcmVkaWN0KGNhcnNfbG0sIGNhcnNfdGVzdCkKCiMgZXhhbWluZSB0aGUgY29ycmVsYXRpb24gYmV0d2VlbiBwcmVkaWN0ZWQgYW5kIGFjdHVhbCB2YWx1ZXMKY29yKHByZWRpY3RlZF9sbV9kaXN0LCBjYXJzX3Rlc3QkZGlzdCkgIApgYGAKCkZpdCBhIE5OLiAgVHJhaW4gYSBuZXVyYWwgbmV0d29yayBtb2RlbC4gIENvbXBhcmUgdGhlIFIgY29kZS4gIEl0IGlzIHZlcnkgc2ltaWxhci4KCmBgYHtyfQpsaWJyYXJ5KG5ldXJhbG5ldCkKCnNldC5zZWVkKDEyMzQ1KQoKY2Fyc19tb2RlbCA8LSBjYXJzX3RyYWluICU+JSBuZXVyYWxuZXQoZm9ybXVsYSA9IGRpc3QgfiBzcGVlZCwgCiAgICAgICAgYWN0LmZjdCA9ICJsb2dpc3RpYyIsIGhpZGRlbiA9IDMsIGxpbmVhci5vdXRwdXQ9VFJVRSkKCnBsb3QoY2Fyc19tb2RlbCkKYGBgCgpOaWNlIHBsb3Qgd2l0aCB0aGUgcGxvdG5ldCgpIGZ1bmN0aW9uLgoKYGBge3J9CmxpYnJhcnkoTmV1cmFsTmV0VG9vbHMpCgpwYXIobWFyID0gbnVtZXJpYyg0KSwgZmFtaWx5ID0gJ3NlcmlmJykKcGxvdG5ldChjYXJzX21vZGVsLCBhbHBoYSA9IDAuNikKYGBgCgpQcmVkaWN0IHRoZSBUZXN0IERhdGEuICBDb21wYXJlIHByZWRpY3RlZCB2YWx1ZXMgd2l0aCB0aGUgaG9sZG91dCB2YWx1ZXMuCgpgYGB7cn0KbW9kZWxfcmVzdWx0cyA8LSBjb21wdXRlKGNhcnNfbW9kZWwsIGNhcnNfdGVzdFsxXSkKCnByZWRpY3RlZF9kaXN0IDwtIG1vZGVsX3Jlc3VsdHMkbmV0LnJlc3VsdAoKIyBleGFtaW5lIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHByZWRpY3RlZCBhbmQgYWN0dWFsIHZhbHVlcwpjb3IocHJlZGljdGVkX2Rpc3QsIGNhcnNfdGVzdCRkaXN0KSAgCmBgYAoKUGxvdCB0aGUgZml0dGVkIG1vZGVscy4KCmBgYHtyfQpnZ3Bsb3QoZGF0YT1jYXJzX3Rlc3QsIGFlcyh4PXNwZWVkLCB5PWRpc3QpKSArIAogIGdlb21fcG9pbnQoc2l6ZSA9IDQpICsKICBnZW9tX3Ntb290aChtZXRob2Q9J2xtJywgZm9ybXVsYT15fngsIGZpbGw9TkEpICsKICBnZW9tX2xpbmUoYWVzKHkgPSBwcmVkaWN0ZWRfZGlzdCkpICsKICBnZ3RpdGxlKCJUZXN0IERhdGEgRml0dGVkIHdpdGggYSBMaW5lYXIgTW9kZWwgKGJsdWUpIGFuZCBOTiAoYmxhY2spIikgKwogIHNjYWxlX3hfY29udGludW91cyhsaW1pdHMgPSBjKC0yLjIsIDIpKSArCiAgc2NhbGVfeV9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIsIDMpKQpgYGAKCiMgRXhhbXBsZTogQ29tcGFyZSBTaW1wbGUgTGluZWFyIFJlZ3Jlc3Npb24gdG8gYSBEZWVwIExlYXJuaW5nLCBtdWx0aWxheWVyIG5ldXJhbCBuZXR3b3JrLiAgCgphLiBEbyB5b3UgdGhpbmsgdGhpcyBtb2RlbCB3aWxsIG9ydmVyZml0PyAgCmIuIFdoYXQgZG9lcyBwYXJzaW1vbmlvdXMgbWVhbj8gIApjLiBTdWdnZXN0IGEgYmV0dGVyIG1lYXN1cmUgZm9yIGdvb2RuZXNzLW9mLWZpdC4KCmBgYHtyfQpjYXJzX21vZGVsIDwtIGNhcnNfdHJhaW4gJT4lIG5ldXJhbG5ldChmb3JtdWxhID0gZGlzdCB+IHNwZWVkLCAKICAgICAgICBhY3QuZmN0ID0gImxvZ2lzdGljIiwgaGlkZGVuID0gYygxMCw1KSwgbGluZWFyLm91dHB1dD1UUlVFKQoKcGxvdChjYXJzX21vZGVsKQpgYGAKCk5pY2UgcGxvdCB3aXRoIHRoZSBwbG90bmV0KCkgZnVuY3Rpb24uCgpgYGB7cn0KcGFyKG1hciA9IG51bWVyaWMoNCksIGZhbWlseSA9ICdzZXJpZicpCnBsb3RuZXQoY2Fyc19tb2RlbCwgYWxwaGEgPSAwLjYpCmBgYAoKUHJlZGljdCB0aGUgVGVzdCBEYXRhLiAgQ29tcGFyZSBwcmVkaWN0ZWQgdmFsdWVzIHdpdGggdGhlIGhvbGRvdXQgdmFsdWVzLgoKYGBge3J9Cm1vZGVsX3Jlc3VsdHMgPC0gY29tcHV0ZShjYXJzX21vZGVsLCBjYXJzX3Rlc3RbMV0pCgpwcmVkaWN0ZWRfZGlzdCA8LSBtb2RlbF9yZXN1bHRzJG5ldC5yZXN1bHQKCiMgZXhhbWluZSB0aGUgY29ycmVsYXRpb24gYmV0d2VlbiBwcmVkaWN0ZWQgYW5kIGFjdHVhbCB2YWx1ZXMKY29yKHByZWRpY3RlZF9kaXN0LCBjYXJzX3Rlc3QkZGlzdCkgIApgYGAKClBsb3QgdGhlIGZpdHRlZCBtb2RlbHMuCgpgYGB7cn0KZ2dwbG90KGRhdGE9Y2Fyc190ZXN0LCBhZXMoeD1zcGVlZCwgeT1kaXN0KSkgKyAKICBnZW9tX3BvaW50KHNpemUgPSA0KSArCiAgZ2VvbV9zbW9vdGgobWV0aG9kPSdsbScsIGZvcm11bGE9eX54LCBmaWxsPU5BKSArCiAgZ2VvbV9saW5lKGFlcyh5ID0gcHJlZGljdGVkX2Rpc3QpKSArCiAgZ2d0aXRsZSgiVGVzdCBEYXRhIEZpdHRlZCB3aXRoIGEgTGluZWFyIE1vZGVsIChibHVlKSBhbmQgTk4gKGJsYWNrKSIpICsKICBzY2FsZV94X2NvbnRpbnVvdXMobGltaXRzID0gYygtMi4yLCAyKSkgKwogIHNjYWxlX3lfY29udGludW91cyhsaW1pdHMgPSBjKC0yLCAzKSkKYGBg