Example: Compare Simple Linear Regresion to a single layer NN.

The cars dataset in R contains two variables stopping speed of cars in mph and dist in feet. Using speed to predict stopping distance, two models are fit. See the R code.

  1. What function is used to normalize the data?
  2. What percentage of the data is used for training? What percentage of the data is used for testing?
  3. What is the fitted linear regression model?
  4. What is the correlation between the linear regression predicted values and the values from the test data?
  5. Sketch the NN model that is used to model stopping distance.
  6. What kind of activation function was used in the ANN? Sketch a picture of what the activation function looks like.
  7. What is the correlation between the ANN predicted values and the values from the test data?
  8. Examine the scatterplot of speed by distance with the fitted models. Is the NN fitting a near linear function?
  9. Which model would you use for prediction? Explain.

Answer:

Read in data and examine structure.

suppressMessages(library("tidyverse"))
cars <- as.tibble(cars)
cars
str(cars)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...
cars %>% ggplot(aes(x=speed, y=dist)) + 
  geom_point(size = 4) +
  ggtitle("Cars data") 

Apply scaling to entire data frame.

cars_norm <- cars %>% mutate(speed = scale(speed), dist=scale(dist))
cars_norm
str(cars_norm)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   50 obs. of  2 variables:
 $ speed: num [1:50, 1] -2.16 -2.16 -1.59 -1.59 -1.4 ...
  ..- attr(*, "scaled:center")= num 15.4
  ..- attr(*, "scaled:scale")= num 5.29
 $ dist : num [1:50, 1] -1.59 -1.28 -1.513 -0.814 -1.047 ...
  ..- attr(*, "scaled:center")= num 43
  ..- attr(*, "scaled:scale")= num 25.8
cars_norm %>% ggplot(aes(x=speed, y=dist)) + 
  geom_point(size = 4) + 
  ggtitle("Scaled cars data") +
  scale_x_continuous(limits = c(-2.2, 2)) +
  scale_y_continuous(limits = c(-2, 3))

Create training and test data.

Side note: This is not done using best practices, the scale() function should only be applied to the training data not the entire dataset. This is a common practice in many machine learning books. This should be corrected.

set.seed(12345)
idx <- sample(1:50, 40)
cars_train <- cars_norm[idx, ]
str(cars_train)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   40 obs. of  2 variables:
 $ speed: num  0.681 0.87 1.816 0.87 -0.265 ...
 $ dist : num  0.117 0.816 1.631 0.505 -0.271 ...
cars_test <- cars_norm[-idx, ]
str(cars_test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   10 obs. of  2 variables:
 $ speed: num  -1.589 -1.021 -1.021 -0.454 -0.454 ...
 $ dist : num  -1.513 -0.969 -0.348 -0.348 0.117 ...
cars_train %>% ggplot(aes(x=speed, y=dist)) + 
  geom_point(size = 4) + 
  ggtitle("Training Data") +
  scale_x_continuous(limits = c(-2.2, 2)) +
  scale_y_continuous(limits = c(-2, 3))
cars_test %>% ggplot(aes(x=speed, y=dist)) + 
  geom_point(size = 4) + 
  ggtitle("Test Data") +
  scale_x_continuous(limits = c(-2.2, 2)) +
  scale_y_continuous(limits = c(-2, 3))

Fit a simple linear regression. Train a linear regression model. Predict the Test Data. Compare predicted values with the holdout values.

cars_lm <- cars_train %>% lm(dist ~ speed, data = .)
summary(cars_lm)

Call:
lm(formula = dist ~ speed, data = .)

Residuals:
       Min         1Q     Median         3Q        Max 
-1.0337298 -0.3661916 -0.0613685  0.2381485  1.6623987 

Coefficients:
               Estimate  Std. Error  t value        Pr(>|t|)    
(Intercept) -0.03133762  0.08918628 -0.35137         0.72725    
speed        0.73450008  0.09457276  7.76651 0.0000000023147 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5638863 on 38 degrees of freedom
Multiple R-squared:  0.6135016, Adjusted R-squared:  0.6033306 
F-statistic: 60.31865 on 1 and 38 DF,  p-value: 0.000000002314661
predicted_lm_dist <- predict(cars_lm, cars_test)
# examine the correlation between predicted and actual values
cor(predicted_lm_dist, cars_test$dist)  
[1] 0.8696117656

Fit a NN. Train a neural network model. Compare the R code. It is very similar.

library(neuralnet)
set.seed(12345)
cars_model <- cars_train %>% neuralnet(formula = dist ~ speed, 
        act.fct = "logistic", hidden = 3, linear.output=TRUE)
plot(cars_model)

Nice plot with the plotnet() function.

library(NeuralNetTools)
par(mar = numeric(4), family = 'serif')
plotnet(cars_model, alpha = 0.6)

Predict the Test Data. Compare predicted values with the holdout values.

model_results <- compute(cars_model, cars_test[1])
predicted_dist <- model_results$net.result
# examine the correlation between predicted and actual values
cor(predicted_dist, cars_test$dist)  
             [,1]
[1,] 0.8744086583

Plot the fitted models.

ggplot(data=cars_test, aes(x=speed, y=dist)) + 
  geom_point(size = 4) +
  geom_smooth(method='lm', formula=y~x, fill=NA) +
  geom_line(aes(y = predicted_dist)) +
  ggtitle("Test Data Fitted with a Linear Model (blue) and NN (black)") +
  scale_x_continuous(limits = c(-2.2, 2)) +
  scale_y_continuous(limits = c(-2, 3))

Example: Compare Simple Linear Regression to a Deep Learning, multilayer neural network.

  1. Do you think this model will orverfit?
  2. What does parsimonious mean?
  3. Suggest a better measure for goodness-of-fit.
cars_model <- cars_train %>% neuralnet(formula = dist ~ speed, 
        act.fct = "logistic", hidden = c(10,5), linear.output=TRUE)
plot(cars_model)

Nice plot with the plotnet() function.

par(mar = numeric(4), family = 'serif')
plotnet(cars_model, alpha = 0.6)

Predict the Test Data. Compare predicted values with the holdout values.

model_results <- compute(cars_model, cars_test[1])
predicted_dist <- model_results$net.result
# examine the correlation between predicted and actual values
cor(predicted_dist, cars_test$dist)  
            [,1]
[1,] 0.889083417

Plot the fitted models.

ggplot(data=cars_test, aes(x=speed, y=dist)) + 
  geom_point(size = 4) +
  geom_smooth(method='lm', formula=y~x, fill=NA) +
  geom_line(aes(y = predicted_dist)) +
  ggtitle("Test Data Fitted with a Linear Model (blue) and NN (black)") +
  scale_x_continuous(limits = c(-2.2, 2)) +
  scale_y_continuous(limits = c(-2, 3))

LS0tCnRpdGxlOiAiTk4gUmVncmVzc2lvbiIKYXV0aG9yOiAiSlNNIDIwMTg6IFBvc3RlciAxODEgLSBDbGFzc3Jvb20gRGVtb25zdHJhdGlvbjogRGVlcCBMZWFybmluZyBmb3IgQ2xhc3NpZmljYXRpb24gYW5kIFJlZ3Jlc3Npb24sIEludHJvZHVjdGlvbiB0byBHUFUgQ29tcHV0aW5nIgpvdXRwdXQ6CiAgcGRmX2RvY3VtZW50OiBkZWZhdWx0CiAgbGF0ZXhfZW5naW5lOiB4ZWxhdGV4CiAgaHRtbF9ub3RlYm9vazogZGVmYXVsdAotLS0KCgpcYmVnaW57Y2VudGVyfQpFcmljIEEuIFN1ZXNzCgpEZXBhcnRtZW50IE9mIFN0YXRpc3RpY3MgYW5kIEJpb3N0YXRpc3RpY3MKCkNTVSBFYXN0IEJheQoKZXJpYy5zdWVzc0Bjc3VlYXN0YmF5LmVkdQpcZW5ke2NlbnRlcn0KCgojIEV4YW1wbGU6IENvbXBhcmUgU2ltcGxlIExpbmVhciBSZWdyZXNpb24gdG8gYSBzaW5nbGUgbGF5ZXIgTk4uCgpUaGUgKipjYXJzKiogZGF0YXNldCBpbiBSIGNvbnRhaW5zIHR3byB2YXJpYWJsZXMgc3RvcHBpbmcgKnNwZWVkKiBvZiBjYXJzIGluIG1waCBhbmQgKmRpc3QqIGluIGZlZXQuICBVc2luZyBzcGVlZCB0byBwcmVkaWN0IHN0b3BwaW5nIGRpc3RhbmNlLCB0d28gbW9kZWxzIGFyZSBmaXQuICBTZWUgdGhlIFIgY29kZS4KCmEuIFdoYXQgZnVuY3Rpb24gaXMgdXNlZCB0byBub3JtYWxpemUgdGhlIGRhdGE/CmIuIFdoYXQgcGVyY2VudGFnZSBvZiB0aGUgZGF0YSBpcyB1c2VkIGZvciAqdHJhaW5pbmcqPyAgV2hhdCBwZXJjZW50YWdlIG9mIHRoZSBkYXRhIGlzIHVzZWQgZm9yICp0ZXN0aW5nKj8KYy4gV2hhdCBpcyB0aGUgZml0dGVkIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsPwpkLiBXaGF0IGlzIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHRoZSBsaW5lYXIgcmVncmVzc2lvbiBwcmVkaWN0ZWQgdmFsdWVzIGFuZCB0aGUgdmFsdWVzIGZyb20gdGhlIHRlc3QgZGF0YT8KZS4gU2tldGNoIHRoZSBOTiBtb2RlbCB0aGF0IGlzIHVzZWQgdG8gbW9kZWwgc3RvcHBpbmcgZGlzdGFuY2UuCmYuIFdoYXQga2luZCBvZiBhY3RpdmF0aW9uIGZ1bmN0aW9uIHdhcyB1c2VkIGluIHRoZSBBTk4/ICBTa2V0Y2ggYSBwaWN0dXJlIG9mIHdoYXQgdGhlIGFjdGl2YXRpb24gZnVuY3Rpb24gbG9va3MgbGlrZS4KZy4gV2hhdCBpcyB0aGUgY29ycmVsYXRpb24gYmV0d2VlbiB0aGUgQU5OIHByZWRpY3RlZCB2YWx1ZXMgYW5kIHRoZSB2YWx1ZXMgZnJvbSB0aGUgdGVzdCBkYXRhPwpoLiBFeGFtaW5lIHRoZSBzY2F0dGVycGxvdCBvZiBzcGVlZCBieSBkaXN0YW5jZSB3aXRoIHRoZSBmaXR0ZWQgbW9kZWxzLiAgSXMgdGhlIE5OIGZpdHRpbmcgYSBuZWFyIGxpbmVhciBmdW5jdGlvbj8KaS4gV2hpY2ggbW9kZWwgd291bGQgeW91IHVzZSBmb3IgcHJlZGljdGlvbj8gIEV4cGxhaW4uCgoqKkFuc3dlcjoqKgoKUmVhZCBpbiBkYXRhIGFuZCBleGFtaW5lIHN0cnVjdHVyZS4KCmBgYHtyfQpzdXBwcmVzc01lc3NhZ2VzKGxpYnJhcnkoInRpZHl2ZXJzZSIpKQpgYGAKCgpgYGB7cn0KY2FycyA8LSBhcy50aWJibGUoY2FycykKY2FycwoKc3RyKGNhcnMpCgpjYXJzICU+JSBnZ3Bsb3QoYWVzKHg9c3BlZWQsIHk9ZGlzdCkpICsgCiAgZ2VvbV9wb2ludChzaXplID0gNCkgKwogIGdndGl0bGUoIkNhcnMgZGF0YSIpIAoKYGBgCgpBcHBseSBzY2FsaW5nIHRvIGVudGlyZSBkYXRhIGZyYW1lLgoKCmBgYHtyfQpjYXJzX25vcm0gPC0gY2FycyAlPiUgbXV0YXRlKHNwZWVkID0gc2NhbGUoc3BlZWQpLCBkaXN0PXNjYWxlKGRpc3QpKQpjYXJzX25vcm0KCnN0cihjYXJzX25vcm0pCgpjYXJzX25vcm0gJT4lIGdncGxvdChhZXMoeD1zcGVlZCwgeT1kaXN0KSkgKyAKICBnZW9tX3BvaW50KHNpemUgPSA0KSArIAogIGdndGl0bGUoIlNjYWxlZCBjYXJzIGRhdGEiKSArCiAgc2NhbGVfeF9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIuMiwgMikpICsKICBzY2FsZV95X2NvbnRpbnVvdXMobGltaXRzID0gYygtMiwgMykpCgpgYGAKCkNyZWF0ZSB0cmFpbmluZyBhbmQgdGVzdCBkYXRhLgoKKipTaWRlIG5vdGU6KiogVGhpcyBpcyBub3QgZG9uZSB1c2luZyBiZXN0IHByYWN0aWNlcywgdGhlIHNjYWxlKCkgZnVuY3Rpb24gc2hvdWxkIG9ubHkgYmUgYXBwbGllZCB0byB0aGUgdHJhaW5pbmcgZGF0YSBub3QgdGhlIGVudGlyZSBkYXRhc2V0LiAgVGhpcyBpcyBhIGNvbW1vbiBwcmFjdGljZSBpbiBtYW55IG1hY2hpbmUgbGVhcm5pbmcgYm9va3MuICBUaGlzIHNob3VsZCBiZSBjb3JyZWN0ZWQuCgpgYGB7cn0Kc2V0LnNlZWQoMTIzNDUpCgppZHggPC0gc2FtcGxlKDE6NTAsIDQwKQoKY2Fyc190cmFpbiA8LSBjYXJzX25vcm1baWR4LCBdCnN0cihjYXJzX3RyYWluKQoKY2Fyc190ZXN0IDwtIGNhcnNfbm9ybVstaWR4LCBdCnN0cihjYXJzX3Rlc3QpCgpjYXJzX3RyYWluICU+JSBnZ3Bsb3QoYWVzKHg9c3BlZWQsIHk9ZGlzdCkpICsgCiAgZ2VvbV9wb2ludChzaXplID0gNCkgKyAKICBnZ3RpdGxlKCJUcmFpbmluZyBEYXRhIikgKwogIHNjYWxlX3hfY29udGludW91cyhsaW1pdHMgPSBjKC0yLjIsIDIpKSArCiAgc2NhbGVfeV9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIsIDMpKQoKY2Fyc190ZXN0ICU+JSBnZ3Bsb3QoYWVzKHg9c3BlZWQsIHk9ZGlzdCkpICsgCiAgZ2VvbV9wb2ludChzaXplID0gNCkgKyAKICBnZ3RpdGxlKCJUZXN0IERhdGEiKSArCiAgc2NhbGVfeF9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIuMiwgMikpICsKICBzY2FsZV95X2NvbnRpbnVvdXMobGltaXRzID0gYygtMiwgMykpCmBgYAoKRml0IGEgc2ltcGxlIGxpbmVhciByZWdyZXNzaW9uLiAgVHJhaW4gYSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbC4gIFByZWRpY3QgdGhlIFRlc3QgRGF0YS4gIENvbXBhcmUgcHJlZGljdGVkIHZhbHVlcyB3aXRoIHRoZSBob2xkb3V0IHZhbHVlcy4KCmBgYHtyfQpjYXJzX2xtIDwtIGNhcnNfdHJhaW4gJT4lIGxtKGRpc3QgfiBzcGVlZCwgZGF0YSA9IC4pCgpzdW1tYXJ5KGNhcnNfbG0pCgpwcmVkaWN0ZWRfbG1fZGlzdCA8LSBwcmVkaWN0KGNhcnNfbG0sIGNhcnNfdGVzdCkKCiMgZXhhbWluZSB0aGUgY29ycmVsYXRpb24gYmV0d2VlbiBwcmVkaWN0ZWQgYW5kIGFjdHVhbCB2YWx1ZXMKY29yKHByZWRpY3RlZF9sbV9kaXN0LCBjYXJzX3Rlc3QkZGlzdCkgIApgYGAKCkZpdCBhIE5OLiAgVHJhaW4gYSBuZXVyYWwgbmV0d29yayBtb2RlbC4gIENvbXBhcmUgdGhlIFIgY29kZS4gIEl0IGlzIHZlcnkgc2ltaWxhci4KCmBgYHtyfQpsaWJyYXJ5KG5ldXJhbG5ldCkKCnNldC5zZWVkKDEyMzQ1KQoKY2Fyc19tb2RlbCA8LSBjYXJzX3RyYWluICU+JSBuZXVyYWxuZXQoZm9ybXVsYSA9IGRpc3QgfiBzcGVlZCwgCiAgICAgICAgYWN0LmZjdCA9ICJsb2dpc3RpYyIsIGhpZGRlbiA9IDMsIGxpbmVhci5vdXRwdXQ9VFJVRSkKCnBsb3QoY2Fyc19tb2RlbCkKYGBgCgpOaWNlIHBsb3Qgd2l0aCB0aGUgcGxvdG5ldCgpIGZ1bmN0aW9uLgoKYGBge3J9CmxpYnJhcnkoTmV1cmFsTmV0VG9vbHMpCgpwYXIobWFyID0gbnVtZXJpYyg0KSwgZmFtaWx5ID0gJ3NlcmlmJykKcGxvdG5ldChjYXJzX21vZGVsLCBhbHBoYSA9IDAuNikKYGBgCgpQcmVkaWN0IHRoZSBUZXN0IERhdGEuICBDb21wYXJlIHByZWRpY3RlZCB2YWx1ZXMgd2l0aCB0aGUgaG9sZG91dCB2YWx1ZXMuCgpgYGB7cn0KbW9kZWxfcmVzdWx0cyA8LSBjb21wdXRlKGNhcnNfbW9kZWwsIGNhcnNfdGVzdFsxXSkKCnByZWRpY3RlZF9kaXN0IDwtIG1vZGVsX3Jlc3VsdHMkbmV0LnJlc3VsdAoKIyBleGFtaW5lIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHByZWRpY3RlZCBhbmQgYWN0dWFsIHZhbHVlcwpjb3IocHJlZGljdGVkX2Rpc3QsIGNhcnNfdGVzdCRkaXN0KSAgCmBgYAoKUGxvdCB0aGUgZml0dGVkIG1vZGVscy4KCmBgYHtyfQpnZ3Bsb3QoZGF0YT1jYXJzX3Rlc3QsIGFlcyh4PXNwZWVkLCB5PWRpc3QpKSArIAogIGdlb21fcG9pbnQoc2l6ZSA9IDQpICsKICBnZW9tX3Ntb290aChtZXRob2Q9J2xtJywgZm9ybXVsYT15fngsIGZpbGw9TkEpICsKICBnZW9tX2xpbmUoYWVzKHkgPSBwcmVkaWN0ZWRfZGlzdCkpICsKICBnZ3RpdGxlKCJUZXN0IERhdGEgRml0dGVkIHdpdGggYSBMaW5lYXIgTW9kZWwgKGJsdWUpIGFuZCBOTiAoYmxhY2spIikgKwogIHNjYWxlX3hfY29udGludW91cyhsaW1pdHMgPSBjKC0yLjIsIDIpKSArCiAgc2NhbGVfeV9jb250aW51b3VzKGxpbWl0cyA9IGMoLTIsIDMpKQpgYGAKCiMgRXhhbXBsZTogQ29tcGFyZSBTaW1wbGUgTGluZWFyIFJlZ3Jlc3Npb24gdG8gYSBEZWVwIExlYXJuaW5nLCBtdWx0aWxheWVyIG5ldXJhbCBuZXR3b3JrLiAgCgphLiBEbyB5b3UgdGhpbmsgdGhpcyBtb2RlbCB3aWxsIG9ydmVyZml0PyAgCmIuIFdoYXQgZG9lcyBwYXJzaW1vbmlvdXMgbWVhbj8gIApjLiBTdWdnZXN0IGEgYmV0dGVyIG1lYXN1cmUgZm9yIGdvb2RuZXNzLW9mLWZpdC4KCmBgYHtyfQpjYXJzX21vZGVsIDwtIGNhcnNfdHJhaW4gJT4lIG5ldXJhbG5ldChmb3JtdWxhID0gZGlzdCB+IHNwZWVkLCAKICAgICAgICBhY3QuZmN0ID0gImxvZ2lzdGljIiwgaGlkZGVuID0gYygxMCw1KSwgbGluZWFyLm91dHB1dD1UUlVFKQoKcGxvdChjYXJzX21vZGVsKQpgYGAKCk5pY2UgcGxvdCB3aXRoIHRoZSBwbG90bmV0KCkgZnVuY3Rpb24uCgpgYGB7cn0KcGFyKG1hciA9IG51bWVyaWMoNCksIGZhbWlseSA9ICdzZXJpZicpCnBsb3RuZXQoY2Fyc19tb2RlbCwgYWxwaGEgPSAwLjYpCmBgYAoKUHJlZGljdCB0aGUgVGVzdCBEYXRhLiAgQ29tcGFyZSBwcmVkaWN0ZWQgdmFsdWVzIHdpdGggdGhlIGhvbGRvdXQgdmFsdWVzLgoKYGBge3J9Cm1vZGVsX3Jlc3VsdHMgPC0gY29tcHV0ZShjYXJzX21vZGVsLCBjYXJzX3Rlc3RbMV0pCgpwcmVkaWN0ZWRfZGlzdCA8LSBtb2RlbF9yZXN1bHRzJG5ldC5yZXN1bHQKCiMgZXhhbWluZSB0aGUgY29ycmVsYXRpb24gYmV0d2VlbiBwcmVkaWN0ZWQgYW5kIGFjdHVhbCB2YWx1ZXMKY29yKHByZWRpY3RlZF9kaXN0LCBjYXJzX3Rlc3QkZGlzdCkgIApgYGAKClBsb3QgdGhlIGZpdHRlZCBtb2RlbHMuCgpgYGB7cn0KZ2dwbG90KGRhdGE9Y2Fyc190ZXN0LCBhZXMoeD1zcGVlZCwgeT1kaXN0KSkgKyAKICBnZW9tX3BvaW50KHNpemUgPSA0KSArCiAgZ2VvbV9zbW9vdGgobWV0aG9kPSdsbScsIGZvcm11bGE9eX54LCBmaWxsPU5BKSArCiAgZ2VvbV9saW5lKGFlcyh5ID0gcHJlZGljdGVkX2Rpc3QpKSArCiAgZ2d0aXRsZSgiVGVzdCBEYXRhIEZpdHRlZCB3aXRoIGEgTGluZWFyIE1vZGVsIChibHVlKSBhbmQgTk4gKGJsYWNrKSIpICsKICBzY2FsZV94X2NvbnRpbnVvdXMobGltaXRzID0gYygtMi4yLCAyKSkgKwogIHNjYWxlX3lfY29udGludW91cyhsaW1pdHMgPSBjKC0yLCAzKSkKYGBg