Stat. 674 Project

Author

Prof. Eric A. Suess

Published

February 24, 2025

Class Project

For the COVID19 data from United States develop the best forecasting models for confirmed cases and produce a forecast for the next 30 days, after the end of the available data. Note that the data is recorded daily.

See Section 9.11 Exercises 10 for an example.

library(pacman)
p_load(tidyverse, fpp3, COVID19)
covid_data <- covid19(c("United States"))
We have invested a lot of time and effort in creating COVID-19 Data
Hub, please cite the following when using it:

  Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
  Source Software 5(51):2376, doi: 10.21105/joss.02376

The implementation details and the latest version of the data are
described in:

  Guidotti, E., (2022), "A worldwide epidemiological database for
  COVID-19 at fine-grained spatial resolution", Sci Data 9(1):112, doi:
  10.1038/s41597-022-01245-1
To print citations in BibTeX format use:
 > print(citation('COVID19'), bibtex=TRUE)

To hide this message use 'verbose = FALSE'.
covid_data <- covid_data %>% as_tsibble(key = "id", index = "date")
  1. Produce an STL decomposition of the data and describe the trend and seasonality.

First determine when the start and end of the available data for the confirmed cases.

Answer:

Summarize your answer to the question here. All code and comments should be below and enter your written answer here.

Code and Comments:

covid_data %>% autoplot(confirmed)
Warning: Removed 415 rows containing missing values or values outside the scale range
(`geom_line()`).

  1. Do the data need transforming? If so, find a suitable transformation.

Answer:

Summarize your answer to the question here. All code and comments should be below and enter your written answer here.

Code and Comments:

  1. Are the data approximately stationary? If not, find an appropriate differencing which yields approximately stationary data.

Try differencing. Try differencing twice. It may not be possible to transform the data to get the data to have both constant mean and constant variance. Try to get at least a constant mean.

Answer:

Summarize your answer to the question here. All code and comments should be below and enter your written answer here.

Code and Comments:

  1. Identify two ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AICc values?

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Hint a possible best model is of the form ARIMA(3,1,0)(0,1,1).

Code and Comments:

  1. Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise? If not, try to find another ARIMA model which fits better.

Hint: Use ARIMA to search for the best model.

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments:

  1. Forecast the next 25 weeks.

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.

Code and Comments:

  1. Eventually, the prediction intervals are so wide that the forecasts are not particularly useful. How many weeks of forecasts do you think are sufficiently accurate to be usable?

Answer:

Summarize your answer to the question here. All code and comments should be below and your written answer above.