In this R Notebook we load the Chicago Crime 2001 - present data into R using a data.table.
First Download the data in a large .csv file or try the SODA API.
To see what is possible with the data check out Minging Chi - Chicago Crime Data Visualization R Shiny App.
library(pacman)
p_load(RSocrata, tidyverse, arsenal, dtplyr, data.table, fst, tictoc, lubridate, disk.frame, beepr)
We could download the data using the Socrata API. However this will take way too long.
# ChiCrimeDataFrame <- read.socrata("https://data.cityofchicago.org/api/odata/v4/ijzp-q8t2")
# nrow(ChiCrimeDataFrame)
So we should just download the .csv file once to the /data directory. This is very slow also.
# download.file("https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD", "data/Crimes_-_2001_to_present.csv")
tic()
ChiCrimes_dt <- fread("data/Crimes_-_2001_to_present.csv", nThread = 12)
toc()
## 14.395 sec elapsed
#
tic()
fwrite(ChiCrimes_dt, file="data/Crimes_-_2001_to_present01.csv", nThread=12)
toc()
## 2.139 sec elapsed
#
tic()
ChiCrimes_df <- read_csv("data/Crimes_-_2001_to_present.csv")
## Warning: 35324024 parsing failures.
## row col expected actual file
## 63303 X Coordinate 1/0/T/F/TRUE/FALSE 1181051 'data/Crimes_-_2001_to_present.csv'
## 63303 Y Coordinate 1/0/T/F/TRUE/FALSE 1837225 'data/Crimes_-_2001_to_present.csv'
## 63303 Latitude 1/0/T/F/TRUE/FALSE 41.708589 'data/Crimes_-_2001_to_present.csv'
## 63303 Longitude 1/0/T/F/TRUE/FALSE -87.612583094 'data/Crimes_-_2001_to_present.csv'
## 63303 Location 1/0/T/F/TRUE/FALSE (41.708589, -87.612583094) 'data/Crimes_-_2001_to_present.csv'
## ..... ............ .................. .......................... ...................................
## See problems(...) for more details.
toc()
## 46.694 sec elapsed
#
comparedf(ChiCrimes_dt, ChiCrimes_df)
## Compare Object
##
## Function Call:
## comparedf(x = ChiCrimes_dt, y = ChiCrimes_df)
##
## Shared: 22 non-by variables and 7134610 observations.
## Not shared: 0 variables and 0 observations.
##
## Differences found in 2/11 variables compared.
## 0 variables compared have non-identical attributes.
rm(ChiCrimes_df)
As an aside, check out the fst package and the write_fst() function that stores the data in a compressed format. How much space is saved storing the data in a .fst file?
tic()
write_fst(ChiCrimes_dt, "data/Crimes_-_2001_to_present.fst", compress = 100, uniform_encoding = TRUE)
toc()
## 13.39 sec elapsed
Then we can read the data into a data.table in R using the read_fst() R function.
tic()
ChiCrimes2_dt <- read_fst("data/Crimes_-_2001_to_present.fst",
columns = NULL,
from = 1,
to = NULL,
as.data.table = TRUE,
old_format = FALSE
)
toc()
## 7.717 sec elapsed
dim(ChiCrimes2_dt)
## [1] 7134610 22
rm(ChiCrimes2_dt)
Note that the fst package can create a link to the .fst file that is saved on your harddrive. This link can be used to access the data without loading it into memory.
ChiCrimes2_fs <- fst("data/Crimes_-_2001_to_present.fst")
dim(ChiCrimes2_fs)
## [1] 7134610 22
##In memory
ChiCrimes2_fs %>% dim()
## [1] 7134610 22
Analyze the data in the data.table
ChiCrimes_dt %>% head()
tic()
ChiCrimes_dt %>% group_by(`FBI Code`) %>%
summarise(n=n())
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` ungrouping output (override with `.groups` argument)
toc()
## 0.119 sec elapsed
ChiCrimes_dt %>% group_by(Year) %>%
summarise(n=n())
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` ungrouping output (override with `.groups` argument)
ChiCrimes_dt %>% group_by(Year, `Community Area`) %>%
summarise(n=n(), Arrest_total = sum(Arrest), Arrest_rate=mean(Arrest)) %>%
ggplot(aes(x=`Community Area`, y=`Arrest_rate`)) +
geom_point() +
geom_smooth()
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
Lets make the plots in the Dashboard.
Number of Reported Crimes by Date. Date when the incident occurred. this is sometimes a best estimate.
ChiCrimes_dt %>% mutate(Date2 = mdy_hms(Date,tz=Sys.timezone())) %>%
mutate(day = date(Date2)) %>%
group_by(day) %>%
summarize(n=n()) %>%
ggplot(aes(x=day, y=n)) +
geom_line()
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` ungrouping output (override with `.groups` argument)
Number of Reported Crimes by Primary Type. The primary description of the IUCR code.
ChiCrimes_dt %>%
ggplot(aes(Arrest)) +
geom_bar()
ChiCrimes_dt %>%
ggplot(aes(`Location Description`)) +
geom_bar()
ChiCrimes_dt %>%
ggplot(aes(`Primary Type`)) +
geom_bar()
ChiCrimes_dt %>%
ggplot(aes(`District`)) +
geom_bar()
## Warning: Removed 47 rows containing non-finite values (stat_count).
ChiCrimes_dt %>%
ggplot(aes(`Domestic`)) +
geom_bar()
setup_disk.frame()
## The number of workers available for disk.frame is 6
# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)
library(nycflights13)
# convert the flights data to a disk.frame and store the disk.frame in the folder
# "tmp_flights" and overwrite any content if needed
flights.df <- as.disk.frame(
flights,
outdir = file.path("data_disk_frame", "tmp_flights.df"),
overwrite = TRUE)
flights.df
## path: "data_disk_frame/tmp_flights.df"
## nchunks: 6
## nrow (at source): 336776
## ncol (at source): 19
## nrow (post operations): ???
## ncol (post operations): ???
flights.df1 <- select(flights.df, year:day, arr_delay, dep_delay)
flights.df1
## path: "data_disk_frame/tmp_flights.df"
## nchunks: 6
## nrow (at source): 336776
## ncol (at source): 19
## nrow (post operations): ???
## ncol (post operations): ???
class(flights.df1)
## [1] "disk.frame" "disk.frame.folder"
collect(flights.df1) %>% head(2)
filter(flights.df, dep_delay > 1000) %>% collect %>% head(2)
c4 <- flights %>%
filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
select(carrier, dep_delay, air_time, distance) %>%
mutate(air_time_hours = air_time / 60) %>%
collect %>%
arrange(carrier)# arrange should occur after `collect`
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## *
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
c4 %>% head
flights.df %>%
group_by(carrier) %>% # notice that hard_group_by needs to be set
summarize(count = n(), mean_dep_delay = mean(dep_delay, na.rm=T)) %>% # mean follows normal R rules
collect %>%
arrange(carrier)
flights.sample <- flights.df %>% sample_frac(0.01) %>%
collect
flights.sample
beep();beep();beep(sound=4)