In this R Notebook we load the Chicago Crime 2001 - present data into R using a data.table.

First Download the data in a large .csv file or try the SODA API.

To see what is possible with the data check out Minging Chi - Chicago Crime Data Visualization R Shiny App.

library(pacman)
p_load(RSocrata, tidyverse, arsenal, dtplyr, data.table, fst, tictoc, lubridate, disk.frame, beepr)

We could download the data using the Socrata API. However this will take way too long.

# ChiCrimeDataFrame <- read.socrata("https://data.cityofchicago.org/api/odata/v4/ijzp-q8t2")

# nrow(ChiCrimeDataFrame) 

So we should just download the .csv file once to the /data directory. This is very slow also.

# download.file("https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD", "data/Crimes_-_2001_to_present.csv")

Read the data into a data.table

tic()
ChiCrimes_dt <- fread("data/Crimes_-_2001_to_present.csv", nThread = 12)
toc()
## 14.395 sec elapsed

#Parallel fread

tic()
fwrite(ChiCrimes_dt, file="data/Crimes_-_2001_to_present01.csv", nThread=12)
toc()
## 2.139 sec elapsed

#Parallel fwrite

Using read.csv() is much slower.

tic()
ChiCrimes_df <- read_csv("data/Crimes_-_2001_to_present.csv")
## Warning: 35324024 parsing failures.
##   row          col           expected                     actual                                file
## 63303 X Coordinate 1/0/T/F/TRUE/FALSE 1181051                    'data/Crimes_-_2001_to_present.csv'
## 63303 Y Coordinate 1/0/T/F/TRUE/FALSE 1837225                    'data/Crimes_-_2001_to_present.csv'
## 63303 Latitude     1/0/T/F/TRUE/FALSE 41.708589                  'data/Crimes_-_2001_to_present.csv'
## 63303 Longitude    1/0/T/F/TRUE/FALSE -87.612583094              'data/Crimes_-_2001_to_present.csv'
## 63303 Location     1/0/T/F/TRUE/FALSE (41.708589, -87.612583094) 'data/Crimes_-_2001_to_present.csv'
## ..... ............ .................. .......................... ...................................
## See problems(...) for more details.
toc()
## 46.694 sec elapsed

#Not parallel read_csv

comparedf(ChiCrimes_dt, ChiCrimes_df)
## Compare Object
## 
## Function Call: 
## comparedf(x = ChiCrimes_dt, y = ChiCrimes_df)
## 
## Shared: 22 non-by variables and 7134610 observations.
## Not shared: 0 variables and 0 observations.
## 
## Differences found in 2/11 variables compared.
## 0 variables compared have non-identical attributes.
rm(ChiCrimes_df)

As an aside, check out the fst package and the write_fst() function that stores the data in a compressed format. How much space is saved storing the data in a .fst file?

tic()
write_fst(ChiCrimes_dt, "data/Crimes_-_2001_to_present.fst", compress = 100, uniform_encoding = TRUE)
toc()
## 13.39 sec elapsed

Then we can read the data into a data.table in R using the read_fst() R function.

tic()
ChiCrimes2_dt <- read_fst("data/Crimes_-_2001_to_present.fst",
  columns = NULL,
  from = 1,
  to = NULL,
  as.data.table = TRUE,
  old_format = FALSE
)
toc()
## 7.717 sec elapsed
dim(ChiCrimes2_dt)
## [1] 7134610      22
rm(ChiCrimes2_dt)

Note that the fst package can create a link to the .fst file that is saved on your harddrive. This link can be used to access the data without loading it into memory.

ChiCrimes2_fs <- fst("data/Crimes_-_2001_to_present.fst")

dim(ChiCrimes2_fs)
## [1] 7134610      22

##In memory

ChiCrimes2_fs %>% dim()
## [1] 7134610      22

Analyze the data in the data.table

ChiCrimes_dt %>% head()
tic()
ChiCrimes_dt %>% group_by(`FBI Code`) %>%
  summarise(n=n())
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` ungrouping output (override with `.groups` argument)
toc()
## 0.119 sec elapsed
ChiCrimes_dt %>% group_by(Year) %>%
  summarise(n=n())
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` ungrouping output (override with `.groups` argument)
ChiCrimes_dt %>% group_by(Year, `Community Area`) %>%
  summarise(n=n(), Arrest_total = sum(Arrest), Arrest_rate=mean(Arrest)) %>%
  ggplot(aes(x=`Community Area`, y=`Arrest_rate`)) +
  geom_point() +
  geom_smooth()
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` regrouping output by 'Year' (override with `.groups` argument)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

Lets make the plots in the Dashboard.

Number of Reported Crimes by Date. Date when the incident occurred. this is sometimes a best estimate.

ChiCrimes_dt %>% mutate(Date2 = mdy_hms(Date,tz=Sys.timezone())) %>%
  mutate(day = date(Date2)) %>%
  group_by(day) %>%
  summarize(n=n()) %>%
  ggplot(aes(x=day, y=n)) +
  geom_line()
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.

## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.

## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
## `summarise()` ungrouping output (override with `.groups` argument)

Number of Reported Crimes by Primary Type. The primary description of the IUCR code.

ChiCrimes_dt %>% 
  ggplot(aes(Arrest)) +
  geom_bar()

ChiCrimes_dt %>% 
  ggplot(aes(`Location Description`)) +
  geom_bar()

ChiCrimes_dt %>% 
  ggplot(aes(`Primary Type`)) +
  geom_bar()

ChiCrimes_dt %>% 
  ggplot(aes(`District`)) +
  geom_bar()
## Warning: Removed 47 rows containing non-finite values (stat_count).

ChiCrimes_dt %>% 
  ggplot(aes(`Domestic`)) +
  geom_bar()

Test out disk.frame

setup_disk.frame()
## The number of workers available for disk.frame is 6
# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)
library(nycflights13)

# convert the flights data to a disk.frame and store the disk.frame in the folder
# "tmp_flights" and overwrite any content if needed
flights.df <- as.disk.frame(
  flights, 
  outdir = file.path("data_disk_frame", "tmp_flights.df"),
  overwrite = TRUE)
flights.df
## path: "data_disk_frame/tmp_flights.df"
## nchunks: 6
## nrow (at source): 336776
## ncol (at source): 19
## nrow (post operations): ???
## ncol (post operations): ???
flights.df1 <- select(flights.df, year:day, arr_delay, dep_delay)
flights.df1
## path: "data_disk_frame/tmp_flights.df"
## nchunks: 6
## nrow (at source): 336776
## ncol (at source): 19
## nrow (post operations): ???
## ncol (post operations): ???
class(flights.df1)
## [1] "disk.frame"        "disk.frame.folder"
collect(flights.df1) %>% head(2)
filter(flights.df, dep_delay > 1000) %>% collect %>% head(2)
c4 <- flights %>%
  filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
  select(carrier, dep_delay, air_time, distance) %>%
  mutate(air_time_hours = air_time / 60) %>%
  collect %>%
  arrange(carrier)# arrange should occur after `collect`
## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.

## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.

## Warning: You are using a dplyr method on a raw data.table, which will call the
## * data frame implementation, and is likely to be inefficient.
## * 
## * To suppress this message, either generate a data.table translation with
## * `lazy_dt()` or convert to a data frame or tibble with
## * `as.data.frame()`/`as_tibble()`.
c4  %>% head
flights.df %>%
  group_by(carrier) %>% # notice that hard_group_by needs to be set
  summarize(count = n(), mean_dep_delay = mean(dep_delay, na.rm=T)) %>%  # mean follows normal R rules
  collect %>% 
  arrange(carrier)
flights.sample <- flights.df %>% sample_frac(0.01) %>% 
  collect 
flights.sample
beep();beep();beep(sound=4)