In this R Notebook we load the Chicago Crime 2001 - present data into R using a data.table.
First Download the data in a large .csv file or try the SODA API.
To see what is possible with the data check out Minging Chi - Chicago Crime Data Visualization R Shiny App.
library(pacman)
p_load(RSocrata, tidyverse, arsenal, dtplyr, data.table, fst, tictoc, lubridate, disk.frame, beepr)
We could download the data using the Socrata API. However this will take way too long.
# ChiCrimeDataFrame <- read.socrata("https://data.cityofchicago.org/api/odata/v4/ijzp-q8t2")
# nrow(ChiCrimeDataFrame)
So we should just download the .csv file once to the /data directory. This is very slow also.
# download.file("https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD", "data/Crimes_-_2001_to_present.csv")
tic()
ChiCrimes_dt <- fread("data/Crimes_-_2001_to_present.csv", nThread = 12)
|--------------------------------------------------|
|==================================================|
toc()
17.059 sec elapsed
#
tic()
fwrite(ChiCrimes_dt, file="data/Crimes_-_2001_to_present01.csv", nThread=12)
toc()
4.448 sec elapsed
#
tic()
ChiCrimes_df <- read_csv("data/Crimes_-_2001_to_present.csv")
35324024 parsing failures.
row col expected actual file
63303 X Coordinate 1/0/T/F/TRUE/FALSE 1181051 'data/Crimes_-_2001_to_present.csv'
63303 Y Coordinate 1/0/T/F/TRUE/FALSE 1837225 'data/Crimes_-_2001_to_present.csv'
63303 Latitude 1/0/T/F/TRUE/FALSE 41.708589 'data/Crimes_-_2001_to_present.csv'
63303 Longitude 1/0/T/F/TRUE/FALSE -87.612583094 'data/Crimes_-_2001_to_present.csv'
63303 Location 1/0/T/F/TRUE/FALSE (41.708589, -87.612583094) 'data/Crimes_-_2001_to_present.csv'
..... ............ .................. .......................... ...................................
See problems(...) for more details.
toc()
50.645 sec elapsed
#
comparedf(ChiCrimes_dt, ChiCrimes_df)
Compare Object
Function Call:
comparedf(x = ChiCrimes_dt, y = ChiCrimes_df)
Shared: 22 non-by variables and 7134610 observations.
Not shared: 0 variables and 0 observations.
Differences found in 2/11 variables compared.
0 variables compared have non-identical attributes.
rm(ChiCrimes_df)
As an aside, check out the fst package and the write_fst() function that stores the data in a compressed format. How much space is saved storing the data in a .fst file?
tic()
write_fst(ChiCrimes_dt, "data/Crimes_-_2001_to_present.fst", compress = 100, uniform_encoding = TRUE)
toc()
13.282 sec elapsed
Then we can read the data into a data.table in R using the read_fst() R function.
tic()
ChiCrimes2_dt <- read_fst("data/Crimes_-_2001_to_present.fst",
columns = NULL,
from = 1,
to = NULL,
as.data.table = TRUE,
old_format = FALSE
)
toc()
9.568 sec elapsed
dim(ChiCrimes2_dt)
[1] 7134610 22
rm(ChiCrimes2_dt)
Note that the fst package can create a link to the .fst file that is saved on your harddrive. This link can be used to access the data without loading it into memory.
ChiCrimes2_fs <- fst("data/Crimes_-_2001_to_present.fst")
dim(ChiCrimes2_fs)
[1] 7134610 22
##In memory
ChiCrimes2_fs %>% dim()
[1] 7134610 22
Analyze the data in the data.table
ChiCrimes_dt %>% head()
tic()
ChiCrimes_dt %>% group_by(`FBI Code`) %>%
summarise(n=n())
You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.`summarise()` ungrouping output (override with `.groups` argument)
toc()
0.149 sec elapsed
ChiCrimes_dt %>% group_by(Year) %>%
summarise(n=n())
You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.`summarise()` ungrouping output (override with `.groups` argument)
ChiCrimes_dt %>% group_by(Year, `Community Area`) %>%
summarise(n=n(), Arrest_total = sum(Arrest), Arrest_rate=mean(Arrest)) %>%
ggplot(aes(x=`Community Area`, y=`Arrest_rate`)) +
geom_point() +
geom_smooth()
You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.`summarise()` regrouping output by 'Year' (override with `.groups` argument)
Lets make the plots in the Dashboard.
Number of Reported Crimes by Date. Date when the incident occurred. this is sometimes a best estimate.
ChiCrimes_dt %>% mutate(Date2 = mdy_hms(Date,tz=Sys.timezone())) %>%
mutate(day = date(Date2)) %>%
group_by(day) %>%
summarize(n=n()) %>%
ggplot(aes(x=day, y=n)) +
geom_line()
You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.`summarise()` ungrouping output (override with `.groups` argument)
Number of Reported Crimes by Primary Type. The primary description of the IUCR code.
ChiCrimes_dt %>%
ggplot(aes(Arrest)) +
geom_bar()
ChiCrimes_dt %>%
ggplot(aes(`Location Description`)) +
geom_bar()
ChiCrimes_dt %>%
ggplot(aes(`Primary Type`)) +
geom_bar()
ChiCrimes_dt %>%
ggplot(aes(`District`)) +
geom_bar()
ChiCrimes_dt %>%
ggplot(aes(`Domestic`)) +
geom_bar()
setup_disk.frame()
The number of workers available for disk.frame is 6
# this will allow unlimited amount of data to be passed from worker to worker
options(future.globals.maxSize = Inf)
library(nycflights13)
# convert the flights data to a disk.frame and store the disk.frame in the folder
# "tmp_flights" and overwrite any content if needed
flights.df <- as.disk.frame(
flights,
outdir = file.path("data_disk_frame", "tmp_flights.df"),
overwrite = TRUE)
flights.df
path: "data_disk_frame/tmp_flights.df"
nchunks: 6
nrow (at source): 336776
ncol (at source): 19
nrow (post operations): ???
ncol (post operations): ???
class(flights.df1)
[1] "disk.frame" "disk.frame.folder"
filter(flights.df, dep_delay > 1000) %>% collect %>% head(2)
c4 <- flights %>%
filter(month == 5, day == 17, carrier %in% c('UA', 'WN', 'AA', 'DL')) %>%
select(carrier, dep_delay, air_time, distance) %>%
mutate(air_time_hours = air_time / 60) %>%
collect %>%
arrange(carrier)# arrange should occur after `collect`
You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.You are using a dplyr method on a raw data.table, which will call the
* data frame implementation, and is likely to be inefficient.
*
* To suppress this message, either generate a data.table translation with
* `lazy_dt()` or convert to a data frame or tibble with
* `as.data.frame()`/`as_tibble()`.
c4 %>% head
flights.df %>%
group_by(carrier) %>% # notice that hard_group_by needs to be set
summarize(count = n(), mean_dep_delay = mean(dep_delay, na.rm=T)) %>% # mean follows normal R rules
collect %>%
arrange(carrier)
beep();beep();beep(sound=4)