Version 1.0 | First Created Oct 12, 2023 | Updated Oct 18, 2023
Keywords: Poisson Regression, Spatial Hotspot Analysis, Local Moran’s
I, Spatial LOGO Cross Validation, K-fold Cross Validation, K Nearest
Neighbors, Kernel Density Estimation
Introduction
Predictive policing is an application of machine learning through
building geospatial risk models that aggregate historical crime data,
socioeconomic factors, and spatial attributes to forecast where and when
crimes are likely to occur. These models are crucial because they offer
law enforcement agencies a proactive approach to crime prevention. While
traditional policing method often rely on reacting to incidents after
they occur, predictive policing enables law enforcement to anticipate
and prevent crimes in advance. Because of that, law enforcement agencies
can optimize their resources and deploy officers strategically, leading
to more cost-effective law enforcement practices.
However, there are notable limitations and biases associated with
these models. One significant concern is the potential for bias in the
historical data used to train these algorithms. If the historical data
reflects existing biases in law enforcement practices, such as racial
profiling or discriminatory policies, the algorithms can perpetuate
these biases, leading to unfair or disproportionate targeting of certain
communities. Additionally, there is the risk of self-fulfilling
prophecies: if law enforcement concentrates their efforts in specific
areas predicted by the algorithm, crime data from those areas might be
inflated, creating a feedback loop of over-policing. That said,
predictions from geospatial risk models can be thought of as latent risk
- areas at risk even if a crime has not actually been reported there.
Selection bias can be a problematic issue for it is very likely that
observed crime would discount actual crime risks.
With these in mind, the goal of this report is to fit a geospatial
risk model predicting robbery incidents in Chicago, a city facing
various crime and policing challenges, to the best of our ability.
Robberies are often subjected to selection bias because 1) certainly not
all incidents of robbery are reported to the police and 2) areas that
had been heavily patrolled in the past due to higher robbery incidents
will continue to report higher robbery incidents. For any type of crime,
the hypothesis is that crime risk is a function of exposure to a series
of geospatial risk and protective factors, such as blight or recreation
centers, respectively. As exposure to risk factors increases, so does
crime risk. In addition to identifying risk factors, the model used in
this report also considers spatial factors to improve accuracy. Poisson
regression is used here for its compatibility with count data. All data
sets used for predictions here are downloaded from Chicago Data Portal.
This report is structured as follow: Firstly, we did
a close examination of our dependent variable, in this case robbery
incidents.Secondly, we discuss our rational of our
selection of risk factors and presented our approach to feature
engineering, which includes the calculation of nearest distance to each
risk factors and spatial hotspots. Thirdly, we fit two
regression models: one simple poisson regression and one that account
for spatial attributes. We subsequently perform two kinds of cross
validation: one random k-fold cross validation.
Finally, we compared the errors of different regression
models, tested their generalization, and compare our models to the
traditional kernel density estimation.
The GitHub repository for this report is here.
Robbery Incidents
The Chicago Data Portal is an online platform that serves as a
central repository for a wide range of municipal data sets. The portal
maintains a record for crime incidents for Chicago every year. In this
report, crime data for 2021 is downloaded using the
RSocrata
package. Socrata is the creator of the open source
platform that Chicago uses to share its data. After downloading, robbery
incidents were filtered out. Some additional data wrangling removes
extraneous characters from the Location
field with gsub.
The resulting field is then converted to separate fields of X and Y
coordinates.
robbery21 <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2021/dwme-t96c") %>%
filter(Primary.Type == "ROBBERY") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),Y = as.numeric(Y)) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
st_transform('ESRI:102271') %>%
distinct()
A quick mapping of each individual incident in Chicago reveals that
robberies are pretty scattered in all parts of the city, though slightly
more concentrated in central and northeast Chicago.
ggplot() +
geom_sf(data = chicagoBoundary, fill = "black") +
geom_sf(data = robbery21, colour="#BB3754", size=0.1, show.legend = "point") +
labs(title= "Robbery Incidents in Chicago 2021") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8)
)
The density map shows three clear robbery hotspots in Chicago. The
most prominent hotspots are located in north east and north west part of
Chicago, with some smaller hotspots in the southern part.
options(scipen=0)
ggplot() +
geom_sf(data = chicagoBoundary, fill = "black") +
stat_density2d(data = data.frame(st_coordinates(robbery21)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 60, geom = 'polygon') +
scale_fill_viridis(option = "magma", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = "none") +
labs(title = "Density of Robbery") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
Since crime risk is not as a phenomenon that varies across
administrative units, but one varying smoothly across the landscape, the
best way to represent this spatial trend in a regression-ready form is
to aggregate point-level data into a lattice of grid cells. The code
below generates a fishnet of Chicago to achieve this goal.
fishnet <-
st_make_grid(chicagoBoundary,
cellsize = 500,
square = TRUE) %>%
.[chicagoBoundary] %>% # fast way to select intersecting polygons
st_sf() %>%
mutate(uniqueID = 1:n())
ggplot() +
geom_sf(data=fishnet, color="black", fill="#FFF5EE") +
labs(title = "Fishnet of Chicago") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
After generating the fishnet, we used that as a spatial unit for
visualizing robbery incidents, from which we may see that number of
robberies are the highest in a couple of girds in northeast Chicago.
This is where the clustered spatial process of robbery begins to take
shape.
robbery21_net <-
dplyr::select(robbery21) %>%
mutate(countRobbery = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countRobbery = replace_na(countRobbery, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
ggplot() +
geom_sf(data = robbery21_net, aes(fill = countRobbery), color = NA) +
scale_fill_viridis(option = "magma", name = "Robbery Counts") +
labs(title = "Count of Robberies for the Fishnet") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
The figure below shows the distribution of robbery incidents in
Chicago, which is non-normally distributed with a few grids having
significantly higher number of robberies. To account for this
non-normality, it is necessary to use Possion regression for this
analysis.
ggplot(robbery21_net, aes(x = countRobbery)) +
geom_histogram( fill="#56106E", color="#e9ecef") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) +
labs(title = "Distribution of Robberies in Chicago 2021",
caption = "Data: Chicago Data Portal Crimes 2021") +
xlab("Robbery Incidents") +
ylab("Count")
Risk Factors
We included in our model eight different risk factors. This includes
311 reports of abandoned cars, street lights out, graffiti remediation,
sanitation complaints, abandon buildings, location of retail stores that
sell liquor to go, location of shotspotter, and banks. Here are the
rationale for including these predictors in the model.
Abandoned Cars: Neighborhoods with a high number of
abandoned cars might indicate economic distress, lack of community
engagement, and lower levels of surveillance. These conditions can
create an environment conducive to criminal activities like
robberies.
Street Lights Out: Poorly lit areas are often
targeted by criminals because darkness provides cover. Reports of street
lights out can highlight areas with reduced visibility, making them
potential hotspots for robberies.
Graffiti Presence: Areas with frequent graffiti
might suggest a lack of community maintenance and oversight. Such areas
can be perceived as neglected, making them attractive for criminal
activities, including robberies.
Sanitation Complaints: Similarly, neighborhoods with
sanitation issues might suggest the lack of maintenance. This breakdown
in community can contribute to an environment where criminal activities,
such as robberies, are more likely to occur.
Abandoned Buildings: Abandoned buildings also serve
as hideouts for criminals and can be perceived as unsafe by residents.
They create a sense of disorder and can attract criminal elements,
making nearby areas more susceptible to robberies.
Retail Stores Selling Liquor to Go: Liquor stores
are associated with alcohol consumption. Areas with a high concentration
of liquor stores might indicate higher lead to higher alcohol related
crime (though this is a very subjective statement) because inebriated
individuals, particularly during late hours, might be more vulnerable
and less aware of their surroundings, making them easy targets for
robberies.
Location of ShotSpotter: ShotSpotter technology
detects gunfire in real-time and areas with frequent gunshot incidents
likely experience higher crime rates, including robberies. These
locations can be considered as potential risk factors for predicting
robbery incidents.
Banks: Bank locations are often targeted due to the
availability of cash. Since we are predicting robberies here, areas with
a high density of banks might attract robbers seeking immediate
financial gains.
abandonCars <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2020") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Cars")
abandonBuildings <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
mutate(year = substr(date_service_request_was_received,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Abandoned_Buildings")
graffiti <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Graffiti")
streetLightsOut <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Street_Lights_Out")
sanitation <-
read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Sanitation")
liquorRetail <-
read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%
filter(business_activity == "Retail Sales of Packaged Liquor") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Liquor_Retail")
shotSpotter <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Violence-Reduction-Shotspotter-Alerts/3h7q-7mdb") %>%
dplyr::select(Y = latitude, X = longitude) %>%
na.omit() %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Shot_Spotter")
Banks <-
read.socrata("https://data.cityofchicago.org/Administration-Finance/Lending-Equity-Depository-Locations-2021/jgup-zt7k") %>%
dplyr::select(location) %>%
na.omit()
Banks <- Banks %>%
mutate(X = as.numeric(sub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", Banks$location)),
Y = as.numeric(sub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", Banks$location))) %>%
dplyr::select(-location) %>%
filter(!is.na(X) & !is.na(Y)) %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform(st_crs(fishnet)) %>%
mutate(Legend = "Bank")
All of these predictor variables were then joined to the fishnet.
Each cell in the fishnet would now contain a number summarizing the
number of respective risk factor occurrence in the cell.
vars_net <-
rbind(abandonCars,streetLightsOut,abandonBuildings,
liquorRetail, graffiti, sanitation, shotSpotter, Banks) %>%
st_join(., fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(fishnet) %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
dplyr::select(-`<NA>`) %>%
na.omit() %>%
ungroup()
The figure below shows that each risk factor has its unique spatial
distribution in Chicago. Abandoned buildings are concentrated mostly
concentrated in southern Chicago. Banks and liquor retail stores are
concentrated in the CBD. Abandoned cars, sanitation complaints, and
streets lights out complaints are scattered everywhere in the city.
Graffiti cleaning request are mostly found close to the city center
while shot spotter are mostly outside of the city center.
vars_net.long <-
gather(vars_net, Variable, value, -geometry, -uniqueID)
plot1 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Buildings"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Abandoned Buildings") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
plot2 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Cars"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Abandoned Cars") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
plot3 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Bank"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Banks") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
plot4 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Graffiti"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Graffiti") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
plot5 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Liquor_Retail"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Liquor Retail") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
plot6 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Sanitation"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Sanitation") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
plot7 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Shot_Spotter"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Shot Spotter") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
plot8 <- ggplot() +
geom_sf(data = vars_net.long %>% filter(Variable == "Street_Lights_Out"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Street Light Out") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
wrap_plots(
plot1, plot3, plot2, plot4,
plot5, plot6, plot7, plot8,
ncol = 4
)
Feature Engineering
Nearest Distance
The first feature engineering approach we took is to calculate
average nearest neighbor distance to hypothesize a smoother exposure
relationship across space. Here, the nn_function is used. For each of
our risk factors, average nearest distance is calculated from the
centroid of each cell to its nearest three, abandoned cars, for
example.
st_c <- st_coordinates
st_coid <- st_centroid
vars_net <-
vars_net %>%
mutate(
Abandoned_Buildings.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
Abandoned_Cars.nn =
nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
Graffiti.nn =
nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
Liquor_Retail.nn =
nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
Street_Lights_Out.nn =
nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
Sanitation.nn =
nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
Shot_Spotter.nn =
nn_function(st_c(st_coid(vars_net)), st_c(shotSpotter),3))
vars_net <-
vars_net %>% mutate(Bank.nn = nn_function(st_c(st_coid(vars_net)), st_c(Banks),3))
The nearest neighbor features are then plotted below. Visualizing
distance provides us with a better view of the spatial distribution of
various risk factors. It appears that the southern and southwest part of
Chicago are the furthest away from all kinds of risks. One interesting
observation is that the northeastern corner of Chicago is the furthest
away from shotspotters while being relatively close to all other kinds
of risk factors. If shot spotter are used to location crime incidents,
their absence in this area raises question of whether this “risk” factor
act the same as others. Are shotspotter an indication of more crime in
the area or does the presence of shot spotter means greater policing and
therefore safter neighborhood? These questions are worth
considering.
vars_net.long.nn <-
dplyr::select(vars_net, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
p1.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Buildings.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Abandoned Buildings NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
p2.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Cars.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Abandoned Cars NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
p3.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Bank.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Banks NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
p4.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Graffiti.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Graffiti NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
p5.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Liquor_Retail.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Liquor Retail NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
p6.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Sanitation.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Sanitation NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
p7.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Shot_Spotter.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Shot Spotter NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
p8.nn <- ggplot() +
geom_sf(data = vars_net.long.nn %>% filter(Variable == "Street_Lights_Out.nn"), aes(fill=value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Street Light Out NN") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
wrap_plots(
p1.nn, p3.nn, p2.nn, p4.nn,
p5.nn, p6.nn, p7.nn, p8.nn,
ncol = 4
)
Local Spatial Autocorrelation
The second feature engineering approach we took is to account for the
spatial process of robbery incidents through determining robbery
hotspots in Chicago. Unlike the global Moran’s I which would tell us if
there’s spatial autocorrelation of robberies of the entire area, local
Moran’s I examine if robbery count at a given location is randomly
distributed relative to its immediate neighbors. A spatial weights
matrix is calcualted below, which is used to relate a unit to its
neighbors using queen contiguity (every grid cell is related to its
eight adjacent neighbors).
final_net <-
left_join(robbery21_net, st_drop_geometry(vars_net), by="uniqueID")
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)
local_morans <- localmoran(final_net$countRobbery, final_net.weights, zero.policy=TRUE) %>%
as.data.frame()
final_net.localMorans <-
cbind(local_morans, as.data.frame(final_net)) %>%
st_sf() %>%
dplyr::select(Robbery_Count = countRobbery,
Local_Morans_I = Ii,
P_Value = `Pr(z != E(Ii))`) %>%
mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
gather(Variable, Value, -geometry)
The results of the localmoran test is combined with the spatial layer
for visualization. Several useful test statistics, including Moran’s I,
the p-value, and significant robbery hotspots are visualized. We may see
that the Local Moran’s I value is significantly higher in northeastern
Chicago, close to the CBD area. Areas with low p-value suggests that
there is spatial autocorrelation between robbery incidents in cells that
are proximate to each other. Significant hotspots, meaning those grid
cells with higher local counts than what might otherwise be expected
under randomness (p-values <= 0.05), follows the same pattern as
what’s shown in the kernal density map above.
moran1 <- ggplot() +
geom_sf(data = final_net.localMorans %>% filter(Variable == "Robbery_Count"), aes(fill=Value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Robbery Count") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
moran2 <- ggplot() +
geom_sf(data = final_net.localMorans %>% filter(Variable == "Local_Morans_I"), aes(fill=Value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Local Morans I") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
moran3 <- ggplot() +
geom_sf(data = final_net.localMorans %>% filter(Variable == "P_Value"), aes(fill=Value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "P Value") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
moran4 <- ggplot() +
geom_sf(data = final_net.localMorans %>% filter(Variable == "Significant_Hotspots"), aes(fill=Value), colour=NA) +
scale_fill_viridis(option = "magma") +
labs(title = "Signigicant Hotspots") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
moran1 + moran2 + moran3 + moran4
Singificant Robbery Clusters
Following the calculation of significant robbery clusters, we created
a dummy variable indicating whether this grid cell is located within the
hotspot. We also calculated the distance from the centroid of each grid
cell to its nearest robbery hotspot.
final_net <-
final_net %>%
mutate(robbery.isSig = ifelse(localmoran(final_net$countRobbery,
final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
mutate(robbery.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
st_coordinates(st_centroid(
filter(final_net, robbery.isSig == 1))), 1))
The figure below visualizes distance to nearest robbery hotspots in
Chicago, where the darker the color, the shorter the distance.
ggplot() +
geom_sf(data = final_net, aes(fill=robbery.isSig.dist), colour=NA) +
scale_fill_viridis(option = "magma", name="NN Distance") +
labs(title="Robbery NN Distance") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 10, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
All Risk Factors
The figure below organizes count and nearest neighbor (nn)
correlations side-by-side. All of our count features are positively
correlated with robbery incidents, meaning the greater number of risk
factors, the higher number of robbery. The relationship is the strongest
between robbery and sanitation complaint as well as shot spotter. All of
our nearest neighbor features are negatively correlated with robbery
incidents, meaning that the greater the distance, the less number of
robbery. The relationship is the strongest for distance to signficant
robbery clusters. While correlation for count features is a bit awkward,
this approach can help with feature selection. For a given risk factor,
we have to avoid colinearity by selecting either the count or nearest
neighbor feature. In the case of our model, we used nearest
distance.
final_net_long <- final_net%>%
st_drop_geometry() %>%
dplyr::select(-c(uniqueID, cvID)) %>%
pivot_longer(cols = -countRobbery, # everything except measurement
names_to = "Type", # categorizes all quantitative variables into Type
values_to = "Number")
correlation.cor <-
final_net_long %>%
group_by(Type) %>%
summarize(correlation = cor(Number, countRobbery, use = "complete.obs"))
final_net_long %>%
ggplot(aes(x= Number, y = countRobbery)) +
geom_point(size = 0.01, color = "#000004") +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
`Abandoned_Buildings` = "Abandoned Buildings",
`Abandoned_Buildings.nn` = "Distance to Abandoned Buildings",
`Abandoned_Cars` = "Abandoned Cars",
`Abandoned_Cars.nn` = "Distance to Abandoned Cars",
`Bank` = "Bank",
`Bank.nn` = "Distance to Bank",
`Graffiti` = "Graffiti",
`Graffiti.nn` = "Distance to Graffiti",
`Liquor_Retail` = "Liquor Retail",
`Liquor_Retail.nn` = "Distance to Liquor Retail",
`robbery.isSig` = "Significant Robbery",
`robbery.isSig.dist` = "Distance to Significant Robbery",
`Sanitation` = "Sanitation",
`Sanitation.nn` = "Distance to Sanitation",
`Shot_Spotter` = "Shot Spotter",
`Shot_Spotter.nn` = "Distance to Shot Spotter",
`Street_Lights_Out` = "Street Light Out",
`Street_Lights_Out.nn` = "Distance to Broken Street Light"))) +
labs(title = "Scatter Plots of Predictor Variables",
caption = "Data: Chicago Data Portal") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Poisson Regression
Random K-Fold CV
Below, goodness of fit metrics are generated for four regressions -
two including Just Risk Factors reg.vars
, and a second
reg.ss.vars
includes risk factors plus the Local Moran’s I
Spatial Process. Two kinds of cross validation test were performed. The
first one uses a randomly generated cvID
associated with
each grid cell for random k-fold cross validation.
reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn",
"Shot_Spotter.nn", "Bank.nn")
reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn",
"Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn",
"Shot_Spotter.nn", "Bank.nn", "robbery.isSig", "robbery.isSig.dist")
reg.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countRobbery",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)
reg.ss.cv <- crossValidate(
dataset = final_net,
id = "cvID",
dependentVariable = "countRobbery",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)
Spatial LOGO CV
The second one hold out one neighborhood, train the model on the
remaining n-1 areas, predict for the hold out, and record the goodness
of fit. This is also called the spatial ‘Leave-one-group-out’
cross-validation (LOGO-CV), in which each neighborhood takes a turn as a
hold-out. The result of each analysis is a sf layer with observed and
predicted burglary counts.
neighborhoods <-
st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
st_transform(st_crs(fishnet))
final_net <-
st_centroid(final_net) %>%
st_join(dplyr::select(neighborhoods, name)) %>%
st_drop_geometry() %>%
left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()
reg.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countRobbery",
indVariables = reg.vars) %>%
dplyr::select(cvID = name, countRobbery, Prediction, geometry)
reg.ss.spatialCV <- crossValidate(
dataset = final_net,
id = "name",
dependentVariable = "countRobbery",
indVariables = reg.ss.vars) %>%
dplyr::select(cvID = name, countRobbery, Prediction, geometry)
Error Analysis
The code block below creates a long form reg.summary, that binds
together observed/predicted counts and errors for each grid cell and for
each regression. Residuals are calculated for each regression by
subtracting the original robbery incidents from predicted incidents.
reg.summary <-
rbind(
mutate(reg.cv, Error = Prediction - countRobbery,
Regression = "Random k-fold CV: Just Risk Factors"),
mutate(reg.ss.cv, Error = Prediction - countRobbery,
Regression = "Random k-fold CV: Spatial Process"),
mutate(reg.spatialCV, Error = Prediction - countRobbery,
Regression = "Spatial LOGO-CV: Just Risk Factors"),
mutate(reg.ss.spatialCV, Error = Prediction - countRobbery,
Regression = "Spatial LOGO-CV: Spatial Process")) %>%
st_sf()
Comparing the error across different regressions, we see that adding
spatial process features improve the model. In addition, the model
appears slightly less robust for the spatial cross-validation, probably
because LOGO-CV is such a conservative assumption.
error_by_reg_and_fold <-
reg.summary %>%
group_by(Regression, cvID) %>%
summarize(Mean_Error = mean(Prediction - countRobbery, na.rm = T),
MAE = mean(abs(Mean_Error), na.rm = T),
SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>% ungroup()
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable(col.name=c("Regression", 'Mean Absolute Error','Standard Deviation MAE')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(2, color = "black", background = "#FCFFA4") %>%
row_spec(4, color = "black", background = "#FCFFA4")
Regression
|
Mean Absolute Error
|
Standard Deviation MAE
|
Random k-fold CV: Just Risk Factors
|
0.66
|
0.59
|
Random k-fold CV: Spatial Process
|
0.64
|
0.55
|
Spatial LOGO-CV: Just Risk Factors
|
2.53
|
2.77
|
Spatial LOGO-CV: Spatial Process
|
1.82
|
2.04
|
The small multiple maps below visualizes the distribution of errors
by regression. For the random k-fold cv method, the errors are pretty
evenly distributed across space. However, for spatial logo-cv method,
higher errors occur at spatial hotspots.
error_by_reg_and_fold %>%
ggplot() +
geom_sf(aes(fill=MAE), color="transparent") +
scale_fill_viridis(option = "magma", direction = -1) +
facet_wrap(~Regression, ncol = 4) +
labs(title = "MAE by Fold and Regression",
caption = "Data: Chicago Data Portal") +
theme(legend.position = "bottom",
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8),
strip.text = element_text(size = 7))
Prediction Accuracy
Generalizability Test
Let’s now test for generalizability across racial-neighborhood
context. To test this proposition, tidycensus is used to pull race data
by census tract. Percentage minority population for each census tract is
calculated by comparing the total number of Latinx, African American,
and Asian population to the total population. Census tracts are split
into two groups: majority white and minority, depending on whether the
proportion of minority population in that census tract exceeds 60%.
census_api_key(dlgInput(
"Enter a Census API Key", # ask for an api key
Sys.getenv("CENSUS_API_KEY")
)$res,
overwrite = TRUE)
tracts21 <-
get_acs(geography = "tract",
variables = c("B02001_001E", # total population
"B02001_002E", # white population
"B02001_003E", # black population
"B02001_005E", # asian population
"B03002_012E"),
year=2021, state=17, county=031,
geometry=TRUE, output="wide") %>%
st_transform('ESRI:102271') %>%
rename(TotalPop = B02001_001E,
Whites = B02001_002E,
African_Americans = B02001_003E,
Asians = B02001_005E,
Latinx = B03002_012E) %>%
mutate(pctMinority = ifelse(TotalPop > 0, (African_Americans + Asians + Latinx ) / TotalPop, 0),
majority = ifelse(pctMinority > 0.5, "minority", "majority")) %>% .[neighborhoods,]
We may see that Chicago remains a very segregated city and there’s
clear racial boundary between places where more than 60% of the
population are minority and places that’s majority white. In particular,
northern and northeastern (CBD) part of Chicago are majority white while
the remaining parts are all made up of racial minorities.
ggplot() + geom_sf(data = na.omit(tracts21), aes(fill = majority), color = NA) +
scale_fill_manual(values = c("#BB3754", "#56106E"), name="Race Context") +
labs(title = "Race Context ",
caption = "Data: American Community Survey 2021") +
theme(
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank(),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.4)
)
Error is calculated by subtracting the observed robbery count from
the prediction. A positive difference represents an over-prediction. The
least ideal result is a model that over-predicts risk in minority areas,
and under-predicts in white areas. If reporting selection bias is an
issue, such a model may unfairly allocate police resource
disproportionately in black and brown communities. In our case,
over-prediction of minorities is an issue for random k-fold cv
regression, but not for logo-cv regressions. The latter two models
under-predict minority neighborhoods and over-predict majority white
neighborhood, making it looks like that this algorithms generalizes well
with respect to race.
joinrace <- st_centroid(reg.summary) %>%
st_intersection(tracts21 %>%dplyr:: select(pctMinority)) %>%
st_drop_geometry() %>%
group_by(cvID) %>%
summarize(meanMinor = mean(pctMinority))
reg.summary <- reg.summary %>%
left_join(joinrace, by = "cvID")
reg.summary %>%
mutate(raceContext = ifelse(meanMinor > .6, "Minority", "Majority White")) %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(2, color = "black", background = "#FCFFA4") %>%
row_spec(4, color = "black", background = "#FCFFA4")
Regression
|
Majority White
|
Minority
|
Random k-fold CV: Just Risk Factors
|
0.3945728
|
-0.0191392
|
Random k-fold CV: Spatial Process
|
0.5608446
|
-0.0273303
|
Spatial LOGO-CV: Just Risk Factors
|
0.5010425
|
-0.3246241
|
Spatial LOGO-CV: Spatial Process
|
0.2887490
|
-0.2015924
|
Kernel Desntiy Estimation
The utility of this algorithm is judged relative to an alternative
police allocation method: kernel density estimation. Kernel density
works by centering a smooth kernel, or curve, atop each crime point such
that the curve is at its highest directly over the point and the lowest
at the range of a circular search radius. The code block below creates a
Kernel density map of robbery with a 1000 foot search radius.
rob_ppp <- as.ppp(st_coordinates(robbery21), W = st_bbox(final_net))
rob_KD.1000 <- spatstat.explore::density.ppp(rob_ppp, 1000)
rob_KD.df <- data.frame(rasterToPoints(mask(raster(rob_KD.1000), as(neighborhoods, 'Spatial'))))
ggplot(data=rob_KD.df, aes(x=x, y=y)) +
geom_raster(aes(fill=layer)) +
coord_sf(crs=st_crs(final_net)) +
scale_fill_viridis(option = "magma", name="Density") +
labs(title = "Kernel Density of Robbery 1000ft Radii") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
Next, a new goodness of fit indicator is created to illustrate
whether the 2021 kernel density or risk predictions capture more of the
2022 robberies.
robbery22 <-
read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2022/9hwr-2zxp") %>%
filter(Primary.Type == "ROBBERY") %>%
mutate(x = gsub("[()]", "", Location)) %>%
separate(x,into= c("Y","X"), sep=",") %>%
mutate(X = as.numeric(X),
Y = as.numeric(Y)) %>%
na.omit %>%
st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
st_transform('ESRI:102271') %>%
distinct() %>%
.[fishnet,]
The kernel density estimation was classified into 5 risk categories
and the count for the 2022 robberies were summarized into the same sf
layer. The purpose here is to see if the risk predictions capture more
observed burglaries than the kernel density. If so, then the risk
prediction model provides a more robust targeting tool for allocating
police resources.
rob_KDE_sum <- as.data.frame(rob_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
aggregate(., final_net, mean)
kde_breaks <- classIntervals(rob_KDE_sum$value,
n = 5, "fisher")
rob_KDE_sf <- rob_KDE_sum %>%
mutate(label = "Kernel Density",
Risk_Category = classInt::findCols(kde_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
mutate(robberyCount = replace_na(robberyCount, 0))) %>%
dplyr::select(label, Risk_Category, robberyCount)
The same process was repeated for the risk prediction. Note that both
predictions from the LOGO-CV with and without the spatial features is
being used here.
ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction,
n = 5, "fisher")
rob_risk_sf <-
reg.ss.spatialCV %>%
mutate(label = "Risk Predictions Spatial",
Risk_Category =classInt::findCols(ml_breaks),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
mutate(robberyCount = replace_na(robberyCount, 0))) %>%
dplyr::select(label,Risk_Category, robberyCount)
ml_breaks_simple <- classIntervals(reg.ss.cv$Prediction,
n = 5, "fisher")
rob_risk_sf_simple <-
reg.ss.cv %>%
mutate(label = "Risk Predictions Simple",
Risk_Category =classInt::findCols(ml_breaks_simple),
Risk_Category = case_when(
Risk_Category == 5 ~ "5th",
Risk_Category == 4 ~ "4th",
Risk_Category == 3 ~ "3rd",
Risk_Category == 2 ~ "2nd",
Risk_Category == 1 ~ "1st")) %>%
cbind(
aggregate(
dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
mutate(robberyCount = replace_na(robberyCount, 0))) %>%
dplyr::select(label,Risk_Category, robberyCount)
Prediction Comparison
Below a map is generated of the risk categories for all three model
types. A strongly fit model should show that the highest risk category
is uniquely targeted to places with a high density of burglary points.
The map shows that our risk prediction models perform as good as the
kernal density approach in capturing the distribution of robbery
incidents.
rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
facet_wrap(~label, ) +
scale_fill_viridis(option = "magma", discrete = TRUE, name = "Risk Category") +
geom_sf(data = sample_n(robbery22, 3000), size = .03, colour = "white") +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="2021 robbery risk predictions; 2022 robberies") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
A more straightforward way to compare the accuracy of prediction is
through histogram. A well fit model should show that the risk
predictions capture a greater share of 2022 robberies in the highest
risk category relative to the Kernel density. Unfortunately, the kernel
density model edges out both risk prediction model for highest risk
areas. Nevertheless, the risk prediction model does a better job in
capturing robbery incidents in lower-risk areas that the kernel density
model had ignored.
rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
st_drop_geometry() %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countRobbery = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Pcnt_of_test_set_crimes = countRobbery / sum(countRobbery)) %>%
ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis(option = "magma", discrete = TRUE, name = "Model", direction = -1) +
labs(title = "Risk Prediction vs. Kernel Density",
y = "% of Test Set Robberies (per model)",
x = "Risk Category") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
Conclusions
In conclusion, there are still a lot of work that needs to be done
before this model could be put into practice. Comparing to the
traditional methodology of using kernel density to estimate crime
locations, our model did not do very well in capturing robbery incidents
in areas designated as the highest risk category. There are two main
reasons behind. Firstly, most predictor variables we used in our model
are out of date. While the Chicago Data Portal has been constantly
updating its crime dataset, almost all of the risk factor dataset we
used were last updated in 2019. That said, we are using reports of
abandoned cars, graffiti, abandoned buildings, street light incidents,
sanitation complaints, etc., from 2019 to fit a model that summarize
robberies in 2021, which is then subsequently used to predict robberies
in 2022. If there’s a correlation between locations of these risk
factors and robbery incidents, then the removals of graffiti and
renovations of abandoned buildings in 2021, for example, would very much
likely to shift the geographies of robbery incidents. For a better
prediction, we highly encourage the release of risk factor datasets that
are up to date.
In addition, we believe that the global pandemic since 2020 would
have had a significant impact on locations of robberies. Economic
distress, driven by high unemployment rates, could create financial
strain, leading to an increase in robberies in areas already facing
economic challenges. Lockdowns and changed routines due to social
distancing measures could also create opportunities for criminals. In
addition, the reallocation of police resources to enforce public health
measures would result in decreased police presence in areas that were
regularly patrolled. All of these would contribute to a growing number
of robbery incidents in area unexpected before. If we sticked with a
more general model to predict robbery incidents during this very unusual
time, then it is very likely that our model produces skewed results.
This reminds us again the importance of ensuring the generalizability of
any machine learning model not only across space but also time. In this
case, the model should consider risk factors related to the
pandemic.
Nevertheless, our model did a better job in capturing robbery
incidents in lower-risk areas that the kernel density model had ignored.
While being low risk, it is also important not to overlook potential
robbery incidents in these areas. Ignoring the “risk” of low-risk area
and overly emphasizing high risk areas would lead to over-policing in
high-risk areas, which would continue to remain as high risk ever after.
This leads us to think about whether using a single model is not enough
to capture the complexities of robbery incidents in one place. A
collective approach that strategically make use of the result of both
the kernel density estimation and the risk prediction models can
probably lead to more accurate predictions.
On top of that, we have avoided the overprediction of robbery
incidents in majority non-white neighborhood. Yet, despite finding that
the model generalizes well across different neighborhood contexts, we
cannot be sure the model doesn’t suffer from selection bias. If law
enforcement systematically over-polices certain communities, and this
selection criteria goes unaccounted for in the model, then the model may
be still biased regardless of the above tests. We also found that adding
spatial process features improve the model and using the random k-fold
cross validation method would enhance the model.
---
title: "Geospatial Risk Predictions of Robbery Incidents in Chicago"
author: "Emily Zhou"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: simplex
    toc: yes
    toc_float: yes
    code_folding: hide
    code_download: yes
editor_options:
  markdown:
    wrap: sentence
---

Version 1.0 \| First Created Oct 12, 2023 \| Updated Oct 18, 2023

Keywords: Poisson Regression, Spatial Hotspot Analysis, Local Moran's I, Spatial LOGO Cross Validation, K-fold Cross Validation, K Nearest Neighbors, Kernel Density Estimation

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r load libraries, include=FALSE}

library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat.explore)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(classInt)   # for KDE and ML risk class intervals
library(svDialogs)
library(patchwork)

# functions
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

```


# Introduction

Predictive policing is an application of machine learning through building geospatial risk models that aggregate historical crime data, socioeconomic factors, and spatial attributes to forecast where and when crimes are likely to occur. These models are crucial because they offer law enforcement agencies a proactive approach to crime prevention. While traditional policing method often rely on reacting to incidents after they occur, predictive policing enables law enforcement to anticipate and prevent crimes in advance. Because of that, law enforcement agencies can optimize their resources and deploy officers strategically, leading to more cost-effective law enforcement practices. 

However, there are notable limitations and biases associated with these models. One significant concern is the potential for bias in the historical data used to train these algorithms. If the historical data reflects existing biases in law enforcement practices, such as racial profiling or discriminatory policies, the algorithms can perpetuate these biases, leading to unfair or disproportionate targeting of certain communities. Additionally, there is the risk of self-fulfilling prophecies: if law enforcement concentrates their efforts in specific areas predicted by the algorithm, crime data from those areas might be inflated, creating a feedback loop of over-policing. That said, predictions from geospatial risk models can be thought of as latent risk - areas at risk even if a crime has not actually been reported there. Selection bias can be a problematic issue for it is very likely that observed crime would discount actual crime risks. 

With these in mind, the goal of this report is to fit a geospatial risk model predicting robbery incidents in Chicago, a city facing various crime and policing challenges, to the best of our ability. Robberies are often subjected to selection bias because 1) certainly not all incidents of robbery are reported to the police and 2) areas that had been heavily patrolled in the past due to higher robbery incidents will continue to report higher robbery incidents. For any type of crime, the hypothesis is that crime risk is a function of exposure to a series of geospatial risk and protective factors, such as blight or recreation centers, respectively. As exposure to risk factors increases, so does crime risk. In addition to identifying risk factors, the model used in this report also considers spatial factors to improve accuracy. Poisson regression is used here for its compatibility with count data. All data sets used for predictions here are downloaded from [Chicago Data Portal](https://data.cityofchicago.org/). 

This report is structured as follow: **Firstly**, we did a close examination of our dependent variable, in this case robbery incidents.**Secondly**, we discuss our rational of our selection of risk factors and presented our approach to feature engineering, which includes the calculation of nearest distance to each risk factors and spatial hotspots. **Thirdly**, we fit two regression models: one simple poisson regression and one that account for spatial attributes. We subsequently perform two kinds of cross validation: one random k-fold cross validation. **Finally**, we compared the errors of different regression models, tested their generalization, and compare our models to the traditional kernel density estimation. 

The GitHub repository for this report is [here](https://github.com/emilyzhou112/MUSA5080-Chicago-Risks). 


# Robbery Incidents

The Chicago Data Portal is an online platform that serves as a central repository for a wide range of municipal data sets. The portal maintains a record for crime incidents for Chicago every year. In this report, crime data for 2021 is downloaded using the `RSocrata` package. Socrata is the creator of the open source platform that Chicago uses to share its data. After downloading, robbery incidents were filtered out. Some additional data wrangling removes extraneous characters from the `Location` field with gsub. The resulting field is then converted to separate fields of X and Y coordinates. 

```{r load outcome variable, warning=FALSE, message=FALSE, results='hide'}

robbery21 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2021/dwme-t96c") %>% 
    filter(Primary.Type == "ROBBERY") %>%
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct()
```

A quick mapping of each individual incident in Chicago reveals that robberies are pretty scattered in all parts of the city, though slightly more concentrated in central and northeast Chicago. 

```{r robbery locations, warning=FALSE, message=FALSE, fig.align = 'center'}

ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "black") +
  geom_sf(data = robbery21, colour="#BB3754", size=0.1, show.legend = "point") +
  labs(title= "Robbery Incidents in Chicago 2021") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8)
        )

```

The density map shows three clear robbery hotspots in Chicago. The most prominent hotspots are located in north east and north west part of Chicago, with some smaller hotspots in the southern part. 

```{r robbery density, fig.width=5, message=FALSE, warning=FALSE, fig.align = 'center'}
options(scipen=0)
ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "black") +
  stat_density2d(data = data.frame(st_coordinates(robbery21)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "magma", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  labs(title = "Density of Robbery") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

Since crime risk is not as a phenomenon that varies across administrative units, but one varying smoothly across the landscape, the best way to represent this spatial trend in a regression-ready form is to aggregate point-level data into a lattice of grid cells. The code below generates a fishnet of Chicago to achieve this goal. 


```{r create fishnet, fig.width=5, fig.align = 'center'}
fishnet <- 
  st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%
  mutate(uniqueID = 1:n())

ggplot() +
  geom_sf(data=fishnet, color="black", fill="#FFF5EE") +
  labs(title = "Fishnet of Chicago") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

After generating the fishnet, we used that as a spatial unit for visualizing robbery incidents, from which we may see that number of robberies are the highest in a couple of girds in northeast Chicago. This is where the clustered spatial process of robbery begins to take shape. 

```{r join robbery to fishnet, fig.align = 'center'}

robbery21_net <- 
  dplyr::select(robbery21) %>% 
  mutate(countRobbery = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRobbery = replace_na(countRobbery, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = robbery21_net, aes(fill = countRobbery), color = NA) +
  scale_fill_viridis(option = "magma", name = "Robbery Counts") +
  labs(title = "Count of Robberies for the Fishnet") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

The figure below shows the distribution of robbery incidents in Chicago, which is non-normally distributed with a few grids having significantly higher number of robberies. To account for this non-normality, it is necessary to use Possion regression for this analysis. 

```{r distribution of robberies, message=FALSE, warning=FALSE, fig.align = 'center'}
ggplot(robbery21_net, aes(x = countRobbery)) +
  geom_histogram( fill="#56106E", color="#e9ecef") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) + 
  labs(title = "Distribution of Robberies in Chicago 2021",
       caption = "Data: Chicago Data Portal Crimes 2021") +
  xlab("Robbery Incidents") +
  ylab("Count")
```

# Risk Factors

We included in our model eight different risk factors. This includes 311 reports of abandoned cars, street lights out, graffiti remediation, sanitation complaints,  abandon buildings, location of retail stores that sell liquor to go, location of shotspotter, and banks. Here are the rationale for including these predictors in the model. 

**Abandoned Cars:** Neighborhoods with a high number of abandoned cars might indicate economic distress, lack of community engagement, and lower levels of surveillance. These conditions can create an environment conducive to criminal activities like robberies.

**Street Lights Out:** Poorly lit areas are often targeted by criminals because darkness provides cover. Reports of street lights out can highlight areas with reduced visibility, making them potential hotspots for robberies.

**Graffiti Presence**: Areas with frequent graffiti might suggest a lack of community maintenance and oversight. Such areas can be perceived as neglected, making them attractive for criminal activities, including robberies.

**Sanitation Complaints:** Similarly, neighborhoods with sanitation issues might suggest the lack of maintenance. This breakdown in community can contribute to an environment where criminal activities, such as robberies, are more likely to occur.

**Abandoned Buildings:** Abandoned buildings also serve as hideouts for criminals and can be perceived as unsafe by residents. They create a sense of disorder and can attract criminal elements, making nearby areas more susceptible to robberies.

**Retail Stores Selling Liquor to Go:** Liquor stores are associated with alcohol consumption. Areas with a high concentration of liquor stores might indicate higher lead to higher alcohol related crime (though this is a very subjective statement) because inebriated individuals, particularly during late hours, might be more vulnerable and less aware of their surroundings, making them easy targets for robberies. 

**Location of ShotSpotter:** ShotSpotter technology detects gunfire in real-time and areas with frequent gunshot incidents likely experience higher crime rates, including robberies. These locations can be considered as potential risk factors for predicting robbery incidents.

**Banks:** Bank locations are often targeted due to the availability of cash. Since we are predicting robberies here, areas with a high density of banks might attract robbers seeking immediate financial gains.

```{r load predictor variables, warning=FALSE, message=FALSE}
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2020") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")
  
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")


shotSpotter <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Violence-Reduction-Shotspotter-Alerts/3h7q-7mdb") %>%  
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Shot_Spotter")

Banks <- 
  read.socrata("https://data.cityofchicago.org/Administration-Finance/Lending-Equity-Depository-Locations-2021/jgup-zt7k") %>%  
  dplyr::select(location) %>%
  na.omit()
Banks <- Banks %>%   
  mutate(X = as.numeric(sub("POINT \\((-?\\d+\\.\\d+) \\d+\\.\\d+\\)", "\\1", Banks$location)),
         Y = as.numeric(sub("POINT \\(-?\\d+\\.\\d+ (-?\\d+\\.\\d+)\\)", "\\1", Banks$location))) %>%
  dplyr::select(-location) %>% 
  filter(!is.na(X) & !is.na(Y)) %>% 
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Bank") 
  

```

All of these predictor variables were then joined to the fishnet. Each cell in the fishnet would now contain a number summarizing the number of respective risk factor occurrence in the cell.

```{r join predictor variables to fishnet, warning=FALSE, message=FALSE}

vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation, shotSpotter, Banks) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

```

The figure below shows that each risk factor has its unique spatial distribution in Chicago. Abandoned buildings are concentrated mostly concentrated in southern Chicago. Banks and liquor retail stores are concentrated in the CBD. Abandoned cars, sanitation complaints, and streets lights out complaints are scattered everywhere in the city. Graffiti cleaning request are mostly found close to the city center while shot spotter are mostly outside of the city center. 

```{r visualize predictor variables, fig.height=8, fig.width=11, fig.align = 'center'}
vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

plot1 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Buildings"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Abandoned Buildings") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot2 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Abandoned_Cars"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Abandoned Cars") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot3 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Bank"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Banks") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot4 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Graffiti"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Graffiti") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot5 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Liquor_Retail"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Liquor Retail") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot6 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Sanitation"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Sanitation") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

plot7 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Shot_Spotter"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Shot Spotter") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


plot8 <- ggplot() +
  geom_sf(data = vars_net.long %>% filter(Variable == "Street_Lights_Out"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Street Light Out") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

wrap_plots(
  plot1, plot3, plot2, plot4,
  plot5, plot6, plot7, plot8,
  ncol = 4  
)


```

# Feature Engineering


## Nearest Distance

The first feature engineering approach we took is to calculate average nearest neighbor distance to hypothesize a smoother exposure relationship across space. Here, the nn_function is used. For each of our risk factors, average nearest distance is calculated from the centroid of each cell to its nearest three, abandoned cars, for example. 

```{r calculate nearest distance, warning=FALSE, message=FALSE}

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      Shot_Spotter.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(shotSpotter),3))

vars_net <-
  vars_net %>% mutate(Bank.nn = nn_function(st_c(st_coid(vars_net)), st_c(Banks),3))
 

```

The nearest neighbor features are then plotted below. Visualizing distance provides us with a better view of the spatial distribution of various risk factors. It appears that the southern and southwest part of Chicago are the furthest away from all kinds of risks. One interesting observation is that the northeastern corner of Chicago is the furthest away from shotspotters while being relatively close to all other kinds of risk factors. If shot spotter are used to location crime incidents, their absence in this area raises question of whether this "risk" factor act the same as others. Are shotspotter an indication of more crime in the area or does the presence of shot spotter means greater policing and therefore safter neighborhood? These questions are worth considering. 

```{r visualize nearest distance, fig.height=8, fig.width=11, fig.align = 'center'}

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)


p1.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Buildings.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Abandoned Buildings NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p2.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Abandoned_Cars.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Abandoned Cars NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p3.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Bank.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Banks NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p4.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Graffiti.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Graffiti NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p5.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Liquor_Retail.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Liquor Retail NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p6.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Sanitation.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Sanitation NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

p7.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Shot_Spotter.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Shot Spotter NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


p8.nn <- ggplot() +
  geom_sf(data = vars_net.long.nn %>% filter(Variable == "Street_Lights_Out.nn"), aes(fill=value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Street Light Out NN") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

wrap_plots(
  p1.nn, p3.nn, p2.nn, p4.nn,
  p5.nn, p6.nn, p7.nn, p8.nn,
  ncol = 4  
)


```


## Local Spatial Autocorrelation

The second feature engineering approach we took is to account for the spatial process of robbery incidents through determining robbery hotspots in Chicago. Unlike the global Moran's I which would tell us if there's spatial autocorrelation of robberies of the entire area, local Moran's I examine if robbery count at a given location is randomly distributed relative to its immediate neighbors. A spatial weights matrix is calcualted below, which is used to relate a unit to its neighbors using queen contiguity (every grid cell is related to its eight adjacent neighbors). 

```{r calculate local moran I, message=FALSE, warning=FALSE}

final_net <-
  left_join(robbery21_net, st_drop_geometry(vars_net), by="uniqueID")
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

local_morans <- localmoran(final_net$countRobbery, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Robbery_Count = countRobbery, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse(P_Value <= 0.001, 1, 0)) %>%
  gather(Variable, Value, -geometry)
```

The results of the localmoran test is combined with the spatial layer for visualization. Several useful test statistics, including Moran's I, the p-value, and significant robbery hotspots are visualized. We may see that the Local Moran's I value is significantly higher in northeastern Chicago, close to the CBD area. Areas with low p-value suggests that there is spatial autocorrelation between robbery incidents in cells that are proximate to each other. Significant hotspots, meaning those grid cells with higher local counts than what might otherwise be expected under randomness (p-values <= 0.05), follows the same pattern as what's shown in the kernal density map above.  

```{r visualize moran I, fig.height=8, fig.width=11, fig.align = 'center'}

moran1 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Robbery_Count"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Robbery Count") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran2 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Local_Morans_I"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Local Morans I") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran3 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "P_Value"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "P Value") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


moran4 <- ggplot() +
  geom_sf(data = final_net.localMorans %>% filter(Variable == "Significant_Hotspots"), aes(fill=Value), colour=NA) +
  scale_fill_viridis(option = "magma") + 
  labs(title = "Signigicant Hotspots") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

moran1 + moran2 + moran3 + moran4
```

## Singificant Robbery Clusters

Following the calculation of significant robbery clusters, we created a dummy variable indicating whether this grid cell is located within the hotspot. We also calculated the distance from the centroid of each grid cell to its nearest robbery hotspot. 

```{r calculate significant robbery points, warning=FALSE, message=FALSE}

final_net <-
  final_net %>% 
  mutate(robbery.isSig = ifelse(localmoran(final_net$countRobbery, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(robbery.isSig.dist = nn_function(st_coordinates(st_centroid(final_net)),
                                          st_coordinates(st_centroid(
                                            filter(final_net, robbery.isSig == 1))), 1))
```

The figure below visualizes distance to nearest robbery hotspots in Chicago, where the darker the color, the shorter the distance. 

```{r visualize robbery NN Distance, fig.align = 'center'}

ggplot() +
      geom_sf(data = final_net, aes(fill=robbery.isSig.dist), colour=NA) +
      scale_fill_viridis(option = "magma", name="NN Distance") +
      labs(title="Robbery NN Distance") +
      theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 10, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )


```

## All Risk Factors

The figure below organizes count and nearest neighbor (nn) correlations side-by-side. All of our count features are positively correlated with robbery incidents, meaning the greater number of risk factors, the higher number of robbery. The relationship is the strongest between robbery and sanitation complaint as well as shot spotter. All of our nearest neighbor features are negatively correlated with robbery incidents, meaning that the greater the distance, the less number of robbery. The relationship is the strongest for distance to signficant robbery clusters. While correlation for count features is a bit awkward, this approach can help with feature selection. For a given risk factor, we have to avoid colinearity by selecting either the count or nearest neighbor feature. In the case of our model, we used nearest distance. 


```{r correlation, fig.height=15, fig.width=8, fig.align = 'center'}

final_net_long <- final_net%>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countRobbery, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") 

correlation.cor <-
  final_net_long %>%
    group_by(Type) %>%
    summarize(correlation = cor(Number, countRobbery, use = "complete.obs"))

final_net_long %>%
  ggplot(aes(x= Number, y = countRobbery)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Abandoned_Buildings` = "Abandoned Buildings",
    `Abandoned_Buildings.nn` = "Distance to Abandoned Buildings",
    `Abandoned_Cars` = "Abandoned Cars",
    `Abandoned_Cars.nn` = "Distance to Abandoned Cars",
    `Bank` = "Bank",
    `Bank.nn` = "Distance to Bank",
    `Graffiti` = "Graffiti",
    `Graffiti.nn` = "Distance to Graffiti",
    `Liquor_Retail` = "Liquor Retail",
    `Liquor_Retail.nn` = "Distance to Liquor Retail",
    `robbery.isSig` = "Significant Robbery",
    `robbery.isSig.dist` = "Distance to Significant Robbery",
    `Sanitation` = "Sanitation",
    `Sanitation.nn` = "Distance to Sanitation",
    `Shot_Spotter` = "Shot Spotter", 
    `Shot_Spotter.nn` = "Distance to Shot Spotter",
    `Street_Lights_Out` = "Street Light Out",
    `Street_Lights_Out.nn` = "Distance to Broken Street Light")))  +
  labs(title = "Scatter Plots of Predictor Variables",
       caption = "Data: Chicago Data Portal") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

```


# Poisson Regression

## Random K-Fold CV

Below, goodness of fit metrics are generated for four regressions - two including Just Risk Factors `reg.vars`, and a second `reg.ss.vars` includes risk factors plus the Local Moran’s I Spatial Process. Two kinds of cross validation test were performed. The first one uses a randomly generated `cvID` associated with each grid cell for random k-fold cross validation. 

```{r random k-fold regression, warning=FALSE, results='hide'}

reg.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
              "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
              "Shot_Spotter.nn", "Bank.nn")

reg.ss.vars <- c("Abandoned_Buildings.nn", "Abandoned_Cars.nn", "Graffiti.nn", 
                 "Liquor_Retail.nn", "Street_Lights_Out.nn", "Sanitation.nn", 
                 "Shot_Spotter.nn", "Bank.nn", "robbery.isSig", "robbery.isSig.dist")

reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = cvID, countRobbery, Prediction, geometry)

```

## Spatial LOGO CV

The second one hold out one neighborhood, train the model on the remaining n-1 areas, predict for the hold out, and record the goodness of fit. This is also called the spatial 'Leave-one-group-out' cross-validation (LOGO-CV), in which each neighborhood takes a turn as a hold-out. The result of each analysis is a sf layer with observed and predicted burglary counts.

```{r spatial logo cv regression, warning=FALSE, message=FALSE, results='hide'}

neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countRobbery, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countRobbery",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countRobbery, Prediction, geometry)
```

## Error Analysis

The code block below creates a long form reg.summary, that binds together observed/predicted counts and errors for each grid cell and for each regression. Residuals are calculated for each regression by subtracting the original robbery incidents from predicted incidents. 

```{r summarize regression}

reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countRobbery,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countRobbery,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countRobbery,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countRobbery,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 
```

Comparing the error across different regressions, we see that adding spatial process features improve the model. In addition, the model appears slightly less robust for the spatial cross-validation, probably because LOGO-CV is such a conservative assumption. 

```{r summarize errors by fold and regression, message=FALSE, warning=FALSE}

error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countRobbery, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>% ungroup()

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(col.name=c("Regression", 'Mean Absolute Error','Standard Deviation MAE')) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
    row_spec(2, color = "black", background = "#FCFFA4") %>%
    row_spec(4, color = "black", background = "#FCFFA4") 


```

The small multiple maps below visualizes the distribution of errors by regression. For the random k-fold cv method, the errors are pretty evenly distributed across space. However, for spatial logo-cv method, higher errors occur at spatial hotspots. 

```{r visualize error,fig.height=8, fig.width=11, fig.align = 'center'}

error_by_reg_and_fold %>% 
  ggplot() +
  geom_sf(aes(fill=MAE), color="transparent") +
  scale_fill_viridis(option = "magma", direction = -1) + 
  facet_wrap(~Regression, ncol = 4) +
   labs(title = "MAE by Fold and Regression",
       caption = "Data: Chicago Data Portal") +
  theme(legend.position = "bottom",
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8),
        strip.text = element_text(size = 7)) 

```

# Prediction Accuracy 

## Generalizability Test

Let’s now test for generalizability across racial-neighborhood context. To test this proposition, tidycensus is used to pull race data by census tract. Percentage minority population for each census tract is calculated by comparing the total number of Latinx, African American, and Asian population to the total population. Census tracts are split into two groups: majority white and minority, depending on whether the proportion of minority population in that census tract exceeds 60%. 


```{r download race data, message=FALSE, warning=FALSE, results='hide'}

census_api_key(dlgInput(
  "Enter a Census API Key", # ask for an api key
  Sys.getenv("CENSUS_API_KEY")
)$res,
overwrite = TRUE)
tracts21 <- 
  get_acs(geography = "tract", 
          variables = c("B02001_001E", # total population
            "B02001_002E", # white population
            "B02001_003E", # black population
            "B02001_005E", # asian population
            "B03002_012E"), 
          year=2021, state=17, county=031, 
          geometry=TRUE, output="wide") %>%
  st_transform('ESRI:102271') %>% 
  rename(TotalPop = B02001_001E, 
         Whites = B02001_002E,
         African_Americans = B02001_003E,
         Asians = B02001_005E,
         Latinx = B03002_012E) %>% 
  mutate(pctMinority = ifelse(TotalPop > 0, (African_Americans + Asians + Latinx ) / TotalPop, 0), 
         majority = ifelse(pctMinority > 0.5, "minority", "majority")) %>%   .[neighborhoods,]


```

We may see that Chicago remains a very segregated city and there's clear racial boundary between places where more than 60% of the population are minority and places that's majority white. In particular, northern and northeastern (CBD) part of Chicago are majority white while the remaining parts are all made up of racial minorities. 

```{r visualize race context, fig.align = 'center'}

ggplot() + geom_sf(data = na.omit(tracts21), aes(fill = majority), color = NA) +
    scale_fill_manual(values = c("#BB3754", "#56106E"), name="Race Context") +
    labs(title = "Race Context ",
         caption = "Data: American Community Survey 2021") +
  theme(
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.4)
        ) 
```

Error is calculated by subtracting the observed robbery count from the prediction. A positive difference represents an over-prediction. The least ideal result is a model that over-predicts risk in minority areas, and under-predicts in white areas. If reporting selection bias is an issue, such a model may unfairly allocate police resource disproportionately in black and brown communities. In our case, over-prediction of minorities is an issue for random k-fold cv regression, but not for logo-cv regressions. The latter two models under-predict minority neighborhoods and over-predict majority white neighborhood, making it looks like that this algorithms generalizes well with respect to race. 

```{r generalizability by race, message=FALSE, warning=FALSE}

joinrace <- st_centroid(reg.summary) %>% 
  st_intersection(tracts21 %>%dplyr:: select(pctMinority)) %>% 
  st_drop_geometry() %>% 
  group_by(cvID) %>% 
  summarize(meanMinor = mean(pctMinority))


reg.summary <- reg.summary %>% 
  left_join(joinrace, by = "cvID") 


reg.summary %>% 
  mutate(raceContext = ifelse(meanMinor > .6, "Minority", "Majority White")) %>% 
  st_drop_geometry() %>% 
  group_by(Regression, raceContext) %>%
  summarize(mean.Error = mean(Error, na.rm = T)) %>%
  spread(raceContext, mean.Error) %>%
  kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
    row_spec(2, color = "black", background = "#FCFFA4") %>%
    row_spec(4, color = "black", background = "#FCFFA4") 

```

## Kernel Desntiy Estimation

The utility of this algorithm is judged relative to an alternative police allocation method: kernel density estimation. Kernel density works by centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular search radius. The code block below creates a Kernel density map of robbery with a 1000 foot search radius. 
 
```{r calculate 2021 Kernal Density, fig.width=6, message=FALSE, warning=FALSE, fig.align = 'center'}

rob_ppp <- as.ppp(st_coordinates(robbery21), W = st_bbox(final_net))
rob_KD.1000 <- spatstat.explore::density.ppp(rob_ppp, 1000)
rob_KD.df <- data.frame(rasterToPoints(mask(raster(rob_KD.1000), as(neighborhoods, 'Spatial'))))


ggplot(data=rob_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(option = "magma", name="Density") +
  labs(title = "Kernel Density of Robbery 1000ft Radii") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```

Next, a new goodness of fit indicator is created to illustrate whether the 2021 kernel density or risk predictions capture more of the 2022 robberies.

```{r download 2022 robery data, warning=FALSE, message=FALSE}

robbery22 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2022/9hwr-2zxp") %>% 
   filter(Primary.Type == "ROBBERY") %>% 
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

```

The kernel density estimation was classified into 5 risk categories and the count for the 2022 robberies were summarized into the same sf layer.  The purpose here is to see if the risk predictions capture more observed burglaries than the kernel density. If so, then the risk prediction model provides a more robust targeting tool for allocating police resources. 

```{r categorize KDE}

rob_KDE_sum <- as.data.frame(rob_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 

kde_breaks <- classIntervals(rob_KDE_sum$value, 
                             n = 5, "fisher")

rob_KDE_sf <- rob_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
    mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label, Risk_Category, robberyCount)

```

The same process was repeated for the risk prediction. Note that both predictions from the LOGO-CV with and without the spatial features is being used here.

```{r categorize random k fold}

ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
rob_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions Spatial",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
      mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label,Risk_Category, robberyCount)

ml_breaks_simple <- classIntervals(reg.ss.cv$Prediction, 
                             n = 5, "fisher")
rob_risk_sf_simple <-
  reg.ss.cv %>%
  mutate(label = "Risk Predictions Simple",
         Risk_Category =classInt::findCols(ml_breaks_simple),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery21) %>% mutate(robberyCount = 1), ., sum) %>%
      mutate(robberyCount = replace_na(robberyCount, 0))) %>%
  dplyr::select(label,Risk_Category, robberyCount)

```

## Prediction Comparison

Below a map is generated of the risk categories for all three model types. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of burglary points. The map shows that our risk prediction models perform as good as the kernal density approach in capturing the distribution of robbery incidents. 

```{r prediction comparisons map, fig.align = 'center'}
rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    facet_wrap(~label, ) +
    scale_fill_viridis(option = "magma", discrete = TRUE, name = "Risk Category") +
      geom_sf(data = sample_n(robbery22, 3000), size = .03, colour = "white") +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2021 robbery risk predictions; 2022 robberies") + 
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```

A more straightforward way to compare the accuracy of prediction is through histogram. A well fit model should show that the risk predictions capture a greater share of 2022 robberies in the highest risk category relative to the Kernel density. Unfortunately, the kernel density model edges out both risk prediction model for highest risk areas. Nevertheless, the risk prediction model does a better job in capturing robbery incidents in lower-risk areas that the kernel density model had ignored. 

```{r prediction comparisons figure, warning=FALSE, message=FALSE, fig.align = 'center'}
rbind(rob_KDE_sf, rob_risk_sf, rob_risk_sf_simple) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countRobbery = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countRobbery / sum(countRobbery)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(option = "magma", discrete = TRUE, name = "Model", direction = -1) +
      labs(title = "Risk Prediction vs. Kernel Density",
           y = "% of Test Set Robberies (per model)",
           x = "Risk Category") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))
```

# Conclusions

In conclusion, there are still a lot of work that needs to be done before this model could be put into practice. Comparing to the traditional methodology of using kernel density to estimate crime locations, our model did not do very well in capturing robbery incidents in areas designated as the highest risk category. There are two main reasons behind. Firstly, most predictor variables we used in our model are out of date. While the Chicago Data Portal has been constantly updating its crime dataset, almost all of the risk factor dataset we used were last updated in 2019. That said, we are using reports of abandoned cars, graffiti, abandoned buildings, street light incidents, sanitation complaints, etc., from 2019 to fit a model that summarize robberies in 2021, which is then subsequently used to predict robberies in 2022. If there’s a correlation between locations of these risk factors and robbery incidents, then the removals of graffiti and renovations of abandoned buildings in 2021, for example, would very much likely to shift the geographies of robbery incidents. For a better prediction, we highly encourage the release of risk factor datasets that are up to date. 

In addition, we believe that the global pandemic since 2020 would have had a significant impact on locations of robberies. Economic distress, driven by high unemployment rates, could create financial strain, leading to an increase in robberies in areas already facing economic challenges. Lockdowns and changed routines due to social distancing measures could also create opportunities for criminals. In addition, the reallocation of police resources to enforce public health measures would result in decreased police presence in areas that were regularly patrolled. All of these would contribute to a growing number of robbery incidents in area unexpected before. If we sticked with a more general model to predict robbery incidents during this very unusual time, then it is very likely that our model produces skewed results. This reminds us again the importance of ensuring the generalizability of any machine learning model not only across space but also time. In this case, the model should consider risk factors related to the pandemic. 

Nevertheless, our model did a better job in capturing robbery incidents in lower-risk areas that the kernel density model had ignored. While being low risk, it is also important not to overlook potential robbery incidents in these areas. Ignoring the “risk” of low-risk area and overly emphasizing high risk areas would lead to over-policing in high-risk areas, which would continue to remain as high risk ever after. This leads us to think about whether using a single model is not enough to capture the complexities of robbery incidents in one place. A collective approach that strategically make use of the result of both the kernel density estimation and the risk prediction models can probably lead to more accurate predictions. 

On top of that, we have avoided the overprediction of robbery incidents in majority non-white neighborhood. Yet, despite finding that the model generalizes well across different neighborhood contexts, we cannot be sure the model doesn’t suffer from selection bias. If law enforcement systematically over-polices certain communities, and this selection criteria goes unaccounted for in the model, then the model may be still biased regardless of the above tests. We also found that adding spatial process features improve the model and using the random k-fold cross validation method would enhance the model. 