Version 3.0 | First Created Nov 17, 2023 | Updated Dec 8, 2023
Keywords: K Nearest Neighbors, Moran’s I, Getis-Ord Gi*, Kernel
Density Estimation, Random K-fold Cross Validation, Logistic Regression,
Poisson Regression, ROC Curve, Principal Component Analysis,
Multi-Criteria Analysis, Health Geography
Important Links: study
repository, pitch
presentation
Introduction
The opioid overdose crisis continues as one of the most significant
public health challenges of the past two decades in the US. Between 1999
and 2019, the national age-adjusted opioid-involved overdose death rate
increased from 2.9 per 100,000 population to 15.5 per 100,000. Rates
have increased steeply during the COVID-19 pandemic. In 2020, the rate
climbed to 21.4 per 100,000 and to 24.7 per 100,000 in 2021. Most
literature to this date have focused on understanding the drivers of
overdose crisis, which include drug supply, drug manufacturing, pain
arising from work-related injuries, income inequality, and added stress,
isolation, and economic disadvantage connected to the COVID-19 pandemic.
Until recently, however, more and more studies have begun to consider
the association between drug use, opioid-related mortality, and the
built-environment. Contextual risk factors, especially features of the
built environment, the presence of violent crime or crime associated
with substance use, and neighborhood-level socioeconomic factors, have
enabled more accurate identification of high-risk overdose areas, thus
driving initiatives on alleviating overdose risk and crises.
We conducted a case study on opioid-overdose risk modeling in
Cincinnati, Ohio where the opioid epidemic has been a serious issue.
Since 2009, drug overdose deaths in Ohio have continued to increase, and
Cincinnati, a large city in southwestern Ohio with a population of about
300,000 people, was one of 12 opioid hotspots in Ohio and had the
highest per capita rates of a fatal overdose between 2010 and 2017.
Today, heroin overdose incidences still remain high in the city since
heroin and prescription drugs abuse have further led to other public
health problems such as HIV and needle-borne diseases.
We combined multiple data sources, including emergency medical
service response, crime and non-emergency reports data, American
Community Survey data, and datasets that are proxies for local built
environment to analyze the distributions of heroin-related overdose
incidents in Cincinnati, Ohio.
Among studies of spatial epidemiology, geospatial clustering has been
used to analyze the geographic variation of phenomena, such as the
disproportionate impact of public health crisis on people with
disabilities, and the concentrated vulnerability of socio-economically
disadvantaged population under natural hazards. It allows the
investigators to identify groups of spatial objects (i.e. clusters) that
have similar characteristics and analyze patterns of the clusters (e.g.,
hot spots, cold spots). As such, we used Getis-Ord hotspot analyses to
identify locations of persistent risk in Cincinnati from 2016-2020. We
constructed measures of the socio-built environment and neighborhood
demographic characteristics using principal components analysis (PCA).
The resulting measures were used as covariates in a logistic regression
with random k-fold cross validation to predict if an area was part of an
opioid hotspot cluster. All built environment risk factors were
aggregated, scaled, and weighted to generate an interactive overdose
risk terrain map.
On the technical side, the goal of our study is to
build a model that could predict opioid overdose clusters as well as
socioeconomic and built environment factors associated with opioid
overdose rates in Cincinnati from 2016 to 2020 as much as possible.
On the practical side, we would like our risk terrain
model to benefit both the stakeholders and community members in the main
form of a municipal dashboard and a mobile application. For
stakeholders, the dashboard would serve as an early warning system by
predicting/showing areas at high risk of opioid overdose. Specifically,
it should include maps with overdose cluster locations, trend graphs
over time, and charts illustrating the correlation between overdose
rates and various factors. For community members, this application
connects them to healthcare and rehabilitation resources within the
city. Specifically, the application allow them to call for emergency
services and medical assistance for themselves or for others at one tap.
To encourage public-private partnership, this app would adopt the
approach that OpenStreetMap took to allow both technical and
non-technical users, private and public entities to update existing
information, such as crime incidents, rehabilitation center capacities
and location. We hope that non-technical stakeholders can use the app to
allocate resources efficiently, directing outreach efforts, treatment
facilities, and educational campaigns to areas with the highest
predicted risk. We hope that the public would use this platform to
foster community engagement and support for initiatives addressing the
opioid crisis.
The rest of the report is organized as follows. Section one describes
our approach to access and pre-process the data. Section two documents
an exploratory analysis we did using poission regression with only crime
and 311 incidents. Section three explains the conceptualization our
approach applying the Getis Ord G* statistics to determine spatial
hotspots. Section four is the main part of this study, where we iclude
all built environment predictors to run a logistic regression predicting
if an area is an opioid overdose hotspot or not. Section five applies
principal component analysis to reduce colinerity and include only half
of the principal components into the model. Section six validates the
accuracy of this model against a kernel density estimation while the
last section constructs the risk terrain map.
The conceptualization and methodologies of this study were drawn in
part from:
Choi, et al.(2022) Spatial clustering of heroin-related overdose
incidents: a case study in Cincinnati, Ohio
Srinivasan, et al. (2023) Risk factors for persistent fatal
opioid-involved overdose hotspots in Massachusetts 2011-2021
Chichester, et al.(2023) Crime and Features of the Built
Environment Predicting Risk of Fatal Overdose: A Comparison of Rural and
Urban Ohio Counties with Risk Terrain Modeling.
Data
We used the EMS
dataset that includes all responses to heroin-related overdose
incidents from the Cincinnati Fire Department between 2016 and 2020.
Each incident record contains incident location information, including
geospatial information, location address, a classification of a
neighborhood in the city, time of the incident, and disposition of
incident response. Data records regarding incidents outside of the study
area, without geospatial information, and with unassociated disposition
codes were excluded from this study. Records from canceled calls or
false alarms and duplicated records were also excluded.
heroin <- read.csv(here("data", "public", "Cincinnati_Fire_Incidents__CAD___including_EMS__ALS_BLS_.csv"))
cincinnati <- st_read(here("data", "public", "Cincinnati_City_Boundary.geojson")) %>% st_transform("EPSG:3735")
heroin <- heroin %>%
filter(CFD_INCIDENT_TYPE_GROUP != "NON-PROTOCOL PROBLEM TYPES") %>%
filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>% # remove incidents without spatial info
filter(is.na(DISPOSITION_TEXT) == FALSE # remove record with un-associated disposition codes
& CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL" # remove records from canceled calls, false alarm, and duplication
& CFD_INCIDENT_TYPE_GROUP !="CANCEL INCIDENT"
& CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DEF: DEFAULT"
& CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,EMSF: FALSE"
& CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DUPF: DUPLICATE"
& CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,FALA: FIRE FALSE AC"
& CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,MEDD: MT DISREGARDE"
& CFD_INCIDENT_TYPE_GROUP !="DUPF: DUPLICATE"
& CFD_INCIDENT_TYPE_GROUP !="DUPLICATE INCIDENT"
& CFD_INCIDENT_TYPE_GROUP !="MAL: SYSTEM MALFUNCTION") %>%
mutate(time = mdy_hms(CREATE_TIME_INCIDENT, tz = "UTC"),
year_column = year(time)) %>%
st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>%
st_transform("EPSG:3735") %>%
st_intersection(cincinnati %>% dplyr::select(OBJECTID),.) %>% # remove incidents outside of study area
filter(year_column %in% c(2016, 2017, 2018, 2019, 2020))
In addition, we used Homeland
Infrastructure Foundation-Level Data (HIFLD) and Substance Abuse
and Mental Health Services Administration (SAMHSA) data for
2015–2020 to obtain information about available healthcare services,
such as hospitals, opioid treatment programs, and buprenorphine
prescribing physicians, etc.
We filtered data only for Cincinnati in the HIFLD dataset. The SAMHSA
data were geocoded manually to get its location information first before
reading into R.
hospitals <- st_read(here("data","public", "Hospitals.geojson"))
hospitals <- hospitals %>%
filter(STATE == "OH" & CITY == "CINCINNATI") %>%
st_transform("EPSG:3735") # select those only in Cincinnati
rehab <- read.csv(here("data", "public", "rehabilitation.csv"))
rehab <- rehab %>%
filter(is.na(Latitude) == FALSE & is.na(Longitude) == FALSE) %>%
st_as_sf(., coords = c("Longitude", "Latitude"), crs = 4326) %>%
st_transform("EPSG:3735")
Furthermore, we included the crime
rate data from the Cincinnati Police Department between 2016 and
2020. Specific crimes were selected for inclusion in analyses based on
having been identified in the literature as a spatial risk factor for
drug overdose or substance use behavior or for having a theoretical
relationship with drug overdose.
crime <- read.csv(here("data", "private", "PDI__Police_Data_Initiative__Crime_Incidents.csv"))
crime_list <- c("AGGRAVATED ASSAULT", "BURGLARY", "BREAKING AND ENTERING", "ROBBERY", "DOMESTIC VIOLENCE", "MURDER ", "RAPE", "THEFT")
crime <- crime %>%
filter(OFFENSE %in% crime_list) %>%
filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>%
st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>%
st_transform("EPSG:3735") %>%
mutate(time = mdy_hms(DATE_REPORTED, tz = "UTC"),
year = year(time)) %>%
filter(year %in% c(2016, 2017, 2018, 2019, 2020))
From the same data source, we included the 311
non-emergency service requests from 2016 to 2020.
complaints <- read_csv(here("data", "private", "Cincinnati_311__Non-Emergency__Service_Requests.csv"))
complaints <- complaints %>%
filter(is.na(LATITUDE) == FALSE & is.na(LONGITUDE) == FALSE) %>%
st_as_sf(., coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>%
st_transform("EPSG:3735") %>%
mutate(time = mdy_hms(REQUESTED_DATE, tz = "UTC"),
year = year(time)) %>%
filter(year %in% c(2016, 2017, 2018, 2019, 2020))
The connection between substance use and built environment variables
(access to public restrooms, access to pharmacies, and driving distance
to services, defined in our study as fast-food restaurants, gas stations
and public parks) are also important. Public restrooms are associated
with people who inject drugs (PWID) because many people use drugs in
these spaces.Pharmacies represent an important access variable for
several reasons. During the initial wave of the overdose crisis,
pharmaceutical prescriptions, either legitimate, diverted, or
potentially inappropriate, fed the opioid supply. In addition, naloxone
(a medication to reverse an opioid overdose) is available at pharmacies
without a prescription, although this provision may vary by neighborhood
socio-demographic levels. All of these data were queried from
OpenStreetMap and preprocessed in QGIS (fixing geometry, creating
spatial indices, and unioning point with polygon data pieces)
# gas station
fuel <- st_read(here("data", "public", "fuel.geojson")) %>% st_transform("EPSG:3735")
# fast food restaurant
fastfood <- st_read(here("data", "public", "fastfood.geojson")) %>% st_centroid() %>% st_transform("EPSG:3735")
# public parks
parks <- st_read(here("data", "public", "parks.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()
# pharmacy
pharmacy <- st_read(here("data","public", "pharmacy.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()
To identify demographic and sociographic characteristics
corresponding to the EMS dataset, we utilized the ACS data for 2016 –
2020. We downloaded demographic information for each census tract,
including total population, gender, adult population size, and
race/ethnicity. The ACS data also included socioeconomic characteristics
of each census tract, such as education, income, and poverty. For each
census tract, we computed the population aged between 25 and 54 (since
this is the age group most vulnerable to opioid overdose), gender ratio,
minority popultion ratio, percentage poverty, and percentage of
population with a bachelor’s degree. Note that the boundary of
Cincinnati does not fully contain several census tracts and therefore we
have tried our best to include tracts within Hamilton county that mostly
fall within Cincinnati.
cincinnati20 <- get_acs(geography = "tract",
variables = c(
"B01001_001E", # total population
"B01001_002E", # total male
"B01001_011E", # male 25-29
"B01001_012E",
"B01001_013E",
"B01001_014E",
"B01001_015E",
"B01001_016E", # male 50-54
"B01001_026E", # total female
"B01001_035E", # female 25-29
"B01001_036E",
"B01001_037E",
"B01001_038E",
"B01001_039E",
"B01001_040E", # female 50-54
"B02001_002E", # white population
"B02001_003E", # black population
"B02001_005E", # asian population
"B03002_012E", # latinx population
"B19013_001E", # median household income
"B06012_002E", # poverty
"B06009_005E" # bachelor
),
year=2020, state="OH", county="Hamilton",
geometry=TRUE, output="wide") %>%
st_transform("EPSG:3735")
cincinnati_tracts <- st_read(here("data", "public", "cincinnati_tracts.geojson")) %>% st_transform("EPSG:3735")
cincinnati_tracts <- cincinnati_tracts %>% st_drop_geometry() %>% dplyr::select(GEOID) %>% as.list(GEOID)
cincinnati20 <- cincinnati20 %>%
filter(GEOID %in% cincinnati_tracts$GEOID) %>%
rename(Totalpop = B01001_001E,
MedHHInc = B19013_001E) %>%
mutate(pop25_54 = B01001_011E + B01001_012E + B01001_013E + B01001_014E + B01001_015E + B01001_016E + B01001_035E + B01001_036E + B01001_037E + B01001_038E+ B01001_039E + B01001_040E) %>%
mutate(MF_ratio = B01001_002E / B01001_026E) %>%
mutate(race_ratio = B02001_002E / (B02001_003E + B02001_005E + B03002_012E)) %>%
mutate(pctPoverty = B06012_002E / Totalpop) %>%
mutate(pctBachelor = B06009_005E / Totalpop) %>%
dplyr::select(pop25_54, Totalpop, MedHHInc, GEOID, MF_ratio, race_ratio, pctPoverty, pctBachelor)
The table below summarizes all of our predictor variables in this
study:
Medical Service |
|
Hospital |
Homeland Infrastructure Foundation-Level Data |
Rehabilitation Center |
Substance Abuse and Mental Health Services
Administration |
Built Environment |
|
Crime Rate |
Open Data Cincinnati |
311 complaints |
Open Data Cincinnati |
Gas station |
OSM |
Public parks |
OSM |
Fast food restaurants |
OSM |
Pharmacies |
OSM |
Socio-demographic
Characteristics |
|
Population aged between 25 and 54 |
ACS |
Gender ratio |
ACS |
White over minority ratio |
ACS |
Population with a Bachelor’s degree |
ACS |
Median household income |
ACS |
Population under poverty line |
ACS |
Median Household Income |
ACS |
Poission Regression
As part of our exploratory analysis, we ran a poisson regression
using only the number of crime incidents and the number of 311
complaints as our predictor variables. Since heroin overdose incident is
not a phenomenon that varies across administrative units, but one
varying smoothly across the landscape, we would need to aggregate these
point-level data into a lattice grid of cells. The code below generates
a 500 by 500 meters (820 feet) fishnet for Cincinnati to achieve this
goal.
fishnet <- st_make_grid(cincinnati,
cellsize = 820,
square = TRUE) %>%
.[cincinnati] %>%
st_sf() %>%
mutate(uniqueID = 1:n())
ggplot() +
geom_sf(data=fishnet, color="black", fill="#FFF5EE") +
labs(title = "Fishnet of Cincinnati") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
We would like to compare across both space and time, and so we divide
the heroin dataset into one training set and two testing sets. The
training set contains incidents from 2016 to 2018 while the testing sets
contain incidents from 2019 and 2020 respectively. Since each individual
year has well enough observations of heroin-overdose incidents,
separating the testing set into 2019 and 2020 allows us to understand
how COVID-19 plays a role in affecting drug overdose rate. We also
separate the crime incident data and the 311 complaints data into
respective year groups so that we could aggregate them into the
fishnet.
heroinTrain <- heroin %>%
filter(year_column %in% c(2016, 2017, 2018))
heroin19 <- heroin %>%
filter(year_column == 2019)
heroin20 <- heroin %>%
filter(year_column == 2020)
crimeTrain <- crime %>%
filter(year %in% c(2016, 2017, 2018))
crime19 <- crime %>%
filter(year == 2019)
crime20 <- crime %>%
filter(year == 2020)
complaintTrain <- complaints %>%
filter(year %in% c(2016, 2017, 2018))
complaint19 <- complaints %>%
filter(year == 2019)
complaint20 <- complaints%>%
filter(year == 2020)
The following two code chunks perform all the necessary tasks to
prepare our training and testing datasets by aggregating
heroin-incidents, crime incidents, and 311 complaints all into the same
spatial unit, in this case, the fishnet.
# add heroin in net
heroinTrain_net <-
dplyr::select(heroinTrain) %>%
mutate(countHeroin = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countHeroin = replace_na(countHeroin, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
# adding crime to net
heroinTrain_net <- heroinTrain_net %>%
st_join(crimeTrain, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Crimecount = n()) %>%
left_join(heroinTrain_net, . ) %>%
st_sf() %>%
mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))
# add 311 to net
heroinTrain_net <- heroinTrain_net %>%
st_join(complaintTrain, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Complaintscount = n()) %>%
left_join(heroinTrain_net, . ) %>%
st_sf() %>%
mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
# 2019 test sets
heroin19_net <-
dplyr::select(heroin19) %>%
mutate(countHeroin = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countHeroin = replace_na(countHeroin, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
heroin19_net <- heroin19_net %>%
st_join(crime19, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Crimecount = n()) %>%
left_join(heroin19_net, . ) %>%
st_sf() %>%
mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))
heroin19_net <- heroin19_net %>%
st_join(complaint19, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Complaintscount = n()) %>%
left_join(heroin19_net, . ) %>%
st_sf() %>%
mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
# 2020 test set
heroin20_net <-
dplyr::select(heroin20) %>%
mutate(countHeroin = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countHeroin = replace_na(countHeroin, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
heroin20_net <- heroin20_net %>%
st_join(crime20, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Crimecount = n()) %>%
left_join(heroin20_net, . ) %>%
st_sf() %>%
mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))
heroin20_net <- heroin20_net %>%
st_join(complaint20, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Complaintscount = n()) %>%
left_join(heroin20_net, . ) %>%
st_sf() %>%
mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
We performed some data visualization work to see if there are any
differences between heroin-overdose incidents over year. We found that
the distribution of these incidents are pretty consistent between
2016-2018 and 2019, with the majority of incidents significantly
clustered in south Cincinnati. However, in 2020, despite that this
overdose cluster still remain, overdose incidents became more disperse.
We may see some small, emerging clusters of high overdose incidents
scattered around the city.
plot1 <- ggplot() +
geom_sf(data = heroinTrain_net, aes(fill = countHeroin), color = NA) +
scale_fill_viridis(option = "rocket", name = "Counts") +
labs(title = "Heroin OD 2016-2018") +
theme(legend.position="bottom",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot2 <- ggplot() +
geom_sf(data = heroin19_net, aes(fill = countHeroin), color = NA) +
scale_fill_viridis(option = "rocket", name = "Counts") +
labs(title = "Heroin OD 2019") +
theme(legend.position="bottom",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot3 <- ggplot() +
geom_sf(data = heroin20_net, aes(fill = countHeroin), color = NA) +
scale_fill_viridis(option = "rocket", name = "Counts") +
labs(title = "Heroin OD 2020") +
theme(legend.position="bottom",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot1 + plot2 + plot3
We created heat maps of heroin-overdose incidents from 2016 to 2018
and overlayed crime incidents and 311 complaints above. Based on the
visualization, we assume crime and 311 complaints to be strong
predictors of overdose risk as places of high overdose incidents align
perfectly with places with higher crime rate and more complaints. Yet,
it is also pretty evident that crime incidents are correlated with 311
complaints.
plot1_1 <- ggplot() +
geom_sf(data = cincinnati, fill = "black") +
stat_density2d(data = data.frame(st_coordinates(heroinTrain)),
aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
size = 0.01, bins = 60, geom = 'polygon') +
scale_fill_viridis(option = "rocket", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = "none") +
geom_sf(data = sample_n(crimeTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
labs(title = "Heroin OD and Crime 2016-2018") +
theme(legend.position="bottom",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot1_2 <- ggplot() +
geom_sf(data = cincinnati, fill = "black") +
stat_density2d(data = data.frame(st_coordinates(heroinTrain)),
aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
size = 0.01, bins = 60, geom = 'polygon') +
scale_fill_viridis(option = "rocket", name = "Density") +
scale_alpha(range = c(0.00, 0.35), guide = "none") +
geom_sf(data = sample_n(complaintTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
labs(title = "Heroin OD and Complaint 2016-2018") +
theme(legend.position="bottom",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot1_1 + plot1_2
Mapping the correlation between our predictor variables reveals that
indeed, there’s a strong, linear, and positive relationship between
heroin-overdose incidents and crime as well as 311 complaints, though
the relationships become weaker in 2020. We suspect that disruptions in
the drug supply chain, changes in law enforcement priorities, as well as
increase social isolation and mental health challenges during the
pandemic might have contribute to the decrease in correlation between
overall crime incidents and heroin overdose incidents in 2020.
plot4 <- heroinTrain_net %>%
st_drop_geometry() %>%
dplyr::select(-c(uniqueID, cvID)) %>%
pivot_longer(cols = -countHeroin, # everything except measurement
names_to = "Type", # categorizes all quantitative variables into Type
values_to = "Number") %>%
ggplot(aes(x= countHeroin, y = Number)) +
geom_point(size = 0.01, color = "#000004") +
geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
`Crimecount` = "Crime",
`Complaintscount` = "311 Complaints"))) +
labs(title = "Scatter Plots of Predictor Variables 2016-18") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
plot5 <- heroin19_net %>%
st_drop_geometry() %>%
dplyr::select(-c(uniqueID, cvID)) %>%
pivot_longer(cols = -countHeroin, # everything except measurement
names_to = "Type", # categorizes all quantitative variables into Type
values_to = "Number") %>%
ggplot(aes(x= countHeroin, y = Number)) +
geom_point(size = 0.01, color = "#000004") +
geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
`Crimecount` = "Crime",
`Complaintscount` = "311 Complaints"))) +
labs(title = "Scatter Plots of Predictor Variables 2019") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
plot6 <- heroin20_net %>%
st_drop_geometry() %>%
dplyr::select(-c(uniqueID, cvID)) %>%
pivot_longer(cols = -countHeroin, # everything except measurement
names_to = "Type", # categorizes all quantitative variables into Type
values_to = "Number") %>%
ggplot(aes(x= countHeroin, y = Number)) +
geom_point(size = 0.01, color = "#000004") +
geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
`Crimecount` = "Crime",
`Complaintscount` = "311 Complaints"))) +
labs(title = "Scatter Plots of Predictor Variables 2020") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
plot4 + plot5 + plot6
We ran the Poisson regression and confirmed that crime incidents and
311 complaints are good at predicting heroin-overdose risk. The model is
slightly better when used to predict 2020 data, but the difference is
minimal.
poissionTrain <- glm(countHeroin ~ Crimecount + Complaintscount, family = "poisson",
data = heroinTrain_net)
Prediction19 <-
mutate(heroin19_net, Prediction = predict(poissionTrain, heroin19_net, type = "response")) %>%
mutate(Error = countHeroin - Prediction) %>%
mutate(MAE = mean(abs(Error))) %>%
mutate(SD_MAE = sd(Error)) %>%
mutate(prediction = "2019") %>%
slice(1:1)
Prediction20 <-
mutate(heroin20_net, Prediction = predict(poissionTrain, heroin20_net, type = "response")) %>%
mutate(Error = countHeroin - Prediction) %>%
mutate(MAE = mean(abs(Error))) %>%
mutate(SD_MAE = sd(Error)) %>%
mutate(prediction = "2020") %>%
slice(1:1)
rbind(Prediction19, Prediction20) %>%
st_drop_geometry() %>%
group_by(prediction) %>%
summarize(Mean_MAE = round(MAE, 2),
SD_MAE = round(SD_MAE, 2)) %>%
kable(col.name=c("Prediction", "Mean Absolute Error",'Standard Deviation MAE')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Prediction
|
Mean Absolute Error
|
Standard Deviation MAE
|
2019
|
0.96
|
1.40
|
2020
|
0.91
|
1.14
|
Here are our reflections up to this point:
While there’s significant associations between crime incidents, 311
complaints, and heroin-overdose incidents, building a model that relies
solely on them to predict risk is simply not enough because substance
abuse and overdose risks are complex phenomena influenced by various
factors beyond just criminal activity. The geographical and
environmental context, including the availability of treatment
facilities, proximity to support networks, and neighborhood
characteristics, also plays a crucial role in understanding overdose
risks. Therefore, we must integrate more features showing built
environment and local contexts into the model.
On top of that, Poisson regression is a traditional method used in
geospatial risk model to predict the number (count) of discrete event,
especially crime. However, in making decisions about how to alleviate
the risk of particular medical or criminal incidence, we shouldn’t just
look at the number of accidents at an intersection in a day at a very
specific place. Rather, we should flag areas in a city that are more
vulnerable/risky than others. It is not wrong to care to focus on the
number of incidence, but from a policy-making perspective, learning
whether or not a place is risky would be more efficient, helpful, and
straightforward in guiding decision making. In other words, stakeholders
and the general public do not have to know how many incidents
happened at this place, but whether this place is risky
and therefore requires more attention.
Getis Ord Spatial Hotspot
We adapted this research workflow from Srinivasan, et al. (2023), who
studied the risk factors for opioid overdose in Massachusetts. In its
essence, Srinivasan, et al used principal component analysis to reduce
colinerarity between predictor variables and ran a logistic regression
to predict whether places in Massachusetts are within overdose hotspot
or not.
To replicate their approach as much as possible, we first examined
the spatial distribution of heroin-related incident rates using the
Jenks natural breaks maps. The Jenks natural breaks maps use a nonlinear
algorithm to create groups where within-group similarity is maximized,
and between-group similarity is minimized. The map shows at least one
significant incidence cluster in south Cincinnati and several other
smaller clusters surrounding that hotspot.
heroin_net <-
dplyr::select(heroin) %>%
mutate(countHeroin = 1) %>%
aggregate(., fishnet, sum) %>%
mutate(countHeroin = replace_na(countHeroin, 0),
uniqueID = 1:n(),
cvID = sample(round(nrow(fishnet) / 24),
size=nrow(fishnet), replace = TRUE))
breaks <- classIntervals(heroin_net$countHeroin, n = 5, style = "jenks")
ggplot() +
geom_sf(data = heroin_net, aes(fill = countHeroin), color = NA) +
scale_fill_viridis(option = "rocket", breaks = breaks$brks, name = "Count") +
labs(title = "Count of Heroin Overdose for the Fishnet") +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, linewidth = 0.8)
) +
theme(legend.key.size = unit(0.8, 'cm'), legend.text = element_text(size=6))
We validated our observation by checking the global moran’s I under
the hypothesis that heroin incidences are randomly distributed across
space. The figure below plot the frequency of 999 randomly permutated I
under Monte Carlo simulation, which shows that our observed I is much
higher than all of the randomly generated I’s. This validates the
existence of spatial autocorrelation.
coords <- st_coordinates(heroin_net %>% st_centroid())
neighborList <- knn2nb(knearneigh(coords, 4))
spatialWeights <- nb2listw(neighborList, style="W")
moranTest <- moran.mc(heroin_net$countHeroin,
spatialWeights, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01, fill = "black") +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
labs(title = "Observed and Permuted Moran's I",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10))
Since the global spatial clustering analysis yields only one
statistic to describe the overall point pattern across the whole study
area, we have to use the Getis-Ord Gi* hotspot statistic to identify and
map clusters (i.e., grids with high overdose rates surrounded by grids
with high rates). Below, each hotspot identified is statistically
significant at the p<0.1 level. For estimating the clusters, we used
the queen’s criterion to define our spatial weights matrix.
neigh_nbs <- heroin_net %>%
mutate(
nb = st_contiguity(geometry), # neighbors share border
wt = st_weights(nb), # row-standardized weights
neigh_lag = st_lag(countHeroin, nb, wt)
)
gi_hot_spots <- neigh_nbs %>%
mutate(Gi = local_g_perm(countHeroin, nb, wt, nsim = 999)) %>%
unnest(Gi)
gi_hot_spots <- gi_hot_spots %>%
dplyr::select(gi, p_folded_sim, uniqueID) |>
mutate(
classification = case_when(
# Classify based on the following criteria:
gi > 0 & p_folded_sim <= 0.01 ~ "Very hot",
gi > 0 & p_folded_sim <= 0.05 ~ "Hot",
gi > 0 & p_folded_sim <= 0.1 ~ "Somewhat hot",
gi < 0 & p_folded_sim <= 0.01 ~ "Very cold",
gi < 0 & p_folded_sim <= 0.05 ~ "Cold",
gi < 0 & p_folded_sim <= 0.1 ~ "Somewhat cold",
TRUE ~ "Insignificant"
), # Convert 'classification' into a factor for easier plotting
classification = factor(
classification,
levels = c("Very hot", "Hot", "Somewhat hot",
"Insignificant",
"Somewhat cold", "Cold", "Very cold")
)
)
gi_hot_spots %>%
ggplot() +
geom_sf(aes(fill = classification), color = "black", lwd = 0.1) +
scale_fill_brewer(type = "div", palette = 6, name = "Classification") +
labs(title = "Spatial Hotspots and Coldspots of Heroin Overdose Incidence") +
theme(legend.position="bottom",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
heroin_net <- gi_hot_spots %>%
mutate(hotspot = ifelse(classification %in% c("Very hot", "Hot", "Somewhat hot"), 1, 0)) %>%
st_drop_geometry() %>%
left_join(heroin_net, ., by = "uniqueID")
The cluster map shows seven types of spatial association determined
based on heroin-related overdose incident rate within a grid and its
neighboring grids. If a grid has a high heroin incident rate with
neighboring grids of high heroin incident rates, it was determined as a
hot spot. At the same scheme, if a grid shows a low heroin incident rate
with neighboring grids of low heroin incident rates, it was classified
as a cold spot. Hotspots with p<0.01 are considered as “Very hot”,
with p<0.05 are considered as “Hot”, with p<0.1 are consiered as
“Somewhat hot”. All grids that fall into these three categories received
a 1 for hotspot, otherwise, it gets a zero.
Logistic Regression
Built Environment Features
We aggregated built environment features in addition to crime and 311
complaints into the fishnet of Cincinnati. For crime and complaints, we
counted their number in each grid. For hospitals, rehabilitation
centers, pharmacies, fast food restaurants, gas stations, and public
parks, we calculated the average distance from the center of each grid
to these locations. For demographic variables, we spatially joined tract
level into the grids depending on which census tract each centroid of
the grid fall into.
# adding crime to net
heroin_net <- heroin_net %>%
st_join(crime, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Crimecount = n()) %>%
left_join(heroin_net, . ) %>%
st_sf() %>%
mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))
# add 311 to net
heroin_net <- heroin_net %>%
st_join(complaints, ., join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID) %>%
summarize(Complaintscount = n()) %>%
left_join(heroin_net, . ) %>%
st_sf() %>%
mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
net_centroid <- st_centroid(heroin_net)
# add hospitals
heroin_net <- heroin_net %>%
left_join(net_centroid %>%
mutate(hospital.nn = nn_function(st_coordinates(net_centroid),
st_coordinates(hospitals), 1)*0.3048) %>%
st_drop_geometry() %>% dplyr::select(hospital.nn, uniqueID),
by = "uniqueID")
# add rehab center
heroin_net <- heroin_net %>%
left_join(net_centroid %>%
mutate(rehab.nn = nn_function(st_coordinates(net_centroid),
st_coordinates(rehab), 1)*0.3048) %>%
st_drop_geometry() %>% dplyr::select(rehab.nn, uniqueID),
by = "uniqueID")
# add pharmacy
heroin_net <- heroin_net %>%
left_join(net_centroid %>%
mutate(pharm.nn = nn_function(st_coordinates(net_centroid),
st_coordinates(pharmacy), 1)*0.3048) %>%
st_drop_geometry() %>% dplyr::select(pharm.nn, uniqueID),
by = "uniqueID")
# add gas station
heroin_net <- heroin_net %>%
left_join(net_centroid %>%
mutate(fuel.nn = nn_function(st_coordinates(net_centroid),
st_coordinates(fuel), 2)*0.3048) %>%
st_drop_geometry() %>% dplyr::select(fuel.nn, uniqueID),
by = "uniqueID")
# add fast food restaurant
heroin_net <- heroin_net %>%
left_join(net_centroid %>%
mutate(fast.nn = nn_function(st_coordinates(net_centroid),
st_coordinates(fastfood), 2)*0.3048) %>%
st_drop_geometry() %>% dplyr::select(fast.nn, uniqueID),
by = "uniqueID")
# add parks
heroin_net <- heroin_net %>%
left_join(net_centroid %>%
mutate(parks.nn = nn_function(st_coordinates(net_centroid),
st_coordinates(parks), 2)*0.3048) %>%
st_drop_geometry() %>% dplyr::select(parks.nn, uniqueID),
by = "uniqueID")
# add demographic vars
heroin_net <- heroin_net %>%
left_join(net_centroid %>%
dplyr::select(uniqueID) %>%
st_intersection(cincinnati20) %>%
st_drop_geometry() %>% dplyr::select(-GEOID), by = "uniqueID") %>%
filter(is.na(pop25_54) == FALSE)
Some exploratory data analysis were performed, including computing
the correlation between all of our predictor variables and heroin
incidence. The figure below shows that all of our predictor variables
are correlated with heroin-overdose incidence, among which all of our
nearest distance variables are negatively correlated with incidence.
This indicates that the further away we are from fast food restaurants
and gas stations, for example, the less number of heroin-overdose
incidence in that area. The relationship is the strongest for crime
incidence and weakest for population aged between 25-54.
heroin_net_long <- heroin_net%>%
st_drop_geometry() %>%
dplyr::select(-c(uniqueID, cvID, gi, p_folded_sim, classification, hotspot, Totalpop)) %>%
pivot_longer(cols = -countHeroin, # everything except measurement
names_to = "Type", # categorizes all quantitative variables into Type
values_to = "Number")
correlation.cor <-
heroin_net_long %>%
group_by(Type) %>%
summarize(correlation = cor(Number, countHeroin, use = "complete.obs"))
heroin_net_long %>%
ggplot(aes(x= Number, y = countHeroin)) +
geom_point(size = 0.01, color = "#000004") +
geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
`Complaintscount` = "Complaints",
`Crimecount` = "Crime",
`fast.nn` = "Distance to Fast Food",
`fuel.nn` = "Distance to Gas Stations",
`hospital.nn` = "Distance to Hospital",
`MedHHInc` = "Median Household Income",
`MF_ratio` = "Gender Ratio",
`parks.nn` = "Distance to Parks",
`pctBachelor` = "Percent Bachelor's Degree",
`pctPoverty` = "Poverty Rate",
`pharm.nn` = "Distance to Pharmacy",
`pop25_54` = "Population 25_54",
`rehab.nn` = "Distance to Rehab Center",
`race_ratio` = "Racial Majority over Minority"))) +
labs(title = "Scatter Plots of All Predictor Variables") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
The figure below shows the difference in the number of hotspots or
cold spots based on each of our predicting variable and we may see that
there seems to be a significant difference in all scenarios.
anova_dataset <- heroin_net %>% st_drop_geometry() %>%
dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)
anova_dataset %>%
gather(Variable, value, -hotspot) %>%
ggplot(aes(as.character(hotspot), value, fill=as.character(hotspot))) +
geom_bar(position = "dodge", stat = "summary", fun = "mean") +
facet_wrap(~Variable, scales = "free", ncol = 5, labeller= labeller(Variable = c(
`Complaintscount` = "Complaints",
`Crimecount` = "Crime",
`fast.nn` = "Distance to Fast Food",
`fuel.nn` = "Distance to Gas Stations",
`hospital.nn` = "Distance to Hospital",
`MedHHInc` = "Median Household Income",
`MF_ratio` = "Gender Ratio",
`parks.nn` = "Distance to Parks",
`pctBachelor` = "Percent Bachelor's Degree",
`pctPoverty` = "Poverty Rate",
`pharm.nn` = "Distance to Pharmacy",
`pop25_54` = "Population 25_54",
`rehab.nn` = "Distance to Rehab Center",
`race_ratio` = "Racial Majority over Minority"))) +
scale_fill_manual(values = c("#000004", "#BB3754")) +
labs(x="Hotspot", y="Value",
title = "Feature Associations with the Likelihood of Overdose Hotspot") +
theme(legend.position = "none") +
theme(plot.subtitle = element_text(size = 12,face = "italic"),
plot.title = element_text(size = 18, face = "bold"),
axis.text.x=element_text(size=10),
axis.text.y=element_text(size=10),
axis.title=element_text(size=9),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
We further conducted a series of ANOVA tests for all predictor
variables to validate if there’s a significant difference in mean number
of crimes, for example, between cold spot and hot spot. The result is in
concur with our expectations: all variables are significant
predictors.
anova_var <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")
anova_results <- data.frame(
Df = integer(),
Sum_Sq = numeric(),
Mean_Sq = numeric(),
F_value = numeric(),
P_Value = numeric(), stringsAsFactors = FALSE)
new_names <- c("Crimecount" = "Crime",
"Complaintscount" = "Complaints",
"pharm.nn" = "Distance to Pharmacy",
"hospital.nn" = "Distance to Hospital",
"fast.nn" = "Distance to Fast Food",
"fuel.nn" = "Distance to Gas Stations",
"rehab.nn" = "Distance to Rehab Center",
"parks.nn" = "Distance to Parks",
"pop25_54" = "Population 25_54",
"MedHHInc" = "Median Household Income",
"MF_ratio" = "Gender Ratio",
"race_ratio" = "Racial Majority over Minority",
"pctPoverty" = "Poverty Rate",
"pctBachelor" = "Percent Bachelor's Degree"
)
for (var in anova_var) {
anova_result <- aov(anova_dataset[[var]] ~ anova_dataset$hotspot)
summary_data <- summary(anova_result)[[1]][, c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")][1:1, ]
anova_results <- rbind(anova_results, summary_data)
}
rownames(anova_results) <- new_names
anova_results %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
footnote(general_title = "\n")
|
Df
|
Sum Sq
|
Mean Sq
|
F value
|
Pr(>F)
|
Crime
|
1
|
1.052003e+06
|
1.052003e+06
|
650.50625
|
0.0000000
|
Complaints
|
1
|
1.375003e+07
|
1.375003e+07
|
474.07239
|
0.0000000
|
Distance to Pharmacy
|
1
|
3.429349e+08
|
3.429349e+08
|
149.89597
|
0.0000000
|
Distance to Hospital
|
1
|
1.091828e+08
|
1.091828e+08
|
34.67735
|
0.0000000
|
Distance to Fast Food
|
1
|
2.791552e+08
|
2.791552e+08
|
251.76801
|
0.0000000
|
Distance to Gas Stations
|
1
|
7.760263e+07
|
7.760263e+07
|
95.70131
|
0.0000000
|
Distance to Rehab Center
|
1
|
1.684621e+09
|
1.684621e+09
|
232.82620
|
0.0000000
|
Distance to Parks
|
1
|
7.582010e+07
|
7.582010e+07
|
199.87377
|
0.0000000
|
Population 25_54
|
1
|
2.719819e+07
|
2.719819e+07
|
63.34770
|
0.0000000
|
Median Household Income
|
1
|
5.935462e+10
|
5.935462e+10
|
57.82477
|
0.0000000
|
Gender Ratio
|
1
|
4.562751e+00
|
4.562751e+00
|
115.01797
|
0.0000000
|
Racial Majority over Minority
|
1
|
1.549411e+03
|
1.549411e+03
|
55.81814
|
0.0000000
|
Poverty Rate
|
1
|
3.966738e+00
|
3.966738e+00
|
150.92897
|
0.0000000
|
Percent Bachelor’s Degree
|
1
|
1.328479e-01
|
1.328479e-01
|
14.58168
|
0.0001366
|
Simple Logit
Considering that all predictor variables used in the anova test
appear to be significant, we included all of them in a kitchen-sink
logistic regression. The model shows that in this scenario,
number of crimes, number of
complaints, distance to hospital,
distance to fast food restaurant, gender
ratio, and distance to public parks are highly
significant predictors of heroin-overdose hotspot whereas other
predictors become insignificant.
kitchensink <- glm(hotspot ~ .,
data=heroin_net %>%
st_drop_geometry() %>%
dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor), family="binomial" (link="logit"))
kitchen_sum <- summary(kitchensink)
coefficients_table <- as.data.frame(kitchen_sum$coefficients)
coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|z|)` < 0.001, '***',
ifelse(coefficients_table$`Pr(>|z|)` < 0.01, '**',
ifelse(coefficients_table$`Pr(>|z|)` < 0.05, '*',
ifelse(coefficients_table$`Pr(>|z|)` < 0.1, '.', ''))))
coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|z|)`, digits = 3), coefficients_table$significance)
coefficients_table %>%
dplyr::select(-significance, -`Pr(>|z|)`) %>%
kable(align = "r") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
footnote(general_title = "\n")
|
Estimate
|
Std. Error
|
z value
|
p_value
|
(Intercept)
|
-1.4654205
|
0.6035523
|
-2.4279925
|
0.015*
|
Crimecount
|
0.0075275
|
0.0014618
|
5.1493408
|
0***
|
Complaintscount
|
0.0024300
|
0.0003841
|
6.3269221
|
0***
|
pharm.nn
|
-0.0002075
|
0.0000910
|
-2.2803200
|
0.023*
|
hospital.nn
|
0.0004531
|
0.0000833
|
5.4404140
|
0***
|
fast.nn
|
-0.0020144
|
0.0002101
|
-9.5865098
|
0***
|
fuel.nn
|
-0.0000786
|
0.0001555
|
-0.5053237
|
0.613
|
rehab.nn
|
-0.0000281
|
0.0000614
|
-0.4574004
|
0.647
|
parks.nn
|
-0.0014241
|
0.0002891
|
-4.9260257
|
0***
|
pop25_54
|
-0.0001910
|
0.0001751
|
-1.0908460
|
0.275
|
MedHHInc
|
-0.0000089
|
0.0000082
|
-1.0929448
|
0.274
|
MF_ratio
|
1.2936016
|
0.3804092
|
3.4005528
|
0.001***
|
race_ratio
|
-0.1054348
|
0.0481930
|
-2.1877625
|
0.029*
|
pctPoverty
|
1.1738101
|
0.7640073
|
1.5363858
|
0.124
|
pctBachelor
|
0.0356627
|
1.5429941
|
0.0231126
|
0.982
|
Logit with CV
We would like to perform cross validation on this model to measures
its overall performance and robustness. We decide to modify a cross
validation function that was originally built on cases using Poission
regression by making it suitable for use in our scenario (logistic
regression). This function takes in four parameters: the dataset, a
cross validation ID, which it could use to separate the dataset into
training and testing, a list of dependent variables, and a independent
variable. In our case, we use a randomly generated cvID associated with
each grid cell for random k-fold cross validation. Each group of grids
with the same cvID gets hold out during one iteration.
logitCV <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, all_of(indVariables),
all_of(dependentVariable))
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, all_of(indVariables),
all_of(dependentVariable))
form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
form <- as.formula(form_parts)
regression <- glm(form, data = fold.train %>%
dplyr::select(-geometry, -id), family="binomial" (link="logit"))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
Here, another helper function is written to visualize the confusion
matrix.
draw_confusion_matrix <- function(cm) {
layout(matrix(c(1,1,2)))
par(mar=c(2,2,2,2))
plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
title('CONFUSION MATRIX', cex.main=2)
# create the matrix
rect(150, 430, 240, 370, col='black')
text(195, 435, '0', cex=1.2)
rect(250, 430, 340, 370, col="#BB3754")
text(295, 435, '1', cex=1.2)
text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
text(245, 450, 'Actual', cex=1.3, font=2)
rect(150, 305, 240, 365, col="#BB3754")
rect(250, 305, 340, 365, col='black')
text(140, 400, '0', cex=1.2, srt=90)
text(140, 335, '1', cex=1.2, srt=90)
# add in the cm results
res <- as.numeric(cm$table)
text(195, 400, res[1], cex=1.6, font=2, col='white')
text(195, 335, res[2], cex=1.6, font=2, col='white')
text(295, 400, res[3], cex=1.6, font=2, col='white')
text(295, 335, res[4], cex=1.6, font=2, col='white')
# add in the specifics
plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)
# add in the accuracy information
text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}
The cross validation model is visualized below, from which we see
that our model has a high accuracy rate of 0.936 and a moderate
sensitivity rate of 0.727. Sensitivity is the true-positive rate, which
is the proportion of actual hotspot grids that were correctly identified
by the model. Given this model’s high accuracy rate, this means that our
model is missing out a few heorin-overdose hotspots.
However, note that in Cincinnati, the number of grids that are
classified as hotspot is much smaller than the number of grids that are
classified as coldspot. While this is a good news for Cincinnati,
logistic regression can be sensitive to class imbalance, meaning that
the model might be biased towards the majority class, leading to poor
performance on the minority class.
reg.vars <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")
kitchensink.logit.cv <- logitCV (
dataset = heroin_net,
id = "cvID",
dependentVariable = "hotspot",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, hotspot, Prediction, geometry)
kitchensink.logit.cv <- kitchensink.logit.cv %>%
mutate(PredictionCat = as.factor(ifelse(kitchensink.logit.cv$Prediction > 0.5 , 1, 0))) %>%
mutate(hotspot = as.factor(hotspot))
cm_basic <- caret::confusionMatrix(kitchensink.logit.cv$hotspot, kitchensink.logit.cv$PredictionCat,
positive = "1")
draw_confusion_matrix(cm_basic)
Principal Component Analysis (PCA)
One of the reasons why each individual predictor variable is
correlated with overdose hotspot but when combined in a regression, some
of them become insignificant is because of multicollinearity. This
commonly occurs in regression analysis when predictor variables are
highly correlated. It won’t be surprise to see that many of those
demographic variables are highly correlated with each other. It is
possible that distance to hospitals is correlated with distance to
parks, given that there are more those kind of services in downtown
areas. Srinivasan, et al. (2023)’s approach to this issue is to use
Principal Component Analysis (PCA) as a preprocessing step before
running the regression.
PC Identification
PCA is a dimension reduction technique that is commonly used to
reduce the number of variables used in analyses while preserving the
information from them. The prcomp
function below is used to
identify the principal components among all 14 of our predictor
variables.
pca_heroin <- heroin_net %>%
st_drop_geometry() %>%
dplyr::select(Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)
pca_heroin <- na.omit(pca_heroin)
pc <- prcomp(pca_heroin,
center = TRUE,
scale. = TRUE)
The scree plot is used to visualize the importance of each principal
component and can be used to determine the number of principal
components to retain. The y axis is eigenvalues, which essentially stand
for the amount of variation. Here, we may see that the first six, or
seven components are capturing the majority of variability. We can
therefore, just use the first seven principal components and ignore the
rest.
fviz_eig(pc, addlabels = TRUE, barfill = "#BB3754", barcolor = "transparent", xlab = "Principal Components", choice = "eigenvalue")
The code above computed the square cosine value for each variable
with respect to the first seven principal components. From the
illustration below, gender ratio, population aged between 25-54,
distance to parks, and poverty rate are top four variables with the
highest cos2, hence contributing the most to the top seven principal
components.
fviz_cos2(pc, choice = "var", axes = 1:7) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_bar(stat = "identity", fill = "#BB3754", color = "transparent")
We also did a biplot based on our principal component analysis, which
helps us to visualize the similarities and dissimilarities between the
samples, and further understand the impact of each attribute on each of
the principal components. Here, the length of the arrow represents the
strength of the contribution of the corresponding variable to the
specific principal component. The color of the arrow represents the
relative contribution of the variable to the principal component. Darker
colors typically indicate larger contributions. For variables with long
arrow, but light color, such as percentage of population with a
Bachelor’s degree and percentage poverty, their contributions were less
significant compared to other variables, and might only contribute
significantly to one PC. For variables with short arrow but dark color,
such as 311 complaints and crime, they have relatively strong
contributions to all PC. Also, the direction of those arrows indicate
that crimes and complaints have positive contributions to hotspots.
fviz_pca_var(pc, col.var = "cos2", repel = TRUE,
arrowsize = 1, ggtheme = theme_minimal(), gradient.cols = c("black", "#BB3754"),) + ggtitle("Biplot Combind with Contribution")
Below, we computed the first seven principal components based on the
loadings of each of our predictor variables in that component.
pca_net <- heroin_net %>%
mutate(pc1 = Crimecount * pc$rotation[1] + Complaintscount * pc$rotation[2] + pharm.nn * pc$rotation[3] + hospital.nn * pc$rotation[4] + fast.nn * pc$rotation[5] + fuel.nn * pc$rotation[6] + rehab.nn * pc$rotation[7] + parks.nn * pc$rotation[8] + pop25_54 * pc$rotation[9] + MedHHInc * pc$rotation[10] + MF_ratio * pc$rotation[11] + race_ratio * pc$rotation[12] + pctPoverty * pc$rotation[13] + pctBachelor * pc$rotation[14]) %>%
mutate(pc2 = Crimecount * pc$rotation[15] + Complaintscount * pc$rotation[16] + pharm.nn * pc$rotation[17] + hospital.nn * pc$rotation[18] + fast.nn * pc$rotation[19] + fuel.nn * pc$rotation[20] + rehab.nn * pc$rotation[21] + parks.nn * pc$rotation[22] + pop25_54 * pc$rotation[23] + MedHHInc * pc$rotation[24] + MF_ratio * pc$rotation[25] + race_ratio * pc$rotation[26] + pctPoverty * pc$rotation[27] + pctBachelor * pc$rotation[28]) %>%
mutate(pc3 = Crimecount * pc$rotation[29] + Complaintscount * pc$rotation[30] + pharm.nn * pc$rotation[31] + hospital.nn * pc$rotation[32] + fast.nn * pc$rotation[33] + fuel.nn * pc$rotation[34] + rehab.nn * pc$rotation[35] + parks.nn * pc$rotation[36] + pop25_54 * pc$rotation[37] + MedHHInc * pc$rotation[38] + MF_ratio * pc$rotation[39] + race_ratio * pc$rotation[40] + pctPoverty * pc$rotation[41] + pctBachelor * pc$rotation[42]) %>%
mutate(pc4 = Crimecount * pc$rotation[43] + Complaintscount * pc$rotation[44] + pharm.nn * pc$rotation[45] + hospital.nn * pc$rotation[46] + fast.nn * pc$rotation[47] + fuel.nn * pc$rotation[48] + rehab.nn * pc$rotation[49] + parks.nn * pc$rotation[50] + pop25_54 * pc$rotation[51] + MedHHInc * pc$rotation[52] + MF_ratio * pc$rotation[53] + race_ratio * pc$rotation[54] + pctPoverty * pc$rotation[55] + pctBachelor * pc$rotation[56]) %>%
mutate(pc5 = Crimecount * pc$rotation[57] + Complaintscount * pc$rotation[58] + pharm.nn * pc$rotation[59] + hospital.nn * pc$rotation[60] + fast.nn * pc$rotation[61] + fuel.nn * pc$rotation[62] + rehab.nn * pc$rotation[63] + parks.nn * pc$rotation[64] + pop25_54 * pc$rotation[65] + MedHHInc * pc$rotation[66] + MF_ratio * pc$rotation[67] + race_ratio * pc$rotation[68] + pctPoverty * pc$rotation[69] + pctBachelor * pc$rotation[70]) %>%
mutate(pc6 = Crimecount * pc$rotation[71] + Complaintscount * pc$rotation[72] + pharm.nn * pc$rotation[73] + hospital.nn * pc$rotation[74] + fast.nn * pc$rotation[75] + fuel.nn * pc$rotation[76] + rehab.nn * pc$rotation[77] + parks.nn * pc$rotation[78] + pop25_54 * pc$rotation[79] + MedHHInc * pc$rotation[80] + MF_ratio * pc$rotation[81] + race_ratio * pc$rotation[82] + pctPoverty * pc$rotation[83] + pctBachelor * pc$rotation[84]) %>%
mutate(pc7 = Crimecount * pc$rotation[85] + Complaintscount * pc$rotation[86] + pharm.nn * pc$rotation[87] + hospital.nn * pc$rotation[88] + fast.nn * pc$rotation[89] + fuel.nn * pc$rotation[90] + rehab.nn * pc$rotation[91] + parks.nn * pc$rotation[92] + pop25_54 * pc$rotation[93] + MedHHInc * pc$rotation[94] + MF_ratio * pc$rotation[95] + race_ratio * pc$rotation[96] + pctPoverty * pc$rotation[97] + pctBachelor * pc$rotation[98])
PCA Logistic Regression
Following Srinivasan, et al. (2023)’s method, now we can estimated
this logistic regression to predict if a grid was in a hotspot (0/1) as
identified by the Getis-Ord Gi* statistic using components derived from
the PCA as the explanatory variables. For this model using the principal
components as predictors, we still get a accuracy rate of 0.928. Our
sensitivity rate dropped a little bit to 0.66 compared to the kitchen
sink model, probably because we addressed the issue of multicolinearity,
but that is still considered as a moderate sensitivity.
reg.vars <- c("pc1", "pc2", "pc3", "pc4", "pc5", "pc6", "pc7")
pca.logit.cv <- logitCV (
dataset = pca_net,
id = "cvID",
dependentVariable = "hotspot",
indVariables = reg.vars) %>%
dplyr::select(cvID = cvID, hotspot, Prediction, geometry)
pca.logit.cv <- pca.logit.cv %>%
mutate(PredictionCat = as.factor(ifelse(pca.logit.cv$Prediction > 0.5 , 1, 0))) %>%
mutate(hotspot = as.factor(hotspot))
cm_pca <- caret::confusionMatrix(pca.logit.cv$hotspot, pca.logit.cv$PredictionCat,
positive = "1")
draw_confusion_matrix(cm_pca)
In addition to cross-validating using random cvID, we also tried the
spatial ‘Leave-one-group-out’ cross-validation (LOGO-CV) approach,
during which we hold out one neighborhood, train the model on the
remaining n-1 areas, predict for the hold out, and record the goodness
of fit. The result shows that our model has high accuracy, but the
sensitivity further decreases. One possible explanation here is that
different neighborhoods may have distinct characteristics or patterns.
If the model is trained on a set of neighborhoods and then tested on a
neighborhood with different characteristics, its performance may be
negatively affected. Random cross-validation IDs might not capture these
spatial variations as explicitly, but further investigation is required
here.
neighborhood <- st_read(here("data", "public", "neighborhood.geojson")) %>% st_transform("EPSG:3735") %>% st_zm()
pca_net <- pca_net %>%
left_join(net_centroid %>% dplyr::select(uniqueID) %>%
st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>%
st_drop_geometry(), by = "uniqueID") %>%
mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))
pca.logit.cv.neigh <- logitCV (
dataset = pca_net,
id = "SNA_NAME",
dependentVariable = "hotspot",
indVariables = reg.vars) %>%
dplyr::select(SNA_NAME = SNA_NAME, hotspot, Prediction, geometry)
pca.logit.cv.neigh <- pca.logit.cv.neigh %>%
mutate(PredictionCat = as.factor(ifelse(pca.logit.cv.neigh$Prediction > 0.5 , 1, 0))) %>%
mutate(hotspot = as.factor(hotspot))
cm_pca_neigh <- caret::confusionMatrix(pca.logit.cv.neigh$hotspot, pca.logit.cv.neigh$PredictionCat,
positive = "1")
draw_confusion_matrix(cm_pca_neigh)
We computed the Area Under the Receiver Operating Characteristic
Curve (AUC-ROC) metric for our model to evaluate their performance. The
AUC value ranges from 0 to 1, where 0.5 means a model with no
discriminatory power (it performs as well as random chance), and 1 means
a perfect model (it perfectly distinguishes between positive and
negative cases). Generally, an AUC over 0.8 indicates a good performance
of the model. For the random K-fold cross validation, we get a AUC score
of 0.904 and for the Spatial LOGO-CV, we get a AUC score of 0.8628. Both
are good scores and indicate that model is capturing important patterns
in the data and making predictions better than random guessing.
plot7 <- ggplot(pca.logit.cv, aes(d = as.numeric(hotspot), m = Prediction)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
labs(title = "ROC Curve for Logistic Regression Model using PCA",
subtitle = "Random Cross Validation") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=9),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
plot8 <- ggplot(pca.logit.cv.neigh, aes(d = as.numeric(hotspot), m = Prediction)) +
geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
labs(title = "ROC Curve for Logistic Regression Model using PCA",
subtitle = "Spatial LOGO Cross Validation") +
style_roc(theme = theme_grey) +
geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=9),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
plot7 + plot8
plot9 <- pca.logit.cv %>%
filter(is.na(Prediction) == FALSE) %>%
mutate(result = case_when(
PredictionCat==0 & hotspot==0 ~ "Accurate",
PredictionCat==1 & hotspot==1 ~"Accurate",
PredictionCat==0 & hotspot==1 ~"Problematic",
TRUE ~ "Inaccurate"
)) %>%
ggplot() +
geom_sf(aes(fill=as.character(hotspot)), color="transparent")+
scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
labs(title = "Actual Hotspot") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot10 <- pca.logit.cv %>%
ggplot() +
geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
labs(title = "Predicted Hotspot",
subtitle = "Random Cross Validation") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot11 <- pca.logit.cv.neigh %>%
ggplot() +
geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
labs(title = "Predicted Hotspot",
subtitle = "Spatial LOGO Validation") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot12 <- pca.logit.cv %>%
filter(is.na(Prediction) == FALSE) %>%
mutate(result = case_when(
PredictionCat==0 & hotspot==0 ~ "Accurate",
PredictionCat==1 & hotspot==1 ~"Accurate",
PredictionCat==0 & hotspot==1 ~"Problematic",
TRUE ~ "Inaccurate"
)) %>%
ggplot() +
geom_sf(aes(fill=as.character(result)), color="transparent")+
scale_fill_manual(values = c("black", "#FFC000", "#BB3754"), name = "Hotspot") +
labs(title = "Mapping Error",
subtitle = "Random Cross Validation") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
plot9 + plot10 + plot11 + plot12
If our model has generally been good at making prediction (high
accuracy) but with some uncertainties in correctly locating the hotspot,
then we would like to know where it is making mistakes. We mapped the
error below and compared it to the actual hotspot. What we can see is
that our model is accurate in roughly estimating the location of the
hotspot, but tends to make mistakes when deciding which of these grids
in particular should be categorized as hotspot.
We highlighted those grids which are actual hotspot but were wrongly
predicted as problematic, from which we may see that those most of those
problematic grids pretty much neighbors hotspot grids. This implies that
when drawing conclusion from our model, we should not only look at
specific hotspot grid, but also probably pay attention to grids
nearby.
KDE Validation
The accuracy of our model is also compared relative to kernel density
estimation. Kernel density works by centering a smooth kernel, or curve,
atop each crime point such that the curve is at its highest directly
over the point and the lowest at the range of a circular search radius.
The code block below creates a Kernel density map of robbery with a 1000
foot search radius.
heroin_ppp <- as.ppp(st_coordinates(heroin), W = st_bbox(heroin_net))
heroin_KD.1000 <- spatstat.explore::density.ppp(heroin_ppp, 1000)
heroin_KD.df <- data.frame(rasterToPoints(mask(raster(heroin_KD.1000), as(cincinnati20, 'Spatial'))))
ggplot(data=heroin_KD.df, aes(x=x, y=y)) +
geom_raster(aes(fill=layer)) +
coord_sf(crs=st_crs(heroin_net)) +
scale_fill_viridis(option = "rocket", name="Density") +
labs(title = "Kernel Density of Heroin 1000ft Radii") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
The kernel density estimation was classified into two categories,
lower risk and higher risk, matching the binary of hotspot and coldspot.
We used the Fisher's method
that involves finding intervals
such that the variance within each interval is minimized while
maximizing the variance between the intervals. The purpose here is to
see if the our model capture the same hotspot grids as the kernel
density. The map below shows that our model perform as good as, if not
worse, the kernal density approach in capturing hotspots (high risk
areas).
heroin_KDE_sum <- as.data.frame(heroin_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(heroin_net)) %>%
aggregate(., heroin_net, mean)
kde_breaks <- classIntervals(heroin_KDE_sum$value,
n = 2, "fisher")
heroin_KDE_sf <- heroin_KDE_sum %>%
mutate(label = "Kernel Density",
Risk_Category = classInt::findCols(kde_breaks),
Risk_Category = case_when(
Risk_Category == 2 ~ "High",
Risk_Category == 1 ~ "Low"))
heroin_risk_sf <-
pca.logit.cv %>%
mutate(label = "Logistic Regression",
Risk_Category = case_when(
PredictionCat == 1 ~ "High",
PredictionCat == 0 ~ "Low")) %>%
dplyr::select(PredictionCat, label, Risk_Category) %>%
rename(value = PredictionCat)
rbind(heroin_KDE_sf, heroin_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
facet_wrap(~label, ) +
scale_fill_viridis(option = "rocket", discrete = TRUE, name = "Risk Category") +
labs(title="Comparison of Kernel Density and Risk Predictions") +
theme(legend.position="bottom",
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
)
Multi Criteria Analysis
We would like to conclude our analysis by building
multi-criteria-analysis-centered geospatial risk terrain model that
stakeholders and the public can use to learn about the community
vulnerability to heroin-overdose in Cincinnati. In GIS, a Multi-Criteria
Analysis (MCA) model is a decision-making approach that involves
evaluating and comparing multiple criteria or factors to make informed
spatial decisions. The criteria we use here are the predictor variables
we use in our model. These criteria are standardized to ensure that all
of them are on a scale of 0 to 1. Then, some of those scaled variables
are reversed if necessary based on whether they contribute positively or
negatively to the heroin-overdose risk. We supply a weight on each of
those variables based on their contribution. Finally, a risk score is
computed for each grid.
scale_values <- function(x){(x-min(x))/(max(x)-min(x))}
risk_score <- heroin_net %>%
mutate(MedHHInc = ifelse(is.na(MedHHInc), 0, MedHHInc)) %>%
mutate(scl_crime = scale_values(Crimecount),
scl_complaints = scale_values(Complaintscount),
scl_pharm = scale_values(pharm.nn),
scl_hospital = scale_values(hospital.nn),
scl_rehab = scale_values(rehab.nn),
scl_park = scale_values(parks.nn),
scl_fuel = scale_values(fuel.nn),
scl_fast = scale_values(fast.nn),
scl_pop = scale_values(pop25_54),
scl_income = scale_values(MedHHInc),
scl_gender = scale_values(MF_ratio),
scl_race = scale_values(race_ratio),
scl_poverty = scale_values(pctPoverty),
scl_edu = scale_values(pctBachelor)) %>%
mutate(scl_race_re = 0 - scl_race + 1,
scl_edu_re = 0 - scl_edu + 1,
scl_income_re = 0 - scl_income + 1,
scl_fuel_re = 0 - scl_fuel + 1,
scl_fast_re = 0 - scl_fast + 1,
scl_park_re = 0 - scl_park + 1,
scl_pharm_re = 0 - scl_pharm + 1) %>%
mutate(score = 0.1 * (scl_hospital + scl_crime + scl_complaints + scl_gender + scl_income_re + scl_fuel_re + scl_fast_re + scl_park_re) + 0.05 * (scl_race_re + scl_pharm_re) + 0.025*(scl_pop +scl_edu_re + scl_poverty + scl_rehab)) %>%
st_as_sf()
risk_score <- risk_score %>%
left_join(net_centroid %>% dplyr::select(uniqueID) %>%
st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>%
st_drop_geometry(), by = "uniqueID") %>%
mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))
This below interactive leaflet map shows the risk score of each grid,
the neighborhood that grid fall into, and whether or not it falls into a
hotspot that our model predicts. We hope that this will serve as our
first step towards building up our use case that present stakeholders
with a clearer picture of community vulnerability to heroin overdose and
that help the public to have a better understanding of the severity of
heroin-overdose in Cincinnati as well as existing rehabilitation
resources.
colors <- rev(viridis::viridis(6, option = "rocket"))
pal <- colorBin(palette = colors,
domain = risk_score$score,
bins = c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7))
labs <- paste0("<strong>", risk_score$SNA_NAME,
"</strong><br/>Heorin-Overdose Risk Score: ", risk_score$score
)
leaflet <- risk_score %>%
st_transform("EPSG:4326") %>%
leaflet() %>%
setView(lng = -84.51, lat = 39.10, zoom = 10.5) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(weight = 1,
fillColor = ~pal(score), # change the name of the parameter in the bracket
popup = labs,
color = "transparent",
group = "Risk Score",
fillOpacity = 1) %>%
addPolygons(
data = pca.logit.cv %>% filter(PredictionCat == 1) %>% st_transform("EPSG:4326"),
weight = 2,
group = "Overdose Hotspot",
color = "yellow") %>%
addLayersControl(
overlayGroups = c("Risk Score", "Overdose Hotspot"),
options = layersControlOptions(collapsed = FALSE)
) %>%
addLegend(position = "bottomright",
pal = pal,
values = ~risk_score, # change the name of the parameter after ~
title = "Risk Score",
opacity = 1) # change the name of the legend
leaflet
mapshot(leaflet, url = "cincyrisk.html")
Conclusion
In this report, we conducted a study on opioid-overdose risk modeling
in Cincinnati, Ohio where the opioid epidemic has been a serious issue.
Drawing on methods by Srinivasan, et al. (2023), we used Getis-Ord
hotspot analyses to identify locations of persistent risk in Cincinnati
from 2016-2020. We constructed measures of the socio-built environment
and neighborhood demographic characteristics using principal components
analysis and ran logistic regression to determine if places in
Cincinnati fall into heroin-overdose hotspots or not. While Poisson
regression has long been the traditional method used in geospatial risk
model to predict the number (count) of discrete event, we propose this
alternative method in this report because we believe that it is less
useful for policymakers to focus on the number of heroin-overdose
incidence among neighborhoods in Cincinnati. Rather, we should develop a
way to quickly identify areas/clusters of high heroin-overdose
incidence.
Our model, which uses principal components as predictors, has an
accuracy rate of 0.928, a sensitivity rate of 0.66, and an AUC score of
0.904, meaning that it has been effective overall in distinguishing
between hotspots and cold spots. We mapped the error and compared that
to the actual hotspot, from which we noticed that our model is accurate
in roughly estimating the location of the hotspot but tends to make
mistakes when deciding which of these grids in particular should be
categorized as hotspot. On top of that, many of those grids are
neighbors of hotspot grids that are correctly predicted.
Reflecting on the overall study design, we have the following
concerns, which we would like to continue working on to improve the
accuracy and reliability of our model. Firstly, our study utilized a
binary classification of hotspot and cold spot to describe risk. While
this is a common approach in risk modeling, we should recognize the
complexity within these categories as areas neighboring hotspots may
also exhibit varying degrees of risk. Building onto that, another major
issue is that our analysis has categorized “insignificant” grids into
cold spots. In theory, cold spot should refer to places with a series of
grids that have low heroin-incidence count whereas for insignificant
grid, there might be a high number of incidences within specific grid.
It’s just that those grids do not form clusters with surrounding grids.
Merging insignificant grids into cold spot might risk overlooking
heroin-incidence within those grids.
Secondly, within the discussion, we highlight the potential imbalance
in logistic regression, prompting consideration for weighted logistic
regression to address disparities in the prediction of positive and
negative cases. Thirdly, during cross validation, we found that our
model does not generalize well across neighborhoods. This highlights
significant variations between different neighborhoods in Cincinnati,
which requires further investigations. Fourthly, as we need to arrogate
information into a standardized spatial unit that’s different from any
existing spatial unit, we need to be aware of how our model might be
sensitive to errors and uncertainties introduced by area-weighted
re-aggregation and boundary/edge effect.
Regardless, the takeaways from this analysis will contribute to our
understanding of the nuanced landscape of heroin-overdose crisis in
Cincinnati and provide important guidance on public health initiatives
aiming to address heroin addiction.
We believe that our risk terrain model could effectively visualize
heroin-overdose hotspots and could serve a good starting point for
providing real-time update through calculating a risk score for each
part of the city, allowing the government to conduct real-time data
analysis to efficiently allocate EMS resources. We hope that our effort
in bringing in data from a variety of sources would encourage data
sharing between public and private sectors so that the government can
manage public service and monitor its built environment. Reports can be
collected from public parks, fast food restaurants, gas stations, places
where most overdose incidence occurs. Having data from local hospitals
and rehabilitation centers allows the government to see capacity
information, thereby efficiently dispatching preparing and dispatching
care so that the risk of heroin overdose is minimized before they
happen.
We look forward to seeing our risk terrain model being adapted in
various scenarios that truly benefit both the stakeholders in managing
community health and well-being as well as serve the needs of local
residents.
References
Choi, J. I., Lee, J., Yeh, A. B., Lan, Q., & Kang, H. (2022).
Spatial clustering of heroin-related overdose incidents: a case study in
Cincinnati, Ohio. BMC public health, 22(1), 1-12. https://doi.org/10.1186/s12889-022-13557-3
Srinivasan, S., Pustz, J., Marsh, E., Young, L. D., & Stopka, T.
J. (2023). Risk factors for persistent fatal opioid-involved overdose
hotspots in Massachusetts 2011-2021: A spatial statistical analysis with
socio-economic, access and prescription factors.https://doi.org/10.21203/rs.3.rs-3249650/v1
Chichester, K. R., Drawve, G., Sisson, M., Giménez-Santana, A.,
McCleskey, B., Goodin, B. R., … & Cropsey, K. L. (2023). Crime and
Features of the Built Environment Predicting Risk of Fatal Overdose: A
Comparison of Rural and Urban Ohio Counties with Risk Terrain Modeling.
American Journal of Criminal Justice, 1-25. https://doi.org/10.1007/s12103-023-09739-3
---
title: "Geospatial Risk Prediction of Heroin Overdose Hotspots in Cincinnati Ohio"
author: "Emily Zhou, EChin Li"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: simplex
    toc: yes
    toc_float: yes
    code_folding: hide
    code_download: yes
editor_options:
  markdown:
    wrap: sentence
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

```

Version 3.0 \| First Created Nov 17, 2023 \| Updated Dec 8, 2023

Keywords: K Nearest Neighbors, Moran's I, Getis-Ord Gi\*, Kernel Density Estimation, Random K-fold Cross Validation, Logistic Regression, Poisson Regression, ROC Curve, Principal Component Analysis, Multi-Criteria Analysis, Health Geography

Important Links: [study repository](https://github.com/emilyzhou112/MUSA5080-Cincinnati-Heroin), [pitch presentation](https://youtu.be/LRWgTeHXNtk)

# Introduction

The opioid overdose crisis continues as one of the most significant public health challenges of the past two decades in the US.
Between 1999 and 2019, the national age-adjusted opioid-involved overdose death rate increased from 2.9 per 100,000 population to 15.5 per 100,000.
Rates have increased steeply during the COVID-19 pandemic.
In 2020, the rate climbed to 21.4 per 100,000 and to 24.7 per 100,000 in 2021.
Most literature to this date have focused on understanding the drivers of overdose crisis, which include drug supply, drug manufacturing, pain arising from work-related injuries, income inequality, and added stress, isolation, and economic disadvantage connected to the COVID-19 pandemic.
Until recently, however, more and more studies have begun to consider the association between drug use, opioid-related mortality, and the built-environment.
Contextual risk factors, especially features of the built environment, the presence of violent crime or crime associated with substance use, and neighborhood-level socioeconomic factors, have enabled more accurate identification of high-risk overdose areas, thus driving initiatives on alleviating overdose risk and crises.

We conducted a case study on opioid-overdose risk modeling in Cincinnati, Ohio where the opioid epidemic has been a serious issue.
Since 2009, drug overdose deaths in Ohio have continued to increase, and Cincinnati, a large city in southwestern Ohio with a population of about 300,000 people, was one of 12 opioid hotspots in Ohio and had the highest per capita rates of a fatal overdose between 2010 and 2017.
Today, heroin overdose incidences still remain high in the city since heroin and prescription drugs abuse have further led to other public health problems such as HIV and needle-borne diseases.

We combined multiple data sources, including emergency medical service response, crime and non-emergency reports data, American Community Survey data, and datasets that are proxies for local built environment to analyze the distributions of heroin-related overdose incidents in Cincinnati, Ohio.

Among studies of spatial epidemiology, geospatial clustering has been used to analyze the geographic variation of phenomena, such as the disproportionate impact of public health crisis on people with disabilities, and the concentrated vulnerability of socio-economically disadvantaged population under natural hazards.
It allows the investigators to identify groups of spatial objects (i.e. clusters) that have similar characteristics and analyze patterns of the clusters (e.g., hot spots, cold spots).
As such, we used Getis-Ord hotspot analyses to identify locations of persistent risk in Cincinnati from 2016-2020.
We constructed measures of the socio-built environment and neighborhood demographic characteristics using principal components analysis (PCA).
The resulting measures were used as covariates in a logistic regression with random k-fold cross validation to predict if an area was part of an opioid hotspot cluster.
All built environment risk factors were aggregated, scaled, and weighted to generate an interactive overdose risk terrain map.

**On the technical side**, the goal of our study is to build a model that could predict opioid overdose clusters as well as socioeconomic and built environment factors associated with opioid overdose rates in Cincinnati from 2016 to 2020 as much as possible. 
**On the practical side**, we would like our risk terrain model to benefit both the stakeholders and community members in the main form of a municipal dashboard and a mobile application. For stakeholders, the dashboard would serve as an early warning system by predicting/showing areas at high risk of opioid overdose. Specifically, it should include maps with overdose cluster locations, trend graphs over time, and charts illustrating the correlation between overdose rates and various factors. For community members, this application connects them to healthcare and rehabilitation resources within the city. Specifically, the application allow them to call for emergency services and medical assistance for themselves or for others at one tap. To encourage public-private partnership, this app would adopt the approach that OpenStreetMap took to allow both technical and non-technical users, private and public entities to update existing information, such as crime incidents, rehabilitation center capacities and location. We hope that non-technical stakeholders can use the app to allocate resources efficiently, directing outreach efforts, treatment facilities, and educational campaigns to areas with the highest predicted risk. We hope that the public would use this platform to foster community engagement and support for initiatives addressing the opioid crisis.

The rest of the report is organized as follows.
Section one describes our approach to access and pre-process the data.
Section two documents an exploratory analysis we did using poission regression with only crime and 311 incidents.
Section three explains the conceptualization our approach applying the Getis Ord G\* statistics to determine spatial hotspots.
Section four is the main part of this study, where we iclude all built environment predictors to run a logistic regression predicting if an area is an opioid overdose hotspot or not.
Section five applies principal component analysis to reduce colinerity and include only half of the principal components into the model.
Section six validates the accuracy of this model against a kernel density estimation while the last section constructs the risk terrain map.

The conceptualization and methodologies of this study were drawn in part from: 

- Choi, et al.(2022) Spatial clustering of heroin-related overdose incidents: a case study in Cincinnati, Ohio

-  Srinivasan, et al. (2023) Risk factors for persistent fatal opioid-involved overdose hotspots in Massachusetts 2011-2021 

- Chichester, et al.(2023) Crime and Features of the Built Environment Predicting Risk of Fatal Overdose: A Comparison of Rural and Urban Ohio Counties with Risk Terrain Modeling.


```{r library and setup, include=FALSE}

library(tidyverse)
library(tidycensus)
library(sf)
library(lubridate)
library(tigris)
library(mapview)
library(sfdep)
library(spdep)
library(caret)
library(kableExtra)
library(knitr)
library(here)
library(viridis)
library(FNN)
library(stats)
library(factoextra)
library(plotROC)
library(classInt)
library(raster)
library(spatstat.explore)
library(leaflet)
library(patchwork)

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)

```

# Data

We used the [EMS dataset](https://data.cincinnati-oh.gov/Safety/Cincinnati-Fire-Incidents-CAD-including-EMS-ALS-BL/vnsz-a3wp) that includes all responses to heroin-related overdose incidents from the Cincinnati Fire Department between 2016 and 2020.
Each incident record contains incident location information, including geospatial information, location address, a classification of a neighborhood in the city, time of the incident, and disposition of incident response.
Data records regarding incidents outside of the study area, without geospatial information, and with unassociated disposition codes were excluded from this study.
Records from canceled calls or false alarms and duplicated records were also excluded.

```{r heroin data, message=FALSE, warning=FALSE, class.source='fold-show', results='hide'}

heroin <- read.csv(here("data", "public", "Cincinnati_Fire_Incidents__CAD___including_EMS__ALS_BLS_.csv"))
cincinnati <- st_read(here("data", "public", "Cincinnati_City_Boundary.geojson")) %>% st_transform("EPSG:3735")

heroin <- heroin %>% 
  filter(CFD_INCIDENT_TYPE_GROUP != "NON-PROTOCOL PROBLEM TYPES") %>% 
  filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>%  # remove incidents without spatial info
  filter(is.na(DISPOSITION_TEXT) == FALSE         # remove record with un-associated disposition codes
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL" # remove records from canceled calls, false alarm, and duplication
         & CFD_INCIDENT_TYPE_GROUP !="CANCEL INCIDENT" 
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DEF: DEFAULT"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,EMSF: FALSE"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DUPF: DUPLICATE"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,FALA: FIRE FALSE AC"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,MEDD: MT DISREGARDE"
         & CFD_INCIDENT_TYPE_GROUP !="DUPF: DUPLICATE"
         & CFD_INCIDENT_TYPE_GROUP !="DUPLICATE INCIDENT"
         & CFD_INCIDENT_TYPE_GROUP !="MAL: SYSTEM MALFUNCTION") %>% 
  mutate(time = mdy_hms(CREATE_TIME_INCIDENT, tz = "UTC"),  
         year_column = year(time)) %>% 
  st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  st_intersection(cincinnati %>% dplyr::select(OBJECTID),.) %>%   # remove incidents outside of study area
  filter(year_column %in% c(2016, 2017, 2018, 2019, 2020))


```

In addition, we used [Homeland Infrastructure Foundation-Level Data (HIFLD)](https://hifld-geoplatform.opendata.arcgis.com/maps/geoplatform::hospitals/about) and [Substance Abuse and Mental Health Services Administration (SAMHSA) data](https://dpt2.samhsa.gov/treatment/directory.aspx) for 2015--2020 to obtain information about available healthcare services, such as hospitals, opioid treatment programs, and buprenorphine prescribing physicians, etc.

We filtered data only for Cincinnati in the HIFLD dataset.
The SAMHSA data were geocoded manually to get its location information first before reading into R.

```{r medical resource data, message=FALSE, warning=FALSE, class.source='fold-show', results='hide'}

hospitals <- st_read(here("data","public", "Hospitals.geojson"))
hospitals <- hospitals %>% 
  filter(STATE == "OH" & CITY == "CINCINNATI") %>% 
  st_transform("EPSG:3735") # select those only in Cincinnati


rehab <- read.csv(here("data", "public", "rehabilitation.csv"))
rehab <- rehab %>% 
  filter(is.na(Latitude) == FALSE & is.na(Longitude) == FALSE) %>%
  st_as_sf(., coords = c("Longitude", "Latitude"), crs = 4326) %>% 
  st_transform("EPSG:3735")

```

Furthermore, we included the [crime rate data](https://data.cincinnati-oh.gov/safety/PDI-Police-Data-Initiative-Crime-Incidents/k59e-2pvf) from the Cincinnati Police Department between 2016 and 2020.
Specific crimes were selected for inclusion in analyses based on having been identified in the literature as a spatial risk factor for drug overdose or substance use behavior or for having a theoretical relationship with drug overdose.

```{r crime data, message=FALSE, warning=FALSE, class.source='fold-show', results='hide'}

crime <- read.csv(here("data", "private", "PDI__Police_Data_Initiative__Crime_Incidents.csv"))
crime_list <- c("AGGRAVATED ASSAULT", "BURGLARY", "BREAKING AND ENTERING", "ROBBERY", "DOMESTIC VIOLENCE", "MURDER	", "RAPE", "THEFT")
crime <- crime %>% 
  filter(OFFENSE %in% crime_list) %>% 
  filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>%
  st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  mutate(time = mdy_hms(DATE_REPORTED, tz = "UTC"),  
         year = year(time)) %>% 
  filter(year %in% c(2016, 2017, 2018, 2019, 2020))

```

From the same data source, we included the [311 non-emergency service requests](https://data.cincinnati-oh.gov/Thriving-Neighborhoods/Cincinnati-311-Non-Emergency-Service-Requests/4cjh-bm8b) from 2016 to 2020.

```{r 311 complaints data, warning=FALSE, message=FALSE, class.source='fold-show', results='hide'}

complaints <- read_csv(here("data", "private", "Cincinnati_311__Non-Emergency__Service_Requests.csv"))
complaints <- complaints %>% 
  filter(is.na(LATITUDE) == FALSE & is.na(LONGITUDE) == FALSE) %>%
  st_as_sf(., coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  mutate(time = mdy_hms(REQUESTED_DATE, tz = "UTC"),  
         year = year(time)) %>% 
  filter(year %in% c(2016, 2017, 2018, 2019, 2020))

```

The connection between substance use and built environment variables (access to public restrooms, access to pharmacies, and driving distance to services, defined in our study as fast-food restaurants, gas stations and public parks) are also important.
Public restrooms are associated with people who inject drugs (PWID) because many people use drugs in these spaces.Pharmacies represent an important access variable for several reasons. 
During the initial wave of the overdose crisis, pharmaceutical prescriptions, either legitimate, diverted, or potentially inappropriate, fed the opioid supply. 
In addition, naloxone (a medication to reverse an opioid overdose) is available at pharmacies without a prescription, although this provision may vary by neighborhood socio-demographic levels. 
All of these data were queried from OpenStreetMap and preprocessed in QGIS (fixing geometry, creating spatial indices, and unioning point with polygon data pieces)

```{r built environment data, warning=FALSE, message=FALSE, results='hide'}
# gas station
fuel <- st_read(here("data", "public", "fuel.geojson")) %>% st_transform("EPSG:3735")

# fast food restaurant 
fastfood <- st_read(here("data", "public", "fastfood.geojson")) %>% st_centroid() %>% st_transform("EPSG:3735")

# public parks
parks <- st_read(here("data", "public", "parks.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()

# pharmacy
pharmacy <- st_read(here("data","public", "pharmacy.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()
```

To identify demographic and sociographic characteristics corresponding to the EMS dataset, we utilized the ACS data for 2016 -- 2020.
We downloaded demographic information for each census tract, including total population, gender, adult population size, and race/ethnicity.
The ACS data also included socioeconomic characteristics of each census tract, such as education, income, and poverty.
For each census tract, we computed the population aged between 25 and 54 (since this is the age group most vulnerable to opioid overdose), gender ratio, minority popultion ratio, percentage poverty, and percentage of population with a bachelor's degree.
Note that the boundary of Cincinnati does not fully contain several census tracts and therefore we have tried our best to include tracts within Hamilton county that mostly fall within Cincinnati.

```{r census data, warning=FALSE, message=FALSE, results='hide'}

cincinnati20 <- get_acs(geography = "tract", 
          variables = c(
            "B01001_001E", # total population
            "B01001_002E", # total male
            "B01001_011E", # male 25-29
            "B01001_012E",
            "B01001_013E",
            "B01001_014E",
            "B01001_015E",
            "B01001_016E", # male 50-54
            "B01001_026E", # total female
            "B01001_035E", # female 25-29
            "B01001_036E",
            "B01001_037E",
            "B01001_038E",
            "B01001_039E",
            "B01001_040E", # female 50-54
            "B02001_002E", # white population
            "B02001_003E", # black population
            "B02001_005E", # asian population
            "B03002_012E", # latinx population
            "B19013_001E", # median household income
            "B06012_002E", # poverty
            "B06009_005E" # bachelor
            ), 
          year=2020, state="OH", county="Hamilton", 
          geometry=TRUE, output="wide") %>%
  st_transform("EPSG:3735")

cincinnati_tracts <- st_read(here("data", "public", "cincinnati_tracts.geojson")) %>% st_transform("EPSG:3735")

cincinnati_tracts <- cincinnati_tracts %>% st_drop_geometry() %>% dplyr::select(GEOID) %>% as.list(GEOID)

cincinnati20 <- cincinnati20 %>% 
  filter(GEOID %in% cincinnati_tracts$GEOID) %>% 
  rename(Totalpop = B01001_001E,
         MedHHInc = B19013_001E) %>% 
  mutate(pop25_54 = B01001_011E + B01001_012E + B01001_013E + B01001_014E + B01001_015E + B01001_016E + B01001_035E + B01001_036E + B01001_037E + B01001_038E+ B01001_039E + B01001_040E) %>% 
  mutate(MF_ratio = B01001_002E / B01001_026E) %>% 
  mutate(race_ratio = B02001_002E / (B02001_003E + B02001_005E + B03002_012E)) %>% 
  mutate(pctPoverty = B06012_002E / Totalpop) %>% 
  mutate(pctBachelor = B06009_005E / Totalpop) %>% 
  dplyr::select(pop25_54, Totalpop, MedHHInc, GEOID, MF_ratio, race_ratio, pctPoverty, pctBachelor)


```

The table below summarizes all of our predictor variables in this study:

|             Variable Name             |                          Source                       |
|:----------------------------------:   |:----------------------------------:                   |
|          **Medical Service**          |                                                       |
|               Hospital                |       Homeland Infrastructure Foundation-Level Data   |
|         Rehabilitation Center         | Substance Abuse and Mental Health Services Administration |
|         **Built Environment**         |                                                       |
|              Crime Rate               |                   Open Data Cincinnati                |
|            311 complaints             |                   Open Data Cincinnati                |
|              Gas station              |                            OSM                        |
|             Public parks              |                            OSM                        |
|         Fast food restaurants         |                            OSM                        |
|              Pharmacies               |                            OSM                        |
| **Socio-demographic Characteristics** |                                                       |
|   Population aged between 25 and 54   |                            ACS                        |
|             Gender ratio              |                            ACS                        |
|       White over minority ratio       |                            ACS                        |
|  Population with a Bachelor's degree  |                            ACS                        |
|        Median household income        |                            ACS                        |
|     Population under poverty line     |                            ACS                        |
|        Median Household Income        |                            ACS                        |


# Poission Regression

As part of our exploratory analysis, we ran a poisson regression using only the number of crime incidents and the number of 311 complaints as our predictor variables. Since heroin overdose incident is not a phenomenon that varies across administrative units, but one varying smoothly across the landscape, we would need to aggregate these point-level data into a lattice grid of cells. The code below generates a 500 by 500 meters (820 feet) fishnet for Cincinnati to achieve this goal. 

```{r create fishnet, warning=FALSE, message=FALSE}
fishnet <- st_make_grid(cincinnati,
               cellsize = 820, 
               square = TRUE) %>%
  .[cincinnati] %>%           
  st_sf() %>%
  mutate(uniqueID = 1:n())

ggplot() +
  geom_sf(data=fishnet, color="black", fill="#FFF5EE") +
  labs(title = "Fishnet of Cincinnati") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

We would like to compare across both space and time, and so we divide the heroin dataset into one training set and two testing sets. The training set contains incidents from 2016 to 2018 while the testing sets contain incidents from 2019 and 2020 respectively. Since each individual year has well enough observations of heroin-overdose incidents, separating the testing set into 2019 and 2020 allows us to understand how COVID-19 plays a role in affecting drug overdose rate. We also separate the crime incident data and the 311 complaints data into respective year groups so that we could aggregate them into the fishnet. 

```{r splitting train and test, warning=FALSE, message=FALSE}

heroinTrain <- heroin %>% 
  filter(year_column %in% c(2016, 2017, 2018))
heroin19 <- heroin %>% 
  filter(year_column == 2019)
heroin20 <- heroin %>% 
  filter(year_column == 2020)

crimeTrain <- crime %>% 
  filter(year %in% c(2016, 2017, 2018))
crime19 <- crime %>% 
  filter(year == 2019)
crime20 <- crime %>% 
  filter(year == 2020)

complaintTrain <- complaints %>% 
  filter(year %in% c(2016, 2017, 2018))
complaint19 <- complaints %>% 
  filter(year == 2019)
complaint20 <- complaints%>% 
  filter(year == 2020)

```

The following two code chunks perform all the necessary tasks to prepare our training and testing datasets by aggregating heroin-incidents, crime incidents, and 311 complaints all into the same spatial unit, in this case, the fishnet. 

```{r preparing training set, warning=FALSE, message=FALSE}

# add heroin in net
heroinTrain_net <- 
  dplyr::select(heroinTrain) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

# adding crime to net
heroinTrain_net <- heroinTrain_net %>% 
  st_join(crimeTrain, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroinTrain_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

# add 311 to net
heroinTrain_net <- heroinTrain_net %>% 
  st_join(complaintTrain, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroinTrain_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
```


```{r test sets, warning=FALSE, message=FALSE}

# 2019 test sets
heroin19_net <- 
  dplyr::select(heroin19) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

heroin19_net <- heroin19_net %>% 
  st_join(crime19, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin19_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

heroin19_net <- heroin19_net %>% 
  st_join(complaint19, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin19_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))


# 2020 test set
heroin20_net <- 
  dplyr::select(heroin20) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

heroin20_net <- heroin20_net %>% 
  st_join(crime20, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin20_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

heroin20_net <- heroin20_net %>% 
  st_join(complaint20, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin20_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
```

We performed some data visualization work to see if there are any differences between heroin-overdose incidents over year. We found that the distribution of these incidents are pretty consistent between 2016-2018 and 2019, with the majority of incidents significantly clustered in south Cincinnati. However, in 2020, despite that this overdose cluster still remain, overdose incidents became more disperse. We may see some small, emerging clusters of high overdose incidents scattered around the city. 

```{r is there a difference between heroin incidents over year, fig.height=6, fig.width=15, warning=FALSE, message=FALSE}


plot1 <- ggplot() +
  geom_sf(data = heroinTrain_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot2 <- ggplot() +
  geom_sf(data = heroin19_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2019") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

 plot3 <- ggplot() +
  geom_sf(data = heroin20_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2020") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
 
plot1 + plot2 + plot3
```

We created heat maps of heroin-overdose incidents from 2016 to 2018 and overlayed crime incidents and 311 complaints above. Based on the visualization, we assume crime and 311 complaints to be strong predictors of overdose risk as places of high overdose incidents align perfectly with places with higher crime rate and more complaints. Yet, it is also pretty evident that crime incidents are correlated with 311 complaints. 

```{r crime and complaints, fig.height=6, fig.width=13, warning=FALSE, message=FALSE}

plot1_1 <- ggplot() +
  geom_sf(data = cincinnati, fill = "black") +
  stat_density2d(data = data.frame(st_coordinates(heroinTrain)), 
                 aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "rocket", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  geom_sf(data = sample_n(crimeTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
  labs(title = "Heroin OD and Crime 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot1_2 <- ggplot() +
  geom_sf(data = cincinnati, fill = "black") +
  stat_density2d(data = data.frame(st_coordinates(heroinTrain)), 
                 aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "rocket", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  geom_sf(data = sample_n(complaintTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
  labs(title = "Heroin OD and Complaint 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot1_1 + plot1_2
```

Mapping the correlation between our predictor variables reveals that indeed, there's a strong, linear, and positive relationship between heroin-overdose incidents and crime as well as 311 complaints, though the relationships become weaker in 2020. We suspect that disruptions in the drug supply chain, changes in law enforcement priorities, as well as increase social isolation and mental health challenges during the pandemic might have contribute to the decrease in correlation between overall crime incidents and heroin overdose incidents in 2020. 

```{r cor plots, fig.height=6, fig.width=12, message=FALSE, warning=FALSE}

plot4 <- heroinTrain_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2016-18") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot5 <- heroin19_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2019") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))


plot6 <- heroin20_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2020") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot4 + plot5 + plot6

```

We ran the Poisson regression and confirmed that crime incidents and 311 complaints are good at predicting heroin-overdose risk. The model is slightly better when used to predict 2020 data, but the difference is minimal. 

```{r running poission, warning=FALSE, message=FALSE}

poissionTrain <- glm(countHeroin ~ Crimecount + Complaintscount, family = "poisson",
                      data = heroinTrain_net)

Prediction19 <-
  mutate(heroin19_net, Prediction = predict(poissionTrain, heroin19_net, type = "response")) %>% 
  mutate(Error = countHeroin - Prediction) %>% 
  mutate(MAE = mean(abs(Error))) %>% 
  mutate(SD_MAE = sd(Error)) %>% 
  mutate(prediction = "2019") %>% 
  slice(1:1)

Prediction20 <-
  mutate(heroin20_net, Prediction = predict(poissionTrain, heroin20_net, type = "response")) %>% 
  mutate(Error = countHeroin - Prediction) %>% 
  mutate(MAE = mean(abs(Error))) %>% 
  mutate(SD_MAE = sd(Error)) %>% 
  mutate(prediction = "2020") %>% 
  slice(1:1)

rbind(Prediction19, Prediction20) %>% 
  st_drop_geometry() %>% 
  group_by(prediction) %>% 
  summarize(Mean_MAE = round(MAE, 2),
              SD_MAE = round(SD_MAE, 2)) %>%
  kable(col.name=c("Prediction", "Mean Absolute Error",'Standard Deviation MAE')) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 

```

Here are our reflections up to this point:

While there's significant associations between crime incidents, 311 complaints, and heroin-overdose incidents, building a model that relies solely on them to predict risk is simply not enough because substance abuse and overdose risks are complex phenomena influenced by various factors beyond just criminal activity. The geographical and environmental context, including the availability of treatment facilities, proximity to support networks, and neighborhood characteristics, also plays a crucial role in understanding overdose risks. Therefore, we must integrate more features showing built environment and local contexts into the model. 

On top of that, Poisson regression is a traditional method used in geospatial risk model to predict the number (count) of discrete event, especially crime. However, in making decisions about how to alleviate the risk of particular medical or criminal incidence, we shouldn't just look at the number of accidents at an intersection in a day at a very specific place. Rather, we should flag areas in a city that are more vulnerable/risky than others. It is not wrong to care to focus on the number of incidence, but from a policy-making perspective, learning whether or not a place is risky would be more efficient, helpful, and straightforward in guiding decision making. In other words, stakeholders and the general public do not have to know *how many incidents happened at this place*, but *whether this place is risky* and therefore requires more attention. 

# Getis Ord Spatial Hotspot

We adapted this research workflow from Srinivasan, et al. (2023), who studied the risk factors for opioid overdose in Massachusetts. In its essence, Srinivasan, et al used principal component analysis to reduce colinerarity between predictor variables and ran a logistic regression to predict whether places in Massachusetts are within overdose hotspot or not. 

To replicate their approach as much as possible, we first examined the spatial distribution of heroin-related incident rates using the Jenks natural breaks maps.
The Jenks natural breaks maps use a nonlinear algorithm to create groups where within-group similarity is maximized, and between-group similarity is minimized. 
The map shows at least one significant incidence cluster in south Cincinnati and several other smaller clusters surrounding that hotspot. 

```{r jenks break, warning=FALSE, message=FALSE}

heroin_net <- 
  dplyr::select(heroin) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))
breaks <- classIntervals(heroin_net$countHeroin, n = 5, style = "jenks")

ggplot() +
  geom_sf(data = heroin_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", breaks = breaks$brks, name = "Count")  + 
labs(title = "Count of Heroin Overdose for the Fishnet") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, linewidth = 0.8)
  ) + 
  theme(legend.key.size = unit(0.8, 'cm'), legend.text = element_text(size=6))
```

We validated our observation by checking the global moran's I under the hypothesis that heroin incidences are randomly distributed across space.  The figure below plot the frequency of 999 randomly permutated I under Monte Carlo simulation, which shows that our observed I is much higher than all of the randomly generated I’s. This validates the existence of spatial autocorrelation.

```{r global morans I, warning=FALSE, message=FALSE}

coords <-  st_coordinates(heroin_net %>% st_centroid()) 
neighborList <- knn2nb(knearneigh(coords, 4))
spatialWeights <- nb2listw(neighborList, style="W")

moranTest <- moran.mc(heroin_net$countHeroin, 
                      spatialWeights, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01, fill = "black") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
  labs(title = "Observed and Permuted Moran's I",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10))

```

Since the global spatial clustering analysis yields only one statistic to describe the overall point pattern across the whole study area, we have to use the Getis-Ord Gi\* hotspot statistic to identify and map clusters (i.e., grids with high overdose rates surrounded by grids with high rates). Below, each hotspot identified is statistically significant at the p\<0.1 level. For estimating the clusters, we used the queen's criterion to define our spatial weights matrix.

```{r local moran I, warning=FALSE, message=FALSE}

neigh_nbs <- heroin_net %>% 
  mutate(
    nb = st_contiguity(geometry),  # neighbors share border    
    wt = st_weights(nb), # row-standardized weights              
    neigh_lag = st_lag(countHeroin, nb, wt)  
  )

gi_hot_spots <- neigh_nbs %>% 
  mutate(Gi = local_g_perm(countHeroin, nb, wt, nsim = 999)) %>% 
  unnest(Gi) 

gi_hot_spots <- gi_hot_spots %>%  
  dplyr::select(gi, p_folded_sim, uniqueID) |> 
  mutate(
    classification = case_when(
      # Classify based on the following criteria:
      gi > 0 & p_folded_sim <= 0.01 ~ "Very hot",
      gi > 0 & p_folded_sim <= 0.05 ~ "Hot",
      gi > 0 & p_folded_sim <= 0.1 ~ "Somewhat hot",
      gi < 0 & p_folded_sim <= 0.01 ~ "Very cold",
      gi < 0 & p_folded_sim <= 0.05 ~ "Cold",
      gi < 0 & p_folded_sim <= 0.1 ~ "Somewhat cold",
      TRUE ~ "Insignificant"
    ),    # Convert 'classification' into a factor for easier plotting
    classification = factor(
      classification,
      levels = c("Very hot", "Hot", "Somewhat hot",
                 "Insignificant",
                 "Somewhat cold", "Cold", "Very cold")
    )
  )

gi_hot_spots %>% 
  ggplot() + 
  geom_sf(aes(fill = classification), color = "black", lwd = 0.1) +
  scale_fill_brewer(type = "div", palette = 6, name = "Classification") +
  labs(title = "Spatial Hotspots and Coldspots of Heroin Overdose Incidence") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

heroin_net <- gi_hot_spots %>% 
  mutate(hotspot = ifelse(classification %in% c("Very hot", "Hot", "Somewhat hot"), 1, 0)) %>% 
  st_drop_geometry() %>% 
  left_join(heroin_net, ., by = "uniqueID")
```

The cluster map shows seven types of spatial association determined based on heroin-related overdose incident rate within a grid and its neighboring grids. If a grid has a high heroin incident rate with neighboring grids of high heroin incident rates, it was determined as a hot spot. At the same scheme, if a grid shows a low heroin incident rate with neighboring grids of low heroin incident rates, it was classified as a cold spot. Hotspots with p\<0.01 are considered as "Very hot", with p\<0.05 are considered as "Hot", with p\<0.1 are consiered as "Somewhat hot". All grids that fall into these three categories received a 1 for hotspot, otherwise, it gets a zero. 

# Logistic Regression

## Built Environment Features

We aggregated built environment features in addition to crime and 311 complaints into the fishnet of Cincinnati. For crime and complaints, we counted their number in each grid. For hospitals, rehabilitation centers, pharmacies, fast food restaurants, gas stations, and public parks, we calculated the average distance from the center of each grid to these locations. For demographic variables, we spatially joined tract level into the grids depending on which census tract each centroid of the grid fall into.   

```{r adding risk factors to net, warning=FALSE, message=FALSE}

# adding crime to net
heroin_net <- heroin_net %>% 
  st_join(crime, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

# add 311 to net
heroin_net <- heroin_net %>% 
  st_join(complaints, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))

net_centroid <- st_centroid(heroin_net)

# add hospitals
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(hospital.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(hospitals), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(hospital.nn, uniqueID), 
            by = "uniqueID")

# add rehab center
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(rehab.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(rehab), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(rehab.nn, uniqueID), 
            by = "uniqueID")

# add pharmacy
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(pharm.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(pharmacy), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(pharm.nn, uniqueID), 
            by = "uniqueID")


# add gas station
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(fuel.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(fuel), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(fuel.nn, uniqueID), 
            by = "uniqueID")

# add fast food restaurant
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(fast.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(fastfood), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(fast.nn, uniqueID), 
            by = "uniqueID")

# add parks
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(parks.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(parks), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(parks.nn, uniqueID), 
            by = "uniqueID")


# add demographic vars
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              dplyr::select(uniqueID) %>% 
              st_intersection(cincinnati20) %>% 
              st_drop_geometry() %>% dplyr::select(-GEOID), by = "uniqueID") %>% 
  filter(is.na(pop25_54) == FALSE)
```

Some exploratory data analysis were performed, including computing the correlation between all of our predictor variables and heroin incidence. The figure below shows that all of our predictor variables are correlated with heroin-overdose incidence, among which all of our nearest distance variables are negatively correlated with incidence. This indicates that the further away we are from fast food restaurants and gas stations, for example, the less number of heroin-overdose incidence in that area. The relationship is the strongest for crime incidence and weakest for population aged between 25-54. 


```{r cor for all predictors, fig.height=12, fig.width=8, message=FALSE, warning=FALSE}

heroin_net_long <- heroin_net%>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID, gi, p_folded_sim, classification, hotspot, Totalpop)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") 

correlation.cor <-
  heroin_net_long %>%
    group_by(Type) %>%
    summarize(correlation = cor(Number, countHeroin, use = "complete.obs"))

heroin_net_long %>%
  ggplot(aes(x= Number, y = countHeroin)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Complaintscount` = "Complaints",
    `Crimecount` = "Crime",
    `fast.nn` = "Distance to Fast Food",
    `fuel.nn` = "Distance to Gas Stations",
    `hospital.nn` = "Distance to Hospital",
    `MedHHInc` = "Median Household Income",
    `MF_ratio` = "Gender Ratio",
    `parks.nn` = "Distance to Parks",
    `pctBachelor` = "Percent Bachelor's Degree",
    `pctPoverty` = "Poverty Rate",
    `pharm.nn` = "Distance to Pharmacy",
    `pop25_54` = "Population 25_54", 
    `rehab.nn` = "Distance to Rehab Center",
    `race_ratio` = "Racial Majority over Minority"))) +
  labs(title = "Scatter Plots of All Predictor Variables") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
```

The figure below shows the difference in the number of hotspots or cold spots based on each of our predicting variable and we may see that there seems to be a significant difference in all scenarios. 

```{r anova figure, fig.height=10, fig.width=15, message=FALSE, warning=FALSE}

anova_dataset <- heroin_net %>% st_drop_geometry() %>% 
                    dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)

anova_dataset %>%
  gather(Variable, value, -hotspot) %>%
    ggplot(aes(as.character(hotspot), value, fill=as.character(hotspot))) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free", ncol = 5, labeller= labeller(Variable = c(
        `Complaintscount` = "Complaints",
        `Crimecount` = "Crime",
        `fast.nn` = "Distance to Fast Food",
        `fuel.nn` = "Distance to Gas Stations",
    `hospital.nn` = "Distance to Hospital",
    `MedHHInc` = "Median Household Income",
    `MF_ratio` = "Gender Ratio",
    `parks.nn` = "Distance to Parks",
    `pctBachelor` = "Percent Bachelor's Degree",
    `pctPoverty` = "Poverty Rate",
    `pharm.nn` = "Distance to Pharmacy",
    `pop25_54` = "Population 25_54", 
    `rehab.nn` = "Distance to Rehab Center",
    `race_ratio` = "Racial Majority over Minority"))) +
      scale_fill_manual(values = c("#000004",  "#BB3754")) +
      labs(x="Hotspot", y="Value", 
           title = "Feature Associations with the Likelihood of Overdose Hotspot") +
      theme(legend.position = "none") +
    theme(plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 18, face = "bold"), 
        axis.text.x=element_text(size=10),
        axis.text.y=element_text(size=10), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
```

We further conducted a series of ANOVA tests for all predictor variables to validate if there's a significant difference in mean number of crimes, for example, between cold spot and hot spot. The result is in concur with our expectations: all variables are significant predictors.

```{r anova test, message=FALSE, warning=FALSE}

anova_var <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")

anova_results <- data.frame(
                            Df = integer(), 
                            Sum_Sq = numeric(), 
                            Mean_Sq = numeric(), 
                            F_value = numeric(), 
                            P_Value = numeric(), stringsAsFactors = FALSE)

new_names <- c("Crimecount" = "Crime",
                "Complaintscount" = "Complaints",
                "pharm.nn" = "Distance to Pharmacy",
                "hospital.nn" = "Distance to Hospital",
                "fast.nn" = "Distance to Fast Food",
                "fuel.nn" = "Distance to Gas Stations",
                "rehab.nn" = "Distance to Rehab Center", 
                "parks.nn" = "Distance to Parks", 
                "pop25_54" = "Population 25_54", 
                "MedHHInc" = "Median Household Income",
                "MF_ratio" = "Gender Ratio",
                "race_ratio" = "Racial Majority over Minority",
                "pctPoverty" = "Poverty Rate", 
                "pctBachelor" = "Percent Bachelor's Degree"
               )

for (var in anova_var) {
  anova_result <- aov(anova_dataset[[var]] ~ anova_dataset$hotspot)
  summary_data <- summary(anova_result)[[1]][, c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")][1:1, ]
  anova_results <- rbind(anova_results, summary_data)
}

rownames(anova_results) <- new_names

anova_results %>% 
 kable() %>% 
 kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
    footnote(general_title = "\n")

```

## Simple Logit

Considering that all predictor variables used in the anova test appear to be significant, we included all of them in a kitchen-sink logistic regression. The model shows that in this scenario, **number of crimes**, **number of complaints**, **distance to hospital**, **distance to fast food restaurant**, **gender ratio**, and **distance to public parks** are highly significant predictors of heroin-overdose hotspot whereas other predictors become insignificant. 

```{r logistic without cv, warning=FALSE, message=FALSE}

kitchensink <- glm(hotspot ~ .,
                  data=heroin_net %>% 
                    st_drop_geometry() %>% 
                    dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor), family="binomial" (link="logit"))


kitchen_sum <- summary(kitchensink)
coefficients_table <- as.data.frame(kitchen_sum$coefficients)

coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|z|)` < 0.001, '***',
                                         ifelse(coefficients_table$`Pr(>|z|)` < 0.01, '**',
                                                ifelse(coefficients_table$`Pr(>|z|)` < 0.05, '*',
                                                       ifelse(coefficients_table$`Pr(>|z|)` < 0.1, '.', ''))))

coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|z|)`, digits = 3), coefficients_table$significance)

coefficients_table %>%
  dplyr::select(-significance, -`Pr(>|z|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))  %>%
  footnote(general_title = "\n")
```

## Logit with CV

We would like to perform cross validation on this model to measures its overall performance and robustness. We decide to modify a cross validation function that was originally built on cases using Poission regression by making it suitable for use in our scenario (logistic regression). This function takes in four parameters: the dataset, a cross validation ID, which it could use to separate the dataset into training and testing, a list of dependent variables, and a independent variable. In our case, we use a randomly generated cvID associated with each grid cell for random k-fold cross validation. Each group of grids with the same cvID gets hold out during one iteration. 


```{r logit function}

logitCV <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])

  for (i in cvID_list) {
    
    thisFold <- i

    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    
    form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
    form <- as.formula(form_parts)
    regression <- glm(form, data = fold.train %>%
                        dplyr::select(-geometry, -id),  family="binomial" (link="logit"))
    
    thisPrediction <-
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

```

Here, another helper function is written to visualize the confusion matrix. 

```{r cm function}

draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix 
  rect(150, 430, 240, 370, col='black')
  text(195, 435, '0', cex=1.2)
  rect(250, 430, 340, 370, col="#BB3754")
  text(295, 435, '1', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col="#BB3754")
  rect(250, 305, 340, 365, col='black')
  text(140, 400, '0', cex=1.2, srt=90)
  text(140, 335, '1', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  
```

The cross validation model is visualized below, from which we see that our model has a high accuracy rate of 0.936 and a moderate sensitivity rate of 0.727. Sensitivity is the true-positive rate, which is the proportion of actual hotspot grids that were correctly identified by the model. Given this model's high accuracy rate, this means that our model is missing out a few heorin-overdose hotspots. 

However, note that in Cincinnati, the number of grids that are classified as hotspot is much smaller than the number of grids that are classified as coldspot. While this is a good news for Cincinnati, logistic regression can be sensitive to class imbalance, meaning that the model might be biased towards the majority class, leading to poor performance on the minority class. 

```{r logistic with CV, warning=FALSE, message=FALSE}

reg.vars <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")


kitchensink.logit.cv <- logitCV (
  dataset = heroin_net,
  id = "cvID",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, hotspot, Prediction, geometry)

kitchensink.logit.cv <- kitchensink.logit.cv %>% 
  mutate(PredictionCat  = as.factor(ifelse(kitchensink.logit.cv$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))


cm_basic <- caret::confusionMatrix(kitchensink.logit.cv$hotspot, kitchensink.logit.cv$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_basic)

```

# Principal Component Analysis (PCA)

One of the reasons why each individual predictor variable is correlated with overdose hotspot but when combined in a regression, some of them become insignificant is because of multicollinearity. This commonly occurs in regression analysis when predictor variables are highly correlated. It won't be surprise to see that many of those demographic variables are highly correlated with each other. It is possible that distance to hospitals is correlated with distance to parks, given that there are more those kind of services in downtown areas. Srinivasan, et al. (2023)'s approach to this issue is to use Principal Component Analysis (PCA) as a preprocessing step before running the regression.  


## PC Identification

PCA is a dimension reduction technique that is commonly used to reduce the number of variables used in analyses while preserving the information from them. The `prcomp` function below is used to identify the principal components among all 14 of our predictor variables. 

```{r determine pc, warning=FALSE, message=FALSE}

pca_heroin <- heroin_net %>% 
  st_drop_geometry() %>% 
  dplyr::select(Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)

pca_heroin <- na.omit(pca_heroin)
pc <- prcomp(pca_heroin,
             center = TRUE,
            scale. = TRUE)
```

The scree plot is used to visualize the importance of each principal component and can be used to determine the number of principal components to retain. The y axis is eigenvalues, which essentially stand for the amount of variation. Here, we may see that the first six, or seven components are capturing the majority of variability. We can therefore, just use the first seven principal components and ignore the rest. 

```{r pca viz1, warning=FALSE, message=FALSE}

fviz_eig(pc, addlabels = TRUE, barfill = "#BB3754", barcolor = "transparent", xlab = "Principal Components", choice = "eigenvalue")

```

The code above computed the square cosine value for each variable with respect to the first seven principal components. From the illustration below, gender ratio, population aged between 25-54, distance to parks, and poverty rate are top four variables with the highest cos2, hence contributing the most to the top seven principal components. 

```{r pca viz2,  message=FALSE, warning=FALSE}

fviz_cos2(pc, choice = "var", axes = 1:7) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_bar(stat = "identity", fill = "#BB3754", color = "transparent")

```

We also did a biplot based on our principal component analysis, which helps us to visualize the similarities and dissimilarities between the samples, and further understand the impact of each attribute on each of the principal components. Here, the length of the arrow represents the strength of the contribution of the corresponding variable to the specific principal component. The color of the arrow represents the relative contribution of the variable to the principal component. Darker colors typically indicate larger contributions. For variables with long arrow, but light color, such as percentage of population with a Bachelor's degree and percentage poverty, their contributions were less significant compared to other variables, and might only contribute significantly to one PC. For variables with short arrow but dark color, such as 311 complaints and crime, they have relatively strong contributions to all PC. Also, the direction of those arrows indicate that crimes and complaints have positive contributions to hotspots. 

```{r pca viz3, fig.height=6, fig.width=9, message=FALSE, warning=FALSE}

fviz_pca_var(pc, col.var = "cos2", repel = TRUE, 
             arrowsize = 1,  ggtheme = theme_minimal(), gradient.cols = c("black", "#BB3754"),) + ggtitle("Biplot Combind with Contribution")


```

Below, we computed the first seven principal components based on the loadings of each of our predictor variables in that component. 

```{r compute pca, message=FALSE, warning=FALSE}

pca_net <- heroin_net %>% 
  mutate(pc1 = Crimecount * pc$rotation[1] + Complaintscount * pc$rotation[2] + pharm.nn * pc$rotation[3] + hospital.nn * pc$rotation[4] + fast.nn * pc$rotation[5] +  fuel.nn * pc$rotation[6] + rehab.nn * pc$rotation[7] + parks.nn *  pc$rotation[8] + pop25_54 * pc$rotation[9] + MedHHInc * pc$rotation[10] + MF_ratio * pc$rotation[11] + race_ratio * pc$rotation[12] +  pctPoverty * pc$rotation[13] + pctBachelor * pc$rotation[14]) %>% 
  mutate(pc2 = Crimecount * pc$rotation[15] + Complaintscount * pc$rotation[16] + pharm.nn * pc$rotation[17] + hospital.nn * pc$rotation[18] + fast.nn * pc$rotation[19] +  fuel.nn * pc$rotation[20] + rehab.nn * pc$rotation[21] + parks.nn *  pc$rotation[22] + pop25_54 * pc$rotation[23] + MedHHInc * pc$rotation[24] + MF_ratio * pc$rotation[25] + race_ratio * pc$rotation[26] +  pctPoverty * pc$rotation[27] + pctBachelor * pc$rotation[28]) %>%
  mutate(pc3 = Crimecount * pc$rotation[29] + Complaintscount * pc$rotation[30] + pharm.nn * pc$rotation[31] + hospital.nn * pc$rotation[32] + fast.nn * pc$rotation[33] +  fuel.nn * pc$rotation[34] + rehab.nn * pc$rotation[35] + parks.nn *  pc$rotation[36] + pop25_54 * pc$rotation[37] + MedHHInc * pc$rotation[38] + MF_ratio * pc$rotation[39] + race_ratio * pc$rotation[40] +  pctPoverty * pc$rotation[41] + pctBachelor * pc$rotation[42]) %>%
  mutate(pc4 = Crimecount * pc$rotation[43] + Complaintscount * pc$rotation[44] + pharm.nn * pc$rotation[45] + hospital.nn * pc$rotation[46] + fast.nn * pc$rotation[47] +  fuel.nn * pc$rotation[48] + rehab.nn * pc$rotation[49] + parks.nn *  pc$rotation[50] + pop25_54 * pc$rotation[51] + MedHHInc * pc$rotation[52] + MF_ratio * pc$rotation[53] + race_ratio * pc$rotation[54] +  pctPoverty * pc$rotation[55] + pctBachelor * pc$rotation[56]) %>%
  mutate(pc5 = Crimecount * pc$rotation[57] + Complaintscount * pc$rotation[58] + pharm.nn * pc$rotation[59] + hospital.nn * pc$rotation[60] + fast.nn * pc$rotation[61] +  fuel.nn * pc$rotation[62] + rehab.nn * pc$rotation[63] + parks.nn *  pc$rotation[64] + pop25_54 * pc$rotation[65] + MedHHInc * pc$rotation[66] + MF_ratio * pc$rotation[67] + race_ratio * pc$rotation[68] +  pctPoverty * pc$rotation[69] + pctBachelor * pc$rotation[70]) %>%
  mutate(pc6 = Crimecount * pc$rotation[71] + Complaintscount * pc$rotation[72] + pharm.nn * pc$rotation[73] + hospital.nn * pc$rotation[74] + fast.nn * pc$rotation[75] +  fuel.nn * pc$rotation[76] + rehab.nn * pc$rotation[77] + parks.nn *  pc$rotation[78] + pop25_54 * pc$rotation[79] + MedHHInc * pc$rotation[80] + MF_ratio * pc$rotation[81] + race_ratio * pc$rotation[82] +  pctPoverty * pc$rotation[83] + pctBachelor * pc$rotation[84]) %>%
  mutate(pc7 = Crimecount * pc$rotation[85] + Complaintscount * pc$rotation[86] + pharm.nn * pc$rotation[87] + hospital.nn * pc$rotation[88] + fast.nn * pc$rotation[89] +  fuel.nn * pc$rotation[90] + rehab.nn * pc$rotation[91] + parks.nn *  pc$rotation[92] + pop25_54 * pc$rotation[93] + MedHHInc * pc$rotation[94] + MF_ratio * pc$rotation[95] + race_ratio * pc$rotation[96] +  pctPoverty * pc$rotation[97] + pctBachelor * pc$rotation[98])
  

```

## PCA Logistic Regression

Following Srinivasan, et al. (2023)'s method, now we can estimated this logistic regression to predict if a grid was in a hotspot (0/1) as identified by the Getis-Ord Gi\* statistic using components derived from the PCA as the explanatory variables. For this model using the principal components as predictors, we still get a accuracy rate of 0.928. Our sensitivity rate dropped a little bit to 0.66 compared to the kitchen sink model, probably because we addressed the issue of multicolinearity,  but that is still considered as a moderate sensitivity. 

```{r using pca in logistic regression, message=FALSE, warning=FALSE}

reg.vars <- c("pc1", "pc2", "pc3", "pc4", "pc5", "pc6", "pc7")

pca.logit.cv <- logitCV (
  dataset = pca_net,
  id = "cvID",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, hotspot, Prediction, geometry)

pca.logit.cv <- pca.logit.cv %>% 
  mutate(PredictionCat  = as.factor(ifelse(pca.logit.cv$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))

cm_pca <- caret::confusionMatrix(pca.logit.cv$hotspot, pca.logit.cv$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_pca)

```

In addition to cross-validating using random cvID, we also tried the spatial ‘Leave-one-group-out’ cross-validation (LOGO-CV) approach, during which we hold out one neighborhood, train the model on the remaining n-1 areas, predict for the hold out, and record the goodness of fit. The result shows that our model has high accuracy, but the sensitivity further decreases. One possible explanation here is that different neighborhoods may have distinct characteristics or patterns. If the model is trained on a set of neighborhoods and then tested on a neighborhood with different characteristics, its performance may be negatively affected. Random cross-validation IDs might not capture these spatial variations as explicitly, but further investigation is required here. 

```{r using pca in logistic regression with neighborhood, message=FALSE, warning=FALSE, results='hide'}

neighborhood <- st_read(here("data", "public", "neighborhood.geojson")) %>% st_transform("EPSG:3735") %>% st_zm()
pca_net <- pca_net %>%
  left_join(net_centroid %>% dplyr::select(uniqueID) %>% 
               st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>% 
              st_drop_geometry(), by = "uniqueID") %>% 
  mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))


pca.logit.cv.neigh <- logitCV (
  dataset = pca_net,
  id = "SNA_NAME",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(SNA_NAME = SNA_NAME, hotspot, Prediction, geometry)

pca.logit.cv.neigh <- pca.logit.cv.neigh %>% 
  mutate(PredictionCat  = as.factor(ifelse(pca.logit.cv.neigh$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))

cm_pca_neigh <- caret::confusionMatrix(pca.logit.cv.neigh$hotspot, pca.logit.cv.neigh$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_pca_neigh)
```

We computed the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) metric for our model to evaluate their performance. The AUC value ranges from 0 to 1, where 0.5 means a model with no discriminatory power (it performs as well as random chance), and 1 means a perfect model (it perfectly distinguishes between positive and negative cases). Generally, an AUC over 0.8 indicates a good performance of the model. For the random K-fold cross validation, we get a AUC score of 0.904 and for the Spatial LOGO-CV, we get a AUC score of 0.8628. Both are good scores and indicate that model is capturing important patterns in the data and making predictions better than random guessing.

```{r ROC curve, fig.height=6, fig.width=11, message=FALSE, warning=FALSE}

plot7 <- ggplot(pca.logit.cv, aes(d = as.numeric(hotspot), m = Prediction)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
    labs(title = "ROC Curve for Logistic Regression Model using PCA", 
         subtitle = "Random Cross Validation") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot8 <- ggplot(pca.logit.cv.neigh, aes(d = as.numeric(hotspot), m = Prediction)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
    labs(title = "ROC Curve for Logistic Regression Model using PCA", 
         subtitle = "Spatial LOGO Cross Validation") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot7 + plot8

```




```{r mapping error, message=FALSE, warning=FALSE, fig.height=6, fig.width=11}

plot9 <- pca.logit.cv %>%
  filter(is.na(Prediction) == FALSE) %>% 
   mutate(result = case_when(
   PredictionCat==0 & hotspot==0 ~ "Accurate",
   PredictionCat==1 & hotspot==1 ~"Accurate",
   PredictionCat==0 & hotspot==1 ~"Problematic",
   TRUE ~ "Inaccurate"
  )) %>% 
  ggplot() +
  geom_sf(aes(fill=as.character(hotspot)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Actual Hotspot") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
  

plot10 <- pca.logit.cv %>%
  ggplot() +
  geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Predicted Hotspot", 
         subtitle = "Random Cross Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot11 <- pca.logit.cv.neigh %>%
  ggplot() +
  geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Predicted Hotspot", 
         subtitle = "Spatial LOGO Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
  

plot12 <- pca.logit.cv %>%
  filter(is.na(Prediction) == FALSE) %>% 
   mutate(result = case_when(
   PredictionCat==0 & hotspot==0 ~ "Accurate",
   PredictionCat==1 & hotspot==1 ~"Accurate",
   PredictionCat==0 & hotspot==1 ~"Problematic",
   TRUE ~ "Inaccurate"
  )) %>% 
  ggplot() +
  geom_sf(aes(fill=as.character(result)), color="transparent")+
  scale_fill_manual(values = c("black", "#FFC000", "#BB3754"), name = "Hotspot") +
  labs(title = "Mapping Error", 
         subtitle = "Random Cross Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot9 + plot10 + plot11 + plot12
```


If our model has generally been good at making prediction (high accuracy) but with some uncertainties in correctly locating the hotspot, then we would like to know where it is making mistakes. We mapped the error below and compared it to the actual hotspot. What we can see is that our model is accurate in roughly estimating the location of the hotspot, but tends to make mistakes when deciding which of these grids in particular should be categorized as hotspot. 

We highlighted those grids which are actual hotspot but were wrongly predicted as problematic, from which we may see that those most of those problematic grids pretty much neighbors hotspot grids. This implies that when drawing conclusion from our model, we should not only look at specific hotspot grid, but also probably pay attention to grids nearby. 


# KDE Validation

The accuracy of our model is also compared relative to kernel density estimation. Kernel density works by centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular search radius. The code block below creates a Kernel density map of robbery with a 1000 foot search radius.

```{r KDE 1000 radii,  message=FALSE, warning=FALSE}

heroin_ppp <- as.ppp(st_coordinates(heroin), W = st_bbox(heroin_net))
heroin_KD.1000 <- spatstat.explore::density.ppp(heroin_ppp, 1000)
heroin_KD.df <- data.frame(rasterToPoints(mask(raster(heroin_KD.1000), as(cincinnati20, 'Spatial'))))


ggplot(data=heroin_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  coord_sf(crs=st_crs(heroin_net)) + 
  scale_fill_viridis(option = "rocket", name="Density") +
  labs(title = "Kernel Density of Heroin 1000ft Radii") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```

The kernel density estimation was classified into two categories, lower risk and higher risk, matching the binary of hotspot and coldspot. We used the `Fisher's method` that involves finding intervals such that the variance within each interval is minimized while maximizing the variance between the intervals. The purpose here is to see if the our model capture the same hotspot grids as the kernel density. The map below shows that our model perform as good as, if not worse, the kernal density approach in capturing hotspots (high risk areas). 

```{r comparison, message=FALSE, warning=FALSE}

heroin_KDE_sum <- as.data.frame(heroin_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(heroin_net)) %>%
  aggregate(., heroin_net, mean) 

kde_breaks <- classIntervals(heroin_KDE_sum$value, 
                             n = 2, "fisher")

heroin_KDE_sf <- heroin_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 2 ~ "High",
           Risk_Category == 1 ~ "Low"))

heroin_risk_sf <-
  pca.logit.cv %>%
  mutate(label = "Logistic Regression",
         Risk_Category = case_when(
           PredictionCat == 1 ~ "High",
           PredictionCat == 0 ~ "Low")) %>% 
  dplyr::select(PredictionCat, label, Risk_Category) %>% 
  rename(value = PredictionCat)


rbind(heroin_KDE_sf, heroin_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    facet_wrap(~label, ) +
    scale_fill_viridis(option = "rocket", discrete = TRUE, name = "Risk Category") +
    labs(title="Comparison of Kernel Density and Risk Predictions") + 
  theme(legend.position="bottom",
    axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```


# Multi Criteria Analysis

We would like to conclude our analysis by building multi-criteria-analysis-centered geospatial risk terrain model that stakeholders and the public can use to learn about the community vulnerability to heroin-overdose in Cincinnati. In GIS, a Multi-Criteria Analysis (MCA) model is a decision-making approach that involves evaluating and comparing multiple criteria or factors to make informed spatial decisions. The criteria we use here are the predictor variables we use in our model. These criteria are standardized to ensure that all of them are on a scale of 0 to 1. Then, some of those scaled variables are reversed if necessary based on whether they contribute positively or negatively to the heroin-overdose risk. We supply a weight on each of those variables based on their contribution. Finally, a risk score is computed for each grid. 

```{r MCA, message=FALSE, warning=FALSE}

scale_values <- function(x){(x-min(x))/(max(x)-min(x))}

risk_score <- heroin_net %>% 
  mutate(MedHHInc = ifelse(is.na(MedHHInc), 0, MedHHInc)) %>% 
  mutate(scl_crime = scale_values(Crimecount),
         scl_complaints = scale_values(Complaintscount),
         scl_pharm = scale_values(pharm.nn),
         scl_hospital = scale_values(hospital.nn),
         scl_rehab = scale_values(rehab.nn),
         scl_park = scale_values(parks.nn),
         scl_fuel = scale_values(fuel.nn),
         scl_fast = scale_values(fast.nn),
         scl_pop = scale_values(pop25_54),
         scl_income = scale_values(MedHHInc),
         scl_gender = scale_values(MF_ratio),
         scl_race = scale_values(race_ratio),
         scl_poverty = scale_values(pctPoverty),
         scl_edu = scale_values(pctBachelor)) %>% 
  mutate(scl_race_re = 0 - scl_race + 1,
         scl_edu_re = 0 - scl_edu + 1,
         scl_income_re = 0 - scl_income + 1, 
         scl_fuel_re = 0 - scl_fuel + 1,
         scl_fast_re = 0 - scl_fast + 1,
         scl_park_re = 0 - scl_park + 1,
         scl_pharm_re = 0 - scl_pharm + 1) %>% 
  mutate(score = 0.1 * (scl_hospital + scl_crime + scl_complaints +  scl_gender + scl_income_re + scl_fuel_re + scl_fast_re +  scl_park_re) + 0.05 * (scl_race_re +  scl_pharm_re) + 0.025*(scl_pop +scl_edu_re + scl_poverty +  scl_rehab)) %>% 
  st_as_sf()
  

risk_score <- risk_score %>%
  left_join(net_centroid %>% dplyr::select(uniqueID) %>% 
               st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>% 
              st_drop_geometry(), by = "uniqueID") %>% 
  mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))


```

This below interactive leaflet map shows the risk score of each grid, the neighborhood that grid fall into, and whether or not it falls into a hotspot that our model predicts. We hope that this will serve as our first step towards building up our use case that present stakeholders with a clearer picture of community vulnerability to heroin overdose and that help the public to have a better understanding of the severity of heroin-overdose in Cincinnati as well as existing rehabilitation resources.  

```{r map risk, message=FALSE, warning=FALSE}

colors <- rev(viridis::viridis(6, option = "rocket"))

pal <- colorBin(palette = colors, 
                domain = risk_score$score, 
                bins = c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7)) 

labs <- paste0("<strong>", risk_score$SNA_NAME,
               "</strong><br/>Heorin-Overdose Risk Score: ", risk_score$score
               ) 

leaflet <- risk_score %>%
  st_transform("EPSG:4326") %>% 
  leaflet() %>%
  setView(lng = -84.51, lat = 39.10, zoom = 10.5) %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(weight = 1,
              fillColor = ~pal(score), # change the name of the parameter in the bracket
              popup = labs,
              color = "transparent",
              group = "Risk Score", 
              fillOpacity = 1) %>%
  addPolygons(
    data = pca.logit.cv %>% filter(PredictionCat == 1) %>% st_transform("EPSG:4326"), 
    weight = 2,
    group = "Overdose Hotspot", 
    color = "yellow") %>% 
  addLayersControl(
    overlayGroups = c("Risk Score", "Overdose Hotspot"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  addLegend(position = "bottomright",
            pal = pal,
            values = ~risk_score, # change the name of the parameter after ~
            title = "Risk Score",
            opacity = 1) # change the name of the legend 

leaflet
mapshot(leaflet, url = "cincyrisk.html")
```


# Conclusion

In this report, we conducted a study on opioid-overdose risk modeling in Cincinnati, Ohio where the opioid epidemic has been a serious issue. Drawing on methods by Srinivasan, et al. (2023), 
we used Getis-Ord hotspot analyses to identify locations of persistent risk in Cincinnati from 2016-2020. We constructed measures of the socio-built environment and neighborhood demographic characteristics using principal components analysis and ran logistic regression to determine if places in Cincinnati fall into heroin-overdose hotspots or not. While Poisson regression has long been the traditional method used in geospatial risk model to predict the number (count) of discrete event, we propose this alternative method in this report because we believe that it is less useful for policymakers to focus on the number of heroin-overdose incidence among neighborhoods in Cincinnati. Rather, we should develop a way to quickly identify areas/clusters of high heroin-overdose incidence. 

Our model, which uses principal components as predictors, has an accuracy rate of 0.928, a sensitivity rate of 0.66, and an AUC score of 0.904, meaning that it has been effective overall in distinguishing between hotspots and cold spots. We mapped the error and compared that to the actual hotspot, from which we noticed that our model is accurate in roughly estimating the location of the hotspot but tends to make mistakes when deciding which of these grids in particular should be categorized as hotspot. On top of that, many of those grids are neighbors of hotspot grids that are correctly predicted.  

Reflecting on the overall study design, we have the following concerns, which we would like to continue working on to improve the accuracy and reliability of our model. Firstly, our study utilized a binary classification of hotspot and cold spot to describe risk. While this is a common approach in risk modeling, we should recognize the complexity within these categories as areas neighboring hotspots may also exhibit varying degrees of risk. Building onto that, another major issue is that our analysis has categorized "insignificant" grids into cold spots. In theory, cold spot should refer to places with a series of grids that have low heroin-incidence count whereas for insignificant grid, there might be a high number of incidences within specific grid. It’s just that those grids do not form clusters with surrounding grids. Merging insignificant grids into cold spot might risk overlooking heroin-incidence within those grids. 

Secondly, within the discussion, we highlight the potential imbalance in logistic regression, prompting consideration for weighted logistic regression to address disparities in the prediction of positive and negative cases. Thirdly, during cross validation, we found that our model does not generalize well across neighborhoods. This highlights significant variations between different neighborhoods in Cincinnati, which requires further investigations. Fourthly, as we need to arrogate information into a standardized spatial unit that’s different from any existing spatial unit, we need to be aware of how our model might be sensitive to errors and uncertainties introduced by area-weighted re-aggregation and boundary/edge effect. 

Regardless, the takeaways from this analysis will contribute to our understanding of the nuanced landscape of heroin-overdose crisis in Cincinnati and provide important guidance on public health initiatives aiming to address heroin addiction. 

We believe that our risk terrain model could effectively visualize heroin-overdose hotspots and could serve a good starting point for providing real-time update through calculating a risk score for each part of the city, allowing the government to conduct real-time data analysis to efficiently allocate EMS resources. We hope that our effort in bringing in data from a variety of sources would encourage data sharing between public and private sectors so that the government can manage public service and monitor its built environment. Reports can be collected from public parks, fast food restaurants, gas stations, places where most overdose incidence occurs. Having data from local hospitals and rehabilitation centers allows the government to see capacity information, thereby efficiently dispatching preparing and dispatching care so that the risk of heroin overdose is minimized before they happen. 

We look forward to seeing our risk terrain model being adapted in various scenarios that truly benefit both the stakeholders in managing community health and well-being as well as serve the needs of local residents. 


# References

Choi, J. I., Lee, J., Yeh, A. B., Lan, Q., & Kang, H. (2022). Spatial clustering of heroin-related overdose incidents: a case study in Cincinnati, Ohio. BMC public health, 22(1), 1-12. [https://doi.org/10.1186/s12889-022-13557-3](https://doi.org/10.1186/s12889-022-13557-3)

Srinivasan, S., Pustz, J., Marsh, E., Young, L. D., & Stopka, T. J. (2023). Risk factors for persistent fatal opioid-involved overdose hotspots in Massachusetts 2011-2021: A spatial statistical analysis with socio-economic, access and prescription factors.[https://doi.org/10.21203/rs.3.rs-3249650/v1](https://doi.org/10.21203/rs.3.rs-3249650/v1)

Chichester, K. R., Drawve, G., Sisson, M., Giménez-Santana, A., McCleskey, B., Goodin, B. R., ... & Cropsey, K. L. (2023). Crime and Features of the Built Environment Predicting Risk of Fatal Overdose: A Comparison of Rural and Urban Ohio Counties with Risk Terrain Modeling. American Journal of Criminal Justice, 1-25. [https://doi.org/10.1007/s12103-023-09739-3](https://doi.org/10.1007/s12103-023-09739-3)


