Version 3.0 | First Created Nov 17, 2023 | Updated Dec 8, 2023

Keywords: K Nearest Neighbors, Moran’s I, Getis-Ord Gi*, Kernel Density Estimation, Random K-fold Cross Validation, Logistic Regression, Poisson Regression, ROC Curve, Principal Component Analysis, Multi-Criteria Analysis, Health Geography

Important Links: study repository, pitch presentation

Introduction

The opioid overdose crisis continues as one of the most significant public health challenges of the past two decades in the US. Between 1999 and 2019, the national age-adjusted opioid-involved overdose death rate increased from 2.9 per 100,000 population to 15.5 per 100,000. Rates have increased steeply during the COVID-19 pandemic. In 2020, the rate climbed to 21.4 per 100,000 and to 24.7 per 100,000 in 2021. Most literature to this date have focused on understanding the drivers of overdose crisis, which include drug supply, drug manufacturing, pain arising from work-related injuries, income inequality, and added stress, isolation, and economic disadvantage connected to the COVID-19 pandemic. Until recently, however, more and more studies have begun to consider the association between drug use, opioid-related mortality, and the built-environment. Contextual risk factors, especially features of the built environment, the presence of violent crime or crime associated with substance use, and neighborhood-level socioeconomic factors, have enabled more accurate identification of high-risk overdose areas, thus driving initiatives on alleviating overdose risk and crises.

We conducted a case study on opioid-overdose risk modeling in Cincinnati, Ohio where the opioid epidemic has been a serious issue. Since 2009, drug overdose deaths in Ohio have continued to increase, and Cincinnati, a large city in southwestern Ohio with a population of about 300,000 people, was one of 12 opioid hotspots in Ohio and had the highest per capita rates of a fatal overdose between 2010 and 2017. Today, heroin overdose incidences still remain high in the city since heroin and prescription drugs abuse have further led to other public health problems such as HIV and needle-borne diseases.

We combined multiple data sources, including emergency medical service response, crime and non-emergency reports data, American Community Survey data, and datasets that are proxies for local built environment to analyze the distributions of heroin-related overdose incidents in Cincinnati, Ohio.

Among studies of spatial epidemiology, geospatial clustering has been used to analyze the geographic variation of phenomena, such as the disproportionate impact of public health crisis on people with disabilities, and the concentrated vulnerability of socio-economically disadvantaged population under natural hazards. It allows the investigators to identify groups of spatial objects (i.e. clusters) that have similar characteristics and analyze patterns of the clusters (e.g., hot spots, cold spots). As such, we used Getis-Ord hotspot analyses to identify locations of persistent risk in Cincinnati from 2016-2020. We constructed measures of the socio-built environment and neighborhood demographic characteristics using principal components analysis (PCA). The resulting measures were used as covariates in a logistic regression with random k-fold cross validation to predict if an area was part of an opioid hotspot cluster. All built environment risk factors were aggregated, scaled, and weighted to generate an interactive overdose risk terrain map.

On the technical side, the goal of our study is to build a model that could predict opioid overdose clusters as well as socioeconomic and built environment factors associated with opioid overdose rates in Cincinnati from 2016 to 2020 as much as possible. On the practical side, we would like our risk terrain model to benefit both the stakeholders and community members in the main form of a municipal dashboard and a mobile application. For stakeholders, the dashboard would serve as an early warning system by predicting/showing areas at high risk of opioid overdose. Specifically, it should include maps with overdose cluster locations, trend graphs over time, and charts illustrating the correlation between overdose rates and various factors. For community members, this application connects them to healthcare and rehabilitation resources within the city. Specifically, the application allow them to call for emergency services and medical assistance for themselves or for others at one tap. To encourage public-private partnership, this app would adopt the approach that OpenStreetMap took to allow both technical and non-technical users, private and public entities to update existing information, such as crime incidents, rehabilitation center capacities and location. We hope that non-technical stakeholders can use the app to allocate resources efficiently, directing outreach efforts, treatment facilities, and educational campaigns to areas with the highest predicted risk. We hope that the public would use this platform to foster community engagement and support for initiatives addressing the opioid crisis.

The rest of the report is organized as follows. Section one describes our approach to access and pre-process the data. Section two documents an exploratory analysis we did using poission regression with only crime and 311 incidents. Section three explains the conceptualization our approach applying the Getis Ord G* statistics to determine spatial hotspots. Section four is the main part of this study, where we iclude all built environment predictors to run a logistic regression predicting if an area is an opioid overdose hotspot or not. Section five applies principal component analysis to reduce colinerity and include only half of the principal components into the model. Section six validates the accuracy of this model against a kernel density estimation while the last section constructs the risk terrain map.

The conceptualization and methodologies of this study were drawn in part from:

  • Choi, et al.(2022) Spatial clustering of heroin-related overdose incidents: a case study in Cincinnati, Ohio

  • Srinivasan, et al. (2023) Risk factors for persistent fatal opioid-involved overdose hotspots in Massachusetts 2011-2021

  • Chichester, et al.(2023) Crime and Features of the Built Environment Predicting Risk of Fatal Overdose: A Comparison of Rural and Urban Ohio Counties with Risk Terrain Modeling.

Data

We used the EMS dataset that includes all responses to heroin-related overdose incidents from the Cincinnati Fire Department between 2016 and 2020. Each incident record contains incident location information, including geospatial information, location address, a classification of a neighborhood in the city, time of the incident, and disposition of incident response. Data records regarding incidents outside of the study area, without geospatial information, and with unassociated disposition codes were excluded from this study. Records from canceled calls or false alarms and duplicated records were also excluded.

heroin <- read.csv(here("data", "public", "Cincinnati_Fire_Incidents__CAD___including_EMS__ALS_BLS_.csv"))
cincinnati <- st_read(here("data", "public", "Cincinnati_City_Boundary.geojson")) %>% st_transform("EPSG:3735")

heroin <- heroin %>% 
  filter(CFD_INCIDENT_TYPE_GROUP != "NON-PROTOCOL PROBLEM TYPES") %>% 
  filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>%  # remove incidents without spatial info
  filter(is.na(DISPOSITION_TEXT) == FALSE         # remove record with un-associated disposition codes
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL" # remove records from canceled calls, false alarm, and duplication
         & CFD_INCIDENT_TYPE_GROUP !="CANCEL INCIDENT" 
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DEF: DEFAULT"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,EMSF: FALSE"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DUPF: DUPLICATE"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,FALA: FIRE FALSE AC"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,MEDD: MT DISREGARDE"
         & CFD_INCIDENT_TYPE_GROUP !="DUPF: DUPLICATE"
         & CFD_INCIDENT_TYPE_GROUP !="DUPLICATE INCIDENT"
         & CFD_INCIDENT_TYPE_GROUP !="MAL: SYSTEM MALFUNCTION") %>% 
  mutate(time = mdy_hms(CREATE_TIME_INCIDENT, tz = "UTC"),  
         year_column = year(time)) %>% 
  st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  st_intersection(cincinnati %>% dplyr::select(OBJECTID),.) %>%   # remove incidents outside of study area
  filter(year_column %in% c(2016, 2017, 2018, 2019, 2020))

In addition, we used Homeland Infrastructure Foundation-Level Data (HIFLD) and Substance Abuse and Mental Health Services Administration (SAMHSA) data for 2015–2020 to obtain information about available healthcare services, such as hospitals, opioid treatment programs, and buprenorphine prescribing physicians, etc.

We filtered data only for Cincinnati in the HIFLD dataset. The SAMHSA data were geocoded manually to get its location information first before reading into R.

hospitals <- st_read(here("data","public", "Hospitals.geojson"))
hospitals <- hospitals %>% 
  filter(STATE == "OH" & CITY == "CINCINNATI") %>% 
  st_transform("EPSG:3735") # select those only in Cincinnati


rehab <- read.csv(here("data", "public", "rehabilitation.csv"))
rehab <- rehab %>% 
  filter(is.na(Latitude) == FALSE & is.na(Longitude) == FALSE) %>%
  st_as_sf(., coords = c("Longitude", "Latitude"), crs = 4326) %>% 
  st_transform("EPSG:3735")

Furthermore, we included the crime rate data from the Cincinnati Police Department between 2016 and 2020. Specific crimes were selected for inclusion in analyses based on having been identified in the literature as a spatial risk factor for drug overdose or substance use behavior or for having a theoretical relationship with drug overdose.

crime <- read.csv(here("data", "private", "PDI__Police_Data_Initiative__Crime_Incidents.csv"))
crime_list <- c("AGGRAVATED ASSAULT", "BURGLARY", "BREAKING AND ENTERING", "ROBBERY", "DOMESTIC VIOLENCE", "MURDER  ", "RAPE", "THEFT")
crime <- crime %>% 
  filter(OFFENSE %in% crime_list) %>% 
  filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>%
  st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  mutate(time = mdy_hms(DATE_REPORTED, tz = "UTC"),  
         year = year(time)) %>% 
  filter(year %in% c(2016, 2017, 2018, 2019, 2020))

From the same data source, we included the 311 non-emergency service requests from 2016 to 2020.

complaints <- read_csv(here("data", "private", "Cincinnati_311__Non-Emergency__Service_Requests.csv"))
complaints <- complaints %>% 
  filter(is.na(LATITUDE) == FALSE & is.na(LONGITUDE) == FALSE) %>%
  st_as_sf(., coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  mutate(time = mdy_hms(REQUESTED_DATE, tz = "UTC"),  
         year = year(time)) %>% 
  filter(year %in% c(2016, 2017, 2018, 2019, 2020))

The connection between substance use and built environment variables (access to public restrooms, access to pharmacies, and driving distance to services, defined in our study as fast-food restaurants, gas stations and public parks) are also important. Public restrooms are associated with people who inject drugs (PWID) because many people use drugs in these spaces.Pharmacies represent an important access variable for several reasons. During the initial wave of the overdose crisis, pharmaceutical prescriptions, either legitimate, diverted, or potentially inappropriate, fed the opioid supply. In addition, naloxone (a medication to reverse an opioid overdose) is available at pharmacies without a prescription, although this provision may vary by neighborhood socio-demographic levels. All of these data were queried from OpenStreetMap and preprocessed in QGIS (fixing geometry, creating spatial indices, and unioning point with polygon data pieces)

# gas station
fuel <- st_read(here("data", "public", "fuel.geojson")) %>% st_transform("EPSG:3735")

# fast food restaurant 
fastfood <- st_read(here("data", "public", "fastfood.geojson")) %>% st_centroid() %>% st_transform("EPSG:3735")

# public parks
parks <- st_read(here("data", "public", "parks.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()

# pharmacy
pharmacy <- st_read(here("data","public", "pharmacy.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()

To identify demographic and sociographic characteristics corresponding to the EMS dataset, we utilized the ACS data for 2016 – 2020. We downloaded demographic information for each census tract, including total population, gender, adult population size, and race/ethnicity. The ACS data also included socioeconomic characteristics of each census tract, such as education, income, and poverty. For each census tract, we computed the population aged between 25 and 54 (since this is the age group most vulnerable to opioid overdose), gender ratio, minority popultion ratio, percentage poverty, and percentage of population with a bachelor’s degree. Note that the boundary of Cincinnati does not fully contain several census tracts and therefore we have tried our best to include tracts within Hamilton county that mostly fall within Cincinnati.

cincinnati20 <- get_acs(geography = "tract", 
          variables = c(
            "B01001_001E", # total population
            "B01001_002E", # total male
            "B01001_011E", # male 25-29
            "B01001_012E",
            "B01001_013E",
            "B01001_014E",
            "B01001_015E",
            "B01001_016E", # male 50-54
            "B01001_026E", # total female
            "B01001_035E", # female 25-29
            "B01001_036E",
            "B01001_037E",
            "B01001_038E",
            "B01001_039E",
            "B01001_040E", # female 50-54
            "B02001_002E", # white population
            "B02001_003E", # black population
            "B02001_005E", # asian population
            "B03002_012E", # latinx population
            "B19013_001E", # median household income
            "B06012_002E", # poverty
            "B06009_005E" # bachelor
            ), 
          year=2020, state="OH", county="Hamilton", 
          geometry=TRUE, output="wide") %>%
  st_transform("EPSG:3735")

cincinnati_tracts <- st_read(here("data", "public", "cincinnati_tracts.geojson")) %>% st_transform("EPSG:3735")

cincinnati_tracts <- cincinnati_tracts %>% st_drop_geometry() %>% dplyr::select(GEOID) %>% as.list(GEOID)

cincinnati20 <- cincinnati20 %>% 
  filter(GEOID %in% cincinnati_tracts$GEOID) %>% 
  rename(Totalpop = B01001_001E,
         MedHHInc = B19013_001E) %>% 
  mutate(pop25_54 = B01001_011E + B01001_012E + B01001_013E + B01001_014E + B01001_015E + B01001_016E + B01001_035E + B01001_036E + B01001_037E + B01001_038E+ B01001_039E + B01001_040E) %>% 
  mutate(MF_ratio = B01001_002E / B01001_026E) %>% 
  mutate(race_ratio = B02001_002E / (B02001_003E + B02001_005E + B03002_012E)) %>% 
  mutate(pctPoverty = B06012_002E / Totalpop) %>% 
  mutate(pctBachelor = B06009_005E / Totalpop) %>% 
  dplyr::select(pop25_54, Totalpop, MedHHInc, GEOID, MF_ratio, race_ratio, pctPoverty, pctBachelor)

The table below summarizes all of our predictor variables in this study:

Variable Name Source
Medical Service
Hospital Homeland Infrastructure Foundation-Level Data
Rehabilitation Center Substance Abuse and Mental Health Services Administration
Built Environment
Crime Rate Open Data Cincinnati
311 complaints Open Data Cincinnati
Gas station OSM
Public parks OSM
Fast food restaurants OSM
Pharmacies OSM
Socio-demographic Characteristics
Population aged between 25 and 54 ACS
Gender ratio ACS
White over minority ratio ACS
Population with a Bachelor’s degree ACS
Median household income ACS
Population under poverty line ACS
Median Household Income ACS

Poission Regression

As part of our exploratory analysis, we ran a poisson regression using only the number of crime incidents and the number of 311 complaints as our predictor variables. Since heroin overdose incident is not a phenomenon that varies across administrative units, but one varying smoothly across the landscape, we would need to aggregate these point-level data into a lattice grid of cells. The code below generates a 500 by 500 meters (820 feet) fishnet for Cincinnati to achieve this goal.

fishnet <- st_make_grid(cincinnati,
               cellsize = 820, 
               square = TRUE) %>%
  .[cincinnati] %>%           
  st_sf() %>%
  mutate(uniqueID = 1:n())

ggplot() +
  geom_sf(data=fishnet, color="black", fill="#FFF5EE") +
  labs(title = "Fishnet of Cincinnati") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

We would like to compare across both space and time, and so we divide the heroin dataset into one training set and two testing sets. The training set contains incidents from 2016 to 2018 while the testing sets contain incidents from 2019 and 2020 respectively. Since each individual year has well enough observations of heroin-overdose incidents, separating the testing set into 2019 and 2020 allows us to understand how COVID-19 plays a role in affecting drug overdose rate. We also separate the crime incident data and the 311 complaints data into respective year groups so that we could aggregate them into the fishnet.

heroinTrain <- heroin %>% 
  filter(year_column %in% c(2016, 2017, 2018))
heroin19 <- heroin %>% 
  filter(year_column == 2019)
heroin20 <- heroin %>% 
  filter(year_column == 2020)

crimeTrain <- crime %>% 
  filter(year %in% c(2016, 2017, 2018))
crime19 <- crime %>% 
  filter(year == 2019)
crime20 <- crime %>% 
  filter(year == 2020)

complaintTrain <- complaints %>% 
  filter(year %in% c(2016, 2017, 2018))
complaint19 <- complaints %>% 
  filter(year == 2019)
complaint20 <- complaints%>% 
  filter(year == 2020)

The following two code chunks perform all the necessary tasks to prepare our training and testing datasets by aggregating heroin-incidents, crime incidents, and 311 complaints all into the same spatial unit, in this case, the fishnet.

# add heroin in net
heroinTrain_net <- 
  dplyr::select(heroinTrain) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

# adding crime to net
heroinTrain_net <- heroinTrain_net %>% 
  st_join(crimeTrain, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroinTrain_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

# add 311 to net
heroinTrain_net <- heroinTrain_net %>% 
  st_join(complaintTrain, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroinTrain_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
# 2019 test sets
heroin19_net <- 
  dplyr::select(heroin19) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

heroin19_net <- heroin19_net %>% 
  st_join(crime19, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin19_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

heroin19_net <- heroin19_net %>% 
  st_join(complaint19, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin19_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))


# 2020 test set
heroin20_net <- 
  dplyr::select(heroin20) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

heroin20_net <- heroin20_net %>% 
  st_join(crime20, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin20_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

heroin20_net <- heroin20_net %>% 
  st_join(complaint20, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin20_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))

We performed some data visualization work to see if there are any differences between heroin-overdose incidents over year. We found that the distribution of these incidents are pretty consistent between 2016-2018 and 2019, with the majority of incidents significantly clustered in south Cincinnati. However, in 2020, despite that this overdose cluster still remain, overdose incidents became more disperse. We may see some small, emerging clusters of high overdose incidents scattered around the city.

plot1 <- ggplot() +
  geom_sf(data = heroinTrain_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot2 <- ggplot() +
  geom_sf(data = heroin19_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2019") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

 plot3 <- ggplot() +
  geom_sf(data = heroin20_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2020") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
 
plot1 + plot2 + plot3

We created heat maps of heroin-overdose incidents from 2016 to 2018 and overlayed crime incidents and 311 complaints above. Based on the visualization, we assume crime and 311 complaints to be strong predictors of overdose risk as places of high overdose incidents align perfectly with places with higher crime rate and more complaints. Yet, it is also pretty evident that crime incidents are correlated with 311 complaints.

plot1_1 <- ggplot() +
  geom_sf(data = cincinnati, fill = "black") +
  stat_density2d(data = data.frame(st_coordinates(heroinTrain)), 
                 aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "rocket", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  geom_sf(data = sample_n(crimeTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
  labs(title = "Heroin OD and Crime 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot1_2 <- ggplot() +
  geom_sf(data = cincinnati, fill = "black") +
  stat_density2d(data = data.frame(st_coordinates(heroinTrain)), 
                 aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "rocket", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  geom_sf(data = sample_n(complaintTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
  labs(title = "Heroin OD and Complaint 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot1_1 + plot1_2

Mapping the correlation between our predictor variables reveals that indeed, there’s a strong, linear, and positive relationship between heroin-overdose incidents and crime as well as 311 complaints, though the relationships become weaker in 2020. We suspect that disruptions in the drug supply chain, changes in law enforcement priorities, as well as increase social isolation and mental health challenges during the pandemic might have contribute to the decrease in correlation between overall crime incidents and heroin overdose incidents in 2020.

plot4 <- heroinTrain_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2016-18") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot5 <- heroin19_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2019") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))


plot6 <- heroin20_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2020") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot4 + plot5 + plot6

We ran the Poisson regression and confirmed that crime incidents and 311 complaints are good at predicting heroin-overdose risk. The model is slightly better when used to predict 2020 data, but the difference is minimal.

poissionTrain <- glm(countHeroin ~ Crimecount + Complaintscount, family = "poisson",
                      data = heroinTrain_net)

Prediction19 <-
  mutate(heroin19_net, Prediction = predict(poissionTrain, heroin19_net, type = "response")) %>% 
  mutate(Error = countHeroin - Prediction) %>% 
  mutate(MAE = mean(abs(Error))) %>% 
  mutate(SD_MAE = sd(Error)) %>% 
  mutate(prediction = "2019") %>% 
  slice(1:1)

Prediction20 <-
  mutate(heroin20_net, Prediction = predict(poissionTrain, heroin20_net, type = "response")) %>% 
  mutate(Error = countHeroin - Prediction) %>% 
  mutate(MAE = mean(abs(Error))) %>% 
  mutate(SD_MAE = sd(Error)) %>% 
  mutate(prediction = "2020") %>% 
  slice(1:1)

rbind(Prediction19, Prediction20) %>% 
  st_drop_geometry() %>% 
  group_by(prediction) %>% 
  summarize(Mean_MAE = round(MAE, 2),
              SD_MAE = round(SD_MAE, 2)) %>%
  kable(col.name=c("Prediction", "Mean Absolute Error",'Standard Deviation MAE')) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 
Prediction Mean Absolute Error Standard Deviation MAE
2019 0.96 1.40
2020 0.91 1.14

Here are our reflections up to this point:

While there’s significant associations between crime incidents, 311 complaints, and heroin-overdose incidents, building a model that relies solely on them to predict risk is simply not enough because substance abuse and overdose risks are complex phenomena influenced by various factors beyond just criminal activity. The geographical and environmental context, including the availability of treatment facilities, proximity to support networks, and neighborhood characteristics, also plays a crucial role in understanding overdose risks. Therefore, we must integrate more features showing built environment and local contexts into the model.

On top of that, Poisson regression is a traditional method used in geospatial risk model to predict the number (count) of discrete event, especially crime. However, in making decisions about how to alleviate the risk of particular medical or criminal incidence, we shouldn’t just look at the number of accidents at an intersection in a day at a very specific place. Rather, we should flag areas in a city that are more vulnerable/risky than others. It is not wrong to care to focus on the number of incidence, but from a policy-making perspective, learning whether or not a place is risky would be more efficient, helpful, and straightforward in guiding decision making. In other words, stakeholders and the general public do not have to know how many incidents happened at this place, but whether this place is risky and therefore requires more attention.

Getis Ord Spatial Hotspot

We adapted this research workflow from Srinivasan, et al. (2023), who studied the risk factors for opioid overdose in Massachusetts. In its essence, Srinivasan, et al used principal component analysis to reduce colinerarity between predictor variables and ran a logistic regression to predict whether places in Massachusetts are within overdose hotspot or not.

To replicate their approach as much as possible, we first examined the spatial distribution of heroin-related incident rates using the Jenks natural breaks maps. The Jenks natural breaks maps use a nonlinear algorithm to create groups where within-group similarity is maximized, and between-group similarity is minimized. The map shows at least one significant incidence cluster in south Cincinnati and several other smaller clusters surrounding that hotspot.

heroin_net <- 
  dplyr::select(heroin) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))
breaks <- classIntervals(heroin_net$countHeroin, n = 5, style = "jenks")

ggplot() +
  geom_sf(data = heroin_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", breaks = breaks$brks, name = "Count")  + 
labs(title = "Count of Heroin Overdose for the Fishnet") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, linewidth = 0.8)
  ) + 
  theme(legend.key.size = unit(0.8, 'cm'), legend.text = element_text(size=6))

We validated our observation by checking the global moran’s I under the hypothesis that heroin incidences are randomly distributed across space. The figure below plot the frequency of 999 randomly permutated I under Monte Carlo simulation, which shows that our observed I is much higher than all of the randomly generated I’s. This validates the existence of spatial autocorrelation.

coords <-  st_coordinates(heroin_net %>% st_centroid()) 
neighborList <- knn2nb(knearneigh(coords, 4))
spatialWeights <- nb2listw(neighborList, style="W")

moranTest <- moran.mc(heroin_net$countHeroin, 
                      spatialWeights, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01, fill = "black") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
  labs(title = "Observed and Permuted Moran's I",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10))

Since the global spatial clustering analysis yields only one statistic to describe the overall point pattern across the whole study area, we have to use the Getis-Ord Gi* hotspot statistic to identify and map clusters (i.e., grids with high overdose rates surrounded by grids with high rates). Below, each hotspot identified is statistically significant at the p<0.1 level. For estimating the clusters, we used the queen’s criterion to define our spatial weights matrix.

neigh_nbs <- heroin_net %>% 
  mutate(
    nb = st_contiguity(geometry),  # neighbors share border    
    wt = st_weights(nb), # row-standardized weights              
    neigh_lag = st_lag(countHeroin, nb, wt)  
  )

gi_hot_spots <- neigh_nbs %>% 
  mutate(Gi = local_g_perm(countHeroin, nb, wt, nsim = 999)) %>% 
  unnest(Gi) 

gi_hot_spots <- gi_hot_spots %>%  
  dplyr::select(gi, p_folded_sim, uniqueID) |> 
  mutate(
    classification = case_when(
      # Classify based on the following criteria:
      gi > 0 & p_folded_sim <= 0.01 ~ "Very hot",
      gi > 0 & p_folded_sim <= 0.05 ~ "Hot",
      gi > 0 & p_folded_sim <= 0.1 ~ "Somewhat hot",
      gi < 0 & p_folded_sim <= 0.01 ~ "Very cold",
      gi < 0 & p_folded_sim <= 0.05 ~ "Cold",
      gi < 0 & p_folded_sim <= 0.1 ~ "Somewhat cold",
      TRUE ~ "Insignificant"
    ),    # Convert 'classification' into a factor for easier plotting
    classification = factor(
      classification,
      levels = c("Very hot", "Hot", "Somewhat hot",
                 "Insignificant",
                 "Somewhat cold", "Cold", "Very cold")
    )
  )

gi_hot_spots %>% 
  ggplot() + 
  geom_sf(aes(fill = classification), color = "black", lwd = 0.1) +
  scale_fill_brewer(type = "div", palette = 6, name = "Classification") +
  labs(title = "Spatial Hotspots and Coldspots of Heroin Overdose Incidence") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

heroin_net <- gi_hot_spots %>% 
  mutate(hotspot = ifelse(classification %in% c("Very hot", "Hot", "Somewhat hot"), 1, 0)) %>% 
  st_drop_geometry() %>% 
  left_join(heroin_net, ., by = "uniqueID")

The cluster map shows seven types of spatial association determined based on heroin-related overdose incident rate within a grid and its neighboring grids. If a grid has a high heroin incident rate with neighboring grids of high heroin incident rates, it was determined as a hot spot. At the same scheme, if a grid shows a low heroin incident rate with neighboring grids of low heroin incident rates, it was classified as a cold spot. Hotspots with p<0.01 are considered as “Very hot”, with p<0.05 are considered as “Hot”, with p<0.1 are consiered as “Somewhat hot”. All grids that fall into these three categories received a 1 for hotspot, otherwise, it gets a zero.

Logistic Regression

Built Environment Features

We aggregated built environment features in addition to crime and 311 complaints into the fishnet of Cincinnati. For crime and complaints, we counted their number in each grid. For hospitals, rehabilitation centers, pharmacies, fast food restaurants, gas stations, and public parks, we calculated the average distance from the center of each grid to these locations. For demographic variables, we spatially joined tract level into the grids depending on which census tract each centroid of the grid fall into.

# adding crime to net
heroin_net <- heroin_net %>% 
  st_join(crime, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

# add 311 to net
heroin_net <- heroin_net %>% 
  st_join(complaints, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))

net_centroid <- st_centroid(heroin_net)

# add hospitals
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(hospital.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(hospitals), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(hospital.nn, uniqueID), 
            by = "uniqueID")

# add rehab center
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(rehab.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(rehab), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(rehab.nn, uniqueID), 
            by = "uniqueID")

# add pharmacy
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(pharm.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(pharmacy), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(pharm.nn, uniqueID), 
            by = "uniqueID")


# add gas station
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(fuel.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(fuel), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(fuel.nn, uniqueID), 
            by = "uniqueID")

# add fast food restaurant
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(fast.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(fastfood), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(fast.nn, uniqueID), 
            by = "uniqueID")

# add parks
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(parks.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(parks), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(parks.nn, uniqueID), 
            by = "uniqueID")


# add demographic vars
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              dplyr::select(uniqueID) %>% 
              st_intersection(cincinnati20) %>% 
              st_drop_geometry() %>% dplyr::select(-GEOID), by = "uniqueID") %>% 
  filter(is.na(pop25_54) == FALSE)

Some exploratory data analysis were performed, including computing the correlation between all of our predictor variables and heroin incidence. The figure below shows that all of our predictor variables are correlated with heroin-overdose incidence, among which all of our nearest distance variables are negatively correlated with incidence. This indicates that the further away we are from fast food restaurants and gas stations, for example, the less number of heroin-overdose incidence in that area. The relationship is the strongest for crime incidence and weakest for population aged between 25-54.

heroin_net_long <- heroin_net%>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID, gi, p_folded_sim, classification, hotspot, Totalpop)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") 

correlation.cor <-
  heroin_net_long %>%
    group_by(Type) %>%
    summarize(correlation = cor(Number, countHeroin, use = "complete.obs"))

heroin_net_long %>%
  ggplot(aes(x= Number, y = countHeroin)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Complaintscount` = "Complaints",
    `Crimecount` = "Crime",
    `fast.nn` = "Distance to Fast Food",
    `fuel.nn` = "Distance to Gas Stations",
    `hospital.nn` = "Distance to Hospital",
    `MedHHInc` = "Median Household Income",
    `MF_ratio` = "Gender Ratio",
    `parks.nn` = "Distance to Parks",
    `pctBachelor` = "Percent Bachelor's Degree",
    `pctPoverty` = "Poverty Rate",
    `pharm.nn` = "Distance to Pharmacy",
    `pop25_54` = "Population 25_54", 
    `rehab.nn` = "Distance to Rehab Center",
    `race_ratio` = "Racial Majority over Minority"))) +
  labs(title = "Scatter Plots of All Predictor Variables") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))

The figure below shows the difference in the number of hotspots or cold spots based on each of our predicting variable and we may see that there seems to be a significant difference in all scenarios.

anova_dataset <- heroin_net %>% st_drop_geometry() %>% 
                    dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)

anova_dataset %>%
  gather(Variable, value, -hotspot) %>%
    ggplot(aes(as.character(hotspot), value, fill=as.character(hotspot))) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free", ncol = 5, labeller= labeller(Variable = c(
        `Complaintscount` = "Complaints",
        `Crimecount` = "Crime",
        `fast.nn` = "Distance to Fast Food",
        `fuel.nn` = "Distance to Gas Stations",
    `hospital.nn` = "Distance to Hospital",
    `MedHHInc` = "Median Household Income",
    `MF_ratio` = "Gender Ratio",
    `parks.nn` = "Distance to Parks",
    `pctBachelor` = "Percent Bachelor's Degree",
    `pctPoverty` = "Poverty Rate",
    `pharm.nn` = "Distance to Pharmacy",
    `pop25_54` = "Population 25_54", 
    `rehab.nn` = "Distance to Rehab Center",
    `race_ratio` = "Racial Majority over Minority"))) +
      scale_fill_manual(values = c("#000004",  "#BB3754")) +
      labs(x="Hotspot", y="Value", 
           title = "Feature Associations with the Likelihood of Overdose Hotspot") +
      theme(legend.position = "none") +
    theme(plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 18, face = "bold"), 
        axis.text.x=element_text(size=10),
        axis.text.y=element_text(size=10), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

We further conducted a series of ANOVA tests for all predictor variables to validate if there’s a significant difference in mean number of crimes, for example, between cold spot and hot spot. The result is in concur with our expectations: all variables are significant predictors.

anova_var <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")

anova_results <- data.frame(
                            Df = integer(), 
                            Sum_Sq = numeric(), 
                            Mean_Sq = numeric(), 
                            F_value = numeric(), 
                            P_Value = numeric(), stringsAsFactors = FALSE)

new_names <- c("Crimecount" = "Crime",
                "Complaintscount" = "Complaints",
                "pharm.nn" = "Distance to Pharmacy",
                "hospital.nn" = "Distance to Hospital",
                "fast.nn" = "Distance to Fast Food",
                "fuel.nn" = "Distance to Gas Stations",
                "rehab.nn" = "Distance to Rehab Center", 
                "parks.nn" = "Distance to Parks", 
                "pop25_54" = "Population 25_54", 
                "MedHHInc" = "Median Household Income",
                "MF_ratio" = "Gender Ratio",
                "race_ratio" = "Racial Majority over Minority",
                "pctPoverty" = "Poverty Rate", 
                "pctBachelor" = "Percent Bachelor's Degree"
               )

for (var in anova_var) {
  anova_result <- aov(anova_dataset[[var]] ~ anova_dataset$hotspot)
  summary_data <- summary(anova_result)[[1]][, c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")][1:1, ]
  anova_results <- rbind(anova_results, summary_data)
}

rownames(anova_results) <- new_names

anova_results %>% 
 kable() %>% 
 kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
    footnote(general_title = "\n")
Df Sum Sq Mean Sq F value Pr(>F)
Crime 1 1.052003e+06 1.052003e+06 650.50625 0.0000000
Complaints 1 1.375003e+07 1.375003e+07 474.07239 0.0000000
Distance to Pharmacy 1 3.429349e+08 3.429349e+08 149.89597 0.0000000
Distance to Hospital 1 1.091828e+08 1.091828e+08 34.67735 0.0000000
Distance to Fast Food 1 2.791552e+08 2.791552e+08 251.76801 0.0000000
Distance to Gas Stations 1 7.760263e+07 7.760263e+07 95.70131 0.0000000
Distance to Rehab Center 1 1.684621e+09 1.684621e+09 232.82620 0.0000000
Distance to Parks 1 7.582010e+07 7.582010e+07 199.87377 0.0000000
Population 25_54 1 2.719819e+07 2.719819e+07 63.34770 0.0000000
Median Household Income 1 5.935462e+10 5.935462e+10 57.82477 0.0000000
Gender Ratio 1 4.562751e+00 4.562751e+00 115.01797 0.0000000
Racial Majority over Minority 1 1.549411e+03 1.549411e+03 55.81814 0.0000000
Poverty Rate 1 3.966738e+00 3.966738e+00 150.92897 0.0000000
Percent Bachelor’s Degree 1 1.328479e-01 1.328479e-01 14.58168 0.0001366

Simple Logit

Considering that all predictor variables used in the anova test appear to be significant, we included all of them in a kitchen-sink logistic regression. The model shows that in this scenario, number of crimes, number of complaints, distance to hospital, distance to fast food restaurant, gender ratio, and distance to public parks are highly significant predictors of heroin-overdose hotspot whereas other predictors become insignificant.

kitchensink <- glm(hotspot ~ .,
                  data=heroin_net %>% 
                    st_drop_geometry() %>% 
                    dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor), family="binomial" (link="logit"))


kitchen_sum <- summary(kitchensink)
coefficients_table <- as.data.frame(kitchen_sum$coefficients)

coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|z|)` < 0.001, '***',
                                         ifelse(coefficients_table$`Pr(>|z|)` < 0.01, '**',
                                                ifelse(coefficients_table$`Pr(>|z|)` < 0.05, '*',
                                                       ifelse(coefficients_table$`Pr(>|z|)` < 0.1, '.', ''))))

coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|z|)`, digits = 3), coefficients_table$significance)

coefficients_table %>%
  dplyr::select(-significance, -`Pr(>|z|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))  %>%
  footnote(general_title = "\n")
Estimate Std. Error z value p_value
(Intercept) -1.4654205 0.6035523 -2.4279925 0.015*
Crimecount 0.0075275 0.0014618 5.1493408 0***
Complaintscount 0.0024300 0.0003841 6.3269221 0***
pharm.nn -0.0002075 0.0000910 -2.2803200 0.023*
hospital.nn 0.0004531 0.0000833 5.4404140 0***
fast.nn -0.0020144 0.0002101 -9.5865098 0***
fuel.nn -0.0000786 0.0001555 -0.5053237 0.613
rehab.nn -0.0000281 0.0000614 -0.4574004 0.647
parks.nn -0.0014241 0.0002891 -4.9260257 0***
pop25_54 -0.0001910 0.0001751 -1.0908460 0.275
MedHHInc -0.0000089 0.0000082 -1.0929448 0.274
MF_ratio 1.2936016 0.3804092 3.4005528 0.001***
race_ratio -0.1054348 0.0481930 -2.1877625 0.029*
pctPoverty 1.1738101 0.7640073 1.5363858 0.124
pctBachelor 0.0356627 1.5429941 0.0231126 0.982

Logit with CV

We would like to perform cross validation on this model to measures its overall performance and robustness. We decide to modify a cross validation function that was originally built on cases using Poission regression by making it suitable for use in our scenario (logistic regression). This function takes in four parameters: the dataset, a cross validation ID, which it could use to separate the dataset into training and testing, a list of dependent variables, and a independent variable. In our case, we use a randomly generated cvID associated with each grid cell for random k-fold cross validation. Each group of grids with the same cvID gets hold out during one iteration.

logitCV <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])

  for (i in cvID_list) {
    
    thisFold <- i

    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    
    form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
    form <- as.formula(form_parts)
    regression <- glm(form, data = fold.train %>%
                        dplyr::select(-geometry, -id),  family="binomial" (link="logit"))
    
    thisPrediction <-
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

Here, another helper function is written to visualize the confusion matrix.

draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix 
  rect(150, 430, 240, 370, col='black')
  text(195, 435, '0', cex=1.2)
  rect(250, 430, 340, 370, col="#BB3754")
  text(295, 435, '1', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col="#BB3754")
  rect(250, 305, 340, 365, col='black')
  text(140, 400, '0', cex=1.2, srt=90)
  text(140, 335, '1', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  

The cross validation model is visualized below, from which we see that our model has a high accuracy rate of 0.936 and a moderate sensitivity rate of 0.727. Sensitivity is the true-positive rate, which is the proportion of actual hotspot grids that were correctly identified by the model. Given this model’s high accuracy rate, this means that our model is missing out a few heorin-overdose hotspots.

However, note that in Cincinnati, the number of grids that are classified as hotspot is much smaller than the number of grids that are classified as coldspot. While this is a good news for Cincinnati, logistic regression can be sensitive to class imbalance, meaning that the model might be biased towards the majority class, leading to poor performance on the minority class.

reg.vars <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")


kitchensink.logit.cv <- logitCV (
  dataset = heroin_net,
  id = "cvID",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, hotspot, Prediction, geometry)

kitchensink.logit.cv <- kitchensink.logit.cv %>% 
  mutate(PredictionCat  = as.factor(ifelse(kitchensink.logit.cv$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))


cm_basic <- caret::confusionMatrix(kitchensink.logit.cv$hotspot, kitchensink.logit.cv$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_basic)

Principal Component Analysis (PCA)

One of the reasons why each individual predictor variable is correlated with overdose hotspot but when combined in a regression, some of them become insignificant is because of multicollinearity. This commonly occurs in regression analysis when predictor variables are highly correlated. It won’t be surprise to see that many of those demographic variables are highly correlated with each other. It is possible that distance to hospitals is correlated with distance to parks, given that there are more those kind of services in downtown areas. Srinivasan, et al. (2023)’s approach to this issue is to use Principal Component Analysis (PCA) as a preprocessing step before running the regression.

PC Identification

PCA is a dimension reduction technique that is commonly used to reduce the number of variables used in analyses while preserving the information from them. The prcomp function below is used to identify the principal components among all 14 of our predictor variables.

pca_heroin <- heroin_net %>% 
  st_drop_geometry() %>% 
  dplyr::select(Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)

pca_heroin <- na.omit(pca_heroin)
pc <- prcomp(pca_heroin,
             center = TRUE,
            scale. = TRUE)

The scree plot is used to visualize the importance of each principal component and can be used to determine the number of principal components to retain. The y axis is eigenvalues, which essentially stand for the amount of variation. Here, we may see that the first six, or seven components are capturing the majority of variability. We can therefore, just use the first seven principal components and ignore the rest.

fviz_eig(pc, addlabels = TRUE, barfill = "#BB3754", barcolor = "transparent", xlab = "Principal Components", choice = "eigenvalue")

The code above computed the square cosine value for each variable with respect to the first seven principal components. From the illustration below, gender ratio, population aged between 25-54, distance to parks, and poverty rate are top four variables with the highest cos2, hence contributing the most to the top seven principal components.

fviz_cos2(pc, choice = "var", axes = 1:7) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_bar(stat = "identity", fill = "#BB3754", color = "transparent")

We also did a biplot based on our principal component analysis, which helps us to visualize the similarities and dissimilarities between the samples, and further understand the impact of each attribute on each of the principal components. Here, the length of the arrow represents the strength of the contribution of the corresponding variable to the specific principal component. The color of the arrow represents the relative contribution of the variable to the principal component. Darker colors typically indicate larger contributions. For variables with long arrow, but light color, such as percentage of population with a Bachelor’s degree and percentage poverty, their contributions were less significant compared to other variables, and might only contribute significantly to one PC. For variables with short arrow but dark color, such as 311 complaints and crime, they have relatively strong contributions to all PC. Also, the direction of those arrows indicate that crimes and complaints have positive contributions to hotspots.

fviz_pca_var(pc, col.var = "cos2", repel = TRUE, 
             arrowsize = 1,  ggtheme = theme_minimal(), gradient.cols = c("black", "#BB3754"),) + ggtitle("Biplot Combind with Contribution")

Below, we computed the first seven principal components based on the loadings of each of our predictor variables in that component.

pca_net <- heroin_net %>% 
  mutate(pc1 = Crimecount * pc$rotation[1] + Complaintscount * pc$rotation[2] + pharm.nn * pc$rotation[3] + hospital.nn * pc$rotation[4] + fast.nn * pc$rotation[5] +  fuel.nn * pc$rotation[6] + rehab.nn * pc$rotation[7] + parks.nn *  pc$rotation[8] + pop25_54 * pc$rotation[9] + MedHHInc * pc$rotation[10] + MF_ratio * pc$rotation[11] + race_ratio * pc$rotation[12] +  pctPoverty * pc$rotation[13] + pctBachelor * pc$rotation[14]) %>% 
  mutate(pc2 = Crimecount * pc$rotation[15] + Complaintscount * pc$rotation[16] + pharm.nn * pc$rotation[17] + hospital.nn * pc$rotation[18] + fast.nn * pc$rotation[19] +  fuel.nn * pc$rotation[20] + rehab.nn * pc$rotation[21] + parks.nn *  pc$rotation[22] + pop25_54 * pc$rotation[23] + MedHHInc * pc$rotation[24] + MF_ratio * pc$rotation[25] + race_ratio * pc$rotation[26] +  pctPoverty * pc$rotation[27] + pctBachelor * pc$rotation[28]) %>%
  mutate(pc3 = Crimecount * pc$rotation[29] + Complaintscount * pc$rotation[30] + pharm.nn * pc$rotation[31] + hospital.nn * pc$rotation[32] + fast.nn * pc$rotation[33] +  fuel.nn * pc$rotation[34] + rehab.nn * pc$rotation[35] + parks.nn *  pc$rotation[36] + pop25_54 * pc$rotation[37] + MedHHInc * pc$rotation[38] + MF_ratio * pc$rotation[39] + race_ratio * pc$rotation[40] +  pctPoverty * pc$rotation[41] + pctBachelor * pc$rotation[42]) %>%
  mutate(pc4 = Crimecount * pc$rotation[43] + Complaintscount * pc$rotation[44] + pharm.nn * pc$rotation[45] + hospital.nn * pc$rotation[46] + fast.nn * pc$rotation[47] +  fuel.nn * pc$rotation[48] + rehab.nn * pc$rotation[49] + parks.nn *  pc$rotation[50] + pop25_54 * pc$rotation[51] + MedHHInc * pc$rotation[52] + MF_ratio * pc$rotation[53] + race_ratio * pc$rotation[54] +  pctPoverty * pc$rotation[55] + pctBachelor * pc$rotation[56]) %>%
  mutate(pc5 = Crimecount * pc$rotation[57] + Complaintscount * pc$rotation[58] + pharm.nn * pc$rotation[59] + hospital.nn * pc$rotation[60] + fast.nn * pc$rotation[61] +  fuel.nn * pc$rotation[62] + rehab.nn * pc$rotation[63] + parks.nn *  pc$rotation[64] + pop25_54 * pc$rotation[65] + MedHHInc * pc$rotation[66] + MF_ratio * pc$rotation[67] + race_ratio * pc$rotation[68] +  pctPoverty * pc$rotation[69] + pctBachelor * pc$rotation[70]) %>%
  mutate(pc6 = Crimecount * pc$rotation[71] + Complaintscount * pc$rotation[72] + pharm.nn * pc$rotation[73] + hospital.nn * pc$rotation[74] + fast.nn * pc$rotation[75] +  fuel.nn * pc$rotation[76] + rehab.nn * pc$rotation[77] + parks.nn *  pc$rotation[78] + pop25_54 * pc$rotation[79] + MedHHInc * pc$rotation[80] + MF_ratio * pc$rotation[81] + race_ratio * pc$rotation[82] +  pctPoverty * pc$rotation[83] + pctBachelor * pc$rotation[84]) %>%
  mutate(pc7 = Crimecount * pc$rotation[85] + Complaintscount * pc$rotation[86] + pharm.nn * pc$rotation[87] + hospital.nn * pc$rotation[88] + fast.nn * pc$rotation[89] +  fuel.nn * pc$rotation[90] + rehab.nn * pc$rotation[91] + parks.nn *  pc$rotation[92] + pop25_54 * pc$rotation[93] + MedHHInc * pc$rotation[94] + MF_ratio * pc$rotation[95] + race_ratio * pc$rotation[96] +  pctPoverty * pc$rotation[97] + pctBachelor * pc$rotation[98])

PCA Logistic Regression

Following Srinivasan, et al. (2023)’s method, now we can estimated this logistic regression to predict if a grid was in a hotspot (0/1) as identified by the Getis-Ord Gi* statistic using components derived from the PCA as the explanatory variables. For this model using the principal components as predictors, we still get a accuracy rate of 0.928. Our sensitivity rate dropped a little bit to 0.66 compared to the kitchen sink model, probably because we addressed the issue of multicolinearity, but that is still considered as a moderate sensitivity.

reg.vars <- c("pc1", "pc2", "pc3", "pc4", "pc5", "pc6", "pc7")

pca.logit.cv <- logitCV (
  dataset = pca_net,
  id = "cvID",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, hotspot, Prediction, geometry)

pca.logit.cv <- pca.logit.cv %>% 
  mutate(PredictionCat  = as.factor(ifelse(pca.logit.cv$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))

cm_pca <- caret::confusionMatrix(pca.logit.cv$hotspot, pca.logit.cv$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_pca)

In addition to cross-validating using random cvID, we also tried the spatial ‘Leave-one-group-out’ cross-validation (LOGO-CV) approach, during which we hold out one neighborhood, train the model on the remaining n-1 areas, predict for the hold out, and record the goodness of fit. The result shows that our model has high accuracy, but the sensitivity further decreases. One possible explanation here is that different neighborhoods may have distinct characteristics or patterns. If the model is trained on a set of neighborhoods and then tested on a neighborhood with different characteristics, its performance may be negatively affected. Random cross-validation IDs might not capture these spatial variations as explicitly, but further investigation is required here.

neighborhood <- st_read(here("data", "public", "neighborhood.geojson")) %>% st_transform("EPSG:3735") %>% st_zm()
pca_net <- pca_net %>%
  left_join(net_centroid %>% dplyr::select(uniqueID) %>% 
               st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>% 
              st_drop_geometry(), by = "uniqueID") %>% 
  mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))


pca.logit.cv.neigh <- logitCV (
  dataset = pca_net,
  id = "SNA_NAME",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(SNA_NAME = SNA_NAME, hotspot, Prediction, geometry)

pca.logit.cv.neigh <- pca.logit.cv.neigh %>% 
  mutate(PredictionCat  = as.factor(ifelse(pca.logit.cv.neigh$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))

cm_pca_neigh <- caret::confusionMatrix(pca.logit.cv.neigh$hotspot, pca.logit.cv.neigh$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_pca_neigh)

We computed the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) metric for our model to evaluate their performance. The AUC value ranges from 0 to 1, where 0.5 means a model with no discriminatory power (it performs as well as random chance), and 1 means a perfect model (it perfectly distinguishes between positive and negative cases). Generally, an AUC over 0.8 indicates a good performance of the model. For the random K-fold cross validation, we get a AUC score of 0.904 and for the Spatial LOGO-CV, we get a AUC score of 0.8628. Both are good scores and indicate that model is capturing important patterns in the data and making predictions better than random guessing.

plot7 <- ggplot(pca.logit.cv, aes(d = as.numeric(hotspot), m = Prediction)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
    labs(title = "ROC Curve for Logistic Regression Model using PCA", 
         subtitle = "Random Cross Validation") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot8 <- ggplot(pca.logit.cv.neigh, aes(d = as.numeric(hotspot), m = Prediction)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
    labs(title = "ROC Curve for Logistic Regression Model using PCA", 
         subtitle = "Spatial LOGO Cross Validation") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot7 + plot8

plot9 <- pca.logit.cv %>%
  filter(is.na(Prediction) == FALSE) %>% 
   mutate(result = case_when(
   PredictionCat==0 & hotspot==0 ~ "Accurate",
   PredictionCat==1 & hotspot==1 ~"Accurate",
   PredictionCat==0 & hotspot==1 ~"Problematic",
   TRUE ~ "Inaccurate"
  )) %>% 
  ggplot() +
  geom_sf(aes(fill=as.character(hotspot)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Actual Hotspot") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
  

plot10 <- pca.logit.cv %>%
  ggplot() +
  geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Predicted Hotspot", 
         subtitle = "Random Cross Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot11 <- pca.logit.cv.neigh %>%
  ggplot() +
  geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Predicted Hotspot", 
         subtitle = "Spatial LOGO Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
  

plot12 <- pca.logit.cv %>%
  filter(is.na(Prediction) == FALSE) %>% 
   mutate(result = case_when(
   PredictionCat==0 & hotspot==0 ~ "Accurate",
   PredictionCat==1 & hotspot==1 ~"Accurate",
   PredictionCat==0 & hotspot==1 ~"Problematic",
   TRUE ~ "Inaccurate"
  )) %>% 
  ggplot() +
  geom_sf(aes(fill=as.character(result)), color="transparent")+
  scale_fill_manual(values = c("black", "#FFC000", "#BB3754"), name = "Hotspot") +
  labs(title = "Mapping Error", 
         subtitle = "Random Cross Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot9 + plot10 + plot11 + plot12

If our model has generally been good at making prediction (high accuracy) but with some uncertainties in correctly locating the hotspot, then we would like to know where it is making mistakes. We mapped the error below and compared it to the actual hotspot. What we can see is that our model is accurate in roughly estimating the location of the hotspot, but tends to make mistakes when deciding which of these grids in particular should be categorized as hotspot.

We highlighted those grids which are actual hotspot but were wrongly predicted as problematic, from which we may see that those most of those problematic grids pretty much neighbors hotspot grids. This implies that when drawing conclusion from our model, we should not only look at specific hotspot grid, but also probably pay attention to grids nearby.

KDE Validation

The accuracy of our model is also compared relative to kernel density estimation. Kernel density works by centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular search radius. The code block below creates a Kernel density map of robbery with a 1000 foot search radius.

heroin_ppp <- as.ppp(st_coordinates(heroin), W = st_bbox(heroin_net))
heroin_KD.1000 <- spatstat.explore::density.ppp(heroin_ppp, 1000)
heroin_KD.df <- data.frame(rasterToPoints(mask(raster(heroin_KD.1000), as(cincinnati20, 'Spatial'))))


ggplot(data=heroin_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  coord_sf(crs=st_crs(heroin_net)) + 
  scale_fill_viridis(option = "rocket", name="Density") +
  labs(title = "Kernel Density of Heroin 1000ft Radii") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

The kernel density estimation was classified into two categories, lower risk and higher risk, matching the binary of hotspot and coldspot. We used the Fisher's method that involves finding intervals such that the variance within each interval is minimized while maximizing the variance between the intervals. The purpose here is to see if the our model capture the same hotspot grids as the kernel density. The map below shows that our model perform as good as, if not worse, the kernal density approach in capturing hotspots (high risk areas).

heroin_KDE_sum <- as.data.frame(heroin_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(heroin_net)) %>%
  aggregate(., heroin_net, mean) 

kde_breaks <- classIntervals(heroin_KDE_sum$value, 
                             n = 2, "fisher")

heroin_KDE_sf <- heroin_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 2 ~ "High",
           Risk_Category == 1 ~ "Low"))

heroin_risk_sf <-
  pca.logit.cv %>%
  mutate(label = "Logistic Regression",
         Risk_Category = case_when(
           PredictionCat == 1 ~ "High",
           PredictionCat == 0 ~ "Low")) %>% 
  dplyr::select(PredictionCat, label, Risk_Category) %>% 
  rename(value = PredictionCat)


rbind(heroin_KDE_sf, heroin_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    facet_wrap(~label, ) +
    scale_fill_viridis(option = "rocket", discrete = TRUE, name = "Risk Category") +
    labs(title="Comparison of Kernel Density and Risk Predictions") + 
  theme(legend.position="bottom",
    axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )

Multi Criteria Analysis

We would like to conclude our analysis by building multi-criteria-analysis-centered geospatial risk terrain model that stakeholders and the public can use to learn about the community vulnerability to heroin-overdose in Cincinnati. In GIS, a Multi-Criteria Analysis (MCA) model is a decision-making approach that involves evaluating and comparing multiple criteria or factors to make informed spatial decisions. The criteria we use here are the predictor variables we use in our model. These criteria are standardized to ensure that all of them are on a scale of 0 to 1. Then, some of those scaled variables are reversed if necessary based on whether they contribute positively or negatively to the heroin-overdose risk. We supply a weight on each of those variables based on their contribution. Finally, a risk score is computed for each grid.

scale_values <- function(x){(x-min(x))/(max(x)-min(x))}

risk_score <- heroin_net %>% 
  mutate(MedHHInc = ifelse(is.na(MedHHInc), 0, MedHHInc)) %>% 
  mutate(scl_crime = scale_values(Crimecount),
         scl_complaints = scale_values(Complaintscount),
         scl_pharm = scale_values(pharm.nn),
         scl_hospital = scale_values(hospital.nn),
         scl_rehab = scale_values(rehab.nn),
         scl_park = scale_values(parks.nn),
         scl_fuel = scale_values(fuel.nn),
         scl_fast = scale_values(fast.nn),
         scl_pop = scale_values(pop25_54),
         scl_income = scale_values(MedHHInc),
         scl_gender = scale_values(MF_ratio),
         scl_race = scale_values(race_ratio),
         scl_poverty = scale_values(pctPoverty),
         scl_edu = scale_values(pctBachelor)) %>% 
  mutate(scl_race_re = 0 - scl_race + 1,
         scl_edu_re = 0 - scl_edu + 1,
         scl_income_re = 0 - scl_income + 1, 
         scl_fuel_re = 0 - scl_fuel + 1,
         scl_fast_re = 0 - scl_fast + 1,
         scl_park_re = 0 - scl_park + 1,
         scl_pharm_re = 0 - scl_pharm + 1) %>% 
  mutate(score = 0.1 * (scl_hospital + scl_crime + scl_complaints +  scl_gender + scl_income_re + scl_fuel_re + scl_fast_re +  scl_park_re) + 0.05 * (scl_race_re +  scl_pharm_re) + 0.025*(scl_pop +scl_edu_re + scl_poverty +  scl_rehab)) %>% 
  st_as_sf()
  

risk_score <- risk_score %>%
  left_join(net_centroid %>% dplyr::select(uniqueID) %>% 
               st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>% 
              st_drop_geometry(), by = "uniqueID") %>% 
  mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))

This below interactive leaflet map shows the risk score of each grid, the neighborhood that grid fall into, and whether or not it falls into a hotspot that our model predicts. We hope that this will serve as our first step towards building up our use case that present stakeholders with a clearer picture of community vulnerability to heroin overdose and that help the public to have a better understanding of the severity of heroin-overdose in Cincinnati as well as existing rehabilitation resources.

colors <- rev(viridis::viridis(6, option = "rocket"))

pal <- colorBin(palette = colors, 
                domain = risk_score$score, 
                bins = c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7)) 

labs <- paste0("<strong>", risk_score$SNA_NAME,
               "</strong><br/>Heorin-Overdose Risk Score: ", risk_score$score
               ) 

leaflet <- risk_score %>%
  st_transform("EPSG:4326") %>% 
  leaflet() %>%
  setView(lng = -84.51, lat = 39.10, zoom = 10.5) %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(weight = 1,
              fillColor = ~pal(score), # change the name of the parameter in the bracket
              popup = labs,
              color = "transparent",
              group = "Risk Score", 
              fillOpacity = 1) %>%
  addPolygons(
    data = pca.logit.cv %>% filter(PredictionCat == 1) %>% st_transform("EPSG:4326"), 
    weight = 2,
    group = "Overdose Hotspot", 
    color = "yellow") %>% 
  addLayersControl(
    overlayGroups = c("Risk Score", "Overdose Hotspot"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  addLegend(position = "bottomright",
            pal = pal,
            values = ~risk_score, # change the name of the parameter after ~
            title = "Risk Score",
            opacity = 1) # change the name of the legend 

leaflet
mapshot(leaflet, url = "cincyrisk.html")

Conclusion

In this report, we conducted a study on opioid-overdose risk modeling in Cincinnati, Ohio where the opioid epidemic has been a serious issue. Drawing on methods by Srinivasan, et al. (2023), we used Getis-Ord hotspot analyses to identify locations of persistent risk in Cincinnati from 2016-2020. We constructed measures of the socio-built environment and neighborhood demographic characteristics using principal components analysis and ran logistic regression to determine if places in Cincinnati fall into heroin-overdose hotspots or not. While Poisson regression has long been the traditional method used in geospatial risk model to predict the number (count) of discrete event, we propose this alternative method in this report because we believe that it is less useful for policymakers to focus on the number of heroin-overdose incidence among neighborhoods in Cincinnati. Rather, we should develop a way to quickly identify areas/clusters of high heroin-overdose incidence.

Our model, which uses principal components as predictors, has an accuracy rate of 0.928, a sensitivity rate of 0.66, and an AUC score of 0.904, meaning that it has been effective overall in distinguishing between hotspots and cold spots. We mapped the error and compared that to the actual hotspot, from which we noticed that our model is accurate in roughly estimating the location of the hotspot but tends to make mistakes when deciding which of these grids in particular should be categorized as hotspot. On top of that, many of those grids are neighbors of hotspot grids that are correctly predicted.

Reflecting on the overall study design, we have the following concerns, which we would like to continue working on to improve the accuracy and reliability of our model. Firstly, our study utilized a binary classification of hotspot and cold spot to describe risk. While this is a common approach in risk modeling, we should recognize the complexity within these categories as areas neighboring hotspots may also exhibit varying degrees of risk. Building onto that, another major issue is that our analysis has categorized “insignificant” grids into cold spots. In theory, cold spot should refer to places with a series of grids that have low heroin-incidence count whereas for insignificant grid, there might be a high number of incidences within specific grid. It’s just that those grids do not form clusters with surrounding grids. Merging insignificant grids into cold spot might risk overlooking heroin-incidence within those grids.

Secondly, within the discussion, we highlight the potential imbalance in logistic regression, prompting consideration for weighted logistic regression to address disparities in the prediction of positive and negative cases. Thirdly, during cross validation, we found that our model does not generalize well across neighborhoods. This highlights significant variations between different neighborhoods in Cincinnati, which requires further investigations. Fourthly, as we need to arrogate information into a standardized spatial unit that’s different from any existing spatial unit, we need to be aware of how our model might be sensitive to errors and uncertainties introduced by area-weighted re-aggregation and boundary/edge effect.

Regardless, the takeaways from this analysis will contribute to our understanding of the nuanced landscape of heroin-overdose crisis in Cincinnati and provide important guidance on public health initiatives aiming to address heroin addiction.

We believe that our risk terrain model could effectively visualize heroin-overdose hotspots and could serve a good starting point for providing real-time update through calculating a risk score for each part of the city, allowing the government to conduct real-time data analysis to efficiently allocate EMS resources. We hope that our effort in bringing in data from a variety of sources would encourage data sharing between public and private sectors so that the government can manage public service and monitor its built environment. Reports can be collected from public parks, fast food restaurants, gas stations, places where most overdose incidence occurs. Having data from local hospitals and rehabilitation centers allows the government to see capacity information, thereby efficiently dispatching preparing and dispatching care so that the risk of heroin overdose is minimized before they happen.

We look forward to seeing our risk terrain model being adapted in various scenarios that truly benefit both the stakeholders in managing community health and well-being as well as serve the needs of local residents.

References

Choi, J. I., Lee, J., Yeh, A. B., Lan, Q., & Kang, H. (2022). Spatial clustering of heroin-related overdose incidents: a case study in Cincinnati, Ohio. BMC public health, 22(1), 1-12. https://doi.org/10.1186/s12889-022-13557-3

Srinivasan, S., Pustz, J., Marsh, E., Young, L. D., & Stopka, T. J. (2023). Risk factors for persistent fatal opioid-involved overdose hotspots in Massachusetts 2011-2021: A spatial statistical analysis with socio-economic, access and prescription factors.https://doi.org/10.21203/rs.3.rs-3249650/v1

Chichester, K. R., Drawve, G., Sisson, M., Giménez-Santana, A., McCleskey, B., Goodin, B. R., … & Cropsey, K. L. (2023). Crime and Features of the Built Environment Predicting Risk of Fatal Overdose: A Comparison of Rural and Urban Ohio Counties with Risk Terrain Modeling. American Journal of Criminal Justice, 1-25. https://doi.org/10.1007/s12103-023-09739-3

---
title: "Geospatial Risk Prediction of Heroin Overdose Hotspots in Cincinnati Ohio"
author: "Emily Zhou, EChin Li"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: simplex
    toc: yes
    toc_float: yes
    code_folding: hide
    code_download: yes
editor_options:
  markdown:
    wrap: sentence
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

```

Version 3.0 \| First Created Nov 17, 2023 \| Updated Dec 8, 2023

Keywords: K Nearest Neighbors, Moran's I, Getis-Ord Gi\*, Kernel Density Estimation, Random K-fold Cross Validation, Logistic Regression, Poisson Regression, ROC Curve, Principal Component Analysis, Multi-Criteria Analysis, Health Geography

Important Links: [study repository](https://github.com/emilyzhou112/MUSA5080-Cincinnati-Heroin), [pitch presentation](https://youtu.be/LRWgTeHXNtk)

# Introduction

The opioid overdose crisis continues as one of the most significant public health challenges of the past two decades in the US.
Between 1999 and 2019, the national age-adjusted opioid-involved overdose death rate increased from 2.9 per 100,000 population to 15.5 per 100,000.
Rates have increased steeply during the COVID-19 pandemic.
In 2020, the rate climbed to 21.4 per 100,000 and to 24.7 per 100,000 in 2021.
Most literature to this date have focused on understanding the drivers of overdose crisis, which include drug supply, drug manufacturing, pain arising from work-related injuries, income inequality, and added stress, isolation, and economic disadvantage connected to the COVID-19 pandemic.
Until recently, however, more and more studies have begun to consider the association between drug use, opioid-related mortality, and the built-environment.
Contextual risk factors, especially features of the built environment, the presence of violent crime or crime associated with substance use, and neighborhood-level socioeconomic factors, have enabled more accurate identification of high-risk overdose areas, thus driving initiatives on alleviating overdose risk and crises.

We conducted a case study on opioid-overdose risk modeling in Cincinnati, Ohio where the opioid epidemic has been a serious issue.
Since 2009, drug overdose deaths in Ohio have continued to increase, and Cincinnati, a large city in southwestern Ohio with a population of about 300,000 people, was one of 12 opioid hotspots in Ohio and had the highest per capita rates of a fatal overdose between 2010 and 2017.
Today, heroin overdose incidences still remain high in the city since heroin and prescription drugs abuse have further led to other public health problems such as HIV and needle-borne diseases.

We combined multiple data sources, including emergency medical service response, crime and non-emergency reports data, American Community Survey data, and datasets that are proxies for local built environment to analyze the distributions of heroin-related overdose incidents in Cincinnati, Ohio.

Among studies of spatial epidemiology, geospatial clustering has been used to analyze the geographic variation of phenomena, such as the disproportionate impact of public health crisis on people with disabilities, and the concentrated vulnerability of socio-economically disadvantaged population under natural hazards.
It allows the investigators to identify groups of spatial objects (i.e. clusters) that have similar characteristics and analyze patterns of the clusters (e.g., hot spots, cold spots).
As such, we used Getis-Ord hotspot analyses to identify locations of persistent risk in Cincinnati from 2016-2020.
We constructed measures of the socio-built environment and neighborhood demographic characteristics using principal components analysis (PCA).
The resulting measures were used as covariates in a logistic regression with random k-fold cross validation to predict if an area was part of an opioid hotspot cluster.
All built environment risk factors were aggregated, scaled, and weighted to generate an interactive overdose risk terrain map.

**On the technical side**, the goal of our study is to build a model that could predict opioid overdose clusters as well as socioeconomic and built environment factors associated with opioid overdose rates in Cincinnati from 2016 to 2020 as much as possible. 
**On the practical side**, we would like our risk terrain model to benefit both the stakeholders and community members in the main form of a municipal dashboard and a mobile application. For stakeholders, the dashboard would serve as an early warning system by predicting/showing areas at high risk of opioid overdose. Specifically, it should include maps with overdose cluster locations, trend graphs over time, and charts illustrating the correlation between overdose rates and various factors. For community members, this application connects them to healthcare and rehabilitation resources within the city. Specifically, the application allow them to call for emergency services and medical assistance for themselves or for others at one tap. To encourage public-private partnership, this app would adopt the approach that OpenStreetMap took to allow both technical and non-technical users, private and public entities to update existing information, such as crime incidents, rehabilitation center capacities and location. We hope that non-technical stakeholders can use the app to allocate resources efficiently, directing outreach efforts, treatment facilities, and educational campaigns to areas with the highest predicted risk. We hope that the public would use this platform to foster community engagement and support for initiatives addressing the opioid crisis.

The rest of the report is organized as follows.
Section one describes our approach to access and pre-process the data.
Section two documents an exploratory analysis we did using poission regression with only crime and 311 incidents.
Section three explains the conceptualization our approach applying the Getis Ord G\* statistics to determine spatial hotspots.
Section four is the main part of this study, where we iclude all built environment predictors to run a logistic regression predicting if an area is an opioid overdose hotspot or not.
Section five applies principal component analysis to reduce colinerity and include only half of the principal components into the model.
Section six validates the accuracy of this model against a kernel density estimation while the last section constructs the risk terrain map.

The conceptualization and methodologies of this study were drawn in part from: 

- Choi, et al.(2022) Spatial clustering of heroin-related overdose incidents: a case study in Cincinnati, Ohio

-  Srinivasan, et al. (2023) Risk factors for persistent fatal opioid-involved overdose hotspots in Massachusetts 2011-2021 

- Chichester, et al.(2023) Crime and Features of the Built Environment Predicting Risk of Fatal Overdose: A Comparison of Rural and Urban Ohio Counties with Risk Terrain Modeling.


```{r library and setup, include=FALSE}

library(tidyverse)
library(tidycensus)
library(sf)
library(lubridate)
library(tigris)
library(mapview)
library(sfdep)
library(spdep)
library(caret)
library(kableExtra)
library(knitr)
library(here)
library(viridis)
library(FNN)
library(stats)
library(factoextra)
library(plotROC)
library(classInt)
library(raster)
library(spatstat.explore)
library(leaflet)
library(patchwork)

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
tidycensus::census_api_key("e79f3706b6d61249968c6ce88794f6f556e5bf3d", overwrite = TRUE)

```

# Data

We used the [EMS dataset](https://data.cincinnati-oh.gov/Safety/Cincinnati-Fire-Incidents-CAD-including-EMS-ALS-BL/vnsz-a3wp) that includes all responses to heroin-related overdose incidents from the Cincinnati Fire Department between 2016 and 2020.
Each incident record contains incident location information, including geospatial information, location address, a classification of a neighborhood in the city, time of the incident, and disposition of incident response.
Data records regarding incidents outside of the study area, without geospatial information, and with unassociated disposition codes were excluded from this study.
Records from canceled calls or false alarms and duplicated records were also excluded.

```{r heroin data, message=FALSE, warning=FALSE, class.source='fold-show', results='hide'}

heroin <- read.csv(here("data", "public", "Cincinnati_Fire_Incidents__CAD___including_EMS__ALS_BLS_.csv"))
cincinnati <- st_read(here("data", "public", "Cincinnati_City_Boundary.geojson")) %>% st_transform("EPSG:3735")

heroin <- heroin %>% 
  filter(CFD_INCIDENT_TYPE_GROUP != "NON-PROTOCOL PROBLEM TYPES") %>% 
  filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>%  # remove incidents without spatial info
  filter(is.na(DISPOSITION_TEXT) == FALSE         # remove record with un-associated disposition codes
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL" # remove records from canceled calls, false alarm, and duplication
         & CFD_INCIDENT_TYPE_GROUP !="CANCEL INCIDENT" 
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DEF: DEFAULT"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,EMSF: FALSE"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,DUPF: DUPLICATE"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,FALA: FIRE FALSE AC"
         & CFD_INCIDENT_TYPE_GROUP !="CN: CANCEL,MEDD: MT DISREGARDE"
         & CFD_INCIDENT_TYPE_GROUP !="DUPF: DUPLICATE"
         & CFD_INCIDENT_TYPE_GROUP !="DUPLICATE INCIDENT"
         & CFD_INCIDENT_TYPE_GROUP !="MAL: SYSTEM MALFUNCTION") %>% 
  mutate(time = mdy_hms(CREATE_TIME_INCIDENT, tz = "UTC"),  
         year_column = year(time)) %>% 
  st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  st_intersection(cincinnati %>% dplyr::select(OBJECTID),.) %>%   # remove incidents outside of study area
  filter(year_column %in% c(2016, 2017, 2018, 2019, 2020))


```

In addition, we used [Homeland Infrastructure Foundation-Level Data (HIFLD)](https://hifld-geoplatform.opendata.arcgis.com/maps/geoplatform::hospitals/about) and [Substance Abuse and Mental Health Services Administration (SAMHSA) data](https://dpt2.samhsa.gov/treatment/directory.aspx) for 2015--2020 to obtain information about available healthcare services, such as hospitals, opioid treatment programs, and buprenorphine prescribing physicians, etc.

We filtered data only for Cincinnati in the HIFLD dataset.
The SAMHSA data were geocoded manually to get its location information first before reading into R.

```{r medical resource data, message=FALSE, warning=FALSE, class.source='fold-show', results='hide'}

hospitals <- st_read(here("data","public", "Hospitals.geojson"))
hospitals <- hospitals %>% 
  filter(STATE == "OH" & CITY == "CINCINNATI") %>% 
  st_transform("EPSG:3735") # select those only in Cincinnati


rehab <- read.csv(here("data", "public", "rehabilitation.csv"))
rehab <- rehab %>% 
  filter(is.na(Latitude) == FALSE & is.na(Longitude) == FALSE) %>%
  st_as_sf(., coords = c("Longitude", "Latitude"), crs = 4326) %>% 
  st_transform("EPSG:3735")

```

Furthermore, we included the [crime rate data](https://data.cincinnati-oh.gov/safety/PDI-Police-Data-Initiative-Crime-Incidents/k59e-2pvf) from the Cincinnati Police Department between 2016 and 2020.
Specific crimes were selected for inclusion in analyses based on having been identified in the literature as a spatial risk factor for drug overdose or substance use behavior or for having a theoretical relationship with drug overdose.

```{r crime data, message=FALSE, warning=FALSE, class.source='fold-show', results='hide'}

crime <- read.csv(here("data", "private", "PDI__Police_Data_Initiative__Crime_Incidents.csv"))
crime_list <- c("AGGRAVATED ASSAULT", "BURGLARY", "BREAKING AND ENTERING", "ROBBERY", "DOMESTIC VIOLENCE", "MURDER	", "RAPE", "THEFT")
crime <- crime %>% 
  filter(OFFENSE %in% crime_list) %>% 
  filter(is.na(LATITUDE_X) == FALSE & is.na(LONGITUDE_X) == FALSE) %>%
  st_as_sf(., coords = c("LONGITUDE_X", "LATITUDE_X"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  mutate(time = mdy_hms(DATE_REPORTED, tz = "UTC"),  
         year = year(time)) %>% 
  filter(year %in% c(2016, 2017, 2018, 2019, 2020))

```

From the same data source, we included the [311 non-emergency service requests](https://data.cincinnati-oh.gov/Thriving-Neighborhoods/Cincinnati-311-Non-Emergency-Service-Requests/4cjh-bm8b) from 2016 to 2020.

```{r 311 complaints data, warning=FALSE, message=FALSE, class.source='fold-show', results='hide'}

complaints <- read_csv(here("data", "private", "Cincinnati_311__Non-Emergency__Service_Requests.csv"))
complaints <- complaints %>% 
  filter(is.na(LATITUDE) == FALSE & is.na(LONGITUDE) == FALSE) %>%
  st_as_sf(., coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>% 
  st_transform("EPSG:3735") %>% 
  mutate(time = mdy_hms(REQUESTED_DATE, tz = "UTC"),  
         year = year(time)) %>% 
  filter(year %in% c(2016, 2017, 2018, 2019, 2020))

```

The connection between substance use and built environment variables (access to public restrooms, access to pharmacies, and driving distance to services, defined in our study as fast-food restaurants, gas stations and public parks) are also important.
Public restrooms are associated with people who inject drugs (PWID) because many people use drugs in these spaces.Pharmacies represent an important access variable for several reasons. 
During the initial wave of the overdose crisis, pharmaceutical prescriptions, either legitimate, diverted, or potentially inappropriate, fed the opioid supply. 
In addition, naloxone (a medication to reverse an opioid overdose) is available at pharmacies without a prescription, although this provision may vary by neighborhood socio-demographic levels. 
All of these data were queried from OpenStreetMap and preprocessed in QGIS (fixing geometry, creating spatial indices, and unioning point with polygon data pieces)

```{r built environment data, warning=FALSE, message=FALSE, results='hide'}
# gas station
fuel <- st_read(here("data", "public", "fuel.geojson")) %>% st_transform("EPSG:3735")

# fast food restaurant 
fastfood <- st_read(here("data", "public", "fastfood.geojson")) %>% st_centroid() %>% st_transform("EPSG:3735")

# public parks
parks <- st_read(here("data", "public", "parks.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()

# pharmacy
pharmacy <- st_read(here("data","public", "pharmacy.geojson")) %>% st_transform("EPSG:3735") %>% st_centroid()
```

To identify demographic and sociographic characteristics corresponding to the EMS dataset, we utilized the ACS data for 2016 -- 2020.
We downloaded demographic information for each census tract, including total population, gender, adult population size, and race/ethnicity.
The ACS data also included socioeconomic characteristics of each census tract, such as education, income, and poverty.
For each census tract, we computed the population aged between 25 and 54 (since this is the age group most vulnerable to opioid overdose), gender ratio, minority popultion ratio, percentage poverty, and percentage of population with a bachelor's degree.
Note that the boundary of Cincinnati does not fully contain several census tracts and therefore we have tried our best to include tracts within Hamilton county that mostly fall within Cincinnati.

```{r census data, warning=FALSE, message=FALSE, results='hide'}

cincinnati20 <- get_acs(geography = "tract", 
          variables = c(
            "B01001_001E", # total population
            "B01001_002E", # total male
            "B01001_011E", # male 25-29
            "B01001_012E",
            "B01001_013E",
            "B01001_014E",
            "B01001_015E",
            "B01001_016E", # male 50-54
            "B01001_026E", # total female
            "B01001_035E", # female 25-29
            "B01001_036E",
            "B01001_037E",
            "B01001_038E",
            "B01001_039E",
            "B01001_040E", # female 50-54
            "B02001_002E", # white population
            "B02001_003E", # black population
            "B02001_005E", # asian population
            "B03002_012E", # latinx population
            "B19013_001E", # median household income
            "B06012_002E", # poverty
            "B06009_005E" # bachelor
            ), 
          year=2020, state="OH", county="Hamilton", 
          geometry=TRUE, output="wide") %>%
  st_transform("EPSG:3735")

cincinnati_tracts <- st_read(here("data", "public", "cincinnati_tracts.geojson")) %>% st_transform("EPSG:3735")

cincinnati_tracts <- cincinnati_tracts %>% st_drop_geometry() %>% dplyr::select(GEOID) %>% as.list(GEOID)

cincinnati20 <- cincinnati20 %>% 
  filter(GEOID %in% cincinnati_tracts$GEOID) %>% 
  rename(Totalpop = B01001_001E,
         MedHHInc = B19013_001E) %>% 
  mutate(pop25_54 = B01001_011E + B01001_012E + B01001_013E + B01001_014E + B01001_015E + B01001_016E + B01001_035E + B01001_036E + B01001_037E + B01001_038E+ B01001_039E + B01001_040E) %>% 
  mutate(MF_ratio = B01001_002E / B01001_026E) %>% 
  mutate(race_ratio = B02001_002E / (B02001_003E + B02001_005E + B03002_012E)) %>% 
  mutate(pctPoverty = B06012_002E / Totalpop) %>% 
  mutate(pctBachelor = B06009_005E / Totalpop) %>% 
  dplyr::select(pop25_54, Totalpop, MedHHInc, GEOID, MF_ratio, race_ratio, pctPoverty, pctBachelor)


```

The table below summarizes all of our predictor variables in this study:

|             Variable Name             |                          Source                       |
|:----------------------------------:   |:----------------------------------:                   |
|          **Medical Service**          |                                                       |
|               Hospital                |       Homeland Infrastructure Foundation-Level Data   |
|         Rehabilitation Center         | Substance Abuse and Mental Health Services Administration |
|         **Built Environment**         |                                                       |
|              Crime Rate               |                   Open Data Cincinnati                |
|            311 complaints             |                   Open Data Cincinnati                |
|              Gas station              |                            OSM                        |
|             Public parks              |                            OSM                        |
|         Fast food restaurants         |                            OSM                        |
|              Pharmacies               |                            OSM                        |
| **Socio-demographic Characteristics** |                                                       |
|   Population aged between 25 and 54   |                            ACS                        |
|             Gender ratio              |                            ACS                        |
|       White over minority ratio       |                            ACS                        |
|  Population with a Bachelor's degree  |                            ACS                        |
|        Median household income        |                            ACS                        |
|     Population under poverty line     |                            ACS                        |
|        Median Household Income        |                            ACS                        |


# Poission Regression

As part of our exploratory analysis, we ran a poisson regression using only the number of crime incidents and the number of 311 complaints as our predictor variables. Since heroin overdose incident is not a phenomenon that varies across administrative units, but one varying smoothly across the landscape, we would need to aggregate these point-level data into a lattice grid of cells. The code below generates a 500 by 500 meters (820 feet) fishnet for Cincinnati to achieve this goal. 

```{r create fishnet, warning=FALSE, message=FALSE}
fishnet <- st_make_grid(cincinnati,
               cellsize = 820, 
               square = TRUE) %>%
  .[cincinnati] %>%           
  st_sf() %>%
  mutate(uniqueID = 1:n())

ggplot() +
  geom_sf(data=fishnet, color="black", fill="#FFF5EE") +
  labs(title = "Fishnet of Cincinnati") +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
```

We would like to compare across both space and time, and so we divide the heroin dataset into one training set and two testing sets. The training set contains incidents from 2016 to 2018 while the testing sets contain incidents from 2019 and 2020 respectively. Since each individual year has well enough observations of heroin-overdose incidents, separating the testing set into 2019 and 2020 allows us to understand how COVID-19 plays a role in affecting drug overdose rate. We also separate the crime incident data and the 311 complaints data into respective year groups so that we could aggregate them into the fishnet. 

```{r splitting train and test, warning=FALSE, message=FALSE}

heroinTrain <- heroin %>% 
  filter(year_column %in% c(2016, 2017, 2018))
heroin19 <- heroin %>% 
  filter(year_column == 2019)
heroin20 <- heroin %>% 
  filter(year_column == 2020)

crimeTrain <- crime %>% 
  filter(year %in% c(2016, 2017, 2018))
crime19 <- crime %>% 
  filter(year == 2019)
crime20 <- crime %>% 
  filter(year == 2020)

complaintTrain <- complaints %>% 
  filter(year %in% c(2016, 2017, 2018))
complaint19 <- complaints %>% 
  filter(year == 2019)
complaint20 <- complaints%>% 
  filter(year == 2020)

```

The following two code chunks perform all the necessary tasks to prepare our training and testing datasets by aggregating heroin-incidents, crime incidents, and 311 complaints all into the same spatial unit, in this case, the fishnet. 

```{r preparing training set, warning=FALSE, message=FALSE}

# add heroin in net
heroinTrain_net <- 
  dplyr::select(heroinTrain) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

# adding crime to net
heroinTrain_net <- heroinTrain_net %>% 
  st_join(crimeTrain, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroinTrain_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

# add 311 to net
heroinTrain_net <- heroinTrain_net %>% 
  st_join(complaintTrain, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroinTrain_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
```


```{r test sets, warning=FALSE, message=FALSE}

# 2019 test sets
heroin19_net <- 
  dplyr::select(heroin19) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

heroin19_net <- heroin19_net %>% 
  st_join(crime19, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin19_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

heroin19_net <- heroin19_net %>% 
  st_join(complaint19, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin19_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))


# 2020 test set
heroin20_net <- 
  dplyr::select(heroin20) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

heroin20_net <- heroin20_net %>% 
  st_join(crime20, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin20_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

heroin20_net <- heroin20_net %>% 
  st_join(complaint20, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin20_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))
```

We performed some data visualization work to see if there are any differences between heroin-overdose incidents over year. We found that the distribution of these incidents are pretty consistent between 2016-2018 and 2019, with the majority of incidents significantly clustered in south Cincinnati. However, in 2020, despite that this overdose cluster still remain, overdose incidents became more disperse. We may see some small, emerging clusters of high overdose incidents scattered around the city. 

```{r is there a difference between heroin incidents over year, fig.height=6, fig.width=15, warning=FALSE, message=FALSE}


plot1 <- ggplot() +
  geom_sf(data = heroinTrain_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot2 <- ggplot() +
  geom_sf(data = heroin19_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2019") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

 plot3 <- ggplot() +
  geom_sf(data = heroin20_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", name = "Counts") +
  labs(title = "Heroin OD 2020") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
 
plot1 + plot2 + plot3
```

We created heat maps of heroin-overdose incidents from 2016 to 2018 and overlayed crime incidents and 311 complaints above. Based on the visualization, we assume crime and 311 complaints to be strong predictors of overdose risk as places of high overdose incidents align perfectly with places with higher crime rate and more complaints. Yet, it is also pretty evident that crime incidents are correlated with 311 complaints. 

```{r crime and complaints, fig.height=6, fig.width=13, warning=FALSE, message=FALSE}

plot1_1 <- ggplot() +
  geom_sf(data = cincinnati, fill = "black") +
  stat_density2d(data = data.frame(st_coordinates(heroinTrain)), 
                 aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "rocket", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  geom_sf(data = sample_n(crimeTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
  labs(title = "Heroin OD and Crime 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot1_2 <- ggplot() +
  geom_sf(data = cincinnati, fill = "black") +
  stat_density2d(data = data.frame(st_coordinates(heroinTrain)), 
                 aes(X, Y, fill = after_stat(level), alpha = after_stat(level)),
                 size = 0.01, bins = 60, geom = 'polygon') +
  scale_fill_viridis(option = "rocket", name = "Density") +
  scale_alpha(range = c(0.00, 0.35), guide = "none") +
  geom_sf(data = sample_n(complaintTrain, 3000), size = 0.03, colour = "white", alpha = 0.5) +
  labs(title = "Heroin OD and Complaint 2016-2018") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot1_1 + plot1_2
```

Mapping the correlation between our predictor variables reveals that indeed, there's a strong, linear, and positive relationship between heroin-overdose incidents and crime as well as 311 complaints, though the relationships become weaker in 2020. We suspect that disruptions in the drug supply chain, changes in law enforcement priorities, as well as increase social isolation and mental health challenges during the pandemic might have contribute to the decrease in correlation between overall crime incidents and heroin overdose incidents in 2020. 

```{r cor plots, fig.height=6, fig.width=12, message=FALSE, warning=FALSE}

plot4 <- heroinTrain_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2016-18") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot5 <- heroin19_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2019") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))


plot6 <- heroin20_net %>%
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") %>% 
  ggplot(aes(x= countHeroin, y = Number)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Crimecount` = "Crime",
    `Complaintscount` = "311 Complaints"))) +
  labs(title = "Scatter Plots of Predictor Variables 2020") +
    theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot4 + plot5 + plot6

```

We ran the Poisson regression and confirmed that crime incidents and 311 complaints are good at predicting heroin-overdose risk. The model is slightly better when used to predict 2020 data, but the difference is minimal. 

```{r running poission, warning=FALSE, message=FALSE}

poissionTrain <- glm(countHeroin ~ Crimecount + Complaintscount, family = "poisson",
                      data = heroinTrain_net)

Prediction19 <-
  mutate(heroin19_net, Prediction = predict(poissionTrain, heroin19_net, type = "response")) %>% 
  mutate(Error = countHeroin - Prediction) %>% 
  mutate(MAE = mean(abs(Error))) %>% 
  mutate(SD_MAE = sd(Error)) %>% 
  mutate(prediction = "2019") %>% 
  slice(1:1)

Prediction20 <-
  mutate(heroin20_net, Prediction = predict(poissionTrain, heroin20_net, type = "response")) %>% 
  mutate(Error = countHeroin - Prediction) %>% 
  mutate(MAE = mean(abs(Error))) %>% 
  mutate(SD_MAE = sd(Error)) %>% 
  mutate(prediction = "2020") %>% 
  slice(1:1)

rbind(Prediction19, Prediction20) %>% 
  st_drop_geometry() %>% 
  group_by(prediction) %>% 
  summarize(Mean_MAE = round(MAE, 2),
              SD_MAE = round(SD_MAE, 2)) %>%
  kable(col.name=c("Prediction", "Mean Absolute Error",'Standard Deviation MAE')) %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed")) 

```

Here are our reflections up to this point:

While there's significant associations between crime incidents, 311 complaints, and heroin-overdose incidents, building a model that relies solely on them to predict risk is simply not enough because substance abuse and overdose risks are complex phenomena influenced by various factors beyond just criminal activity. The geographical and environmental context, including the availability of treatment facilities, proximity to support networks, and neighborhood characteristics, also plays a crucial role in understanding overdose risks. Therefore, we must integrate more features showing built environment and local contexts into the model. 

On top of that, Poisson regression is a traditional method used in geospatial risk model to predict the number (count) of discrete event, especially crime. However, in making decisions about how to alleviate the risk of particular medical or criminal incidence, we shouldn't just look at the number of accidents at an intersection in a day at a very specific place. Rather, we should flag areas in a city that are more vulnerable/risky than others. It is not wrong to care to focus on the number of incidence, but from a policy-making perspective, learning whether or not a place is risky would be more efficient, helpful, and straightforward in guiding decision making. In other words, stakeholders and the general public do not have to know *how many incidents happened at this place*, but *whether this place is risky* and therefore requires more attention. 

# Getis Ord Spatial Hotspot

We adapted this research workflow from Srinivasan, et al. (2023), who studied the risk factors for opioid overdose in Massachusetts. In its essence, Srinivasan, et al used principal component analysis to reduce colinerarity between predictor variables and ran a logistic regression to predict whether places in Massachusetts are within overdose hotspot or not. 

To replicate their approach as much as possible, we first examined the spatial distribution of heroin-related incident rates using the Jenks natural breaks maps.
The Jenks natural breaks maps use a nonlinear algorithm to create groups where within-group similarity is maximized, and between-group similarity is minimized. 
The map shows at least one significant incidence cluster in south Cincinnati and several other smaller clusters surrounding that hotspot. 

```{r jenks break, warning=FALSE, message=FALSE}

heroin_net <- 
  dplyr::select(heroin) %>%  
  mutate(countHeroin = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countHeroin = replace_na(countHeroin, 0),
         uniqueID = 1:n(), 
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))
breaks <- classIntervals(heroin_net$countHeroin, n = 5, style = "jenks")

ggplot() +
  geom_sf(data = heroin_net, aes(fill = countHeroin), color = NA) +
  scale_fill_viridis(option = "rocket", breaks = breaks$brks, name = "Count")  + 
labs(title = "Count of Heroin Overdose for the Fishnet") +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9, face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill = NA, linewidth = 0.8)
  ) + 
  theme(legend.key.size = unit(0.8, 'cm'), legend.text = element_text(size=6))
```

We validated our observation by checking the global moran's I under the hypothesis that heroin incidences are randomly distributed across space.  The figure below plot the frequency of 999 randomly permutated I under Monte Carlo simulation, which shows that our observed I is much higher than all of the randomly generated I’s. This validates the existence of spatial autocorrelation.

```{r global morans I, warning=FALSE, message=FALSE}

coords <-  st_coordinates(heroin_net %>% st_centroid()) 
neighborList <- knn2nb(knearneigh(coords, 4))
spatialWeights <- nb2listw(neighborList, style="W")

moranTest <- moran.mc(heroin_net$countHeroin, 
                      spatialWeights, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01, fill = "black") +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
  labs(title = "Observed and Permuted Moran's I",
       subtitle = "Observed Moran's I in Red",
       x = "Moran's I",
       y = "Count") +
  theme_light() +   
theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=10))

```

Since the global spatial clustering analysis yields only one statistic to describe the overall point pattern across the whole study area, we have to use the Getis-Ord Gi\* hotspot statistic to identify and map clusters (i.e., grids with high overdose rates surrounded by grids with high rates). Below, each hotspot identified is statistically significant at the p\<0.1 level. For estimating the clusters, we used the queen's criterion to define our spatial weights matrix.

```{r local moran I, warning=FALSE, message=FALSE}

neigh_nbs <- heroin_net %>% 
  mutate(
    nb = st_contiguity(geometry),  # neighbors share border    
    wt = st_weights(nb), # row-standardized weights              
    neigh_lag = st_lag(countHeroin, nb, wt)  
  )

gi_hot_spots <- neigh_nbs %>% 
  mutate(Gi = local_g_perm(countHeroin, nb, wt, nsim = 999)) %>% 
  unnest(Gi) 

gi_hot_spots <- gi_hot_spots %>%  
  dplyr::select(gi, p_folded_sim, uniqueID) |> 
  mutate(
    classification = case_when(
      # Classify based on the following criteria:
      gi > 0 & p_folded_sim <= 0.01 ~ "Very hot",
      gi > 0 & p_folded_sim <= 0.05 ~ "Hot",
      gi > 0 & p_folded_sim <= 0.1 ~ "Somewhat hot",
      gi < 0 & p_folded_sim <= 0.01 ~ "Very cold",
      gi < 0 & p_folded_sim <= 0.05 ~ "Cold",
      gi < 0 & p_folded_sim <= 0.1 ~ "Somewhat cold",
      TRUE ~ "Insignificant"
    ),    # Convert 'classification' into a factor for easier plotting
    classification = factor(
      classification,
      levels = c("Very hot", "Hot", "Somewhat hot",
                 "Insignificant",
                 "Somewhat cold", "Cold", "Very cold")
    )
  )

gi_hot_spots %>% 
  ggplot() + 
  geom_sf(aes(fill = classification), color = "black", lwd = 0.1) +
  scale_fill_brewer(type = "div", palette = 6, name = "Classification") +
  labs(title = "Spatial Hotspots and Coldspots of Heroin Overdose Incidence") +
  theme(legend.position="bottom",
        axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

heroin_net <- gi_hot_spots %>% 
  mutate(hotspot = ifelse(classification %in% c("Very hot", "Hot", "Somewhat hot"), 1, 0)) %>% 
  st_drop_geometry() %>% 
  left_join(heroin_net, ., by = "uniqueID")
```

The cluster map shows seven types of spatial association determined based on heroin-related overdose incident rate within a grid and its neighboring grids. If a grid has a high heroin incident rate with neighboring grids of high heroin incident rates, it was determined as a hot spot. At the same scheme, if a grid shows a low heroin incident rate with neighboring grids of low heroin incident rates, it was classified as a cold spot. Hotspots with p\<0.01 are considered as "Very hot", with p\<0.05 are considered as "Hot", with p\<0.1 are consiered as "Somewhat hot". All grids that fall into these three categories received a 1 for hotspot, otherwise, it gets a zero. 

# Logistic Regression

## Built Environment Features

We aggregated built environment features in addition to crime and 311 complaints into the fishnet of Cincinnati. For crime and complaints, we counted their number in each grid. For hospitals, rehabilitation centers, pharmacies, fast food restaurants, gas stations, and public parks, we calculated the average distance from the center of each grid to these locations. For demographic variables, we spatially joined tract level into the grids depending on which census tract each centroid of the grid fall into.   

```{r adding risk factors to net, warning=FALSE, message=FALSE}

# adding crime to net
heroin_net <- heroin_net %>% 
  st_join(crime, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Crimecount = n()) %>%
    left_join(heroin_net, . ) %>%
    st_sf() %>%
  mutate(Crimecount = ifelse(is.na(Crimecount), 0, Crimecount))

# add 311 to net
heroin_net <- heroin_net %>% 
  st_join(complaints, ., join=st_within) %>% 
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(Complaintscount = n()) %>%
    left_join(heroin_net, . ) %>%
    st_sf() %>%
  mutate(Complaintscount = ifelse(is.na(Complaintscount), 0, Complaintscount))

net_centroid <- st_centroid(heroin_net)

# add hospitals
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(hospital.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(hospitals), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(hospital.nn, uniqueID), 
            by = "uniqueID")

# add rehab center
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(rehab.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(rehab), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(rehab.nn, uniqueID), 
            by = "uniqueID")

# add pharmacy
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(pharm.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(pharmacy), 1)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(pharm.nn, uniqueID), 
            by = "uniqueID")


# add gas station
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(fuel.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(fuel), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(fuel.nn, uniqueID), 
            by = "uniqueID")

# add fast food restaurant
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(fast.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(fastfood), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(fast.nn, uniqueID), 
            by = "uniqueID")

# add parks
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              mutate(parks.nn = nn_function(st_coordinates(net_centroid), 
                                           st_coordinates(parks), 2)*0.3048) %>% 
              st_drop_geometry() %>% dplyr::select(parks.nn, uniqueID), 
            by = "uniqueID")


# add demographic vars
heroin_net <- heroin_net %>% 
  left_join(net_centroid %>% 
              dplyr::select(uniqueID) %>% 
              st_intersection(cincinnati20) %>% 
              st_drop_geometry() %>% dplyr::select(-GEOID), by = "uniqueID") %>% 
  filter(is.na(pop25_54) == FALSE)
```

Some exploratory data analysis were performed, including computing the correlation between all of our predictor variables and heroin incidence. The figure below shows that all of our predictor variables are correlated with heroin-overdose incidence, among which all of our nearest distance variables are negatively correlated with incidence. This indicates that the further away we are from fast food restaurants and gas stations, for example, the less number of heroin-overdose incidence in that area. The relationship is the strongest for crime incidence and weakest for population aged between 25-54. 


```{r cor for all predictors, fig.height=12, fig.width=8, message=FALSE, warning=FALSE}

heroin_net_long <- heroin_net%>% 
  st_drop_geometry() %>% 
  dplyr::select(-c(uniqueID, cvID, gi, p_folded_sim, classification, hotspot, Totalpop)) %>% 
  pivot_longer(cols = -countHeroin, # everything except measurement
               names_to = "Type", # categorizes all quantitative variables into Type
               values_to = "Number") 

correlation.cor <-
  heroin_net_long %>%
    group_by(Type) %>%
    summarize(correlation = cor(Number, countHeroin, use = "complete.obs"))

heroin_net_long %>%
  ggplot(aes(x= Number, y = countHeroin)) +
  geom_point(size = 0.01, color = "#000004") +  
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1, size=3) +
  geom_smooth(method='lm', formula= y~x, lwd=0.5, se = FALSE, color = "#BB3754") +
  facet_wrap(~ Type, scales = "free", ncol = 2, labeller= labeller(Type = c(
    `Complaintscount` = "Complaints",
    `Crimecount` = "Crime",
    `fast.nn` = "Distance to Fast Food",
    `fuel.nn` = "Distance to Gas Stations",
    `hospital.nn` = "Distance to Hospital",
    `MedHHInc` = "Median Household Income",
    `MF_ratio` = "Gender Ratio",
    `parks.nn` = "Distance to Parks",
    `pctBachelor` = "Percent Bachelor's Degree",
    `pctPoverty` = "Poverty Rate",
    `pharm.nn` = "Distance to Pharmacy",
    `pop25_54` = "Population 25_54", 
    `rehab.nn` = "Distance to Rehab Center",
    `race_ratio` = "Racial Majority over Minority"))) +
  labs(title = "Scatter Plots of All Predictor Variables") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=6),
        axis.text.y=element_text(size=6), 
        axis.title=element_text(size=8), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
```

The figure below shows the difference in the number of hotspots or cold spots based on each of our predicting variable and we may see that there seems to be a significant difference in all scenarios. 

```{r anova figure, fig.height=10, fig.width=15, message=FALSE, warning=FALSE}

anova_dataset <- heroin_net %>% st_drop_geometry() %>% 
                    dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)

anova_dataset %>%
  gather(Variable, value, -hotspot) %>%
    ggplot(aes(as.character(hotspot), value, fill=as.character(hotspot))) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free", ncol = 5, labeller= labeller(Variable = c(
        `Complaintscount` = "Complaints",
        `Crimecount` = "Crime",
        `fast.nn` = "Distance to Fast Food",
        `fuel.nn` = "Distance to Gas Stations",
    `hospital.nn` = "Distance to Hospital",
    `MedHHInc` = "Median Household Income",
    `MF_ratio` = "Gender Ratio",
    `parks.nn` = "Distance to Parks",
    `pctBachelor` = "Percent Bachelor's Degree",
    `pctPoverty` = "Poverty Rate",
    `pharm.nn` = "Distance to Pharmacy",
    `pop25_54` = "Population 25_54", 
    `rehab.nn` = "Distance to Rehab Center",
    `race_ratio` = "Racial Majority over Minority"))) +
      scale_fill_manual(values = c("#000004",  "#BB3754")) +
      labs(x="Hotspot", y="Value", 
           title = "Feature Associations with the Likelihood of Overdose Hotspot") +
      theme(legend.position = "none") +
    theme(plot.subtitle = element_text(size = 12,face = "italic"),
        plot.title = element_text(size = 18, face = "bold"), 
        axis.text.x=element_text(size=10),
        axis.text.y=element_text(size=10), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))
```

We further conducted a series of ANOVA tests for all predictor variables to validate if there's a significant difference in mean number of crimes, for example, between cold spot and hot spot. The result is in concur with our expectations: all variables are significant predictors.

```{r anova test, message=FALSE, warning=FALSE}

anova_var <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")

anova_results <- data.frame(
                            Df = integer(), 
                            Sum_Sq = numeric(), 
                            Mean_Sq = numeric(), 
                            F_value = numeric(), 
                            P_Value = numeric(), stringsAsFactors = FALSE)

new_names <- c("Crimecount" = "Crime",
                "Complaintscount" = "Complaints",
                "pharm.nn" = "Distance to Pharmacy",
                "hospital.nn" = "Distance to Hospital",
                "fast.nn" = "Distance to Fast Food",
                "fuel.nn" = "Distance to Gas Stations",
                "rehab.nn" = "Distance to Rehab Center", 
                "parks.nn" = "Distance to Parks", 
                "pop25_54" = "Population 25_54", 
                "MedHHInc" = "Median Household Income",
                "MF_ratio" = "Gender Ratio",
                "race_ratio" = "Racial Majority over Minority",
                "pctPoverty" = "Poverty Rate", 
                "pctBachelor" = "Percent Bachelor's Degree"
               )

for (var in anova_var) {
  anova_result <- aov(anova_dataset[[var]] ~ anova_dataset$hotspot)
  summary_data <- summary(anova_result)[[1]][, c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")][1:1, ]
  anova_results <- rbind(anova_results, summary_data)
}

rownames(anova_results) <- new_names

anova_results %>% 
 kable() %>% 
 kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>% 
    footnote(general_title = "\n")

```

## Simple Logit

Considering that all predictor variables used in the anova test appear to be significant, we included all of them in a kitchen-sink logistic regression. The model shows that in this scenario, **number of crimes**, **number of complaints**, **distance to hospital**, **distance to fast food restaurant**, **gender ratio**, and **distance to public parks** are highly significant predictors of heroin-overdose hotspot whereas other predictors become insignificant. 

```{r logistic without cv, warning=FALSE, message=FALSE}

kitchensink <- glm(hotspot ~ .,
                  data=heroin_net %>% 
                    st_drop_geometry() %>% 
                    dplyr::select(hotspot, Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor), family="binomial" (link="logit"))


kitchen_sum <- summary(kitchensink)
coefficients_table <- as.data.frame(kitchen_sum$coefficients)

coefficients_table$significance <- ifelse(coefficients_table$`Pr(>|z|)` < 0.001, '***',
                                         ifelse(coefficients_table$`Pr(>|z|)` < 0.01, '**',
                                                ifelse(coefficients_table$`Pr(>|z|)` < 0.05, '*',
                                                       ifelse(coefficients_table$`Pr(>|z|)` < 0.1, '.', ''))))

coefficients_table$p_value <- paste0(round(coefficients_table$`Pr(>|z|)`, digits = 3), coefficients_table$significance)

coefficients_table %>%
  dplyr::select(-significance, -`Pr(>|z|)`) %>% 
  kable(align = "r") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))  %>%
  footnote(general_title = "\n")
```

## Logit with CV

We would like to perform cross validation on this model to measures its overall performance and robustness. We decide to modify a cross validation function that was originally built on cases using Poission regression by making it suitable for use in our scenario (logistic regression). This function takes in four parameters: the dataset, a cross validation ID, which it could use to separate the dataset into training and testing, a list of dependent variables, and a independent variable. In our case, we use a randomly generated cvID associated with each grid cell for random k-fold cross validation. Each group of grids with the same cvID gets hold out during one iteration. 


```{r logit function}

logitCV <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])

  for (i in cvID_list) {
    
    thisFold <- i

    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, all_of(indVariables),
                    all_of(dependentVariable))
    
    form_parts <- paste0(dependentVariable, " ~ ", paste0(indVariables, collapse = "+"))
    form <- as.formula(form_parts)
    regression <- glm(form, data = fold.train %>%
                        dplyr::select(-geometry, -id),  family="binomial" (link="logit"))
    
    thisPrediction <-
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

```

Here, another helper function is written to visualize the confusion matrix. 

```{r cm function}

draw_confusion_matrix <- function(cm) {

  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX', cex.main=2)

  # create the matrix 
  rect(150, 430, 240, 370, col='black')
  text(195, 435, '0', cex=1.2)
  rect(250, 430, 340, 370, col="#BB3754")
  text(295, 435, '1', cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col="#BB3754")
  rect(250, 305, 340, 365, col='black')
  text(140, 400, '0', cex=1.2, srt=90)
  text(140, 335, '1', cex=1.2, srt=90)

  # add in the cm results 
  res <- as.numeric(cm$table)
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}  
```

The cross validation model is visualized below, from which we see that our model has a high accuracy rate of 0.936 and a moderate sensitivity rate of 0.727. Sensitivity is the true-positive rate, which is the proportion of actual hotspot grids that were correctly identified by the model. Given this model's high accuracy rate, this means that our model is missing out a few heorin-overdose hotspots. 

However, note that in Cincinnati, the number of grids that are classified as hotspot is much smaller than the number of grids that are classified as coldspot. While this is a good news for Cincinnati, logistic regression can be sensitive to class imbalance, meaning that the model might be biased towards the majority class, leading to poor performance on the minority class. 

```{r logistic with CV, warning=FALSE, message=FALSE}

reg.vars <- c("Crimecount", "Complaintscount", "pharm.nn", "hospital.nn", "fast.nn", "fuel.nn", "rehab.nn", "parks.nn", "pop25_54", "MedHHInc", "MF_ratio", "race_ratio", "pctPoverty", "pctBachelor")


kitchensink.logit.cv <- logitCV (
  dataset = heroin_net,
  id = "cvID",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, hotspot, Prediction, geometry)

kitchensink.logit.cv <- kitchensink.logit.cv %>% 
  mutate(PredictionCat  = as.factor(ifelse(kitchensink.logit.cv$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))


cm_basic <- caret::confusionMatrix(kitchensink.logit.cv$hotspot, kitchensink.logit.cv$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_basic)

```

# Principal Component Analysis (PCA)

One of the reasons why each individual predictor variable is correlated with overdose hotspot but when combined in a regression, some of them become insignificant is because of multicollinearity. This commonly occurs in regression analysis when predictor variables are highly correlated. It won't be surprise to see that many of those demographic variables are highly correlated with each other. It is possible that distance to hospitals is correlated with distance to parks, given that there are more those kind of services in downtown areas. Srinivasan, et al. (2023)'s approach to this issue is to use Principal Component Analysis (PCA) as a preprocessing step before running the regression.  


## PC Identification

PCA is a dimension reduction technique that is commonly used to reduce the number of variables used in analyses while preserving the information from them. The `prcomp` function below is used to identify the principal components among all 14 of our predictor variables. 

```{r determine pc, warning=FALSE, message=FALSE}

pca_heroin <- heroin_net %>% 
  st_drop_geometry() %>% 
  dplyr::select(Crimecount, Complaintscount, pharm.nn, hospital.nn, fast.nn, fuel.nn, rehab.nn, parks.nn, pop25_54, MedHHInc, MF_ratio, race_ratio, pctPoverty, pctBachelor)

pca_heroin <- na.omit(pca_heroin)
pc <- prcomp(pca_heroin,
             center = TRUE,
            scale. = TRUE)
```

The scree plot is used to visualize the importance of each principal component and can be used to determine the number of principal components to retain. The y axis is eigenvalues, which essentially stand for the amount of variation. Here, we may see that the first six, or seven components are capturing the majority of variability. We can therefore, just use the first seven principal components and ignore the rest. 

```{r pca viz1, warning=FALSE, message=FALSE}

fviz_eig(pc, addlabels = TRUE, barfill = "#BB3754", barcolor = "transparent", xlab = "Principal Components", choice = "eigenvalue")

```

The code above computed the square cosine value for each variable with respect to the first seven principal components. From the illustration below, gender ratio, population aged between 25-54, distance to parks, and poverty rate are top four variables with the highest cos2, hence contributing the most to the top seven principal components. 

```{r pca viz2,  message=FALSE, warning=FALSE}

fviz_cos2(pc, choice = "var", axes = 1:7) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_bar(stat = "identity", fill = "#BB3754", color = "transparent")

```

We also did a biplot based on our principal component analysis, which helps us to visualize the similarities and dissimilarities between the samples, and further understand the impact of each attribute on each of the principal components. Here, the length of the arrow represents the strength of the contribution of the corresponding variable to the specific principal component. The color of the arrow represents the relative contribution of the variable to the principal component. Darker colors typically indicate larger contributions. For variables with long arrow, but light color, such as percentage of population with a Bachelor's degree and percentage poverty, their contributions were less significant compared to other variables, and might only contribute significantly to one PC. For variables with short arrow but dark color, such as 311 complaints and crime, they have relatively strong contributions to all PC. Also, the direction of those arrows indicate that crimes and complaints have positive contributions to hotspots. 

```{r pca viz3, fig.height=6, fig.width=9, message=FALSE, warning=FALSE}

fviz_pca_var(pc, col.var = "cos2", repel = TRUE, 
             arrowsize = 1,  ggtheme = theme_minimal(), gradient.cols = c("black", "#BB3754"),) + ggtitle("Biplot Combind with Contribution")


```

Below, we computed the first seven principal components based on the loadings of each of our predictor variables in that component. 

```{r compute pca, message=FALSE, warning=FALSE}

pca_net <- heroin_net %>% 
  mutate(pc1 = Crimecount * pc$rotation[1] + Complaintscount * pc$rotation[2] + pharm.nn * pc$rotation[3] + hospital.nn * pc$rotation[4] + fast.nn * pc$rotation[5] +  fuel.nn * pc$rotation[6] + rehab.nn * pc$rotation[7] + parks.nn *  pc$rotation[8] + pop25_54 * pc$rotation[9] + MedHHInc * pc$rotation[10] + MF_ratio * pc$rotation[11] + race_ratio * pc$rotation[12] +  pctPoverty * pc$rotation[13] + pctBachelor * pc$rotation[14]) %>% 
  mutate(pc2 = Crimecount * pc$rotation[15] + Complaintscount * pc$rotation[16] + pharm.nn * pc$rotation[17] + hospital.nn * pc$rotation[18] + fast.nn * pc$rotation[19] +  fuel.nn * pc$rotation[20] + rehab.nn * pc$rotation[21] + parks.nn *  pc$rotation[22] + pop25_54 * pc$rotation[23] + MedHHInc * pc$rotation[24] + MF_ratio * pc$rotation[25] + race_ratio * pc$rotation[26] +  pctPoverty * pc$rotation[27] + pctBachelor * pc$rotation[28]) %>%
  mutate(pc3 = Crimecount * pc$rotation[29] + Complaintscount * pc$rotation[30] + pharm.nn * pc$rotation[31] + hospital.nn * pc$rotation[32] + fast.nn * pc$rotation[33] +  fuel.nn * pc$rotation[34] + rehab.nn * pc$rotation[35] + parks.nn *  pc$rotation[36] + pop25_54 * pc$rotation[37] + MedHHInc * pc$rotation[38] + MF_ratio * pc$rotation[39] + race_ratio * pc$rotation[40] +  pctPoverty * pc$rotation[41] + pctBachelor * pc$rotation[42]) %>%
  mutate(pc4 = Crimecount * pc$rotation[43] + Complaintscount * pc$rotation[44] + pharm.nn * pc$rotation[45] + hospital.nn * pc$rotation[46] + fast.nn * pc$rotation[47] +  fuel.nn * pc$rotation[48] + rehab.nn * pc$rotation[49] + parks.nn *  pc$rotation[50] + pop25_54 * pc$rotation[51] + MedHHInc * pc$rotation[52] + MF_ratio * pc$rotation[53] + race_ratio * pc$rotation[54] +  pctPoverty * pc$rotation[55] + pctBachelor * pc$rotation[56]) %>%
  mutate(pc5 = Crimecount * pc$rotation[57] + Complaintscount * pc$rotation[58] + pharm.nn * pc$rotation[59] + hospital.nn * pc$rotation[60] + fast.nn * pc$rotation[61] +  fuel.nn * pc$rotation[62] + rehab.nn * pc$rotation[63] + parks.nn *  pc$rotation[64] + pop25_54 * pc$rotation[65] + MedHHInc * pc$rotation[66] + MF_ratio * pc$rotation[67] + race_ratio * pc$rotation[68] +  pctPoverty * pc$rotation[69] + pctBachelor * pc$rotation[70]) %>%
  mutate(pc6 = Crimecount * pc$rotation[71] + Complaintscount * pc$rotation[72] + pharm.nn * pc$rotation[73] + hospital.nn * pc$rotation[74] + fast.nn * pc$rotation[75] +  fuel.nn * pc$rotation[76] + rehab.nn * pc$rotation[77] + parks.nn *  pc$rotation[78] + pop25_54 * pc$rotation[79] + MedHHInc * pc$rotation[80] + MF_ratio * pc$rotation[81] + race_ratio * pc$rotation[82] +  pctPoverty * pc$rotation[83] + pctBachelor * pc$rotation[84]) %>%
  mutate(pc7 = Crimecount * pc$rotation[85] + Complaintscount * pc$rotation[86] + pharm.nn * pc$rotation[87] + hospital.nn * pc$rotation[88] + fast.nn * pc$rotation[89] +  fuel.nn * pc$rotation[90] + rehab.nn * pc$rotation[91] + parks.nn *  pc$rotation[92] + pop25_54 * pc$rotation[93] + MedHHInc * pc$rotation[94] + MF_ratio * pc$rotation[95] + race_ratio * pc$rotation[96] +  pctPoverty * pc$rotation[97] + pctBachelor * pc$rotation[98])
  

```

## PCA Logistic Regression

Following Srinivasan, et al. (2023)'s method, now we can estimated this logistic regression to predict if a grid was in a hotspot (0/1) as identified by the Getis-Ord Gi\* statistic using components derived from the PCA as the explanatory variables. For this model using the principal components as predictors, we still get a accuracy rate of 0.928. Our sensitivity rate dropped a little bit to 0.66 compared to the kitchen sink model, probably because we addressed the issue of multicolinearity,  but that is still considered as a moderate sensitivity. 

```{r using pca in logistic regression, message=FALSE, warning=FALSE}

reg.vars <- c("pc1", "pc2", "pc3", "pc4", "pc5", "pc6", "pc7")

pca.logit.cv <- logitCV (
  dataset = pca_net,
  id = "cvID",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, hotspot, Prediction, geometry)

pca.logit.cv <- pca.logit.cv %>% 
  mutate(PredictionCat  = as.factor(ifelse(pca.logit.cv$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))

cm_pca <- caret::confusionMatrix(pca.logit.cv$hotspot, pca.logit.cv$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_pca)

```

In addition to cross-validating using random cvID, we also tried the spatial ‘Leave-one-group-out’ cross-validation (LOGO-CV) approach, during which we hold out one neighborhood, train the model on the remaining n-1 areas, predict for the hold out, and record the goodness of fit. The result shows that our model has high accuracy, but the sensitivity further decreases. One possible explanation here is that different neighborhoods may have distinct characteristics or patterns. If the model is trained on a set of neighborhoods and then tested on a neighborhood with different characteristics, its performance may be negatively affected. Random cross-validation IDs might not capture these spatial variations as explicitly, but further investigation is required here. 

```{r using pca in logistic regression with neighborhood, message=FALSE, warning=FALSE, results='hide'}

neighborhood <- st_read(here("data", "public", "neighborhood.geojson")) %>% st_transform("EPSG:3735") %>% st_zm()
pca_net <- pca_net %>%
  left_join(net_centroid %>% dplyr::select(uniqueID) %>% 
               st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>% 
              st_drop_geometry(), by = "uniqueID") %>% 
  mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))


pca.logit.cv.neigh <- logitCV (
  dataset = pca_net,
  id = "SNA_NAME",
  dependentVariable = "hotspot",
  indVariables = reg.vars) %>%
    dplyr::select(SNA_NAME = SNA_NAME, hotspot, Prediction, geometry)

pca.logit.cv.neigh <- pca.logit.cv.neigh %>% 
  mutate(PredictionCat  = as.factor(ifelse(pca.logit.cv.neigh$Prediction > 0.5 , 1, 0))) %>% 
  mutate(hotspot = as.factor(hotspot))

cm_pca_neigh <- caret::confusionMatrix(pca.logit.cv.neigh$hotspot, pca.logit.cv.neigh$PredictionCat, 
                       positive = "1")

draw_confusion_matrix(cm_pca_neigh)
```

We computed the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) metric for our model to evaluate their performance. The AUC value ranges from 0 to 1, where 0.5 means a model with no discriminatory power (it performs as well as random chance), and 1 means a perfect model (it perfectly distinguishes between positive and negative cases). Generally, an AUC over 0.8 indicates a good performance of the model. For the random K-fold cross validation, we get a AUC score of 0.904 and for the Spatial LOGO-CV, we get a AUC score of 0.8628. Both are good scores and indicate that model is capturing important patterns in the data and making predictions better than random guessing.

```{r ROC curve, fig.height=6, fig.width=11, message=FALSE, warning=FALSE}

plot7 <- ggplot(pca.logit.cv, aes(d = as.numeric(hotspot), m = Prediction)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
    labs(title = "ROC Curve for Logistic Regression Model using PCA", 
         subtitle = "Random Cross Validation") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot8 <- ggplot(pca.logit.cv.neigh, aes(d = as.numeric(hotspot), m = Prediction)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "black") +
    labs(title = "ROC Curve for Logistic Regression Model using PCA", 
         subtitle = "Spatial LOGO Cross Validation") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = "#BB3754") +
  theme(plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"), 
        axis.text.x=element_text(size=8),
        axis.text.y=element_text(size=8), 
        axis.title=element_text(size=9), 
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth =0.8))

plot7 + plot8

```




```{r mapping error, message=FALSE, warning=FALSE, fig.height=6, fig.width=11}

plot9 <- pca.logit.cv %>%
  filter(is.na(Prediction) == FALSE) %>% 
   mutate(result = case_when(
   PredictionCat==0 & hotspot==0 ~ "Accurate",
   PredictionCat==1 & hotspot==1 ~"Accurate",
   PredictionCat==0 & hotspot==1 ~"Problematic",
   TRUE ~ "Inaccurate"
  )) %>% 
  ggplot() +
  geom_sf(aes(fill=as.character(hotspot)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Actual Hotspot") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
  

plot10 <- pca.logit.cv %>%
  ggplot() +
  geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Predicted Hotspot", 
         subtitle = "Random Cross Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot11 <- pca.logit.cv.neigh %>%
  ggplot() +
  geom_sf(aes(fill=as.character(PredictionCat)), color="transparent")+
  scale_fill_manual(values = c("black", "#BB3754"), name = "Hotspot") +
  labs(title = "Predicted Hotspot", 
         subtitle = "Spatial LOGO Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )
  

plot12 <- pca.logit.cv %>%
  filter(is.na(Prediction) == FALSE) %>% 
   mutate(result = case_when(
   PredictionCat==0 & hotspot==0 ~ "Accurate",
   PredictionCat==1 & hotspot==1 ~"Accurate",
   PredictionCat==0 & hotspot==1 ~"Problematic",
   TRUE ~ "Inaccurate"
  )) %>% 
  ggplot() +
  geom_sf(aes(fill=as.character(result)), color="transparent")+
  scale_fill_manual(values = c("black", "#FFC000", "#BB3754"), name = "Hotspot") +
  labs(title = "Mapping Error", 
         subtitle = "Random Cross Validation") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
        )

plot9 + plot10 + plot11 + plot12
```


If our model has generally been good at making prediction (high accuracy) but with some uncertainties in correctly locating the hotspot, then we would like to know where it is making mistakes. We mapped the error below and compared it to the actual hotspot. What we can see is that our model is accurate in roughly estimating the location of the hotspot, but tends to make mistakes when deciding which of these grids in particular should be categorized as hotspot. 

We highlighted those grids which are actual hotspot but were wrongly predicted as problematic, from which we may see that those most of those problematic grids pretty much neighbors hotspot grids. This implies that when drawing conclusion from our model, we should not only look at specific hotspot grid, but also probably pay attention to grids nearby. 


# KDE Validation

The accuracy of our model is also compared relative to kernel density estimation. Kernel density works by centering a smooth kernel, or curve, atop each crime point such that the curve is at its highest directly over the point and the lowest at the range of a circular search radius. The code block below creates a Kernel density map of robbery with a 1000 foot search radius.

```{r KDE 1000 radii,  message=FALSE, warning=FALSE}

heroin_ppp <- as.ppp(st_coordinates(heroin), W = st_bbox(heroin_net))
heroin_KD.1000 <- spatstat.explore::density.ppp(heroin_ppp, 1000)
heroin_KD.df <- data.frame(rasterToPoints(mask(raster(heroin_KD.1000), as(cincinnati20, 'Spatial'))))


ggplot(data=heroin_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  coord_sf(crs=st_crs(heroin_net)) + 
  scale_fill_viridis(option = "rocket", name="Density") +
  labs(title = "Kernel Density of Heroin 1000ft Radii") +
    theme(axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```

The kernel density estimation was classified into two categories, lower risk and higher risk, matching the binary of hotspot and coldspot. We used the `Fisher's method` that involves finding intervals such that the variance within each interval is minimized while maximizing the variance between the intervals. The purpose here is to see if the our model capture the same hotspot grids as the kernel density. The map below shows that our model perform as good as, if not worse, the kernal density approach in capturing hotspots (high risk areas). 

```{r comparison, message=FALSE, warning=FALSE}

heroin_KDE_sum <- as.data.frame(heroin_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(heroin_net)) %>%
  aggregate(., heroin_net, mean) 

kde_breaks <- classIntervals(heroin_KDE_sum$value, 
                             n = 2, "fisher")

heroin_KDE_sf <- heroin_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 2 ~ "High",
           Risk_Category == 1 ~ "Low"))

heroin_risk_sf <-
  pca.logit.cv %>%
  mutate(label = "Logistic Regression",
         Risk_Category = case_when(
           PredictionCat == 1 ~ "High",
           PredictionCat == 0 ~ "Low")) %>% 
  dplyr::select(PredictionCat, label, Risk_Category) %>% 
  rename(value = PredictionCat)


rbind(heroin_KDE_sf, heroin_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    facet_wrap(~label, ) +
    scale_fill_viridis(option = "rocket", discrete = TRUE, name = "Risk Category") +
    labs(title="Comparison of Kernel Density and Risk Predictions") + 
  theme(legend.position="bottom",
    axis.text.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks =element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.subtitle = element_text(size = 9,face = "italic"),
        plot.title = element_text(size = 12, face = "bold"),
        panel.background = element_blank(),
        panel.border = element_rect(colour = "grey", fill=NA, size=0.8)
        )
```


# Multi Criteria Analysis

We would like to conclude our analysis by building multi-criteria-analysis-centered geospatial risk terrain model that stakeholders and the public can use to learn about the community vulnerability to heroin-overdose in Cincinnati. In GIS, a Multi-Criteria Analysis (MCA) model is a decision-making approach that involves evaluating and comparing multiple criteria or factors to make informed spatial decisions. The criteria we use here are the predictor variables we use in our model. These criteria are standardized to ensure that all of them are on a scale of 0 to 1. Then, some of those scaled variables are reversed if necessary based on whether they contribute positively or negatively to the heroin-overdose risk. We supply a weight on each of those variables based on their contribution. Finally, a risk score is computed for each grid. 

```{r MCA, message=FALSE, warning=FALSE}

scale_values <- function(x){(x-min(x))/(max(x)-min(x))}

risk_score <- heroin_net %>% 
  mutate(MedHHInc = ifelse(is.na(MedHHInc), 0, MedHHInc)) %>% 
  mutate(scl_crime = scale_values(Crimecount),
         scl_complaints = scale_values(Complaintscount),
         scl_pharm = scale_values(pharm.nn),
         scl_hospital = scale_values(hospital.nn),
         scl_rehab = scale_values(rehab.nn),
         scl_park = scale_values(parks.nn),
         scl_fuel = scale_values(fuel.nn),
         scl_fast = scale_values(fast.nn),
         scl_pop = scale_values(pop25_54),
         scl_income = scale_values(MedHHInc),
         scl_gender = scale_values(MF_ratio),
         scl_race = scale_values(race_ratio),
         scl_poverty = scale_values(pctPoverty),
         scl_edu = scale_values(pctBachelor)) %>% 
  mutate(scl_race_re = 0 - scl_race + 1,
         scl_edu_re = 0 - scl_edu + 1,
         scl_income_re = 0 - scl_income + 1, 
         scl_fuel_re = 0 - scl_fuel + 1,
         scl_fast_re = 0 - scl_fast + 1,
         scl_park_re = 0 - scl_park + 1,
         scl_pharm_re = 0 - scl_pharm + 1) %>% 
  mutate(score = 0.1 * (scl_hospital + scl_crime + scl_complaints +  scl_gender + scl_income_re + scl_fuel_re + scl_fast_re +  scl_park_re) + 0.05 * (scl_race_re +  scl_pharm_re) + 0.025*(scl_pop +scl_edu_re + scl_poverty +  scl_rehab)) %>% 
  st_as_sf()
  

risk_score <- risk_score %>%
  left_join(net_centroid %>% dplyr::select(uniqueID) %>% 
               st_intersection(neighborhood %>% dplyr::select(SNA_NAME)) %>% 
              st_drop_geometry(), by = "uniqueID") %>% 
  mutate(SNA_NAME = ifelse(is.na(SNA_NAME), "NOT APPLICABLE", SNA_NAME))


```

This below interactive leaflet map shows the risk score of each grid, the neighborhood that grid fall into, and whether or not it falls into a hotspot that our model predicts. We hope that this will serve as our first step towards building up our use case that present stakeholders with a clearer picture of community vulnerability to heroin overdose and that help the public to have a better understanding of the severity of heroin-overdose in Cincinnati as well as existing rehabilitation resources.  

```{r map risk, message=FALSE, warning=FALSE}

colors <- rev(viridis::viridis(6, option = "rocket"))

pal <- colorBin(palette = colors, 
                domain = risk_score$score, 
                bins = c(0.2, 0.3, 0.4, 0.5, 0.6, 0.7)) 

labs <- paste0("<strong>", risk_score$SNA_NAME,
               "</strong><br/>Heorin-Overdose Risk Score: ", risk_score$score
               ) 

leaflet <- risk_score %>%
  st_transform("EPSG:4326") %>% 
  leaflet() %>%
  setView(lng = -84.51, lat = 39.10, zoom = 10.5) %>%
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addPolygons(weight = 1,
              fillColor = ~pal(score), # change the name of the parameter in the bracket
              popup = labs,
              color = "transparent",
              group = "Risk Score", 
              fillOpacity = 1) %>%
  addPolygons(
    data = pca.logit.cv %>% filter(PredictionCat == 1) %>% st_transform("EPSG:4326"), 
    weight = 2,
    group = "Overdose Hotspot", 
    color = "yellow") %>% 
  addLayersControl(
    overlayGroups = c("Risk Score", "Overdose Hotspot"),
    options = layersControlOptions(collapsed = FALSE)
  ) %>%
  addLegend(position = "bottomright",
            pal = pal,
            values = ~risk_score, # change the name of the parameter after ~
            title = "Risk Score",
            opacity = 1) # change the name of the legend 

leaflet
mapshot(leaflet, url = "cincyrisk.html")
```


# Conclusion

In this report, we conducted a study on opioid-overdose risk modeling in Cincinnati, Ohio where the opioid epidemic has been a serious issue. Drawing on methods by Srinivasan, et al. (2023), 
we used Getis-Ord hotspot analyses to identify locations of persistent risk in Cincinnati from 2016-2020. We constructed measures of the socio-built environment and neighborhood demographic characteristics using principal components analysis and ran logistic regression to determine if places in Cincinnati fall into heroin-overdose hotspots or not. While Poisson regression has long been the traditional method used in geospatial risk model to predict the number (count) of discrete event, we propose this alternative method in this report because we believe that it is less useful for policymakers to focus on the number of heroin-overdose incidence among neighborhoods in Cincinnati. Rather, we should develop a way to quickly identify areas/clusters of high heroin-overdose incidence. 

Our model, which uses principal components as predictors, has an accuracy rate of 0.928, a sensitivity rate of 0.66, and an AUC score of 0.904, meaning that it has been effective overall in distinguishing between hotspots and cold spots. We mapped the error and compared that to the actual hotspot, from which we noticed that our model is accurate in roughly estimating the location of the hotspot but tends to make mistakes when deciding which of these grids in particular should be categorized as hotspot. On top of that, many of those grids are neighbors of hotspot grids that are correctly predicted.  

Reflecting on the overall study design, we have the following concerns, which we would like to continue working on to improve the accuracy and reliability of our model. Firstly, our study utilized a binary classification of hotspot and cold spot to describe risk. While this is a common approach in risk modeling, we should recognize the complexity within these categories as areas neighboring hotspots may also exhibit varying degrees of risk. Building onto that, another major issue is that our analysis has categorized "insignificant" grids into cold spots. In theory, cold spot should refer to places with a series of grids that have low heroin-incidence count whereas for insignificant grid, there might be a high number of incidences within specific grid. It’s just that those grids do not form clusters with surrounding grids. Merging insignificant grids into cold spot might risk overlooking heroin-incidence within those grids. 

Secondly, within the discussion, we highlight the potential imbalance in logistic regression, prompting consideration for weighted logistic regression to address disparities in the prediction of positive and negative cases. Thirdly, during cross validation, we found that our model does not generalize well across neighborhoods. This highlights significant variations between different neighborhoods in Cincinnati, which requires further investigations. Fourthly, as we need to arrogate information into a standardized spatial unit that’s different from any existing spatial unit, we need to be aware of how our model might be sensitive to errors and uncertainties introduced by area-weighted re-aggregation and boundary/edge effect. 

Regardless, the takeaways from this analysis will contribute to our understanding of the nuanced landscape of heroin-overdose crisis in Cincinnati and provide important guidance on public health initiatives aiming to address heroin addiction. 

We believe that our risk terrain model could effectively visualize heroin-overdose hotspots and could serve a good starting point for providing real-time update through calculating a risk score for each part of the city, allowing the government to conduct real-time data analysis to efficiently allocate EMS resources. We hope that our effort in bringing in data from a variety of sources would encourage data sharing between public and private sectors so that the government can manage public service and monitor its built environment. Reports can be collected from public parks, fast food restaurants, gas stations, places where most overdose incidence occurs. Having data from local hospitals and rehabilitation centers allows the government to see capacity information, thereby efficiently dispatching preparing and dispatching care so that the risk of heroin overdose is minimized before they happen. 

We look forward to seeing our risk terrain model being adapted in various scenarios that truly benefit both the stakeholders in managing community health and well-being as well as serve the needs of local residents. 


# References

Choi, J. I., Lee, J., Yeh, A. B., Lan, Q., & Kang, H. (2022). Spatial clustering of heroin-related overdose incidents: a case study in Cincinnati, Ohio. BMC public health, 22(1), 1-12. [https://doi.org/10.1186/s12889-022-13557-3](https://doi.org/10.1186/s12889-022-13557-3)

Srinivasan, S., Pustz, J., Marsh, E., Young, L. D., & Stopka, T. J. (2023). Risk factors for persistent fatal opioid-involved overdose hotspots in Massachusetts 2011-2021: A spatial statistical analysis with socio-economic, access and prescription factors.[https://doi.org/10.21203/rs.3.rs-3249650/v1](https://doi.org/10.21203/rs.3.rs-3249650/v1)

Chichester, K. R., Drawve, G., Sisson, M., Giménez-Santana, A., McCleskey, B., Goodin, B. R., ... & Cropsey, K. L. (2023). Crime and Features of the Built Environment Predicting Risk of Fatal Overdose: A Comparison of Rural and Urban Ohio Counties with Risk Terrain Modeling. American Journal of Criminal Justice, 1-25. [https://doi.org/10.1007/s12103-023-09739-3](https://doi.org/10.1007/s12103-023-09739-3)


