Code
import os
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import shape, Polygon, Point
from libpysal.weights import Queen
from libpysal.weights.spatial_lag import lag_spatial
The purpose of this section of the methodology is to examine the loaded data, explore its structure, and perform additional processing to prepare the variables needed for machine learning models. The workflow merges all relevant variables into a single cohesive dataframe. By the end of the notebook, the output is a unified dataframe containing approximately 3,000 rows, each representing a census tract.
This notebook can be executed by loading the processed data from our GitHub repository. To reproduce our analysis, the following Python libraries are required: pandas
for data manipulation, geopandas
for handling spatial data, numpy
for numerical operations, shapely
for geometric analysis, and libpysal
for spatial statistics and spatial data processing. Ensure that libpysal is installed to replicate our results accurately.
import os
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import shape, Polygon, Point
from libpysal.weights import Queen
from libpysal.weights.spatial_lag import lag_spatial
Let’s first check the geometry of Pennsylvania’s census tracts. After processing, the dataset is simplified to contain only two columns: GEOID and geometry. GEOID is useful when performing plain join to other variables while geometry will be useful when we compute nearest neighbor.
3) PA_Tracts.head(
GEOID | geometry | |
---|---|---|
0 | 42001030101 | POLYGON ((-77.15558 40.05671, -77.15467 40.057... |
1 | 42001030103 | POLYGON ((-77.07256 39.97480, -77.07223 39.974... |
2 | 42001030104 | POLYGON ((-77.09913 39.94784, -77.09750 39.948... |
The plot below quickly visualizes census tracts in Pennsylvania.
Now, let’s check both the COPD data and the health risks behavior data. After the cleaning process, the COPD dataset would only include 6 columns, among which we are mostly interested in LocationName (the GEOID) used to join it with other dataset as well as the prevalence of COPD.
3) PA_Chronic.head(
CountyName | LocationName | Asthma | TotalPopulation | TotalPop18plus | COP | |
---|---|---|---|---|---|---|
0 | Allegheny | 42003141200 | 9.7 | 4007 | 3242 | 4.8 |
1 | Allegheny | 42003140100 | 10.6 | 5579 | 5066 | 4.2 |
2 | Allegheny | 42003191900 | 10.6 | 2177 | 1786 | 5.9 |
A quick visualization of COPD prevalence in Pennsylvania reveals that urban areas such as Pittsburgh and Philadelphia experience higher COPD prevalence compare to the suburban area in its vincinity. However, several rural areas in Northern and Central Pennsylvania exhibit much higher rate of COPD.
Indeed, if we look at the top 10 counties in Pennsylvania with highest average COPD prevalence, we many notice that these are not urban counties. They are also not proximate to any urban areas. Cameron, Mifflin, Forest, and Sullivan counties are both located in central Pennsylvania. This is a pretty surprising observation as researchers often assume people living in urban areas to be more vulnerable to COPD because of air pollution, development density, etc.
The health risks behavior dataset includes only 6 variables as well, which are mainly the percentages of population who smoke and drink on a regular basis and having short sleep, no physical activity.
3) PA_HRB.head(
CountyName | LocationName | Smoking | Drinking | Short_Sleep | Physical_Activity | |
---|---|---|---|---|---|---|
0 | Adams | 42001030300 | 17.7 | 18.2 | 39.6 | 25.4 |
1 | Allegheny | 42003151700 | 15.2 | 21.2 | 34.8 | 20.8 |
2 | Allegheny | 42003151600 | 17.1 | 21.7 | 35.1 | 21.8 |
After loading in the retailers data PA_Retailers
, we need to create a new column geometry by applying a lambda function to each row. The lambda function constructs a Point object using the longitude (lon) and latitude (lat) values from each row. The next piece of code convert the PA_Retailers
DataFrame into a GeoDataFrame, specifying the geometry column and setting the coordinate reference system (CRS) to EPSG:4326.
# make retailer data spatial
'geometry'] = PA_Retailers.apply(
PA_Retailers[lambda row: Point(row['lon'], row['lat']), axis=1
)= gpd.GeoDataFrame(
PA_Retailers
PA_Retailers,='geometry',
geometry='EPSG:4326'
crs )
We included here a quick visualization of the distribution of tobacco retailers in Pennsylvania.
Since in the introduction, we’ve mentioned that we would like to account for environment exposures to tobacco retailers from neighboring tracts considering that people travel between different tracts everyday, there’s some extrat steps we need to do to compute retailer density and average them across it neighbors. The code below first performs a spatial join between PA_Tracts
and PA_Retailers
, joining the two GeoDataFrames based on their spatial relationship. The how=“left” parameter ensures that all tracts are retained, and the predicate=“intersects” parameter specifies that retailers intersecting with tracts should be joined. Then, we need to summarize the total retailers in each tract, and merge the summary beack to the original PA_Tracts
GeoDataFrame. Any tracts without retailers are filled with a count of zero. To compute density, we also need to convert our GeoDataFrame to EPSG:3857 for more accurate area calculations. Then, the density of retailers is calcualted through dividing the total number of retailers by the area of the tract.
# count retailers
= gpd.sjoin(PA_Tracts, PA_Retailers, how="left", predicate="intersects")
tracts_with_retailers = (
tracts_summary "GEOID")
tracts_with_retailers.groupby(
.size()="total_retailers")
.reset_index(name
)
= PA_Tracts.merge(tracts_summary, on="GEOID", how="left")
tracts_with_retailers"total_retailers"] = tracts_with_retailers["total_retailers"].fillna(0)
tracts_with_retailers[
# compute density
= tracts_with_retailers.to_crs(epsg=3857)
density 'area_km2'] = density.geometry.area / 1e6 # Convert m² to km²
density['density'] = density ['total_retailers'] / density ['area_km2'] density [
To know the average density across neighboring tracts, a spatial weights matrix is created based on the geometries of tracts using the Queen contiguity method, which considers tracts as neighbors if they share a boundary or a vertex. This is done with the Queen.from_dataframe(density)
function. Next, it computes the average density of neighboring tracts by applying the lag_spatial
function, which uses the spatial weights matrix to calculate the spatial lag of the density column, resulting in a new column neighbor_avg_density in the density GeoDataFrame. Finally, the code removes the area_km2 column from the density GeoDataFrame, renaming it to PA_Retailers
.
# Create spatial weights matrix based on tract geometries
= Queen.from_dataframe(density)
weights
# Compute average density of neighbors using lag_spatial
'neighbor_avg_density'] = lag_spatial(
density['density']
weights, density[
)= density.drop(columns=["area_km2"]) PA_Retailers
The final PA_Retailers
dataset would only include the following columns in addition to GEOID and the geometry: total retailers of a census tract, retailer density, as well as the average density across its neighboring tracts.
5) PA_Retailers.head(
GEOID | geometry | total_retailers | density | neighbor_avg_density | |
---|---|---|---|---|---|
0 | 42001030101 | POLYGON ((-8588920.357 4874186.492, -8588818.5... | 3 | 0.031552 | 0.227339 |
1 | 42001030103 | POLYGON ((-8579678.266 4862280.523, -8579641.3... | 1 | 0.031470 | 0.158547 |
2 | 42001030104 | POLYGON ((-8582635.361 4858365.994, -8582454.6... | 7 | 0.081237 | 1.671000 |
3 | 42001030200 | POLYGON ((-8599802.938 4869004.417, -8599007.8... | 8 | 0.038641 | 0.413897 |
4 | 42001030300 | POLYGON ((-8618506.453 4863224.423, -8618215.5... | 5 | 0.026165 | 0.216389 |
The visualization of average tobacco retailer density across neighboring census tract is not very exciting, as expected, since most of the retailers are located in urban areas. Except for Philadelphia and Pittsburgh, where the density of tobacco retailer is high, most rurual areas in Pennsylvania has a low tobacco retailer density. This suggets that exposure to tobacco retailer store might not have a significant effect to explain the high COPD in central and northern Pennsylvania.
Now, let’s take a look at the census data. The processed dataset only contains the predictors we will use in machine leanring models: percentage of minority, aging and disability population.
3) PA_Census.head(
GEOID | minority | aging | disability | geometry | |
---|---|---|---|---|---|
0 | 42125775200 | 0.211840 | 0.215264 | 0.287671 | POLYGON ((-79.876586 40.177549, -79.876234 40.... |
1 | 42125775300 | 0.395709 | 0.131506 | 0.273739 | POLYGON ((-79.879294 40.174857, -79.878454 40.... |
2 | 42125782700 | 0.048652 | 0.314267 | 0.302433 | POLYGON ((-79.913564 40.153257, -79.913332 40.... |
For population with disability (as well as aging population, though not plotted), the spatial pattern is not that apparent. However, for racial minority population, it is definitely more concentrated in urban areas (Philadelphia and Pittsburgh). Meanwhile, some census tracts in northeastern and central Pennsylvania also exhibit more diversity in racial composition.
Let’s load in all of our land cover data. There are six of them in total representing different land cover groups: forest, grasses, high density development, low density development, water and wetland. Below is an example of the forest land cover dataset that contains four columns. In addition to the GEOID and the file name identifier, it includes the total forest pixels and number of contiguous forest pixels in each census tract.
1].head(3) landcover[
geoid | neighbor_landcover | total_landcover | source_file | |
---|---|---|---|---|
0 | 42001030101 | 6885.043137 | 14626.749020 | forest_landcover_metrics.csv |
1 | 42001030103 | 2053.384314 | 5035.964706 | forest_landcover_metrics.csv |
2 | 42001030104 | 3582.129412 | 10099.086275 | forest_landcover_metrics.csv |
The code below demonstrates the process of merging multiple landcover datasets based on a common key (geoid), and then performing a series of transformations to organize and compute relevant landcover metrics. The code begins by merging the first dataset landcover[0]
with the next dataset landcover[1]
using a left join on the geoid column. This ensures that the resulting merged_df
contains all rows from the first dataset, along with matching rows from the second dataset. The columns are then renamed to more meaningful names like forest_total, forest_neighbor, and grasses_total. Unnecessary are dropped to clean up the data. The process of merging, renaming, and dropping columns is repeated for several additional datasets (landcover[2]
to landcover[5]
), progressively adding landcover data related to wetlands, water, and development density.
# combine landcover data
= pd.merge(landcover[0], landcover[1], on="geoid", how="left")
merged_df= merged_df.rename(
merged_df ={
columns"total_landcover": "forest_total",
"neighbor_landcover": "forest_neighbor",
"sum": "grasses_total",
}
)= merged_df.drop(columns=["source_file_x", "source_file_y"])
merged_df = pd.merge(merged_df, landcover[2], on="geoid", how="left")
merged_df = merged_df.rename(
merged_df ={
columns"sum": "wetland_total",}
)= merged_df.drop(columns=["source_file"])
merged_df = pd.merge(merged_df, landcover[3], on="geoid", how="left")
merged_df = merged_df.rename(
merged_df ={
columns"sum": "water_total" }
)= merged_df.drop(columns=["source_file"])
merged_df = pd.merge(merged_df, landcover[4], on="geoid", how="left")
merged_df = merged_df.rename(
merged_df ={
columns"sum": "Ldensity_total" }
)= merged_df.drop(columns=["source_file"])
merged_df = pd.merge(merged_df, landcover[5], on="geoid", how="left")
merged_df = merged_df.rename(
merged_df ={
columns"total_landcover": "Hdensity_total",
"neighbor_landcover": "Hdensity_neighbor", }
)= merged_df.drop(columns=["source_file"])
PA_Landcover
# some metrics with landcover data
'forest_pct'] = PA_Landcover['forest_neighbor'] / PA_Landcover['forest_total']
PA_Landcover['Hdensity_pct'] = PA_Landcover['Hdensity_neighbor'] / PA_Landcover['Hdensity_total']
PA_Landcover[= PA_Landcover.drop(columns=["forest_neighbor", "Hdensity_neighbor"]) PA_Landcover
The final dataset would look like the following:
3) PA_Landcover.head(
geoid | grasses_total | forest_total | wetland_total | water_total | Ldensity_total | Hdensity_total | forest_pct | Hdensity_pct | |
---|---|---|---|---|---|---|---|---|---|
0 | 42001030101 | 31291.925490 | 14626.749020 | 2675.717647 | 288.745098 | 5868.972549 | 817.870588 | 0.470716 | 0.084107 |
1 | 42001030103 | 11254.184314 | 5035.964706 | 408.607843 | 879.866667 | 3644.113725 | 283.894118 | 0.407744 | 0.042269 |
2 | 42001030104 | 40245.227451 | 10099.086275 | 2142.858824 | 405.870588 | 6041.019608 | 1041.843137 | 0.354698 | 0.032345 |
We could also visualize the spatial distribution of the aggregated number of pixels for each kind of land cover in Pennsylvania. We may see that grasslands and forests are mainly concentrated in central and northern Pennsylvania. Welands are mainly concentrated in western Pennsylvania. Eastern Pennsylvania has more high density development than the western part.
The folder that stores Landsat data contains 16 files, with 4 vegetation indices for 4 seasons. Each of them is an individual file. Below is an example of one of the files. It only has three column, in which mean refers to the mean of savi in fall for each census tract.
0].head(3) landsat[
GEOID | mean | source_file | |
---|---|---|---|
0 | 42001030101 | 0.395006 | savi_fall.csv |
1 | 42001030103 | 0.341787 | savi_fall.csv |
2 | 42001030104 | 0.359570 | savi_fall.csv |
We need to perform the same operation as above to merge all 16 files together. The process starts by merging the first two datasets in the landsat list using a left join on GEOID
, and it renames the relevant columns for clarity, specifically renaming the columns representing different vegetation indices (e.g., savi_fall, lst_summer). Unnecessary columns are then dropped to clean up the data. This merging and renaming process is repeated for additional datasets in the landsat list (from landsat[2]
to landsat[15]
), progressively adding columns related to different vegetation indices (e.g., evi_spring, lst_winter, ndvi_summer) for various seasons (spring, summer, fall, winter). Once all the datasets are merged, the final dataframe, PA_Landsat
, contains a comprehensive set of vegetation index data for different seasons.
# combine landsat data
= pd.merge(landsat[0], landsat[1], on="GEOID", how="left")
merged_df2= merged_df2.rename(
merged_df2 ={
columns"mean_x": "savi_fall",
"mean_y": "lst_summer",
}
)= merged_df2.drop(columns=["source_file_x", "source_file_y"])
merged_df2 = pd.merge(merged_df2, landsat[2], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "evi_spring",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[3], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "evi_winter",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[4], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "evi_summer",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[5], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "lst_spring",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[6], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "evi_fall",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[7], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "lst_winter",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[8], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "ndvi_summer",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[9], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "savi_spring",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[10], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "lst_fall",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[11], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "savi_winter",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[12], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "savi_summer",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[13], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "ndvi_winter",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[14], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "ndvi_fall",
}
)= merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[15], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
merged_df2 ={
columns"mean": "ndvi_spring",
}
)= merged_df2.drop(columns=["source_file"]) PA_Landsat
Let’s check the output of the merge.
3) PA_Landsat.head(
GEOID | savi_fall | lst_summer | evi_spring | evi_winter | evi_summer | lst_spring | evi_fall | lst_winter | ndvi_summer | savi_spring | lst_fall | savi_winter | savi_summer | ndvi_winter | ndvi_fall | ndvi_spring | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 42001030101 | 0.395006 | 33.257296 | 0.319802 | 0.260378 | 0.568514 | 10.827598 | 0.410165 | 4.714395 | 0.762120 | 0.316083 | 19.379386 | 0.263375 | 0.517445 | 0.473559 | 0.672438 | 0.501817 |
1 | 42001030103 | 0.341787 | 33.251842 | 0.291822 | 0.238932 | 0.537638 | 9.268735 | 0.349453 | 4.470871 | 0.736301 | 0.288043 | 19.461976 | 0.241329 | 0.491634 | 0.434709 | 0.620192 | 0.470976 |
2 | 42001030104 | 0.359570 | 33.716166 | 0.337012 | 0.280906 | 0.563929 | 10.559965 | 0.368580 | 4.283530 | 0.753435 | 0.328228 | 19.039489 | 0.279969 | 0.512783 | 0.468789 | 0.616475 | 0.509266 |
We could also visualize all the variables we have. Taking into the account the influence of seasonality is an important decision for more accurate prediction of COPD prevalence given that vegetation indices would change significant depending on the season should there be a correlation between them and COPD prevalence. In addition, spatial variations in vegetation indices across Pennsylvania is also evident in this map.
The very last step in preparing our dataframe for machine learning is to merge all dataframes together. The GEOID
column in each dataset is explicitly converted to a string type to ensure consistency in the merging process. The datasets include information about chronic disease statistics from PA_Chronic
, health risk behaviors from PA_HRB
, retailer density from PA_Retailers
, demographic factors from PA_Census
, landcover from PA_Landcover
, and satellite imagery data from PA_Landsat
. The purpose of these merges is to create a comprehensive dataset, PA_Final
, which combines relevant health, demographic, environmental, and spatial factors for each geographic area (defined by GEOID). This combined dataset will be useful for analyzing spatial relationships between chronic health conditions, environmental factors, and socio-economic characteristics, enabling more targeted health interventions and spatial analysis in Pennsylvania. Additionally, missing values in the final dataset are filled with zeros (fillna(0)),
# everthing combined
'GEOID'] = PA_Retailers['GEOID'].astype(str)
PA_Retailers['GEOID'] = PA_Landsat['GEOID'].astype(str)
PA_Landsat['GEOID'] = PA_Landcover['geoid'].astype(str)
PA_Landcover['GEOID'] = PA_Census['GEOID'].astype(str)
PA_Census['GEOID'] = PA_Chronic['LocationName'].astype(str)
PA_Chronic['GEOID'] = PA_HRB['LocationName'].astype(str)
PA_HRB[
= pd.merge(PA_Chronic, PA_HRB[["GEOID", "Smoking", "Drinking", "Short_Sleep", "Physical_Activity"]], on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Retailers[["GEOID", "neighbor_avg_density"]], on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Census[["GEOID", "minority", "aging", "disability"]], on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Landcover.drop(columns=["geoid"]), on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Landsat, on="GEOID", how="inner")
PA_Final = PA_Final.fillna(0) PA_Final
Below is a snap shot of the final dataframe.
5) PA_Final.head(
CountyName | LocationName | Asthma | TotalPopulation | TotalPop18plus | COP | GEOID | Smoking | Drinking | Short_Sleep | ... | evi_fall | lst_winter | ndvi_summer | savi_spring | lst_fall | savi_winter | savi_summer | ndvi_winter | ndvi_fall | ndvi_spring | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Allegheny | 42003141200 | 9.7 | 4007 | 3242 | 4.8 | 42003141200 | 10.3 | 21.3 | 32.1 | ... | 0.254667 | 5.846844 | 0.569421 | 0.274674 | 23.110598 | 0.126746 | 0.342207 | 0.316698 | 0.485495 | 0.469827 |
1 | Allegheny | 42003140100 | 10.6 | 5579 | 5066 | 4.2 | 42003140100 | 10.7 | 23.9 | 34.5 | ... | 0.299216 | 6.916926 | 0.570255 | 0.297535 | 21.461009 | 0.155348 | 0.365760 | 0.347601 | 0.512010 | 0.488617 |
2 | Allegheny | 42003191900 | 10.6 | 2177 | 1786 | 5.9 | 42003191900 | 14.6 | 22.1 | 35.0 | ... | 0.250319 | 8.235525 | 0.489839 | 0.272597 | 21.101579 | 0.134626 | 0.307345 | 0.277480 | 0.455361 | 0.446655 |
3 | Allegheny | 42003320700 | 10.7 | 1613 | 1318 | 7.3 | 42003320700 | 16.4 | 21.3 | 36.0 | ... | 0.327833 | 4.928171 | 0.654718 | 0.350588 | 19.085399 | 0.144300 | 0.430196 | 0.367408 | 0.560819 | 0.565722 |
4 | Allegheny | 42003100500 | 11.3 | 2013 | 1695 | 7.7 | 42003100500 | 15.3 | 15.8 | 40.6 | ... | 0.312338 | 4.781316 | 0.632465 | 0.330390 | 20.809972 | 0.164339 | 0.414667 | 0.356294 | 0.547767 | 0.538620 |
5 rows × 39 columns