Machine Learning Data Preparation

The purpose of this section of the methodology is to examine the loaded data, explore its structure, and perform additional processing to prepare the variables needed for machine learning models. The workflow merges all relevant variables into a single cohesive dataframe. By the end of the notebook, the output is a unified dataframe containing approximately 3,000 rows, each representing a census tract.

This notebook can be executed by loading the processed data from our GitHub repository. To reproduce our analysis, the following Python libraries are required: pandas for data manipulation, geopandas for handling spatial data, numpy for numerical operations, shapely for geometric analysis, and libpysal for spatial statistics and spatial data processing. Ensure that libpysal is installed to replicate our results accurately.

Code
import os
import pandas as pd
import numpy as np
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import shape, Polygon, Point
from libpysal.weights import Queen
from libpysal.weights.spatial_lag import lag_spatial

PA Geometry

Let’s first check the geometry of Pennsylvania’s census tracts. After processing, the dataset is simplified to contain only two columns: GEOID and geometry. GEOID is useful when performing plain join to other variables while geometry will be useful when we compute nearest neighbor.

Code
PA_Tracts.head(3)
GEOID geometry
0 42001030101 POLYGON ((-77.15558 40.05671, -77.15467 40.057...
1 42001030103 POLYGON ((-77.07256 39.97480, -77.07223 39.974...
2 42001030104 POLYGON ((-77.09913 39.94784, -77.09750 39.948...

The plot below quickly visualizes census tracts in Pennsylvania.

COPD Data and Health Risk Behaviors

Now, let’s check both the COPD data and the health risks behavior data. After the cleaning process, the COPD dataset would only include 6 columns, among which we are mostly interested in LocationName (the GEOID) used to join it with other dataset as well as the prevalence of COPD.

Code
PA_Chronic.head(3)
CountyName LocationName Asthma TotalPopulation TotalPop18plus COP
0 Allegheny 42003141200 9.7 4007 3242 4.8
1 Allegheny 42003140100 10.6 5579 5066 4.2
2 Allegheny 42003191900 10.6 2177 1786 5.9

A quick visualization of COPD prevalence in Pennsylvania reveals that urban areas such as Pittsburgh and Philadelphia experience higher COPD prevalence compare to the suburban area in its vincinity. However, several rural areas in Northern and Central Pennsylvania exhibit much higher rate of COPD.

Indeed, if we look at the top 10 counties in Pennsylvania with highest average COPD prevalence, we many notice that these are not urban counties. They are also not proximate to any urban areas. Cameron, Mifflin, Forest, and Sullivan counties are both located in central Pennsylvania. This is a pretty surprising observation as researchers often assume people living in urban areas to be more vulnerable to COPD because of air pollution, development density, etc.

The health risks behavior dataset includes only 6 variables as well, which are mainly the percentages of population who smoke and drink on a regular basis and having short sleep, no physical activity.

Code
PA_HRB.head(3)
CountyName LocationName Smoking Drinking Short_Sleep Physical_Activity
0 Adams 42001030300 17.7 18.2 39.6 25.4
1 Allegheny 42003151700 15.2 21.2 34.8 20.8
2 Allegheny 42003151600 17.1 21.7 35.1 21.8

Tobacco Retailer Data

After loading in the retailers data PA_Retailers, we need to create a new column geometry by applying a lambda function to each row. The lambda function constructs a Point object using the longitude (lon) and latitude (lat) values from each row. The next piece of code convert the PA_Retailers DataFrame into a GeoDataFrame, specifying the geometry column and setting the coordinate reference system (CRS) to EPSG:4326.

# make retailer data spatial
PA_Retailers['geometry'] = PA_Retailers.apply(
    lambda row: Point(row['lon'], row['lat']), axis=1
)
PA_Retailers = gpd.GeoDataFrame(
    PA_Retailers,
    geometry='geometry',
    crs='EPSG:4326'
)

We included here a quick visualization of the distribution of tobacco retailers in Pennsylvania.

Since in the introduction, we’ve mentioned that we would like to account for environment exposures to tobacco retailers from neighboring tracts considering that people travel between different tracts everyday, there’s some extrat steps we need to do to compute retailer density and average them across it neighbors. The code below first performs a spatial join between PA_Tracts and PA_Retailers, joining the two GeoDataFrames based on their spatial relationship. The how=“left” parameter ensures that all tracts are retained, and the predicate=“intersects” parameter specifies that retailers intersecting with tracts should be joined. Then, we need to summarize the total retailers in each tract, and merge the summary beack to the original PA_Tracts GeoDataFrame. Any tracts without retailers are filled with a count of zero. To compute density, we also need to convert our GeoDataFrame to EPSG:3857 for more accurate area calculations. Then, the density of retailers is calcualted through dividing the total number of retailers by the area of the tract.

# count retailers
tracts_with_retailers = gpd.sjoin(PA_Tracts, PA_Retailers, how="left", predicate="intersects")
tracts_summary = (
    tracts_with_retailers.groupby("GEOID")
    .size()
    .reset_index(name="total_retailers")
)

tracts_with_retailers= PA_Tracts.merge(tracts_summary, on="GEOID", how="left")
tracts_with_retailers["total_retailers"] = tracts_with_retailers["total_retailers"].fillna(0)

# compute density
density = tracts_with_retailers.to_crs(epsg=3857)
density['area_km2'] = density.geometry.area / 1e6  # Convert m² to km²
density ['density'] = density ['total_retailers'] / density ['area_km2']

To know the average density across neighboring tracts, a spatial weights matrix is created based on the geometries of tracts using the Queen contiguity method, which considers tracts as neighbors if they share a boundary or a vertex. This is done with the Queen.from_dataframe(density) function. Next, it computes the average density of neighboring tracts by applying the lag_spatial function, which uses the spatial weights matrix to calculate the spatial lag of the density column, resulting in a new column neighbor_avg_density in the density GeoDataFrame. Finally, the code removes the area_km2 column from the density GeoDataFrame, renaming it to PA_Retailers.

# Create spatial weights matrix based on tract geometries
weights = Queen.from_dataframe(density)

# Compute average density of neighbors using lag_spatial
density['neighbor_avg_density'] = lag_spatial(
    weights, density['density']
)
PA_Retailers = density.drop(columns=["area_km2"])

The final PA_Retailers dataset would only include the following columns in addition to GEOID and the geometry: total retailers of a census tract, retailer density, as well as the average density across its neighboring tracts.

Code
PA_Retailers.head(5)
GEOID geometry total_retailers density neighbor_avg_density
0 42001030101 POLYGON ((-8588920.357 4874186.492, -8588818.5... 3 0.031552 0.227339
1 42001030103 POLYGON ((-8579678.266 4862280.523, -8579641.3... 1 0.031470 0.158547
2 42001030104 POLYGON ((-8582635.361 4858365.994, -8582454.6... 7 0.081237 1.671000
3 42001030200 POLYGON ((-8599802.938 4869004.417, -8599007.8... 8 0.038641 0.413897
4 42001030300 POLYGON ((-8618506.453 4863224.423, -8618215.5... 5 0.026165 0.216389

The visualization of average tobacco retailer density across neighboring census tract is not very exciting, as expected, since most of the retailers are located in urban areas. Except for Philadelphia and Pittsburgh, where the density of tobacco retailer is high, most rurual areas in Pennsylvania has a low tobacco retailer density. This suggets that exposure to tobacco retailer store might not have a significant effect to explain the high COPD in central and northern Pennsylvania.

Census Data

Now, let’s take a look at the census data. The processed dataset only contains the predictors we will use in machine leanring models: percentage of minority, aging and disability population.

Code
PA_Census.head(3)
GEOID minority aging disability geometry
0 42125775200 0.211840 0.215264 0.287671 POLYGON ((-79.876586 40.177549, -79.876234 40....
1 42125775300 0.395709 0.131506 0.273739 POLYGON ((-79.879294 40.174857, -79.878454 40....
2 42125782700 0.048652 0.314267 0.302433 POLYGON ((-79.913564 40.153257, -79.913332 40....

For population with disability (as well as aging population, though not plotted), the spatial pattern is not that apparent. However, for racial minority population, it is definitely more concentrated in urban areas (Philadelphia and Pittsburgh). Meanwhile, some census tracts in northeastern and central Pennsylvania also exhibit more diversity in racial composition.

Land Cover Data

Let’s load in all of our land cover data. There are six of them in total representing different land cover groups: forest, grasses, high density development, low density development, water and wetland. Below is an example of the forest land cover dataset that contains four columns. In addition to the GEOID and the file name identifier, it includes the total forest pixels and number of contiguous forest pixels in each census tract.

Code
landcover[1].head(3)
geoid neighbor_landcover total_landcover source_file
0 42001030101 6885.043137 14626.749020 forest_landcover_metrics.csv
1 42001030103 2053.384314 5035.964706 forest_landcover_metrics.csv
2 42001030104 3582.129412 10099.086275 forest_landcover_metrics.csv

The code below demonstrates the process of merging multiple landcover datasets based on a common key (geoid), and then performing a series of transformations to organize and compute relevant landcover metrics. The code begins by merging the first dataset landcover[0] with the next dataset landcover[1] using a left join on the geoid column. This ensures that the resulting merged_df contains all rows from the first dataset, along with matching rows from the second dataset. The columns are then renamed to more meaningful names like forest_total, forest_neighbor, and grasses_total. Unnecessary are dropped to clean up the data. The process of merging, renaming, and dropping columns is repeated for several additional datasets (landcover[2] to landcover[5]), progressively adding landcover data related to wetlands, water, and development density.

Code
# combine landcover data
merged_df= pd.merge(landcover[0], landcover[1], on="geoid", how="left")
merged_df = merged_df.rename(
    columns={
        "total_landcover": "forest_total",
        "neighbor_landcover": "forest_neighbor",
        "sum": "grasses_total",
    }
)
merged_df = merged_df.drop(columns=["source_file_x", "source_file_y"])
merged_df = pd.merge(merged_df, landcover[2], on="geoid", how="left")
merged_df = merged_df.rename(
    columns={
        "sum": "wetland_total",}
)
merged_df = merged_df.drop(columns=["source_file"])
merged_df = pd.merge(merged_df, landcover[3], on="geoid", how="left")
merged_df = merged_df.rename(
    columns={
        "sum": "water_total"    }
)
merged_df = merged_df.drop(columns=["source_file"])
merged_df = pd.merge(merged_df, landcover[4], on="geoid", how="left")
merged_df = merged_df.rename(
    columns={
        "sum": "Ldensity_total"    }
)
merged_df = merged_df.drop(columns=["source_file"])
merged_df = pd.merge(merged_df, landcover[5], on="geoid", how="left")
merged_df = merged_df.rename(
    columns={
        "total_landcover": "Hdensity_total",
        "neighbor_landcover": "Hdensity_neighbor",    }
)
PA_Landcover = merged_df.drop(columns=["source_file"])

# some metrics with landcover data
PA_Landcover['forest_pct'] = PA_Landcover['forest_neighbor'] / PA_Landcover['forest_total']
PA_Landcover['Hdensity_pct'] = PA_Landcover['Hdensity_neighbor'] / PA_Landcover['Hdensity_total']
PA_Landcover = PA_Landcover.drop(columns=["forest_neighbor", "Hdensity_neighbor"])

The final dataset would look like the following:

Code
PA_Landcover.head(3)
geoid grasses_total forest_total wetland_total water_total Ldensity_total Hdensity_total forest_pct Hdensity_pct
0 42001030101 31291.925490 14626.749020 2675.717647 288.745098 5868.972549 817.870588 0.470716 0.084107
1 42001030103 11254.184314 5035.964706 408.607843 879.866667 3644.113725 283.894118 0.407744 0.042269
2 42001030104 40245.227451 10099.086275 2142.858824 405.870588 6041.019608 1041.843137 0.354698 0.032345

We could also visualize the spatial distribution of the aggregated number of pixels for each kind of land cover in Pennsylvania. We may see that grasslands and forests are mainly concentrated in central and northern Pennsylvania. Welands are mainly concentrated in western Pennsylvania. Eastern Pennsylvania has more high density development than the western part.

Landsat Data

The folder that stores Landsat data contains 16 files, with 4 vegetation indices for 4 seasons. Each of them is an individual file. Below is an example of one of the files. It only has three column, in which mean refers to the mean of savi in fall for each census tract.

Code
landsat[0].head(3)
GEOID mean source_file
0 42001030101 0.395006 savi_fall.csv
1 42001030103 0.341787 savi_fall.csv
2 42001030104 0.359570 savi_fall.csv

We need to perform the same operation as above to merge all 16 files together. The process starts by merging the first two datasets in the landsat list using a left join on GEOID, and it renames the relevant columns for clarity, specifically renaming the columns representing different vegetation indices (e.g., savi_fall, lst_summer). Unnecessary columns are then dropped to clean up the data. This merging and renaming process is repeated for additional datasets in the landsat list (from landsat[2] to landsat[15]), progressively adding columns related to different vegetation indices (e.g., evi_spring, lst_winter, ndvi_summer) for various seasons (spring, summer, fall, winter). Once all the datasets are merged, the final dataframe, PA_Landsat, contains a comprehensive set of vegetation index data for different seasons.

Code
# combine landsat data
merged_df2= pd.merge(landsat[0], landsat[1], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean_x": "savi_fall",
        "mean_y": "lst_summer",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file_x", "source_file_y"])
merged_df2 = pd.merge(merged_df2, landsat[2], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "evi_spring",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[3], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "evi_winter",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[4], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "evi_summer",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[5], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "lst_spring",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[6], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "evi_fall",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[7], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "lst_winter",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[8], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "ndvi_summer",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[9], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "savi_spring",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[10], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "lst_fall",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[11], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "savi_winter",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[12], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "savi_summer",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[13], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "ndvi_winter",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[14], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "ndvi_fall",
    }
)
merged_df2 = merged_df2.drop(columns=["source_file"])
merged_df2 = pd.merge(merged_df2, landsat[15], on="GEOID", how="left")
merged_df2 = merged_df2.rename(
    columns={
        "mean": "ndvi_spring",
    }
)
PA_Landsat = merged_df2.drop(columns=["source_file"])

Let’s check the output of the merge.

Code
PA_Landsat.head(3)
GEOID savi_fall lst_summer evi_spring evi_winter evi_summer lst_spring evi_fall lst_winter ndvi_summer savi_spring lst_fall savi_winter savi_summer ndvi_winter ndvi_fall ndvi_spring
0 42001030101 0.395006 33.257296 0.319802 0.260378 0.568514 10.827598 0.410165 4.714395 0.762120 0.316083 19.379386 0.263375 0.517445 0.473559 0.672438 0.501817
1 42001030103 0.341787 33.251842 0.291822 0.238932 0.537638 9.268735 0.349453 4.470871 0.736301 0.288043 19.461976 0.241329 0.491634 0.434709 0.620192 0.470976
2 42001030104 0.359570 33.716166 0.337012 0.280906 0.563929 10.559965 0.368580 4.283530 0.753435 0.328228 19.039489 0.279969 0.512783 0.468789 0.616475 0.509266

We could also visualize all the variables we have. Taking into the account the influence of seasonality is an important decision for more accurate prediction of COPD prevalence given that vegetation indices would change significant depending on the season should there be a correlation between them and COPD prevalence. In addition, spatial variations in vegetation indices across Pennsylvania is also evident in this map.

Merging All Data

The very last step in preparing our dataframe for machine learning is to merge all dataframes together. The GEOID column in each dataset is explicitly converted to a string type to ensure consistency in the merging process. The datasets include information about chronic disease statistics from PA_Chronic, health risk behaviors from PA_HRB, retailer density from PA_Retailers, demographic factors from PA_Census, landcover from PA_Landcover, and satellite imagery data from PA_Landsat. The purpose of these merges is to create a comprehensive dataset, PA_Final, which combines relevant health, demographic, environmental, and spatial factors for each geographic area (defined by GEOID). This combined dataset will be useful for analyzing spatial relationships between chronic health conditions, environmental factors, and socio-economic characteristics, enabling more targeted health interventions and spatial analysis in Pennsylvania. Additionally, missing values in the final dataset are filled with zeros (fillna(0)),

Code
# everthing combined
PA_Retailers['GEOID'] = PA_Retailers['GEOID'].astype(str)
PA_Landsat['GEOID'] = PA_Landsat['GEOID'].astype(str)
PA_Landcover['GEOID'] = PA_Landcover['geoid'].astype(str)
PA_Census['GEOID'] = PA_Census['GEOID'].astype(str)
PA_Chronic['GEOID'] = PA_Chronic['LocationName'].astype(str)
PA_HRB['GEOID'] = PA_HRB['LocationName'].astype(str)

PA_Final = pd.merge(PA_Chronic, PA_HRB[["GEOID", "Smoking", "Drinking", "Short_Sleep", "Physical_Activity"]], on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Retailers[["GEOID", "neighbor_avg_density"]],  on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Census[["GEOID", "minority", "aging", "disability"]],  on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Landcover.drop(columns=["geoid"]),  on="GEOID", how="inner")
PA_Final = pd.merge(PA_Final, PA_Landsat,  on="GEOID", how="inner")
PA_Final = PA_Final.fillna(0)

Below is a snap shot of the final dataframe.

Code
PA_Final.head(5)
CountyName LocationName Asthma TotalPopulation TotalPop18plus COP GEOID Smoking Drinking Short_Sleep ... evi_fall lst_winter ndvi_summer savi_spring lst_fall savi_winter savi_summer ndvi_winter ndvi_fall ndvi_spring
0 Allegheny 42003141200 9.7 4007 3242 4.8 42003141200 10.3 21.3 32.1 ... 0.254667 5.846844 0.569421 0.274674 23.110598 0.126746 0.342207 0.316698 0.485495 0.469827
1 Allegheny 42003140100 10.6 5579 5066 4.2 42003140100 10.7 23.9 34.5 ... 0.299216 6.916926 0.570255 0.297535 21.461009 0.155348 0.365760 0.347601 0.512010 0.488617
2 Allegheny 42003191900 10.6 2177 1786 5.9 42003191900 14.6 22.1 35.0 ... 0.250319 8.235525 0.489839 0.272597 21.101579 0.134626 0.307345 0.277480 0.455361 0.446655
3 Allegheny 42003320700 10.7 1613 1318 7.3 42003320700 16.4 21.3 36.0 ... 0.327833 4.928171 0.654718 0.350588 19.085399 0.144300 0.430196 0.367408 0.560819 0.565722
4 Allegheny 42003100500 11.3 2013 1695 7.7 42003100500 15.3 15.8 40.6 ... 0.312338 4.781316 0.632465 0.330390 20.809972 0.164339 0.414667 0.356294 0.547767 0.538620

5 rows × 39 columns

Back to top