Vector Data and US Census Bureau API

To get all the vector data required in this study, we used the following packages. The cenpy package provides a streamlined interface for accessing data from the U.S. Census Bureau’s APIs, such as the American Community Survey (ACS) and Decennial Census. It simplifies querying demographic, socioeconomic, and housing data directly from census databases. The pygris package focuses on working with the U.S. Census Bureau’s geographic data, particularly boundary files like TIGER/Line shapefiles. We would also need geopandas and pandas for working with spatial and tabular data, respectively, to load and further process downloaded data.

Note: The dataset from CDC is 760 MB and therefore cannot be commited to GitHub. It is currently stored in a private folder ignored by the repository. Given the size of the original data, we used Colab for all the vector data loading and processing work as well. The code here are just for demonstration purposes. All file paths have been removed.

If you wish to reproduce this study, you may download the original CDC data that’s available here. Alternatively, you may head to the next step in the Methodology section and follow the steps there to load in proccessed data.

Tobacco Retailer Data

The original retailer dataset from PA Open Data Portal is first loaded from a CSV file. Next, the data is filtered to include only rows where the state column equals ‘PA’. This extra steps ensures that we are only including the retailers from Pennsylvania. The resulting subset is refined to retain only the columns of interest: county, license_type, lat, and lon, which provide relevant spatial and categorical information for further analysis.

all_retailers = pd.read_csv('######')
pa_retailers = all_retailers[all_retailers['state'] == 'PA']
pa_retailers = pa_retailers[["county", "license_type", "lat", "lon"]]

The processed data is then exported.

pa_retailers.to_csv('#####', index=False)

COPD Data and Health Risks Behaviors

As mentioned in the introduction, the CDC data contains census tract level estimates of 27 disease measures for 500 major cities in the U.S, which totals about 3 million rows. After reading in the data, we need to significnatly trim down the dataset to the information we need: we are mainly interested in the prevalence of COPD as well as the number of adults with health risks behavior.

cdc_data = pd.read_csv("######")

According to CDC, they computed a probability among adults who report having ever been told by a doctor, nurse, or other health professional they had chronic obstructive pulmonary disease (COPD), emphysema, or chronic bronchitis. The probability was then applied to the detailed population estimates at the appropriate geographic level to generate the prevalence. More information can be found here.

With the raw cdc_data, we select rows where Measure is either “Current asthma among adults” or “Chronic obstructive pulmonary disease among adults,” and the state abbreviation is “PA”. The asthma and COPD subsets are then merged based on the common column LocationName (essentially the GEOID), using a left join to preserve all locations in the asthma data. The merged dataset is renamed to include clearer column labels, where Asthma represents asthma prevalence and COP represents COPD prevalence.

# process CRD data
PA_Asthma = cdc_data[(cdc_data['Measure'] == "Current asthma among adults") & (cdc_data['StateAbbr'] == "PA")]
PA_COP = cdc_data[(cdc_data['Measure'] == "Chronic obstructive pulmonary disease among adults") & (cdc_data['StateAbbr'] == "PA")]
PA_Chronic = PA_Asthma.merge(
    PA_COP[['LocationName', 'Data_Value']],
    on="LocationName",
    how="left"
).rename(columns={"Data_Value_x": "Asthma", "Data_Value_y": "COP"})

This dataset is then exported to a new CSV file to hold our outcome variables. While we only intend to use COPD is the outcome variable, we also include asthma here in case we need it for further analysis.

PA_Chronic.to_csv('######', index=False)

We then extract health risk behavior data from the same cdc_data. The PLACES Health Risk Behaviors data capture estimated prevalence for various U.S. adult behaviors that pose a risk to health, from binge drinking to smoking, lack of physical inactivity, and short sleep duration. Among those, binge drinking adults are those who report having ≥5 drinks (men) or ≥4 drinks (women) on ≥1 occasion during the previous 30 days. Smoking adults are those who report having smoked ≥ 100 cigarettes in their lifetime and currently smoke every day or some days. Lack of physical activity refers to adults having no leisure-time physical activity during the past month. Short sleep duration refers to those who have less than 7 hours of sleep.

For more detailed explaination of each variable, you may refer to this page.

The code below extracts subsets for four specific health behaviors. Then, each subset is filtered for rows where the Measure matches the behavior and the StateAbbr is “PA”. The subsets are sequentially merged on the common LocationName column using left joins, ensuring that all locations from the smoking data are preserved. The merged dataset, PA_HRB, contains columns for the prevalence of each behavior, renamed for clarity.

# process HRB data
PA_Smoking = cdc_data[(cdc_data['Measure'] == "Current cigarette smoking among adults") & (cdc_data['StateAbbr'] == "PA")]
PA_Drinking = cdc_data[(cdc_data['Measure'] == "Binge drinking among adults") & (cdc_data['StateAbbr'] == "PA")]
PA_Physical_Activity = cdc_data[(cdc_data['Measure'] == "No leisure-time physical activity among adults") & (cdc_data['StateAbbr'] == "PA")]
PA_Short_Sleep = cdc_data[(cdc_data['Measure'] == "Short sleep duration among adults") & (cdc_data['StateAbbr'] == "PA")]

PA_HRB = PA_Smoking.merge(
    PA_Drinking[['LocationName', 'Data_Value']], on='LocationName', how='left'
).rename(columns={"Data_Value_x": "Smoking", "Data_Value_y": "Drinking"})

PA_HRB = PA_HRB.merge(
    PA_Physical_Activity[['LocationName', 'Data_Value']], on='LocationName', how='left'
).rename(columns={'Data_Value': 'Physical_Activity'})

PA_HRB = PA_HRB.merge(
    PA_Short_Sleep[['LocationName', 'Data_Value']], on='LocationName', how='left'
).rename(columns={'Data_Value': 'Short_Sleep'})
PA_HRB[['LocationName', 'Smoking', 'Drinking', 'Physical_Activity', 'Short_Sleep']]

This dataset is then exported to a new CSV file.

PA_HRB.to_csv('######', index=False)

Census Data

The code below connects to the Census Bureau’s ACS (American Community Survey) 5-Year Data (2022) using the cenpy package and specifies a set of variables to retrieve demographic, age-related, and disability-related characteristics. These include total population, race and ethnicity breakdowns (White, Black, Native American, Asian, and Hispanic populations), age distributions for individuals 65 years and older, and disability prevalence across various age and gender groups. By accessing these data, the project aims to understand the factors associated with public health outcomes, particularly COPD (Chronic Obstructive Pulmonary Disease).

acs = cenpy.remote.APIConnection("ACSDT5Y2022")

Including these census variables is critical for analyzing COPD rates because the condition is influenced by demographic, social, and environmental factors. For example, COPD prevalence varies across racial and ethnic groups due to disparities in healthcare access, environmental exposures, and socioeconomic conditions, making it essential to include race and ethnicity data. Age distribution data is also vital since COPD predominantly affects older adults; understanding the size and characteristics of elderly populations helps identify at-risk groups. Additionally, disability prevalence provides insight into potential comorbidities and healthcare needs among populations with COPD, as individuals with disabilities often face heightened vulnerability to chronic conditions. By integrating these data, the project can explore correlations between COPD rates and demographic characteristics, identify vulnerable subpopulations, and inform targeted public health strategies. This comprehensive approach supports evidence-based decision-making to address health disparities and improve outcomes.

census_var = ["NAME",
              "B02001_001E", # total
              "B02001_002E", # white
              "B02001_003E", # black
              "B02001_004E", # native american
              "B02001_005E", # asian
              "B03002_012E", # hispanic
              'B01001_020E', # male 65-66
              'B01001_021E', # male 67-69
              'B01001_022E', # male 70-74
              'B01001_023E', # male 75-79
              'B01001_024E', # male 80-84
              'B01001_025E', # male over 85
              'B01001_044E', # female 65-66
              'B01001_045E', # female 67-69
              'B01001_046E', # female 70-74
              'B01001_047E', # female 75-79
              'B01001_048E', # female 80-84
              'B01001_049E', # female over 85
              'B18101_007E', # Male 5 to 17 years With a disability
              'B18101_010E', # Male 18 to 34 years With a disability
              'B18101_013E', # Male 35 to 64 years With a disability
              'B18101_016E', # Male 65 to 74 years With a disability
              'B18101_019E', # Male over 75 years With a disability
              'B18101_026E', # Female 5 to 17 years With a disability
              'B18101_029E', # Female 18 to 34 years With a disability
              'B18101_032E', # Female 35 to 64 years With a disability
              'B18101_035E', # Female 65 to 74 years With a disability
              'B18101_038E'
             ]

After getting the data, we calculated the three indices we are interested in: percentage of racial minority population, percentage of population with disability, and percentage of aging population. Specifically, the minority column calculates the proportion of the population that identifies as a racial or ethnic minority. This is achieved by subtracting the White population B02001_002E from the total population B02001_001E and dividing by the total population. The aging column computes the proportion of elderly individuals aged 65 and older, using a sum of relevant age-specific population variables B01001_020E to B01001_049E divided by the total population. Similarly, the disability column calculates the proportion of the population with a disability, based on age and gender-specific disability data B18101_007E to B18101_038E.

pa_state_code = "42"
census_data = acs.query(
    cols=census_var,
    geo_unit="tract",
    geo_filter={"state": pa_state_code}
)
for variable in census_var:
    if variable != "NAME":
        census_data[variable] = census_data[variable].astype(float)

census_data['minority'] = (
    (census_data['B02001_001E'] - census_data['B02001_002E']) / census_data['B02001_001E']
)
census_data['aging'] = (
    census_data[[
        'B01001_020E', 'B01001_021E', 'B01001_022E', 'B01001_023E',
        'B01001_024E', 'B01001_025E', 'B01001_044E', 'B01001_045E',
        'B01001_046E', 'B01001_047E', 'B01001_048E', 'B01001_049E'
    ]].sum(axis=1) / census_data['B02001_001E']
)
census_data['disability'] = (
    census_data[[
        'B18101_007E', 'B18101_010E', 'B18101_013E', 'B18101_016E',
        'B18101_019E', 'B18101_026E', 'B18101_029E', 'B18101_032E',
        'B18101_035E', 'B18101_038E'
    ]].sum(axis=1) / census_data['B02001_001E']
)

The last step here simply merges demographic metrics with geographic boundaries at the tract level. Key variables, including minority representation, aging population, and disability prevalence, are filtered alongside geographic identifiers. The purpose is to append GEOID onto the dataframe, making it easier to work with later.

census_data = census_data[["NAME", "county", "tract", "minority", "aging", "disability"]]
tracts = pygris.tracts(state=pa_state_code, year=2022)
pa_census_data = tracts.merge(census_data, left_on=["COUNTYFP", "TRACTCE"], right_on=["county", "tract"],)
pa_census_data = pa_census_data[["GEOID", "minority", "aging", "disability", "geometry"]]

pa_census_data.to_csv('######', index=False)