Discussions
Findings Summary
In short summary, this study offers a multifaceted exploration of how various predictors—both social-behavioral and environmental—interact with seasonal changes and dimensionality reduction techniques to influence the prediction of COPD rates. One of the most consistent findings across all models is the prominence of physical activity and development density as top predictors of COPD rates. These predictors retain their significance regardless of seasonal variations, underscoring their role as fundamental health-related factors. In contrast, environmental predictors such as temperature and vegetation indices demonstrate greater sensitivity to seasonal fluctuations. Notably, their relevance diminishes during winter, reflecting a seasonal reduction in their predictive power. This observation aligns with broader climatic patterns, where winter conditions—such as reduced vegetation and lower temperatures—may interact differently with COPD-related health outcomes.
The evaluation of model performance highlights seasonal differences, with spring models achieving the highest R² values, while winter models show the lowest R². This trend suggests that spring conditions offer more predictable patterns for COPD rates, possibly due to more stable environmental. Conversely, the reduced performance in winter indicates challenges in capturing complex seasonal dynamics, such as cold temperatures or poorer air quality, which may disproportionately impact COPD outcomes. The use of cross-validation led to slightly lower R² values compared to a simple train-test split. This is expected, as cross-validation evaluates model performance across multiple subsets of the data, providing a more reliable but conservative estimate of accuracy.
A closer look at the residual versus predicted plots reveals a notable positive linear pattern in the residuals suggests systematic underprediction or overprediction in certain data ranges. Ideally, residuals should be randomly distributed around zero, but the observed trend indicates that the model may not be fully capturing certain aspects of the data. This points to the need for additional feature engineering, such as incorporating new variables or exploring alternative algorithms, to enhance predictive accuracy. Through mapping the residuals, we noticed that in urban areas like northern Philadelphia, the model tends to underpredict COPD rates. This may reflect complex urban health dynamics that are not fully captured by the current set of predictors. Conversely, in surrounding suburban areas, the model shows a tendency to overpredict. In rural areas, underprediction is also common, possibly due to limited data or missing explanatory variables that influence COPD outcomes in these regions. Residuals exhibit spatial continuity, with similar prediction errors clustering in neighboring census tracts. This indicates the influence of regional factors that may affect multiple tracts similarly, highlighting the importance of spatial dependencies in COPD prediction.
Limitations and Uncertainties
This study provides valuable insights into the predictors and seasonal patterns of COPD, but several limitations and uncertainties must be acknowledged to contextualize the findings and guide future research. One key concern is the accuracy of the CDC’s COPD estimates, which serve as a foundational dataset for the analysis. If the estimation methodology contains biases or inaccuracies, these could propagate through the model and influence the results. Additionally, there is some uncertainty regarding whether we are utilizing these estimates appropriately for our specific modeling purposes. A closer examination of the CDC’s methods and assumptions is necessary to ensure compatibility with our approach.
Another limitation lies in our method of estimating seasonal COPD occurrences, which relies on exacerbation rates as a proxy. While exacerbations are closely tied to COPD severity, it is unclear whether they can reliably represent seasonal variations in overall COPD prevalence. This raises questions about the validity of our approach and highlights the need for further validation using independent seasonal COPD data. Furthermore, the selection of predictors in the model warrants further consideration. Although we included social, behavioral, and environmental factors, the omission of specific predictors, such as pollution indicators like PM2.5 or NO₂ concentrations, may have reduced the model’s ability to fully capture the complexity of COPD outcomes. Incorporating such variables in future analyses could enhance the explanatory power and robustness of the model.
A related issue is our choice of outcome variable. While we primarily focused on predicting COPD counts, this approach introduced high variability, which likely reduced the model’s accuracy. Preliminary analyses not included in this study demonstrated that predicting COPD rates, normalized by population, resulted in much higher accuracy due to the lower variability in rates. This raises the question of whether future analyses should prioritize rate predictions over raw counts. Although predicting rates can improve accuracy, it might shift the focus to relative comparisons and obscure the absolute disease burden in high-prevalence areas.
Next Step
Moving forward, it is important to acknowledge that this work is still in progress and represents our first trial in developing models to better understand COPD predictors and patterns. While we aimed to build robust models, the current approach has not fully explained the spatial distribution of COPD, highlighting a critical area for improvement. Future iterations will require significant effort in feature engineering to enhance the model’s explanatory power. This includes carefully selecting the most relevant indicators of local environment, demographic characteristics, and socio-behavioral factors. For instance, identifying more specific and impactful predictors, such as localized air pollution levels, housing quality, or access to healthcare, could help capture finer details of COPD prevalence.
Additionally, more exploratory analyses are needed to delve into the nuances of the data, particularly to account for urban and rural differences that likely influence COPD patterns in distinct ways. By zooming into specific geographic areas or demographic subgroups, we can gain deeper insights into how these factors interact. Such efforts will allow us to refine our models and create tools that better represent the spatial and seasonal variability of COPD. With these enhancements, future work will aim to provide a more comprehensive and actionable understanding of COPD determinants, supporting public health strategies and interventions tailored to local needs.