To estimate PM2.5 and temperature, we applied advanced machine learning methods such as extreme gradient boosting (XGBoost), random forest models and ensemble averaging. We included multiple predictors to build our model including ground monitoring data, satellite remote sensing data, meteorology, land use, emissions inventory, and chemical transport models.
Our hybrid model design involved a complex multi-stage approach. To start with, we used our published machine learning approach to downscale both the meteorological and aerosol optical depth (AOD) parameters that are present at a coarser spatial resolution to obtain continuous full time series 1x1km data. Secondly, we implemented a calibration regression to mitigate the sparsity of ground-based PM2.5 and temperature measurements. With available PM10 data from co-located monitors, we modelled the ratio of PM2.5 and PM10 against weather parameters and certain land use variables. This calibration regression allowed us to predict PM2.5 at locations and time where only PM10 data is available. Next, we modelled the relationship between both the ground based PM2.5 and temperature measurements (using separate models) and all predictor variables using a combination of statistical and machine learning techniques. The dataset was then split into training and test datasets. We trained multiple machine learning algorithms, such as, random forest, XGBoost and neural networks. To borrow strength across multiple methodologies, we combined the predictions from each learner using a generalized additive model ensemble approach with a tensor product smoothing over the spatial coordinates. This allowed the weights given the individual algorithms to vary both over space and time. Measures of prediction accuracy was obtained from these test datasets including root mean squared errors (RMSE), bias and prediction R2 using robust linear regression. The temperature model will be fit similarly.