the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Improving 3-day deterministic air pollution forecasts using machine learning algorithms
Christer Johansson
Zhiguo Zhang
Magnuz Engardt
Massimo Stafoggia
Xiaoliang Ma
Abstract. As air pollution is regarded as the single largest environmental health risk in Europe it is important that communication to the public is up-to-date, accurate and provides means to avoid exposure to high air pollution levels. Long- as well as short-term exposure to outdoor air pollution is associated with increased risks of mortality and morbidity. Up-to-date information on present and coming days’ air quality help people avoid exposure during episodes with high levels of air pollution. Air quality forecasts can be based on deterministic dispersion modelling, but to be accurate this requires detailed information on future emissions, meteorological conditions and process oriented dispersion modelling. In this paper we apply different machine learning (ML) algorithms – Random forest (RF), Extreme Gradient Boosting (XGB) and Long-Short Term Memory (LSTM) – to improve 1-, 2- and 3-day deterministic forecasts of PM10, NOx, and O3 at different sites in Greater Stockholm, Sweden.
It is shown that the deterministic forecasts can be significantly improved using the MLs but that the degree of improvement of the deterministic forecasts depends more on pollutant and site than on what machine learning (ML) algorithm is applied. Deterministic forecasts of PM10 is improved by the MLs through the input of lagged measurements and Julian day partly reflecting seasonal variations not properly parameterised in the deterministic forecasts. A systematic discrepancy by the deterministic forecasts in the diurnal cycle of NOx is removed by the MLs considering lagged measurements and calendar data like hour of the day and weekday reflecting the influence of local traffic emissions. For O3 at the urban background site the local photochemistry not properly accounted for by the relatively coarse Copernicus Atmosphere Monitoring Service ensemble model (CAMS) used here for forecasting O3, but compensated using the MLs by taking lagged measurements into account. The machine learning models performed similarly well for the sites and pollutants. Performance measures like Pearson correlation, root mean square error (RMSE), mean absolute percentage error (MAPE) and mean absolute error (MAE), typically differed less than 30 % between ML models. At the urban background site, the deviations between modelled and measured concentrations (RMSE errors) are smaller than uncertainties in the measurements estimated according to recommendations by the Forum for Air Quality Modeling (FAIRMODE) in the context of the air quality directives. At the street canyon sites modelled errors are higher, and similar to measurement uncertainties. Further work is needed to reduce deviations between model results and measurements for short periods with relatively high concentrations (peaks). Such peaks can be due to a combination of non-typical emissions and unfavourable meteorological conditions and may be difficult to forecast. We have also shown that deterministic forecasts of NOx at street canyon sites can be improved using MLs even if they are trained at other sites. For PM10 this was only possible using LSTM.
An important aspect to consider when choosing ML is that the decision tree based models (RF and XGB) can provide useful output on the importance of features that is not possible using neural network models like LSTM, and also that training and optimisation is more complex with LSTM, which could be important to consider when selecting ML algorithm in an operational forecast system. A random forest model is now implemented operationally in the forecasts of air pollution and health risks in Stockholm. Development of the tuning process and identification of more efficient predictors may make forecast more accurate.
- Preprint
(4545 KB) - Metadata XML
- BibTeX
- EndNote
Christer Johansson et al.
Status: final response (author comments only)
-
RC1: 'Comment on acp-2023-38', Anonymous Referee #1, 21 Mar 2023
Review of “Improving 3-day deterministic air pollution forecasts using machine learning algorithms”
The authors present several ML models for air quality forecasting focusing on PM10, NOx and compare those results against deterministic forecasts. The dataset used to train ML models focuses on Stockholm. Overall, the authors show that the ML investigated seem to outperform the deterministic forecasts over the periods that the models were tested on, which were 1-, 2-, and 3-days in length periods. The ML models considered are very standard methods used widely in the literature, and are probably the appropriate starting points given the structure and size of training data. The authors additionally show some feature importance results.
In my opinion the length of this paper is too long and the results section needs to be shortened, the overall results summarized, and the key findings should be emphasized more. Both Table 2 and 4 could go to the Appendix as well as most of the figures in the results section. It might be easier to present/condense the text results in the tables as figures. I think the feature importance plots should be in the main text.
In the abstract it is wrongly stated that one cannot subject LSTMs to feature importance methods. A google search provides examples of how this can be performed. I think the authors need to investigate a gradient-based feature importance method for the LSTM and compare that model against the other models investigated, since the application of feature importance alongside the usage of ML models for longer-term horizon prediction is a main focus of their paper. Furthermore, the authors did not mention which importance method was used for the tree-based models. There are now a variety of methods available for these models, which often do not agree on the importance ranking. For the XGB model, how do other feature importance metrics (potentially such as the permutation and SHAP importance) compare to what was used?
Data
It is not clear in the paper how the data sources were combined and then split into training, and validation splits. Furthermore, it was not stated if any preprocessing of the data was performed (which probably needed to be carried out given the different ranges the input quantities cover). How large was the data set? The authors state that the ML models were trained on the same data; were the tree models trained on randomized (tabular) data, or was it split some other way (was the time-dependence preserved)? Figure 3 makes sense for the LSTM; however I think the authors could extend the figure to include some more schematic details pertaining to the tree-based models studied (and a schematic LSTM could be helpful as well).
Models
It becomes clear that there are in fact many models being used. How many relative to (presumably) Figure 4 (it currently says “Fel! Hittar inte referenskalla”)? The authors should probably add some comments about how this considerably complexifies the overall model pipeline, relative to one single model being used, and limits model(s) generalizability. Is this really a better solution relative to one single LSTM in terms of complexity?
There is no detail provided on model and training hyperparameters, and whether hyper parameter optimization was performed, for any of the models. How large was the LSTM (how many layers, layer size, what activation functions were used, etc.). Similarly, what were the XGB parameters used? Were all of the parameters guessed (or defaults used)? Overall, the models mostly look comparable. Given the results as they are in the current manuscript, it seems the obvious choice is RF, but I think the authors also need to compare linear regression to the other models, which should be considered the baseline ML model that the others need to beat. I don’t see much value in including the GAM (other than ruling it out).
Finally, it was not clear if cross-validation was performed and if the presented results show ensemble averages or something else. I think by now this is a very standard procedure, and it also provides an estimate of the uncertainty present in the models prediction capabilities given the training data set (for example, the tables should be presenting mean and variances for the ensemble). Given that there are so many models at play, it might be more useful to understand when models are more or less certain in their predictions.
Other
Most of the results in later figures show the mean of 1-,2-,3- day forecasts. How does the performance depend on day?
I’d rather the authors show the coefficient of determination (e.g. R2) alongside/rather than the Pearson coefficient.
Line 10: The deterministic predictions are used as models’ inputs but at which time? Is it the current prediction (and the models’ job is to correct CAMS?).
Throughout the paper the authors use “MLs” but this should probably read ML models for grammatical consistency.
In Figure 5 when there is a large drop just after September 20, what did the XAI method claim was the important feature(s). Is there anything different about the inputs to the model on that date?
Citation: https://doi.org/10.5194/acp-2023-38-RC1 -
RC2: 'Comment on acp-2023-38', Anonymous Referee #2, 31 Mar 2023
This paper describes and evaluates the use of different machine learning algorithms to make short-term forecasts of ground-level air pollution at 4 observational stations in Stockholm, Sweden. While the paper certainly adds to the current literature, and fits within the scope of the journal, the paper lacks some important details and is sometimes difficult to follow. My major comments relate to the need for more clarity on the exact methodology used, particularly with reference to the data processing. Clarity on these details is necessary before publication.
Specific major comments:
- It is unclear how the data is split for model training, validation and testing. What is the percentage split of this data? Crucially, how was the data split? Was it split randomly, or was it split temporally (if so, what were the dates for the training/test data)? Some details are necessary here, as improper data splitting can lead to over-inflation of model skill results due to data leakage.
- Did the authors carry out hyperparameter tuning for their models? If so, did they use a validation set to do this tuning, and then evaluate on a held-out test dataset? Currently there are no details on hyperparameters in the manuscript. It is stated in the conclusion that fine-tuning is possible future work - does this mean that defaults were used for the hyperparameters?
- The authors state that the LSTM model is not used to its full potential. I am unclear what this means. As I understand it, the LSTM was not trained to make autoregressive predictions of air pollution, as would typically be the case when forecasting with an LSTM? If this is the case, I imagine this to allow comparison between the LSTM and the tree-based methods. However, this is not really a fair test of the skill of the LSTM, and the possible advantages it provides against tree-based methods. Some clarity on this would be appreciated.
- It is unclear how the feature importances were extracted from the tree-based methods. There are a number of different methods for this. Was this based on permutation-based feature importance, for example? Why was the particular importance method chosen, and what are the drawbacks of using this method? In addition, there are methods to extract feature importances from LSTM models that could be used, to compare against importances from the tree-based models.
- It is stated that data from the UB site covers around 1000 days, while the street canyon data extends over 500 days. Were the percentage splits of the training and testing data similar for both cases? Given there were fewer data for the street canyon sites, might this affect model performance? Finally, given that the data comes from 2019-2021, might the impact of coronavirus restrictions affect model skill, or model generalisation to future data?
Other specific comments:
- Line 5: some citations for reduced lung function etc. would be welcome.
- Line 26-28: ‘Although… the challenges of forecasting air pollution concentrations in a longer-term horizon such as a day or even several days have not been investigated’. If this is referring to multi-day daily forecasts of pollutants, this is not true. There is a significant body of work looking at forecasting air pollution on the time horizon of several days e.g., Kleinert et al, 2022 for ozone.
- Line 28-29: ‘very few studies have combined deterministic models and ML in forecasting air pollution levels of a few hours/days in the future’. There are some studies that do this. They should be cited here.
- Page 4, line 15: ‘whereas the alternative approach substitutes the missing values with mean values of available data in the neighbourhood’. Substituting with the mean value from the other 3 stations? Or the nearest station?
- Page 5, line 9: ‘are extracted from a location outside the greater Stockholm domain’ - what location?
- Page 8, line 15 – missing reference.
- The metrics used are generally clear and well reported, however, for clarity, it would be good to include bold highlighting of the best performing model for each case in Table 2 and Table 4.
- Thank you for including the analysis of model performance and high pollutant concentrations. This is important.
Technical comments:
- Line 26: ‘For O3 at the urban background site the local photochemistry is? not properly accounted for by the relatively coarse Copernicus Atmosphere Monitoring Service ensemble model (CAMS) used here for forecasting O3, but is compensated for using the MLs (ML models?) by taking lagged measurements into account.’ Perhaps?
- Throughout: MLs or machine learning models? MLs is not a common term.
- The Figure on page 9 has no caption or label.
- Conclusion: reflection -> reflecting
Citation: https://doi.org/10.5194/acp-2023-38-RC2 -
AC1: 'Comment on acp-2023-38', Christer Johansson, 02 May 2023
The comment was uploaded in the form of a supplement: https://acp.copernicus.org/preprints/acp-2023-38/acp-2023-38-AC1-supplement.pdf
Christer Johansson et al.
Data sets
ACP-2023-38 paper submission support: data for 3-days prediction of Air Quality using Machine Learning algorithms Johansson et al. https://zenodo.org/record/7576042#.Y9k9KHbMK72
Model code and software
ACP-2023-38 paper submission support: code for 3-days prediction of Air Quality using Machine Learning algorithms Johansson et al. https://zenodo.org/record/7576042#.Y9k9KHbMK72
Christer Johansson et al.
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
620 | 203 | 18 | 841 | 7 | 5 |
- HTML: 620
- PDF: 203
- XML: 18
- Total: 841
- BibTeX: 7
- EndNote: 5
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1
Cited
2 citations as recorded by crossref.
- Water Quality Classification and Machine Learning Model for Predicting Water Quality Status—A Study on Loa River Located in an Extremely Arid Environment: Atacama Desert V. Flores et al. 10.3390/w15162868
- A Hybrid Autoformer Network for Air Pollution Forecasting Based on External Factor Optimization K. Pan et al. 10.3390/atmos14050869