08 Feb 2023
 | 08 Feb 2023
Status: this preprint is currently under review for the journal ACP.

Improving 3-day deterministic air pollution forecasts using machine learning algorithms

Christer Johansson, Zhiguo Zhang, Magnuz Engardt, Massimo Stafoggia, and Xiaoliang Ma

Abstract. As air pollution is regarded as the single largest environmental health risk in Europe it is important that communication to the public is up-to-date, accurate and provides means to avoid exposure to high air pollution levels. Long- as well as short-term exposure to outdoor air pollution is associated with increased risks of mortality and morbidity. Up-to-date information on present and coming days’ air quality help people avoid exposure during episodes with high levels of air pollution. Air quality forecasts can be based on deterministic dispersion modelling, but to be accurate this requires detailed information on future emissions, meteorological conditions and process oriented dispersion modelling. In this paper we apply different machine learning (ML) algorithms – Random forest (RF), Extreme Gradient Boosting (XGB) and Long-Short Term Memory (LSTM) – to improve 1-, 2- and 3-day deterministic forecasts of PM10, NOx, and O3 at different sites in Greater Stockholm, Sweden.

It is shown that the deterministic forecasts can be significantly improved using the MLs but that the degree of improvement of the deterministic forecasts depends more on pollutant and site than on what machine learning (ML) algorithm is applied. Deterministic forecasts of PM10 is improved by the MLs through the input of lagged measurements and Julian day partly reflecting seasonal variations not properly parameterised in the deterministic forecasts. A systematic discrepancy by the deterministic forecasts in the diurnal cycle of NOx is removed by the MLs considering lagged measurements and calendar data like hour of the day and weekday reflecting the influence of local traffic emissions. For O3 at the urban background site the local photochemistry not properly accounted for by the relatively coarse Copernicus Atmosphere Monitoring Service ensemble model (CAMS) used here for forecasting O3, but compensated using the MLs by taking lagged measurements into account. The machine learning models performed similarly well for the sites and pollutants. Performance measures like Pearson correlation, root mean square error (RMSE), mean absolute percentage error (MAPE) and mean absolute error (MAE), typically differed less than 30 % between ML models. At the urban background site, the deviations between modelled and measured concentrations (RMSE errors) are smaller than uncertainties in the measurements estimated according to recommendations by the Forum for Air Quality Modeling (FAIRMODE) in the context of the air quality directives. At the street canyon sites modelled errors are higher, and similar to measurement uncertainties. Further work is needed to reduce deviations between model results and measurements for short periods with relatively high concentrations (peaks). Such peaks can be due to a combination of non-typical emissions and unfavourable meteorological conditions and may be difficult to forecast. We have also shown that deterministic forecasts of NOx at street canyon sites can be improved using MLs even if they are trained at other sites. For PM10 this was only possible using LSTM.

An important aspect to consider when choosing ML is that the decision tree based models (RF and XGB) can provide useful output on the importance of features that is not possible using neural network models like LSTM, and also that training and optimisation is more complex with LSTM, which could be important to consider when selecting ML algorithm in an operational forecast system. A random forest model is now implemented operationally in the forecasts of air pollution and health risks in Stockholm. Development of the tuning process and identification of more efficient predictors may make forecast more accurate.

Christer Johansson et al.

Status: final response (author comments only)

Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor | : Report abuse
  • RC1: 'Comment on acp-2023-38', Anonymous Referee #1, 21 Mar 2023
  • RC2: 'Comment on acp-2023-38', Anonymous Referee #2, 31 Mar 2023
  • AC1: 'Comment on acp-2023-38', Christer Johansson, 02 May 2023

Christer Johansson et al.

Data sets

ACP-2023-38 paper submission support: data for 3-days prediction of Air Quality using Machine Learning algorithms Johansson et al.

Model code and software

ACP-2023-38 paper submission support: code for 3-days prediction of Air Quality using Machine Learning algorithms Johansson et al.

Christer Johansson et al.


Total article views: 841 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
620 203 18 841 7 5
  • HTML: 620
  • PDF: 203
  • XML: 18
  • Total: 841
  • BibTeX: 7
  • EndNote: 5
Views and downloads (calculated since 08 Feb 2023)
Cumulative views and downloads (calculated since 08 Feb 2023)

Viewed (geographical distribution)

Total article views: 832 (including HTML, PDF, and XML) Thereof 832 with geography defined and 0 with unknown origin.
Country # Views %
  • 1


Latest update: 22 Sep 2023
Short summary
Up-to-date information on present and coming days’ air quality help people avoid exposure to high levels of air pollution. We apply different machine learning models to significantly improve traditional forecasts of PM10, NOx, and O3 in Stockholm, Sweden. It is shown that forecasts of all air pollutants are improved by through the input of lagged measurements and taking into account calendar information. The final modelled errors are substantially smaller than uncertainties in the measurements.