Improving 3-day deterministic air pollution forecasts using machine learning algorithms

Zhang, Zhiguo; Johansson, Christer; Engardt, Magnuz; Stafoggia, Massimo; Ma, Xiaoliang

doi:https://doi.org/10.5194/acp-24-807-2024

Articles | Volume 24, issue 2

https://doi.org/10.5194/acp-24-807-2024

Articles | Volume 24, issue 2

Research article

19 Jan 2024

Research article |

| 19 Jan 2024

Improving 3-day deterministic air pollution forecasts using machine learning algorithms

Zhiguo Zhang, Christer Johansson, Magnuz Engardt, Massimo Stafoggia, and Xiaoliang Ma

Abstract

As air pollution is regarded as the single largest environmental health risk in Europe it is important that communication to the public is up to date and accurate and provides means to avoid exposure to high air pollution levels. Long- and short-term exposure to outdoor air pollution is associated with increased risks of mortality and morbidity. Up-to-date information on present and coming days' air quality helps people avoid exposure during episodes with high levels of air pollution. Air quality forecasts can be based on deterministic dispersion modelling, but to be accurate this requires detailed information on future emissions, meteorological conditions and process-oriented dispersion modelling. In this paper, we apply different machine learning (ML) algorithms – random forest (RF), extreme gradient boosting (XGB), and long short-term memory (LSTM) – to improve 1, 2, and 3 d deterministic forecasts of PM₁₀, NO_x, and O₃ at different sites in Greater Stockholm, Sweden.

It is shown that the deterministic forecasts can be significantly improved using the ML models but that the degree of improvement of the deterministic forecasts depends more on pollutant and site than on what ML algorithm is applied. Also, four feature importance methods, namely the mean decrease in impurity (MDI) method, permutation method, gradient-based method, and Shapley additive explanations (SHAP) method, are utilized to identify significant features that are common and robust across all models and methods for a pollutant. Deterministic forecasts of PM₁₀ are improved by the ML models through the input of lagged measurements and Julian day partly reflecting seasonal variations not properly parameterized in the deterministic forecasts. A systematic discrepancy by the deterministic forecasts in the diurnal cycle of NO_x is removed by the ML models considering lagged measurements and calendar data like hour and weekday, reflecting the influence of local traffic emissions. For O₃ at the urban background site, the local photochemistry is not properly accounted for by the relatively coarse Copernicus Atmosphere Monitoring Service ensemble model (CAMS) used here for forecasting O₃ but is compensated for using the ML models by taking lagged measurements into account.

Through multiple repetitions of the training process, the resulting ML models achieved improvements for all sites and pollutants. For NO_x at street canyon sites, mean squared error (MSE) decreased by up to 60 %, and seven metrics, such as R² and mean absolute percentage error (MAPE), exhibited consistent results. The prediction of PM₁₀ is improved significantly at the urban background site, whereas the ML models at street sites have difficulty capturing more information. The prediction accuracy of O₃ also modestly increased, with differences between metrics.

Further work is needed to reduce deviations between model results and measurements for short periods with relatively high concentrations (peaks) at the street canyon sites. Such peaks can be due to a combination of non-typical emissions and unfavourable meteorological conditions, which are rather difficult to forecast. Furthermore, we show that general models trained using data from selected street sites can improve the deterministic forecasts of NO_x at the station not involved in model training. For PM₁₀ this was only possible using more complex LSTM models. An important aspect to consider when choosing ML algorithms is the computational requirements for training the models in the deployment of the system. Tree-based models (RF and XGB) require fewer computational resources and yield comparable performance in comparison to LSTM. Therefore, tree-based models are now implemented operationally in the forecasts of air pollution and health risks in Stockholm. Nevertheless, there is big potential to develop generic models using advanced ML to take into account not only local temporal variation but also spatial variation at different stations.

Download & links

How to cite.

Received: 31 Jan 2023 – Discussion started: 08 Feb 2023 – Revised: 24 Sep 2023 – Accepted: 28 Oct 2023 – Published: 19 Jan 2024

1 Introduction

According to the World Health Organization (WHO), air pollution is one of the leading causes of mortality worldwide and is regarded as the single largest environmental health risk (Fuller et al., 2022). Acute effects of air pollution are due to short-term (e.g. daily) exposures that can lead to reduced lung function, respiratory infections, and aggravated asthma (Lee et al., 2021). According to the European air quality directive, information on air quality should be made available to the public. Public information regarding the expected health risks associated with current or the next few days' concentrations of pollutants can be very important for sensitive persons when planning their outdoor activities.

There are different approaches to obtaining information on the spatio-temporal variation of air pollutant concentrations – from complex process-oriented models to different types of statistical models. Gaussian plume models are widely used in urban areas for estimating impacts on atmospheric concentrations from different emission sources and for health risk assessments (Munir et al., 2020; Johansson et al., 2009; Orru et al., 2015; Johansson et al., 2017b). Eulerian chemical transport models that describe emission, transport, mixing, and chemical transformation of trace gases and aerosols, e.g. CHIMERE, EMEP, and MATCH, are part of the Copernicus Atmosphere Monitoring Service (CAMS, https://atmosphere.copernicus.eu/, last access: 20 December 2023) to predict air pollution over Europe (Horálek et al., 2019). The uncertainties in the output of the deterministic models include uncertainties in the input, such as emissions, model algorithms, and parameterizations.

In urban areas, detailed knowledge of the dedicated emission source is often crucial. For example, road traffic, as a main emission source, can be modelled by various levels of emission models (Ma et al., 2012; Keller et al., 2017). To assess the concentration of contaminants, it is often required to combine the models of emission and dispersion processes (e.g. Ma et al., 2014). An alternative approach may derive spatio-temporal distribution of air pollutants without modelling the emission process. For example, using a land use regression model is a popular method to explain spatial contrasts in air pollution concentrations (e.g. Hoek et al., 2008).

Data-driven models using machine learning (ML) have become increasingly popular in predicting outdoor air quality (Rybarczyk and Zalakeviciute, 2018; Iskandaryan et al., 2020). Previous studies predict both hourly and daily average concentrations of particulate matter (PM), as well as gaseous air pollutants, using meteorological and traffic data (e.g. Qadeer et al., 2020; Di et al., 2019; Thongthammachart et al., 2021; Kamińska, 2019; Chuluunsaikhan et al., 2021; Doreswamy et al., 2020; Castelli et al., 2020; Stafoggia et al., 2019, 2020). In addition, a combination of ML, land-use regression (LUR), dispersion modelling, and ground-based and satellite measurements have been used to obtain temporally and spatially distributed concentrations (Shtein et al., 2020; Stafoggia et al., 2019; Brokamp et al., 2017; Di et al., 2019). Recently, Kleinert et al. (2022) conducted a study to forecast O₃ concentrations in a longer-term horizon; meanwhile, a deterministic model was also combined with ML in the study of Hong et al. (2022) to forecast the PM_2.5 concentration.

This paper aims to demonstrate how ML can improve the 1, 2, and 3 d deterministic forecasts of several critical urban air pollutants: particulate matter (PM₁₀, particles with an aerodynamic diameter less than 10 µm), nitrogen oxides (NO_x), and ozone (O₃). The study covers both urban background and street canyon sites in Stockholm, Sweden. Three ML algorithms were adopted, two based on decision trees (random forest, RF, and extreme gradient boosting, XGB) and one deep neural network model (long short-term memory, LSTM). These models were compared to investigate if there are systematic differences in their prediction performance depending on different pollutants and measurement sites, which can be used to improve current applications in Stockholm. Meanwhile, four methods for feature importance ranking were applied to analyse the effects of different features on the model prediction results.

2 Background

2.1 The Stockholm air quality forecast system

Stockholm city has used an air quality forecast system since 2021. Three different dispersion models are used to forecast concentrations considering emissions and dispersion at the European, urban and street-level scales described by Fig. 1. The CAMS ensemble model, part of the Copernicus Programme, was used to obtain forecasts of long-range transported air pollution from outside of the Greater Stockholm area. Previous assessments have found the ensemble model to be more accurate than any individual model part of CAMS (Meteo-France, 2017; Marécal et al., 2015). CAMS regional ensemble forecasts are published once a day and each forecast covers 96 h (4 d).

https://acp.copernicus.org/articles/24/807/2024/acp-24-807-2024-f01

Figure 1Illustration of the deterministic modelling from European scale at a resolution of 0.1^∘ × 0.1^∘ (ca. 11 km × 6 km) via urban scale (100 m resolution over an area of 35 × 35 km) down to the street canyon sites. The CAMS ensemble forecast map example is taken from https://atmosphere.copernicus.eu/. The map with the Gaussian model local forecast example is output from the Airviro system (https://www.airviro.com/airviro/, last access: 1 February 2023) used in Stockholm. The illustration of a street canyon site is taken from https://www.wikiwand.com/en/Operational_Street_Pollution_Model (last access: 1 February 2023).

The contributions to concentrations due to local emissions in the metropolitan area were performed on a 100 m resolution using a Gaussian dispersion model part of the Airviro system (https://www.airviro.com/airviro/). In this modelling domain (Greater Stockholm, 35 × 35 km) individual buildings and street canyons are not resolved but treated using a roughness parameter (Gidhagen et al., 2005). The Gaussian model is fed with meteorological forecasts from the Swedish Meteorological and Hydrological Institute (SMHI). A diagnostic wind model is used to account for influences of variations in topography and land use on the dispersion parameters input to the Gaussian model. For details regarding uncertainties and validation of local modelling, see Johansson et al. (2017a).

Finally, the Operational Street Pollution Model (OSPM), developed by Berkowicz (2000) and driven by forecasted meteorology from SMHI, is applied to the street canyon sites. It has been applied earlier at Hornsgatan in Stockholm in a number of modelling studies (e.g. Krecl et al., 2021; Ottosen et al., 2015). NO_x and PM₁₀ are modelled on all scales, whereas O₃ is only forecasted by the CAMS ensemble model.

For the urban-scale model domain, a detailed emission database is used as input for the local dispersion modelling. The database and its applications and comparisons between modelling and measurements are described in SLB (2022). The total emissions from road traffic are based on emission factors for different vehicle types, including passenger cars, buses, and light- and heavy-duty trucks. Exhaust emission factors of NO_x and particles are based on HBEFA version 3.3 (Keller et al., 2017) depending on the Euro class of the vehicle. The emission factors per vehicle category were weighted according to the national Swedish Transport Administration vehicle registry, but the vehicle composition taken from national vehicle registry has been shown to be similar to the local fleet using real-world number plate recognition measurements at Hornsgatan (Burman and Johansson, 2010; Burman et al., 2019). Non-exhaust emissions of PM due to wearing of brakes, tyres, and roads are calculated using the NORTRIP model (Denby et al., 2013a) forced by the forecasted meteorology from SMHI. Information on shares of studded winter tyres is obtained from manual counting every week during the winter at different locations in the city centre and along highways outside of the city. Road traffic emissions are calculated for all roads with more than 3000 vehicles per day. Other emission sources included in the local emissions database include shipping and private and municipal heating (including burning of waste). More information about the Stockholm air quality forecast system is provided in Engardt et al. (2021).

2.2 Meteorological forecasts

As an integral part of the Stockholm air quality forecast system, meteorological forecasts for a point in central 60 Stockholm are downloaded every morning from the websites of SMHI (https://www.smhi.se/data/oppna-data, last access: 20 December 2023) and MET Norway (https://docs.api.met.no/doc/, last access: 20 December 2023). The meteorological forecasts extend over 10 d and are a combination of output from a number of regional and global numerical weather prediction models. The combination is based on statistical adjustments and manual edits. Initial models of weather-dependent PM emissions and urban and street canyon air quality modelling are driven by meteorology. The forecasted meteorological data are also used as predictors for the models in this study.

3 Methods

3.1 Data and pre-processing

The data used in this study were collected from four monitoring stations in central Stockholm, including one urban background site (Torkel Knutssonsgatan, hereafter called UB or urban) and three street canyon sites (Hornsgatan, HO; Folkungagatan, FO; and Sveavägen, SV). They are all located in central Stockholm (see Fig. 2). Detailed descriptions of measurement methods and sites are provided in Appendix A.

https://acp.copernicus.org/articles/24/807/2024/acp-24-807-2024-f02

Figure 2Map of central Stockholm showing the locations of the urban background site and the street canyon traffic sites. Base map credits are as follows: © OpenStreetMap contributors, licensed under the Open Data Commons Open Database License (ODbL) v1.0.

Data from the UB site cover approx. 1000 d (10 April 2019 through 31 December 2021). As the OSPM model became operational at a later date, the street canyon data extend over 500 d (5 August 2020 through 31 December 2021). Pollutant concentration measurements from monitoring stations, pollutant forecasts, and meteorological forecasts from the Stockholm air quality forecast system were aggregated into the following four datasets.

All the data above were collected at 1 h intervals, with details illustrated in Table 1. It should be noted that there are several studies that show the impact of the COVID-19 pandemic on pollutant emissions as a result of some restrictive regulations (Sokhi et al., 2021; Torkmahalleh et al., 2021). The COVID-19 pandemic in Sweden commenced in January 2020 and continued until February 2022, meaning that the majority of the data were collected during this pandemic period.

Table 1Description of the dataset.

Download Print Version | Download XLSX

The pollutant measurements and forecasts from the deterministic model exhibit a missing rate of less than 5 %, with a few inaccurate samples, including outliers and negative values. Appendix B shows the missing status of O₃ in the UB dataset. To accurately represent the extreme values in the real world, outliers were deliberately included in the data because their occurrence is hard to justify. However, negative pollutant samples were eliminated, and missing data were manually interpolated using historical average interpolation (Willmott and Matsuura, 1995).

Frequently employed approaches of interpolating time series data comprise constant interpolation, nearest-neighbour interpolation, and linear interpolation. To keep the temporal relationship, the historical average interpolation is applied based on the periodicity pattern in the data. The periodicity of each feature, denoted by p, is determined by the analysis of the autocorrelation function (ACF) and partial autocorrelation function (PACF) of the data. Subsequently, the missing value $\tilde{p} (t)$ at time t is substituted by the average of the available data from the two preceding periods and their adjacent values:

\begin{matrix} (1) & \begin{aligned} \tilde{p} (t) = & \frac{1}{n} \sum (\tilde{p} (t - p), \tilde{p} (t - p \pm 1), \tilde{p} (t - 2 p), \\ \tilde{p} (t - 2 p \pm 1)), \end{aligned} \end{matrix}

where n is the number of samples used in Eq. (2). An example result of interpolation is shown in Appendix B.

3.2 Prediction scheme

This study is to forecast hourly concentrations for the coming 1, 2, and 3 d of data based on historical pollutant measurements and other available information as inputs, which is a time series prediction for multiple time steps, for example, 72 time steps for 3 d prediction. Instead of more complex network structure, multiple single-output ML models are chosen for forecasting different air pollutants for k= 1, 2, and 3 d intervals, as shown in Eq. (2).

\begin{matrix} (2) & \begin{aligned} {\hat{ρ}}_{i, j} (d, t) = & ML_model ({\tilde{ρ}}_{i, j} (d - k, t), {\overline{ρ}}_{i, j}^{S} (d - k, t), \\ {\overset{ˇ}{ρ}}_{i, j} (d, t), W (d, t), C (d, t)), \end{aligned} \end{matrix}

where ${\hat{ρ}}_{i, j} (d, t)$ is the forecast of the pollutant j for day d and time t at the location i, and ${\tilde{ρ}}_{i, j} (d, t)$ is the corresponding real measurement; ${\overline{ρ}}_{i, j}^{S} (d, t)$ uses a set S to represent several statistical measures, including maximum, minimum, 25 % quantile, and 75 % quantile of the measured concentration data during the past 24 h until t, and the measurement dataset can be represented by a set, i.e. $\{{\tilde{ρ}}_{i, j} (d, t), {\tilde{ρ}}_{i, j} (d, t - 1), {\tilde{ρ}}_{i, j} (d, t - 2) \dots\}$ ${\overset{ˇ}{ρ}}_{i, j} (d, t)$ is the predicted concentration using deterministic model. W(d,t) represents the weather condition predicted for day d and time t.

Figure 3 demonstrates the prediction horizon and lagged information horizon for the case of 1 d prediction. To build consistent statistical ML models with a fixed rolling horizon, a new measurement point at the current time (d,t) will lead to an additional prediction for 1 d ahead, i.e. the predicted value at $(d + 1, t)$ . In this case, the measurement statistics ${\overline{ρ}}_{i, j}^{S} (d, t)$ will be based on 1 d preceding measurement data of (d,t), resulting in a lagged rolling horizon described by Fig. 3.

https://acp.copernicus.org/articles/24/807/2024/acp-24-807-2024-f03

Figure 3Illustration of the machine learning modelling scheme for 1 d prediction based on available datasets.

Improving 3-day deterministic air pollution forecasts using machine learning algorithms

2.1 The Stockholm air quality forecast system

2.2 Meteorological forecasts

3.1 Data and pre-processing

3.2 Prediction scheme

3.3 Machine learning models

3.3.1 Framework

3.3.2 Model setups

3.4 Hyperparameter optimization

3.5 Feature importance ranking

3.6 Statistical performance indicators

4.1 Urban background

4.1.1 Comparison between deterministic forecasts and ML models – urban background

4.1.2 Importance of features – urban background

4.2 Street canyon sites

4.2.1 Comparison between deterministic forecasts and ML models – street canyon sites

4.2.2 Importance of features – street canyon sites

4.3 Generalization of street canyon modelling

5.1 Comparison of different ML models

5.2 Temporal dependency of feature importance