Meteorology-normalized impact of COVID-19 lockdown upon NO 2 pollution in Spain

. The spread of the new coronavirus (COVID-19) forced the Spanish Government to implement extensive lockdown measures to reduce the number of hospital admissions, starting on March 14 th 2020. Over the following days and weeks, strong reductions of nitrogen dioxide (NO 2 ) pollution were reported in many regions of Spain. A substantial part of these reductions is obviously due to decreased local and regional anthropogenic emissions. Yet, the confounding effect of meteorological variability hinders a reliable quantiﬁcation of the lockdown impact upon the observed pollution levels. Our study uses machine learning 5 (ML) models fed by meteorological data along with other time features to estimate the "business-as-usual" NO 2 mixing ratios that would have been observed in the absence of the lockdown. We then quantify the so-called meteorology-normalized NO 2 reductions induced by the lockdown measures by comparing the business-as-usual with the actually observed NO 2 mixing ratios. We applied this analysis for a selection of urban background and trafﬁc stations covering the more than 50 Spanish provinces and islands. 10 , a stronger meteorology-normalized NO 2 change is highlighted at trafﬁc stations compared to urban background ones. Our results are consistent with foreseen (although still uncertain) changes in anthropogenic emissions induced by the lockdown. We also show the importance of taking into account meteorological variability for accurately assessing the impact of the lockdown on NO 2 levels, in particular at ﬁne spatial and temporal scales. implications of +2.5%, respectively, when 395 averaged over all provinces. Among the different provinces, the corresponding 5 th /95 th percentiles reach -17/+48, -29/+34 and -30/+37% during phases I, II and III, respectively. For the case of Barcelona province, these relative biases are +39, +25 and 30%. Code and data availability. The EEA AQ eReporting, ERA5 and Gridded Population of the World (GPW) version 5 datasets used in this study are publicly available. The HERMESv3_BU (Bottom-Up) code package with its documentation is publicly available at the following gitlab repository: https://earth.bsc.es/gitlab/es/hermesv3_bu (https://doi.org/10.5281/zenodo.3521897, Guevara et al., 2019). leaf hyperparameter controls the minimum number of samples to allow in a terminal node (larger values limiting the risk of overﬁtting).


Introduction
The rapid spread of the new coronavirus  has led numerous countries worldwide to put their citizens on various forms of lockdown, with measures ranging from light social distancing to almost complete restrictions on mobility (Anderson single ML model over a relatively long and thus potentially non-stationary time series. In contrast to linear regression, GBM does not learn equations relating the target variable to the different features, but rather builds non-parametric relationships between target and features. As a consequence, such a model will always make NO 2 predictions within the range of NO 2 values used in the training, regardless of the inclusion of the aforementioned date index feature or the feature values it takes for making the predictions. However, if NO 2 strongly increases (decreases) with time in the training dataset, the GBM model is 130 able to split the data using the trend feature and therefore predict NO 2 in the range of the higher (lower) mixing ratios reached by the end of the training period. We note that even with a trend feature, such a model is not expected to stay valid very far in 5 https://doi.org/10.5194/acp-2020-446 Preprint. Discussion started: 25 May 2020 c Author(s) 2020. CC BY 4.0 License. time relative to the training data when the training data is following a too strong trend. Our sensitivity tests have clearly shown that the behaviour of the ML models substantially improves when including the trend feature.
In our study, the GBM models are trained over the 3 last full years, namely 2017-2019 and then used for predicting business-135 as-usual NO 2 mixing ratios over the 4 following months, from January to April 2020. Such a duration is expected to allow capturing a substantial part of the inter-annual variability of NO 2 mixing ratios and meteorological conditions and ensures some past data is available for quantifying the uncertainties of our ML modelling strategy (as explained later in Sect. 2.3.3).
Note that no improvement was found with extended training periods of 4 or 5 years. Although our interest is to predict NO 2 during the lockdown period, the two and half preceding months were kept to test the validity of our predictions and uncertainty These uncertainties are suited for our ML-based daily NO 2 predictions. Since daily uncertainties are at least partly uncorrelated, predictions averaged over longer time periods are expected to have smaller uncertainties. We estimated the uncertainty affecting our ML predictions at the weekly scale. We used a similar approach than previously described for the daily uncertainty, but based on the 7-day running average of the daily residuals (by requiring a minimum of 5 over 7 days with available 170 data). The 5 th and 95 th percentiles were computed based on the entire set of residuals (514 residuals on average at each station over 2016-2019). On average over all provinces, the weekly uncertainty interval obtained are [-4.2, +3.5] ppbv at urban background stations, and [-4.9, +4.5] ppbv at traffic stations, which represents a reduction of 27 and 30%, respectively, with respect to the daily uncertainties.
Our main interest in this study is to quantify the mean NO 2 changes during the lockdown period. We decided to keep the 175 weekly scale uncertainties for the predictions of business-as-usual NO 2 mixing ratios averaged over its different phases (10-13 days each) and over the entire lockdown period (41 days). The use of weekly uncertainties is likely conservative when used for the entire lockdown average, but accounts for potential data gaps, particularly when estimating the shorter phases therein.

Results and Discussion
In this section, we first evaluate the ML-based predictions of business-as-usual NO 2 mixing ratios (Sect. 3.1). We then illustrate 180 our methodology in the two provinces with largest population density, namely Madrid and Barcelona (Sect. 3.2). We then analyze the meteorology-normalized changes of NO 2 obtained in all Spanish provinces (Sect. 3.3). We discuss in Sect. 3.4 the potential relationships with emission reductions. Finally, we discuss in Sect. 3.5 the advantages of our ML-based approach for estimating the baseline NO 2 pollution compared to the climatological approach.

185
The performance of the ML predictions is shown in Fig. 2, and the statistics averaged over all Spanish provinces reported in Table 1. Results are evaluated using the following metrics, calculated based on daily NO 2 mixing ratios : mean bias (MB), normalized mean bias (nMB), root mean square error (RMSE), normalized root mean square error (nRMSE) and Pearson cor- For information purposes, we included the statistical results obtained over the training dataset (2017/01/01-2019/12/31).  ing results over the training data may be useful for highlighting obvious situations of overfitting, when the performance is almost perfect. At urban background stations, results show no bias, low RMSE (always below 30%, 20% on average over all provinces), and a high PCC (0.91 on average). Similar results are obtained at traffic stations. Although such a performance obtained is very good, there are no clear signs of too prejudicial overfitting at this stage.

195
On average over the different provinces, the bias increases to +2 and +7%, the RMSE to 32 and 28%, and the PCC is reduced to 0.71 and 0.75, at urban background and traffic stations, respectively. Results highlight some inter-regional variability of the performance, with poorer statistics in some provinces, at least for one type of station. At most stations, the bias remains below Table 1. Performance of the ML predictions of NO2 mixing ratios on the test dataset (2020/01/01-2020/03/13), averaged over all Spanish provinces (the standard deviation among the provinces is also indicated).

Dataset
Type of  Several factors may explain the poorer statistical results obtained at some stations. First and foremost, it may be due to deficiencies in the ML modelling, and in particular to some overfitting. This seems to be the case of Fuerteventura and Huesca, given the good performances obtained with the training data. Since we are considering numerous stations in this study, we need a fixed procedure applied similarly to all ML models to be trained. As described in Sect. 2.3.2, we designed our training and 205 tuning procedure in order to limit as much as possible this common issue, through rolling-origin cross-validation and randomized search in the hyperparameters space. Overall results are satisfactory but some overfitting can still persist in some cases.
Second, although moderately, some of the biases and errors may be partly due to trends and/or inter-annual variability of NO 2 .
As previously explained (Sect. 2.3.2), by model design, if NO 2 levels in the first months of 2020 are outside of the NO 2 range in the 2017-2019 training dataset, our predictions over the lockdown period could be equally biased. The different NO 2 time 210 series indeed show some cases where NO 2 mixing ratios are lower than in the past years (since 2013). In the frame of our study, it is important to mention that, although the lockdown was officially implemented on March 14 th , the COVID-19 started to perturb the business-as-usual situation in the days/weeks before, first through the cancellation of numerous events and, later, through unusual movements of a part of the population (e.g. to second homes). Although complicated to assess more precisely in each of the Spanish provinces, this likely explains part of the biases noticed in the second half of the test period.

215
Third, poor performances at some stations may be due to weaker relationships between meteorological input data and NO 2 mixing ratios. This points to uncertainties in the ERA5 meteorology data. For example, the relatively coarse spatial resolution (31 km) of ERA5 data may only capture part of the meteorological variability existing at a given station. This is especially true when considering stations located in urban areas where the complex urban morphology (e.g. presence of buildings, canyon street) is known to locally distort the mesoscale circulation. Decision-tree based ML methods like GBM offer some inter-220 pretability by providing a measure of the importance of the different features included as input data. In our case, on average over all ML models, the most important feature is the boundary layer height (22±7%) followed by the surface wind speed  After the lockdown, NO 2 mixing ratios decreased down to 8 and 11 ppbv on average at the urban background and traffic 290 stations, respectively, both reaching minimum daily mixing ratios of 4 ppbv. Results highlight strong and statistically significant differences with the business-as-usual situation in which NO 2 levels would have remained around 15-21 ppbv during that period. As in Madrid, the strongest differences are found in April, during the phases II and III of the lockdown. Note that these differences exceed by large the aforementioned positive bias encountered after February. Interestingly, besides the strong reduction, observed NO 2 mixing ratios followed a very similar variability than business-as-usual NO 2 , which highlights the 295 major influence of meteorological conditions on the levels of pollution, as previously mentioned by Tobías et al. (2020). For instance, the increase of NO 2 mixing ratios between April 6 th and April 9 th appears linked to unusually low wind speeds over Barcelona, 1.7 m.s −1 on average over these days, which is slightly below the climatological (2013-2020) 5 th percentile of wind speed in April (1.8 m.s −1 ). Without the lockdown, this stagnant situation associated with the business-as-usual emission forcing would have increased NO 2 by about 5-10 ppbv, according to the ML predictions. Observed NO 2 also slightly increased

Meteorology-normalized changes of NO 2 mixing ratios over Spain
We computed the meteorology-normalized changes of NO 2 for all the selected stations. Results are presented in Fig. 5, together with the weekly uncertainty of our ML predictions (colored lines). For information purposes, we also display the daily uncer-310 tainty (black lines). Results are colored as a function of their degree of significance, here computed as the distance between the NO 2 change best estimate and the upper limit of the weekly uncertainty interval, normalized by the distance between the best estimate and zero. A degree of significance of 1 thus indicates a NO 2 change significant at a 90% confidence level. Statistics over the changes of NO 2 obtained in all provinces are reported in Table 2. A map of best estimates of NO 2 changes at each station is also given in Fig. 6.

315
Results highlight that the reduction previously described in Madrid and Barcelona extends to most Spanish provinces, although with some inter-regional variability in the extent of the change and the degree of statistical significance. On average over all changes obtained in all provinces) are -7 ppbv (-65%) and -2 ppbv (-36%). The NO 2 change is significant with more than 320 90% confidence in 20 out of 38 provinces, with many of the remaining ones being relatively close to that confidence level. A similar, yet more statistically significant reduction is found at traffic stations, with a mean NO 2 decrease of -6[-11,-1] ppbv (or -50[-88,-8]%), and 28 out of 37 stations exceeding the 90% confidence level. The spread of NO 2 change between the different provinces is also quite similar between the two types of stations, with 5 th and 95 th percentiles of -66 and -28%, respectively.
Generally, the meteorology-normalized NO 2 reductions in the provinces of the southern half of the country appear stronger 325 and in more cases statistically significant.
As previously observed in Madrid and Barcelona, results in Table 2 highlight noticeable differences between the different phases of the lockdown. The corresponding figures (with both absolute and relative changes) can be found in the Appendix (Figs. A1, A2, A3 and A4). The mean reduction of NO 2 during phase I was about -40% at both station types, and further increased to about -55% during phases II and III. The lower reduction during the first phase is partly explained by the fact that 330 NO 2 concentrations started at their business-as-usual levels and took a few days to reach their minimum. During the two last phases, NO 2 was found to be reduced in many more provinces, as shown by the 95 th percentile that ranges between -33 and -45% depending on the type of station during phases II and III, compared to only -12 to -19% during phase I.

Relationship to emission reductions
We contrasted our results with a detailed NO x anthropogenic emission inventory at 4km x 4km resolution over Spain avail-335 able through the bottom-up module of the HERMESv3 emission model, developed at the Earth Sciences Department of the Barcelona Supercomputing Center (Guevara et al., 2020). Averaged over the different stations considered in this study, road transport emissions are the dominant source, with 66 and 69% of the total NO x emissions in the vicinity of urban background and traffic stations, respectively. The other emission sources are the residential/commercial combustion sector (14 and 15%), industrial point sources (8 and 13%) and shipping and port activities (11 and 3%). In Spain, the public agency in charge of 340 monitoring traffic (Dirección General del Tráfico) reported progressive reductions in total traffic down to levels about -60 to -90% lower than usual, with substantial day-to-day variability and strongest reductions during weekends. Assuming to first order a linear relationship between NO 2 urban background mixing ratios and local surrounding NO x emissions (within a 4km x 4km cell) and applying a 70% (80%) reduction of road transport would lead to a NO 2 reduction of about 47% (54%), which is consistent with our findings. Our knowledge about the impact of the lockdown on the other emission sectors remains at 345 this stage quite limited. NO x emissions from industry likely also decreased but quantifying this reduction, even roughly, is more complex as some industries were considered as essential and thus not affected by the lockdown. Although 9-13% of the surrounding emissions (in the 4km x 4km cell of the inventory) are associated to this sector, the impact of idling industrial activities on the pollution levels observed at the selected stations may be relatively small considering that none of these stations are classified as "industrial". The residential/commercial emission sector represents another unknown since the expected emis-350 sion increment caused by a population spending more time at home may be compensated by the closure of most shops, schools trained on different labels: the first base learner is trained on the dataset, the second on the errors left by the previous one, the third on the errors left by the two previous ones, and so on. RF (used by Grange et al. (2018) and Grange and Carslaw (2019)) 475 consists in aggregating base learners trained on random subsets of the training dataset based on a random subset of features.
Appendix C: Tuning of the GBM model The training of the model is conducted together with a search of the optimal hyperparameter tuning. We retained a so-called randomized search in which a range of values is given for each hyperparameter of interest and a total number of hyperparameters combinations to test (20 in our case). Compared to the so-called grid search in which all combinations of hyperparameters 480 are tested, this choice allows to explore a large part of the hyperparameters space for a greatly reduced computational cost, and is less prone to overfitting.
We used the scikit-learn Python package. The learning rate was fixed to 0.05 and the number of features to consider when looking for the best split is fixed to the square root of the number of features (max_features in scikit-learn, set to "sqrt"). Besides that, the tuning of the GBM model was done over the following set of hyperparameters: the tree maximum depth (max_depth