A machine learning approach to quantify meteorological drivers of 1 recent ozone pollution in China

. Surface ozone concentrations have been increasing in many regions of China for the past few years, in contrast to 11 policy-driven declines in other key air pollutants such as particulate matter. While the central role of meteorology in modulating 12 ozone pollution is widely acknowledged, its quantitative contribution remains highly uncertain. Here, we use a data-driven 13 machine learning approach to assess the impacts of meteorology on surface ozone variations in China for the years 2015 to 14 2019, considering the months of highest ozone pollution from April to October. To quantify the importance of various 15 meteorological driver variables, we apply non-linear random forest regression (RFR) and linear ridge regression (RR) to learn 16 relationships between meteorological variability and surface ozone in China, and contrast the results to those obtained with 17 the widely used multiple linear regression (MLR) and stepwise MLR. We show that RFR outperforms the three linear methods 18 when predicting ozone using only local meteorological predictor variables. This implies the importance of non-linear 19 relationships between local meteorological factors and ozone, which are not captured by linear regression algorithms. In 20 addition, we find that including non-local meteorological predictors can further improve the modelling skill of RR, particularly 21 for Southern China, highlighting the importance of large-scale meteorological phenomena for ozone pollution in that region. 22 Overall, RFR and RR are in close agreement concerning the leading meteorological drivers behind regional ozone pollution. 23 For example, we find that temperature variations are the dominant meteorological driver for ozone pollution in Northern China 24 (e.g., Beijing Tianjin Hebei region), whereas variations in relative humidity are the most important factor in Southern China 25 (e.g., Pearl River Delta). Variability in surface solar radiation modulates photochemistry but was not considered as such in 26 previous controlling factor analyses, and is found to be the most important predictor in the Yangtze River Delta and Sichuan 27 Basin regions. In general, our analysis underlines that hot and dry weather conditions with high sunlight intensity are strongly 28 related to high ozone pollution across China. This further validates our novel approach to quantify the central role of 29 meteorology. By contrasting our meteorological ozone predictions with ozone measurements between 2015 and 2019, we 30 estimate that almost half of the observed ozone trends across China might have been caused by meteorological variabilities on 31 average. We highlight that these insights are of particular importance given possible increases in the frequency and intensity 32 of weather extremes such as heatwaves under climate change. 33

Prior to modelling ozone, we pre-processed the meteorological data by averaging the raw hourly data over different 151 periods each day and this process is summarised in Table 1. The averaging periods were not the same for all meteorological 152 variables. For example, T2, SSRD, SLP, RH, and W850 are averaged between local time (UTC+8:00) 06:00 to 18:00 on each 153 day. The average of these hours is sufficient to cover all daytime hours when ozone is photochemically produced from April 154 to October. Total precipitation is calculated as the sum of accumulated precipitation for all hours from 06:00 to 18:00. For 155 zonal and meridional wind at 10 m and 850 hPa, data are averaged over 06:00 to 12:00, which covers the main hours that may 156 have potential fresh emission of precursors and transport or dispersion of precursors or ozone. Boundary layer height (BLH) 157 is averaged over 00:00 to 12:00 for the consideration of both potential night-time emission of industrial activities when 158 boundary layer is still low and transportation emission during morning rush hours. Through this process, raw hourly 159 meteorological data can be converted to daily format, temporally matching with MDA8 ozone data. 160 Table 1. Summary of the meteorological controlling factor variables and the respective times of day considered in their averages.

161
The motivation behind each selected time period is provided in the main text. Note: a positive zonal wind means westerly; positive 162 meridional wind means southerly; positive vertical velocity means downward motion. Finally, both ozone data and meteorological data are deseasonalized. Specifically, for MDA8 ozone and the converted 164 daily meteorological variables, we first calculate 15-day moving window averages centered on the particular calendar date 165 from 2015 to 2019. We then take the difference between each day's MDA8 ozone or daily meteorological variables and these 166 15-day averages to obtain daily anomalies, creating smooth time series of deseasonalized MDA8 ozone and deseasonalized 167 meteorological variables. 168 https://doi.org/10.5194/acp-2021-1075 Preprint. Discussion started: 25 January 2022 c Author(s) 2022. CC BY 4.0 License. predictions can then be used to measure the "out-of-sample" skill of the algorithm in predicting ozone pollution given certain 201 meteorological conditions. In this study, we split the 5-years of data (2015 to 2019) systematically into training/validation and 202 testing datasets one at a time and in a rotating fashion. Specifically, 4 of these 5 years are classified as training/validation data, 203 leaving 1 year for testing. To ensure that we are measuring the true predictive performance and relationships, our predictive 204 results and model evaluations are only conducted for the test data, which has not been used at the training and validation stages. 205 This process rotates until ozone data for each year have been assigned once as test data so that all 5 years of data can be 206 predicted by RFR and RR. 207 Machine learning regressions such as RFR and RR optimize their predictive performance by tuning certain sets of 208 hyperparameters. To determine the most suitable set of hyperparameters, we use a statistical cross-validation method. Initially, 209 we split the 5 years of data into 1 test year and 4 training/validation years. For cross-validation, the 4-year training/validation 210 set is further split into four folds (one year per fold). We then run a grid search over pre-defined combinations of 211 hyperparameters by training on three folds and predicting on the fourth fold in a classic 4-fold cross-validation procedure. We 212 finally select the best estimated set of hyperparameters on the basis of the average validation data prediction performance as 213 measured by the coefficient of determination (R 2 -score), and refit model coefficients using this set of hyperparameters for the 214 entire 4 years of training/validation data. We note that we avoid a 'leave-one-out' cross validation method as we expect 215 autocorrelation in our data (i.e., MDA8 ozone may share similarity in adjacent dates), which, intuitively, could lead to an 216 overestimate of our predictive skill if testing data immediately follows training data points. 217 The ranges of hyperparameters we search over for both RFR and RR are set as follows. For RFR, the maximum depth for 218 trees growing is iterated in a step of 1 from 8 to 15. Maximum percentage of features and maximum samples (with bootstrap 219 method) are set from 20% to 90% and 30% to 80% with 10% incremental step, respectively. Total tree number for the forest 220 is set at 200 as a compromise between model complexity and runtime. Optimizations further showed that the minimum samples 221 per leaf is best set to 3 in our RFRs so that we finally kept this value constant in our grid searches. In terms of RR, the 222 regularization strength is iterated over a range of 1 to 199 with incremental step of 2, which appeared to encapsulate the best 223 solution in each case. A detailed explanation of these hyperparameters for RFR  It is important to first assess how well of these machine learning algorithms can model ozone by using only meteorological 238 variables as predictors. Therefore, we adopt the coefficient of determination (R 2 ) (i.e., the square of Pearson correlation 239 coefficient, R) as a standard metric for prediction performance on the deseasonalized MDA8 ozone data. As mentioned above, 240 to measure the true predictive skill of the machine learning functions, we only compare our predictions for out-of-sample test 241 data that are not used during training/validation stages against the deseasonalized measured MDA8 ozone data. 242 The predictors used by RFR and RR here are only the local meteorological variables, i.e., each ERA5 grid point's 243 meteorological variables are used as predictors to model averaged deseasonalized MDA8 ozone for that particular grid location. 244 The average prediction performance of RFR and RR by comparing predictions across all test years against the deseasonalized 245 measured MDA8 ozone data across China is illustrated in Fig. 2.  that the problem of collinearity is still limited given the use of 11 meteorological predictors. The enhanced performance of 265 RFR compared to RR may therefore be attributed to RFR being able to model non-linear relationships between local 266 meteorological variables and ozone, indicating that a flexible machine learning approach such as RFR that can capture non-267 linearity is more suitable to reflect relationships between local meteorological factors and ozone. 268

Predictive skill using additional non-local meteorological predictors 269
Meteorological phenomena usually belong into a larger spatial context. For example, high-pressure systems usually take 270 in a larger spatial domain, suppressing air flow in certain directions. Consequently, it seems intuitive that a meteorological 271 controlling factor framework for ozone might benefit from including additional non-local information in the regressions, i.e., 272 if we were to consider surrounding meteorological context information that is not just limited to the predicted ozone grid point 273 in question (Ceppi and Nowack 2021). 274 We thus ran a second version of our controlling factor analysis in which we did not just include local values of 275 meteorological drivers, but additionally consider a spatially wider effect of meteorology on a two-dimensional (2D) domain 276 of meteorological variables. This is possible since both RR and RFR are less prone to collinearity and overfitting in high-277 dimensional regression settings than simple non-regularized MLR approaches would be, meaning that the additional 278 information included in the regressions might well outweigh the cost of adding more predictors. to our predictions, but also significantly increases the dimensionality (number of predictors) of our regression problem and 284 increases the number of collinear predictors. Indeed, we find that through the additional L2-regularization in RR with 2D 285 expansion (denoted as RR-2D), its predictions by far outperform its MLR-2D equivalent which now suffers from severe 286 overfitting (compare R 2 scores in Fig. 3a and 3b). Noteworthy, with the increase of dimensionality in RR-2D, the regularization 287 strength now is adjusted to larger values starting from 10 3 to 10 9 with a factor of 1.42 incremental increase at each step, which 288 is much higher than the regularization strength set in RR with only local predictors. Suh a large increase of range is due to the 289 consideration of adding large number of meteorological predictors within the 2D domain, and it ensures that the best solution 290 with the most suitable regularization strength for each run can be well covered within this range. The overall R 2 scores for RR-291 2D ranges from 0.5 to 0.6 while R 2 in MLR-2D ranges from 0.3 to 0.4; MLR-2D is overall worse than MLR with only local 292 meteorological predictors in terms of R 2 . It is well-known that RFR may not be as effective at handling multi-collinearity in 293 very high dimensional settings as RR (e.g., Dormann et al., 2013) and its training time also increases exponentially with the 294 number of predictors. We thus only ran RFR with 2D expansion (denoted as RFR-2D) for the southern Chinese PRD region, 295 where we found a particularly large R 2 -score improvement after including non-local predictors in RR-2D (R 2 =0.60) as 296 compared to local RR (R 2 =0.47), and even non-linear local RFR (R 2 =0.53). These results highlight the apparent importance 297 of large-scale meteorological phenomena in this region. However, we find that RFR-2D improves the average R 2 score (0.57) 298 relative to RR and RFR with only local predictors, but does not perform better than RR-2D. 299  For clarity, Table 2 summarizes the averaged R 2 in each region by all machine learning methods including RFR, RR, 303 MLR, stepwise MLR, RR-2D, MLR-2D and RFR-2D. In summary, RFR and RR-2D are overall the two machine learning 304 algorithms with highest R 2 in these four regions, while MLR and RR are equivalent. 305

Regionally averaged prediction skill 310
In order to assess the performance of the algorithms in modelling regional average ozone, we further compared our 311 regionally-averaged machine learning predictions by RFR, RR and RR-2D against observations for each of the four selected 312 regions in China (Fig. 4), whereas previously we compared regional averages based on predictions for individual grid points 313 whose R 2 scores were subsequently averaged within each region. For this purpose, we averaged all 0.25° × 0.25° grid point 314 observations and model results within each region first and then compared the resulting time series for each test dataset directly. 315 The results of regional averaged predictions and observations for each region are shown in Figure 4, where the goal for the 316 predictions is to fall as close as possible onto the 1:1-line, in combination with a high R 2 -score (i.e., square of Pearson 317 correlation, R). With only local meteorological predictors, RFR still outperforms RR regarding both the coefficient of 318 determination (R 2 , same calculation method as above) and slope (closer to 1) in all four regions. This can likely be attributed 319 to the ability of RFR to capture non-linearity as well. 320 Using this calculation method, regional R 2 are much higher; for RFR, regional R 2 in BTH, YRD, PRD and Sichuan Basin 321 is 0.71, 0.75, 0.7 and 0.83, since each grid is more prone to the effect of local emissions and related local uncertainties as 322 regional average can factor out the local effects (i.e., emissions and uncertainties) to some extent. For instance, stations that 323 are located relatively close to emission source may be more influenced by NO titration effect which may lower ozone level 324 (Sillman, 1999). This effect can be more significant in some urban areas (Li et al., 2017)    than 1, suggesting a systemic underprediction of ozone for the highest observed ozone levels (higher than the deseasonalized 336 zero mean) and overpredictions of ozone for low ozone pollution regimes (lower than the deseasonalized zero mean). As 337 previously mentioned, such a mismatch may -at least to a degree -arises from non-meteorological factors such as the effect 338 of precursor emissions, which are not taken into account here. Although regionally averaged prediction skill is less affected 339 by local emissions, it will not be completely free from such effects. However, the increase of the magnitude of the slopes in 340 RR-2D with closer to 1 also suggests that considering non-local meteorological variables may help improve the performance 341 of ozone pollution controlling factor analyses, even if non-linearity is not intrinsically taken into account. 342 https://doi.org/10.5194/acp-2021-1075 Preprint. Discussion started: 25 January 2022 c Author(s) 2022. CC BY 4.0 License.

Quantifying the importance of meteorological predictors 343
We next aim to quantify how important each local meteorological predictor is for ozone pollution across China. For RR, 344 we use the regression slope as a standard metric to measure how important of each the meteorological predictor on ozone 345 pollution. A large positive value for the slope (regression coefficient) of a meteorological predictor indicates that the predictor 346 has a strong positive effect on ozone levels and vice versa. Since each of the 4-year training data is learned independently, we 347 will show averaged results. For RFR, we measure each predictor's importance through Gini importance (see Sect. 2.5). The 348 highest absolute value for both the RR slope or RFR Gini importance is interpreted as the most important meteorological driver 349 variable identified through our data-driven learning procedure. Note that Gini importance only allows to measure relative 350 influences of predictor variables on ozone variability, but not the sign of the influence, i.e., a high value of Gini importance 351 score is not able to determine whether the predictor has positive or negative effect on ozone. 352 The Gini importance scores estimated by RFR and the slopes learned by RR for each region are shown in

359
In general, both RFR and RR show good agreement in terms of identifying the dominant meteorological drivers for each 360 region. Temperature at 2 m is found to be the most important meteorological driver for ozone in BTH, followed by surface 361 https://doi.org/10.5194/acp-2021-1075 Preprint. Discussion started: 25 January 2022 c Author(s) 2022. CC BY 4.0 License. solar downward radiation, albeit the relative difference between these two variables differs more clearly for RFR, which might 362 be caused by non-linearity in the ozone-temperature relationship (Supplementary Fig. S5). Temperature was also identified as 363 the most important meteorological variable in BTH by Li et al. (2019a) using MLR. Moreover, a more pronounced positive 364 correlation between daily maximum temperature and MDA8 ozone is found in northern regions of China (Fig. 6a) In addition, despite more OH may be available given high humidity, OH can react with NO2, forming HNO3 in highly NOx-390 polluted regions, which ultimately leading to less efficient O3 formation by competing with the oxidation of VOC and CO with 391 in PRD, and it is negatively correlated with average MDA8 ozone. More generally, the regional average of MDA8 ozone in 397 PRD is negatively correlated with meridional wind at 850hPa from South China Sea (Fig. 6b), indicating strong marine air 398 inflow may have a significant cleaning and dispersion effect on PRD ozone and its precursors. Furthermore, the negative 399 correlation also expands to the northeast areas to the PRD, suggesting lower ozone in PRD given strong southerly wind in 400 these areas, which may hinder the transport of ozone and its precursors to PRD. This finding is consistent with the backward 401 trajectories and numerical modelling analysis by Qu et al. (2021). 402 403 Figure. 6 (a) Spearman correlation between daytime (06:00 to 18:00) averaged temperature at 2 m and MDA8 ozone from 2015 to variable in this set-up (i.e., one for each grid point in the 2D predictor domain), we sum up Gini scores for all grid points within 417 the expanded domain for each meteorological variable; and this summed value is denoted as the importance for that particular 418 meteorological variable. As illustrated in Fig. 7 (a), the relative feature importance of vertical velocity at 850hPa (w850) 419 increases compared to RFR using only local predictors (see Fig. 5b), likely reflecting the larger-scale influences of downward 420 transport of air masses in PRD region. Other key meteorological drivers (RH, solar radiation and meridional wind at 850hPa) 421 remain in a similar order to what was identified by purely local regressions. The model performance is slightly improved by 422 adding the 2D information with an increase of R 2 to 0.73 (from 0.70) in comparison to original RFR without 2D expansion. 423 However, we note that there appears to be a trade-off between the inclusion of non-linear relationships using RFR and 424 collinearity in high dimensions. Indeed, we find that the R 2 in RFR-2D for PRD region (see Fig. 7b) is still slightly less than 425 the R 2 using RR-2D (0.76) and the predictions from RR-2D are closer to observations with less deviations at both high and 426 low ozone value predictions (see Fig. 4j), suggesting that RR is better at handling the dimensionality increase of predictors, 427 which now slightly outweighs the importance of non-linearity in high dimensions.  Given this very short period, we are aware that any long-term trend analysis is explorative and has to be interpreted carefully, 455 as will also become evident from low statistical significance in many detected trends. We nevertheless attempt such an analysis 456 to demonstrate how our method can be used in such contexts and to also evaluate if any statistically significant trends are 457 https://doi.org/10.5194/acp-2021-1075 Preprint. Discussion started: 25 January 2022 c Author(s) 2022. CC BY 4.0 License. robust after accounting for meteorological influences. After all, as we have demonstrated above, we can quantify such 458 influences with greater skill than using simple MLR methods applied previously. 459 For trend analyses, we first convert MDA8 ozone concentrations from mass concentrations (μg m -3 ) to volume mixing 460 ratios (ppbv). We then average MDA8 ozone over April to October or summertime for each year for both observational data 461 and model results predicted by our three best-performing controlling factor regressions (RFR, RR and RR-2D). The predictions 462 can be considered as a quantitative estimate for the influence of meteorology on the ozone record during the study period. The 463 residual (true ozone signal minus meteorological predictions) will for example be mainly reflective of anthropogenic 464 contribution but will also inevitably contain some uncertainties related to the accuracy of the machine learning algorithms in 465 modelling ozone. 466 Finally, we aim to calculate trends on a ERA5 grid-by-grid point basis. Although RFR, RR and RR-2D all show significant 493 skill in modelling ozone across China, RR-2D exhibited particularly increased predictive skill in southern China. Therefore, 494 for assessing meteorologically-driven trends of MDA8 ozone across all ERA5 grid locations in China, we will only be 495 each region. In general, ozone pollution in northern China such as in the Beijing-Tianjin-Hebei (BTH) region is predominantly 532 driven by temperature fluctuations while ozone in southern China like in Pearl River Delta (PRD) region is particularly strongly 533 controlled by humidity, possibly due to the important role of humid weather in preventing significant ozone pollution episodes 534 in this region, while the effect of humidity is constrained in BTH probably because of the relatively drier condition in this 535 region. Besides, we observe a strong influence in PRD of air exchange with pristine marine regions, leading to a greater 536 influence of large-scale wind directions, e.g., through the transport of clean marine air into the region, or through air stagnation 537 and ozone accumulation under large-scale sinking atmospheric motion. Surface solar radiation plays a major role in general 538 due to its importance for setting the conditions for ozone photochemistry, which is particularly dominant in the Yangtze River 539 Delta (YRD) and Sichuan Basin. Our work thus highlights that surface solar radiation might be a key predictor to consider in 540 future controlling factor analyses in these two regions. In summary, hot, dry and sunny weather tends to provide more favorable 541 conditions for ozone pollution in China, which is not entirely unexpected but carries important implications for future changes 542 in air pollution under climate change, while simultaneously considering the pivotal role of targeted emission control strategies 543 on ozone precursors. 544 In terms of ozone trends, we find a linear MDA8 ozone increase of about 1.1 ppbv a -1 on average during April to October 545 from 2015 to 2019 across China. Regionally, these trends can be more than twice as large as in BTH. The largest positive 546 https://doi.org/10.5194/acp-2021-1075 Preprint. Discussion started: 25 January 2022 c Author(s) 2022. CC BY 4.0 License.