Model Output Statistics (MOS) applied to CAMS O 3 forecasts: trade-offs between continuous and categorical skill scores

. Air quality (AQ) forecasting systems are usually built upon physics-based numerical models that are affected by a number of uncertainty sources. In order to reduce forecast errors, ﬁrst and foremost the bias, they are often coupled with Model Output Statistics (MOS) modules. MOS methods are statistical techniques used to correct raw forecasts at surface monitoring station locations, where AQ observations are available. In this study, we investigate to what extent AQ forecasts can be improved using a variety of MOS methods, including persistence (PERS), moving average (MA), quantile mapping (QM), Kalman Filter 5 (KF), analogs (AN), and gradient boosting machine (GBM). We apply our analysis to the Copernicus Atmospheric Monitoring Service (CAMS) regional ensemble median O 3 forecasts over the Iberian Peninsula during 2018-2019. A key aspect of our study is the evaluation, which is performed using a very comprehensive set of continuous and categorical metrics at various time scales (hourly to daily), along different lead times (1 to 4 days), and using different meteorological input data (forecast vs reanalyzed). 10 Our results show that O 3 forecasts can be substantially improved using such MOS corrections and that this improvement goes much beyond the correction of the systematic bias. Although it typically affects all lead times, some MOS methods appear more adversely impacted by the lead time. When considering MOS methods relying on meteorological information and comparing the results obtained with IFS forecasts and ERA5 reanalysis, the relative deterioration brought by the use of IFS is minor, which paves the way for their use in operational MOS applications. Importantly, our results also clearly show the trade-offs 15 between continuous and categorical skills and their dependencies on the MOS method. The most sophisticated MOS methods better reproduce O 3 mixing ratios overall, with lowest errors and highest correlations. However, they are not necessarily the best in predicting the highest O 3 episodes, for which simpler MOS methods can give better results. Although the complex impact of MOS methods on the distribution and variability of raw forecasts can only be comprehended through an extended set of complementary statistical metrics, our study shows that optimally implementing MOS in AQ forecast systems crucially 20 requires selecting the appropriate skill score to be optimized for the forecast application of interest. contributions. contributed to the conception and design of the study. and were responsible for downloading CAMS and meteorological data. was responsible for installing the python packages and other useful modules on the Mare Nostrum supercomputer. was responsible for the acquisition and preprocessing of the air quality data through the GHOST project. HP carried out the analysis. HP, CPGP, OJ, AS, MG, JMA and DB contributed to the interpretation of results. HP was responsible for writing the article, 690 with a careful review from CPGP and

possible in a worse case scenario where a new operational AQ forecasting system is implemented together with a MOS module, starting from scratch, i.e. without any hindcast data or past observations available. On a given day, all MOS methods can only rely on the historical data accumulated so far. We believe that such a strategy allows to compare the different MOS methods in 120 a balanced way given the operational context. As it will be described in more detail in the next section, some MOS methods require very limited prior information to achieve their optimal performance, while other need a larger amount of training data.
In an operational context, the first category of methods might thus be advantaged at the beginning before being gradually supplanted with the second category.

125
This section describes the different MOS methods implemented for correcting the raw forecasts (hereafter referred to as RAW), namely: persistence (PERS), moving average (MA), Kalman filter (KF), quantile mapping (QM), analogs (AN) and gradient boosting machine (GBM). All MOS methods are applied independently on each monitoring station.

Persistence (PERS) and moving average (MA) methods
We primarily consider two relatively simple MOS methods: the persistence (PERS) and the moving average (MA). The PERS 130 method simply uses the previous observed concentrations values at a specific hour of the day (averaged over 1 or several days) as the predicted value for this specific hour. It is often used as a reference to measure the skill achieved by other methods, especially for very short-term forecasts. In the MA method, the forecast bias in the previous day or days is used to correct the forecast. As a first approach, we use a time window of one single day for both PERS and MA methods. The corresponding approaches are hereafter referred to as PERS(1) and MA(1). The sensitivity of both PERS and MA methods to the time window 135 is discussed in Sect. 3.4.

Quantile mapping (QM) method
The quantile mapping (QM) method aims at adjusting the distribution of the forecast concentrations to the distribution of observed concentrations. For a given day, the QM method consists in (1) computing two cumulative distribution functions (CDF), corresponding to past modeled and observed O 3 mixing ratios, respectively, (2) locating the current O 3 forecast in forecast will correspond to the 95 th percentile of the observed O 3 mixing ratios. This approach thus aims at correcting all quantiles of the distribution, and not only the mean.
In the operational-like context in which this study is conducted (Sect. 2.2), first QM corrections are computed when 30 days 145 of data have been primarily accumulated, to ensure a minimum representativeness of the model and observation CDFs. For computational reasons, both CDFs are updated every 30 days (although an update frequency of one single day would be optimal in a real operational context).

Kalman filter (KF) method
Over the last decades, the Kalman filter (KF) theory has found numerous applications in problems with different levels of complexity. In atmospheric sciences, it offers a popular frame for sophisticated data assimilation applications (e.g., Gaubert et al., 2014;Di Tomaso et al., 2017), but can also be used as a simple yet powerful MOS method for correcting forecasts (e.g., Delle Monache et al., 2006;Kang et al., 2008;De Ridder et al., 2012). A detailed description of the KF algorithm can be found in Appendix B (as well as in Delle Monache et al. (2006)).
KF provides an efficient way of estimating the forecast bias based on past model and observation information. For a given day 155 at a given hour, the forecast bias is computed as a weighted average of (1) the forecast bias estimated one day before and (2) the corresponding observed forecast bias. Each of these two terms is weighted according to the value of the so-called Kalman gain (k t ) that intrinsically depends on the so-called variance ratio (see Appendix B for more details). The value chosen for this internal parameter substantially affects the behaviour of the KF, and thus the obtained MOS corrections. A variance ratio close to zero induces a Kalman gain close to 0. In such situations, the estimated forecast bias corresponds to the estimated forecast 160 bias of the previous day, independently from the forecast error. A very high (infinite) variance ratio gives a Kalman gain close to 1. In this case, the estimated forecast bias corresponds to the observed forecast bias of the previous day, which makes it thus equivalent to the MA(1) method.
In this study, the variance ratio is adjusted dynamically and updated regularly in order to optimize a specific statistical metric, in our case the RMSE (the corresponding approach being hereafter referred to as KF(RMSE)). The different steps are: (1) at a 165 given day of update, the KF corrections over the entire historical dataset are computed considering different values of variance ratio, from 0.001 to 100 in a logarithmic progression; (2) the RMSE is computed for each of the corrected historical time series obtained; (3) the variance ratio associated to the best RMSE is retained and used until the next update. Other choices of metric to optimize are explored in Sect. 3.4.
As for QM, for computational reasons, the update frequency is set to 30 days in this study (although, again, an update frequency 170 of one single day would be optimal).

Analogs (AN) method
The analogs method (AN) implemented here consists in (1) comparing the current forecast to all past forecasts available, (2) identifying the past days with the most similar forecast (hereafter referred to as analog days or analogs), and (3) using the corresponding past observed concentrations to estimate the AN-corrected O 3 forecast (Delle Monache et al., 2011Monache et al., , 2013175 Djalalova et al., 2015;Huang et al., 2017, e.g.,). The current forecast is compared to past forecasts based on a set of features including the raw O 3 mixing ratio forecast from the AQ model and the 10-meter wind speed, 2-meter temperature, surface pressure and boundary layer height forecast from the meteorological model. The similarity of each day of forecast is assessed using the distance metric proposed by Delle Monache et al. (2011) and previously used in Djalalova et al. (2015) (see the formula in Appendix C). As a first approach, we consider the 10 best analog days, hereafter referred to as AN (10); other values 180 are tested in Sect. 3.4). From those best analog days, the MOS-corrected forecast is computed as the weighted average of the corresponding observed concentrations, where weights are taken as the inverse of the distance metric previously computed. In comparison to a normal average, introducing the weights is expected to slightly reduce the dependence upon the number of analog days chosen. Therefore, in the analogs paradigm, the past days of similar chemical and/or meteorological conditions are identified in the 185 forecast (i.e. model) space while the output (i.e. the AN-corrected forecast) is taken from the observation space. The AQ model thus only serves to identify the past observed situations that look similar to the current one.

Machine-learning-based MOS method
In this study, we also explore the use of ML algorithms as an innovative MOS approach for correcting AQ forecasts. In ML terms, it corresponds to a supervised regression problem where a ML model is trained to predict the observed concentrations, 190 hereafter referred to as the target or output, based on multiple ancillary variables, hereafter referred to as the features or inputs, coming from meteorological and chemistry-transport geophysical models and/or past observations. In this context, the use of ML is of potential interest because (i) we suspect that some relationships exist between the target variable and at least some of these features, (ii) these relationships are likely too complex to be modeled in an analytical way, and (iii) data are available for extracting (learning) information about them. Over the last years, ML algorithms became very popular for many 195 types of predictions, notably due to their ability to model complex (typically non-linear and multi-variable) relationships with good prediction skills. Among the myriad of ML algorithms developed so far, we focus on the decision tree-based ensemble methods, and more specifically on the gradient boosting machine (GBM), that often gives among the best prediction skills (as shown in various ML competitions and model intercomparisons, e.g., Caruana and Niculescu-Mizil, 2005).
At each monitoring station, one single ML model is trained to forecast O 3 concentrations at all lead hours (from 1 to 96) or 200 days (from 1 to 4), depending on the time scale used (see Sect. 2.4). The features taken into account include a set of chemical features (raw forecast O 3 concentration, O 3 concentration observed one day before), meteorological features (2-m temperature, 10-m surface wind speed, normalized 10-m zonal and meridian wind speed components, surface pressure, total cloud cover, surface net solar radiation, surface solar radiation downwards, downward UV radiation at the surface, boundary layer height, and geopotential at 500 hPa; all forecast by the meteorological model) and time features (day of year, day of week, lead hour).

205
Although the past O 3 observed concentration corresponds to recursive information that will not be available for all forecast lead days, we use here the same value for all lead days. The tuning of the GBM models is described in Appendix D.
As for QM, the GBM model is first trained (and tuned) only after 30 days to accumulate enough data, and then retrained every 30 days based on all historical data available.

Time scales of MOS corrections
210 Current AQ standards are defined according to pollutant-dependent time scales, e.g. daily 8-hour maximum (d8max) concentration in the case of O 3 . In the literature, MOS corrections are typically applied to hourly concentrations, providing hourly corrected concentrations from which the value at the appropriate time scale can then be computed. Following this approach, for a given MOS method X, corrections in this study are first computed based on hourly time series (hereafter referred to as concentrations are then deduced. In addition, MOS corrections are computed directly on daily 24-hour average (X dd , the additional "d" indicating that the MOS method is applied directly on daily rather than hourly time series), daily 1-hour maximum (X dd1max ) and daily 8-hour maximum (X dd8max ) time series, respectively. When needed, meteorological features are used at the same time scale. This is done to investigate whether applying the MOS correction directly at the regulatory time scale can help achieving better performance.

Results
We first briefly describe the O 3 pollution over the Iberian Peninsula as observed by the monitoring stations and simulated by the Finally, the impact of the input meteorological data on the MOS methods performance is discussed in Sect. 3.5. In order to ensure fair comparisons between observations and RAW/MOS forecasts, O 3 values at a given hour are discarded when at least one of these different dataset does not have data. Over the 2018-2019 period, the resulting data availability exceeds 94% whatever the time scale considered. Note that about 4% of the data is here missing due to the aforementioned minimum of 30 days (i.e. January 2018) of accumulated historical data requested to start computing the corrected forecasts 235 with some MOS methods.

Ozone pollution over Iberian Peninsula and raw CAMS forecasts
The European Union sets different standards regarding O 3 pollution, including (1) a target threshold of 60 ppbv for the daily 8-hour maximum, with 25 exceedances per year allowed on average over 3 years, (2) an information threshold of 90 ppbv for the daily 1-hour maximum, and (3) an alert threshold of 120 ppbv for the daily 1-hour maximum. In this study, we focus on 240 the two first thresholds and exclude the last one mainly because exceedances of the alert threshold are extremely rare (only 13 exceedances over 314,005 points, i.e. 0.004%). With such a low frequency of occurrence, such events remain extremely difficult to predict (without predicting too many false alarms).  As mentioned in Sect. 3.1, the RAW O 3 D+1 forecasts over the Iberian Peninsula show a moderate overestimation with nMB 270 around +18% at hourly and daily scales, reduced to +7 and +2% at d8max and d1max scales, respectively. Similarly, the nRMSE ranges between 38% at the hourly scale and 19% at the d1max scale. A reasonable correlation is obtained, around 0.75-0.79 depending on the time scale. The variability appears substantially underestimated, with a nMSDB between -28 and -30%, which is reflected in the low model-versus-observation linear slope obtained (between 0.53 and 0.57 depending on the time scale). The deterioration of the performance of the raw CAMS forecasts with lead time is very low, with hourly-scale 275 nRMSE/PCC decreasing from 38%/0.75 at D+1 to 39%/0.72 at D+4. Such a slow decrease in performance might be due to the relatively coarse resolution of the CAMS forecasts.
The impact of the MOS corrections on the performance strongly varies with the method considered. As expected (by construction), the most basic PERS(1) method gives unbiased O 3 forecasts with unbiased variability (nMB and nMSDB of 0%).
280 Figure 1. Overview of the O3 pollution over the Iberian Peninsula, as observed by monitoring stations (left panels) and as simulated by the CAMS regional ensemble D+1 forecasts (right panels), showing the mean O3 mixing ratios (top panels), and the number of exceedances of the standard (d8max > 60 ppbv; middle panels) and information threshold (d1max > 90 ppbv; bottom panels), over the period 2018-2019.
For clarity, the stations without any observed or simulated exceedance are omitted.
Due to the temporal auto-correlation of O 3 concentrations, reasonable results are obtained at D+1. Compared to RAW, the PERS(1) method slightly reduces the nRMSE (36% at hourly scale), but does not improve the PCC (0.75). Although still too low, the slope is also greatly improved, with 0.75 at hourly scale (up to 0.84 at d8max scale). However, the performance of this simple method quickly deteriorates with lead time, down to nRMSE/PCC of 42%/0.65 at D+4, thus worst than the RAW forecast. The MA(1) method also allows to remove the bias and to correct most of the underestimated variability (absolute nMSDB 285 below 2%). It substantially improves the other metrics for all lead days, with hourly-scale nRMSE/PCC/slope of 31%/0.81/0.82 at D+1 and 36%/0.74/0.75 at D+4. Thus, the performance still slightly deteriorates with lead time, but less dramatically than with PERS(1). The QM method shows quite similar results as the MA(1) method, but usually with slightly worse results at short lead time and better ones at longer lead time (thus with slower performance deterioration with lead time). When considering the hourly time scale, the KF, AN and GBM methods give relatively similar results on most of these continuous statistical 290 metrics except the slope and nMSDB that are slightly better with KF (followed by GBM). Some negative biases are introduced by these MOS methods, essentially at d1max scale, as well as d8max scale for GBM specifically. Interestingly, applying these https://doi.org/10.5194/acp-2021-864 Preprint. Discussion started: 1 December 2021 c Author(s) 2021. CC BY 4.0 License. MOS methods directly on d1max or d8max O 3 mixing ratios rather than hourly data (i.e. dd1max and dd8max scales) removes most of these biases. However, KF, AN and GBM all outperform the previous MOS methods in terms of nRMSE (about 25%) and PCC (about 0.86) that are substantially improved. Their main limitation lies in the variability that remains underestimated 295 (nMSDB around -10%), although less than in RAW (-29%).
Note that some MOS methods (QM, AN and GBM) ingest increasing amounts of input data over time, and their performance is thus expected to be relatively lower at the beginning of the period, yet increase with time. Comparing the relative change of nRMSE and PCC obtained during the last year (2019) against those previously discussed (period 2018-2019), while 300 RAW shows a slight relative deterioration of its performance (nRMSE increased by +2% and no change of PCC), all MOS methods depict a small relative improvement. Interestingly, the improvement for GBM is substantially larger than the other MOS methods, with nRMSE decreased by 5% (against +1 to +2% for the other methods) and PCC increased by +2% (against 11 https://doi.org/10.5194/acp-2021-864 Preprint. Discussion started: 1 December 2021 c Author(s) 2021. CC BY 4.0 License. +0 to +1% for the other methods). This improvement is mainly due to the relatively poor predictions made during the very first months of 2018 when the training dataset was the most limited (see time series in Fig. F1 in Appendix F).

Performance of MOS methods on categorical forecasts
Focusing now on the performance for detecting target and information thresholds, Fig. 3 (middle and bottom panels) shows a comprehensive set of metrics, where the most relevant ones are probably CSI and PSS, followed by SR and AUC.
As previously foreseen, despite RAW is very "conservative" with low H and F (very few true positives and false negatives), it does not benefit from a strong SR (0.45), and finally shows the worst performance in terms of CSI (0.10) or PSS (0.15).

310
The term "conservative" here refers to forecasting systems that predict exceedances only with strong evidence; it thus predicts very few exceedances but with higher confidence. It follows relatively well the variability of O 3 (as shown by a reasonably good AUC) but dramatically fails at reaching high O 3 mixing ratios, as illustrated by the low FB (0.25). Even a basic method like PERS(1) provides better detection skills regarding target thresholds. This is especially true during the first lead days, but the performance quickly decreases along lead time, with CSI/PSS reduced from about 0.27/0.42 at D+1 to about 0.14/0.23 at 315 D+4. Except FB, all categorical metrics show a similarly strong sensitivity to the lead time. However, the usefulness of having geophysical O 3 forecasts is nicely illustrated by the results obtained with MA(1), QM and KF(RMSE), the MOS methods relying only on both RAW and observed O 3 data. Indeed, these methods show among the best CSI and/or PSS results, not so far from the two last methods, AN(10) and GBM. For short lead times (D+1), MA(1) clearly outperforms the other methods, especially for PSS. Differences of performance are reduced when considering longer lead times. At D+4, best CSI are obtained 320 with KF(RMSE) dd1max and GBM dd1max (0.28), while best PSS are achieved by QM and MA(1). More generally, KF(RMSE), AN(10) and GBM appear as the most "conservative" MOS approaches here, with relatively low H and F, but a strong SR. In other terms, they predict fewer exceedances but with a higher reliability.
However, when considering the detection of the information threshold (d1max O 3 above 90 ppbv), the KF(RMSE), AN (10) and GBM methods still benefit from a strong SR but are missing too much the observed exceedances, which leads to a dramatic 325 deterioration of both CSI and PSS. This means that there is a high change that an exceedance predicted by these methods indeed occurs but such exceedances are too rarely predicted. For detecting such high O 3 values, best methods are finally MA(1) for shortest lead times, and QM for longer ones. Both methods reproduce fairly well the geographical distribution of high O 3 episodes (PERS(1) reproduces it perfectly, by construction), as shown in Fig. 4, but still with very low SR (below 0.25 for exceedances of the information threshold). Note that the RAW model alone misses almost all exceedances of the information 330 threshold.

Sensitivity tests
In the previous sections, we provided a first evaluation of the performance of a set of MOS methods. All methods rely on specific choices or parameters that can substantially influence the behaviour of the MOS-corrected forecasts, and thus its general performance. In this section, we discuss some of these choices and investigate their impact on the performance through

Persistence method
The persistence method essentially relies on the choice of the time window over which past observations are averaged to provide the O 3 forecast. In the previous section, we used a window of 1 d. A sensitivity test is performed with windows ranging between 1 and 10 d (hereafter referred to as PERS(n) with n the window in days). Results are shown in Fig. G1 in 340 Appendix G, and indicate that, while PERS(1) forecasts were unbiased (whatever the time scale), increasing the window leads to a growing negative bias on d1max and d8max scales. The bias is substantially reduced when working at dd1max and dd8max scales, i.e. when applying the PERS approach directly on daily 1-hour and 8-hour maximums rather than on the hourly time series. The differences between the two approaches originate from the day-to-day variability in the hour of the day when O 3 mixing ratios peak. For illustration purposes, let's assume that O 3 peaks between 15 and 17 h; on a given day, O 3 mixing ratios Conversely, both RMSE and PCC can be slightly improved with longer windows. However, averaging past observations over more days reduces the variability, which was unbiased in PERS(1)), thus introducing a substantial negative nMSDB. As a 350 important differences in terms of AUC for detecting exceedances of the target threshold depending on the lead day, ranging from a decrease of AUC with longer windows at D+1 to an increase at D+4. Therefore, for detecting exceedances, considering PSS and/or CSI as the most relevant metrics (Appendix E), the PERS method shows its best performance for a time window of 1 d. However, it gives very "liberal" O 3 forecasts with rather poor SR. The term "liberal" is here borrowed from (Fawcett, 2006) to designate forecasting systems that predict exceedances with weak 360 evidence, in opposition with the aforementioned term "conservative". Longer time windows can improve SR, but result in an important deterioration of CSI and PSS, particularly for the shorter lead times (D+1/D+2).

Moving average method
Similarly to PERS, the MA method depends on the time window over which past model biases are averaged to correct the forecast. Similarly, a sensitivity test is performed with windows ranging between 1 and 10 d (hereafter referred to as MA(n) 365 with n the window in days). Results are shown in Fig. G2 in Appendix G. Increasing the window length impacts the MA performance in a very similar way than for PERS, especially in terms of continuous metrics for which the sensitivity is almost exactly the same. Regarding the detection of the target threshold (d8max O 3 above 60 ppbv), the main noticeable difference is the absence of strong deterioration of some metrics like AUC, SR or CSI for shorter lead times. Regarding the detection of the information threshold (d1max O 3 above 90 ppbv), the clearest difference with PERS concerns the SR that substantially 370 improves when considering longer windows. However, the deterioration of both CSI and PSS persists.
Therefore, the detection of O 3 exceedances with the MA method shows its best skills with shortest windows (1 d). As for PERS, the corresponding forecasts are quite liberal with low SR. However, in contrast to PERS, the SR associated to strong thresholds (d1max above 90 ppbv) can be substantially improved when using longer windows, which may be an interesting option if the corresponding deterioration of CSI/PSS is seen as acceptable.

Kalman filter method
As explained in Sect. 2.3.3 (and Appendix B), the behaviour of the KF intrinsically depends on the σ 2 η /σ 2 ratio chosen. So far, this parameter has been adjusted dynamically (and updated regularly) to optimize the RMSE on past data. Here, a sensitivity test is performed with alternative strategies in which the variance ratio is chosen to optimize the SR, CSI, PSS or AUC with threshold values of 60 or 90 ppbv (hereafter referred to as SR-60, SR-90, CSI-60, CSI-90, PSS-60, PSS-90, 380 AUC-60 and AUC-90). The objective is to investigate to what extent tuning the KF algorithm with appropriate categorical metrics allows improving the exceedance detection skills. Results (Fig. G3 in Appendix G) show that this tuning strategy barely impacts the performance obtained on continuous metrics, except for CSI-60 and PSS-60 that show slightly deteriorated of target threshold exceedances, but these are mostly restricted to the first lead day. The improvement is stronger for the demonstrates that choosing an appropriate tuning strategy can help slightly improving the detection skills at a potential cost in terms of continuous metrics.

Analog method
The AN method identifies the closest analog days to estimate the corresponding prediction, and thus depends on the number of 400 analog days taken into account. We performed a sensitivity test with 1, 5, 10, 15, 20, 25 and 30 analog days (hereafter referred to as AN(N) with N the number of analogs). Results are shown in Fig. G4 in the Appendix G. Increasing the number of analog days up to 5 (AN(5)) positively impacts PCC but deteriorates it when more days are included. It also increases the negative bias affecting the variability (nMSDB), which leads to a worse slope and intercept. Concerning the detection of target threshold exceedances, increasing the number of analog days logically makes the forecast more "conservative" (lower H and F), although 405 the best SR are found with a number of analogs around 20. However, best CSI and PSS are obtained with lowest numbers of analogs (1 in this case). When focusing on information threshold exceedances, the AN forecasts based on 10 analogs or more never reach such high O 3 values. Therefore, similarly to PERS and MA methods that reached their best skills for the shortest time windows, with AN the best CSI and PSS skills are obtained when using the lowest number of analogs (with a cost in the continuous metrics, as for PERS and MA). Computing the AN-corrected O 3 mixing ratios based on a larger number of 410 analogs gives smoother predictions, and our choice to weight the average by the distance to the different analogs is unable to substantially mitigate this issue.

Gradient boosting machine method
Although GBM gives among the best RMSE and PCC, it strongly underestimates the variability of O 3 mixing ratios, with critical consequences in terms of detection skills, especially for the highest thresholds (e.g. d1max > 90 ppbv). This is at least 415 partly due to the low frequency of occurrence of such episodes, and their corresponding low weight in the entire population of points used for the training. One way of mitigating this issue consists in specifying different weights to the different training instances. This aims at forcing the GBM model to better predict the instances of higher weight, at the cost of a potential deterioration of the performance on the instances of lower weight.
In order to assess to which extent it may improve the performance of the GBM MOS method, we tested different weighting 420 strategies. At each training phase, we compute the absolute distance D between all observed O 3 mixing ratio instances and the mean O 3 mixing ratio (averaged over the entire training dataset). Then several sensitivity tests are performed, weighting the training data by D, D 2 and D 3 , respectively (hereafter referred to as GBM(W), GBM(W2), GBM(W3), respectively). Using such weights, we want the GBM model to better predict the lower and upper tails of the O 3 distribution in order to better represent the variability of the O 3 mixing ratios. Given that the O 3 mixing ratio distribution is typically positively skewed, the 425 highest weights are put on the strongest positive deviations from the mean.
As a parallel sensitivity test, we explore the performance of these different ML models but removing the input feature corresponding to the previous (one day before) observed O 3 mixing ratio (hereafter referred to as GBM(noO), GBM(noO,W), GBM(noO,W2) and GBM(noO,W3)). This additional test is of interest for operational purposes since O 3 observations are not always available in near real-time. In this context, it appears interesting to evaluate to which extent the performance is altered 430 when not relying on this specific information. Results are shown in Fig. G5 in the Appendix G.
As expected, results highlight a deterioration of the RMSE and PCC combined with an improvement of the slope, intercept and nMSDB. The negative bias affecting the variability with the unweighted GBM is substantially reduced when using weights, although too strong weights (as in GBM(W3) for instance) can lead to a slight overestimation of the variability at specific time scales.

435
Regarding the skills for detecting d8max O 3 above 60 ppbv, stronger weights typically increase both H and F, improve the (underestimated) FB, but deteriorate the SR and AUC (the forecasts become more liberal). Regarding the more balanced metrics (of strongest interest here), adding more weights on the tails of the O 3 distribution has a positive although small impact on PSS. A minor positive impact is also found on CSI, but the best results are obtained with GBM(W2), thus moderate weights.
For both metrics, improvements are most obvious at the d8max scale, while changes at the dd8max scale are much smaller.

440
Regarding the detection of d1max O 3 above 90 ppbv, the influence of the weighting strategies is more ambiguous but the detection skills generally remain very poor. Again, the strongest CSI or PSS improvements are obtained at the d1max scale with much lower changes of the dd1max results.
Therefore, adopting an appropriate weighting strategy is simple yet effective for achieving slightly better O 3 exceedance detection skills in exchange of a reasonable deterioration in RMSE and PCC. Overall, the improvements are relatively small, but 445 still valuable given the initially very low detection skills for the strongest O 3 episodes.

Influence of the meteorological input data in AN and GBM methods
In the previous sections, O 3 corrections with AN and GBM methods relied on IFS meteorological forecasts. Here, we investigate the impact of using ERA5 reanalysis. Generally, the hourly O 3 predictions with both meteorological datasets are consistent. Assuming ERA5-based O 3 mixing ratios as the truth, the IFS-based O 3 predictions with AN(10) method show 450 a nRMSE/PCC of 8%/0.98 at D+1 (N=7,067,085), slowly deteriorating up to 10%/0.97 at D+4 (N=6,960,524). Similarly, nRMSE/PCC with GBM method evolves from 12%/0.96 to 13%/0.95. Whatever the lead day or the MOS method, no differences are found between the ERA5-based and IFS-based predictions.
The results obtained against observations are shown in Fig. G6 in the Appendix G, for the AN(1), AN(5), AN(10) and GBM methods. Since O 3 predictions are close, the statistical performance against observations is also very consistent between both 455 meteorological datasets. As expected, the performance is slightly lower with IFS data and the discrepancies increase with lead time. Using IFS rather than ERA5 data increases the nRMSE of AN(10) by 1% at D+1 and by 5% at D+4. This relative deterioration at D+4 depends upon the MOS method, with 5, 4, 4 and 8% for AN(1), AN(5), AN(10) and GBM, respectively.
Similarly, the PCC is slightly reduced when using IFS data, by only 1% at D+1 whatever the MOS method, and up to 3, 2, 2 and 3% for AN(1), AN(5), AN(10) and GBM, respectively at D+4. Therefore, the sensitivity to the quality of the meteorological 460 input data varies with the MOS method and the metric considered, and GBM is the most sensitive to this aspect.
Overall, similar conclusions can be drawn for categorical metrics. GBM shows a relative deterioration of CSI/PSS from -7 to -9%. Again, this deterioration of the performance is also observed with the AN method, with up to -5, -8 and -8% for AN(1), AN(5) and AN(10), respectively.
Compared to IFS, the ERA5 reanalysis undoubtedly benefits from the assimilation of many meteorological observations but 465 has conversely a coarser spatial resolution (about 31 versus 9 km), which may have a negative impact on its reliability, especially in specific areas (e.g. complex orography, urban areas). All in all, ERA5 likely gives better meteorological information, in particular for longer lead times. Here, our results show that a better performance of the MOS correction can be obtained using such higher quality meteorological inputs. At the same time, the deterioration introduced by the use of IFS forecasts remains relatively small (at lead times below 4 d). The very similar results obtained with IFS and ERA5 meteorological input 470 data are likely not explained by the fact that both datasets give very similar values for the different meteorological variables, but rather by the intrinsic characteristics of both AN and GBM methods. The AN method make use of the meteorological data only to identify past days with more or less similar meteorological conditions, and can thus handle to some extent the presence of biases in meteorological variables as far as they are systematic (and thus do not impact the identification of the analogs).
On the other side, the GBM method uses past information to learn the complex relationship between O 3 mixing ratios and the 475 other ancillary features. Although the better the input data, the higher the chances are to fit a reliable model for predicting O 3 , the GBM models can also learn indirectly at least part of the potential errors affecting some meteorological variables and how they relate to O 3 mixing ratios. Therefore, the presence of systematic biases in some of the ancillary features is not expected to strongly impact the performance of the predictions. However, results at longer lead times are still clearly better with ERA5 than with IFS, because of the chaotic nature of weather and the unavoidable increase of errors with lead time.

Discussion and conclusions
We demonstrate the strong impact of MOS methods to enhance raw CAMS O 3 forecasts, not only by removing potential systematic biases but also for correcting other issues related to the distribution and/or variability of O 3 mixing ratios. Apart from the PERS method, all MOS approaches were indeed able to substantially improve at least some aspects of the RAW O 3 forecasts, first and foremost the RMSE and PCC, for which the strongest improvements are obtained with most sophisticated 485 MOS methods like KF, AN or GBM. However, although all MOS methods were able to increase the underestimated variability of O 3 mixing ratios of RAW, the strongest improvements of slope and nMSDB were obtained with more simple MOS methods like MA or QM. O 3 mixing ratios corrected with AN, GBM and to a lesser extent KF, remained too smooth, and such a deficiency has a major impact on the detection skills for high O 3 thresholds. All in all, the best PSS or CSI are usually obtained with the more simple MOS methods. Therefore, there is a clear trade-off between the continuous and categorical skills scores, 490 as also shown by the different sensitivity tests. The quality of a MOS-corrected forecast assessed solely based on metrics like RMSE or PCC thus tells little about the forecast value, here understood as an information a user can benefit from to make better decisions, notably for mitigating O 3 short-term episodes.
More generally, our study highlights the complexity of identifying the "best" MOS method given the multiple dimensions 495 of the problem. The relative performance of the MOS methods can vary depending on the metric used, the threshold considered in the case of categorical metrics (or more specifically the base rate), the time scale at which MOS corrections are computed and/or evaluated, or the lead time. Other dimensions not covered by this study, like the seasonality of the performance, are also susceptible of shedding a different light on the inter-comparison.
Among the continuous metrics, both RMSE and PCC provide a first valuable information on the performance of a MOS 500 method. However, a MOS method can give the best RMSE and PCC, yet the poorest high O 3 detection skills. This was the case of the unweighted GBM method. Continuous metrics like the model-versus-observation linear slope or nMSDB provide important complementary information, potentially less misleading, especially in a context where the final objective is to predict episodes of strong O 3 . Among the categorical metrics, although results were presented on a relatively large set of metrics, all metrics do not benefit from the same properties. PSS may be considered as one of the most valuable, notably due to its inde-505 pendence from the base rate, in contrast to CSI. Such a property is particularly useful when comparing scores over different regions and/or time periods where the frequency of observed exceedances might vary, for instance due to different emission forcing and/or meteorological conditions. In an operational context where statistical metrics are continuously monitored, the independence from the base rate is an interesting property because it may change with time, which prevents from a consistent comparison between different periods. However, a well-known issue of both PSS and CSI (as well as many other categorical 510 metrics) is that they degenerate to trivial values (either 0 or 1) as events become rarer (Jolliffe and Stephenson, 2011;Ferro and Stephenson, 2011), which should restrict their use to the detection of not too rare (and therefore not too high) O 3 episodes.
In this study, the base rate of the target threshold was likely sufficiently high (s around 5%), but we were probably already at the limit regarding the information threshold (s around 0.1%). All in all, the selection of the evaluation metrics depends on the subjective choices and intended use, and is fundamentally a cost-loss problem where the user should arbitrate between the cost 515 of missing exceedances and predicting false alarms.
The performance of the RAW forecasts was found to be only slightly sensitive to the lead day, but this sensitivity was substan-tially stronger with some MOS methods. This aspect is important, although different users may have different needs in terms of lead time, depending on the intended use of the AQ forecast. Forecasts at D+1 may already be useful for some applications 520 like warning in advance the vulnerable population so that they could adapt their outdoor activities. However, implementing short-term emission reduction measures at local scale usually goes through decisions taken at different administrative and political levels, and thus typically requires forecasts at least at D+2. If such measures would have to be taken at larger scale, the occurrence of O 3 episodes would probably need to be forecasted even more in advance.
We saw that some MOS methods like PERS can provide a reasonable performance at D+1 but quickly deteriorate when looking The performance of the MA(1) method also substantially depends on the lead time, although less than PERS(1). Conversely, 535 some MOS methods like GBM, AN or QM were less impacted by the increasing lead time.
By comparing the MOS results obtained with ERA5 reanalysis data rather than IFS forecasts, we demonstrated that higherquality meteorological input data helps improving the performance of the prediction. However, the improvement obtained with ERA5 was relatively small, which is an important result for the use of MOS in an operational context where only meteorological forecasts can be used. Although data are so far only available until July 2019, it would be interesting in the near-future to extend 540 the present analysis using the UERRA regional reanalysis for Europe that provides meteorological information at a refined spatial resolution of 5.5x5.5 km 2 (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-uerra-europe-single-levels?tab= overview).
For operational purposes, other important aspects are to be taken into account. A first aspect concerns the input data required by the MOS method. Does the MOS method rely on observations, models or a combination of both? When the method relies on 545 observations, are they needed in near real-time? How much historical data are required? When the method relies on historical data, to which extent the length of the historical dataset impacts the performance? Related to this last point, another essential aspect concerns the ability of the MOS method to handle progressive and/or abrupt changes in the AQ forecasting system (e.g. configuration, parameterizations, input data like emissions) and/or in the Earth's atmosphere (long-term trends, anomalous events like the COVID-19-related emission reduction, climate change). In this frame, the year 2020 obviously offers a unique 550 large-scale case study to investigate the behaviour of the different MOS methods.
MOS methods relying only on very recent data (namely PERS, MA and KF methods) are evidently more adaptable to rapid changes, which is a clear asset under changing atmospheric conditions or modeling system configurations. On the other hand, they naturally discard all the potentially useful information available within the historical dataset. Methods like QM, AN or GBM aim at extracting such information to produce better forecasts, but implicitly rely on the assumption that these historical 555 data are still up-to-date and thus representative of the current conditions, which can be a too strong hypothesis when the historical dataset is long or the emission forcing and/or meteorological conditions are changing rapidly. In this study, we considered a relatively short 2-year dataset but using a longer training dataset would likely require to build specific methodologies to tackle this issue, either by identifying and discarding the potentially outdated data, or by giving them a lower weight in the procedure.
In this study, we implemented a relatively simple ML-based MOS method. Although the performance on categorical metrics 560 was found limited despite encouraging results on continuous metrics, there is likely room for improvements in near-future developments. In order to improve the high O 3 detection skills, potential interesting aspects to explore include testing other types of ML models, customizing loss function and/or cross-validation scores, designing specific weighting strategies and/or re-sampling approaches or comparing regression and classification ML models for the detection of exceedances. Along the preparation of this study, some of them have been investigated but more efforts are required to draw firm conclusions regarding 565 their potential for better predicting O 3 episodes. Finally, we focused here on the CAMS regional ensemble but including the individual CAMS models in the set of ML input features may help achieving better performance if the ML model is somehow able to learn the variability (in time and space, or during specific meteorological conditions) of strengths and weaknesses of each model and build its predictions based on the most appropriate sub-set of individual models. More generally, the performance of the different MOS methods is expected to vary from one raw model to another. Investigating the performance and 570 behavior of these methods on the different individual models might shed an interesting light on the results obtained here with the ensemble, and eventually allow generalizing some of our conclusions.
Data availability. The EEA AQ e-Reporting and ERA5 dataset used in this study are publicly available.
Appendix A: Quality assurance with GHOST Using the metadata available in GHOST (Globally Harmonised Observational Surface Treatment), a quality assurance screen-575 ing is applied to O 3 hourly observations, in which the following data are removed : missing measurements (GHOST's flag 0), infinite values (flag 1), negative measurements (flag 2), zero measurements (flag 4), measurements associated with data quality flags given by the data provider which have been decreed by the GHOST project architects to suggest the measurements are associated with substantial uncertainty or bias (flag 6), measurements for which no valid data remains to average in temporal window after screening by key QA flags (flag 8), measurements showing persistently recurring values (rolling 7 out 580 of 9 data points; flag 10), concentrations greater than a scientifically feasible limit (above 5000 ppbv) (flag 12), measurements detected as distributional outliers using adjusted boxplot analysis (flag 13), measurements manually flagged as too extreme (flag 14), data with too coarse reported measurement resolution (above 1.0 ppbv) (flag 17), data with too coarse empirically derived measurement resolution (above 1.0 ppbv) (flag 18), measurements below the reported lower limit of detection (flag for preparing NO 2 for subsequent measurement (flag 40), measurements with inappropriate sample preparation for preparing NO 2 for subsequent measurement (flag 41) and measurements with erroneous measurement methodology (flag 42).
Appendix B: Kalman filter CAMS forecasts are available over 4 lead days, from D+1 to D+4. We define here the time t as the day D at a given hour of the day (t + 1 thus corresponds to D+1 at this specific hour of the day). In an operational context, observations at this hour 590 of the day are available only until time t (included). In this frame, the primary objective of the Kalman filter is to estimate the so-called x t+1|t , that designates here the true (unknown) forecast bias at time t + 1 using the information available until t (included). We distinguish estimated values from true values using an hat (ˆ) (x t+1|t therefore corresponds to the estimated value of x t+1|t ). In its application as a MOS method, the Kalman filter considers the following system equation for describing the time evolution of the true forecast bias : where η t is assumed to be a white noise term with normal distribution, zero-mean, variance σ 2 η and uncorrelated in time. It also assumes that the forecast error (forecast minus observation) y t observed at time t does not represents the true x t|t−1 due to the presence of some random error t : where t is assumed to be a white noise term with normal distribution, zero-mean, variance σ 2 and uncorrelated in time, and independent from η t .
Using the Kalman filter theory, it can be demonstrated that the optimal estimate for the true forecast bias x t+1|t can be obtained from the following equations : where K t corresponds to the so-called Kalman gain used to weight the respective importance of the previous forecast bias estimate (x t|t−1 ) and its observed value (y t ), andp t+1|t the expected error of the forecast bias estimate (i.e. the variance of the forecast bias error :p t+1|t = V ar(x t+1|t −x t+1|t )).
Solving these equations requires assigning values to both σ 2 η and σ 2 . It can be demonstrated that, once σ 2 is set to a fixed value (any reasonable value can be chosen, for instance σ 2 = 1), the KF results solely depend on the σ 2 η /σ 2 variance ratio. Various strategies can be used to choose an appropriate value for this variance ratio. This aspect is discussed in Sect. 2.3.3.

Appendix C: Analogs norm
The analogs (AN) method requires to identify which past forecast days are the most similar to the current one. Given a set of 620 features to take into account, this similarity is computed using the norm introduced by (Delle Monache et al., 2006) : with F t the raw forecast at time t, A t an analog forecast at time t , N the number of features taken into account, w i the weight of the feature i, σ i its standard deviation calculated over past forecasts, T the half the width of the time window over which to 625 compute the metric (i.e. a value T = 2 means that the squared difference between the forecast and the analog will be computed over a ±2 hours time window). In our study, we used weights of 1 for all features (wind speed, wind direction, temperature, surface pressure) and T = 1.

Appendix D: Tuning of the GBM models
The GBM models are tuned using a so-called randomized search in which a range of values is given for each hyperparameter 630 of interest and a total number of hyperparameters combinations to test. After fixing the learning rate to 0.05 (learning_rate in the scikit-learn Python package), the tuning of the GBM model was done over the following set of hyperparameters: the tree maximum depth (max_depth : from 1 to 5 by 1), the subsample (subsample : from 0.3 to 1.0 by 0.1), the number of trees (n_estimators: from 50 to 1000 by 50) and the minimum number of samples required to be at a leaf node (min_samples_leaf : from 1 to 50). As we are dealing here with time series, this tuning is conducted through a rolling-origin cross-validation in 635 which validation data are always posterior to train data.  The performance of the categorical forecasts of exceedances of the AQ standard can primarily be described through a contingency table (Tab. E1). Based on these individual numbers a (hits), b (false alarms), c (misses) and d (correct rejections), and Stephenson (2011).

Appendix E: Evaluation metrics
For a given total number of data n (= a + b + c + d), the 2x2 contingency table can be fully described by three independent measures, namely the base rate s independent from the forecasting system (total proportion of observed exceedances, also known as the climatological probability of an exceedance), the hit rate H (proportion of the observed exceedances that are correctly detected) and the false alarm rate F (proportion of the observed non-exceedances erroneously forecast as exceedances, with a r = (a + b)(a + c)/n the expected number of hits (a) for a random forecast with the same s (meaning that GSS is an equivalent of CSI where the number of hits is adjusted for the hits that are associated to random chance, due to the climatology frequency of the event).
Jolliffe and Stephenson (2011) gave a detailed explanation of the different metric properties desirable for assessing the quality 675 of a forecasting system (see Table 3.4 in Jolliffe and Stephenson, 2011). In this framework, among the previous metrics, we retained PSS as the best choice for assessing the skills of our MOS methods, given that it gathers numerous interesting properties: (i) truly equitable (all random and fixed-value forecasting systems are awarded the same score, which provides a single no-skill baseline), (ii) not trivial to hedge (the forecaster cannot cheat on his forecast in order to increase PSS), (iii) base rate independent (PSS only depends on H and F, which makes it invariant to natural variations in climate, which is particularly 680 interesting in the frame of AQ forecast where AQ standards and subsequently the base rate can also change) and (v) bounded (values are comprised with a fixed range). Note also that no perfect metric exists, and PSS (as most other metrics) does not benefit from the properties of non-degeneracy (it tends to meaningless values for rare events).
Finally, another useful metric is the Area under the ROC curve (AUC).
Appendix F: Time series