Simulations from chemical weather models are subject to uncertainties in the
input data (e.g. emission inventory, initial and boundary conditions) as well
as those intrinsic to the model (e.g. physical parameterization, chemical
mechanism). Multi-model ensembles can improve the forecast skill, provided
that certain mathematical conditions are fulfilled. In this work, four
ensemble methods were applied to two different datasets, and their performance
was compared for ozone (O

Uncertainties in atmospheric models, such as the chemical weather models, whether due to the input data or the model itself, limit the predictive skill. The incorporation of data assimilation techniques and the continued effort in understanding the physical, chemical and dynamical processes result in better forecasts (Zhang et al., 2012). In addition, ensemble methods provide an extra channel for forecast improvement and uncertainty quantification. The benefits from ensemble averaging arise from filtering out the components of the forecast with uncorrelated errors (Kalnay, 2003).

The European Centre for Medium-Range Weather Forecast (ECMWF) reports an increase in forecast skill of 1 day per decade for meteorological variables, evaluated on the geopotential height anomaly (Simmons, 2011). The air quality modelling and monitoring has a shorter history that does not allow a similar adequate estimation of such trends for the numerous species being modelled. Moreover, the skill changes dramatically from species to species and is strongly connected to the availability of accurate emission data. Results for ozone suggest that medium-range forecasts can be performed with a quality similar to the geopotential height anomaly forecasts (Eskes et al., 2002). Aside from the continuous increase in skill due to the improved scientific understanding, harmonized emission inventories, more accurate and denser observations, as well as ensemble averaging, an extra gain of similar magnitude can be achieved for ensemble-based deterministic modelling using conditional averaging (e.g. Galmarini et al., 2013; Mallet et al., 2009; Solazzo et al., 2013).

Ideally, for continuous and unbiased variables, the multi-model ensemble mean outscores the skill of the deterministic models provided that the members have similar skill and independent errors (Potempski and Galmarini, 2009; Weigel et al., 2010). Practically, the multi-model ensemble mean usually outscores the skill of the deterministic models if the evaluation is performed over multiple observation sites and times. This occurs because over a network of stations, there are some where the essential conditions (e.g. the skill difference between the models is not too large) for the ensemble members are fulfilled, favouring the ensemble mean; for the remaining stations, where the conditions are not fulfilled, local verification identifies the best model, but generally no single model is the best at all sites. Hence, although the skill of the numerical models varies in space (latitude, longitude, altitude) and time (e.g. hour of the day, month, season), the ensemble mean is usually the most accurate spatio-temporal representation.

One of the challenges in multi-model ensemble forecasting is the processing
of the deterministic model datasets prior to averaging in order to
construct another dataset for which its members ideally constitute an
independent and identically distributed (i.i.d.) sample (Kioutsioukis and
Galmarini, 2014; Bishop and Abramowitz, 2013). This statistical process favours
the ensemble mean at each observation site. Two basic pathways exist to achieve this goal: model
weighting or model sub-selecting. There are several methods to assign
weights to ensemble members, such as the singular value decomposition
(Pagowski et al., 2005), dynamic linear regression (Pagowski et al., 2006;
Djalalova et al., 2010), Kalman filtering (Delle Monache et al., 2011),
Bayesian model averaging (Riccio et al., 2007; Monteiro et al., 2013) and
analytical optimization (Potempski and Galmarini, 2009), while model
selection usually relies on the quadratic error or its proxies in time
(e.g. Solazzo et al., 2013; Kioutsioukis and Galmarini, 2014) or frequency
space (Galmarini et al., 2013). The majority of those ensemble studies
focus on O

In this work, we apply and intercompare both approaches (weighting and
sub-selecting) using the Air Quality Model Evaluation International
Initiative (AQMEII) datasets from phase I and phase II. The ensemble
approaches are evaluated against ground level observations from the EMEP (European Monitoring and Evaluation Programme) and
AirBase databases, focusing on the pollutants O

In spite of these differences we consider the analysis of the two sets of ensembles revealing. In detail, the objectives of the paper are (a) to interpret the skill of the unconditional multi-model mean within AQMEII-I and AQMEII-II, (b) to calculate the maximum expectations in the skill of alternative ensemble estimators and (c) to evaluate the operational implementation of the approaches using cross-validation. The originality of the study includes (a) the comparison of several ensemble methods on pollutants of different skill using different datasets, (b) the introduction of an approach based on high-dimension spectral optimization, and (c) the introduction of innovative charts for the interpretation of the error of the unconditional ensemble mean with respect to indicators reflecting the skill difference and error dependence of the models as well as the effective number of models. Therefore, we carry out an analysis of the performance of different ensemble techniques rather than a comparison of the results from the two phases of the AQMEII activity.

The paper is structured as follows: Sect. 2 provides a brief description of the ensemble's basic properties through a series of conditions expressed by mathematical equations. In Sect. 3, the experimental setup is described. Results are presented in Sect. 4, where the skill of the deterministic models, the unconditional ensemble mean and the conditional ensemble estimators are analysed and intercompared. Conclusions are drawn in Sect. 5.

The notation conventions used in this section are briefly presented in the
following section. Assuming an ensemble composed of

Assuming a uniform ensemble, the mean squared error (MSE) of the multi-model
ensemble mean can be broken down into three components, namely, the average
bias (first term), the average error variance (second term) and the
average error covariance (third term) of the ensemble members (Ueda and Nakano, 1996):

In a similar fashion, the squared error of the multi-model ensemble mean can
be decomposed into the difference of two positive-definite components, with
their expectations characterized as accuracy and diversity (Krogh and Vedelsby, 1995):

An error decomposition approach can also be applied to the spectral
components (SC) of the observed and modelled time series. The data can be
spectrally decomposed with the Kolmogorov–Zurbenko (kz) filter (Zurbenko,
1986), while the original time series can be obtained with the linear
combination of the spectral components. Assuming the pollution data at the
frequency domain yield

The three decompositions presented assume uniform ensembles, i.e. all
members receive equal weight. For the case of a non-uniform ensemble, the
MSE of the multi-model ensemble mean can be analytically minimized to yield
the optimal weights, provided that the participating models are
bias-corrected (Potempski and Galmarini, 2009):

Unlike the straightforward calculation of the optimal weights, the
sub-selecting schemes make use of a reduced-dimensionality ensemble. An
estimate of the effective number of models (

It is apparent from the considerations above that the skill of the
unconditional ensemble mean has the potential for certain advantages over
the single members, provided some properties are satisfied. As those
properties are not systematically met in practice, superior ensemble skill
can be achieved through sub-selecting or weighting schemes presented in this
section. An intercomparison of the following approaches in ensemble
averaging is investigated in this work using observed and simulated air
quality time series:

The unconditional ensemble mean (mme) is investigated.

The conditional (on selected members) ensemble mean in time domain (mme

The conditional (on selected members) ensemble mean in frequency domain (kzFO) is investigated. Following Eq. (3), an ensemble estimator is synthesized from the best member at each spectral band (Galmarini et al., 2013). The original time series are decomposed into four spectral components (see Appendix), namely the intra-diurnal, diurnal, synoptic and long-term components, using the Kolmogorov–Zurbenko filter (Zurbenko, 1986).

The conditional (on selected members) ensemble mean in frequency domain (kzHO) is investigated.
It is an extension of the kzFO, where the spectral components of the ensemble
estimator are averaged from

The conditional (optimally weighted) ensemble mean (mmW)is investigated according to equation 4 (Potempski and Galmarini, 2009).

The modelling systems participating in the first and second phases of AQMEII for Europe.

AQMEII phase I: Standard boundary conditions, provided from GEMS
project (Global and regional Earth-system Monitoring using Satellite and in situ
data). Refer to Schere et al. (2012) for details.

The statistical distribution of (a) the normalized mean square
error (NMSE) of the best model (NMSE

The two AQMEII ensemble datasets have simulated the air quality for Europe
(10

The participating models follow a restrictive protocol concerning the emissions and the meteorological and chemical boundary conditions. In AQMEII-I, meteorological models applied nudging to the NCEP GFS (National Centers for Environmental Prediction, Global Forecast System) meteorological analysis. In AQMEII-II, the simulations were run more in a way as if they were real forecasts; meteorological boundary conditions for the majority of the models were from the ECMWF operational archive (see Tables 1 and 2 in Brunner et al., 2015), and no nudging or FDDA (four-dimensional data assimilation) was applied. However, the driving meteorological data were analysis (but no reanalysis) for all simulations, with exception of the COSMO-MUSCAT run. Hence, the runs from AQMEII-II are more like forecasts than those from AQMEII-I.

Recent studies with regional air quality models yielded that the full
variability of the ensemble can be retained with only an effective number of
models (

Table 1 summarizes the features of the modelling
systems analysed in this study with regard to O

Following the statements of Sect. 2, each model was bias-corrected prior to the analysis, i.e. its own mean bias over the examined 3-month period was subtracted from its modelled time series at each monitoring site. For each modelling system, its long-term systematic error is a known quantity estimated during its validation stage; therefore, the subtraction of the seasonal bias does not restrict the generality of the study. Actually, the requirement for bias removal is a necessary condition only for the weighted ensemble mean. In the results section we will address this issue and its effect on the skill of the ensemble estimators.

The observational datasets for O

In this section we apply the conceptual context briefly presented in Sect. 2 to investigate the effect of the differences in the ensemble properties within each of the two AQMEII phases (Rao et al., 2011) in the skill of the unconditional multi-model mean. The potential for improved estimates through conditional ensemble averages and their robustness is ultimately assessed.

The cumulative density functions of the observations (O

From the station-based hourly time series provided, we analysed one season
(3-month period) with continuous data and relatively high
concentrations: for O

The distributions of each model's NMSE for O

The distribution of average NMSE at each station (

Aside from

The skill of the multi-model mean was compared to the skill of the
best deterministic model, independently evaluated at each monitoring site
(hereafter bestL). The geographical distribution of the ratio
RMSE(mme)/RMSE

The skill difference between the best model and the average skill is
inferred from the indicator NMSE

Model skill difference via the NMSE. For each box, the central mark
indicates the median, and the bottom and top edges of the box indicate the
25th and 75th percentiles respectively. The whiskers extend to the most extreme
data points not considered outliers and the outliers (points with distance from
the 25th and 75th percentiles larger than 1.5 times the interquartile range)
are plotted individually using the “

Comparison of the mme skill against the best local deterministic model
by means of the indicator RMSE

Model error dependence through the eigenvalues spectrum. The average
explained variation from the maximum eigenvalue is 71 and 78 (phase I and II) for O

Interpretation of Fig. 3: the explanation of the mme skill against
the best local deterministic model with respect to skill difference (evaluated
from MSE

The eigenvalues of the covariance matrix calculated from the model errors
provide information on the member diversity and the ensemble redundancy
(Eq. 5). Following the eigenanalysis of the error covariance matrix at each
station separately and converting the eigenvalues to cumulative amount of
explained variance, the resulting matrix is presented in a box and whisker
plot (Fig. 4). The error dependence of the
ensemble members is deduced from the explained variation by the maximum
eigenvalue

All species demonstrate smaller skill difference and higher error dependence
in the AQMEII-II dataset. The Kolmogorov–Smirnov test yielded that the difference
in the corresponding distributions of the indicators between AQMEII-I and
AQMEII-II is significant at the 1 % level. However, it is the joint
distribution of skill difference and error dependence that modulates the
mme skill with respect to the bestL, as seen in Fig. 5.
Shifts in the distributions of the indicators at opposite directions
eventually cancel out, yielding no change in the mme skill. This case is
observed for O

The area below the diagonal in Fig. 5 corresponds
to monitoring sites with disproportionally low diversity under the current
level of accuracy. This area of the chart indicates high spread in skill
difference and relatively highly dependent errors. This situation
practically means a limited number of skilled models with correlated errors,
which in turn denotes a small

The cumulative distribution of

Following the identification of the weaknesses in the ensemble design, the
potential for corrections through more sophisticated schemes is now
investigated. We consider the skill of the multi-model mean as the starting
point, and we investigate pathways for further enhancing it through the
non-trivial problem of weighting or sub-selecting. The optimal weights (mmW)
are estimated from the analytical formulas presented in Potempski and
Galmarini (2009). The sub-selection of members was built upon the
optimization of either the accuracy–diversity trade-off (mme

The global skill of all the single models and the ensemble estimators,
evaluated at all stations, is presented in Fig. 7 in the form of
Taylor plots. For O

Like Fig. 5 but showing the

Composite skill of all deterministic models and ensemble estimators
(mme, mme

The MSE from (a) the best deterministic models globally (bestG) and
locally (bestL), (b) the unconditional ensemble mean (mme), and (c) the four
conditional ensemble estimators (mme

mme: unconditional ensemble mean; mme

Like Fig. 5 but for the mme

The cumulative density function of the skill score (1

A comparison between the skill of the examined ensemble estimators versus
the mme and the best single model is now conducted
(Table 3). The best single model is evaluated
globally (bestG is the average across all stations) and locally (bestL is at each station
separately). The former estimates the best average deterministic skill among
the candidate models; the latter provides a useful indicator for controlling
whether the anticipated benefits of ensemble averaging holds. The skill
scores were evaluated against the guaranteed minimum gain of the
ensemble (

The mme always achieves a lower error than bestG. The advancement is higher for
O

The skill score of mme over

The skill score of the examined ensemble estimators (mmW, mme

The improvement of the ensemble estimator using selected

The theoretical minimum MSE of mme for the case of unbiased and uncorrelated models (from Eq. 1) is far from being achieved from all ensemble estimators.

The probability of detection (POD) and false alarm rate (FAR) from
(a) the best deterministic models, globally (bestG) and locally (bestL), (b) the
unconditional ensemble mean (mme), and (c) the four conditional ensemble
estimators (mme

mme: unconditional ensemble mean; mme

The interquartile range over all stations of the day-to-day difference in the weights arising from variable time series length.

The ability to simulate extreme values is now examined through the POD and
FAR indices. Two thresholds were utilized for each pollutant, being 120 and
180

For O

The mme

As we move towards higher percentiles, the first-order spectral model (kzFO)
has higher POD than the higher-order spectral model (kzHO) due to the
averaging in the latter. In addition, the frequency domain averaging (kzHO) had
slightly higher POD compared to the time domain averaging (mme

The mmW, aside from its lower MSE, has the highest POD among all models and ensemble estimators.

The variation of FAR was very small between all examined models and ensemble estimators.

The additional skill over mme in the range between 8 and 31 % from the statistical approaches applied to a pool of ensemble simulations identifies the upper ceiling of the improvements from the corrections in the skill difference and the error dependence of the ensemble members. The bound results from the removal of the seasonal bias from the time series and the optimal training of the methods. We now proceed with splitting the datasets into training and testing, and we explore the sensitivity of the mmW skill arising from improper bias removal and weights. Both factors are estimated on the training set for variable time series length that is progressively increasing from 1 to 60 days, for all monitoring stations and pollutants. The evaluation period for all training windows is the same 30-day segment, not available in the training procedure. The analysis will provide a perspective on applying the techniques in a forecasting context, although the examined simulations did not operate in forecasting mode.

The interquartile range of the day-to-day difference in the weights is
calculated and its range over all stations is displayed in
Fig. 10. No convergence occurs; however, the
variability of the mmW weights is notably reduced after a certain amount of
time. If we set a tolerance level at the second decimal, to be satisfied at
all stations, we need at a minimum 20–45 days of hourly time series. The
variability of weights is smaller for O

The average MSE of mmW for various training lengths, calculated for the testing time series (i.e. not used in the training phase) that contains all stations.

The results demonstrate that the ensemble estimators based on the analytical
optimization become insensitive to inaccuracies in the bias and weights for
training periods exceeding 60 days. However, other published studies with weighted
ensembles using non-analytical optimization (e.g. linear regression;
Monteiro et al., 2013), argue that 1 month is sufficient for the weights
and the bias. The sub-selecting schemes are more robust compared to the
optimal weighting scheme in the variations of their parameters (bias,
members). Using data from AQMEII-I, training periods in the order of 1 week
were found essential for mme

In this paper we analyse two independent suites of chemical weather
modelling systems regarding their effect on the skill of the ensemble mean (mme).
The results are interpreted with respect to the error decomposition of
the mme. Four ways to extract more information from an ensemble aside from the
mme are ultimately investigated and evaluated. The first approach applies
optimal weights to the models of the ensemble (mmW), and the other three methods
utilize selected members in time (mme

The comparison of the mme skill versus the globally best single model (bestG is identified
from the evaluation over all stations) points out that mme achieves lower average
(across all stations) error compared to bestG. The enhancement of accuracy is
highest for O

The four examined ensemble estimators are then assessed for their skill in
the average error as well as their capability to correctly identify extreme
values (events exceeding threshold value). The key results of the analysis
are summarized below:

The skill score of mme over its guaranteed upper ceiling (case of zero
diversity) ranges between 15 and 30 %, being lower for PM

The promotion of the right amount of accuracy and diversity in the
conditional ensemble estimators almost doubles the distance to the
guaranteed upper ceiling. The skill score over mme is higher for O

The theoretical minimum MSE of mme for the case of unbiased and uncorrelated models is far from being achieved from all ensemble estimators.

As we move towards the tail, the probability of detection (POD) of bestG
(bestL) dominates over the mme (mme

The combination of the results from the average error and the extremes
identifies mmW as the estimator that outscores the others across all
percentiles. kzFO has a high capacity for extremes but requires attention for the
limited sites with high

The improvement of the physical, chemical and dynamical processes in the deterministic models is a continuous procedure that results in better forecasts. Furthermore, mathematical optimizations in the input data (e.g. data assimilation) or the model output (e.g. ensemble estimators) have a significant contribution in the accuracy of the whole modelling process. The presented post-simulation advancements were the result only of favourable ensemble design. However, the theoretical minimum MSE of mme for the case of unbiased and uncorrelated models is far from being achieved from all ensemble estimators. Further development is underway in the presented ensemble methods that take into account the meteorological and chemical regimes.

All data used in the study are available in the ensemble platform (

The relevant separate scales of motion are defined by means of physical
considerations and periodogram analysis (Rao et al., 1997). They are namely
the intraday component (ID), the diurnal component (DU), the synoptic
component (SY) and the long-term component (LT). The hourly time series (

We gratefully acknowledge the contribution of various groups to the second Air Quality Model Evaluation international Initiative (AQMEII) activity: US EPA, Environment Canada, Mexican Secretariat of the Environment and Natural Resources (Secretaría de Medio Ambiente y Recursos Naturales-SEMARNAT) and National Institute of Ecology (Instituto Nacional de Ecología-INE) (North American national emissions inventories); US EPA (North American emissions processing); TNO (European emissions processing); ECMWF/MACC project & Météo-France/CNRM-GAME (Chemical boundary conditions). Ambient North American concentration measurements were extracted from Environment Canada's National Atmospheric Chemistry Database (NAtChem) PM database and provided by several US and Canadian agencies (AQS, CAPMoN, CASTNet, IMPROVE, NAPS, SEARCH and STN networks); North American precipitation-chemistry measurements were extracted from NAtChem's precipitation-chemistry database and were provided by several US and Canadian agencies (CAPMoN, NADP, NBPMN, NSPSN, and REPQ networks); the WMO World Ozone and Ultraviolet Data Centre (WOUDC) and its data-contributing agencies provided North American and European ozone sonde profiles; NASA's Aerosol Robotic Network (AeroNet) and its data-contributing agencies provided North American and European AOD measurements; the MOZAIC Data Centre and its contributing airlines provided North American and European aircraft take-off and landing vertical profiles. For European air quality data the following data centres were used: EMEP European Environment Agency, European Topic Center on Air and Climate Change, and AirBase provided European air- and precipitation-chemistry data. The Finnish Meteorological Institute is acknowledged for providing biomass burning emission data for Europe. Data from meteorological station monitoring networks were provided by NOAA and Environment Canada (for the US and Canadian meteorological network data) and the National Center for Atmospheric Research (NCAR) data support section. Joint Research Center Ispra and Institute for Environment and Sustainability provided their ENSEMBLE system for model output harmonization and analyses and evaluation. The co-ordination and support of the European contribution through COST Action ES1004 EuMetChem is gratefully acknowledged. The views expressed here are those of the authors and do not necessarily reflect the views and policies of the US Environmental Protection Agency (EPA) or any other organization participating in the AQMEII project. This paper has been subjected to EPA review and approved for publication. The UPM authors thankfully acknowledge the computer resources, technical expertise, and assistance provided by the Centro de Supercomputación y Visualización de Madrid (CESVIMA) and the Spanish Supercomputing Network (BSC). GC and PT were supported by the Italian Space Agency (ASI) in the frame of the PRIMES project (contract no. I/017/11/0). The same authors are deeply thankful to the Euro Mediterranean Centre on Climate Change (CMCC) for having made available the computational resources. Edited by: G. Carmichael Reviewed by: M. Plu and one anonymous referee