Pauci ex tanto numero : reducing redundancy in multi-model ensembles

We explicitly address the fundamental issue of member diversity in multi-model ensembles. To date no attempts in this direction are documented within the air quality (AQ) community, although the extensive use of ensembles in this ﬁeld. Common biases and redundancy are the two issues directly deriving from lack of independence, 5 undermining the signiﬁcance of a multi-model ensemble, and are the subject of this study. Shared biases among models will determine a biased ensemble, making therefore essential the errors of the ensemble members to be independent so that bias can cancel out. Redundancy derives from having too large a portion of common variance among the members of the ensemble, producing overconﬁdence in the predictions and 10 underestimation of the uncertainty. The two issues of common biases and redundancy are analysed in detail using the AQMEII ensemble of AQ model results for four air pollutants in two European regions. We show that models share large portions of bias and variance, extending well beyond those induced by common inputs. We make use of several techniques to further show that subsets of models can explain the same 15 amount of variance as the full ensemble with the advantage of being poorly correlated. Selecting the members for generating skilful, non-redundant ensembles from such subsets proved, however, non-trivial. We propose and discuss various methods of member selection and rate the ensemble performance they produce. In most cases, the full ensemble is outscored by the reduced ones. We conclude that, although independence 20 of outputs may not always guarantee enhancement of scores (but AQ


Introduction
Geophysical modelling nowadays relies, among other techniques, on ensemble methods to improve predictive skills, assess performance, and quantify uncertainties. This is particularly the case for atmospheric sciences, where climate and air quality models are often treated as ensembles of an arbitrary collection of models results belonging 5 to the same family, sharing similar structure and resolution ("ensembles of opportunity", as defined, e.g. by Tebaldi and Knutti, 2007). Just like human beings normally consult a number of sources prior to making a decision (see for example the "trillion dollar garden party" analogy adopted by Knutti, 2010), the advantage of treating the information from several sources into ensembles relies on the fundamental assump-10 tion that information coming from multiple sources allows an estimation of the quality of the former, in line with the "Principle of multiple explanations" proposed by the Greek philosopher Epicurus (341 BC-270 BC) which says that for an optimal solution of a concrete problem we have to take into consideration all the hypotheses that are consistent with the input data. The fundamental aspect that makes multiple estimations a better 15 one is the fact that the sources of the former must be independent. In our view, multimodel (MM) ensembles practices, as they have been developed over the years, lack of this fundamental consideration due to the fact that models are phenotypically similar (Potempski and Galmarini, 2009) and need therefore caution in their applications. Serving the scope of removing misconceptions and ambiguous interpretations of the 20 paper, we define here: -Independence, a formal property, when the joint Probability Distribution Function (PDF) of two or more model results is derived from the product of single PDFs (Cover and Thomas, 2006). This is the rigorous definition of independence, though the joint PDF is difficult to estimate in practice; gle PDFs (Abramowitz, 2010;Potempski and Galmarini, 2009). Ideally, perturbation of model parameters and associated uncertainty on model output could serve this scope, as suggested by Tebaldi and Knutti (2007), but this is often impractical. The strategy common to the (few) studies that directly investigate model diversity consists in attributing model independence only from the analysis of the output they produce. In particular, Potempski and Galmarini (2009) showed that, by relaxing the condition of model independence to that of model associativity, a robust theoretical framework could be built from which precise mathematical formulations could be drawn. Associativity is measured by the covariance or by the correlation of pairs of model outputs. However, caution is needed as "it is possible that two models could agree with respect to out-15 puts despite being based on different casual assumption" (Pirtle et al., 2010). Thus, when looking at the correlation of model outputs as metric for defining independence, different models producing the same output would be erroneously considered as dependant. Further to that, the similarity of the results from two independent models is a valuable information that tells about model accuracy and uncertainty. Un-correlation 20 of the outputs is a necessary but not sufficient condition to guarantee independence. In the impossibility of an a-priori assessment of ensemble members' independence, model biases are excellent parameters to investigate the ensemble member interdependence. Models are intrinsically wrong due to their numerical nature, imprecise input data and limited understanding of the atmospheric chemical-physical processes. 25 What is important is that models have independent systematic errors so the biases cancel out when combining models into ensembles. Should that not be the case, combining more and more similar models into an ensemble is not a solution, results do not improve! Moreover, a MM ensemble for which all biases have the same sign and value 4993 may give the false impression of accuracy, which is often confused with precision. The agreement of models to precisely predict the same (biased) result is confused with accuracy of models which implies homogeneously distributed biases around measurements (Potempski and Galmarini, 2009).
A further fundamental aspect is that of the uncertainty of the measurements used 5 to calculate the bias, to evaluate the models, and/or to weight the ensemble members. Annan andHargreaves (2010, 2011) have shown that, due to the uncertainties in the measurements, model's deviations from the observations can be strongly correlated, even in case of independent models. Thus, the independence of models does not necessarily translate into independence of their deviations from observations. At the 10 same time, similar models have correlated errors but correlation of errors does not imply similarity of models. Furthermore, the conceptual assumption that models are drawn from a distribution centered around the truth (meaning that measurements and model output are not biased, or biased-corrected) might lead to wrong conclusions as recently reported by Annan andHargreaves (2010, 2011).

15
To date, the link between ensemble accuracy and diversity of members is still unclear, the reason being that there is no unique way to decompose the ensemble's error in terms of bias and variance (Potempski and Galmarini, 2009). This is the fundamental problem of the ensemble techniques, whose error obeys to the bias-variancecovariance decomposition (e.g. Brown et al., 2005). The trade-off between bias and 20 variance involves indeed three terms, and there is no way to simply minimize the covariance without affecting other component of the error. The error of the ensemble mean increases linearly with the correlation of the members through the co-variance term. Techniques promoting diversity (or penalizing commonalities) do exist (negative correlation and other, Liu and Xao, 1999) and are an active area of research in the field 25 of information science, though they are not the goal of this study.
The Latin expression Pauci ex tanto numero is extracted from the De Bello Gallico (The Gallic wars, book 7, chapter 88) by G. J. Ceasar (100 BC-44 BC) and refers to the battle of the roman army against the Gauls. The complete citation reads "pauci 4994 The paper is structured as follow. In Sect. 2 the scopes are highlighted and the dataset and methodology are presented. In Sect. 3 we introduce an appropriate metric that allows detecting similarities beyond the overarching ones, and use this metric to quantify the level of redundancy of the dataset. To reduce the redundancy we then apply several techniques of dimension reduction (Sect. 4), which serve the scope of 10 identifying the minimum number of elements necessary to explain the variance of the observational data. Once the dimension of the minimum set is established, we apply a number of member selection criterion (Sect. 5). The methods of member selection have the purposes of identifying the members (or the weights) that (i) have poorly correlated errors (thus non-redundant) and (ii) whose ensemble mean is skilful in 15 terms of accuracy and precision. Conclusions are drawn in Sect. 6.

Scopes, data and method
We want to address here some fundamental questions: to what extent an ensemble of different models put together on the grounds of opportunity and convenience is indeed producing a better result? How can one quantify the information in MM ensembles that 1. Determine the ensemble redundancy: i.e. the minimum set of members that explains the variance of the observations and maximise the accuracy; 2. Reduce the ensemble redundancy: if two models, or their errors, are highly correlated one can be expressed in terms of the other by a simple scaling factor. If many redundant models are combined together, there would be loss of valuable 5 information due to dependant biases.
The fundamental idea is thus to investigate different methodologies viable to achieve the above mentioned goals and verify the pros and the cons of each of them trying to produce a generalized concept and methodology applicable to any other kind of ensemble and variable. 10 For our analysis we will investigate the correlation between errors produced by AQ models run by twelve groups in the context of the Air Quality Modeling Evaluation International Initiative (AQMEII) (Rao et al., 2011;Galmarini et al., 2012). For all of the analyses presented in this study we use hourly time series for the months of June July August (JJA) of the year 2006 of the gaseous species of O 3 , CO, NO 2 , SO 2 . We apply 15 the analysis to two distinct regions of Europe: region 1 (−10, 5) • W; (42, 60) • N, including the UK, France, northern Spain and Belgium; region 2 (5, 24.5) • W; (46, 60) • N, the continental Europe, including Germany, Poland, Austria, and Czeck Republic. 20 The modelled and observed time series have been spatially averaged over the region 1 and 2 defined above. These two regions have been the subjects of in-depth investigation in other AQMEII studies (Solazzo et al., 2012a,b;Vautard et al., 2012). The number of receptors -by species -in each region is reported in Table 1. The participating models are summarized in Table 2. Details about the model settings and operational 25 evaluation against observational data can be found in Solazzo et al. (2012a,b) and Vautard et al. (2012), with the exception of the GEM-AQ model (Côté et al., 1998;Kaminsli 4996 Full Screen / Esc Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | et al., 2008), which did not take part of the previous AQMEII analysis. The AQMEII ensemble of models is indeed a genuine ensemble of opportunity, with a good level of diversity in terms of AQ models and meteorological drivers. Emission and boundary conditions are, however, largely shared, making the distribution of model errors neither systematic nor random. The history of regional scale modelling has also forcibly pro-5 duced a number of common elements to all the models, which should be considered an a-priori contaminating element of the ensemble results.
Since an accurate estimation of multivariate a PDF is hard to achieve due to the computational cost it entails even for a small number of models , we decide to focus on quantifying the amount of information two models share, measured 10 by the redundancy which can be computed more easily in this case. Given the output from two models, x, y organized as a two-columns table, said Σ xy their covariance and p(·) their joint PDF, the redundancy can be defined either through the redundancy index ρI(x, y) (Stewart and Love, 1968), which is a metric for quantifying the portion of variance already being accounted for by other members of the ensemble (Eq. 1), or by 15 the mutual information among models I(x, y) Ding and Peng., 2005) (Eq. 1): Eq. (1) is related to the prediction of x by y by multiple linear regression. ρI(x, y) is a weighted average of the squared multiple correlation coefficient between all pairs of variables of x and y. It is a measure of the quality of the prediction of x by y and represents the proportion of explained variance in the regression of x by y (see e.g. Youness and Saporta, 2010). In the case of x and y one-dimensional vectors ρI re- 25 turns R 2 , the squared correlation coefficient. The mutual information in Eq.
(2) is more complex and involves the PDFs of multivariate variables. In practical terms I is the level 4997 the noise can be comparable with that of the signal, in particular for low concentrations. The AQMEII database includes results from ensemble members sharing meteorological drivers, emissions, chemical boundary conditions (Table 2). It was proven that these input fields introduce systematic biases in the model results (Solazzo et al., 2012a,b). A simple error metric would not be adequate to detect any type of underlying common-10 ality other than these overarching biases. For this reason we have to find out a metric that (a) explores hidden similarities, i.e. those underlying common modules and parameters in the model, and that (b) is robust enough to be used under a number of scenarios. Having in mind that no wonder metric exists and that different metrics produce different results (Gleckler et al., 2008), we opted for the metric d m proposed by Pennel and Reichler (2011) (hereafter referred to as PR2011), which explores the biases of models and removes from each model the dominating similarities, thus making individual model errors more dissimilar and unveiling "hidden" trends that are masked by overarching commonalities. Let us start by defining the standardized deviation of models (mod) from observations 20 (obs) for the species of as: where σ s is the standard deviation of the observed chemical species and i = 1, . . . , N is the index of the time series, m is the model index and s that of the species being considered (O 3 , CO, NO 2 , SO 2 ). The normalisation in Eq.
(3) makes more comparable Introduction which contains the "bulk" of bias among models. To eliminate the dominating model similarities, we remove from all model's errors the portion of MME associated with 5 each individual model error. Accordingly to PR2011, the removal of the portion of MME relevant to an individual model can be accomplished by calculating the difference d m between the standardised model error and the weighted standardised MME, with the weight being the correlation coefficient R between the m-th model error and MME: where the "*" indicates standardised vectors, calculated, for each time series, by subtracting the corresponding mean value e m and dividing by the standard deviation σ em (we have now get rid of the index i for a more compact notation). Note that the normalisation serves the only purposes of making the results for different species inter-comparable, as the correlation is bias-and normalisation-independent. Also note 15 that the normalisation makes the correlation and covariance interchangeable. As said above, removal of MME makes model errors more dissimilar and uncovers "hidden" trends that are outweighed by overarching commonalities. For example, corr(e * FR3,O3 , e * FI1,O3 ) = 0.73, while corr(d FR3,O3 , d FI1,O3 ) = 0.36. The subtraction of the correlated portion of the bulk error from the individual error emphasizes the real differences among 20 models. On the other hand, in the case of two highly similar models, such as DE2 and US4 the correlation among e * i is approximately the same as that among the d i . We provide two graphical examples of the efficacy of d m vs. e m . The correlation between individual model error and the MME (corr(ei , MME)), averaged over all models as a function of variable is reported in Fig. 1 demonstrating the extent of large commonalities, and also show dependence on the region (correlations for SO 2 are quite different over the two regions). In Fig. 1 the correlation corr(d i , dj ) is also shown, averaged over all model pairs. The values for the curves for the two regions are very similar and small, indicating that the effect of MME has been largely mitigated. By removing the MME, model errors become region-independent, as 5 shown by the similar curves of corr(d i , dj ). In Fig. 2 we report the associativity tree (the dendrogram, see details in Sect. 4.4) of cov(d i , dj ) and cov(ei , ej ) for the joint time series of the four pollutants in region 2. While e m -associations are based on the species (model errors for each species are the most correlated), d m -associations are drastically diverse, and unexpected patterns emerge. Models are grouped by the bias un-10 derlying modules and/or parameters strictly associated with the physics and chemistry of a given compound; the diversity for d m is higher with respect to the e m -dendrogram and the number of disjoint clusters is, at least, of six (distance level of ∼0.9), while four e m -clusters were identified (at an even smaller distance of ∼0.7).

15
In Fig. 3 a graphical representation of the covariance cov(di,dj)is shown for the species of the European region 2 (plots for region 1 are omitted for brevity). Mutual model covariance is indicated by the positioning of the model codes in black with respect to models on the horizontal axis. Because the covariance matrix is symmetric, we display only half of it, for clarity. The model codes in red indicate the variance (cov(d i , d i )). In 20 these plots we also report the redundancy measured by R 2 (blue crosses), the square of corr(d i , d j ). R 2 represents the amount of variance already explained by the regressor model and, for model pairs, corresponds to the redundancy index ρI; the mutual information I (vertical segments in orange). Because of the normalisation of the metric d m , the covariance and redundancy can be expressed on the same scale, between −1 and 1. Depending on the species, the mutual relationships among members vary greatly, proving that for AQ models many factors (chemistry and dispersion modules, meteorology, grid resolution) contribute to the final outcome. This was also found to be the 5 case for climate models (PR2011; Annan and Hargreaves, 2010). Overall, errors do not seem to co-vary more in the case of two instances of the same AQ model (DE3 and UK2 for example) than for different combination of meteo-dispersion models (FR4, DE1 for O 3 and CO; HR1 and UK2 for SO 2 ; and many others). The sharing of routines specifically designed for certain pollutants and process could be a possible cause for 10 this. It is often the case that model developers borrow entire model components as their use was demonstrated to be an improved, or sometimes the only, solution for simulating a process. For example, the ISORROPIA module (Nenes et al., 1998) for inorganic pollutants, the resistive scheme by Zhang et al. (2001) for dry deposition, the scavenging parameterisation for wet deposition are all examples of shared routines among the 15 majority of the AQMEII models (see Table 1 of Solazzo et al., 2012a).
Because the redundancy measured by R 2 is simply the ratio of the squared covariance to the variance, models with a large spectrum of covariance are also the more redundant (DK1 and DE1 for O 3 ; US3 and US4 for CO; DK1 and DE3 for SO 2 ; NL1, DE2, US4 for NO 2 ). The redundancy measured by the mutual information is often in 20 line with that of R 2 , although in some cases higher values are estimated. For example DE2 and US4 (same models run by different groups), but also US3 for CO and NO 2 , FR4 and PL1 for SO 2 , due to I being calculated as a raw frequency count, whilst R derives from a regression analysis. A further aspect of error redundancy is the amount of the observed variance ex-25 plained by the MM ensemble. Following the methodology proposed by Annan and Hargreaves (2011) we projected the observation anomalies onto the principal components (PCs) of the covariance matrix of the deviation of the ensemble of models around the MM mean. We found that just the first (or the first two for ozone) component already Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | exceeds the observed variance. When all components are taken into account, it results that the MM mean for the EU region 1 (region 2) can explain as much as 1.2 (1.7), 2 (4.8), 2.1 (9), 7 (18), times of the observed variability for O 3 , CO, SO 2 , NO 2 , respectively (the large difference between region 1 and 2 for NO 2 and SO 2 is due to the much smaller variance of the observed values of these two compounds in region 5 2 (∼ 4 and 12 times smaller for NO 2 and SO 2 , respectively)). This case is depicted as a "wide ensemble" by Annan and Hargreaves (2010). A wide ensemble can be interpreted also in terms of lack of reliability with a rank-histogram (Talagrand et al., 1998) exhibiting a "central dome" pattern: the ensemble performs poorly in predicting less frequent episodes (both high and low concentrations) and lacks sharpness. Given the 10 massive application of AQ models in regulatory applications and the more and more stringent AQ targets, the detected overconfidence can cost considerably. Dealing with a wide ensemble implies that there is a substantial amount of redundant variability already accounted for by other models. One plausible explanation is that the ensemble size, constrained by the available members, is simply too large.

Quantifying ensemble redundancy through dimensionality
The foremost advantage of reducing the dimensionality of large datasets by discarding redundant information is that lower dimensionality means reduced computational costs and noise, improving the accuracy of the ensemble. Data mining and data reduction are active areas of research in various fields, from genetics to ecology to machine 20 learning. There exist a plethora of methods aiming at detecting commonalities, most of which developed ad-hoc for a specific application, such as Independent Component Analysis (Kong et al., 2008), Maximum-Relevence-Minimum-Redundancy , the methods reviewed by Grömping et al. (2007), and others. Though, is seldom the case that a method developed by a community passes the barrier to be adopted in 25 a field other than the one it was originally developed for. Here we explore some analytical techniques proposed in various flavours in the climate modelling community. Note that the outcome of dimension-reduction methods is simply a number, that is, the dimension of the subspace sought. Selecting the members belonging to that subspace is a different problem, and is addressed in Sect. 5. 5 We calculated the effective number of models (also known as the effective number of Degree of Freedom) sufficient to reproduce the variability of the full ensemble (the MM ensemble generated with all available members) as:

Eigenvalue methods
with λ eigenvalue of the corr(d i , d j ) matrix. Theoretical derivation of Eq. (6) can be 10 found in Bretherton et al. (1999). Under the assumption that the modelled and observed fields are normally distributed, the fraction of the overall variance expressed by the first M eff eigenvalues is of 86 % (Eq. 8 of Bretherton et al., 1999).
Results for M eff are reported in Table 4. The sum over all eigenvalues at the nominator of Eq. (6)  values were equal to unity, and in this case Eq. (6) would return M eff = M, that is all directions are equally important. In reality there exist eigenvalues that are larger than unity, and consequently other that are less than unity, and since these are squared (denominator of Eq. 6) the contribution of the former outweighs that of the latter so 20 that M eff < M approximately in the amount of the number of eigenvalues larger than unity (Guttman (1954) and Kaiser (1960) indeed proposed to adopt this as rule for 5003 eigenvalue would be non-zero and M eff = 1. By applying Eq. (6) to the datasets of model errors (corr(d i , d j )) we find that M eff is in the range 5 to 6.5 (Table 3). If the MME term is retained (that is, M eff is calculated from corr(ei , ej )) we find much lower values for M eff ( Table 3) as consequence of most of the similarity among models being expressed by the MME term.

Principal Components Analysis (PCA)
PCs analysis (PCA) (Jolliffe, 2002) is probably the most well known and wide-spread unsupervised dimension-reduction technique. It is based on eigenanalysis to select uncorrelated directions associated with the largest variances. Relationships between PCA and clustering (Ding and He, 2004), redundancy (Jolliffe, 2002),  Scaling (Groenen and van de Velden, 2004), and regression analysis (Jong and Kotz, 1999) have been documented, proving the versatility of this method. For example, the ratio of the sum of leading eigenvalues to the sum of all eigenvalues obtained by means of PCA is proportional to the ratio of the regression sum of squares (SS reg ) (explained or signal variance) and the total sum of squares (SS tot ) (the total variance) in regres-20 sion analysis. This latter ratio is the coefficient of determination R 2 , the redundancy index (Jun et al., 2008).
The relationship 6 provides an analytical estimate of the dimensionality of the subspace of models to produce the information of the whole ensemble. Graphically, the "scree test" (Cattell, 1966) is often applied in problems of dimension reduction. We first 25 produce a plot of the number of dimensions vs. quantities related to the amount of variability or independence, measured by appropriate metrics. Then, we use the "elbow criterion" by seeking the point at which the curve levels off to a plateau. To produce 5004 a scree plot from Eq. (6), we look at M eff as a dependent variable of N, the number of models. Curves are reported in Fig. 4 for the four pollutants and the two European regions. The variability scale is calculated as the cumulative variability, for each N. As for Table 3, curves have been derived from corr(d i , d j ) and from corr(e i , e j ). We notice that in both sub-regions M eff from corr(e i , e j ) is much lower and that variability above 5 80 % is reached by the first 2-3 leading eigenvalues. As noted by PR2011, the concavity of the curves over N indicates that the addition of more models to the ensemble is not compensated by a linear increase in the overall information. This is a straight consequence of commonalities among members: chances that a new member shares features with an existing one increases as the ensemble size does. This would not 10 happen in the case of independent models.

Multi-Dimensional Scaling (MDS)
Another method, among the many, to create a scree plot is to use Multi-Dimensional Scaling (MDS) algorithm (Borg and Groenen, 2005) for determining the relationships between model errors. MDS basically searches for a spatial configuration of the objects 15 such that the mutual Euclidean distance among them matches their proximities as closely as possible. Here, we use the corr(d i , dj ) matrix as proximities. The degree of correspondence between the distances among points implied by MDS map and the input matrix is measured by a suitably defined stress function, the minimisation of which also provides information about the dimensionality of the subspace covering the 20 whole variability of the data. Avoiding detailing too much, in MDS theory the Euclidean distance s i j between two rows of a matrix X is defined as: 5005 The objective of MDS is to find the elements of X minimising the difference between s i j and d i j (the elements of the proximity matrix corr(d i , d j )): σ 2 is the raw stress function (with w i j non-negative weights, set to unity). Minimisation of the stress function is not trivial, and thus numerical iterative methods are employed 5 (Borg and Groenen, 2005). By running the minimisation problem for different values of p in Eq. (8), we plot the stress against the dimension. Results for the European region 2 are reported in Fig. 5 (results for region 1 are very similar and therefore not shown).
The "elbow" in the scree plot indicates when more dimensions only yield a negligible improvement in terms of stress. The trend of the curves in Fig. 5 (similar for all pollutants) indicates four as the number of independent components that best fit the data, i.e. about one third of the whole sample size.

15
Each clustering solution π i is a partition of the data set X into K i (i = 1, . . . , r) disjoint clusters of instances, represented as Fern and Brodley, 2004). A typical output of HC is a dendrogram or associativity tree, where redundant models are gropued together and the level of similarity among groups is reported based on the distance between the elements of the input matrix. Here, we use 20 the Euclidean as distance metric and the corr(d i , dj ) as input matrix. Applications of HC and dendrogram representation for air quality ensemble modelling are documented in Riccio et al. (2012) and Solazzo et al. (2012b).
A fundamental challenge of the HC method is that different grouping is obtained by slightly changing some of the options underlying the HC algorithms (Fern and Brodley, weighted pair-group average was selected as agglomeration method (Murtagh, 1984) with the cut-off value set between 0.10 and 0.15 (1 being the maximum similarity) for all pollutants in both regions, which produced five disjointed clusters (Figs. 6) for all species. The cut-off value is chosen by looking at the structure of the dendrogram: it is convenient to break structures that are obviously disjointed, and within each structure avoid separating highly connected groups, or groups of only two models. Looking at the dendrogram for ozone for example, the two main branches at the top further split into two more at a relatively low similarity level, suggesting a plausible way to proceed. At a ∼ 10 to 15 % similarity level five clusters are detected for all species in both regions.

15
Given the normalisation implied by the metric d m , we found M eff to range between 5.2 (ozone in region 2) and 6, with only NO 2 and CO in region 1 requiring 6.5 components (Table 3). M eff based on d m is between 1.5 (SO 2 region 2) and 5 (CO region 1) times higher than the values based on e m (values in parenthesis in Table 3). The variability of M eff among species depends on the heterogeneity of processes and sources 20 within the two regions, as well as on the receptors coverage. Despite having removed the commonalities among models encapsulated by the MME, we still found a level of redundancy above 50 %, being M eff less than half of the size of sample.
As said above, results of HC analysis indicates that at a ∼15 % similarity level five clusters are detected. The between-classes variance (weighted average of the mean 25 distance of each cluster and the mean distance of the whole dendrogram) detected by the five components generated by the HC method is between 70 % and 80 % of the total variance (depending on the variable), which would be totally reproduced only in 5007 Introduction

Tables Figures
Back Close

Full Screen / Esc
Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | the case of a cut-off level at the root of the dendrogram tree (one cluster only). On the other hand, the within-classes variance (average distance within each cluster) is an estimate of the redundancy, as it is proportional to the cluster-averaged coefficient of determination R 2 (Moesa et al., 2005). This result is in line of that obtained by applying PCA: M eff in that case explained 86 % of the total variance (Sect. 4.2) with a slightly 5 larger number of models. Thus the two techniques are consistent for similar amount of variance. Dimensionality through MDS and the minimization of the stress function has returned a number of components of four. In general though, MDS fit indexes are descriptive and do not always provide an absolute criterion for selecting the best dimensionality (Tinsley and Brown, 2000).
Summarising, the ensemble of models is highly redundant even after having removed the MM error. It is possible to reduce the full datasets of more than 50 %, down to five to six components. As discussed next, this allows reducing noise and improving accuracy. The methods adopted give consistent results.

15
As many as i =1,m M i subspaces with dimension smaller than m are identified by the M members (M is the total number of available members). It is therefore difficult to univocally identify a subset of members systematically outscoring all the others for a large number of skills. Furthermore countless methods for selecting members have been developed by different communities, testifying that available methods are "fit for purpose" 20 rather than of general applicability. Ideally, once a skill or feature is identified, one could select the best performing ensemble by extracting from the group of M members only those m that are individually performing at best when compared with the measurements. However, combinations of individually good models do not necessarily produce a good ensemble for a given feature: the m best models are not necessarily the best Printer-friendly Version

Interactive Discussion
Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | justified in some cases (e.g. McSweeney et al., 2012), we discourage this approach where no assumptions about the redundancy of the information is provided (Solazzo et al., 2012b).
A number of data reduction and member selection/weighting techniques exist that could be useful to our goal, some of which have been already discussed in Sect. 4 and 5 that will be to adopted reduce the ensemble. Further to that we will compare the new ensembles to the full ensemble mean, taken here as benchmark. Member selection techniques applied are: Not all of the methods above take into account the redundancy of members. The first two (HC and MDS) provide ensembles of low-redundant members; the minMSE tech-15 nique is a heuristic method based on the minimisation of the error and thus selection is skill-driven (Solazzo et al., 2012b;Riccio et al., 2012;Knutti et al., 2010); PCA provides weights to the models along the directions of maximum variance; finally CAR is a score-based member selection method developed by Zuber and Strimmer (2011) that is hybrid of marginal correlation and regression analysis and is shortly discussed 20 in Sect. 5.5.1.

Hierarchical clustering (HC)
With reference to Fig. 6 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | each cluster, was the one chosen to represent the cluster. The bias was selected since it is the metric underlying d m , the metric used to build the dendrogram. The members are reported in Table 4. Other options have been taken into account, such as selecting the model closer to the centre of each cluster, and the model whose MSE with respect to the cluster's centroid is minimum. However, the reduced ensembles generated with 5 these selection criteria were outperformed by that of members of minimum bias (see Sect. 5.5.2) and are therefore not shown.

Multi-Dimensional Scaling (MDS)
In MDS the distance among models can be used as proxy for independence, providing the visual aid needed for interpreting the grouping and selecting the more diverse  Solazzo et al. (2012b) show that the ensemble mean minimizing the MSE has also superior skills with respect to the full ensemble, both in terms of variance and, of course, 20 error. We ran similar analysis for the present datasets. Application of this analysis yields (i) the number of dimensions to retain (the dimension of the subset), and (ii) the members to retain (the component of the subset, reported in Table 4). Finding the subset of models minimising the MSE is a heuristic practice based on the evidence that a MM ensemble whose mean minimize the error is likely to have small covariance having the same variance (the authors refer to this case as exchangeable or indistinguishable ensembles). Moreover, the fact that the MSE, no matter how large is M, can never reaches zero, is a consequence of the variability affecting the observations (from the error decomposition relationship, the variance of the observation is the lower bound for the error). Plot of RMSE for ozone (region 2) of the mean of random subsets of the ensemble members, plotted as function of subset size, is reported in Fig. 8

Principal Components Analysis (PCA)
Although PCA cannot be applied for selecting individual, independent, members, it can 20 be nonetheless used to generate an artificial time series mod PC obtained by projecting the original data onto its PCs. This generates a weighted ensemble, the weights being the projections of the model components onto the eigenvectors associated to the leading m eigenvalues. We have applied PCA to the matrix of covariance cov(d i , d j ), to disclose redundancy patterns (see Sect. 4.2). The reduced matrix dm red is obtained 25 by projecting d m onto PC m , the subspace of the first m eigenvectors: 5011 Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | We have discussed the scree plot of Eq. (9) (Sect. 4.2) and the dimensionality of the subspace PC m . Ideally now we should be in a position to score the weighted ensemble obtained by retaining m components. Getting the time series back from Eq. (9) is not trivial though, since d m is a composite metric, and no similar applications of PCA have been found in the literature. The reduced time series is given by: 5 mod PC = σ obs σ em (dm red + r · MME * ) + e m + obs (10) Some assumptions are necessary, as for example how to obtain the MME and the e m (the mean error of each model) for the reduced space. The assumption we made consists in projecting these quantities onto PC m too, as it is no possible to associate them to their original time series.

10
The use of the observational data for recreating the time series is a major shortcoming of this methodology, which can be moderated in case of applications to forecasting. In that case we could use a portion of the data to generate back mod PC , and another portion for verification of the forecast. Current work is devoted to this aspect.

Member selection
Members selected with MDS, HC, minMSE and CAR score are reported in Table 4, where the redundancy index ρI of each reduced ensemble is also reported.
HC analysis of region 1 highlights that there is group of two models common to the four pollutants (FI1 and FR3), and with the exception of ozone DE1 and HR1 are 20 also in common. Furthermore, given the high level of similarity between DE2 and US4, NO 2 and SO 2 are represented by the same members (Table 4). The outputs of these models have therefore the least correlated bias, and in this sense can be considered not redundant. For region 2 we found the selection to be more sensitive to the species, with only US3 and DE1 common to three species. The spatial dependence of the bias too. This is an additional indication that independence and skills need be investigated separately (Abramowitz, 2010). Two similar, high redundant models are bound to score likewise under a variety of member selection techniques. This could be a possible way to read the combination of redundant members optimizing the MSE. We have applied the CAR score recently 5 developed by Zuber and Strimmer (2011) to our dataset (available in the package "relaimpo" for the R statistical software (www.r-project.org)). This method provides a ranking based on the partial correlation between model and the observations, conditioned to all other models. The CAR methodology is related to the amount of explained variance, enforces the simultaneous selection of highly correlated predictors and penalises 10 variables correlating with opposite signs with the observations. Models with a small CAR score contribute little to improve the prediction error or to reduce the unexplained variance. Interestingly, for the species O 3 and NO 2 of region 1 and O 3 , CO, and NO 2 of region 2 two of the selected models using CAR are in common with the minMSE selection (the first four CAR-ranked models are reported in Table 4). Further to that, 15 the overall redundancy of the ensemble built by the mean of the first four CAR-ranked models is, in some occasion, even lower than that of HC selection (O 3 , SO 2 , and NO 2 in region 1; CO and NO 2 in region 2). We shall further notice that the minMSE having lower redundancy than any other method is related to the fact that the subspace has dimensionality of two (SO 2 and CO region 1) and three (NO 2 region 1 and O 3 region 20 2). As noted above, from the error decomposition theorem in the case of low dimensionality it is straightforward to assess that the error in minimised by low redundant members, as the covariance term is null.

Skill scores
In Table 5 we report the scores of the reduced ensemble generated with the meth-25 ods discussed above. The full-member ensemble mean is also included as reference. With few exceptions, the reduced ensembles score better than, or as well as, the full ensemble, especially in terms of variability. Overall, the minMSE selection seems to 5014 methods do not consistently score high, although performing best in some occasions. The overwhelming strength of the weighted PCA ensemble derives from having used the observations to rebuild the time series, as discussed in Sect. 5.4. In general, though, one the main drawbacks of weighted ensembles is that they are not robust enough to be applied under a variety of scenarios (species, temporal, and spatial), and 10 in practical applications MM mean is often preferred (Pierce et al., 2009;Knutti et al., 2010).

Implications for AQ forecasting
We outline here some considerations about applying the techniques of dimension reduction and member selection to periods of time other than those used for selecting 15 the members. It is in the ensemble forecasting applications that the low redundancy of the bias plays the most important role: since observations are not available to provide evaluation, averaging out of errors is the only means to avoid common, redundant biases to determine the direction of the (biased) agreement.
We thus ask whether any associativity among members can be inferred in the case 20 observational data were not available. In other words, knowing the associativity among the errors, what can be deducted about the associativity of the models underlining those errors? This problem is of direct relevance to forecasting, thus worth investigating. The starting point is as usual the covariance matrix of the errors cov(d i , d j ), which we assume it is known. After some basic manipulation we get: 5015 The model errors covariance and the model covariance are strictly related, thus we cannot prescind from the observations. All we can do is to infer some consideration about the covariance of the model errors for short periods of time ahead. In practical terms, we first derive a reduced ensemble from the matrix of errors cov(d i , d j ). Then, if we trend of the error does not change drastically for a few hours or days ahead, we 5 can deduce that the association among them does not change either, thus the reduced ensemble is still the best option. Exploitation of reducing ensembles and member selection for forecasting applications is a topical argument and a matter of ongoing work.
Recently, Galmarini et al. (2013) have investigated the possibility of forecasting air quality starting from the combination of well-behaved spectral properties extracted from the AQMEII ensemble. The results show that the approach outruns even the ensemble median. Further investigation will be devoted to determine the correspondence between the reduced set obtained here and the properties of the ensemble put together by Galmarini et al. (2013) for the sake of identifying a deeper structure inside in the model behaviour and performance.

Conclusions
That of the similarity of members in ensemble modelling is an outstanding issue which has recently raised awareness in the ensemble climate community but not in the air quality one. In this study we explain the risks of combining models sharing high correlated bias into ensembles. We apply our analysis to a high resolution dataset covering 20 two regions of EU for 3 months. Along with observational data, we have treated results of 13 AQ models for four air pollutants: CO, O 3 , NO 2 and SO 2 .
We have provided definitions for the concepts of independence, diversity/similarity, redundancy of models and their errors, which are often used interchangeably, giving raise to misconception. Due to practical difficulties in computing independency, we 25 used the redundancy instead, which is simpler to handle and has the advantage of expressing the amount of the accounted-for variance, regardless of the diversity of Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | models. Conceptually we believe this is very important, as it allows to univocally interpreting the results. We started by applying the metric d m introduced in climate modelling studies to our ensemble of regional-scale pollutant concentrations. d m serves the scope of eliminating overarching commonalities among members and to explore hidden similarities, i.e. those underlying common modules and parameters in the models. Some main results and considerations: 1. The correlation among the majority of models remained a constant feature across the two examined regions, but varied from species to species. In fact it is generally not possible to identify model similarities common to the four species. This implies 10 a large spectrum of partially shared modules and parameterisations within the AQ modelling systems which are invoked depending on the species and on other inputs, such as meteorology and emissions. Indeed, although most of the model similarities encapsulated by the multi-model ensemble mean error were removed by calculating d m , similarities among model errors were still found significant; Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | -Multi-dimensional Scaling and graphical representation of model similarities as mutual distance among models; -minMSE heuristic investigation, determining the size of the ensemble of models whose mean minimize the MSE.
None of the aforementioned method is new, they are all well-established techniques 5 used, in many flavours, in various branches of science. They have not been used before in an AQ ensemble context, neither compared. We also introduce, where possible, the nexus between these techniques and redundancy. We found that the optimal size of an ensemble of poorly correlated members is of about 4-6, implying that more than half of the information of the full MM ensemble is redundant.
the best accuracy (lowest error) by definition, but scores among the best also for variability. HC, MDS and CAR methods do not consistently score high, although performing best in some occasions.
2. The error being minimized by highly redundant members does not justify, in our view, the use of the ensemble of those members. Skills and diversity need to be 20 analysed in separation. This is because redundant members might share common biases which will force the agreement to be directed towards the same direction, with the risk of misjudging the results. These aspects are likely to be detected by diagnostic-type of analysis (rather that by simple operational scores based on distance metrics) and may often reveal more about the causes of model er-Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | numeric combination in the trade-off between covariance and bias. Indeed, if all mutual covariances between the models are negative then the optimal MSE is less than for un-correlated models. It remains to be explained why also high redundant ensembles produced high scores; a further open issue is why ensembles of individually high-ranked members might be outperformed by ensembles of high 5 and low ranking members.
3. Application of PCA to the matrix of errors for the purpose of data reduction has proved successful. By contrast, generating the reduced time series (the time series projected on the leading eigenvectors) is not trivial and requires the use of the observational data, which masks the outcome of the procedure. As no applications of this sort have been found in the literature, our intention is to devote future work to this aspect which might be relevant in the realm of forecasting; 4. Finally, we have highlighted the steps for applying the methods of dimension reduction and member selection to a forecasting context.
We also believe the effort we spent to migrate some of the knowledge and techniques 15 developed in other scientific areas (especially computer science, genetics, and climate modelling) will contribute to raise awareness in the ensemble AQ community about the dependency of models and the meaning of model agreement.