A science-based use of ensembles of opportunities for assessment and scenario study : a re-analysis of HTAP-1 ensemble

Introduction Conclusions References


Introduction
A multi-model (MM) ensemble is defined as a group of simulations of the same case study, produced by formally different models, which are statistically treated in an attempt to improve the quality of the result (Potempski and Galmarini, 2009).Given the ever increasing collaborations of geophysical modelling communities in joint assessment studies, MM ensembles are becoming very popular and an opportunity to extend and generalize individual deterministic model results (Solazzo et al., 2012(Solazzo et al., , 2014;;Solazzo and Galmarini, 2014;Galmarini et al., 2004;Vautard et al., 2012;Evans et al., 2013;Bishop and Abramowitz, 2013).
In particular in atmospheric sciences, MM ensembles are used extensively in climate and air quality predictions and assessments.While in climate research and applications many of the concepts applied and described here are well known and correctly used, in air quality this is not always the case and several are the examples of direct use of un-inspected ensembles.We shall describe an inspected ensemble (opposed to an un-inspected one) as: a set of model results, whose properties and characteristics, Figures have been analysed in an attempt to reduce the presence of redundant information or elements that are not relevant to the determination of an accurate result.An inspected ensemble should always produce a result that is more accurate than the simple average of the multi model results.
The motivations behind the necessity to inspect an ensemble are connected to the way in which MM ensembles are put together and the nature of the participating models.In fact, the selection of the models whose results will make the ensemble is not regulated by any science based criteria and there is no a-priori specification that defines the characteristics of a model that should or should not take part to an ensemble.The constitution of a MM ensemble is merely based on an opportunity to provide model simulations and to participate to a community activity where anybody is welcome (ensemble of opportunity).Regarding the nature of the models producing results for ensemble applications, one should never forget that the best results are those produced by ensembles of independent (and accurate) models (Potempski and Galmarini, 2009;Kioutsioukis and Galmarini, 2014;Weigel et al., 2008;Pirtle et al., 2010;Knutti, 2010).Formally, model m 1 is defined independent from m 2 if the joint probability p for a result of m 1 and m 2 can be expressed as p(m 1 , m 2 ) = p(m 1 )p(m 2 ).Under this condition, biases of opposite signs cancel out and the deviation from the average of all models results is a true representation of the uncertainty affecting the solution.If independent models produce similar results, the simulations converge toward the accurate results, if they differ, it means that the results are uncertain and the spread would be a reliable measure of the uncertainty.Models used in air quality (among others) are not independent, they are often sharing common assumptions, modules, input data, and cannot therefore be considered independent.In most of the cases the models are different (Phenotypical model difference, Potempsky and Galmarini, 2009), but are not independent.This leads to the possibility that results obtained from an ensemble, rather than representing a true alternative and independent solution, would just be like in music composition a variation on the theme, producing a false sense of variability Introduction

Conclusions References
Tables Figures

Back Close
Full which could lead to coinciding (diverging) biased results and a false sense of agreement (uncertainty).MM ensembles derived from simply different models are said to be potentially prone to redundancy and overconfidence.The inspection is therefore primarily finalised at: the identification of the level of diversity (communality) shared by the model results, retaining only those that are contributing with original information removing the redundancy.
Techniques exist that allow such screenings that rely on the existence of observations and the comparison of the ensemble variability with the observational variability (Potempski and Galmarini, 2009;Solazzo et al., 2013).
In this study we aim at demonstrating the importance of using existing good practices in the air quality MM ensemble context.Toward the scope we have selected a case study published in the past which does not exploit the true value of having multiple model results at hand.The case analyzed is the HTAP (Hemispheric Transport of Air Pollution) phase 1 multi-model exercise (Dentener et al., 2010) and in particular the multi-model ensemble activity performed within it and presented by Fiore et al. (2009).The study of Fiore et al. (2009) is used here as mere representative of a wide spread practices in the air quality modelling communities at all scales and it represents just an example on how things could be improved further.An additional reason for selecting this case is the fact that ensembles are used by Fiore et al. (2009) for sensitivity studies with respect to emission reduction options.The inspection of the ensemble can have important consequences also for emission scenarios as shown later, an aspect never considered before in the literature.Introduction

Conclusions References
Tables Figures

Back Close
Full We will focus on the MM ensemble analysis by Fiore et al. (2009) (from now FetA09).
In FetA09, an average of 21 model results was used to investigate the monthly mean surface ozone concentration in three sub-regions of Europe (Mediterranean, Central Europe with receptors between 0 and 1 km height and Central Europe with receptors between 1 and 2 km height), five North-American sub-regions (North East, South West, South East, Great Lakes, and Mountainous) and one Japanese sub-region (EANET stations).Operational scores (bias, correlation coefficient and SD) were calculated in each sub-region making use of ground-based measurements.The combined spatial and temporal average of the modelled concentration values resulted in smoothed monthly time-series.The analysis of FetA09 reveals that the distribution of the results is rather symmetric (Fig. 1).Supported by the agreement with observations, the authors considered the MM ensemble mean to be the best possible estimate as it "generally captures the observed seasonal cycle and is close to the observed regional mean" (FetA09), although the time-series distribution showed the presence of some clear outliers.The MM ensemble mean was then used to quantify source-receptor relationships as well as ozone concentration response to changes in the emissions scenarios.
Although the scope of the study of FetA09 was not to prove the robustness of the MM ensemble mean, it is an example of the widespread practice of averaging all available members in the light of the ill-based assumption that the average of many model results Introduction

Conclusions References
Tables Figures

Back Close
Full is always a better result than that of one model.That would be true if the models were independent but there is no a-priori proof of that.Some questions arise: how robust are the results if the models are not independent models?How different would be the result should some model not take part to the activity or more outliers like the one present in the Fig. 1 would be present?How generalised is the result since the selection of the ensemble members is based on the voluntary participation to a joint activity and the ensemble does not contain all possible results?Is there any duplication of information?Is all the information contained in a MM ensemble relevant and necessary?Since the construction of ensemble is not governed by scientific selection criteria, so it happens that the subsequent ensemble result strictly depends on aleatory factors and one can presume that it lacks generality.
The screening methodology proposed and that we will apply as an example to the FetA09 set, is a good way to exploit an abundance of model results in the best way, to transform the aleatory gathering of information into a more robust result that is based on scientific principles.The large ensemble of model results becomes an opportunity to cherry-pick those models that produce the most accurate MM ensemble and use only those to drive conclusions.The analysis will help identifying the size of the nonredundant ensemble and the subsets of members to produce skilled results.

Inspecting a multi model ensemble
In this section the MM ensemble of FetA09 is inspected.We will concentrate on the ozone simulations over the same regions presented in FetA09 and we will make use of exactly the same model data and observations used by Fiore et al. (2009).The inspection is based on the following steps: identification of the models that will be part of the reduced ensemble which will be subsequently used.

The "accounted" variability: eigen-analysis and ranked histogram technique
The goal of this first analysis is to determine to what extent the observational variability is reproduced by the ensemble.An optimal situation is the one in which the variability of observations coincides with that produced by the ensemble of models, in other words the ensemble of the results all together covers the same range of variation of the measurements.Any deviation from this condition, namely a smaller or a larger variability of the ensemble with respect to the observed one would show, on one side, the incapacity of the ensemble to model the observed reality, or on the other, the addition of irrelevant information to the simulation of the observed situation.Therefore considering that a MM ensemble is assembled on an opportunity basis rather than results characteristics, this first step is of primary importance to estimate to what extent the gathered set is appropriate for the case study.
A technique to assess the variability and to estimate the redundancy of the MM ensemble with respect to that of the observations, was suggested by Annan and Hargreaves (2010) and applied in some MM ensemble modelling contexts (see, e.g.Solazzo et al., 2013;Solazzo and Galmarini, 2014).It consists of projecting the observation anomalies (the element-wise difference between the observations and their mean) onto the principal components (PCs) of the covariance matrix of the deviation of the ensemble of models from the MM mean (the element-wise difference between each model realisation and the MM ensemble mean).Principal component analysis (Jolliffe, 2002) is probably the most well-known and wide-spread dimension-reduction technique.It is based on eigen-analysis to select uncorrelated directions associated with the largest variances.
When applied to the HTAP 21-member ensemble analysed by FetA09, this method shows that the first (largest) eigenvalue already explains more than 90 % of the ob-30529 Introduction

Conclusions References
Tables Figures

Back Close
Full servational variability in most regions, the only exception being Japan with 60 %.In other words, most of the ensemble members have a significant projection onto the first eigen-vector defining the major component, thus explaining the same portion of variance.If too many models are projected on the same eigenvector, it means that there are too many models producing repeating or "overlapping" solutions (thus, the ensemble is redundant).A well-behaved MM ensemble (not necessarily the theoretical case of independent models) should be made of a number of models whose eigenvalues contribute to the explanation of as many different components as the observational variability and the ratio model-to-observed variance should be close to unity.In the case of the HTAP MM ensemble, when all eigen-values are taken into account, the MM ensemble variance is 4.7, 6.0, 8.7 times the variance of the observation anomalies for the EU Mediterranean, Central 0-1 km, and Central 1-2 km, regions respectively.Concerning the US Mountains, Great Lakes, SE, NE, SW regions, the full MM ensemble mean accounts for 25.4,9.1, 20.6, 10.7, 5.6 times the observed variability, respectively, and finally 4.7 times for the Japanese sub-region.According to the definition of Annan and Hargreaves (2010) the ensemble is therefore wide, i.e. its variability is larger than the observed one.Dealing with a wide ensemble implies that there is a substantial amount of redundant variability, i.e. variability already accounted for by other models.Not all information contained in the ensemble is needed in principle and needs to me reduced.
An alternative method to diagnose the variability spanned by an ensemble of models to the eigenvalues used is the Talagrand or Ranked Histogram (RH) (Talagrand et al., 1998), which provides and evaluation of the consistency of the ensemble with an observed quantity.In a RH the observations are ranked into a number of bins equal to the number of models making up the ensemble plus one for the extremes.The ensemble members are sorted to define ranges or "bins" of the modeled variable such that the probability of occurrence of the observation within each bin is, ideally, equal.then there will be N + 1 bins (Hamill, 2001).The underlying assumption is that each ensemble member in principle can introduce an independent degree of variability.An indication of an ill-constructed ensemble is the ratio between the number of elements and the number of data available per model.If there are N models with time series each of size n t (elements of the time series), the implication of N > n t is that there will be at least N − n t empty bins in the RH, indicating redundancy of the ensemble and that the ensemble is inappropriate for the case analyzed.This same result could be visualized by looking at the load factors resulting from the decomposition in PCs: many projections would be null, as the number of Eigen-vector is larger than the number of data to project.The HTAP MM ensemble used in this example, N = 21 and n t = 12.The RH for the nine sub-regions is reported in Fig. 2. Six (NA NE) to nine (NA SW) bins out of 21 are populated, (i.e.contain non-zero values), due to insufficient data and excess of redundant information.The use of the Ranked histogram reveals another important problem with the FetA09 ensemble.Good ensemble practice would require n t N. The plots clearly show that there are many empty bins (so degrees of freedom in the process that are not part of the reality as no observations are present in that range).The uneven distribution of the histograms shows that much emphasis (overconfidence) is given to some aspects of the process description, while others are neglected, that is another way of representing the redundancy obtained with PC analysis presented earlier.

Effective number of models
Having assessed that the ensemble is redundant it is important to determine the minimum number of models from those available in the ensemble that would suffice to describe the observational variability.A very robust method never used in air quality is that developed by Bretherton et al. (1999).The effective number of models sufficient to Introduction

Conclusions References
Tables Figures

Back Close
Full reproduce the variability of the observation is defined as: with λ eigenvalue of the corr(d i , d j ) matrix, which contains the linear correlation coefficient between any pair d i , d j (i , j = 1, . .., N). d is a metric defined accordingly to Pennel and Reichler (2011): where the index m identifies the model, MME is the multi model error (the average of all individual model's errors) and R is the Pearson correlation coefficient between e m , the error of model m and the MME.The removal of MME in Eq. ( 2) makes model errors more dissimilar from one another and uncovers "hidden" trends that are outweighed by overarching commonalities.Indeed the scope of the metric d m is to determine similarities between models beyond the dominating ones induced by shared inputs and/or common parameterisations to the extent that the former are accounted for in the average.Expression (1) should be interpreted as: only if all eigenvalues were equal to unity, Eq. ( 1) would take a value of N eff = N, which corresponds to the situation where all directions are equally important and all models add independent contributions to the explanation of the observational variability.On the other hand, if all error fields were similar, only one eigenvalue would be non-zero and N eff = 1.Equation ( 1) provides an analytical estimate of the dimensions of the subspace of models necessary to produce the information of the whole ensemble.
For the HTAP MM ensemble of FetA09, Eq. ( 1) gives N eff ranging between ∼ 2 and 4 for the regions analysed by FetA09 compared to the original 21 models.Thus, approximately three quarter of the available information on variance is redundant.This is a very revealing result that indicates paradigmatically the relevance of a pre-inspection Introduction

Conclusions References
Tables Figures

Back Close
Full of an ensemble.What seemed like a largely populated ensemble turns out to be incapable of capturing several degrees of freedom of observations and 2 to 4 members of 21 are sufficient to describe the observational variability.One may ask: if so, why is the average of the 21 models fitting so well with the observations as presented in FetA09?
The answers could be: pure chance, since finally the model results participated out of good will, and happened to be there in the right mixture.Just consider what would have happened to the mean of the models should one of the two most evident outliers in Fig. 1 decide to withdraw from the exercise.Alternatively an explanation could be the massive smoothing due to the monthly averaging along with the high level of tuning of the models around specific solutions that are normally distributed around the average observed data.

Reducing ensembles
As demonstrated in the previous sections, the HTAP MM ensemble is redundant and in particular 2 to 4 members are sufficient to represent the observational variability while the rest do not add any new information.Similarly, the extra elements are likely to deteriorate any evaluation metrics applied to the ensemble.At this point we know that the number of models that are necessary and sufficient is smaller than 21 but we do not know which combination of members for every grouping produces the optimal ensemble.Given N members, there are G = N!/[r!(N − r)!] possible groups of r elements.
A straight forward way to identify the optimal ensemble (optimal sub set) and maximize the accuracy of the ensemble is to analyse all the G combinations of subsets of models and identify the one that minimize the Root Mean Square Error (RMSE).The latter is a measure of the accuracy (the even distribution of model results around the observed value), and high accuracy also improves precision (a reduced spread/scatter of the model results around the observed value).In fact while accuracy is a pre-requisite for precision, the contrary does not hold.Introduction

Conclusions References
Tables Figures

Back Close
Full In Fig. 3 we report the curves of minimum, mean, and maximum RMSE for the nine sub-regions used by FetA09 as a function of the number of members of ensembles (r = 2, . .., 21).The figure confirms the results on the number of models necessary to maximize the ensemble performance and tells us that which combination of the 2 to 4 models out of 21 produces such improvement.The scores of the reduced ensemble are reported in Table 2 and are compared against the ones produced by the full ensemble mean.In all cases the mean of the reduced ensemble improves the accuracy (from 31 % for NA NW to 71 % for NA Mountain and NA Lakes) and precision (most notably for NA SE and NA NE).As it can be seen in several regions the use of the full ensemble of opportunity produces a clear deterioration in the ensemble statistics.In Table 2 we report also the ranking of the models contributing to minimize the error in the subregions.As from the table it is often the case that the error is minimized by mix-ranked (good performing and bad performing) of members.In fact, if the two best models have a high chance of being also highly correlated then they would share some portion of information thus resulting redundant.Therefore when considering the ensemble mean of these two models, very little decrease in error would be found compared to the individual models.Mathematically, the theorems by Elashoff et al. (1967) andCover (1974) have proven two important results on the selection of member and evaluation of individual scores: the best two models are seldom the combination of two models that maximises the score of an ensemble average, and furthermore, that the best single model may not appear in the ensemble maximising the feature score.As a result, the simple method of making ranked combinations of models with the best individual features may prove unsuccessful, as also demonstrated by e.g.Solazzo et al. (2013), Hannan and Hargreaves (2011), and others.This confirms the importance of the inspection of the available results prior to their use and of having at disposal a large pool of models from which optimal subsets can be extracted.Introduction

Conclusions References
Tables Figures

Back Close
Full 3 Impact on the results of emission sensitivity analysis of an inspected vs. uninspected ensemble An important part of FetA09 relates to the sensitivity study on emission reduction.As part of the HTAP program the consequences of an emission reduction of 20 % anthropogenic NO x in specific part of the globe where investigated using the MM ensemble available.Since we have demonstrated that the MM ensemble used in FetA09 is redundant and having identified the optimal number of elements and the most accurate set of models, one may wonder how the predicted consequences of the emission reduction on ozone concentration would change if we used the reduced ensemble.
We focused the analysis on the North-American region only for reasons that will be discussed in the next section.In FetA09 the use the mean of the full ensemble produced an average response in ozone concentration of −0.76 ppb in the NA region as a consequence of the reduction of NO x emission by 20 %.We shall note that the NA region is subjected to the emission reduction and therefore the investigation includes the whole of the US and part of Mexico (Fig. 1 of FetA09), and thus it has a spatial extension that includes the five NA sub-regions discussed in Sect. 2 for the evaluation.Furthermore, of the 21 models participating to the evaluation part of the exercise, only 14 models results were made available for the simulation with reduced emission scenarios.Therefore for the sake of consistency, we repeated the redundancy inspection for the 14-member ensemble and calculated the most accurate set through the minimization of RMSE described Sect.2. The size of the newly calculated subsets ranges between three for the Lakes, North-East, South-West, South-East of USA, and four model results for the Mountainous region.The newly calculated set obtained from the original 14 member ensemble produced an ozone concentration reduction of 2.32 ppb on average across all regions.That is 300 % more than that found by FetA09.The largest variation is obtained for the South-East region of USA, with an ozone concentration decrease of 5.30 ppb that is a 5-fold than what obtained by FetA09.Such an Introduction

Conclusions References
Tables Figures

Back Close
Full analysis demonstrates how conclusions could change if the ensemble is not inspected a priori and reduced if necessary.
In the exploration of scenario or sensitivity to ideal conditions like that presented in HTAP, one may be tempted to construct an ensemble that only groups the best preforming models results in the evaluation against measurements and using only those in the sensitivity or scenario case study grouping them in an ensemble.This would be wrong in principle or in other words would not produce the best ensemble by definition as demonstrated by the already cited theorems of Elashoff et al. (1967) andCover (1974).

Conclusions
Multi-model ensemble is becoming very popular in geophysical studies.In this paper we have been contrasting the results from an ensemble of opportunity where casually assembled model phenotypical different are the driving elements, with the results obtained when the same pool of model is screened to eliminate redundancy and the optimal combination is used.
The case of HTAP phase 1 is taken here as an example of a practice that is widely spread, especially in the realm of air quality, atmospheric dispersion at all scales.A very limited amount of studies apply correctly the technique.The HTAP case has been selected for two main reasons: the very large number of models that participated to the initiative and that were available for the ensemble analysis; the ensemble results were also used as basis to assess the consequences of an emission reduction strategy on ozone in several regions of the world.
The HTAP ensemble has been assessed against available measurements and the following conclusion were obtained: Introduction

Conclusions References
Tables Figures

Back Close
Full -In spite of the large number of participating models, the scarcity of time steps produces an important level of redundancy as from the simple analysis of a ranked histogram.
-At smaller subset of model perform much better when compared to measurements and it is statistically more significant.
-In the case of HTAP (FetA09) the objective of the study was to determine, through a MM ensemble, the impact of emission changes produced in one continent on another.The analysis conducted on the impact over the same continent where the emissions are produced, reveals that the conclusions remain the same as those produced by FetA09 but the values found are between 3 to 5 times higher when using a non-redundant ensemble.
These are problems that are common to many multi model studies and for which a minimum set of good practice rules should be taken into account (Kioutsioukis and Galmarini, 2014).
On a more general level, it is clear that the use of un-inspected ensembles of opportunities is a miss-practice that could lead to under-exploitation of the latter and in some case even wrong conclusions.Quantitative practices guarantee for the best possible diagnosis of the ensemble potential and its full exploitation.The availability of monitoring information is essential for the performance of the analysis presented here and it could be argued that the optimal ensemble identification is prone to the time and spatial representativity of the observations.This is true but as much as it is for the evaluation of any individual model result that depends on the space and time distribution of observation and the phenomenology represented.The hemispheric transport case analyzed here brings to the attention also the issue of the space and timescale at which a set of model verified in a certain area could be used.The verification of the effect of the selection of an optimal set out of an ensemble based on data pertaining to a specific region and time frame, produces over another region, remains an important element of research.In other words, whether an optimal Introduction

Conclusions References
Tables Figures

Back Close
Full set selected for region A using observation in region A can be used for a region B and in a scenario or sensitivity analysis mode.Scale dependence of the atmospheric processes involved could become an issue in this case and will have to be verified.On the other end we consider the use of the optimal set for scenario and sensitivity study in the area where the observation used for its selection have been collected much more appropriate than the use of a full ensemble of opportunity.The selection of the optimal set through observations on a base case scenario is equivalent to the evolution of a single deterministic model and its application for speculative scenario analysis or forecast applications.
The representativity of the ensemble compared to observation and the minimization of the redundancy remain a important issues.In the light of that we speculate here, the use of multi-scale multi-model ensembles, constructed with the combinations of models covering different portions of the atmospheric power spectrum, could greatly improve the representativity and provide coverage of the problem in a much more detailed form.The combination of global and regional scale results, for example, in one ensemble is a possibility that will be explored in the framework of the next phase of HTAP.Introduction

Conclusions References
Tables Figures

Back Close
Full Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | 2 The case study and MM ensemble inspection In 2006 the Task Force on Hemispheric Transport of Air Pollution (http://www.htap.org/)organised a comparison exercise of global and hemispheric transport models, focussing on the relationships between regional scale emission perturbations and the response in air quality, ecosystem, and climate related variables.The information was used in an aggregated form to evaluate air pollution abatement strategies and their impact across the Northern Hemisphere.Results of the comparison exercise are summarized in Dentener et al. (2010); Sanderson (2008); Fry et al. (2012); Wild et al. (2012); Jonson et al. (2010); Anenberg et al. (2009); Fiore et al. (2009).
Discussion Paper | Discussion Paper | Discussion Paper |

-
determine to what extent the variability present in the observation is reproduced by the ensemble, determine the minimum number of models necessary to represent the observed variability, Discussion Paper | Discussion Paper | Discussion Paper | Screen / Esc Printer-friendly Version Interactive Discussion Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | The bins are determined by ranking the ensemble members from lowest to highest.The interval between each pair of ranked values forms a bin.If there are N ensemble members, Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | Table 2. RMSE-ranking and scores of the reduced MM ensemble mean for the sub-regions object of the analysis (RMSE: Root-Mean-Square-Error; PCC: Pearson Correlation Coefficient; σ: ratio of the modelled to the observed SD).= 3.11 (5.70)Japan EANET 12, 15 PCC = 0.96 (0.79) σ = 0.66 (0.51)