Objective assessment of ozone in chemistry-climate model simulations

Objective assessment of ozone in chemistry-climate model simulations A. Yu. Karpechko, N. P. Gillett, B. Hassler, K. H. Rosenlof, and E. Rozanov Climatic Research Unit, School of Environmental Sciences, University of East Anglia, UK Canadian Centre for Climate Modelling and Analysis, Environment Canada, Canada National Institute of Water and Atmospheric Research, Lauder, New Zealand NOAA Aeronomy Laboratory, Boulder, USA Institute for Atmospheric and Climate Science, ETH Zürich, Switzerland Physical-Meteorological Observatory/World Radiation Center, Davos, Switzerland Received: 6 July 2009 – Accepted: 2 September 2009 – Published: 17 September 2009 Correspondence to: A. Yu. Karpechko (a.karpechko@uea.ac.uk) Published by Copernicus Publications on behalf of the European Geosciences Union.


Introduction
In the last two decades of the 20th century stratospheric ozone, which accounts for about 90% of the total ozone, has declined significantly as a result of chemical destruction by anthropogenic halogen-containing compounds (WMO, 2007).In the Southern Hemisphere (SH) where ozone depletion is particularly severe in high-latitudes in spring, ozone changes have led to cooling of the lower stratosphere and an increase in the lifetime of the Antarctic polar vortex (e.g., Randel and Wu, 1999;Zhou et al., 2000).These changes have further led to intensification of the tropospheric circumpolar circulation (Thompson and Solomon, 2002;Gillett and Thompson, 2003).Among other impacts, the intensification of the tropospheric circulation has contributed to significant Introduction

Conclusions References
Tables Figures

Back Close
Full Screen / Esc Printer-friendly Version Interactive Discussion decrease of rainfalls in southwest Australia (Cai et al., 2005) and to dramatic warming of the Antarctic Peninsula (Marshall et al., 2006).Stratospheric ozone is expected to recover during the 21st century as a result of declining halogen abundances (WMO, 2007) and ozone recovery is expected to influence future SH tropospheric circulation (Miller et al., 2006;Perlwitz et al., 2008;Son et al., 2008).This implies that details of the ozone recovery need to be predicted well in order to reliably simulate future SH climate.
Presently, full representation of stratospheric chemistry in climate models is quite expensive and the majority of coupled atmosphere-ocean climate models use prescribed ozone fields.Models that were used for the Intergovernmental Panel of Climate Change Forth Assessment Report (IPCC AR4) used either a simplified ozone recovery scenario or even assumed constant ozone (i.e. annual cycle not varying from year to year) throughout the 21st century (Miller et al., 2006).More physically sound future ozone scenarios are provided by coupled Chemistry-Climate Models (CCMs).These models account for interactions between stratospheric ozone chemistry and atmospheric physics and dynamics which may change due to projected greenhouse gases (GHGs) increases.However ozone projections by these models differ from model to model (Eyring et al., 2007) raising the question of which ozone recovery scenario is more reliable.
Information on model performance in simulating present climate may be used to decide which model's projection is more reliable (Reichler and Kim, 2008;Gleckler et al., 2008).However models are tuned to represent the present climate and the best tuned model may not simulate future climate more correctly.Yet, without a better alternative, model ranking based on their ability to simulate present climate and observed trends looks like a reasonable approach and is widely employed (e.g.Connolley and Bracegirdle, 2007;Bracegirdle et al., 2008).Eyring et al. (2006) assessed different aspects of performance of several CCMs including their ozone simulation skill; however they did not derive any objective metric of agreement between simulations and observations.Waugh and Eyring (2008)

Conclusions References
Tables Figures

Back Close
Full ried out an objective assessment of CCMs' ability to simulate several key processes relevant to stratospheric ozone since a process-oriented evaluation might be a better predictor of a models' ability to make reliable projections; however they did not assess models' skill at simulating ozone itself.The goal of this study is to provide climate modellers with a guideline for choice of future ozone scenario for simulations with prescribed ozone fields.To achieve this we perform an objective assessment of CCM skills in simulating observed ozone climatology and ozone trends with a focus on the SH, where the largest impacts of ozone recovery on the climate are expected.As a reference, we employ several available up-to-date observational data sets, which allow us to evaluate uncertainties associated with the observations.

Data
We use output of twelve CCMs assembled in the Chemistry-Climate Model Validation (CCMVal) Archive at the British Atmospheric Data Center (BADC).These twelve CCM groups contributed data to the first round of CCMVal (CCMVal-1).The models are listed in Table 1 together with a reference for each model.We consider simulations of the last two decades of the 20th century based on forcings described in Eyring et al. (2006).These include observed sea surface temperature, sea ice concentrations, surface concentrations of well-mixed GHGs and halogens, solar variability, and aerosol from major volcanic eruptions.For all the models except MRI and SOCOL, outputs from the simulations performed in support of WMO ozone assessment 2006(WMO, 2007) are used.
For MRI we use data from an updated run with an improved transport scheme (see http://www.pa.op.dlr.de/CCMVal/CCMValErrataBADC.html).For SOCOL we use simulations from the model vs. 2.0 described in Schraner et al. (2008).Compared with the model vs. 1.0 used by Eyring et al. (2006), this version has improved parameterization of stratospheric water vapour condensation, a more sophisticated heterogeneous chemistry scheme, and improved transport scheme (Schraner et al., 2008).Here we mainly use ozone outputs from the first simulation of each model and restrict our at-Introduction

Conclusions References
Tables Figures

Back Close
Full Screen / Esc

Printer-friendly Version
Interactive Discussion tention to the period of 1980-1999 for which outputs from all the models and sufficient observations are available.Additional simulations started from different initial conditions are available for SOCOL, MRI, and WACCM and are used to study sensitivity of the results to sampling errors.In addition to the individual models we also consider ensemble averaged ozone time series (MULTI).
Observational data sets used for model performance validation include total ozone and ozone profiles data sets from several sources.The merged satellite total ozone data set (TOMS/SBUV) is based on individual Total Ozone Mapping Spectrometer (TOMS) and Solar Backscatter Ultraviolet 2 (SBUV/2) data sets (Stolarski and Frith, 2006).Another total ozone data set used in this study is that compiled by Karen Rosenlof from satellite (SME, SAGE-II, MLS, HALOE and TOMS/SBUV) and standard ozone climatology data (Dall'Amico et al., 2009).The Rosenlof data set also provides ozone profiles.Two other ozone profile data sets used here are those described in papers by Randel and Wu (2007) and Hassler et al. (2009).The former (Randel) is based on a regression model fitted to SAGE 1 and 2 and ozonesonde profiles combined with a seasonally varying ozone climatology.Over the Antarctic region, which is of interest here, the model utilises only data from Syowa station located at 69 • S and may not adequately represent the ozone field further south.Implications of this will be discussed below.The latter data set (Hassler) is based on satellite (SAGE 1 and 2, POAM 2 and 3, HALOE) and ozonesonde profiles.Due to the lack of the observations, the Rosenlof and the Hassler data sets also apply different techniques to fill in the gaps, which will be discussed in more detail below.

Method
To assess model performance we calculate a metric similar to that used by Reichler and Kim (2008) and Gleckler et al. (2008).First we calculate normalized root mean square (RMS) differences e j kl between the j -th model and k-th reference data set for Introduction

Conclusions References
Tables Figures

Back Close
Full the l -th diagnostic where x i mj l is the simulated variable and y i mkl is the observed variable at month m and grid point i , w i m is the weight assigned to each data point, W = w, is a sum of individual weights and σ i mkl is a measure of the uncertainty in the observed variable y i mkl .In the following the value e j kl will be referred to as model error in l -th diagnostic with respect to k-th reference data set.Calculations of the weights and the observation uncertainty are described below in this section.Following Reichler and Kim (2008) we scale the errors in all diagnostics by the average error across the individual models to ensure that different diagnostics receive similar weights when calculating the combined metric of model performance where J is the number of models.Model errors are calculated with respect to several available observation-based data sets in order to reduce possible influence of biases in the observation-based data sets.However, the observation data sets are not completely independent since they share some of the same input data and therefore may suffer from similar biases.We next average the model errors with respect to all available reference data sets for each diagnostic where K is the number of reference data sets.Finally, a model performance index (I) is calculated as an average across errors in all individual diagnostics where L is the number of diagnostics.A lower value of I indicates better overall agreement with the observations and is interpreted as a better model performance.
The choice of diagnostics and grading metric is inevitably subjective, which is a shortcoming of this approach (Connolley and Bracegridle, 2007;Waugh and Eyring, 2008).The sensitivity of our result to the choice of grading metric and diagnostics will be discussed later.The diagnostics we employ are listed in Table 2. Two of them are (1) total ozone climatology in the period 1980-1984, when the influence of ozone depletion was minimal so that model ability to simulate pre-ozone hole climatology can be assessed; and (2) total ozone linear trend during the period 1980-1999.A five year period is somewhat short for calculating climatology; however using a longer period is restricted by absence of observations (and also some model data) before 1980 and by the increasing influence of ozone depletion after 1984.The influence of sampling errors on our results will be assessed by using additional simulations available for some models.Trends are calculated using linear, least squares regression of the ozone time series on time.We consider monthly zonal mean values and calculate model errors according to Eq. ( 1) by summation over months and latitudes.The weights w are cosine of the latitude.While the total ozone climatology errors are calculated globally the total ozone trend errors are only calculated over the SH.Ozone abundance in the NH extratropics during winter-early spring, when observed total ozone trends are largest, is strongly controlled by dynamics, in particular the Brewer-Dobson circulation.The latter has experienced a significant change during the last two decades of the 20th century (e.g.Hu and Tung, 2002;Karpetchko and Nikulin, 2004) due to reasons which are not completely understood (Hu et al., 2005).This leaves a possibility that natural decadal variability, not related to external forcing, has considerably contributed into the NH trends.It is therefore not reasonable to expect that the models simulate the NH trends correctly.
The simulation of realistic climate and climate trends depends not only on a correct simulation of total column ozone, but also on the vertical distribution of ozone.Therefore two additional diagnostics are considered here: (3) the monthly mean ver-Introduction

Conclusions References
Tables Figures

Back Close
Full Screen / Esc

Printer-friendly Version
Interactive Discussion tical ozone distribution climatology over the period 1980-1984 and (4) the monthly mean vertical ozone distribution trend over the period 1980-1999 at several pressure levels (see Table 2 for the list of pressure levels).As discussed in the Introduction the largest influence of stratospheric ozone changes on climate in the 21st century is expected in the SH associated with Antarctic ozone hole recovery.Therefore, to put more weight on model skill in simulating ozone over the Antarctic we average vertical ozone distributions only over the SH polar cap (60-90 • S) and weight the values by the annually-average ozone profile, i.e. according to their contribution to the total ozone.
Total ozone from all the data sets is linearly interpolated onto a 5 • latitude grid (87.5 • S-87.5 • N).Ozone profiles are interpolated linearly in the logarithm of pressure onto the pressure levels specified in Table 2.As a measure of the observational uncertainty σ, the standard error of the mean is used in the case of the ozone climatology, and the standard error of the slope parameter from a linear regression is used in the case of ozone trends.Measurement errors are only available for the Hassler data set.These are combined with the sampling errors by root mean squares and the resulting σ are used for all the three profile data sets.

Total ozone
Figure 1 shows 5-yr total ozone climatology for the period 1980-1984 from the individual models, MULTI, TOMS/SBUV, and Rosenlof.This picture is similar to Fig. 14 from Eyring et al. (2006) except that they show 20-year total ozone climatology (1980)(1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999) and NIWA data set instead of Rosenlof shown here.Also we show an updated MRI simulation.All the models simulate familiar features of the ozone distribution including the wintertime build-up in both hemispheres, and also the early stage of the Antarctic springtime ozone depletion.data sets is excellent, as might be expected since prior to 1985 Rosenlof employs only data from TOMS and SBUV.Figures 1 and 2 show that some models (MAECHAM4CHEM, MRI) strongly overestimate total ozone globally while others (SOCOL, UMETRAC) strongly underestimate it in the extratropics.Some models (E39C, LMDZrepro) underestimate total ozone in SH mid-and high-latitudes while overestimating it elsewhere.In many models the errors typically exceed 3σ, and are therefore very unlikely to be explained by sampling variability associated with the particular 5-yr period chosen for comparison.Eyring et al. (2006) identified the causes of some model errors, like the positive biases in the extratropics in some models which is likely due to the simulated Brewer-Dobson circulation being too strong.However in most cases the causes are not straightforward to identify.MULTI tends to overestimate total ozone, especially in the SH mid-latitudes where several models show strong positive biases.In general there are no consistent biases in total ozone across the models.20-yr linear trends in total column ozone are shown in Fig. 3.The largest negative trends according to the observations are in the SH high-latitudes in November.All the models simulate a maximum negative trend in the SH high-latitudes but the time varies between September and December.Also the magnitude of the trend differs considerably between the models.The smallest simulated trend is only a half of the observed trend while the largest trend exceeds the observed trend almost by factor 2. Eyring et al. (2006) showed that the simulated ozone trends are consistent with the trends in stratospheric halogen loading.The largest trends in Cl y are simulated by UMETRAC while the smallest trends are simulated by E39C and SOCOL.Accordingly, these models simulate too large and too small ozone trends.According to Waugh and Eyring's (2008)  MULTI errors are everywhere within 3σ of the observed trends.The CCMVal models almost all show too much ozone depletion in the tropics, but elsewhere biases are not consistent in sign amongst the models.
To test the sensitivity of our results to the trend period we calculated the trends for the period 1980-2001 for observations and for those models for which the data are available.The observed trends for this period are typically smaller than those shown in Fig. 3; however the errors patterns do not change much and our conclusions are unaffected by these changes.

Vertical ozone distribution
The climatology of the vertical distribution of ozone is shown in Fig. 5.All three observational data sets exhibit an ozone minimum in October and a lifting of the ozone maximum layer in November-December when the polar vortex breaks up and mid-latitude air is mixed into high-latitudes.The majority of the models reproduce these features, however some models (ULAQ, UMETRAC) simulate a comparable minimum at the end of summer, a feature typically observed in the NH seasonal cycle and attributed to ozone depletion by summertime NOx chemistry (Br ühl et al., 1998).Excessive ozone simulated by MAECHAM4CHEM and MRI is apparent throughout the year.Several other models simulate too much ozone during the winter build-up period, which may be an indication of either a too strong Brewer-Dobson circulation, or weak isolation of the lower stratospheric polar vortex from mid-latitude ozone-rich air, or both.spring and summer.This maybe because Randel comprises only data from Syowa station located relatively close to the polar vortex edge (Randel and Wu, 2007).The polar vortex edge region is more influenced by mixing with mid-latitude ozone-rich air while air from the vortex interior further south, impacted by chemical ozone depletion, remains more isolated.Reassuringly, both the Randel and the Hassler total errors are lower than those of individual models and MULTI, although the differences between the data sets often exceed 3σ.The differences between the Rosenlof and the Hassler are largest in the troposphere where no satellite data is available and both data sets rely on a reconstruction to fill in the gaps.Rosenlof obtains tropospheric ozone as a difference between total ozone and stratospheric ozone (Dall'Amico et al., 2009) while Hassler calculates it as a regression fit to ozonesonde data, mainly available after 1986 (Hassler et al., 2008), using equivalent effective stratospheric chlorine, QBO, solar cycle, El Nino Southern Oscillation, and stratospheric aerosol loading resulting from volcanic eruptions (Hassler et al., 2009).The large differences in the stratosphere during winter when satellite coverage of high-latitudes is limited, are also, most probably, related to the differences in the reconstruction techniques.Almost all the models show lower values in the upper troposphere and in the lower stratosphere below 150 hPa throughout the year than Rosenlof does.The differences in the troposphere with respect to the Hassler data set are smaller (Fig. 7) and mainly restricted to 200-300 hPa, with model values below this typically being higher than in the Hassler data set.In some models (E39C, ULAQ, UMETRAC) the lower values near the tropopause arise because of a too high ozonopause.Above 100 hPa models typically simulate higher ozone values than observed, particularly during the winter build-up period and above 50 hPa during summer, presumably due to a more vigorous exchange with mid-latitudes.Model errors with respect to both observational data sets typically exceed 3σ.MULTI shows the lowest total error among the models with respect to Rosenlof but not with respect to Hassler.
Vertical ozone distribution trends are shown in Fig. 8.The trends in the observational data sets differ considerably from each other and the differences between them Introduction

Conclusions References
Tables Figures

Back Close
Full are comparable to the differences between the observations and the models.The maximum negative trend in the Rosenlof data set is only 60% of that in the Hassler data set and lags it by two months.The differences between the time series arise largely after 1990 and are therefore attributable to the different data sources rather than to the methods used to construct the data sets.
Figures 9 and 10 show trend errors with respect to the Rosenlof and Hassler data sets correspondingly.The majority of the models underestimate the springtime depletion compared with the Hassler data set but not with the Rosenlof data set.Several models simulate too strong ozone depletion below 100 hPa in summer comparing with the three observation data sets.In some models (LMDZrepro, WACCM) this may be a result of a delayed polar vortex break up (Eyring et al., 2006).MULTI and several individual models show better agreement with the Rosenlof data set (and also with the Hassler data set) than the two other observation data sets.
To investigate whether the choice of reference data set has a large impact on the model ranking we calculate Spearman's rank correlation coefficient between total model errors with respect to the three observation data sets.There is a high correlation between the model errors in ozone profile trends with respect to the Rosenlof and Randel data sets (r=0.93)however the model errors with respect to the Hassler data set are poorly correlated with those with respect to either the Rosenlof (r=0.41) or the Randel (r=0.56)data sets.Similar tests performed for the model errors in the ozone profile climatology showed a high correlation between the errors with respect to the Rosenlof and Hassler data sets (r=0.93)but lower correlations between the model errors with respect to the Randel data set and either the Rosenlof (r=0.60) or the Hassler (r=0.54)data sets.Model errors in the total ozone climatology and trends were very similar with respect to both datasets (r>0.9).

Combined errors
Performance indices calculated using Eq. ( 4 performs better than the others in all diagnostics however some models have generally low errors while other have generally high errors.MULTI does not perform better than the individual models in all diagnostics however its combined error is the lowest.Comparing Figs. 2, 6, 7 with Figs. 4, 9, 10 one can see that the trend errors are typically smaller than the climatology errors.However, since errors are normalised by the uncertainty σ, we caution against the interpretation that the models simulate trends better than the climatology. To make sure that our diagnostics do not duplicate each other we have correlated model errors in different diagnostics.The correlation between model errors in total ozone climatology and ozone profile climatology is found to be the only one significant at the 5% level (r=0.7), with the other correlation coefficients being less than 0.52.Therefore our diagnostics provide complementary information on different aspects of model performance.
To study sensitivity of the results to sampling errors we used additional runs available for SOCOL, MRI, and WACCM.The calculations were repeated for two additional simulations for each of these models.The resulting changes in performance indices for these models are shown in Fig. 11a.They are up to 15% (about 0.1 in absolute units), suggesting that smaller differences in performance index between models may be insignificant.In terms of ranking, these changes resulted in models ranking changes by 1-2 positions.

Alternative grading
It was mentioned in Sect. 3 that another set of diagnostics may give different results.So far we have considered diagnostics constructed by spatial (i.e.zonal) and temporal (i.e.monthly) averaging of variables.Waugh and Eyring (2006)  variables they found large differences between the two ways of calculating errors.
To test the sensitivity of our ranking to the choice of diagnostics we consider an alternative set consisting of annual mean globally averaged total ozone climatology and trend, and annual mean partial ozone pressure climatology and trend at 50 hPa averaged over 60-90 • S. The 50 hPa level is chosen because it is close to the ozone maximum.The periods for climatology and trend calculations are the same as in the original diagnostics.Instead of applying Eqs. ( 1)-( 4) for calculation of a model performance metric we use the same index as Waugh and Eyring (2008) where x j l is the simulated variable and y kl is the observed variable, σ kl is the standard deviation of the observed variable y kl , n is a scaling factor, and g j kl is the grade of j -th model in l -th diagnostic with respect to k-th reference data set.Similarly to Waugh and Eyring (2008) we choose n=3 and set negative values of g, wherever they are obtained, to zero.The model grade for an individual diagnostic g j l is calculated as an average over model grades with respect to all reference data sets and the overall model grade g j is calculated by averaging over the grades for the individual diagnostics.The maximum possible grade is 1.A zero grade means that the model differs from the observation by 3σ or more.Model performance in this simplified set of diagnostics is shown in Fig. 11b.Similarly to the original test, the models get higher grades in the trend diagnostics than in the climatology diagnostics.All but two models get zero grades in the total ozone climatology.In the other cases model diagnostics are generally within 3σ intervals from the observed diagnostics, as suggested by non-zero grades.This is in a contrast with the original diagnostics discussed in Sects.4.1-4.3where typical model errors exceed 3σ.This is due to the fact that the errors of different sign cancel each other when averaged annually.
Models having the largest combined errors in the original diagnostics get the lowest grades but the correlation between the model errors obtained by the two methods is 19364 Introduction

Conclusions References
Tables Figures

Back Close
Full Screen / Esc Printer-friendly Version Interactive Discussion only 0.49 and there are considerable changes in the ranking.In particular, MULTI does not get the highest grade.Since the original diagnostics include a more sophisticated comparison, i.e. consideration of the annual cycle, the differences between model rankings suggest that consideration of the mean annual values may not be sufficient for objective model assessments.
We will now see how our ranking agrees with results by Eyring et al. (2006) and Waugh and Eyring (2008).For this comparison we use the same simulations as those used by them.Eyring et al. (2006) highlighted 6 out of 13 models which agree better with the observations based on analysis of several transport, temperature and chemistry diagnostics.Five individual models with the highest performance index according to our original analysis are among the six models highlighted by Eyring et al. (2006) while the four models with the lowest performance index are among the seven nonhighlighted models.Taking into account that margins between the models with intermediate performance index are small compared to the overall spread of the errors we conclude that our ranking is in a reasonable agreement with results by Eyring et al. (2006).Waugh and Eyring (2008) evaluated a subset of key processes important for stratospheric ozone, with the focus mainly on diagnostics to evaluate transport and dynamics in the CCMs.Spearman's rank correlation coefficient between our model ranking and theirs is 0.59.Waugh and Eyring (2008) also provide model grades based separately on transport diagnostics or polar dynamic diagnostics.The agreement with our ranking gets worse if we consider either grades based on transport diagnostics (r=0.44) or grades based on polar dynamic diagnostics (r=0.54).In the case of polar dynamics diagnostics the most considerable difference is that WACCM which has the highest performance index among the individual models in our analysis gets low grade in the polar dynamics diagnostics, specifically in the zonal wind diagnostic (V.Eyring, personal communication).The differences between our grades are partly because some of the diagnostics considered in Waugh and Eyring (2008), in particular the tropical diagnostics, may be not important for polar ozone which is given large weight in our Introduction

Conclusions References
Tables Figures

Back Close
Full analysis.On the other hand, some important diagnostics (e.g.those related to the chemical processes) may be not considered in their study.Waugh and Eyring (2008) found that, in many diagnostics, MULTI does not get better grades than the best individual models.This is because significant biases in these diagnostics are shared by many but not all the models.We also found that MULTI does not have the smallest error in all the diagnostics.However the overall performance of MULTI according to the original test is considerably better than that of any individual model.This result holds also if available alternative model realisations are used (see Sect. 4.3), and is also robust to small modifications in the diagnostics, which include restricting the domain to above 200 hPa or weighting the ozone profile errors according to the mass or geometric thickness of the corresponding layer.Such diagnostic modifications do, however, result in changes in the performance index of the individual models of up to 6% (about 0.05 in absolute units), typically changing model ranks by no more than one position.Therefore, assuming that the ability of the models to simulate observed ozone climatology and trends is a reliable indicator of their ability to simulate future ozone we conclude that the multi-model average of ozone projections appears to be the best choice as a future ozone scenario for usage in climate model simulations that require prescribed ozone fields.Waugh and Eyring (2008) also noticed that weighting the model results according to their performance does not change significantly the multi-model average projection.

Conclusions
The goal of this study is to provide the climate modelling community with some recommendations regarding the choice of future ozone scenario for implementation in climate simulations.We have validated the abilities of twelve CCMs to simulate the observed total ozone climatology and trends and also the Antarctic ozone profile climatology and trends and ranked the models according to their errors averaged across four chosen diagnostics.No one model performs better than the others in all four diagnostics; how-  et al. (2006) and Waugh and Eyring (2008).A rather good agreement is found with the results of Eyring et al. (2008) who only separated the models into two groups, with models in one group being generally in better agreement with the observations than models in the other group.However, comparison with the other model ranking based on evaluation of dynamics and transport in the model showed only modest correlation, probably because the processes considered by Waugh and Eyring (2008) are not necessarily those important for polar ozone which is given large weight in our study.This comparison shows that the selection of diagnostics for process-oriented CCM validation remains a challenging task and likely depends on the region of interest.
We also performed a sensitivity test which showed that an alternative grading based on simplified ozone diagnostics results in a considerably different ranking.Therefore we argue that the multi-model averaged projection, which is less sensitive to individual model biases, provides the best estimate of future ozone.
In order to reduce the influence of possible biases in ozone observations more than one up-to-date observational data set has been used in this evaluation and the true model error is estimated as an average across errors with respect to individual observational data sets.While diagnostics based on total ozone are found to be insensitive to choice of observational data set, partly because they share the same data sources, the evaluation of simulated Antarctic ozone profiles provides significantly different results depending on which data set is used as a reference.In the ozone profile trend diagnostic differences between the observations are found to be comparable to or even to exceed model errors.In particular, the maximum negative trend in the Rosenlof data set in the Antarctic lower stratosphere during spring is only 60% of that in the Hassler data set; presumably due to different data sources employed in the data sets.This result stresses the need for the compilation of a unified reliable vertically resolved ozone   ,levels: 500,400,300,250,200,150,130,115,100,90,80,70,50,30,20,15,10 hPa Vertical ozone distribution trend Ozone partial pressure profile linear trend between 1980-1999, monthly mean values averaged over 90-60 • S, levels: 500, 400, 300,  250, 200, 150, 130, 115, 100, 90, 80, 70, 50, 30, 20, 15, 10 Figure 2 shows models and TOMS/SBUV relative errors (x i m −y i m )/σ i m with respect to Rosenlof.Agreement between the two observational Introduction

Figure 6
Figure 6 shows model errors in vertical ozone distribution climatology with respect to the Rosenlof data set.The differences between the observational data sets are striking.Randel has typically larger values than the two other data sets especially in 19360 ) are shown in Fig. 11a together with errors for the individual diagnostics.Comparing individual models shows that no one model 19362 used even wider seasonal averaging.One possible way to construct a set of diagnostics alternative to our original one is to use annual or global averaging.Gleckler et al. (2008) compared model errors in annual mean variables with annual mean errors in monthly mean variables.In the latter case model errors in simulating the seasonal cycle are accounted for.For some 19363

Fig. 2 .Figure 3
Fig. 2. Normalised errors with respect to Rosenlof data set in total ozone climatologies shown in Fig. 1.Numbers next to data set names indicate area-weighted globally averaged errors normalised by the average error across the individual models according to Eq. (2).

Fig. 4 .Figure 5 .
Fig. 4. Normalised errors with respect to the Rosenlof data set in total ozone trends shown in Fig. 3. Numbers next to data set names indicate area-weighted hemisphere-averaged errors normalised by the average error across the individual models according to Eq. (2).

Fig. 6 .
Fig. 6.Normalised errors with respect to the Rosenlof data set in vertical ozone distribution climatology shown in Fig. 5. Numbers next to data set names indicate domain-averaged errors normalised by the average error across the individual models according to Eq. (2).
SBUV errors are small, indicating consistency between TOMS/SBUV and the other satellites (SAGE-II, MLS, HALOE) employed in the Rosenlof data set after 1985.Models that overestimate the magnitude of the trends typically show the largest errors.As a result MULTI trends are biased negative.However the MULTI total error with respect to Rosenlof (and also to TOMS/SBUV) is smaller than in any individual model.
assessment UMETRAC has high grades in simulating polar Cl y while E39C and SOCOL have low grades.Note that this interpretation may not be applicable to the newer version of SOCOL used here.The Cl y trends maybe somewhat different in this model version.Figure4shows relative errors with respect to the Rosenlof data set.The Introduction in individual diagnostics into one metric of model performance allowed us to objectively rank the models.The highest rank is obtained by the multimodel ensemble average.The model ranks obtained are further compared with those made earlier by Eyring

Table 1 .
Our assessment has at least two practical applications.First, modellers can make a choice of future ozone scenario based on the objective evaluation.Second, ozone simulations by future model generations can be validated in the same way as done here and model improvements can be quantitatively assessed.FiguresScientific Assessment of Ozone Depletion: 2006, World Meteorological Organization, Global  Ozone Research and Monitoring Project, Report No. 50, Geneva, Switzerland, 2007.Zhou, S. T., Gelman, M. E., Miller, A. J., and McCormack, J. P.: An inter-hemisphere comparison of the persistent stratospheric polar vortex, Geophys.Res.Lett., 27, 1123Lett., 27,  -1126Lett., 27,  , 2000.CCMs used in this study.

Table 2 .
Diagnostics used in this study.
hPa Introduction