Calculating a multi-model mean, a commonly used method for ensemble averaging, assumes model independence and equal model skill. Sharing of model components amongst families of models and research centres, conflated by growing ensemble size, means model independence cannot be assumed and is hard to quantify. We present a methodology to produce a weighted-model ensemble projection, accounting for model performance and model independence. Model weights are calculated by comparing model hindcasts to a selection of metrics chosen for their physical relevance to the process or phenomena of interest. This weighting methodology is applied to the Chemistry–Climate Model Initiative (CCMI) ensemble to investigate Antarctic ozone depletion and subsequent recovery. The weighted mean projects an ozone recovery to 1980 levels, by 2056 with a 95 % confidence interval (2052–2060), 4 years earlier than the most recent study. Perfect-model testing and out-of-sample testing validate the results and show a greater projective skill than a standard multi-model mean. Interestingly, the construction of a weighted mean also provides insight into model performance and dependence between the models. This weighting methodology is robust to both model and metric choices and therefore has potential applications throughout the climate and chemistry–climate modelling communities.

Global chemistry–climate models (CCMs) are the most comprehensive tools to investigate how the global composition of the atmosphere develops, both naturally and under anthropogenic influence

Many years of scientific studies and assessments have tied stratospheric-ozone depletion to the anthropogenic emission and subsequent photochemistry of halogen-containing gases, such as chlorofluorocarbons (CFCs), hydrofluorocarbons (HCFCs) and halons

The recovery process is slow due to the long atmospheric lifetimes of ozone-depleting substances and could be hampered by releases of ozone-depleting substances (ODSs) not controlled by the Montreal Protocol, such as short-lived halogens

The assessment of when the ozone layer will recover is conducted using an ensemble of chemistry–climate models, forced by past and projected future emissions of ozone-depleting substances (ODSs) and climate forcers

Given these issues, there is currently no consensus on how best to combine model output when analysing an ensemble. Probably the most widely used and simplest is to take a multi-model mean where each model contributes equally, and indeed it has also been established that an ensemble mean performs better than any single model

Quantifying model skill (or performance) against comparable observations forms an important part of the validation and analysis of multi-model ensembles

In this study we develop a weighting methodology, originally presented by

In this study, we develop and exploit a framework to calculate model weights based on recent work in the physical-climate-science community

The numerator captures the closeness of the model to observations.

Investigating and evaluating a phenomenon or complex process often relies on identifying multiple metrics since it can only be partially expressed by any single variable. Expert understanding of the physical process is needed to select a set of relevant metrics with which to develop the process-based weighting. Including multiple metrics, provided they are not highly correlated, has the further benefit of giving less weight to models which perform well but do so for the wrong reasons. In this framework, ensuring that these metrics influence the weighting proportionally is done by normalising the model data using a min–max scaling between

When combining multiple metrics into a weighting, the weight of the

We take the combined weights for each model and apply them to our parameter or process of interest (the evolution of stratospheric ozone here). As with the metrics this parameter need not be a time series and could be a spatial distribution or a single measure. The weighted projection is therefore

The two scaling parameters (

As noted in

Panel

We demonstrate the applicability of this weighting framework by applying it to the important and well-understood phenomenon of the Antarctic stratospheric ozone hole, for which we can use several decades of suitable observations to weight the models. Below, we describe the model and observation data used and the metrics selected, against which we measure model performance and independence.

CCM output was taken from the simulations conducted under Phase 1 of the Chemistry–Climate Model Initiative (CCMI) (

The CCMI model simulations used in this analysis and their key references.

We focus on two sets of simulations, called refC1SD and refC2, and for the weighting analysis we only consider models which ran both simulations. Table

The refC2 simulations cover 1960–2100 and are used to construct weighted projections of Antarctic ozone recovery, using the weights calculated from refC1SD. The forcing from GHGs and anthropogenic emissions follows the historical scenario conditions prescribed for the fifth Coupled Model Inter-comparison Project (CMIP5)

The observational products and respective variables used to construct metrics on which to weight the models. MSU: Microwave Sounding Unit. NIWA-BS: National Institute of Water and Atmospheric Research – Bodeker Scientific. GOZCARDS: Global OZone Chemistry And Related trace gas Data records for the Stratosphere. ERA-Interim: European Centre for Medium-Range Weather Forecasts – ECMWF – Reanalysis.

Model performance was evaluated against a series of well-accepted metrics (see below), drawing from widely used observational and reanalysis datasets listed in Table

Like many ozone recovery studies, we utilise TSAM (time series additive modelling)

The first step in the weighting process is to identify the most
relevant processes that affect Antarctic ozone depletion to allow for
appropriate metric choice. Suitable metrics require adequate
observational coverage and for the models to have outputted the
corresponding variables. The metrics we chose are as follows:

This is the first derivative with respect to time of the total ozone column. Given the discontinuity in the total ozone column record, the years 1992–1996 are excluded. It is a southern polar cap (60–90

The lower-stratospheric temperature for all of the models are constructed using the MSU TLS-weighting function

This is the first derivative with respect to time of the lower-stratospheric temperature found above.

The vortex breakdown date is calculated as when the zonal mean wind at 60

Both the lower-stratospheric temperature and the total ozone column are separately averaged over 60 to 82.5

This is similar to the metric above except that we first calculated the time derivative of the total ozone column and temperature polar time series before calculating the linear relationship. The gradient of the linear relationship is the total-ozone-column-trend–temperature-trend gradient metric.

The hydrogen chloride concentration was averaged over the austral spring months and over the southern polar cap, for areas which have observational coverage. We consider a pressure range of 316 to 15

These metrics capture two of the main features of ozone depletion, namely: (1) the decrease in temperature over the poles caused by the depletion of ozone and (2) the breakdown of the vortex which has a major role of isolating the ozone-depleted air mass. The chlorine metric encapsulates the anthropogenic release of ODSs and the main chemical driver of ozone depletion. Ozone–temperature metrics allow us to look at model success in reproducing the temperature dependency in ozone reaction rates and stratospheric structure. By looking at the instantaneous rate of change as well as the overall trends, we can gather a picture of both short-term and long-term changes for a range of chemical and dynamical processes.

The metrics are not highly correlated, except for the total ozone column gradient and the lower-stratospheric temperature gradient, which are correlated because of the strong coupling of ozone and temperature in the stratosphere

Two types of testing were used to investigate the usefulness of the weighted prediction and to validate metric choices. Firstly, we performed a simple out-of-sample test on the weighted prediction against the total ozone column observations from NIWA-BS. Although the weights are generated from comparison between the specified-dynamics runs (refC1SD) and observations, it does not necessarily follow that the weighted projection created using the free-running (refC2) runs will be a good fit for the observations. To test this, we compared the refC2 multi-model mean and weighted projection to the observations. Due to the large inter-annual variability in the total column ozone (TCO) observations, we do not expect the weighted average to be a perfect match; after all, free-running models are not designed to replicate the past. However, we need to test the level of agreement between the weighted mean and the observations for an out-of-sample period (2010–2016). This serves a secondary purpose of determining transitivity between the two model scenarios used, i.e. that the weightings found from refC1SD apply to refC2.

Secondly, we used a perfect-model test (also known as model-as-truth or a pseudo-model test) to determine whether our weighting methodology is producing valid and robust projections. In turn, each model is taken as the pseudo truth and weightings are found in the same way as described in Sect.

Perfect-model testing also allows us to test transitivity between scenarios, since, unlike with the obvious temporal limit on observations, the pseudo truth exists in both the hindcast and forecast. If a weighting strategy produces weighted means which are closer to the pseudo truth than a multi-model mean, then we can have some confidence that we can apply a weighting across model scenarios.

Antarctic (60–90

Figure

The WM predicts a return to 1980 TCO levels by 2056 with a 95 % confidence interval (2052–2060). For comparison the recovery dates presented in D18 were 2062 with a

The confidence interval for recovery dates is formed from the predictive uncertainty in the WM from the TSAM (for which the 95 % confidence interval is 2054–2059) and the uncertainty associated with the weighting process. Choices made about which models and metrics to include influence the return dates and therefore introduce uncertainty. This is similar to the concept of an “ensemble of opportunity”, which is that only modelling centres with the time, resources or interest take part in certain model ensembles. To quantify this uncertainty, we performed a dropout test where a model and a metric were systematically left out of the recovery date calculation. This was done for all combinations of models (

Figure

Due to the nudging of temperature that takes place in most of the specified-dynamics simulations, we would expect stratospheric temperatures to be reasonably well simulated. However, variation exists in nudging methods in addition to inter-model differences, and this leads to part of the variability in weights

The total weighting, formed from the mean of individual metric weights per model, is largely influenced by CNRM-CM5-3, which has a weight of 0.27 (297 % of the value of a uniform weighting). The CNRM-CM5-3 simulations are more successful at simulating metrics whilst being reasonably independent from other models, leading to a weight with greater prominence than the other models. This does not mean that CNRM-CM5-3 is the most skilful model. For example, if two nearly identical models had the highest performance, their final weights would be much lower, as they would be down-weighted for their similarity. All models are contributing towards the weighted ensemble mean providing confidence that our weighting methodology is not over-tuned and returning model weights of zero. The lowest total model weight is 45 % of the value of a uniform weighting.

Model weights for each of the seven metrics are all shown in blue. The weights account for both performance and independence and are found using Eq. (

Results of the perfect-model test. The mean monthly improvement in the Antarctic October TCO projection (1960–2095) of the WM compared to the MMM for each model taken as the pseudo truth. The average shown in red is the improvement across all the perfect-model tests. No conclusions about overall model skill should be drawn from this plot.

We performed a perfect-model test (Sect.

Three models, when treated as the pseudo truth, do not show an improvement of the WM with respect to the MMM. Note that this is not poor performance of the model in question, but it is rather that the weighting methodology does not do an adequate job of creating a weighted projection for that model as the pseudo truth. Using CHASER-MIROC-ESM as the pseudo truth gives a worse WM projection than if we used the MMM. However, the average correlation between the CHASER-MIROC-ESM-simulated TCO and other models in the ensemble is the lowest at 0.65, compared to the average ensemble cross-correlation score of 0.81. Since a weighted mean is a linear combination of models in the ensemble, it is understandable that models with low correlation to CHASER-MIROC-ESM will be less skilful at replicating its TCO time series. This is why an improvement is not seen for CHASER-MIROC-ESM as the pseudo truth in the perfect-model testing.

We also performed out-of-sample testing on the WM projection for the years 2011–2016 inclusive by comparing it to the TCO observational time series, which was smoothed as described in Sect.

Inter-model similarity across all refC1SD models as calculated by Eq. (

The current design of model inter-comparison projects does not account for structural similarities in models, ranging from sharing transport schemes to entire model components. Therefore, a key part of generating an informed weighting is considering how alike any two models are. The weighting scheme presented here accounts for model independence through the denominator in Eq. (

The refC1SD scenario from CCMI consists of 14 different simulations, some of which are with different models, whereas others are just different realisations of the same models. Note that there are more models used here than in the creation of the Antarctic ozone projection. This is because for the weighted projection we require both a refC1SD and a refC2 simulation for each model, but for similarity analysis we can use all the refC1SD simulations. For these model runs we calculated a similarity index

Similarities between all refC1SD models are shown in Fig.

The projection of the ozone hole recovery date presented here makes use of an ensemble of the latest generation of CCMs and a weighting methodology that accounts for complexities within model ensembles. While the ozone recovery date found in this work (2056) is different to that found by

The CNRM-CM5-3 model received the largest weight of

Central to the weighting methodology is the selection of metrics requiring expert knowledge. The set of metrics we chose was grounded in scientific understanding and produce a good improvement of the weighted projection compared to the MMM. There are numerous other metrics of varying complexity which could be considered, such as the size of the ozone hole or the abundance of polar stratospheric clouds. These extra metrics could improve the model weighting and give a more accurate projection, but testing an exhaustive collection of metrics was not our aim, and there are not always appropriate measurements to validate the metrics with. We have shown a weighting framework which improves upon the current methodology for combining model ensembles and is also flexible and adaptable to whichever metric choices the user deems reasonable. Furthermore, the low range in return dates produced from the dropout testing shows that the results produced in this weighting framework are robust to metric and model choices. This is a desirable effect of a methodology to provide stable results irrespective of fluctuations in the input. It is reassuring to know that the methodology is robust to metric choices, as we are often constrained by the availability of observational data. In this work we benefit from the decades of interest in polar ozone which have led to datasets of a length suitable for constructing model weights. This highlights the importance of continued production of good observational datasets because, although perfect-model testing allows us a form of testing which forgoes the need for observations, weighting methodologies must be grounded in some estimate of the truth.

We generated weights from the refC1SD simulations which means that some metrics we chose are based on nudged variables, such as the lower-stratospheric temperature gradient. As a result, one might expect that the model skill for these metrics should be equal, though given Fig.

As the perfect-model test produces better projections, for models which are nudged in a variety of ways, we can conclude that the weighting is not dominated by nudging. Take for example UMUKCA-UCAM, which is nudged quite differently compared to the ensemble, as evidenced by a southern pole significantly colder than the ensemble. When we take UMUKCA-UCAM as the pseudo truth (temporarily assuming the UMUKCA-UCAM output is the observational truth) we generate weights based upon the refC1SD simulations and test them on the refC2 simulations. The weights generated are based on the dynamical system simulated in refC1SD which includes any model nudging. We can test how well these weights apply to a different dynamical system without nudging (refC2). As we see an improvement in the WM compared to the MMM, we can conclude that the weights generated from the refC1SD dynamical system can be applied to the refC2 dynamical system. If there had not been an improvement, then the dynamical systems described by refC1SD and refC2 may be too dissimilar for this weighting methodology, and the weights may instead have been dominated by how well models are nudged. Nudging may be influencing the weights, but not to a degree that the accuracy of the projection suffers.

We justified using the nudged refC1SD simulations, despite these considerations, for two reasons. Firstly, these nudged simulations give the models the best chance at matching the observational record, by providing relatively consistent meteorology across the models. The free-running CCMI hindcast simulations (refC1) have a large ensemble variance and, despite producing potentially realistic atmospheric states, are not directly comparable to observational records. Secondly, the perfect-model testing discussed above demonstrates that the nudging does not have a detrimental effect on the model weighting.

Although we were not seeking to grade the CCMs as per

We have presented a model-weighting methodology, which considers model dependence and model skill. We applied this over a suite of metrics grounded in scientific understanding to Antarctic ozone depletion and subsequent recovery. In particular we have shown that the weighted projection of the total ozone column trend, with inter-annual variability removed, predicts recovery by 2056 with a 95 % confidence interval of 2052–2060. Through perfect-model testing we demonstrated that on average a weighted mean performs better than the current community standard of calculating a multi-model mean. Additionally, the perfect-model test, a necessary step in validating the methodology, showed a level of transitivity between the free-running and the specified-dynamics simulations.

This methodology addresses the known shortcomings of an ensemble multi-model mean, which include the problem of ensembles including many similar models and the inability to factor in model performance. It does this by quantifying skill and independence for all models in the ensemble over a selection of metrics which are chosen for their physical relevance to the phenomena of interest. This weighting methodology is still subject to some of the same limitations of taking an ensemble mean: i.e. we are still limited by what the models simulate. For example, in the case of ozone depletion, a weighted mean is no more likely to capture the ozone changes due to the recent fugitive CFC-11 release

Addressing the shortcomings and presenting possible improvements of methods for averaging model ensembles is timely given the current running of CMIP6 simulations

In summary, we have presented a flexible and useful methodology, which has applications throughout the environmental sciences. It is not a silver bullet for creating the perfect projection for all circumstances; however, it can be used to construct a phenomenon-specific analysis process that can account for model skill and model independence, both of which can improve ensemble projections compared to a multi-model mean.

The Jupyter notebook used to run the analysis, along with a collection of functions to produce weightings from ensembles, can be found at

MA developed the methods, led the analysis and conceived the study alongside JSH and PJY, who made major contributions as the work progressed. MA drafted the paper, with the guidance of PJY and JSH and input from JFL. JFL and all the other co-authors provided model simulation data. All the co-authors provided model outputs and helped with finalising the paper.

The authors declare that they have no conflict of interest.

This article is part of the special issue “Chemistry–Climate Modelling Initiative (CCMI) (ACP/AMT/ESSD/GMD inter-journal SI)”. It is not associated with a conference.

This work was supported by the Natural Environment Research Council (NERC grant reference no. NE/L002604/1), with Matt Amos's studentship through the Envision Doctoral Training Partnership. Paul J. Young is partially supported by the Data Science of the Natural Environment (DSNE) project, funded by the UK Engineering and Physical Sciences Research Council (EPSRC; grant no. EP/R01860X/1), as well as the Impact of Short-Lived Halocarbons on Ozone and Climate (ISHOC) project, funded by the UK Natural Environment Research Council (NERC; grant number: NE/R004927/1). We would like to thank Bodeker Scientific, funded by the New Zealand Deep South National Science Challenge, for providing the combined NIWA-BS total column ozone database. We also acknowledge the GOZCARDS team for the production of the HCl record and remote sensing systems for the MSU TLS record. We acknowledge the modelling groups for making their simulations available for this analysis, the joint WCRP SPARC–IGAC Chemistry–Climate Model Initiative (CCMI) for organising and coordinating the model data analysis activity, and the British Atmospheric Data Centre (BADC) for collecting and archiving the CCMI model output. The EMAC simulations have been performed at the German Climate Computing Center (DKRZ) through support from the Bundesministerium für Bildung und Forschung (BMBF). DKRZ and its scientific steering committee are gratefully acknowledged for providing the HPC and data-archiving resources for this consortial project ESCiMo (Earth System Chemistry integrated Modelling). The CCSRNIES-MIROC3.2 model computations were performed on NEC-SX9/A(ECO) and NEC SX-ACE computers at the CGER and NIES, supported by the Environment Research and Technology Development Funds of the Ministry of the Environment, Japan (2-1303), and Environmental Restoration and Conservation Agency, Japan (2-1709).

This research has been supported by the Natural Environment Research Council (grant no. NE/L002604/1) and the Engineering and Physical Sciences Research Council (grant no. EP/R01860X/1).

This paper was edited by Gunnar Myhre and reviewed by Sandip Dhomse and two anonymous referees.