Comment on "Quantitative performance metrics for stratospheric-resolving chemistry-climate models" by Waugh and Eyring (2008)
- Deutsches Zentrum für Luft- und Raumfahrt, Institut für Physik der Atmosphäre, Oberpfaffenhofen, 82230 Wessling, Germany
Abstract. This comment focuses on the statistical limitations of a model grading, as applied by D. Waugh and V. Eyring (2008) (WE08). The grade g is calculated for a specific diagnostic, which basically relates the difference of means of model and observational data to the standard deviation in the observational dataset. We performed Monte Carlo simulations, which show that this method has the potential to lead to large 95%-confidence intervals for the grade. Moreover, the difference between two model grades often has to be very large to become statistically significant. Since the confidence intervals were not considered in detail for all diagnostics, the grading in WE08 cannot be interpreted, without further analysis. The results of the statistical tests performed in WE08 agree with our findings. However, most of those tests are based on special cases, which implicitely assume that observations are available without any errors and that the interannual variability of the observational data and the model data are equal. Without these assumptions, the 95%-confidence intervals become even larger. Hence, the case, where we assumed perfect observations (ignored errors), provides a good estimate for an upper boundary of the threshold, below that a grade becomes statistically significant. Examples have shown that the 95%-confidence interval may even span the whole grading interval [0, 1]. Without considering confidence intervals, the grades presented in WE08 do not allow to decide whether a model result significantly deviates from reality. Neither in WE08 nor in our comment it is pointed out, which of the grades presented in WE08 inhibits such kind of significant deviation. However, our analysis of the grading method demonstrates the unacceptably high potential for these grades to be insignificant. This implies that the grades given by WE08 can not be interpreted by the reader. We further show that the inclusion of confidence intervals into the grading approach is necessary, since otherwise even a perfect model may get a low grade.