Re-review of Two-scale multi-model ensemble: Is a hybrid ensemble of opportunity telling us more?
The new manuscript addresses many of the issues raised in my initial review. Of my initial concerns, two have been well addressed: the structure of the paper does seem to be improved, and the quality of the figures has improved substantially. However, my other major concern – the lack of quantitative conclusions – has not been addressed. In light of this, and of some smaller problems which still remain, I recommend additional minor revisions. Once addressed, I am confident that this paper is appropriate for publication in ACP.
My only major concern is the lack of quantitative conclusions. The authors state in their response that “More care has been given to the abstract an to adding more quantitative conclusions”. However, in comparing the original to the revised version, the abstract does not appear to have any additional quantitative conclusions. Instead, all of the additional material is qualitative (“slight” increase of diversity, “marked” improvement of accuracy). I again ask that the author quantify their conclusions, using the data they already have to hand. This recommendation also applies to the conclusions section (section 5) of the paper, which I note present only one quantitative result, a “1-5% improvement”. I also recommend that the authors be more specific about what has improved, as here the actual quantity which has improved is not clearly stated.
All remaining concerns are minor, and listed below:
1) In my previous review, I stated:
“In the conclusions the authors refer to analysis of annual hourly, J-J-A hourly, and annual daily maximum records. However, until that point there seems to be no discussion of the latter two metrics. The authors should consider elaborating in the previous sections on the analysis they performed using these metrics.”
Based on the response given by the authors, I fear that I was not clear. My meaning is that, whenever data on “accuracy” or any other outcome is presented, the exact quantity being considered is often not clearly stated. For example, I note that section 4.3 (previously section 7) states that “[t]he comparison of the ensemble performances will be restricted to the months of June -August”. However, the authors do not state whether they are considering the average of the daily maximum over this period, or the average of all hourly readings over this period. While it may be clear to the authors, I strongly recommend that they clearly state the exact quantity being analyzed at the opening of each figure’s discussion. For example, in the discussion of POD and FAR on lines 334 to 349, are they referring to exceedances in all hourly data, or in the daily maximum?
2) Further to the previous comment, I cannot find any dedicated analysis in the manuscript of the daily maximum records. The closest I can find is in section 3.2, but this is prior to any discussion of ensemble results, and only addresses the maximum as a limiting case for model diversity. The ability of models to accurately predict the one-hour daily maximum is of significant policy relevance due to the fact that it is a monitored quantity under some regulations, and the fact that it was used as the predictor in a widely-cited epidemiological study which underpins much of the literature on ozone exposure impacts (Jerrett et al 2009). The ability of models to achieve a certain level of accuracy when predicting the daily mean is not shown here to be a reliable predictor of their ability to predict the daily maximum, and there are reasonable grounds to expect this not to be true (e.g. a model may represent night-time ozone chemistry well but have a poor representation of daytime behavior). I therefore suggest that the authors either:
• Clearly demonstrate which of their analyses are specifically relevant for the daily-maximum record as opposed to including all hourly records, if indeed some of the analyses are specific to the daily maximum; or
• Provide analysis in the results section of the ensemble performance specifically for daily maximum ozone, prior to the assertions on lines 629-630; or
• Remove the assertion on line 630 that the annual daily maximum records are analyzed.
3) Figure 11 is critical to the paper’s conclusions, but remains somewhat difficult to interpret due to unnecessary clutter and lack of appropriate labeling. I would suggest that:
• The four colar bars are replaced by one;
• Units are added for the x- and y-axis labels;
• A smaller marker is used to prevent (or at least minimize) the degree to which markers block one another;
• Units are added for the in-figure notations.
To be clear, I recognize that Figure 11 is complex and is attempting to compactly represent a lot of data. My hope is that these suggestions will help the reader to more easily interpret an important but complicated diagram. Units should also be used during the discussion of Figure 11 (lines 577-607).
4) On line 279, the authors state that the spectrum behaves “remarkably similarly at scales smaller than the daily peak”. I was surprised at the lack of analysis regarding the mismatch between all models and the observations in the high frequency components, with seemingly all models showing strong resonant peaks while the observations show no such effect. Do the authors have a hypothesis for this? Is this because of shortcomings in the observations or model error?
5) Lines 196 to 205 have out-of-date section numbers. This paragraph should be re-written to better reflect the new paper structure.
6) Line 337: “whereas the latter is the proportion of non-occurrences that were incorrectly identified as happening” – this sentence is confusing and I suggest the authors reword it.
7) Lines 445-451 seem to be referring to Figure 11, but at this point only Figures 1-6 have been introduced. The authors should either defer this discussion to the relevant part of the manuscript or at least make clear that they are talking about a different figure.
8) On line 496, the terms LT, SY, and DU are used, but these are no longer defined explicitly in the manuscript.
9) On line 499-502, would it be accurate to say that a larger number of global models is needed in locations with complex topography because we are using unrelated differences between models to substitute for an accurate representation of the effect of topography on the result? Or do some of the global models have subgrid parameterizations capable of capturing this effect?
10) On lines 569-571, the authors assert that, where a small number of global models are used in the optimum ensemble for a given location, this “clearly indicat[es] a deficiency”. Doesn’t this depend on both the accuracy and diversity of that small number of global models (eg high accuracy but low diversity of the globals + low accuracy but high diversity of the regionals could produce this result)?
11) Figure 10 has only three lines of discussion. I recommend that it be moved to the SI, as the outcomes could easily be presented without it.
12) There are still some typos in the paper. I have listed some below but recommend that the authors perform another sweep:
• Line 114: “of which where global” should be “of which were global”
• Line 254: “selected as the capture” should be “selected as these capture” (or possibly “they”)
• Lines 264, 396, 573: “Figures” should be “Figure”
• Line 408: This should be Figure 5, not Figure 6
• Line 570: “Global” should be “global”
• Line 599: “+” should be “and”
• Line 600: “form” should be “from”
13) The assertion on line 613 to 615 is that “regardless of the treatment, the ensemble data capture the ozone power spectrum with no notable deviation from the measured spectrum”. This is far too strong and not supported by the data. Whether because of shortcomings in the observational technique or in the models (see comment 4 above), there is significant disagreement at frequencies below 6 hours, and this deserves explanation. This statement also seems to be almost immediately contradicted by the observation of the “large power deficit in the range from 0.8 days to 100 days”.
14) Line 924: what is meant by “most present”?
15) Table 2 still seems quite inconsistent, especially in their vertical grid descriptions. For example, the CHASER models report “32 VL up to 40 km” – no detail on spacing – while the IFS model reports “60 VL from surface up to 0.1 hPa – lowest level 15 m”. Meanwhile the GEOS-Chem adjoint model still reports “47 levels up to 0.066 hPa (bottom of last grid)”. While I appreciate that the authors did follow up on my previous question, their explanation was not convincing; GEOS-Chem does not generate values “at the base of the cell” but instead almost always calculates cell-average quantities. Meanwhile the other models report the correct upper limit, i.e. the pressure at the “ceiling” of the uppermost grid cell – this includes for example the IFS model. In the interests of consistency, I recommend that the authors replace the “vertical grid” column with “Number of layers”, “Lowest layer thickness”, and “Model top (hPa)” columns, and that they either use the actual model top for GEOS-Chem or that they change the description of the other models to also use the bottom of the uppermost layer (e.g. 0.02 hPa for IFS, 136.5 hPa for the EMEP model, and so on).
References
Jerrett, M., Burnett, R.T., Pope, C.A., Ito, K., Thurston, G., Krewski, D., Shi, Y., Calle, E., Thun, M., 2009. Long-term ozone exposure and mortality. N. Engl. J. Med. 360, 1085–1095. |