Assessing the potential for simpliﬁcation in global climate model cloud microphysics

We sincerely thank the reviewers for their insightful and constructive feedback. We implemented their feedback into a revised version of the manuscript. Please ﬁnd our answer to the reviewers’ points below, followed by a marked-up manuscript verison. of a model to perturbations of four analysis of global results, then a spatial decomposition using spherical harmonics expansion. Overall results show strong sensitivity to the “aggregation” process, while riming has a strong impact on LWP. The authors then discuss implications for the level of complexity of microphysical process representations. Overall, the paper is well written and an important topic, and is within the scope of Atmospheric Chemistry and Physics. This study is interesting. The methodology appears to be sound. The demonstration of this PPE-based approach to investigate cloud microphysical processes as a of concept” is the strongest part of the paper in opinion. However, have several questions regarding the broader context, including the motivation for this approach and interpretation of results. of below. These


Reviewer Comment #1
RC: This study used a perturbed parameter ensemble (PPE) framework together with Gaussian process emulation to explore sensitivity of a global climate model to perturbations of four ice microphysical processes. The paper first describes analysis of global results, then a spatial decomposition using spherical harmonics expansion. Overall results show strong sensitivity to the "aggregation" process, while riming has a strong impact on LWP. The authors then discuss implications for the level of complexity of microphysical process representations. Overall, the paper is well written and addresses an important topic, and is within the scope of Atmospheric Chemistry and Physics. This study is interesting. The methodology appears to be sound. The demonstration of this PPE-based approach to investigate cloud microphysical processes as a "proof of concept" is the strongest part of the paper in my opinion. However, I do have several questions regarding the broader context, including the motivation for this approach and interpretation of results.
Overall I think it could be acceptable in ACP, but more context is needed and some clarification of points raised below. These are detailed in "specific major comments". I also have a several additional minor comments and a few editorial corrections.
AR: Thank you for your clear and constructive feedback. Please find our respective answers directly below your comments below.
1.1. Specific major comments 1.1.1 RC: I don't think "phasing" is the best term to use for the process perturbations. Essentially, you're applying a multiplicative factor directly to the process rates. But "phasing" typically means the relationship between the timing of two or more events, or synchronizing of multiple events. Similarly, the paper also uses the term "phase in and out" which implies some kind of dynamic or periodic change to the multiplicative factors, which is not the case here (they are constant in time and space for a given run, as I understand). Instead, process "perturbing" seems like a more appropriate term than "phasing". I don't think addressing this is required before publication, but I strongly suggest not calling this "phasing" to avoid confusion.
AR: To avoid confusion, we have followed your suggestion and use "perturbing" instead of "phasing" except where the dynamic connotation of "phase in and out" is intended.
1.1.2 RC: My main comment, as noted above, is in the interpretation and broader context of the study. The authors motivate this study by suggesting that results from this approach can be used to inform which processes can be simplified, but the specifics of this are unclear and the paper seems self-contradictory in several places. For example, the argument on lines 320-325 is: "Recognizing it as a threshold process and seeing the gradual response to small deviations from 1.0 in η aggr , it appears that there is potential for a less accurate description of aggregation in the model.", where "it" in this sentence means the "aggregation" process. I don't understand this argument. Moreover, this seems to directly contradict the argument later on lines 385-389 and lines 445-447. These lines state that the model is sensitive to representation of "aggregation" and riming, and therefore these processes "need to be represented accurately"and should receive attention of model developers, while self-collection and accretion can be simplified. But then this is contradicted again on line 450: "The previous analysis shows that the response of ECHAM-HAM to an inhibition of self-collection or accretion is negligible, while for riming and aggregation at least a less accurate representation can be appropriate." Perhaps this is not what the authors intended to write and it's simply written incorrectly? Either way, this is confusing. See also lines 478-479 in the conclusions section, and lines 481-484, where the latter again argues that aggregation should be the "process of choice" to represent most accurately. Overall, these self-contradictory arguments make the paper feel somewhat incoherent, and this needs to be addressed before the paper can be accepted.
AR: We see that in the presented form our argument was confusing. Our argument is indeed going two ways: One the one hand aggregation is the dominant CMP process of the four investigated. As such, if one wanted to improve the model, aggregation provides leverage to doing so. This means that the representation of aggregation should be subjected to scrutiny and -as you point out in other comments -that its dominant role might need to be reconsidered (and if it is considered unphysical, one solution might be to move to a scheme that does not separate the ice and snow categories). On the other hand, around η aggr = 1 the considered model output variables are unresponsive to a change in η aggr . This implies that a change in the formulation of aggregation would keep the model intact, and thus simplifications in the formulation of aggregation are possible as well. We have reformulated the sentences you point to to make this argument clearer.
The model is not sensitive to accretion and self-collection of ice and therefore these processes can be simplified, while aggregation :::::::::::: autoconversion : and riming dominate the model responseand therefore need to be represented accurately.
The previous analysis shows that the response of ECHAM-HAM to an inhibition of self-collection or accretion is negligible, while for riming and aggregation at least :::::::::::: autoconversion a less accurate representation can be appropriate.
These results as well as the shape of the response surface suggest that the parameterisation of selfcollection and accretion can be readily and drastically simplified. While aggregation :::::::::::: autoconversion and riming have a large impact on the model output ::::::::: considering ::: the ::::: whole :::::::::: investigated ::::: phase ::::: space, the shallow slope of the response surface around the default η i = 1 hints that slight modifications of their representations may leave the model output unchanged. The strength of the PPE approach is that interactions are already taken into account, meaning that all four processes could be simplified at the same time. If one wants to make the representation of one of these processes more accurate, aggregation should be the process of choice :::::: develop ::: the ::::: CMP :::::: scheme ::::::: further, ::::::::::::: autoconversion : is ::: the :::::: process :: to :::::::: scrutinize : as it has the largest leverage in the model and therefore the largest :::: most ::::: urgent need to be represented correctly.

1.1.3
RC: I have several other questions and comments about the broader motivation and context for this study. First, a process could still be important even if there is not much sensitivity to some limited set of output metrics in a given climate state (for example, a process might be important for cloud-aerosol interactions even if the impact on LWP, IWP, etc for the current climate state is small). This is alluded to on lines 507-508, but I feel this point should be emphasized more in the paper (e.g. in the introduction). Moreover, if a process is unimportant, it doesn't necessarily "do harm", except for extra computational cost. Computational cost here is negligible for these processes anyway, as discussed on lines 451-455 and Table 1. I appreciate the gain in interpretability with model simplification (stated explicitly on line 460), and this is an important point, but this should be made clear in the introduction when motivating the work.
RC: Also, regarding the sentence on line 460, I don't see how such simplification leads to a gain in "comprehensiveness", which I assume means generality; could you clarify this?
AR: We corrected our incorrect choice of words for "comprehensiveness" by replacing it: However, as detailed in Sec. 1, there are numerous benefits in simplification that are independent of the associated computing cost, such as a gain in comprehensiveness ::::::::::: compactness, ::::::::: robustness : and interpretability.
RC: Furthermore, regarding both computational cost and interpretability (as well as comprehensiveness), it seems likely that other scheme features, particularly the choice of prognostic variables and hydrometeor categories, is more important than process formulations anyway (see major comment #4 below for further discussion of this point).
AR: This point about the other scheme features is answered with #4 below. 1.1.4

RC:
The process formulations themselves only one aspect of scheme complexity, and arguably a rather small part. The choice of hydrometeor categories and prognostic variables is generally more important in terms of determining overall scheme complexity, interpretability, comprehensiveness, and computational cost. Yet nothing is mentioned about this in the paper. In my experience, simplifying the representation of microphysics (ice microphysics in particular) in terms of choice of prognostic variables and hydrometeor categories is likely to have a *much* larger effect on interpretability, cost, etc. than only simplifying the process formulations. To this end, can your approach inform simplification of these other aspects that are likely to be more important overall? As a specific example, one of the main findings of this paper is the importance of "aggregation" (which is probably better termed "ice autoconversion", see major comment #7 below). Yet in simpler schemes that use a single ice category but multiple predicted properties for that category (e.g., Morrison and Milbrandt 2015;Eidhammer et al. 2017) this process is not even included, nor are any other conversion processes between ice categories (e.g. accretion of cloud ice by snow). The critical point is that the separation of self-collection, aggregation, and accretion using separate cloud ice and snow categories is artificial, and the approach ultimately must rely on an ad-hoc choice for how cloud ice and snow are separated. Given that "aggregation" (autoconversion) is the most important process (of the 4 tested) based on your results, it's troubling that there is little physical constraint for this process.
To me, the findings from this study motivate the kind of simplification of using one ice category so that "aggregation" is no longer needed. I feel discussion of this issue is needed, particularly the approach of moving away from separate categories for cloud ice and snow. More broadly, discussion of these other sources of structural uncertainty (categories, prognostic variables, etc.), besides just the process formulations, is needed in the paper.

AR:
We have added a data point for η aggr = 0.1 to illustrate the behaviour we are referring to more clearly. We have also added more detail to the explanation of this behaviour in the text the first time it is mentioned. In lack of a better term, we continue to refer to it as "threshold" behaviour but hope that with the explanation this is now more clear.
RC: A few other related comments. First, it's not entirely clear that snow should actually be zero when "aggregation" is turned off, if there are other processes which might contribute at least some to snow (see major comment #8 below). Second, this behavior really just reflects assumptions in the scheme, particularly the separation of total ice into cloud ice and snow categories (see major comment #4 above). For example, the shutting off of riming as aggregation goes to 0 must reflect the fact that riming of snow is included but riming of cloud ice is neglected (at least, I think riming of cloud ice is neglected but that isn't clear from the paper). Ultimately this kind of behavior does not necessarily reflect actual underlying cloud physics but rather the separation of ice into cloud ice and snow categories with an ad-hoc conversion between the two called "aggregation" (ice autoconversion).

1.1.6
RC: I have a few comments on the process perturbations. The range for the process perturbations (magnitude of η i ) seems ad-hoc, from 0 to 200%. Following the motivation of understanding sensitivity to process rates and the level of process uncertainty, it seems that the range of η i should follow from the level of uncertainty of a given process. However, turning off a process entirely (η i = 0) is unrealistic and generally not justified from a physical standpoint or from the standpoint of the actual range of uncertainty of a process. I understand why this might be done as a sensitivity test and "proof of concept" to assess importance of that process, but this point should be clarified in the paper. Also, what determined the upper range of 200%? Was this ad-hoc?
AR: In response to yours and the second reviewer's comments, we have adjusted the range of the perturbations to go from η i = 0.5 to 2 in the PPE. The choice of going to 200% was ad-hoc, but is now matched by the multiplicatively appropriate lower end of the range. We think that this range is appropriate as we're interested in perturbations around η i = 1 (as demonstrated in the newly added sketch) and a deviation of double or half the original results through simplification seems reasonable for all four processes (e.g. doubling corresponds to implicitly adding a new process). The adapted PPE design also makes the "proof of concept" idea more apparent in the one-at-a-time sensitivity studies going until η i = 0, conducted for e.g. Fig. 4.
RC: Finally, a fairly major simplification of this study is assuming the process perturbations are constant in time and space. Of course, introducing time/space dependence would add more dimensions and increase complexity of the problem considerably, and therefore I can see why constant perturbations were assumed in this study for practical reasons. But from the standpoint of process uncertainty, it seems likely that relative uncertainty in these processes is not uniform in time and space. I suggest briefly mentioning this issue.
AR: Thanks for this suggestion. We have adapted the naming of the process to autoconversion instead of aggregation (and autc instead of aggr) in the manuscript.

1.1.8
RC: It's mentioned that "aggregation" (autoconversion) is the only process that can generate snow. Is this really true? What about freezing of rain? Does that form cloud ice, or what else is done?
AR: Indeed aggregation of ice crystals is the only process that can form snow in the model. Rain is diagnosed and precipitates to the surface within one timestep and therefore cannot freeze. 1.1.9 RC: You might consider briefly mentioning other emulation approaches which have some benefits (as well as some drawbacks) compared to using Gaussian processes for emulation as done here, specifically those based on neural networks. I believe some of these issues are outlined in Watson-Parris et al. (2021), and perhaps you could just mention the issue in a quick sentence or two and cite that paper (already cited elsewhere in your paper).
AR: Thanks for noting this, we have added a specification to the caption.

1.1.11
RC: There have been several studies (mainly authored by Posselt, van Lier-Walqui, and/or Morales) that have examined sensitivity to microphysical scheme perturbations in a Bayesian framework. These studies have used MCMC directly rather than emulation, which was possible by focusing on idealized modeling with reduced dimensionality (1D or 2D models generally, or small 3D domains). Some examples are Posselt (2016), Morales et al. (2021), He and Posselt (2015), and van Lier-Walqui et al. (2020) (references given at the end of this review). While these studies haven't focused on global climate, they seem relevant to mention. In particular, the study of van Lier-Walqui et al. (2014) is relevant because they similarly applied constant multiplicative factors to individual microphysical process rates as a way to vary processes and examine the effects of process uncertainty (but in a 1D model, with MCMC).
AR: Thank you for pointing us to this set of references which had previously escaped our attention. We agree that they are relevant and have added references in the Introduction and Methods Section of the manuscript: In a PPE multiple input parameters are perturbed at the same time.
AR: As detailed in our answer above, we changed "phasing" to "perturbation".

Line 27
RC: This could give the false impression that Fisher and Koven (2020) concerns cloud microphysics in climate models specifically, when it's about land surface processes. This is because you say "these processes" in the sentence before on line 25, which refers specifically to cloud microphysics. There's also a paper by Morrison et al. (2020) that discusses similar issues as on lines 23-35, but specifically from the standpoint of microphysics.

Line 66
RC: I don't think that "stripped of detail" means "less accurate". Accuracy implies how close the model is to some benchmark or truth. Thus, my suggestion is to simply state ". . . where process parameterisations can be stripped of detail to aid the development of simplified models . . . ".
AR: Thank you for the suggestion. We have adapted the sentence accordingly.
In this paper, we propose a new methodology to assess where process parameterisations can be less accurate, i.e. stripped of detail , to aid the development of a simplified model as well as to increase process understanding.

Line 143
RC: So riming of droplets on cloud ice is neglected?
AR: Yes, the current CMP scheme does not contain riming of cloud droplets on cloud ice.

Line 146
RC: But snow number should not change at all for riming, of course. I assume that's the case for the parameterization here?

Line 147
RC: So cloud ice sedimentation is neglected completely? Later on lines 181-182 it seems that sedimentation *is* included, but reword earlier around line 147 to make this clear.

Line 158
RC: Assumed to be *all* liquid at mixed-phase temperatures? Also, better to specifically state the temperature range rather than saying "mixed-phase temperatures" which is imprecise.

Line 164
RC: 15 micron volumetric radius seems very low for a maximum-allowed value. Is this a typo or is this actually the value used? I understand this is only for determining the minimum CDNC, but with such a small maximum radius this implies a rather large CDNC. I realize this isn't a central part of the study but it seems worth pointing out.
AR: In the model a minimum CDNC value is needed likely because some aerosol sources are missing. As detailed in the manuscript we have replaced a fixed minimum CDNC value with a fixed maximum cloud droplet radius. 15 µm is the actual value used for the cloud droplet radius. Admittedly, cloud droplets larger than that can be found in the atmosphere. However, the parameter is set not to match the highest realistic value but to tune towards satisfactory model performance. From this perspective, a maximum cloud droplet radius of 15 µm provides a reasonable grid cell mean value.
1.2.9 Lines 184-186 RC: I think this sentence is confusing: "the gain of ice crystal concentrations in the level into which the ice crystals sediment is restricted to the closest loss of in-cloud ice crystal mass and number concentration in the levels above." Can this be clarified, perhaps using a more mathematical description which would be less confusing?

Line 204
RC: I don't follow the logic that if the response to η i follows a sigmoidal curve, then the process need be only represented roughly. Do you mean because the relationship is monotonic? A sigmoidal curve of course could still have a sharp increase with small changes in η i implying a strong sensitivity to the given process. Or am I missing something here?
AR: This is a terminological mistake, we meant a curve where the slope around η = 1 is 0, i.e. a critical point. We have corrected this in the manuscript and added a sketch to illustrate the point (Fig. 2 in the new manuscript).
For example, if the model output variable (e.g. ice water path, IWP) as a function of η i follows a sigmoidal function ::: has : a :::::: critical ::::: point :: at :::::: η i = 1 :::: (i.e. ::::: slope :: of ::::: zero), this suggests that the process i needs to be represented only roughly :::::::::::: approximately and that some detail could probably be removed from its parameterisation without much of an effect on the model performance. Figure 2 RC: Perhaps mention PPE in the figure itself, not just the caption. i.e., the diagonal arrow could be labeled "Run ECHAM-HAM top generate the PPE" or something like that. Just a suggestion, the authors can take it or leave it.

1.2.11
AR: We welcome this suggestion and have added the term as suggested.

Line 235
RC: I assume "precipitation" here is surface precipitation?
AR: Your assumption is correct and we have specified this accordingly.
From the PPE, we can construct a surrogate model for every output variable that we are interested in by training a separate emulator for each output variable (ice crystal and cloud droplet number concentration, ice and liquid water path, shortwave and longwave cloud radiative effect, cloud cover, :::::: surface precipitation, ice, liquid and mixed-phase cloud cover).

Lines 258-259
RC: This seems like an important point, but could you be more specific what you mean by "disruptive changes" in the CMP processes compared to the changes in Johnson et al.(2015)? Do you mean you use a larger range, including shutting processes off entirely by setting η i = 0? Or something else?
In the original manuscript, we were also thinking of the changes induced by η i = 0 but this part of the parameter space is no longer included in the PPE in the new manuscript.

Line 304
RC: This result doesn't seem that surprising -by suppressing "aggregation" you're likely decreasing the snow mixing ratio substantially, in turn reducing riming since the amount of riming depends on the amount of snow. This result likely reflects the somewhat artificiality of neglecting riming of cloud ice (at least, I think it's neglected in the scheme here), and more generally the separation of ice into "ice crystal" or "cloud ice" and "snow" categories (major comment #4). This kind of "thresholding" behavior associated with "aggregation" is discussed a bit below on line 309, but this is unphysical and reflects an ad-hoc separation of ice into cloud ice and snow categories. See major comment #5 above.

AR:
We agree with your interpretation that aggregation inhibits riming because the latter depends on the snow that the former produces. As discussed above this is inherent to the CMP scheme.

Lines 309-310
RC: Snow is not generated by freezing of rain? In a climate model this will be tiny, but it should still be non-zero. Or else, how is rain formed/lofted above the freezing level treated?
AR: No, in this CMP scheme rain reaches the surface within one timestep and thus cannot freeze to form snow.

Line 322
RC: I disagree with the sentence: "In classical sensitivity studies, where processes are only turned on and off, only the large signal induced by aggregation would have been visible." Clearly there would still be a large signal to turning off the aggregation process completely.

AR:
Indeed this is what the sentence says: in classical sensitivity studies (. . . ) the large signal induced by aggregation would have been visible.
RC: A few lines down this sentence also isn't clear: "Recognizing it as a threshold process and seeing the gradual response to small deviations from 1.0 in η aggr , it appears that there is potential for a less accurate description of aggregation in the model." To me, discussing a "threshold" process as a "gradual response" is self-contradictory. Not clear exactly what you mean by aggregation as a "threshold process".
AR: What we mean is illustrated in the sketch we added (Fig. 2 in the new manuscript): no matter what the response is far away from η = 1, if the slope is close to 0 around η = 1 there is some potential for a less accurate description of the process in question (purple line in the sketch). (2010) is meaningful because here you're choosing an ad-hoc range to vary the η i 's. And the range extends to unrealistic values, such as setting η i = 0 meaning the process is shut off completely. In contrast, I assume Lohmann and Ferrachat (2010) varied scheme parameters over a range of physically plausible values?

Lines 343-345 RC: I'm not sure if this comparison with Lohmann and Ferrachat
AR: Lohmann and Ferrachat (2010) varied tuning parameters over plausible tuning ranges. For two of the values, the perturbations span multiplicative ranges of 10 or larger. As the updated PPE design in our study includes perturbations between 0.5 and 2 (now excluding η + i = 0, a comparison to Lohmann and Ferrachat (2010) is plausible. AR: Indeed the autoconversion rate has been multiplied by a factor between 1 and 10 in Lohmann and Ferrachat (2010), and we have specified this accordingly: Only for LWP Lohmann and Ferrachat (2010) find a larger range of about 50 gm −2 when they vary ::::::: multiply the autoconversion rate :::: with : a ::::: factor : between 1 and 10. As this warm-rain process is not included in the present analysis, it is reasonable that the observed variation for LWP is smaller.

Line 400
RC: I would suggest using a different variable than "f" for the data and complex coefficients. It's confusing to use the same symbol "f" for both.
AR: To remain in line with common notation but avoid ambiguity we have changed the symbol for the function to F .
Mathemtically :::::::::::: Mathematically, the model data can be represented as a linear combination of the orthogonal spherical harmonics basis functions as follows: The data f is then a function of the longitude θ and latitude φ, with Y m l a spherical harmonics function of degree l and order m (l and m are integers, with −l ≤ m ≤ l). The complex coefficients f m l ::: F m l can be computed as:

Line 465
RC: Not sure I'd say you're multiplying the process "effects" by a factor, but rather you're multiplying the process rates.

AR:
We have changed the formulation accordingly to make this more clear. perturb the four CMP processes of aggregation :::::::::::: autoconversion, riming, accretion and self-collection of ice as a whole. This is achieved by multiplying their immediate effects :::::: process :::: rates with a factor between 0 :: 0.5 : and 2.

Editorial comments
• Lines 236-237: Would this sentence be better as: "For the kernel, an additive combination of the linear, polynomial, bias and exponential kernels was used (Duvenaud, 2014)."?
• Line 270: I think you can remove "thus".
• Line 396: I think there should be a comma after "decomposition".
AR: Thank you for pointing out these errors. We corrected them in the new version of the manuscript.

Reviewer Comment #2
RC: This study makes use of the technique of variance-based sensitivity analysis on an emulated perturbed parameter ensemble of the global aerosol-climate model ECHAM-HAM to understand the impact of perturbing the effect of selected cloud-ice microphysical processes on the models' output. For each process considered, a 'phasing parameter' is implemented to perturb the strength of the generated effect of a process (from 0 to 200%), and the study uses this 'phasing' as a proxy for the effect of process simplification. This is a novel use of the emulation and sensitivity analysis approach to assess model behaviour under uncertainty. The paper definitely falls within the scope of ACP and EGU, and is written to a high standard. However, there are several points that I believe need clarification (see specific comments below). In particular, I am concerned that the design and sampling for the PPE simulations does not provide the required coverage of the actual phasing effect for a completely robust analysis. The parameters are multiplicative factors but they are not treated as such in the PPE design, so the PPE has a very skewed coverage over the effect of 'phasing out' a process (see specific comment at Line 210-214, below). Because of this, the PPE looks to have very low coverage of training data where the 'phasing out effect' is strongest and the model response is likely to be greatest/more erratic (as the η i parameters move towards zero), and a much denser coverage of training points where the phasing out effect is weaker (0.5<η i <1) or there is over-estimation of a process due to an inaccurate description (η i >1). Given the low amount of training information for the emulator where the phasing out is strong, I'm not convinced that the emulator can properly capture this response in any kind of detail. For a more robust conclusion, I would recommend (if possible -this would be a major revision) a re-design of the PPE input combinations to properly cover the parameter space for emulation and provide a more even sampling of the 'phasing out effect' for the sensitivity analysis. Once this and the further issues/comments below are addressed, I would recommend the publication of the manuscript in ACP.
AR: Thank you for your thoughtful and well explained feedback. Please find the answer to your comments below.

Line 7 (in abstract)
RC: 'The response to the phasing of a process thereby serves as a proxy for the effect of a simplification'. This sentence is confusing me in two ways. Firstly, what is meant by 'the phasing of a process'? -this is unclear (I realise this might be explained in the paper, but people will read the abstract first, so it's not clear at this point.). Also, the use of the word 'thereby' in this sentence is confusing -it suggests that the information in this sentence follows as a result of the sentence before it, but I don't think it is -it is a separate point with new information about the method/assumptions made. Please remove 'thereby' and re-phrase to clarify.
AR: In response to Reviewer #1 we have exchanged the "phasing" for "perturbation" and we have removed the "thereby" as you suggested: The response to the phasing ::::::::: perturbation : of a process thereby serves as a proxy for the effect of a simplification.

Line 13 (in abstract)
RC: Is this really a 'new framework'? Statistical emulation and sensitivity analysis have been used in several studies to assess process impacts in complex models of clouds, the atmosphere and the climate, as you have stated in the paper e.g. Line 239-240: 'This approach is similar to Johnson et al. (2015). . . ,' Please re-phrase.

Line 42-44
RC: 'Finally, the detail of . . . increases computational demand and thereby costs or inhibits other advancements such as the move towards higher resolution. . . '. Do parameterisations at a lower resolution still always hold at a higher resolution? Is it not the case that for a move to higher resolution, we need more detailed representations of processes? -So, to a reasonable extent, doesn't higher resolution and more detailed process descriptions have a dependence? -I don't think they are quite as independent as this sentence is suggesting -please clarify.

Line 63
RC: '. . . a simplified model equifinal to a more complex model'. What does 'equifinal to' mean here, when the difference is between 2 different models (simplified and complex)? -I'm not sure if this is the same as the definition for equifinality on page 2, where different parts of a model's parameter space lead to a similar observed state? Please clarify.

Line 68
RC: '. . . The influence of CMPs has been shown to dominate over that of aerosol schemes. . . ' I'm not sure this has always been the case for a GCM? For example, Regayre at al (2018) [ Figure 9] showed that both aerosol and physical atmosphere (cloud-related) model parameters are both important sources of uncertainty in aerosol ERF in the GCM HadGEM3-UKCA. Please update the text here to reflect this.

Line 112
RC: 'By phasing we mean that we vary the effectiveness of a given process, going from using 0 to 200% of a process's effect in the model'. This is quite a difficult concept to understand here in terms of how this can be done -I don't think all processes within a model could be easily 'phased'. What is meant by 'effectiveness'? How is it defined and is this 'effectiveness' the same or does it differ between inputs / processes? I realise that the next section (2) will bring more detail on this, but giving a small (brief) example or a little more detail here could provide a bit more clarity for the reader as a starting point.
AR: As you suggested below we have added a sketch to illustrate our point (new Fig. 2). Process accuracy does not correspond to the perturbation directly. You are right that a simplified process could reduce or enlargen the delta component, and this change would likely be variable in space and time. The perturbations we apply generate results, and from the shape of these results and the sensitivity analysis we can deduce how sensitive the model is to any changes in the process in question. This shape of the results/sensitivity is the proxy for what would happen if one simplified or approximated the process. This needs more clarity -how this parameter for each process can inform the need for model complexity is a key message from the study, so understanding how to interpret it is very important, yet it seems to be skipped over here. A diagram to help the reader picture what you mean (maybe with several different options as to how the parameter η i could be interpreted for a process) would be helpful, as it is not clear to me that the statement in this example (lines 203-205) is true, or how the parameter in general will inform us.
AR: As Reviewer #1 also pointed out, what we meant is not a sigmoidal function but a curve where the slope around η = 1 is 0, i.e. a critical point. We have corrected this in the manuscript and added a sketch to illustrate the point (Fig. 2 in the new manuscript).

Lines 210-214
RC: The scaling of the 'η' parameters here treats them as linear factors, but I don't think they are. I think each η i is a multiplicative factor, and as such, the phasing effect is not varying evenly over the η ranges, with it likely that there will be a much more significant effect on the model behaviour with very small values as an η approaches 0. [I think this is also an aspect of the cause of the large outlier at the very low η aggr in Fig 3?]. This is tricky to explain, but within your range of 0<η<2, 0.5<η i <1 corresponds to a scaling of the given process by 1 times (1x) to a half times (0.5x), covering a 'phasing reduction' of the process by up to 2 times (2x) smaller than its default effect over a range in η values of 0.5. But, lower down the η range, say 0.01<η<0.1, this covers a more significant reduction of 10 times smaller (0.1x) to 100 times smaller (0.01x) than the default effect, but within a much smaller range on η of size 0.09. Because of this, your PPE looks to have very low coverage of training data where the 'phasing out effect' is strongest and the model response is likely to be greatest/more erratic (as the η parameters move towards zero), and a much denser coverage of training points where the phasing out effect is relatively weaker (0.5<η i <1) or where you consider over-estimation (1<η i <2). In fact, designing the training points linearly between 0 and 2 leads to having approx. 50% of simulations with η>1 for each η parameter -so really concentrating on the parts of the ranges / 4-d parameter space that is to 'imitate an overestimation of a given process due to an inaccurate description' (Line 214). Is that what you intended? As, my understanding is that this area of the space isn't really the focus of the study (to understand how sensitive the model responses are to phasing out processes), so why sample it the most? I think this is a significant error in your PPE design. And this will also feed through to affect how you sample the phasing effects for the sensitivity analysis (concentrated away from a strong phasing out, and highly focussed on η i >1, if sampling uniformly). In most PPE studies, parameters like this are varied on a log 10 scale to account for the multiplicative behaviour. However, including zero in your range means a log 10 transform is difficult here (as log 10 (0) = -inf) -it might be better to only vary the parameters down to a small value close to zero (e.g. 0.001) so that a log 10 scaling could be used to even out the phasing effect over the η ranges. Given the low amount of training information for the emulator where the phasing out is strong, I'm not convinced that the emulator can properly capture this response in any kind of detail. If possible, for a more robust conclusion, I would recommend a re-design of the PPE input combinations in this way to properly cover the parameter space for emulation, provide a more even sampling of the 'phasing out effect' for the sensitivity analysis, and also provide more detail on how the model response changes as a process is phased out. If this is not possible, please at least acknowledge the assumption that has been made here -that you treat the η parameters as linearly varying factors -and note/describe/discuss here and in the results and discussion section how this is affecting your analysis and results.
AR: Thank you for pointing us to this error. We have revised the PPE design to sample logarithmically from 0.5 to 2 to address your concern. As you correctly point out our focus is on the region around η = 1, which is why we now constrained the parameter space to a multiplicatively equal range of 2 on either side of η = 1. The extreme case of η = 0 is still covered with the one-at-a-time sensitivity tests displayed in e.g. Fig. 4 but no longer included in the PPE to avoid overinterpretation. Much of the analysis and results proved to be robust to this change. However, for the spatial analysis the emulators default for more variables. This is reasonable since the PPE now covers a smaller region of the phase space with a less intense signal.

Line 215
RC: The phrase 'sets of simulation input' is unclear. I think you mean 'the set of input parameter combinations (η 1 , η 2 , η 3 , η 4 ) to be simulated with the model'. Please clarify the text.

Line 237
RC: 'The input data was centred and whitened prior to emulation'. What exactly does that mean? Why is this needed, and how does it affect the emulator / surrogate model? Please give more detail. [There isn't enough detail here for someone to be able to replicate the analysis.]
AR: In addition, we hope that the scripts that we supply on the open access data repository zenodo will help anyone to replicate the analysis.
2.1.14 Line 245 RC: '1-out validation'. This is an unusual term to describe this approach. Please change to 'Leave-one-out validation', here and elsewhere.
AR: Thank you for alerting us to this terminology. We have exchanged it accordingly.

Line 246-247
RC: In the brackets, please use the notation as it is in the formula. So, '(with Y sim and Y emu the output of the ECHAM-HAM simulations and the emulated output respectively, and V emu the emulator variance)'

AR:
We have adapted the text accordingly: In Fig. 4 a) and b), the individual standardized errors, Ysim−Yemu √ Vemu (with Y :::: Y sim ::: and ::::: Y emu the output of the ECHAM-HAM simulations and the emulated output, respectively, and V :::: V emu : the emulator variance), are plotted against the emulated output and input parameters.

Lines 257-266
RC: I think this could also result in part from the PPE design and low coverage of large changes in the phasing amount at the low end of the parameter ranges (see comment [L210-214] above), which should also be acknowledged here.
AR: As we have improved the PPE design based on your comments above, this reasoning no longer applies here. AR: We reworded the sentence since what we meant was simply that η aggr = 0 has a larger effect than the other η i = 0: Of the four phased :::::::: perturbed : processes, turning off aggregation ::::::::::::: autoconversion has the largest effect on model output: the global annual mean ice water path (IWP) is more than doubled, and the increase in cloud cover and decrease in precipitation dwarf the changes inflicted by the inhibition of the ::::: turning :: off ::: the : other three processes.

Line 305
RC: 'The shape of the model response to the gradual phasing of the processes holds additional information: while the generated model response is mostly gradual, for low η aggr the response is more abrupt.' -This is also, in part, the effect of the uneven distribution of the 'phasing effect' over the parameter range (see comment [L210-214] above), which should be acknowledged here.
AR: As we have improved the PPE design based on your comments above, this reasoning no longer applies here.

Line 310
RC: 'As can be seen from Fig. 1 it (aggregation) is the only process that generates snow flakes. Accretion and riming need the snow flakes to be able to act upon them.' Does this mean that there is a dependence between the phasing parameters here? Is this a strong dependence? i.e. without quite a high value of η aggr , you cannot have (it isn't realistic to have) a high value of η accr or η rim ? -or can you have (is it realistic to have) a high value of η rim when η aggr is pretty small (just not zero)? If it is a strong dependence, then this would invalidate the assumptions of the variance-based sensitivity analysis (Sections 2.5,3.3) which assumes independence between inputs for the breakdown of the variance into its component parts.

AR:
The four parameters that we vary are independent: no matter what the modulation of the aggregation effect, we can modulate the riming effect completely independently from that. Only when aggregation is turned off, varying η rime does not have any effect, but this is now excluded from the PPE design.

Line 363-364
RC: 'This was excluded from the sensitivity analysis as only the input parameter space with η aggr ≥ 0.5 was taken into consideration. . . ' Is it not more appropriate to consider the sensitivity analysis (SA) for a range of η i that doesn't go all the way down to zero anyway? Is it not the case that processes need to still be accounted for (they still need to be included in the model), but that you are investigating just how detailed or not (phased in or out) that representation needs to be? Why did you choose 0.5 here? Also, is the focus of the SA in terms of space sampled more on η i >1? Could this be biasing the SA results away from the effect that you really want to consider? (i.e. is it focussing much less on phasing out from the current full complexity at η=1, and more on the effects of increasing complexity / overestimation of a process?). How might this feed through to affect the inferences and conclusions made? AR: In response to comment # 1.1.2 this sentence has changed substantially. To answer your question: yes, at this point of the analysis one could only derive statements on the global annual mean values.

Line 392
RC: '. . . analysis of grid-point level data is tedious and error prone. . . ' I agree that you have to create a lot of emulators to do grid-point analysis, so reducing the dimensions as you do in this section is definitely advantageous. But, why is the grid-point analysis 'error-prone'? Given the quite high level of uncertainty in emulator prediction from the emulators of the spherical harmonics here (Line 429 and Appendix D Figure D1 part d, where the points are rather widely scattered about the line of equality), is it really fair to infer that this approach is less error-prone? Also, is it possible to generate a plot (map) of how the sensitivities of parameters vary over the globe via the method used here, like you might with a grid-point level analysis? (I'm not suggesting you necessarily do this here -I'm just wondering if it is possible to do this. . . )

AR:
The emulation and subsequent sensitivity analysis of grid-point level data is more error-prone, because the emulation method from Watson-Parris et al. (2021b) does not take spatial correlation into account. As each grid point is evaluated only by itself, there is more noise in the data that is reduced by reducing the dimensions of the data set. Therefore, seeing that the validation of the spherical harmonics is already difficult because of a small signal, we assume that this problem would be elevated at the grid point scale. We have specified this in the new manuscript as follows: Since the emulation and subsequent analysis of grid-point level data is tedious and error-prone :: due :: to ::: the ::::: small ::::: signal :::: and ::::: large ::::: noise, we compress the information in the data to a space of lower dimensionality.
AR: No, a map of how the sensitivities of the parameters vary over the globe is not possible with this method. This is because we apply the sensitivity analysis onto the angular amplitude spectrum, which is a combination of all m for each l, and thus we cannot do the inverse transformation. AR: You are right. However, with the new PPE design the small-scale emulation of CDNC does not validate, so the sentence was adjusted to the following: The CDNC :::: LWP : is dominated by riming on all regional scales and on the global scale, while the LWP at some degrees l < 7 is also influenced by aggregation : it :: is ::: also ::::::: heavily :::::::: influenced ::: by :::::::::::: autoconversion.
2.1.27 Line 443 RC: '. . . as with the 48th member the computational constraint was too tight for the emulator' I don't understand what you mean by 'computational constraint was too tight'. Please clarify.
AR: In terms of practical understanding, what we mean to describe is the fact that for some of the emulations, the GP was only able to fit to 47 members of the PPE. Adding a 48th member led to an overdescribed system, which in practical terms manifested in a non-invertible matrix error in the code. We have tried but failed to eliminate this error by adjusting the GP model properties, and understanding it further is beyond the scope of this paper as it is embedded in libraries underlying the used software.

Figures 10 and 11
RC: Why do you only show the total sensitivity index? Isn't the First order effect more informative? With the total sensitivity index, if there are large interactions, or if the model output is quite noisy which can sometimes induce interaction in the indices that is not real, this index on its own can give a potentially skewed/false impression of the true sensitivity. To conclude robustly on the sensitivity, really you should present both indices. Can the first order index also be shown in Figures 10 and 11?
AR: Because the first order sensitivity index looked very similar to the total one, we did not include it for clarity/brevity of the Figures. For added transparency we have now included the first order sensitivity index plots in the Appendix.