Reply on RC2

ll. 19: To emphasize the originality of the dataset, I would also mention the simulation period and the model resolution like at l. 333 We thank the reviewer for the suggestion, this will be clarified in the revised manuscript. ll. 23: "Greenhouse warming will in particular alter...". I would rather write "Greenhouse warming in the model in particular alters..." and mention that model uncertainty is not considered here, as explained in the paragraph of line 356. We will implement the rewording recommended by the reviewer. Introduction: ll. 33: “spectral variance peaks”. The “peaks” are also relatively broad for different reasons contrary to spikes (Dirac’s). I would write “relatively sharp peaks”. We will make the change recommended by the reviewer. ll 53: "variance is repeated We will remove the redundant use of "variance". ll. 59: Explain why the results of the CMIP6 version of CESM2 is expected to be conclusive relative to "earlier studies" (l. 46). Perhaps based on what is explained in Section 2. A large number of improvements occurred with climate models since the CCSM4 version 1.4 used in the ensemble study of Zelle et al. (2005), where the effective atmospheric resolution there having been 3.75°x3.75°. In addition to greatly improved resolution, a number of improvements in process-resolution contribute to increased skill in representing variability across a broad range of timescales. This includes the importance of land processes, with changes in the CLM5 model documented below. For the atmosphere, a sampling of the many changes that have occurred are in boundary layer and shallow convection schemes, as well as representations of microphysics. New parameterizations in the atmospheric CAM6 model are described in the Danabasoglu et al. (2020) presentation paper for the CESM2 model. We will add appropriate text to convey these points in the revised text.

We thank the reviewer for the suggestion, this will be clarified in the revised manuscript.
ll. 23: "Greenhouse warming will in particular alter...". I would rather write "Greenhouse warming in the model in particular alters..." and mention that model uncertainty is not considered here, as explained in the paragraph of line 356.
We will implement the rewording recommended by the reviewer.
Introduction: ll. 33: "spectral variance peaks". The "peaks" are also relatively broad for different reasons contrary to spikes (Dirac's). I would write "relatively sharp peaks".
We will make the change recommended by the reviewer.
ll 53: "variance is repeated We will remove the redundant use of "variance". ll. 59: Explain why the results of the CMIP6 version of CESM2 is expected to be conclusive relative to "earlier studies" (l. 46)

. Perhaps based on what is explained in Section 2.
A large number of improvements occurred with climate models since the CCSM4 version 1.4 used in the ensemble study of Zelle et al. (2005), where the effective atmospheric resolution there having been 3.75°x3.75°. In addition to greatly improved resolution, a number of improvements in process-resolution contribute to increased skill in representing variability across a broad range of timescales. This includes the importance of land processes, with changes in the CLM5 model documented below. For the atmosphere, a sampling of the many changes that have occurred are in boundary layer and shallow convection schemes, as well as representations of microphysics. New parameterizations in the atmospheric CAM6 model are described in the Danabasoglu et al. (2020) presentation paper for the CESM2 model.
We will add appropriate text to convey these points in the revised text.

ll. 78: Missing punctuation
The problem with the missing punctuation will be fixed.
ll. 81: Is the resolution for the POP model also 1 degree? It is important to know if it is eddy-resolving since this might also effect the atmospheric variability.
The nominal resolution of the POP model is 1-degree, with enhanced resolution in the meridional direction near the equator as well as in the Southern Ocean. We will state this more clearly in the revised text.
ll. 91 -paragraph: How are these improvements measured? To which extent does the land model (fire model and agricultural management in particular) rely on assumptions regarding human behavior in the future (for instance in resopnse to climate change)?
We thank the reviewer for pointing out that we were not sufficiently clear in the paragraph starting on line 91, in particular in describing the strengths of the land model and the benchmarking that has been done to assess skill in the model. We will reorganize the paragraph at line 91 to say: "An important advance of great value to Large Ensemble investigations is achieved through new developments incorporated into the Community Land Model Version 5 (CLM5) (Danabasoglu et al., 2020;Lawrence et al., 2019;Lombardozzi et al., 2020). This model addresses a number of well-known limitations relative to previous versions of CLM, including enhanced simulated cumulative CO2 uptake over the historical period (Bonan et al., 2019) and the seasonal cycle of net ecosystem production (NEP; Lawrence et al, 2019), which is highlighted in our analysis of phenology changes. Improvements in CLM are found across a broad range of simulated variables that have been documented through evaluation of model simulations against the International Land Model Benchmarking (ILAMBv2.1) package and other analyses (Collier et al., 2018;Danabasoglu et al. 2020;Lawrence et al., 2019;Wieder et al., 2019). Notable features also included in CLM5 are the explicit representation of agricultural management and improvements in the implementation of the prognostic fire model (Lombardozzi et al, 2020;Lie et al., 2013;Li and Lawrence, 2017). We note that land model trajectories are sensitive to the SSP-RCP scenarios that determine the spatial distribution and extent of land use and land cover change, as well as climate change scenarios (O'Neill et al., 2016). "

Results:
ll. 169: Could you explain what motivated the choise of these observables? I guess one factor is the relationship between the observables and climate-change impact, but this is not obvious.
The variables we chose for Fig. 2 reflect interest in both climate/ecosystem dynamics and societal relevance in terms of adaptation and resource management. We will state this more emphatically in our revisions.
ll. 169: Instead of the Fourier transform of the observable, why not use an estimate (e.g. periodogram) of the power spectrum which can be directly related to the variance that you use in Figure 1 (as the integral of an adequately normalized power spectrum)? The variance is also used in Figure 3. If this is in fact what you are doing, please make it clearer.
In the caption for Fig. 2 and in l.169 we specified "amplitude spectrum", which is the square root of the power spectrum. Please note that Fig. 1, Fig. 3e and Fig. 3f are crossensemble standard deviations using annual mean data (the annual mean is calculated first for each member, and then the standard deviation is taken across ensemble members for the same year). Furthermore, they are not representing "variance". Thus, even if we use power spectral density (PDS) in Fig. 2, the integral of PSD is not equivalent to what is shown in either Fig. 1 or Fig. 3. We chose the amplitude spectrum in Fig. 2 to have a better understanding of the amplitudes of perturbations at different timescales, and to be consistent with the standard deviations shown in the other figures. Also, if PDS were used, the already strong annual cycle and its harmonic peaks in the amplitude spectrum would become too large, making other spectral behaviors relatively less variable.
ll. 169: To avoid spectral leakage, a window should be applied before the FFT, is this the case?
The point is well-taken, but the large degree of aggregation used for Fig. 2 (both across ensemble members but also spatial aggregation) alleviates the need for windowing. We will state this more clearly in our revised text. (2070-2099 and 1960-1989)?

ll. 171 and 172: Why 35 years and not 30 years
We thank the reviewer for catching this typo, we will modify "35 years" to say "30 years". ll. 173: If the power spectrum is computed, an alternative would be to first compute correlation functions for each member, average over the members and then do the FFT. I do not know which estimation method has the best properties, but could you explain why you made this choice?
There is a disparity between dynamical characteristics (such as power spectra) and statistical characteristics (like the natural measure) in that we can -for now -define only the latter in an instantaneous/shapshot or pullback sense. Given that there is no reference for the power spectrum with respect to which biases are defined it is not really possible to rank different possibly estimators. We went with what appears to be the most intuitive estimator: the E-mean of the temporally computed power spectra (or FFT amplitudes). A benefit of this quantity is that nonergodicity could be naturally defined considering the difference between this and the correct, conceptually sound quantity.

Figure 2: What are the units of the spectral amplitudes given the observables?
Please find this in the caption for Fig. 2 in the original submitted manuscript, where we state: "Spectra are shown as amplitude, with the units being the same as the x-axis for the PDFs." ll. 183: Even if I do not think that it is necessary to add confidence intervals to all panels and for all frequencies or to test the significance of the differences between spectra, could you give an estimate of what would be the width of these confidence intervals given that data that you use (in the supplementary material for instance)? This would also make this section more coherent with the part on wavelets.
We will provide a 95% confidence interval for Niño3.4 precipitation in the Supplementary Material, and provide a figure. The confidence interval is simply estimated as 1.96 multiplied by standard error of 100 spectra at each frequency. In this calculation, we first spatially average spectra at individual grid points over the Niño3.4 region for each member, and we treat the sampling size N=100 for the 100 ensemble members. We believe that this is a somewhat conservative approach. If we assume that all grid points can be sampled, the confidence interval becomes much narrower as the sampling number becomes much larger (N=100*number_of_gridpoints).

ll. 200 and ll 208: Although a scalar observable can techincally be seen as a bilinear form, I would reserve the term positive definite for non-trivial bilnear forms (e.g. represented by non-scalar matrices) and simply write "positive variables"
We thank the reviewer for making this point, and we wil use the term "positive variables" as suggested.

ll. 202: This is not true for all stochastic processes. I guess you mean for a Brownian motion?
We will modify the text from saying "stochastic processes" to say "white noise processes", we thank the reviewr for expressing the need to be clearer here.

Could you clarify what is meant by "cross-ensemble" everywhere this expression is used?
Yes, we will do this.

ll. 256: "minimum m" => "minimum in"
This is correct, we will make the change as suggested.
ll. 265: same as for ll. 251. . "cross-ensemble calculations applied for identical time records for each ensemble member" is not clear to me. The standard deviation is computed from a sample combining all members and all years for a given period (historic or future)? Based on the caption, I guess not, the standard deviation is computed over the ensemble and then time-averaged over the period. Could you clarify and explain why you made this choice and not the other?
We apologize for not being clearer, we will clarify this in the text. The inference here made by the reviewer is correct, the standard deviation was first computed across the ensemble members for the same time slice, and then time averaging was conducted. This was done out of pragmatism, as we were looking for convergence (or non-convergence) with the number of years included through the averaging procedure.

ll. 310: "and" => "an" I guess
Yes, this is correct, we will change this.

Figure 5: Do the histograms aggregate ensemble members for a single year only or is there also an aggregation of the 20 years in the interval (in which case the histograms would include the 20y-trend)?
The aggregation is done for individual years, we will clarify this in the text. The width of binning for the histograms is 1-day, we will clarify this in the text, we thank the reviewer for pointing out that this wasn't clear in the reviewed manuscript.
ll. 318: Since the spacing between the vertical grid lines in Fig. 5 represents an interval of 10 days, it seems to me that the shift in the onset is closer to 3 weeks or even 4 weeks than 2 weeks. Am I wrong?
We thank the reviewer for raising this question, we will replace the mis-statement of the duration of the onset shift to reflect that it is in fact three weeks. Comparing the mean over 1860-1869 and the mean over 2090-2099, the day of the first zero-crossing shifts 20.9 days and the day of the termination shifts 6.3 days.
ll. 322: How do you "measure" the interannual variability? Even if you read the histograms by eye, I guess that you have some metric representation of the spread in mind, such as the standard deviation. In fact, if you use the distance between the minima and the maxima, the interannual variaiblity appears comparable to the trend to me.
We will clarify this in the text, we used one standard deviation to measure the variabiltiy.
ll. 325: Same question as for ll. 322: Which measure do you use to obtain these percentages?
We thank the reviewer for raising this point, as we were insufficiently clear in the text. The calculations were performed year-by-year, where the transitions (zero crossing) were calculated for each member across the full set of 90 ensemble members as this