Comparing different generations of idealized solar geoengineering simulations in the Geoengineering Model Intercomparison Project (GeoMIP)

. Solar geoengineering has been receiving increased attention in recent years as a potential temporary solution to offset global warming. One method of approximating global-scale solar geoengineering in climate models is via solar reduction experiments. Two generations of models in the Geoengineering Model Intercomparison Project (GeoMIP) have now simulated offsetting a quadrupling of the CO 2 concentration with solar reduction. This simulation is idealized and designed to elicit large responses in the models. Here we show that energetics, temperature, and hydrological cycle changes in this experiment are 5 statistically indistinguishable between the two ensembles. Of the variables analyzed here, the only major differences involve


Introduction
Solar geoengineering describes a set of technologies designed to (ideally) temporarily, deliberately reduce some of the effects of climate change by changing the radiative balance of the planet, often by reflecting sunlight back to space (NRC, 2015).
Numerous methods have been proposed, but the most studied is stratospheric sulfate aerosol injection (Budyko, 1977;Crutzen, 15 2006). This method involves substantially increasing the stratospheric sulfate aerosol burden, replicating the mechanisms that cause cooling after large volcanic eruptions (Robock, 2000), although one might expect different climate responses from pulse versus sustained injections . Climate models are currently the only tools for understanding the climatic consequences of solar geoengineering. In model simulations of solar geoengineering, insolation reduction is often used as a proxy for actual stratospheric sulfate aerosols, as it captures many of the broad radiative effects of stratospheric aerosol 20 geoengineering as well as some of the important climate effects like surface cooling and hydrological cycle strength reduction Kalidindi et al., 2015). However, stratospheric sulfate aerosols also absorb longwave radiative flux, which heats the upper troposphere and lower stratosphere. As such, any implementation of stratospheric geoengineering with sulfate aerosols would produce additional effects, such as changing atmospheric circulation in response to stratospheric heating and heating gradients (e.g., Richter et al., 2017;Tilmes et al., 2018;Simpson et al., 2019) and stratospheric ozone changes (e.g., 25 Pitari et al., 2014), as well as changes in ultraviolet radiative flux and enhanced diffuse radiation at the surface (Madronich et al., 2018). However, here we consider the major, large-scale effect of reflecting sunlight to cool Earth.
Simulations of solar geoengineering with solar reduction have long shown that solar geoengineering would cool the planet, offsetting global warming (e.g., Govindasamy and Caldeira, 2000;NRC, 2015;Irvine et al., 2016), although there would still be residual regional effects (e.g., . Idealized simulations of solar reduction have also been simulated in a 30 multi-model context under the Geoengineering Model Intercomparison Project (GeoMIP; Kravitz et al., 2011), to understand the robust model responses to various standardized solar geoengineering simulation designs. Multi-model conclusions from these studies indicate that solar geoengineering would be effective at partially offsetting greenhouse gas-induced temperature changes , as well as changes in the hydrological cycle (Tilmes et al., 2013), the cryosphere , extreme events (Curry et al., 2014;Aswathy et al., 2015), vegetation (Glienke et al., 2015), circulation (Guo et al., 2018;35 Gertler et al., 2020), agricultural yield potential (Xia et al., 2014), and numerous other areas. However, the offset is not exact (Moreno-Cruz et al., 2012), particularly on a regional basis or when considering multiple simultaneous metrics of climate change Irvine et al., 2019), leading to concerns about winners and losers from geoengineering (Ricke et al., 2010). To some extent, the effects of solar geoengineering may be tailored or designed Kravitz et al., 2016Kravitz et al., , 2017Kravitz et al., , 2019, but solar geoengineering will still not be able to completely offset climate change from greenhouse Earth System Models (Kravitz et al., 2015). As such, this is an opportunity to revisit some central questions in solar geoengineering. Many of the CMIP5 results regarding solar geoengineering showed substantial agreement across the participating GeoMIP models. In this newest iteration of GeoMIP, do the same science conclusions still hold, and do the models still generally agree on the resulting climate effects? Here we address these questions by evaluating and comparing general climate model response to GeoMIP experiment G1 (described in the next section) from both CMIP5 and CMIP6.

Simulations and Participating Models
In this study, we evaluate GeoMIP experiment G1, in which, starting from a preindustrial control (piControl) baseline, the atmospheric CO 2 concentration is instantaneously quadrupled (the standard CMIP experiment abrupt4xCO2), and insolation is simultaneously reduced such that net top-of-atmosphere (TOA) radiative flux is within ±0.1 W m −2 of the baseline value in the first decade of simulation (Kravitz et al., 2011(Kravitz et al., , 2015. This experiment was part of the original suite of GeoMIP experiments 55 and was repeated and extended in the newest suite in an effort to understand the role of model structural uncertainty in broad conclusions about solar geoengineering. Participating models are listed in Table 1. We include 13 models from CMIP5 and 7 models from CMIP6. Experiment G1 is an idealized experiment aimed at understanding physical climate response, and not as a proposed real-world geoengineering implementation. Although G1 should not be used directly for impacts analysis, improved understanding of climate model response to G1 will increase confidence when evaluating more policy-relevant scenarios.

60
The original G1 experiment was 50 years in length, whereas the CMIP6 version is 100 years in length to allow for better analyses of rare events and to capture very slow responses. Comparison between the two ensembles necessitates only using the first 50 years, but we need to verify that this can be done without losing important longer-term evolution in features. Figures 1 and 2 look at G1 behavior over the entire 100-year period of the CMIP6 simulations to determine whether there is any drift or steady state error that would not be revealed by only analyzing the first 50 years. (Also see Table 2 for quantitative information.)

65
Over years 11-100 of simulation, CNRM-ESM2.1 and IPSL-CM6A-LR show greater than 0.1 K/decade in magnitude negative trends in temperature, and CESM2(WACCM) and UKESM1.0-LL show positive trends of similar magnitudes. This is despite no model showing a trend in net TOA radiative flux greater in magnitude than 0.02 W m −2 /decade. Beyond an initial transient period, CESM2(WACCM), CNRM-ESM2.1, and IPSL-CM6A-LR show approximately 0.06%/decade trends in precipitation and evaporation of the same sign as the temperature trends. Nevertheless, the differences in temperature and hydrological cycle 70 change due to experiment G1 are orders of magnitude greater than the calculated values in Table 2. As such, we conclude that our choice to focus on the first 50 years of simulation does not appreciably affect our results. Figure 2 shows that many of the models have low frequency variability that appears in the different regions plotted here.
For the egion north of 30 • N, IPSL-CM6A-LR has a steadily increasing temperature value north of 30 • N, possibly related to a slight trend in sea ice coverage (Boucher et al., 2020).

75
IPSL-CM6A-LR is also known to have a bicentennial oscillation, which could affect G1-piControl differences, depending on the baseline period used for subtraction. To verify that this oscillation is not impacting our results, we divided that model's 1200-year piControl run into 50-year chunks and computed the surface air temperature average for each of those chunks.
The largest temperature found was 286.0339 K, and the smallest was 285.6384 K. The average over the entire ensemble was 285.8604 K. As such, using the mean of the entire ensemble versus matching the appropriate period in the bicentennial 80 oscillation would have an impact on G1-piControl temperature by at most 0.22 K. Only averaging the first 100 years of the piControl run (which may be the best match to the period covered by G1) yields a temperature of 285.9084 K, which is 0.048 K different from the mean of the entire piControl run. As such, we conclude that this bicentennial oscillation is unlikely to have substantially influenced our findings.
Per the results in Figure 1, IPSL-CM6A-LR and GISS-E2.1-G appear to have a different responsiveness of the hydrological 85 cycle to the combined CO 2 -solar forcing than the other models. We are reluctant to attribute this feature to any potential shortcomings or lack of fidelity to observations because there are no observations of this type of experiment. Although these models are outliers, there is no evidential basis on which to assume they are more or less valid than the other models for this study.
Because the main focus of this paper is a comparison between the CMIP5 and CMIP6 generations of model results, we have 90 opted for the following to aid comparisons: -Since we are not evaluating any features that require 100 years of statistics, and the results do not show any appreciable time evolution of behavior after the first couple of years (see discussion above), we only evaluate the first 50 years of all simulations. All maps show changes over years 11-50, removing the initial transient period.
-We do not compare previous versions of individual models with current ones, instead only examining ensembles. Even 95 though models may share similar development histories (e.g., atmosphere and ocean dynamical cores, convective parameterizations, radiative transfer modules, terrestrial biosphere and cryosphere; Knutti et al., 2013;Zelinka et al., 2020), there have been numerous developments in models in these areas (and others) between CMIP5 and CMIP6 such that in most cases a direct comparison would not be meaningful.
-We focus extensively on the G1 results and, with few exceptions, do not focus on the corresponding abrupt4xCO2 100 simulations. It has been well documented that the CMIP6 models tend to have higher climate sensitivities than the CMIP5 models (Flynn and Mauritsen, 2020;Meehl et al., 2020;Zelinka et al., 2020), so we do not wish to make conclusions that might be based on a form of selection bias.
-All lack of stippling on map plots, as in previous GeoMIP studies (e.g., , indicates agreement on the sign of the response in at least 75% of models. Because G1 CMIP5 has more participating models than G1 CMIP6 , this 105 threshold provides some consistency across analyses of the ensembles. When plotting differences between the ensembles (G1 CMIP6 -G1 CMIP5 ), there is no stippling, as it is difficult to meaningfully represent such differences between ranges.
Aggregate differences between the two ensembles, as calculated using Welch's t-test or differences in stippled area, are discussed in Table 3.

Energetics
Ensemble mean radiative and turbulent flux quantities are plotted in Figure 3, and the ensemble ranges are plotted in Figure 4.
An immediate observation is that, in both ensembles, the models were successful at limiting net TOA radiative flux change to 115 within approximately ±0.1 W m −2 of the models' respective preindustrial values. Accomplishing this required an average solar reduction of 4.14% (models range in 3.20-5.00%) in CMIP5 and 4.14% (3.72-4.91%) in CMIP6. As such, despite numerous structural changes between the two generations of models, there is no appreciable change in solar efficacy (Hansen et al., 2005).
None of the radiative flux quantities indicate large transients over 50 years of simulation of G1, other than the initial flux change within the first year or so of simulation. This is consistent with the "perpetual fast response" found by Kravitz et al. 120 (2013b), in which because global mean temperature does not change appreciably over the course of the G1 simulation, climate feedbacks are not excited, and the internal state of the system (as measured by, for example, fluxes and hydrological cycle changes) similarly does not change. Ensemble mean fluxes show few differences (<1 W m −2 in magnitude) with the exception of shortwave cloud forcing, defined as all-sky minus clear-sky shortwave flux at the surface. On average, the CMIP6 ensemble has 3-4 W m −2 less shortwave cloud forcing than CMIP5. Neglecting some outliers, for each flux except shortwave (and 125 hence total) cloud forcing, the median model in one ensemble is within the inter-quartile range of the other ensemble. This indicates that there are no major differences between the ensembles in how the models handle energy balance and energetics, with the exception of clouds, which is consistent with findings about CMIP6 (Zelinka et al., 2020). Moreover, it appears that most of the major differences in shortwave cloud forcing are due to outliers in each ensemble, positive for CMIP5 and negative for CMIP6. To further explore these potential differences, Figure 5 provides maps of the ensemble means for cloud forcing.

130
In G1, the CMIP5 ensemble showed more positive shortwave cloud forcing and more negative longwave cloud forcing (i.e., more cancellation) than the CMIP6 ensemble. Overall, the CMIP6 ensemble has greatly reduced (in some places by over 10 W m −2 ) shortwave cloud forcing as compared to CMIP5 under the G1 experiment. This is a widespread result, but the most prominent features are in the tropics, especially over the Amazon, Africa, and the Maritime Continent. These regions encompass tropical forests, indicating a potential for vegetation feedbacks on the temperature reductions. However, the reasons 135 behind these forcing changes are difficult to diagnose, as they could be due to changes in cloud thickness, cloud cover, or cloud level between CMIP5 and CMIP6 models (e.g., Vignesh et al., 2020), differences in how solar geoengineering affects clouds (Russotto and Ackerman, 2018), or artifacts of the analyses (e.g., cloud masking; Andrews et al., 2009;Kravitz et al., 2013b).
Moreover, based on the results in Figure 4, it is likely that many of these features are exaggerated by outlier models (also see Vignesh et al., 2020). As such, we reserve such detailed investigations for future work.

Temperature
These small flux changes also lead to few G1 temperature changes between the two ensembles. Figure 6 shows global, land, and ocean-averaged temperatures for the CMIP5 and CMIP6 ensembles. In general, the abrupt4xCO2 simulation in CMIP6 has higher temperatures than in CMIP5, consistent with the noted increase in climate sensitivity (Vial et al., 2013;Flynn and Mauritsen, 2020;Meehl et al., 2020;Zelinka et al., 2020). In both ensembles, G1 is effective at offsetting global mean 145 temperature change, in some cases with a slight positive residual temperature change over land. Figure 7 shows three aggregate temperature metrics: global mean temperature (T 0 ), the interhemispheric temperature gradient (T 1 ), and the equator-to-pole temperature gradient (T 2 ) (Ban-Weiss and Caldeira, 2010; Kravitz et al., 2016): where A is area. As for the fluxes, the median model in one ensemble is within the inter-quartile range of the other ensemble.

150
This indicates that no ensemble is on average warmer or cooler than another, has a substantially warmer Northern or Southern Hemisphere than the other, nor has warmer tropics or poles than the other. We can conclude that spatial patterns of temperature change from G1 are robust across a wide range of structural uncertainty, including an increase in climate sensitivity between the two generations of CMIP.
The spatial structure of temperature change ( Figure 8) does have small differences between the two ensembles. G1 in CMIP6 155 has multiple locations that are warmer than G1 in CMIP5, despite both ensembles achieving net energy balance at TOA and the surface (Figure 3). The majority of the differences are over land and in the tropics, where CMIP6 is slightly warmer than CMIP5 (up to 1 • C in some places). Nevertheless, both ensembles show the well noted feature that offsetting a CO 2 increase with globally uniform solar reduction overcools the tropics and undercools the poles (Govindasamy and Caldeira, 2000;. CMIP6 shows slightly less high latitude warming than CMIP5, but temperature differences between the two 160 ensembles are largely negligible. However, the warmer temperatures in CMIP6 near Greenland have important implications for ice sheet melt and consequent sea level rise, as well as bottom water formation. We reserve such analyses for future investigations, particularly since the models used here are not capable of simulating the eustatic component of sea level rise. In any case, these ensemble mean differences between CMIP5 and CMIP6 cannot be deemed statistically significant (Table 3 and Figure 7).  Maritime Continent in G1 CMIP6 . Evaporation in the two ensembles is nearly identical except for more evaporation in Amazonia and Australia in G1 CMIP6 . As such, the net P-E change between the two ensembles strongly resembles the precipitation changes. Figure 10 shows that, like previous evaluations of ensemble ranges, the median model in one ensemble falls well within the interquartile range of the other ensemble for P, E, and P-E. As such, we cannot conclude any robust hydrological cycle changes between the two ensembles.
175 Figure 11 shows average (years 11-50) temperature change (with respect to piControl) plotted against average precipitation change for each model, as in Tilmes et al. (2013). Other than a potentially greater climate sensitivity of some CMIP6 models, there is no distinguishable difference in aggregate behavior between the two ensembles. The same conclusion discovered by Tilmes et al. (2013) holds: solar reduction cannot simultaneously offset CO 2 -induced changes in both global mean temperature and global mean precipitation.

180
As an integrator of CO 2 , temperature, and precipitation effects over land, Figure 12 shows changes in terrestrial net primary productivity (NPP

Discussion and Conclusions
Based on the results presented here, model response to G1 has not changed substantially between CMIP5 and CMIP6, despite numerous changes to models between the two generations, including an increase in climate sensitivity. The sign of residual climate impacts (for example in temperature) are in better agreement in CMIP5 than CMIP6 (Table 3 shows a difference in stippled area between the two ensembles), but this could be a function of the smaller ensemble size in CMIP6. Alternatively, 195 the factors affecting the signs of residual climate impacts are not well enough understood for the CMIP6 models to show improvement over CMIP5. Energetics, temperature, and the hydrological cycle are qualitatively and quantitatively similar in both ensemble means and ensemble ranges, although these variables are somewhat related, so we might expect them to all portray a similar picture. Notable differences do exist in shortwave cloud forcing and NPP, particularly in Amazonia, Africa, and Australia, which are also regions of inter-ensemble difference in precipitation.

200
From these findings, we can conclude that results obtained over the past 20 years of study have not been overturned by the latest round of simulations. All of the major ensemble differences highlighted above deal with clouds and land surface modeling, both of which are difficult to model and are necessarily highly parameterized. The conclusions that are based on more fundamental knowledge, such as column energetics (in the case of the hydrological cycle), are relatively robust to structural uncertainty, in so far as this study adequately captures representative variations in structural uncertainty. This lends 205 confidence to our conclusions about the broad climate effects from modeling solar geoengineering via solar dimming.
We also conclude that the models used in CMIP5 are not obviously biased or inferior as compared to CMIP6. While improvements have been made in the CMIP6 generation of models, and those models are likely better for representing numerous features of the present-day climate that may be important for studies of geoengineering, there are many aspects of climate that are well represented by earlier models. In some cases, more robust analyses may be enabled by augmenting ensemble sizes 210 with archived output from earlier generations of CMIP models.
Many of the broad features of solar geoengineering with sulfate aerosols can be represented by a reduction in solar constant (e.g., Niemeier et al., 2013;Kalidindi et al., 2015). However, the more subtle changes that derive from complex response to stratospheric aerosol heating ( There are numerous aspects of physical climate that we did not evaluate, nor did we pursue analyses beyond physical climate, including many other aspects of natural science, social science, the humanities, governance, justice, or ethics (to name a few important areas). Moreover, we emphasize that experiment G1 is an idealized experiment aimed at understanding physical climate response to combinations of large forcings and should not be interpreted as a realistic or policy-relevant scenario of 225 geoengineering. A holistic assessment of the consequences of geoengineering, particularly of more policy-relevant scenarios, would certainly need to take these numerous aspects into account. Nevertheless, based on the results presented here, results for geoengineering across several important metrics appear to be consistent across some important structural uncertainties. This lends confidence to some conclusions drawn from global climate models regarding solar geoengineering.
Data availability. All CMIP5 and CMIP6 output, including the respective GeoMIP simulations, is available via the Earth System Grid 230 Federation (https://esgf-node.llnl.gov/projects/esgf-llnl/) or by contacting the respective modeling groups responsible for the output. For  Table 1.
All participating models in both the CMIP5 and CMIP6 eras of GeoMIP, including references. For G1 solar reduction, the percentage is calculated as the percent change in incident solar irradiance at the top-of-atmosphere between G1 and its respective piControl run. Numbers in the first column correspond to the model numbers in Figure 11.       Figure 10. Global mean ensemble median (red lines), inter-quartile (blue boxes), and ranges (black whiskers or, for P-E one blue circle indicating an extreme outlier) for the hydrological quantities shown in Figure 9 for both the CMIP5 and CMIP6 ensembles.  Figure 11. Average (years 11-50) temperature (y-axis; K) and precipitation (x-axis; %) change for each model in this study. Numbers indicate the model number (listed in  Figure 12. Terrestrial net primary productivity (kg C m −2 y −1 ) for the CMIP5 (top) and CMIP6 (middle) ensembles, as well as the ensemble differences (bottom). All shaded values are ensemble means. Lack of stippling indicates agreement on the sign of the values across at least 75% of the models.