Four years of global carbon cycle observed from OCO-2 version 9 and in situ data, and comparison to OCO-2 v7

The Orbiting Carbon Observatory 2 (OCO-2) satellite has been provided information to estimate carbon dioxide (CO2) fluxes at global and regional scales since 2014 through the combination of CO2 retrievals with top-down atmospheric inversion methods. Column average CO2 dry air mole fraction retrievals has been constantly improved. A bias correction has been applied in the OCO-2 version 9 retrievals compared to the previous OCO-2 version 7r improving data accuracy and coverage. 5 We study an ensemble of ten atmospheric inversions all characterized by different transport models, data assimilation algorithm and prior fluxes using first OCO-2 v7 in 2015-2016 and then OCO-2 version 9 land observations for the longer period 20152018. Inversions assimilating in situ (IS) measurements have been also used to provide a baseline against which to compare the satellite-driven results. The times series at different scales (going from global to regional scales) of the models emissions are analyzed and compared to each experiments using either OCO-2 or IS data. We then evaluate the inversion ensemble based 10 on dataset from TCCON, aircraft, and in-situ observations, all independent from assimilated data. While we find a similar constraint of global total carbon emissions between the ensemble spread using IS and both OCO-2 retrievals, differences between the two retrieval versions appear over regional scales and particularly in tropical Africa. A difference in the carbon budget between v7 and v9 is found over this region which seems to show the impact of corrections applied in retrievals. 1 https://doi.org/10.5194/acp-2021-373 Preprint. Discussion started: 11 August 2021 c © Author(s) 2021. CC BY 4.0 License.


Introduction
Understanding the global carbon cycle and how quickly the planet warms in response to human activities is becoming a global priority. CO 2 is a key driver of global warming and its dynamics can be explored with a variety of CO 2 measurements. Ground 20 based (in situ) data, while highly precise and accurate, are distributed very sparsely over the globe (Ciais et al., 2013). Spacebased CO 2 retrievals, on the other hand, allow comprehensive spatial coverage across the globe, particularly over regions with few surface observations, such as the tropics. Furthermore, the number of satellites observing atmospheric CO 2 has rapidly grown over the past decade, e.g. Greenhouse Gases Observing Satellite (GOSAT/GOSAT2, Kuze et al. (2009) ;Nakajima et al. (2012)) and the Orbiting Carbon Observatory (OCO-2/OCO-3, ; ). 25 The rise in CO 2 concentration at a global scale has motivated the drive towards a better understanding of the global surface fluxes of carbon (World Meteorological Organisation, 2020). In order to understand the different processes involved in the carbon cycle, such as uptake or release of CO 2 by the oceans and the land biosphere, and hence be able to predict future climate change, we need accurate emissions estimates and an improved understanding of natural CO 2 emissions and uptakes. Topdown atmospheric inversion approaches that couple atmospheric observations of CO 2 with chemistry (atmospheric) transport 30 models (CTMs) have been widely used to estimate CO 2 fluxes (Ciais et al., 2010;Peylin et al., 2013;Basu et al., 2013;Wang et al., 2018;Crowell et al., 2019). This is in contrast to "bottom-up" methods, which often use a mechanistic understanding of the carbon-cycle, e.g. soil dynamics, photosynthesis, decomposition processes, and steady state ocean-atmospheric chemical exchange, to predict land and ocean-atmospheric exchange, and hence atmospheric CO 2 concentrations. While the mechanistic underpinnings of these models is attractive, there is no guarantee that the resulting atmospheric exchange of CO 2 will bear any 35 similarity to reality. By contrast, the "top-down" approach often uses a "bottom-up" model output as a starting guess and then optimizes atmospheric exchange to agree with atmospheric observations. Formal uncertainties of top-down approaches can be attributed to the errors in the observations assimilated, i.e. "observation" errors, and to errors in the starting guess from mechanistic models. However, past studies have also shown that top-down estimates can be sensitive to errors in the modeled atmospheric transport as well as in choices related to the optimization tech-40 nique Houweling et al., 2015;Basu et al., 2018;Schuh et al., 2019), uncertainties which are difficult if not impossible to characterize formally in any one atmospheric inversion scheme. This shortcoming was the motivation for the OCO-2 Model Inter-comparison Project (MIP), whose goal was to (1) study the impact of assimilating OCO-2 retrieval data into several atmospheric inversion models and (2) provide an overall ensemble spread of the model emissions characterizing most sources of known uncertainty. In addition to its primary goal of assimilating OCO-2 retrievals, MIP modelers The paper is structured as follows. The MIP design as well as data used in the inversions (i.e., the in situ data and OCO-2 v9 retrievals, as well as how v9 differs from v7) will be detailed in section 2. In this same section the independent data used for evaluation will be also presented. Section 3 presents the optimized fluxes estimated from the in situ, v7 and v9 inversions at global, latitudinal and regional scales. Evaluation using independent data will appear at the end of this section 3. Finally, section 4 will discuss the results and findings. The MIP project, organized by the OCO-2 Science Team, is a collaboration of CO 2 modelers formed to study the impact of assimilating OCO-2 retrieval data into atmospheric inversion models. The project's goal is to create an ensemble of CO 2 surface flux estimates to understand how flux estimates using OCO-2 retrievals and in situ measurements depend on (i) transport, 75 (ii) data assimilation methodology, (iii) prior flux and associated errors and (iv) possible systematic errors in the OCO-2 retrievals, in particular across viewing modes, i.e. ocean glint (OG), land nadir (LN), and land glint (LG). The OCO-2 MIP philosophically mimics past projects such as RECCAP (REgional Carbon Cycle Assessment and Processes) and TRANSCOM  This simulation has been used in the v9 MIP. LoFI uses a different method than the other inversions but fit some independent data as the other simulation do. In addition, it has the highest resolution. LoFI has then been used in this MIP project to look at a range of different methods but, in this study, we will focus our analyze to the range of emissions from the other simulations.
both land nadir-and land glint-mode retrievals combined together have also been used, providing data over the oceans. The advantage of combining both modes have shown to yield a stronger constraint at regional scales on CO 2 fluxes (Miller and Michalak, 2020). In addition, biases existing between these two modes of retrievals have been reduced (O'Dell et al., 2018).
In this study, we focused on the nadir and glint modes, so the target mode is not discussed further. In addition, we only focus on the land nadir (LN) and land glint (LG) modes and do not use the ocean observation mode. Even if, since version 7, ocean 120 biases in OCO-2 retrievals have been largely reduced (O'Dell et al., 2018), inversions assimilating OCO-2 ocean retrievals produced unrealistic results with annual global ocean sinks higher of 2.6 ± 0.5 GtC.yr −1 compared to the state-of-the-art estimated in Le Quéré et al. (2018). Consequently, as for MIP v7 (Crowell et al., 2019), the OCO-2 ocean retrievals will not be further discussed in this study.
The algorithm developed to retrieve the column-average dry air mole fraction of CO 2 in the atmosphere (XCO 2 ) from the 125 measured radiance spectrum comes from NASA's Atmospheric CO 2 Observations from Space (ACOS) project (O'Dell et al., 2012;Connor et al., 2008). The ACOS algorithm was first applied to GOSAT NIR and SWIR spectral measurements, which have similar spectral characteristics to the OCO-2 measurements, before being used for OCO-2 and OCO-3 (which were launched at later dates). In addition to the spectral data, ACOS uses meteorology and model data to constrain retrievals of XCO 2 along with a variety of other parameters such as aerosol optical depth, surface albedo, surface pressure and total column 130 water vapor. In this paper, the modelers have used the ACOS bias-corrected retrievals (OCO-2 Level 2 Lite XCO 2 product) version 9 (Kiel et al. (2019); O 'Dell et al. (2018), https://daac.gsfc.nasa.gov). Since October 2019, OCO-2 processing has used the ACOS version 9 (or "ACOS B9") algorithm, an update to the previous v7 and v8 versions (O' Dell et al. (2018)). Several changes have been applied in the v9 compared to the v7. In particular, the v8 data included corrections related the spectroscopy, aerosol treatment, prior meteorology and the surface model. More details can be found in O' Dell et al. (2018). v9 included an addition correction for the surface pressure estimation (Kiel et al. (2019)) which significantly reduced biases, particularly over areas of rough topography. This bias correction in v9 allows a more uniformly distribution of XCO 2 over regions of interest, decreasing the standard deviation to 0.74 ppm compared to v8 which was of 1.35 ppm.
MIP modelers used all valid cloud-free OCO-2 retrievals (those considered as "good" by the quality_flag xco2_quality_flag=0) from the OCO-2 Lite files and then selected the bias-corrected data (Wunch et al., 2011). Since the spatial resolution of OCO-2 140 data is much higher than the model grid box scale used in the inversions, the OCO-2 data are averaged to a coarser scale (in this study, across a 10 second span, equivalent to about 67.5 km, along-track) before being assimilated. The retrieved column CO 2 , averaging kernels, prior CO 2 profiles, and a subset of the auxiliary parameters from the Lite files have all been averaged across these 10-second spans in the same way, weighted by the inverse of the square of the retrieval uncertainty (variable xco2_uncertainty) for each scene in the average. This is similar to the averaging done for the OCO-2 v7 MIP (Crowell et al. 145 (2019)), except that the 10-second averages are calculated directly, without the intermediate step of computing 1-second averages, as was done before. In computing the uncertainty to be placed upon the new 10-second-average XCO 2 value, an attempt was made to account for correlations between the model-data mismatch (MDM) errors for each individual scene: each scene within the 10-second span was assumed to have errors that were correlated with every other scene in the span with the same positive correlation coefficient (+0.3 and +0.6 for scenes over land and ocean, respectively). Details of the form and deriva-150 tion of these average uncertainties may be found in the 'constant correlation' section of Baker et al. (2021). The approach to handling the correlations, while crude, represents an increase in complexity compared to what was assumed in the v7 MIP (no reduction in uncertainty due to the averaging process, as described in Crowell et al. (2019)). Since it is known that the uncertainty computed by the retrieval (in variable xco2_uncertainty) underestimates the true level of error in the retrieved XCO 2 , an additional term is added onto this "theoretical" uncertainty, in quadrature, to obtain a more realistic uncertainty per scene: 155 the standard deviation of all the XCO 2 values used in the 10-second average. In this ad hoc approach, scenes that have a very small spread in XCO 2 values across the 10-second span are assigned the theoretical uncertainty from the retrieval, while those for which the actual variability of the XCO 2 values is larger than the theoretical values are assigned a value closer to this computed error level. Both resulting uncertainty is then passed through the formula to account for error correlations. Finally, an additional term is added in quadrature to account for transport model errors. This model error term is computed from the 160 difference between the CO 2 concentrations computed by the TM5 and GEOS-Chem models when both are driven by the same realistic surface CO 2 fluxes, after the annual mean difference filed is subtracted off; the values that result are considerably smaller than those model errors added on for the OCO-2 v7 MIP Crowell et al. (2019). In contrast to this level of detail, the uncertainties between different 10 s averages are assumed to be independent when assimilated into the inversions (Worden et al., 2017;Crowell et al., 2019). Several studies have used this method in order to be coherent with the resolution of their 165 inversions or simulations regarding the OCO-2 resolution Chevallier et al., 2019).

In situ CO 2 measurements
The set of in situ CO 2 measurements used for assimilation and for evaluation is drawn from 5 collections in ObsPack (Masarie et al., 2014, and   ships, from light aircraft at regular profiling sites, and on commercial aircraft (see Fig. 1.a) Only a small subset of ObsPack measurements are designated as suitable for assimilation, and the remainder as designated for evaluation. The assimilable measurements meet two criteria: they can be successfully simulated in coarse-resolution global 190 models, and they can be assigned a MDM error value. Many factors can render observations difficult to simulate in the global models used for this exercise. Sites located in areas with complex topography, or close to strong local sources like cities, or strongly influenced by small-scale circulation features such as land/sea breezes are all considered difficult to simulate.
The CarbonTracker "adaptive model-data mismatch" scheme from CT2017 (Peters et al., 2007, with  campaigns (e.g. ACT-America (DiGangi et al., 2018) and ATom (Wofsy and ATom Science Team, 2018)). Many of these measurements without an MDM value are otherwise assimilable, since they sample background conditions that models should be able to simulate successfully. These data are particularly useful for model evaluation because they are generally independent of assimilated measurements.

Withheld data 210
In order to evaluate inversion model performance, a small subset of the assimilable in situ measurements were withheld for cross-validation. Each withheld measurement has an MDM value, which allows model residuals to be normalized by expected performance. The collection of withheld measurements was then used to evaluate the MIP ensemble.
Approximately 5% of the assimilable data were chosen for withholding. These were chosen carefully to maximize independence from the data designated for assimilation in the IS experiment. The criteria for withholding vary by measurement 215 type. Flasks, which are generally taken on a weekly basis and with sampling criteria that emphasize background conditions, are already assumed to be independent from one another. All the measurements in a given aircraft profile are assumed to be related, so entire profiles were excluded. Quasi-continuous measurements, like those at towers and observatories, are assumed to be correlated in time, so all measurements during 5% of entire days were excluded.

TCCON
The Total Carbon Column Observing Network (TCCON) is composed of around 30 sites around the globe estimating columnaveraged dry air mole fraction of several atmospheric gases using ground-based remote sensing by Fourier Transform Spectrometers (Wunch et al., 2011). TCCON measures spectra of direct sunlight in the near infrared region. TCCON CO 2 retrievals 235 are estimated to have precisions better than 0.25% (1-sigma) (Wunch et al., 2011). These retrievals have been used as the primary validation resource for several satellite missions, including OCO-2, SCIAMACHY and GOSAT (Wunch et al., 2011. OCO-2 observations were extensively evaluated against TCCON data in . Posterior and prior concentrations are sampled for each 30-minute average TCCON retrieval before calculating the statistics following the approach described in Crowell et al. (2019). For LNLGv9 inversions, the available 10s OCO-2 retrievals were 240 averaged and compared to TCCON observations with a 5 • latitude and longitude geometric coincidence criterion and within 1h of the overpass. All TCCON sites used in the evaluation section are listed in Table 3. Figure 3 illustrates the different time ranges and observation numbers across TCCON sites during the 2015-2018 period.

Results
We discuss results from two v9 MIP experiments: IS, in which only the in situ CO 2 measurements are assimilated, and 2018 years E u r e k a , N u n a v u t , C a n a d a N y -Å l e s u n d , S v a l b a r d , N o r w a y S o d a n k y l ä , L a p l a n d , F i n l a n d B i a y s t o k , P o d l a s i e , P o l a n d B r e m e n , B r e m e n , G e r m a n y K a r l s r u h e , B a d e n -W ü r t t e m b e r g , G e r m a n y P a r i s , Î l e d e F r a n c e , F r a n c e O r l é a n s , C e n t r e -V a l d e L o i r e , F r a n c e G a r m i s c h , B a v a r i a , G e r m a n y P a r k F   are slightly larger with OCO-2 v7 than with OCO-2 v9. Additionally, while the growing season observed with v7 is shifted earlier in the year relative to IS and the unoptimized prior fluxes, this is not the case with v9 (although the priors in the v7 and  We also see some differences between v7 and v9. LN v7 fluxes are closer to what is observed in both LNLG v9 and IS during the El Niño period, with a median sink around -2.25 PgC/yr, but LGv7 gives a weaker sink (-1.5 PgC/yr) due to larger sources 300 in fall 2015 and 2016. As we can observe for all other latitudes bands and we will observe for smaller regions, LNLGv9 tends to be closer to LNv7 than to LGv7. This points to previously known issues with the v7 LG data that were resolved with a unified bias correction in OCO-2 v9. Interestingly, the seasonality for the v9 results more closely aligns with the IS results, and the large efflux at the end of the growing season in v7 LNLG has disappeared in v9.
The Southern extra-tropics (SHExt, Fig. 6e and f) are known to have fewer IS observations as well as little land mass 305 (Crowell et al., 2019), and hence fewer land retrievals to constrain the fluxes, which are signifcaintly weaker over this latitude band. This could explain why, for this latitude, the prior, IS, v7 and v9 results have different seasonality. LNv7 and LNLGv9 have different seasonality and v7 has a delay in the efflux peak for 2015 and no efflux peak at all for 2016. The seasonal amplitude observed with LNLGv9 is smaller for the whole period compared to IS and LGv7 and has a delay in 2017 and 2018 compared to IS. The differences in monthly emissions are also observed in the annual fluxes. They show, for the whole period, 310 stronger sinks with v9 than with IS, and v7. However, in contrast to NHExt, the ensemble spread is larger with v9 than with v7.
The bias reduction of v9 gives a smaller spread and hence a better agreement among the models, particularly over the Northern Hemisphere. Over the tropics ( Fig. 6c and d), the seasonal peak efflux is typically in the fall, with an anomalously strong source in fall 2015 during the El Niño intense period. On average, the seasonality seems to be similar for the IS and OCO-2 inversions, but 315 different from that of the prior. However, for the whole period, the v9 OCO-2 annual mean source is about 0.5±0.1 PgC/yr stronger than for IS. OCO-2 observations have a more frequent coverage over the tropics than the in situ network. However, OCO-2 retrievals can be biased due to cloud coverage during the wet season and aerosol from biomass burning during the dry season (Merrelli et al., 2015;Massie et al., 2017). LNLGv9 gives stronger annual sources, particularly for the El Niño period, and with a smaller ensemble spread, than does v7. The OCO-2 LGv7 ensemble spread does not deviate from the prior spread,  We see similarities in the seasonality and annual flux across zonal bands in the OCO-2 (mainly v9) and IS results for the Northern extra-tropics, but also differences between v7 and v9: now we look at smaller spatial scales to see where these agreements or disagreement are observed. Figure 7 shows monthly and annual fluxes for Northern America (top panel), Europe (middle panel) and Northern Asia (bottom panel).
Over Northern America ( Fig. 7a and Houweling et al. (2015) assimilated GOSAT data over the 2009 and 2010 period and observed a larger carbon uptake for Europe with GOSAT than with in-situ data, as was also observed by Chevallier et al. (2014) and Reuter et al. (2014). Inferred fluxes using v9 seem then to be more consistent with other studies, but more analysis is needed to understand why this difference between v7 and v9 appears over Europe (which could be due to a dipole between Europe and Northern Africa as observed and mentioned by the previous studies of Houweling fewer in situ than OCO-2 observations above Northern Asia, and particularly above the boreal forest of Eurasia, which is an important area for sources and sinks of atmospheric CO 2 (Houghton et al., 2007;Siewert et al., 2015). The combination of sparse data, in an area not well observed, with transport uncertainty could be the cause. We see that the ensemble spread of the models is larger for IS than for v7 or v9 and that the annual fluxes differ by almost 1.5 PgC/yr between the IS and OCO-2 inversions during the El Niño period. This is not the case for Europe and Northern America, where the surface measurements 360 are most densely concentrated ( Fig. 1.a). Finally, due to the additional two years of fluxes using IS and v9, we see a trend towards a weaker sink from 2015 to 2018 for Northern Asia that is not observed for Europe or North America.

Tropical region
We now look at the Tropics split across the Northern and Southern hemispheres. The Southern and Northern tropics are In order to see the resolution provided by the data at finer scales in the tropics, we examine fluxes for six tropical regions (three over the Northern Tropical hemisphere (Fig. 9) and three over the Southern Tropical hemisphere (Fig. 10)).   the two versions, we do not observe much difference between v7 and v9 for Southern Tropical South America ( Fig. 10a and   b), except that the ensemble spread is larger with v9 than with v7. The bias correction in v9 allowed to have more data than v7 over the Amazon in order to pass the quality flag criteria O'Dell et al., 2018). In addition, we can observe a different amplitude in the seasonality between IS and OCO-2 for this region. This difference over Tropical Southern 410 America might be due to the cloud effect (Crowell et al., 2019) of the wet season over the Amazon affecting OCO-2 data. But this could be also because most of the in situ data are located mainly inside the Amazon and not in the Cerrado savanna of Brazil, resulting in IS inversions being dominated by tropical forest seasonality. Alternatively, the OCO-2 inversions could be dominated by savanna seasonality (Baker et al., 2021a in prep). Finally, over Southern Tropical Africa (Fig. 10c), we obtain a large difference between the annual means of v7 and v9 that we did not observe for the other regions. While LN and LG 415 v7 give a sink of carbon during the El Niño period of around -0.25 PgC/yr, LNLGv9 gives a sources of around 0.25 PgC/yr, which seems not to be compensated by a flux signal in Northern Tropical Africa. These sources of carbon come from weaker sinks during the growing season (from November through March). This source of carbon during the El Niño period is also observable with the IS inversions (and was also observed with the ISv7, Crowell et al. (2019)).

Evaluation against independent data 420
To assess the accuracy of the posterior flux results presented previously, we evaluate them here by sampling the resultant posterior concentrations and comparing them to withheld data, ATom aircraft measurements and TCCON data. All modelers have sampled their posterior concentrations at the times and locations of the evaluation data.

Withheld in situ evaluation data
Here, we evaluate against the withheld in situ data introduced in Section 2.4.   Figure 11 shows the evaluation with the withheld data by latitude band for IS, and LNLGv9. Over the southern hemisphere, a large underestimation of the ensemble mean of LNLGv9 appears compared to the observations. While biases are larger over the southern hemisphere they are smaller over the tropics. The large biases observed in LNLGv9 and not with the IS, could be due to a latitudinal bias in the OCO-2 data. Going from the southern to the northern hemisphere, we can see a change in the v9 biases, which go from underestimation to overestimation by the models. This behavior also appears in the IS but with lower 430 biases (less than 0.5 for all latitudes). The variability between the models and observations is higher in the southern hemisphere than in the tropics and the northern latitudes (Fig. 11, right plot). Overall, the IS experiments are more consistent with the in situ measurements than the LNLG experiments are. Figure 12 shows the normalized bias and standard deviation of the ensemble mean by MIP regions for LNLGv9 and IS.
Evaluation by MIP regions reveals a different behavior than observed by latitudes. For instance, over Temperate North America, 435 v9 seems to have a larger underestimation of the observations than IS has, while for Temperate South America and Northern Tropical Africa, v9 has larger underestimation than IS. Over Europe, we see here that v9 has smaller biases (less overestimation)  than IS. Finally, if we look at Southern Tropical Africa, v9 is less biased than IS. Regarding the standard deviation, it seems that the three experiments have similarities, but with closer agreement between IS and v9 for some regions.
Since the MDM values range over two orders of magnitude, the use of the normalized residuals gives the most meaningful 440 interpretation of the residuals. When we look at the normalized bias and standard deviation for each models between IS and LNLGv9 experiments by latitudes (Fig. 13 a, b, c, and d) and by MIP regions (Fig. 14 a, b, c, and d), we can see this large difference with underestimation for all models assimilating v9 in the southern and tropical latitudes. This underestimation with all models is, however, generally not obtained with IS. When comparing the root-mean-square error (RMSE) between the models and the withheld data (Fig. 13 e,f), we can observe higher value for all models between 30 • N and 75 • N with 445 RMSE between 6 ppm (4 ppm) and 10 ppm (7 ppm) for LNLG inversions (for IS inversions respectively), while RMSE are below 3ppm in the southern latitudes and in the tropics. This larger raw errors observed in the northern latitudes could come from the difference in the in situ network between northern and southern latitudes. Indeed, there are more measurements in the northern hemisphere close to regionally significant sources and sinks compared to other latitudinal bands (tropics and southern hemisphere). Compared to the ensemble mean, every model shows small normalized standard deviations across the 450 latitudes, with values near 0.1 in the Northern Hemisphere and going from 0.2 to 0.6, according the models, between 75 • S to 60 • S. However, the evaluation by MIP regions shows similarities between IS and v9 for all regions. This similarity is also found for the RMSE values (Fig. 14  distribution. Normalizing the models with the MDM's values is equivalent to normalize against this expected variability. We can also observe that some models do better at some latitudes and regions, while the others are better at other regions. However, the goals of the study are to envelope the uncertainty in the inversions, rather than ranking model performance. at almost every latitude and gives larger biases than IS in general. This underestimation is consistent with the large underestimation observed, particularly in Southern Hemisphere, with the withheld data (Fig. 13). In addition, for both experiments, we observe a large overestimation (with biases of 1 to 2 ppm) in the stratosphere of the boreal latitudes. This stratospheric region 465 is affected by atmospheric circulation and has few observations to constrain the inversions. It is possible then that this excess of concentration, in both experiments, reflects the initial conditions of the inversion.  Compared to the evaluation of ISv7 in the study of Crowell et al. (2019), ISv9 and LNLGv9 biases are closer to each other (in the v7 MIP, the LNv7 were biased high compared to ISv7). Additionally, the OCO-2 biases have decreased (to values between -1.0 and 1.0 ppm) with v9 compared to v7, where biases ranged between -1.5 and 1.5 ppm (Crowell et al., 2019). As observed here as well, IS and v9 have large positive biases over most of the European sites, which could indicate either an issue related to the coarse resolution used by the transport models or (though this has not be studied yet) to the accuracy of TCCON retrievals 475 over these regions (Crowell et al., 2019). The Caltech, Saga, and Tenerife sites show large underestimation in both the IS and v9 results across all models. As mentioned and observed in v7 MIP, differences between the Caltech and Edwards sites (which are very close each other) could be due to the location of Edwards over the mountains and Caltech is affected in the Los Angeles basin (Kort et al., 2012;Schwandner et al., 2017): the coarse resolution of models cannot differentiate the variability of these two sites (Crowell et al., 2019;Schuh et al., 2021). This could also explain the underestimation observed over Saga  and Izaña (Tenerife Island), which are urban regions. Finally, most of sites over southern latitudinal bands are underestimated with LNLGv9 but slightly overestimated with IS, as also observed with the withheld in situ and ATom observations. However, Ascension Island, situated in the tropics, shows an overestimation of around 0.6 ppm for all models and for both IS and LNLGv9. The biases in v9 have decreased for Ascension Island compared to v7, where the biases were around 1.0 ppm for LN and LG.

Discussions
We have analyzed inversions assimilating OCO-2 version 9 XCO 2 retrievals and compared them to in situ (IS) inversions, and to the OCO-2 v7 inversions. Our study is an update to the v7 MIP analysis of Crowell et al. (2019), which used OCO-2 v7 retrievals. In addition to comparing the LNLGv9 viewing mode inversion results to the IS ones, we also wanted to see if differences existed between an ensemble of atmospheric inverse models using either LNLGv9 or LNv7 and LGv7. We remind    the reader that differences between XCO 2 in v7 and v9 are not related to meteorology which is fixed, but to spectroscopy and geolocation improvements, and better representation of aerosols (O'Dell et al., 2018). We have compared these different experiments, starting at the global scale, moving on to latitudinal scales, and finishing with regional scales. As expected, we did not find large differences between v7 and v9 at the global scale or latitudinal scale, except that the ensemble spread among the models was smaller with v9 than with v7. Transport model uncertainty is not expected to have changed dramatically since 495 v7. This suggests that the reduction in the ensemble spread is likely related to a decrease in OCO-2 retrievals errors in v9 This anomaly over Africa was not expected, as Africa is generally not a tropical region affected by El Niño events. When 525 they compared to total column carbon monoxide (CO) anomalies using MOPITT data, they found anomalous flux in Southern Tropical Africa, with a large CO release in February 2016. Finally, they also found a water deficit at the beginning of 2016 (weaker than over the Amazon). We can then see that this study assimilating in situ measurements is in agreement to what we observe with v9 and our IS inversions. The corrections in the retrievals (in going from v7 to v9) seem hence to be important for CO 2 emissions estimates, particularly over tropical regions.

Conclusions
In this study, we compare an ensemble of inversion models separately assimilating in situ data, OCO-2 v7 LN and LG retrievals, and OCO-2 v9 LNLG retrievals, across the 2015-2018 period. Using the four years available with v9, in comparison to the two with v7, we have been able to observe better the impact of El Niño period during 2015-2016 and the recovery period during the 2017-2018 period, especially over the tropics. Additionally, the ensemble spread among the models assimilating OCO-2 535 v9 retrievals is smaller for almost all regions compared to the ensemble spread with OCO-2 v7 retrievals, meaning either the impact of the long period used with the different models and priors or a better agreement in emissions among the models and the impact of the v9 retrieval bias correction. We find at the global scale a good agreement overall between fluxes inferred from the v7, v9 and in situ data. However, differences are found at smaller scales over northern latitudes and particularly over the tropics. While seasonality in the tropics differs significantly between the OCO-2 v7 and v9 compared to in situ results, the 540 annual emissions show better agreement between the in situ, LNv7 and v9, except over Northern Tropical Africa and Northern Tropical Asia. As was observed with v7, v9 also shows a stronger source of carbon over Northern Tropical Africa than that observed with in situ data. This weaker source given by the in situ data seems to be balanced over Northern Tropical Asia with a larger out-gassing during the El Niño period that is not observed with the OCO-2 retrievals, as they show only sinks of carbon there for the whole period. It is difficult to conclude where these sources come from, as there are few in situ observations over 545 this region. Finally, we see, as previously mentioned in several studies, a carbon uptake (of around -0.25 PgC/yr) over Southern Tropical Africa using the v7 data, but a carbon release there using the v9 data of around 0.25 PgC/yr during the El Niño period.
The in situ data also suggest a carbon release. This difference between v7 and v9 over Southern Tropical Africa seems to show the impact of retrieval bias corrections on the regional CO 2 fluxes, particularly in the tropics. This contradiction in the carbon budget conclusion between the two sets of OCO-2 inversions requires further investigation over this African region.

550
Evaluation with the withheld data, ATom aircraft measurements, and TCCON retrievals suggests similarities in biases between the in situ data and LNLG v9 data, with a slight negative bias in the v9 OCO-2 data for almost all latitudes, particularly in the Southern Hemisphere and the tropics, where few evaluation data are available. Evaluation against TCCON shows also a reduction in retrieval errors with v9 ensemble models compared to v7. Now that OCO-2 v10 retrievals are available, analysis and comparison between this new release and the two preceding 555 ones presented in this study, should bring further flux information and comparison for the tropical regions regarding retrieval corrections.

A2 Baker
The transport model used in the Baker simulations is the Parameterized Chemical Transport Model (PCTM, Kawa et al. (2004)). The meteorology fields come from the MERRA-2 reanalysis: these have been coarsened to a resolution of 2 • x2.5 • for the forward runs of the prior fluxes and to a resolution of 6.67 • x6.67 • for the assimilation of the measurements; the 580 vertical resolution has been coarsened to 40 levels in both cases by grouping together the original 72 layers in the upper atmosphere. Biospheric priors come from the CASA-GFED3 (Potter et al., 1993;Randerson et al., 1997;van der Werf et al., 2004) and include gross primary productivity (GPP), net ecosystem respiration, wildfires, and biofuel emissions. The biospheric prior errors are assumed, over land, to dominate the error budget and covariances are based on an analysis of mismatches with in situ flux measurements (Chevallier et al., 2006(Chevallier et al., , 2012. Spatial correlations on daily mean NEE decay exponentially with a length of 500km; standard deviations are set to 0.8 times the climatological daily-varying heterotrophic 600 respiration flux simulated by ORCHIDEE, with a ceiling of 4 gC.m −2 per day. Over a full year, the total 1-sigma uncertainty for the prior land fluxes amounts to about 3.0 GtC.yr −1 . Ocean prior uncertainty is defined as follows : (i) temporal correlations decay exponentially with a length of one month,  2016)). The CARDAMOM data assimilation system explicitly represents the time-resolved uncertainties in the NBE. The prior estimates are already constrained with multiple data streams accounting for measurement uncertainties following a Bayesian approach similar to that used in the 4D-variational approach. CMS-Flux simulation use the CARDAMOM setup as described by Bloom et al. (2016); Anthony Bloom et al. (2020) resolved at monthly timescales; data constraints include GOME-2 solar-induced fluorescence 615 (Joiner et al., 2013), MODIS Leaf Area Index (LAI), and biomass and soil carbon. In addition, mean GPP and fire carbon emissions from 2010 -2017 are constrained by FLUXCOM RS+METEO version 1 GPP (Tramontana et al., 2016;Jung et al., 2017) and GFEDv4.1s (Randerson et al., 2012), respectively, both assimilated with an uncertainty of 20%. The Olsen and Randerson (2004) approach has been used to downscale monthly GPP and respiration fluxes to 3-hourly timescales, based on ERA-interim re-analysis of global radiation and surface temperature. Fire fluxes are downscaled using the GFEDv4.1 daily and 620 diurnal scale factors on monthly emissions. Posterior CARDAMOM NBE estimates are then summarized as NBE mean and standard deviation values. The NBE from CARDAMOM shows net carbon uptake of 2.3 GtC/year over the tropics and close to neutral in the extratropics. The year-to-year variability (i.e., interannual variability, IAV) estimated from CARDAMOM from 2010 -2017 is generally less than 0.1 gC/m 2 /day outside of the tropics. Because of the weak interannual variability estimated by CARDAMOM, the same 2017 NBE prior is used for 2018. CARDAMOM generates uncertainty along with the mean state.

625
The relative uncertainty over the tropics is generally larger than 100%, and the magnitude is between 50% and 100% over the extra-tropics. We assume no correlation in the prior flux errors in either space or time.

A5 CSU
CSU simulation used GEOS-Chem model with a Bayesian technique at a 1 • x1 • resolution. The model is driven by MERRA-2 meteorology fields. Biospheric prior emissions are based on the Simple Biosphere Model Version 4 (SiB4, Baker et al. (2013)) at a 1x1 resolution grid. SiB4 is a land surface model that predicts vegetation and soil moisture , land surface energy and terrestrial carbon cycle. It used the carbon fluxes to determine biomass above and below the ground. Prior ocean are based on CT2019 OIF and fire priors on GFED4.

A6 CT
The CarbonTracker (CT) simulations presented here closely follow the methods used for CT2019B (Jacobson et al., 2020b), except that only one set of biosphere, wildfire, and oceanic emissions data were used. For the present experiments, first guess land carbon emissions were provided by GFED4.1s (Giglio et al., 2013;Van Der Werf et al., 2017) and first-guess ocean emissions come from Jacobson et al. (2007). CT inversions optimize surface fluxes by estimating weekly scaling factors multiplying first-guess emissions for 156 ecoregions covering the globe. The optimization scheme is an ensemble Kalman filter with a 12-week assimilation window. Atmospheric transport is simulated by TM5 (Krol et al., 2005) with meteorology 640 from the ERA-Interim reanalysis (Dee et al., 2011) with a global resolution of 3 • longitude by 2 • latitude and a 1 • x1 • zoom region over North America.

A7 LoFI
While the OCO-2 MIP v7 (Crowell et al., 2019) was composed of models assimilating OCO-2 retrieval v7 or in situ data, the OCO-2 MIP v9 has in addition the model LoFI (Weir et al. (2020)) which is an integrated Earth system model with data 645 assimilation capabilities. The Low-order Flu Inversion (LoFI, (Weir et al., 2020)) reproduce atmospheric and oceanic growth rates from a group of remote sensing carbon fluxes. The model used is Heracles 4.0 GEOS GCM, including MERRA-2 meteorological inputs and a 0.5 • x0.625 • resolution grid with 72 vertical levels. Biospheric fluxes have a 3-hourly timestep while all other fluxes have a daily timestep. Biospheric and biofuel priors come from the CASA-GFED3. Biomass burning priors are based on the Quick Fire Emissions Dataset (QFED, Darmenov and Silva (2015)). QFED is based on MODIS fire 650 radiative power estimates using the technique from GFAS (Kaiser et al., 2012). Finally, oceanic priors are an extension of Takahashi et al. (2009) climatology product used with NOAA zonal-mean surface CO 2 and MERRA-2 wind speed. This approach is also used in NOAA CarbonTracker system.

A8 OU
OU simulation used the chemistry transport model TM5 (Krol et al., 2005) with a 4DVAR assimilation algorithm. The model 655 has been run at a 4 • x6 • resolution with 25 vertical layers. ERA-Interim meteorology fields are used here as well. Initial conditions are provided from CarbonTracker. Prior oceanic were constructed from Takahashi et al. (2009). Biospheric priors are based on CT2019 CASA-GFED3 and fire prior from GFED3. Uncertainties are derived from different climatological fluxes.
Exponential spatiotemporal correlation is assumed for the uncertainty in the prior flux. For the oceanic component, length is of 1000km and timescales of 3 weeks, while for the terrestrial component, length and timescale are of 250 km and 1 week.

A9 TM5-4DVAR
TM5-4DVAR simulation used the same transport model as OU simulation with same meterological fields but at a 2 • x3 • resolution grid. A climatological average of CT2019 oceanic fluxes estimates constrained prior oceanic fluxes. Biospheric prior are taken form SiB CASA GFED4 (Van Der Velde et al., 2013) and fire prior from GFED4 (Randerson et al., 2012). The uncertainties on the biospheric prior are fixed to 0.5 times the heterotrophic respiration from SiB CASA. For the oceanic prior, the uncertainties are fixed at 1.57 times the absolute flux at each grid cell and time step. Same correlation, length and timescale of OU simulation are assumed and used in this TM5-4DVAR simulation.

A10 UT
The Geos-Chem model has been used in UT simulation, driven by assimilated meteorological observation from GEOS-FP and used with the 4D-Var assimilation algorithm. The model is runs at 4 • x5 • resolution with 47 vertical layers. More information 670 can be found in Deng et al. (2016). Fire priors are based on GFED4. Oceanic prior fluxes are based on the monthly climatology of Takahashi et al. (2009). And finally, biospheric prior are based on a 3-hourly fluxes from the Boreal Ecosystem Productivity Simulator . Annual terrestrial ecosystem exchange are assumed neutral in each grid box (Deng and Chen, 2011;Deng et al., 2014). Optimized scaling factors are estimated with a monthly temporal resolution. Uncertainty is assumed of 38% for the fire emissions in each month and each model grid box, while it is of 44% for the ocean emissions and of 22% 675 for terrestrial emissions.