Objective evaluation of surface- and satellite-driven CO 2 atmospheric inversions

. We study an ensemble of six multi-year global Bayesian CO 2 atmospheric inversions that vary in terms of assimilated observations (either column retrievals from one of two satellites or surface air sample measurements) and transport model. The time series of inferred annual fluxes are first compared with each other at various spatial scales. We then objectively evaluate the small inversion ensemble based on a large dataset of accurate aircraft measurements in the free troposphere over the globe, that are independent from all assimilated data. The measured variables are connected with the inferred fluxes through 5 mass-conserving transport in the global atmosphere and are part of the inversion results. Large-scale annual fluxes estimated from the bias-corrected land retrievals of the second Orbiting Carbon Observatory (OCO-2) differ from the prior fluxes much, but are similar to the fluxes estimated from the surface network within the uncertainty of these surface-based estimates. The OCO-2- and surface-based inversions have similar performance when projected in the space of the aircraft data, but relative strengths and weaknesses of the two flux estimates vary within the Northern and Tropical parts of the continents. The 10 verification data also suggests that the more complex and more recent transport model does not improve the inversion skill. In contrast, the inversion using bias-corrected retrievals from the Greenhouse Gases Observing Satellite (GOSAT) or, to a larger extent, a non-Bayesian inversion that simply adjusts a recent bottom-up flux estimate with the annual growth rate diagnosed from marine surface measurements, estimate much different fluxes and fit the aircraft data less. Our study highlights a way to rate global atmospheric inversions. Without any general claim on the usefulness of all OCO-2 retrieval datasets vs. all GOSAT 15 retrieval datasets, iIt still suggests that some satellite retrievals can now provide inversion results that are, despite their uncertainty, comparable in credibility to traditional inversions using the accurate but sparse surface network and that are therefore complementary for studies of the global carbon budget.

The article presents global inversion results for an ensemble of 6 inversion set-ups, comprising observations from GOSAT and in situ measurements,and using  Our focus is on the definition of quality measures, and we apply it for the analysis of a set of global inversions in view of the effect of the transport model and assimilated observations. We discuss our contribution to our general focus in section Discussion and recall previous uses of aircraft measurements there. We explain that an "objective evaluation" (to use the terms of our title) of global CO2 atmospheric inversions currently has to rely on aircraft measurements, however scarce they are, but that they have hardly served this purpose so far. We 25 hope that our text can stimulate progress in this direction because we believe that objective evaluation is critical for the credibility of atmospheric inversions.
In comparison, the results of our application to the progress of OCO-2-driven inversions may come stronger but our paper reflects the structure of our scientific approach here (addressing a methodological question with an application). We note that the reviewer does not challenge any specific part of our logic. 30 Specific comments: -The effect of the transport model on the inversion result is included in the analysis by using two different versions of the same model. The results indicate that this difference has very little effect on the obtained surface fluxes, from which the conclusion seems to be drawn that transport errors have a much lesser influence on the quality of the inversion than do the assimilated observations. Although this 35 seems to be the case for the analyzed inversions, it is well known that in general the transport model is one of the largest sources of difference between inversions (as for example shown again recently in Schuh 2019). This analysis is therefore misleading: the chosen variations in transport model are likely not representative for the actual transport variations typically observed between different inversion set-ups. In order to include these results, they should therefore be situated in the context of an analysis of the 40 decrease in model error between both versions, and preferably also compared against the typical spread between transport models.
The relatively small impact of transport uncertainty has been expected in the case of the assimilation of OCO-2 after Basu et al. (2018). At first glance, this may be more a surprise for the surface-driven inversions (ibid.). However, Schuh et al. (2019, their Section 4.1), after Yu et al. (2018), highlighted technical artefacts (i.e. 45 regridding problems) to explain some of the large model-to-model differences. LMDz and his parent model are run at the same resolution, so that our model does not suffer for this issue. We also note that only five years ago, the refinement of the vertical grid in our version of LMDz had a major impact (Chevallier et al. 2014). The version with coarse vertical resolution ran much faster than the new one: it is likely that nobody would have complained if we still used it for model intercomparisons (thereby contributing to enlarge result spread) or other studies, but we have decided to phase it out. Further refinements in the process modelling and in the driving winds have a much more modest effect for the quantities that we study in the paper (inferred annual 5 Transcom-region-scale fluxes). This is consistent with what we saw with the parent model of our off-line model (Remaud et al., 2018, a study that also included a test of the refinement of the 3D model resolution).
Initial text: This result is consistent with our study of the parent model of our off-line model in forward mode (Remaud et al., 2018) and suggests that LMDz transport errors play a much smaller role in the quality of our inversion results than the choice of assimilated data. 10 Modified text: This result is consistent with our study of the parent model of our off-line model in forward mode (Remaud et al., 2018) and suggests that LMDz transport errors play a much smaller role in the quality of our inversion results than the choice of assimilated data. This may be different for previous versions of LMDz (Chevallier et al., 2014) or some other off-line models (Schuh et al., 2019).
-The OCO-2 MIP (Crowell 2019) is touched upon rather briefly, but it would make the analysis stronger if 15 more of the obtained results are put into perspective with respect to the inversion ensemble analyzed in the MIP.
Even in the case of our own LMDz-based results, important changes have been made between this contribution to Crowell et al. and the OCO-2 inversions presented here: the retrievals are different both in terms of quality and coverage (O'Dell et al., 2018; see also the analysis of Miller and Michalak, 2019, 20 https://doi.org/10.5194/acp-2019-382, of the difference between the two retrieval products), and the transport model has been upgraded from version 3 to versions 5A and 6A. The most striking difference with respect to the OCO-2 MIP conclusions is the proximity of the OCO-2-based annual fluxes with the surface-based fluxes, as noted in our discussion section. A finer analysis of the changes would have required running a series of sensitivity tests, but the running time of the inversion system hampers such a purpose, as explained in our 25 conclusion.
-One aspect which might be given some more attention in the manuscript is the presence of spatial patterns in the optimized surface fluxes due to the used prior, another aspect that typically causes large differences between inversion results. It is mentioned that the OCO-2 and in situ based fluxes look similar, can this be due to the prior pattern being dominant over observation patters in regions with 30 sparse observational coverage?
The paper only presents annual natural fluxes for which our prior fields are uniformly zero over land. All our inversions deviate much from this prior value. This was already noted in Section 3.2 that stated that the OCO-2 inversions were "clearly distinct from the inversion prior fluxes (that are zero on annual average over land), and from the GOSAT-based fluxes". The last words of the sentence, about GOSAT, indicate that the inversion system 35 does not automatically converge to a unique point. Posterior natural fluxes are also influenced by the prior fossil fuel maps, but the spatial patterns of Fig. 6, that we discuss at the end of Section 3.2, do not resemble fossil fuel patterns.
-The aircraft verification data has especially large coverage over North America. Is the coverage over the other continents sufficient to make statements about the inversion quality? 40 Most of the aircraft measurement programs are over North America, but the majority of measurements are provided by the CONTRAIL program, as noted in the legends of Figs. 7 ad 8. CONTRAIL samples air at our study altitudes above many cities outside North America. CONTRAIL represents 74% of all data for the GOSAT period and 39% for the OCO-2 period. Still, as noted in the conclusion, "the situation is not satisfactory at present in some parts of the world, like Africa". 45 How is this considered when analyzing the root-mean-square difference (Line28 p7)? Are these values a representative quality measure, or biased towards quality above North America? Line28 p7 says "If we combine all measurements together, the root-mean-square difference for the OCO-2based and the surface-based inversions varies only between 1.51 and 1.56 ppm." If we only take CONTRAIL data, the root-mean-square difference for the OCO-2-based and the surface-based inversions still only varies 5 between 1.60 and 1.67 ppm, which tends to indicate that our conclusions are not biased towards features that are specific to North America.
We have added: "Note that 39% of these data are from CONTRAIL, a program that spreads over all continents." -Similar to figure 1, it would be nice to include maps with the observational coverage (and if possible quality) of the different sets of assimilated observations, and to refer to this information in the 10 discussion of the inversion results.
Maps of data coverage in our case follow common patterns.
In Section 2.3. "Assimilated observations", we now refer to the maps of Chevallier (2018a) for SURF, to those of Bösch and Anand (2017)  -Line 32, page 4: please include a little more information about the used transport uncertainty. 15 We have added "based on the variability across several models at the OCO-2 sounding locations" before the reference to Crowell et al. (2019).
-On page 4 (lines 37-39) it is mentioned that the bias of OCO-2 ocean observations is still too large for these observations to be included in the inversion, based on initial testing. It would be interesting to include a little more information here: how significant is the bias and the effect of it on optimized fluxes? 20 We have added the following information: "(annual global ocean sinks about 5 GtC·a -1 , to be compared with the much smaller state-of-the-art estimates in Le Quéré et al., 2018)". In order to be more exhaustive, we have also added "(as are retrievals over inland water or over mixed land-water surfaces)" at the end of the sentence.
-Line 2 on page 5 states that outliers are rejected: are these considered outliers in the aircraft data, or in the simulation result? If the latter, is there a pattern to be discovered? And are the same observations 25 removed for all inversions in the ensemble?
We consider that these measurements sampled local plumes that cannot be represented by our global model. Nearly the same measurements of that sort are identified when compared with each inversion posterior simulation.
We have added "they likely represent very local pollution plumes". 30 -Line 4 p6: this is not entirely clear: Do you mean that the grid cell value is used as simulated value for the verification data?

Yes.
We have added the reviewer's expression.
-Line 33 p9: if the aircraft data does not allow to distinguish between the quality of OCO-2 and in situ- 35 based fluxes, what does this imply for future quality assessments of inversions? One might expect that the quality of inversion results will increase in the coming years, both through the availability of highquality satellite data, and through mprovements in inversion set-ups. Would this mean that the presented aircraft verification approach will no longer be useful to distinguish between their quality?
Consistent with our title, we focus on objective criterions to evaluate global CO2 atmospheric inversions. In our 40 discussion, we explain that they have to rely on aircraft measurements, however scarce they are. To better serve this purpose, in particular in view of future improvements of the inversion systems, we obviously need more aircraft measurements. The analysis could also be extended to aircore measurements whose number over the globe is growing.
We have added "We also need better coverage to accompany the better quality of inversion results expected in the coming years.".
Small comments / technical corrections: 5 -Please use ppm/yr instead of ppm/a We follow ISO 80000-3:2006, that defines a as the symbol for year.
-Lines 47-51 p7: it would be clearer if you move these after eq (1), especially since the equation is now on the next page.

Done. 10
- Figure 6 might be clearer if you remove the edges of the contour plot, as they now look messy in combination with the edges of the continents.
On the other hand, the contour lines help visualizing the flux patterns and are particularly useful over the ocean to spot the outgasing areas. We have left them.
-Figures 7&8: please start the y-axis of both plots from zero. It would be insightful to include the 0. 15 15 ppm benchmark to show which biases fall within the measurement uncertainty.
We agree that such changes would be interesting, but they would also reduce the readability of the existing information, which is already limited. We prefer to keep the figures as they are.

Response to Reviewer #2 20
This paper explores the flux constraint of the in situ network, GOSAT, and OCO-2 independently using two different transport models. As such, this paper is extremely important to the community as a benchmark of progress on satellite remote sensing and the relative information in different sensors on the carbon cycle.
The paper is well-written and thorough, and I recommend publication after a few minor additions and/or 25 corrections.
One issue I can foresee is the use of the OCFP XCO2 product for GOSAT and the OCO-2 FP algorithm for OCO-2. A major conclusion of this paper is that GOSAT has serious issues with regard to its carbon cycle constraint.
We disagree on this last point. We tried to be careful to draw general conclusions neither about OCO-2 nor  30 about GOSAT, because our study is limited to two specific retrieval datasets: "Further, other GOSAT-based inversions could be more competitive if made differently (e.g., with a different bias-correction), while other OCO-2-based inversions (e.g., with a different transport model or with different retrievals), or ours with ACOS v9 retrievals after our study period (e.g., if the empirical bias-correction is less efficient for later months), could still be found deficient for carbon specialists." (Section Conclusions). 35 To further clarify this point, in the abstract of the revised version, we have inserted a short similar statement: "Without any general claim on the usefulness of all OCO-2 retrieval datasets vs. all GOSAT retrieval datasets…".
In the conclusion, we have added two words (in bold here): "Within the limitations imposed by the use of two different verification periods, the tested bias-corrected OCO-2 retrievals perform better than the tested GOSAT retrievals in our inversion system.". 40 A paper by Takagi et al, perhaps in 2015, showed large differences between the different retrieval algorithms, and you might actually get similar results between OCO-2 and GOSAT if you were to use the same retrieval algorithm. I believe this at least warrants some discussion in the manuscript.
Our response above also addresses this point.
We have added the reference to Takagi et al. (2014) in the conclusion: "despite large differences between 5 GOSAT retrieval algorithms, Takagi et al., 2014).
Also, the small sensitivity to the two different transport models is a significant finding, as other studies have shown a very large sensitivity to transport. Can you comment on this lack of sensitivity? It is due to similarities between Era-Interim and Era-5? Prior flux constraint? You comment that LMDz5a and LMDz6a are very different models. I realize that this evaluation could be a series of papers on its own. 10 In our discussion section, we already mentioned the fact that TransCom-type ensembles "gather systems with various degrees of sophistication (resolution of the transport model, size of the control vector)". For instance, Schuh et al. (2019, their Section 4.1) mostly refer to technical problems in the mass flux processing from the parent model to explain strong differences in interhemispheric mixing, after Yu et al. (2018). Strictly speaking, this is not really transport uncertainty: transport processes are better known than what some off-line models 15 simulate. Our study suggests that transport process uncertainty does not play a large role any more for annual and Transcom-type region scale fluxes, but more investigation would be needed to confirm this explanation. The dependence of the satellite retrievals on the TCCON data through the bias-correction is a serious limitation for the use of TCCON to discriminate between the skill of the various inversions. Further, as we say in the introduction, the fact that "current TCCON retrievals that serve as the best reference for column retrievals with global coverage, have commensurate [sub-ppm] offset uncertainties (Wunch et al., 2015)" hampers clear 25 conclusions. Indeed, our inversion SURF fits the TCCON data mostly within 1 ppm (see For better clarity, in the above sentence, we have added "site-specific" before "commensurate offset uncertainties". We also refer in Section Discussion to  where this issue is further discussed.
Page 6, Line 19: Could this be an artifact of the way that the NOAA growth rate is calculated (in the marine bdy layer or at Mauna Loa) vs. a global flux? This is our interpretation as well, but we have not further investigated this point. 35

Response to Reviewer #3
The authors of this paper aim at evaluating whether airborne CO2 measurements made in the free troposphere can distinguish six flux inversion results that are based on two different transport model versions and three independent CO2 datasets. The three CO2 datasets used here are a selection of surface-based CO2 measurements and column-mean CO2 concentrations retrieved from GOSAT and 40 OCO-2 satellite observations. The authors presented the evaluation result in terms of the mean of differences between the airborne measurement and corresponding modeled posterior concentration at each airborne site.
Specific comments: 1. The authors achieved their goal of distinguishing the six independent flux inversion results with the free troposphere airborne CO2 measurements, but they go further to address the relative merits and demerits of using GOSAT and OCO-2 retrieval datasets based on their evaluation results. The current design/setup of the experiment, however, is too limited to discuss that topic; the two satellite retrieval 5 datasets were evaluated over different time periods and with different amounts of airborne observations at different locations/sites. Indeed, the authors acknowledge in the conclusion chapter that the current experiment can be expanded to 1) cover longer periods in which both satellite retrieval datasets overlap, 2) understand the impact of differences among multiple GOSAT/OCO-2 retrieval algorithms available, 3) test out other approaches to handle the dense OCO-2 retrievals, and 4) assess the impact of OCO-10 2/GOSAT differences in data density, data precision, and quality in CO2 retrieving and bias correction. Items2-4 are particularly essential topics that need to be explored. I would encourage the authors to go to that depth, if they are to touch on the merit/demerit topic.
When writing the paper, we had no intention to address the relative merits and demerits of using GOSAT and OCO-2 retrieval datasets in general. We tried to make it clear that we were studying two specific products in 15 particular: "Further, other GOSAT-based inversions could be more competitive if made differently (e.g., with a different bias-correction), while other OCO-2-based inversions (e.g., with a different transport model or with different retrievals), or ours with ACOS v9 retrievals after our study period (e.g., if the empirical bias-correction is less efficient for later months), could still be found deficient for carbon specialists." (Section Conclusions).
To further clarify this point, in the abstract of the revised version, we have inserted a short similar statement: 20 "Without any general claim on the usefulness of all OCO-2 retrieval datasets vs. all GOSAT retrieval datasets…".
In the conclusion, we have added two words (in bold here): "Within the limitations imposed by the use of two different verification periods, the tested bias-corrected OCO-2 retrievals perform better than the tested GOSAT retrievals in our inversion system.".

Can this help explain the gaps in the regional flux time series? 40
Considering the same 185 measurements (i.e. much less than what is shown in Fig. 7 and 8), the performance of SURF is in-between GOSAT and OCO-2: 0.8 and 1.0 ppm, respectively for 5A and 6A. These statistics are based on a small sample and provide little help to explain the spread of the time series over a large portion of the South American continent.

Do the flux gaps fall within the range of flux uncertainties? 45
For Tropical South America, the posterior uncertainty of SURF (1σ) is 1.3 GtC/a and well covers the spread. This large uncertainty is expected since the reviewer points to regions that are measurement-sparse at the surface. A more general view on this question is provided by Figure 5 that shows the ratio of the absolute difference between the OCO-2 annual fluxes and the SURF annual fluxes to the SURF posterior uncertainty: "For all regions discussed so far, this difference is usually within the Bayesian uncertainty standard deviation (but reaches up to 5 2.6 times this quantity in Northern Africa for 2015), which means that the difference between the two flux estimates at this scale is mostly not statistically significant" (Section 3.2).
What are other reasons that may explain these gaps?
We acknowledged the fact that "remaining differences between fluxes from these two flux inversion types require further analysis." (Section 5, Conclusions). As long as they are consistent with the posterior uncertainty 10 estimates, they are not surprising: they reflect the limitations of the observation information content as seen through the inverse system.

What Figures 2 through 4 show poses readers a question of whether column-mean CO2 retrievals from multiple satellite missions can be mixed in CO2 source-sink studies. I think these are worth discussing in depth in this paper. 15
We argued that the satellite-driven time-series presented here cannot be mixed: "Upstream, both inferred flux time series do not overlap with each other at all scales studied here (for instance in the tropical lands) in terms of both the mean and variability. This prevents us from computing flux anomalies from one vs. the other" (Section 5, Conclusions).

Minor comments: 20
We have replaced "because of likely edge effects" by "because it is less constrained at the end by lack of 2017 retrievals here." 2. Figure 7: The SURF values are shown in red color in the bar chart, but blue color isused for SURF circles indicating >0.15 ppm differences. Should this be in red? ? 25 We thank the reviewer for having spotted the mistake. We have corrected the legend.

10
Abstract. We study an ensemble of six multi-year global Bayesian CO2 atmospheric inversions that vary in terms of assimilated observations (either column retrievals from one of two satellites or surface air sample measurements) and transport model. The time series of inferred annual fluxes are first compared with each other at various spatial scales. We then objectively evaluate the small inversion ensemble based on a large dataset of accurate aircraft measurements in the free troposphere over the globe, that are independent from all assimilated data. The measured variables are connected with the inferred fluxes through 5 mass-conserving transport in the global atmosphere and are part of the inversion results. Large-scale annual fluxes estimated from the bias-corrected land retrievals of the second Orbiting Carbon Observatory (OCO-2) differ from the prior fluxes much, but are similar to the fluxes estimated from the surface network within the uncertainty of these surface-based estimates. The OCO-2-and surface-based inversions have similar performance when projected in the space of the aircraft data, but relative strengths and weaknesses of the two flux estimates vary within the Northern and Tropical parts of the continents. The 10 verification data also suggests that the more complex and more recent transport model does not improve the inversion skill. In contrast, the inversion using bias-corrected retrievals from the Greenhouse Gases Observing Satellite (GOSAT) or, to a larger extent, a non-Bayesian inversion that simply adjusts a recent bottom-up flux estimate with the annual growth rate diagnosed from marine surface measurements, estimate much different fluxes and fit the aircraft data less. Our study highlights a way to rate global atmospheric inversions. Without any general claim on the usefulness of all OCO-2 retrieval datasets vs. all GOSAT 15 retrieval datasets, iIt still suggests that some satellite retrievals can now provide inversion results that are, despite their uncertainty, comparable in credibility to traditional inversions using the accurate but sparse surface network and that are therefore complementary for studies of the global carbon budget.

Introduction
Carbon dioxide (CO2) is increasingly monitored in the global atmosphere due to its important role in climate change. For 20 example, NOAA's GlobalView Plus Observation Package (ObsPack, Cooperative Global Atmospheric Data Integration Project, 2018) archives high-quality measurements made at the surface or from aircraft by various institutes. Despite occasional budget difficulties (Houweling et al., 2012), the number of collected data points has exponentially increased over the years, with, in reference to 1980, six times more measurements in 2000 and 100 times more measurements in 2017. In addition, the ground-based Total Carbon Column Observation Network of column retrievals (TCCON, Wunch et al., 2011) is less than 15 25 years old but already operates about 30 sites over the globe. Other measurements, like the recent AirCore technique that samples air in freefall tubes (Karion et al., 2010) or the COllaborative Carbon Column Observing Network (COCCON, Frey et al. 2018), have also emerged in the past decade. Most remarkably, the number of spectrometers designed to monitor the CO2 column from space has grown from one in 2002 to six at the end of 2018 (Crisp et al., 2018). The primary motivation for this increase of CO2 observations has been to further our understanding of the global surface fluxes of carbon, with the additional 30 help of meteorological data (e.g., Bolin and Keeling, 1963;WMO, 2018). This is done in practice by inversion of atmospheric transport models within a Bayesian framework (e.g., Peylin et al., 2013). Scientists have urged caution when interpreting this growing amount of data because the uncertainty of the available meteorological information was identified early as a critical limitation on the exploitable measurement information. This limitation motivated the creation of the international Atmospheric Tracer Transport Model Intercomparison project 25 years ago (TransCom, Law et al., 1996) and is still relevant today (Schuh 35 et al., 20182019). Adequate representation of the various error statistics involved in the Bayesian estimation remains a challenge (e.g., Bocquet et al., 2011). In addition, column retrievals, made from measured radiances from space or on the ground after complex processing, cannot fundamentally be calibrated relative to WMO-traceable standards, in contrast to surface measurements like those in ObsPack GlobalView Plus. Indeed, systematic errors in the retrievals at the sub-μmol/mol level (10 -6 mol/mol, abbreviated as part per million, ppm) are enough to affect the flux estimation (Chevallier et al. 2007), but 40 the current TCCON retrievals that serve as the best reference for column retrievals with global coverage, have site-specific commensurate offset uncertainties (Wunch et al., 2015). A given inversion configuration is made of one or several observation types, a transport model and a few statistical models. Many of them seem reasonable. Though model disagreement has been reduced over the last couple of decades, current inversion results show an unacceptably large spread, even for zonal averages (e.g., Le Quéré et al., 2018). This study aims at 45 evaluating whether simple measures of quality based on airborne measurements in the free troposphere can distinguish between six inversion configurations. These inversion configurations differ in the assimilated data and in the transport model. The assimilated data are either surface measurements in ObsPack and related databases, retrievals from the Greenhouse Gases Observing Satellite (GOSAT) or the second Orbiting Carbon Observatory (OCO-2). The transport models are two versions of the atmospheric general circulation model of the Laboratoire de Météorologie Dynamique (LMDz, Hourdin et al., 2013) 50 nudged towards analysed meteorological variables. The Bayesian inversion system from the Copernicus Atmosphere Monitoring service (CAMS, https://atmosphere.copernicus.eu/, Chevallier et al., 2005) is used in all six inversions. We use a "poor man's inversion" (Chevallier et al., 2009) based on recent bottom-up fluxes and on the global annual atmospheric growth rate estimated from the average of marine surface measurements (Conway et al., 2014) to define a baseline for the skill of each Bayesian inversion result.
Our use of airborne measurements in the free troposphere as verification data is motivated by their frequent, WMO-traceable calibration, their independence from all data assimilated here (including the measurements in the boundary layer) and their spatial distribution that samples all oceans and continents. Arguably they are the only CO2 dataset that possesses all of three 5 qualities.
In the following, data and models are described in Section 2, while Section 3 presents the various results. They are discussed in Section 4. Section 5 concludes the study.

10
LMDz is the atmospheric component of the Earth system model of Institut Pierre-Simon-Laplace (Dufresne et al., 2013) which has been contributing to the recent versions of the Climate Model Intercomparison Project (CMIP) established by the World Climate Research Programme (https://cmip.llnl.gov/). Here, we use its off-line version (Hourdin et al. 2006) to simulate the transport of CO2. The off-line LMDz model reads a frozen archive of 3-hourly-mean meteorological data pre-computed by the full LMDz so that it only needs to simulate large-scale advection and subgrid transport processes (i.e. deep convection and 15 boundary-layer turbulence). LMDz is nudged towards 6-hourly analysed meteorological variables, here either ERA-Interim (Dee et al., 2011) or ERA-5 (https://www.ecmwf.int/en/forecasts/datasets/archive-datasets/reanalysis-datasets/era5, access 31 January 2019) with a relaxation time of 3 hours. On-line and Ooff-line models are consistently run at the same spatial resolution in order to avoid any challenging interpolation of the air mass fluxes for the subgrid processes (see, e.g., Yu et al., 2018): here 39 eta-pressure layers between the surface and around 80 km above sea level, and 96×96 grid-points, i.e. a horizontal resolution 20 of 1.89 o ×3.75 o in longitude. This configuration discretizes the 2-7 km above sea level region of the atmosphere, that will be a major focus in the following, into 6 to 10 layers, depending on local orography.
We use two physical formulations of LMDz, called 5A (in code identification number 1649) and 6A (in code identification number 3353), as described by Remaud et al. (2018, and references therein). The gap between the two versions represents about six years of development from the LMDz team and includes, e.g., a complete revision of radiation, the introduction of 25 the thermodynamical effect of ice and changes in the subgrid-scale parameterizations (convection, boundary layer dynamics) and in the land surface processes. For version 5A, horizontal winds are nudged towards ERA-Interim, but we use the new ERA-5 for LMDz6A. Therefore, the differences between the two versions cannot be exclusively attributed to subgrid-scale processes, since boundary variables (nudging files and land processes) differ as well.

30
LMDz is embedded within the CAMS CO2 inversion system. This system minimizes a Bayesian cost function to optimize the grid-cell eight-day surface fluxes (with a distinction between local night-time fluxes and daytime fluxes, but without fossil fuel emissions, that are prescribed) and the initial state of CO2. To do so, it assimilates a series of CO2 observations over a given time window within the LMDz model. The minimization approach is called 'variational' because it explicitly computes the gradient of the cost function using the adjoint code of LMDz. Prior information about the surface fluxes is provided to the

Code de champ modifié
All assimilated observations are dry air mole fraction of CO2.
Assimilated surface air sample measurements have been selected from four large ongoing databases of atmospheric CO2 measurements: (i) NOAA's ObsPack (Cooperative Global Atmospheric Data Integration Project, 2018, andCarbonTracker Team, 2018), (ii) the World Data Centre for Greenhouse Gases archive (WDCGG, https://gaw.kishou.go.jp/), (iii) the Réseau Atmosphérique de Mesure des Composés à Effet de Serre database (RAMCES, http://www.lsce.ipsl.fr/), and (iv) the Integrated 5 Carbon Observation System-Atmospheric Thematic Center (ICOS-ATC, https://icos-atc.lsce.ipsl.fr/). The list of selected sites and maps of their location areis given by Chevallier (2018a). Each dataset provides at least five years of measurements. The error variances assigned to these measurements in the inversion system correspond to transport modelling uncertainty (analytical measurement uncertainty of in situ CO2 data is a negligible component) and are computed as the variance of the high frequency variability of the de-seasonalized and de-trended CO2 time series of the daily-mean measurements at each site.

10
These variances are then inflated in order to give the same weight to each measurement day at a given location.
GOSAT was launched in January 2009, as a joint project of Japan Aerospace Exploration Agency (JAXA), NIES (National Institute of Environmental Studies) and Japan's Ministry of the Environment (MOE) (Kuze et al., 2009). OCO-2 is a NASA satellite that was launched in July 2014 (Eldering et al. 2017). Both satellites still collect scientific data today. They orbit around the Earth from pole to pole with a local crossing time at the Equator in the early local afternoon. Each carry a 15 spectrometer that measures the sunlight reflected by the Earth and its atmosphere in the near-infrared/ shortwave infrared spectral regions, with high spectral resolution (>~ 20,000) such that individual gas absorption lines are resolved. OCO-2 provides spatially dense data with a narrow swath and with footprints of a few km 2 , while GOSAT provides coarser-resolution data (100 km 2 at nadir) with low spatial density. Various algorithms have been developed to retrieve the column-average dry air-mole fraction of CO2 in the atmosphere (XCO2) from the measured radiance spectrums. For GOSAT, we use bias-corrected 20 XCO2 retrievals from product OCO Full Physics (OCFP) v7.1 made by the University of Leicester and available from the Copernicus Climate Change Service for the period April 2009 -December 2016 (https://climate.copernicus.eu/). For OCO-2, we use NASA's Atmospheric CO2 Observations from Space (ACOS) bias-corrected retrievals, version 9 (Kiel et al., 2018;O'Dell et al., 2018) from September 2014 until July 2018. In both cases, a previous release of the CAMS surface-based inversion contributed to the retrieval official bias-correction to some extent. We neglect this dependency in the following 25 because other reference data are used that reduce the weight of the CAMS inversion (e.g., TCCON), and because the biascorrection schemes rely on 2 to 5 time-and space-invariant parameters only, with internal retrieval variables (e.g., the retrieved vertical CO2 gradient between the surface and the free troposphere) as predictors. We do not tune the official retrieval biascorrections. To reduce data volume without loss of information at the scale of a global model, glint and nadir OCO-2 retrievals have been averaged in 10-s bins for the Model Intercomparison Project (MIP) of OCO-2, as described in Crowell et al. (2019),

30
and we use them in this form. The retrieval averaging kernels, prior profiles and Bayesian uncertainty are accounted for in the assimilation of both types of satellite retrievals (the interpolation procedure between the model vertical grid and the retrieval grid is described in Section 2.2 of Chevallier, 2015). For OCO-2 retrievals, we also use the transport uncertainty term that is provided by the OCO-2 MIP, based on the variability across several models at the OCO-2 sounding locations (Crowell et al., 2019).

35
Maps of the coverage of GOSAT and OCO-2 retrievals are shown in Bösch and Almud (2017) and O'Dell et al. (2018), respectively. We only consider "good" retrievals as identified by variable xco2_quality_flag of each product. Both land and ocean data are used for GOSAT. GOSAT data over ocean have matured in the ~10 years since they were first produced, and have reached a point where they appear to have smaller biases than over land (Zhou et al., 2016). Their direct inclusion in inversions also appears to be beneficial (Deng et al., 2016). However, though the ocean biases in OCO-2 have been 40 substantially reduced since the initial version 7 (O'Dell et al. 2018), initial inversion tests using OCO-2 ocean observations still produced highly unrealistic results (annual global ocean sinks about 5 GtC·a -1 , to be compared with the much smaller state-of-the-art estimates in Le Quéré et al., 2018) and are hence left out of this work (as are retrievals over inland water or over mixed land-water surfaces). As for GOSAT, this situation may change in time and OCO-2 ocean data could be beneficial in future inversion set-ups. Despite the exclusion of ocean retrievals and the 10 s averaging, there are still 65% more OCO-2 45 retrievals than GOSAT retrievals assimilated on average per month.

Verification observations
We use some specific measurements of the dry air mole fraction of CO2 as verification data. They are aircraft measurements in the free troposphere made between July 2009 and December 2017 and archived in different ObsPacks (Cooperative Global Atmospheric Data Integration Project, 2018, and NOAA Carbon Cycle Group ObsPack Team, 2018). Table 1 lists the various 50 aircraft measurement sites, campaigns or programs. For simplicity, all sites, campaigns or programs will be referred to as "programs" in the following. All measurements have been calibrated to the WMO CO2 X2007 scale or to the NIES 09 CO2 scale to better than 0.1 ppm (e.g., Machida et al., 2008;Sweeney et al., 2015). We note that no aircraft data is assimilated here (Section 2.3).
We define the free troposphere as the altitudes comprised between 2 and 7 km above sea level. We avoid data below 2 km because (i) local anthropogenic emissions affect many aircraft measurements there, and (ii) some of the aircraft flew in the vicinity of measurement sites that have been used in the surface-based inversions. We avoid data above 7 km because the 5 measurement variations (and the flux regional signal) are much reduced there. A few outliers for which the difference between model and observation is larger than 40 ppm are rejected: they likely represent very local pollution plumes.
We define two periods for the following statistical computations. They are based on the availability of the satellite retrievals and of the aircraft data in the databases used here: a "GOSAT period" from July 2009 until September 2016 and an "OCO-2 period" from December 2014 until December 2017. Note that they overlap and that there is a minimum of three months 10 between the temporal bounds of the verification data and the temporal bounds of the assimilated data in order to account for inversion spin-up and spin-down. Figure 1 shows the geographical location of the verification data for the two periods.

Poor man's inversion
In order to put the differences between inversion simulations and aircraft measurements in perspective, we compare them to an inversion that only assimilated the annual global growth rate of CO2. This baseline, called "the poor man's inversion" by 15 Chevallier et al. (2009), adjusts prior natural fluxes over land in order to fit the annual trend of globally-averaged marine measurements (http://www.esrl.noaa.gov/gmd/ccgg/trends/, access 10 January 2019) multiplied by a conversion factor (2.086 GtC·ppm -1 , from Prather, 2012) when combined with prior ocean and fossil fuel fluxes. The correction to the natural land fluxes is made proportional to the prior error standard deviations assigned within a given inversion system. In the case of the CAMS system here, the prior error standard deviations are themselves proportional to a climatology of heterotrophic 20 respiration fluxes simulated by a vegetation model, with a ceiling of 4 gC·m -2 ·day -1 . This simple approach is not Bayesian because prior error correlations are ignored, but it still allows transport models to fit atmospheric data with less bias than its prior fluxes because it closes the carbon budget in a plausible way.
Over the ocean and for the fossil fuel emissions, we choose the same prior fluxes as for the six Bayesian inversions (Landschützer et al., 2017, Crippa et al, 2016, Le Quéré et al., 2018, see Section 2.2). However, we choose more informed 25 natural fluxes over land than for the Bayesian inversions: rather than letting the inversion fully free to locate the annual land sinks (see Section 2.2), we take a simulation of a dynamic global vegetation model that accounts for land-use, climate and CO2 history (simulation ORCHIDEE-Trunk in Le Quéré et al., 2018). When multiplied by 2.086 GtC·ppm -1 , this combination of prior fluxes already fits the annual trend of globally-averaged marine measurements with a root-mean-square difference of 0.3 ppm·a -1 , By construction, the poor man's adjustment brings these annual global differences to zero.

30
For the comparison of the poor man's inversion with aircraft measurements, we use LMDz5A. We start the poor man's simulation on 1 January 2000 from a 3D prior initial state of CO2. We then add an offset to the simulation so that its mean bias with respect to NOAA's surface measurements at South Pole Observatory (Cooperative Global Atmospheric Data Integration Project, 2018) over the 2010-2017 period is zero. This offset addresses the uncertainty of the initial state and the uncertainty of the 2.086 GtC·ppm -1 conversion factor.

Principle
We build an ensemble of six Bayesian inversions using the inversion system of Section 2.2, the two transport model versions of Section 2.1, and the three observation datasets of Section 2.3. The assimilation periods differ (Section 2.2), but the prior fluxes and the prior error model are the same. For each inversion, the posterior model simulation statistically fits its own 40 assimilated data well within their 1-sigma uncertainty. Note that the surface-based inversion with LMDz5A is exactly the CO2 inversion product 18r1 of CAMS that was released in November 2018 (http://atmosphere.copernicus.eu/). In the figures, we will refer to the surface-based inversions by the generic name "SURF" for simplicity.
We first present the carbon budget estimates. We choose to look at fluxes at the annual scale only, knowing that over land, the inferred interannual variability is completely driven by the assimilated observations (because prior natural fluxes over land are 45 zero on annual average for the Bayesian inversions, see Section 2.2). As we will see, it is relatively large. Except at the global scale, capturing the interannual variability well is particularly challenging because its estimation accumulates all errors made throughout the seasonal cycle.
Then we compare the inversion performance vis-à-vis the aircraft measurements of Section 2.4, to the performance of the poor man's inversion of Section 2.5. This comparison is made for two periods (Section 2.4). For each of them, we will only consider the inversions that cover the window completely, which means that the GOSAT-based (or OCO-2-based) inversions will not be used in the results for the "OCO-2 period" (or "GOSAT period"). The projection of the inversion fluxes onto the space of the aircraft-measured variables (mole fractions) is made by the same LMDz model version that was used in the inversion.

5
Doing this, we are consistent with the way the inversion system distributes the well-constrained total mass of carbon in the atmosphere and we avoid error compensations between the version used in the assimilation and the one used in the evaluation. The model is directly sampled at measurement time and space, without any interpolation: the grid cell value is used as simulated value for the verification data.    Australia regions. While being clearly distinct from the inversion prior fluxes (that are zero on annual average over land), and from the GOSAT-based fluxes, we note the agreement of the two OCO-2-based inversions with the 6A SURF inversion and the poor man's fluxes (that are informed by an up-to-date bottom-up simulation) in the two boreal regions, despite the lack of OCO-2 data there during half of the year as a consequence of insufficient insolation (see, e.g., Deng et al., 2014). The main differences between inversions OCO-2 and SURF over the ocean are regions North Pacific Temperate and Southern Ocean. 40 Figure 5 compares the difference between fluxes estimated by assimilating either OCO-2-or the surface data within LMDz6A, to the posterior uncertainty diagnosed from the Bayesian system (Chevallier et al. 2007) for the surface-based inversion. For all regions discussed so far, this difference is usually within the Bayesian uncertainty standard deviation (but reaches up to 2.6 times this quantity in Northern Africa for 2015), which means that the difference between the two flux estimates at this scale is mostly not statistically significant. Figure 6 further zooms in to the pixel-scale for year 2015, a year that is common between all inversions. Only the LMDz5A results for the satellite-based inversions are shown. For the two surface-based inversions, the change of transport model leaves the flux patterns generally unchanged but slightly modulates their amplitude. In contrast, the two satellite-based inversions show more differences in the flux pattern. They suggest large flux gradients in southern Africa and South America: similar together in Africa, with a large sink in the tropical evergreen forests and large sources around; different in America with a 50 source over the tropical evergreen forests for GOSAT and over a northeast corner for OCO-2. The broad flux patterns in the lands of the Northern Hemisphere are similar between the four maps, but OCO-2 has flux gradients closer to SURF than to GOSAT in America while the opposite is seen in South-east Asia. The Tropical ocean outgassing region reduces with OCO-2 and expands to the south with GOSAT. Figure 7 presents the statistics of model-minus-measurement differences per measurement program for the GOSAT period. Note that the data number varies by several orders of magnitude among the programs: there are a few hundreds of samples for 5 most of the 37 programs, but a few thousands for CALNEX2010, KORUS-AQ, ORCAS, SGP and ATom, a few 10,000s for ACT, DC3, DISCOVER-AQ, GSFC, HIPPO, SEAC4RS, and SONGNEX2015, and 900,000 for CONTRAIL. Obviously, many measurements may fit into a single time-space block of the global transport model. We will only discuss bias differences larger than 0.15 ppm (i.e. above the calibration uncertainty of the aircraft data, see Section 2.4) and that are statistically significant at the 0.05 level, as reported on the figure. The computation of the significance level is made with an unpaired t-10 test when comparing inversion results that assimilated different data (we assume that changing the assimilated data makes the inversion results independent), and with a paired t-test when comparing inversion results that assimilated the same data (we assume that inversion results in which only the transport model varies are dependent). In practice, changing the independency assumption only affects the detail of the significance-level results, but not the overall picture.  Figure  6) translate into relatively close performance compared to the aircraft. SURF performs better than OCO-2 for INPE and ACT in terms of biases, and worse for GSFC and KORUS-AQ. The poor man's simulation has lesser skill than in the GOSAT period: ,it performs much worse than the surface-based and the OCO-2-based inversions in the Northern Hemisphere, and 30 comparably or better in the Southern one. If we combine all measurements together, the root-mean-square difference for the OCO-2-based and the surface-based inversions varies only between 1.51 and 1.56 ppm. Note that 39% of these data are from CONTRAIL, a program that spreads over all continents. The standard deviations are comparable between the OCO-2-based inversions and the surface-based inversions. LMDz6A improves the SURF biases for KORUS-AQ and degrades them at three other ones (INPE, LEF and ETL). This lack of improvement also appears for OCO-2 (degradation at INPE, KORUS-AQ, and 35 ABOVE). The statistics for four programs (ORCAS, KORUS-AQ, ACT and SONGNEX2015) are directly comparable between the two periods because the corresponding data are fully in both of them: in all four, OCO-2 performs better than GOSAT. Figure 9 reformulates the bias statistics of Figure 8 on a map of the differences between the absolute biases of inversions OCO-2 and SURF. Like for the program biases, some points are more robust than others (due to varying amount of data), but there 40 is some large-scale coherence, with better performance of SURF in the Southern Hemisphere (as could already be seen in Figure 7 and Figure 8) and in central and Eastern US, while OCO-2 yields smaller biases in the Northern Hemisphere subtropics and in Europe. Other parts of the globe are less consistent such as the western Pacific edge or boreal America. Liu and Bowman (2016) proposed a method to quantify the impact of flux changes over the globe on the corresponding change 45 in the mean squared error (MSE) of the transport model simulation with respect to n independent measurements. They demonstrated it in the case of the flux changes from their prior values to their posterior values within the approximations of a linear transport model M (including the sampling operator at measurement time and location) and of an unchanged initial state of CO2. It is actually valid for other types of changes within an inversion, provided they respect the tangent-linear hypothesis for the transport model. The change in the MSE (δMSE) is expressed as a finite sum of terms. There is one term for each 50 element i of the inversion control vector (i.e., a CO2 flux at a given time and location, or some part of the 3D initial state of CO2):

Pixel attribution
. Term i is the product of the corresponding change in the control vector (i.e. a scalar δfi), times the corresponding row of the transpose of the linear model M, times (dot product here) the vector of the sum of the differences between the two model 5 simulations (one, C1, before the change in the control vector and one, C2, after the change) and all verification measurements (δC1+δC2, both vectors with dimension n). : Interestingly, the second product in this formula can be calculated by the adjoint code of the transport model, if it exists, which is the case for LMDz (Section 2.2). Further detail is given in Liu and Bowman (2016).

10
We apply this approach to interpret the difference between the OCO-2-based and the surface-based inversions using LMDz5A. The overall MSE is very similar between both (1.5 2 ppm 2 ), but the relative performance still varies in space and time ( Figure  8 and Figure 9) and we hope to extract some further insight into the relative merits of each dataset. In practice, we compute δC1+δC2 using the LMDz model linearized around the inversion prior simulation in order to respect the underlying hypothesis. However, some inconsistencies for the initial state of CO2 could not be completely removed between δC1 and δC2 due to the America, southern Africa, South and South-east Asia, and Indonesia. No noticeable contribution is seen over the ocean, where OCO-2 retrievals have not been assimilated. By construction, regions that are not well observed downstream by aircraft have lesser contributions, like in Africa. This feature makes the relative magnitude of the patterns among each other not much informative about the flux quality. We will therefore pay more attention to the sign of the dominant patterns.
The map of Figure 9, which refers to differences in absolute biases within moving windows, is in principle not directly 25 comparable with Figure 10, which refers to MSEs. However, bias changes are much larger than standard deviation changes ( Figure 8) which makes the map of root-mean-square errors (RMSEs, not shown) very similar to Figure 9. Differences between the patterns of Figure 9 in the space of free-tropospheric mole fractions and those of Figure 10 in the space of fluxes are linked to the way CO2 is transported between the surface and the free troposphere. Dominating westerlies outside the Tropics bring the positive flux contributions of Figure 10 to the west of the positive RMSE variations of Figure 9, like from the western to 30 eastern US, or, at a much larger scale, from Eurasia to Alaska. Similarly, negative flux contributions from the eastern US induce negative RMSE variations in the central North Atlantic Ocean, and tropical easterlies link the negative flux contributions from Southern Africa to the negative RMSE variations in the tropical Atlantic Ocean. The distance between flux signal and free tropospheric signal implies an important role for the transport model in attributing the latter to the former so that these patterns should be considered with caution, as has been the case for inversion systems in general.

Discussion
Interest in atmospheric CO2 observations has grown dramatically over the last decade, with the hope that they can reliably quantify the evolution of the CO2 sources and sinks. However, a suite of physical and statistical models is needed to estimate the latter from the former. For instance, the link between some of these observations, like the satellite retrievals, and measurement standards is not direct and needs to be empirically made. We also lack measurements dedicated to the 40 development and validation of atmospheric transport models, in particular for subgrid-scale processes. Therefore, the various underlying models are still in development and our current source-sink estimation capability is not clear: there is no consensus about the latitudinal distribution of the natural carbon fluxes (Le Quéré et al., 2018) or about the carbon budget of relatively well-documented regions like Europe (Reuter et al., 2017). We have defined here quality measures for global inversion systems in order to evaluate the current skill of global inversions, through the example of the CAMS inversion system. By focussing 45 on a specific inversion system, we have avoided the problem of heterogeneity of TransCom-type ensembles, that gather systems with various degrees of sophistication (resolution of the transport model, size of the control vector), but we still varied the assimilated data (surface or satellite) and the transport model in order to generate a small inversion ensemble.
In practice, quality measures for a data assimilation system must rely on unbiased and independent data (Talagrand, 2015). The property of unbiasedness means that the errors are null on statistical average. The property of independence means that the errors affecting the verification data must not be correlated with the errors affecting the observations that have been used in the inversion. Ideally, the verification data should be the carbon fluxes to be evaluated, but in the specific case of global inverse systems, the spatial resolution of existing flux observations (of the order of a hundred meters) is much smaller than the 5 spatial resolution of global transport models (larger than a degree). Therefore, one has no option but to evaluate the analysed CO2 fields (that are the combination of the analysed surface fluxes, of an analysed initial state of CO2 and of the transport model used in the inversion) rather than the analysed surface fluxes alone, both of them being related through mass-conserving transport in the global atmosphere. This can be done with atmospheric observations like those listed in the Introduction: surface measurements, aircraft measurements, TCCON retrievals, AirCore measurements or satellite retrievals. We remove TCCON 10 and the satellite data from the list on the criterion of unbiasedness (Figure 9 suggests that we are interested here in signals that are smaller than the TCCON trueness; see also the discussion in Chevallier, 2018b), the surface data on the criterion of independence (the surface data in ObsPack-type databases that are well simulated by the transport models are usually assimilated), and the AirCore data because of limited time and space coverage so far. This leaves aircraft data, however scarce they are, as an obvious choice to define objective measures of quality of the inversion systems, when they are not assimilated.

15
They have served this role in the past to some extent, starting from Peylin et al. (2006) or Stephens et al. (2007) (see also Pickett-Heaps et al., 2011;Basu et al., 2014;Houweling et al., 2015;Frankenberg et al., 2016;Le Quéré et al., 2018;Crowell et al., 2019), but few aircraft measurement programs have been used so far and, as a consequence, their use has rarely been formulated in terms of quality assurance or quality control processes for atmospheric inversions. Compared to previous studies, we benefit from a much larger number of aircraft measurements over the globe in the free troposphere (600,000 for the OCO-20 2 period, twice as much for the GOSAT period) and from more recent satellite retrievals.
We have used data between 2 and 7 km above sea level only, where the age of air varies significantly (Krol et al., 2018). Aircraft data in this region of the atmosphere only sample a portion of the carbon cycle. With their sparse coverage at places, they may miss some of the tropical flux signal that can reach higher levels within a few days, but flux errors compensate at the global scale such that errors in the Tropics that would not be directly seen will likely induce errors elsewhere that can be seen.

25
Conversely, our 5-km wide layer still represents a large portion of the column observed by the satellites. However, with the use of individual pointwise measurements (rather than profile averages), we hope to have minimized the possible advantage given to the satellite inversions with respect to the surface-based inversions. The gradient between mole fractions in the boundary layer and the free troposphere is also informative (Stephens et al., 2007). It provides complementary information about inversion quality, provided that the minority of measurements above urban areas or in the vicinity of assimilated surface 30 sites are excluded. This has not been explored here.
For our ensemble of six Bayesian inversion results, we have seen that large differences in the estimated annual subcontinental fluxes (GOSAT-based vs. surface-based results) are paralleled by different quality of fit to the aircraft data, with GOSATbased results performing less well. An additional poor man's inversion that simply adjusts very recent bottom-up flux estimates with the annual global growth rate, has larger differences than the surface-based and the OCO-2-based inversions in terms of 35 flux and the aircraft data. Changing the transport model affected the flux estimation only at the scale of TransCom-type regions: no benefit could be seen with respect to aircraft data, despite six years of model development within the CMIP framework by the LMDz team and despite improved nudging meteorological variables between the two versions (from ERA-Interim to . This result is consistent with our study of the parent model of our off-line model in forward mode (Remaud et al., 2018) and suggests that LMDz transport errors play a much smaller role in the quality of our inversion results than the choice of 40 assimilated data. This may be different for previous versions of LMDz (Chevallier et al., 2014) or some other off-line models (Schuh et al., 2019). In comparison to the GOSAT results, or to previous OCO-2 inversion results (Crowell et al., 2019), OCO-2-based annual fluxes are surprisingly close to the surface-based fluxes (usually within 1 σ of the Bayesian uncertainty of the surface-based fluxes). Consequently, the aircraft data used here do not allow us to distinguish between the quality of OCO-2based fluxes and surface-based fluxes. The poor man's inversion still performs worse despite the contribution of a recent 45 dynamic global vegetation model simulation (i.e. simulation ORCHIDEE-Trunk in Le Quéré et al., 2018), showing that the OCO-2 performance is not trivial. Following Liu and Bowman (2016), we attribute the simulation error changes in the free troposphere for the OCO-2 period to flux differences in specific regions of the globe. We find a rather homogeneous geographical distribution of the flux performance with OCO-2-based fluxes and surface-based fluxes alternating as the best ones over continental land masses. This adjoint analysis also illustrates the large footprint of our aircraft data in the free 50 troposphere in terms of flux information, which prevents using them for the evaluation of local fluxes, given our choice of altitude range of 2 -7 km above sea level.

Conclusions
Within the limitations imposed by the use of two different verification periods, the tested bias-corrected OCO-2 retrievals perform better than the tested GOSAT retrievals in our inversion system. Upstream, both inferred flux time series do not overlap with each other at all scales studied here (for instance in the tropical lands) in terms of both the mean and variability. This prevents us from computing flux anomalies from one vs. the other. Within the study timeframe, it was not possible to test more than a couple of different versions of the GOSAT retrievals (despite large differences between GOSAT retrieval 5 algorithms, Takagi et al., 2014) or other ways to assimilate the OCO-2 retrievals. Indeed, each one of our six Bayesian inversions represented a large computational effort that lasted between four and six weeks on a parallel cluster. We could therefore not identify the distinctive asset of OCO-2 vs. GOSAT in our system: either the data density, the data precision, the data trueness (linked both to the quality of the physical retrieval scheme and to its empirical bias-correction), or a combination of these qualities at once. Further, other GOSAT-based inversions could be more competitive if made differently (e.g., with a 10 different bias-correction), while other OCO-2-based inversions (e.g., with a different transport model or with different retrievals), or ours with ACOS v9 retrievals after our study period (e.g., if the empirical bias-correction is less efficient for later months), could still be found deficient for carbon specialists. As we have shown, aircraft data can help ranking the skill of these alternative inversion configurations between each other and vs. ours (all data used here, apart from the recent INPE data, are publicly available).

15
This validation strategy assumes that airborne measurement programs are continued while new satellite observations are made, and that these programs fairly sample the diversity of CO2 plumes in the free troposphere. In this respect, the situation is not satisfactory at present in some parts of the world, like Africa. We also need better coverage to accompany the better quality of inversion results expected in the coming years. This validation strategy also implies that aircraft data are kept independent from the inversion system, and therefore that observations dedicated to the free troposphere (aircraft or satellite partial column 20 retrievals) are not assimilated. This is usually the case, for instance because of the challenging characterization of model errors in simulating aircraft profiles or because systematic errors for partial column retrievals are too large. Zhang et al. (2014) or Alden et al. (2016) presented a different strategy in which aircraft profile measurements are assimilated: a compromise has to be found between exploiting valuable data directly (in particular in areas void of surface measurements), or keeping them for validation.

25
Finally, the evidence provided by aircraft measurements in the free troposphere suggests that the quality of some OCO-2 retrievals over land is now high enough to provide results that are comparable in credibility to the reference (but sparse) surface air sample network, within the above-discussed limits. For ocean retrievals, this remains unclear as OCO-2 ocean soundings were not tested in this work. The consistency of results from the surface and OCO-2-driven inversions, in stark contrast to the bottom-up fluxes or to the GOSAT-driven inversion, does not seem to be fortuitous. It may reinforce some specific conclusions 30 from the surface network, for instance pertaining to the location of the land sink in latitude during the recent years. Remaining differences between fluxes from these two flux inversion types require further analysis and underline their complementarity. The best results may now be obtained by inversions that simultaneously assimilate both observation types.