Objective evaluation of surface- and satellite-driven carbon dioxide atmospheric inversions

We study an ensemble of six multi-year global Bayesian carbon dioxide (CO2) atmospheric inversions that vary in terms of assimilated observations (either column retrievals from one of two satellites or surface air sample measurements) and transport model. The time series of inferred annual fluxes are first compared with each other at various spatial scales. We then objectively evaluate the small inversion ensemble based on a large dataset of accurate aircraft measurements in the free troposphere over the globe, which are independent of all assimilated data. The measured variables are connected with the inferred fluxes through massconserving transport in the global atmosphere and are part of the inversion results. Large-scale annual fluxes estimated from the bias-corrected land retrievals of the second Orbiting Carbon Observatory (OCO-2) differ greatly from the prior fluxes, but are similar to the fluxes estimated from the surface network within the uncertainty of these surface-based estimates. The OCO-2-based and surface-based inversions have similar performance when projected in the space of the aircraft data, but the relative strengths and weaknesses of the two flux estimates vary within the northern and tropical parts of the continents. The verification data also suggest that the more complex and more recent transport model does not improve the inversion skill. In contrast, the inversion using biascorrected retrievals from the Greenhouse Gases Observing Satellite (GOSAT) or, to a larger extent, a non-Bayesian inversion that simply adjusts a recent bottom-up flux estimate with the annual growth rate diagnosed from marine surface measurements both estimate much different fluxes and fit the aircraft data less. Our study highlights a way to rate global atmospheric inversions. Without any general claim regarding the usefulness of all OCO-2 retrieval datasets vs. all GOSAT retrieval datasets, it still suggests that some satellite retrievals can now provide inversion results that are, despite their uncertainty, comparable with respect to credibility to traditional inversions using the accurate but sparse surface network and that are therefore complementary for studies of the global carbon budget.

of 2018 (CEOS Atmospheric Composition Virtual Constellation Greenhouse Gas Team, 2018). The primary motivation for this increase in CO 2 observations has been to further our understanding of the global surface fluxes of carbon, with the additional help of meteorological data (e.g. Bolin and Keeling, 1963;WMO, 2018). This is done in practice by inversion of atmospheric transport models within a Bayesian framework (e.g. Peylin et al., 2013). Scientists have urged caution when interpreting this growing amount of data because the uncertainty of the available meteorological information was identified early as a critical limitation on the exploitable measurement information. This limitation motivated the creation of the international Atmospheric Tracer Transport Model Intercomparison project 25 years ago (TransCom, Law et al., 1996) and is still relevant today . Adequate representation of the various error statistics involved in the Bayesian estimation remains a challenge (e.g. Bocquet et al., 2011). In addition, column retrievals, made from measured radiances from space or on the ground after complex processing, cannot fundamentally be calibrated relative to WMO-traceable standards, in contrast to surface measurements like those in ObsPack GLOBALVIEWplus. Indeed, systematic errors in the retrievals at the sub-micromole per mole level (10 −6 mol mol −1 , abbreviated as part per million, ppm) are enough to affect the flux estimation (Chevallier et al., 2007), but the current TCCON retrievals that serve as the best reference for column retrievals with global coverage have site-specific commensurate offset uncertainties (Wunch et al., 2015).
A given inversion configuration is made of one or several observation types, a transport model and a few statistical models. Many of them seem reasonable. Although model disagreement has been reduced over the last couple of decades, current inversion results show an unacceptably large spread, even for zonal averages (e.g. Le Quéré et al., 2018). This study aims at evaluating whether simple measures of quality based on airborne measurements in the free troposphere can distinguish between six inversion configurations. These inversion configurations differ in the assimilated data and in the transport model. The assimilated data are either surface measurements in ObsPack and related databases, retrievals from the Greenhouse Gases Observing Satellite (GOSAT) or the second Orbiting Carbon Observatory (OCO-2). The transport models are two versions of the atmospheric general circulation model of the Laboratoire de Météorologie Dynamique (LMDz, Hourdin et al., 2013) nudged towards analysed meteorological variables. The Bayesian inversion system from the Copernicus Atmosphere Monitoring Service (CAMS, https://atmosphere.copernicus.eu/, last access: 21 November 2019, Chevallier et al., 2005) is used in all six inversions. We use a "poor man's inversion" (Chevallier et al., 2009) based on recent bottom-up fluxes and on the global annual atmospheric growth rate estimated from the average of marine surface measurements (Conway et al., 2014) to define a baseline for the skill of each Bayesian inversion result.
Our use of airborne measurements in the free troposphere as verification data is motivated by their frequent, WMOtraceable calibration, their independence from all data assimilated here (including the measurements in the boundary layer) and their spatial distribution that samples all oceans and continents. Arguably they are the only CO 2 dataset that possesses all three of these qualities.
In the following, data and models are described in Sect. 2, while Sect. 3 presents the various results. The results are discussed in Sect. 4, and Section 5 concludes the study. 2 Model, system and data

Transport models
LMDz is the atmospheric component of the Earth system model of the Institut Pierre-Simon-Laplace  which has been contributing to the recent versions of the Climate Model Intercomparison Project (CMIP) established by the World Climate Research Programme (https:// www.wcrp-climate.org/wgcm-cmip, last access: 21 November 2019). Here, we use its offline version (Hourdin et al., 2006) to simulate the transport of CO 2 . The offline LMDz model reads a frozen archive of 3-hourly mean meteorological data pre-computed by the full LMDz so that it only needs to simulate large-scale advection and subgrid transport processes (i.e. deep convection and boundary layer turbulence). LMDz is nudged towards 6-hourly analysed meteorological variables, here either ERA-Interim (Dee et al., 2011) or ERA5 (https://www.ecmwf.int/en/forecasts/ datasets/archive-datasets/reanalysis-datasets/era5, last access: 31 January 2019) with a relaxation time of 3 h. Online and offline models are consistently run at the same spatial resolution in order to avoid any challenging interpolation of the air mass fluxes for the sub-grid processes (see e.g. Yu et al., 2018): here 39 eta-pressure layers between the surface and around 80 km above sea level (km a.s.l.), and 96 grid points × 96 grid points, i.e. a horizontal resolution of 1.89 in latitude × 3.75 • in longitude. This configuration discretises the 2-7 km a.s.l. region of the atmosphere, which will be a major focus in the following, into 6 to 10 layers, depending on local orography.
We use two physical formulations of LMDz, called 5A (in code identification number 1649) and 6A (in code identification number 3353), as described by Remaud et al. (2018, and references therein). The gap between the two versions represents about 6 years of development from the LMDz team and includes e.g. a complete revision of radiation, the introduction of the thermodynamical effect of ice and changes in the sub-grid-scale parameterisations (convection, boundary layer dynamics) and in the land surface processes. For version 5A, horizontal winds are nudged towards ERA-Interim, but we use the new ERA5 for LMDz6A. Therefore, the differences between the two versions cannot be exclusively at-tributed to sub-grid-scale processes, as boundary variables (nudging files and land processes) differ as well.

Inversion system
LMDz is embedded within the CAMS CO 2 inversion system. This system minimises a Bayesian cost function to optimise the grid cell 8-day surface fluxes (with a distinction between local night-time fluxes and daytime fluxes, but without fossil fuel emissions, which are prescribed) and the initial state of CO 2 . To do so, it assimilates a series of CO 2 observations over a given time window within the LMDz model. The minimisation approach is called "variational" because it explicitly computes the gradient of the cost function using the adjoint code of LMDz. Prior information about the surface fluxes is provided to the Bayesian system by a combination of climatologies and other types of measurementdriven flux estimates (e.g. Emission Database for Global Atmospheric Research version 4.3.2, Crippa et al., 2016, scaled globally and annually from Le Quéré et al., 2018, for the fossil fuel emissions or Landschützer et al., 2017, for the ocean fluxes). Details can be found in Chevallier (2018a). Of special interest here is the fact that, when integrated over a calendar year, prior natural fluxes are zero over all land grid points: this implies that the interannual variability of the inferred annual mean of terrestrial vegetation fluxes is generated by the assimilated observations only. Over a full year, the total 1σ uncertainty (resulting from assigned error variances that vary in space and time, and from assigned temporal and spatial error correlations) for these prior land fluxes amounts to about 3.0 GtC a −1 . The error statistics for the open ocean correspond to a global air-sea flux uncertainty of about 0.5 GtC a −1 .
The assimilation window is either 19 years for the surface measurements (from January 2000 to October 2018), 8 years for the GOSAT retrievals (from January 2009 to December 2016) or 4 years for the OCO-2 retrievals (from September 2014 to July 2018).

Assimilated observations
All assimilated observations are the dry air mole fraction of CO 2 .
Assimilated surface air sample measurements have been selected from four large ongoing databases of atmospheric CO 2 measurements: (i) NOAA's ObsPack (Cooperative Global Atmospheric Data Integration Project, 2018, andCar-bonTracker Team, 2018), (ii) the World Data Centre for Greenhouse Gases archive (WDCGG, https://gaw.kishou.go. jp/, last access: 21 November 2019), (iii) the Réseau Atmosphérique de Mesure des Composés à Effet de Serre database (RAMCES, http://www.lsce.ipsl.fr/, last access: 21 November 2019) and (iv) the Integrated Carbon Observation System Atmospheric Thematic Centre (ICOS-ATC, https://icos-atc. lsce.ipsl.fr/, last access: 21 November 2019). The list of se-lected sites and maps of their location are given by Chevallier (2018a). Each dataset provides at least 5 years of measurements. The error variances assigned to these measurements in the inversion system correspond to transport modelling uncertainty (analytical measurement uncertainty of in situ CO 2 data is a negligible component) and are computed as the variance of the high-frequency variability of the deseasonalised and de-trended CO 2 time series of the daily mean measurements at each site. These variances are then inflated in order to give the same weight to each measurement day at a given location.
GOSAT was launched in January 2009, as a joint project of the Japan Aerospace Exploration Agency (JAXA), the National Institute of Environmental Studies (NIES) and Japan's Ministry of the Environment (MOE) (Kuze et al., 2009). OCO-2 is a NASA satellite that was launched in July 2014 (Eldering et al., 2017). Both satellites still collect scientific data today. They orbit around the Earth from pole to pole with a local crossing time at the Equator in the early local afternoon. Each carries a spectrometer that measures the sunlight reflected by the Earth and its atmosphere in the nearinfrared/shortwave infrared spectral regions, with high spectral resolution (>∼ 20 000) such that individual gas absorption lines are resolved. OCO-2 provides spatially dense data with a narrow swath and with footprints of a few square kilometres, whereas GOSAT provides coarser-resolution data (100 km 2 at nadir) with low spatial density. Various algorithms have been developed to retrieve the column-average dry air mole fraction of CO 2 in the atmosphere (XCO 2 ) from the measured radiance spectrums. For GOSAT, we use bias-corrected XCO 2 retrievals from the OCO Full Physics (OCFP) v7.1 product made by the University of Leicester and available from the Copernicus Climate Change Service for the period from April 2009 to December 2016 (https: //climate.copernicus.eu/, last access: 21 November 2019). For OCO-2, we use NASA's Atmospheric CO 2 Observations from Space (ACOS) bias-corrected retrievals, version 9 O'Dell et al., 2018) from September 2014 to July 2018. In both cases, a previous release of the CAMS surface-based inversion contributed to the retrieval official bias-correction to some extent. We neglect this dependency in the following because other reference data are used that reduce the weight of the CAMS inversion (e.g. TCCON), and because the bias-correction schemes rely on two to five timeand space-invariant parameters only, with internal retrieval variables (e.g. the retrieved vertical CO 2 gradient between the surface and the free troposphere) as predictors. We do not tune the official retrieval bias corrections. To reduce data volume without loss of information at the scale of a global model, glint and nadir OCO-2 retrievals have been averaged in 10 s bins for the Model Intercomparison Project (MIP) of OCO-2, as described in Crowell et al. (2019), and we use them in this form. The retrieval averaging kernels, prior profiles and Bayesian uncertainty are accounted for in the assimilation of both types of satellite retrievals (the interpolation procedure between the model vertical grid and the retrieval grid is described in Sect. 2.2 of Chevallier, 2015). For OCO-2 retrievals, we also use the transport uncertainty term that is provided by the OCO-2 MIP, based on the variability across several models at the OCO-2 sounding locations .
Maps of the coverage of GOSAT and OCO-2 retrievals are shown in Bösch and Anand (2017) and O'Dell et al. (2018), respectively. We only consider "good" retrievals as identified by the xco2_quality_flag variable of each product. Both land and ocean data are used for GOSAT. GOSAT data over ocean have matured in the ∼ 10 years since they were first produced, and have reached a point where they appear to have smaller biases than over land (Zhou et al., 2016). Their direct inclusion in inversions also appears to be beneficial (Deng et al., 2016). However, although the ocean biases in OCO-2 have been substantially reduced since the initial version 7 (O'Dell et al., 2018), initial inversion tests using OCO-2 ocean observations still produced highly unrealistic results (annual global ocean sinks of about 5 GtC a −1 compared with the much smaller state-of-the-art estimates in Le Quéré et al., 2018) and are therefore left out of this work (as are retrievals over inland water or over mixed land-water surfaces). As for GOSAT, this situation may change in time and OCO-2 ocean data could be beneficial in future inversion set-ups. Despite the exclusion of ocean retrievals and the 10 s averaging, there are still 65 % more OCO-2 retrievals than GOSAT retrievals assimilated on average per month.

Verification observations
We use some specific measurements of the dry air mole fraction of CO 2 as verification data. They are aircraft measurements in the free troposphere made between July 2009 and December 2017 and archived in different ObsPacks (Cooperative Global Atmospheric Data Integration Project, 2018, andNOAA Carbon Cycle Group ObsPack Team, 2018). Table 1 lists the various aircraft measurement sites, campaigns or programmes. For simplicity, all sites, campaigns or programmes will be referred to as "programmes" in the following. All measurements have been calibrated to the WMO CO 2 X2007 scale or to the NIES 09 CO 2 scale to better than 0.1 ppm (e.g. Machida et al., 2008;Sweeney et al., 2015). We note that no aircraft data are assimilated here (Sect. 2.3).
We define the free troposphere as the altitudes between 2 and 7 km a.s.l. We avoid data below 2 km because (i) local anthropogenic emissions affect many aircraft measurements there, and (ii) some of the aircraft flew in the vicinity of measurement sites that have been used in the surface-based inversions. We avoid data above 7 km because the measurement variations (and the flux regional signal) are much reduced there. A few outliers for which the difference between model and observation is larger than 40 ppm are rejected: they likely represent very local pollution plumes. We define two periods for the following statistical computations. They are based on the availability of the satellite retrievals and of the aircraft data in the databases used here: a "GOSAT period" from July 2009 to September 2016 and an "OCO-2 period" from December 2014 to December 2017. Note that they overlap and that there is a minimum of 3 months between the temporal bounds of the verification data and the temporal bounds of the assimilated data in order to account for inversion spin-up and spin-down. Figure 1 shows the geographical location of the verification data for the two periods.

Poor man's inversion
In order to put the differences between inversion simulations and aircraft measurements into perspective, we compare them to an inversion that only assimilated the annual global growth rate of CO 2 . This baseline, called "the poor man's inversion" by Chevallier et al. (2009), adjusts prior natural fluxes over land in order to fit the annual trend of globally averaged marine measurements (http://www.esrl.noaa. gov/gmd/ccgg/trends/, last access: 10 January 2019) multiplied by a conversion factor (2.086 GtC ppm −1 , from Prather, 2012) when combined with prior ocean and fossil fuel fluxes. The correction to the natural land fluxes is made proportional to the prior error standard deviations assigned within a given inversion system. In the case of the CAMS system here, the prior error standard deviations are themselves proportional to a climatology of heterotrophic respiration fluxes simulated by a vegetation model, with a ceiling of 4 gC m −2 d −1 . This simple approach is not Bayesian because prior error correlations are ignored, but it still allows transport models to fit atmospheric data with less bias than its prior fluxes because it closes the carbon budget in a plausible way. Over the ocean and for the fossil fuel emissions, we choose the same prior fluxes as for the six Bayesian inversions (Landschützer et al., 2017;Crippa et al., 2016;Le Quéré et al., 2018, see Sect. 2.2). However, we choose more informed natural fluxes over land than for the Bayesian inversions: rather than leaving the inversion fully free to locate the annual land sinks (see Sect. 2.2), we use a simulation of a dynamic global vegetation model that accounts for land use, climate and CO 2 history (simulation ORCHIDEE-trunk in Le Quéré et al., 2018). When multiplied by 2.086 GtC ppm −1 , this combination of prior fluxes already fits the annual trend of globally averaged marine measurements with a root-meansquare difference of 0.3 ppm a −1 , By construction, the poor man's adjustment brings these annual global differences to zero.
For the comparison of the poor man's inversion with aircraft measurements, we use LMDz5A. We start the poor man's simulation on 1 January 2000 from a 3-D prior initial state of CO 2 . We then add an offset to the simulation so that its mean bias with respect to NOAA's surface measurements at the South Pole Observatory (Cooperative Global Atmospheric Data Integration Project, 2018) over the 2010-2017 period is zero. This offset addresses the uncertainty of the initial state and the uncertainty of the 2.086 GtC ppm −1 conversion factor.

Principle
We build an ensemble of six Bayesian inversions using the inversion system from Sect. 2.2, the two transport model versions from Sect. 2.1 and the three observation datasets from Sect. 2.3. The assimilation periods differ (Sect. 2.2), but the prior fluxes and the prior error model are the same. For each inversion, the posterior model simulation statistically fits its own assimilated data well within their 1σ uncertainty. Note that the surface-based inversion with LMDz5A is exactly the same as the CO 2 inversion product version 18r1 of CAMS that was released in November 2018 (http://atmosphere. copernicus.eu/, last access: 21 November 2019). In the figures, we will refer to the surface-based inversions by the generic name "SURF" for simplicity.
We first present the carbon budget estimates. We choose to look at fluxes at the annual scale only, knowing that the inferred interannual variability is completely driven by the assimilated observations over land (because prior natural fluxes over land are zero on annual average for the Bayesian inver-sions, see Sect. 2.2). As we will see, it is relatively large. Except at the global scale, capturing the interannual variability well is particularly challenging because its estimation accumulates all errors made throughout the seasonal cycle.
Then we compare the inversion performance vis-à-vis the aircraft measurements of Sect. 2.4, to the performance of the poor man's inversion of Sect. 2.5. This comparison is made for two periods (Sect. 2.4). For each of them, we will only consider the inversions that cover the window completely, which means that the GOSAT-based (or OCO-2-based) inversions will not be used in the results for the "OCO-2 period" (or "GOSAT period"). The projection of the inversion fluxes onto the space of the aircraft-measured variables (mole fractions) is made by the same LMDz model version that was used in the inversion. Thus, we are consistent with the way the inversion system distributes the well-constrained total mass of carbon in the atmosphere and we avoid error compensations between the version used in the assimilation and the one used in the evaluation. The model is directly sampled at measurement time and space, without any interpolation: the grid cell value is used as the simulated value for the verification data.

Annual budgets
The time series of the annual natural carbon budgets at several very broad scales are displayed in Fig. 2 for the period between 2004 and 2017: the globe, the northern or southern extratropics, and the tropics with lands and oceans either separated or combined. At this scale, the influence of the transport model version is hardly distinguishable (coloured solid lines vs. coloured dashed lines). The poor man's inversion (black dashed lines) locates the land sink mostly in the northern extratropics but also in the tropics (consistent with its prior information shown in Fig. 8 of Le Quéré et al., 2018), whereas the six Bayesian inversions put it more in the northern extratropics (starting from a null prior on annual average). All approaches converge towards near-neutral southern extratropical lands (that represent a relatively small surface area). Over the oceans, the surface-based inversions vary little from the prior (which is equal to the poor man's estimate there), but the GOSAT-based inversions reduce the ocean sink by about 0.5 GtC a −1 in 2015; the OCO-2-based inversions increase it by up to 1 GtC a −1 . We recall that years 2015 and 2016 correspond to a strong El Niño event associated with a large CO 2 growth rate (e.g. Malhi et al., 2018 and references therein). The GOSAT inversions seem to underestimate the beginning of this anomaly (Fig. 2a), and to attribute it to the southern extratropical oceans rather than to the tropical lands like the other inversions. OCO-2-based fluxes are close to the surface-based fluxes, except for the increased ocean sink (which appears to be regularly spread between the three bands). The OCO-2-based and surfacebased growth rates are very similar, but do not fully overlap with the poor man's fluxes because they do not fully agree with NOAA's estimates, in particular in 2016 when they diagnose a smaller rate (by 0.25 ppm a −1 if we use the 2.086 GtC ppm −1 conversion factor). Figures 3 and 4 focus on the Bayesian inversion results at the scale of the 22 regions of the TransCom 3 experiment (Gurney et al., 2002): 11 regions over land and 11 regions over the oceans that together tile the whole globe. At this scale, the impact of the choice of the LMDz version appears: LMDz6A induces slightly less year-to-year variability for the surface-based inversion for some years (see the 2010s for the Europe region, the last couple of years for the Eurasian temperate region or the full time series for the North Atlantic temperate region), and the two model versions can yield different baselines (see the North American temperate and South American temperate regions, or the three Atlantic regions). The two GOSAT-based inversions show larger yearto-year variability than the other inversions. The OCO-2based inversions broadly agree with the surface-based inversions for the temporal variability of the fluxes in most regions (North American boreal, Southern Africa, Eurasian boreal, Tropical Asia, Europe) but there are noticeable differences in the North American temperate, South American tropical, South American temperate and Australia regions. While being clearly distinct from the inversion prior fluxes (that are zero on annual average over land), and from the GOSATbased fluxes, we note the agreement of the two OCO-2-based inversions with the 6A SURF inversion and the poor man's fluxes (that are informed by an up-to-date bottom-up simulation) in the two boreal regions, despite the lack of OCO-2 data there during half of the year as a consequence of insufficient insolation (see e.g. Deng et al., 2014). The main differences between inversions OCO-2 and SURF over the ocean are the North Pacific temperate and Southern Ocean regions. Figure 5 compares the difference between fluxes estimated by assimilating either OCO-2-based or surface-based data within LMDz6A to the posterior uncertainty diagnosed from the Bayesian system (Chevallier et al., 2007) for the surfacebased inversion. For all regions discussed so far, this difference is usually within the Bayesian uncertainty standard deviation (but reaches up to 2.6 times this quantity in northern Africa for 2015), which means that the difference between the two flux estimates at this scale is mostly not statistically significant. Figure 6 further zooms in to the pixel-scale for the year 2015, a year that is common between all inversions. Only the LMDz5A results for the satellite-based inversions are shown. For the two surface-based inversions, the change of transport model leaves the flux patterns generally unchanged but slightly modulates their amplitude. In contrast, the two satellite-based inversions show more differences in the flux pattern. They suggest large flux gradients in southern Africa and South America: the fluxes are similar for GOSAT and OCO-2 in Africa, with a large sink in the tropical evergreen forests and large sources around these same forests, whereas they are different in America with a source over the tropical  (2016) is not represented because it is less constrained at the end by the lack of 2017 retrievals here. Note that the prior fluxes are zero over land at this temporal scale (see Sect. 2.2) and that they are equal to the "poor man" curve over the ocean (see Sect. 2.5).  evergreen forests for GOSAT and over a northeast corner for OCO-2. The broad flux patterns in the lands of the Northern Hemisphere are similar between the four maps, but OCO-2 has flux gradients closer to SURF than to GOSAT in America while the opposite is seen in Southeast Asia. The tropical ocean outgassing region reduces with OCO-2 and expands to the south with GOSAT. Figure 7 presents the statistics of model-minus-measurement differences per measurement programme for the GOSAT period. Note that the data number varies by several orders of magnitude among the programmes: there are a few hundred samples for most of the 37 programmes, but a few thousand for CALNEX2010, KORUS-AQ, ORCAS, SGP and ATom, a few tens of thousands for ACT, DC3, DISCOVER-AQ, GSFC, HIPPO, SEAC4RS and SONGNEX2015, and 900 000 for CONTRAIL. Obviously, many measurements may fit into a single time-space block of the global transport model. We will only discuss bias differences larger than 0.15 ppm (i.e. above the calibration uncertainty of the aircraft data, see Sect. 2.4) and those that are statistically significant at the 0.05 level, as reported in the figure. The computation of the significance level is made using an unpaired t test when comparing inversion results that assimilated different data (we assume that changing the assimilated data makes the inversion results independent), and using a paired t test when comparing inversion results that assimilated the same data (we assume that inversion results in which only the transport model varies are dependent). In practice, changing the independency assumption only affects the detail of the significance level results, but not the overall picture.

Differences with aircraft data
Comparing solid and dotted lines, we see no benefit of LMDz6A vs. LMDz5A, as version 6A increases the absolute bias of SURF for eight programmes (three in Brazil -RBA-B, ALF and TAB, as well as CALNEX2010, DISCOVER-AQ, ACT, THD and LEF) and improves it for four programmes (the fourth Brazilian site -SAN , as well as SEAC4RS, KORUS-AQ and ETL). There is no obvious consistency between the changes brought by LMDz6A to the surface-based inversion and those brought to the GOSAT-based inversion. For SAN and SENEX2013, the two surface-based inversions have larger absolute biases than the GOSAT-based inversions, but perform better for 11 other sites. The poor man's inversion shows the worse biases north of 45 • N, but usually performs better than the GOSAT-based inversion in the Southern Hemisphere, likely helped by the tuning with the South Pole Observatory data. Between the Equator and 45 • N, the relative performance of the poor man's inversion is uneven but it is usually not as good as SURF. In terms of standard deviation (bottom row of Fig. 7), the surface-based inversions have the smallest values.
There are 26 aircraft programmes in the OCO-2 period. They challenge SURF a little less (Fig. 8)  atively close performance compared with the aircraft programmes. SURF performs better than OCO-2 for INPE and ACT in terms of biases, and worse for GSFC and KORUS-AQ. The poor man's simulation has less skill than in the GOSAT period: it performs much worse than the surface-based and the OCO-2-based inversions in the Northern Hemisphere, and is comparable or better in the Southern Hemisphere. If we combine all measurements, the root-mean-square difference for the OCO-2-based and the surface-based inversions only varies between 1.51 and 1.56 ppm. Note that 39 % of these data are from CONTRAIL, a programme that spreads over all continents. If we only take CONTRAIL data, the root-mean-square difference for the OCO-2-based and the surface-based inversions still only varies between 1.60 and 1.67 ppm, which tends to indicate that this result is not biased towards features that are specific to one region. The standard deviations are comparable between the OCO-2-based inversions and the surface-based inversions. LMDz6A improves the SURF biases for KORUS-AQ and degrades them for three other programmes (INPE, LEF and ETL). This lack of improvement also appears for OCO-2 (degradation at INPE, KORUS-AQ and ABOVE). The statistics for four programmes (ORCAS, KORUS-AQ, ACT and SONGNEX2015) are directly comparable between the two periods because the corresponding data are complete in both periods: in all four programmes OCO-2 performs better than GOSAT. Figure 9 reformulates the bias statistics of Fig. 8 on a map of the differences between the absolute biases of the OCO-2 and SURF inversions. As for the programme biases, some points are more robust than others (due to varying amount of data), but there is some large-scale coherence, with better performance of SURF in the Southern Hemisphere (as could already be seen in Figs. 7 and 8) and in the central and eastern US, whereas OCO-2 yields smaller biases in the Northern Hemisphere subtropics and Europe. Other parts of the globe are less consistent such as the western Pacific edge or boreal America.

Pixel attribution
Liu and Bowman (2016) proposed a method to quantify the impact of flux changes over the globe on the corresponding change in the mean squared error (MSE) of the transport model simulation with respect to n independent measurements. They demonstrated it in the case of the flux changes from their prior values to their posterior values within the approximations of a linear transport model M (including the sampling operator at the measurement time and location) and of an unchanged initial state of CO 2 . It is actually valid for other types of changes within an inversion, provided they respect the tangent linear hypothesis for the transport model. The change in the MSE (δMSE) is expressed as a finite sum of terms. There is one term for each element i of the inversion control vector (i.e. a CO 2 flux at a given time and location, or some part of the 3-D initial state of CO 2 ): Term i is the product of the corresponding change in the control vector (i.e. a scalar δf i ), times the corresponding row of the transpose of the linear model M, times (dot product here) the vector of the sum of the differences between the two model simulations (one, c 1 , before the change in the control vector and one, c 2 , after the change) and all verification measurements (δc 1 + δc 2 , both vectors with dimension n). Interestingly, the second product in this formula can be calculated by the adjoint code of the transport model, if it exists, which is the case for LMDz (Sect. 2.2). Further detail is given in Liu and Bowman (2016). We apply this approach to interpret the difference between the OCO-2-based and the surface-based inversions using LMDz5A. The overall MSE is very similar between both (1.5 2 ppm 2 ), but the relative performance still varies in space and time (Figs. 8,9) and we hope to extract some further insight into the relative merits of each dataset. In practice, we compute δc 1 + δc 2 using the LMDz model linearised around the inversion prior simulation in order to respect the underlying hypothesis. However, some inconsistencies for the initial state of CO 2 could not be completely removed between δc 1 and δc 2 due to the different starting date of each inversion. The map of the sum of all contributions of the flux changes δf i (from the surface-based inversion to the OCO-2-based inversion) at a given pixel to the change in MSE (δMSE) is presented in Fig. 10. Positive values occur when the OCO-2based fluxes increase the MSE relative to the surface-based fluxes. This happens in the western contiguous US, northeastern South America, western Europe, Turkey, the West Siberian Plain and eastern Siberia. Contributions to reduce the MSE (negative values) are mostly in Alaska and the eastern contiguous US, western South America, southern Africa, South and Southeast Asia, and Indonesia. No noticeable contribution is seen over the ocean, where OCO-2 retrievals have not been assimilated. By construction, regions that are not well observed downstream by aircraft have lesser contributions, like in Africa. This feature makes the relative magnitude of the patterns among each other not very informative regarding the flux quality. Therefore, we will pay more attention to the sign of the dominant patterns.
The map in Fig. 9, which refers to differences in absolute biases within moving windows, is, in principle, not directly comparable with Fig. 10, which refers to MSEs. However, bias changes are much larger than standard deviation changes (Fig. 8) which makes the map of root-mean-square errors (RMSEs, not shown) very similar to Fig. 9. Differences between the patterns in Fig. 9 in the space of free-tropospheric mole fractions and those in Fig. 10 in the space of fluxes are linked to the way CO 2 is transported between the surface and the free troposphere. Dominating westerlies outside the tropics bring the positive flux contributions in Fig. 10 to the west of the positive RMSE variations in Fig. 9, like from the western to the eastern US, or, at a much larger scale, from Eurasia to Alaska. Similarly, negative flux contributions from the eastern US induce negative RMSE variations in the central North Atlantic Ocean, and tropical easterlies link the negative flux contributions from southern Africa to the negative RMSE variations in the tropical Atlantic Ocean. The distance between the flux signal and the free-tropospheric signal implies an important role for the transport model in attributing the latter to the former; therefore, these patterns should be considered with caution, as is the case for inversion systems in general.

Discussion
Interest in atmospheric CO 2 observations has grown dramatically over the last decade, with the hope that they can reliably quantify the evolution of the CO 2 sources and sinks. However, a suite of physical and statistical models is needed to estimate the latter from the former. For instance, the link between some of these observations, like the satellite retrievals, and measurement standards is not direct and needs to be em- Figure 7. Model-minus-observation absolute differences and standard deviations over the GOSAT period per measurement programme for the surface-based inversion (SURF, red line), the GOSAT-based inversion (GOSAT, blue line) and the poor man's inversion (shaded area). Inversions with LMDz5A (LMDz6A) are shown using continuous (dashed) coloured lines. The number of measurements per site, campaign or programme varies between 113 (BNE) and 901 846 (CON). The programme definitions are given in Table 1. They are ranked by increasing mean latitude (north is on the right), irrespective of their latitudinal coverage (which is large at several tens of degrees for ORC, TOM, HIP and CON). These mean latitudes are shown in the middle of the panel. For each programme, a green circle appears in the upper panel if the difference between the GOSAT bias and the SURF bias using LMDz5A is statistically significant (see the main text for a definition) and exceeds 0.15 ppm. Similarly, a red (blue) circle indicates that the difference between LMDz5A and LMDz6A for SURF (GOSAT) is statistically significant and exceeds 0.15 ppm. pirically made. We also lack measurements dedicated to the development and validation of atmospheric transport models, in particular for sub-grid-scale processes. Therefore, the various underlying models are still in development and our current source-sink estimation capability is not clear: there is no consensus about the latitudinal distribution of the natural carbon fluxes (Le Quéré et al., 2018) or about the carbon budget of relatively well-documented regions like Europe (Reuter et al., 2017). Here, we have defined quality measures for global inversion systems in order to evaluate the current skill of global inversions, via the example of the CAMS inversion system. By focussing on a specific inversion system, we have avoided the problem of heterogeneity of TransComtype ensembles, which gather systems with various degrees of sophistication (resolution of the transport model, size of the control vector), but we still varied the assimilated data (surface or satellite) and the transport model in order to generate a small inversion ensemble.
In practice, quality measures for a data assimilation system must rely on unbiased and independent data (Talagrand, 2014). The property of being unbiased means that the errors are null on statistical average. The property of independence means that the errors affecting the verification data must not be correlated with the errors affecting the observations that have been used in the inversion. Ideally, the verification data should be the carbon fluxes to be evaluated, but in the specific case of global inverse systems, the spatial resolution of existing flux observations (of the order of a hundred metres) Figure 9. Difference between the model-minus-observation absolute differences in 10 • moving windows (a). Negative (positive) values denote areas where the OCO-2-based inversion has smaller (larger) biases than the surface-based inversion. Both inversions use LMDz5A. Panel (b) gives the number of data that contribute to the bias computation in each 10 • moving window. Biases are only computed in the windows where there are more than 100 measurements. Figure 10. Contribution of the grid-point flux changes to the change in the variance of CO 2 model-measurement differences between the OCO-2-based inversion and the surface-based inversion (variance of the former minus variance of the latter), in parts per million squared. Both inversions use LMDz5A. Note that the fluxes themselves are illustrated in Fig. 6a and c. is much smaller than the spatial resolution of global transport models (larger than a degree). Therefore, one has no option but to evaluate the analysed CO 2 fields (that are the combination of the analysed surface fluxes, of an analysed initial state of CO 2 and of the transport model used in the inversion) rather than the analysed surface fluxes alone, both of which are related through mass-conserving transport in the global atmosphere. This can be done with atmospheric observations like those listed in Sect. 1: surface measurements, aircraft measurements, TCCON retrievals, AirCore measurements or satellite retrievals. We remove TCCON and the satellite data from the list on the criterion that they are biased ( Fig. 9 suggests that we are interested here in signals that are smaller than the TCCON trueness; see also the discussion in Chevallier, 2018b), the surface data on the criterion of independence (the surface data in ObsPack-type databases that are well simulated by the transport models are usually assimilated), and the AirCore data because of limited time and space coverage so far. This leaves aircraft data, however scarce they are, as an obvious choice to define objective measures of the quality of the inversion systems, when they are not assimilated. They have served this role in the past to some extent, starting from Peylin et al. (2007) or Stephens et al. (2007) (see also Pickett-Heaps et al., 2011;Basu et al., 2014;Houweling et al., 2015;Frankenberg et al., 2016;Le Quéré et al., 2018;Crowell et al., 2019), but few aircraft measurement programmes have been used so far and, as a consequence, their use has rarely been formulated in terms of quality assurance or quality control processes for atmospheric inversions. Compared with previous studies, we benefit from a much larger number of aircraft measurements over the globe in the free troposphere (600 000 for the OCO-2 period and twice as many for the GOSAT period) and from more recent satellite retrievals.
We have only used data between 2 and 7 km a.s.l., where the age of air varies significantly (Krol et al., 2018). Aircraft data in this region of the atmosphere only sample a portion of the carbon cycle. With their sparse coverage in places, they may miss some of the tropical flux signal that can reach higher levels within a few days, but flux errors compensate for this at the global scale such that errors in the tropics that would not be directly seen will likely induce errors elsewhere that can be seen. Conversely, our 5 km wide layer still represents a large portion of the column observed by the satellites. However, with the use of individual pointwise measurements (rather than profile averages), we hope to have minimised the possible advantage given to the satellite inversions with respect to the surface-based inversions. The gradient between mole fractions in the boundary layer and the free troposphere is also informative (Stephens et al., 2007). It provides complementary information about inversion quality, provided that the minority of measurements above urban areas or in the vicinity of assimilated surface sites are excluded. This has not been explored here. Most of the aircraft measurement programmes here are over North America, but the majority of measurements are provided by the CONTRAIL programme, as noted in the legends of Figs. 7 and 8. CON-TRAIL samples air at our study altitudes above many cities outside North America; it also represents 74 % of all data for the GOSAT period and 39 % for the OCO-2 period.
For our ensemble of six Bayesian inversion results, we have seen that large differences in the estimated annual sub-continental fluxes (GOSAT-based vs. surface-based results) are paralleled by a different quality of fit to the aircraft data, with GOSAT-based results not performing as well. An additional poor man's inversion, which simply adjusts very recent bottom-up flux estimates with the annual global growth rate, has larger differences than the surface-based and the OCO-2based inversions in terms of flux and the aircraft data. Changing the transport model only affected the flux estimation at the scale of TransCom-type regions: no benefit could be seen with respect to aircraft data, despite 6 years of model development within the CMIP framework by the LMDz team and despite improved nudging meteorological variables between the two versions (from ERA-Interim to ERA5). This result is consistent with our study of the parent model of our offline model in forward mode (Remaud et al., 2018) and suggests that LMDz transport errors play a much smaller role in the quality of our inversion results than the choice of assimilated data. This may be different for previous versions of LMDz: for instance, the refinement of the vertical grid in our version of LMDz from 19 to 39 layers had a major impact (Chevallier et al., 2014). This may be different for some other offline models: for example, Schuh et al. (2019) highlighted regridding problems within one model to explain some large differences with another one. In comparison to the GOSAT results, or to previous OCO-2 inversion results , OCO-2-based annual fluxes are surprisingly close to the surface-based fluxes (usually within 1σ of the Bayesian uncertainty of the surface-based fluxes). Consequently, the aircraft data used here do not allow us to distinguish between the quality of OCO-2-based fluxes and surface-based fluxes. The poor man's inversion still performs worse despite the contribution of a recent dynamic global vegetation model simulation (i.e. simulation ORCHIDEE-trunk in Le Quéré et al., 2018), showing that the OCO-2 performance is not trivial. From our results, we cannot draw general conclusions about OCO-2 retrievals nor about GOSAT retrievals, because our study is limited to two specific retrieval datasets. Takagi et al. (2014) showed large differences between different retrieval algorithms for GOSAT, and it is still possible that we would get similar results between OCO-2 and GOSAT if we used the same retrieval algorithm.
Following Liu and Bowman (2016), we attribute the simulation error changes in the free troposphere for the OCO-2 period to flux differences in specific regions of the globe. We find a rather homogeneous geographical distribution of the flux performance with OCO-2-based fluxes and surfacebased fluxes alternating as those with the best performance over continental land masses. This adjoint analysis also illustrates the large footprint of our aircraft data in the free troposphere in terms of flux information, which prevents us using them for the evaluation of local fluxes, given our choice of altitude range between 2 and 7 km a.s.l.

Conclusions
Within the limitations imposed by the use of two different verification periods, the tested bias-corrected OCO-2 retrievals perform better than the tested GOSAT retrievals in our inversion system. Upstream, both inferred flux time series do not overlap with each other at all scales studied here (for instance in the tropical lands) in terms of both the mean and variability. This prevents us from computing flux anomalies from one vs. the other. Within the study time frame, it was not possible to test more than a couple of different versions of the GOSAT retrievals (despite large differences between GOSAT retrieval algorithms, Takagi et al., 2014) or other ways to assimilate the OCO-2 retrievals. Indeed, each one of our six Bayesian inversions represented a large computational effort that lasted between 4 and 6 weeks on a parallel cluster. Therefore, we could not identify the distinctive asset of OCO-2 vs. GOSAT in our system: either the data density, the data precision, the data trueness (linked both to the quality of the physical retrieval scheme and to its empirical bias-correction) or a combination of these qualities at once. Further, other GOSAT-based inversions could be more competitive if made differently (e.g. with a different biascorrection), while other OCO-2-based inversions (e.g. with a different transport model or with different retrievals), or our inversion with ACOS v9 retrievals after our study period (e.g. if the empirical bias-correction is less efficient for later months), could still be found deficient for carbon specialists. As we have shown, aircraft data can help with ranking the skill of these alternative inversion configurations among one another and vs. ours (all data used here, apart from the recent INPE data, are publicly available).
This validation strategy assumes that airborne measurement programmes are continued while new satellite observations are made, and that these programmes fairly sample the diversity of CO 2 plumes in the free troposphere. In this respect, the situation is not satisfactory at present in some parts of the world, such as Africa. We also need better coverage to accompany the better quality of inversion results expected in the coming years. This validation strategy also implies that aircraft data remain independent of the inversion system, and, therefore, that observations dedicated to the free troposphere (aircraft or satellite partial column retrievals) are not assimilated. This is usually the case, for instance because of the challenging characterisation of model errors in simulating aircraft profiles or because systematic errors for partial column retrievals are too large. Zhang et al. (2014) and Alden et al. (2016) presented a different strategy in which aircraft profile measurements are assimilated: a compromise has to be found between exploiting valuable data directly (in particular in areas void of surface measurements), or keeping them for validation.
Finally, the evidence provided by aircraft measurements in the free troposphere suggests that the quality of some OCO-2 retrievals over land is now high enough to provide results that are comparable in credibility to the reference (but sparse) surface air sample network, within the above-mentioned limits. For ocean retrievals, this remains unclear as OCO-2 ocean soundings were not tested in this work. The consistency of results from the surface-and OCO-2-driven inversions, in stark contrast to the bottom-up fluxes or to the GOSAT-driven inversion, does not seem to be fortuitous. It may reinforce some specific conclusions from the surface network, for instance pertaining to the location of the land sink in latitude during the recent years. Remaining differences between fluxes from these two flux inversion types require further analysis and underline their complementarity. The best results may now be obtained by inversions that simultaneously assimilate both observation types.
Data availability. The aircraft measurements are available from https://www.esrl.noaa.gov/gmd/ccgg/obspack/ (last access: 21 November 2019). The OCO-2 data can be obtained from http://co2.jpl.nasa.gov (last access: 21 November 2019) and were produced by the OCO-2 project at the Jet Propulsion Laboratory, California Institute of Technology. The CAMS v18r1 product can be obtained from http://atmosphere.copernicus.eu/ (last access: 21 November 2019), and the GOSAT retrievals are available from http://climate.copernicus.eu/ (last access: 21 November 2019). All inversion results can be obtained upon request to copernicus-support@ecmwf.int.
Author contributions. FC designed the experiments and carried them out. FC, MR, CWO'D, DB and PP analysed the results. MR and AC adapted the LMDz6A model for tracer transport. FC and MR prepared the paper with contributions from all co-authors.
Competing interests. The authors declare that they have no conflict of interest.
Acknowledgements. The authors are very grateful to the many people involved in the surface, aircraft and satellite CO 2 observations and in the archiving of these data that were kindly made available for this study, including Hartmut Bösch and Jasdeep Anand at the University of Leicester, the ObsPack team and LaGEE, and the Greenhouse Gases Laboratory from INPE. The Carbon in Arctic Reservoirs Vulnerability Experiment (CARVE) is an Earth Ventures (EV-1) investigation, under contract with the National Aeronautics and Space Administration. The Atmospheric Carbon and Transport (ACT) -America project is a NASA Earth Venture Suborbital 2 project funded by NASA's Earth Science Division (grant no. NNX15AG76G to Penn State). This work was performed using HPC resources from GENCI-TGCC Financial support. This research has been supported by the Copernicus Atmosphere Monitoring Service, implemented by the European Centre for Medium-Range Weather Forecasts on behalf of the European Commission (grant no. CAMS73), and from ESA through contract no. 4000123002/18/I-NB RECCAP-2.
Review statement. This paper was edited by Federico Fierli and reviewed by three anonymous referees.