A joint e ff ort to deliver satellite retrieved atmospheric CO 2 concentrations for surface flux inversions : the ensemble median algorithm

Introduction Conclusions References


Introduction
Our current knowledge about sources and sinks of atmospheric CO 2 is limited by the sparseness of highly accurate and precise CO 2 measurements (Stephens et al., 2007).Due to their global coverage and sensitivity down to the surface, satellite based XCO 2 (the column-average dry-air mole fraction of atmospheric CO 2 ) retrievals in the near infrared are a promising candidate to reduce existing uncertainties if accurate and precise enough (Rayner and O'Brien, 2001;Houweling et al., 2004;Miller et al., 2007;Chevallier et al., 2007).
The basic principle of all these algorithms is the same: (i) A satellite instrument measures backscattered solar Table 1.Main retrieval characteristics: algorithm name and version, satellite instrument, spectral bands, inversion technique (OE = optimal estimation, TP = Tikhonov-Phillips regularization, LS = least squares), consideration of scattering (FP = full physics, PR = light path proxy, PPDF = photon path length probability density function, 4EP20 = 4 extinction profiles with 20 layers (two aerosol types, water and ice cloud), CWP = cloud water path, CTH = cloud top height, APS x = aerosol profile scaling of x different aerosol types, AOD = aerosol optical depth, RSL = reflectivity of scattering layer, PLMP = path length modification parameter, APNC = aerosol particle number concentration, ASP = aerosol size parameter, AH = aerosol height, CEPS = cloud extinction profile scaling), main cloud filter (MERIS = medium resolution imaging spectrometer, CAI = cloud and aerosol imager of GOSAT, PMD = polarization measurement device of SCIAMACHY).

Algorithm
Sensor radiation in near-infrared O 2 and CO 2 absorption bands.
(ii) A radiative transfer plus instrument model (forward model) is utilized to simulate the satellite measurement for a set of known parameters (parameter vector) and unknown parameters (state vector).(iii) An inversion method tries to find that state vector which results in best agreement of simulated and measured radiances.(iv) The retrieved state vector is assumed to represent the true (or most likely) atmospheric state.However, when going into more detail, the algorithms have distinct conceptual differences: the algorithms are optimized for different instruments namely SCIAMACHY (scanning imaging absorption spectroMeter of atmospheric chartography; Burrows et al., 1995;Bovensmann et al., 1999) and GOSAT (greenhouse gases observing satellite; Yokota et al., 2004).They are based on different absorption bands, use different inversion methods (optimal estimation, Tikhonov-Phillips, least squares), and are based on different physical assumptions on the radiative transfer in scattering atmospheres.So-called full physics algorithms explicitly account for (multiple) scattering at molecules, aerosols, and/or clouds by having state vector elements such as cloud water path, cloud top height, and aerosol optical thickness.The photon path length probability density function (PPDF) can be used as shortcut for computational expensive radiative transfer calculations in a scattering atmosphere.The light path proxy method assumes that photon path lengths are modified similarly in the O 2 and CO 2 absorption bands, and that scattering related effects cancel out when dividing the retrieved CO 2 and O 2 columns when building XCO 2 .Additionally, the algorithms use different pre-and post-processing filters (e.g., cloud detection from O 2 -A band or from a cloud and aerosol imager).
The main differences of the seven retrieval algorithms are summarized in Table 1.This list does not claim to be exhaustive and there are other aspects which can also easily result in differences of some ppm (e.g., spectroscopy).Discussions of the specific strengths and weaknesses and many more points, where the individual algorithms differ, can be found in the cited literature.
All retrieval teams find encouraging validation results when comparing with TCCON (total carbon column observing network) (Washenfelder et al., 2006;Wunch et al., 2011) ground based FTS (Fourier transform spectrometer) measurements (see references above).This goes along with a good inter-algorithm agreement at TCCON sites and with the results of our unified validation study having station-tostation biases (i.e., the standard deviation of the biases at different sites) typically below 1 ppm and single measurement precisions typically between 2 ppm and 4 ppm (Fig. 1, Table 2).
However, the inter-algorithm agreement as well as the agreement with NOAA's (National Oceanic and Atmospheric Administration) CarbonTracker (Peters et al., 2007(Peters et al., , 2010) model (CT2011), i.e., our current knowledge about atmospheric CO 2 based on NOAA's air sampling network, often reduces remote from validation sites due to differing large scale bias patterns (Fig. 2).Such biases are currently the most critical issue for surface flux inversions and the user requirements are demanding; as an example, Miller et al. (2007) and Chevallier et al. (2007) found that regional biases of a few tenths of a ppm can already hamper surface flux inversions.
This indicates that assessing an algorithm's quality should not be based on comparisons against current TCCON stations only.Obviously, large regions of the world possess more "complicated" retrieval conditions without the availability of ground truth measurements which could be used to judge the algorithms' performance.
Diverging model results are common to many scientific disciplines (e.g., Araujo and New, 2007;Rötter et al., 2011), much attention and effort is devoted to this topic on the subject of weather and climate modeling.Here, the divergence of the model results arises not only from structural differences of the different models, but also from the nonlinearity of the model equations, leading to differing results of one single model when performing multiple realizations with slightly differing initial conditions (Hagedorn et al., 2005;Tebaldi and Knutti, 2007).Especially in the case of weather forecasting or climate projections, where no ground truth is available for the verification of the forecasts and projections, it is impossible to identify the "best" model and the "perfect" initial conditions.For long-term climate projections, this problem is impaired by the unknown future greenhouse forcing.This conceptual problem is dealt with by using multimodel, multi-realization, multi-emission-scenario ensembles of simulations, which ideally span the entire range of possible model outcomes and, thus, can be used to estimate the uncertainties of the forecast or projection.
However, interpreting the ensemble's spread as uncertainty is not the only possible application: some studies indicate that the ensemble mean, weighted mean, or median can outperform each individual model under appropriate conditions (e.g., Kharin and Zwiers, 2002;Vautard et al., 2009).Within Sect.3, we seize this idea and introduce the ensemble median algorithm EMMA with which a global one year data set (June 2009-May 2010) of individual soundings has been generated.It comprises data from the seven retrieval algorithms mentioned above.

Ensemble spread
Due to entirely different samplings (different satellites, different filtering strategies, etc.), any algorithm intercomparison considering the majority of individual soundings (level 2) can only be based on aggregated data (level 3), in our case monthly averages on a 10 • × 10 • grid.Before gridding, we apply the individual averaging kernels to adjust all retrieval  2012).We do this as proposed in the textbook of Rodgers (2000) and applied to XCO 2 by, for example, Reuter et al. (2011).SECM reproduces large-scale features such as the year-to-year increase, the north/south gradient, and the seasonal cycle.However, SECM is only empirically extrapolating from past modeled CO 2 fields.New or changing phenomena cannot be within SECM, and it should also be mentioned that the adjustments are mostly minor, typically a few tenths of a ppm.For consistency, we also remove the overall global bias of each retrieval with SECM as reference.
In order to get statistically robust results, we only use those grid boxes for which the standard error of the mean is estimated to be less than 1 ppm.This takes the individual retrieval precisions into account so that the minimum number of soundings needed to build the average of a grid box can vary from retrieval to retrieval and grid box to grid box.Beforehand, the reported retrieval precision is scaled to match (on average) the precision given in Table 2. TCCON and CT2011 are gridded in the same way.
Figure 2 shows for a typical month (September 2009) the calculated monthly averages.First of all, one can see many large scale similarities such as the north/south gradient.However, one can also find more or less obvious outliers of a few ppm in each of the seven algorithms (e.g., ACOS: Angola, BESD: Amazon, NIES PPDF-D: Saudi Arabia, NIES v02.xx:Senegal, RemoteC: north/east Siberia, UOL-FP: Brazil, WFMD: Somalia).
Often the observed systematic deviations (of level 3 data) are larger than the single sounding retrieval precisions expected from instrumental noise, i.e., they are dominated by specific algorithm effects.As level 3 grid boxes are always calculated from several individual level 2 soundings (ideally) sampled all over the grid box, we expect that sampling and representation errors are much lower than the observed deviations.Therefore, these errors are not discussed further in this context.
Due to independent algorithm developments, different physical approaches and assumptions, different pre-and post-processing filters, and due to the different instruments, we expect relatively independent bias patterns.This is supported in Fig. 2, which shows (uncorrelated) obvious outliers in various regions, i.e., it seems unlikely that all algorithms produce the same bias within one grid box.This implies that similar averages within one grid box can give us more confidence in the individual retrievals within this grid box.On the other way around, large inter-algorithm spreads indicate regions with more difficult and uncertain retrieval conditions.Therefore, we interpret the ensemble spread, i.e., the standard deviation of (at least five) algorithms, as uncertainty due to regional retrieval biases.
An example is given in Fig. 2 (right, bottom) showing larger inter-algorithm spreads in the tropics and in East Asia (always remote from TCCON sites).This pattern is temporally more or less stable, i.e., similar also in other months.

Ensemble median
As described in the previous section, up to seven XCO 2 averages (one for each algorithm) are calculated within each grid box.However, now, we are aiming to use the ensemble not only to assess regional and temporal uncertainties but also to create a data set which is less influenced by regional or temporal biases.This could be achieved, for example, by building the average, a weighted average, or the median in each grid box.
In this context, the median has some clear advantages: outliers are assumed to be seldom and there is a high chance that a grid box includes no or only one outlying algorithm.Therefore, cancellation of errors cannot be expected by calculating the average.The median is much less sensitive to such individual outliers.Additionally, the median calculates no new quantity from the individuals of an ensemble, it is rather a procedure to select one specific ensemble member.This allows us to easily trace back from level 3 averages to individual level 2 soundings.
Essentially, there are five possible scenarios for median calculation within one grid box: (i) All algorithms perform well and scatter slightly around the true XCO 2 value.In this case the median will help to reduce scatter.(ii) The minority of algorithms produce outliers so that the median is influenced only marginally.(iii) The majority of algorithms produce outliers in different directions.Here, it is still likely that the median falls on a well performing algorithm in the "middle".(iv) The majority of algorithms produce outliers in the same direction.This is the only case where the median is a bad choice, because it would select an outlier and ignore a well performing algorithm.As discussed in the previous section, we assume that the algorithms within one grid box are often realistic with uncorrelated occasional outliers, which makes this case very unlikely to happen often.(v) If all algorithms are outlying, the median is not better or worse than selecting any other ensemble member.
We calculate the median only for grid boxes with at least five successfully determined average XCO 2 values.In case of an even number of values, we define the median as that value being closer to the mean.We then trace back to the individual level 2 data, which were used to calculate that average being the median.Together, with all information needed for inverse modeling (geo-location, time, averaging kernels, etc.), these soundings are stored in the EMMA database.
Some algorithms may provide considerably larger amounts of level 2 data (e.g., WFMD, weighting function modified DOAS) than other algorithms.In order to prevent over-weighting these algorithms, we limit the maximum number of data points (per grid box).Therefore, we calculate the standard error of the mean of each successfully deter- mined average.The idea behind this is that the lower the standard error of the mean, the larger the potential constraint on an inverse model becomes.If the standard error of the mean of the selected algorithm is lower than the 25 % percentile of all algorithms, a centered truncated mean is calculated instead.The (symmetrical) truncation of elements adjusts the standard error of the mean to be slightly larger than the 25 % percentile.In this way, the number of data points can still be rather different but the potential constraint on an inverse model becomes similar.
Summarizing, the EMMA database consists of individual level 2 soundings retrieved by algorithms which can change from grid box to grid box and month to month.Figure 3 shows the integrated data content of each algorithm (defined as 1/σ 2 i ) within the EMMA database.ACOS has the largest integrated data content because it is often selected as median, has many data points per grid box, and a low scatter.

Performance of EMMA
The validation of EMMA's level 2 database with TCCON (Fig. 1, Table 2) has been performed in analogy with the work of Reuter et al. (2011) and shows very good overall performance: EMMA has more co-locations than any GOSAT retrieval, and a low station-to-station bias of 0.8 ppm.Mainly due to its WFMD component, EMMA has a single measurement precision of 3.1 ppm, which is somewhat larger than most of the GOSAT algorithms.It shall be noted that TC-CON's accuracy (2σ ) is about 0.8 ppm (Wunch et al., 2010(Wunch et al., , 2011)).This is similar to the observed station-to-station biases of the satellite retrievals and much larger than their differences.Additionally, it shall be noted that the number of co-locations is not solely driven by the satellite retrievals.Due to clouds and instrument maintenance, the seven TC-CON sites used provided suitable validation data in less than 40 % of the days.
The following algorithm intercomparison addresses temporal and spatial bias patterns and is based on gridded level 3 data sets (described in Sect.2).A glance at Fig. 2 shows that EMMA generates a relatively smooth global map with realistic patterns and no obvious outliers.As mentioned before, we use at least five algorithms to calculate a median.This can result in a loss of coverage relative to some of the individual algorithms.
We calculated the fraction of potential outliers according to unrealistically large spatial gradients (> 3 ppm/10 • , Fig. 4a, left) and unrealistically large deviations from CT2011 (> 3 ppm, Fig. 4a, middle).EMMA's fraction of potential outliers is below 2 %, which is considerably lower than for any other algorithm.Analyzing the difference between the individual algorithms and EMMA, Fig. 4a shows that large deviations from EMMA are often correlated with large deviations from CT2011 and large gradients.
With respect to the standard deviation of the difference (STDD, Fig. 4b), EMMA is in better agreement with CT2011 and TCCON than any other algorithm.
We also compared the north/south (N/S) gradient of each month with CT2011 and TCCON (Fig. 4c) by averaging all northern and southern hemispheric grid boxes (using the same sampling).All algorithms agree that CT2011 has a N/S gradient being about 0.3-0.9ppm too low.This effect is estimated to be 0.2 ppm less pronounced in the previous Car-bonTracker version CT2010.However, it shall be noted that CT2010 ends in 2009, and CO 2 fields after 2009 were estimated by extrapolation from previous years.EMMA's N/S gradients have the third smallest systematic deviations from CT2011 and the lowest scatter.It has the smallest systematic deviation from TCCON with the third smallest scatter.However, the statistics in comparing to TCCON are less robust because only seven grid boxes include TCCON stations and there are only 12 months for which the N/S gradient has been calculated.
Additionally, we compared the seasonal (peak-to-peak) amplitude of each grid box with CT2011 and TCCON (Fig. 4d) by calculating the difference between annual maximum and minimum.Beforehand, we subtracted a globally constant linear increase of 1.8 ppm yr −1 .We considered only those grid boxes with at least six valid months and used the same sampling.Most algorithms agree that CT2011 underestimates the seasonal amplitude by about 1.5-2.0ppm, which is broadly consistent with the findings of Yang et al. (2007); Schneising et al. (2011);Reuter et al. (2011);Keppel-Aleks et al. (2012), andMesserschmidt et al. (2012).The effect is estimated to be about 0.3 ppm less pronounced in CT2010 (extrapolated).However some algorithms (especially WFMD) see probably unrealistically large amplitudes.EMMA is in best agreement with CT2011 and in second best agreement with TCCON.It shall be noted that the CT2011 comparison is dominated by the Northern Hemisphere, due to significantly more filled grid boxes.The TCCON statistics are probably not very robust because they rely on seven grid boxes with seasonal cycles only.

Conclusions
In a joint effort of all XCO 2 retrieval teams worldwide, which are actively developing satellite based algorithms for the near infrared, the ensemble median algorithm EMMA has been set up.EMMA is a database of individual level 2 retrievals and takes advantage of the variety of different retrieval algorithms and their independent developments.For each month and each 10 • × 10 • grid box, one algorithm is chosen to supply level 2 retrievals in the database.The algorithm is chosen on the basis that its grid box mean is the median amongst the available algorithms.This allows the reduction of occasional outliers and sometimes unrealistic bias patterns, which can be found in each individual retrieval algorithm and which can hamper surface flux inversions.EMMA relies on the assumption that it is unlikely that the majority of algorithms produce outliers in the same direction because only in this case the median is a bad choice.Smoothing of real atmospheric variability, as it could happen when dealing with climate model ensembles, cannot be expected for EMMA because all ensemble members (XCO 2 retrieval algorithms) represent the same (real) atmospheric XCO 2 conditions and deviations from the real values are always due to retrieval errors (sampling issues are neglected in this context).
The EMMA database (June 2009-May 2010) includes all information needed for inverse modeling (geo-location, time, averaging kernels, etc.).As it consists of individual XCO 2 retrievals, it can be used in the same manner as any other XCO 2 satellite retrieval.EMMA also includes the inter-algorithm spread which gives important information about regional uncertainties.
Analyzing the inter-algorithm spread, we found that the algorithms agree often within < 1 ppm.However, especially in the tropics and East Asia remote from TCCON validation sites, we find larger spreads of about 1-2 ppm.This knowledge can be used to account for regional uncertainties in addition to the reported retrieval error estimates.Furthermore, it gives important indications where the most complicated retrieval conditions exist and where new validation sites would be of great interest.
TCCON is continuously expanding and improving and currently the de facto validation standard.However, many important regions are not covered, its accuracy (∼ 0.8 ppm) is not significantly better than the user requirements for regional biases, and TCCON cannot measure under cloudy conditions.Therefore, complementary validation concepts, for example, based on NOAA's AirCore system (Karion et al., 2010), which is currently in development, are of great interest, especially for future satellite missions.
A unified validation exercise showed that EMMA performs well at TCCON sites.This was somewhat expected because most of the algorithms perform similarly well here.In terms of station-to-station biases, TCCON's accuracy does not allow to identify significant differences between the analyzed algorithms.
The strength of EMMA lies in the reduction of the spatial and temporal bias patterns, which can be analyzed with the global gridded level 3 data: (i) EMMA's fraction of obvious outliers (in terms of unrealistically large gradients and unrealistically large deviation from CT2011) is lower than for any individual algorithm.(ii) It has the smallest STDD to CT2011 (considered as our current knowledge about atmospheric CO 2 concentrations) and TCCON.(iii) Its N/S gradients are in third best agreement with CT2011 and in best agreement with TCCON, and have the lowest scatter in case of CT2011 and the third lowest scatter in case of TCCON.(iv) EMMA's seasonal amplitude is in best agreement with CT2011 and second best with TCCON, and has the lowest scatter in both cases.
In summary, EMMA performs very well in terms of the analyzed statistical quantities.As long as no individual retrieval algorithm meets the demanding user requirements, we conclude that EMMA is a promising candidate for inverse modeling studies.
Our study also showed that all algorithms consistently observe a N/S gradient being about 0.3-0.9ppm larger and a seasonal (peak-to-peak) amplitude being about 1.5-2.0ppm larger than modeled by CT2011.Both effects were estimated to be slightly less pronounced in CT2010.
Future EMMA versions will profit from improvements of the individual algorithms.Furthermore, it is planned to extend the EMMA period to more years as soon as all algorithms have provided data.
Additional to EMMA v1.3a (described in this paper), we generated a version without WFMD (EMMA v1.3b) and a version without SCIAMACHY (EMMA v1.3c).EMMA seems to be very stable because rejecting, for example, WFMD from the ensemble has only minor influence on the global maps and level 3 statistics.However, a relatively large influence can be observed in the patterns of the ensemble spread.Additionally, the single measurement precision (compare Table 2) is reduced to about 2 ppm for v1.3b and v1.3c.All EMMA versions are available upon request to download at http://www.iup.uni-bremen.de/∼ mreuter/emma.php.

Fig. 1 .
Fig. 1.Co-locations with validation measurements at the TC-CON site Lamont, USA (distance < 500 km, time difference < 2 h).Soundings included in EMMA are highlighted with a white dot.

Fig. 3 .
Fig. 3. Integrated data weight of each algorithm within the EMMA database defined as 1/σ 2 i , where σ i is the (scaled) individual sounding error.

Fig. 4 .
Fig. 4. Performance statistics (based on level 3 data) of the seven individual retrieval algorithms and EMMA.From top to bottom: (a) frequency of potential outliers defined as unrealistically large spatial gradient, large deviation from CT2011, and large deviation from EMMA; (b) standard deviation of the difference (STDD) to CT2011 and TCCON; (c) difference of the north/south gradient to CT2011 and TCCON (average and standard deviation); (d) difference of the seasonal amplitude to CT2011 and TCCON (average and standard deviation).
results to a common a priori, namely the simple empirical CO 2 model (SECM) ofReuter et al. (