Linking global terrestrial CO 2 ﬂuxes and environmental drivers using OCO-2 and a geostatistical inverse model

. Observations from the Orbiting Carbon Observatory 2 (OCO-2) satellite, launched in July 2014, have been used to estimate CO 2 ﬂuxes in many regions of the globe and provide new insight on the global carbon cycle. A challenge now is to not only estimate ﬂuxes using satellite observations but also to understand how these ﬂuxes are connected to variations in environmental conditions. In this study, we speciﬁcally evaluate the capabilities and limitations of utilizing current OCO-2 observations to infer connections between CO 2 ﬂuxes and underlying environmental variables. To do so, we adapt geostatistical 5 inverse modeling to satellite-based applications and evaluate a case study for year 2016 using OCO-2. A unique aspect of the geostatistical approach is that we can use estimates of environmental and meteorological variables to help estimate CO 2 ﬂuxes in place of a traditional prior ﬂux model. We are able to quantify the relationships between CO 2 ﬂuxes and a few environmental variables across global biomes; we ﬁnd that a simple combination of air temperature, daily precipitation, and photosynthetically active radiation (PAR) can describe almost 90% of the variability in CO 2 ﬂuxes as seen through OCO-2 10 observations. PAR is an adept predictor of ﬂuxes across mid-to-high latitudes, whereas a combined set of air temperature and precipitation shows strong explanatory power across tropical biomes. However, we are unable to quantify relationships with additional environmental variables because many variables are correlated or colinear when passed through an atmospheric model and averaged across a total atmospheric column. Overall, we estimate a global net biospheric ﬂux of -1.73 ± 0.53 GtC in year 2016, in close agreement with recent inverse modeling studies using OCO-2 retrievals as observational constraints.

in year 2017, whereas there are no such gaps in year 2016. This time period also overlaps with an OCO-2 inverse modeling inter-comparison (MIP) study, enabling direct comparison with those results (Crowell et al., 2019). We specifically estimate 90 CO 2 fluxes for September 1, 2015 to December 31, 2016 but discard the first four months as a spin-up time period. We also offer up a wide range of environmental drivers and allow the GIM to select a subset that best predicts spatiotemporal patterns in CO 2 fluxes at the model grid scale, described in detail below (Sects. 2.2-2.4).

OCO-2 satellite observations
We utilize 10-s average XCO2 generated from version 9 of the satellite observations for the period from September 1, 2015 95 through the end of year 2016 (e.g., Chevallier et al., 2019). We use both land nadir-and land glint-mode retrievals in the inverse model. Recent retrieval updates have eliminated biases that previously existed between land nadir and land glint observations (O'Dell et al., 2018). Moreover, Miller and Michalak (2020) evaluated the impact of these updated OCO-2 retrievals on the terrestrial CO 2 flux constraint in different regions of the globe; the authors found that the inclusion of both land nadir and land glint retrievals yielded a stronger constraint on CO 2 fluxes relative to using only a single observation type.

Geostatistical inverse model
A GIM does not require an emission inventory or a bottom-up model as an initial guess of fluxes; instead, a GIM can leverage a wide range of environmental driver datasets to help predict spatial and temporal patterns in the CO 2 fluxes (e.g., Gourdji et al., 2008Gourdji et al., , 2012Shiga et al., 2018). We further pair the GIM with a statistical approach known as model selection to objectively determine which set of drivers can best reproduce CO 2 observations from OCO-2. This setup makes it feasible to both estimate 105 CO 2 fluxes and to explicitly quantify the relationships between the fluxes and the underlying environmental drivers. The fluxes, as estimated by the GIM, consist of two components. First, the GIM will scale the environmental drivers to match patterns in the atmospheric observations, and this component of the flux estimate is referred to as the 'deterministic model'. Second, the GIM will model space-time patterns in the CO 2 fluxes that are implied by the atmospheric observations but not explained by any environmental drivers, and this component of the fluxes is referred to as the 'stochastic component'. The best flux estimate 110 is a sum of the deterministic model and the stochastic component: where s are m × 1 unknown fluxes, X is a m × p matrix of environmental drivers (see Sect. 2.4), β are p × 1 unknown scaling factors or drift coefficients. These coefficients quantify the relationships between each of thep environmental drivers (i.e., each column of X) and the estimated CO 2 fluxes. The product of X and β is the deterministic model (Xβ). The stochastic 115 component (ζ) is zero-mean with a pre-specified spatial and/or temporal correlation structure; it describes spatial and temporal patterns in the fluxes that are not captured by the deterministic model. For the setup here, the drift coefficient (β) associated with each environmental driver is constant in space and time, while the stochastic component (ζ) varies at the model grid scale.
The cost function includes two components: the first component indicates that the fluxes (s), when run through an atmospheric model, h(s), should match the observations (z) to within a specific error tolerance (z−h(s)) that is prescribed by the covariance matrix R (n × n). R describes model-data mismatch errors, including errors from the atmospheric transport model and the OCO-2 retrievals, among other errors. The second component of Eq. 2 stipulates that the structure of the stochastic component 125 (s − Xβ) is described by the covariance matrix Q (m × m). Q, like R, must be defined by the modeler before estimating the fluxes; it represents the variances and spatiotemporal covariances of the stochastic component. We estimate Q using a statistical approach known as Restricted Maximum Likelihood (RML; e.g., Kitanidis, 1997;Gourdji et al., 2012;Miller et al., 2016). Q includes both diagonal and off-diagonal elements; the latter decay with the separation time and distance between two model grid boxes. We construct R as a diagonal matrix with constant elements on the diagonal. The Supplement Sect. S1 provides a 130 detailed explanation of the approach used here to estimate the covariance matrix parameters.
After estimating the covariance matrix parameters, we then estimate the CO 2 fluxes by iteratively minimizing Eq. 2 using the Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS, Liu and Nocedal, 1989). We use this approach to simultaneously estimate both s and β. Miller et al (2019) describe this iterative approach to minimize Eq. 2 in detail.

135
We consider a wide range of environmental drivers (X). These are meteorological variables primarily related to heat, water, and radiation, available from NASA's Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2, Rienecker et al., 2011). Specifically, we consider daily 2-m air temperature, daily precipitation, 30-day average precipitation, photosynthetically active radiation (PAR), surface downwelling shortwave radiation, soil temperature at 10-cm depth, soil moisture at 10-cm depth, specific humidity, and relative humidity. We also include a non-linear function of 2-m air temperature 140 as an environmental driver (refer to hereafter as scaled temperature). This function is from the Vegetation Photosynthesis and Respiration Model (VPRM, Mahadevan et al., 2008) and describes the non-linear relationship between temperature and photosynthesis (e.g., Raich et al. 1991, see the Supplement Sect. S2).
Note that we do not include any remote sensing indices (e.g., solar-induced chlorophyll fluorescence (SIF) or leaf area index (LAI)) in the present study. Rather, the focus of this study is to explore environmental drivers of CO 2 fluxes, not remote sensing 145 proxies for CO 2 fluxes.
We group the globe into seven biome-based regions and allow the GIM to use different environmental drivers in different biomes. Miller and Michalak (2020) found that current OCO-2 retrievals can be used to constrain terrestrial CO 2 fluxes for regions of this size. The seven-biome map ( Fig. 1) is derived from the biomes in Olson et al (2001), aggregated to form larger regions. As a result of this setup, each column of X includes a single environmental driver for a single biome. Therefore, each 150 environmental driver is represented by a total of seven columns in X. Within each column, all elements are zeros except for elements that correspond to a single biome.
We also include several constant columns of ones in X. These columns are analogous to the intercept in a linear regression.
Existing GIM studies always include one or more constant columns within X (e.g., Gourdji et al. 2008;Gourdji et al., 2012;Miller et al., 2018). In this study, we specifically use a total of seven constant columns, one for each biome. We also include a constant column for the ocean.
We further consider non-biospheric fluxes in the X matrix, including fossil fuel emissions from the Open-source Data Inventory for Anthropogenic CO 2 monthly fossil fuel emissions (ODIAC2016, Oda et al., 2018, climatological ocean fluxes from Takahashi et al. (2016), and biomass burning fluxes from the Global Fire Emissions Database (GFED) version 4.1 (Randerson et al., 2018). We only allocate a single column of X for fossil fuel, biomass burning, and ocean fluxes, respectively, because 160 these fluxes are not the focus of this study.
In total, we consider a total of 81 columns for the X matrix: 8 constant columns of ones, 70 columns associated with environmental drivers, and 3 columns associated with anthropogenic, ocean, and biomass burning fluxes.

Model selection
We utilize a model selection framework to evaluate which subset of the environmental drivers (i.e., columns of X) best describe 165 variations in CO 2 fluxes as inferred from the OCO-2 observations. The inclusion of additional environmental drivers or columns in X will always improve the model-data fit, but the inclusion of too many variables in X can yield an overfit of the OCO-2 observations or can yield unrealistic drift coefficients (β) (e.g., Zucchini, 2000). Instead of including all environmental drivers in X, we use model selection to decide which set of environmental drivers to include in X. In this study, we implement a type of model selection known as the Bayesian Information Criterion (BIC; Schwarz, 1978), which has been extensively used in 170 recent GIM studies (e.g., Gourdji et al., 2012;Miller et al. 2013;Fang and Michalak, 2015). Using the BIC, we score different combinations of environmental drivers that could be included in X based on how well each combination reproduces the OCO-2 observations. We calculate these scores using the following equation for the implementation here (Miller et al. 2018;Miller and Michalak, 2020):

175
where L is log likelihood of a particular combination of environmental drivers (i.e., columns of X),p is the number of environmental drivers in this particular combination, and n * is the effective number of independent observations. The first component (L) rewards combinations that are a better fit to the observations, whereas the second component in Eq. 3 (pln(n * )) penalizes models with a greater number of columns to prevent overfitting. The best combination of environmental drivers for X is the combination that receives the lowest score (the Supplement Sect. S3 and Table S2). We implement the BIC using a heuristic 180 branch and bound algorithm (Yadav et al., 2013) boreal and temperate forests relative to grasslands, indicating a stronger relationship between PAR and net biosphere CO 2 fluxes in those biomes (Table 1; Figs. 5a and S3d-f).
Indeed, previous studies also indicate that PAR and similar environmental drivers (e.g., shortwave radiation) are closely associated with CO 2 fluxes. For example, a top-down study of North America (Fang and Michalak, 2015) found that shortwave

255
A composite of PAR, scaled temperature, and daily precipitation adeptly describe variability in CO 2 fluxes across tropical forests (Figs. 4c and 4d), as seen through the OCO-2 observations. PAR in tropical forests is usually a function of the presence or absence of clouds (e.g., Baldocchi et al., 2017;Zeri et al., 2014); cloudiness is also associated with rainfall. Therefore, low PAR over tropical forests is likely an indicator of cloud presence and rainfall. A positive β estimated for PAR suggests that a decrease in PAR, indicative of enhanced precipitation, is associated with 380 increased carbon uptake. Furthermore, the 260 negative β value assigned to scaled temperature (the Supplement Sect. S2) implies that an increase in air temperature, which often exceeds optimal temperature over tropical forests, is associated with reduced carbon uptake. values of PAR, high air temperature, and low precipitation may be a manifestation of these drought patterns.

265
Indeed, multiple lines of evidence indicate that drought is associated with diminished carbon uptake in tropical forests (e.g., Phillips et al., 2009;Brienen et al., 2015;Baccini et al., 2017). For example, Gatti et al (2014) suggested that a suppression of photosynthesis during tropical drought may cause a reduction in carbon uptake. Brienen et al (2015) added that tropical drought is often associated with higher-than-normal temperature, which may further contribute to reducing gross primary production (GPP) and carbon uptake. Overall, this GIM study supports the conclusion that environmental conditions indicative of drought This result suggests that heat and water availability are likely associated with carbon fluxes across this biome.

275
A negative β value for precipitation indicates that an increase in precipitation is associated with an increase in carbon uptake, which is in line with current knowledge that water availability facilitates photosynthesis, especially in arid or semiarid regions. In addition, a negative β value for scaled temperature (the Supplement Sect. S2) indicates that an increase in air temperature is associated with a reduction in carbon uptake. Specifically, high temperatures in the tropics often exceed the optimal temperature for photosynthesis (e.g., Baldocchi et al., 2017), which can suppress GPP (e.g., Doughty and Golden, 2008). Overall, a combined set of air temperature and precipitation adeptly describes CO 2 flux variability in tropical grasslands, rendering it a net source in year 2016.

Estimated biospheric flux totals for different global regions
We estimate a global terrestrial biospheric CO 2 budget of -1.73 ± 0.53 GtC (Uncertainties listed are the 95% confidence interval. The Supplement Sect. S5 provides detail on the posterior uncertainty estimate for biospheric fluxes.). Among the 285 seven biomes, middle to high latitudes (primarily temperate, boreal and tundra biomes) act as a significant carbon sink; tropical biomes are a net source; desert and shrubland regions play a small, neutral role ( Table 2). Note that we subtract flux patterns that map onto fossil fuels (Xβ, Fig. 5d) from the posterior flux estimate (s, Fig. 2c) to obtain an estimate for biospheric fluxes (including terrestrial NEE and biomass burning fluxes). We estimate a β value of 1.09 ± 0.05 (95% confidence interval) for the fossil fuel emissions from ODIAC2016, indicating that the overall global magnitude of ODIAC2016 is consistent with OCO-2 290 observations. We therefore assume that ODIAC2016 is a reasonable global estimate for fossil fuel emissions.
These flux totals are broadly consistent with a recent MIP of different inverse models that assimilate OCO-2 observations The fluxes estimated here are also broadly consistent with aircraft-based in situ CO 2 observations, a topic discussed in the Supplement Sect. S7. It is important to note that the posterior uncertainties calculated in most classical Bayesian or geostatistical inverse models account for many but not all possible sources of uncertainty. For example, the posterior uncertainties presented here account for the sparsity of the OCO-2 observations, random observational or atmospheric transport errors, and uncertainties due to uncertain drift coefficients (β) (e.g., Kitanidis and Vomvoris, 1983;Michalak et al., 2004). However, these calculations do not fully account for bias-type errors: regional-or continental-scale biases in the OCO-2 observations, biases in modeled 320 atmospheric convection (e.g., Basu et al., 2018;Schuh et al., 2019), or biases in modeled interhemispheric transport, among other possible biases. Most classical Bayesian and geostatistical inverse models assume that the observational or model errors are Gaussian with a mean of zero (e.g., Kitanidis and Vomvoris. 1983;Michalak et al., 2004;Tarantola, 2005), making it challenging to account for the types of biases listed above. As a result, the posterior uncertainties estimated in this study are typically smaller than the range of flux estimates produced from the recent MIP study ( Fig. 6; Crowell et al., 2019).

Conclusions
In this study, we adapt the geostatistical approach to inverse modeling for global satellite observations of CO 2 , and evaluate the extent to which we can use these observations to make connections between CO 2 fluxes and environmental drivers. We find that 1. A simple combination of environmental drivers can adeptly describe patterns in CO 2 fluxes across different biomes of 330 the globe, as seen through observations from the OCO-2 satellite; 2. PAR is an adept predictor of fluxes across mid-to-high latitudes, whereas a combination of daily air temperature and daily precipitation shows strong explanatory power across tropical biomes; 3. A larger number of environmental drivers is not selected because many drivers are correlated or colinear when passed through an atmospheric model and averaged across a total atmospheric column. This high collinearity, not errors in the 335 OCO-2 retrievals or atmospheric model, appears to be a limiting factor in using satellite observations to connect CO 2 fluxes with environmental drivers; 4. We estimate a global terrestrial biospheric budget of -1.73 ± 0.53 GtC in year 2016, in close agreement with recent inverse modeling studies that use OCO-2 retrievals as observational constraints.
Author contributions. Z.C. and S.M.M. designed the study. Z.C. analyzed the data. Z.C. and S.M.M. wrote the manuscript. All authors reviewed and edited the paper.