Evaluating China’s anthropogenic CO2 emissions inventories: a northern China case study using continuous surface observations from 2005 to 2009

China has pledged reduction of carbon dioxide (CO2) emissions per unit of gross domestic product (GDP) by 60 %–65 % relative to 2005 levels, and to peak carbon emissions overall by 2030. However, the lack of observational data and disagreement among the many available inventories makes it difficult for China to track progress toward these goals and evaluate the efficacy of control measures. To demonstrate the value of atmospheric observations for constraining CO2 inventories we track the ability of CO2 concentrations predicted from three different CO2 inventories to match a unique multi-year continuous record of atmospheric CO2. Our analysis time window includes the key commitment period for the Paris Agreement (2005) and the Beijing Olympics (2008). One inventory is China-specific and two are spatial subsets of global inventories. The inventories differ in spatial resolution, basis in national or subnational statistics, and reliance on global or China-specific emission factors. We use a unique set of historical atmospheric observations from 2005 to 2009 to evaluate the three CO2 emissions inventories within China’s heavily industrialized and populated northern region accounting for∼33 %–41 % of national emissions. Each anthropogenic inventory is combined with estimates of biogenic CO2 within a high-resolution atmospheric transport framework to model the time series of CO2 observations. To convert the model–observation mismatch from mixing ratio to mass emission rates we distribute it over a region encompassing 90 % of the total surface influence in seasonal (annual) averaged back-trajectory footprints (L_0.90 region). The L_0.90 region roughly corresponds to northern China. Except for the peak growing season, where assessment of anthropogenic emissions is entangled with the strong vegetation signal, we find the China-specific inventory based on subnational data and domestic field studies agrees significantly better with observations than the global inventories at all timescales. Averaged over the study time period, the unscaled China-specific inventory reports substantially larger annual emissions for northern China (30 %) and China as a whole (20 %) than the two unscaled global inventories. Our results, exploiting a robust time series of continuous observations, lend support to the rates and geographic distribution in the China-specific inventory Though even long-term observations at a single site reveal differences among inventories, exploring inventory discrepancy over all of China requires a denser observational network in future efforts to measure and verify CO2 emissions for China both regionally and nationally. We find that carbon intensity in the northern China region has decreased by 47 % from 2005 to 2009, from approximately 4 kg of CO2 per USD (note that Published by Copernicus Publications on behalf of the European Geosciences Union. 3570 A. Dayalu et al.: Evaluating China’s anthropogenic CO2 emissions inventories all references to USD in this paper refer to USD adjusted for purchasing power parity, PPP) in 2005 to about 2 kg of CO2 per USD in 2009 (Fig. 9c). However, the corresponding 18 % increase in absolute emissions over the same time period affirms a critical point that carbon intensity targets in emerging economies can be at odds with making real climate progress. Our results provide an important quantification of model– observation mismatch, supporting the increased use and development of China-specific inventories in tracking China’s progress as a whole towards reducing emissions. We emphasize that this work presents a methodology for extending the analysis to other inventories and is intended to be a comparison of a subset of anthropogenic CO2 emissions rates from inventories that were readily available at the time this research began. For this study’s analysis time period, there was not enough spatially distinct observational data to conduct an optimization of the inventories. The primary intent of the comparisons presented here is not to judge specific inventories, but to demonstrate that even a single site with a long record of high-time-resolution observations can identify major differences among inventories that manifest as biases in the model–data comparison. This study provides a baseline analysis for evaluating emissions from a small but important region within China, as well a guide for determining optimal locations for future ground-based measurement sites.


Table of Contents
. Comparison of unadjusted annual anthropogenic CO2 emissions (TgCO2) by region 14

S1 WRF Model: Post-processing and Evaluation
We evaluate WRF output against publicly available, 24h-averaged Chinese Meteorological Administration (CMA) observational data. CMA observational data is not used in the NCEP FNL reanalysis WRF initialization fields. CMA provides daily averages of surface pressure, wind speed, temperature, and relative humidity. Access to higher temporal resolution observational data is limited. We convert hourly (d01) and half-hourly (d02, d03) WRF output to daily averages before evaluation. We use a combination of NCAR Command Language v6.1.2 (NCL; http://dx.doi.org/10.5065/D6WD3XH5) and R v2.9.0 (https://www.r-project.org/) to process the observed and simulated output. The standard post-processing toolbox, consisting of the WRF Unified Post Processor and METv4.1 Point-Stat Tool (http://www.dtcenter.org/code/) is not used here because of the low temporal resolution of observational data and file format mismatches. However, we base our evaluation method and procedures on the METv4.  (Fig S1), the median modeled wind speed was 15% higher than observations, with a median absolute deviation of 16%. We emphasize that a more robust evaluation of WRF windspeed (or other meteorological) biases relative to observations would require access to higher temporal resolution meteorological observations. Currently, we are restricted by data availability to 24-hour averages which blur smaller timescale processes and therefore likely underestimates the WRF surface wind speed bias relative to observations. We do not include d01 comparisons in this analysis, as the distance between nearest station and WRF gridcell center can be on the order of tens of kilometers, decreasing the information and value of the comparison. The graphics associated with the d02 and d03 comparisons are available from https://dx.doi.org/10.7910/DVN/OJESO0 as "006_WRFvCMAplots_2006_d0X.pdf".

S2 STILT Model Set-up and Run Details
The version of WRF-STILT 1 used in this study corresponds to STILT release r701 of the AER-NOAA branch at the STILT svn repository 2 , and Release-3-5 of the WRF-STILT interface 3 . Spinup periods are removed from the WRF meteorological data and the WRF netcdf output files are converted to .arl format (Air Research Laboratory; https://ready.arl.noaa.gov/HYSPLIT_data2arl.php#INFO) prior to being ingested into STILT.
In this study, we transport an ensemble of 500 particles 7-days back in time to model footprints for each measurement hour at the receptor. The receptor (Miyun; 40°29′N, 116°46.45′E, 152 m above sea level (asl)) has the measurement inlet (STILT particle "release" point) located 6m above ground level (agl) (Fig. S2). We employ dynamic regridding, which accounts for resolution changes among the nested WRF domains. Mixing height is derived from WRF PBL heights; we set the surface layer as 50% of the mixed layer height. Footprints are integrated hourly. We set up the STILT runs as "pleasantly parallel" by running each month of a year simultaneously; hours within a month are run serially.
When the receptor release occurs outside of peak daylight hours, stratification of the PBL becomes significant. Therefore, as is common practice in virtually all emissions optimization/assessment studies, we model the 1100 to 1600 (local time) subset. These daylight hours represent a typical window for which STILT reliably models transport (e.g., 4). We examine the unadjusted model performance at all times, averaged seasonally and diurnally, in Sec S7.

S3 Anthropogenic CO2 inventories
In order to facilitate comparison among the three anthropogenic inventories used in this study, we interpolate the two global inventories (EDGAR, 0.1ºx0.1º; CDIAC, 1ºx1º) to the same 0.25ºx0.25º grid as the regional inventory (ZHAO). We use the NCL Earth System Modeling Framework (ESMF) Conserve regridding method which minimizes deviation of the variable's integral between source and destination grids. We evaluate the impact of regridding in Fig. S6 by comparing annual totals (MtCO2) before and after regridding. The ZHAO inventory remains on its native grid. We show that regridding does not appreciably affect the total emissions reported for mainland China by EDGAR and CDIAC, providing confidence in our representation of the two original inventories.
The  S7b). We find the assumption to be valid; the mean change per gridcell from 2009 relative to 2005 is -0.011% with a 2-s of 15%.
Total uncorrected emissions for each anthropogenic inventory are calculated on the 0.25ºx0.25º grids and provided in Table S1. We provide emissions summed for each administrative region in the study domain, each STILT influence contour, and all China. Differences among the inventories zoomed to the L_0.90 region, are displayed in Fig. S8. Miyun and Beijing are encompassed by the L_0.25 contour. We display the average gridcell emissions of ZHAO (Fig. S8a) and the differences of EDGAR and CDIAC relative to ZHAO ( Fig. S8b and Fig. S8c, respectively). In heavily emitting regions, ZHAO is typically higher than EDGAR and CDIAC. In regions where ZHAO is consistently lower than CDIAC, the differences are lower than the instances where ZHAO is higher. Note that, in the case of CDIAC, the uniformity of the differences includes artefacts from downscaling the gridded CDIAC inventory from 1°x1° to 0.25°x0.25°.

S4 CT2015: Background Concentration Selection and Evaluation of Model Bias
We derive estimates of background CO2 concentrations from NOAA CarbonTracker (CT2015; https://www.esrl.noaa.gov/gmd/ccgg/carbontracker/CT2015/). CT2015 enables us to estimate concentrations of CO2 prior to interaction with the surfaces in the study domain. Background value selection is summarized as follows. For each hour, the end x-y-z-time coordinates for each of 500 particles is found and linked to its corresponding CT2015 CO2 concentrations using a spatiotemporal nearest neighbor approach. Only instances where a particle originated at the edge of the outermost domain and/or an altitude greater than or equal to 3000masl is included in the average background concentration calculation for that hour. If less than 75% of particles for an hour have valid background concentrations, that hour is not used in subsequent analyses. This selection criteria for background CO2 mole fractions enables realistic modeling of true background conditions that have not interacted with the domain within each hourly measurement's maximum seven-day regional influence period. For the five-year study period, this method of boundary selection retains approximately 85% of hourly modelled values per year and across years.
The CT2015 model for the study domain is heavily trained by observations made approximately weekly via flask sampling at four World Meteorological Organization (WMO) sites in the region (https://www.esrl.noaa.gov/gmd/dv/site/). Mt. Waliguan to the west of the receptor (WLG) represents free tropospheric background air; Ulaan Uul (UUM) in Mongolia represents clean continental air; Tae-ahn Peninsula (TAP) in South Korea represents urban-influenced air from the east; Lulin (LLN) in Taiwan represents urban-influenced air from the southeast. TAP and LLN become more prominent in their representation upwind/background air sites during the spring and summer months when the East Asian Monsoon begins to influence regional air trajectory patterns. WLG and UUM are prominent in their representation of upwind/background air at all times of the year but particularly weight background air during the winter and fall seasons.
We quantify bias in the background model by evaluating observations against the nearest CT2015 model pixel and level. Observations are filtered using highest quality flask sample points only. Fig. S10(top panel) displays the time series of 3-hourly modeled CT2015 values and observed WMO measurements. Deviation of residuals from a normal distribution are displayed in Fig. S10 (bottom panel). The typical 1-s model bias is 2ppm, but not all of the distributions are normal. For UUM, and therefore, CT2015 parameterization of clean continental background, the modelmeasurement residuals largely follow a normal distribution centered around 0. The clean continental background generally exhibits well-mixed behavior and is not defined by large excursions in the CO2 signal. At the high-altitude WLG site representative of the free troposphere, the residuals follow a normal distribution centered around 0 but deviate from normal during instances where significant excursions in the CO2 signal are present. This is also the case at LLN (distribution centered near 2.5ppm). TAP residuals deviate significantly from normal. In general CT2015 does not capture CO2 events that are significantly different from global means; CT2015 underestimates uptake processes and overestimates lower or higher than global mean.
As not all deviations from observations can be represented as normal distributions, we place the model-measurement residuals at the four WMO sites in an error pool and select as part of the overall bootstrapping procedure for the modeling framework.
As shown in Fig. S10, LLN shows CO2 depletion relative to CT2015 suggesting that for this analysis it is not representative as a background site. (CT is not responsive to all sites). The LLN observed CO2 drawdown compared to modeled CT2015 suggests that LLN sees a lot of surface influence on account of its location in the middle of an island in vegetated surroundings. Moreover, LLN is not an important sector for the influence region of this study; we include it primarily for reference for future studies considering regions of China that would be more sensitive to the sector associated with LLN.

S5 Scaling Results and Methodology
We translate the resulting mole fraction (ppm) mismatch between observed and modeled DCO2 to inventory corrections at annual and seasonal timescales. We scale in the L_0.90 region (Fig. S9) which represents regions that substantially influence the receptor without disproportionally weighting pixels that contribute very little to the observed signal (Fig. S11). As discussed in the main text, we are still using surface influences from the entire STILT footprint to derive the CO2 concentration at the receptor, but we ascribe the resulting model-observation mismatch as dominated by the L_0.90 region. Table S2 provides seasonal fluxes for each year before and after scaling. Annual scaling results are in Table 2 of the main text.
At annual scales, the dominant contributor to the CO2 signal are anthropogenic emissions; correction at annual scales is therefore applied only to the anthropogenic emissions inventories. The other significant contributors include longer term biological and ocean carbon sinks and interannual variability within these components, but for this study region, these components are embedded in the background concentrations. In particular, 13% of the northern China ecosystems and 20% of northeastern China's ecosystems are mixed forests. However, the ecosystems with greatest influence on this single site are croplands with high intra-annual carbon turnover rates. The heavily cropped L_0.90 region implies rapid turnaround of vegetation carbon stocks at the annual scale, justifying this assumption. At these timescales, we derive the DCO2,obs/DCO2,mod ratio which represents the factor by which the annual anthropogenic inventory must be scaled in order to match observations. We use a model of the mean method to derive the annual scaling factors, where hh represents each local afternoon hour (1100 to 1600) in the year. SF>1 implies the model underestimates CO2 concentrations while SF<1 implies the model overestimates CO2 concentrations. We obtain 95% confidence bounds by bootstrapping uncertainties in the numerator and denominator separately, and obtaining the 0.025 and 0.975 quantiles from the ratio of the means of the two distributions. The annual influence contours are overlayed on the IGBP land use map in Fig. S9, and shows the dominant grassland/cropland influence on the modeled Miyun signal at annual scales. As stated previously, The Miyun CO2 signal is certainly affected by other biological/oceanic/interannual variability; but these are not demonstrated to be significant parts of the regional (DCO2) signal. These are longer term features embedded in the background concentrations.
At the seasonal timescales, however, evaluation of CO2 processes is complicated by the biogenic flux contribution during the growing season and, to a lesser extent, the effects of ecosystem respiration in the dormant season. At these timescales, we derive additive corrections from converting observation-model mole fraction mismatch to the total CO2 to be added or subtracted from the inventories. We correct the anthropogenic and vegetation inventories together as it is not possible to distinguish the contributions from our existing observational data set. For each modeled hour we derive a residual-based flux correction, DFhh, in µmolCO2m -2 s -1 : where hh represents each local afternoon hour (1100 to 1600) in the season and h represents the STILT footprint back-trajectory hour up to 7 days back in time. Given that anthropogenic emissions are positive terms and the biogenic component is a net balance of two opposing terms (uptake and release) of CO2 during the growing seasons, use of inventory scaling factors for growing season scaling is inappropriate. That is, even a small mole fraction difference between modeled and observed in the growing season can result in meaningless scaling factors when there is a difference in sign involved. While scaling factors are appropriate during dormant seasons, for consistency we apply the same method of additive corrections across all seasons and report the adjusted inventory as fluxes (kg CO2 m -2 season -1 ). The methods are comparable; inventory corrections obtained by both methods during the winter and fall exhibit converging 95% confidence intervals. Original half-hourly output displayed in grey. Shaded yellow region represents observed daily range; daily minimum for windspeed is not available, but assumed to be 0m/s.