Intercomparison of atmospheric trace gas dispersion models: Barnett Shale case study

Greenhouse gas emissions mitigation requires understanding the dominant processes controlling fluxes of these trace gases at increasingly finer spatial and temporal scales. Trace gas fluxes can be estimated using a variety of approaches that translate observed atmospheric species mole fractions into fluxes or emission rates, often identifying the spatial and temporal characteristics of the emission sources as well. Meteorological models are commonly combined with tracer dispersion models to estimate fluxes using an inverse approach that optimizes emissions to best fit the trace gas mole fraction observations. One way to evaluate the accuracy of atmospheric flux estimation methods is to compare results from independent methods, including approaches in which different meteorological and tracer dispersion models are used. In this work, we use a rich data set of atmospheric methane observations collected during an intensive airborne campaign to compare different methane emissions estimates from the Barnett Shale oil and natural gas production basin in Texas, USA. We estimate emissions based on a variety of different meteorological and dispersion models. Previous estimates of methane emissions from this region relied on a simple model (a mass balance analysis) as well as on ground-based measurements and statistical data analysis (an inventory). We find that in addition to meteorological model choice, the choice of tracer dispersion model also has a significant impact on the predicted down-wind methane concentrations given the same emissions field. The dispersion models tested often underpredicted the observed methane enhancements with significant variability (up to a factor of 3) between different models and between different days. We examine possible causes for this result and find that the models differ in their simulation of vertical dispersion, indicating that additional work is needed to evaluate and improve vertical mixing in the tracer dispersion models commonly used in regional trace gas flux inversions.


S1 Evaluation of WRF-FDDA wind and PBL depth using Lidar and aircraft observations
To evaluate the performance of the WRF-FDDA model runs that we use in the dispersion calculation, we compare with observations from the field. Two sources of observations are used in this comparison: NOAA/ESRL Chemical Sciences Division (CSD)'s High-Resolution Doppler Lidar (HRDL), and the aircraft observations. The HRDL 5 retrieves horizontal wind speed and direction, vertical wind speed variance, and aerosol scattering as a function of altitude at one stationary location where the instrument was deployed in the field ( Fig. 1) (Grund et al., 2001); more information about this deployment is in Karion et al. (2015). HRDL was not operational for the first three flights in October 2013 due to the US federal government shutdown on those days.

10
Wind speed and direction observations from the HRDL allow for comparison with the WRF-FDDA model at one location but continuous in time, including the time period that is used in the STILT or HYSPLIT model runs, which extends 12 h -24 h back from the flight time. Both the HRDL winds and the model were averaged in height from the ground to the top of the PBL as determined in WRF. Airborne measurements of the wind speed and direction provided an additional evaluation metric, with information about the wind across the entire field but only during the time of the 15 flight. Comparisons indicate that the WRF-FDDA model, on most flight days, predicts average wind speed and direction quite well, with wind speed errors up to 30% (1.8 m s -1 ) at most, with a mean high bias of 11% (0.7 m s -1 ) over all the days relative to the aircraft and a low bias of 3% (0.1 m s -1 ) relative to HRDL, with significant variability over the times of the averages (Fig. S1, S3, Table S1).

20
The WRF-FDDA PBL depth was evaluated against HRDL observations for the 12 hours prior to each flight, and against the PBL depth used for the MBE, from airborne observations (Fig. S2, S3, and Table S2). PBL depth was provided by NOAA/ESRL's Chemical Sciences Division as it was retrieved from the HRDL data using an algorithm developed for this purpose (Bonin et al., 2017). PBL depths used for the mass balance calculation in Karion et al. (2015), which were estimated from the aircraft vertical profiles during the flight, are compared with the WRF-FDDA 25 PBL depths, averaged over the flight time.
Over all the flight days, the WRF-FDDA PBL depth showed a difference of 26 ± 26 % (mean ± standard deviation) relative to HRDL observations and 11 ± 14 % (mean ± standard deviation) relative to the aircraft observations, with the model biased towards low PBLs. Both wind and PBL comparisons indicate that relative errors in WRF-FDDA, 30 on average, are not large. However, looking for specific days when the average winds or PBL are not faithful to the observations can serve as a possible metric to evaluate the derived emissions for given days. We can see that WRF wind speeds are often higher than the observations, especially those from the aircraft (a negative difference in Fig.   S3), and on 20131016, 20131019, and 20131025, these differences are larger than other days (20%-30% vs. < 10% on other days). PBL depth on the other hand is often underestimated in the model (shown as a positive difference in 35

S2 Model configuration details
Here we describe in detail the various configurations of the models that were run for this study and are listed in Table   1 in the main text.

S2.1 WRF 40
The primary transport model used in this analysis was the Weather Research and Forecasting (WRF) model, version 3.4, in Four Dimensional Data Assimilation (FDDA) mode. The WRF-FDDA configuration for the model physics includes the use of: 1) the Thompson microphysical processes, 2) the Grell 3-D ensemble scheme for cumulus parameterization on the coarse grid, 3) the Rapid Radiative Transfer Model (RRTM) for longwave atmospheric radiation, and the Dudhia scheme for shortwave atmospheric radiation, 4) the TKE-predicting Mellor-Yamada-45 Nakanishi-Niino (MYNN) Level 2.5 turbulent closure scheme for boundary layer turbulence parameterization, and 5) the 6-level RUC land surface model (LSM) for representation of the interaction between the land surface and the atmospheric surface layer. The WRF modeling system also has data assimilation (FDDA) capabilities to allow the meteorological observations to be continuously assimilated into the model. The WRF-FDDA system is able to create four-dimensional dynamically consistent with data sets or dynamic analyses (Rogers et al., 2013). The wind fields 50 were nudged with the four-dimensional data assimilation (FDDA) using WMO surface meteorological observations and regional radiosondes in both March and October using methods described in Deng et al. (2009), but in March the wind fields were additionally nudged with High-Resolution Doppler Lidar (HRDL) and aircraft horizontal wind fields (see Karion et al. (2015) and Lauvaux et al. (2013) for additional details). Three nested domains were run at 1, 3, and 9 km resolution, with 50 vertical levels, and initial and boundary conditions from NOAA's Rapid Refresh (RAP) 55 model. Fifty vertical terrain-following layers are used, with the center point of the lowest model layer located ~12 m above ground level (agl). The thickness of the layers increases gradually with height, with 27 layers below 850 hPa (~1550 m agl).

S2.2 WRF-Chem forward simulations
WRF-Chem simulations were performed using the passive tracer module described in previous studies (Díaz Isaac et 60 al., 2014;Lauvaux et al., 2012), originally developed from the passive tracer option in WRF-Chem. The emissions of CH4 from the Z-A inventory, assumed constant over time, are emitted at every model time step, and transported according to the mean horizontal and vertical winds, and turbulence fields. For the 9-km grid, the WRF-FDDA simulation (S2.1) was nudged to re-analysis fields and meteorological data from WMO surface stations and rawinsondes (i.e. wind speed and direction, and temperature above the PBL only) as well as airborne wind data, and 65 Lidar wind profiles. For the WRF-Chem simulation, the 3-km grid was run with WRF but without FDDA to conserve mass within the domain. The simulation was re-initialized every 18 hours in both WRF-FDDA and WRF-Chem runs.
The 3-dimensional fields of CH4 mixing ratios were saved every hour to provide a continuous simulation over the time period (from 18:00 UTC on October 18, 2013 to 18:00 UTC on October 29, 2013). This time period was chosen because it encompassed the flights that showed the largest variability (in terms of the CH4 enhancement) in the 70 Lagrangian footprint-based simulations.

S2.3 HYSPLIT and STILT backward simulations
HYSPLIT and STILT are Lagrangian particle dispersion models that can be run off-line using archived transport fields from a meteorological model. HYSPLIT is often used for forward dispersion modeling, i.e. for modeling downwind concentration fields of pollutants from a known release point (Stein et al., 2015;Draxler and Hess, 1997). STILT (Lin 75 et al., 2003) was developed based on HYSPLIT, but uses a different parametrization for vertical mixing to those choices available in HYSPLIT. It is commonly used for trace gas flux estimation using atmospheric inverse methods, because it was developed to run backwards in time to generate influence functions, or footprints. These can be used as the adjoint in an inverse model (Miller et al., 2013;Mueller et al., 2008). For the backwards runs, the HYSPLIT model was used with the setting to emulate the STILT model, that is to save the particle trajectories and variables 80 required to produce "footprints", or influence functions, for each receptor, with units (ppm (µmol m -2 s -1 ) -1 ). Integrated footprints from WRF-HYSPLIT for all the flight days are shown in Fig. S4.
We note that in both HYSPLIT and STILT the parameter that governs the minimum planetary boundary layer (PBL) depth (KMIX0) was set to 25 m, from the default 250 m. The STILT model was used with its default parameter list 85 unless otherwise noted, but we note that the WRF-FDDA wind fields were not time-averaged, as is usually the case with coupled WRF-STILT model runs (Nehrkorn et al., 2010). Both models were run from receptors located every 30 seconds (approximately 2.1 km during level flight) along the flight tracks, i.e. following the aircraft's latitude, longitude, altitude, and time. At each receptor (particle origin), 2000 particles were released, and their trajectories followed back in time for 24 hours. After 24 hours, for all flights, the particles had little to no surface influence (i.e. 90 had exited the boundary layer) in the study domain (the 25-county Barnett domain over which the inventory was developed, Fig. 1). Three combinations of these models were run for all the flights (Table 1) were tested. In HYSPLIT, the changing of the boundary layer turbulence parameterization (using KBLT) was investigated in both the forward and backward runs. The default in HYSPLIT is to use the Kantha-Clayson parametrization, but the use of the WRF turbulent kinetic energy (TKE) to determine the vertical mixing was also tested. We note that the TKE parametrization was corrected for an error in HYSPLIT 4 revision 931, so here we are using this latest update (dated 2018-02-02). These options are described in detail in Draxler and Hess (1997) and the 100 HYSPLIT User Guide (https://www.arl.noaa.gov/documents/reports/HYSPLIT_user_guide.pdf). This vertical turbulence computational method was found to have little effect on results (Fig. S5), so we chose the default HYSPLIT configuration (Kantha-Clayson parametrization) for further analyses.
Other HYSPLIT parameters were set to the default values, including the use of heat and momentum fluxes from the 105 driving model to determine boundary layer stability (KBLS=1). We found that changing the PBL depth determination from the default (using the PBL in the WRF-FDDA model, KMIXD=0) to a TKE-based PBL depth (KMIXD=2), did not make a difference in either flight. In the STILT model, changing the default PBL determination method from the default of using the Richardson number (KMIXD=3; note, this Richardson number method is not available in the current HYSPLIT model) to using the PBL from the input WRF model (KMIXD=0), was also investigated but we 110 found little difference between the two (not shown). Additionally, for 20131028 the effect of gridding footprints at a higher resolution of 0.04 degrees (and convolving with a higher-resolution 4 km inventory) vs. our default of 0.1 degrees was investigated. Lastly, we tested one configuration on 20131028 for which we released 5000 particles instead of 2000 at each receptor point, to see if the results were sensitive to this number. Neither the spatial resolution or particle number affected the predicted enhancements and are not discussed further here. 115 The original WRF-FDDA runs (conducted in 2013 for the initial campaign) were not configured specifically for running with the STILT model, which requires specific variable outputs and time-averaged wind fields. We were able to re-run WRF for the middle 3-km domain using the driver data from the 9-km domain in the original runs, but now outputting the proper variables for WRF-STILT coupling, for the last four flights: 20131019, 20131020, 20131025, 120 and 20131028. For previous flights the driver data was unavailable from the original runs. For all eight flights, the STILT model was run with the original archived WRF-FDDA fields by bypassing the specific requirements for the WRF-STILT coupling and allowing STILT to use the fields that were provided, as it does when it is driven by non-WRF products, such as NAM. Using the averaged wind fields and other STILT-specific variables caused the footprint strength and enhancement to increase by 1% (20131019) to 40% (20131028), ranging between 0.2 ppb and 5 ppb in 125 the average downwind enhancement for those four flights where both versions were compared (Fig. S5). Results shown for STILT in Fig. 2 and 3 are using the averaged fields; Fig. 4 shows the results from the averaged fields for the last four flights and instantaneous wind fields for the first four.

S2.4 WRF-HYSPLIT forward simulations
HYSPLIT was run using the above-described WRF-Chem simulations (S2.2) for their meteorological fields only (no 130 tracer information), in forward mode. Unlike in the backward/footprint mode, here the particles were released from the emissions location, in this case at the center of each 0.1-degree cell from the Z-A inventory. The particles represent CH4 emissions and were released with mass corresponding to the emission rate in each cell, in kg h -1 . We conducted two of these simulations; one beginning at 1:00 UTC 20131019 and ending at 4:00 UTC 20131023, and one beginning at 21:00 UTC 20131024 and ending at 23:00 UTC 20131028. The mole fraction fields were output hourly at 0.1-135 degree horizontal resolution and at 15 vertical levels (top of the lowest level was 50 m, and then from 100 m to 1000 m every 100 m, then 1200, 2000, 3000, and 10,000 m). We note that although the mole fractions were saved at these resolutions, the particle motion was resolved at the native WRF-Chem resolution (3 km, 50 levels, hourly, as described above) regardless of the choice of output resolution. We conducted the forward HYSPLIT simulations with the default Kantha-Clayson turbulent mixing parametrization. We also tested the turbulent kinetic energy-based (TKE) 140 turbulence parametrization, but, similarly to the footprint/backward runs (Fig. S5), we did not see a large difference.
We first confirmed that the CH4 enhancements from the forward HYSPLIT model runs correspond to the backward runs at the observation locations (Fig. S6), and then used this forward model to compare with the four-dimensional WRF-Chem fields to more fully investigate model differences.

S2.5 WRF-LPDM backward simulations 145
The Lagrangian Particle Dispersion Model (LPDM) (Uliasz, 1994) was used to generate footprints from the WRF-FDDA transport fields. The turbulent motion in the Planetary Boundary Layer is parameterized following a Mellor-Yamada turbulence closure scheme to calculate the Lagrangian time scale and hence the vertical motion of particles near the surface. The turbulent motion is distributed in the horizontal and vertical directions assuming isotropic mixing in the horizontal plane. The energy dissipation rate is directly derived from the Turbulent Kinetic Energy values from 150 WRF-FDDA and combined with the wind velocity variance to calculate the Lagrangian time scale. The LPDM (coupled with WRF) has been evaluated in previous studies by comparing WRF-Chem direct simulations of trace gases (e.g. Lauvaux et al., 2012) for daytime tower-based measurement locations, with significant discrepancies between the two models under stable and neutral conditions. Daytime differences under convective stability conditions were found to be low at weekly time scales (less than 5% of the observed model-data mismatches) but significant in 155 terms of additional random errors at the hourly time scale (up to 50% of the observed model-data mismatches) (Lauvaux et al., 2012).

S2.6 WRF2-FLEXPART backward simulations
A second set of WRF model runs were performed to couple with FLEXPART ("FLEXible PARTicle dispersion model", https://www.flexpart.eu), another commonly-used Lagrangian particle trajectory model (Angevine et al., 160 2014;Brioude et al., 2013); we designate these as WRF2-FLEXPART (FP) in Table 1. For this set of runs, four different WRF configurations were run, testing two different PBL schemes and two different boundary / initial conditions. The four configurations are shown as GM, EM, GT, and ET in table 1 of Angevine et al. (2014). In brief, WRF version 3.5 was run on a single grid of 12-km horizontal spacing, with 60 vertical levels. All configurations used the RRTMG longwave and shortwave radiation schemes and Grell 3D cumulus scheme. The four configurations 165 differed by the use of the MYNN (GM and EM) or TEMF (GT and ET) PBL schemes, and by the initialization data set. GM and GT were initialized with GFS analysis, while EM and ET were initialized with ERA-Interim. In Angevine et al. (2014) these configurations were shown to be statistically indistinguishable and can be treated as members of a (small) ensemble.

170
FLEXPART was run using each of these four WRF configurations as a driver, to generate footprints along the flight tracks that were convolved with the Z-A inventory to produce simulated CH4 enhancements. The model released 3000 particles every 30 seconds along the flight tracks (as for the other Lagrangian footprint models), backwards in time, following them for 18 hours. The results shown in Fig. 2

S2.7 WRF-STILT CarbonTracker Lagrange (CT-L) backward simulations 180
An additional WRF/STILT coupled configuration, WRF-CTL/STILT (also referred to as CT-L in the figure captions and Table 1) was also tested for all but the first flight. WRF v3.6.1 runs were 10-km resolution, continental scale outputs conducted by Atmospheric and Environmental Research (AER) with configuration similar to Nerkhorn et al.
(2010) for NOAA's Carbon-Tracker Lagrange (CT-L; https://www.esrl.noaa.gov/gmd/ccgg/carbontracker-lagrange/) project. Analysis data from the NOAA NCEP North American Regional Reanalysis (NARR), 32 km grids was used 185 to provide initial and boundary conditions for the WRF runs. Forecasts were reinitialized every 24 h, and analysis nudging to NARR every 3 h was used to constrain the model solution. WRF used the Noah land surface model and the Yongsei University (YSU) PBL scheme, with the RRTMG longwave and shortwave radiation and Grell-Devenyi convection option. STILT footprints were generated by AER for these flights every 60 s along the track, with 500 particles per receptor, and the near-field footprints were used for this analysis, which were output hourly for 24 h back 190 at 0.1-degree resolution.

S3 Bayesian inversion: construction of error covariance matrices R and B
First, a series of inversions was completed for different combinations of the variances along the diagonals of the R and B matrices (Eq. (4)) to determine the sensitivity of the inversion results to these parameters (Fig. S7). The variances in R were constant throughout each flight (i.e. constant along the diagonal), while the diagonal terms of B 195 were considered as a multiple of the 1-sigma uncertainty on the Z-A inventory (the 95% confidence interval as provided by the authors, the 1-sigma uncertainty ranged from 5-35% of each grid cell's value), squared to obtain a variance for use in B. Off-diagonal elements (i.e. correlations) in either R or B were outside the scope of this work, and not considered.

200
We only model R and B as diagonal matrices here, with no correlation between observations, for simplicity. Significant work has been devoted to identifying the proper covariance structure for these correlation matrices (Wu et al., 2013;Bousserez et al., 2015;Lauvaux et al., 2016), and we would expect correlations to be especially important in small spatial domains such as this one. However, given the range of posterior results and their sensitivity to our choice of R and B, we do not expect the overall trend of model-based emissions being higher than inventory estimates 205 to change based on the structure of these covariance matrices in the inversion; likely these correlations would change the relative weighting of the result toward either the data or the prior within the range of solutions shown in Fig. S7.
After the sensitivity analysis, two other options were investigated for determining these matrices: Restricted Maximum Likelihood (RML) and the variance of the model ensemble. First, the R and B diagonal values were estimated using 210 RML (Michalak et al., 2005); we used the method to simultaneously optimize for a single variance for R per flight and a single multiplier on the inventory uncertainty (on the standard deviation) for B. We placed an upper limit of 10 times the inventory uncertainty on each pixel, and the RML chose that upper limit on all flights. We did not allow the RML to choose a higher multiplier on the inventory uncertainty because we did not believe it would be realistic to consider that the authors' uncertainty estimates were more than a factor of 10 too low. Thus, B was the same for all deviations for R ranged from 11 to 70 ppb, all values above the uncertainty on CH4 enhancements from the observations.
Given the wide range of CH4 enhancements predicted by the various transport and dispersion models, we also conduct 220 inversions choosing the model-data mismatch matrix (R) proportional to the spread in the enhancements from the various models. We used the variance of the modeled enhancements at each observation location and time, summed with the variance from the background uncertainty from Karion et al. (2015), to construct an R matrix with different values along the diagonal. We used the following five models to construct this variance: WRF-HYSPLIT, WRF-STILT, WRF-LPDM, CT-L, and NAM-HYSPLIT, as they were available for all eight flights, with the exception of 225 20130325 which did not have CT-L results, so the other four were used for this day. For this inversion, the matrix B was kept at 10 times the inventory uncertainty. Figure S7 shows the results of the full sensitivity test of R and B on the total posterior emissions. We chose to show the result of the RML-based parameters in Figs. 8 and 9 in the main text.
230 Figure S8 shows the flux corrections (posterior-prior) for the eight flight days from inversions with the RML-estimated R and B, along with the spatial map of the diagonal terms in B, the prior error covariance matrix. The spatial pattern of adjustments to the inventory follows the pattern of the prior error, as one would expect in this framework. 235   Table S1. Errors are the measured minus model variable, shown as a mean ± standard deviation of this difference. Days with significant wind speed differences are highlighted in gray.  the model-data mismatch (R) standard deviation for various values of the prior error covariance, B, as a multiplier of the inventory uncertainty. B is assumed to be diagonal and a product of a constant and the inventory uncertainty, ranging from one to 10 times the 1-sigma inventory uncertainty (different for each pixel in the domain). R is also assumed to be diagonal, and constant for each flight, with the standard deviation (square root of variance along the diagonal) ranging from 10 ppb to 200 ppb CH4. The posterior total value 335 for R determined from the RML optimization with B=10x is also indicated on the figure (black and red circle). The posterior total value for R determined using the variance of the meteorological models with B=10x is shown for the mean value of the standard deviation along the diagonal (black and yellow circle). Note the y-axis is scaled differently for each flight. The prior total is equal to the inventory (85 t h -1 ) (black line), shown along with its 95% confidence intervals (gray shading). Error bars on the total emissions are also 95% confidence 340 intervals, from the posterior uncertainty (Eq. 5).