Errors in top-down estimates of emissions using a known source

Air pollutant emissions estimates by top-down methods are subject to a variety of errors and uncertainties. This work uses a known source, a coal-fired power plant, to explore those errors. The known emissions amount and location remove two major types of error, facilitating understanding of other types. Biases and random errors are 15 distinguished. A Lagrangian dispersion model (HYSPLIT) is run forward in time from the known source, and virtual measurements of the resulting tracer plume are compared to actual measurements from research aircraft. Four flights in different years are used to illustrate a variety of conditions. The measurements are analyzed by a mass-balance method, and the assumptions of that method are discussed. Some of those assumptions can be relaxed in analysis of the modeled plume, allowing testing of their validity. Meteorological fields to drive HYSPLIT are provided by the 20 European Center for Medium Range Weather Forecasts Fifth Reanalysis (ERA5). A unique feature of this work is the use of an ensemble of meteorological fields intrinsic to ERA5. This analysis supports reasonably large (30-40%) uncertainties on top-down analyses.


Introduction 25
Emissions of air pollutants must be known for modeling of exposure and planning for compliance with concentration standards. Bottom-up and top-down methods are used to estimate emissions. Bottom-up methods combine activity data with emissions factors, essentially counting sources and multiplying by their individual emissions. This is the main method used to produce official inventories. Top-down methods use measurements of atmospheric concentrations to estimate emissions. Both types of methods are subject to substantial uncertainty, and often disagree 30 (e.g. (Hsu et al., 2010)).
The main purpose of this work is to evaluate some of the errors and uncertainties in top-down methods, while controlling other sources of uncertainty. Evaluating the errors and uncertainties of top-down methods is difficult. Not only the emissions amounts, but their location, distribution, and timing are often unknown. Attempts to constrain all 35 these matters simultaneously result in grossly under-determined systems. Most often the under-determination is dealt with by employing Bayesian statistical methods, which introduce further errors and uncertainties, and remove the desired independence of the top-down and bottom-up methods.
In this work, we start with an emission source known in quantity, timing, and location. That source is the Martin Lake 40 coal-fired power plant in eastern Texas. It is located in reasonably simple, flat terrain. Stack emissions of several gasses are measured by Continuous Emissions Monitoring Systems (CEMS). Concentration measurements from aircraft are used to estimate emissions by mass balance, and to compare with modeled concentrations. Here we use sulfur dioxide (SO2), nitrogen oxides (NOy), and carbon dioxide (CO2). SO2 is emphasized because its peak concentration measured in aircraft traverses is well-defined above the regional background. SO2 is lost to surfaces and 45 converted to sulfate aerosol, but that conversion is slow compared to the transport time of the main transects we use here, which are usually 30-60 minutes downwind of the stack. CEMS data are also uncertain, but we assume that those uncertainties are small, i.e. <10% (Peischl et al., 2010), relative to the other uncertainties treated here. For the purposes of this paper, CEMS data are considered the "reality" with which other estimates are compared. 50 Emissions can be estimated from observations alone under certain non-trivial assumptions. Inverse modeling is used to allow relaxation of some of those assumptions. In this work, we replace some of the observations with (forward) modeled values, in varying combinations. This allows us to characterize and estimate errors arising from the models, and is a step toward understanding errors and uncertainties in inverse modeling. 55 In the mass balance method the object is to determine the amount of mass flowing through a plane downwind of a source. The mass flow rate through this plane is used as an estimate for the emission rate from the source. Sources of error are (1) error in determining the mass flow rate through the plane; (2) the emission rate at the source may be different than the mass flow rate through the plane; and (3) pollutant may be lost to deposition and/or chemical transformation. Determining the mass flow rate through the plane is done by estimating the concentration at each point 60 in the plane, multiplying by the area to get a linear mass density and then multiplying by the wind speed perpendicular to the plane to obtain a mass flow rate. Usually one or more transects is flown through a plume during presumably well mixed conditions. Thus the concentrations at a single height may be used as an estimate for concentrations from the ground to a well-defined mixing height. Errors may arise because the plume is not well mixed causing concentrations to vary significantly in the vertical direction, the planetary boundary layer height is not well known, or significant mass 65 has been transported above the boundary layer. The other source of error is that the mass flow rate through the plane may be significantly different than the emission rate from the source. This may occur when winds are variable in speed and/or direction in time and/or space. By using a model, we can virtually eliminate errors arising from (1). In the simulated world, simulated concentrations 70 and model wind speeds at every point in the plane are known so there is no need to estimate a mixing height. In addition, the emission from the source is also known in the simulated world. Thus mismatch between the emission rate from the source and the mass flow rate from the analysis plane is only caused by variable winds and can be explored in detail. Uncertainties due to losses to deposition and/or transformation (3) are also eliminated in the simulations. 75 The mass balance method can also be applied to the simulated fields by using only simulated concentrations from a single transect and the model mixing height. The tracer profile is known exactly in the simulated world, but the mixing height may still not be well defined. Tracer profiles may not have a sharp cutoff in the vertical. Significant mass is often found above the boundary layer height specified to the simulation because the model is designed to allow transport above the boundary layer and also because the boundary layer height changes in space and time. It is difficult 80 to tell, however, if this accurately reflects vertical mixing in the real world.
We attempt to distinguish between error and uncertainty within this work. Error is the difference between an analyzed value and the true value, in this case, of the emission rate. Uncertainty is an estimate of the error that we expect in the absence of knowledge of the truth. (Metrology, 2008) Error and uncertainty have systematic and random components. 85 Systematic error is synonymous with bias, that is, persistent differences of one sign between reality and a result of analysis. The distinction is neither precise nor crisp, and terms are not always used carefully. Atmospheric measurements rarely have enough samples to reliably distinguish the two. Bias can be introduced by the use of methods or assumptions that most often move the result in one direction. For example, a low bias in a wind speed measurement will result in a low bias in the emissions estimated by mass balance. Random differences have many 90 possible causes, one important cause being sampling uncertainty, that is, the difference between the mean of a quantity measured with a small number of samples and the true (ensemble) mean. The distinction is important for several reasons. Reporting a small uncertainty with possibly large unknown biases can lead to incorrect policy decisions. Bayesian analyses assume that the measurements and prior have zero mean error and (usually) Gaussian uncertainty, and the proper characterization of the error covariances is critical to a good result. From a practical point of view, 95 repeated measurements can reduce random uncertainty but cannot reduce bias.
Errors and uncertainties in the meteorological fields (modeled or measured) used in analyses propagate directly into the result. Since we only measure one realization of the chaotic atmosphere, we never have an ensemble average (statistically speaking). Normally, only a single meteorological field is used. The numerical weather prediction 100 community has been moving steadily toward producing ensemble output, that is, well-designed sets of multiple realizations. These can provide a rough method to distinguish between bias and random uncertainty. The difference https://doi.org/10.5194/acp-2020-169 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.
between the result produced from a control run and reality is an estimate of bias. So is the difference between an ensemble mean result and reality. The statistics of differences between results from all ensemble members are estimates of random uncertainty. The quality of these estimates depends on the design and quality of the ensemble. 105 Meteorological quantities have been identified as major sources of error and uncertainty in emissions estimates. These can be divided into several classes. Wind direction errors displace the plume (or the source in an inverse analysis). Wind speed errors change the magnitude of the plume and its timing with respect to time-varying emissions. Mistaken diagnosis of the mixing height affects the concentrations and may contribute to violation of the well-mixed assumption. 110 Errors in the transport model are also important. Under-or over-estimation of horizontal dispersion changes the plume width. Discretization in either horizontal or vertical dimensions may add noise or uncertainty. Physical situations that are correctly handled in the models may still lead to errors in some analyses, for example temporal variation (unsteadiness) of wind can result in storage of pollutant that violates the assumption of steady wind in mass balance analysis. This can be exacerbated if the winds are not updated often enough in the transport model. 115

Data
The Martin Lake power plant complex is located in fairly simple terrain in east Texas (32.260ºN, -94.570º). It has three stacks that are each 452 ft (138 m) high. The stacks are spaced 100 m apart. 120 The measurements used here were taken by NOAA scientists aboard NOAA or NCAR research aircraft. Flights downwind of the Martin Lake power plant were made in four years (2000, 2006, 2013, and 2015). Dates are shown in Table 1. The flights were planned to intercept the plume at least once and usually several times. Downwind distances were chosen to satisfy the conditions for mass balance analysis (see below), far enough downwind for the plume to be 125 well-mixed through the boundary layer in the vertical but close enough for the concentration signal to be strong and to minimize chemical transformations.
For all four flights, SO2 was measured using a modified, commercial, pulsed UV fluorescence instrument, a Thermo Environmental Instruments, Inc., model 43S (Ryerson et al., 1998). The 1 Hz measurements have an estimated 1-130 sigma precision of ±0.3-1 ppbv and a 1-sigma uncertainty of ±10-12%, depending on year. NOy was measured by chemiluminescence after conversion to NO in a heated gold catalyst (Ryerson et al., 1999). The 1-Hz precision ranged from 0.015-0.4 ppbv, and uncertainty ranged from ±10-12%, depending on year. CO2 was measured by infrared absorption using a LI-COR 6262 in 2000 and 2006 (Peischl et al., 2010), and a Picarro 1301-m in 2013 and 2015 (Peischl et al., 2012). The 1-Hz precision was at or below ±0.1 ppm for all flight years, and the measurement 135 uncertainty was ±3% in 2000, and approximately ±0.15 ppm for the other flights.
Pursuant to federal regulation, commercial electric utility steam generating units with a capacity greater than 25 MW are required to monitor stack emissions and report them to the Environmental Protection Agency. The full capacity of each of the three units at Martin Lake are greater than 700 MW, thus subject to reporting requirements. The relative 140 accuracy of the SO2, NOx, CO2, and flow measurements are all required to be less than 10%. Therefore, we expect the uncertainty of a mass flow value or a ratio of two pollutant concentrations to be less than 14% after quadrature addition of the uncertainties. The ratio of pollutants was verified by top-down measurements of 11 Texas power plants, including Martin Lake, in a 2006 study (Peischl et al., 2010), but a determination of the mass flux has not. 145

Methods
Mass balance is a time-honored method of converting concentration measurements to emissions with minimal reference to models (e.g. (White et al., 1976;Trainer et al., 1995)). In the simplest cases, such as those presented below, it assumes that a plume is well-mixed in the vertical to a well-defined height and that the wind speed and 150 direction are steady. A robust estimate of the background (concentration not attributable to the source of interest) is required. At least one transect across the plume is made at a height well within the mixed layer. The mixed layer https://doi.org/10.5194/acp-2020-169 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.
height is usually determined from profiles at the ends of some transects. More elaborate methods involve multiple downwind transects, often accompanied by 2-D interpolation of downwind measurements (Mays et al., 2009). Detailed analysis of related methods in the context of column measurements is given by Varon et al. (2018). 155 Fully model-based retrievals of emissions from concentration measurements are usually done with a Bayesian analysis in order to overcome the under-constrained nature of the problem. A Lagrangian dispersion model is run backwards in time from the sites of measurements to produce footprints, which are convolved with a prior inventory. The differences between modeled and measured concentrations are then optimized by a mathematical algorithm. Weights 160 (error covariances) are applied to the measurements and the prior, expressing relative confidence in their correctness.
Bayesian analysis requires assumptions about the PDF of errors (usually Gaussian, always with zero mean).
Here we use hybrid or intermediate methods to examine the consequences of different classes of error. The meteorological model (reanalysis) produces full fields of wind speed and direction and of mixing height. The 165 Lagrangian transport model then produces a full four-dimensional view of the plume. From these we can calculate the concentrations at the aircraft location for direct comparison and for standard mass balance analysis, as in (Karion et al., 2019). We can also calculate other estimates by considering different heights within the plume, a range of heights, or the whole plume. We can use measured or modeled mixing heights, and measured or modeled winds. We can choose whether to be sensitive to horizontal displacement of the plume. 170 The ECMWF fifth-generation reanalysis (ERA5) [Copernicus Climate Change Service (C3S) (2017): ERA5: Fifth generation of ECMWF atmospheric reanalyses of the global climate . Copernicus Climate Change Service Climate Data Store (CDS), accessed March 2019. https://cds.climate.copernicus.eu/cdsapp#!/home] is used to provide meteorological fields for this study. It consists of a control (high resolution) run and 10 ensemble members. The 175 control run is available hourly on a 0.25x0.25 degree latitude-longitude grid, with 37 pressure levels. The ensemble members are available every 3 hours on a coarser 0.5x0.5 degree grid.
HYSPLIT version 944 was used in this study. HYSPLIT is a Lagrangian atmospheric transport and dispersion model developed by the National Oceanic and Atmospheric Administration's Air Resources Laboratory (NOAA ARL) (Stein 180 et al., 2015). Dispersion of a material is simulated by a number of computational particles which represent a specified amount of mass of material. The computational particles are advected by the wind field and dispersed by a turbulent component which is calculated by the model from meteorological data fields.
HYSPLIT provides a variety of options to optimize for different situations. We used the default methods for 185 determining vertical velocity variances (Kantha-Clayson), and wind and temperature profiles for computing the boundary layer stability. These are the same settings used by Karion et al. (2019). The mixed layer depth was taken from the ERA5 input, which is also the default method. HYSPLIT was modified to include the Stochastic Time-Inverted Lagrangian Transport (STILT) dispersion algorithm, which was employed in this study. The STILT dispersion algorithm incorporates the (Thomson et al., 1997) reflection/transmission scheme for Gaussian turbulence that 190 preserves well mixed distributions for particles moving vertically across interfaces between grid cells as described in (Lin et al., 2003).
A tracer was emitted from the location of the Martin Lake stacks (32.26ºN, -94.57º) at 100 m with the rate given by the hourly CEMS data for SO2. Heat content of 8.5x10 7 W was specified for all simulations and used in the plume rise 195 calculation.
Concentrations from HYSPLIT were output on a horizontal grid with increments of 0.011º latitude by 0.009º longitude, spanning 0.6º x 0.8º. Vertical levels (25) were spaced every 100 m up to 2000 m AGL and then every 200 m up to 3000 m AGL. HYSPLIT concentrations were output as averages between each defined level, so for example the first 200 level in the output is the average between 0 and 100 m AGL. The hourly output represents the average over each hour.
No deposition was used, and the particles were defined as entirely passive. Ten thousand particles per hour were used to produce a sufficiently smooth concentration field.

Results 205
We first examine aircraft observations from the flight on 25 June 2013, which took place in good conditions for massbalance analysis as described above. The plume from the Martin Lake power plant was intercepted by the aircraft four times, as shown in figure 1. Figure 2 shows the time series of the observations and the plumes simulated using the control meteorology and the ten ensemble members. The simulated plume is well-aligned with the observations in the 210 first transect, but displaced in the other three. Note that the third and fourth transects were flown in opposite directions. All the simulated plumes are weaker and wider than observed. There is little visually apparent spread in the ensemble, all the members produce plumes of similar location, magnitude, and width. As figure 3 shows, however, the integrated amount of SO2 in the plumes does vary. 215 The next step is to compute the emission rate represented by each transect. This is done by mass balance using observed and modeled values separately. The observational analysis uses the observed mixing ratios, mixing height, and wind speed. Simulated plumes from the control run are analyzed with the simulated wind speed and mixing height. A full-plane integration of the simulated plume is also conducted, and described below. Thus we have a total of three emission rate estimates from the control simulations for each transect to compare with the CEMS 220 measurement. Figure 4 shows these results. The CEMS data show that the aircraft transects all measured emissions that came out of the stack during a plateau of relatively constant emissions. In the lower panel of fig. 4, the emission rates are presented as ratios to the CEMS data for each species. By this means, the three measured species constitute separate estimates of the same emissions. The single simulated tracer, labeled "SO2 sim", serves for all species. 225 Transect 2, at the shortest downwind distance, best meets the mass balance assumptions. The error bar shows the estimated uncertainty (±30%, one standard deviation) of the observed mass balance estimate. The estimate itself is nearly perfect, falling just below the unity ratio line. The simulated mass balance estimate is about 20% high, within the error bar. NOy and CO2 observed estimates also fall within the estimated error (not shown). For the other transects, the SO2 observed falls consistently below the unity ratio line, although only the farthest transect falls outside the error 230 bar. The 30% error estimate is only shown for transect 2, but is the same for the other transects and species, and is described in more detail in the Discussion section below. NOy and CO2 estimates are scattered, but within the 30% uncertainty estimate. The simulated tracer is nearly perfect for transects 1, 3, and 4, the transects at 45-52 km downwind. 235 The set of emission rate estimates resulting from the ensemble meteorology are shown in figure 4 with red + marks. For the closest transect (2), the ensemble spread is about 15%, less than the estimate of uncertainty for the observations. The ensemble spread contains the control simulation value, but does not span the unity line. The ensemble estimates for the farther transects have greater spread, although still somewhat less than the observation estimate, and all span both the control simulation values and unity ratio. 240 The above mass balance analyses are not sensitive to errors in the width or displacement of the plume, since they involve integrating across each plume regardless of its exact location or width. We can also do an analysis that is insensitive to mixing height by integrating the simulated plume in a vertical plane along the flight transect, which we call the full-plane integration. Error in the emission estimate from this method would arise only from deviations from 245 the assumption of a steady mass flow rate through the plane. For transect 2 on 25 June 2013, the full-plane integration using the control simulation gives an emission rate of 7500 kg h -1 (table 1) compared to the CEMS value of 6780. This is a bias of 11% (all percentage values are with respect to the CEMS value). The full-plane integration using the ensemble meteorology produces estimates ranging from 6900 to 7500 kg h -1 , with mean 7130 and median 7140 kg h -1 , a bias of 5%. The ensemble range of the full-plane estimates does not include the CEMS value or the observation-based 250 mass balance value, but does include the control simulation mass balance value. The ensemble spread is rather small (8%). The simulated plumes are not perfectly well-mixed, with higher concentrations in the lower BL ( figure 5).
The question of mixing height deserves further exploration. Figure 5 shows vertical cross sections in the along-wind direction of the tracer mixing ratios simulated by HYSPLIT with the control meteorology and each of the ten ensemble 255 members. An observed value of mixing height was subjectively determined from potential temperature and water vapor profiles flown at the ends of the transects, shown as an o mark in each of the subplots. A simulated mixing height is estimated from the tracer mixing ratio profiles. It is the height at which the mixing ratio first falls below 50%  2, 3, and 4). Simulated values from the control run are 5-19% high for transects 1, 2, and 4, and 14% low for transect 3. The observational estimate for transect 1 has a substantial low bias (34%), for which we do not have a convincing explanation. 265 Seeing that the observations and control simulations produce small biases for transects 2, 3, and 4, we now examine the ensemble simulations. These are shown in figure 4, and numbers are given in table 1. The ensemble does not span reality for transect 2, but does cover the control estimate. The ensemble spread is 12% for transect 2 and 25-36% for the other transects. 270 Another flight with several transects took place on 16 September 2006 (figure 6). Transect 2 has the best match to the mass balance assumptions. The emission rate analysis from observations has a high bias of 25%, while the estimate from the control simulation is biased 5% low. The observational estimates for CO2 and NOy are close to their respective CEMS values (table 2). The full-plane estimate agrees closely with the observational estimate for SO2 (table  275 1). The ensemble estimates have a spread of 36%, cover the CEMS value, and are nearly centered around the control simulation value. As for the other transects, the closest transect gives a high-biased observational estimate. The transects farther downwind all produce low-biased observational and simulated estimates, except for CO2 and NOy at the farthest distance (not shown). The emissions as measured by CEMS are increasing during the span of time when the plume was emitted, adding substantial uncertainty to the comparison. Examining the vertical cross-sections of the 280 plume (figure 7) we see that the plume is approximately well-mixed at the latitude of transect 2 (32.5ºN) in all but one ensemble member (em8), but not well-mixed at 32.4ºN, the latitude of transect 1.
On 3 September 2000, only one transect is usable (table 1, figure 8). The observational analysis produces an emission rate within 4% of CEMS. The control simulation overestimates by 19%. The ensemble estimates have substantial 285 spread (46%), which covers the values from the observations and the control simulation. The full-plane integration has a 44% high bias. Some of the difficulty in the simulations is due to unrealistically large mixing heights in ERA5. HYSPLIT cannot produce a well-mixed plume in these conditions. Agreement between the observational analysis and CEMS reflects a reasonable mixing height estimate but may involve some element of good luck. We do not know whether the real plume was well-mixed. Potential temperature profiles (not shown) before the transect show relatively 290 shallow mixing heights consistent with the manual estimate. Observed profiles after the transect show a deep BL ~2500 m AGL, although not as deep as in ERA5 (3000-3600 m). This is an indication that the mixing height was changing during the time of the observations. Both profiles are at some distance from the plume location, so their applicability is questionable. The emissions measured by CEMS are increasing substantially around the time the plume was emitted, which adds to the uncertainty. 295 The flight on 25 April 2015 took place after the SO2 emissions of the power plant had been substantially reduced by scrubbing, so the SO2 plume was much weaker. Analysis of the observations requires estimating the background, and uncertainty in that estimate is more important when the plume is weaker. The observational analysis produces a flux estimate biased 40% low; the control simulation is biased only 10% high (table 1, figure 8). The ensemble estimates 300 have large spreads, more than 100%. The spread is due mostly to a single high member. We cannot justify removing that member, which is not obviously wrong. Wind speed is biased low and mixing height is biased high in the simulations (table 3). Several members produce concentration profiles that are not well-mixed at the observation plane (not shown). Potential temperature profiles before and after the transect (not shown) are consistent with the respective model and observed BLHs. 305

Discussion
In this section, we use the collection of emissions estimates described above to explore the uncertainty of such estimates. Several approaches are used: 310 1. Manual uncertainty estimates for the observational mass balance https://doi.org/10.5194/acp-2020-169 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.
2. Errors from the observational mass balance for the four flights, including multiple species 3. Errors from the simulated mass balance for the four flights 4. Errors from the simulated full-plane integrations for the four flights 5. Errors from simulated mass balance using the meteorological ensemble 315 Uncertainty estimates for mass balance calculations from observations alone are described in detail by (Peischl et al., 2015) for the Haynesville shale area, which includes the Martin Lake power plant, using the same flight data used here for 25 June 2013. They estimate uncertainties of 300m (about 20%) for mixing height and about 28% for wind speed. These are the dominant uncertainties in their presentation. Combined, all sources yield flux uncertainties (for methane 320 from the oil and gas field) of about 35%. We refer to these as the "manual" uncertainty estimates (list item 1 above). We estimate the total uncertainty in the mass balance emission rate by summing in quadrature the following sources of uncertainty: The accuracy of the gas species measurement (±0.10 ppmv + 3% in 2000, ±0.15 ppmv otherwise for CO2 ; ±10% for SO2, ±12% for NOy); the background determination of the gas species measurement (±20% for CO2; ±5% for SO2; ±10% for NOy), the boundary layer depth (±200 m), and the wind speed (±1 m/s). This results in a total 1-sigma 325 uncertainty between ±22% and ±32% for the emission estimates presented here. For clarity of presentation in figures 4 and 6, this is represented by a ±30% error bar.
Of the transects in table 1, many have large biases, but these problematic transects are either far from the source or very close. We chose one "primary" transect for each day, flown at a roughly optimal distance from the source. The 330 primary transects are highlighted in Table 1, and emissions of all three species are included in Table 2. Because the simulation uses a passive tracer, values scaled to the CEMS emission rates are applicable to all three species. Of these four transects, three have observational estimates within the manually estimated uncertainty for all three species. The outlying observational estimate, for 25 April 2015, is subject to uncertainty in the background, which may be underestimated, and which does not affect the simulations. The CO2 observational estimate is within the uncertainty on 335 this day. It is not clear how to rigorously combine uncertainties for different flight days and species, so we do not present a single number for this collection (list item 2 above). An uncertainty estimate of less than 23% would leave four outliers rather than two. We therefore choose an estimate of approximately 25% to avoid being too optimistic.
All four estimates for SO2 from the primary transects using the control simulation are within the manually estimated 340 uncertainty (Table 1 and figure 8). Again, no rigorous combined uncertainty can be computed (list item 3 above) but anything less than 20% would be too optimistic, leaving out two of the four estimates. The full-plane integration (list item 4) eliminates one source of uncertainty, the estimation of mixing height. Somewhat counter to our expectation, however, this reduces the error in only two of the four cases. The error in mixing height must be offsetting an error due to unsteady winds in the other two cases. The collection suggests a lower limit of 30% or so on the uncertainty. 345 Estimating probabilities or confidence intervals from small ensembles is challenging (Leutbecher, 2019;Wilks, 2011). Too few members will generally produce too small a spread. Ensemble spread is often expressed as a standard deviation, but if each member is equally likely, the range is a better measure. Even so, most ensembles have too little spread, and do not encompass the true value. If biases occur, for example because of model errors common to all 350 members, the spread probably should not encompass the true value. In other words, increased random spread is not an adequate substitute for diagnosis and correction of biases. Bias correction as part of ensemble calibration should be used. However, sufficient and reliable observations are often unavailable with which to correct biases, as is the case here. 355 The example of 25 June 2013 transect 2 is instructive (table 1). Emission rates from the ensemble range 12%, but do not cover the true value. If we call the ensemble range the uncertainty of the rate estimate, it would be too small (overconfident). If we used the ensemble standard deviation, the uncertainty estimate would be even smaller, about 5%, which is clearly unreasonable. For the same transect (25 June 2013 transect 2) the rate calculated from observations is 8% low relative to reality, and the rate calculated from the control simulation is 19% high. Both 360 estimates fall within the estimated uncertainty of the manual analysis.
Taking the other primary transects in order starting with the worst control flux estimate, for 3 Sept. 2000 we have a 19% high bias. The ensemble range is 46%. These differences suggest a somewhat larger uncertainty than the manual estimate. Other clues to problems with this plume include very high BLH from ERA5 and a large difference between 365 mixing heights estimated from the aircraft profiles before and after the transect. However, if mixing height were the only problem, the full plane integration would produce a correct estimate, since it is not sensitive to mixing height or to https://doi.org/10.5194/acp-2020-169 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.
the well-mixed assumption. The remaining source of error is the variability of the winds leading to storage of pollutant within the volume between the source and the measurement plane, which accounts for the rest of the bias (control) and spread (ensemble). The winds were light and variable during the night and early morning. 370 Transect 2 on 16 Sept. 2006 has emission rate estimates from the observations that are 25% high, and from the control simulation 5% low. The full plane integration returns a 22% high bias. The ensemble range is 36%, and covers reality. The other transects on this day are either too close for the plume to be well-mixed (transect 1) or too far away for the assumption that SO2 is roughly conserved to be valid. 375 Transect 2 on 25 April 2015 has already been mentioned as suffering from background uncertainty. The control-based estimate is close to reality. The full plane integration is biased 28% low, still within the manually-estimated uncertainty. One outlying member produces the very large ensemble spreads. There is nothing to justify removing this member, which again emphasizes the difficulty of estimating spread from a small number of members. 380 Overall, the ensemble ranges give uncertainty estimates of 12-105% on different days (list item 5). This is information that cannot be obtained in any other way. We must caution, however, that the small range on 25 June 2013 is probably underestimated. 385 Mixing height and the well-mixed assumption are important uncertainties in the mass balance framework. This is not very surprising considering that the definition of a single mixing height involves several unsafe assumptions. For the boundary layer to be well-mixed, there must be a substantial surface buoyancy flux, a well-defined capping layer (not necessarily an inversion per se), and minimal change of advection with height. Boundary layers are commonly wellmixed in potential temperature but not in water vapor mixing ratio, because the entrainment fluxes at the top of the 390 boundary layer are of opposite sign (warming for temperature, drying for water vapor). Additional conditions are required for a plume to be well-mixed, specifically there must be enough time (several boundary layer turnover times) between emission and measurement. It is worth noting that plumes that are not well-mixed can sometimes give reasonable mass balance estimates if the measurement transect is flown close to the middle of the boundary layer, where the measured value approximates the mean of the (roughly linear) vertical profile. 395 Wind speed is another important uncertainty. The usual assumption is that the wind speed at the measurement transect is representative for the entire plume transit. If the wind is light and therefore variable, or if the wind speed has changed substantially over the transit time, it is unclear what wind speed to use in the calculation. Sometimes modeled wind speeds are used (Karion et al., 2015). Looking at table 3, we see substantial biases in wind speed (low) and 400 mixing height (high), the smallest biases occurring on 25 June 2013. These biases are with respect to the winds and mixing height observed on the aircraft, and therefore also include substantial uncertainties. The ensemble ranges can be compared with the manual estimates of uncertainty. For wind speed, the ensemble ranges are comparable to the manual estimates or smaller. For mixing height, the ensemble ranges are larger. Combining simulated and observed values, for example using simulated wind speeds with observed mixing heights, would incur large errors. 405 The methods used here are not sensitive to some common sources of error. None of these methods are sensitive to plume displacement caused by errors in wind direction, as long as the plume is fully covered by the transect. The full plane integration is not sensitive to the mixing height estimation or to violations of the well-mixed assumption. The fact that the full-plane esimate is farther from reality than the traditional mass balance estimate for three of the four 410 primary transects suggests that compensating errors are present. Significant uncertainties remain in all methods.
This work was partly inspired by the study of Karion et al. (2019). The main emphasis of that work was on large (factor of 2 or greater) biases due primarily to errors in vertical mixing in the Lagrangian transport models. We should keep in mind, however, that the true emissions in that study are not known. The true vertical mixing is not known in 415 either that study or this one. Karion et al. (2019) also reported large uncertainties (error bars) based on multiple flights and multiple models. Variability from a small ensemble of meteorology in one configuration (WRF2-FP) was 20-30%.
An earlier study by Angevine et al. (2014) used a six-member WRF ensemble driving forward runs of the Flexible Particle (FLEXPART) Lagrangian dispersion model to estimate uncertainty due to meteorology. A passive tracer 420 representing national inventory emissions of carbon monoxide (CO) was transported by the models. That study found 30-40% spread in CO concentrations. The ensemble spread was insufficient to cover the errors in wind and temperature, as would be expected for a small ensemble using a single model framework.

Conclusions 425
Errors in top-down emissions estimates are substantial even under good conditions. Using a known source in forward runs driven by ensemble meteorology, we have shown errors ranging from a few percent to over 100%. No single robust estimate of either bias or random error can be derived from these results. Identification and removal of cases that are clearly bad reduces the error, but if the source is unknown, some bad cases cannot be clearly identified. 430 Investigator judgement is vital. Using forward modeling and examining the structure of the plumes it produces can be helpful in identifying better or worse cases. Using an ensemble provides another dimension for diagnosis, since the spread of the ensemble results can indicate the possibility of outlying solutions that are not apparent in a single deterministic run. 435 The largest source of error is the vertical mixing, as was also shown by Karion et al. (2019). This is not simply a question of finding a correct mixing height, although that is a major problem. In several of the cases shown here, the plumes are not well-mixed, and the mixing height is therefore not well defined.
Wind speed is another important source of error (table 3). The wind speed on the aircraft transect may be the only 440 source of data for comparison to the models. However, even if it compares perfectly, unsteadiness of the wind previous to the measurement can cause poorly-characterized errors.
Losses of pollutant to deposition or chemical transformation can be important for observations taken at longer downwind distances (greater than ~20 km), as seen in figures 4 and 6. These losses can be estimated if sufficient 445 information is available, for example measurements of product species, but those estimates will necessarily introduce additional uncertainty.
These results are not sensitive to errors in plume location or width. In future work, we will explore the sensitivity of Bayesian inversion methods to those errors as well as to the kinds of errors shown here. 450 Our results are consistent with uncertainty estimates from rigorous analysis of observations, as performed by Peischl et al. (2015). The minimum emissions flux uncertainty that can be supported by these results is 30%. Under less than ideal conditions, errors can be much larger.     Lower panel: Ratio of emission rate derived from different estimates (method and species, symbols as in legend) to 564 CEMS emission at estimated time of emission. Explanation of legend: "obs" is mass balance using observations; "Ctl 565 sim" is mass balance using control simulated tracer, wind, and mixing height; "full plane" is derived by integrating the 566 full x-z plane in the simulation at the transect latitude; "ensemble" is mass balance using simulated tracer, wind, and 567 https://doi.org/10.5194/acp-2020-169 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.  Ratio of emission rate derived from different estimates (method and species) to CEMS emission at estimated time of 581 emission. Explanation of legend: "obs" is mass balance using observations; "Ctl sim" is mass balance using control 582 https://doi.org/10.5194/acp-2020-169 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.  (method and species) to CEMS emission at estimated time of emission. Explanation of legend: "obs" is mass balance 596 using observations; "sim" is mass balance using simulated tracer, wind, and mixing height; "full plane" is derived by 597 https://doi.org/10.5194/acp-2020-169 Preprint. Discussion started: 21 April 2020 c Author(s) 2020. CC BY 4.0 License.