Atmospheric inversions have been used to assess biosphere–atmosphere

Atmospheric inversions are used to assess the exchange of

Atmospheric inversions based on Bayesian inference depend on the prior flux
error covariance matrix and the observation error covariance matrix. The
prior flux error covariance matrix represents the statistics of the mismatch
between the true fluxes and the prior fluxes, but the limited density of
flux observation limits our ability to characterize these errors (Hilton et
al., 2013). The observation error covariance describes errors of both
measurements and the atmospheric transport model. In atmospheric inversions,
the model errors tend to be much greater than the measurement errors (e.g.,
Gerbig et al., 2003; Law et al., 2008). Additionally, atmospheric inversions
assume that the atmospheric transport uncertainties are known and are
unbiased; therefore, the method propagates uncertain and potentially biased
atmospheric transport model errors to inverse fluxes limiting their
optimality. Unfortunately, rigorous assessments of the transport
uncertainties within current atmospheric inversions are limited. Estimation
of the atmospheric transport errors and their impact on

A limited number of studies are dedicated to quantify the uncertainty in
atmospheric transport models and even fewer attempted to translate this
information into the impact on the

Atmospheric transport models have multiple sources of uncertainty including the boundary conditions, initial conditions, model physics parameterization schemes and parameter values. With errors inherited from all of these sources, ensembles have become a powerful tool for the quantification of atmospheric transport uncertainties. Different approaches have been evaluated in the carbon cycle community to represent the model uncertainty: (1) the multi-model ensembles that encompass models from different research institutions around the world (e.g., TransCom experiment; Gurney et al., 2002; Baker et al., 2006; Patra et al., 2008; Peylin et al., 2013; Houweling et al., 2010), (2) multi-physics ensembles that involve different model physics configurations generated by the variation of different parameterization schemes from the model (e.g., Kretschmer et al., 2012; Yver et al., 2013; Lauvaux and Davis, 2014; Angevine et al., 2014; Feng et al., 2016; Sarmiento et al., 2017) and (3) multi-analysis (i.e., forcing data) that consists of running a model over the same period using different analysis fields (where perturbations can be added) (e.g., Lauvaux et al., 2009; Miller et al., 2015; Angevine et al., 2014). These ensembles are informative (e.g., Peylin et al., 2013; Kretschmer et al., 2012; Lauvaux and Davis, 2014) but have some shortcomings. In some cases, the ensemble spread includes a mixture of transport model uncertainties and other errors such as the variation in prior fluxes or the observations used. Other studies have only varied the PBL scheme parameterizations. None of these studies have carefully assessed whether or not their ensemble spreads represent the actual transport uncertainties.

In the last two decades, the development of ensemble methods has improved the
representation of transport uncertainty using the statistics of large
ensembles to characterize the statistical spread of atmospheric forecasts
(e.g., Evensen, 1994a, b). Single-physics ensemble-based statistics are highly
susceptible to model error, leading to underdispersive ensembles (e.g., Lee,
2012). Large ensembles (> 50 members) remain computationally
expensive and ill adapted to assimilation over longer timescales such as
multi-year inversions of long-lived species (e.g.,

Various post-processing techniques can be used to calibrate or
“down-select” from a transport ensemble of 50 or more members to a subset
of ensemble members that represent the model transport uncertainties (e.g.,
Alhamed et al., 2002; Garaud and Mallet, 2011; Lee, 2012; Lee et al., 2016).
Some of these techniques are principal component analysis (e.g., Lee, 2012),

In this study, we start with a large multi-physics/multi-analysis ensemble of
45 members presented in Díaz-Isaac et al. (2018) and apply a
calibration process similar to the one explained in Garaud and Mallet (2011). Two principal features characterize an ensemble: reliability and
resolution. The reliability is the probability that a simulation has of
matching the frequency of an observed event. The resolution is the ability
of the system to predict a specific event. Both features are needed in order
to represent model errors accurately. Our main goal is to down-select the
large ensemble to generate a calibrated ensemble that will represent the
uncertainty of the transport model with respect to meteorological variables
of most importance in simulating atmospheric

We generate an ensemble using the Weather Research and Forecasting (WRF)
model version 3.5.1 (Skamarock et al., 2008), including the chemistry module
modified in this study for

Physics schemes used in WRF for the sensitivity analysis.

The different simulations were run using the one-way nesting method, with two nested domains (Fig. 1). The coarse domain (d01) uses a horizontal grid spacing of 30 km and covers most of the United States and part of Canada. The inner domain (d02) uses a 10 km grid spacing, is centered in Iowa and covers the Midwest region of the United States. The vertical resolution of the model is described with 59 vertical levels, with 40 of them within the first 2 km of the atmosphere. This work focuses on the simulation with higher resolution; therefore, only the 10 km domain will be analyzed.

Geographical domain used by WRF-Chem

The

Our interest is to calibrate the ensemble over the US Midwest using the
meteorological observations available over this region. The calibration of
the ensemble will be done only within the inner domain. To perform the
calibration, we used balloon soundings collected over the Midwest region
(Fig. 1). Meteorological data were obtained from the University of Wyoming's
online data archive (

The ensemble will be calibrated for three different meteorological
variables: PBL wind speed, PBL wind direction and planetary boundary layer
height (PBLH). We will calibrate the ensemble with the late afternoon data
(i.e., 00:00 UTC) from the different rawinsondes. In this study, we use only
daytime data, because we want to calibrate and evaluate the ensemble under
the same well-mixed conditions that are used to perform atmospheric
inversions. For each rawinsonde site, we will use wind speed and wind
direction observations from approximately 300 m a.g.l. We
choose this observational level because we want the observations to lie
within the well-mixed layer, the layer into which surface fluxes are
distributed, and the same air mass that is sampled and simulated for
inversions based on tower

The PBLH was estimated using the virtual potential temperature gradient
(

WRF derives an estimated PBLH for each simulation; however, the technique used to estimate the PBLH varies according to the PBL scheme used to run the simulation. For example, the YSU PBL schemes estimate PBLH using the bulk Richardson number (Hong et al., 2006), the MYJ PBL scheme uses the turbulent kinetic energy (TKE) to estimate the PBLH (Janjic, 2002), and the MYNN PBL scheme uses twice the TKE to estimate the PBLH. To avoid any errors from the technique used to estimate the PBLH, we decided to estimate the PBLH from the model using the same method used for the observations. Simulated PBLH will be analyzed at the same time as the observations, 00:00 UTC, i.e., late afternoon in the study region.

We analyzed

In this research, we want to test the performance of the transport ensemble and try to achieve a better representation of transport uncertainties, if possible using an ensemble with a smaller number of members. A series of statistical metrics are used as criteria to measure the representation of uncertainty by the ensemble for the period of 18 June to 21 July 2008. The criteria used for our down-selection process include rank histograms, rank-histogram scores and ensemble bias.

The rank histogram and the rank-histogram scores are tools used to measure
the spread and hence the reliability of the ensemble (see Fig. A1 in
the Appendix). The rank histogram (Anderson, 1996; Hamill and Colucci, 1997;
Talagrand et al., 1999) is computed by sorting the corresponding modeled
variable of the ensemble in increasing order and then a rank among the
sorted predicted variables from lowest to highest is given to the
observation. The ensemble members are sorted to define “bins” of the
modeled variable; if the ensemble contains

The rank-histogram score is used to measure the deviation from flatness of a
rank histogram:

Atmospheric inverse flux estimates are highly sensitive to biases. The bias,
or the mean of the model–data mismatches, was used to assist the selection
of the calibrated sub-ensemble. We identify a sub-ensemble that has minimal
bias:

Different statistical tools were used to evaluate both the large (45-member) ensemble and calibrated ensemble; these statistics include Taylor diagrams, spread–skill relationship and ensemble root mean square deviation (RMSD). These statistical analyses will be used to describe the performance of each member (standard deviations and correlations), ensemble spread (root mean square deviation) and error structures in space (error covariance), which will allow us to evaluate all the important aspects of an ensemble.

We use Taylor diagrams to describe the performance of each of the models of the large ensemble (Taylor, 2001). The Taylor diagram relies on three nondimensional statistics: the ratio of the variance (model variance normalized by the observed variance), the correlation coefficient, and the normalized center root mean square (CRMS) difference (Taylor, 2001). The ratio of the variance or normalized standard deviation indicates the difference in amplitude between the model and the observation. The correlation coefficient measures the similarity in the temporal variation between the model and the observation. The CRMS is normalized by the observed standard deviation and quantifies the ratio of the amplitude of the variations between the model and the observations.

To verify that the ensemble captures the variability in the model performance across space and time, we computed the relationship between the spread of the ensemble and the skill of the ensemble over the entire dataset (i.e., spread–skill relationship). The linear fit between the two parameters measures the correlation between the ensemble spread and the ensemble mean error or skill (Whitaker and Lough, 1998). The ensemble spread is calculated by computing the standard deviation of the ensemble and the mean error by computing the absolute difference between the ensemble mean and the observations. Ideally, as the ensemble skill improves (the mean error gets smaller), the ensemble spread becomes smaller, and vice versa. Compared to the rank histograms, spread–skill diagrams represent the ability of the ensemble to represent the errors in time and space.

The spread of the ensemble is evaluated in time, using the RMSD. The RMSD does not consider the observations as we take the square root of the average difference between model configuration and the ensemble mean. Additionally, we use the mean and standard deviation of the error (model–data mismatch) to evaluate the performance of each of the members selected for the calibrated ensembles.

Transport model errors in atmospheric inversions are described in the
observation error covariance matrix and hence in

In this study, we want to test the ability to reduce the ensemble from 45 members to an ensemble with a smaller number of members that is still capable of representing the transport uncertainties and does not include members with redundant information. The number of ideal ensemble members could have been decided by performing the calibration for all the different sizes of ensemble smaller than 45 members. However, we decided to use an objective approach to select the total number of members of the sub-ensemble. Therefore, we use the Garaud and Mallet (2011) technique to define the size of the calibrated sub-ensemble that each optimization technique will generate. The size of the sub-ensemble was determined by dividing the total number of observations by the maximum frequency in the large ensemble (45-member) rank histogram. We are going to generate sub-ensembles of three different sizes (number of members) to evaluate the impact that an ensemble size has on the representation of atmospheric transport uncertainties. Each of the ensembles will be calibrated for the period of 18 June to 21 July 2008.

Two optimization methods, simulated annealing (SA) and a genetic algorithm
(GA), are used to select a sub-ensemble that minimizes the rank-histogram
score (

In this study, SA and GA techniques will randomly search for the different
combinations of members and compute the rank-histogram score. Both
techniques generate a sub-ensemble (

SA is a general probabilistic local search algorithm, described by Kirkpatrick et al. (1983) and Černý et al. (1985) as an optimization method inspired by the process of annealing in metal work. Based on the Monte Carlo iteration solving method, SA finds the global minimum using a cost function that gives to the algorithm the ability to jump or pass multiple local minima (see Fig. A2 in the Appendix). In this case, the optimal solution is a sub-ensemble with a rank-histogram score close to 1.

The SA starts with a randomly selected sub-ensemble. The current state (i.e.,
initial random sub-ensemble) has a lot of neighbor states (i.e., other
randomly generated sub-ensembles) in which a unit (i.e., model) is changed,
removed or replaced. Let

When the algorithm reaches the predefined number of iterations, we collect
only the accepted sub-ensemble

GA is a stochastic optimization method that mimics the
process of biological evolution, with the selection, crossover and mutation
of a population (Fraser and Burnell, 1970; Crosby, 1973; Holland, 1975). Let

This process is repeated until it reaches the specified number of iterations. This algorithm will provide at the end a population of individuals with a better rank-histogram score than the initial population. Out of all those individuals, we choose the sub-ensemble with the best score for the three variables (i.e., wind speed, wind direction and PBLH) and with a smaller bias than the large ensemble.

Various inputs are required to guide the selection algorithms. For example,
we typically need to choose the initial and final temperature (

The selection process is performed in three distinct steps to ensure that the final calibrated ensembles will be the optimal combinations of model configurations (Fig. 2). First, the flatness of the rank histograms will control the acceptance of the calibrated sub-ensembles by the selection algorithms (see Fig. A1 in the Appendix). The flatness is defined by Eq. (1) for the single-variable calibration and Eq. (3) for the calibration of the three variables simultaneously. The algorithm selects multiple sub-ensembles with a rank-histogram score smaller than 6 for each individual meteorological variable, or smaller than the original ensemble score if higher than 6 (see Fig. 2 and Table 2). In general, the lowest scores are found for PBLH and the highest for wind speed, as shown in Fig. 3. As a second step, sub-ensembles accepted by SA and GA algorithms with a bias larger than the bias of the full ensemble are filtered out. This step is critical to avoid the selection of biased ensembles as discussed by Hamill et al. (2001). Finally, the remaining calibrated ensembles are compared among SA and GA techniques to identify if both algorithms provide a common solution. If multiple common solutions were identified, the final sub-ensemble was determined by the solution with the smallest score and bias. However, if no common solution was found by both techniques, the final sub-ensemble corresponds to the smallest score among the different solutions that share > 50 % of the same model configurations.

Rank-histogram score (

Diagram of the process of selection of reduced-sized ensembles
explained in Sect. 2.7. In this diagram, in the sub-ensemble, we show our two main
thresholds after running each algorithm; the sub-ensemble score has to be smaller
than the full ensemble (

Box plot of the rank-histogram scores of the different sub-ensembles of 10

In this section, we evaluate the performance of the large ensemble. Our goal is to test the ensemble skill (ability of the models to match the observations) and the spread (variability across model simulations to represent the uncertainty). We will evaluate the skill and the spread for PBLH, PBL wind speed and PBL wind direction across the region of study using afternoon (00:00 UTC) rawinsonde observations.

We evaluate the performance of the different models of the 45-member ensemble by computing the normalized standard deviation, normalized center root mean square and correlation coefficient for wind speed (Fig. 4a), wind direction (Fig. 4b) and PBLH (Fig. 4c) (Taylor, 2001). The majority of the model configurations produce winds speeds and directions with higher standard deviations (more variability) than the observations, whereas the simulations over- and underestimate PBLH variability depending on the model configuration. The model–data correlations with wind speed and wind direction are between 0.4 and 0.7, whereas the PBLH shows a smaller correlation, between 0.3 and 0.6. The range of modeled PBL heights will provide a wide spectrum of alternatives to select the optimal calibrated sub-ensemble. However, wind speed and wind direction do not show much difference among the different models. This limited spread potentially reduces the selection of the model configurations to produce a sub-ensemble that matches the observed variability.

Taylor diagram comparing the 00:00 UTC rawinsonde observations
(300 m wind speed

We illustrate the ensemble spread and how well this ensemble encompasses the observations using the time series of the simulated and observed meteorological variables. Figure 5 shows the time series of the ensemble spread for wind speed, wind direction and PBLH at Green Bay (GRB; Fig. 5a, c, e) and Topeka (TOP; Fig. 5b, d, f) sites. The time series show qualitatively that simulated wind speed (Fig. 5a–b) and wind direction (Fig. 5c–d) have a smaller spread compared to PBLH (Fig. 5e–f). Figure 5 shows how the ensemble can have a small spread and still encompass the observations (i.e., DOY 183; Fig. 5c), and have a large spread and not encompass the observation (i.e., DOY 174; Fig. 5e). These time series suggest that the ensemble may struggle to encompass the observed wind speed and wind direction more than the PBLH.

Time series of the simulated and observed for 300 m wind
speed

Figure 6 shows the rank histograms of the 45-member ensemble for each of the
meteorological variables that we use to calibrate the ensemble (i.e., wind
speed, wind direction and PBLH). In these rank histograms, we include all 14
rawinsonde sites. All the rank histograms have a U shape. U-shaped
histograms mean the ensemble is underdispersive; that is, the model members
are too often all greater than or less than the observed atmospheric values
(e.g., DOY 178–181; Fig. 5b). Each rank histogram has the first rank as the
highest frequency, indicating that observations are most frequently below
the envelope of the ensemble (e.g., DOY 178–180; Fig. 5b). The rank-histogram
score for each of the variables is greater than 1, confirming that we do
not have optimal spread in our ensemble. Table 2 shows that both wind speed
and wind direction have a higher rank-histogram score (i.e.,

Rank histogram of the 45-member ensemble for wind
speed

Spread–skill relationship for

Figure 7 shows the spread–skill relationship, another method that we use to examine the representation of errors of the ensemble. Wind direction (Fig. 7b) shows a higher correlation between the spread and the skill compared to the PBLH (Fig. 7c) and the wind speed (Fig. 7a). Therefore, the ensemble has a wider spread when the model–data differences are larger. The PBLH and wind speed show consistently poorer skill (a large mean absolute error) compared to their spread. This supports the conclusion that the large ensemble is underdispersive for these variables. None of these variables show a correlation equal to 1; this implies that our ensemble spread does not match exactly the atmospheric transport errors on a day-to-day basis. This feature is common among ensemble prediction systems (Wilks et al., 2011) and should not impair the ability to identify the optimal reduced-size ensembles.

In this section, we show the results of the calibrated ensembles generated
with both SA and GA. Each calibration was performed for three different
sub-ensemble sizes; the size of the ensembles is determined using the
technique explained in Sect. 2.4. To compute the size of the sub-ensemble,
we use the maximum frequency of the rank histogram using the large ensemble
(Fig. 6). In this case, the maximum frequency is the left bar (

Table 3 shows that both techniques (i.e., SA and GA) were able to find
similar combinations of model configurations (i.e., an ensemble that shares
more than half of the members) when each meteorological variable was used
separately. The configurations chosen for each sub-ensemble vary
significantly across the different variables, with the exception of the
10-member ensemble calibrated using wind speed and wind direction. The
majority of the ensembles include model configuration 14. This model
configuration, as shown in Díaz-Isaac et al. (2018), introduces large
errors for both wind speed and wind direction, and is selected to allow for
sufficient spread of these variables in the sub-ensembles. The final scores
of the calibrated ensembles for each variable show that finding a calibrated
sub-ensemble that reaches a score of 1 is not possible for wind speed and
wind direction. A sub-ensemble with a score less than or equal to 1 can be
found for PBLH. Figure 8 shows the rank histograms of the different
calibrated ensembles (i.e., 10, 8 and 5 members) for each meteorological
variable shown in Table 3. The calibrated ensembles of PBLH (Fig. 8c, f,
i) are nearly flat for all ensemble sizes, whereas the 10- and 8-member
sub-ensembles keep a slight U shape for wind speed and wind direction but
are significantly flatter than the original ensemble. The ratio between the
expected

Calibrated ensembles generated by both SA and GA and their rank-histogram scores and bias for each variable.

Rank histograms of the calibrated ensembles found for wind
speed

Table 4 shows the sub-ensembles selected by SA. Each of the sub-ensembles has two simulations in common (i.e., 17 and 33), implying that these models are crucial to build an ensemble that best represents the transport errors for the three variables. Figure 9 shows the rank histograms of the sub-ensembles shown in Table 4. These rank histograms show that we were able to flatten the histogram relative to the 45-member ensemble for all three meteorological variables. Similar to the individual variable calibration, the rank histograms for wind speed (Fig. 9a, d) and wind direction (Fig. 9b, e) still show a U shape which is minimized for the smallest (i.e., five-member) sub-ensemble (Fig. 9g–h). The rank histograms are flatter for the PBLH (Fig. 9c, f, i) and the histogram score is closer to 1 (Table 4) compared to wind speed and wind direction. The rank-histogram scores for all variables are greater than those for one-variable optimization (see Table 4). The high rank-histogram scores are associated with the equal weight given to the three variables for this simultaneous calibration, where wind speed controlled the calibration process. For the calibration of the three variables together, we were not able to produce an ensemble for wind speed with a score smaller than 4; this ends up limiting the selection of the calibrated ensemble for the rest of the variables (see Fig. A4 in the Appendix). In addition, all these calibrated sub-ensembles have biases smaller in magnitude than the 45-member ensemble. Both wind speed and PBLH retain an overall positive bias, and wind direction a negative bias. The standard deviations of these three calibrated ensembles are larger than those of the large ensemble, consistent with the effort to increase the ensemble spread.

Ensemble members, rank-histogram scores (

Rank histograms of wind speed

Using SA and GA techniques and the selection criteria detailed in Sect. 2.7 (i.e., low mean error of the entire ensemble), we defined an optimal five-member sub-ensemble (the optimal solution using both techniques) and nearly identical combinations of members for 10-and 8-member sub-ensembles, with only two model configurations not being shared by both algorithms. We also find that configuration 14 remains important for the multi-variable calibrated ensembles, as it was for the single-variable calibrated ensembles.

Both optimization techniques were able to generate sub-ensembles that reduce the U shape of the rank histograms, while significantly decreasing the number of members in the ensemble. A flatter histogram indicates that the ensemble is more reliable (unbiased) and has a more appropriate (greater) spread. The correlation between spread and skill for the wind direction increased, while wind speed and PBLH remain similar. Therefore, we conclude that the calibrated sub-ensembles are equivalent to or even better than the full ensemble to represent the daily model errors.

Time series of simulated and observed 300 m wind
speed

Figure 10 shows the time series of the different calibrated ensembles generated by the SA algorithm at the TOP site. In general, there are no major differences among 5- (Fig. 10a, d, g), 8- (Fig. 10e, h) and 10-member (Fig. 10c, f, i) ensembles. Figure 12 shows how the calibration can increase the spread of the ensemble to the extent of encompassing the observations (e.g., DOY 179; Fig. 10b–c) compared to the full ensemble (Fig. 5b). The ensemble spread was reduced after calibration at a few specific points in space and time.

Insight into the physics parameterizations can be gained by evaluating the calibrated ensembles. The LSM, PBL, CP and MP schemes, and reanalysis choice vary across all of the sub-ensemble members; no single parameterization is retained for all members in any of these categories. However, we also find that the calibrated ensembles rely upon certain physics parameterizations more than others. Figure 11 shows that most of the simulations in the calibrated ensemble use the RUC and thermal diffusion (T-D) LSMs in preference to the Noah LSM. In addition, more simulations use the MYJ PBL scheme than the other PBL schemes. The physics parameterizations shown with a higher percentage in Fig. 11 appear to contribute more to the spread of the ensemble than the other parameterizations.

Frequency with which the physics schemes are used for the SA

Residual (model–data mismatch) mean and standard deviation of
individual members for wind speed

We next explore the characteristics of the individual ensemble members that are retained in an effort to understand what member characteristics are important to increase the spread of the ensemble. Figure 12 shows the mean and standard deviation of the residuals for each simulation included in the five-member ensemble of SA and GA. Ensembles appear to need at least one member with a larger standard deviation to improve the spread for wind speed and wind directions (see member 23; Fig. 12a–b). Additionally, a member that has a large PBLH bias (see member 16; Fig. 12c) appears to be selected, highlighting the need for end members among the model configurations in order to reproduce the observed variability in PBLH. We note here that model configuration 14 was not selected when calibrating three variables together.

The calibrated ensembles found in this study were chosen based on the
meteorological variables and not on the

Figure 13 shows the spread of daily daytime average

Spread (model–ensemble mean), RMSE (model–data) and ratio (spread

Ensemble mean and spread (i.e., RMSD) of the DDA at approximately 100 m

Sum of the

Spatial correlation of

The calibration of the multi-physics/multi-analysis ensemble using SA and GA optimization techniques generated 10-, 8- and 5-member ensembles with a better representation of the error statistics of the transport model than the initial 45-member ensemble. One of our goals was to find sub-ensembles that fulfil the criteria of Sect. 2.7, independent of the selection algorithm and for multiple meteorological variables. Wind speed and wind direction statistics only improve by a modest amount in the calibrated ensembles as compared to the 45-member ensemble, while PBLH statistics, namely the flatness of the rank histogram, show a significant improvement in the calibrated ensembles. The variance in the calibrated ensembles increased relative to the 45-member ensemble but the potential for improvement was limited by the spread in the initial ensemble. Stochastic perturbations (e.g., Berner et al., 2009) could increase the spread of the initial ensemble, which, combined with the suite of model configurations, could better represent the model errors. Here, we limited the 45-member ensemble to mass-conserved, continuous flow (i.e., unperturbed) members that can be used in a regional inversion. Future work should address the problem of using an underdispersive ensemble before the calibration of the ensemble.

We first attempted to calibrate the ensemble for each meteorological variable
(i.e., wind speed, wind direction and PBLH). Table 3 shows that the different
sub-ensembles were able to follow the criteria presented in Sect. 2.7, but
the calibration of the single-variable ensembles did not allow us to find a
unique sub-ensemble that can be used to represent the errors of the three
variables. Therefore, the joint optimization of the three variables was
required to identify an ensemble that best represents model errors across the
three variables. By minimizing the sum of the squared rank-histogram scores
of the three variables, the selection algorithm found common solutions at the
expense of less satisfactory rank-histogram scores than were obtained for
single-variable ensembles (see Table 4). We
assumed that each variable was equally important to the problem, an
assumption that has not been rigorously evaluated. Future work on the
relative importance of meteorological variables on

The calibrated ensembles show the rank-histogram score closer to 1 (Table 4), that is, flatter rank histograms (Fig. 9) compared to the 45-member ensemble (Table 2 and Fig. 6). The sub-ensembles do have a greater variance than the large ensemble (i.e., improved reliability) (Fig. 14). However, the spread–skill relationship (i.e., resolution) of the calibrated ensembles do not show any major improvement compared to the 45-member ensemble, implying that the spread of the ensemble does not represent the day-to-day transport errors well. While the rank histogram suggests that the different calibrated ensembles have enough spread, the spread–skill relationship indicates that our ensemble does not systematically encompass the observations. The disagreement between the rank histogram and the spread–skill relationship can be associated with the metric used for the calibration (i.e., rank histogram) and the biases included in the calibrated ensemble. Using the score of the rank histogram alone may not be sufficient to measure the reliability of the ensemble (Hamill, 2001); therefore, future down-selection studies should incorporate the resolution as part of the calibration process (skill score optimization). The biases in the model are a complex problem because there are many sources systematic errors within an atmospheric model (e.g., physical parameterizations and meteorological forcing). Future studies should consider data assimilation or improvement of the physics parameterizations to reduce or remove these systematic errors. To improve the representation of daily model errors, additional metrics should be introduced and the initial ensemble should offer a sufficient spread, possibly with additional physics parameterizations, additional random perturbations, or modifications of the error distribution of the ensemble (Roulston and Smith, 2003).

Rank histograms, as explained in Sect. 2.3.1, evaluate the ensemble by
ranking individual observations in a relative sense. The ensembles
calibrated using the rank histograms may be representing the variances over
the region correctly but not the spatial and temporal structures of the
errors (Hamill, 2001). These parameters are critical to inform regional
inversions of correlations in model errors, directly impacting flux
corrections (Lauvaux et al., 2009). In this study, the calibrated ensembles
show an improvement in the meteorological variances and an increase in the

We applied a calibration (or down-selection) process to a
multi-physics/multi-analysis ensemble of 45 members. In this calibration
process, two optimization techniques were used to extract a subset of
members from the initial ensemble to improve the representation of transport
model uncertainties in

The calibration techniques improved the spread (flatness of the rank
histogram) of the ensembles and slightly improved the biases, which were
already small in the larger ensemble, but the calibration did not improve
daily atmospheric transport errors as shown by the spread–skill
relationship. We assessed how the calibrated ensemble errors propagate into
the

The code is accessible upon request by contacting the corresponding author (lzd120@psu.edu).

Meteorological data were obtained from the University of
Wyoming's online data archive
(

LDI performed the model simulations, calibration and the model–data analysis. The calibration technique was coded by MB, LDI and TL based on the work of Garaud and Mallet (2011). TL, MB and KJD provided guidance with the calibration and model–data analysis. All authors contributed to the design of the study and the preparation the paper.

The authors declare that they have no conflict of interest.

This research was supported by NASA's Terrestrial Ecosystem and Carbon Cycle
Program (grant NNX14AJ17G), NASA's Earth System Science Pathfinder Program
Office, Earth Venture Suborbital Program (grant NNX15AG76), NASA Carbon
Monitoring System (grant NNX13AP34G) and an Alfred P. Sloan Graduate
Fellowship. We thank Natasha Miles, Chris E. Forest and Andrew Carleton for
fruitful discussions. Meteorological data used in this work were provided by
University of Wyoming's online data archive
(

This paper was edited by Marc von Hobe and reviewed by Lori Bruhwiler and Ian Enting.