Air Quality Predictions with an Analog Ensemble

The authors demonstrate how the analog ensemble (AnEn) can efficiently generate deterministic and probabilistic forecasts of air quality. AnEn estimates the probability of future observations of a predictand based on a current deterministic numerical weather prediction and an archive 10 of prior analog predictions paired with prior observations. The method avoids the complexity and realtime computational expense of dynamical (i.e., model-based) ensembles. The authors apply AnEn to observations from the Environmental Protection Agency’s (EPA’s) AIRNow network and to forecasts from the Community Multiscale Air Quality (CMAQ). Compared to raw forecasts from CMAQ, deterministic forecasts of O3 and PM2.5 based on AnEn’s mean have lower errors, both systemic and 15 random, and are better correlated with observations. Probabilistic forecasts from AnEn are statistically consistent, reliable, and sharp, and they quantify the uncertainty of the underlying prediction. * The National Center for Atmospheric Research is sponsored by the National Science Foundation Atmos. Chem. Phys. Discuss., https://doi.org/10.5194/acp-2017-1214 Manuscript under review for journal Atmos. Chem. Phys. Discussion started: 2 May 2018 c © Author(s) 2018. CC BY 4.0 License.


Introduction
Every year poor air quality kills millions of people worldwide (Forouzanfar et al., 2015) and in the U.S. alone costs society tens to hundreds of $billions (Muller and Mendelsohn, 2007).Air-quality forecasts are one resource that decision-makers can use to reduce many threats that poor air quality poses.
The lower the uncertainty of the information on which decisions are based, the better.5 Unfortunately, uncertainty cannot be completely eliminated from air-quality forecasting, but there are effective ways to treat the inevitable uncertainty and to reduce its consequences.One way is through probabilistic ensemble prediction.In contrast to the deterministic approach of using a single forecast from a single model, an ensemble comprises multiple, meaningfully different forecasts that are valid at the same future time and location, from which probabilistic information can be obtained . 10 Ensembles are beneficial in many ways.The probabilistic guidance they provide is potentially much more useful for decision-makers than a single forecast could ever be (Buizza, 2008;Palmer, 2002).An ensemble's mean forecast tends to be (but is not always) more skillful than any individual member's prediction (Delle Monache et al., 2006aMonache et al., , 2006bMonache et al., , 2008;;Delle Monache and Stull, 2003;Djalalova et al., 2010;Du et al., 1997;Ebert, 2001;Galmarini et al., 2001;Galmarini et al. 2004;Kioutsioukis and 15 Galmarini et al. 2014;Leith, 1974;McKeen et al., 2005;Potempski et al. 2008;Potempski and Galmarini, 2009;Solazzo et al. 2012;Toth and Kalnay, 1997;Zhu et al., 2012).Calculating the mean filters out some of the unpredictable elements of the physical processes being simulated.Another benefit is that the approximate uncertainty in a mean forecast can be inferred from the spread among ensemble members (Kalnay, 2003) if the ensemble is calibrated, although that inference has to be made carefully and is not 20 valid in every case (Barker, 1991;Hopson, 2014;Murphy, 1988).Ensembles also produce reliable and well resolved probabilistic forecasts that can be further improved through calibration and other methods of post-processing imperfect numerical predictions.A reliable ensemble is one that over many cases predicts conditions to occur with the same frequency as they actually occur in nature.A well resolved ensemble is one that is specific from case to case about what conditions will and will not occur . 5 In air-quality forecasting in particular, probabilistic approaches have been recommended as-and have been demonstrated to be-effective at dealing with the many sources of uncertainty (e.g., Bei et al., 2010;Carmichael et al., 2008;Dabberdt et al., 2004;Delle Monache et al., 2006a, 2006b, 2008;Delle Monache and Stull, 2003;Garaud and Mallet, 2010;Kioutsioukis et al., 2016;Marécal et al., 2015;Mallet, 2010;Mallet et al., 2013;Mallet andSportisse, 2006a, 2006b;McKeen et al., 2005;Pagowski et al., 2005;10 Zhang et al., 2007;Zhang et al., 2012).Uncertainties stem from meteorological initial and boundary conditions; errors in observations assimilated into a model; truncations and approximations in a model's numerical schemes; specification of emissions, which are often not well known nor well characterized in models; and physical or chemical processes that are simplified or omitted entirely (Delle Monache and Stull, 2003).15 Ensemble prediction takes different forms.One of the simplest is a lagged ensemble Dalcher et al., 1988;Delle Monache et al., 2006a;Ebisuzaki and Kalnay, 1991;(Hoffman and Kalnay, 1983;Lu et al., 2007;Mittermaier, 2007).In a lagged ensemble, sequential forecasts from a deterministic system are grouped together, each with a common valid time but with different lead times (e.g., an 18-h forecast initialized at 0000 UTC and valid at 1800 UTC, a 15-h forecast initialized at 0300 UTC and valid at 1800 20 UTC, and so on).The effectiveness of this approach depends on how frequently the deterministic forecasts are updated (Lu et al., 2007;Mittermaier, 2007).
Lagged ensembles target uncertainty in the initial conditions.So do other, similar, single-model ensembles in which different initial conditions, boundary conditions, and/or perturbed observations are used to generate diversity even though the model code (including physical parameterizations) is fixed 5 (e.g., Molteni et al., 1996;Zhang et al., 2007).In modelling air quality, perturbations can be applied not just to meteorological observations, but also to emissions.Considering both types of perturbations is important for addressing the nonlinearity of the problem (Delle Monache et al., 2006aMonache et al., , 2006c)).
More complex and computationally expensive-and often more effective-are ensembles based on multiple models.Multi-model ensembles address uncertainty in the actual physical or chemical 10 processes being simulated, not merely in the initial meteorological or chemical conditions (e.g., Bei et al., 2010;Delle Monache et al., 2008;Delle Monache and Stull, 2003;Djalalova et al., 2010;Garaud and Kioutsioukis et al., 2016;Mallet, 2010;Mallet andSportisse, 2006a, 2006b;McKeen et al., 2005;Vautard et al., 2012).(For the purposes of this article, we consider what are sometimes called multi-physics ensembles to be just one type of multi-model ensemble.)Multi-model ensembles have proved to be a 15 particularly effective tool for probabilistic operational weather forecasting for quite some time (e.g., Buizza et al., 2005;Hacker et al., 2011;Krishnamurti, 1999).In air-quality forecasting, multi-model ensembles have been successfully applied to forecasts of ground-level ozone (e.g., McKeen et al., 2005;Solazzo et al., 2012;Žabkar et al., 2013), airborne particles (e.g., Djalalova et al. 2010;McKeen et al., 2007), and to both (e.g., Monteiro et al., 2013).20 The complexity and expense of multi-model ensembles and the limitations of single-model ensembles are two motivations for developing simpler, more cost-effective, yet still powerful methods for probabilistic prediction-especially hybrid statistical-dynamical methods.The goal of this article is to describe how one such method, an analog ensemble (AnEn), can be applied to air-quality forecasting.
We compare AnEn's forecasts of ground-level ozone (O3) and particles less than 2.5 µm in diameter 5 (PM2.5) to the simpler probabilistic standard of a persistence ensemble (PeEn), and to an operational standard of the industry, the Environmental Protection Agency's (EPA's) Community Multiscale Air Quality (CMAQ) model (Byun and Schere, 2006).The two ensemble methods, CMAQ, and the observations used for this research are described in the following sections.
2 Predictive systems and data 10

Analog ensemble (AnEn)
AnEn is a hybrid statistical-dynamical method to generate an ensemble for estimating the probability of some future observation of a predictand (e.g., 2-m dew point, geopotential height at 500 hPa, or, in the case of air-quality forecasts, PM2.5) given a current deterministic prediction and an archive of historical, analogous deterministic predictions paired with historical observations at those predictions' valid times.15 That archive is used to train the AnEn.Typically, the deterministic predictions in the training data and the current deterministic prediction come from the same configuration of the same NWP model, or nearly so.Hamill and Whitaker (2006) demonstrated the considerable value of using an analog ensemble approach to calibrate an existing ensemble, and Delle Monache et al. (2013) proposed a similar approach to instead generate an ensemble, which form the basis of our approach.20 analog is defined, how its quality is assessed, how many analogs to select from the archive, how deep an 5 archive is necessary, etc., can vary from one application of the AnEn to another.Section 3 goes into more detail about our application's sensitivity to some of those choices.
For determining which historical forecasts are sufficiently analogous to the current forecast, we follow Delle Monache et al. (2011Monache et al. ( , 2013) ) and Nagarajan et al. (2015) in using the metric developed by Delle Monache et al. (2011): 10 , wherein  5 is the current deterministic forecast for a location of interest, valid at some time t in the future;  5 7 is an analogous forecast from the archive, valid at some time  H in the past, and with the same lead time as the current forecast's;  J is the number of atmospheric variables, i.e., predictors, used to select analogs;  > is the weight assigned to each atmospheric variable of index  ; and  & : is the standard 15 deviation of historical forecasts of each atmospheric variable of index .The metric is calculated over a range of times from − ̃ to + ̃ centered on the valid time . >,5?@ and  >,5 7 ?@respectively are each current forecast and each analogous historical forecast for atmospheric variable index  within that range of times.
We set  ̃= 1.The forecast interval is 0-48 h.The weights  > for each analog predictor were determined independently for each observing site as explained in section 3.2.From each search of the archived datasets, the best 20 analogs were chosen, which for any given search is 3-4% of the total cases in the archive.How many analogs to choose is a balance between sampling enough of the observed distribution of the predictand variable while ensuring that all analogs are sufficiently similar to the current prediction.5

Persistence ensemble
Like Alessandrini et al. (2015), we use a persistence ensemble (PeEn) as a baseline method for probabilistic prediction, then demonstrate AnEn's improvement on that baseline.For each forecast lead time, PeEn is based on the most recent 20 observations of O3 or PM2.5 at the same hour of day as the forecast valid time.Other tested PeEn configurations (which do not perform as well as the latter, not 10 shown) include PeEn formed by the most recent observations collected the previous seven days over a three-hour window centered on the same hour of the day, and a configuration including observations from the previous four days over a five-hour window.The PeEn ensemble can be skillful when air quality conditions persist for several days, or when conditions fluctuate with the same repeating diurnal pattern.
Rapidly changing patterns of O3 or PM2.5 are challenging for the PeEn.15

CMAQ
CMAQ (Byun and Schere, 2006), with the Chemistry-Transport Model (CTM) it comprises, is an industry standard in air-quality forecasting.It is a modular, Eularian, Cartesian modeling system for simulating on regional scales the emission, production, advection, diffusion, chemical transformation, and removal of atmospheric pollutants.As input into AnEn's algorithms, we use CMAQ's daily forecasts and ground-20

Observations 5
The source of observations is the Environmental Protection Agency's (EPA's) AIRNow network (EPA, 2017) in the conterminous U.S. and southern Canada (Figure 1).Hourly concentrations of O3 are from 1337 sites, of PM2.5 from 551 sites.In this study, we implement a quality control procedure of the observations that is suitable for real-time operational forecasting.A detailed description of this procedure applied to PM2.5 can be found in section 2 of the paper by Djalalova et al. (2015).10 Additionally, observation sites frequently reporting missing data are excluded by the analysis presented in this study, i.e., only stations with at least 50% of data available are retained.That results in 1045 and 458 sites for O3 and PM2.5, respectively, available to generate AnEn.

Sensitivity analysis of the analog ensemble
AnEn's performance is sensitive to the number of analogs chosen from the historical data set (Figure 2).
In the case of this study, 15-25 members produced ensemble mean forecasts of O3 and PM2.5 with the highest correlations and forecasts of PM2.5 with the lowest RMSEs.The lowest RMSEs from O3 forecasts were achieved with 10-20 members.These sensitivities motivated our choice of 20 members for this particular study.Other studies will exhibit different sensitivities and might call for different measures of AnEn's performance, not simply RMSE and correlation.
The curves in Figure 3 result from two opposing trends.The more analogs employed, the more 5 thoroughly we sample the statistical relationships between forecasts and observations in the training data.
However, as each additional analog is included, the similarity between it and the current forecast decreases.Under typical circumstances, the shorter the training period (i.e., the larger the percentage of the training data from which ensemble members are drawn), the less analogous are the chosen analogs, which will normally lead to lower correlations and higher RMSEs.10 Another way to measure this sensitivity is to evaluate AnEn's performance as a function of training period with the number of analogs held constant (Figure 4).Longer training periods improve AnEn's performance.The degree of improvement depends on the variable and metric.Forecasts of O3 are improved most.

5
Note the different ranges among y axes.

Ozone
Num. of members  1.0 range with increments of 0.1, and the weights of O3 and PM2.5 CMAQ predictors are set to a minimum of 0.4 for the prediction of O3 and PM2.5, respectively.Figure 5 shows the distribution of the weights for each predictor for both O3 (left) and PM2.5 (right).For the prediction of O3, the predictors are wind speed (WSPD, m s -1 ) and direction (WDIR, degrees from N) at 10 m AGL, air temperature (T2M, ˚C) at 2 m AGL, cloud fraction (CF), and ground-level concentrations of O3 (ppbv).For PM2.5, the analog predictors 10 are T2M, WSPD, WDIR, specific humidity (Q, g kg -1 ), and surface PM2.5 (µg m -3 ).
The distributions show the weights' variability across the stations.Both O3 and PM2.5 are weighted high for their respective predictions, as expected.For O3, T2M, WSPD, and WDIR are weighted similarly, while the distribution of the weights for CF corresponds to the lowest values.For PM2.5, T2M has the second-highest median of its weight distribution followed by WSPD, WDIR, and then Q. 15

Deterministic predictions
When deterministic predictions are evaluated over all forecasts and observations in our study, AnEn's mean forecasts of O3 and PM2.5 are dramatically superior to the raw forecasts from CMAQ (Figure 6). 10  The AnEn significantly improves CMAQ estimates, which have been used to generate it.As shown in other applications (e.g., Delle Monache et al., 2011;Djalalova et al., 2015;Huang et al. 2017), it reduces both systematic and unsystematic errors, while significantly improving the correlation with observations.In principal, given that the analog ensemble estimates are based on past observations, the 10 AnEn mean should provide unbiased estimates.However, the residual bias after the AnEn correction (middle panels in Figure 6), is likely due by the fact that the training dataset is finite, which does not guarantee that the distribution of the observations that are the foundation for AnEn fully samples the predicted PDF.
Figure 7 and Figure 8 show the spatial distribution of the AnEn improvements (%) over CMAQ 15 in RMSE and correlation for both O3 and PM2.5, computed independently at each available observation site and over the verification period.For O3, AnEn consistently reduces CMAQ RMSE (Figure 7

Probabilistic predictions 5
In this section, several attributes of AnEn and PeEn probabilistic predictions are evaluated to assess their performances for AQ forecasting.These include the match between the observed and predicted cumulative PDF, reliability, resolution, statistical consistency, and an analysis of the spread-skill relationships (Jolliffe and Stephenson, 2003;Wilks, 2006).

Observed vs. predicted cumulative distribution
The Continuous ranked probability score (CRPS) is computed to assess the closeness between observed and predicted PDFs, by comparing the full ensemble distribution with the observations, where both prediction and observations are represented as cumulative distribution functions, or CDF (Carney and Cunningham, 2006).It corresponds to the mean absolute error of deterministic predictions, and it has the 5 same unit as the forecast variable.The more the ensemble-derived CDF is sharp and centered on the corresponding observation, the lower the CRPS is.Zero is a perfect CRPS.distinct diurnal cycle in each CRPS series is not surprising, given the diurnal cycle in forecast error (Figure 6).The better CRPS of AnEn results from better resolution, an important attribute of probabilistic predictions.The reliability of the two systems is very similar.Reliability and resolution are discussed next.

Reliability 5
An ensemble is reliable when its forecast probability matches the observed relative frequency (i.e., the rate of occurrence) of a certain condition over a deep observational record.For instance, a reliable ensemble will predict a 7% probability of ground-level O3 concentration exceeding a regulatory threshold at some point during a 24-h period if historically on 7% of days with similar conditions that threshold was exceeded.For a given condition (e.g., O3 > 50 ppb), plotting forecast probabilities from a perfectly 10 reliable model vs. observed relative frequencies will result in a 1:1 diagonal line on a reliability diagram (Jolliffe and Stephenson, 2003;Wilks, 2006).However, the CRPS reliability component provides an indication of overall reliability, i.e., not tied to a particular condition.For the latter, as shown in Figure 9 and indicated by "REL," the reliability of AnEn and PeEn are very similar, and close to the perfect value of zero, indicating that both probabilistic prediction systems would provide an end-user with 15 information that is not misleading (i.e., that would not lead to over-or under-confidence in the forecast).

Resolution
The potential CRPS ("POT" in Figure 9) includes both uncertainty and resolution (Hersbach, 2000).The uncertainty term is an attribute relative to the observations only, and is the same across different prediction systems.However, the resolution quantifies the forecasts' ability to predict when an event occurs or not.20 Probability forecasts with perfect resolution are 100% on occasions when the event occurs and 0% when the event does not occur.The CRPS POT values reported in Figure 9 show that AnEn has better resolution than PeEn.This is because AnEn forecasts are designed to capture errors in the current raw prediction, whereas PeEn includes the most recent 20 observations of O3 or PM2.5 at the same hour of day as the forecast valid time, which may not sample well the observation corresponding to the current prediction.5

Statistical consistency
If an ensemble is statistically consistent, its members' forecasts are statistically indistinguishable from observations (Anderson, 1996).If this condition is satisfied, when ranking an observation against the corresponding ensemble forecasts, the observation falls with equal probability in any of the ranks.Over a sufficient number of cases, when rank frequencies are plotted the resultant rank histogram is 10 statistically flat if an ensemble is perfectly consistent (Anderson, 1996;Hamill, 2001;Talagrand et al., 1997).Rank histograms of forecasts from AnEn and PeEn are close to flat (Figure 10).There is a slight high bias to both ensembles-shorter bars on the right of the histograms indicate that observations that fall among the higher predicted values are less common than those that fall among low values.The ensembles are also slightly under-dispersive-the U shape at the tops of the bars indicate that 15 observations that fall within the envelope of the ensembles' spreads are less common than those that fall outside the envelope, either lower than the lowest forecast value from a member or higher than the    • The AnEn significantly improves the skill of deterministic predictions by reducing the errors of the 15 deterministic model used to generate it, the Community Multiscale Air Quality (CMAQ), while increasing its correlation with the observations.For example, AnEn's root mean square error is lower than CMAQ's by roughly 35% and 30% for O3 and PM2.5, respectively.• AnEn produces a probabilistic prediction which is statistically consistent, reliable, and sharp.It quantifies the uncertainty of the underlying prediction, which could contribute to an increased ability to protect the public health.
• An analog ensemble can be generated for existing real-time air-quality forecast systems with very small additional computational cost in real-time.However, an analog ensemble does require an 5 archive of historical simulations from a deterministic model and observations of the quantity to be predicted, which can be built offline.
The results reported herein can be further improved with a longer training dataset (which would require additional computational resources), by extending existing training datasets to consider neighboring locations while searching analogs (at no additional computational cost), and by exploring more predictors 10 for the analog-matching metric.

Code availability
The analog ensemble's code can be shared for research purposes upon request to the corresponding author.

Data availability 15
The datasets used in this study can be made available upon request to the corresponding author.
Atmos.Chem.Phys.Discuss., https://doi.org/10.5194/acp-2017-1214Manuscript under review for journal Atmos.Chem.Phys.Discussion started: 2 May 2018 c Author(s) 2018.CC BY 4.0 License.level concentrations of O3 and PM2.5 at lead times of 0-48 hours.Additional inputs for AnEn include 10m wind speed and direction, 2-m air temperature, 2-m specific humidity, and cloud cover, which were extracted from the numerical weather predictions that are used to drive CMAQ, the National Centers for Environmental Prediction (NCEP) North America Model (NAM).

Figure 1 . 3 Results
Figure 1.Sites of observations of a) O3 and b) PM2.5 used for this study.Colors indicate whether hourly observations at a site are available more (red) or less (blue) than 50% of the time during the two periods of study.

Figure 2 .
Figure 2. Examples of 0-48 predictions of O3 (top) and PM2.5 (bottom) at two different locations and days.AnEn's results are on the left, PeEn's on the right.CMAQ's predictions are in red, observations in black.Both ensemble distributions are depicted with 5

Figure 5 .
Figure 5. Distribution of analog predictor weights across the available stations for O3 (left) and PM2.5 (right).The brown boxes indicate the 25-75 interquantile range, the black line within the box is the median, the squares are the outliers, and the vertical black lines at the edge of the dashed lines are the minimum and maximum excluding the outliers.O3 predictors include wind speed (WSPD, m s -1 ) and direction (WDIR, degrees from N) at 10 m AGL, air temperature (T2M, ˚C) at 2 m AGL, cloud fraction (CF), and ground-5 . Phys.Discuss., https://doi.org/10.5194/acp-2017-1214Manuscript under review for journal Atmos.Chem.Phys.Discussion started: 2 May 2018 c Author(s) 2018.CC BY 4.0 License.

Figure 6 .
Figure 6.RMSE (top), bias (middle), and correlation (bottom), of O3 (left) and PM2.5 (right), vs. lead time in forecasts from CMAQ (black) and AnEn mean (red).Calculations are averages over all sites during the periods of study described in the text.5