Development towards a global operational aerosol consensus: basic climatological characteristics of the International Cooperative for Aerosol Prediction Multi-Model Ensemble (ICAP-MME)

Here we present the first steps in developing a global multi-model aerosol forecasting ensemble intended for eventual operational and basic research use. Drawing from members of the International Cooperative for Aerosol Prediction (ICAP) latest generation of quasi-operational aerosol models, 5-day aerosol optical thickness (AOT) forecasts are analyzed for December 2011 through November 2012 from four institutions: European Centre for MediumRange Weather Forecasts (ECMWF), Japan Meteorological Agency (JMA), NASA Goddard Space Flight Center (GSFC), and Naval Research Lab/Fleet Numerical Meteorology and Oceanography Center (NRL/FNMOC). For dust, we also include the National Oceanic and Atmospheric Administration-National Geospatial Advisory Committee (NOAA NGAC) product in our analysis. The Barcelona Supercomputing Centre and UK Met Office dust products have also recently become members of ICAP, but have insufficient data to be included in this analysis period. A simple consensus ensemble of member and mean AOT fields for modal species (e.g., fine and coarse mode, and a separate dust ensemble) is used to create the ICAP Multi-Model Ensemble (ICAP-MME). The ICAP-MME is run daily at 00:00 UTC for 6-hourly forecasts out to 120 h. Basing metrics on comparisons to 21 regionally representative Aerosol Robotic Network (AERONET) sites, all models generally captured the basic aerosol features of the globe. However, there is an overall AOT low bias among models, particularly for high AOT events. Biomass burning regions have the most diversity in seasonal average AOT. The Southern Ocean, though low in AOT, nevertheless also has high diversity. With regard to root mean square error (RMSE), as expected the ICAP-MME placed first over all models worldwide, and was typically first or second in ranking against all models at individual sites. These results are encouraging; furthermore, as more global operational aerosol models come online, we expect their inclusion in a robust operational multi-model ensemble will provide valuable aerosol forecasting guidance. Published by Copernicus Publications on behalf of the European Geosciences Union. 336 W. R. Sessions et al.: Development towards a global operational aerosol consensus


Introduction
Aerosol modeling, once purely the domain of regional air quality and climate models, has seen recent rapid development at traditional numerical weather prediction (NWP) centers (e.g., Tanaka et al., 2003;Morcrette et al., 2009;Westphal et al., 2009;Kukkonen et al., 2012).Applications are numerous, and include corrections for radiance assimilation systems for the NWP modeling systems themselves (Wang and Niu, 2013;Weaver et al., 2007).There is further mounting evidence that for heavily burdened atmospheres, inclusion of the radiative effects of aerosol particles improves overall NWP forecasts (e.g., Haywood et al., 2005;Pérez et al., 2006;Wang et al., 2010;Mulcahy et al., 2014) and is even hypothesized to impact tropical cyclone (TC) development (e.g., from Karyampudi and Carlson, 1988;Karyampudi and Pierce, 2002;Dunion and Velden, 2004;to most recently Dunstone et al., 2013;Reale et al., 2011Reale et al., , 2014)).Direct and indirect radiative effects have also been found to impact common NWP parameters such as temperature.For example, in response to large biomass burning events, surface temperatures clearly drop (Westphal and Toon, 1991).Smoke over the Indian Ocean in 1997 and 2006 may have resulted in a net cooling of sea surface temperatures (e.g., Thampi et al., 2009;Rajeev et al., 2008), with dust over the Atlantic Ocean similarly indicted both physically (Evan et al., 2008) and as an artifact (Merchant et al., 2006).Atmospheric transport and diffusion can expand aerosol impacts to continental and global scales thus posing further NWP impact questions (e.g., Colarco et al., 2004;Damoah et al., 2004 over North America andKoe et al., 2001 over Asia).For these reasons, most NWP centers with global modeling mandates have some form of aerosol prediction program.Indeed, increased accuracy in forecasting aerosol particles has benefits for mitigating human impacts: poor air quality negatively impacts biological processes including human cardiovascular and respiratory health (Seaton et al., 1995;Pöschl et al., 2005).Reduced visibility due to aerosols creates operational hazards on land, at sea and for aviation.Volcanoes represent a dramatic example, with SO 2 and ash reducing visibility, while silicate tephra induces aircraft engine stalls and flame outs (Miller and Casdevall, 2000;Carn et al., 2009).Large volcanic eruptions that inject SO 2 in the stratosphere can also have a long-lasting cooling impact on surface temperature.
The path to the development of NWP aerosol capabilities has been quite different among centers.Certainly, the underlying meteorology driving aerosol models is from largely independent models.The aerosol source, microphysics and sink functions have also been developed or drawn from a variety of air quality and climate data sources.The differences in meteorology and assumed aerosol heritage when many aerosol parameterizations were developed lead to significant amounts of model tuning.Sometimes unphysical tuning parameters are required in order to get physical results against key metrics.Given the complexity of the aerosol and me-teorological environment, this tuning can lead to high scoring in one metric (aerosol optical thickness -AOT) and poor scoring on another (particulate matter d p < 2.5 µm, PM 2.5 ).With the advent of AOT data assimilation, models have been driving towards that metric (e.g., Reid et al., 2011) and AOT model analyses have dramatically improved.But even here, assimilation methods diverge significantly between centers (Reid et al., 2011;Benedetti, 2014), and eventually this must be reconciled for multi-day forecasts.
Due to the stochastic nature of the atmosphere, for any NWP variable, aerosol species or dynamical, deterministic forecasts eventually reduce in quality with increasing forecast time no better than climatological values (or sometimes worse).There are many sources of forecast error, but there are two categories in particular that garner significant NWP attention: systemic errors from the imperfect nature of the model and sensitivity of models to initial conditions.Lorenz (1963Lorenz ( , 1965Lorenz ( , 1969a, b) , b) showed in his classic papers that small errors in initial conditions produce large errors and divergence even within a perfect model.Errors ranging up to the synoptic scale have been found to not be the result of model deficiencies, but rather a small variation in initial states (Reynolds et al., 1994).To help control these errors, ensemble-based prediction, single-model ensemble meteorological forecasts are used by nearly all the major operational weather centers (Buizza et al., 2005).However, while single-model probabilistic ensemble forecasting is clearly enhancing model solutions (particularly in data sparse regions), multi-model ensembles are an ever increasing tool for forecasters.Multi-model ensemble (or consensus) forecasting, using independent and skilled forecasts, has long proven valuable to atmospheric sciences.Ensemble techniques have been applied to the benefit of tropical cyclone tracking (Leslie and Fraedrich, 1990;Mundell and Rupp, 1995;Goerss, 2000) and intensity forecasting (Kaplan and DeMaria, 2001;DeMaria et al., 2006;Sampson et al., 2008).The consensus of cyclone track forecasts was found, on average, to be more accurate than the individual member deterministic models.Consensus style multi-model ensembles and their interpretation are a mainstay of the climate change community (e.g., Meehl et al., 2007;Knutti et al., 2010).Fordham et al. (2012) used a multi-model ensemble of general circulation models to explore potential impacts of climate change following the demonstration of general circulation model (GCM) consensus values by Reichler and Kim (2008).Non-NWP methods also benefit from consensus techniques, as Sanders (1973) showed when the average forecast from a group of forecasters often proved better than any of the individual contributions given.Taken a step further, error weighting a multi-model ensemble leads to the development of the super ensemble (e.g., Krishnamurti et al., 1999;Casanova and Ahrens, 2009).However, equal weighting in a consensus style appears to provide the most robust result overall for a host of forecasting applications (e.g., Del-Sole et al., 2013;Sansom et al., 2013), especially if model errors are not precisely known (Weigel et al., 2010).Since error estimates are often quickly out of date as models develop rapidly, this final point is salient here.
The rapid increase in the number of operational and quasioperational global aerosol models coupled with the NWP community's wide experience of ensemble systems has resulted in an opportune moment to explore the development of a global operational multi-model aerosol forecast consensus.The International Cooperative for Aerosol Prediction (ICAP), consisting of developers servicing aerosol programs at forecasting centers and remote-sensing data providers, began meeting in April 2010 to discuss issues germane to the operational aerosol forecasting community (Reid et al., 2011).As a relatively nascent community, ICAP has worked to build the standards for data protocols, validation and verification between international centers.Data exchange for the purposes of consistent error analysis and consensus forecasting began in early 2011 and now includes four complete aerosol forecast models (European Centre for Medium-Range Weather Forecasts-Monitoring Atmospheric Composition and Climate Model (ECMWF-MACC); FN-MOC/NRL Fleet Numerical Meteorology and Oceanography Center-Navy Aerosol Analysis and Prediction System (NAAPS); Japan Meteorological Agency (JMA)-Model of Aerosol Species in the Global Atmosphere (MASINGAR); and NASA GMAO (Global Modeling and Assimilation Office) Goddard Earth Observing System Version 5 -GEOS-5).Three dust-only models are also included (NMMB (National Materials and Manufacturing Board)/BSC-CTM (Barcelona Supercomputing Center-Chemical Transport Model) Nonhydrostatic Multi-scale Meteorological; NOAA NCEP (National Oceanic and Atmospheric Administration-National Centers for Environmental Prediction) NEMS (NOAA Environmental Modeling System) GFS (Global Forecast System) Aerosol Component; NGAC (National Geospatial Advisory Committee) UKMO (United Kingdom Met Office) Unified Model).In this paper we briefly describe the International Cooperative for Aerosol Prediction Multi-Model Ensemble (ICAP-MME) framework and explore the first year of ensemble and ensemble member AOT products.For this study we only include the four complete aerosol models and NGAC (NMMB and the UKMO Unified Model will be in subsequent publications once sufficient data are incorporated for robust statistics).Forming an arithmetic mean of model parameters, the ICAP Multi-Model Ensemble (ICAP-MME) was generated.Climatological characteristics of the ensemble mean are presented.Verification statistics against Aerosol Robotic Network (AERONET) sun-sky radiometer data are presented including bias and root mean square error (RMSE), and we highlight areas of relative consensus and divergence.Finally, to set the stage for the next round of analyses, an example for Cape Verde dust is presented on ICAP-MME for issues to be addressed in predicting extreme aerosol events.

Methodology
For this introductory paper on the ICAP-MME, we briefly describe the included models and outline the fundamental metrics for model performance for AOT.The analysis period for this paper spans 1 year from December 2011 through November 2012.A further seasonal breakdown was also performed for boreal winter-spring (December-May) and summer-fall (June-November) periods.As per original ICAP agreements, we do not identify specific models to specific metrics other than the ensemble model itself.All such evaluations are to be performed and presented by the individual model's developers.Rather, we emphasize relative spread in skill for different sites.There are multiple reasons for this anonymous approach.These include the developmental nature of some of the input models and the very rapid updates the input models are receiving (e.g., by the time this paper is published, the model performance statistics will be certainly out of date).This paper is intended to demonstrate the usefulness of a multi-model ensemble in both forecasting applications, as well as a way to identify areas of common development needs in aerosol prediction.

Input models
The ICAP-MME is currently based on four comprehensive global aerosol models (GEOS-5, NAAPS, MACC, MASIN-GAR), and three dust-only global models (NOAA NGAC, NMMB/BSC-CTM, UKMO Unified Model).Requirements for entry in the ICAP-MME are a global model with at least quasi-operational status and reliable data distribution from a large data center.During the development of this paper, there were insufficient data to fully evaluate two of the dust models (BSC and UKMO).Thus, while we include these two models in the description, they are not used in this early evaluation.
We provide brief synopses of the input models in the current quasi-operational ICAP-MME consensus in Appendix A. As can be seen, these models tend to be quite independent in the parameterizations used for sources -particularly for dust and biomass burning.Although, biomass burning emissions all have some lineage back to MODIS (Moderate Resolution Imaging Spectroradiometer) active fire hotspot counts.Sea salt is treated similarly in nature between models in terms of functional form, but tuning based on underlying meteorology and sink terms results in significant differences between the models.Perhaps the most similar aspect of the models is in emissions of anthropogenic emissions which are poorly constrained and thus similar inventories are developed.MACCity (Monitoring Atmospheric Composition and Climate/CityZen inventory) is used by both ECMWF and NRL.NASA GMAO GEOS-5 uses Edgar, which has similar components to MACCity.Finally, the NOAA NCEP NGAC dust model has the same GO-CART (Goddard Chemistry, Aerosol, Radiation, and Trans- port Model) foundation as GEOS-5, although with different driving meteorology and without data assimilation.

ICAP-MME
The International Cooperative for Aerosol Prediction Multimodel Ensemble (ICAP-MME) is a consensus style multimodel ensemble where all members are equally weighted.ICAP-MME was born out of a simple ICAP proposition that some uniform basis of AOT plotting be adopted across centers.This quickly led to data exchange and ultimately the formation of the AOT ensemble.Because of differences in member center's data policy, data availability of ICAP-MME members and consensus fields is limited to participating centers.However, consensus plots are available on the web (http://www.nrlmry.navy.mil/aerosol/)with further expansion in the coming year.
The basic resolution of ICAP-MME is 1 • ×1 • , with member model data re-gridded through linear interpolation to 1 • ×1 • model grid.Three-dimensional aerosol and AOT fields are then generated in a member agreed NetCDF format.The ICAP consensus is the arithmetic mean of the interpolated fields.Because of latency constraints by some of the members, ICAP-MME is generated with a 24 h lag.This will be reviewed as those constraints change.Forecasts are available 6-hourly out to 120 h.At the moment the ensemble is limited to speciated AOT at a standard 550 nm wavelength.The data continuity for ICAP-MME for the current study period is presented in Fig. 1a.Because data are provided in an operational data stream, it was not always possible to back populate to make a completely contiguous data set.Outages could be due to a combination of network issues either at NRL, where the data are assembled, or at the production center.For this study, ICAP-MME is only generated when all four core models populate the ensemble, which holds data for 90 % of forecasts.
ICAP-MME has four broad species, dust, sea salt, pollution/sulfate and biomass burning/smoke.The largest difficulty in combining model data is in the various member models' speciation.For example, NAAPS separates out species by source (as is done in the ICAP-MME).Other models, such as GEOS-5, carry species by chemical species (e.g., sulfate, organic carbon, black carbon -BC).Also, some models carry size information (MACC), and others ignore biogenic organic carbon emissions.In the case of coarse mode aerosol species such as sea salt or dust, the speciation vs. source is easy to reconcile as the source and chemistry are one in the same, and size information can be integrated.The separation between anthropogenic pollution, biomass burning and sometimes included biogenic emissions is much more ambiguous.Therefore, we developed the simple rubric that sulfate and biogenic are considered in the pollution/sulfate category, whereas organic carbon is listed with biomass burning, which, if not physical, is in line with how the species are input and transformed into the models.Clearly, this is unsat-   1.
isfying from multiple points of view.To clarify the situation, our analysis reduces the degrees of freedom further, and we largely analyze on a simple fine and coarse mode species AOT (e.g., dust and sea salt is coarse, pollution, biomass burning is fine).While there is some residual coarse mode material in sub-micron size ranges, the spectral deconvolution method (SDA) algorithm takes these tails into account.
There are a number of products that are then generated from ICAP-MME.Most commonly used is the consensus arithmetic mean coupled with the standard deviation for the so-called mean-spread plot.Similarly, the median is calculated and sometimes used, as it is robust in the face of a major outlier.For event-based metrics, such as scores for dust storms, several cut points (e.g., thresholds) were used.The most notable and consistent is a 550 nm AOT of 0.8, high enough for the sky to have a complete haze color.Now that there are suitable data to develop a climatology, a dynamic event cut point will be developed in the future based on multiples of regional standard deviations or geometric standard deviations (e.g., 1σ or 2σ event).
Figure 2 presents example data from the ICAP-MME for the 72 h forecast for a particularly large dust event on 29 June 2012, including contributions from all four core and the three dust members.Plots and data such as these are expected to be released to the public following the publication of this paper.Figure 2a presents the simple ICAP-MME AOT mean.In Fig. 2b, a mean/spread plot is presented where the isopleths are AOT and the color is standard deviation.From these plots we can see that in many dust areas the models are very consistent, whereas in the Sahel and Arabian Gulf there is more uncertainty.In Fig. 2c, isopleths of AOT of 0.8 are presented, showing spatial differences in models, whereas in Fig. 2d, a simple warning area mask is plotted where at least half the models predict AOT > 0.8.All of these products are designed for easy interpretation and verification.
AOT data from the Aerosol Robotic Network (AERONET; Holben et al., 1998)  ∼ 0.01 to 0.02 (higher in the UV; Eck et al., 1999).These accuracies are for non-cloud contaminated data and comparison of AERONET field site AOT with independently calibrated sun photometers showed agreement to within ∼ 0.015 (root mean square) or better (Schmid et al., 1999;Nyeki et al., 2012).Level 1.5 AOD (aerosol optical depth) may have typical accuracies of ∼ 0.02-0.04but is quite variable and uncertainty may be larger, depending primarily on the length of deployment since initial calibration and the amount of material deposited on the optics lenses (dust, sea salt, etc.).Instances of cirrus contamination (Chew et al., 2011) were evident in the level 1.5 and, to a lesser extent, level 2 products.Influence of these outliers was removed by hand for clear outliers, as well as trimming the top 5 % of coarse observations in northern Africa and East Asia, and the top 15 % elsewhere.The remaining observations are then binned by the median observation value within a 6 h window centered on the model valid time.We focus on 21 sites chosen by the ensemble developers in consultation with AERONET before the analysis was conducted.Selection was based on regional representativeness (e.g., Shi et al., 2011) as well as a contiguous data record throughout the 1-year study period.These are listed in Table 1 and marked on Fig. 1b.All quantitative comparisons to AERONET are pairwise, conducted only when AERONET and the ICAP-MME can be co-located.Because three of the four multi-species models invoke some form of data assimilation, and ECMWF does not generate an analysis field of AOT on the model grid, our primary model metric for global representation of aerosol loadings is the 6-24 h forecast.Given AERONET only collects data on the sun side of the earth, this corresponds to 6-30 h of forecast time for any data day.For all calculations of forecasts out to five days, verification is performed ±3 h of model valid time which is instantaneous for that time.For brevity, we group error statistics into data days to simplify the number of columns in data tables.With the once daily 00:00 UTC (GMT) production of ICAP, some regions benefit from the availability of daytime-only data for verification and assimilation.Thus, over Asia verification and assimilation are at a shorter forecast time than say Europe and North America.This gives Asia a beneficial regional verification bias, but we do not believe this will impact any of our key results.
Rank histograms, also known as the Talagrand diagram, were also generated for the models (Talagrand et al., 1997).These help determine if the ensemble members are drawn from the same distribution that produces the true state.While not a true verification tool, they are diagnostically useful to judge the ensemble reliability.Given an observation point, an n member ensemble is organized from highest to lowest and assigned a rank of 1 to n + 1.If the ensemble is representative, the observation value is equally likely to be of any rank of the n + 1 ranks, assuming a statistically significant number of independent observations, resulting in a flat histogram.Conversely, bias could be evaluated if the observation too often falls into the top or bottom bin.A U-shaped histogram potentially indicates insufficient ensemble spread, as all the forecasts consistently resolve too high or low.Care must be taken with interpretation, as uniform or U-shaped distributions can arise, such as when observational biases change sign by location.More detail on rank histograms can be found in Talagrand et al. (1997) and Hamill (2001).
For event forecasting, we use the critical success index (CSI) also known as the threat score (TS) (CSI or TS = hits / (hits + misses + false alarms) as a common and straightforward metric with scales that range from 0 (no skill) to 1 (perfect skill).For AOT, threat scores are somewhat subjective.If the bar for triggering a hit is too low, then the model forecast is without functional value.If it is set extremely high, then the TS gives a false optimism in system performance.To address this issue, we also use the equitable threat score accounts for random change ETS = (hits − random chancehits) / (hits + misses + false alarms−random chance hits) where the random chance is (the total forecasts of the event • the total observation) / sample size.As discussed in Sect.6, the use of threat scores are somewhat problematic, especially in regard to amplitude or displacement error (Baldwin and Kain, 2006).

Results: climatological characteristics of ICAP-MME
The mean and standard deviation of the ICAP-MME 6hourly forecast mean is provided in Fig. 3. Data are broken down into a seasonal and size mode degree of freedom.Seasonally, data are presented for the boreal winterspring December 2011-May 2012 and boreal summer-fall June-November 2012 time periods.These bi-seasonal temporal stratifications account for the major monsoonal and climatic shifts in the atmosphere while preserving major aerosol events such as for the boreal summer-fall, the August-October biomass burning seasons in Africa, South America and Maritime Continent, the June-August African Dust Season and the contiguous United States (CONUS), and European summer haze seasons.Similarly the boreal winterspring period captures the March-May Asian dust season, and the Southeast Asia and Sahelian African biomass burning season.The next set of striation is by model size, separating fine mode species (sulfate, organic carbon, black carbon etc.) from coarse (sea salt and dust).This stratification resolves speciation differences between models.Correspond-ing seasonal means of AERONET fine and coarse mode 550 nm AOT are also presented on the mean plots.
The ICAP-MME, as well as the entirety of the core model members, easily resolves the world's largest aerosol features: Saharan dust, continental biomass burning, and the great Asian dust and pollution plume are well described.Associated standard deviations of the ICAP-MME 6-hourly means also highlight regions of more episodic aerosol events, with African and Asian dust being particularly noteworthy.The seasonal biomass burning features also stand out.
While in the next section we focus on member scores, from a climatological point of view it is worthwhile to examine the climatological variability between the models.In a manner similar to Fig. 3, in Fig. 4 we present bi-seasonal and size modal estimates of the pointwise maximum or minimum AOT of the ensemble members.That is, after generating seasonal and size modal mean AOT for each of the four core member models, for each 1 × 1 latitude and longitude point we select the highest and lowest AOT of the four.Such a minimum and maximum not only is indicative of differences in model amplitude, but also in plume location (if, for example, a zonal aerosol feature is shifted meridionally between models, then the minimum will be low across the region, missing the feature completely).Such a depiction can span the local seasonal range of coarse and fine mode AOT present in the models and identify which areas require attention.The largest area of difference between the models was clearly associated with biomass burning, with factors of 3 differences spanning springtime Sahelian and South American biomass burning and biomass burning over the maritime continent.The coarse models generally tuned dust reasonably well, but sea salt maximum and minimum in the Southern Ocean spanned a factor of 2.
We can become more quantitative through comparison to AERONET.Table 1 lists AERONET AOT and the model mean bias for the ICAP-MME and its core members for the two monsoonal periods.Table 2 presents a similar set of bias statistics for ICAP core models plus NGAC for those sites where the coarse mode is dominated by dust.In all of these tables, ICAP-MME ensemble mean is underlined.To improve visualization, in Fig. 5 we present similar data in a scatterplot (similar in nature to a reliability diagram), where the ICAP-MME means are in bold.
Our interpretation of pairwise AERONET data are in agreement with our interpretation of plots in Figs. 3 and 4. Overall the models have reasonable correlation and consistency across the AERONET sites.Cape Verde, perhaps the community's benchmark site for dust, was so well tuned in the models that it had virtually non-existent dust biases for summer-fall and an insignificant 10 % high bias for winterspring.Most background sites performed equally well.The one exception was the Crozet Islands in the Southern Ocean for boreal summer-austral winter, where most models clearly overestimated sea salt production.
For higher AOT sites, all of the models have a clear and consistent low bias.A small part of this can be explained by the smoothing nature of a global model (models propose to represent the grid box mean).High AOT plume or dust event amplitude simply is not captured either in the model physics or in data assimilation.Depending on how models screen their AOT data before data assimilation, bias could be a residual of the retrieval (e.g., see discussion in Zhang et al., 2008).However, some of the largest departures are clearly related to chemistry or sources.The highest single departure for the winter-spring period is Chiang Mai, Thailand, where all models seem to underestimate that season's biomass burning and pollution influence.Models also underestimate AOTs at Singapore.This is not surprising, as SE Asia has been identified as being perhaps the most challenging region in the world to observe and model (Reid et al., 2009(Reid et al., , 2013a, b) , b) because, among other reasons, high cloud cover conspires to disrupt both fire detections and data assimilation.Most models rely in fact on retrieved products from the MODIS instruments, which have large biases in presence of clouds or are not available at all.The Sahel region in the winter-spring is another area of considerable difficulty for nearly all models.Ilorin in fact had the highest climatological AOT of any site (0.89) with consistent low biases in all models, on the order of 50 %.This is likely due to underrepresentation of biomass burning in all models, although a correlated bias between models and smoke optical properties cannot be ruled out at this time.The Sahelian biomass burning system and its frequent mixing with dust and clouds makes it difficult to remotely monitor (Reid et al., 2009).Finally, areas of very high pollution load, such as the sites on the Indo-Gangetic plain (Kanpur and Gandhi College) and Beijing also have persistent low biases (Table 1).Models that have secondary organic aerosol production have lower biases than those without.However, large uncertainties at this site also point at inadequacies of the emission inventories.Dust is also under-represented for these sites.
A similar study of bias as above can be also conducted as a function of forecast day (Fig. 6).After the generation of the forecast analysis through data assimilation and the forecast commences, both the meteorological and aerosol models will evolve into its free running behavior.Thus, in general we expect model biases to worsen in time as the model gets further and further away from the satellite observations that help initialize the run.In areas of poor natural model performance, the change in model bias with forecast time can be dramatic.Sometimes, site performance can completely reverse itself between monsoonal phases.Most notable is Ilorin in the African Sahel.Mean AOT biases become evermore negative in forecast time in the winter-spring period, reaching 50 % of the mean value at 5 days.Similar biases are seen in Chiang Mai, Thailand.Bias change in the fine mode implicate biomass burning in this region.Again, these are the most complex burning regimes in the world (Reid et al., 2009(Reid et al., , 2013a, b), b).However, for the summer-fall, both sites do remarkably well.In the case of Ilorin, it is a result of a transition from mixed dust and biomass burning to a dust dominated regime (Eck et al., 2010).For Chiang Mai, it is a result of the linear nature of consensus style ensembles, one model with a very large high bias counteracted three others with a moderate low bias.Also of note is Kanpur, India, which consistently demonstrates poor forecasting performance of all the models, although its neighbor Gandhi College (not shown) only showed half the bias.Some sites actually improve in time, such as, Baengnyeong Korea, where there is statistically significant improvement in bias with forecast, in the winter-spring.This could implicate bias in the analysis, as the free running forecasts relax into lower error states before being erroneously jarred into high error by the assimilation process.Other sites show little difference at all as forecast time increases, such as Beijing, Banizoumbou, Sahelian Africa and winter-spring in Singapore, implicating the lesser impact of data assimilation in these regions.

Results: RMSE
While mean seasonal bias is important overall to many aerosol applications, for aerosol forecasting daily variability is equally if not more important.In this regard metrics such as the RMSE become more appropriate for characterizing model skill.Since RMSE incorporates both bias and variance, additional steps can also be taken to perform bias removal in order to determine how well models capture aerosol variability.In a like manner to bias, Tables 3 and 4, provide total AOT and dust RMSE for each site.These RMSEs are pictorially presented in Fig. 7 against each site's mean AOT.Shown are each model's value (small dots color coded) and the RMSE for the ensemble mean (large blue data point).A likewise representation of RMSE for 4-day forecasts (84 to 108 h model AERONET matchups) is similarly presented in Fig. 8 for total, and dust AOT.Total AOT RMSE and FGE (fractional gross error) as a function of forecast time for key sites are presented in Fig. 9.
By definition, for biases the ICAP-MME ensemble mean provides no more information than the average of its members.As was clearly demonstrated, as all of the models tend to low bias average AOT, so does the ICAP-MME.If the model averages are evenly distributed around the true state, the ICAP-MME will be without bias.For RMSE however, the situation is quite different, where typically we find the RMSE of the ensemble of skillful and independent models is superior than any individual members.We found this to be the case with ICAP MME.With RMSE (or mean absolute error, not shown), ICAP-MME provides the best performance.Examination of Fig. 7 and Tables 3 and 4 shows that in nearly all cases the ICAP-MME RMSE is either the leader or the second best in RMSE.For dust in particular (in which all 9, 1, 0, 0, 0, 0 10, 0, 0, 0, 0, 0 96 h rank (10 pos) 10, 0, 0, 0, 0, 0 10, 0, 0, 0, 0, 0  modeling groups emphasize development), the ICAP-MME is particularly skillful.
Based on the slope of RMSE against mean AOT value for each site in Fig. 7, the RMSEs of the 1-day forecasts of ICAP-MME run approximately 50 % of the climatological mean AOT value.Dust AOT forecasting is superior to over all fine and coarse mode AOT, running approximately one-third of climatological AOT.Again, this is part reflects the importance of the dust species by centers.Further, the AERONET Cape Verde site (in which RMSE is particularly skillful) is a common benchmark site for Saharan dust; hence, models are typically tuned for the region.
Regions of particular difficulty with RMSE are often the same as those with large biases.Chiang Mai and Singapore in their respective biomass burning seasons have some of the highest biases.Beijing China, Kanpur India and Ilorin in the Sahel have RMSEs that are more than half the mean AOT.But, if we account for mean AOT for the region in the FGE errors at some low AOT sites become more pronounced.Perhaps most important of sites would be Baengnyeong Korea, a receptor for East Asia, with a normalized RMSE of 1.3, or a FGE of 0.55.Owing to the low baseline AOT and the difficulty with modeling and remote sensing in the South-  ern Ocean, the Crozet Islands also appear to be poorly represented, with normalized RMSE of 1.24, and a FGE of 0.67.Monterey, CA, another marine site, also has FGEs in the 0.3-0.6 range.
Like bias, forecasting skill for all models and the ensemble mean degrades in time.Although the relative performance of the ICAP-MME mean relative to the member models increases in time, particularly for dust.In Fig. 8, we show the RMSE vs. AERONET AOT for 4-day forecasts, or 3 days after the first 24 h baseline for the total and dust cases (Fig. 7).In general, the RMSEs increase to 60 % of the total AOT value from ∼ 40 % at 1 day.For dust, however, skills remain constant in time.These general trends can be seen even more clearly in RMSE and FGE as a function of forecast day of the ICAP-MME consensus (Fig. 9).

Results: rank histograms
Thus far we have treated the ICAP-MME deterministically through comparisons of the ensemble mean to the individual members.Comparisons between models in bias and RMSE do tell us a general state of the modeling community.To move towards a goal of event driven applications of the ICAP-MME, we can begin to view the ensemble members probabilistically and ask questions related to where individual observations fall relative to the model.The rank histogram (a.k.a Talagrand diagram) is a useful diagnostic to depict the relative distribution of observations and models (e.g., Hamill, 2001).Rank histograms are constructed by repeatedly tallying the rank of the verifying observation relative to values from an ensemble sorted from lowest to highest.A flat rank histogram is usually taken as a sign of reliability, while a U-shaped rank histogram often indicates a lack of variability in the ensemble.In Fig. 10 we present global rank histograms of the first forecast data day (6-24 h forecast period) for all observations segregated bi-seasonally into the boreal winter (December-May) and summer (June-November) periods.Included are histograms for all AERONET matchups for our 22 sites (Fig. 9a-d) as well as for those AERONET cases were AOT > 0.6 (Fig. 9e-h).This value of 0.6 is somewhat arbitrary, and was chosen to give balance between high AOT and enough data points to lend significance to the product.Plots are given for total, fine and coarse AOTs for the four core multi-species models (leading to 5 ranks), and dust for the core four plus NGAC (6 ranks).
As a rank histogram is a histogram as to where an observation falls relative to the models, it is useful to calculate and examine relative to the biases and RMSEs.For all data (Fig. 10a), the histogram is relatively flat, with a slight slope with increasing rank.That is, the observations tend to be bigger than the individual members and the ICAP-MME mean.But, there is offsetting divergence in the individual aerosol particle size modes, with models generally overestimating fine mode AOT overall, and conversely underestimating coarse mode AOT.This is in agreement with the biases presented in Table 1.Thus, while the total AOT data histogram is relatively flat, it is flat for the wrong reasons with offsetting fine and coarse populations.If we examine more significant events for AOT > 0.6 (Fig. 10e-g), we will see that the models are strongly low biased overall.That is, for dust, smoke and pollution alike, the models are in general underestimating the most severe events.
These rank histograms are for all global observations and are generally representative for individual sites.In Fig. 11 we present histograms for 15 of the 21 sites for the four multispecies models (to conserve space, plots that showed similar tendencies to neighbors were dropped).In the first column, sites of a background nature or as a long-range receptor are given.All of these sites are relatively clean and have average AOTs < 0.15.In general, the histograms are relatively flat, although there is in general over prediction of AOT in the central United States, represented by the DOE CART site, and under-representation of dust at the Palma de Mallorca site in Spain as a receptor for dust.At Ragged Point, an African dust receptor in the Caribbean, the distribution is good.For sites with intermediate loadings or those that are taken as regionally representative of polluted areas (column 2), there is also a distribution of tendencies, with Singapore showing universal AOT under-representation in AOT, and Goddard Space flight center suggesting over-representation.Most interesting are the heavily impacted sites (column 3 and 4), where we show all data plus those cases where AERONET AOT > 0.6.Sites, such as Beijing, China and Gandhi College, India (for massive pollution models), and Baengnyeong, Korea (an Asian receptor), Cape Verde and Banizoubou (for African dust models), have similar tendencies in regard to all data.But all models are strongly low biased for high AOT events.This shows that while the models are independent in the meteorology and parameterizations, they nevertheless succumb to correlated bias overall.As an example of typical behavior for a well characterized site we turn to Cape Verde as an example.

Results: Cape Verde and Kanpur as examples of issues related to forecasting significant events
A substantial motivation for operational aerosol forecasting is natural hazards and significant events forecasting.Thus, while it is important for models to generally reproduce the basic characteristics of the aerosol system via good bias and RMSE scores, it is perhaps equally important for the models to succeed in identifying significant and unusual events.
Good RMSE scores by nature ensure the models have skill in predicting typical environments, but consistent bias and amplitude may cloud a model's value in more extreme situations.In the early stage of development we settled on an AOT of 0.8 to be a key benchmark for warning areas (e.g., Fig. 2c, d).For example, the MACC alert system which is aimed at detecting significant events for air quality exceedance, uses a threshold of 0.5, which can be shown to correspond to a particulate matter < 10 µm in diameter (PM 10 ) of approximately 50 µg m −3 .The number of days during which this PM 10 value is exceeded is used in European legislation as a threshold for fining EU countries.The value chosen for the ICAP-MME is largely subjective, and was agreed upon after an examination of AERONET data to find logical 2σ events in heavily polluted regions.However, after deeper investigation, this became somewhat dissatisfying.In the context of a multi-model ensemble, there are numerous subjective considerations in combining model products for the benefit of forecasters.For example, one model may have an amplitude consistent error (i.e., track AOT extremely well), but poor bias scores and threat scores.Others may have excellent amplitudes and biases overall, but have timing issues with significant events.As always, there is the potential for sampling bias in our observational data set.
To conceptualize the above issues we considered two sites in detail: Cape Verde and Kanpur.The Cape Verde AERONET site is a long-standing benchmark location for dust modeling.With more than 15 years of observations, it is one of AERONET's longest running providing not only satellite and model verification data, but also climatological aerosol trends.Given the significant amount of attention centers pay to modeling dust, it is no surprise that Cape Verde is a high scoring site for all models.In comparison, Kanpur is the lowest scoring site next to Beijing for the models.Given that Kanpur has a more contiguous data record than Beijing, we chose that site for further analysis.

Cape Verde
Cape Verde's location as a downwind receptor for African dust coupled with overall good model performance makes it a good location to study the nature of event scoring.The Boolean nature of threat scores is often problematic and there can be difficulty in this metric in first defining what con- stitutes an event.For air quality applications, for example, an event can be referenced to a degree of violation.Near misses are frequently valuable from a forecasting point of view, both in magnitude and in temporal offset.Observations are also problematic, as clear-sky bias can be a problem in both satellite and ground-based observations, thus leading to a bias as to when one can verify.We can explore this further with the time series of AERONET coarse mode AOT and the ICAP-MME mean for the 1-year study period (Fig. 12a).Differences in the dust seasonality are clear, with winter and spring months having a relatively low background with occasional significant events and a higher dust continuum during summer months with numerous high-frequency events.An enlargement is provided in Fig. 12b for the middle time series month of May.Examination of the data in combination of error statistics presented in Tables 1 and 2 suggests that indeed the ICAP-MME is performing well.Scatterplots of the 12 and 84 h forecasts for 00:00 UTC against AERONET (Fig. 12c), representing 24 and 96 h since the last satellite data assimilation cycle for the region, are quite good.However, there are clear outliers worth investigating from an events perspective.
In interpreting the regression of Fig. 12c, cases far to the right of the regression lines (7 February, AERONET = 1.15,ICAP MME = 0.25) tend to be in association with residual cirrus contamination.Cases studies such as these were visually verified such as in the right satellite image in Fig. 12a.While such misses are infrequent, they nevertheless are reminders that no verification data set is perfect, and in an unsupervised verification system, cases such as this can heavily affect scores.Data points far to the left of the regression line are false alarm cases where presumably the models far over predicted a dust event that did not materialize.These cases are nearly all associated with the 84 h forecast of isolated wintertime events, or four full days since the last satellite data assimilation cycle, and thus are purely forecast meteorology driven.We found that errors dropped to half their previous value as forecast lengths decreased to about two days as the forecast meteorology became more accurate.However, there are also cases where when we track the peak in dust AOT, this peak arrives outside of AERONET verification data availability but within 12 h.This artifact points to the necessity of loosening our verification criteria for the amplitude and timing for longer forecasts.
To help further describe the nature of the many Boolean skill scores, it helps to provide an example for when we have the most confidence: forecasts within 24 h.Our first challenge is to define the threshold to be implemented.This can be done uniformly for all sites (AOT > 0.5, 0.8 or 1 etc.), or it can be site specific, based on the probability of what is locally considered an extreme event.Figure 12d provides an AOT probability plot for the 1-year time series of the 12 and 84 h forecasts.There is generally good agreement on the probability distribution of AOT between observations and corresponding 12 and 84 h forecasts well past one geometric standard deviations (84.1 AOT percentile = 0.50) to just short of two (97.7 AOT percentile = 0.83).These lines are marked on Fig. 12a and b as well as the common 1.5σ g level (93 AOT percentile = 0.62).
The difficulty in skill scores becomes apparent if we consider the 2σ g level as a threshold.At 2σ g there are six events recorded by AERONET (3 in May, Fig. 12b), all of which were captured by the ICAP MME mean at 12 and 84 h.However, at 12 h, there were six false alarms.This leads to a TS or CSI of 0.5, and ETS of 0.48.This is a somewhat middling score.However, in five of the six false alarm cases, the observations reached at least 1.5σ g with the remaining one above the 1σ level.For 84 h forecasts, the false alarm rate goes up to 11, but even here six reach the 1.5σ g level.If we use 1.5σ g as a threshold, the 12 h TS goes up to 0.65 and the ETS to 0.58.
Between the above analysis and Fig. 12c the model clearly has skill.However, the Boolean nature of the metrics can make interpretation difficult, particularly when one applies them uniformly over the globe.This situation is common in the Numerical Weather Prediction realm, and in response dozen of skill scores have been developed, including those with fuzzy neighborhood boundaries such as spatial multievent contingency tables and fractional skill scores to (e.g., http://www.cawcr.gov.au/projects/verification/).If we move further to take advantage of the natural probabilistic applications of a multi-model ensemble, versions of Brier scores or the continuous rank probability score may also be appropriate.These are directions of research for the next set of multi-year ensemble data.

Kanpur
In contrast to Cape Verde, Kanpur represents a site with overall poor event scoring by all models for the common metrics as bias, RMSE and threat score.In this case, Kanpur provides a complex overall environment over land in opposition to the more simplified dust environment at the nominally oceanic site of Cape Verde.Kanpur district has a high population density (∼ 4.5 million), has high industrial and biofuel emissions, is a receptor for dust from all along the Indo-Gangetic plane and as is key here, and a complex aerosol meteorology, particularly in wintertime (e.g., Nair et al., 2007;Gautam et al., 2007Gautam et al., , 2009Gautam et al., , 2011;;Kar et al., 2010;Arola et al., 2013).Given such complexity, it is little wonder that the global models have great difficulty with the region in the context of common metrics.But, after further examination and consideration of the nature of global modeling, we find that bulk metrics do not entirely describe model performance, particularly in regard to extreme events.
Figure 13 provides data of a similar nature as shown in Fig. 12 for Cape Verde.Although here we provide fine mode data for the four multi-species models, and all five models under current analysis with dust.Beginning with fine mode comparison, we find that the 12 h forecast nominally tracks the overall nature of the regions aerosol pattern, although with a significant low bias in the winter months.Also in the winter months is when we find significant spikes in fine mode AOT.These are quite often haze events created during the evaporation of winter time stratocumulus (Eck et al., 2010).Under such circumstances, global models are unlikely to cope with such strong boundary layer meteorological forcing.In contrast, we see that in the spring, when pollution events are more regional, the models have some skill in at least simulating event onset, albeit with a significant low bias.When taken as a whole, skill scores for correlations are reasonable for 12 h forecasts, or nominally 18-24 h since the last satellite observations were assimilated (r 2 = 0.58).However, by the time forecasts reach 3-4 days, models appear to lose all fine mode skill.
For dust, the models appear in general to perform better.Regressions are decent at both 12 and 84 h (r 2 = ∼ 0.6).But in this case, there are no events.The distribution of coarse mode AOT observations are so tight, there are few to no observations past the 1.5 standard deviation level.At one geometric standard deviation, exceedances are in a continuum.
Thus, a threat score does not provide sufficient context to evaluate models in this environment.
Finally, Kanpur highlights a further situation with verification data.While the SDA algorithm does an admirable job separating fine and coarse mode AOT, in this case coarse mode is a combination of aeolian dust (which is generally the context of dust in the global models), and regional coarse mode species, including agriculture, industrial or road dust as well as perhaps droplets in the cloud burn off phase.This seems to be particularly true in the winter periods.Thus, is the use of the term coarse mode cannot be used synonymously with dust in classical terms, and thus provides an additional challenge to the global models.

Discussion and conclusions
This paper describes the basic climatological characteristics and evaluation of the world's first global multi-model aerosol forecast model, the International Cooperative for Aerosol Research Multi-Model ensemble: ICAP-MME.At the time of writing this paper, there were four core multi-species models (ECMWF, JMA MASINGAR, NASA GEOS-5, NRL NAAPS) and seven dust models (aforementioned four, plus NMMB/BSC-CTM, NOAA NGAC, and UKMO Unified Model) running daily at 00:00 UTC with 24 h latency.Here we focus on the first year of data, from 1 December 2011 to 30 November 2012 when all four multi-species models plus NGAC were providing data in near-real time.We expect rapid evolution in the individual member models based on these results and similar exercises with ICAP-MME products.Thus, the error metrics are likely out of date for the better at the publishing of this initial research.Further, as models are added to the ICAP-MME we expect better performance.The initial state of the ICAP-MME is worth documenting for base lining purposes, and the general tendencies in the state of global aerosol forecasting models are worth discussing.These are listed here as follows: 1. Overall performance via RMSE: as we expected when we first constructed the ICAP-MME, the ensemble mean outperforms all of its individual members in RMSE against AERONET globally throughout the forecast period.Typically RMSE runs 40-60 % of the mean AOT with coarse mode prediction outperforming fine mode.Given that RMSE has both a bias and variance component, and the ensemble mean bias is by definition in the middle of the members, the improvement in variance prediction is significant.Like other ensemblebased systems like the tropical cyclones (Leslie and Fraedrich, 1990;Mundell and Rupp, 1995;Goerss, 2000;DeMaria et al., 2006;Kaplan and DeMaria 2001;Sampson, 2008) and GCMs (e.g., Meehl et al., 2007;Knutti et al., 2010;Reichler and Kim 2008), we expect that as individual models improve and are added, so will the consensus.Indeed, even though NGAC has

W. R. Sessions et al.: Development towards a global operational aerosol consensus
average performance relative to other dust models, it did improve the overall RMSE of the ICAP-MME for dust.
2. Overall performance via bias: in general all models and thus the ensemble mean capture the major climatological aerosol features around the globe.However, while the models perform well in RMSE, there is a tendency for the modeling community to have a low bias in AOT, particularly for significant events.Conversely, for more moderate or clean conditions, fine mode AOT is overestimated.These biases seem to be persistent in the modeling community, and documentation dates back to the AeroCom (Aerosol Comparisons between Observations and Models) comparisons of Kinne et al. (2006).This persistence in low bias dust in models is perplexing, because one would think the community would tune around the observation.In the case of the forecast models, the assimilation of MODIS and verification via AERONET are ubiquitous.Further, regression is probably the most commonly used tuning metric, and as it is driven by the largest magnitude values we were surprised to find the under-representation of AOT.We can surmise that in some heavily polluted urban site like Beijing, large-scale models cannot represent fine-scale features nor are there observations for extreme events (the maximum AOT measurable by AERONET is ∼ 5).But for regional polluted sites compared to urban counterparts (such as Gandhi College vs. Kanpur) the biases remain.It may represent an overall reluctance by model developers to perturb or tune static emissions inventories.Thus, this persistent bias among models might also have a psychological supporting factor too in the way scientists interpret pollution data vs. other species such as dust and biomass burning.In regard to ICAP-MME, all core models have satellite data assimilation in some cases, remote-sensing biases can then work their way into forecast climatological biases.Even so, some species remain problematic.There is more diversity in climatological biomass burning AOTs than any other species.Despite the low AOTs, diversity in sea salt AOTs in the high mid-latitudes is also large.Tracking this effect is a goal of future efforts.
3. Site specific performance: AERONET sites were picked by mutual agreement by the model developers based on data representativeness and availability.There are clearly regions of relative high and low model performance.Cape Verde is a widely used AERONET site for monitoring dust emissions from the Sahara, and models in general tune to this site to great effect even though the benefit of data assimilation is marginal outside of the analysis period.Aerosol receptor sites or those sites which will have the benefit of data assimilation also tend to score well such as Palma de Mallorca and Ragged Point.There are also sites with universal difficulty.Models clearly have more difficulty with sites in the mixed fine and coarse mode environments of the Sahel, India and polluted cities of Asia.Cloud cover impacts on data assimilation are also likely a factor in sites such as Singapore and Chiang Mai.
4. Future directions: this is the first paper on the ICAP-MME and there are clearly many directions in which studies may proceed.Perhaps the most common question received by developers on future direction is whether we intend to convert the ICAP-MME to a super ensemble where models are weighted by their scores (e.g., Krishnamurti et al., 1999;Casanova and Ahrens, 2009).Experience has shown however that equal weighting in a consensus style appears to provide the most robust results overall, and this is backed up on both practical and theoretical grounds (DelSole et al., 2013).Further, we frequently see regional improvements to ensemble members as the models develop, and different models score differently by region or type of event.In the operational realm reanalyses cannot always be generated and significant events by nature are so rare that tuning will likely be unrepresentative.Thus, for all of these reasons an operational super ensemble is impractical at this time, although in the future adaptive systems may be possible.Nevertheless, the underlying premise that individual models be continuously scored uniformly is highly relevant to the field, particularly for major events.Now that a common data set has been generated developers are now in a position to agree upon standard metrics and protocols to ensure that performance improvement and best practices are cleanly documented across models.A second area for future direction related to metrics is to take advantage of the probabilistic nature of the ICAP-MME.Already consensus threat scores and warning areas have been defined.These clearly need to be explored further.Finally, given that the ICAP-MME members share some development legacy and at times exhibit similar forecast outcomes, we intend to probe the relative independence of the models.

Appendix A: Member model descriptions
Provided in the Appendix are short narratives of individual model descriptions provided by their developers.We begin with the four core multi-species model developers (ECMWF MACC, FNMOC/NRL NAAPS, JMA-MASIN-GAR, NASA GMAO GEOS-5) followed by the three dustonly models (NMMB/BSC-CTM, NOAA NCEP NGAC, and UKMO Unified Model).
A1 Multi-species models A detailed description of the ECMWF forecast and analysis model including aerosol processes is given in Morcrette et al. (2009) and Benedetti et al. (2009).The initial package of ECMWF physical parameterizations dedicated to aerosol processes mainly follows the aerosol treatment in the LOA/LMD-Z (Laboratoire d'Optique Atmospherique/Laboratoire de Météorologie Dynamique Model) model (Boucher et al., 2002;Reddy et al., 2005).Five types of tropospheric aerosols are considered: sea salt, dust, organic and black carbon and sulfate aerosols.Prognostic aerosols of natural origin, such as mineral dust and sea salt are described using three size bins.For dust, bin limits are at 0.03, 0.55, 0.9 and 20 microns, while for sea salt bin limits are at 0.03, 0.5, 5 and 20 microns.Emissions of dust depend on the 10 m wind, soil moisture, the UV-visible component of the surface albedo and the fraction of land covered by vegetation when the surface is snow free.A correction to the 10 m wind to account for gustiness is also included (Morcrette et al., 2008).Sea salt emissions are diagnosed using a source function based on work by Guelle et al. (2001) and Schulz et al. (2004).In this formulation, wet sea salt mass fluxes at 80 % relative humidity are integrated for the three size bins, merging work by Monahan et al. (1986) and Smith and Harrison (1998) between 2 and 4 mm.Sources for the other aerosol types, which are linked to emissions from domestic, industrial, power generation, transport and shipping activities, are taken from the SPEW (Speciated Particulate Emission Wizard), and EDGAR (Emission Database for Global Atmospheric Research) annual-mean or monthly mean climatologies.More details on the sources of these aerosols are given in Dentener et al. (2006).Emissions of OM (organic matter), BC and SO 2 linked to fire emissions are obtained using the Global Fire Assimilation System (GFAS) based on MODIS satellite observations of fire radiative power, as described in Kaiser et al. (2012).
Several types of removal processes are considered: dry deposition including the turbulent transfer to the surface, gravitational settling, and wet deposition including rainout by large-scale and convective precipitation and washout of aerosol particles in and below the clouds.The wet and dry deposition schemes are standard, whereas the sedimentation of aerosols follows closely what was introduced by Tompkins (2005) for the sedimentation of ice particles.Hygroscopic effects are also considered for organic matter and black carbon aerosols.
MODIS AOT data at 550 nm are routinely assimilated in a 4-D Var framework which has been extended to include aerosol total mixing ratio as extra control variable (Benedetti et al., 2009).A variational bias correction for MODIS AOD is implemented based on the operational setup for assimilated radiances following the developments by Dee and Uppala (2009).The bias model for the MODIS data consists of a global constant that is adjusted variationally in the minimization based on the first-guess departures.Although simple, this bias correction works well in the sense that the MACC analysis matches well the debiased MODIS observations.The observation error covariance matrix is assumed to be diagonal, to simplify the problem.The errors have been chosen based on the departure statistics and are prescribed as fixed values over land and ocean for the assimilated observations.The aerosol background error covariance matrix used for aerosol analysis was derived using the Parrish and Derber method (also known as NMC (National Meteorological Center, now National Centers for Environmental Prediction) method; Parrish and Derber, 1992) as detailed by Benedetti and Fisher (2007).This method was long used for the definition of the background error statistics for the meteorological variables and is based on the assumption that the forecast differences between the 48 h and the 24 h forecasts are a good statistical proxy to estimate the model background errors.

A1.2 FNMOC/NRL NAAPS
The Navy Aerosol Analysis and Prediction System (NAAPS) is the US Navy's offline chemical transport model running with dust, smoke, sulfate and sea salt at 1 • ×1 • /27 levels based on the Danish Eulerian Hemispheric Model (Christensen, 1997;Witek et al., 2007).NAAPS has generated quasi-operational forecasts since 1999 at the Naval Research Laboratory (NRL; http://www.nrlmry.navy.mil/aerosol),but in 2008 became fully operational global at Fleet Numerical Meteorology and Oceanography Center (FNMOC; http: //www.usno.navy.mil/FNMOC/).At the time of writing this paper, NAAPS was in the process of a major revision change, including an increase in resolution to 1/3  Hogan and Rosmond, 1991).A first order approximation of secondary organic aerosol (SOA) processes is adopted in which production of SOA from its precursors is assumed to be instant and included with the original sulfate species to form a combined pollution species.Anthropogenic emissions come from the ECMWF MACC inventory (Lamarque et al., 2010).Smoke from biomass burning is derived from nearreal-time satellite-based thermal anomaly data used to construct smoke source functions (Reid et al., 2009;Hyer et al., 2013).In the NAAPS version for the ensemble, dust is emitted dynamically and is a function of modeled friction velocity to the fourth power, surface wetness and surface erodibility, which in this model run is adopted from Ginoux (2001) with regional tuning.Sea salt modeling in ensemble version of NAAPS is the same as Witek et al. (2007) and sea salt emission is driven dynamically by sea surface wind.Analysis fields assimilate quality controlled collection 5 MODIS AOT (Zhang and Reid, 2006;Zhang et al., 2008, Hyer et al., 2011) with minor corrections from Multi-angle Imaging Spectroradiometer (MISR).Aerosol wet deposition is constrained at analysis time with satellite retrieved precipitation within the tropics (Xian et al., 2009).

A1.3 JMA MASINGAR
The Japan Meteorological Agency (JMA) has been providing the Aeolian dust information to the general public via its website (http://www.jma.go.jp/en/kosa/) since January 2004.The operational numerical dust forecast in JMA is based on the Model of Aerosol Species in the Global Atmosphere (MASINGAR) (Tanaka et al., 2003), which is coupled with the MRI (Meteorological Research Institute)/JMA98 AGCM (atmospheric general circulation model).The model includes five aerosol species, namely, sulfate (and its precursors), black carbon, organic aerosols, sea salt and mineral dust.The model resolutions were set to a T106 Gaussian horizontal grid (approximately 1.125 • × 1.125 • ) and 30 vertical layers from the surface to a height of 0.4 hPa.Dust and sea salt particles are logarithmically divided into ten discrete size bins from 0.1 to 10 µm in radius.The operational version of MASINGAR calculates the emission flux of dust as a function of the third power of 10 m wind velocity (Gillette, 1978), soil moisture, soil type, snow cover and vegetation cover.Anthropogenic emissions curing this study period are taken from the Representative Concentration Pathways Database (RCP), but have since transitioned to using MACCity.The ICAP-MME version of MASINGAR used updated dust aerosol module based on the saltationbombardment dust emission theory, which is described in Tanaka and Chiba (2005).The transport of aerosol is calculated with 3-D semi-Lagrangian advection, subgrid vertical diffusion, moist convective transport and gravitational settling.Removal processes of aerosol include rainout, washout and dry deposition.JMA is planning to update the operational dust forecast model to be based on the latest global climate model MRI-CGCM3 (Meteorological Research Institute coupled general circulation model) (Yukimoto et al., 2012).

A1.4 NASA GEOS-5
The Goddard Earth Observing System model, version 5 (GEOS-5), is the latest version of the NASA Global Modeling and Assimilation Office (GMAO) Earth system model (Rienecker et al., 2008).GEOS-5 serves NASA (1) as a state-of-the-art modeling tool to study climate variability and change, (2) as a provider of research quality reanalyses for use by NASA instrument teams and the scientific community at large and (3) as a source of near-real-time forecasts of aerosol and atmospheric constituents in support of NASA aircraft campaigns (e.g., SEAC4RS, ARC-TAS, HS3, DISCOVER-AQ).GEOS-5 includes components for atmospheric circulation and composition (including atmospheric data assimilation), ocean circulation and biogeochemistry, and land surface processes.Components and individual parameterizations within components are coupled under the Earth System Modeling Framework (ESMF, Hill et al., 2004).GEOS-5 has a mature atmospheric data assimilation system that builds upon the Grid-point Statistical Interpolation (GSI) algorithm jointly developed with NCEP (Rienecker et al., 2008) and is currently evolving into a hybrid ensemble-variational assimilation system.The version of GEOS-5 documented here is run in near-real time at a 0.25 • ×0.3125 • latitude × longitude horizontal spatial resolution on 72 hybrid sigma levels from the surface to approximately 85 km.In addition to traditional meteorological parameters (winds, temperatures, etc.;Rienecker et al., 2008), GEOS-5 includes modules to represent aerosols (Colarco et al., 2010) and tropospheric-stratospheric chemical constituents (Pawson et al., 2008), and their respective radiative feedback.Aerosols are handled through a version of the GO-CART (Chin et al., 2002) run online and radiatively coupled in GEOS-5.GOCART treats the sources, sinks and chemistry of dust, sulfate, sea salt and black and organic carbon aerosols.Aerosol species are assumed to be external mixtures.Aerosol and precursor emissions are based on a number of sources.Biofuel emissions of black and organic carbon are based on Park et al. (2003) with emissions from shipping based on EDGAR.Other anthropogenic sources follow from Streets et al. (2009).For SO 2 we have anthropogenic emissions from EDGAR except for aircraft emissions, which are based on the NASA AEAP program.Natural sources of organic carbon are derived from the GEIA terpene inventory (assuming 10 % conversion to secondary organic aerosol).DMS emissions (converted to SO 2 and then to sulfate) are based on Kettle et al. (1999).Dust and sea salt emissions are as in Colarco et al. (2010).Total mass of sulfate and hydrophobic and hydrophilic modes of carbonaceous aerosols are tracked, while for dust and sea salt the particle size distribution is explicitly resolved across five non-interacting size bins for each.Both dust and sea salt have wind-speed dependent emission functions, while sulfate and carbonaceous species have emissions principally from fossil fuel combustion, biomass burning and biofuel consumption, with additional biogenic sources of organic carbon.Sulfate has additional chemical production from oxidation of SO 2 and dimethyl sulfide (DMS), as well as a database of volcanic SO 2 emissions and injection heights.For all aerosol species, optical properties are primarily from the commonly used Optical Properties of Aerosols and Clouds data set (OPAC, Hess et al., 1998).Except for dust, optical properties are derived under the assumption of spherical particles.Our dust optical properties data set incorporates non-spherical dust properties based on Meng et al. (2010).GEOS-5 is driven by biomass burning emissions from the Quick Fire Emission Data set (QFED, Darmenov and da Silva, 2013).In near-real time, GEOS-5 includes assimilation of AOT observations from the MODIS sensors on both Terra and Aqua satellites.Based on the work of Zhang and Reid (2006) and Lary et al. (2010), we originally developed a back-propagation neural network to correct observational biases related to cloud contamination, surface parameterization and aerosol microphysics.This empirical algorithm has been adapted to retrieve AOT directly from cloud-cleared MODIS reflectances.Online quality control is performed with the adaptive buddy check of Dee et al. (2001), with observation and background errors estimated using the maximum likelihood approach of Dee and da Silva (1999).Following a multi-channel AOT analysis, three-dimensional analysis increments are produced exploring the Lagrangian characteristics of the problem, generating local displacement ensembles intended to represent misplacements of the aerosol plumes.

A2.1 NMMB/BSC-CTM
The NMMB/BSC-CTM (Pérez et al., 2011;Jorba et al., 2012;Spada et al., 2013) is an online chemical weatherprediction system for meso-to global-scale applications, developed at the Barcelona Supercomputing Center-Centro Nacional de Supercomputación (BSC-CNS) in collaboration with NOAA/NCEP, NASA Goddard Institute for Space Studies, the International Research Institute for Climate and Society (IRI) and the University of California Irvine.BSC-CNS maintains global and regional dust and sea salt aerosol forecasts based on NMMB/BSC-CTM.The BSC-Dust module is fully embedded into the Non-hydrostatic Multi-scale Model NMMB developed at NCEP (Janjic et al., 2011;Janjic and Gall, 2012).It includes a physically based dust emission scheme, which explicitly takes account of saltation and sand-blasting processes (White, 1979;Marticorena and Bergametti, 1995;Marticorena et al., 1997) and assumes a viscous sublayer between the smooth desert surface and the lowest model layer (Janjic, 1994;Nickovic et al., 2001).For the source function, the model uses the topographic preferential source approach after Ginoux et al. (2001) and the NESDIS (National Environmental Satellite, Data, and Information Service) vegetation fraction climatology (Ignatov and Gutman, 1998).It includes an eight-bin size distribution within the 0.1-10 microns radius range according to Tegen and Lacis (1996) and radiative interactions (Mlawer et al., 1997).The NMMB/BSC-Dust model has been evaluated at regional and global scales (Pérez et al., 2011;Haustein et al., 2012).Complementing the dust atmospheric aerosol, a sea salt module (Spada et al., 2013) is implemented through 8 bins in the dry radius interval (0.1-15 microns) to describe mass concentrations and optical depth.A sub-bin lognormal approach is assumed to calculate the optical properties of the particles.Several open-ocean emission schemes are implemented, accounting for bubble-bursting and spume production (Gong, 2003;Monahan et al., 1986;Smith et al., 1993;Mårtensson et al., 2003;Jaeglé et al., 2011).The water uptake is taken into account by using prescribed growth factors for different relative humidity values following Chin et al. (2002).The parameterizations of the aerosol processes affected by the water uptake (i.e., sedimentation, dry deposition, wet deposition) have been extended to wet particles from those implemented in the dust module.These developments are steps forward towards a unified multi-scale chemical weather-prediction system at BSC-CNS.This sea salt component is not in the ICAP-MME but may be included at a later date.

A2.2 NOAA NCEP NGAC
Since September 2012 NOAA NCEP has been providing 5day global dust forecasts at 1 • ×1 • /64 levels once per day (at 00:00 UTC cycle) from the NEMS GFS Aerosol Component (NGAC) system.It includes a five-bin size distribution with effective radius at 1, 1.8, 3, 6 and 10 microns.The NGAC is an online global atmospheric aerosol model developed at NCEP in collaboration with NASA GMAO (Lu et al., 2010(Lu et al., , 2013)).The forecast model is the NCEP's Global Forecast System (GFS) within the NOAA Environmental Modeling System (NEMS) infrastructure (Black et al., 2009).The aerosol component is NASA's GOCART within GMAO's GEOS-5 earth system model (Colarco et al., 2010).While NGAC has the capability to forecast dust, sea salt, sulfate and carbonaceous aerosols, the initial NGAC operational production in 2012 only generated global dust forecasts.NCEP is planning to upgrade the operational NGAC in 2015 to include the full suite of aerosols using real-time fire emissions from satellites observations.

59Figure 1
Figure 1.(a) Timeline of available data within this paper's study period.(b) Location of AERONET sites used for verification.Labels are listed in Table1.

Figure 1 .
Figure 1.(a) Timeline of available data within this paper's study period.(b) Location of AERONET sites used for verification.Labels are listed in Table1.

Figure 2 .
Figure 2. Examples of ICAP-MME products expected to be released to the public at publication of this paper for an example 72 h forecast of 2012's most significant dust events plus a secondary event over the Arabian Gulf using all six dust members.(a) Ensemble mean 550 nm AOT; (b) Mean/Spread of the six ensemble members, with the standard deviation as color and AOT isopleths; (c) Spaghetti plot of AOT 0.8 isopleth; (d) Dust warning areas where more than half of the models predict AOT > 0.8.

Figure 3 .
Figure 3. Mean and standard deviation of the ICAP-MME 550 nm AOT ensemble consensus for the December 2011-November 2012 time period.Included are the four core models of ECMWF MACC, FNMOC/NRL NAAPS, JMA MASINGAR, and NASA GMAO GEOS-5.Breakout is by boreal winter-spring (December-May) and summer-fall (June-November).Further striations are for total, fine and coarse mode optical depth.Provided in dots are the AERONET means for the same time period, although these are not pairwise with the model data.

Figure 4 .
Figure 4. Same as Fig. 3, but for pointwise maximum and minimum 550 nm AOTs drawn from the ICAP-MME's four core member seasonally averaged AOT fields.AERONET circles represent AOT means.

Figure 5 .
Figure 5. Bi-seasonal comparisons of model 550 nm AOT means with 21 core AERONET verification sites listed in Table 1.Large blue circles are ICAP-MME means.Other models are small colored diamonds.Data are stratified (left column) for December-May 2011, (right column) June-November.(a) and (b) core models and ensemble mean comparisons to total AERONET derived 550 nm AOT.(c) and (d) models vs. AERONET for fine mode particles.(e) and (f) models vs. AERONET for coarse mode particles.(g) and (h) model dust vs. AERONET Coarse for dust stations listed in Table2.NGAC is included in the dust comparison.

Figure 6 .
Figure 6.ICAP-MME 550 nm total AOT model bias as a function of forecast hour for key AERONET sites.(a) December-May boreal winter-spring period; (b) June-November boreal summer-fall.

Figure 7 .
Figure 7. Bi-seasonal comparisons of +1-day model 550 nm AOT RMSE with 21 core AERONET verification sites listed in Table 1.Large blue circles are ICAP-MME means.Other models are small colored diamonds.Data is stratified (left column) for December-May 2011, (right column) June-November.(a) and (b) core models and ensemble mean comparisons to total AERONET derived 550 nm AOT.(c) and (d) models vs. AERONET for fine mode particles.(e) and (f) models vs. AERONET for coarse mode particles.(g) and (h) model dust vs. AERONET coarse for dust stations listed in Table2.NGAC is included in the dust comparison.

Figure 9 .
Figure 9. ICAP-MME consensus root mean square error (RMSE) and fractional gross error as a function of forecast day for selected AERONET sites shown in Fig. 1b.

Figure 10 .
Figure 10.(a)-(d), bi-seasonal rank histograms of ICAP-MME members and the ensemble mean total, fine, coarse and dust AOT for all data.(e)-(h) same as previous for cases where AERONET AOT > 0.6.

Figure 11 .
Figure 11.Rank histograms for selected sites the entire 1-year study period.Included are sites considered as background or long range receptor sites (column 1), sites with intermediate loadings (column 2), and sites with high aerosol impact, segregated into all data (column 3) and those cases with AERONET AOT > 0.6 (column 4).The dominant aerosol type leading to AOTs > 0.6 are listed for sites in column 4.

Figure 12 .
Figure 12.An example of the derivation of threat scores for the CapeV site.(a) One-year time series of first-day forecasted ICAP-MME mean AOT with corresponding AERONET coarse mode AOT.Insets are MODIS RGB images for an actual and artifact dust event.(b) enlargement of (a) for the month of May 2012; (c) scatterplot of forecasted AOT against AERONET; (d) probability distribution of AERONET and forecasted AOT.

Figure 13 .
Figure 13.ICAP MME-AERONET comparisons for the Kanpur India site.Included are the (a) fine mode and (b) dust components.Marked are the 1, 1.5 and 2 geometric standard deviation lines.Also shown are scatterplots against 12 and 84 h forecasts for (c) fine mode and (d) dust, respectively.

Table 1 .
List of + 1-day biases from study core AERONET sites.Included are the 550 nm total AOTs.These are followed by list of model biases for the four core ICAP members listed sequentially low to high for each site.The ICAP-MME ensemble mean bias is bold.

Table 2 .
Same as Table1but for coarse mode AOT, and for those sites in which the coarse mode is dominated by dust.This includes the ICAP core and NGAC models.

Table 3 .
List of +1-day forecast 550 nm total AOT RMSE from study core AERONET sites.These are followed by list of model biases for the four core ICAP members listed sequentially low to high for each site.The ICAP-MME ensemble mean bias is bold.

Table 4 .
Same as Table3but for AERONET coarse mode AOT and model dust RMSE, and for those sites in which the coarse mode is dominated by dust.This includes the ICAP core and NGAC models.
Starting in 2008, ECMWF has been providing daily aerosol forecasts including dust as part of the EU-funded projects GEMS, MACC and MACC-II.All data are publicly available online at http://www.copernicus-atmosphere.eu.In the near future, these forecasts will be available operationally as part of the EU Copernicus Atmospheric Services which provides predictions of global atmospheric composition and regional European air pollution.The current model resolution is ∼ 80 km, and it is envisaged that this will be increased to ∼ 40 km in the operational phase expected to start in 2015.
• , new meteorology through NAVGEM (Navy Global Environmental Model), up-dated data assimilation and improved fire emissions.For this study, an intermediate version of the model is used for consistency.The 1 • ×1 • model is driven by the 0.5 • Navy Operational Global Analysis and Prediction System (NOGAPS;