Using Measurements for Evaluation of Black Carbon Modeling

The ever increasing use of air quality and climate model assessments to underpin economic, public health, and environmental policy decisions makes effective model evaluation critical. This paper discusses the properties of black carbon and light attenuation and absorption observations that are the key to a reliable evaluation of black carbon model and compares parametric and nonparametric statistical tools for the quantification of the agreement between models and observations. Black carbon concentrations are simulated with TM5/M7 global model from July 2002 to June 2003 at four remote sites (Alert, Jungfraujoch, Mace Head, and Trinidad Head) and two regional background sites (Bondville and Is-pra). Equivalent black carbon (EBC) concentrations are calculated using light attenuation measurements from January 2000 to December 2005. Seasonal trends in the measurements are determined by fitting sinusoidal functions and the representativeness of the period simulated by the model is verified based on the scatter of the experimental values relative to the fit curves. When the resolution of the model grid is larger than 1 • × 1 • , it is recommended to verify that the measurement site is representative of the grid cell. For this purpose, equivalent black carbon measurements at Alert, Bondville and Trinidad Head are compared to light absorption and elemental carbon measurements performed at different sites inside the same model grid cells. Comparison of these equivalent black carbon and elemental carbon measurements indicates that uncertainties in black carbon optical properties can compromise the comparison between model and observations. During model evaluation it is important to examine the extent to which a model is able to simulate the variability in the observations over different integration periods as this will help to identify the most appropriate timescales. The agreement between model and observation is accurately described by the overlap of probability distribution (PD) curves. Simple monthly median comparisons , the Student's t-test, and the Mann-Whitney test are discussed as alternative statistical tools to evaluate the model performance. The agreement measured by the Student's t-test, when applied to the logarithm of EBC concentrations, overestimates the higher PD agreements and underestimates the lower PD agreements; the Mann-Whitney test can be employed to evaluate model performance on a relative scale when the shape of model and experimental distributions are similar.


Introduction
The term black carbon (BC) refers to light absorbing carbon (LAC) aerosols produced by combustion processes (Bond and Bergstrom, 2006;Cachier, 1998), also known as soot.Its ability to absorb visible light is due to the molecular structure, which is characterized by valence electrons in πorbitals; the energy gaps between bonding and anti-bonding molecular orbitals is small enough to allow the absorption of visible light (380-880 nm).BC concentration is determined from light attenuation measurements at a constant wavelength or at multiple wavelengths.A detailed list of direct and indirect LAC measurement techniques is reported by Bond and Bergstrom (2006).The measurements discussed here are direct measurements of the light attenuation produced by aerosol either suspended in the air or deposited on a filter.
BC modeling is a crucial component of global climate model because BC is the most ubiquitous aerosol component which absorbs visible light and is a key contributor to the climate forcing by aerosol (Liousse et al., 1993;Cooke et al., 1997;Andreae and Merlet, 2001;Menon et al., 2002;Hansen et al., 2005;Ramanathan and Carmichael, 2008).BC aerosols can be transported far from their sources and affect Published by Copernicus Publications on behalf of the European Geosciences Union.
S. Gilardoni et al.: Evaluation of black carbon modeling remote and pristine areas, as well as rural and urban locations (Hansen et al., 1988;Hara et al., 2008;Eleftheriadis et al., 2009).The first attempt to model BC concentration used a simple approach assuming that BC was externally mixed and had a constant size distribution (Haywood and Shine, 1995;Tegen et al., 1997).More recent models included BC size fractionation and internal mixing with soluble aerosol components (Jacobson, 2001;Wilson et al., 2001;Stier et al., 2005).
Model evaluation is crucial because it helps understanding pollutant dynamics and makes it possible to use model output to quantify the effect of environmental policies on regional and global scale (Koch et al., 2009).Several approaches have been proposed to obtain accurate assessments of local and regional air quality model performance (Chang and Hanna, 2004;Seigneur et al., 2000;Boylan and Russell, 2006;Yu et al., 2006;Zhang et al., 2006); on the contrary, evaluation of large scale models (like global models) is commonly performed in a simplistic manner by comparing monthly means (or medians) of model data and observations.This study is part of the EUCAARI -European Integrated Project on Aerosol Cloud Climate and Air Quality Interactions (Kulmala et al., 2009) and AEROCOM -Aerosol Comparisons between Observations and Models (http://nansen.ipsl.jussieu.fr/AEROCOM)(Textor et al., 2006) efforts to improve model validation methods.The objectives of this paper are: -the optimization of a validation procedure that includes evaluation of observation representativeness, -the discussion of statistical tools to compare model and observations.
First, we recognize that the model/observation comparison is only meaningful when the temporal and spatial representativeness of observations are tested.In addition, light absorbing interfering species and uncertainties in the BC optical properties might strongly affect the interpretation of the observations.Equivalent black carbon measurements performed at remote and regional background sites are then used to validate TM5/M7 model; mathematical and statistical tools to evaluate model/experiment agreement and to highlight reasons of model failure are suggested and discussed.

Observation and model data
In the following sections we describe the measurement sites used for the model/observation comparison, the physical meaning of equivalent black carbon based on light attenuation measurements, and the model details.The choice of the measurement sites was based on the availability of long time series of observations.Unfortunately, the number of locations that offer public data access and long time coverage is limited and this study could not be extended to more than six sites.

Monitoring sites
Light attenuation measurements from the WMO Global Atmosphere Watch (GAW) sites at Alert (Canada), Bondville (Illinois), Ispra (Italy), Jungfraujoch (Switzerland), Mace Head (Ireland), and Trinidad Head (California) have been used in this study.Ispra, Jungfraujoch, and Mace Head are also part of the European Monitoring and Evaluation Programme (EMEP) network.Site coordinates, periods during which measurements are available (till 2005), and the details of the measurements made are summarised in Table 1.This set of sites is chosen as representative of both remote sites and rural background sites.In addition it gives us the opportunity to illustrate diverse issues that make black carbon modeling challenging.Table 2 lists sources of light absorbing carbonaceous aerosol that affect the sites with their corresponding seasonal cycles.
Alert is a continental remote site operated by Environment Canada, located on the edge of Lincoln Sea and Ellesmere Island, in the Canadian Arctic.The air masses circulation is driven by a semi-permanent pressure system able to transport continental airflow from Europe and Asia to the Arctic.Light absorbing carbon observed in the Arctic region is mainly produced by anthropogenic activities and biomass burning emissions transported from northern Europe, Siberia and southern Asia (Sharma et al., 2004(Sharma et al., , 2006;;Koch and Hansen, 2005); open fires in the boreal forest of North America contribute to the black carbon levels during the summer season (Stohl et al., 2006).
Bondville (IL) is a rural monitoring site and GAW regional station that is part of the NOAA/ESRL and IM-PROVE networks, located 8 km south of Champaign (population 75 000).The prevailing sources of light absorbing carbon are emissions from local traffic, coal-fired power plants located in Tennessee Valley, Ohio River Valley and western Ohio, Canadian forest fires, local crop harvesting and agricultural field burning (Sheridan et al., 2000;Kim et al., 2005;Malm et al., 2007).
Ispra site is located in northern Italy, 50 km north of Milan (population 1 300 000); although Ispra is classified as a rural background site, concentration of carbonaceous aerosol is comparable to those of urban European background sites, likely due to the impact of regional air pollution from the nearby Po Valley (Yttri et al., 2007).The main sources of light absorbing carbonaceous aerosols are central Europe continental pollution, local traffic and residential heating (Lazaridis et al., 2002;Putaud et al., 2002).
Mace Head Atmospheric Research Station is situated on the west coast of Ireland, and is one of the most westerly locations of Europe; the closest urban area is Galway City at 88 km to the east (population 70 000).Air masses that reach Mace Head are most of the time from the west i.e. clean air masses from the North Atlantic Ocean; less frequently, the site is affected by anthropogenic air pollution from continental Europe and United Kingdom (Jennings   , 1997;Derwent et al., 2001;Kleefeld et al., 2002;Yttri et al., 2007).Investigation of temporal trend of carbonaceous aerosol shows that removal mechanisms (i.e.precipitation) play a major role in modulating light absorbing carbon concentration (Cooke et al., 1997).
Jungfraujoch is a remote high elevation site located in the southern part of Switzerland.Local emissions are expected to be negligible, in fact the site can be reached only through an electrical railway, tourist facilities are heated electrically, and local wastes are transported to the valley (Baltensperger et al., 1997).Most of the time the site is representative of continental European background, while slope winds are responsible for planetary boundary layer influence during summer (Nyeki et al., 1998).
Trinidad Head is located on the northern coast of California, about 40 km north of Eureka (population 25 000).This remote maritime site with negligible anthropogenic influence and prevailing maritime airflow is also part of the NOAA/ESRL network.Trinidad Head is affected by North American continent pollution, Asian outflow including dust and anthropogenic emissions, and smoke from forest fires (Ogren et al., 2002;Goldstein et al., 2004;Perry et al., 2004;Ramanathan et al., 2007;Zhao et al., 2008).

Light attenuation and equivalent black carbon measurements
Determination of BC from visible light attenuation implies three assumptions that are seldom verified: -carbonaceous species are the only absorbing species at the measurement wavelength, -organic species do not contribute to light absorption, and -the relationship between light attenuation and BC concentration is well known and predictable.
Iron oxides, like hematite, are common components of aerosols containing dust and absorb visible light at wavelengths shorter than 600 nm (Sokolik and Toon, 1999;Alfaro et al., 2004;Yang et al., 2009).Some organic species also absorb radiation at wavelength shorter than 600 nm (Kirchstetter et al., 2004;Alexander et al., 2008).Light absorbing organic carbon, also known as brown carbon, biases BC measurements especially in regions strongly affected by biomass burning where brown carbon concentrations are higher than BC concentrations (Andreae and Gelencser, 2006;Barnard et al., 2008).The conversion of light attenuation into BC requires the knowledge of the attenuation cross section σ , which defines the ability of a species to absorb radiation at a certain wavelength.σ can be determined theoretically based on the Mie theory (Liousse et al., 1993).For ambient aerosol σ is more often determined experimentally using the Lambert-Beer equation and assuming that the BC concentration is equal or proportional to the elemental carbon (EC) concentration measured by thermo-optical methods.σ values reported in literature for combustion aerosols range over one order of magnitude (2-25 m 2 g −1 ), while values reported for BC ambient aerosol at 550 nm are between 7 and 12 m 2 g −1 (Rosen and Hansen, 1984;Sharma et al., 2002;Bond and Bergstrom, 2006;Corrigan et al., 2006;Junker et al., 2006;Yang et al., 2009).σ depends on particle morphology, composition of internally mixed soot particles and soot aging (Liousse et al., 1993;Bond and Bergstrom, 2006), consequently it varies from location to location, and in the same location, from season to season.To take into account this dependency and variability we refer to the light absorbing carbon as equivalent black carbon (EBC) and not BC.The term BC implies to have optical properties equivalent to those of soot, while EBC is operationally defined as the amount of light absorbing carbon that would give the same signal of soot in an optical instrument like the aethalometer (Andreae and Gelencser, 2006).Caution must be exercised when EBC concentrations are calculated from light attenuation measurements assuming a known and constant σ .Light attenuation measurements are made available through the World Data Centre for Aerosol (WDCA), as part of the GAW program, by the investigators responsible at each site (originally: http://wdca.jrc.ec.europa.eu,now http: //www.gaw-wdca.org/).Light attenuation measurements are performed with an aethalometer (Hansen et al., 1984) at Alert, Ispra, Jungfraujoch, and Mace Head and with a Particle Soot Photometer PSAP (Bond et al., 1999) at Bondville and Trinidad Head; time resolution of observation data is 1 h.The aethalometer and PSAP are filter-based measurement techniques.They measure attenuation of light (B ATN ) due to particles accumulated on a filter, according to: I and I 0 are the intensities of light transmitted through a loaded filter and a blank, respectively.The presence of a collection substrate might affect light attenuation: some of the aerosol particles can be embedded into the filter fibrous structure and do not contribute to light attenuation; at the same time filter fibres might be able to enhance aerosol absorption due to the scattering of the light beam and consequently increase of the beam path.
During aethalometer measurements light absorption (B abs ) is calculated from light attenuation through the equation where C and R are empirical calibration factors (dimensionless): C describes the multiple scattering of light produced by the filter fibers and depends on the filter properties, R depends on the amount of particles embedded in the filter structure and the particle optical properties (Weingartner et al., 2003).EBC concentration (in µg m −3 ) can be calculated from light attenuation (B ATN in Mm −1 ) according to the equation reported by Weingartner et al. (2003) and Liousse et al. (1993) where σ is the attenuation cross-section (in m 2 g −1 ) and varies with the radiation wavelength λ.For particles small compared to the radiation wavelength, according to the Mie theory, it can be proved (Jennings and Pinnick, 1980) that the attenuation coefficient is inversely proportional to the radiation wavelength (λ); Jennings and Pinnick (1980) calculated that σ of spherical pure black carbon particles is equal to 14625/λ (nm).To calculate EBC from light absorption, Eqs.
(2 and 3) can be written as: We define σ * (in m 2 g −1 ) the empirical calibration coefficient that relates light absorption and EBC, and for aethalometer it is equal to σ divided by C and R.
Light attenuation measurements reported by PSAP already include an empirical calibration that takes into account absorption by the filter medium and nonlinear response due to filter overloading.They are corrected using the method described by Bond et al. (1999) for flow rate and spot size inaccuracy.σ * values reported in literature for PSAP span over a fairly wide range (5 to 25 m 2 g −1 ) depending on the optical properties of ambient aerosols and on the thermooptical method used for calibration (Snyder and Schauer, 2007).The measurements are available at the two sites located in the United States.The comparison of EBC from light integrating plate measurements (LIPM) and EC from thermo-optical measurements at rural sites of the United States (Huffman, 1996;Malm et al., 1994) suggests that σ * is equal to 20 m 2 g −1 ; nevertheless Malm et al. (1994) used 10 m 2 g −1 during their discussion of the IMPROVE network data, as suggested by theoretical calculations; in fact, the value 20 m 2 g −1 would result from the misidentification of OC/EC split during the thermo-optical measurements and the consequent underestimation of EC.EBC concentrations presented here for Bondville and Trinidad Head are calculated assuming σ * equal to 10 m 2 g −1 .
A summary of σ * values used at each site is reported in Table 1, together with C and R correction factors when σ * was calculated from the theoretical σ value of 14625/λ.
At Bondville, Junfraujoch, and Trinidad Head EBC concentrations are available for PM 1 and PM 10 size fractions, while at Alert, Ispra, and Mace Head EBC concentrations are available for total suspended particles (TSP).

Black carbon modeling
The global model TM5/M7 simulated BC concentrations at the monitoring sites from July 2002 to June 2003, with 1hour time resolution.
The TM5 model is an off-line global transport chemistry model (Krol et al., 2005) that uses the ECMWF IFS (Integrated Forecast System) meteorological data.It has a spatial global resolution of 6 • × 4 • and a two-way zooming algorithm that allows regions (e.g.Europe, North America, Africa and Asia) to be resolved at a finer resolution of 1 • × 1 • .To smooth the transition between the global 6 • × 4 • region and the regional 1 • × 1 • domain, a domain with a 3 • × 2 • area resolution has been added.In the present application the zoom is over Europe, therefore outside the Eu-ropean domain the resolution of the model is 6 • × 4 • .In the current version, the model has a vertical resolution of 25 layers, defined in a hybrid sigma-pressure coordinate system with a higher resolution in the boundary layer and around the tropopause.The height of the first layer is approximately 50 m.
The model transport has been extensively validated using 222Rn and SF6 (Krol et al., 2005;Peters et al., 2004) and its further validation was performed within the EVERGREEN Project (Bergamaschi et al., 2006).
Gas phase chemistry is calculated using the CBM-IV chemical mechanism (Gery et al., 1989;Gery, 1989) modified by Houweling et al. (1998), solved by means of the EBI method (Hertel et al., 1993).Dry deposition is calculated using the ECMWF surface characteristics and the resistance method (Ganzeveld and Lelieveld, 1995).
Wet deposition is the dominant removal process for most aerosols.Removal occurs in convective systems (convective precipitation −2.14 Tg C year −1 ) and in large scale systems (5.86 Tg C year −1 ) that are associated with weather fronts.The in-cloud removal rates, which depend on the large scale precipitation and are differentiated for cumulus and stratiform precipitation, are calculated following Guelle et al. (1998) and Jeuken et al. (2001).Aerosol below-cloud scavenging is parameterised accordingly to Dana and Hales (1976).TM5 is coupled to the microphysical aerosol model M7 (Vignati et al., 2004(Vignati et al., , 2010a,b) ,b) that allows the resolution of particle masses and numbers.The model ability to evaluate BC concentration has been evaluated in Vignati et al. (2010a).The particles are represented by seven internally mixed classes, using a pseudo-modal approach.Four classes are for soluble mixed particles representing nucleation, Aitken, accumulation, and coarse mode, and three are for the insoluble (Aitken, accumulation, and coarse mode).Nucleation, condensation of sulphuric acid and coagulation between the particles are included.BC can be present in the insoluble and soluble Aitken modes, and in the soluble accumulation and coarse modes.The ageing is accomplished by considering condensation of H 2 SO 4 and coagulation with soluble particles, which form a soluble shell around the hydrophobic core and the particles are moved from the insoluble to the soluble/mixed modes.The other components in M7 are mineral dust, primary organic carbon (OC), sulfate, and sea salt.All particles are removed in case of convective wet removal.In presence of large scale precipitations only the soluble accumulation and coarse modes are scavenged by rain, while the remaining modes (insoluble Aitken, accumulation, coarse and soluble nucleation and Aitken) form interstitial aerosols and they are not in-cloud removed.The soluble accumulation and coarse modes are assumed to form cloud droplets where the oxidation of SO 2 by O 3 and H 2 O 2 takes place; the resulting sulphate is partitioned between the two modes as function of number of particles present in the modes (Stier et al., 2005).Below cloud scavenging removes all the particles as function of their size.The BC emission inventories used in the present application are from Bond et al. (2004) for the anthropogenic contributions (fossil and bio fuels) and from van der Werf et al. ( 2004) for large scale biomass burning areas.Black carbon is assumed to be insoluble when emitted.The number of BC emitted particles is calculated assuming the freshly emitted particles with number median radii of 0.03 and 0.075 µm, for fossil/bio fuel and biomass burning, respectively, and emitted in the insoluble Aitken mode with standard deviation equal to 1.59 (adapted from Dentener et al. 2006)

Results and discussion
This study uses observations from July 2002 to June 2003 where possible, to be in agreement with time period of the TM5/M7 model runs.At Ispra, where light attenuation measurements are available from January 2004 onwards, the comparison is performed using observations for the period July 2004-June 2005.In recognition of this, montly data are used in the comparisons at all sites.Since no size fractionation is applied to BC model output, TSP or PM 10 experimental measurements are used for the comparison.

Temporal representativeness of experimental measurements
First, we want to verify that the time period chosen for model evaluation is representative for the site based on longer time scale observations, when available.Figure 1 shows the time series of EBC concentrations at the different sites from January 2000 to December 2005.The variability of observations often ranges over one order of magnitude, with larger fluctuations (defined by the ratio 95th percentile to 5th percentile) at Jungfraujoch, Ispra, and Mace Head.The black lines in the six panels of Fig. 1 correspond to the fit functions defined as a sinusoidal curve: t is the time variable, EBC 0 is the mean value, A 0 is the amplitude of the sine wave, ω is the frequency, and ϕ is the phase.The model fitting uses a Levenberg-Marquardt algorithm to search for the coefficient values that minimize chi-square.To verify the presence of seasonality in the time series, the initializing value of ω is calculated assuming the periodicity equal to one year.The fitting parameters A 0 are reported in Table 1.At Alert, Jungfraujoch, and Ispra the resulting values of the amplitude (A 0 ) are significantly larger than the signal fluctuation, indicating the presence of a consistent seasonal cycle.To verify that the year July 2002-June 2003 is representative at each site, the standard deviation from the fit function over this period is compared to the standard deviation calculated for all the available measurements over 2000-2005.At all sites the standard deviation difference is smaller than 15%, and the largest differences are observed at Trinidad Head (12%), and Alert (15%).

Variability within a grid box
Ground observations are performed at a specific location and thus, from a spatial point of view, they describe the behaviour of a parameter at a point, while models simulate the investigated parameter over a surface whose size is defined by the model grid resolution.Point observations thus need to be representative of the entire model grid cell in order to be used in model evaluation; such a requirement can be a critical issue when the spatial resolution of the model is low.Model data at Ispra, Jungfraujoch and Mace Head are available with a resolution of 1 • × 1 • , while at Alert, Bondville, and Trinidad Head the model resolution is only 4 • × 6 • , that is about 500 × 500 km.The large grid cell is a non-optimal condition that we want to test in order to discuss model evaluation.EBC concentrations are compared to light absorption and elemental carbon measurements collected at other measuring stations to verify that Alert, Bondville and Trinidad Head are representative sites of the model grid cells.
Alert is a remote site, located 1700 km to the north of the Polar Circle and at more than 2000 km from Iqaluit, the closest populated area.Figure 2a compares EBC monthly medians at Alert with EBC monthly medians at Ny-Alesund and Barrow, two polar remote sites located in Northern Europe and Alaska, respectively.EBC is calculated from light absorption measurements at 880 nm with an aethalometer AE-31 at Ny-Alesund, and at 550 nm with a PSAP at Barrow.The EBC trend observed at Alert is common to the other polar locations, even if farther than 1000 km, indicating that sources, sinks, and transport pathways are similar for the three sites.This is consistent with the fact that the pristine environment surrounding Alert for hundreds of kilometres makes it representative of the 4 • × 6 • cell.Figure 2 compares EBC concentrations at Bondville (panel b) and Trinidad Head (panel c) to elemental carbon (EC) concentrations at different sites in the same 4 • × 6 • grid cell or in the closest adjacent cell.The EC sites belong to the IMPROVE network, where PM 2.5 aerosol samples are collected twice a week for 24 h.Although corresponding to a different size fraction, EC measurements are used for the comparison assuming that the amount of LAC present in particles larger than 2.5 µm is negligible compared to the amount of LAC in smaller particles.EC quantification is based on aerosol thermal properties, while EBC measurements are based on aerosol optical properties.We do not expect a perfect agreement between optical and thermal measurements (Watson et al., 2005), but at Bondville, where both EBC and EC measurements are available, their monthly medians agree within 20%, suggesting that the comparison is accurate enough for the purposes of this study.are not reported because data capture is smaller than 30%.The Trinidad Head grid cell thus demonstrates greater spatial variability than found at Bondville.The variability of EBC and EC measurements is larger during the period July-November compared to the following months; in particular, during February, March, and April the EC monthly averages differ by less than 10% of the variability range.Although EC medians are usually within the 25th-75th percentile range of EBC measurements, EBC medians are consistently higher then EC medians with higher discrepancies during spring.The use of a default specific absorption coefficient rather than a site calibrated one may be responsible for the systematic overestimation.In addition, dust can interfere with EBC measurements, especially during the spring season when Asian dust plumes might reach the west coast of the United States (Perry et al., 2004;Zhao et al., 2008).
Assuming that the concentration of dust in PM 10 at Trinidad Head is equal to 10 µg m −3 the corresponding light attenuation would be about 0.3 Mm −1 (Yang et al., 2009); this would lead to an overestimation of EBC of 30 ng m −3 , which is not enough to explain the discrepancy observed between EBC and EC measurements.To further investigate the role of absorption cross section coefficient uncertainty, EBC concentrations are calculated assuming σ * equal to 20 m 2 g −1 , as suggested by Huffman (1996).Figure 3 shows a better agreement between EC and EBC when a larger σ * is employed and the differences between EC and EBC measurements are reduced to 30%.The larger absorption cross section coefficient of Trinidad Head compared to Bondville could be justified by the presence of more processed aerosols at the coastal site; aged aerosol is mixed with sulfate and other organic species that enhance ability of BC to absorb visible light through the lens effect (Liousse et al., 1993).EBC at Bondville compares well with EC at the same site and at the nearest IMPROVE sites, indicating both similarity among rural and regional background sites, and an adequate choice of σ * to estimate EBC from optical measurements.On the contrary, the Trinidad Head grid cell is characterized by a larger variability of EC and EC is systematically smaller than EBC; the discrepancy is likely due to an incorrect σ * rather than dust interference on optical measurements; at Trinidad Head the uncertainty of black carbon optical properties and the variability of the grid cell compromise the use of EBC observations for model evaluation.

Statistical methods
The statistical tools used in the following paragraphs include Fast Fuorier Transformation (FFT) analysis, skewness analysis, median comparison, variability analysis, probability distribution curve comparison, Student's t-test, and Mann-Whitney test.The FFT analysis is used to identify the frequency and amplitude of data modulation.The skewness analysis investigates the symmetry of data distribution (of model or observations); when skewness is zero the data are symmetrically distributed around their average value.Median comparison and variability analysis are used to test the similarity of two data distributions: the median comparison focuses on the similarity of central values of each distribu-tion, while the variability analysis investigates how data are spread around the central values.The similarity of two data distributions can further be investigated with the overlap of the probability distribution curves, the Student's t-test, and the Mann-Whitney test; in the first two tests, higher value of the output (overlap area and test probability) corresponds to higher similarity of data distributions, while the Mann-Whitney test shows lower output parameter Z for higher similarity of distributions.
In the following sections we will use the different statistical tools within three tests with three different objectives.The comparison of monthly medians is used to evaluate the ability of the model to simulate the time trend of measurements over a longer time period.The variability analysis illustrates how data spread around the median of model and observations, to understand the time scale of possible model failure.The comparison of model and observation probability distributions summarizes the information from the first two tests, quantifying the agreement of measurements and simulations during a time period shorter than one year, i.e. a month or a season.5th, 25th, 50th, 75th, and 95th percentiles are calculated for each month.The median rarely falls in the middle of the 25th-75th percentile interval or the 5th-95th percentile interval, indicating that the 1-hour resolution data (both observations and model) when considered month by month are not normally distributed.Skewness is used to quantify the deviation relative to the normal Gaussian curve.If the skewness is zero, the distribution is symmetric like the bell shaped normal curve, with a positive skew the distribution curve shows a tail towards larger values, and with a negative skew the distribution curve shows a tail towards smaller values.The observations are characterized by positive skew that averages between 1 (Bondville and Ispra) and 3.5 (Mace Head).The model data distributions are characterized by lower skew parameters, varying between 0 (Ispra) and 2.2 (Mace Head).The mean value of an asymmetric data distribution is strongly affected by even a small number of extremely high or extremely low values.For this reason we use the monthly medians rather than the monthly means to compare the time series of model and observations.At Alert, Ispra and Jungfraujoch, where a seasonal cycle is detected (see Sect. 3.1), the model reproduces this cycle, with maxima during winter at Alert and Ispra, and during summer at Jungfraujoch.At Bondville and Mace Head no specific trend is observed in either the measurements or the model during the year, while at Trinidad Head observations and model shows a similar trend with slightly higher EBC/BC concentrations during fall compared to spring.

Comparison of model and observation time series
At Alert, observed monthly medians range between 7 and 140 ng m −3 , while model monthly medians range between 4 and 25 ng m −3 .The winter maxima are due to fewer precipitation events and atmospheric circulation patterns that control long-range transport (Sharma et al., 2006); the modelled BC concentration during winter and spring is more than 5 times smaller than observed EBC levels; the similarity of Alert EBC time series with those recorded at other polar sites (Fig. 3a) suggests that the model/observation disagreement is due to the inability of the model to simulate the Arctic haze.During early summer and fall (July-November) the disagreement between model and observations averages 40% of observations.
The highest concentrations are recorded at Bondville and Ispra.Bondville has a limited month-to-month variation; median values of observations and model vary in the range 275-570 ng m −3 , and 540-850 ng m −3 , respectively; on average model monthly medians are 70% higher than observations, with larger overestimation during January-March.The higher concentrations calculated by the model might be attributed to the model grid size, which includes rural areas together with large cities (i.e.Chicago), whose plume rarely reach Bondville.At Ispra model monthly medians vary between 1000 and 3000 ng m −3 , while observed monthly medians vary between 600 and 3000 ng m −3 .Winter measurements are three times higher compared to summer, most probably due to larger emissions from residential heating, and more frequent events of temperature inversion (Yttri et al., 2007).The model overestimates observations significantly during August and March, when modelled BC is roughly twice as large as measured EBC.
Experimental medians at Jungfraujoch vary between 10 and 140 ng m −3 .The range is well simulated by the model (10-110 ng m −3 ), which on average differs from the observation by less than 5%.Higher concentrations are observed and simulated during summer, when polluted air masses from lower altitude are transported to the site by thermal convection (Baltensperger et al., 1997;Cozic et al., 2008).
At Mace Head, observation and model medians vary in the interval 25-130 ng m −3 and 30-270 ng m −3 , respectively.The highest concentrations are reported during September, October, and December, both by observations and by model.On average model medians overestimate observations by 70%.
Observed monthly medians at Trinidad Head are in the range 230-480 ng m −3 , while model medians are 60-450 ng m −3 .EBC and EC concentration maxima are in October and November due to a larger contribution of North American emissions in the cold season.Model underestimates by 30% the observation medians, and larger disagreement takes place during spring (on average 50%).

Comparison of variability
In this section we illustrate how to investigate the ability of model to simulate observation variability, and to verify if the model fails in simulating the daily modulation.First we compare 1-hour and 24-hour resolution data to identify cases when variability is dominated by daily modulation, and then we use the FFT analysis to verify how model describes this modulation.
The comparison of monthly median values alone does not describe in detail the ability of the model to simulate the observations; the data variability around the median plays a crucial role in defining the agreement of model and experiment.Since model and observation data are not normally distributed, the standard deviation is not a meaningful parameter to represent data variability, instead the variability is expressed as the difference between 95th and 5th percentiles of 1-hour resolution data.The monthly variability of observations ranges from 75% of the median at Jungfraujoch to 750% at Mace Head, and variability of model ranges from 90% of the median at Alert to 500% at Mace Head. Figure 5a shows the ratio of model variability to observed variability per each month.The ratio is effectively an indication of how much variability of the observations is explained by the model.The ratio is expected to be smaller than one; primarily due to the effects of local sources, which are not represented in the model, but are responsible for fluctuations in pollutant concentration on a short time scale and can be responsible for larger variability of observations compared to model simulations.In addition, variability of meteorological parameters is not resolved on the large model grid.Most of the time the ratio is in the range 0.5-1 indicating that the model explains more than 50% of measurement variability.Bondville in winter and Ispra in summer show ratios larger than 1.5, while at Alert the ratio is always below 0.5.
To understand the causes of model failure, the variability of 1-hour resolution and 24-hour resolution data are investigated.Figure 6 illustrates this with two examples.When the 1-hour resolution data variability is dominated by the diurnal fluctuations (case A), the variability of 24-hour resolution data is much smaller than the variability of 1-hour resolution data and their ratio tends to 0. On the contrary, when the daily modulation is smoother (case B), the variability of 24-hour and 1-hour resolution datasets are similar and their ratio tends to 1. Figure 5 shows the variability ratio of 24-hour resolution data to 1-hour resolution data of measurements (panel b) and simulations (panel c).Generally, the ratio ranges between 0.5 and 1 indicating that the daily modulation does not dominate the 1-hour resolution data variability.Conversely the modelled concentrations at Ispra and Bondville have a stronger daily modulation, and at the two sites the ratio shows a seasonal trend with values smaller than 0.5 during spring and summer.
The daily variability is further investigated with the Fast Fourier Transformation (FFT) analysis, which identifies the frequency and amplitude of data modulations.Table 3 reports the intensity of the daily modulations of modelled concentrations and observations.At Jungfraujoch the daily modulation is perfectly simulated by the model.At Alert, the model fails to simulate the daily variability, but since the 1hour resolution data variability is larger than the daily modulation, this is not enough to explain the low ratio observed in Fig. 5a.At Bondville and Ispra the model overestimates the daily variability.Bondville does not show any seasonal differences between warm and cold seasons, while the model overestimation of daily variability at Ispra is 5 times higher during summer compared to winter.
Model overestimates observation variability at Bondville and Ispra, and underestimates it at Alert.At Ispra the overestimation is limited to the summer months, when the model variability is dominated by the daily modulation and the model overestimates daily modulation of observations.At Bondville and Alert the model fails in reproducing both daily modulations and 1-hour resolution data variability.

Model/observation agreement
We compare here four different statistical tools to quantify the agreement between model and observations.The most accurate way to quantify this agreement is the overlap of the probability distribution curves.Considering that this method is elaborate, we discuss limitations and advantages of alternative and more common statistical tools: comparison of medians, Student's t-test, and Mann-Whitney test.Probability distributions (PD) of 1-hour resolution model data and observations are calculated for each month.The distribution curves are normalized to 100 and the overlap of model and observation curves is calculated.The overlap is 1-h res Variability 24-h res Variability P 95 P 95 Day EBC concentration ple of overall variability and day-to-day variability defined as the difference bepercentile (P 95 ) and 5 th percentile (P 5 ) of 1-hour resolution data and 24-hour ta, respectively.42 Fig. 6.Example of overall variability and day-to-day variability defined as the difference between 95th percentile (P 95 ) and 5th percentile (P 5 ) of 1-hour resolution data and 24-hour resolution data, respectively.fects the agreement between model and observations.The PD overlap averages 25% (0-53%) at Alert, 57% (37-72%) at Bondville, 64% (34-83%) at Ispra, 68% (57-78%) at Jungfraujoch, 57% (43-69%) at Mace Head, and 53-57% (39-64%) at Trinidad Head.To validate the comparison at Ispra, the period July 2002-June 2003 has been compared to the period July 2004-June 2005 using 24-hour resolution sulfate measurements performed at the same site.The PD overlap of sulfate data, on average 68%, indicates that the two periods do not differ significantly for sulfate concentration and support the assumption that the black carbon measurements can be used for model evaluation.The worst agreement between model and experiment are observed at Alert, with minimum during the spring haze period, as expected from the median comparison.The highest values are recorded at Jungfraujoch where the agreement is constantly above 55% during the entire year.Bondville and Ispra show the same seasonal trends already seen, with higher agreements during the warm and the cold season, respectively.At Trinidad Head the model/experiment agreement is lower than 50% in winter if σ * is assumed equal to 20 m 2 g −1 and in spring if σ * is assumed equal to 10 m 2 g −1 .This result might indicate a seasonal dependence of σ * or the model failure during one season.A better knowledge of black carbon optical properties at the site is mandatory to use light attenuation measurements for model evaluation.The median agreement and PD overlap do not co-vary linearly: when the PD agreement is larger than 50%, the median comparison might overestimate or underestimate the agreement by more than 20%.
The agreement between two datasets characterized by normal distribution is usually calculated with Student's t-test, which evaluates the difference of the means based the dis-tribution standard deviations.Although model and experiment data are not normally distributed, we compare the ttest probability to the PD overlap to assess the error introduced by the normal distribution assumption.Observations and model data are treated as two independent samples with means X obs and X mod ; t is calculated as: t = X obs − X mod s 2 obs /n obs + s 2 mod /n mod (7) where s and n indicates standard deviations and number of data points, respectively.The null hypothesis states that the population means of observations and model data are the same.The comparison of t with the Student's t-distribution table defines the probability that the null hypothesis is true.Student's t-test probabilities co-vary with PD overlap (r 2 = 0.70) (Fig. 8a), although the t-test overestimates the agreement when the PD agreement is higher than 60%.The t-test percentages show a positive bias equal to 14% ± 5%.
EBC measurements and modelled BC data are not normally distributed.Skew parameters indicate that the probability distributions are characterized by a tail towards larger values, so logarithmic transformation of the raw data should obtain a distribution that is closer to a Gaussian distribution.Monthly means and standard deviations of the logarithm of EBC and BC concentrations are used to recalculate the agreement probability with the Student's t-test (Fig 8b).The t-test probability of logarithmic data on average is not biased, but lower agreements are underestimated and higher agreements are overestimated.
The comparison of two datasets characterized by nonnormal distributions can be performed with the Mann-Whitney U-test, which is also known as Mann-Whitney-Wilcoxon test or rank sum test.The null hypothesis states that the two samples are drawn from the same population, thus their distributions are equal.Observations and model data are put into a single array in increasing order, rank are assigned to each data (the smaller value is ranked 1), the sum of ranks (R) is computed separately for model and observations, and the parameters U are calculated as: n mod and n exp are the number of model and experiment data points.Similar model and observation distributions lead to similar U obs and U mod .The difference between the two distributions is quantified by Z: where µ is the mean value of U obs and U mod (it is uninfluential if U obs or U mod is used in the formula), and σ is defined as: where N is the sum of n obs and n mod , g is the number of groups of ties, and t j is the number of tied ranks in group j .
Z is used to calculate the significance probability associated to the null hypothesis; larger Z corresponds to larger difference between U mod and U exp and thus to lower agreement between model and observations.The Mann-Whitney test is applied to model and observations: 1-hour resolution monthly data are analyzed.The smallest values are calculated for Jungfraujoch while the highest values (larger than 20) are observed at Alert, Trinidad Head, and Bondville.At Alert the largest Z values occur during spring (March and April) when the model fails in simulating the Arctic haze.At Bondville Z is smaller during summer and higher during winter, in agreement with the time trend reported in Fig. 7 where higher PD similarity is observed in summer and lower similarity in winter.At Trinidad Head the highest Z values occur in spring when model and observations mostly disagree.Z values calculated at all sites are compared to PD overlap in Fig. 8a.Z anti-correlates with the PD overlap, as expected, but occasionally distributions characterized by a satisfactory agreement (larger than 60%) are associated with high Z values (larger than 20).The low correlation (r 2 = 0.67) is likely due to the different shapes of model and observation distribution curves: although Mann-Whitney test does not assume that the two populations of data are normally distributed, it requires that the shapes of their distributions are identical.For this reason we re-applied the Mann-Whitney test only to those months for whom model does not overestimate observation variability (see Sect. 4.2).Considering only months when the model to observed variability ratio is smaller than 1, the correlation of Z values and PD overlap area is improved (r 2 = 0.79).Monthly medians, Student's t-test, and Mann-Whitney test are employed to quantify the agreement between model and observation data distributions and compared to the probability distribution curve overlap method.The three tools are based on different sets of assumptions that are not always encountered in real population samples.If these assumptions are not discussed and the methods are applied without discrimination to the entire dataset, the result can be misleading.
In summary, the comparison of median is inaccurate and it should be discouraged because ignores the data distribution around the medians.The Student's t-test is a good choice only when the data distribution is similar to a normal distribution; this assumption is rarely verified and its testing can be time consuming.The Mann-Whitney test is a satisfactory alternative to the comparison of PD curves, but it requires that observation and model data distribution have the same shape; this can be easily verified by comparison of observation and model variability.

PD overlap summary
We explore here the physical meaning of PD overlap and how PD overlap analysis can be used to discuss model performance.For this purpose we focus on four cases that were already discussed in the previous sections.
At Alert the model is not able to reproduce the transport of pollutants from lower latitudes, which leads to PD overlap ranging between 0 and 25%.In March and April model distribution data is narrow and centered around BC concentrations smaller than 50 ng m −3 , while observations are shifted towards higher concentrations; from December to February observations distribution is bimodal and the model simulates only the lower concentration mode, again below 50 ng m −3 (Fig. 9a).The model does not show higher BC concentrations since it fails in describing pollutant transport taking place at the end of winter and in spring.A similar effect is observed at Jungfraujoch in summer, when local sources might affect the sampling site.The model fails in describing the transport of emissions from lower altitudes (Fig. 9b), and thus it does not simulate the occasionally higher BC concentrations in May-August.This results in PD overlap slightly smaller during summer than in winter.
Figure 9c shows that at Ispra in summer the model data distribution has a bimodal shape and the mode at concentrations larger than 2000 ng m −3 is not present in the observations.This result is likely due to the model www.atmos-chem-phys.net/11/439/2011/Atmos.Chem.Phys., 11, 439-455, 2011 underestimation of dilution processes, which also leads to the overestimation of daily cycle amplitude, as seen during variability analysis.At Bondville the model does not simulate the lower BC concentrations (Fig. 9d).This could be due to underestimation of dilution, but more likely by the large model grid cell that contains industrial, urban, and background areas.

Conclusions
Model evaluation studies are based on the comparison of model data and observations.The goal of this paper is to discuss, compare and evaluate a variety of statistical and analytical tools in order to better inform such evaluation.
Temporal representativeness and variability within a grid box of observations must be assessed before observations are used to evaluate model simulations.Temporal representativeness is tested when longer time series of measurements are available.Seasonal trend over multiple years is defined by a fitting sinusoidal curve and the scatter of data points relative to the fitting function is used to define the similarity of the investigated time series to previous and following Site measurements should be representative of the entire grid cell defined by the model.Investigation of spatial representativeness is recommended especially for lower model resolution.During the present study EBC measurements are compared to EC and light attenuation measurements collected at different sites inside the same grid cell (or close to the investigated location).
This study illustrates a suite of tests that should be applied to evaluate black carbon model with EBC measurements.Three tests are proposed: the comparison of monthly medians, the variability analysis, and the comparison of monthly probability distribution curves.The median comparison illustrates the ability of the model to simulate the seasonal trend of black carbon measurements.The variability explained by the model is defined as the ratio of measurement variability to model variability of 1-hour resolution data; it allows a preliminary evaluation of model performance.The comparison of normalized frequency distribution curves (or probability distribution PD curves) leads to a more complete and accurate model evaluation, compared to the previous tests.The agreement, defined by the overlapping area of the two curves, takes into account both the similarity of the median values and the variability of the data distributions.In addition the comparison of PD curves reflects the similarity of the data distributions.
Student's t-test and Mann-Whitney test are compared to the PD agreement to evaluate their limitations.Student's ttest requires that data are normally distributed and Mann-Whitney test requests that data distributions have the same shape.Typically neither EBC measurements nor modelled BC concentration time series fulfil such requirements and these tests might overestimate the agreement when PD over-laps are higher than 60%.Applying a logarithmic transformation to EBC measurements and BC simulations to reduce skewness does not improve the accuracy of Student's t-test.
Study of long measurement time series and comparison of observations collected over a larger geographical area are required to guarantee temporal and spatial representativeness of experimental data.The agreement between model and observation should be defined by the PD curve overlapping areas; the use of Student's t-test and Mann-Whitney is discouraged.When model fails, the comparison of model and data variability, together with Fast Fourier Transform analysis, helps to identify the reason of model/observation disagreement.
The IMPROVE sites compared to Bondville are Cadiz KY (36.7 • N 87.8 • W), LivoniaIN (38.5  • N 86.3 • W), and Mammoth Cave KY (37.1 • N 86.1 • W).The three sites are located in the grid cell adjacent to Bondville and are less than 380 km away.The four sites do not show a seasonal cycle over the study period.Other than January medians at Mammoth Cave, EC medians are within the 25th-75th percentile of EBC measurements, and on average the difference between EC and EBC medians is less than 5%.EBC monthly medians at Trinidad Head are compared to EC monthly medians at Redwood National Park CA (41.6 • N 124.1 • W), Lassen Volcanic National Park CA (40.5 • N 121.5 • W), Lava Beds CA (41.7 • N 121.5 • W), Trinity CA (40.7 • N 122.8 • W), Bliss State Park CA (38.9 • N 120.1 • W), and Point Reyes National Seashore CA (38.1 • N 122.9 • W).Point Reyes is located to the south of Trinidad Head cell at 350 km distance.Redwood National Park and Point Reyes National Seashore are marine sites, situated along the northern coast of California like Trinidad Head.EBC monthly medians during December 2002 and January 2003

Fig. 3 .
Fig. 3. Scatter plot of EBC concentration at Trinidad Head calculated with σ * equal to 10 m 2 g −1 (full diamonds) and 20 m 2 g −1 (open diamonds) versus the average EBC concentration at the IM-PROVE sites; black line indicates the 1:1 ratio.

Figure 4
Figure 4 shows box whisker plots of 1-hour resolution data of observed EBC concentrations (on the left side), and TM5/M7 model BC concentrations (on the right side), from July 2002 to June 2003.Ispra panel reports measurements from July 2004 to June 2005.Boxes and whiskers are not reported during months characterized by capture smaller than 30%.5th,25th, 50th, 75th, and 95th percentiles are calculated for each month.The median rarely falls in the middle of the 25th-75th percentile interval or the 5th-95th percentile interval, indicating that the 1-hour resolution data (both observations and model) when considered month by month are not normally distributed.Skewness is used to quantify the deviation relative to the normal Gaussian curve.If the skewness is zero, the distribution is symmetric like the bell shaped normal curve, with a positive skew the distribution curve shows a tail towards larger values, and with a negative skew the distribution curve shows a tail towards smaller values.The observations are characterized by positive skew that averages between 1 (Bondville and Ispra) and 3.5 (Mace Head).The model data distributions are characterized by lower skew parameters, varying between 0 (Ispra) and 2.2 (Mace Head).The mean value of an asymmetric data distribution is strongly affected by even a small number of extremely high or extremely low values.For this reason we use the monthly medians rather than the monthly means to compare the time series of model and observations.At Alert, Ispra and Jungfraujoch, where a seasonal cycle is detected (see Sect. 3.1), the model reproduces this cycle, with maxima during winter at Alert and Ispra, and during summer at Jungfraujoch.At Bondville and Mace Head no specific trend is observed in either the measurements or the model

Fig. 4 .
Fig. 4. Time series of experimental EBC and model BC concentrations, from July 2002 to June 2003; central point of the boxes correspond to median values, boxes indicate 25th-75th percentile intervals, and whiskers indicate 95th-5th percentile intervals.

Fig. 5 .Fig. 5 .
Fig. 5. Panel a: explained variability, defined as ratio of model variabil ability, from July 2002 to June 2003, at Alert (blue triangle), Bondville Ispra open squares), Jungfraujoch (light blue circles), Mace He Trinidad Head (black triangles).Panel b and c: variability ratio of 24-h 1-hour resolution data for experiment (b) and model (c).

Fig. 8 .
Fig. 8. Model experiment agreement determined with Student's t-test (green) and median comparison (red) plotted versus the PD overlap percentage; rank test Z (blue) is plotted on the right vertical axis.Linear fits are shown for Student's t-test and Mann-Whitney test; black line is the 1:1 ratio.Panels show agreement of original data (a) and agreement of logarithmic transformation of original data (b).

Figure 8
Figure 8 compares the PD agreement to the probability of agreement evaluated with parametric (comparison of medians and Student's t-test) and nonparametric (Mann-Whitney test) statistical tools.The PD overlap is first compared to the median agreement, defined as: Median agreement = 100 − 100 |median exp − median mod | median exp (6)

Fig. 9 .
Fig. 9. PD curves of model (blue) and observations (red) at Alert in February (a), at Jungfraujoch in June (b), at Ispra in July (c), and at Bondville in November (d) .

Table 1 .
Site description, model and experimental details.

Table 2 .
Sources and seasonality of light absorbing aerosol at the monitoring sites.

Table 3 .
Intensity of daily modulation from FFT data analysis (ng m −3 ).Figure7shows the PD overlap of monthly data at each site and during each month.PD overlap values at Trinidad Head are reported for σ * equal to 10 m 2 g −1 and 20 m 2 g −1 to understand how the uncertainty in σ * af-