A three-dimensional variational data assimilation system for multiple aerosol species with WRF / Chem and an application to PM 2 . 5 prediction

Z. Li, Z. Zang, Q. B. Li, Y. Chao, D. Chen, Z. Ye, Y. Liu, and K.-N. Liou Jet Propulsion Laboratory, California Institute of Technology, Pasadena, California, USA Joint Institute for Regional Earth System Science and Engineering, University of California, Los Angeles, California, USA Department of Atmospheric and Oceanic Sciences, University of California, Los Angeles, California, USA Brookhaven National Laboratory, Upton, New York, USA Remote Sensing Solutions, Inc., Pasadena, California, USA


Introduction
Aerosols are airborne suspensions of minute particles and have fundamental impacts on the earth's environment and climate and on human health.To understand the physical, chemical, radiative and dynamical processes associated with aerosols, a variety of sophisticated atmospheric chemistry models have been developed and coupled with atmospheric models (Seinfeld and Pandis, 2006;Fast et al., 2006;Binkowski and Roselle, 2003).Parallel to the model development, the last decade has witnessed great progress in the technology for observing aerosols, ranging from in-situ speciated measurements to satellite-and surface-based remote sensing, leading to the establishment of a variety of observing networks (e.g., Diner et al., 2004).
The progress in both aerosol models and observing networks facilitates the development and implementation of aerosol data assimilation.Data assimilation is a methodology for the integration of all available observations into models to produce aerosol fields, which can be used to provide 4266 Z. Li et al.: Three-dimensional variational data assimilation for aerosol model initial conditions to improve forecasts, perform diagnostic analyses, and for other applications.The meteorological community has employed data assimilation for more than three decades to provide optimal initial conditions for numerical weather prediction models and to develop reanalysis products for a wide spectrum of applications (Kalnay, 2003).In recent years, data assimilation has increasingly been applied to aerosol analysis.
Here we present an aerosol three-dimensional variational data assimilation (3-DVAR) scheme.This 3-DVAR scheme is developed for the WRF/Chem (Grell et al., 2005), with the comprehensive aerosol scheme known as the Model for Simulating Aerosol Interactions and Chemistry (MOSAIC) (Zaveri et al., 2008).MOSAIC was first implemented in the Weather Research and Forecasting (WRF) model coupled with Chemistry (WRF/Chem) by Fast et al. (2006).Leveraging the generality of MOSAIC, this 3-DVAR system is used to estimate multi-species concentrations and their size distributions, and to assimilate observations of not only total concentrations, but also speciated concentrations.
The outline of this paper is as follows: in Sect.2, some challenges faced by aerosol data assimilation and the strategies for meeting those challenges are described.In Sect.3, a brief description of the MOSAIC scheme is given, and the analysis variables used in the 3-DVAR scheme are defined; Sect. 4 presents the 3-DVAR scheme and explains in detail the relationship between the observed and modeled variables.In Sect.5, the method used to estimate the background error covariance is detailed, with an emphasis on the vertical correlations.In Sect.6, the 3-DVAR system is applied to the prediction of PM 2.5 in the Los Angeles basin during the Cal-Nex (California Research at the Nexus of Air Quality and Climate Change) 2010 field experiment, and assessments of the performance of the data assimilation methodology and forecasts are presented.Finally, a summary and discussion are given in Sect.7.

Challenges and strategies
Aerosol data assimilation faces some fundamental difficulties beyond those encountered in meteorological data assimilation.The difficulties arise primarily in treating a large number of state variables.A sophisticated model may explicitly treat more than a dozen species, which involve not only mass concentrations, but also number concentrations.In particular, a large number of state variables are required to represent the wide range of aerosol size distributions, ranging from a few nanometers to around 100 µm in diameter.A modal method represents the size distributions by fitting the size distribution to a set of log-normal functions (Whitby, 1978).Four log-normal functions, known as the nucleation, Aitken, accumulation and coarse particle modes, are often used (Seinfeld and Pandis, 2006).Another method uses a set of bins of increasing size, and is referred to as a sectional or bin method (Gelbard et al., 1980;Jacobson, 1997).With either of these two methods, scores of state variables are then needed.A third method is to track the moments of the aerosol population (McGraw, 1997;Bauer et al., 2008).Although the moment method has been shown to be quite efficient, it still can lead to a large number of variables.
The large number of state variables poses multiple challenges in the practical implementation of data assimilation.Data assimilation is computationally demanding by nature.For generic formulations of different data assimilation schemes and their relationships, readers are referred to Daley (1991), Courtier et al. (1994), Ide et al. (1997), Cohn (1997), Ménard and Daley (1996) and Li and Navon (2001).Among the widely used data assimilation schemes, the threedimensional variational data assimilation (3-DVAR) scheme -the type used here -is the most computationally efficient.A 3-DVAR scheme iteratively minimises a cost function that depends on error covariance matrices.Its demands on computational and memory resources increase rapidly as the number of state variables increases (see Sect. 4).A greater challenge is related to the limited number of observations.There are only a few hundred aerosol measuring surface stations in the United States, one of most dense networks in the world, and the measurements are limited to a few parameters and to the surface.Instrumented-aircraft measurements, which often provide aerosol profiles, are even more limited in space and time.Satellite measurements provide global coverage, and the most common satellite observations are aerosol optical depths (AODs).The available observations are insufficient to constrain all the variables at spatial and temporal scales dictated by the inhomogeneity of aerosol emission sources and their relatively short atmospheric residence time.
Due to the afore-mentioned computational and observational requirements, it will be practically impossible for many years to establish a data assimilation system that can simultaneously and reliably estimate mass and number concentrations of all the major species at the size bins.We must, therefore, judiciously choose a limited number of variables to estimate, based on the aerosol treatment schemes, the desired accuracy for a given application, the geographical regions (urban, remote continental areas, oceans, etc.), the effectiveness of the use of observations available, and the computational feasibility.
Along this line, two types of schemes have been implemented.One scheme can be traced back to Collins et al. (2001), who assimilated AODs in a three-dimensional chemical transport model.This scheme uses AODs as the only data assimilation analysis variable and estimates their increments.Then the estimated AOD increments are transformed into species mass concentration increments, which are in turn added to the model forecast.Following Collins et al. (2001), a number of studies assimilated AODs in global and regional chemical transport models (Yu et al., 2003;Generoso et al., 2007;Adhikary et al., 2008;Zhang et al., 2008;Sandu and Chai, 2011).Another scheme first estimates the total aerosol mixing ratio increment and then distributes the total increment to mass concentrations of individual species.This type of scheme was presented in Benedetti and Janiskova (2008), Benedetti et al. (2009) and Mangold et al. (2011).In air quality oriented applications, the total mass concentration, which is often the concentrations of PM 2.5 (sizes smaller than 2.5 µm) and PM 10 (size smaller than 10 µm), is used as the analysis variable (Denby et al., 2008;Tombette et al., 2009;Pagowski et al., 2010).
The aerosol data assimilation methods described above can be characterised as two-step schemes (Liu et al., 2011).The first step is to estimate the increments of lumped variables, such as AODs, PM 2.5 and PM 10 ; the second step is to calculate the increments of individual species at specified sizes from the lumped variable increments.These two-step schemes are suboptimal.The optimal scheme would be to directly estimate all the prognostic variables in the forecast model.When a relatively simplified aerosol scheme is used, the number of state variables may be limited so that all the state variables can be estimated simultaneously (e.g., Liu et al., 2011;Sekiyama et al., 2010).We envision that for many years to come, two-step schemes will inevitably be used for most comprehensive and sophisticated aerosol schemes, which treat scores of variables to represent mass concentrations and number concentrations with multiple size distributions, but the number of lumped variables should gradually increase as the number of observations increases and computational technology advances.

Aerosol scheme and analysis variables
When a two-step data assimilation scheme is chosen, different data assimilation analysis variables are generally required to define for different aerosol schemes.Here we use the MO-SAIC scheme in WRF/Chem.
MOSAIC treats eight aerosol species, including elemental/black carbon (EC/BC), organic carbon (OC), nitrate (NO − 3 ), sulfate (SO 2− 4 ), chloride (Cl − ), ammonium (NH + 4 ), sodium (Na + ).Other unspecified inorganic species such as silica (SiO 2 ), other inert minerals, and trace metals are lumped together as "other inorganic mass" (OIN).A sectional approach is adopted to represent aerosol size distributions.The size bins are defined by their lower and upper dry particle diameters.Each bin is assumed to be internally mixed so that all particles within a bin have the same chemical composition, while particles in different bins are externally mixed.The number of size bins, denoted as N bin here, can be specified as appropriate for different applications.In MOSAIC, hence, the state variables consist of mass concentrations of as many as 8N bin , along with the number concentrations of as many as N bin .While MOSAIC offers flexibility in specifying the number of size bins, four or eight bins are commonly used.Here 4 bins are used.Accordingly, the state variables consist of 32 mass concentrations and 4 number concentrations.
To proceed to formulate a two-step scheme, we first define a set of lumped variables based on the aforementioned state variables.As in previous studies (Denby et al., 2008;Benedetti et al., 2009;Tombette et al., 2009;Pagowski et al., 2010), we may use PM 2.5 or PM 10 as the analysis variables.In this study, four bins used are located between 0.039-0.1 µm, 0.1-1.0µm, 1.0-2.5 µm, and 2.5-10 µm.The total mass concentration of PM 2.5 or PM 10 can be expressed as a summation across according size bins.Here we introduce more analysis variables and form lumped variables consisting of the total mass concentrations of the aforementioned eight species.Specifically, one data assimilation analysis variable is the total mass concentration of one aerosol species, that is, the sum of the mass concentrations across the size bins used.We note that the data assimilation is designed to allow further lumping some of these eight variables for particular applications, as well as using the first two, three or all four bins.
Once the lumped variables are formed, the two-step scheme is first to obtain the analysis increment of these lumped variables by solving a 3-DVAR problem, and then to partition these increments into increments for the individual species in each of the size bins.These partitioned increments are added to the model forecast to produce the final analysis.Accordingly, the number of aerosols for each bin is adjusted.

Data assimilation scheme
Here we describe the basic framework and then address the partition of the increments of the lumped variables into the increments for the individual species in each of the size bins.

Basic formulation
We consider five analysis variables, x EC , x OC , x NO 3 , x SO 4 and x OTR , which are the total mass concentrations of EC, OC, NO − 3 , SO 2− 4 , and OTR.The chloride, ammonium, sodium and other inorganic aerosol concentrations are lumped into one single variable OTR.Here we do not use eight species, but only five species as analysis variables for simplicity and also because the speciated measurements that are assimilated later correspond to these five species.Following the notation suggested by Ide et al. (1997), we express these five vectors as x, that is, x T = x T EC , x T OC , x T NO 3 , x T SO 4 , x T OTR , where "T" stands for transpose.The incremental form of the 3-DVAR cost function is written as: Here δx is the N vector, known as the incremental state variable, which is defined as δx = x − x b , where x b is the forecast or background state generated by the MOSAIC scheme is known as the observation innovation, where y is an observation vector and the M × M matrix R is the observation error covariance associated with the observation y.The M × N matrix H is an observational operator that maps the state variable to the observation and is assumed to be linear here.The minimisation solution is the so-called analysis increment δx a , and the final analysis is x a = x b + δx a .This analysis is statistically optimal as a minimum error variance estimate (e.g., Jazwinski, 1970;Cohn, 1997) or a maximum likelihood (Bayesian) estimate if both forecast and observation errors have Gaussian distributions.

Construction of background error covariance
For a given set of observations, the performance of a 3-DVAR scheme is dictated by the specified background error covariance B and the observational error covariance R in Eq. ( 1).R can generally be specified in a straightforward way and will not be discussed in detail here.Statistically, an accurate estimate of B is required to render the analysis x a the maximum likelihood estimate.More specifically, B plays the role of spreading out observational information contained in y to nearby model grid-points, smoothing out small scale noise and enforcing basic dynamic balance constraints.
In practice, however, B is incorporated in a suboptimal way.The primary reason is that B is too large to handle numerically.For a high resolution model such as that used here, the number of model grid points is on the order of 10 6 .The number of elements in B is, therefore, 10 12 multiplied by the square of the number of analysis variables.With this size, B cannot be explicitly manipulated.A simplification or parameterisation of B is required.
To pursue simplifications, we use the following factorisation where D is the root-mean-square error (RMSE) matrix, a diagonal matrix whose elements are RMSEs, and C is the correlation matrix.With this factorisation, the RMSE and correlation matrices can be described and prescribed separately.
Since D has an effective dimension the same as that of δx, it is computationally treatable.The simplification applies primarily to C.
To simplify C, we use a three step procedure.The Cholesky factorisation is first applied.Since C is symmetric and positive definite, the Cholesky factorisation is where the matrix C 1 2 is a lower triangular matrix.This Cholesky factorisation is used to ensure the symmetry and positive definiteness of C and reduce the computational cost.
Using this Cholesky factorisation, we can transform the analysis variable δx to δz through δx = DC 1 2 δz.
(4) Substituting Eq. ( 4) into Eq.( 1), we obtain the desired form of Eq. (1) as (5) The transformed cost function is generally better conditioned and, thus, this transformation expedites the convergence when it is iteratively minimised.We here assume that the background errors of different types of aerosols are not correlated.This is an ad hoc assumption and is used simply to circumvent the computational complexity of treating the cross-correlations between different types of aerosols.Using this assumption, C becomes a block diagonal matrix, with the main diagonal blocks being the correlation matrices of individual types of aerosols, and we have where C BC , C OC , C NO 3 , C SO 4 and C OTR are the background error correlation matrices associated with the five types of aerosols.
Finally, a fundamental simplification of C can be achieved following Li et al. (2008a, b), in which a Kronecker product method is used to construct C. Let C S denote the background error correlation matrix of one species, where S stands for EC, OC, NO − 3 , SO 2− 4 or OTR.We can then approximately express C S as where ⊗ denotes Kronecker product, which is also known as vector or tensor product (Graham, 1981).Here x, y and z stand for the three coordinate directions in longitude, latitude and the vertical.C Sx is, thus, an n x × n x correlation matrix in the x direction.Accordingly, C Sy is an n y × n y correlation matrix in the y direction, and C Sz an n z × n z correlation matrix in the z direction.Here n x , n y and n z are the numbers of grid points in the x, y and z directions, respectively.The most desirable property of Eq. ( 7) is that these three onedimensional correlation matrices are computationally treatable since n x , n y and n z are on the order of 10 2 to 10 3 in size in present atmospheric chemical models.Another desirable property of Eq. ( 7) is that we have the Cholesky factorisation Because of their treatable dimensions, C Sy and C 1 2 Sz are always pre-computed and saved, which renders this 3-DVAR scheme particularly efficient computationally.As showed in Li et al. (2008a, b), Eqs. ( 7) or (8) allows incorporating some forms of inhomogeneity and anisotropy in the correlations, while a particular desirable advantage of Eq. ( 7) is the capability of representing complex vertical correlations, which will be further discussed when the estimates of C Sx , C Sy and C Sz are described in Sect. 5.

Increments of lumped variables to species
The data assimilation scheme formulated above estimates five lumped variables formed from the mass concentrations of the aerosol species treated in MOSAIC, but the ultimate analysis solution is concerned with the mass concentrations of those individual species.By definition, the increment of a lumped variable can be expressed as where S again stands for EC, OC, NO − 3 , SO 2− 4 , or OTR.Here δm Sl is the mass concentration increment of one MOSAIC species for a single size bin.Since EC, BC, NO − 3 and SO 2− 4 are MOSAIC species, the summation is across all the size bins, and L equals N bin , the number of the size bins used.
Since OTR includes four MOSAIC species, the summation is for the four species and across all the size bins and L, thus, becomes 4N bin for OTR.We calculate the analysis increment of mass concentration for one MOSAIC species and one size bin as where σ Sl are the root-mean-square (RMS) of the mass concentration background error for each species and size bin.Equation ( 10) is used since it can be derived as a minimum error variance estimation with the constraint (Eq.9) and under the assumption that δx a S has no error and the background errors associated with m Sl are uncorrelated.
With δm a Sl at our disposal, we can obtain the final analysis mass concentrations for each MOSAIC species and size bin as where c rSl is a positive value.We use c rSl to ensure that m a Sl is non-negative, and no reduction is made to the background state when the background concentration is lower than the observational errors.In practice, aerosol observational errors are often not well characterised.In the current application, we empirically specify c rlS = 0.1σ Sl .
After the mass concentration increments are added to the background state, the number concentration increments should be added accordingly.For simplicity, we assume that the ratio of the number concentration to the mass concentration for each size bin remains to the same as it was in the background state.Let us denote this ratio as γ .γ is calculated from the background state.The analysis of the number concentrations is where n b k is the background number concentration within one size bin, and m a k is the total mass concentration summed for all the species within one size bin and computed from m a Sl given by Eq. ( 11).

Observational operators
In present observing networks, in-situ observations primarily consist of total surface concentrations of particular matter, such as PM 1.0 , PM 2.5 or PM 10 , and speciated concentrations of EC, OC, NO − 3 , SO 2− 4 and others.In this study, mass concentrations of PM 2.5 from Air Resources Board (ARB) of California Environmental Protection Agency (available at http://www.arb.ca.gov/aqmis2/aqdselect.php),and the daily mean speciated mass concentrations from the IMPROVE (Interagency Monitoring of Protected Visual Environments) network (http://vista.cira.colostate.edu/improve/) are assimilated.
The total concentrations of particular matter are indirect observations, and the observational operators H represent the summation of the concentrations of all the species across the appropriate size bins.The observational operator for the concentration of PM 2.5 represents the summation of the concentrations of all the species across the first three size bins (see Sect. 3).The observational operator often involves both horizontal and vertical interpolations from the model grid to observation locations.Here we assume that the lowest model level is at the surface and, thus, no vertical interpolation is applied for surface observations like ARB PM 2.5 .
The observational operator for IMPROVE observations is more complicated.IMPROVE observations are not concentrations at a given time but daily mean concentrations.To assimilate them, we first compute the daily mean observation innovation for the period of time 12 h before and after the data assimilation time.The computed innovation is then assumed to be constant during the period of time and, thus, assimilated as the innovation at the given time.The assumption used here is similar to that used in the First-Guess at the Appropriate Time (FGAT) method, which has been employed in several meteorological data assimilation systems (e.g., Simmons, 2000;Lorenc et al., 2000).

WRF/Chem configuration and background error covariance estimate
The WRF/Chem model is configured as a nested set of three spatial domains (Fig. 1).The large domain encompasses the western United States and adjacent coastal region (36 km grid), the middle domain a smaller portion of the western states and the California coast (12 km grid), and the small domain the Los Angeles Basin (4 km grid).The nesting is twoway for both interior domains.Each domain has 40 vertical levels with the top at 100 hPa.The vertical grid is stretched to place the highest resolution in the lower troposphere.The analyses presented here will be primarily confined to the 4 km domain.
We use version 3.3 of WRF/Chem.In WRF/Chem, the chemistry (both aerosol and gas-phase) and meteorological components are fully coupled (Grell et al., 2005;Fast et al., 2006).The parameterisations of physical processes used that are most relevant to aerosols are summarised as follows: the Monin-Obukhov surface layer scheme, the Yonsei University (YSU) boundary-layer scheme, the Morrison 2-moment microphysics scheme, the Noah land surface model and the Dudhia radiation scheme for longwave and shortwave interactions with clear-air and clouds (Skamarock et al., 2008 and references therein).For the chemical processes, the MOSAIC (4 bin) aerosol scheme is used.The Carbon Bond Mechanism (CBM-Z) scheme is used for the Gas-phase chemistry processes.The emissions were derived from National Emission Inventory 2005 (NEI'05) for both aerosols and trace gases (McKeen et al., 2002).
In Sect.4.2, we developed an approximate expression for the background error covariance.According to Eqs. ( 2) and ( 7), we need to estimate the RMSE matrix D and onedimensional correlation matrices for each analysis variable.
To estimate these matrices, we follow a methodology used in meteorology.An overview of methods to diagnose background error statistics for application in Numerical Weather Prediction (NWP) is provided in Bannister (2008).The main approaches are based on either statistics of observations minus model differences at observation locations, or on model fields generated on the model grid that can be used statistically as a proxy of the background error; this second method is known as the NMC method (Parrish and Derber, 1992).The observation innovation method cannot be used in aerosol data assimilation because of the lack of three-dimensional speciated observations and, hence, the NMC method is used here.
The NMC method has been used for estimating the aerosol concentration background error covariance (Benedetti and Fisher, 2007;Kahnert, 2008).In the European Centre for Medium-Range Weather Forecasts (ECMWF) 4-DVAR system (Benedetti and Fisher, 2007), the differences between 48 h and 24 h forecasts of aerosol mixing ratios for sea salt, desert dust and continental particulates are assumed to be a proxy of the background error.While the NMC method has not yet been fully justified in the context of aerosol data assimilation, the ECMWF 4-DVAR system indicates that it can be useful at least in some circumstances.We estimate the aforementioned error covariance matrices following Benedetti and Fisher (2007).
We generate one month of the 48 h and 24 h forecast differences from 15 May 2010 to 14 June 2010.The forecast is initialised using the North American Regional Reanalysis (NARR) (Mesinger et al., 2006).The interior boundary conditions and sea surface temperatures are updated at each initialisation, with the lateral boundary conditions updated continuously throughout the forecast.Note that the NARR .Vertical distribution of the root-mean-square of the background errors in centration for five species, estimated using the differences between 24 forecasts valid at the same time.fields do not include any aerosol variables.The initial conditions for the aerosol species are simply from the forecast without being updated.Hence, the forecast difference arises from the difference in the meteorological fields, which in turn give rise to differences in vertical and horizontal transports, dry and wet depositions, and chemical processes that are sensitive to temperature, moisture and cloud water content.It is assumed that these differences are representative of the short-term forecast errors in transport-related aerosol processes and can be used to calculate background error statistics.We note that the forecast differences for one month have been used, but a longer time series could be beneficial for robust statistics.
We directly estimate the RMSE matrix D. The domain average RMSEs for five species are shown in Fig. 2. The RM-SEs differ among the species.The largest RMSE is associated with NO − 3 , while the smallest RMSE is associated with OC.The vertical distributions of the RMSE for all the species display a relatively rapid decrease with height.
The fine structures of the RMSE vertical distribution are related to boundary layer heights.In Fig. 3, the boundary layers primarily consist of marine layers with depths of less than 400 m, and inland boundary layers with depths of around 1000 m.There is a noticeable increase in the SO 2− 4 RMSEs at the boundary layer height.
The three-dimensional correlation matrix C S has been approximately factorized and expressed as the Kronecker product of three one-dimensional correlation matrices C Sx , C Sy and C Sz for each species in Eq. ( 7), and these three onedimensional matrices need to be estimated.Although the horizontal correlation matrices C Sx and C Sy can be directly computed using the proxy of the background error (Li et al., 2008b), we here use Gaussian functions to represent then correlation functions for constructing C Sx and C Sy , and also  assume that the correlations are isotropic.With these two assumptions, a correlation function between two points r 1 and r 2 in the horizontal can be expressed as c S (r , where L S is the horizontal correlation scale.The correlation length scale L S becomes the only parameter that need to be estimated.This correlation length scale is estimated using the proxy of the background error.The correlation c S reduces to the value of e −1/2 when the distance of two points r 1 and r 2 is measured L S .This distance averaged over the model domain is used as the estimate of L S .The estimated L S are 36, 32, 20, 52, 48 km for the five species, EC, OC, NO − 3 , SO 2− 4 , and OTR, respectively.The estimated correlation length scales are significantly different among distinct species.The largest scale is associated with SO 2− 4 , which indicates that the background error has relatively large scales horizontally, and the influence of an SO 2− 4 observation could spread farther than other species.In contrast, the smallest correlation length scale is associated with NO − 3 , and it is about 2/5 of the scale associated with SO 2− 4 .Such differences among the correlation length scales indicate a need to use multi-species concentrations as the analysis variables.
The vertical correlation matrices C Sz are computed directly from the proxy data.This can be done since C Sz is only an n z ×n z matrix.A directly-computed C Sz helps represent the complicated vertical structures of the vertical correlations due to the discontinuity-like transition of the vertical distributions between the boundary layer and the free atmosphere above.Such structures are difficult to represent using analytical functions.The computed vertical correlation matrices of C Sz are displayed in Fig. 4. The vertical correlation for OC is not shown, because it is similar to that for EC.A salient and common feature of these vertical correlations is their strong relation to the boundary layer heights.Consistent with the discontinuity-like transition in the background .Vertical correlations of the background errors for EC, NO , SO , and OTR.
rrelations are computed using differences between 24 and 48 h forecasts valid at time.A localization was applied with a vertical scale of 3500 m.The contour s 0.1.4 , and OTR.These correlations are computed using differences between 24 and 48 h forecasts valid at the same time.A localisation was applied with a vertical scale of 3500 m.The contour interval is 0.1.vertical RMSE around the top of the boundary layer, the vertical correlations display a jump at the top of boundary layer.Another feature worth mentioning is that the vertical correlation for NO − 3 shows a relatively small vertical scale.

Applications to the prediction of PM 2.5
The Greater Los Angeles area continues to be the most polluted metropolitan region in the US, and prediction of the spatial and temporal distributions of PM 2.5 in the region remains a challenge.Here the 3-DVAR system is used to assimilate PM 2.5 measurements along with some speciated concentration measurements and then predictions are performed.This data assimilation and prediction experiment has been carried out for a period of one month from 12:00 UTC, 15 May to 12:00 UTC, 14 June 2010.This period of time was chosen because the CalNex 2010 field experiment took place at this time (http://www.esrl.noaa.gov/csd/calnex/).CalNex 2010 was a major climate and air quality study in California conducted by the National Oceanic and Atmospheric Administration (NOAA) and the California Air Resources Board (ARB) and, thus, more observations are available for assimilation and evaluation.

Implementation of data assimilation and forecast
To assess the data assimilation analyses and subsequent predictions, we compare the results from the experiments with and without data assimilation.The experiments are carried out as follows.The forecast is initialised at 00:00 UTC daily using the NARR meteorological fields, and then the forecast is run out to 36 h, that is, till 12:00 UTC on the following day.The first 12 h of the forecast are discarded as a model spin-up for the meteorological fields.This spin-up allows the WRF/Chem to produce not only proper clouds and precipitation, but also fine structures in the wind fields associated with the higher resolution of the model and surface topography.The chemical fields, including both the gaseous and aerosol species, are updated daily at 12:00 UTC after the 12 h spin-up of the meteorological fields, using the forecast from the previous day.For convenience, we refer to these results as the control analyses or forecasts.
The aerosol data assimilation is carried out every 6 h.A time window of 6 h is used to resolve diurnal variations.Specifically, the data assimilation is carried out at 12:00 UTC, after the above-mentioned spin-up, and the 6 h aerosol forecast initialised by the prior aerosol data assimilation analysis valid at 06:00 UTC is used as the background state.Then the aerosol analysis obtained is used as the initial condition at 12:00 UTC for the subsequent 6 h forecast, which generates the background state for the aerosol data assimilation at 18:00 UTC.In this way, the data assimilation and forecast cycle can continuously advance in time.It is worth noting that the aerosol feedback to the radiation and cloud processes is turned off in the WRF/Chem and, thus, the meteorological fields are exactly the same as those in the control experiment.
The aerosol observations assimilated have been described in Sect.4.4.The ARB PM 2.5 consists of hourly measurements from a total of 42 stations in the model domain (Fig. 1).The hourly observations are assimilated, but only at those hours when the data assimilation is carried out.The IMPROVE daily mean speciated mass concentrations are assimilated at 12:00 UTC.For those observations assimilated, a simple quality control is applied.First, observations with negative values are rejected; second, the differences between the observations and 6 h forecasts (O-F) are examined, and those observations with the O-F values greater than 120 µg m −3 are rejected.The observational error is specified as half of the background root-mean-square errors.
Additional observations of speciated concentrations are available during the CalNex 2010 experiment (available at http://www.esrl.noaa.gov/csd/calnex/)from one station located on the California Institute of Technology campus (see Fig. 1).These observations consist of the concentrations of EC, OC, NO − 3 , SO 2− 4 and others.They are not assimilated, but are used as independent observations to assess the skill of the analyses.

Assessment of data assimilation analyses
We first compare the analysis PM 2.5 against the observations that are assimilated, which is known as "sanity check".To quantify the difference, we analyse biases, correlations and root-mean-square differences between data assimilation  analyses and observations.For simplicity, we refer to rootmean-square differences as root-mean-squared errors (RM-SEs), although they are not RMSEs in a strict sense because the observation errors can be significant.These biases, correlations and RMSEs will also be used later in analysing the forecasts.
Figure 5 presents a scatter plot of the PM 2.5 concentrations from the control and data assimilation analysis, respectively, against the observations for a period of one month.Observations from each of the 42 stations and at all four times of the day with data assimilation, that is, at 00:00, 06:00, 12:00 and 18:00 UTC, are used.The control PM 2.5 results display a significant underestimation.The observed mean concentration of PM 2.5 is 21.5 µg m −3 , while that of the control analysis is 14.9 µm, a bias of −7.6 µg m −3 and, thus, about 30 % lower than the observed.The correlation is 0.51 and the RMSE is 11.0 µg m −3 , compared with a standard deviation of 26.8 µg m −3 in the observed PM 2.5 .In the data assimilation analysis, the bias is greatly reduced, to as small as −1.0 µg m −3 .The correlation between the analysis and observed PM 2.5 is as high as 0.87, while the RMSE decreases to 4.2 µg m −3 .These results show that this 3-DVAR scheme can effectively assimilate the PM 2.5 observations.
We have introduced multi-species concentrations as analysis variables, aiming to enhance the capability of the scheme in reproducing species concentrations.The concentrations of four major species -EC, OC, NO − 3 , and SO 2− 4 are evaluated here, using the speciated observations obtained at the super observing station (Fig. 1).We note that these speciated observations are not assimilated and, thus, are independent data.
Figure 6 shows the scatter plots of analysis species concentrations against the observations at the super observing station.The concentrations of all four species are improved in both their correlation and RMSE.Relative to the control results, the correlation increases by 0.1 for EC to 0.2 for NO − 3 and the RMSEs are reduced by 10 % for OC to 50 % for SO 2− 4 .The RMSE increases for EC, but this is due to the bias.The biases in the NO − 3 and SO 2− 4 concentrations are appreciably reduced, but an increase in bias occurs in the EC and OC concentrations.The increase of these biases is worth discussing.The EC and OC concentrations show a positive bias in both the control and analysis results, and this bias is opposite to that of the PM 2.5 bias in sign, while the NO − 3 and SO 2− 4 concentrations show a negative bias.Note that the assimilation of IMPROVE observations have relatively little impact on the analysis at this location, because there is no IMPROVE station in the area surrounding this location.Thus, the assimilation of PM 2.5 gives rise to increases in the biases in the EC and OC concentrations.This increase in bias actually indicates an inherent difficulty when only PM 2.5 observations are assimilated, that is, the individual species concentration bias can deteriorate when the PM 2.5 and species concentration biases have opposite signs.

Assessment of forecasts
One of the ultimate goals of developing this 3-DVAR system is to improve our capability for predicting aerosol concentrations.Here we evaluate forecast skill in PM 2.5 concentrations out to 24 h. Figure 7 presents the correlations and RMSEs as a function of forecast lead time.The 24 h forecast is initialised with the analysis at 12:00 UTC, which corresponds to 05:00 Local Time (LT).The 0 h forecast is the data assimilation analysis at 12:00 UTC.The correlations and RMSEs .Scatter plots of model mass concentrations against observations with (left column) ut (right column) data assimilation for the species of EC, OC, NO , and SO , ely.These observations are not assimilated and are thus independent data.These ons are obtained hourly from the super monitoring station during the CalNex 2010 riment.show that the forecast has consistent skill.Comparing the correlations and RMSEs with those from the control forecast without aerosol data assimilation, a positive impact on skill can be seen all the way out to 24 h.
Examining all forecast durations from 6 to 24 h, one characteristic seen is that there is not a monotonic decrease in skill as the forecast duration increases.Actually, during the period from 6 to 24 h, the forecast correlation increases.The RMSE shows a maximum value for the 6 h forecast and is basically flat from 12 to 24 h.Relative to the control forecast, both the correlation and RMSEs display no significant variations from 12 to 24 h.Another characteristic is the large increase in correlation and decrease in RMSEs during the first 6 h.These two characteristics of the forecast skill are not often encountered in meteorological data assimilation, where the forecast skill tends to decrease with the forecast duration gradually and monotonically, and may suggest an inherent challenge in aerosol data assimilation or the need for the incorporation of dynamic balance among aerosol components and between aerosol and gaseous components.
We have seen (Fig. 1) that the PM 2.5 observations are heterogeneously distributed.A related question is whether the forecast skill is also spatially heterogeneous in association with this heterogeneous distribution of observations.We have analysed the spatial distribution of both the correlations (not shown) and RMSEs for the 24 h forecast.The distributions of RMSE are shown in Fig. 8. From Fig. 8a and b, we see that the RMSE of the 24 h forecast shows a significant decrease in areas where the RMSE is relatively larger in the control forecast without data assimilation, including most of the Los Angeles area.To show this decrease in the RMSE more clearly, the difference between the control forecast and the 24 forecast with data assimilation is shown in Fig. 8c.A decrease in RMSE is seen at most of the observing locations, but there are four locations where the RMSEs increase.We note that these four locations are located along the coast.
The forecast of individual species is more challenging than that of the total PM 2.5 concentration, but the results are overall encouraging.Figure 9 presents forecast correlations and RMSEs out to 24 h at the super observing station.Most encouraging is that, in terms of both correlations and RMSE, the error reduction in the analysis SO 2− 4 concentration (Fig. 9) persists up to 24 h.The correlations of the OC and NO − 3 concentrations are larger than that of the control forecast up to 24 h, while the correlation of the EC concentrations is larger than that of the control forecast up to 18 h.The RMSEs of the OC and NO − 3 concentrations are reduced comparing to that of the control forecast, but for a relatively short period of time.The RMSE of the EC forecast is larger than that of the control forecast because of the large bias.

Conclusions and discussions
A 3-DVAR data assimilation system for the MOSAIC aerosol scheme in WRF/Chem has been developed and presented.MOSAIC provides a comprehensive representation of aerosol species and size distributions that result from a variety of pollutant emissions.This 3-DVAR scheme is formulated and implemented in an attempt to take full advantage of the MOSAIC scheme.The 3-DVAR system developed can be characterised as a two-step scheme.The total concentrations of individual species, defined as the sum of the concentrations across all size bins, are used as analysis variables and, thus, the analysis variables consist of the eight mass concentrations of the MOSAIC aerosol species.In the application presented here, the eight concentrations are further lumped into five species, and four size bins were used.The first step is to determine analysis increments for these eight concentrations following a 3-DVAR methodology (Sect.4.1).The second step is to distribute the analysis increments over four size bins.The distribution is inversely related to the background error variances.The number concentrations are adjusted based on the ratio of the mass and number concentration in the background state.
This system was applied to the analysis and prediction of PM 2.5 in the Los Angeles basin during CalNex 2010.Surface PM 2.5 and speciated concentration observations were assimilated.To evaluate the performance of the scheme, we carried out control forecasts, which were initialised with the North America Regional Reanalysis (NARR) and used both aerosol and trace gas emissions derived from National Emission Inventory 2005 (NEI'05).The comparison of the control forecasts and the forecasts with aerosol data assimilation against observations demonstrated that the data assimilation generated analyses with significantly reduced errors, which improved the subsequent forecasts of PM 2.5 up to 24 h.We also evaluated the performance of the forecasts of elemental carbon, organic carbon, nitrate and sulfate concentrations.The data assimilation significantly improved the forecast of sulfate concentrations up to 24 h, while the improvement extended up to 12-18 h for the other three species.
We have emphasised the use of multi-species concentrations as analysis variables.Because of this, this 3-DVAR has the capability of simultaneously assimilating total and speciated concentrations observations, such as total PM 2.5 concentrations and speciated concentrations from the IMPROVE network.The use of multi-species concentrations is also desirable for representations of the background error covariance.In Sect.5, we showed that both the horizontal and vertical scales of the background errors correlations are significantly different and this difference needs to be accounted for.We carried out an experiment (not presented), in which the background error correlation of each species was replaced by that estimated for PM 2.5 using the 48 h minus 24 h forecast described in Sect. 5 and showed that this degraded the performance.
In this study, only surface observations were assimilated, but this 3-DVAR scheme has been developed to assimilate additional observations from aircraft and satellites.With more observations assimilated, this data assimilation system is expected not only to further improve forecasts, but also to be of use in other applications.For example, the output can be used to interpolate limited observations for the evaluation of numerical models of aerosol-related processes such as aerosol-cloud interactions.
Despite the promising performance of this 3-DVAR system, a few fundamental assumptions used warrant further examination.The most fundamental assumption is that the background and observational errors are Gaussian.However, the evidence shows that aerosol concentrations tend to be non-Gaussian (Seinfeld and Pandis, 2006).While the background errors may not necessarily follow the aerosol concentration distributions, there is a possibility that the background errors are non-Gaussian.In addition, aerosol concentrations are strictly non-negative quantities, therefore, errors in these quantities cannot be strictly Gaussian distributed, although they may be approximately so since the Gaussian density assigns positive probability to negative values of these quantities (Cohn, 1997).A systematic analysis is needed to characterise the background error distribution.Another assumption is that the background errors are uncorrelated between different types of species and between distinct size bins.This is an ad hoc assumption made simply to render the problem computationally manageable.The correlations between different bins can be significant, because one species often arises from same sources and transfers across the bins.The consequences of this assumption should be quantified, and relaxation of this assumption should be pursued.In the present implementa-  tion, gaseous components are not involved.However, aerosol particles in the atmosphere contain a variety of volatile compounds (ammonium, nitrate, chloride, volatile organic compounds) that can exist either in the particulate or gas phase, and these two phases are often in thermodynamic equilibrium.Assimilation of gaseous components may be required to further improve the forecast of aerosol species concentrations.

Z.
Li et al.: Three-dimensional variational data assimilation for aerosol in WRF/Chem.B is the N × N matrix, denoting the error covariance associated with x b .The M vector d = y − Hx b

Figure 1 .Fig. 1 .
Figure 1.Model domains, topography and location of observations.The three nested

Fig. 2 .
Fig. 2.Vertical distribution of the root-mean-square of the background errors in mass concentration for five species, estimated using the differences between 24 and 48 h forecasts valid at the same time.

Fig. 4 .
Fig. 4. Vertical correlations of the background errors for EC, NO − 3 , SO 2−4 , and OTR.These correlations are computed using differences between 24 and 48 h forecasts valid at the same time.A localisation was applied with a vertical scale of 3500 m.The contour interval is 0.1.

1 1Figure 5 .
Figure 5. Scatter plots of the PM2.5 mass concentrations against observations in the

Fig. 5 .
Fig.5.Scatter plots of the PM 2.5 mass concentrations against observations in the analysis without (a) and with (b) data assimilation.The observations are assimilated, and consist of the 00:00, 06:00, 12:00 and 18:00 UTC observations from the 42 ARB monitoring stations during the period 12:00 UTC, 5 May to 12:00 UTC, 14 June 2010.

Fig. 6 .
Fig.6.Scatter plots of model mass concentrations against observations without (left column) and with (right column) data assimilation for the species of EC, OC, NO − 3 , and SO 2− 4 , respectively.These observations are not assimilated and are, thus, independent data.These observations are obtained hourly from the super monitoring station during the CalNex 2010 field experiment. 1

Fig. 7 .
Fig. 7. Correlations (a) and root-mean-square errors (RMSEs in µg m −3 ) (b) of the total PM 2.5 concentration forecasts against observations as a function of forecast duration.The forecast is initialised at 12:00 UTC.Both correlations and RMSEs are calculated against the observations from 42 stations during the period 12:00 UTC, 5 May to 12:00 UTC, 14 June 2010.The black bars are for the control forecast without data assimilation, and the red bars for the forecast with data assimilation.

.Fig. 8 .
Fig. 8. Spatial distribution of RMSEs of the 24 h forecasts.(a) shows the RMSEs of the control forecast, and (b) the RMSEs of the 24 h forecast with data assimilation.The RMSE difference between the control and 24 h forecast with data assimilation is shown in the panel (c), in which a negative value indicates a decrease of RMSE arising from data assimilation.The RMSEs are in µg m −3 . 50

Figure 9 .Fig. 9 .
Figure 9. Correlations (left) and root-mean square errors (RMSEs in µg/m 3 ) (right) of the specie concentration forecasts against observations as a function of forecast duration.The forecast is initialized at 12 UTC.Both correlations and RMSEs are calculated against the observations from 42 stations during the period 12 UTC, 5 May to 12 UTC, 14 June, 2010.The black bars for the control forecast without data assimilation, and the red bars for the forecast with data assimilation.