In this study, we apply the four-dimensional variational (4D-Var) data assimilation to optimize initial ozone state and to improve the predictability of air quality. The numerical modeling systems used for simulations of atmospheric condition and chemical formation are the Weather Research and Forecasting (WRF) model and the Community Multiscale Air Quality (CMAQ) model. The study area covers the capital region of South Korea, where the surface measurement sites are relatively evenly distributed.

The 4D-Var code previously developed for the CMAQ model is modified to consider background error in matrix form, and various numerical tests are conducted. The results are evaluated with an idealized covariance function for the appropriateness of the modified codes. The background error is then constructed using the NMC method with long-term modeling results, and the characteristics of the spatial correlation scale related to local circulation are analyzed. The background error is applied in the 4D-Var research, and a surface observational assimilation is conducted to optimize the initial concentration of ozone. The statistical results for the 12-hour assimilation periods and the 120 observatory sites show a 49.4 % decrease in the root mean squared error (RMSE), and a 59.9 % increase in the index of agreement (IOA). The temporal variation of spatial distribution of the analysis increments indicates that the optimized initial state of ozone concentration is transported to inland areas by the clockwise-rotating local circulation during the assimilation windows.

To investigate the predictability of ozone concentration after the
assimilation window, a short-time forecasting is carried out. The ratios of
the RMSE (root mean squared error) with assimilation versus that without assimilation are 8 and
13 % for the

Data assimilation provides a consistent representation of the physical state such as the atmosphere by blending imperfect model predictions and noisy observations. As a technique that applies observational information to numerical models with the aim of increasing model predictability, data assimilation is actively used in numerical weather prediction (NWP) and ocean modeling studies (Daley, 1991; Courtier et al., 1998; Rabier et al., 2000; Kalnay, 2003; Navon, 2009; Evensen, 2007). With more chemical observations available in recent years, including the satellite data, data assimilation is expected to make more contributions to weather forecasting and further improve the predictability of air quality. When the data assimilation technique is used in an air quality model, it not only improves the initial concentration distribution of pollutants, but also optimizes the emissions. In addition to the boundary inflow concentration (Carmichael et al., 2008), emission is also one crucial factor in the numerical prediction of various air pollutants. Several data assimilation techniques have been developed. The four-dimensional variational (4D-Var) data assimilation requires an adjoint model for use in non-linear numerical models. This represents an applied area in the use of adjoint sensitivity (Elbern and Schmidt, 2001; Penenko et al., 2002; Sandu et al., 2005; Hakami et al., 2007).

Research using the adjoint model in air quality models started in the mid-1990s. The adjoint models used in and before the year 2000 are well described in the review paper of Wang et al. (2001). Sandu and Chai (2011) and Carmichael et al. (2008) presented subsequent research, and described many areas in which the adjoint method has been applied. More recently, more comprehensive reviews including coupled chemistry meteorology models were well-addressed by Bocquet et al. (2015).

Elbern et al. (1997) were the first to assimilate tropospheric air quality data into the European air pollution dispersion model. They argued that back then the existing air quality data assimilation was limited solely to stratospheric ozone data from satellite observations, which is far less than enough for better air quality prediction. In their study, they performed data assimilation using both data generated by the model and various information from observations. The results indicated that when using the model-generated data, the predictability is improved not only for the chemical species directly related with those used in the data assimilation, but also for those not used in the data assimilation. In their following research, Elbern and Schmidt (2001) applied 4D-Var to cases of high summer ozone concentrations based on ground observations over Europe, and ozone sonde observations from other locations. The results of 6 h data assimilation showed improved predictability. In addition, they also examined the sensitivities of model simulation to data assimilation based on the radius of the influenced area when data assimilation was performed.

Chai et al. (2007) analyzed the effects of observations from various
observation systems, such as ground, civil aviation, ship, ozone sonde, and
lidar, on data assimilation. The ICARTT (International Consortium for
Atmospheric Research on Transport and Transformation) data were obtained and
used in the above research. In particular, they proposed a method to
calculate background errors, which had not been addressed in detail in the
previous research, and verified its performance in the interested modeling
area. Boisgontier et al. (2008) assimilated tropospheric ozone
concentrations in their regional ozone prediction study prior to the launch
of the MetOp Satellite of European Organisation for the Exploitation of
Meteorological Satellites (EUMETSAT) Polar System (EPS) in October 2006.
Although the study performed data assimilation using the column ozone data
ranging over 0–6 km in the troposphere, they expected that it would
positively affect the accuracy of regional ozone prediction. The chemical
data assimilation has been conducted using NO

Gou and Sandu (2011) indicated that there differences might exist in the
gradient results between discrete and continuous adjoint in the process of
developing an adjoint model due to the high non-linearity in the advection
equation of the air quality model. As a result, they argued that the
discrete method is more accurate in the adjoint sensitivity study, and that
the continuous method is faster in minimizing the cost function in the
4D-Var data assimilation. In their study of the background pollutants
affecting ground ozone concentrations in western America during the summer,
Huang et al. (2013) applied data assimilation not only to numerical
simulations, but also to evaluation of the concentrations associated with
transport. Based on analysis of the ground-observed ozone concentration,
they suggested that the simulated surface O

Most of the previous studies for chemical data assimilation have focused on a phenomena of meteorologically synoptic scale using satellite-based observation as well as ground-based data. The transport of air pollution forced by a local circulation such as land–sea breeze is poorly examined.

One of the important elements affecting results of data assimilation in the 4D-Var process is the background errors of the model (Talagrand and Courtier, 1987). Previous research has treated the background errors as scalar quantities with a Gaussian distribution, whereas there is a lack of research applying them in a matrix form and considering the three-dimensional covariance (Constantinescu et al., 2007; Singh et al., 2011; Sliver et al., 2013).

In this study, the region centered in the capital area of South Korea, where the ground observation sites are densely distributed, is selected for the study of data assimilation. The previously developed 4D-Var code has been modified to treat background errors in matrix forms, and various numerical tests have been conducted. The results are evaluated using an idealized covariance function. The realistic background errors are then obtained for the region around the capital of South Korea using long-term modeling results. Characteristics of the background errors generated in this study are analyzed. Also, the predictability of high ozone concentration was investigated by setting the initial ozone concentration as control variables in the cost function for the 4D-Var data assimilation.

The variational method solves the data assimilation problem from an optimal
control framework (Penenko and Obraztsov, 1976; Courtier and Talagrand,
1987; Le Dimet and Talagrand, 1986). We aim to find control variables that
minimize the difference between the model predictions and observations. In
the frame of strongly constrained 4D-Var data assimilation, the
observational data at all times within the assimilation window are
simultaneously considered. The control variables become the initial
concentration distribution

In the maximum likelihood approach, the 4D-Var data assimilation gives the
maximum a posteriori estimator of the true initial concentration
distribution, which is obtained by minimizing the cost function:

The 4D-Var analysis can be obtained by the initial concentration that
minimizes Eq. (1) with respect to the model equation.

The model domains (d27, d09, and d03) for WRF (Weather Research and Forecasting). The domain size of CMAQ (Community Multiscale Air Quality) is mostly the same except that it has five grids fewer than WRF at lateral boundaries. The air quality monitoring sites at ground level are marked by green blank circles. Blue filled circles and red filled triangles indicate the selected locations for the idealized and realized background error covariance experiments, respectively. These experiments are conducted to investigate the diurnal variation of ozone during the assimilation window. Administrative district in the areas of Seoul, Gyeonggi-do, Gangwon-do, Chungcheongnam-do, and Chungcheongbuk-do is abbreviated to SU, GG, GW, CN, and CB, respectively, and also represented on the map.

Accurate error covariances for background and observation are important for the quality of data assimilation. A reasonable analysis may deteriorate because of misunderstanding of these covariances (Daescu, 2008). The background error covariance (BEC) is of utmost importance, as it weights the model error against the competing observation error, spreads information from observations to the adjacent area, and influences several parameters such as temperature and wind fields or chemical constituents (Elbern and Schmidt, 2001).

The adjoint code for CMAQ (CMAQ-ADJ) model was implemented from the project H98
(University of Houston, 2009) by Houston Advanced Research Center/Texas
Environ mental Research Consortium (HARC/TERC). The validation and several
numerical tests of this code are well described in Hakami et al. (2007).
Below is the defined cost function in CMAQ-ADJ to optimize initial
condition, which refers to concentration at the initial time.

Configuration of WRF modeling system.

Configuration of CMAQ 4D-Var modeling system.

If a BEC is to be correctly adopted, a cost function should be defined in the form of a matrix; this is denoted by the first term on the right-hand side in Eq. (1). The background part and its gradient of the cost function, written in Fortran codes, have been revised in this study to make the matrix operation possible. A numerical test is conducted to validate the suitability and effects of the revised codes.

The methods for obtaining the BEC of a numerical model are mainly divided into two types: an NMC method (Parrish and Derber, 1992) that defines the model error as the difference between the forecasting results at different initial times, and an ensemble method that uses a perturbed forecast. Recently, Kucukkaraca and Fisher (2006) introduced a technique for modeling a flow-dependent BEC. In Constantinescu et al. (2007), an autoregressive model was proposed for flow-dependent BEC in air quality data assimilation.

In this study, the BEC of the model is constructed by using the NMC method, which is the most intuitive and easily applied method.

Horizontal distributions of emission rate for domain d27 (top),
d09 (middle), and d03 (bottom). The left and right panels are for VOCs and
NO

If the observatory sites are distributed unevenly, results of data assimilation based on the variational theory will have low reliability, and it is difficult to minimize the cost function (Courtier and Talagrand, 1987). For this reason, the capital region of South Korea is selected for the present data assimilation study because measurement sites are relatively evenly distributed in this area. Figure 1 depicts the study area (d03), i.e., the capital region of South Korea along with the domain configuration for the other two nesting domains of coarse resolution. A total of 120 observatory sites are evenly distributed in the areas of Seoul (SU), Gyeonggi-do (GG), Gangwon-do (GW), Chungcheongnam-do (CN), and Chungcheongbuk-do (CB). The innermost domain, d03, is located in a geographical area with coasts to the west and the topography gradually rises towards the east. The Weather Research and Forecasting (WRF) model (Skamarock et al., 2008) is a mesoscale atmospheric model that has been widely used to simulate a local circulation pattern and provide the meteorological input data for the air quality model. The chemical formation and transportation of ozone is simulated by the Model-3 Community Multiscale Air Quality (CMAQ) model (Byun and Ching, 1999). This model simulates gas-phase chemistry using the carbon bond IV (CB-IV) photochemical mechanisms (Grey et al., 1989). To describe the chemical transformation, euler backward Iterative (EBI) (Hertel et al., 1993) solver is implemented. The advection is calculated by the piecewise-parabolic method (PPM) (Colella and Woodward, 1984), which is based on the finite volume subgrid definition of the advected scalar. The vertical diffusion in the planetary boundary layer is calculated following the approach in the Regional Acid Deposition Model, RADM (Chang et al., 1987), which is based on the similarity theory. Detailed settings used for the atmospheric and air quality model systems in the present study are presented in Tables 1 and 2, respectively. All time mentioned in this paper except those in Table 2 are local standard time (LST), which is 9 h earlier than the Coordinated Universal Time (UTC).

The experiment without assimilation was conducted as a forward run (FWD),
which covers 4 days from 09:00 LST on 3 August to 09:00 LST on 7 August. In
addition, data assimilation (4DV) was performed within the 12 h
time window from 09:00 to 21:00 LST on 5 August. Figure 2 illustrates the
spatial distribution of the total NO

Synoptic weather charts (left) and simulated results (right) on 4 August (upper) and 5 August (lower). Filled contours and vectors represent ozone concentration and winds, respectively.

The results of the WRF simulation for the synoptic pattern of surface
pressure during the study period are presented in Fig. 3, along with the
weather charts. The vector indicates surface wind, and the values of the
contours are the concentrations of O

Diurnal variations of horizontal distribution of ozone (contour) and wind (vector) at 3-hour interval starting from 06:00 LST on 5 August.

Figure 4 shows the horizontal distributions of simulated ozone concentration
and surface wind from 06:00 to 21:00 LST on 5 August at 3-hour intervals.
At 06:00 LST, a southeasterly to easterly wind developed along the western
coast, and the overall ozone concentration was low in this region.
Accompanied with the increase in solar radiation after sunrise, the ozone
concentration began to increase, and an onshore sea-breeze developed after
12:00 LST in the western coast. This sea breeze lasted from 18:00 to 21:00 LST.
After sunset, the influence of the sea-breeze can be identified over areas
where the ozone concentration decreased due to NO

Covariance distribution for Gaussian (blue) and Balgovind (red)
functions with respect to the distance (

Experimental design for the idealized background error covariance test. The FWD case is conducted and the results are compared with that of the 4D-Var run.

Horizontal distribution of analysis increments at surface resulted
from the single observation experiment (EXP_A) with respect
to radius of influence (

Two simple yet popular covariance models are Gaussian and Balgovind (Balgovind et al., 1983) functions expressed as

Table 3 summarizes a suite of numerical tests with and without data assimilation. In the tests with application of data assimilation, a matrix is constructed assuming that the BEC of the model has the form of a Balgovind function. The model domain is the innermost domain as illustrated in Fig. 1. The FWD test is conducted without data assimilation, and the other test is performed with data assimilation. The two types of tests are named as EXP_A and EXP_B, respectively.

EXP_A is a test that can be used to evaluate the
characteristics of the BEC based on a single observation experiment. In this
experiment, 100 ppb of O

Cross-section of analysis increments along the blue line in Fig. 6b as
the radius of influence (

Diurnal variations of surface ozone from the results of
EXP_B at

In EXP_B, which is the second test, the effect of BEC used in
4D-Var is examined. Real observation data are used in EXP_B.
The observation data include 12 h ozone concentration at 120 sites within
the capital city regions. Two cases are investigated in the
EXP_B (Table 3): the XBE case only considers variance that is
not in a matrix form, and the OBE case uses the BEC in the matrix form that
adopts the Balgovind function. In the XBE, two tests that take into
consideration the different weighting between

Schematic illustration for the NMC approach to obtain the background error covariance (BEC) matrix.

Model error correlation coefficients between vertical levels. The physical height of each level is indicated by the non-uniform grid line only in the layer below 1553 m, which is the eight layer of CMAQ.

Among the results of the EXP_A, horizontal distributions of
the analysis increment with respect to the radius of influence (

Model error correlation coefficients between two layers, as a
function of

Figure 8 shows the daily changes in ozone concentration simulated by each
experiment in the test EXP_B and from observations at
selected sites. Exact locations of these sites are marked in Fig. 1. At the
site GG01, the observed (black solid line) concentration of ozone, which is
higher than 100 ppb, was not simulated in the FWD (blue solid line). In
XBE_r0.08 (green solid line), although the BEC is not
applied, the simulated O

The BEC is obtained using the NMC (National Meteorological Center, now National Centers for Environmental Prediction) approach (Parrish and Derber, 1992), which is based on a real simulation for the realistic 4D-Var data assimilation study.

Model error correlation coefficients as a function of horizontal
distance

Figure 9 describes the method to define the model error. The error
statistics for the CMAQ model is defined by the differences between

Time variations of surface ozone concentration at selected sites whose specific locations are marked by red filled triangles in Fig. 1 during daytime on 5 August. Black solid lines are observed results, and blue bashed and red solid lines indicate simulated results from the FWD and 4DV, respectively.

It seems to be error-prone to invert ill-conditioned matrices. Based on the
property of Kronecker product,

The error correlations between the vertical layers of the model are given in Fig. 10. Moving further away from a pertinent layer, the error correlation decreases. Judging from the diagonalized structure of errors, the correlation was found to be roughly a function of the physical distance between the layers. Examining the vertical error correlations for the magnitude of the boundary layer, high correlations can be found up to the fourth layer for the correlations in the vicinity of ground surface. This result indicates that an improvement in the model simulation can be achieved in the neighboring layers by performing DA using the observation data of upper layers that are located from the surface to the boundary layer.

In Fig. 11, the error correlations are plotted as a function of distance
between two layers. When the distribution of correlations versus distance is
fitted to a simple function,

Statistics of the model results.

Decreasing root mean square error (RMSE, solid line) and
increasing index of agreement (IOA, dashed line) with respect to each
iteration step. The RMSE and IOA are calculated by comparing 4D-Var data
assimilation (4D-Var) results during time window with observed O

4D-Var experiments are performed in this study, using actual observations
with the distribution of the initial concentration of O

The observation results of the diurnal variation of O

Diurnal variations of statistical results of IOA (dashed) and RMSE (solid) during the assimilation time window. The results with assimilation (4DV) are indicated by red and thick lines, and those without assimilation (FWD) are the blue and thin lines.

The root mean square error (RMSE) and index of agreement (IOA) of simulated
results at each iteration step of 4D-Var using observation data from all
sites were calculated, and the results are shown in Fig. 14 (the
definitions of statistical variables used in this research are listed in
Table 4). Results at the starting point, i.e., iteration

Horizontal distributions of surface ozone and its time variations.
The plotted time is valid at

Figure 15 gives the diurnal variations of the two statistical variables. As
the statistical results are derived from 120 observatory sites over a fixed
period of time, they actually represent the errors and general agreement in
spatial distribution of O

The same as Fig. 16 except that the contour value is analysis
increments

Table 5 shows the statistical results based on simulations with the 12-hour
assimilation periods and from the 120 observatory sites. The simulation
result of the 4DV experiment is 61.4 ppb, which is close to the average
concentration of observed ozone of 63.6 ppb. A 49.4 % decrease in RMSE and
a 59.9 % increase in IOA in the results of the 4DV (i.e., the difference
between FWD and 4DV) demonstrate the great improvement caused by data
assimilation. Mean bias, normalized by the average observed concentration
(MMB), was

Statistics for the observed (OBS) and simulated (FWD and 4DV)
results. The FWD indicates the simulation without data assimilation. 4DV
results are obtained by assimilating all observed surface O

Time variations of observed and forecast ozone concentration
after

To compare the spatial distribution of the simulated O

Figure 17 shows the difference between results of FWD and 4DV (4DV results minus that of the FWD). These differences can be regarded as analysis increments and their effects during assimilation windows. At 09:00 LST, the analysis increments are negative in most of the area, but are positive over some of the western coast area and the CN area, which is affected by the clockwise circulation of the sea-breeze. These analysis increments, which are also evident in the result of the reanalysis of initial conditions, are transported to inland areas by the local circulation. As a result, the differences between the FWD and 4DV experiments become larger, and the areas of positive values become larger too, encompassing the SU and GG areas. This process makes it possible to simulate the high concentration of daytime ozone.

The direct comparison with the observation data used during the assimilation
window has a limit in the verification of results. Forecasts of FWD and 4DV
with different initial conditions after the time window (Table 2) are
performed in this part. Figure 18a depicts the temporal variation of
ozone concentration, which is obtained by averaging the results of all the
observatory sites and those of corresponding model grids during the 12-hour
assimilation period and the 12-hour forecast. During the period for
validation, the FWD overestimates O

Time variations of RMSE (solid lines) and IOA (dashed lines) for
24 h forecast after

The above result shows a forecast for the nighttime ozone with application
of the daytime data assimilation. However, high concentrations of ozone that
have harmful effects to human health are often found during daytime.
Therefore, the effects of the assimilation over a time window in the
nighttime upon the forecast accuracy of daytime ozone concentration are also
carried out. The period for validation of data assimilation is set to be 12 h, from 12:00 UTC on 5 August to 00:00 UTC on 6 August (Table 2). The

Optimized ozone after data assimilation did not show a significant change in
the other chemical components (not shown here). Ozone is a secondary
produced pollutant, and has no direct emission sources. Other components,
especially the precursors of ozone, are mostly dependent on its emission
information. Our next study will be optimizing the initial condition for
NO

In this study, we presented an approach that uses an adjoint model in data assimilation. To incorporate observation data in a numerical model, the 4D-Var that is designed to improve predictability of ozone concentration is conducted by optimization of the initial values. The model systems used in the present study include WRF, CMAQ and CMAQ-ADJ.

The previously developed adjoin code for 4D-Var considers the background error of the model in the cost function as a constant. In this study, the code is revised to reflect the information of errors belonging to the actual subject areas. Verification of the revised code is conducted. Two numerical experiments are first performed by defining an ideal matrix with the assumption that the background error has a Balgovind function distribution. The results are verified. It is found that synthetic observation information is effectively spread over the neighboring areas.

In order to define the realistic model error, the NMC method that is widely used in meteorological DA is adopted in this study. The background error covariance is constructed based on the 29 differences between 48 h forecasts and 24 h forecasts, which are taken as the model error. The forecasts are performed over August, with daily initialization and a forecast period of 48 h. The vertical correlation of the model results is constructed as a diagonal and symmetric matrix; the length scale in the correlation analysis of vertical distance is about 300 m, and the scale of length in the averaged east-west and south-north correlation is about 10 km (the east-west correlation is higher than the north-south correlation).

The generated background error of the model simulation is applied in the
4D-Var research, and the surface observation is incorporated by DA to
optimize the initial concentration of ozone. As a result of DA in a 12 h
time window during the daytime of 5 August, the 4DV experiment shows a
diurnal variation of O

Data assimilation has been playing an essential role in air quality modeling study. For this reason, the following studies need to be conducted for further operational applications of data assimilation.

In addition to ground data, other observations such as the data from ozone sonde, airplanes, and satellites, need to be exploited.

In the case of long-range transport, the inflow boundary condition needs to be optimized by considering it as a control variable in 4D-Var data assimilation.

Instead of using the averaged values of BEC data (which is used in the present research) to easily obtain the inverse matrix, the error correlation with different length scales at each grid should be considered. For this purpose, the preconditioning procedure, which modifies the form of the cost function, should be applied.

When considering the error covariance used in the modeling study, it is possible to conduct DA research using observation variables that are different to the control variables.

The study proposes a method to improve predictability by applying DA technology to air quality forecasts. Results of the present study provide helpful information to policy makers in charge of emission regulation. With more information related to a variety of air pollutants becoming available in the future, for example data from the geostationary orbit environmental satellite that is planned to operate in 2018 (Lee et al., 2010) and other observation systems, it is necessary to handle vast amount of observation data for better chemical weather forecasting (Carmichael et al., 2008). This study can be considered to be a preliminary research in this aspect.

We would like to thank the CSL research group at Virginia Tech for providing the CMAQ-ADJ code used in this study. This work was conducted under the framework of Research and Development Program of the Korea Institute of Energy Research (grant number: GP2014-0030). Edited by: A. Baklanov