Improving predictability of high ozone episodes through dynamic boundary conditions, emission refresh and chemical data assimilation during the Long Island Sound Tropospheric Ozone Study (LISTOS) field campaign

Although air quality in the United States improved remarkably in the past decades, ground-level ozone (O3) rises often in exceedance of the national ambient air quality standard in nonattainment areas, including the Long Island 20 Sound (LIS) and its surrounding areas. Accurate prediction of high ozone episodes is needed to assist government agencies and the public in mitigating harmful effects of air pollution. In this study, we have developed a suite of potential forecast improvements, including dynamic boundary conditions, rapid emission refresh and chemical data assimilation, in a 3 km resolution Community Multi-scale Air Quality (CMAQ) modeling system. The purpose is to evaluate and assess the effectiveness of these forecasting techniques, individually or in combination, in improving forecast guidance for two 25 major air pollutants: surface O3 and nitrogen dioxide (NO2). Experiments were conducted for a high O3 episode (August 28–29, 2018) during the Long Island Sound Tropospheric Ozone Study (LISTOS) field campaign, which provides abundant observations for evaluating model performance. The results show that these forecast system updates are useful in enhancing the capability of this 3 km forecasting model with varying effectiveness for different pollutants. For O3 prediction, the most significant improvement comes from the dynamic boundary conditions derived from the NOAA 30 operational forecast system, National Air Quality Forecast Capability (NAQFC), which increases the correlation coefficient (R) from 0.81 to 0.93 and reduces the Root Mean Square Error (RMSE) from 14.97 ppbv to 8.22 ppbv, compared to that with the static boundary conditions (BCs). The NO2 from all high-resolution simulations outperforms that from the operational 12 km NAQFC simulation, regardless of the BCs used, highlighting the importance of spatially resolved emission and meteorology inputs for the prediction of short-lived pollutants. The effectiveness of improved 35 initial concentrations through optimal interpolation (OI) is shown to be high in urban areas with high emission density. The influence of OI adjustment, however, is maintained for a longer period in rural areas where emissions and chemical transformation make a smaller contribution to the O3 budget than that in high emission areas. Following the assessment of individual updates, the forecasting system is configured with dynamic boundary conditions, optimal interpolation of


Introduction
Exposure to ambient air pollutants has been associated with detrimental health effects, including cardiovascular 50 diseases and premature deaths (Brunekreef and Holgate, 2002;Kim, 2007;Héroux et al., 2015). Recent decades saw remarkable improvement in the air quality across the United States. From 1990 to 2015, the United States Environmental Protection Agency (US EPA) estimated that the emissions of nitrogen oxides (NOx), a major pollutant that controls regional ozone formation, were reduced from 25.2 to 11.5 million t yr -1 (Feng et al., 2020). The downward trends in NOx emissions have been verified by ground and satellite observations in large cities  and in the eastern 55 United States (Zhou et al., 2013;Krotkov et al., 2016). Because of the substantial emission reductions, ground-level ozone concentrations decreased ubiquitously across the US (Hogrefe et al., 2011;Simon et al., 2015;He et al., 2020).
Regardless of the tremendous improvement in air quality, more than one third of the US population still lives in areas exceeding the National Ambient Air Quality Standards (NAAQS) for ozone (O3) and/or fine particulate matter (PM2.5) (US EPA, 2020). Many of these ozone nonattainment areas are located along the northeastern Interstate 95 (I-95, Interstate 60 Highway on the East Coast of the United States) corridor where high density of emissions is produced by transportation and other industrial sources. Surface ozone is formed from photochemical reactions between NOx and volatile organic compounds (VOCs) (NRC, 1992), and the high emission density of NOx is a major controlling factor for high ozone events in this region.
As part of the efforts to understand regional O3 pollution, a multi-agency collaborative study of precursor emissions, at coastal locations or over complex urban areas, including uncertainties in vertical mixing, deposition processes, spatialtemporal allocation of emissions to the air quality models (Hogrefe et al., 2007;Tong et al., 2006). Therefore, several modeling techniques have been developed to improve the forecasting skills of these air quality models (Liu et al., 2001;Tang et al., 2007). Previous studies (Wu et al., 2008;Sandu et al., 2010) suggested employing data assimilation methods to adjust the initial conditions of a model to reduce model bias. Optimal interpolation (OI) is a simple data assimilation 80 method used to enhance model prediction (Candiani et al., 2013;Tang et al., 2015;Tang et al., 2017). Considering the modeling sensitivity to BCs, Tang et al. (2009) examined the impact of six different sources of lateral BCs on the CMAQ (Community Multiscale Air Quality) forecast ability and the results showed that using global model predictions for BCs was able to improve the correlation coefficients of surface O3 prediction compared to observations. Evaluations of different databases and configurations for BCs in short-term and long-term simulations also showed that dynamic BCs 85 could have a positive impact on numerical predictions (Tang et al., 2007;Makar et al., 2010;Henderson et al., 2014;Khan and Kumar, 2019). However, many of these studies used BCs based on global forecasts that had a relatively low resolution (e.g., 1.4° × 1.4° and 2° × 2.5°). Therefore, databases with higher resolution, such as satellite observations or regional forecasting products, were introduced to construct boundary conditions that were shown to result in a measurable improvement in model performance (Borge et al., 2010;Pour-Biazar et al., 2011). Finally, updating emissions from the 90 base year to the specific forecast year was shown to be an effective approach to reduce the uncertainties of outdated emission inventories to increase forecasting accuracy (Pan et al., 2014;Tong et al., 2015Tong et al., , 2016. This study examines to what extent can various modeling techniques improve O3 and NO2 predictions over LIS and surrounding areas. As the largest metropolitan area in the United States on the Atlantic Ocean coast, this LIS region represents one of the most challenging places for air quality modeling. The resolution of the present operational 95 forecasting system, National Air Quality Forecast Capability (NAQFC), operated by the National Oceanic and Atmospheric Administration (NOAA), is at a 12 km horizontal resolution (Davidson et al., 2008). To better resolve finescale processes such as sea breeze and recirculation of air pollutants at coastal sites, a high-resolution (3km) air quality forecasting system over the LIS region (LIS3km) is developed using the latest meteorology and air quality models. Using observations from ground air quality monitors and the LISTOS field campaign, we evaluate the forecasting skills of the 100 high-resolution air quality forecasting system to predict O3 and NO2 over LIS. Specifically, we use three forecast improvements -dynamic boundary conditions, rapid emission refresh, and chemical data assimilation -to improve the LIS3km system. The effectiveness of each technique to improve forecasting skill is assessed using the observations from the LISTOS and the EPA AirNow network (http://airnowapi.org). Descriptions of the modeling system, forecast improvements, and observation data are presented in Section 2. Assessments of the CMAQ results with and without 105 different forecast system updates are described in Section 3. A summary of our findings and concluding remarks are provided in Section 4.

110
To simulate ozone variability over a complex coastal urban environment, a high-resolution air quality forecasting system has been developed for LIS and surrounding areas. The forecasting system is comprised of state-of-the-science weather, emission, and chemical transport models. The model domain covers eastern Pennsylvania, New Jersey, southern New York, Connecticut and Rhode Island. While this model domain is large enough to capture key physical/chemical processes within the LIS area, such as sea breeze circulation and photochemistry, the influence of regional transport 115 outside this domain cannot be adequately represented. Therefore, real-time forecasts from the operational NAQFC , produced by the NOAA National Weather Service, are used to provide dynamic boundary conditions to investigate the effect of this model input on forecasting performance. We also explore the effects of emission adjustment and chemical data assimilation on forecasting performance.

b) Queens College, c) New Haven, d) Westport, e) Colliers Mills, f) Riverhead, g) Greenwich, h) Madison-Beach Road, i) Middletown-CVH-Shed and Stratford
adjustment experiments are designed to update NOx emissions using observed changes from satellite and ground sensors (Tong et al., 2016). These emission adjustment factors are applied either uniformly across the domain (EmisAdj_whole), or separately for each subdomain (EmisAdj_sub). In the latter case, the domain was divided into five regions based on city areas: New York City (NYC), City of Philadelphia (PH), city area of New Haven-Hartford (NHH) and Providence-Pawtucket (PP), and the area other than these four regions (OTHR) (Fig. 1). Finally, three simulations with the combinations of these three techniques were conducted in search of the best performer. All simulations were conducted for a high ozone episode, which lasted 168 hours from 0:00 UTC August 25th to 23:00 UTC August 31st, 2018.  (Pierce et al., 1998). The emissions of sea spray aerosols are calculated using an updated version of the Gong (2003) sea-spray emission parameterization (Gantt et al., 2015).
The CMAQ model ingests emissions and meteorology to predict spatial and temporal variations of O3, NO2, and their 170 precursors. In this study, version 5.3.1 of the CMAQ model was configured to include detailed implementation of inline emission processes for biogenic, sea-salt and elevated anthropogenic emissions, horizontal and vertical advection, turbulent diffusion, dry/wet deposition and full gas, aqueous and aerosol chemistry using a revised Carbon Bond 6 gasphase mechanism and AE6 aerosol mechanism (CB6r3_AE6_AQ) (Byun and Schere, 2006;Luecken et al., 2019). Both the meteorological and air quality models have a 3 km horizontal resolution over the LIS region and its surrounding areas 175 ( Fig. 1).

Techniques to improve forecasting skills
We implement and test three forecasting improvement techniques to assess their effectiveness in enhancing the simulation performance of the CMAQ model. Details of each update are described below.

a) Dynamic lateral boundary conditions
Regional air quality models such as CMAQ rely on lateral boundary conditions to account for inflow of air pollutants and precursors from out-of-domain sources. These boundary conditions fall into two categories: static and dynamic. Static boundary conditions are time-independent vertical profiles of appropriate species at the boundaries that can be prepared from prescribed profiles, long-term vertical observations, or climatological model simulations (Tong and Mauzerall, 2006;Tang et al., 2007). Dynamical boundary conditions are provided by a concurrently running global model or another 185 regional model covering a larger domain. In the previous studies of regional modeling, a nested grid approach was often applied to provide dynamic BCs for the study area (e.g., Taghavi et al., 2004;Fu et al., 2009;Yin et al., 2015). However, the nested model would need higher computational resources and a longer running time. The increasing pool of real-time national and global forecasts provides alternative BCs that be used to drive a regional forecasting system as demonstrated in this work. Here, we explore the feasibility of utilizing the products of NOAA NAQFC, which provides real-time 190 national forecasts to prepare dynamic boundary conditions to drive the LIS3km system. The NAQFC is an operational system, operated by the National Weather Services, and the data are provided freely to the public. Hourly forecasts of the NAQFC  are processed using the BCON tool developed by the US EPA. The description of NAQFC configuration can be found in Lee et al., (2017) and a summary is provided in Table S1.

195
Optimal Interpolation (OI) is a commonly applied data assimilation method (Wang et al., 2013;Chai et al., 2017) that can be used to adjust the initial conditions (ICs) of an air quality model to minimize errors (Adhikary, 2008). This method runs fast and portably, making it very suitable for the forecasting system which needs regular execution. The equation of the OI method is defined as: where x a and x b are the analyzed and background fields, respectively. B and O are the background and observation error covariance matrix, H is the observational operator and H T is its matrix transpose, and y is the observation vector.
In the CMAQ model, the restart file, called CGRID, is daily generated during the simulation and acts as ICs for the next day. To constrain the biases in ICs, the concentrations of ozone, NO2 and NO in the restart file were adjusted via the OI method, which is applied every 24 hours at 0:00 Coordinated Universal Time (UTC). The influence area of OI is 205 controlled by the correlation length scale and the previous study by Chai et al. (2017) chose the range of 84 km for the contiguous US domain. Moreover, this influence length scale also varies from region to region. Over remote regions, the length scale may be longer while it is shorter over polluted areas as the correlation decreases more rapidly. Considering the high emission density and the fine model resolution over the LIS area, we chose a shorter influence length (33km) for a higher correlation threshold (r >= 0.5) for the LIS which means this OI adjustment was made on each 11×11 grid cell 210 block of the surface layer over the whole domain to obtain the analyzed field x a . Next, as there is no information of vertical background profile in this method, the ratio between x a and x b at each surface layer grid point was used to scale the concentrations for the vertical layers within the PBL. Detailed information for this method is described in Tang et al. (2015;. The OI assimilation first allocates ground-based observational data from the EPA AIRNow network into model grid 215 cells. The Tang et al. (2015) method puts in-situ data directly into the corresponding model grid cells. If there was more than one active site in the same grid cell, the observations are first averaged before being applied to the grid cell (OI_avg hereafter). Grid cells that did not have observations and were not within 5 grids cells from the observations were not adjusted. Therefore, the region of influence is limited, and the adjusted fields may be discrete in spatial distribution.
Besides this method, experiments were also performed with two different interpolation methods for preparing the 220 observational data. The first one was to interpolate the averaged observational grid points to the whole domain using the Inverse Distance Weighting (IDW) interpolation scheme (Shepard, 1968), (the OI_idw method). With this interpolation, the effect of OI will be not limited near the observational sites and most of the grid cells in the domain can be adjusted comparing to the OI_avg. The second method adjusted the initial concentrations by subtracting the bias between the simulation and the averaged observations within the grid point, then smoothing the adjusted concentration field via the 225 IDW scheme. This method is called OI_bias. Unlike the OI_idw which just applied the spatial interpolation to extend the OI effect, in this method the observation cells are distributed to the whole domain grids based on the spatial patterns provided by model so that it is able to better reflect the realistic fields.

c) Emission refresh
The third forecast system update evaluated here is the rapid emission refresh capability that allows timely updates of 230 outdated NEIs to the forecasting year (Tong et al., 2016). Here we focus on updating NOx emissions. NOx are important precursors to tropospheric ozone formation (Spicer, 1983;Chameides et al., 1992), therefore, their emissions can influence atmospheric ozone concentrations. Since NOx emissions decreased substantially over the last decade (Silvern et al., 2019;Dix et al., 2020) and the anthropogenic emission used in this study are based on the 2011 NEIs, the NOx emissions need to be projected from 2011 to the forecast year (2018). According to the approach proposed by . Temporal trends at the surface are determined from the hourly observed NOx concentration during the morning rush hours (06, 07, 08, and 09 local time). These times are optimal for assessing local emission conditions since they are related to the highest NOx levels typically produced as a result of both commuter traffic peaks and the shallow morning planetary boundary layer . Satellite-based temporal trends are calculated from 240 the monthly NO2 product retrieved from the Ozone Monitoring Instrument (OMI) aboard the Aura satellite (Lamsal et al., 2020). A weighting function is introduced to combine the surface-based and satellite-based temporal trends to acquire the merged projection adjustment factor (AF) for a specified region: where ΔS and NS are the temporal trend and the number of satellite data, respectively; and ΔG and NG are the temporal 245 trend and the number of surface-based data, respectively. Two weighting factors, fS and fG are applied to the satellite and surface data, respectively. Here the value of fS is set to 1 and fG to 100 to avoid dominance by either data source . In this study, two groups of AFs are prepared for the emission projection. One is the average AF over the whole domain (EmisAdj_avg) and the other group includes the AFs for each sub-region in the research area (EmisAdj_sub).
The AFs used in both groups are the averages of the monthly AFs from May to September.

Observational data sets
In this study, a suite of observational datasets were used either as inputs for emissions and chemical data assimilation or to evaluate model performance. These datasets include surface O3 and NO2 measurements from the US EPA Air Quality System (AQS) surface network, the NO2 vertical column density (VCD) from the OMI satellite data, NO2 VCD from the and NO2 at hourly intervals. Note that NO2 measurements are typically biased high due to interference in the chemiluminescence measurement (Dunlea et al., 2007). As the goal of this study is to improve forecasting performance, a near-real-time version of the AQS data was used, called AirNow. This is a preliminary dataset for the purpose of realtime air quality reporting and forecasting; it is not fully verified and provides fewer measured species. The data used in 265 this study are downloaded from the AirNow data portal maintained by the US EPA.
NO2 VCD measurements were provided by the Ozone Monitoring Instrument (OMI) standard product (version 4), available from the NASA Goddard Earth Sciences Data and Information Services Center (GES DISC). OMI is a nadirviewing hyperspectral imaging spectrometer that measures solar backscattered radiance and solar irradiance in the ultraviolet and visible regions (270-500 nm) (Levelt et al., 2006). The Aura spacecraft has a local equator-crossing time

270
of 13:45 h in the ascending node. OMI views the Earth along the satellite track with a swath of 3600 km on the surface in order to provide daily global coverage. In the normal global operational mode, the OMI ground pixel at nadir is approximately 13 km × 24 km, with increasing pixel sizes toward the edges of the orbital swaths. Multi-year OMI NO2 data were further aggregated to calculate state-level emission adjustment factors using a mass conservation approach .

275
The high-resolution NO2 observations from the GCAS (Kowalewski and Janz, 2014) are used for a direct comparison against model simulations of the NO2 VCD. GCAS is an ultraviolet-visible spectrometer used in air quality field studies to map the spatiotemporal distribution of NO2 and HCHO VCDs at high spatial resolution (Nowlan et al., 2018;Judd et al., 2020). For LISTOS, this instrument flew on 11 flight days collecting between 2-4 gapless raster datasets at spatial resolutions for NO2 as fine as 250 × 250 m. More information about the retrieval can be found in Judd et al. (2020).
During LISTOS, NO2 from GCAS was validated using coincident Pandora measurements and had a median percent difference of -1.2% with 95% of the most temporally homogeneous points within ± 25% or 0.1DU.

Effects of boundary conditions
In this section, we examine the effects of using the dynamic boundary conditions on O3 and NO2 predictions. As a reference, we also compare these simulations to the NAQFC results, extracted for the same region, during the August 29 high ozone event. Figure 2 shows the O3 and NO2 24-hour average concentrations simulated by Control (static BCs), BCON (dynamic BCs) and the NOAA NAQFC over the LIS region. Comparing to the underestimated O3 concentrations 295 simulated by Control run, the concentration level using dynamic boundary conditions increases considerably and is closer to the observations. High O3 concentrations appear over near-coast areas, but are lower in the northwest of the domain.
This spatial pattern illustrates the ozone river in a northeastward direction along the I-95 corridor, extending from Philadelphia to NYC, and then to Connecticut where the worst air quality is often observed. Although it overestimates surface O3 in Philadelphia and central New Jersey, the BCON simulation can reproduce O3 hourly variations during this 300 episode well in comparison with the observed data (see the time series in Fig. 2d). Note the peak O3 simulated in the control run is nearly the same on all days during the simulation period. The comparisons between the peak O3 with the default profile and dynamic LBC case indicates relatively large regional contributions on these days. Compared to the Control run, the BCON run performed better not only in bias, but also with higher correlations between prediction and observations (Table S2), especially during the August 26-27 high O3 days. As the profile BCs are static and lack spatial-305 temporal variations, the Control run mainly reflects the local contributions of emissions, transport and chemical processes within the domain (Tang et al., 2007). The underprediction suggests that these processes are insufficient to produce the observed O3 levels, and that the transport of air pollutants from upwind is important to predict the high O3 episodes. It highlights the significant influence of dynamic BCs on the simulations over this region during high pollution time. In comparison, the influence of BCs is less important during the cold season, when the simulation with the profile BCs can 310 also result in prediction in reasonable agreement with observations ( Fig. S2a, d). This indicates the influence of dynamic BCs varies with time and it is more significant during the high pollution time. The performance of the high-resolution simulation was next compared to that by the NAQFC. The NAQFC simulation, which is being used to provide national numerical guidance for O3 and PM2.5 , is run at a coarser 320 resolution (12 km), using a different CMAQ version (a revised CMAQ5.0), driven by different emission and meteorology datasets. Regardless of these differences, the NAQFC and BCON runs predict similar surface O3 distribution patterns.
Compared to that in the NAQFC prediction, the O3 from the 3 km BCON run demonstrated more detailed spatial distributions in the predicted O3 fields. For instance, the O3 concentration over the Long Island Sound is lower than its surroundings and the 3 km simulation could reproduce this pattern while the O3 from the 12 km NAQFC showed a 325 relatively coarser pattern of the concentration gradient. The O3 distribution along the coastal area also agrees better with the observations than the 12 km NAQFC prediction. This proves the high-resolution simulation can better reproduce the pollutant variability over this coastal urban area. In addition, the BCON run performs better over southern New Jersey, and northeast of the LIS domain, in particular exhibiting much-reduced biases in the LIS downwind areas as well. As to the diurnal variations, the BCON run overestimates the peak O3 concentrations on August 28 and 29, while the NAQFC 330 run performs well and is closer to the measurements (Fig. 2d). Use of coarser resolution NAQFC predictions as BCs substantially improves the capability of the 3 km forecasting system to reproduce the O3 variability. Compared to the Control run, the correlation coefficient between BCON and observed O3 concentrations increases from 0.81 to 0.93 and the relative mean square error (RMSE) decreases from 14.97 ppbv to 8.22 ppbv with a reduction of 45%, resulting in a comparable performance with the NOAA NAQFC predictions with correlation of 0.91 (Table S2).

335
The spatial patterns of predicted NO2 concentrations from the Control, BCON, and NAQFC runs are quite similar with high value areas all appearing over the NYC area (Fig. 3). The simulated NO2 concentrations by the 3 km forecasting system, either with static or with dynamic BCs, agree better with the observations than those from the 12 km NAQFC simulation, highlighting the importance of using high resolution inputs to better represent the emission sources in the model. The correlation coefficient and RMSEs for the Control and BCON runs are 0.69 (4.12 ppb) and 0.71 (3.82 ppb), 340 respectively, while those of NOAA are 0.67 (4.98 ppb) (Table S2). In addition, the improvement of simulated NO2 using dynamic BC was much smaller compared to that of O3. This is because the lifetime of NO2 is relatively short (1-7 h in summertime, Lu et al., 2015), and its budget in urban areas is mainly influenced by local emissions and chemistry, and less by regional transport, indicating the effectiveness of dynamic BCs depends not only on the downwind/upwind gradients, but also on lifetimes of the concerned species.

Effects of initial condition adjustment
Initial concentrations are an important input to air quality forecasting. Adjusting initial concentrations through chemical data assimilation has been shown to significantly improve air quality forecasting (Tang et al., 2015;Chai et al., 2017) although the impacts wane with increasing forecast length. Here we compare the results using various OI methods

355
with the simulations without any BC adjustment (same as the Control run) and study the effects of adjusting initial conditions on O3 and NO2 prediction. Figure 4 illustrates the initial concentrations of surface O3 adjusted by OI_avg, OI_idw and OI_bias, respectively. In the initial concentrations, the areas influenced by OI_avg are primarily limited to the ground-based sites and the regions within five model grid cells in each direction of the observations compared to the Control run (Fig. 4a, b). The rest of the domain is not affected by the adjustment, resulting in significant differences 360 between adjusted and unadjusted areas. The O3 fields adjusted by OI_idw (Fig. 4c) and OI_bias (Fig. 4d) show similar horizontal distributions, but the concentration level of OI_bias is relatively higher over NYC and northern New Jersey.   (Table 3). In addition, the effects of this adjustment on the modeling results decrease with the simulation time and display no discernible difference from the Control run after 12 hours (Fig. S1). Generally, the improvement of the simulated results due to OI data assimilation over the study domain is smaller than that from the dynamic BCs. Among the three OI methods, the simulation with OI_bias shows the best performance, so this method is chosen for subsequent analyses in which multiple techniques are combined to improve forecasting skills.  CORR: correlation coefficient, RMSE: relative mean square error, NMB: normalized mean bias, NME: normalized mean error. The ICs for each day were adjusted by OI using real time observations, it is interesting to note that the duration of OI influence on O3 simulation varies from place to place. Figure 6 shows the time series of the averaged differences in compared to between that in large cities. The OI effects in large cities remain for a shorter time than in non-urban area or smaller cities. For example, the differences between OI runs and the Control run decrease to ~0 ppb in four to eight hours in two metropolitan areas, NYC and Philadelphia (Fig. 6a, b). Meanwhile, in the New Haven -Hartford region (Fig. 6c), Providence-Pawtucket region (not shown) and the non-urban areas (Fig. 6d), the differences could last 12 to 16 hours.
The different durations indicate the influence time of OI adjusted ICs, not necessarily the improvement in model skill,

410
which is determined by both initial concentrations and other processes (chemical production and transport, etc.). The improvement using OI adjustment is similar in different subdomains (Table S3). This difference reflects the dependence of O3 level on the initial concentrations in the air quality model. In general, the influence of OI adjustment lingers for a longer period in an area with low emission density where emissions and chemical reactions make a smaller contribution to the O3 budget than that in the area with high emission density.

Effects of NOx emission adjustment
One of the major challenges in air quality forecasting is the time lag in updating the emission inputs generated for a specified base year which is typically different than the year for which the simulation is desired (Tong et al., 2012). Here The results in Table 4 and 5 show that the performance for O3 and NO2 prediction is very similar between two 440 simulations using two emission adjustment methods (a uniform average adjustment factor over the entire domain, and spatially varied factors for each subdomain defined in Figure 1).

450
observations from satellite or ground monitors, based on which the emissions were adjusted, may not accurately capture the temporal evolution of the emission sources. A large geographical range may better reflect the overall changes of NOx emissions in the LIS region. Previous studies either use a coarse model resolution (e.g., 1 degree in Lamsal et al., 2011, or state-level adjustment in Tong et al., 2016. As a result, the simulated concentrations using different methods were very close and the limited difference can also get averaged out when calculating the averaged statistical metrics. The effect of the emission adjustment method in this study is not as large as BCON or OI adjustments, which directly influence O3 concentrations. A recent study by Jin et al. (2020) showed that the decrease in NOx emissions has shifted the NOxsaturated to NOx-sensitive regime transition zone closer to urban centers, approximately 40 to 60 km from the center (the highest emission point) of New York City. Therefore, it is expected that the effectiveness of emission adjustment will increase over time in this region. For surface NO2, the emission adjustment showed more significant impact on the 460 simulated concentration. Note that the emission adjustment was only implemented in the LIS system, not in NAQFC, which still uses the 2014 NEIs for anthropogenic emissions. Without the emission adjustment, the changes in NOx emissions between the inventory and forecast years are not accounted for. On the high O3 days, NAQFC over-predicted surface O3 during the study period (Fig. 2c). The NAQFC LBCs are likely associated with a possible over-prediction of the regional transport, which can be partially responsible for the BCON LIS simulation overpredicted O3 during high O3 days (Fig 2d). Considering the similarities of these two emission adjustment methods, they will be both tested in the subsequent multi-adjustment simulations.

475
After assessing the effects of individual updates, we test how these updates can be combined to optimize forecasting performance. In the preceding sections, three groups of adjustment approaches have been included and evaluated. For each group, the best performing method has been identified, including the dynamic BCs, ICs with OI-bias, and rapid emission refresh (EmisAdj_avg/EmisAdj_sub). With these selected updates, we design and conduct two multi-adjustment simulations, the first one used both the dynamic BCs and the OI-bias adjusted initial concentration files ( Figure 8 compares the predicted hourly O3 and NO2 concentrations against in-situ observations from August 26 to 31, 2018 in five subdomains and the overall domain with Taylor diagrams (Taylor, 2001). In the Taylor diagram, the relative skill of each forecasting system to reproduce the O3 and NO2 variability is represented using three statistical adjustments show stronger correlations with R all above 0.9. Furthermore, the performance in the OTHER areas is better than that in the five subdomains with the R value up to 0.97 and SD close to 1 (Fig. 8e). Taylor diagrams also reveal that these adjustments are even more effective over the low emission areas. The three adjusted runs, namely BCON (#2), BO (#3) and BOE (#4) run in diagrams, have well reproduced surface O3 concentrations over the NYC region. The simulations 500 with BOE usually demonstrate a relatively lower O3 concentration level than that with the BCON run or the combined BCON and OI run. This means in the overestimated areas (such as NYC, Fig. 8a), the simulations with emission adjustment show better performance than that without emission adjustment. In addition, these three simulations have similar biases and errors with NMB ranging from 4% to 22% and NME from 15 to 22% (Fig. 9a, 9c). These results illustrate the importance of combining complementary modeling system updates to reduce model uncertainties in a 505 comprehensive way. A single update, such as emission adjustment, may result in a better emission input closer to the "true" level, but its effect can be offset by systematic biases caused by other inputs. Concurrent improvements of boundary conditions and initial concentrations allow a more realistic initial state and boundary conditions to demonstrate the effectiveness of the emission adjustment in improving O3 forecasting (Fig. 9).
The Taylor diagrams show that the performance of variability of NO2 predictions is generally worse than that of 510 variability of O3 predictions. Overlaid on the same diagrams, the points that represent NO2 performance are all further away from the OBS point compared to that representing O3 from the same simulations (Fig. 8). This is not surprising as O3 has been one of the focal points in air quality modeling in the past decades, while NO2 has not been scrutinized with the same intensity. All of the high-resolution simulations, including the Control run, perform better for NO2 prediction than the NAQFC run ( Fig. 9), highlighting the benefit of using a high-resolution modeling system for predicting short-

515
lived chemical species such as NO2. The NAQFC generally underestimates NO2 concentrations in all subdomains. Its bias is the smallest in the NYC subdomain and largest in its downwind New Haven-Hartford region. The correlation coefficient is between 0.8 and 0.9 in NYC, but lower than 0.6 in the New Haven-Hartford region (Fig. 8). Similarly, the NMB are within 10% in NYC but can be as large as -65% in the New Haven-Hartford region. Such a contrast suggests either an underestimate of emission sources in Connecticut, or an unrealistically short lifetime of NOx due to flawed 520 model chemistry, or a combination of both.  , this episode, which occurred during a welldesigned field campaign, offers a rare opportunity to assess how well a state-of-the-science air quality model can predict 550 a high O3 pollution event that is now less frequent than in the past decades.

NO2 prediction
CMAQ predictions of NO2 surface concentrations and vertical column density are compared against ground and aircraft observations. NO2 is not only a key precursor to tropospheric ozone, but also a proxy for traffic-related air pollution in many epidemiological studies (e.g., Jerrett et al., 2007). Within the LISTOS CMAQ domain, there are four 555 active ground monitors with valid NO2 readings during the study period. Hourly variations from AQS monitors, the BOE 3 km prediction, and the operational NAQFC prediction are illustrated in Fig. 10. Among these sites, the lowest NO2 concentrations were observed at the Flax Pond site in the middle of Long Island, away from the major emission sources.
Both BOE and NAQFC are able to reproduce the magnitude and diurnal variations of surface NO2 concentrations at this site. The NO2 concentration at the Queens College site, also located on Long Island though within NYC, is significantly 560 higher than at the Flax Pond site, due to its close proximity to major sources such as the tunnels, harbors and highways.
For this site, the BOE 3 km prediction is considerably better than that from the NAQFC prediction. Similarly, the BOE prediction outperforms the NAQFC at the New Haven site in Connecticut, where the surface NO2 concentration reaches 40 ppbv on August 28 and 55 ppbv on August 29, 2018. The NAQFC predicted concentration is constantly below 10 ppbv, severely underestimating the observations. In comparison, the BOE predicted concentration is much closer to the 565 observations, although still underpredicting the latter. Finally, both models missed the first, primary peak on both days at the Westport, CT site, which is strongly influenced by the NY City plume and sea breeze circulation. Next, the two model simulations are compared against the NO2 VCD measured by NASA GCAS during the LISTOS field campaign. In order to allow a comparison between simulations and measurements from GCAS, the CMAQ 575 prediction of NO2 mixing ratio is vertically integrated from the surface to the layer which is the closest to the plane altitude to generate vertical column density (unit: molecules cm -2 ), GCAS data are averaged over the 3 km grid to provide a spatially representative observation. We also sample the model data to match the actual measurement time. The GCAS observations show higher NO2 VCD in the morning and lower values in the afternoon. This temporal pattern is well captured by both simulations. The GCAS observations depict an NO2 hotspot over lower Manhattan and Brooklyn, which is reproduced by both BOE and NAQFC simulations (Fig. 11). The observed and simulated VCDs are generally at the same magnitude (4-40´10 15 molecules cm -2 ), with BOE better capturing the peak values. Moreover, the VCD prediction from the BOE run presents a northeastward pattern and it was lower over water area of LIS than that over surrounding lands. In comparison, the VCD from NAQFC shows a high NO2 plume over the land and the water around LIS. This spatial distribution from BOE is more consistent with that of GCAS compared to that from NAQFC. Similarly, this 585 situation is also similar for the prediction of surface NO2 distributions (Fig. 3), indicating the high-resolution system can outperform NAQFC through resolving the fine-scale processes. It should be noted that the VCD levels from both simulations are biased high outside the high emission density areas, especially in the morning. The BOE prediction shows a larger area of high NO2 VCD than that from GCAS, suggesting either a positive bias in NOx emissions or inefficient transformation and removal of emitted NOx in the CMAQ model. The high NO2 VCD from the NAQFC simulation is 590 lower than the measurements over lower Manhattan and Brooklyn, and the high NO2 VCD extends to an area larger than that from both GCAS and BOE. The performance is relatively unsatisfactory during the high polluting period on August 28 morning (Fig. 11e, 11i) with a correlation of only 0.56 for BOE and 0.44 for NAQFC. These low correlations could be partly caused by the high spatial variability of fine resolution measured VCD, so that the averaged VCD is still more variable than either model. In contrast, the spatial patterns of NO2 VCD in the afternoon are better reproduced than in the 595 morning (Table S6). In addition, the NO2 VCD from simulation with combined adjustments using EmisAdj_sub method for emission refresh shows a similar spatial pattern with that of BOE (Fig. S3) while its VCD level over the NYC area is lower, making it underestimates the hotspot but much closer to the VCD over the rest of the areas. And besides the uncertainties in the model, an evaluation conducted by Judd et al. (2020) showed that the absolute difference in GCAS from Pandora measurement has an average and standard deviation of -0.2×10 15 ±2×10 15 molecules cm -2 and a percent 600 difference on average of -1.5%±20%, which indicates biases exist in GCAS retrievals. Overall, the BOE simulation at 3 km resolution is able to reproduce the observed NO2 VCD, and unlike the results of surface NO2, the NO2 VCD using EmisAdj_sub has lower NMB (33%) and NME (57%) compared to that using EmisAdj_avg (40% and 61%) while their correlation is still the same (0.74). They both perform better than the NAQFC at 12 km resolution (0.57, 45% and 76%, respectively). The statistical metrics for these simulations are provided in Table S6.

O3 prediction
One key result expected from the improved prediction system is to better reproduce high O3 episodes, especially those 615 events that cause the exceedance of NAAQS. Here we compare the model performance between BOE and NAQFC at the seven sites where the O3 concentrations exceeded the NAAQS. Compared to NAQFC, BOE demonstrates enhanced prediction skills at all sites (Fig. 12). Note the comparisons may be attributed to the differences in meteorology, emission and other factors. Although it is difficult to attribute the improvement quantitatively to each factor, the magnitude of O3 improvement from the base run to the BOE run is compared to that of the overall reduced O3 bias, suggesting a significant 620 contribution from these improvement techniques. The results show that BOE can better capture peak O3 values than NAQFC in the afternoon, a highly desired feature in predicting O3 exceedances. Hourly surface O3 concentrations reached more than 100 ppbv at four Connecticut sites, including Greenwich, Westport, Middletown-CVH-Shed, and Stratford.
While neither BOE nor NAQFC is able to predict such high values, BOE reduces the bias by 10-20 ppbv during peak hours at these sites. The improvement of peak O3 prediction is less significant on the other sites with lower observed O3 625 concentration, but BOE still displays better performance than NAQFC. There are only three sites at which one or both simulations overpredict peak O3 on the August 29, 2018. Compared to NAQFC, BOE shows larger over-prediction of the peak O3 at the Greenwich site, but smaller overprediction at two other sites (Middletown and Westport).
Besides better peak prediction, BOE has also improved the prediction of the timing of peak O3. The peaks predicted by BOE are two to three hours earlier than that by NAQFC, which agree better with the timing of the observed peaks ( Fig.   630 12  , emissions and meteorological data with 3 km resolution could improve the simulation of peak value and timing, especially over urban areas.

645
Vertical profiles of O3 are compared between Langley Mobile O3 Lidar (LMOL) observations and CMAQ simulations at the Westport site. As shown in Fig. 13, LMOL observations reveal that the O3 concentration in the planetary boundary layer start to build around 16:00-17:00 UTC and high concentrations (>~70 ppbv), which extend to a height of about 1.5 km, last until 23:00 UTC on August 28 and 29. This pattern is reproduced by both the BOE and NAQFC simulations.

650
Above the PBL, the variations of O3 concentrations are also captured by both simulations. O3 concentrations in the free troposphere are more controlled by regional O3 production and transport than in the PBL. Consequently, the structure and magnitude of O3 profiles are very similar between the BOE and NAQFC simulations, since the BOE simulation is driven by the dynamic boundary conditions derived from the same NAQFC simulation. Compared to that from the LMOL observations, the predicted O3 concentrations from both runs are biased low above 800 hPa but biased high below 800 655 hPa. Between the two model simulations, the BOE run not only produces more O3 in the PBL, but also shows a better temporal evolution of the PBL structure, with a short-lived high O3 peak and a PBL height peak between 20:00-22:00 UTC on August 28, and persistent O3 and PBL height plateaus between 16:00-23:00 UTC on August 29 (Fig. 13). The PBL in the BOE simulation extends well above 850 mbar, while the observed high O3 from LMOL generally stays beneath this height, suggesting possible overprediction of the PBL height.

660
In general, the 3 km BOE simulation performs better to capture the temporal variability of the PBL and O3 production but tends to overestimate both during the episode. In contrast, the NAQFC simulation has produced less pronounced temporal variations in both O3 concentrations and PBL height in the lower troposphere, in particular on August 28 when this region experienced the worst air quality in several states. The NAQFC simulation, however, performed better during the time with lower O3 concentrations, which resulted in an overall lower NMB (9%) and NME (21%) comparing to that 665 in BOE (22% and 26% respectively). The BOE simulation, however, presented a much better reproduction of the O3 variability in term of correlation (0.71) than the NAQFC run (0.54). This suggests that the new 3 km BOE system is more responsive to the controlling factors that shape O3 pollution, although the system needs to be further refined to reduce bias. The model performance of O3 surface concentrations and vertical distribution using AFs from EmisAdj_sub is very close to those of using the AFs from EmisAdj_avg in the BOE case (Fig. S4, Table S7). Note white represents missing data from LMOL.

Summary
Improvement of air quality in the past decade renders the prediction of high ozone events more challenging. This study investigates the feasibility of designing a high-resolution air quality prediction system to capture these less frequent events with more accuracy. Relying on the observations collected during the Long Island Sound Tropospheric Ozone Study field campaign, we have assessed the effectiveness of various improvements to the predictions system to enhance 680 the predictability of high O3 episodes. These updates were then combined to explore how to further improve the predictability of both ozone and nitrogen dioxide. Finally, the modeling system with combined updates has been utilized to simulate a severe high O3 pollution event in the Long Island Sound and surrounding areas.
Different prediction system updates demonstrate varying potentials to improve O3 and NO2 prediction performance.
For O3 prediction, the most significant improvement comes from the dynamic boundary conditions derived from NOAA part to the fact that O3 is a regional air pollutant and the relatively small model domain used in this study, making the O3 prediction more susceptible to the influence of regional transport. Dynamic boundary conditions (BCs) are less influential for NO2 prediction, for which all high-resolution simulations outperform the 12 km NAQFC simulation, highlighting the importance of spatially resolved emission and meteorology for the prediction of short-lived pollutants. The impact of 690 improved initial concentrations through optimal interpolation (OI) is shown to be large in urban areas initially but fades away rapidly. The influence of OI adjustment, however, lingers for a longer period in an area with low emission density where emissions and chemical reactions make a smaller contribution to the O3 budget than that in the area with high emission density. Such method may be more useful if applied to vertical layers above the ground. Future air quality forecasting and modeling can benefit from concerted efforts to provide near real time data of O3 aloft on a continuous 695 basis (Mathur et al., 2018), so that improved initialization of the aloft conditions can better represent regional transport and modulate the inferred impact of LBCs on O3 forecasting. Finally, emission adjustment, which changes baseline emissions using the temporal trends derived from ground and satellite observations, only yields moderate improvement in O3 prediction compared to that without emission adjustment. One possible direction to explore is to apply other methods to constrain emissions that use both variational (e.g., Elbern et al., 2007;Vira and Sofiev, 2012) and ensemble-based (e.g., 700 Miyazaki et al., 2012Miyazaki et al., , 2017