Evaluation and intercomparison of wildfire smoke forecasts from multiple modeling systems for the 2019 Williams Flats fire

Wildfire smoke is one of the most significant concerns of human and environmental health, associated with its substantial impacts on air quality, weather, and climate. However, biomass burning emissions and smoke remain among the largest sources of uncertainties in air quality forecasts. In this study, we evaluate the smoke emissions and plume forecasts from 12 state-of-the-art air quality forecasting systems during the Williams Flats fire in Washington State, US, August 2019, which was intensively observed during the Fire Influence on Regional to Global Environments and Air Quality (FIREX-AQ) field campaign. Model forecasts with lead times within 1 d are intercompared under the same framework based on observations from multiple platforms to reveal their performance regarding fire emissions, aerosol optical depth (AOD), surface PM2.5, plume injection, and surface PM2.5 to AOD ratio. The comparison of smoke organic carbon (OC) emissions suggests a large range of daily totals among the models, with a factor of 20 to 50. Limited representations of the diurnal patterns and day-to-day variations of emissions highlight the need to incorporate new methodologies to predict the temporal evolution and reduce uncertainty of smoke emission estimates. The evaluation of Published by Copernicus Publications on behalf of the European Geosciences Union. 14428 X. Ye et al.: Evaluation and intercomparison of wildfire smoke forecasts smoke AOD (sAOD) forecasts suggests overall underpredictions in both the magnitude and smoke plume area for nearly all models, although the high-resolution models have a better representation of the fine-scale structures of smoke plumes. The models driven by fire radiative power (FRP)-based fire emissions or assimilating satellite AOD data generally outperform the others. Additionally, limitations of the persistence assumption used when predicting smoke emissions are revealed by substantial underpredictions of sAOD on 8 August 2019, mainly over the transported smoke plumes, owing to the underestimated emissions on 7 August. In contrast, the surface smoke PM2.5 (sPM2.5) forecasts show both positive and negative overall biases for these models, with most members presenting more considerable diurnal variations of sPM2.5. Overpredictions of sPM2.5 are found for the models driven by FRP-based emissions during nighttime, suggesting the necessity to improve vertical emission allocation within and above the planetary boundary layer (PBL). Smoke injection heights are further evaluated using the NASA Langley Research Center’s Differential Absorption High Spectral Resolution Lidar (DIAL-HSRL) data collected during the flight observations. As the fire became stronger over 3– 8 August, the plume height became deeper, with a day-today range of about 2–9 km a.g.l. However, narrower ranges are found for all models, with a tendency of overpredicting the plume heights for the shallower injection transects and underpredicting for the days showing deeper injections. The misrepresented plume injection heights lead to inaccurate vertical plume allocations along the transects corresponding to transported smoke that is 1 d old. Discrepancies in model performance for surface PM2.5 and AOD are further suggested by the evaluation of their ratio, which cannot be compensated for by solely adjusting the smoke emissions but are more attributable to model representations of plume injections, besides other possible factors including the evolution of PBL depths and aerosol optical property assumptions. By consolidating multiple forecast systems, these results provide strategic insight on pathways to improve smoke forecasts.

smoke AOD (sAOD) forecasts suggests overall underpredictions in both the magnitude and smoke plume area for nearly all models, although the high-resolution models have a better representation of the fine-scale structures of smoke plumes. The models driven by fire radiative power (FRP)-based fire emissions or assimilating satellite AOD data generally outperform the others. Additionally, limitations of the persistence assumption used when predicting smoke emissions are revealed by substantial underpredictions of sAOD on 8 August 2019, mainly over the transported smoke plumes, owing to the underestimated emissions on 7 August. In contrast, the surface smoke PM 2.5 (sPM 2.5 ) forecasts show both positive and negative overall biases for these models, with most members presenting more considerable diurnal variations of sPM 2.5 . Overpredictions of sPM 2.5 are found for the models driven by FRP-based emissions during nighttime, suggesting the necessity to improve vertical emission allocation within and above the planetary boundary layer (PBL). Smoke injection heights are further evaluated using the NASA Langley Research Center's Differential Absorption High Spectral Resolution Lidar (DIAL-HSRL) data collected during the flight observations. As the fire became stronger over 3-8 August, the plume height became deeper, with a day-today range of about 2-9 km a.g.l. However, narrower ranges are found for all models, with a tendency of overpredicting the plume heights for the shallower injection transects and underpredicting for the days showing deeper injections. The misrepresented plume injection heights lead to inaccurate vertical plume allocations along the transects corresponding to transported smoke that is 1 d old. Discrepancies in model performance for surface PM 2.5 and AOD are further suggested by the evaluation of their ratio, which cannot be compensated for by solely adjusting the smoke emissions but are more attributable to model representations of plume injections, besides other possible factors including the evolution of PBL depths and aerosol optical property assumptions. By consolidating multiple forecast systems, these results provide strategic insight on pathways to improve smoke forecasts.

Introduction
Wildfire is a natural ecological process that is necessary to maintain ecosystem structure and function (He et al., 2019;Pausas and Keeley, 2019) but also a crucial concern of public health, environment, and climate (Field et al., 2009;Jacobson, 2014;Kanda et al., 2002;Page et al., 2002;Reid et al., 2016). Smoke produced by fires is composed of considerable quantities of aerosols and trace gases originating from emissions of biomass combustion, including primary air pollutants such as particulate matter (PM), nitrogen oxides (NO x ), carbon monoxide (CO), and ammonia (NH 3 ), as well as trace metals and volatile organic compounds (VOCs). Composition of fire smoke evolves over time and space through com-plex chemical transformations, aerosol processes, and interactions with other atmospheric components (Sokolik et al., 2019), which lead to formation of secondary air pollutants such as O 3 and secondary aerosols (Baker et al., 2016). Air pollutants associated with smoke plumes, especially fine aerosol particles, can be transported over long distances and lead to degradation of local to regional air quality and harmful exposures over large areas (Colarco et al., 2004;Larsen et al., 2018). In recent decades, the risks posed by wildfires on human health and property have been costly and increasing in North America (McClure and Jaffe, 2018;Rappold et al., 2017), associated especially with the increases in the severity and frequency of large wildfires and development of the urban-wildland interface over the western US (Westerling, 2006;Williams et al., 2019) and owing mostly to anthropogenic climate change (Abatzoglou and Williams, 2016).
Numerical models of atmospheric chemistry and transport play an important role in advancing our understanding of the diverse impacts of wildfire smoke on air quality and the climate system, interpreting observed smoke plume characteristics, as well as providing valuable information on regulatory and health advisory purposes for decision-making during smoke events. Biomass burning emissions have been incorporated into many modeling systems to account for wildfire impacts in global operational or near-real-time (NRT) air quality forecasts (e.g., Inness et al., 2019;Pierce et al., 2007;Randles et al., 2017). Additionally, multiple regional air quality prediction systems across different regions in North America have also included smoke predictions (Ahmadov et al., 2017;Chen et al., 2019;Herron-Thorpe et al., 2014;Lee et al., 2017;Pavlovic et al., 2016). Effective performance of these air quality forecasting systems has been reported during fire seasons (e.g., Chen et al., 2019;Pan et al., 2020a;Yuchi et al., 2016). However, many of the investigations that evaluate simulations of wildfire smoke are implemented in a retrospective way with the proxies for fire emissions already known, and only a few of them were aimed at demonstrating predictive skill and using different evaluation metrics. Also, most evaluations were performed for a single model or different versions of a similar system. Therefore, a multi-model intercomparison of fire smoke predictions by current modeling systems under a common framework is lacking.
Although advances have been made in a number of forecasting systems, biomass burning emissions and smoke are still within the greatest sources of uncertainties in air quality predictions (Carter et al., 2020;Kaiser et al., 2012;Pan et al., 2020b), relating to many challenges that remain unresolved. Firstly, wildfires occur sporadically in many places; thus the emissions are inherently unpredictable. Biomass burning emissions within a forecasting window are usually estimated by persistence, which means that the emissions estimated based on the latest satellite observations are assumed to persist in the forecasting window. However, emissions from wildfires depend on many factors, including meteorology, fuel conditions, combustion stage, and fire containment activities, and thus can undergo substantial daily and diurnal variability (Saide et al., 2015). This limits the potential of smoke forecasts, especially for large wildfires with drastic day-to-day changes in their behavior.
In addition to the assumption of persistence, the detection of fire activity and quantification of emissions are also challenging and highly uncertain (Darmenov and da Silva, 2013;Kaiser et al., 2012). This is associated with limitations of the spatiotemporal coverage and resolution of satellite measurements, as well as the complexity of fuels and combustion processes. Satellite-based fire emission estimates are primarily derived from satellite observations of burned area, active fire counts, and/or fire radiative power, with constraints from satellite retrievals of aerosol optical depth (AOD), CO, or CO 2 (Wiedinmyer et al., 2011;Kaiser et al., 2012;Darmenov and da Silva, 2013;Li et al., 2019). While advances have been made in fire emission estimation with both topdown and bottom-up approaches, considerable uncertainties in fire emission inventories are reported in the literature. For instance, emissions of biomass burning aerosols differ by a factor of 4 to 7 over North America across inventories, driven mostly by dry-matter differences (Carter et al., 2020). Sensitivity studies have shown substantial emission-related uncertainty in smoke forecasts and the radiative effect of carbonaceous aerosols, contributed by different spatiotemporal distributions and magnitudes of fire emissions (Carter et al., 2020;Garcia-Menendez et al., 2014). This can substantially limit the accuracy of fire smoke forecasts and the potential of air quality models.
Parameterization of plume injection height is another essential factor in the simulation and forecast of smoke transport, lifetime, and chemistry (Paugam et al., 2016). Plume injection heights are defined as the altitudes at which fire emissions are entrained into the boundary layer, the free troposphere, and even the lower stratosphere, resulting from the updrafts generated by heat and buoyancy above fires (Freitas et al., 2007). Profound impact of plume injection height on transport of smoke constituents has been proven, as emissions injected into the free stable troposphere can be transported over long distances owing to stronger winds and fewer scavenging processes (Ansmann et al., 2018;Dirksen et al., 2009;Val Martín et al., 2006); on the other hand, plumes injected into the planetary boundary layer (PBL) are expected to have a much stronger effect on local air quality. The smoke plume injection processes are dependent on meteorological conditions and fire characteristics, which are both highly dynamic and make the representation of injection heights challenging. A variety of plume rise models have been developed and implemented in chemical transport models to parameterize the vertical distribution of fire emissions by taking fire buoyancy and atmospheric conditions into account, including empirical-statistical approaches such as adapted formulations of stacks injections (Briggs, 1975(Briggs, , 1965Pavlovic et al., 2016;Raffuse et al., 2012), methods considering microphysics and entrainment (Freitas et al., 2007) and fire-energy thermodynamics (Anderson et al., 2011;Chen et al., 2019;Sofiev et al., 2012), and integrated systems that fully resolve plume dynamics (Mandel et al., 2014). Other approaches have simply considered arbitrary vertical distributions of fire emissions, such as a uniform vertical distribution below the modeled mixed layer height, or specified altitudes determined empirically (Val Martin et al., 2012). Previous studies have evaluated performance of these approaches. For example, the Briggs approach gives mostly injection heights below the PBL height (Mallia et al., 2018) and makes comparisons at lower heights than the Multi-angle Imaging Spec-troRadiometer (MISR) plume heights because of the different processes controlling the uplift of wildfire plumes compared to the plume rise of stacks (Raffuse et al., 2012;Sofiev et al., 2012). Former studies are generally performed for retrospective cases or offline comparisons against satellite measurement of plume heights, and the performance of diverse plume injection parameterizations deployed in smoke forecasts is yet to be intercompared.
Air quality and smoke forecasting data from multiple modeling systems were collected during the NOAA/NASA Fire Influence on Regional to Global Environments and Air Quality (FIREX-AQ) field campaign, which provides a unique opportunity to extensively understand the status and prospects on wildfire smoke forecasting. The FIREX-AQ field campaign (https://www.esrl.noaa.gov/csd/projects/firex-aq/, last access: 10 March 2021) took place in the late summer of 2019 from 21 July to 5 September. This comprehensive field investigation provided detailed airborne and ground-based observations of the chemistry, composition, fuel, and meteorology for smoke from wildfires and agricultural fires across the continental US, which provides an opportunity to validate and improve real-time smoke forecasts. The air quality forecasting data from multiple systems were used to support the flight planning, including centralized collection and archival of information from major groups providing smoke forecasts covering the continental US. More importantly, a variety of observation platforms were involved, including four instrumented research aircraft, satellites, and ground-based stationary and mobile laboratories. Particularly, the NASA DC-8 aircraft flying laboratory was deployed to Boise, ID, and Salina, KS, from 17 July to 5 September 2019 and collected in situ and remote sensing measurements from multiple fires. The Differential Absorption Lidar-High Spectral Resolution Lidar (DIAL-HSRL) (Hair et al., 2018) on board the DC-8 collected observations of aerosol optical properties' profiles, which constitute a valuable dataset to validate model predictions of plume structure and injection.
In this paper, we present the evaluation and intercomparison of the forecasts of fire emissions and smoke plume from 12 global and regional air quality forecasting systems that represent the state of the art, with the aim of enhancing our knowledge about the main factors controlling their performance. The evaluation is carried out focusing on the Williams Flats fire that occurred in August 2019, Washing- Figure 1. Map of forecast domains for all regional models included in this study. The horizontal grid spacings of the models are labeled in white (see also ton State, US, to demonstrate the forecasting performance in multiple dimensions, including fire emissions, total column and surface aerosol loading, and plume injections. In Sect. 2, we describe the modeling systems with a brief overview of their primary differences. Section 3 provides the analysis of how the model forecasts perform by comparisons to satellitederived AOD, surface observations of PM 2.5 concentrations, and airborne observations of vertical plume structures and plume heights. We also investigate the joint performance for surface PM 2.5 and total column smoke aerosols, as indicated by AOD. The summary and conclusions are presented in Sect. 4, along with discussions on pathways to improve the accuracy and tackle the challenges in smoke forecasting.

Descriptions of forecast models
A total of 12 forecast systems that provide fire smoke predictions are incorporated within the following intercomparison, including three global and nine regional systems. A summary of the model descriptions can be found in Table 1, and the domains of the models are shown in Fig. 1. The evaluation here is restricted to an area between 44.0-50.0 • N and 110.0-122.0 • W, focusing on the smoke plumes from the Williams Flats fire. Description of the fire event is given in Sect. 3. Although some of these forecast systems produce multiple forecasting cycles a day, only one cycle a day was used in this study, as denoted by the initial time in Table 1. In the following, the forecast systems are described separately in brief (Sect. 2.1), along with a summary of main differences in their numerical methodology, especially regarding biomass burning emissions and plume rise parameterizations (Sect. 2.2).  (Benedetti et al., 2009;Flemming et al., 2015;Inness et al., 2019;Rémy et al., 2019), which provides 5 d forecasts of atmospheric composition including reactive gases, aerosols, and greenhouse gases (http://atmosphere.copernicus.eu/, last access: 10 September 2019) at an effective horizontal resolution of 0.4 • . Emissions from global anthropogenic activities are provided by the CAMS_GLOB_ANT v2.1 inventory (Granier et al., 2019). Emissions of organic matter (OM), black carbon (BC), and SO 2 from fires are obtained using the Global Fire Assimilation System (GFAS) v1.2 based on the Moderate Resolution Imaging Spectroradiometer (MODIS) observations of fire radiative power (FRP) (Kaiser et al., 2012). Plume injection height is parameterized with a method derived from MISR fire plume observations (Sofiev et al., 2012). The diagnostics of total and fine-mode AOD use bulk optical properties for each aerosol species that have been pre-computed with a standard code for Mie scattering (Bozzo et al., 2020;Wiscombe, 1980). Specifically, the smoke OM aerosol properties are based on the "continental" mixtures and BC properties are based on soot model described in the Optical Properties of Aerosols and Clouds (OPAC) database (Bozzo et al., 2020;Hess et al., 1998). For the hydrophilic types, the optical properties change with the relative humidity (RH) for the refractive index and density based on Koepke et al. (1997), and the size distribution is modified by applying growth factors (Table A2 in Bozzo et al., 2020) to the mode radius. MODIS AOD data (550 nm) with a variational bias correction applied (Dee and Uppala, 2009) are routinely assimilated in a 4D-Var framework using aerosol total mixing ratio as the control variable (Benedetti et al., 2009). Retrievals of reactive gases are also assimilated, including ozone, carbon monoxide, formaldehyde, and nitrogen dioxide (Inness et al., , 2019.

GEOS-FP
GEOS-FP (GEOS "Forward Processing") is a near-real-time (NRT) forecast system led by the NASA Global Modeling and Assimilation Office (GMAO). It provides NRT forecasts of meteorological fields, aerosols, and tracers globally and twice a day for a period of 120 h, with a grid resolution of about 25 km (0.25 • in latitude and 0.3125 • in longitude). GEOS-FP uses the same modeling configuration as the Modern-Era Retrospective Analysis for Research and Applications, version 2 (MERRA-2) (Randles et al., 2017). GEOS-FP uses the Goddard Chemistry Aerosol Radiation Table 1. Summary of the forecast systems evaluated in this study. "n/a" is used to denote "not applicable".  Anderson et al. (2011) and Transport (GOCART) aerosol model (Chin et al., 2002;Colarco et al., 2010), which includes simplified sulfur chemistry and tracks the aerosol mass mixing ratio of dust, sea salt, hydrophobic and hydrophilic types of BC and organic carbon (OC), and sulfate. Biomass burning emissions are from the Quick Fire Emission Dataset (QFED) v2.4 (Koster et al., 2015). By employing the FRP observations from MODIS, QFED uses FRP-to-emission coefficients adjusted to improve model agreement with AOD estimates. Smoke emissions are distributed within the PBL. While dust and seasalt emissions are wind-driven (Randles et al., 2017), anthropogenic aerosol emissions are obtained from annually varying global datasets (Diehl et al., 2012). Aerosols are assumed to be externally mixed in modes of fixed mean diameter and standard deviation (Colarco et al., 2014;Randles et al., 2017). The AOD is diagnosed by integrating the extinction over the column and aerosol species. The speciesspecific mass extinction coefficients are derived from Mie theory for spherical particles (Wiscombe, 1980) or the Tmatrix approach using the updated optics for nonspherical dust as described in Meng et al. (2010). The extinction coefficients for sulfate and hydrophilic carbonaceous aerosol are a function of RH following Chin et al. (2002). RH affects both the index of refraction and the size distribution. Assumed optical properties are primarily from the Optical Properties of Aerosols and Clouds (OPAC) dataset (Hess et al., 1998) with updated dust optical properties that incorporate nonsphericity (Meng et al. 2010;Colarco et al. 2014). GEOS-FP also includes data assimilation of satellite AOD retrievals corresponding to bias-corrected AOD estimates using MODIS radiances and a neural-network framework (Albayrak et al., 2013).

RAQMS
The Realtime Air Quality Modeling System (RAQMS) is a forecast system that provides global prediction of aerosol and reactive gases at 1 • resolution (Pierce et al., 2007(Pierce et al., , 2009. It uses EDGAR anthropogenic emissions, and biomass burning emissions are estimated using gridded carbon fuel consumption databases, MODIS fire detections, fire weather severity index, and published emission ratios (Petrenko et al., 2012;Soja et al., 2004). RAQMS forecasts are initialized with assimilation of OMI and MLS ozone and MODIS AOD retrievals using a statistical digital filter (Pierce et al., 2007). The same settings for aerosol optical properties and AOD calculation as used in WISC WRF-Chem are employed in this model (see Sect. 2.1.5).

HRRR-Smoke
The High-Resolution Rapid Refresh coupled with smoke (HRRR-Smoke) is an operational smoke forecasting system run by NOAA/NWS. It is based on NOAA's weather forecasting model HRRR (https://rapidrefresh.noaa.gov/hrrr, last access: 10 September 2019) and a coupled meteorologychemistry model WRF-Chem (Grell et al., 2005). HRRR-Smoke simulates primary aerosols from wildland fires in real time on a 3 km resolution grid over the entirety of the continental US (Ahmadov et al., 2017). It ingests FRP data from the MODIS sensor on Terra and Aqua satellites and the Visible Infrared Imaging Radiometer Suite (VIIRS) sensor on the Suomi National Polar-orbiting Partnership (S-NPP) and NOAA-20 to calculate fire size, heat flux, and fire emissions. Fire smoke is treated as a chemically inert tracer, and no other aerosol sources are included in this model. Both dry and wet deposition processes of the smoke tracer are represented. The model also includes direct feedback of the smoke aerosol on radiation. The AOD at 550 nm is diagnosed using the mass extinction coefficient of 4.5 m 2 g −1 . In addition, fire size and heat flux determined by FRP data is used to calculate plume injection heights for the flaming emissions using concurrently simulated meteorological fields by the model (Freitas et al., 2007;Grell and Baklanov, 2011;Paugam et al., 2015). The RAP-Smoke model (13.5 km resolution), which covers the entirety of North America, provides lateral boundary conditions (LBCs) of smoke concentrations to HRRR-Smoke (Ahmadov et al., 2019).

WISC WRF-Chem
WISC WRF-Chem is an experimental regional forecast system led by University of Wisconsin, which runs in nearreal time with its 8 km grid-resolution domain nested into RAQMS. The Goddard Chemistry GOCART is used as the aerosol scheme (Chin et al., 2002). The AOD at 550 nm is computed by vertical integration of the aerosol extinction. Changes in the optical properties of OC associated with hygroscopic growth are accounted for, with the mole fraction and density values being used from lookup tables and as a function of RH (OC mole fraction values range from 1 to 0.01 for RH values of 1 % to 99 % respectively). Extinction efficiencies are used as a function of mole fraction, with values ranging from 1.37 to 2.18 for mole fraction values of 0.01 to 1. The initial and boundary conditions for aerosols are provided by the RAQMS (Pierce et al., 2007). Meteorological initial and boundary conditions are provided by the NOAA Global Forecast System (GFS) V15 release, which uses the Finite-Volume Cubed-Sphere (FV3) dynamic core. The AOD assimilation is not performed within the model but is included through the RAQMS initial conditions. Smoke emissions were estimated using the geostationary fire detections from the GEOS-15 and the Brazilian Biomass Burning Model (3BEM), which is a bottom-up biomass burning emission estimation approach included in the PREP_CHEM_SRC emissions preprocessor (Freitas et al., 2011;Pierce et al., 2009).

UCLA WRF-Chem
UCLA WRF-Chem provided experimental forecasts during the FIREX-AQ field campaign at a spatial resolution of 4 km over the western US. The model is based on the WRF-Chem v3.6.1 (https://ruc.noaa.gov/wrf/wrf-chem, last access: 10 March 2021) and configured with a simplified aerosol-aware microphysics scheme (Saide et al., 2016;Thompson and Eidhammer, 2014) to reduce computational costs compared to full-chemistry runs. The system was deployed successfully in a previous NASA field campaign targeting smoke from fires (Redemann et al., 2021). The meteorological initial and boundary conditions are acquired from the 12 km North American Model (NAM) Nonhydrostatic Multiscale Model (Janjic and Gall, 2012). Two categories of aerosols are considered, i.e., water-and ice-friendly aerosols, which are non-reactive and only undergo wet deposition. The smoke aerosols are considered to be fully contained in the water-friendly aerosols. Ambient aerosol extinction and AOD at 550 nm are computed based on the two aerosol tracers (water-and ice-friendly aerosols) using fixed RHdependent mass extinction efficiencies to consider aerosol hygroscopic growth. Dry extinction is computed using RH of 20 %, and it is used to estimate PM 2.5 concentrations (using the mass extinction efficiency of 3.5 m 2 g −1 ).
Smoke emissions at 0.1 • resolution are obtained from QFED v2.4 and processed using the fire_emiss preprocessor (https://www2.acom.ucar.edu/wrf-chem/ wrf-chem-tools-community, last access: 10 September 2019). Each grid cell from which the fire smoke is released is assigned a burned area of 0.25 km 2 in the plume rise parameterization scheme (Freitas et al., 2007(Freitas et al., , 2010. The same scheme was used in NCAR WRF-Chem, UIOWA WRF-Chem, and WISC WRF-Chem. In addition, this model ingests MODIS AOD observations on Terra and Aqua satellites to provide constraints on fire emissions in near-real time using a inversion modeling framework (Saide et al., 2015(Saide et al., , 2016. The inversion scheme optimizes emissions for six fire complexes, with the largest average emissions over the last 4 d.

UIOWA WRF-Chem
The University of Iowa provided air quality forecasts based on WRF-Chem v3.9.1 (UIOWA WRF-Chem) (http: //bio.cgrer.uiowa.edu/FIREX-AQ/model_info.html, last access: 10 September 2019) using the Model for Ozone and Related Tracers-4 (MOZART-4) (Emmons et al., 2010) with aqueous chemistry as the chemistry scheme and the Model for Simulating Aerosol Interactions and Chemistry (MO-SAIC) (Fast et al., 2006;Gao et al., 2016;Zaveri et al., 2008) for aerosols. MOSAIC uses a sectional representation of aerosol size distribution, with detailed aerosol interactions with radiation and clouds described by Chapman et al. (2009). The aerosol size distributions are described in eight size bins, and the chemical constituents of the aerosol are assumed to be internally mixed within each bin and externally mixed between bins. The optical properties of each bin are calculated using the Mie code (Toon and Ackerman, 1981), then their summations over all bins give the bulk optical properties of the aerosol population (Fast et al., 2006;Barnard et al., 2010). Aerosol hygroscopicity is considered by calculating aerosol water content using the Zdanovskii-Stokes-Robinson (ZSR) method (Zdanovskii, 1948;Stokes and Robinson, 1966). The model included tracers with no lifetime for understanding the impact of processes such as smoke plume rise. The GFS data were used for the meteorological initial and boundary conditions and the Whole Atmosphere Community Climate Model (WACCM) (Marsh et al., 2013) forecasts for chemical species and aerosols. Biomass burning emissions are also obtained from the QFED v2.4, and the plume rise was also parameterized using the method of Freitas et al. (2007Freitas et al. ( , 2010.

NCAR WRF-Chem
The NCAR WRF-Chem is a regional forecast system (https: //www.acom.ucar.edu/firex-aq/, last access: 10 September 2019) led by NCAR Atmospheric Chemistry Observations and Modeling (ACOM) using WRF-Chem v3.9.1 (Kumar et al., 2021). The meteorological initial and boundary conditions are provided by the GFS model. Biomass burning emissions are produced each day using the near-realtime Fire Inventory from NCAR (FINN) based on MODIS Rapid Response fire counts (Wiedinmyer et al., 2011). Plume rise by Freitas et al. (2007Freitas et al. ( , 2010 is used to distribute the fire emissions vertically. The model uses MOZART for gasphase chemistry and GOCART for aerosol processes, with the chemical initial and boundary conditions coming from the WACCM forecasts. The aerosol optical properties are calculated using the parameterized Mie theory (Ghan and Zaveri, 2007), with a hygroscopicity parameter of OC of 0.14.

NAQFC
The National Air Quality Forecasting Capability (NAQFC) model is developed by NOAA/Air Resources Laboratory (ARL) and National Centers for Environmental Prediction (NCEP) to provide operational 48 h air quality prediction over the US (Lee et al., 2017;Pan et al., 2020a) at 12 km resolution. It is based on the US Environmental Protection Agency (EPA) Community Multi-scale Air Quality (CMAQ) v5.02 (Byun and Schere, 2006) modeling system and is offline-driven by the 12 km North American Model (NAM) Nonhydrostatic Multiscale Model (Janjic and Gall, 2012). It uses the US EPA National Emission Inventory (NEI) 2014v2 and correction factors based on satellites to account for trends in NO x emissions (Tong et al., 2015). Wildfire emissions, thermal, and speciation characteristics are determined in near-real time by the US Forest Service BlueSky smoke emission package (Larkin et al., 2009) and the NOAA/NESDIS Hazard Mapping System (HMS) for fire location and strength (Ruminski and Kondragunta, 2006). The BlueSky wildfire heat flux was used in the Briggs equation (Briggs, 1975) to determine smoke plume injection heights. Monthly averaged concentrations for 36 gaseous and aerosol species obtained from GEOS-Chem were used as lateral boundary conditions below 7 km altitude, and a clean-air scenario static condition was used between 7 km and model top. NAQFC used the Carbon-Bond 2005 (CB05) (Yarwood et al., 2005) for gas-phase mechanisms and followed largely the EPA's AERO4 module for aerosol processes (Binkowski and Roselle, 2003) with the related emission and removal processes in CMAQ v4.6. The AERO4 represents particle size as three modes. The processes of coagulation, particle growth, and new particle formation are included. Extinction of aerosols is represented using a parametric approximation to Mie extinction (Binkowski and Roselle, 2003;Byun and Ching, 1999;Evans and Fournier, 1990). Then the AOD is derived by integrating the Mie extinction coefficients over the column.

AIRPACT
The Air Information Report for Public Awareness and Community Tracking (AIRPACT) v5 is an air quality prediction system primarily for Idaho, Oregon, and Washington (Herron-Thorpe et al., 2014). Meteorology fields predicted by the University of Washington WRF (Skamarock et al., 2008) model at 4 km resolution are used to drive CMAQ v5.02, which accounts for the chemical and physical processes of air components, including emissions, transport, vertical mixing, dilution, rainout, and deposition. The CMAQ model includes the CB05 and AERO6 chemical and aerosol mechanisms. Aerosol extinction at 550 nm is estimated using the approximation of the Mie extinction efficiency (Binkowski and Roselle, 2003;Byun and Ching, 1999), with the reference refractive indices of aerosols based on the OPAC database (Hess et al., 1998). The model assumes that organics influence neither the water content nor the ionic strength of the aerosol particles; only the equilibrium of the sulfate, nitrate, ammonium and water system is considered (Binkowski and Roselle, 2003). The chemical boundary conditions are derived from the WACCM to account for long-range transport of pollutants from outside the domain. The WACCM simulations incorporate fire emissions from the FINN data (Wiedinmyer et al., 2011). Fire emissions in AIRPACT are derived from the BlueSky framework, with fire locations determined by the Satellite Mapping Automated Reanalysis Tool for Fire Incident Reconciliation (SMARTFIRE) v2. Plume rise is represented using a modified WRAP DEASCO3 method (Mavko and Morris, 2013).

FireWork
Environment and Climate Change Canada (ECCC) operates the Regional Air Quality Deterministic Prediction System (RAQDPS), which uses the Global Environmental Multiscale -Modelling Air quality and CHemistry (GEM-MACH) model (Moran et al., 2010) v5.0 with full description of atmospheric chemistry and meteorological processes to provide 48 h forecasts at 10 km resolution. The Forest Fire Smoke Model (FireWork) operational system is run as an additional version of the RAQDPS by including smoke emissions (Chen et al., 2019;Pavlovic et al., 2016). Gasphase chemistry is accounted for using the Young and Boris scheme, and aerosol processes are represented by the Canadian Aerosol Module (CAM) (Gong et al., 2003) with two bins. Aerosol optical properties within the model are decoupled from the meteorological model components -the model's radiative transfer routines make use of climatological aerosol radiative properties in its radiative transfer processes. AODs from the FireWork forecast were generated using post-processing of the model outputs. The operational 2-bin model output was first mapped to a 12-bin representation based on an assumed fixed size distribution, then optical properties were calculated based on speciated PM 2.5 (for nitrate, sulfate, OC, BC, dust, and sea salt) using formula from IMPROVE. Biomass burning emissions are computed by the Canadian Forest Fire Emission Prediction System (CFFEPS) v2.03 (Chen et al., 2019), which is a bottom-up system linked to the Canadian Wildland Fire Information System (CWFIS) (Lee et al., 2002), with the hourly changes in biomass fuel consumption parameterized considering forecasted meteorology at fire locations. The fire-activity information is based on initial NRT fire hotspot data from three satellite sensors, including the Advanced Very High Resolution Radiometer (AVHRR), MODIS, and the Visible Infrared Imaging Radiometer Suite (VIIRS). Smoke emissions and the energy generated from wildfires are a product of burned area with diurnally adjusted fire growth rates (Lawson et al., 1996), fuel consumed per unit area calculated considering meteorology, combustion stages and burn durations, and emission factors following the literature (Urbanski, 2014). The forecasts initialized at 12:00 UTC are used in the evaluation, which ingested the most recent data of the current day's hotspot by the initialization time. A plume rise parameterization based on fire-energy thermodynamics is used to define the smoke injection height and the vertical distribution of emissions (Anderson et al., 2011;Chen et al., 2019).

ARQI
An experimental version of GEM-MACH at 2.5 km resolution is maintained by the Air Quality Modelling and Integration (ARQI) group within ECCC (Makar et al., 2021). This domain is nested within the 10 km operational FireWork domain to generate forecasts for Alberta and Saskatchewan, Canada, continuously since 2012 and has been used during several field campaigns. In support of the FIREX-AQ, the experimental forecast system was set up by ECCC with the domain covering the northwest US and southwest Canada to track smoke flows from and towards Canada. In order to make these forecasts available during the forecasting window used in FIREX-AQ, ARQI was initialized at 12:00 UTC on the previous day to the forecast day, using the previous day 03:00 UTC update of CFFEPS emissions. Thus, these forecasts had smoke emissions that were nearly behind by 1 d compared to the rest of the forecasting systems, and this may account for some of the discrepancies between the ARQI and FireWork forecasts, both of which used the same forest fire emissions processing framework. The KPP model with Rodas3 solver is used for gas-phase chemistry. Aerosol processes are also depicted by the CAM module with 12 bins. AOD is generated using an online Mie scattering approach assuming homogeneous spherical aerosols (Bohren and Huffman, 1983). Similarly to FireWork, the biomass burning emissions are estimated by the CFFEPS v4.0 following Chen et al. (2019) but with some modifications: (1) plume rise is calculated using the full GEM-MACH vertical resolution of temperature for radiative balance; (2) a 24 h offline simulation using an a priori GEM forecast is used to create spin-up conditions for the GEM-MACH 2.5 km simulation, but the forest fire emissions used in the model were calculated by CFFEPSv4.0 online within GEM-MACH. This allowed the forest fire emissions of particulate matter to modify the weather within the simulation, in turn modifying the forest fire plume rise behavior on subsequent time steps. A more detailed description of the ARQI model and these feedback effects may be found in Makar et al. (2021).

Main differences of the forecast systems
The modeling systems included in this work represent a considerable diversity in forecasts accounting for fire smoke over the western US. Besides the spatial coverages, resolutions, and the driving meteorology, they differ in several major aspects: 1. Biomass burning emission input. Diverse approaches of quantifying emissions are involved, which can be broadly categorized into top-down estimates based on satellite FRP and FRP-to-emission coefficients and bottom-up estimates based on fire detections (hotspots reflecting burned area), fuel categories, and emission factors.
2. Plume injection. Smoke emissions are distributed within the PBL for GEOS-FP and RAQMS (severity dependent), while plume rise parameterizations are employed in the other models. All the WRF-Chem simulations used the default version of the plume rise parameterization within WRF-Chem (Freitas et al., 2007(Freitas et al., , 2010, and HRRR-Smoke and CAMS used satellite FRP observations in their parameterizations, while ARQI (Makar et al., 2021) and FireWork (Chen et al., 2019) used hourly calculated fire energy and thermodynamic (temperature) profiles at hotspot locations.
3. Diurnal cycle of smoke emissions. Diverse patterns are adopted in the systems from nearly flat profiles to strong diurnal variations, which will be shown in the comparison. Most models use fixed diurnal profiles, except for UCLA WRF-Chem, for which the pattern can be modified by the inverse modeling of fire emissions.
4. Initialization time. The forecasts considered here were mostly initialized at 00:00 UTC or 12:00 UTC, which leads to a different time of validity for satellite observations and thus fire emissions in the models, depending on the latency of data by initialization times.

Complexity of chemical mechanisms.
Smoke chemistry is treated in different ways, ranging from tracers in HRRR-Smoke and UCLA WRF-Chem, simplified GOCART (Chin et al., 2002) chemistry in GEOS-FP, NCAR WRF-Chem, and WISC WRF-Chem, and full chemistry by other models that may have differences in their treatment of organic aerosol (e.g., ARQI and Fire-Work).
6. Assimilation of satellite AOD data. All three global operational forecasting systems incorporated assimilation of satellite AOD data. For the regional models, WISC WRF-Chem was initialized with assimilated aerosol fields from RAMQS, and UCLA WRF-Chem used AOD retrievals to constrain fire emissions. Other models did not make use of chemical data assimilation in their systems.
7. Treatment of aerosol processes and assumptions of aerosol physical and chemical properties. These are relevant to aerosol optical properties' calculation. The direct aerosol feedbacks linked to radiative forcing are considered in many models (CAMS, GEOS-FP, ARQI, HRRR-Smoke, UCLA WRF-Chem, UIOWA WRF-Chem, WISC WRF-Chem, ARQI, and NCAR WRF-Chem), while the indirect feedbacks are enabled in three models (UIOWA WRF-Chem, UCLA WRF-Chem, and ARQI). Regarding the diagnosis of AOD, UCLA WRF-Chem, HRRR-Smoke, WISC WRF-Chem, RAQMS, and GEOS-FP used mass-concentration-based aerosol extinction, while CAMS, UIOWA WRF-Chem, NCAR WRF-Chem, NAQFC, AIRPACT, and ARQI used the Mie theory, and FireWork used a diagnostic postprocessing calculation to estimate AOD from modelgenerated aerosols. The uncertainty of AOD calculations owing to the different methods and assumptions about the mixing state, density, refractive index, and hygroscopic growth of aerosols has been estimated as 30 %-35 % (Curci et al., 2015), with the general tendency for the different AOD methods to result in negative biases relative to satellite observations. This uncertainty introduced by these factors is not treated explicitly in this work.

Evaluation of smoke forecasts for the Williams Flats fire
The Williams Flats fire was the largest wildfire event sampled during the FIREX-AQ field campaign. In this paper, we focus the model evaluations on this event. First reported at about 03:23 PDT (Pacific daylight time, or UTC−7) on 2 August 2019, the fire was ignited by various lightning strikes related to an early morning thunderstorm in the Colville Reservation, Washington State, located about 8 km to the southeast of Keller, WA, and 80 km to the northwest of Spokane, WA. The high-pressure system over the fire area produced abovenormal temperature and low humidity, which in combination with the wind pattern resulted in spreading of the blaze to the north bank of the Columbia River. The fire event led to numerous evacuation orders in the area. Fuel burned in the fire event included short grass, timber, and tall brush. Containment efforts were somewhat hampered by steep terrain, limited access, and primitive road conditions near the fire. The storm that occurred on 9-10 August brought large amounts of rain over the fire, which caused localized flash flooding, washing out several roads. Although the storms caused challenges for firefighting efforts, the cooler temperature and rain dramatically moderated the fire behavior, and the percentage of containment reached 65 % on 13 August (https://inciweb. nwcg.gov/incident/6493/, last access: 10 September 2019). According to the final Incident Status Report (ICS-209), prepared by the Bureau of Indian Affairs for the Williams Flats fire, the unit burned a total of 44 446 acres (179.87 km 2 ). The burned area measurements from the USDA Forest Service National Infrared Operations (NIROPS) Unit with airborne thermal infrared (IR) imaging are used to present the evolution of this event. As shown in Fig. 2, a slightly increasing trend in the daily burned area growth can be seen on 4-6 August, and on 7 August the fire expanded abruptly, with the daily increment of burned area being almost triple of that on 6 August. Therefore, the predicted burned area growth by assuming persistence, namely maintaining the same burned area growth on the previous day, may lead to significant underprediction of the fire expansion on 7 August and overestimation on 8 August. Considering the temporal evolution of this event, the 7 d period of 3-9 August 2019 is of interest in this evaluation, as it corresponds to the actively expanding stage of the fire with intense emissions and abundant observations. Evaluation of the models' performance is carried out from multiple perspectives, including (1) fire emissions and their diurnal variation pattern, (2) total column aerosol loading via comparisons against satellite AOD retrievals, (3) surface air quality impact via comparisons against in situ PM 2.5 measurements, and (4) vertical plume structure and fire plume injection compared against airborne lidar observations. Additionally, further discussion is presented on the surface PM 2.5 to AOD ratio and possible ways to reduce the discrepancies in performance between the two terms.

Comparison of fire emissions
In this section, the biomass burning (BB) emissions from the Williams Flats fire that are used in models are intercompared in terms of the evolution of daily emissions and diurnal variation patterns. Figure 3 shows the comparison of daily total BB organic carbon (OC) emissions for eight models. Due to data availability, not all models could be included. Emissions from each forecasting system are derived by aggregating emissions at grid pixels representing the Williams Flats fire from 00:00 to 23:00 PDT per day, with the value for each hour derived from the latest forecast cycle. An illustration of the grid pixels on the emission distribution maps is included in Fig. S1. Moreover, the emission estimates derived by a detailed analysis developed by the FIREX-AQ Fuel2Fire group (https://www-air.larc.nasa. gov/missions/firex-aq/index.html, last access: 4 July 2020) are also shown in Fig. 3. As a bottom-up emission dataset, the BB emissions are derived using Fuel Characteristic Classification System (FCCS) fuelbeds, VIIRS, Geostationary Operational Environmental Satellite (GOES), MODIS, and ground-based intelligence (Soja et al., 2004) and provided at a temporal resolution of 1 s for both flaming and smoldering conditions. We converted the total carbon emissions into carbonaceous aerosol emissions (OC and BC) using a fixed percentage of 2 % (Soja et al., 2004) and then extracted the partition of OC emissions following an up-to-date relationship between the ratio of OC and BC emission factors (EF OC / EF BC ) and modified combustion efficiency (MCE) for western US wildland fuels (Jen et al., 2019), with the MCE assumed to be 0.84 and 0.95 for smoldering and flaming emissions, respectively.
The result indicates a large spread of the daily total BB OC emissions within forecasting systems, with the differences in these estimates ranging from a factor of about 20 to 50 on 5-9 August (Fig. 3). The factors on the days before and after are even higher, owing to the different pace of the models ingesting satellite fire detections. Meanwhile, the FRP-based emissions that were used by UCLA WRF-Chem, UIOWA WRF-Chem, GEOS-FP, HRRR-Smoke, and CAMS tend to be overall higher than the hotspot-based emissions by a mean factor of 5.6. This is especially the case for the QFED emissions used by UCLA and UIOWA WRF-Chem and GEOS-FP, which are on average 6.4 times higher than the hotspot-based emissions. The Fuel2Fire emissions are within these two categories. The emission estimates tend to show an increasing trend throughout the days, which is consistent with typical fire behavior and Fuel2Fire emissions. The large spread in the emission estimates could be due to multiple factors, including differences in the identification of ecosystems or fuel load, land cover classification, type of satellite fire detections used (e.g., different sensors and pixel sizes), and the method and timing of ingesting satellite observations. Future work needs to be performed to understand the large spread between the smoke emissions and to reduce their uncertainty.
Diurnal factors of the smoke emissions are evaluated against the observed patterns derived by the GOES-17 Wildfire Automated Biomass Burning Algorithm (WFABBA) FRP product generated by the Cooperative Institute for Meteorological Satellite Studies (CIMSS) at the University of Wisconsin, Madison. As shown in Fig. 4, the FRP data exhibit discernable diurnal fire activities, which peaked towards late afternoon (14:00-19:00 PDT), with a substantial day-today variability. By contrast, most models assumed fixed diurnal profiles. A variety of patterns are found for the models, which show relatively smaller variations, e.g., GEOS-FP, RAQMS, and AIRPACT, and more pronounced peaks for the other models. Overall, NCAR WRF-Chem peaked the earliest (14:00 PDT), and ARQI peaked the latest (17:00 PDT). However, most model patterns deviate from the FRP observations. The day-to-day variation and bimodal patterns on some of the days were not captured by any of the models. UCLA WRF-Chem incorporated an inversion technique to constrain fire emissions, which allowed the emissions to be pushed later, resulting in a better agreement with the FRP data. But the coarse time resolution of the scaling factors (8 h) greatly limited how much the diurnal profile could be modified. Additionally, the nighttime fire activity was not well described on the nights of 2-3 and 7-8 August by most models, except for ARQI, due to its later peaks. Note that FireWork used similar diurnal factors to ARQI, but it is not shown in Fig. 4.
Multiple ways to improve the representation of diurnal emission variations can be drawn from these results. First, forecasts would likely benefit from including diurnal cycles based on geostationary FRP, coinciding with recent literature (Wiggins et al., 2020). For doing this, at least 1 d of spin-up would have to be performed or using near-real-time incorporation of data. A modeling system for this goal has been reported, which adopts a strategy of hourly sequential warm-start runs with FLEXPART-WRF (Solomos et al., 2015(Solomos et al., , 2019, which allows the emissions to be updated every hour using METEOSAT geostationary observations. However, due to large day-to-day variability in diurnal cycles, it still does not guarantee that the persistence of the latest diurnal pattern into the forecasting window will provide better results. One possibility is to utilize fire weather forecasts, which are currently used to predict fire danger. Thus, future work is needed to investigate methods to forecast diurnal behavior of fires.

Data and statistical metrics
The AOD at 550 nm from the MCD19A2 Version 6 product  is used for the evaluation. It is a MODIS Terra and Aqua satellites combined Level 2 product based on the Multi-Angle Implementation of Atmospheric Correction (MAIAC) algorithm producing AOD data at 1 km pixel resolution (https://lpdaac.usgs.gov/products/ mcd19a2v006/, last access: 10 January 2020). Compared to other algorithms, the MAIAC algorithm provides more available AOD data over smoke plumes with its capability to accurately classify thick smoke, which is frequently identified as clouds by other methods . With Terra and Aqua's sun-synchronous low earth orbit, the MA-IAC data have a higher nominal resolution than geostationary data but at lower temporal refresh rates. The equatorial crossing time for the MODIS Terra is 10:30 and 22:30 LST and 01:30 and 13:30 LST for the MODIS Aqua. Locations in low latitudes and midlatitudes are scanned twice per day by each of the satellites. Higher latitudes can receive more frequent data coverage, with up to six orbits per day in Alaska and northern Canada. As the MAIAC AOD is retrieved from visible-band (470 nm) measurements, only daytime data are available. The AOD accuracy is evaluated as ± (0.05+15 %) or even better ± (0.05 + 10 %) in a global validation . A recent assessment over North America against ground-based observations at the AErosol RObotic NETwork (AERONET) sites indicates that MAIAC performs well for this region over a wide range of surface conditions, with a bias of 0.015 and an RMSE of 0.062 over western North America (Jethva et al., 2019). MAIAC also shows extended coverage over the continent US compared to the VI-IRS product or other MODIS algorithms, owing to its pixel selection process and ability to retrieve aerosol information over brighter surfaces (Jethva et al., 2019;Superczynski et al., 2017). For post-processing, the data were filtered according to the quality assessment flags, keeping the retrievals with cloud masks indicating "clear" or "possibly cloudy" and adjacency flags of "clear" or "adjacent to a single cloudy pixel". The tiles of retrievals were concatenated to produce hourly snapshots, with the overpassing time rounded to full hours. The filtered AOD data were spatially mapped onto the grid corresponding to each model's resolution for consistency. There were 14 hourly scenes in total on 4-8 August 2019 (two to three snapshots per day). The evaluation was also performed at four specific grid resolutions (0.1, 0.2, 0.5, and 1.0 • ) to examine the model performance at different spatial re-gridding resolutions.
To evaluate the AOD retrievals during the Williams Flats fire over our region of analysis, we compared the MA-IAC data against the AERONET sun photometer data (Version 3, Level 2.0, cloud-cleared and quality-assured) . During FIREX-AQ, multiple temporary NASA AERONET platforms -Distributed Regional Aerosol Gridded Observation Networks (DRAGON) -were deployed to collect sun photometer measurements (Holben et al., 1998(Holben et al., , 2018. Fixed DRAGON sites operated in Missoula, Taylor Ranch, and McCall (https://aeronet.gsfc.nasa.gov/new_ web/DRAGON-FIREX-AQ_2019.html, last access: 8 August 2020). Along with the permanent AERONET sites, the AOD retrievals are available at 27 ground sites (Fig. S2). As MODIS MAIAC AOD data at 550 nm are used in model evaluation, for consistency, the AERONET AOD at 500 nm is converted to 550 nm using the Ångström exponent retrieved for 440-675 nm. Following the collocation strategy reported in Jethva et al. (2019), we used two sets of spatiotemporal averaging windows to get the AOD matchups. The MAIAC data are re-gridded onto a 0.1 • (0.4 • ) resolution grid and then bilinearly interpolated onto the locations of the sites; the AERONET data are averaged within 0.5 h (1 h) time windows centered at the overpass time of MAIAC. To avoid values after re-gridding driven by very sparse MAIAC pixels contained in the respective grid boxes, the minimum number of 1 km satellite observations contained in each grid cell is required to be larger than 20 % of the maximum possible 1 km pixels contained in a grid box. Figure 5a and c show the scatterplots constructed by using the matchups between AERONET and MAIAC. The MAIAC AOD is highly correlated (r ∼ 0.84 and 0.89) with AERONET and shows small positive biases. As expected, the larger spatial and temporal averaging intervals yield a larger number of data pairs. The dependence of MAIAC bias on the magnitude of AOD is examined, and the result for the bins with the number of matchups larger than five is shown (Fig. 5b, d). The me- dian error is less than 0.015, and an increasing trend towards higher aerosol loading is notable for AOD bins, with their center values larger than 0.1. The spread of the errors becomes greater as AOD increases. This result demonstrates acceptable accuracy of the MAIAC AOD during this wildfire event. It also suggests a tendency of slightly larger positive bias and increased variability in the retrieval errors over the areas with significant smoke impacts.
Regarding the model forecasts, coincident predictions at the closest hour relative to observations were derived from the most recent forecast cycle (initialized within 24 h) excluding the spin-up period. For consistency with AOD measurements, the forecasts are also filtered to exclude cloudy conditions based on the cloud water mixing ratio or cloud fraction, depending on data availability. Specifically, for HRRR-Smoke, UCLA WRF-Chem, AIRPACT, ARQI, and NCAR WRF-Chem, the grid cells with total column cloud water >10 −6 kg m −2 were filtered out. For GEOS-FP, the grid cells with low cloud fraction or middle cloud fraction >10 % were masked. For UIOWA WRF-Chem, grid columns with more than five vertical layers with clouds were excluded. Although no cloud filter was implemented for the other models as cloud variables were not archived, the grid cells with clouds can be mostly masked when filtering together with the observations. After the cloud screening, the temporally and spatially collocated prediction and observation data were kept for the comparisons.
Three sets of evaluations are presented here, focusing on standard statistical measures, the magnitude of smoke AOD (sAOD) enhancements, and spatial coverage of smoke plumes. The sAOD was derived by subtracting a background AOD, represented by the average of the lowest 20 % values within the entire region of comparison, for each modeled and observed distribution map. The statistical metrics used in the evaluation include correlation coefficient (r), ratio, mean bias (MB), normalized mean bias (NMB), root-mean-square error (RMSE), and normalized mean error (NME), which are calculated as follows: Here the subscript i represents the pairing of N observations (o) and model predictions (m) by spatial location and time.
Overbars indicate averages over location and/or time. These metrics were also used for comparison of surface PM 2.5 forecasts against ground-based observations in Sect. 3.5. In addition to standard statistical measures, the spatial extent of the smoke plume is evaluated based on how well the predicted location of the plume compares with the actual smoke plume detected by MODIS AOD. For each observation scene, observed and modeled plume coverages are compared by producing the figure of merit in space (FMS) and false alarm rate (FAR) (Boybeyi et al., 2001;Mosca et al., 1998;Rolph et al., 2009) These two categorical scores are calculated by counting the grid cells for model predictions and observations with sAOD > 0.05 that fall into the four categories listed in Table 2. FMS / 100 is equivalent to the threat score or critical success index (CSI) in verifying meteorological forecasts; it is defined as the ratio of the intersection to the union of the

Model performance statistics of smoke AOD (sAOD)
In this section, the evaluation results of sAOD forecasts are presented. The time period for evaluation was 4-8 August 2019, since there were multiple models that had not included emissions from the Williams Flats fire on 3 August, and on 9 August showers and cloudy weather resulted in very few AOD observations. It should be noted that because of the different setups for the chemical LBCs as summarized in Sect. 2.2, there may be systematic discrepancies in their AOD predictions. Additionally, HRRR-Smoke does not consider non-smoke sources; the models using simplified chemistry can struggle to represent background aerosols arising from secondary formation that is not resolved within the mechanism. Figure 6 shows the comparison of the background AOD estimated from MAIAC data and model forecasts per hourly scene. While the observed background ranges between 0.06-0.14, the modeled counterparts show less variability, except for RAQMS. Most models have smaller background AOD than the observations, and systematic discrepancies can be seen among the models. Thus, these discrepancies are excluded in the following comparison by subtracting the background values from the total AOD. Figure 7 shows the map of MAIAC sAOD from Terra MODIS at about 20:00 UTC 5 August 2019, along with the forecasts by the 12 models. Comparison of sAOD distributions for the other times is provided in the Supplement (Figs. S3-S15). As seen in Fig. 7, the observed areas with sAOD > 0.05 can largely represent the smoke plume, and this threshold is used to evaluate the spatial extents of smoke plumes in the following Sect. 3.2.3. The smoke plume for this day can be separated into three categories: (1) the fresh intense plume nearby the fire blowing east, with the peak sAOD reaching above 1.0; (2) an older plume in the vicinity of the fire (over Washington State, Oregon, and Idaho) from emissions earlier in the day or the previous day that is likely within the boundary layer; and (3) smoke transported further away from the fire (i.e., the band of high sAOD extending over northern Montana and southern Canada) associated with emissions from the previous days that was injected into the free troposphere. Note that the scattered enhancements over the southeast of the region in Fig. 7 are due to a small fire located in Idaho and also the scattered low clouds, as elevated satellite AOD retrievals have been seen around the rim of clouds (Ignatov et al., 2004;Kondragunta et al., 2008), owing to a high relative humidity environment near clouds and thus the hygroscopic growth of some particles. Overall, the high-resolution regional models tend to be more effective in depicting the fine structure of the plume transport but also show a higher risk of displacing the narrow plumes. All models represented the fresh plume but with a significant variability in the spatial coverage and magnitude, with the FRP-driven emissions resulting in higher sAOD than hotspot-driven emissions, in consistency with their relative emissions magnitudes shown in Sect. 3.1. Most models also show a representation of the nearby aged plume, but again the magnitude is highly variable, and the locations of the smoke differ substantially, likely related to the diurnal emission profiles and model resolution, as well as the driving meteorology. The misrepresented spatial pattern could also be due to the observed diurnal pattern on 4 August having a doublepeak structure, differing from any of the diurnal patterns assumed by the forecasts (Fig. 4). Conversely, the band of enhanced sAOD related to plume injection on the previous day seems to be only shown by a few models (HRRR-Smoke, UCLA WRF-Chem, and FireWork; Fig. 7), but there is still a large variability in the magnitudes. The representation of plume injection is further evaluated in Sect. 3.4. While this analysis is for a single overpass, the overall model performance follows a similar pattern, and the result can be generalized for most days (Figs. S3-S15). The quantitative evaluation of modeled sAOD during 4-8 August is summarized in Table 3, with the corresponding scatter plots shown in Fig. S16. In order to examine model performance in the vicinity of the fire and over transported smoke, the same statistical metrics are calculated over the fresh-plume-impacted (Fp) areas and other (Ot) areas separately, and the fresh-plume area boundaries are defined by examining satellite visible images (see Fig. 11). The total number of points included within the comparison was from 610 to 622 623 for the entire analysis region, depending on the grid resolutions. Overall, although some of the models show nearly unbiased predictions (UIOWA WRF-Chem and CAMS), negative biases in sAOD are seen for all models, with the MB ranging between −0.070 and −0.004 and NMB between −4.3 % and −87.4 %. The underpredictions are also seen over the Ot and Fp areas, except for CAMS and UIOWA WRF-Chem over the Fp areas, owing to likely the overpredicted emissions prior to times of the satellite overpasses. The absolute deviations of modeled sAOD against observations are large, with the NME of 61.4 % to 90.2 %, the RMSE of 0.11 and 0.17, and the correlation coefficients (r) ≤ 0.50. These results suggest that the spatial distribution patterns of sAOD were not well represented, which has been indicated by the discrepancies in the plume locations and spatial patterns shown in the map comparisons.
Although the characteristics of the models differ in a variety of dimensions, it is noteworthy that the models incorporating assimilation of satellite observations, including GEOS-FP, CAMS, RAQMS, UCLA WRF-Chem, and WISC WRF-Chem, are within the six models showing less underprediction of sAOD (NMB ≥ −51 %). Meanwhile, the five models ingesting FRP-based fire emissions, which are UIOWA WRF-Chem, GEOS-FP, CAMS, UCLA WRF-Chem, and HRRR-Smoke, rank within the seven models with comparably less bias (NMB ≥ −52.9 %). While improvements in 1 d AOD forecasts when assimilating AOD are expected (e.g., Kumar et al., 2019;Saide et al., 2013), FRPbased emission inventories generally use AOD to tune the conversion of FRP to emissions (Ichoku and Ellison, 2014; Kaiser et al., 2012;Koster et al., 2015), which can explain these results. Future work could explore this topic by performing sensitivity simulations to determine the major factors of the AOD forecast errors.
Besides the characteristics of model settings, the horizontal resolution used for re-gridding of model and observation data may also influence the performance statistics. In order to isolate the impact of grid resolution, the data were mapped onto four grid resolutions (0.1, 0.2, 0.5, and 1.0 • ) and examined the models' performance accordingly. As shown in Fig. 8, although the spatial resolution changes, the sAOD statistics for the 12 models remain within ranges of about 0.05-0.55 for correlation coefficient (r) (with r 2 <0.3), −90 %-20 % for NMB, and 60 %-95 % for NME. The ranges are slightly larger compared to the statistics shown for the original horizontal resolutions. However, the forecast performance is still poor in terms of r 2 (<0.3) and NME (>60 %), even when comparing at a resolution of 1.0 • . Regarding variations in the statistics against the re-gridding resolution, the NMB does not have distinctive changes, which is expected because the spatial smoothing could not yield much improvement in the mean bias against observations. For the NME, most models present decreasing trends when the re-gridding resolution gets coarser, except for NCAR WRF-Chem and UIOWA WRF-Chem, which show slight increases or mixed trends. In contrast to this, the variations of r values are more complex. With the re-gridding resolution getting coarser, we may expect an increased r due to some extreme outliers in sAOD distributions maybe getting smoothed out, but only 4 among the 12 models, namely ARQI, AIRPACT, UCLA WRF-Chem, and NAQFC, show the increasing trends. For the other models, mixed trends in r are shown when the re-gridding resolution becomes coarser. Overall, the relative ranking of the models' statistical performance does not vary significantly, and the horizontal grid resolution does not seem to be a decisive factor for models' performance. Thus, in the following section, the evaluations were performed based on model data at their original horizontal resolutions (i.e., without horizontal re-gridding).

Smoke AOD magnitude, temporal evolution, and spatial matching of plumes
In addition to the point-to-point comparisons, in this section the predictions are evaluated in perspective of the temporal evolution of sAOD magnitude and spatial extent of the smoke plumes. An sAOD threshold (sAOD >0.05) has been applied to filter sAOD representing the areas with pronounced smoke impact. This threshold is qualitative and chosen by visually examining the observation maps of sAOD and satellite visible images. Meanwhile, to exclude the grid cells significantly impacted by other smaller fires, only the data within a smaller area indicated by the dashed red box in Fig. 7 (see the MODIS sAOD map) were filtered for the analysis in this section.
The temporal evolution of the sAOD magnitude is shown in Fig. 9. In consistency with the statistical results, the overestimations of sAOD magnitudes occurred for some overpasses for the fresh-plume areas, particularly for the models driven by FRP-based emissions. Temporal variability in the model performance is noticeable, which is closely associated with the limitation of forecasted fire emissions based on the assumption of persistence. For example, on 4 August, nearly all the models show underestimations in sAOD over the fresh-plume area mostly resulting from the underpredicted emissions, due to delayed ingestion of emission information from satellite observations. An additional reason is that 4 August was the first day when the Williams Flats fire became active in most models, except for HRRR-Smoke and FireWork that already included it on 3 August. In comparison, on 5 and 6 August the burned area increased steadily without dramatic elevation (see the day-to-day increment of burned area in Fig. 2). Accordingly, the models show some skill, since the assumption of persistence managed to produce comparable emissions against the actual fire activity. However, the stronger burning activity was observed on 7 August (Fig. 2), leading to underestimations of the emissions. As the last overpass time of the Aqua-MODIS was about 14:00 PST, well before the peaking hour of FRP at about 17:00 PST on 7 August (Fig. 4), the impact of underestimated fire emissions was not shown by the modeled sAOD over the fresh-plume areas. However, this change in fire behavior generated a large underprediction on 8 August over other areas (Fig. 9). As indicated by the observed sAOD distribution at 19:00 UTC on 8 August (Fig. S14), the elevated sAOD was mostly contributed by the transported smoke aerosols resulting from the enhanced fire emissions late afternoon on 7 August. These results show that the assumption of persistence of smoke emissions degraded the forecasts. Future work needs to be performed to find strategies to predict changes in the smoke emissions over the forecasting window. Additionally, the representation of plume injection plays a critical role in the forecasted sAOD for the transported smoke plumes on 7 August, which is discussed further in Sect. 3.4.
Consistency of the modeled and forecasted spatial coverage of smoke plume is also examined. As shown in Fig. 9, the models underpredicted the total number of grid cells with sAOD > 0.05 for most of the snapshots, suggesting underestimation in the area of the smoke plumes. The accuracy of the predicted smoke areas is evaluated by the metrics of FAR and FMS for each MODIS snapshot during 4-8 August. These two metrics are derived at the original grid resolutions of dif-ferent models (Fig. 10) and at fixed grid resolutions (0.1, 0.2, 0.5, and 1.0 • ) as well (Fig. S17), and the results show similar features. The maximum FMS can reach as high as 80 %, with the medians for the models ranging from 10 % to 70 %. The FAR scores are generally low, and the median values are below 45 %. There is a noticeable group of models showing relatively better performance with lower FAR and higher FMS score, which include CAMS, RAQMS, GEOS-FP, UCLA WRF-Chem, and UIOWA WRF-Chem. As analyzed previously, these models also show better performance for the statistics of sAOD and the total number of grid cells with sAOD > 0.05. These models used FRP-based emissions (except for RAQMS) and incorporated assimilations of satellite AOD observations (except for UIOWA WRF-Chem). Other factors such as complexity of chemical mechanisms, chemical LBCs, horizontal resolution, initial time of forecast, and dynamic core used to drive the meteorological dispersion and transport do not seem to be determining for these metrics.

Data and statistical metrics
The model forecasts of surface PM 2.5 mass concentrations during 4-9 August 2019 are evaluated against the hourly measurements collected from the AirNow (https://www. airnow.gov/, last access: 15 June 2020) network. The observations were accessed from the OpenAQ Platform (https: //openaq.org, last access: 15 June 2020) and were originally collected by state, local, or tribal monitoring agencies using federal reference or equivalent monitoring methods approved by the US Environmental Protection Agency (EPA). As noted by AirNow, although the preliminary data quality assessments are performed, the data were not fully verified and validated through the quality assurance procedures that the monitoring organizations used to officially submit and certify data on the US EPA Air Quality System (AQS). Compared to the AQS data that are used for regulatory purposes, such as determining attainment of the National Ambient Air Quality Standards (NAAQS), the observations from AirNow are used to report the Air Quality Index (AQI) to the public, and they have a better completeness during extraordinary air pollution events such as wildfires. Additionally, the AirNow data have also been compared with the US EPA's Air Data (https://www.epa.gov/outdoor-air-quality-data, last access: 16 October 2020) to check the consistency. The missing data in AirNow were filled in by combining these two datasets. The locations of the 86 monitoring stations within the domain of analysis are shown in Fig. 11. By examining the visible images based on the GOES-17 data and MODIS MAIAC AOD maps, 14 sites are selected as "freshplume stations" (Fp) that show immediate impact by the fresh smoke from the Williams Flats fire on 4-7 August, i.e., the stations located within the fresh-plume borders on any of the days, as denoted by the red dots in Fig. 11. The PM 2.5 observations are hourly averages reported at the end of each hour, namely centered on the half hour. For the modeled counterparts, surface PM 2.5 values are derived by bilinearly interpolating the modeled 2-D forecasts at the lowest model level onto the latitude and longitude coordinates of each monitoring station. The inconsistency in spatiotemporal representations of the model forecasts and observations could be a source of model-observation differences. It should also be noted that most model data are provided as hourly files, while some of them come as 3-hourly snapshots (see Table 1). Thus, the limitation in output frequency could contribute to errors compared to PM 2.5 observations for some models.
The comparison is restricted to the 83 sites that have forecasts from all the 12 models; thus three stations (no. 1, 18, and 29 in Fig. 11) are omitted since they are not covered by all models. The common statistical measures as used for sAOD (described in Sect. 3.2.1) are used for the hourly PM 2.5 forecasts at all the 83 stations, as well as separately over the stations categorized as "fresh-plume sites" (Fp) and all the other sites (Ot) (Fig. 11). Similar to the comparisons for sAOD, the surface PM 2.5 mass concentration enhancements (surface smoke PM 2.5 , sPM 2.5 ) are derived, with the background represented by the average of the lowest 20 % values for the model forecasts over the entire region of analysis and 10 % over the observation stations. Note that a lower percentage is used for the observations due to the sparseness of stations. Three sets of comparisons are presented, which focus on (1) the overall statistical measures and their spatial patterns, (2) diurnal forecast performance, and (3) day-to-day performance. Table 4 gives the statistical comparison results of the hourly surface sPM 2.5 from the 12 models and measurements over all 83 monitors. Also included are the statistical measures for the 14 stations in the category of "fresh-plume areas" (referred to as Fp) and the other 69 stations (referred to as Ot), respectively. The scatter plots of all pairs of prediction and observation data and for the Fp and Ot stations separately are shown in Figs. S18-S20. Maps in Figs. S21-S23 show statistical results by station. Overall, FireWork and NAQFC generally rank within the best four for all the statistical metrics used here. It is interesting to note that these models are the two operational forecasts for predicting air quality in Canada and the US, where performance against surface monitoring stations is generally the primary metric for model evaluation. As revealed by the results for all stations, half of the models give positively biased sPM 2.5 predictions. Specifically, there are three models, i.e., FireWork, NAQFC, and RAQMS, showing nearly unbiased predictions with NMB of −4.00 %, 0.30 %, and −5.50 %, and the ratios of predictions versus observations are between 0.95 and 1.00. However, in terms of the correlation coefficients and absolute errors, all the models show low performance, with r<0.35 (r 2 <0.13), RMSE > 9.8 µg m −3 , and NME > 70 %, which is similar to the results for sAOD. It is worth mentioning that there is a discrepancy when comparing between the ranking of the best models for sPM 2.5 and sAOD. However, as the spatial and temporal representations of the observed surface PM 2.5 and AOD incorporated into the above statistics are very different, the statistics are not exactly comparable. Thus, the discrepancies in statistics for AOD and surface PM 2.5 are further discussed in Sect. 3.5 using coincident data. The statistical measures indicate that the sPM 2.5 errors can barely be recovered through simple bias corrections, suggesting the mismatch in spatial distributions of surface smoke aerosols.

Results
In terms of the comparison over different groups of stations, the correlation coefficients for the Fp stations are larger compared to the Ot stations for most models, except for UCLA WRF-Chem and UIOWA WRF-Chem, although the values are all less than 0.3 (r 2 <0.09, Table 4). By contrast, for the other five statistical measures, most models tend to show reduced performance for the Fp stations compared to the Ot stations, especially for RMSE, which can also be seen in the RMSE distributions (Fig. S22). It is likely because the model performance over the Fp stations is more closely impacted by errors in fire emissions, and RMSE is more driven by large absolute values of the model-observation differences compared to the other statistical metrics. These results suggest that the spatial structures of sPM 2.5 from the fresh smoke plumes appears to be slightly better captured by forecasts compared to the outer areas; however, considerable biases still contribute to the larger RMSE in fresh smoke than the outer areas. It is also noteworthy that there are five forecasts (HRRR-Smoke, UCLA WRF-Chem, UIOWA WRF-Chem, GEOS-FP, and CAMS) showing large overpredictions for the Fp stations, with NME > 40 % and MB > 6 ug m −3 , which are all driven by FRP-based fire emissions and have shown relatively fewer negative biases of sAOD. As discussed later in this section, the overpredictions in sPM 2.5 by these models mostly happened in the evening through early morning. Among these five models, CAMS shows even larger NMB for the Ot sites compared to the Fp sites, likely due to model processes besides fire emissions. Another notable feature is that some models give different signs of biases over the Fp and Ot sites. For example, HRRR-Smoke and FireWork show positive biases for Fp sites and negative biases for Ot sites, and the opposite is seen for WISC WRF-Chem, NAQFC, and RAQMS. For models showing the same sign, the magnitudes of NMBs between Fp and Ot sites can be quite different. This points to discrepancies in the model performance for the fresh versus aged smoke, which could be generated due to multiple reasons, including issues in plume injection, downwind chemical evolution, and transport of fire smoke.
As the fire activity, plume injection, and vertical mixing change significantly from daytime to nighttime, it is useful to compare the diurnal feature of forecast performance. The predicted and observed sPM 2.5 are compared diurnally (from 00:00 to 23:00 PDT) for the two categories of stations (Fp and Ot) respectively. Figure 12a and b show observed sPM 2.5 statistics. The mean sPM 2.5 over the Fp sites is about 3 times larger than the values for the Ot sites, as they are impacted by smoke more directly. Also, there is a more prominent diurnal variability in sPM 2.5 for the Fp sites, with overall higher values in nighttime than in daytime. The peak-tovalley difference of the means is about 8.5 µg m −3 (76 % of the minimum) for the Fp sites and about 2.0 µg m −3 (35 % of the minimum) for the Ot sites. For the Fp sites, the means have an early afternoon peak (13:00 PDT), which seems to be observed at a small portion of the sites depending on the episodical fire emissions and wind directions. The increase that occurred around 18:00 to 20:00 PDT can be attributed to the fresh smoke emitted during the peak hours of the fire activity reaching the sites. Also, the boundary layer collapse along with the transition of the convective boundary layer into a stable boundary layer during the early evening and the continued burning on some of the days during these hours can also play a role. By contrast, a decreasing trend is seen from 00:00 to 10:00 PDT, which is mostly related to the reduced emissions later in the night, the spatial dispersion of the smoke, and the PBL growth that leads to vertical dilution in the morning. For the Ot sites, there is a peak in the early morning (08:00 PDT), possibly due to anthropogenic activity. While the upper decile and quartile follow a similar trend than the mean for the Fp sites, the lower decile, lower quartile, and median show relatively flat diurnal profiles, likely due to the fresh plume not impacting all sites simultaneously. For Ot sites, the behavior of all statistics is similar. We thus focus on comparing the mean of the models.
Comparisons of the modeled diurnal means of sPM 2.5 against the observations are given in Fig. 12c and d. The corresponding diurnal variations of the model statistical measures are included in Figs. S25 and S26. For models, the forecasted diurnal variability of the means is mostly stronger than the observations, and the peak-to-valley differences range from 4.6-45.3 µg m −3 (36 %-474 % of the minimum) over the Fp sites and 1.0-6.4 µg m −3 (37 %-113 %) over the Ot sites. For the Fp sites, the early afternoon peak is captured by FireWork and HRRR-Smoke, although there are differences in timing, width, and magnitude. Also, most models capture the early morning decrease related to development of daytime PBL and dilution, and RAQMS, UIOWA WRF-Chem, and CAMS show the decrease 3-5 h later than ob-served. Another important feature is the timing and magnitude of evening buildup that are quite different among the models. UCLA WRF-Chem, UIOWA WRF-Chem, and GEOS-FP show the buildup 1-2 h earlier than observed with much higher concentrations; while FireWork, WISC WRF-Chem, NAQFC, and RAQMS show it 2-3 h later, and the nighttime concentrations are overall lower than the observations. CAMS captured the daytime trends and late-afternoon buildup well; however, it shows overpredictions during nighttime to early morning. The overpredictions during the late afternoon to early morning by UCLA WRF-Chem, UIOWA WRF-Chem, GEOS-FP, HRRR-Smoke, and CAMS seem to dominate the large positive biases and NMB for the Fp sites. This helps to explain the discrepancy between the performance for sAOD and sPM 2.5 . Overall, these differences in the evening and nighttime evolutions can be attributed to the model uncertainties in diurnal pattern of emissions, the larger proportion of emissions allocated within the PBL, and underestimated nighttime PBL heights as well. The timing of collapse of PBL may also play a significant role, as the models produce emissions at the surface rather than injected above after that, which could subsequently lead to more enhanced surface PM 2.5 when the fire emissions continue.
The diurnal variation of the average sPM 2.5 for the Ot sites also shows patterns deviating from the observations. Most models (except for UCLA WRF-Chem, NCAR WRF-Chem, and HRRR-Smoke) show a common feature of a valley in the afternoon; however it is not seen for the observations. It could be a joint consequence of the larger dispersion volume owing to a higher PBL depth than reality and/or issues with other sources, e.g., anthropogenic activities. The early-morning peak is captured by NAQFC, AIR-PACT, and ARQI, earlier by about 1 h than the observations (except for ARQI). Similar to the result over the Fp sites, stronger overpredictions are seen during the evening to early Considering the coincidence of sAOD and sPM 2.5 , the Terra-and Aqua-MODIS overpassing times are between 11:00 and 15:00 PDT, when the models with large nighttime sPM 2.5 overpredictions (except for HRRR-Smoke) tend to agree the best with the observed diurnal patterns but still overpredicting ( Fig. 12c and d). As mentioned previously, these models use FRP-based fire emissions and show less underprediction for sAOD compared to the other models. This suggests discrepancies in sAOD and sPM 2.5 prediction performance, which will be further discussed in the following sections.
The model representation of the day-to-day evolution of sPM 2.5 magnitudes is also evaluated. Figure 13 shows the comparison of the spread of modeled daily average sPM 2.5 for the two categories of stations. The NMB and NME results generally reflect similar information and are presented in Figs. S27 and S28. For the Fp sites, the observed means show an overall increasing trend, with a peak on 7 August and a slight decline afterwards. Most models generally captured the day-to-day variations of the sPM 2.5 , with several models showing underpredictions on 4 or 5 August and occasionally overpredicting on the following days (6-9 August), including ARQI, HRRR-Smoke, AIRPACT, UCLA WRF-Chem, UIOWA WRF-Chem, FireWork, and NAQFC. This could be attributed to delayed ingestion of fire emission information based on satellite observations collected prior to a forecast cycle, which has also been shown with the variations of the spread of magnitudes for the modeled sAOD. By contrast, the overestimations on 4 and 5 August are seen for WISC WRF-Chem, GEOS-FP, and CAMS, which could be linked to the larger emissions injected at levels closer to the land surface than reality and within the PBL. In addition, significant overestimations of the 75th and 90th percentiles are shown by the models driven by FRP-based fire emissions, especially on 9 August for the three models using QFED data (i.e., UCLA WRF-Chem, UIOWA WRF-Chem, and GEOS-FP). These overpredictions could be related to overpredicted total emissions and/or issues in the vertical allocation of emissions (i.e., putting too much amount of smoke within PBL) as will be discussed in Sect. 3.5.
For the Ot sites, the observed daily averages of sPM 2.5 show smaller day-to-day variations compared to the Fp sites (Fig. 13). In contrast to the variation of the observed means over the Fp sites that peaks on 7 August, the result over the Ot sites peaks on 5 August and reduces slightly afterwards. Besides meteorology, this decreasing trend after 5 August is likely attributed to the evolution of fire activity, which showed higher plume injection heights on 7 and 8 August, leading to lofted smoke above the PBL and generating less impact near the land surface, as will be discussed in Sect. 3.4 and 3.5. The observed decreasing trend is overall depicted by some of the models, such as HRRR-Smoke, AIRPACT, NAQFC, GEOS-FP, CAMS, and RAQMS, although with temporal shifts and biases in their magnitudes. In comparison, nearly flat or opposite trends of the daily means are seen for UCLA WRF-Chem, UIOWA WRF-Chem, ARQI, and FireWork, which is likely related to allocating higher emissions closer to the land surface compared to observations, which will be shown by the comparison against lidar data for the aged plumes in Sect. 3.4.3.
3.4 Evaluation of vertical structure of smoke plumes using NASA DC-8 aircraft data

Observations and derivation of smoke plume heights and PBL heights
The vertical allocation of fire emissions can significantly affect both the total column smoke aerosol loading and surface aerosol concentrations in adjacent areas of the fire and remotely downwind. In this section, we evaluate the vertical plume structures and fire plume injections predicted by the models, based on the observations acquired by the DIAL-HSRL (Hair et al., 2018) from the DC-8 aircraft that sampled the Williams Flats fire plume on 4 d, namely 3, 6, 7, and 8 August 2019, during the FIREX-AQ field campaign. The DIAL-HSRL system is capable of providing measurements of aerosol depolarization (355, 532, 1064 nm), aerosol/cloud extinction (532 nm), and backscatter coefficient (355, 532, 1064 nm) above and below the flight height at a temporal resolution of 10 s. In addition, the partial-column AODs were derived by integrating the aerosol extinction profile at 532 nm in nadir, when the flight height was above 5.15 km. In this section, the plume structures are compared using the DIAL-HSRL backscatter coefficient (at 532 nm) profiles, as they provided more detailed structures and fewer missing data than extinction and forecasts of PM 2.5 concentration profiles (as most models did not provide extinction or backscattering profiles). In addition, the AOD derived from the lidar data is also used to analyze the relation between vertical aerosol distribution and the ratio of surface PM 2.5 concentration versus total column aerosol loading in Sect. 3.5. All the models evaluated in the preceding sections, except for NAQFC, are examined in this section. The maps of flight tracks are shown in Fig. 14. As summarized in Table 5, the observations of 15 flight transects that sampled straight along the smoke plume are selected, among which 11 transects represent fresh plumes close to the fire on those 4 d, and 4 transects represent the aged plumes sampled on 7 and 8 August. The modeled counterparts of smoke plume structures along these transects are numerically derived using the 3-D forecasts of PM 2.5 concentration fields valid at the nearest hour of model output relative to the sampling time of a profile. Then the model data are bilinearly interpolated onto the latitude and longitude coordinates of the sample profile at each of the model levels.
Smoke plume heights and PBL heights are determined from the DIAL-HSRL data and model forecasts to evaluate models' performance for plume injection. The method used to determine plume and PBL heights is based on the Figure 13. Spreads of modeled and observed daily averages of smoke PM 2.5 enhancements (sPM 2.5 ) over the two categories of surface sites, i.e., (a) fresh-plume sites and (b) other sites during 4-9 August 2019. The dates are labeled on the x axes as "month/day". The box and whiskers (as in Fig. 9) stand for model results. The corresponding ranges of the 10th to 90th percentiles for observations are represented by the light grey shading, and the range of 25th to 75th percentiles is represented by the medium grey shading. The observed median and mean are denoted by the dark grey lines and black crosses, respectively. vertical gradients of aerosol backscatter or PM 2.5 concentrations. Aerosol profile measurements made by lidar have long been used to determine PBL heights or more appropriately the mixed layer (ML) heights (Hayden et al., 1997;Scarino et al., 2014;Tucker et al., 2009) in the daytime since aerosol gradients can indicate the level below which the aerosol species emitted within PBL tend to be well mixed and dispersed. For aerosol profiles through a smoke plume, strong aerosol gradients can indicate the plume top, which can be higher than the PBL heights when emissions are injected above the PBL. Following this heritage, for each of the five-point moving average backscatter profiles below 10.5 km, the highest level where the local minimum vertical gradient is less than a threshold k is derived; if this criterion is not satisfied at any level, the level of the global minimum vertical gradient over the entire profile is used. Then this level is referred to as plume top height (h plume ) if the profile is in plume, i.e., when the maximum backscatter below 8 km is larger than a threshold b, and otherwise referred to as PBL height (h PBL ) (when the profile is out of plume). The value of b is chosen as 2.1 × 10 −3 km −1 sr −1 for 6 August, as the backgrounds increased significantly due to dispersed smoke from fire emissions on the previous days and b = 1.2 × 10 −3 km −1 sr −1 for the other days. For outof-plume profiles, k = −1.2 × 10 −6 km −2 sr −1 ; for in-plume profiles k = −4.0 × 10 −6 km −2 sr −1 , which is smaller than the out-of-plume condition, in order to avoid picking up heights affected by in-plume variations due to vertical mixing and plume rise. The results are also visually inspected and filtered to exclude the impact from the incoming smoke from the fires in Siberia, which can be seen on 3 August. As for the model data, h plume is derived from the vertical profiles of PM 2.5 mass concentrations using a similar method. For each of the PM 2.5 profiles, h plume is obtained only if the profile is in plume, i.e., when the difference between the maximum and minimum PM 2.5 below 8 km is larger than 5.0 µg m −3 . The data impacted by the Siberian fires on 3 August were also excluded, with the masked levels manually tuned for each model. Also, the data below 100 m above ground level are excluded to avoid selecting the level  Table 5 for details) are colored by the observation time (UTC) on each day, overlaid on visible images of GOES-17 at 17:01 PDT. Note that the map coverage for 7 and 8 August is larger, in order to show the flight transects that sampled the aged plume. White triangles represent locations of surface monitoring sites of the AirNow network. impacted by strong emissions near the surface. The threshold of vertical gradient k is modified to −2.5 × 10 −3 µg m −4 for both in-plume and out-of-plume conditions. An additional modification is applied when the local minimum (or the global minimum) is smaller than −3.5×10 −3 µg m −4 that can occur at a certain level adjacent to the very intense injected fire emissions, and the vertical gradient of PM 2.5 is much stronger than near the plume top. Thus, the h plume is tuned upward using the highest level at which the gradient is less than 3.0 × 10 −3 µg m −4 . Depending on data availability and the agreement of the modeled PBL heights compared to the results estimated by forecasted virtual potential temperature and PM 2.5 profiles, the modeled h PBL is derived in different ways for three classes of models: 1. For CAMS and RAQMS, PBL heights are not available in their outputs, so the h PBL is determined as the ML heights derived from the out-of-plume PM 2.5 profiles.
2. For UCLA WRF-Chem, UIOWA WRF-Chem, and ARQI, the h PBL results diagnosed by PBL parameterization in model are usually lower than the ML heights estimated by the vertical PM 2.5 profiles (as will be shown in Sect. 3.4.2). By examining the potential temperature profile and land use, it is found that the underestimation mostly happened over the area downwind to the Columbia River, where the model-diagnosed PBL heights tend to represent the top of the thermal internal boundary layer relating to the underlying water body. Therefore, h PBL is determined from virtual potential temperature profiles (θ v ) for these three models as the lowest level at which ∂θ v /∂z = 1.3 K km −1 . The method, using a threshold of the vertical θ v gradient, is found to outperform other methods based on turbulence kinetic energy (TKE), θ v , or Richardson number for estimating convective boundary layer depth (LeMone et al., 2013).
3. For the other models not mentioned in (1) or (2), the model-diagnosed h PBL results that came with the forecasts are used.
It should be noted that the h PBL results derived by the above methods are compatible with the lidar results only during daytime when the term of ML heights is applicable. In this work, lidar measurements for the selected transects were mostly collected during the daytime; however, for the data collected in the evening, e.g., as late as 18:49 LST (local standard time, or UTC−8) for D3T3 (see Table 5), the lidar h PBL tends to represent the residual layer height, since the aerosol layer remains after the collapse of daytime boundary layer and transition into the nocturnal boundary layer due to radiative cooling. Therefore, to ensure the compatibility of model and lidar data, the model h PBL after 16:00 PDT (15:00 LST) is derived as the higher one between the h PBL values at the current hour and 16:00 PDT for the same location, which allows the top of the residual layer to be captured. The modeled h PBL and h plume for each of the selected transects are compared against the lidar results. Two sets of evaluations are shown, focusing on the fresh plumes and aged plumes respectively.

Evaluation for smoke plumes close to the fire
The evolution of vertical smoke plume structure and plume rise is demonstrated for the fresh plumes close to the fire, using the DIAL-HSRL observations that were collected on 3, 6, 7, and 8 August. As shown with the observed backscatter coefficients (532 nm) in Fig. 15, obvious day-to-day vari-ability of the plume rise behavior and fire activity occurred on these days. Slight plume injection is shown on 3 August, with a layer of enhanced backscatter located right above the PBL top at about 3 km above sea level (a.s.l. hereafter). The lofted layer of aerosols in between 4 and 7 km a.s.l. is associated with incoming smoke plume from the Siberian fires. On 6 August, the flight sampled the plume from approximately 12:00 to 15:00 PDT, which is earlier than for the other days, and no injection was observed. The fire just started to become active at the time of D6T1, with h plume being lower than the h PBL . Elevated emission heights are seen afterwards for D6T2 and D6T3, while no injection above the PBL was observed, and the plume tended to be well mixed within the PBL. This behavior can partly explain why the fire emissions peaked late on that day (19:00 PDT, Fig. 4) and the measurements took place earlier. In contrast, strong plume injections above the PBL were seen on 7 and 8 August. The h plume reached 7 km a.s.l. on 7 August, and even stronger injections were triggered on 8 August by the thermodynamic convection related to the active burning. The flight track sampled through multiple pyrocumulonimbus (pyroCb) pulses on that day, which was generated by the convection along with the abundant heat and moisture released during the burning. Consequently, the emissions were significantly elevated, with the h plume getting as high as 10 km a.s.l.
Based on the DIAL-HSRL data, the predicted vertical plume structures are evaluated. Figure 16 shows the comparison for the transect D7T3 sampled on 7 August, and similar results along the other transects are provided in the Supplement (Figs. S30-S39). Overall, there is a large spread of the modeled plume heights. ARQI, HRRR-Smoke, UCLA WRF-Chem, WISC WRF-Chem, FireWork, and NCAR WRF-Chem tend to show plume injections above the PBL, while AIRPACT, CAMS, GEOS-FP, and RAQMS show smoke generally well mixed within the PBL. For UIOWA WRF-Chem, although the fire emissions seem to be allocated mostly close to the land surface, a lofted plume exists over the downwind area above the PBL, which corresponds to injected smoke that occurred earlier (see Fig. S40).
The median h plume values for each of the 11 fresh plume transects are evaluated by statistical metrics that have already been used in previous sections for sAOD and sPM 2.5 . Note that in the following evaluation, all the heights are converted to above ground level (a.g.l.). The statistics are presented in Table 6, and the observed median h plume and modelobservation difference are shown in Fig. 17a. The total number of points incorporated varies between 8 and 11, since for some models the fire had not been active yet in the forecasts on 3 August. Overall, multiple models have high linear correlation (eight models with r over 0.7), indicating models following the observed trend of injections getting deeper as the days went by. However, all three global models which tend to inject their emission in the mixed layer are within this group, and thus the correlation might reflect the concurrent increasing trend in the daytime PBL heights. This means that the  Table 5 for the details of the selected transects. The red dots represent the h plume determined from the in-plume profiles; the black dots are the h PBL determined from profiles out of plume. The solid black line shows the aircraft altitude.  Table 5) through the smoke plume from the Williams Flats fire. The observed backscatter coefficient at 532 nm (Obs panel) and PM 2.5 mass concentrations forecasted by different models (model panels) are shown. The red and black dots are plume heights and mixed layer heights determined by using the observed backscatter or modeled PM 2.5 profiles. The pink dots are PBL heights derived from model diagnosis or forecasted virtual potential temperature (for ARQI, UCLA WRF-Chem, and UIOWA WRF-Chem, and their diagnosed PBL heights by PBL parameterization schemes are denoted by the grey dots). The black line shows the aircraft altitude. high correlation coefficients might not be a good indicator of smoke injection performance. Meanwhile, only four models (ARQI, HRRR-Smoke, UCLA WRF-Chem, and FireWork) show biases <1 km with NMB < 20 %, and all models have biases <2 km with NMB < 40 % and NME of 30 %-50 %.
For the day-to-day variations, the observed h plume presents considerable variability associated with plume injection behavior, with the medians along each transect ranging from about 2 to 9 km a.g.l. (Fig. 17a). While most models show overpredicted h plume on 3 and 6 August and underpredictions on 7 and 8 August, the range of predicted h plume is smaller than observed for all the models (Fig. 17c), which means that the day-to-day variability in plume injection behaviors on these days was not captured by any of the models. The underestimation of temporal variation in plume injection heights is consistent with a previous study (Val Martin et al., 2012).
Overall, there is not a single model that performs the best all the time. For instance, the models that tend to put emissions within the PBL (e.g., GEOS-FP, CAMS and RAQMS) performed better on 3 and 6 August, while the models with more intermediate injections performed better on 7 August (HRRR-Smoke and UCLA WRF-Chem). The performance of the global models may also be limited by the coarser vertical resolutions compared to high-resolution models, as they have limited representation of the fine-scale vertical smoke structures. The models with deeper injections (e.g., NCAR WRF-Chem) performed the best on 8 August for the pyroCb (Fig. 17a). This is also confirmed by comparing plume injection magnitude represented by the difference between the medians of plume heights and PBL heights (median h plume − median h PBL ) for each transect. As shown in Fig. 17c, for ARQI, HRRR-Smoke, UCLA WRF-Chem, FireWork, and NCAR WRF-Chem, the cases with stronger injections than observations, i.e., the data points above the 1 : 1 line, are mostly associated with the overpredictions of plume heights and vice versa. Some exceptions exist when the models give a higher injection magnitude but still underpredicted the plume height, which can be attributed to the underprediction of PBL height. By contrast, X. Ye et al.: Evaluation and intercomparison of wildfire smoke forecasts for GEOS-FP, CAMS, and RAQMS, the differences (median h plume − median h PBL ) are around zero for most cases. Thus, obvious underestimations in h plume are seen when strong injections are present. Another interesting result is that while there are multiple models using the same plume rise parameterization scheme (all WRF-Chem-based configurations except HRRR-Smoke), they present large variations in their performance (e.g., D7T3; Fig. 16). Although these models used the same injection parameterization, the burned area is specified differently, which, together with differences in meteorological fields and grid resolutions, can account for the large spread of the predicted plume heights. Additionally, the two systems using satellite FRP in their plume injection estimation (HRRR-Smoke and CAMS) also show very different behavior, which can be explained by the different plume rise parameterizations used in these models. Future sensitivity analysis is needed to determine the key factors contributing to their performance.

Evaluation for aged smoke plumes
Comparison of plume heights along the transects that sampled through aged smoke plumes on 7 and 8 August shows features consistent with the transects through the fresh plumes the day before. Figure 18 presents comparisons of the aged plume over northwest Montana on 7 August (D7T1). The observed smoke plume is mostly well mixed within the PBL, corresponding to the emissions injected within the PBL as observed in the fresh plume on 6 August (see Figs. S33-S35). The model forecasts have captured the plume heights as shown by the observations, with the smoke being within the PBL and reaching the ground surface. Similar features can be found for D7T2 (Fig. S41). While multiple models predicted injections into the free troposphere on 6 August (ARQI, HRRR-Smoke, UCLA WRF-Chem, FireWork, and NCAR WRF-Chem), these lofted plumes are not represented along this flight transect, likely because they were advected faster and in a different direction than the plume in the PBL. Moreover, although the vertical location of the aged smoke is captured by the models, the spatial variability is not well represented, possibly owing to errors in the temporal profile of emissions (Fig. 4).
In contrast, significant injection into the free troposphere that happened on 7 August resulted in a large portion of the aged smoke not mixing down to the surface on the next day, as suggested by the lidar data (Fig. 19, D8T2). Although the lofted smoke is partially represented by some models showing stronger injections on 7 August (Fig. 16), the observed lofted smoke covered a much larger area, with the core of the observed plume (∼ 00:25 UTC) not being captured by any model. This is likely due to the earlier diminishing injections and moderate burning activity in the late afternoon by the models, as can be confirmed by the comparisons of vertical plume structures (Figs. S36 and S37 for D7T4 and D7T5), as well as the diurnal emission evolution profiles (Fig. 4).
Additionally, there is a tendency of the forecasts to show a larger proportion of smoke mixed within the PBL than indicated by the observations. This could be responsible for the discrepancies between sPM 2.5 and sAOD performance (i.e., although there is an evident underestimation of sAOD for the transported smoke plume on 8 August, the predicted sPM 2.5 shows overestimations for some models). These results highlight again the significance of resolving both temporal and vertical representations of fire emissions in models to improve forecasts of transported smoke plumes. An important parameter of plume injection parameterizations is the percentage of emissions that are injected into the free troposphere. This parameter is generally assumed as a constant depending on the fuel category (Freitas et al., 2007), which can have a large impact on the surface smoke aerosol concentrations. Thus, more detailed evaluations of vertical partition of emissions are needed. Another possible reason for the discrepancy is the potentially enhanced evaporation of organic aerosol near the surface compared to lofted plumes , which is a process not included by any of the forecasts evaluated here.

Synergetic evaluation of surface PM 2.5 and AOD and their ratio
The ratio between surface PM 2.5 and AOD has been widely considered in evaluations of model performance (e.g., Lennartson et al., 2018) and is also critical for studies on estimating surface PM 2.5 based on satellite AOD retrievals. As an intensive performance metric, this ratio is less dependent on mass concentrations and emission than PM 2.5 or AOD, often referred to as extensive parameters and dependent on mass concentrations. This ratio is dependent on the vertical allocation of smoke aerosols and aerosol optical properties and thus can be used to evaluate models in terms of these aspects. This is especially important for models performing assimilations of AOD data, as misrepresentations of these ratios can lead to erroneous PM 2.5 concentrations (Saide et al., 2020). In this section, the forecasts of surface PM 2.5 / AOD ratio are evaluated for the 12 models. General examples of the ratios under different typical mixing and layering situations of smoke are demonstrated using the DIAL-HSRL data and surface PM 2.5 measurements. However, considering the sparse coincidence of DC-8 flight measurements and surface PM 2.5 data, statistical evaluation of the ratio relied on MAIAC AOD retrievals. Two sets of evaluations are presented for the model representation of the ratio regarding probability distribution and day-to-day evolution.

Observations of surface PM 2.5 to AOD ratio
Examples of the ratios observed under typical conditions of smoke aerosol profiles are shown in Fig. 20. The ratios are derived using surface PM 2.5 and AOD calculated using Table 6. Statistics of the forecasted median plume heights for the 11 transects that sampled through fresh plumes compared against DIAL-HSRL observations. The best four members for each of the columns have been highlighted in bold.  Figure 18. Similar to Fig. 16 but for the transect D7T1 that sampled through the aged plume. The annotations for the colored dots are the same as in Fig. 16, except that the blue dots represent PBL heights estimated by virtual potential temperature for the labeled hour; the pink dots are PBL heights for models at 16:00 PDT.
aerosol extinction from the DIAL-HSRL collocated within 5 km of distance. A condition commonly occurring is that of smoke aerosols well mixed within the PBL under a clean free troposphere (Fig. 20a), yielding ratios in the 70-90 range.
In contrast, Fig. 20b and c provide typical results for nearsurface smoke and lofted smoke above the PBL, with the ratios becoming much higher (402.2) or lower (38.9) compared to the general situation. A mix of these two conditions can also yield ratios similar to the well-mixed ones (e.g., Fig. 20d). Therefore, the surface PM 2.5 to AOD ratio is applicable to suggest vertical placement of smoke aerosols. One situation that could obscure the relationship of ratio and vertical smoke layering is the clean or non-smoke cases, for which an example is presented in Fig. 20e. In this case, the enhanced backscatter above the PBL is attributable to scattered clouds, and the filtered extinction profile shows well-mixed PBL aerosols with an AOD of 0.11 and a ratio of 126.9, which also tends to be larger than the general value for smoke mixed within PBL. Thus, for a clear indication, an AOD filter is applied in the following analysis to extract data pairs representing columns more likely impacted by smoke aerosols. Meanwhile, cloudy conditions where aerosol hygroscopic growth could complicate the analysis are excluded  by focusing on the period of 4 to 8 August when clear-sky conditions prevailed.

Evaluation of forecasted ratio
Before the evaluation was performed, tests and visual examining were conducted to find the best criteria to filter observations and forecasts over smoke-plume-affected areas, and the filters based on sAOD yielded more appropriate results than when using AOD. A filter of sAOD > 0.05 is employed for observations, while for model forecasts, due to the considerable range of magnitude of the forecasted fire emissions among the models, there is not a single sAOD threshold that can be appropriate for all the forecasts. Consequently, the threshold is chosen as 0.05 for most models, and reduced values of 0.02 (for HRRR-Smoke, WISC WRF-Chem, and FireWork) and 0.01 (for NCAR WRF-Chem, AIRPACT, ARQI, and NAQFC) are chosen to account for the lower sAOD magnitudes in these forecasts. Figure 21 shows an example of filtered observations and forecasts. Similar to the analysis in the previous subsection, for most models the high ratios correspond to the areas adjacent to the fire emission hotspot and transported smoke mixed down to the surface. It should be noted that for models with relatively low emissions, the areas impacted by smoke plumes cannot be well distinguished from background and other sources (e.g., Fig. 7); thus the distribution of the ratio is high biased due to inclusion of low background AOD values. Reduced ratios are shown in Fig. 21 over northwest Montana for some models (e.g., HRRR-Smoke, UCLA WRF-Chem, UIOWA WRF-Chem, WISC WRF-Chem, FireWork, NAQFC, NCAR WRF-Chem), indicating an aged smoke plume located above the PBL. Figure 22 shows the comparison of the probability distributions of the ratios for observations and model forecasts on 4-8 August, with the parameters of the lognormal fitting curves given Table 7. It should be mentioned that for models that had relatively low fire emissions, as noted earlier (AIR-PACT, ARQI, NAQFC, and NCAR WRF-Chem), the distributions cannot unambiguously represent smoke plumes from the Williams Flats fire and are likely driven by other sources and background aerosols, so the results are not exactly comparable with the observations. These models tend to overpredict the ratios, which is consistent with the larger ratios obtained for non-smoke cases, as shown in the previous subsection. Overall, the results suggest no single model performing the best simultaneously in terms of the mean and standard deviation of the fitted distribution. Slight to large overpredictions of the ratios are shown for most models, except for UIOWA WRF-Chem which shows a negative bias. This tendency is consistent with other studies that show discrepancies in the AOD and PM 2.5 performance for biomass burning smoke (Mangold et al., 2011;Reddington et al., 2016Reddington et al., , 2019. For the models with prominent smoke AOD impacts, a shift in the distribution of ratio compared to observations could be explained by issues in assumptions for aerosol optical properties (e.g., too high a mass extinction efficiency can bias the ratio distribution towards the lower end) or biases in the PBL heights (e.g., shallower PBL can lead to positively biased ratios).
Meanwhile, most models display a narrower distribution of the ratio than observed, except for HRRR-Smoke. The wider distribution is expected since HRRR-Smoke is a smoke tracer model. The background PM 2.5 and AOD due to anthropogenic and other non-biomass pollution sources were not represented, which generates more extreme ratios. For the other models, the narrower distribution may suggest a smaller probability of extreme conditions (e.g., smoke aerosols lofted above the PBL or confined near the surface). In other words, the narrow distribution tends to suggest smoke plumes mostly getting mixed in the PBL. Therefore, the misrepresentation of the distribution width can be improved by fixing issues with regards to the vertical allocation of fire emissions that is estimated by parameterizations of plume injection. Further work is necessary to evaluate the contributions of relevant factors independently, e.g., fire size, fuel type, and thermodynamic stratifications.
As the fire activity changes drastically from day to day, the surface PM 2.5 to AOD ratios also show temporal variations, which can be found in the spread of the ratios over the hours of comparison (Fig. 23). The observations show a decreasing trend, especially for the 10th percentile, which is likely associated with deeper PBL and/or lofted smoke owing to stronger plume injections on 7 and 8 August compared to the previous days. However, the models rarely captured this feature or show flatter decreasing trends than observed, which can be associated with the fewer plume injections in models as days went by, consistent with the evaluation results against the DIAL-HSRL data in Sect. 3.4.2.
The model representation of the ratios also suggests discrepancies between model performance for AOD and PM 2.5 . Figure 24 shows how the model performance of PM 2.5 compares to that of AOD. Ideally, the target would be for the dots to fall to the 1 : 1 line, meaning that the PM 2.5 and AOD are biased by the same amount, and thus if emissions were corrected, the forecast could achieve a close-to-zero bias in both quantities simultaneously. However, the dots often fall far from the 1 : 1 to line, and the further they are from the 1 : 1 line, the stronger the biases generally seen in the PM 2.5 to AOD ratios (see Fig. S43 for the ideal relationship between NMBs of the ratio, AOD and PM 2.5 ), showing that the surface PM 2.5 to AOD ratio can be a good indicator of the discrepancies. While multiple forecasting systems show nearly unbiased ratios for some cases (i.e., the dots close to the 1 : 1 line), discrepancy in the AOD and surface PM 2.5 performance occurred for all the models. This means that the modeling systems cannot be fully improved by only revising the smoke emissions. Changes in structural configuration needs to be explored, including better representations of the aerosol optical properties, vertical allocation of the emis-  sions, and timing of plume injections, as well as the meteorological fields and thermodynamic processes in the lower troposphere (e.g., PBL heights and evolution).

Conclusions and recommendations to improve wildfire smoke forecasts
Predictions of wildfire smoke impacts on local to regional air quality by numerical forecasting systems have been a crucial tool in the decision-making and understanding of large wildfire events. However, the wildfire smoke forecasts relating to biomass burning emissions still bear a large uncertainty.
In this paper, we present an intercomparison and evaluation of the wildfire smoke predictions produced by 12 state-ofthe-art forecasting systems under the same framework. These forecast models are drastically different from each other with respect to the gas/aerosol emissions, complexity of chemical processes, and use of AOD data assimilation, etc. Fo-  cusing on the active burning period of the Williams Flats fire (3-9 August 2019), the evaluation is carried out to reveal model performance in multiple dimensions, including fire emissions, total column loading of smoke aerosols, surface PM 2.5 concentrations, and plume injection. The major findings and recommendations for improved wildfire smoke simulation and forecast are summarized as follows.

Wildfire smoke emissions
The intercomparison of predicted smoke emissions suggests a substantial uncertainty in forecasted emission inventories. We find an overall large spread in daily total BB OC emissions, with the factor between the maximum and the minimum being about 20 to 50 on 5-9 August due to different methodologies used for the emission estimates. Over- Figure 24. Scatterplots of normalized mean bias (NMB) of AOD and surface PM 2.5 for different models evaluated by each hour of coincident PM 2.5 and MODIS AOD observations. The scattered circles are color-coded by the NMB of PM 2.5 / AOD ratio in that hour.
all, the FRP-based fire emissions are relatively higher than the satellite-fire-detection-based emissions. The large spread is likely driven by dry-matter differences, and not emission factors, as reported by Carter et al. (2020). Additionally, the diurnal fire activity observed by the geostationary FRP data shows substantial day-to-day variations, while this cannot be well represented by the fixed diurnal patterns used by the models. Discrepancies are shown in terms of the magnitude of diurnal variation and timing of the peak, as well as the nighttime fire activity. The limited representation of complex fire emission evolution in models greatly affected and challenged the performance on characterizing the impact of smoke on air quality.
4.2 Total column smoke loading and surface impacts: sAOD and sPM 2.5 Statistics for sAOD and surface sPM 2.5 show well-predicted magnitudes for a few models, while none of the models managed to realistically describe their spatial distributions. Nearly unbiased predictions are present for sAOD (UIOWA WRF-Chem and CAMS) and sPM 2.5 (FireWork, NAQFC, and RAQMS), and as expected, the high-resolution regional models show relatively better capability in depicturing the fine-scale plume structures. However, low correlation coefficients with r<0.55 (0.36) and large errors with NME > 60 % (70 %) are found for all models for sAOD (sPM 2.5 ), which indicates inconsistencies in the spatiotemporal variations of smoke plumes between the forecasts and observations. The FRP-based fire emissions and assimilation of satellite AOD tend to yield better model performance in terms of sAOD. In accordance with the emission magnitudes, the models driven by FRP-based fire emissions (CAMS, GEOS-FP, HRRR-Smoke, UCLA WRF-Chem, UIOWA WRF-Chem) produce larger sAOD and outperform those driven by satellite-fire-detection-based emissions, showing less un-derestimation and better agreement for smoke plume areas. FRP-based emission inventories generally use AOD observations to tune their conversion of FRP to emissions, which could explain this advantage. Assimilating satellite AOD data in initial condition and forecasts (CAMS, RAQMS, GEOS-FP) or as constraints of fire emissions (UCLA WRF-Chem) also helped offer better performance. Other factors such as complexity of chemical mechanisms, chemical LBCs, horizontal resolution, initial time of forecast, and dynamic core used to drive the meteorological dispersion and transport do not seem to be determining for these metrics.
For sPM 2.5 , however, the two operational air quality models for Canada and the US (FireWork and NAQFC) perform the best, which often use surface monitoring station data as the primary metric of their evaluation. In contrast, all models using FRP-based emissions tend to exhibit remarkable overestimations, especially for the stations that are closely impacted by fresh smoke plumes (MB > 6.0 µg m −3 , NMB > 40 %). The inconsistent biases in sAOD and sPM 2.5 demonstrate discrepancies in the performance for total column and surface air pollution levels, which is partly attributable to errors in vertical allocation of emissions, as confirmed by the evaluation against DIAL-HSRL observations. The percentage of emissions injected into the free troposphere in models is generally assumed as a constant depending on fuel category, while, compared to the lidar data, most models show larger proportions of smoke within the PBL rather than being injected into the free troposphere, which leads to a higher amount of smoke aerosols close to the land surface. In addition, model representation of PBL evolution, assumptions to diagnose optical properties of smoke aerosols, and missing chemical processes, e.g., enhanced evaporations of organic aerosols near the surface, are also potential factors associated with this discrepancy.
The diurnal cycles of sPM 2.5 suggest additional inconsistencies in certain processes within models and observations. Overall, the overestimations of sPM 2.5 proved to be even higher during the late-afternoon and nighttime hours, thus producing much stronger diurnal variations of sPM 2.5 than observed. The enlarged positive biases are presumably due to the overestimated proportion of emissions within the PBL. Earlier collapse of daytime PBL and lower PBL heights in the late afternoon and nighttime can also be a reason, associated with the disregarded sensible heat released from wildfire burning process and the PBL parameterizations used in models. An evaluation of surface PM 2.5 forecasts has suggested an inconsistency in the nocturnal PBL mixing within WRF-Chem (McKeen et al., 2007) that very little turbulent exchange is present during stable nighttime conditions, leading pollutants to build up unreasonably in the lowest model level. Further investigation is needed to understand their relative contributions more.
The day-to-day variation of model performance for sAOD illustrates the significant limitation of the assumption of persistence used for predicting fire emissions within the fore-casting period. Remarkable underestimations of sAOD are shown on 8 August for all models, which appears to be mainly due to the drastic expansion of the burned area on the previous day not being captured by the models assuming persistence of fire activity. In contrast, the impact of persistence on the performance for sPM 2.5 on 8 August is not as significant as for sAOD, which is likely due to a cancellation of errors with overestimated proportions of smoke emissions within the PBL. Grouping sPM 2.5 sites by affected by fresh plume or not further confirms this behavior.

Plume injections
The vertical plume structure and plume injections are further quantitatively evaluated against the DIAL-HSRL observations acquired during FIREX-AQ. While the observations show a considerable day-to-day variation in the plume heights, all the models have smaller spreads. Overall, there is a large inter-model difference in the predicted plume heights. For the flight transects presenting injections within or around the PBL heights (3 and 6 August), most models show overestimated plume heights, and the models that usually put emissions below the PBL height perform better (e.g., CAMS, RAQMS, and GEOS-FP). Physical parameterizations of plume injections (e.g., Freitas et al., 2007Freitas et al., , 2010, based on convective energy) yield slight overestimations on 3 and 6 August and managed to depict enhanced injections on 7 August with stronger free-tropospheric injections, although with underpredictions. Additionally, insufficient representations are found for the strong injections owing to deep convection of the pyroCb on 8 August. More work is needed to improve their inferring method from input variables (e.g., fire intensity, fire size, meteorological fields) to plume injections to accurately depict different scenarios. It is also noteworthy that even for the models using the same plume injection scheme (e.g., NCAR WRF-Chem, UIOWA WRF-Chem, UCLA WRF-Chem, and WISC WRF-Chem), they often show substantially different results which might relate to uncertainties in meteorological fields, inputs to the plume rise parameterization, and grid resolution that need to be investigated further. The assessments for transported plumes (older than a day) show consistency with the injection performance as revealed for the fresh plume on the day before. It confirms the comparison results of models showing overestimated sPM 2.5 , which tend to be associated with the earlier decay of plume injections and overestimated proportion of smoke loading with the PBL in the late afternoon. This result further emphasizes how the errors in timing and vertical allocation of emissions can propagate into model skill over transported plumes.

Discrepancy in performance for surface PM 2.5 and AOD
Surface PM 2.5 / AOD is suggested as a measure to assess the vertical distribution of smoke as well as the discrepancy in model performance between the two individual terms. Note that the ratios can be dominated by the background or other emission sources when the fire emissions are low. The evaluation for probability distributions of the PM 2.5 to AOD ratios emphasizes two aspects of model improvements. First, most models show positive biases in the means of the (lognormal) distributions, which suggests misrepresented aerosol optical properties (e.g., relatively lower mass extinction coefficients), and/or shallower mixing volume (e.g., relating to the lower PBL). Second, the narrower distributions indicate underpredicted possibility of cases with smoke plumes reaching the land surface or lofted above the PBL. The analysis of AOD, surface PM 2.5 , and their ratios for coincident samples further confirms discrepancies in the model performance for AOD and PM 2.5 . The biases in AOD and PM 2.5 can be effectively reduced simultaneously by adjusting the fire emissions for the models showing fewer discrepancies in their ratio, while large discrepancies in the ratios point to the need of taking other factors into consideration, including the representation of aerosol optical properties, vertical allocation of smoke aerosols, and PBL evolution as well.

Recommendations for future improvements
Model evaluations of smoke emissions and sAOD suggest the advantage of using FRP-based fire emissions and data assimilation in providing less biased forecasts for total column smoke aerosol loading, compared to other differences in model features. The fact that all estimated fire emissions exhibit a large spread in magnitude demonstrate a need of future work to close the gap between these estimates and reduce their uncertainty. Leveraging FRP detections from geostationary satellites could provide beneficial information in improving the representation of temporal variation of fire emissions and to overcome the limitation of fixed diurnal patterns. This would be important, especially for severe wildfires with unusual diurnal activity. As the forecasted air quality impacts still show limitations due to the persistence assumption, methodologies to describe and predict evolution of fire burning need to be developed. A relevant system is available in Europe, adopting a modeling strategy of hourly sequential warm start runs (Solomos et al., 2015(Solomos et al., , 2019, with the emissions updated every hour using geostationary satellite detections. For the US, hourly emissions derived by blending GOES Advanced Baseline Imager (ABI) and polarorbiting satellite VIIRS fire products at 3 km spatial resolution will be incorporated into operational fire smoke models, namely NAQFC and HRRR-Smoke. These studies provide an efficient way of removing the minor or extinguished fires and at the same time of enhancing emissions from actual burning fires, thus tracking the diurnal cycle of biomass burning. However, challenges still exist in making assumptions about the fire intensity and spread over the next hours/days. Also, compared to Europe, the fire intensities and their durations are on much larger scales in North America. The large spatial variability of fuels, complex topography, and different ecosystems in the US add to the complexity. Forecast performance for sAOD and surface sPM 2.5 and their discrepancies highlight key modeling processes to be improved. The proportion of emissions getting injected above the PBL appears to evolve with fire burning intensity, as indicated by the lidar observations, while all models tend to show underestimated free-troposphere smoke emissions and plume heights towards the days with enhanced burning. This illustrates a key need to improve plume rise parameterizations and vertical partition of fire emissions that is closely relevant to accurate representation of smoke plumes and the performance discrepancies. Modeled aerosol components and late-afternoon and nighttime evolution of PBL structures are also closely relevant. In addition, although not evaluated specifically in this work, model assumptions of aerosol optical properties are important for both the AOD forecast performance and the discrepancies. Most models use optical properties that come embedded in the model version, and these are usually out of date. Thus, it is necessary to more proactively update optical property modules that reflect all that we have learned from field campaigns and satellite observations and make them easy to be incorporated into the community models. Given these various factors, future sensitivity and retrospective runs and analysis on these processes would help to identify the determinant factor(s) and their relative contributions for improving smoke forecasting.
Author contributions. XY and PES designed the study and conducted UCLA WRF-Chem forecasts. XY analyzed the data with help from all the co-authors. RA, EJ, and GAG provided HRRR-Smoke forecast data and emissions. BP and AK provided RAQMS and WISC WRF-Chem data. PM and JC provided ARQI data. DD provided FireWork data. GRC and GF provided UIOWA WRF-Chem data. JM and JH provided NAQFC data. RK and LE provided NCAR WRF-Chem data. FLHT provided AIRPACT data. MP, RE, and VHP provided CAMS data. AS, EG, and EW developed Fuel2Fire emission data. JWH, MF, and TS conducted DIAL-HSRL observations. AL and YW developed MODIS MAIAC AOD data. SK developed satellite products (fire emissions and AOD from VIIRS) used by some models, either as input or for verification. BH and DMG conducted AERONET and DRAGON observations and helped with analysis. XY wrote the paper with contributions from co-authors.
Competing interests. The authors declare that they have no conflict of interest.
Disclaimer. Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.