Recommendations on benchmarks for numerical air quality model applications in China: Part I – PM2.5 and chemical species

10 Numerical air quality models (AQMs) are being applied more frequently over the past decade to address diverse scientific and regulatory issues associated with deteriorated air quality in China. Thorough evaluation of a model‟s ability to replicate monitored conditions (i.e. a model performance evaluation or MPE) helps to illuminate the robustness and reliability of the baseline modelling results and subsequent analyses. However, with numerous input data requirements, diverse model configurations, and the scientific evolution of the models themselves, no two AQM applications are the same and their 15 performance results should be expected to differ. MPE procedures have been developed for Europe and North America but there is currently no uniform set of MPE procedures and associated benchmarks for China. Here we present an extensive review of model performance for fine particulate matter (PM2.5) AQM applications to China and, from this context, propose a set of statistical benchmarks that can be used to objectively evaluate model performance for PM2.5 AQM applications in China. We compiled MPE results from 307 peer-reviewed articles published between 2006 and 2019, which applied five of 20 the most frequently used AQMs in China. We analyse influences on the range of reported statistics from different model configurations, including modelling regions and seasons, spatial resolution of modelling grids, temporal resolution of the MPE, etc. Analysis using a Random Forest method shows that the choices of emission inventory, grid resolution, and aerosol and gas-phase chemistry are the top three factors affecting model performance for PM2.5. We propose benchmarks for six frequently used evaluation metrics for AQM applications in China, including two tiers – “goals” and “criteria” – where 25 “goals” represent the best model performance that a model is currently expected to achieve and “criteria” represent the model performance that the majority of studies can meet. Our results formed a benchmark framework for the modelling performance of PM2.5 and its chemical species in China. For instance, in order to meet the goal and criteria, the normalized mean bias (NMB) for total PM2.5 should be within 10% and 20% while the normalized mean error (NME) should be within 35% and 45%, respectively. The goal and criteria values of correlation coefficients for evaluating hourly and daily PM2.5 are 30 0.70 and 0.60, respectively; corresponding values are higher when the index of agreement (IOA) is used (0.80 for goal and 0.70 for criteria). Results from this study will support the ever-growing modelling community in China by providing a more objective assessment and context for how well their results compare with previous studies, and to better demonstrate the credibility and robustness of their AQM applications prior to subsequent regulatory assessments.

with previous studies, and to better demonstrate the credibility and robustness of their AQM applications prior to subsequent regulatory assessments.

Data compilation
A total of five photochemical modelsthe Community Multiscale Air Quality (CMAQ, Foley et al., 2010), the 5 Comprehensive Air Quality Model with Extensions (CAMx, Ramboll Environment and Health, 2018), the Goddard Earth Observing System (GEOS)-Chem (http://geos-chem.org), the Weather Research and Forecasting model coupled with Chemistry (WRF-Chem, Grell et al., 2005), and the Nested Air Quality Prediction Modelling System (NAQPMS, Z. Wang et al. 2006)are included in this compilation. While the former four models are developed by institutes and/or companies outside China, the NAQPMS is developed by the Institute of Atmospheric Physics of Chinese Academy of Sciences and has 10 mostly been utilized for applications in China. GEOS-Chem is a global chemical transport model with coarser resolution (only 20% of complied GEOS-Chem studies have grid resolution less than 50 km), as opposed to the other four regional models that are applied with finer spatial resolution at regional or local scale (for example, less than 10 km).
Our investigation started by searching for combinations of three key words on the Web of Science: model name, "air quality", and "China", and limited the timespan between 2006 and 2019. This initial search gave 446 (CMAQ), 84 (CAMx), 15 256 (WRF-Chem), 117 (NAQPMS), and 58 (GEOS-Chem) records (a total of 961). Duplicated records were excluded. We then excluded records that were listed as conference papers or not published in English-language journals (for example, Chinese and Korean-language journals) due to narrower audiences. This resulted in 826 records published in 61 journals. We further reduced the number of journals considered by excluding those that had less than ten publications during 2006-2019, since most of the excluded journals are not air quality-related journals, which results in 464 studies. Table S1 shows the list 20 of journals that were included in this study, which is believed to cover the mainstream journals in atmospheric research, especially in applications of air quality models.
The next filtering stage needed substantial manual effort. The 464 records were downloaded and manually checked to exclude (1) studies that were accidentally included in the search but did not apply any of the models in their study; (2) studies that were intended for other purposes (for example, evaluating meteorological simulations); (3) studies that were not 25 focused on China (for example, the target region was Korea, Japan, etc.); (4) studies that did not provide any air quality model performance evaluation or the evaluation results were referred to previous studies; (5) studies that did conduct model performance evaluation but no numerical values were given (for example, only graphical plots were given). The final selection included a total of 307 papers (see a complete list in Table S2). We defined ten regions of China as shown in Figure  4 the term of "correlation coefficient" (or "correlative coefficient") is frequently used in many studies but calculated using different mathematical formulas. In some studies the "correlation coefficient" refers to the Pearson correlation coefficient (R), which indicates the strength of linear relationship between observations and predictions; while in other studies it refers to the coefficient of determination (R 2 ) that represents the fractions of predicted variations explained by observations. In these two cases, R 2 is simply the square of R, yet the square root of R 2 may be +/-R, where negative values are possible 5 (indicating anti-correlation) and thus lead to ambiguity. In two studies (X. Wang et al., 2018;, the term "correlation coefficient" is used but formulated as the root mean square error (RMSE). To make things even more complicated, the correlation coefficient is used to indicate a model"s ability to capture temporal variations in most of the studies but also spatial variations in some cases (e.g. Ge et al., 2014). For temporal variations, correlation coefficient is calculated based on temporally (hourly or daily) matched observation and modelled results at a single monitoring site (or 10 averages across multiple monitoring sites in many cases). For spatial variations, correlation coefficient is calculated based on pairs of observations and modelled results at multiple sites and its value is used to demonstrate spatial performance. For better comparability among studies, we converted R 2 values to R (assuming always positive values). "Index of Agreement" (IOA) is another example where caution must be taken since the definition of IOA is not unique among these studies. Most of the studies use the definition of IOA (d) shown in Table 1 yet one study used the formula in Table 3. The use of IOA is 15 discussed more in section 3.4 and we dropped the second formula for developing IOA benchmarks.

Derivation of benchmarks
In this study, the method established by Simon et al. (2012) and Emery et al. (2017) was mostly adopted. The quartile distribution for each statistical metric (depending on the data availability) is first presented and the influences of several model key inputs on these metrics are discussed. Rank-ordered distributions for selected metrics were then used to pick out 20 the 33 rd and 67 th percentiles. According to Emery et al. (2017), the 33 rd and 67 th percentile separate the whole distribution into three performance ranges: studies that fall within the 33 rd percentile successfully meet the goals that the best performing models are currently expected to achieve; studies that fall between 33 rd and 67 th quantiles successfully meet the criteria that the majority of modelling studies achieve; studies that fall outside the 67 th quantile indicate relatively poor performance for that specific metric. A summary table with values of 33 rd and 67 th quantile values for recommended statistical metrics is 25 provided at the end of this work and is compared with U.S. benchmarks proposed by Emery et al. (2017).

Feature importance based on Random Forest
Random Forest is a machine learning method suitable for classification and regression (Liu et al., 2012). It is a collection of a series of decision trees and each tree is generated from a bootstrap sample. Both continuous and categorical input variables are allowed. It can provide the order of feature importance (FI) so that we can determine and rank which parameter choices 30 most influence the simulation results. We reviewed the model configurations for studies that reported correlation coefficient, IOA, MB, NMB, mean error (ME), normalized mean error (NME), fractional bias (FB), and fraction error (FE) for PM 2.5 (a total of 176 studies). Model configurations include the meteorological data that are used to drive air quality simulations (e.g. from WRF, MM5, or GEOS), the emission inventory (e.g. public available dataset vs. locally developed), gas-phase chemistry (for example, 35 carbon bond vs. Statewide Air Pollution Research Center (SAPRC)), aerosol chemistry (including inorganic aqueous chemistry, inorganic gas-particle partitioning, organic gas-particle partitioning and oxidation), boundary conditions (e.g. model default values vs. results generated from global model), grid resolution and the temporal resolution (Table S7). We ignored the study region and period for FI selection because these two options are more restricted by the user"s specific needs and focus (i.e., more subjective/uncontrollable and less objective/controllable). We ranked each statistical metric from 40 good to poor performance. For example, values of R and IOA that are close to 1 represent good performance and values close to 0 represent poor performance. For MB and NMB, we used absolute values so that deviations from zero represent the performance level. These results were classified into three tiers with breaks at 33% and 67% of the ranked values so that each tier includes the top one third, the middle one third, and the bottom one third of the reported performance results. The random forest model was performed using the "sklearn" module in Python to obtain the FI metric.

General overview of air quality modelling studies in China
A total of 307 articles with AQM applications published between 2006 and 2019 were compiled in this work. Figure 2a shows the number of articles published in each year during the past 14 years. Prior to 2013, the number of studies that utilized AQMs in China was generally limited. A noticeable increase in the number of studies was apparent in 2013, with a doubling or tripling each year during 2016-2019. This sharp increase coincides with the infamous record-breaking haze 10 event in January and December 2013 that attracted numerous attentions to air pollution issues in China. Since then, a series of air pollution related actions were carried out due to increasing funding that became available for the research community.
Of the 307 articles included in this work, CMAQ was the most frequently used AQM (used in 124 studies), followed by WRF-Chem (111 studies), CAMx (36 studies), GEOS-Chem (20 studies), and NAQPMS (18 studies). Several studies evaluated model performance for multiple models (e.g. Q. Wu et al. 2012;Zhang et al., 2016;Wang et al., 2017). In terms of 15 regions, BTH (122 studies), YRD (84 studies), and PRD (65 studies) are the top three most evaluated regions ( Figure 1) (note that we excluded studies that cover the entirety of China for this count).
Meteorological data are needed to drive air quality simulations and the performance of meteorological modelling is a key source of uncertainty for air quality modelling performance. Meteorological data were mostly simulated by the Weather Research Forecasting (WRF) model (Skamarock et al., 2005) in our compiled studies; the Fifth Generation Penn 20 State/NCAR Mesoscale Model (MM5) (Grell et al., 1994) and the Regional Atmospheric Modelling system (RAMS) were used in a few studies. Model performance of meteorological results should be evaluated in addition to air quality simulation results. However, several studies did not report any results with respect to their meteorological simulations. The performance of meteorological results used to drive air quality simulations and how it could affect the air quality simulations is beyond the scope of the current work and will need to be discussed as a future work. NO 2 and CO) are routinely measured at the national monitoring sites, model validation for speciated PM 2.5 , ammonia, volatile organic compounds (VOCs) species (e.g. isoprene, formaldehyde, etc.) can only be based on measurements obtained from local monitoring sites or field observations conducted by individual research groups. Figure 2b shows the reporting frequency for each statistical metric compiled in this study. Table 1 shows the formula of metrics that have been used in more than 20 studies. Same as Simon et al. (2012), the top three most frequently used metrics 40 is correlation coefficient (R, 223 studies), normalized mean bias (NMB, 170 studies), and mean bias (MB, 132 studies).
Other frequently used (>20 studies) metrics include root mean square error (RMSE, 118 studies), normalized mean error (NME, 111 studies), fractional bias (FB, 66 studies), fractional error (FE, 62 studies), index of agreement (IOA, 57 studies) and mean error (ME, 27 studies). MNB) and MNE were only used in 15 and 10 studies, respectively, since as mentioned in Simon et al. (2012), these two metrics tend to give more weight to data at low values. About 65% of articles included in this work used at least three statistical metrics for model performance evaluation ( Figure 2c); 16% of studies reported numerical values for only one metric; less than 10% of studies included more than five MPE metrics; four studies (X. Li et al., 2015;5 Kim et al., 2017;X. Li et al., 2018;Z. Zhang et al., 2017) used eight statistical metrics. In terms of number of pollutants evaluated in each study (Figure 2d), 132 studies (43%) evaluated only one pollutant and 223 studies (73%) evaluated less than or equal to three pollutants; one study (Ying et al., 2018) evaluated 17 pollutants (including elemental PM 2.5 components). Figure 3 shows the number of studies broken down by pairs of pollutants and AQM models and pairs of pollutants and 10 metrics. As expected, PM 2.5 is the most frequently evaluated pollutant, followed by ozone, NO 2 , SO 2 and PM 10 , all of which are criteria pollutants included in China"s National Ambient Air Quality Standards (NAAQS). Evaluation of speciated PM species, including nitrate, sulfate, ammonium and organic carbon (OC) is about 25% less frequent as total PM 2.5 and was only covered in applications for certain regions due to limited observations.  Table S5). For total PM 2.5 , slightly more negative values of MB, NMB, and FB were reported. Absolute bias for PM 2.5 ranged from as low as -50 μg/m 3 to over 50 μg/m 3 (outliers excluded) with median values around 3 μg/m 3 . The bias range for speciated components was much smaller (within 20 μg/m 3 ) because the absolute magnitude of speciated components was much smaller. In terms of the normalized bias (i.e. NMB and FB), the 20 range of PM 2.5 was comparable or smaller than speciated components, partly due to the compensating errors from speciated components. Speciated PM 2.5 tended to be dominantly under-estimated except for nitrate (in terms of NMB) and EC (in terms of FB). The widest range of normalized bias was reported for nitrate, suggesting substantial uncertainties in simulated nitrate concentrations. Some early reported large negative NMB values of nitrate were partly due to missing formation of coarse-mode nitrate implemented in early version of the model (Kwok et al., 2010). As the model evolved over time and 25 emission inventories improved, the large negative bias of nitrate disappeared (see Figure S3). Speciated PM components can be both emitted directory from sources (i.e. primary) and formed via chemical reactions of precursors (i.e. secondary). Model under-estimates of secondary species (organic and inorganic) have been reported in numerous studies with explanations ranging from missing formation mechanisms, to uncertainties with the emission inventory, and to meteorological errors.

Quantile distributions of PM 2.5 and speciated components 15
Elemental carbon (EC) is solely emitted from sources thus the performance of simulated EC concentrations is strictly 30 associated with the accuracy of the emission inventory and meteorological mixing and transport.
For error metrics, total PM 2.5 performed better than speciated components in terms of NME, with a median NME value around 40%. For FE, median values for total PM 2.5 and carbonaceous species were within 40~60% while inorganic secondary species had relatively large (>60%) median FE.
R and IOA are used to indicate how well the model could capture the variations of observed values and both values are 35 within the range of 0~1. We converted R 2 values to R for better comparability. For total PM 2.5 , the median IOA was 0.76 while median R was 0.69 (R 2 =0.4761). The minimum IOA value reported for total PM 2.5 was 0.4 while the minimum R value could be negative. Eleven studies reported both R and IOA values, which enabled inter-comparisons of the two metrics based on identical sets of data points. It was found that IOA values always tend to be higher than R values (38 out of 40 data pairs). Compared to total PM 2.5 , secondary inorganic aerosols (i.e. sulfate, nitrate, and ammonium) demonstrated better 40 performance in terms of R values but slightly poorer performance in terms of IOA values. OM and EC show lower values for both R and IOA compared to total PM 2.5 .

Impact of season
There are numerous factors that could affect model performance results, such as the study region and period, uncertainty of emission inventory, model grid resolution, the temporal resolution of paired observations and modelling results used for model evaluation, etc. We first looked at NMB results for total PM 2.5 and selected species (due to availability of data points) by season ( Figure 5). For total PM 2.5 , the number of data points reported for winter is significantly higher than those reported 5 for other seasons as heavy haze episodes generally occur in winter. Reported NMB for total PM 2.5 was dominantly negative except for winter where positive NMB was also reported. Most of the large positive NMB values (>50%) reported for winter were from one single study (Zhang et al., 2017), for which the author explained that the over-estimation may be associated with the inconsistency of emissions between the base year and the modeling year. Sulfate tended to be overwhelmingly

Impact of region
We also considered whether there are any regional differences in these statistical metrics. Constrained by number of data points, we only compared results of R and NMB for total PM 2.5 and secondary inorganic species over three key regions in 20 China (BTH, YRD, and PRD; Figure 6). These three regions represent the most populated, economically developed and urbanized city clusters in China. With respect to total PM 2.5 , R and NMB values for the three regions did not exhibit substantial differences. All three regions exhibited more negative NMB values with median NMB around -10%. Reported R values for PRD were slightly lower compared with the other two regions. For sulfate and ammonium, reported R values for YRD were significantly lower than the other two regions. Underestimation of sulfate and ammonium were more severe in 25 YRD and PRD. For nitrate, PRD showed the lowest R values and model bias shifted from positive to negative as the target region got warmer.

Impact of temporal and spatial resolution
Although AQMs are usually set to output data at hourly time steps, validation of modelling results is not always performed against hourly data, depending on the temporal resolution of observational data as well as the purpose of the application. 30 Daily, monthly and annually averaged pairs of model results and observations were used for model evaluation. Of the 307 studies compiled in this work, 183 (60%) studies used hourly data for model validation, followed by 90 (29%) using daily, 31 (10%) using monthly, and 12 (4%) using annual data. Due to the coarse temporal resolution of GEOS-Chem output in general, the finest GEOS-Chem validation was conducted using daily data. Figure 7 shows the quantile distribution of eight statistical metrics for total PM 2.5 presented by the temporal resolution used for model validation (plots for speciated 35 components are shown in Figure S1; results for annual are not shown due to insufficient data). Model performance evaluated using daily-average values were similar or slightly better than hourly values but exhibited large improvements when monthly-average values were used. For instance, reported R values did not show much difference at hourly and daily scale (median values around 0.7) but exhibited a substantial improvement at monthly scale (median value around 0.85). A similar trend was also observed for reported error statistics (NME and FE), which showed slight improvement as the validation 40 resolution increased from hourly to daily, but large improvement from daily to monthly. One study (Matsui et al., 2009) provided two sets of R values based on hourly and daily-averaged data, and R values for daily averages were always higher (12 out of 14 values). Spatial resolution is key for AQM applications. For applications at local or urban scale, AQMs are usually configured with two or three nested domains, with downscaling from a coarser outer domain to finer inner domains. Among the 307 articles compiled in this study, a total of 43 grid resolutions were used (for nested grids, we used the grid solution from the finest grid), ranging from over 200 km (used by GEOS-Chem) to 1 km depending on the target region and the purpose of the application. GEOS-Chem was more often used with coarse resolution (>50 km). We classified these different grid 5 resolutions into five categories: 0-5 km, 5-10 km, 10-25 km, 25-50 km, and 50-100 km. Figure 8 shows the distribution of eight statistical metrics for total PM 2.5 by these four categories (plots for speciated components are shown in Figure S2). It appears that finer spatial resolution did not necessarily improve model performances results. For example, the R values for the finest resolution category ranged from as low as 0.47 to as high as 0.85 while for the coarsest category they ranged from 0.33 to 0.96. MB appeared to move from underestimation to overestimation from fine to coarse resolution but no clear trend 10 was observed for FB and NMB. Reported NME and FE values also appeared to increase with coarser grid resolution. As Fine resolution simulations have been conducted with the intention of improving model performance. With finer grid resolution, the spatial allocation of certain features in emission patterns is significantly improved, which is especially important for air quality simulations at local scale (Tan et al., 2015;Liu et al., 2020). Additionally, meteorological simulations could also be improved at finer resolution given more detailed land cover and structures in topography (Tao et al., 2020), which in turn improves the subsequent air quality simulations. Estimation of PM 2.5 related health impacts are 25 reported to be biased high/low at coarse spatial resolution (Li et al., 2017;Thompson and Selin, 2012). Lin et al. (2020) developed a new online regional atmospheric chemistry model -WRF-GC (v1.0), that integrates the WRF meteorology model and GEOS-Chem chemistry model. This new WRF-GC model has been configured with a spatial resolution of 27km and successfully applied to quantify the changes of NOx emissions due to COVID-19 for Eastern China (Zhang et al., 2020), illustrating the potential applications of GEOS-Chem at finer spatial scale. However, not all fine resolution simulations lead 30 to improved model performance, especially when the input data are not available with the same high resolution (Jiang and Yoo, 2018; Tao et al., 2020). Therefore, grid resolution should be determined depending on the purpose of the study and the availability of input data.

Trends over the past decade
In an attempt to assess whether model performance results have evolved over the past decades, we present time series of 35 selected statistical metrics for total PM 2.5 in Figure 9 (plots for inorganic species are shown in Figure S3). Results published prior to 2013 were aggregated into one group because there were a limited number of studies prior to 2013. For total PM 2.5 , reported R values have remained relatively consistent over the past decade with the median fluctuating within 0.6~0.8. The ranges of reported RMSE and MB become narrower in recent years even though the number of studies has increased substantially. Reported IOA and RMSE values fluctuated upward and downward over the period. On the other hand, there 40 seems to be an improving trend in terms of FB, FE, and NME as the reported values for these three metrics shift towards zero. For instance, the median value of reported FE decreased from 56.9% prior to 2013 to around 33% in 2019. However, it is important not to over-interpret these results as the number of studies published each year could affect the results.

Recommended metrics and benchmarks
We present diagrams to develop metrics and benchmarks for model evaluation. Figure 10 shows the rank-ordered distribution of R, IOA, NMB, NME, FB, and FE results for total PM 2.5 and speciated components from all studies compiled in this work. Results of R for total PM 2.5 are further split into hourly (h), daily (d) and monthly (m) resolution since R increases as temporal resolution changes from hourly to monthly. The top 33 rd percentile value increases from around 0.76 5 for hourly and daily to 0.92 for monthly results; the top 67 th percentile increases from 0.60 to 0.70 as the total PM 2.5 is evaluated with coarser resolution. Secondary inorganic species (sulfate, nitrate and ammonium) show similar range (0.65 ~ 0.75) over the 33 rd -67 th percentile interval. For OC/OM and EC, the 33 rd and 67 th percentile R value is lower compared to inorganic species; the 33 rd to 67 th percentile for OC/OM is 0.56~0.69 for OC/OM and 0.48~0.65 for EC. In terms of IOA, the 33 rd -67 th percentile interval ranges from 0.73 to 0.83 for total PM 2.5 and lower for speciated components. Values for EC 10 were not shown due to insufficient data. For bias and error, total PM 2.5 exhibits smaller values compared with speciated components, due to potential compensating effects from different components. The 33 rd percentile of absolute NMB for total PM 2.5 is less than 10% while the 67 th percentiles is less than 20%. Among the three secondary inorganic species, the bias and error of nitrate exhibits largest variability (NMB ranges from 17.0% to 55.2% and NME from 47.0% to 71.2% for 33 rd to 67 th percentile interval). The 33 rd to 67 th range of NMB for EC (16.0% to 32.4%) is much lower than that for OC/OM (34.7% 15 to 55.0%) while NME for OC/OM and EC is similar, ranging from ~40% to 55%. Number of FB and FE data is considerably less than NMB and NME for speciated components and nitrate exhibits largest variability in terms of FB and FE.
Based on our analysis as well as previous conclusions from Emery et al. (2017), we propose recommended statistical metrics and associated benchmarks for total PM 2.5 and speciated component as shown in Table 2. Shaded values indicate that less than 20 data points were available to develop the benchmarks. Values for "goal" indicate that roughly the top one third of 20 studies represent the best that models are currently expected to achieve. Values for "criteria" indicate that roughly the top two thirds of studies represent results from the majority of studies. Based on our results, NMB for total PM 2.5 should be within 10% and 20% if the goal and criteria benchmark is to be met; corresponding values for NME are 35% (goal) and 45% (criteria). In terms of R, our recommended benchmarks range from 0.60 to 0.70 for hourly and daily PM 2.5 and 0.70 to 0.90 for monthly PM 2.5 . Recommended benchmark value for IOA is 0.80 for goal and 0.70 for criteria. Our table differs from 25 Emery et al. (2017) in three aspects. First, we add benchmarks for IOA in addition to the correlation coefficient. We found a general increasing trend in using IOA for model performance evaluation since 2013 (prior to 2013, only six of our compiled studies used IOA; after 2013, 51 studies used IOA). Second, we present benchmarks for different temporal resolution for total PM 2.5 when possible. As mentioned above, reported R values for total PM 2.5 improve with coarser temporal resolution while no strong trend is observed for other metrics. Third, Emery et al. (2017) did not present benchmarks for the correlation 30 coefficient of speciated PM components due to large uncertainties and insufficient data. Here we present benchmarks for R and IOA for speciated PM components (except for EC IOA), but caution should be taken comparing to these benchmarks.
For example, less than twenty data points were used to develop the IOA benchmarks for ammonium and sulfate. For those compounds, we do not observe sudden shifts in the rank-order distribution as observed in Emery et al. (2017). Thus, we keep these values for future reference. For bias and error metrics, we do observe sharp changes in rank-order values, for example, 35 in the NMB/FB for nitrate, and FB for EC. Therefore, we do not give benchmarks for these. We also present benchmarks for FB and FE.
We further compare our results with benchmarks for U.S. proposed by Emery et al. (2017). Values with an asterisk in Table   2 indicate that our benchmarks are stricter than corresponding values in Emery et al. (2017), which means it would be more difficult for results from a particular study to attain our recommended 33 rd (or 67 th ) percentiles. For total PM 2.5 , our proposed 40 benchmarks are generally stricter than that in Emery et al. (2017). For example, our NMB (NME) "criteria" value for PM 2.5 is 20% (45%) as opposed to 30% (50%) in Emery"s study; the "criteria" value for R benchmark is also higher (0.60) than those based on U.S. studies (0.40). This might partially reflect the systematic improvements in model applications (e.g. incorporation of newly discovered mechanisms) during the past several years since the latest study included in Emery et al.
(2017) was published in 2015. Our "goal" values for NMB, NME and R benchmarks are the same as that proposed by Emery et al. (2017). For speciated components, NME benchmarks for nitrate are lower (i.e. stricter) than Emery"s study while the opposite is true for sulfate and ammonium. For correlation coefficient, our criteria benchmarks for sulfate and ammonium (0.65) are much higher (i.e. more strict) than those in Emery"s study (0.40).

5
As mentioned earlier, AQM applications involve numerous driving inputs as well as diverse model configurations, which lead to an abundant database from which to assess their relative influences on model performance. The similarities between the benchmarks derived in this study and Emery"s study suggest that important model input data (e.g. emission inventories) have comparable accuracy for China and North America and model formulations (e.g. algorithms such as chemistry, deposition, transport) seem to be equally applicable to China and North America. In additional to the need for model 10 performance benchmarks, there also is a need for more studies that quantify contributions to model uncertainty, such as the recent study by Dunker et al. (2020), which quantifies contributions of chemistry, boundary concentrations, deposition and emissions to uncertainty in simulated ozone results. In this study, we applied the Random Forest method for pattern recognition to identify and rank model attributes (inputs, grid resolutions, etc.) that have important influences on PM 2.5 model performance. The choice of emission inventory is shown to affect the model performances most, followed by grid 15 resolution, aerosol and gas chemistry ( Figure 11). Meteorological input and the choice of model itself is of least importance.

Benchmarks for European modeling community -FAIRMODE
The air quality model benchmarking practice for AQM applications by the FAIRMODE community is somehow different from the U.S. benchmarks. The main modeling performance indicator is called the modeling quality indicator (MQI), which 20 is calculated based on RMSE and measurement uncertainties (function of mean value and standard deviation of observations) (Janssen et al., 2017). The modeling quality objective (MQO) is the criteria value for MQI and is said to be met if MQI is less than or equal to one. In addition to the main MQI, three statistical indicators that describe certain aspects of the differences between observed and modeled resultsnamely bias, correlation, and standard deviation are proposed as the modelling performance indicators (MPI). For each MPI, the model performance criterion (MPC) that individual MPI is 25 expected to meet is also given. However, unlike fixed values given in this study and Emery et al. (2017), MPC is dependent on observation uncertainties. Therefore, it is not directly comparable between MPC and the benchmarks proposed in this study or those in Emery et al. (2017).

The use of "index of agreement"
The concept of "index of agreement" was originally proposed by Willmott in the 1980s and has since then been widely used 30 to "reflect the degree to which the observed variate is accurately estimated by the simulated variate" (Willmott, 1981) in a variety of fields. IOA has gone through several modifications (together referred as Willmott indices) since it was proposed in the original formula (Willmott 1982;Willmott et al., 1985Willmott et al., , 2012. The formula of the original form (d) is shown in Table 2 (presented again in Table 3) and the other three (d 1 , d 1 " and d r ) shown in Table 3. The first version of IOA is proposed over the correlation coefficient for its ability to "discern differences in proportionality and/or constant additive differences 35 between the two variables" (Willmott, 1981) and this version is also the most widely used version in our compiled studies.
Compared with R 2 values, the original IOA results systematically higher values (Valbuena et al., 2019) thus is being adopted in an increasing number of studies partially because it makes results appear "better". However, the original and most widelyused formula is problematic in that too much weight is given to the large errors when squared (Willmott et al., 2012) and relatively high IOA values could be obtained even when a model is performing poorly (Willmott et al., 1985;Pereira et al., 40 2018). Newer versions as later proposed by Willmott overcome this problem by removing the squaring and are recommended over the original one (Willmott et al., 1985(Willmott et al., , 2012. Nearly 20% (57 studies) of our compiled studies used the original IOA formulation for MPE but only one study (Y. Peng et al. 2011) used the second formula (d 1 ). Since there is an increasing trend of using the original IOA formula as a model performance indicator for AQM applications in China (prior to 2013 only 6 study vs. 51 studies after 2013), we decided to keep IOA for future reference, but caution should be taken when using and interpreting IOA values. It should be noted that IOA alone does not necessarily indicate how well the model performs. 5

Additional recommendations
Other than the recommended metrics and associated benchmarks listed in Table 2, we list additional recommendations for validation practices that would enable a complete and comprehensive picture of model performance.
(1) Provide explicit mathematical formulas for statistical metrics being used to avoid any confusion. As mentioned earlier, many studies did not give explicit formulas, which caused ambiguity when a common name (for example, correlation 10 coefficient, or index of agreement) was used but calculated in numerous ways.
(2) Provide as much detail as possible with respect to how observation and modelling results are used to obtain the statistical results. (3) Present meteorological model performance results for good practice, usually including but not limited to temperature, humidity, wind speed, and wind direction. Performance results from meteorological modelling help to explain potential causes of unsatisfactory AQM results.
(4) Include two types of statistical metrics for model evaluation, one to assess errors in magnitude (e.g. MB, NMB or FB) 20 and one to assess agreement in variation (e.g. R or IOA). Caution is needed when presenting values for normalized metrics, for example NMB/NME and FB/FE. Always express these in percentage to avoid ambiguity in whether normalized metrics are in decimal or percent formats. performance for composite species such as total PM 2.5 .
(6) In addition to listing statistical results, graphs/plots are strongly recommended to further support model validation. To give a few examples, visualizing data via time series and regression plots of modelled versus observed data help to illustrate periods and concentration ranges, respectively, with better or poorer performance. Spatial plots that include modelling results as background isopleths and observation data overlaid help illustrate how the model performs spatially. 30

Conclusions
With the increasing number of AQM applications in China over the past decade, a review of model performance is needed to help understand how well these models are currently performing compared with observations and how reliable future model applications are compared with existing studies. Following an established method used in the U.S., a total of 307 peerreviewed studies that applied AQMs in China were compiled in this work and key information, including model applied, 35 study region, grid resolution, evaluated metrics, and etc., were tabulated. Operational MPE results for total PM 2.5 and speciated components reported in the compiled literature are presented in this study. Quantile distributions of common statistical metrics reported in the literature were presented and the impacts of different model configurations, including study region, study period, spatial and temporal resolutions on performance results are discussed. With the concept of "goals" and "criteria", we proposed benchmarks for six commonly used metrics -NMB, NME, FB, FE, R and IOA based on the method 40 employed by Emery et al. (2017). For total PM 2.5 , we provided R benchmarks with different temporal resolutions; for component species, we did not split results by temporal resolution due to an insufficient number of data points. We included results for IOA while recognizing that this specific metric should be used and interpreted with caution. Additional recommendations on good evaluation practices are provided. Results from this study help the ever-growing modelling community in China to put their model performance in context relative to previous studies and to guide modellers to conduct model evaluation in a more consistent fashion.

5
Date availability. All data is available upon request from the corresponding author.
Competing interest. The authors declare that they have no conflict of interest.
Special issue statement. This article is part of the special issue "Regional assessment of air pollution and climate change over 10 East and Southeast Asia: results from MICS-Asia Phase III". It is not associated with a conference. benchmarks to assess photochemical model performance, Journal of the Air & Waste Management Association, 67, 582-598, https://doi.org/10.1080/10962247.2016.1265027, 2017 Table S2      -ordered distributions of R, IOA, NMB, NME, FB, and FE for total PM 2.5 and speciated components. The number of data points and the 33 rd , 50 th , and 67 th percentile values are also listed. For instance, one third of reported R value for predicted hourly PM 2.5 concentration is higher than 0.76; half is higher than 0.69; and two thirds higher than 0.60. Figure 11: Ranking of key model inputs in terms of feature importance (FI)