Recommendations on benchmarks for photochemical grid model applications in China: Part I – PM2.5 and chemical species

10 Photochemical grid models (PGMs) are being applied more frequently to address diverse scientific and regulatory issues associated with deteriorated air quality in China over the past decade. Thorough evaluation of model performance helps to illuminate the robustness and reliability of the baseline modelling results and subsequent scenarios that are built upon it. Thus, the model performance evaluation (MPE) is a critical step of any PGM application. However, with numerous input data requirements, diverse model configurations, and the scientific evolution of the models themselves, no two PGM 15 applications are the same and their model performance results can differ significantly. Currently, a uniform set of MPE procedures and associated benchmarks is lacking in China, where air pollution problems have attracted extensive attention and the modelling community has grown substantially. Here we present a comprehensive review of PGM applications in China with the aim to propose a set of statistical benchmarks for evaluation of simulated fine particulate matter (PM2.5) concentrations in China. A total of 307 peer-reviewed articles published between 2006 and 2019, which applied five of the 20 most frequently used PGMs in China, are compiled to summarize operational model performance results. Quantile distributions of common statistical metrics are presented for total PM2.5 and speciated components. We discuss influences on the range of reported statistics from different model configurations, including modelling regions and seasons, spatial resolution of modelling grids, temporal resolution of the MPE, etc. This is the first study to propose benchmarks of six frequently used evaluation metrics for PGM applications in China, including two tiers – “goals” and “criteria” – where 25 “goals” represent the best model performance that a model is currently expected to achieve and “criteria” represent the model performance that the majority of studies can meet. Results from this study will support the ever-growing modelling community in China by providing a more objective assessment and context for how well their results compare with previous studies, and to better demonstrate the credibility and robustness of their PGM applications prior to subsequent regulatory assessments. 30


Data compilation
A total of five photochemical modelsthe Community Multiscale Air Quality (CMAQ, Foley et al., 2010), the Comprehensive Air Quality Model with Extensions (CAMx, Ramboll Environment and Health, 2018), the Goddard Earth Observing System (GEOS)-Chem (http://geos-chem.org), the Weather Research and Forecasting model coupled with 5 Chemistry (WRF-Chem, Grell et al., 2005), and the Nested Air Quality Prediction Modelling System (NAQPMS, Z. Wang et al. 2006)are included in this compilation. While the former four models are developed by institutes and/or companies outside China, the NAQPMS is developed by the Institute of Atmospheric Physics of Chinese Academy of Sciences and has mostly been utilized for applications in China. GEOS-Chem is a global chemical transport model with coarser resolution (only 20% of complied GEOS-Chem studies has a grid resolution less than 50 km), as opposed to the other four regional 10 models that are applied with finer spatial resolution at local scale (for example, less than 10 km). Our investigation started by searching for combinations of three key words on the Web of Science: model name, "air quality", and "China", and limited the timespan between 2006 and 2019. This initial search gives 446 (CMAQ), 84 (CAMx), 256 (WRF-Chem), 117 (NAQPMS), and 58 (GEOS-Chem) records (a total of 961). Duplicated records were excluded. We then excluded records that were listed as conference papers or not published in English-language journals (for example, Chinese and Korean-15 language journals). This resulted in 826 records published in 61 journals. We further reduced the number of journals considered by excluding those that had less than ten publications during 2006-2019, which results in 464 studies. Table S1 shows the list of journals that were included in this study, which is believed to cover the mainstream journals in atmospheric research, especially in applications of air quality models. The next filtering stage needed substantial manual effort. The 464 records were downloaded and manually checked to exclude (1) studies that were accidentally included in the search but did 20 not apply any of the models in their study; (2) studies that were intended for other purposes (for example, evaluating meteorological simulations); (3) studies that were not focused on China (for example, the target region was Korea, Japan, etc.); (4) studies that did not provide any air quality model performance evaluation or the evaluation results were referred to previous studies; (5) studies that did conduct model performance evaluation but no numerical values were given (for example, only graphical plots were given). The final selection included a total of 307 papers (see a complete list in Table S2). 25 We defined ten regions of China as shown in Figure 1, namely Beijing-Tianjin-Hebei (BTH) region, Yangtze River Delta (YRD) region, Pearl River Delta (PRD) region, Sichuan Basin (SCB), North China Plain (NCP), Central, Northwest, Northeast, Southeast, and Southwest (see Table S3 for provinces covered in this region).

Metrics evaluated
A total of 25 performance metrics was used in the 307 articles compiled in this study (see Supplemental Table S4 for a 30 complete list of the 25 metrics). In general, these statistical metrics could be divided into two types: one is to indicate how well model captures the magnitude of observations. Examples of this type include mean bias (MB), normalized mean bias (NMB), fractional bias (FB), etc. The other type of statistical metrics is used to indicate how the model captures the variations of observations and most commonly used metrics are "correlation coefficient" or "index of agreement". While some of the compiled studies explicitly provide mathematical formula of the MPE metrics used in their paper, quite many 35 did not. This causes ambiguity when a common terminology or abbreviation was used but no explicit formula is provided.
For example, the term of "correlation coefficient" (or "correlative coefficient") is frequently used in many studies but turned out to be calculated using different mathematical formula in different studies. In some studies, the "correlation coefficient" refers to the Pearson correlation coefficient (R), which indicates the strength of linear relationship between observations and predictions; while in some studies, it refers to the coefficient of determination (R 2 ) that represents the fractions of predicted 40 variations explained by observations. In these two cases, R 2 value is simply the square of R value. In two studies (X. , the term of "correlation coefficient" is used but formulated as the root mean square error (RMSE). To make things even more complicated, this correlation coefficient is used to indicate model's capability of capturing temporal variations in most of the studies but also spatial variations in some cases (e.g. Ge et al., 2014). For temporal variations, this "correlation coefficient" is calculated based on temporally (hourly or daily) matched observation and modelled results at a single monitoring site (or averages across multiple monitoring sites in many cases). For spatial 5 variations, this "correlation coefficient" is calculated based on pairs of observations and modelled results at multiple sites and its value is used to demonstrate spatial performance. To have better comparability among studies, we converted R 2 values to R. "Index of Agreement" (IOA) is another example that cautions must be taken when collecting data since the definition of IOA is not unique among these studies. Most of the studies use the definition of IOA (d) shown in Table 1 and only one study used the formula in Table 3. The use of IOA is discussed more in section 3.4 and we dropped the second 10 formula for developing IOA benchmarks.

Derivation of benchmarks
In this study, the method established by Simon et al. (2012) and Emery et al. (2017) was mostly adopted. Quartile distribution for each statistical metrics (depending on the data availability) was first presented and the influences of several model key inputs on these metrics were discussed. Rank-ordered distribution for selected metrics was then used to pick out 15 the 33 rd and 67 th percentiles. According to Emery et al. (2017), the 33 rd and 67 th percentile separates the whole distribution into three performance range: studies that fall within the 33 rd percentile can be considered as successfully meeting the goals that the best a model is currently expected to achieve; studies that fall between 33 rd and 67 th quantiles indicate successfully meeting the criteria that the majority of studies could achieve; studies that fall outside the 67 th quantile indicate relative poor performance for that specific metric. A summary table with values of 33 rd and 67 th quantile values for recommended 20 statistical metrics is provided at the end this work and is compared with U.S. benchmarks proposed by Emery et al. (2017).  in January 2013 that attracted numerous attentions to air pollution issues in China. Since then, series of air pollution related actions were carried out due to increasing funding that became available for the research community to perform various studies related to air pollution. Of the 307 articles included in this work, CMAQ was the most frequently used PGM (used in 30 Emission inventory is another critical input for PGM applications and the accuracy of emission inventory being used no doubtfully directly affects the model performance. Most frequently used emission inventory for anthropogenic sources include the MEIC developed by Tsinghua University (http://www. meicmodel.org), Regional Emission Inventory in Asia   Table 1 shows the formula of metrics 15 that have been used in more than 20 studies. Same as Simon et al. (2012), the top three most frequently used metrics is correlative coefficient (R, 223 studies), normalized mean bias (NMB, 170 studies), and mean bias (MB, 132 studies). Other frequently used (>20 studies) metrics include root mean square error (RMSE, 118 studies), normalized mean error (NME, 111 studies), fraction bias (FB, 66 studies), fraction error (FE, 62 studies), index of agreement (IOA, 57 studies) and mean error (ME, 27 studies). Mean normalized bias (MNB) and mean normalized error (MNE) were only used in 15 and 10 20 studies, respectively, as mentioned in Simon et al. (2012) that these two metrics tends to give more weight to data at low values. About 65% of articles included in this work used at least three statistical metrics for model performance evaluation one pollutant and 223 studies (73%) evaluated less than or equal to three pollutants; one study (Ying et al., 2018) evaluated 17 pollutants (including elemental PM 2.5 components). Figure 3 shows the number of studies broken down by pairs of pollutants and PGM models and pairs of pollutants and metrics. As expected, PM 2.5 is the most frequently evaluated pollutant, followed by ozone, NO 2 , SO 2 and PM 10 , all of which are criteria pollutants included in China's National Ambient Air Quality Standards (NAAQS). Evaluation of speciated PM 30 species, including nitrate, sulfate, ammonium and organic carbon (OC) is about one fourth frequent as total PM 2.5 and was only covered in applications for certain regions due to limited observations.  Table S5). For total PM 2.5 , slightly more negative values of MB, NMB, and 35 FB were reported. Absolute bias for PM 2.5 ranges from as low as -50 μg/m 3 to over 50 μg/m 3 (outliers excluded) with median values around 3 μg/m 3 . The bias range for speciated components is much smaller (within 20 μg/m 3 ) because the absolute magnitude of speciated components is much smaller. In terms of the normalized bias (i.e. NMB and FB), the range of PM 2.5 is comparable or smaller than speciated components, partly due to the compensating errors from speciated components. Speciated PM 2.5 tends to be dominantly under-estimated except for nitrate (in terms of NMB) and EC (in terms of FB). 40

Quantile distributions of PM 2.5 and speciated components
Widest range of normalized bias was reported for nitrate, suggesting substantial uncertainties in simulated nitrate concentrations. Some early reported large negative NMB values of nitrate were partly due to missing formation of coarse-mode nitrate implemented in early version of the model (Kwok et al., 2010). As the model evolves over the time, the large negative bias of nitrate disappeared (see Figure S3). Unlike other speciated components that could be both emitted directory from sources (i.e. primary) and formed via chemical reactions of precursors (i.e. secondary), elemental carbon (EC) is solely emitted from sources directly thus the performance of simulated OC concentration is more associated with the accuracy of the emission inventory. Model under-estimations of secondary species (organic and inorganic) have been reported in 5 numerous studies with explanations of missing formation mechanisms, uncertainties with the emission inventory, and meteorology errors that were carried over, etc. For error metrics, total PM 2.5 performs better than speciated components in terms of NME, with a median NME value around 40%. For FE, median values for total PM 2. 5

Impact of season
There are numerous factors that could affect model performances results, to give a few examples, the study region and period, source of emission inventory, model grid resolution, the temporal resolution of paired observations and modelling 20 results used for model evaluation, etc. We first look at NMB results of total PM 2.5 and selected species (due to availability of data points) by season ( Figure 5). For total PM 2.5 , number of data points reported for winter is significantly higher than those reported for other seasons as heavy haze episodes generally occur in winter. Reported NMB for total PM 2.5 was dominantly negative except for winter where positive NMB is also reported. Most of the large positive NMB values (>50%) reported for winter is from one single study , for which the author explained that the over-estimation may be 25 associated with the inconsistency of emissions between the base year and the modeling year. Sulfate tends to be overwhelmingly underestimated regardless of season, which is commonly reported in literatures with potential causes of missing formation mechanisms (e.g. heterogeneous reactions, Ye et al., 2018;Shao et al., 2019;Chen et al., 2019). However, a few large NMB values (>50%) were reported in fall (Cheng et al., 2019). Nitrate and ammonium exhibits equivalent over-and under-estimation for all seasons and differences among seasons are minor. OM also tends to be 30 more underestimated, especially in summer and fall. The underestimation of organic components, especially the secondary organic aerosols (SOA), is well documented by many studies (e.g. Jimenez et al., 2009;Q. Chen et al., 2017;B. Zhao et al., 2016). The two positive NMB values reported for winter is from one study (Li et al., 2018) and uncertainties in anthropogenic emissions were explained for the model bias.

Impact of region 35
We also look at whether there are any regional differences in these statistical metrics. Constrained by number of data points, we only compared results of R and NMB for total PM 2.5 and secondary inorganic species over three key regions in China, that is the Beijing-Tianjin-Hebei (BTH) region in north China, the Yangtze River Delta (YRD) region in eastern China, and the Pearl River Delta (PRD) region in south China ( Figure 6). These three regions represent the most populated, economically developed and urbanized city clusters in China. With respect to the total PM 2.5 , R and NMB values for the 40 three regions do not exhibit substantial differences. All three regions have more negative NMB values in terms of PM 2.5 with median NMB around -10%. Reported R values for PRD are slightly lower compared with the other two regions. For sulfate and ammonium, reported R values for YRD are significantly lower than the other two regions. Underestimation of sulfate and ammonium is more severe in YRD and PRD. For nitrate, PRD shows the lowest R values and model bias shifts from positive to negative as the target region gets warmer.

Impact of temporal and spatial resolution
Although PGM are usually conducted at hourly time step, validation of modelling results is not always performed with pairs of hourly data, which depends on the temporal resolution of observational data as well as the purpose of the application. 5 Daily, monthly and even annually-averaged pairs of modelling results and observations were used for model evaluation. Of the 307 studies compiled in this work, 183 (60%) studies used hourly data for model validation, followed by 90 (29%) using daily, 31 (10%) using monthly, and only 12 (4%) studies using annual data. Due to the coarse resolution of GEOS-Chem studies in general, the finest validation is conducted in GEOS-Chem studies is using daily data. Figure 7 shows the quantile distribution of eight statistical metrics for total PM 2.5 presented by the temporal resolution used for model validation (plots 10 for speciated components are shown in Figure S1; results for annual are not shown due to limited data). Model performances evaluated using daily-average values are similar or slightly better than hourly values but exhibit large improvements when monthly-average values are used. For instance, reported R values do not show much differences at hourly and daily scale (median values around 0.7) but exhibit a substantial improvement at monthly scale (median value around 0.85). A similar trend is also observed for reported error statistics (NME and FE), which show slight improvement as the validation 15 resolution increase from hourly to daily but large improvement from daily to monthly. One study (Matsui et al., 2009) provided two sets of R values that calculated are based on hourly and daily-averaged data, respectively and R values for daily averages are always (12 out of 14 R values) higher.
Spatial resolution is a key setup for PGM applications. For applications at local or urban scale, PGM is usually configured with two or three nested domains that were downscaled from coarser outer domain to finer inner domain. Among the 307 20 articles compiled in this study, a total of 43 grid resolutions was used (for nested grids, we used the grid solution of the finest grid), ranging from as coarse as over 200 km (used by GEOS-Chem) to as fine as 1 km depending on the target region and the purpose of the application. GEOS-Chem is more often used with coarse resolution (>50 km) and the modelling grids are usually rectangular, which are converted to the side length of a square that has equivalent grid area. For simplicity, we classified these different grid resolutions into five categories: (0, 5 km], (5 km, 10 km], (10 km, 25 km], (25, 50 km], and (50 25 km, 100 km]. Figure 8 shows the distribution of eight statistical metrics of total PM 2.5 by these four categories (plots for speciated components are shown in Figure S2). It appears that finer spatial resolution does not necessarily improve model performances results. For example, the R values for the finest category range from as low as 0.47 to as high as 0.85 while for the coarsest category from 0.33 to 0.96. MB seems to be moving from underestimation to overestimation as grid resolution gets coarser and no clear trend is observed for FB and NMB. Reported NME and FE values seem to increase as the grid 30 resolution gets coarser. As mentioned above, many factors could affect model performances. Thus it is difficult to solely evaluate whether there is a systematic improvement of model performances as the modelling resolution gets finer. While simultaneously. For this particular study, model over-predicted PM 2.5 at 12 km resolution (positive values of MB, NMB, and FB) but under-predicted PM 2.5 at 36 km resolution (negative values of MB, NMB, and FB). This is likely due to the dilution effect that makes model results lower at 36 km domain.

Trends over the past decade 40
In an attempt to assess whether model performance results have evolved over the past decades, we present time series of selected statistical metrics for total PM 2.5 in Figure 9 (plots for inorganic species are shown in Figure S3). Results published prior to 2013 were aggregated into one group because there were a limited number of studies prior to 2013. For total PM 2.5 , reported R values have remained relatively consistent over the past decade with the median fluctuating within 0.6~0.8. The ranges of reported RMSE and MB become narrower in recent years even though the number of studies has increased substantially. Reported IOA and RMSE values fluctuated upward and downward over the period. On the other hand, there seems to be an improving trend in terms of FB, FE, and NME as the reported values for these three metrics shift towards zero. For instance, the median value of reported FE decreased from 56.9% prior to 2013 to around 33% in 2019. However, it 5 is important not to over-interpret these results as the number of studies published each year could affect the results.

Recommended metrics and benchmarks
We presented similar diagrams as Emery et al. (2017) to develop metrics and benchmarks for model evaluation. Figure 10 shows the rank-ordered distribution of R, IOA, NMB, NME, FB, and FE results for total PM 2.5 and speciated components from all studies compiled in this work. Results of R for total PM 2.5 are further split into hourly (h), daily (d) and monthly (m) 10 resolution since it increases as temporal resolution changes from hourly to monthly. The top 33 rd percentile value increases from around 0.76 for hourly and daily to 0.92 for monthly results; the top 67 th percentile increases from 0.60 to 0.70 as the total PM 2.5 is evaluated with coarser resolution. Secondary inorganic species (sulfate, nitrate and ammonium) show similar range (0.65 ~ 0.75) over the 33 rd -67 th percentile interval. For OC/OM and EC, the 33 rd and 67 th percentile R value is lower compared to inorganic species; the 33 rd to 67 th percentile for OC/OM is 0.56~0.69 for OC/OM and 0.48~0.65 for EC. In 15 terms of IOA, the 33 rd -67 th percentile interval ranges from 0.73 to 0.83 for total PM 2.5 and lower for speciated components.
Values for EC were not shown due to limited data. For bias and error, total PM 2.5 exhibits smaller values compare with speciated components, due to potential compensating effects from different components. The 33 rd percentile of absolute NMB for total PM 2.5 is less than 10% while the 67 th percentiles less than 20%. Among these three secondary inorganic species, the bias and error of nitrate exhibits largest variability (NMB ranges from 17.4% to 55.2% and NME from 47.0% to 20 71.2% for 33 rd to 67 th percentile interval). The 33 rd to 67 th range of NMB for EC (16.0% to 32.4%) is much lower than that for OC/OM (34.7% to 55.0%) while NME for OC/OM and EC is similar, ranging from ~40% to 55%. Number of FB and FE data is considerably less than NMB and NME for speciated components and nitrate exhibits largest variability in terms of FB and FE.
Based on our analysis above as well as previous conclusions from Emery et al. (2017), we propose recommended statistical 25 metrics and associated benchmarks for total PM 2.5 and speciated component as shown in Table 2. Shaded values indicate that less than 20 data points were available to develop the benchmarks. Values for "goal" indicate that roughly the top one third of studies could meet the benchmarks and represent the best that a model is currently expected to achieve. Values for "criteria" indicate that roughly the top two thirds of studies meet the benchmarks and represent results from the majority of studies. Our table differs from Emery et al. (2017) in three aspects. Firstly, we added benchmarks for IOA in addition to the 30 correlation coefficient. We found a general increasing trend of using IOA for model performance evaluation since 2013 (prior to 2013, only six of our compiled studies used IOA; after 2013, 51 studies used IOA). Thus we added IOA for future reference. Secondly, we presented benchmarks for different temporal resolution of total PM 2.5 when possible. As mentioned above, reported R values for total PM 2.5 get better as temporal resolution gets coarser while no strong trend is observed for other metrics. Therefore, different benchmarks are developed for R. Thirdly, Emery et al. (2017) did not present benchmarks 35 for the correlation coefficient of speciated PM components due to large uncertainties. Here we presented benchmarks for R and IOA of speciated PM components (except IOA for EC is not available), but cautions should be taken comparing to these benchmarks. For example, less than twenty data points were used to develop the benchmarks of IOA for ammonium and sulfate. For sulfate and ammonium, we do not observe sudden changes in the rank-order distribution as observed in Emery et al. (2017). Thus, we keep these values for future references. For bias and error metrics, we do observe sharp changes in rank-40 order values, for example, the NMB/FB for nitrate, FB for EC. Therefore, we do not give benchmarks in this situation. We also presented benchmarks for FB and FE.
We further compared our results with benchmarks proposed by Emery et al. (2017). Values with an asterisk in Table 2 indicate that our benchmarks are stricter than corresponding values in Emery et al. (2017), which means results from a study would be more difficult to be considered within 33 th (or 67 th ) percentiles if our benchmarks are used. For total PM 2.5 , our proposed benchmarks are generally stricter than that in Emery et al. (2017). For example, our NMB (NME) "criteria" value for PM 2.5 is 20% (40%) as opposed to 30% (50%) in Emery's study; "criteria" value for R benchmark is also higher (0.55) 5 than those based on U.S. studies (0.40). This might partially reflect the systematic improvements in model applications (e.g. incorporation of newly discovered mechanisms) during the past several years since the latest study included in Emery et al. (2017) was published in 2015. Our "goal" values for NMB, NME and R benchmarks are same as that proposed by Emery et al. (2017). For speciated components, NME benchmarks for nitrate are lower (i.e. stricter) than Emery's study while the opposite is true for sulfate and ammonium. For correlation coefficient, our criteria benchmarks for sulfate and ammonium 10 (0.60) are much higher (i.e. more strict) than those in Emery's study (0.40).
As mentioned earlier, PGM applications involve numerous driving inputs as well as diverse model configurations, which lead to an abundant database from which to assess their relative influences on model performance. A preliminary analysis based on the Random Forest Method (Liu et al., 2012), a machine learning method suitable for classification and regression, suggests that emission inventory, grid resolution and boundary conditions are the top three factors that affect model 15 performances results (see details in Supplemental information). The similarities between the benchmarks derived in this study and Emery's study suggest that important model input data (e.g. emission inventories) have comparable accuracy for China and North America and model formulations (e.g. algorithms such as chemistry, deposition, transport) seem to be equally applicable to China and North America. In additional to the need for model performance benchmarks, there also is a need for more studies that quantify contributions to model uncertainty, such as the recent study by Dunker et al. (2020), 20 which quantifies contributions of chemistry, boundary concentrations, deposition and emissions to uncertainty in simulated ozone results.

Benchmarks for European modeling community -FAIRMODE
The air quality model benchmarking practise for PGM applications by the FAIRMODE community is somehow different 25 from the U.S. benchmarks. The main modeling performance indicator is called the modeling quality indicator (MQI), which is calculated based on RMSE and measurement uncertainties (function of mean value and standard deviation of observations) (Janssen et al., 2017). The modeling quality objective (MQO) is the criteria value for MQI and is said to be met if MQI is less than or equal to one. In addition to the main MQI, three statistical indicators that describe certain aspects of the differences between observed and modeled resultsnamely bias, correlation, and standard deviation are proposed as the 30 modelling performance indicators (MPI). For each MPI, the model performance criterion (MPC) that individual MPI is expected to meet is also given. However, unlike fixed values given in this study and Emery et al. (2017), MPC is dependent on observation uncertainties. Therefore, it is not directly comparable between MPC and the benchmarks proposed in this study or the ones in Emery et al. (2017).

The use of "index of agreement" 35
The concept of "index of agreement" is originally proposed by Willmott in the 1980s and has since then been widely used to "reflect the degree to which the observed variate is accurately estimated by the simulated variate" (Willmott, 1981) in a variety of fields. IOA has gone through several modifications (together referred as Willmott indices) since it was proposed in the original formula (Willmott 1982;Willmott et al., 1985Willmott et al., , 2012. The formula of the original one (d) is shown in Table 2 (presented again in Table 3) and the other three (d 1 , d 1 ' and d r ) shown in Table 3. The first version of IOA is proposed over 40 the correlation coefficient for its ability to "discern differences in proportionality and/or constant additive differences between the two variables" (Willmott, 1981) and this version is also the most widely used version in our compiled studies.
Compared with R 2 values, the original IOA results systematically higher values (Valbuena et al., 2019) thus is being adopted in an increasing number of studies partially because it makes results appear "better". However, the original and also being the most widely used IOA is problematic in that too much weight is given to the large errors when squared (Willmott et al., 2012) and relatively high IOA values could be obtained even when a model is performing poorly (Willmott et al., 1985;Pereira et al., 2018). Newer versions as later proposed by Willmott overcome this problem by removing the squaring and are 5 recommended over the original one (Willmott et al., 1985(Willmott et al., , 2012. Valbuena et al. (2019) suggested using d 2 instead of d, at least for estimating forest biomass based on remote sensing to facilitate comparison with studies using correlation coefficient.
Near 20% (57 studies) of our compiled studies used the "index of agreement" for MPE but only one study (Y. Peng et al. 2011) used the second formula (d 1 ) while the rest studies all used the original formula. There seems to be an increasing trend of using IOA (the original formula) as a model performance indicator for PGM applications in China (prior to 2013 only 6 10 study vs. 51 studies after 2013), we decided to keep IOA based results and discussions in this work for future reference but cautions should be taken when using and interpreting IOA values. It should be noted that the value of IOA alone does not necessarily tell how well the modelling results are.

Additional recommendations
Other than the recommended metrics and associated benchmarks listed in Table 2, we list additional recommendations for 15 validation practices that would enable a complete and comprehensive picture of model performances.
(1) Provide explicit mathematical formula of statistical metrics being used to avoid any confusion. As mentioned earlier, quite many studies did not give explicit formula of used metrics in their studies. This would sometimes cause ambiguity when a common name (for example, correlation coefficient, or index of agreement) is used but calculated using different formula. 20 (2) Provide as much details as possible with respect to how observation and modelling results are used to obtain the statistical results. For example, how observed data and modelled results are paired in space and time? Is any averaging performed prior to calculating statistical metrics? Specify the number of observation sites and the number of available data points being used. This would enable a further comparison of model performances based on the amount of available data points. It should be noted that large averaging (i.e. more pairing of observed and modelled results) usually result in 25 better statistics, but do not convey any more meaning.
(3) It is always good practise to present model performance results of meteorological fields, usually including but not limited to temperature, humidity, wind speed, and wind direction. Performance results of meteorological model could also help explain potential causes of unsatisfactory PGM simulated results.
(4) Metrics used should always include two types of statistical metrics for model evaluation, one for magnitude evaluation 30 (e.g. MB, NMB or FB) and one for variation evaluation (e.g. R or IOA). According to Simon et al. (2012), a minimum set of MPE statistical metrics should include "mean observation, mean prediction, MB, ME (or RMSE) and a normalized bias and error (NMB/NME or FB/FE)". Cautions need to be taken when presenting values of fractional metrics, for example, NMB/NME, FB/FE. Double check if the values presented are before or after multiplied with 100%.
We do find studies that present extremely small values of NMB (<1%) but should be multiplied by 100 based on the 35 results of other evaluation metrics.
(5) Try to evaluate multiple pollutants even if the study focuses on one single pollutant. It is obvious that opposite biases in speciated PM components could compensate each other and falsely lead to a good performance of the total PM 2.5 . With the increasing number of PGM applications in China over the past decade, a review of the model performance is needed to help understand how well these models are currently performing compared with observations and how reliable the future model applications are compared with existing studies. Following an established method used in the U.S., a total of 307 peer-reviewed studies that applied PGMs in China was compiled in this work and key information, including model 5 applied, study region, grid resolution, evaluated metrics, and etc., were collected. As an initial attempt, operational MPE results for total PM 2.5 and speciated components reported in the compiled studies are presented in this study; results for other pollutants and meteorological simulations will be discussed as follow-up studies. Quantile distributions of common statistical metrics used in the literature were presented and the impacts of different model configurations, including study region, study period, spatial and temporal resolutions on performance results are discussed. With the concept of "goals" and 10 "criteria", we proposed benchmarks for six commonly used metrics -NMB, NME, FB, FE, R and IOA based on the method employed by Emery et al. (2017). For total PM 2.5 , we provided R benchmarks with different temporal resolutions; for component species, we did not split results by temporal resolution due to limited number of data points. We kept results for index of agreement while recognizing it should be used and interpreted with cautions. Additional recommendations on good evaluation practices are provided at the end. Results from this study could help the ever-growing modelling community in 15 China to have a better understanding of how their model performances are compared with existing studies and also help modellers to conduct model evaluation in a more consistent fashion, which would in turn improve the comparability among different studies.     For instance, one third of reported R value for predicted hourly PM 2.5 concentration is higher than 0.76; half is higher than 0.69; and two thirds higher than 0.60.  Table 2 indicate that our benchmarks are stricter than corresponding values in Emery et al. (2017) b Shaded values indicate that less than 20 data points were available to develop the benchmarks.