As one common precursor for both

Tropospheric ozone (

To address that challenge, studies on investigating the relationship among
the responses of

To address the issue in a regression-based RSM, this study aims to develop a
polynomial family of functions in the RSM to represent the responsive
behavior of

The data used in this study were obtained from a recent regression-based RSM
study conducted in the Beijing–Tianjin–Hebei (BTH) region in China. One
baseline scenario and 1100 brute-force controlled scenarios were
performed using the Community Multiscale Air Quality Modeling System (CMAQ)
(version 5.0.1) in a 12

The responses of

In the regression-based RSM developed previously, the system supports the
investigation of different emission changes for five precursors in five regions (i.e.,
extended RSM, ERSM described in Zhao et al., 2015a and Xing et al., 2017a).
In this study, for simplification, the pf-RSM was built on the simultaneous
change in one or all regions (i.e., controls separately in an individual
region, or jointly controls in all five regions with the same control ratio).
However, the pf-RSM can be extended to pf-ERSM following the same structure
as the regression-based ERSM but using polynomial functions for

Schematic plot of comparison between the traditional RSM (regression-based) and the RSM with a polynomial function (denoted as “pf-RSM”, fitting-based).

In general, tropospheric

In this study, the prior knowledge of pollutant responses to emissions was
characterized as a series of polynomial functions by the previous developed
regression-based RSM. The accuracy of the regression-based RSM in representing
the nonlinearity in pollutant response to emissions has been examined
thoroughly using different methods including cross validation, out-of-sample
validation and isopleth validation in previous studies (Xing et al., 2011,
2017a; Wang et al., 2011; Zhao et al., 2015, 2017). The relationship among
pollutant responses to emissions followed by the basic chemical functions and
physical laws is implicitly represented in the regression-based RSM. In this
study, however, we adopted a linear combination of polynomial bases (i.e., 1,
x, x

First, the degrees of the five variables were determined individually by
fitting the responsive function with a polynomial of a single indeterminate
plot (Fig. 3). The

Degree of variables in the polynomial function of response to emission changes.

Flow scheme of pf-RSM development.

Fitting the

Figure 3a presents

The correlation among variables (i.e., product term) was determined in pairs
by fitting the responsive function with a polynomial of a two-indeterminate
isopleth, expressed as follows:

The product term

Term selections for

An example was shown in Supplement Fig. S1, which presents

Training samples were generated to fit the polynomial function for each pollutant. To minimize the number of CTM simulations (one simulation scenario represents one training sample), the number of training samples needed to be as small as possible, but greater than the number of terms (i.e., unknown coefficients) in the polynomial function. Our previous study (Xing et al., 2011) suggested that samples generated through uniform methods, such as Latin hypercube sampling (LHS) and a Hammersley quasi-random sequence sample (HSS), could provide even distributions for individual sources. However, additional marginal processing is recommended for its ability to improve the performance of prediction at margins.

Sensitivity analysis of the number and distributions of training samples was
conducted in this study. Groups of 20, 30, 40 or 50 training samples were
sampled using uniformly distributed HSS. Additional marginal processing was
conducted using a power function (

Out-of-sample dataset for validation.

The training samples were predicted using the regression-based RSM and subsequently used to fit the polynomial function for all pollutants. We selected two datasets as out of sample to validate the fitting polynomial function, i.e., jointly controls in five regions (denoted as “OOS100”) and single regional controls (denoted as “OOS15”) (see Table 2). The control matrixes of these two datasets are provided in the Supplement (Table S1). The method of leave-one-out cross validation (LOOCV) was used to examine whether the statistical polynomial regression was overfitting. The definition of LOOCV is to use a single sample from the original datasets as the validation data, and the remaining sample as the training data to build pf-RSM.

The predictive performance of the pf-RSM was evaluated using five
statistical indices, namely the mean normalized error (MeanNE), maximal
normalized error (MaxNE), mean fractional error (MeanFE), maximal fractional
error (MaxFE) and correlation coefficient (

In our previous RSM studies, indicators representing the nonlinearity of

Definition of peak ratio (PR) and suggested VOC

For

The previous calculations for the PR were performed through a looping
procedure in the RSM statistical system, which is not straightforward. One
advantage of the pf-RSM is that the PR can be directly calculated from the
polynomial function as follows:

In addition, we can further quantify how much simultaneous control of VOC is
required to avoid increasing

For

Definition of flex ratio (FR) and extra benefit from simultaneous
reduction of

Similarly, the FR can be directly calculated from the polynomial function as
follows:

Further, we can quantify the extra benefit in

The PR and FR are the results of

Table 3 summarizes the performance of the pf-RSM with different training
samples for predicting

Additional marginal processing improved the performance of

To meet the criteria of MeanNE within 2 % and MaxNE within 10 % (i.e., uncertainty of pf-RSM), which is comparable to the performance of previous regression-based RSMs, the use of 40 training samples with marginal processing (to improve boundary conditions) is recommended.

Similar results are found in the cross validation (i.e., LOOCV), as the performance in pf-RSM gets better along with the increase in sample numbers. Basically, the statistics of cross validation are in the same order as shown in out-of-sample validations (OOS100 and OOS15), except for the case of 20 training samples with marginal processing (worse performance due to underfitting problem). One interesting finding is that the pf-RSM with marginal processing exhibits worse performance than that with an even sampling method in cross validation. That is because the samples with marginal processing are located closer to margin areas where it is more difficult to predict (Xing et al., 2011). This also implies that the samples with marginal processing have better representation of the variability. Nevertheless, the results of validations suggest the pf-RSM with the current number of samples is not overfitted, and the number of training samples selected in fitting the system is recommended to be 40 training samples with marginal processing.

One kind of visual comparison, i.e., isopleth validation of the pf-RSM with different training samples was conducted, and its details are shown in the Supplement (Fig. S2–S9). The performance of the pf-RSM with less than 40 training samples exhibited a noticeable discrepancy (i.e., spatial pattern of the response under the controls) compared with that of the regression-based RSM. Such discrepancy is caused by the underfitting issue, implying that the number of training samples is not large enough to capture the nonlinearity in the model system. The issue can be addressed by adding more training samples to fit the model. The 40 training samples presented good agreement with the predictions of the regression-based RSM. An improved sampling method is also important for reducing the biases. We can see that additional marginal processing also improved the performance of the pf-RSM.

First, we applied the pf-RSM in each grid cell in the simulated domain. The
base case and 40 controlled scenarios simulated by the CMAQ model (41
training samples in total) were used to fit the function of each grid cell.
Two out-of-sample CMAQ cases (i.e., Case 1: moderate control with

Spatial distribution of CMAQ-simulated and
pf-RSM-predicted PM

Spatial distribution of CMAQ-simulated and
pf-RSM-predicted O

Figures 7 and 8 present the spatial distribution of CMAQ-simulated and
pf-RSM-predicted

The performance of

Second, we applied the pf-RSM to each day in two simulated months (i.e., January and July, 2014). The same 41 training samples and two additional CMAQ cases were used to fit and validate the pf-RSM on each day.

Daily series of CMAQ-simulated and pf-RSM-predicted daily
averaged PM

The daily series of the CMAQ-simulated and pf-RSM-predicted 24 h averaged

Performance of

The performance of

Spatial distribution of the indicators for
PM

The nonlinearity in the pollution response to emissions leads to an either
enhanced or reduced effectiveness of emission controls. In previous studies,
the concept of

Table 4 summarized the indicators at urban areas of prefecture-level cities
in the five target regions. In both January and July, most of the urban
areas (except strong

Estimation of indicators that represent the nonlinear control
effectiveness for reducing

The PR values for

The PR values calculated in this study also indicate that the control of
NO

Quantification of the effectiveness of air pollution controls by emission
mitigation needs an accurate representation of the nonlinear responses of
ambient

After the application of a prior knowledge of the pollutant responsiveness
to emissions in the RSM system, the cases required for single regional
pf-RSM development were substantially decreased to 40 samples, compared with
the previous requirement of over 100 samples, implying that the fitting-based
RSM (i.e., pf-RSM) is 3 times faster than previous regression-based RSMs
(i.e., the number of CTM simulations needed in pf-RSM is 60 % less than
that required by previous regression-based RSMs). The pf-RSM system in this
study operates rapidly, and thus can quickly generate responses with high
spatial and temporal resolutions, thereby further facilitating cost-benefit
optimization and enabling further assessment studies to be conducted (e.g.,
air pollution control, cost-benefit and attainment assessment ABaCAS system
described by Xing et al., 2017b). The polynomial functions developed in this
study have been successfully applied in all grid cells across the simulated
domain and all days across the simulated periods for both January and July,
indicating that the combination of terms selected in this study is spatially and temporally
independent as it mainly depends on the nonlinearity in the
atmospheric processes. It means that only the coefficients of terms need
to be fitted with training samples in another case (Step 3 in Fig. 2), as
seen in Table S4, which provides the coefficients of 15 terms for

Based on the pf-RSM, a series of indicators were calculated from the
polynomial function to represent the nonlinearity in control effectiveness
for reducing

Model outputs and pf-RSM code package are available upon request from the corresponding author.

The supplement related to this article is available online at:

This work was supported in part by the National Key R and D program of China (2016YFC0207601), National Research Program for Key Issues in Air Pollution Control (DQGG0301), National Science Foundation of China (21625701 and 21521064) and Shanghai Environmental Protection Bureau (2016-12). This work was completed on the “Explorer 100” cluster system of Tsinghua National Laboratory for Information Science and Technology. The authors also acknowledge the contributions of Xiaoyue Niu, Qi Li, Kui Hua and Nayang Shan from the Center for Statistical Science at Tsinghua University. Edited by: Bryan N. Duncan Reviewed by: two anonymous referees