The vertical distribution of aerosol extinction
coefficient (EC) measured by lidar systems has been used to retrieve the
profile of particle matter with a diameter

Aerosol is a suspension of fine solid particles or liquid droplets in air
(Hinds, 1999; Chen et al., 2014; Fan et al., 2019; Huang et al., 2019). In
recent decades, with the anthropogenic aerosol emissions increasing in China,
the concentration of fine particle matter with a diameter of less than 2.5

Until now, surface in situ PM

In recent years, transport flux (TF) as a representation of the horizontal transmission flux
of pollutants has been put forward, and it is determined by the horizontal wind
speed and PM

A series of studies have been conducted to estimate the PM

Given the abovementioned problems and referencing the work of the former,
surface in situ PM

Geographic distribution of observation site and the observation instruments used in this study. The photo of the particulate matter detector is provided by GRIMM Aerosol Technik (© GRIMM).

The observational station is at the State Key Laboratory of Information
Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), located at
Luoyu Road, Wuhan (39.98

Surface aerosol EC were measured by the combination of nephelometer (Model 3563, TSI, USA) and Aethalometer (Model AE31, Magee Scientific, USA). The
nephelometer can measure the aerosol scattering coefficients (SCs)
simultaneously at 450, 550 and 700 nm, and the error of its data production
is less than 7 % (Gong et al., 2015). The aerosol SC of lidar at 532 nm can
be calculated from wavelengths at 450, 550 and 700 nm (Yan et al., 2017;
Liu et al., 2018a). Moreover, the aerosol absorption coefficients (ACs) were
deduced from black carbon concentrations which were measured by Aethalometer
(Xu et al., 2012). The Aethalometer can measure the black carbon
concentration at the seven wavelengths of 370, 470, 520, 590, 660, 880 and
950 nm. Previous studies indicated that aerosol AC at 532 nm and black
carbon concentrations at 880 nm have a strong correlation, and the
determination coefficient (

During this observation period, the particulate matter monitor (Grimm EDM
180, Germany) was used to measure the surface PM

A Mie lidar system with an operating wavelength of 532 nm was used to measure the aerosol EC profile. In the measurement, the temporal and spatial resolutions are 1 min and 3.75 m, respectively. The overlap of this system is 200 m. More detailed descriptions are presented in the previous studies (Liu et al., 2017). This lidar system can directly measure the scattering intensity of aerosols, and aerosol EC can be reversed by the Fernald method (Fernald, 1984). The lidar ratio in Wuhan area is estimated to be 50 sr (Gong et al., 2010; Liu et al., 2021). Note that the standard deviation of the assumed lidar ratio is about 20 %, and the uncertainty for EC derived by lidar is about 10 %–20 % (Liu et al., 2017). The lidar dataset includes the observation from January 2017 to December 2019. After removing the cloud and rain days, a total of 2304 hourly average profiles were obtained.

To calculate the TF of PM

In this section, the statistical methods which were used to assess the performance of models are introduced. The establishment of a traditional linear model and four ML models is then introduced and discussed. Finally, the calculation method of TF is presented.

In this study, the mean absolute error (MAE), root mean square error (RMSE)
and correlation coefficient (

The linear regression relationship between observed
PM

Traditional linear models (LMs) have been used to retrieve the PM

Here, the surface EC and PM

In this study, four classical ML algorithms were used to train a PM

Probability distribution functions of all sample datasets
(orange line), training dataset (red line) and testing (blue line) for
observed

The RF model is a classifier that uses multiple trees to train and predict samples, which was first proposed by Breiman (2001). There is no correlation between each decision tree in the forest, and the final output of the model is jointly determined by each decision tree in the forest. RF model can handle multiple input features and provide the best outcomes by considering different features. Due to its high degree of generalization and fast training speed, the RF model is widely used in atmospheric remote sensing to solve the nonlinear fitting problem (Wei et al., 2019).

Mean absolute errors (MAE) between observed PM

Summary of tuning parameters and their dynamic ranges of four different machine learning algorithms.

Here, the RF model was used to predict the PM

KNN is a ML algorithm that can be used for both classification and
regression (Altman, 1992; Coomans and Massart, 1982). Its principle is to
find the

SVM is a two-class classification model, which was first proposed by Cortes and Vapnik in 1995 (Cortes and Vapnik, 1995). Its basic idea is to find a linear classifier with a separation hyperplane and maximal interval in the feature space. According to the limited sample information, the best compromise is sought between the complexity of the model (the learning accuracy of a specific training sample) and the learning ability (the ability to identify any sample without error) in order to obtain the best generalization ability (Drucker et al., 1997). It shows many unique advantages in solving small-sample, nonlinear and high-dimensional pattern recognition and can be extended to other machine learning problems such as function fitting (Cao, 2003).

For the SVM model, the penalty parameter (

Mean absolute errors (MAE) between observed PM

Variations of the estimated PM

XGB algorithm is an improved version of the gradient boosting decision tree (GBDT) algorithm. The GBDT algorithm is an additive model that minimizes the objective function value by gradually adding decision trees (Friedman, 2002). However, the objective function does not have a regularization term; it is just the sum of the loss function values, which may easily cause overfitting. The XGB algorithm adds a regularization term to the cost function on the basis of the GBDT algorithm and performs a second-order Taylor approximation to the objective function. Then, the exact or approximate method is used to search for the segmentation point with the highest score, and then perform the next segmentation and expand the leaf nodes (Chen et al., 2015). In this way, it is ensured that the tree structure will not be too complicated to cause overfitting in the process of minimizing the loss function. In addition, this can speed up the calculation.

To achieve the optimal effect of the XGB model, it is necessary to adjust five
parameters: subsample, number of tree (estimator num), maximum
depth of the tree (max depth), learning rate and gamma (Table 1). The tuning
parameter process for these parameters is shown in Fig. 5. The subsample
was set to 0.1, 0.2, 0.5 and 1, respectively. The results show that
subsample

TF is an important parameter to measure the horizontal transmission of
pollutants (Y. Liu et al., 2019; Shi et al., 2020). In this study, the TF is
determined by the WS and the PM

Correlation coefficients between observed PM

In this section, the estimated PM

Ranking histograms of the input environment variable for

Difference of observed PM

Figure 7 shows the correlation between the observed PM

From the results in the previous section, the ML algorithms that take
meteorological variables into account have better performance than the LM
algorithm. The input variable importance analysis was performed to
investigate the influence of meteorological factors, as shown in Fig. 8. For
these four models, the importance ranking of the input variables (EC, WD, WS,

Hourly variations of vertical distribution of

Figure 9 shows the difference between estimated and observed PM

Seasonal and annual profiles of

In this section, the diurnal and seasonal variations of TF and PM

Figure 10 shows the diurnal variation of the EC, WS, PM

Figure 11 shows the seasonal variation of the PM

This study presents a comprehensive analysis to explore the conversion of
aerosol extinction coefficient to PM

After using traditional LM and other four ML algorithms to predict the
PM

Our work comprehensively compares the performance of LM, RF, KNN, SVM and
XGB models. From the perspective of correlation and deviation between
observed and estimated PM

The experimental data used in this paper can be provided for non-commercial
research purposes upon request (Boming Liu: liuboming@whu.edu.cn). The
ERA5 wind data can be downloaded from

The study was completed with close cooperation between all authors. YM and BL conceived of the idea for the paper. YM and BL conducted the data analyses and co-wrote the paper. YaZ, HL, SJ, WG, YiZ and RF discussed the experimental results, and all coauthors helped review the paper.

The contact author has declared that neither they nor their co-authors have any competing interests.

Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We are very grateful to the lidar team of Wuhan University for the operation and maintenance of the lidar and ground-based instruments.

This research has been supported by the National Natural Science Foundation of China (grant no. 42001291) and the China Postdoctoral Science Foundation (grant no. 2020M682485).

This paper was edited by Jianping Huang and reviewed by two anonymous referees.