Himawari-8-derived diurnal variations of ground-level PM2.5 pollution across China using a fast space-time Light Gradient Boosting Machine

PM2.5 has been used as an important atmospheric environmental parameter primarily due to its impact on human health. PM2.5 is affected by both natural and anthropogenic factors that usually have strong diurnal variations. Monitoring it does not only help understand the causes of air pollution but also our adaptation to it. Most existing PM2.5 products have been derived from polar-orbiting satellites. This 20 study exploits the usage of the next-generation geostationary meteorological satellite Himawari-8/AHI in revealing its diurnal variations. Given the huge volume of the satellite data, a highly efficient treebased Light Gradient Boosting Machine (LightGBM) learning approach, which is based on the idea of gradient boosting, is applied by involving the spatiotemporal characteristics of air pollution, named the space-time LightGBM (STLG) model. Hourly PM2.5 data set in China (i.e., ChinaHighPM2.5) at a 5 km 25 spatial resolution is derived based on the Himawari-8/AHI aerosol products together with other variables. The hourly PM2.5 estimates (N = 1,415,188) are well correlated with ground measurements (R = 0.85) with a RMSE and MAE of 13.62 and 8.49 μg/m respectively in China. Our model can capture well the PM2.5 diurnal variations, where the pollution increases gradually in the morning, and https://doi.org/10.5194/acp-2020-1277 Preprint. Discussion started: 6 January 2021 c © Author(s) 2021. CC BY 4.0 License.

Over the years, Wang et al. (2017) used the linear mixed-effect (LME) model, and Sun et al. (2019) applied the geographically weighted regression (GWR) and support vector regression (SVR) models to estimate hourly PM2.5 data from the Himawari-8 aerosol optical depth (AOD) products in the Beijing- 60 Tianjin-Hebei (BTH) region.  developed an improved LME model, and Xue et al. (2020) proposed an improved geographically and temporally weighted regression (IGTWR) model to derive the hourly PM2.5 maps based on Himawari-8 AOD products over central and eastern China. In addition to the traditional statistical regression models, several artificial intelligence models, including the random forest (RF), eXtreme Gradient Boosting (XGBoost), and deep neural network (DNN), have 65 been recently successfully adopted to Himawari-8 data to obtain hourly PM2.5 concentrations from local regions to the whole of China (Chen et al., 2019;Liu et al., 2019;Sun et al., 2019). Nevertheless, due to its poor data mining ability, the traditional statistical regression methods usually suffer from large uncertainties. While artificial intelligence methods can achieve higher accuracy, they are often highly demanding on computational power and are thus often slow. In addition, the spatiotemporal variations 70 of PM2.5 were often neglected in previous studies (Chen et al., 2019;Liu et al., 2019;Sun et al., 2019;Wang et al., 2017;, resulting in relatively low accuracy. Therefore, here, we develop a new high-efficiency and high-precision method for near-surface PM2.5 estimation relying on the tree-based light gradient boosting machine (LightGBM) learning approach by including spatial and temporal information, namely, space-time LightGBM (STLG) model. This results 75 in a high-temporal-resolution (hourly) PM2.5 dataset over eastern China (5 km) of high quality from Himawari-8/AHI hourly AOD products. Section 2 provides the detailed information of the natural and human data used, and introduces the development of the STLG model. Section 3 validates the hourly PM2.5 estimates, and shows the diurnal PM2.5 variations across China, and further compares it with traditional models and previous studies. Section 4 summarizes the study.

PM 2.5 and AOD data
Individual-site PM2.5 hourly measurements are collected at a total of 1583 monitoring stations across 85 China during 2018 ( Figure S1). Corresponding Himawari-8 hourly 5 km AOD products (500 nm) across mainland China are also collected. They are synthesized from the Level 2 instantaneous AOD products of 10 minutes, which are generated through a newly defined Lambert-assumed aerosol retrieval algorithm (Yoshida et al., 2018;Letu et al., 2018). The Himawari-8 AOD retrievals have been preliminarily evaluated against the AOD in-situ observations provided by the Aerosol Robotic (Giles et 90 al., 2019) and Sun-Sky Radiometer (Li et al., 2018) Observation Networks, and they are well consistent (R = 0.75), and the average RMSE and MAE are 0.39 and 0.21 (Wei et al., 2019b). Here, only the highquality AOD retrievals (500 nm) are selected for PM2.5 estimation.

95
PM2.5 can be significantly affected by meteorological conditions (Su et al., 2018), but most of the current available reanalysis meteorological products have low temporal resolutions (~3-6 hours).
Recently (14 June 2018), the 5 th generation ECMWF global atmospheric reanalysis (ERA5) at a horizontal resolution of 0.25°×0.25° has been released as well as the land version (12 July 2019) at a horizontal resolution of 0.1°×0.1° both at hourly time scale (1979 to present). Here we use seven ERA5 100 hourly meteorological parameters, including the 2-m temperature, total evaporation, relative humidity, 10-m u-and v-components of winds, surface pressure, and boundary layer height,

Human influence
Human activity is a key factor affecting PM2.5 pollution. The global annual LandScan TM product at 1 km 105 spatial resolution in 2018 is selected to illustrate population distribution (Dobson et al., 2000). In addition, MEIC monthly anthropogenic source emission data is also employed (Li et al., 2017;Zhang et al., 2007). It is generated from agriculture, industry, power, residential, and transportation at more than 700 anthropogenic sources, including a total of 10 atmospheric pollutants and greenhouse gases. Here, four main precursors including the NH3, NOx, SO2, and VOCs, and the direct emission to PM are

Ancillary data
Two additional ancillary data, including the MODIS monthly NDVI at a horizontal resolution of 0.05° × 0.05°, and the SRTM 90 m DEM product are selected to reflect the land cover and change and 115 topographical conditions in China, respectively. All the selected variables (Table S1) with potential impact on PM2.5 concentrations were resampled to 0.05°×0.05° as the Himawari-8 aerosol product.

Space-Time LightGBM model
The Light Gradient Boosting Machine (LightGBM) model, which is a newly developed tree-based 120 machine learning approach, was introduced in 2017 ( Ke et al., 2017). It is based on the gradient boosting framework to construct the decision tree and can address both regression and classification tasks. LightGBM is a fast, distributed, and highly efficient method to tackle the main challenges faced in traditional tree-based machine learning approaches, i.e., computational complexities, by reducing the number of data samples (M) and features (N). The LightGBM model includes three main steps when 125 constructing the decision tree as follows: 1) Histogram-based algorithm. The continuous features are first converted to different bins, which are used to construct the feature index histograms without the need to sort during training. It follows by going through all the data bins to find the best split point from the feature histograms, which can significantly reduce the computation cost of the split gain. The overall complexity is O (M × N). 130 2) Gradient-based One-Side Sampling. The data samples are first sorted in descending order according to their absolute gradients, and the top a% of them are selected as a subset sample with large gradients. Then the b% samples are chosen from the remaining data as a subset sample with small gradients randomly. The sampled data with small gradients are multiplied by a weight coefficient ( 1− ). Last, a new classifier is learned and established using above-sampled data until convergence. 3) Exclusive Feature Bundling. A graph with weighted edges is first constructed, and each weight corresponds to the total number of conflicts between two features. Then the features are sorted in descending order according to the degree of each feature (the greater the degree, the greater the conflict with other points)). Last, each feature is checked in the sorted sequence and either assigned to a combination with small conflicts or created a new combination. 140 Besides the main technologies mentioned above, there are also other details of the optimization, such as the leaf-wise tree growth strategy with depth restriction (Shi, 2007), histogram difference acceleration, sequential access gradient, and the support of category feature and parallel learning. All of the aboveadvanced methodologies make it reach high accuracy and efficiency (Ke et al., 2017).
It is well known that air pollution has spatiotemporal heterogeneities, leading to large differences in 145 PM2.5 concentrations from space to space on different days. However, such characteristics have always been ignored in most of the traditional statistical regression or artificial intelligence methods used in previous studies. To tackle this issue, we integrated the spatiotemporal information into the LightGBM model, and then a newly space-time LightGBM (i.e., STLG) model is developed in this study. The spatial feature is represented by the geographical distances of one pixel to other points in the 150 circumscribed rectangle of the study region (Behrens et al., 2018;Baez-Villanueva et al., 2020). The distance is calculated using the Haversine method (Equation 1) to reflect the spherical distance between two points in the sphere space (Wei et al., 2021). The temporal feature is represented by the day of the year (DOY), which is used to distinguish each data record of different days throughout the year during the model training. 155 = 2 * * asin(√ 2 ( 2 − 1 2 ) + cos( 1 ) cos( 2 ) 2 ( 2 − 1 2 )) where and represent the latitude and longitude of a point in the sphere; r denotes the mean earth radius (≈ 6371 km). Figure 1 illustrates the flowchart of the new STLG model. Two independent tenfold cross-validation methods (10-CV, Rodriguez et al., 2010) based on all the data samples (i.e., outof-sample) and the PM2.5 monitoring stations (i.e., out-of-station) are selected to validate the model 160 performance and the spatial prediction ability, respectively.
[Please insert Figure 1 here]

Model fitting and validation
165

Spatial-scale performance
The STLG model can also largely minimize the overfitting issue and show a strong data mining ability ( Figure S2), which can more accurately establish the relationships between hourly PM2.5 observations and its influence variables (i.e., R 2 = 0.97-0.98, RMSE = 4.18-7.31 μg/m 3 ). Figure  which is a common issue in fine particle remote sensing (Wei et al., 2020). This can be explained by the large aerosol retrieval uncertainty as well as the small number of data samples under highly polluted conditions (Wei et al., 2019b, c).
In addition, the regional performance of the STLG model for hourly PM2.5 estimates has also been evaluated ( Figure 3).  55, 9.63, 11.83, and 17.57 μg/m 3 for four seasons, respectively ( Figure S4). In general, the overall uncertainty of PM2.5 estimates increases at the beginning and end of the year, which may be 225 caused by a harsher environment and more intense human activities in spring and winter.
[Please insert Figure 5 here] We have also evaluated the temporally synthesized PM2.5 data from the hourly data samples at each [Please insert Figure 7 here] μg/m 3 . In addition, PM2.5 pollution always remained at a high level > 42 μg/m 3 before 12:00 pm, yet dropped to a low level < 29 μg/m 3 after 16:00 p.m. This is closely related to people's normal life and production and the life cycle of PM2.5 in a day. Similar conclusions can be obtained in the YRD region. 260 In general, the PRD region is less polluted in the morning but more severe in the afternoon than the BTH region in a day. By contrast, PM2.5 pollution is much lower and shows a smaller diurnal difference with hourly PM2.5 values ranging from 29.49±5.97 μg/m 3 (11:00 a.m.) to 36.36±5.76 μg/m 3 (08:00 a.m.) in the PRD region than other two key regions. This is mainly contributed to its better natural conditions and fewer pollutant emissions (Su et al., 2018).

Zhejiang and Fujian) areas of China benefiting from sparse human distributions and activities or
superior meteorological conditions for strong pollutant dispersion (Su et al., 2018).

Comparison with traditional models
First, we compared the STLG model with five widely-used statistical regression models for estimating 290 PM2.5 in China with the same input data set (Table S3). The multivariate linear regression (MLR) model performs worst as demonstrated by the statistical indicators; this, due to the complex nonlinear PM2.5-AOD relationships; while the GWR model shows a better performance because it takes into account the spatial characteristics of PM2.5. The generalized additive model (GAM) and the LME model show overall improved performance with decreasing estimation uncertainties because of the nonlinear 295 characteristics and stronger data regression abilities. By contrast, the two-stage model outperforms with higher CV-R 2 values and smaller estimation uncertainties by combining the advantages of the GWR and LME models. Nevertheless, our model shows the best performance than all of the above traditional statistical regression models, mainly due to its stronger data mining ability.  Table 2 shows the comparisons of the accuracy and efficiency among six tree-based machine learning 300 models to hourly PM2.5 estimates in China using the same input data set. Decision Tree (DT, Quinlan, 1986) is a traditional supervised learning classification method that has been used frequently and shows the worst performance because of the simple single classifier, although the training speed is the fastest and the memory consumption is the least. The model performance of ensemble learning approaches, i.e., Gradient Boosting Decision Tree (GBDT, Friedman, 2001), RF (Breiman, 2001), Extremely 305 randomized trees (ERT, Geurts, et al., 2006), However, the model efficiency is different among these models due to the large differences in the algorithm design frameworks. These tree-based machine learning models can be divided into two categories: one is "Bagging", including the DT, RF, and ERT models, which synthesizes multiple independent and unrelated weak classifiers into a strong classifier. It allows work in parallel, which can 315 save a lot of time but may need more computer memory. The other is "Boosting", including the GBDT, XGBoost, and LightGBM models, which synthesizes multiple interdependent and related weak classifiers into a strong classifier. They can only work in serial, which may take a lot of time but not too much memory. In general, the STGB model is the most time-consuming, while the STET model is the most memory-consuming approach. By contrast, the LightGBM model runs very fast and consumes 320 very small computer memory benefiting from a series of algorithm optimizations (Ke et al., 2017).
Furthermore, after considering the spatiotemporal variation characteristics, all the newly defined spacetime tree-based machine learning approaches (i.e., STET, STGB, STXB, STRF, STET, and STLG) show significant improvements in both overall estimation accuracy and spatial prediction ability in hourly PM2.5 concentrations with reference to their original models. This further illustrates the 325 importance of the introduction of spatiotemporal information in constructing the PM2.5-AOD relationships. More importantly, the training speed of these models did not decrease much; in addition, the memory consumption did not increase much either. In general, the STLG model shows the best performance with much high efficiency (i.e., training speed = 46s, memory usage = 0.60 GB) among all the space-time tree-based machine learning models. Therefore, our newly STLG model is highly 330 valuable for accurate and fast air pollution monitoring, in particular for our future extended study to the global scale.
[Please insert Table 2 here]

335
Last, we also compared with previous related studies in Himawari-8-based hourly PM2.5 estimates at regional and national scales in China (Table 3). The comparison results show that the local hourly PM2.5 concentrations retrieved from our national-scale model are more accurate than those derived from the models developed separately in local areas, e.g., the LME model (Wang et al., 2017), and the GWR, SVR, RF, and DNN models in the BTH region (Sun et al., 2019); the two-stage RF and DNN models in 340 the YRD region (Fan et al., 2020;Tang et al., 2019). In addition, our model outperforms most of the statistical regression models, machine learning models focusing on entire China, e.g., the I-LME, and IGTWR, RF, Adaboost, XGBoost, and their stacked models in China (Chen et al., 2019;Liu et al., 2019;Xue et al., 2020;. The main reasons include the stronger data mining ability of our model, and the consideration of the key spatial and temporal information of air pollution that 345 ignored in previous studies, and the introduction of more comprehensive factors (e.g., emission inventory) that affect PM2.5 pollution.
[Please insert Table 3