Using machine learning to derive cloud condensation nuclei number concentrations from commonly available measurements

Abstract. Cloud condensation nuclei (CCN) number concentrations are an important aspect of aerosol–cloud interactions and the subsequent climate effects; however, their measurements are very limited. We use a machine learning tool, random decision forests, to develop a Random Forest Regression Model (RFRM) to derive CCN at 0.4% supersaturation ([CCN0.4]) from commonly available measurements. The RFRM is trained on the long-term simulations in a global size-resolved particle microphysics model. Using atmospheric state and composition variables as predictors, through associations of their variabilities, 5 the RFRM is able to learn the underlying dependence of [CCN0.4] on these predictors, which are: 8 fractions of PM2.5 (NH4, SO4, NO3, secondary organic aerosol (SOA), black carbon (BC), primary organic carbon (POC), dust, and salt), 7 gaseous species (NOx, NH3, O3, SO2, OH, isoprene, and monoterpene), and 4 meteorological variables (temperature (T), relative humidity (RH), precipitation, and solar radiation). The RFRM is highly robust: median mean fractional bias (MFB) of 4.4% with ∼ 96.33% of the derived [CCN0.4] within a good agreement range of −60%<MFB<+60% and strong correlation of 10 Kendall’s τ coefficient ≈ 0.88. The RFRM demonstrates its robustness over 4 orders of magnitude of [CCN0.4] over varying spatial (such as continental to oceanic, clean to polluted, and near surface to upper troposphere) and temporal (from the hourly to the decadal) scales. At the Atmospheric Radiation Measurement Southern Great Plains observatory (ARM SGP) in Lamont, Oklahoma, United States, long-term measurements for PM2.5 speciation (NH4, SO4, NO3, and organic carbon (OC)), NOx, O3, SO2, T, and RH, as well as [CCN0.4] are available. We modify, optimise, and retrain the developed RFRM to make predic15 tions from 19→ 9 of these available predictors. This retrained RFRM (RFRM-ShortVars) shows a reduction in performance due to the unavailability and sparsity of measurements (predictors); it captures the [CCN0.4] variability and magnitude at SGP with ∼ 67.02% of the derived values in the good agreement range. This work shows the potential of using the more commonly available measurements of PM2.5 speciation to alleviate the sparsity of CCN number concentrations’ measurements.

Pearson product-moment correlation coefficient (r) is: and Kendall rank correlation coefficient (τ ):    Quality System (AQS) made publicly available at https://www.epa.gov/air-data from monitors in the vicinity (< 100 km) of the SGP site.
Real-time aerosol mass loadings and their chemical composition measurements have been made from 2010-present using an Aerodyne Aerosol Chemical Speciation Monitor (ACSM; Ng et al. (2011)) with technical details in (Watson, 2017). We use Once the data has been selected to train the RFRM, we tune the parameters of the model. The default implementation of Wright and Ziegler (2017) comes with reasonable (balancing speed and accuracy) choices for the number of trees in the forest (numtrees), the minimum number of variables to consider for each split (mtry), and the minimum node size (min.node.size), 215 i.e. the minimum size of homogeneous data to prevent overfitting. These defaults are: numtrees = 500, mtry = rounded-down square root of the number of variables, and min.node.size = 5. We verify if these hyperparameter choices are optimal by performing a grid-search of the hyperparameters and training multiple Random Forest models and not just examining their performance with the training set, but also additionally with the test set. Figs. 4 & 5 show the results of this exercise.  Wright and Ziegler (2017) for this specific work. It must be kept in mind that a reasonable cutoff, beyond which there is imperceptible gain in performance at increased computational cost, should be considered. The results shown in Fig. 4 motivate a zoomed-in hyperparameter grid search to choose the optimal (accurate and fast) RFRM. Fig. 5 shows this for the best performing Random Forest models with numtrees ranging from 600 to 1400, mtry   the results in Fig. 6. Fig. 6 shows the out-of-bag increase in mean square error upon permutation of a specific predictor. We modify the approach in Wright and Ziegler (2017)

Comparison with GCAPM [CCN0.4]
We examine the RFRM in further detail with the subset of data excluded for testing, i.e. for the sites that RF-AllVars has not been exposed to during its training. Fig. 8   Bins with low counts (< 1% of maximum count: ∼ 3% of the data) are shaded grey. The lines indicate MFB of 0 (black; perfect agreement), +1 (darker red), −1 (darker blue), +0.6 (lighter red), and −0.6 (lighter blue).  ShortVars. Dotted lines show the good agreement range of |MFB| < 0.6.

Comparisons for the SGP site
We also examine the temporal trends of [CCN0.4] for the SGP site (the surface 2×2.5 • model gridbox), which was completely 305 excluded from the RFRM training. Fig. 10 Table 2) would have aided in this analysis. However, due to the sparsity and absence of measurements of certain predictors, we are limited to the available factors listed in Table 4. These @ @ @ @ @ @ are shown in Fig. 11, with the speciated PM2.5 predictors in Fig. 11(a), the trace gas measurements in Fig. 11(b), and the meteorological variables in Fig. 11(c).

315
Thus, with a reduction from 19 → 9 predictor variables, we retrain the RFRM to use only these as inputs to derive [CCN0.4].
The RFRM optimisation is carried out as described in Sec. 2.2; the RFRM parameters are numtrees = 1000, mtry = 3, and min.node.size = 3 and uses the 9 predictors listed in Table 4

Further discussion and outlook
There are a number of limitations in the application of the present study to augment empirical measurements of [CCN]. These are as follows: -The RFRM is trained on GCAPM, with the assumption that the physical and chemical processes that relate the 'pre- and detection limits. Further sources of error are the utilisation of 'predictor' measurements from nearby (but not colocated) monitors to fill in significant gaps in the required data for the SGP site.
-Random Forest was the machine learning tool of choice due to its parallelisability and high degree of accuracy. There are, however, other tools such as XGBoost (a choice for many winners of machine learning competitions) or regression study, however, showed no significant improvement at the cost of much higher computational expense.
-To overcome some of the gaps in the [CCN0.4] measurement data from the SGP site, we proposed and implemented a derivation of [CCN0.4] from [CCN] measurements at other supersaturations using the Random Forest technique. While this derivation was confirmed to be exceptionally good, this is an approximation.
-Development of an observation-based RFRM is presented in this study. However, it can be significantly improved with 385 more observations for the SGP site, and generalisable if trained with observations from numerous other sites. This was presently not possible.
Despite the caveats associated with this work, this proof-of-concept shows promise for wide-ranging development and deployment. This machine learning approach can provide improved representation of cloud condensation nuclei numbers: in locations where their direct measurements are limited, but measurements of other atmospheric state and composi- in earth system models:

395
by providing a more accurate alternative to bulk microphysical parameterisations by providing a computationally less intensive alternative to explicit bin-resolving microphysics models This work is an initial step towards fast and accurate derivation of [CCN0.4], in the absence of their measurements, constrained by empirical data for other measurements of atmospheric state and composition. This work demonstrates the possible applications of machine learning tools in handling the complex, non-linear, ordinal, and large amounts of data in the atmo-400 spheric sciences.