Predicting Gas-Particle Partitioning Coefficients of Atmospheric Molecules with Machine Learning

The formation, properties and lifetime of secondary organic aerosols in the atmosphere are largely determined by gas-particle partitioning coefficients of the participating organic vapours. Since these coefficients are often difficult to measure or compute, we developed a machine learning (ML) model to predict them given molecular structure as input. Our data-driven approach is based on the dataset by Wang et al. (Atmos. Chem. Phys., 17, 7529 (2017)), who computed the partitioning coefficients and saturation vapour pressures of 3414 atmospheric oxidation products from the master chemical mechanism using the COSMOtherm program. We train a kernel ridge regression (KRR) ML model on the saturation vapour pressure ($P_{sat}$), and on two equilibrium partitioning coefficients: between a water-insoluble organic matter phase and the gas phase ($K_{WIOM/G}$), and between an infinitely dilute solution with pure water and the gas phase ($K_{W/G}$). For the input representation of the atomic structure of each organic molecule to the machine, we test different descriptors. Our best ML model predicts $P_{sat}$ and $K_{WIOM/G}$ to within 0.3 and $K_{W/G}$ to within 0.4 logarithmic units of the original COSMOtherm calculations. This is equal or better than the typical accuracy of COSMOtherm predictions compared to experimental data. We then apply our ML model to a dataset of 35,383 molecules that we generated based on a carbon 10 backbone and functionalized with 0 to 6 carboxyl, carbonyl or hydroxyl groups to evaluate its performance for polyfunctional compounds with potentially low $P_{sat}$. The resulting $P_{sat}$ and partitioning coefficient distributions were physico-chemically reasonable, and the volatility predictions for the most highly oxidized compounds were in qualitative agreement with experimentally inferred volatilities of atmospheric oxidation products with similar elemental composition.

significant computational effort, especially due to the conformational complexity of these species, for example in terms of different potential hydrogen bonding patterns. Overall, there could be up to 10 4 − 10 7 different organic compounds in the atmosphere (not even counting most oxidation intermediates), which makes the computation of saturation vapour pressures and partitioning coefficients a daunting task (Shrivastava et al., 2019;Ye et al., 2016).
Here, we take a different approach compared to previous parametrization studies, and consider a data-science perspective . Instead of assuming chemical or physical relations, we let the data speak for itself. We develop and train a machine learning model to extract patterns from available data and predict saturation vapour pressures as well as partitioning coefficients.
Machine learning has only recently spread into atmospheric science (Cervone et al., 2008;Toms et al., 2018;Barnes et al., 2019;Nourani et al., 2019;Huntingford et al., 2019;Masuda et al., 2019). Prominent applications include the identification of forced climate patterns (Barnes et al., 2019), precipitation prediction (Nourani et al., 2019), climate analysis (Huntingford et al., 2019), pattern discovery (Toms et al., 2018), risk assessment of atmospheric emissions (Cervone et al., 2008), and the estimation of cloud optical thicknesses (Masuda et al., 2019). In molecular and materials science, machine learning is more established and now frequently complements theoretical or experimental methods (Müller et al., 2016;Ma et al., 2015;Shandiz and Gauvin, 2016;Gómez-Bombarelli et al., 2016;Bartók et al., 2017;Rupp et al., 2018;Goldsmith et al., 2018;Meyer et al., 2018;Zunger, 2018;Gu et al., 2019;Schmidt et al., 2019;Jensen et al.;Coley et al.). Here we build on our experience in atomistic, molecular machine learning Todorović et al., 2019;Stuke et al., 2019;Himanen et al., 2020) to train a regression model that maps molecular structures onto saturation vapour pressures and partitioning coefficients. Once trained, the machine learning model can make saturation vapour pressure and partitioning predictions at COSMOtherm accuracy for hundreds of thousands of new molecules at no further computational cost. When experimental training data becomes available, the machine learning model can easily be extended to make predictions for experimental pressures and coefficients.
Due to the above-mentioned lack of comprehensive experimental databases for saturation vapour pressures or gas-liquid partioning coefficient of polyfunctional atmospherically relevant molecules, our machine-learning model is based on the computational data by Wang et al. (2017). They computed the partitioning coefficients and saturation vapour pressures for 3414 atmospheric secondary oxidation products, obtained from the Master Chemical Mechanism (Jenkin et al., 1997;Saunders et al., 2003), using a combination of quantum chemistry and statistical thermodynamics as implemented in the COSMOtherm approach .
We transform the molecular structures in Wang's dataset into atomistic descriptors more suitable for machine learning than the atomic coordinates or the commonly used simplified molecular-input line-entry system (SMILES) strings. Optimal descriptor choices have been the subject of increased research in recent years (Langer et al., 2020;Rossi and Cumby, 2020;Himanen et al., 2020). We here test several descriptor choices: the many body tensor representation (Huo and Rupp, 2017), the Coulomb matrix (Rupp et al., 2012), the Molecular ACCess System (MACCS) structural key (Durant et al., 2002), a topological fingerprint developed by RDkit (Landrum et al., 2006) based on the daylight fingerprint (James et al., 1995) and the Morgan fingerprint (Morgan, 1965).
In this work, we address the following objectives. 1) With view to future machine learning applications in atmospheric science, we assess the predictive capability of different structural descriptors for machine learning the chosen target properties.
2) We quantify the predictive power of our KRR machine learning model for Wang's dataset to ascertain if the dataset size is sufficient for accurate machine learning predictions. 3) We then apply our validated machine learning model to a new molecular dataset to gain chemical insight into SOA condensation processes.
The paper is organized as follows. We describe our machine learning methodology in section 2, then present the machine learning results in section 3. Section 4 demonstrates how we employed the trained model for fast prediction of molecular properties. We discuss our findings and present a summary in section 5. Our machine learning approach has five components as illustrated in Fig. 1. We start off with the raw data, which we present and analyse in section 2.1. The raw data is then transformed into a suitable representation for machine learning (step 2). We introduce five different representations in section 2.2, which we test in our machine learning model (cf section 3). Next we choose our machine learning method. Here we use kernel ridge regression (KRR), which is introduced in section 2.3. We analyse the learning success of our machine learning approach in step 4. The results of this process are shown in section 3. In this step we also make adjustments to the representation and the parameters of the model to improve the learning. Finally, we use the best machine learning model to make predictions as shown in section 4.

Dataset
In this work we are interested in the the equilibrium partitioning coefficients of a molecule between a water-insoluble organic matter (WIOM) phase and gas phase (K WIOM/G ) as well as gas phase and infinitely dilute water solution. These coefficients are defined as where C WIOM , C W , and C G are the equilibrium concentrations of the molecule in the WIOM, water, and gas phase, respectively, at the limit of infinite dilution. In the framework of COSMOtherm calculations, gas-liquid partitioning coefficients can be converted into saturation vapor pressures, or vice versa, using the activity coefficients γ W or γ WIOM in the corresponding liquid (which can also be computed by COSMOtherm). Specifically, if for example K W/G is expressed in units of m 3 g −1 , then P sat = RT M γWKW , where R is the gas constant, T the temperature, M the molar mass of the compound and K W and γ W are the partitioning and activity coefficients in water (Arp and Goss, 2009). We caution, however, that many different conventions exist e.g. for the dimensions of the partitioning coefficients, as well as the reference states for activity coefficients -the relation given above applies only for the particular conventions used by COSMOtherm. Wang et al. (2017) used the conductor-like screening model for real solvents (COSMO-RS) theory  implemented in COSMOtherm to calculate the two partitioning coefficients K WIOM/G 2 and K W/G for 3414 molecules. These molecules were generated from 143 parent volatile organic compounds with the Master Chemical Mechanism (MCM) (Jenkin et al., 1997;Saunders et al., 2003) through photolysis and reactions with ozone, hydroxide and nitrade.   The 3414 non-radical species obtained from MGM range in size from 4 to 48 atoms, which translates into 1 to 24 nonhydrogen atoms per molecule. The distribution peaks at 10 non-hydrogen atoms and is skewed towards larger molecules.
Panel b) illustrates how many molecules contain at least one atom of the indicated element. All molecules contain carbon (100% C), 3410 contain hydrogen (H; 99.88%) and 3333 also oxygen (O; 97.63%). Nitrogen (N) is the next most abundant element (30.17%) followed by chlorine (Cl; 3.05%), sulphur (S; 0.44%) and bromide (Br; 0.32%). Lastly, panel c) presents the distribution of functional groups. It peaks at 2 to 3 functional groups per molecule, with relatively few molecules having 0, 5 or 6 functional groups.

Representations
The molecular representation for machine learning should fulfil certain requirements. It should be invariant with respect to translation and rotation of the molecule and permutations of atomic indices. Furthermore, it should be continuous, unique, compact and efficient to compute (Faber et al., 2015;Huo and Rupp, 2017;Langer et al., 2020;Himanen et al., 2020).
In this work we employ two classes of representations for the molecular structure, also known as descriptors: physical and cheminformatics descriptors. Physical descriptors encode physical distances and angles between atoms in the material or molecule. Such descriptors generally exhibit good performance for many different system types. Meanwhile, decades of research in cheminformatics have produced topological descriptors that encode the qualitative aspects of molecules in a compact representation. These descriptors are typically bitvectors, in which molecular features are encoded (hashed) into binary fingerprints, which are joined into long binary vectors. In this work, we use two physical descriptors, the Coulomb Matrix and the many-body tensor, and three cheminformatics descriptors: the MACCS structural key, the topological fingerprint and the Morgan fingerprint.
In Wang's dataset the molecular structure is encoded in SMILES (Simplified Molecular Input Line Entry Specification) strings. We convert these SMILES strings into structural descriptors using Open Babel (O'Boyle et al., 2011) and the DScribe library (Himanen et al., 2020) or into cheminformatics descriptors using RDkit (Landrum et al., 2006).

Coulomb Matrix
The Coulomb matrix CM descriptor is inspired by an electrostatic representation of a molecule (Rupp et al., 2012). It encodes the cartesian coordinates of a molecule in a simple matrix of the form where R i is the coordinate of atom i with atomic charge Z i . The diagonal provides element-specific information. The coefficient and the exponent have been fitted to the total energies of isolated atoms (Rupp et al., 2012). Off-diagonal elements encode inverse distances between the atoms of the molecule by means of a Coulomb-repulsion-like term.
The dimension of the Coulomb matrix is chosen to fit the largest molecule in the data set, i.e. it corresponds to the number of atoms of the largest molecule. The "empty" rows of Coulomb matrices for smaller molecules are padded with zeroes.
Invariance with respect to the permutation of atoms in the molecule is enforced by simultaneously sorting rows and columns of each Coulomb matrix in descending order according to their 2 -norms. An example of a Coulomb matrix for 2-hydroxy-2methylpropanoic acid is shown in Fig. 4b.
The CM is easily understandable, simple and relatively small as a descriptor. However, it performs best with Laplacian kernels in the machine-learning model (see Section 2.3), while other descriptors work better with the more standard choice of a Gaussian kernel.

Many-body tensor representation
The many-body tensor representation (MBTR) follows the Coulomb matrix philosophy of encoding the internal coordinates of a molecule. We will here describe the MBTR only qualitatively. Detailed equations can be found in the original publication (Huo and Rupp, 2017), our previous work (Himanen et al., 2020;Stuke et al., 2020) or Appendix A.
Unlike the Coulomb matrix, the many-body tensor is continuous and it distinguishes between different types of internal coordinates. At many-body level 1, the MBTR records the presence of all atomic species in a molecule by placing a Gaussian at the atomic number on an axis from 1 to the number of elements in the periodic table. The weight of the Gaussian is equal to the number of times the species is present in the molecule. At many-body level 2, inverse distances between every pair of atoms (bonded and non-bonded) are recorded in the same fashion. Many-body level 3 adds angular information between any triple of atoms. Higher levels (e.g. dihedral angles) would in principle be straightforward to add, but are not implemented in the current MBTR versions (Huo and Rupp, 2017;Himanen et al., 2020). Figure 4c shows selected MBTR elements for 2-hydroxy-2-methylpropanoic acid.
The MBTR is a continuous descriptor, which is advantageous for machine learning. However, MBTR is by far the largest descriptor out of the five we tested, and this can pose restrictions on memory and computational cost. Furthermore, the MBTR is more difficult to interpret than the CM.

MACCS Structural Key
The Molecular ACCess System (MACCS) structural key is a dictionary-based descriptor (Durant et al., 2002). It is represented as a bitvector of Boolean values that encode answers to a set of predefined questions. The MACCS structural key we used is a 166 bit long set of answers to 166 questions such as "Is there a S-S bond" or "Does it contain Iodine?" (Landrum et al., 2006;James et al., 1995).
MACCS is the smallest out of the five descriptors and extremely fast to use. Its accuracy critically depends on how well the 166 questions encapsulate the chemical detail of the molecules. Is it likely to reach a moderate accuracy with low computational cost and memory usage and could be beneficial for fast testing of a machine learning model.

Topological Fingerprint
The topological fingerprint (TopFP) is RDKit's original fingerprint (Landrum et al., 2006) inspired by the Daylight fingerprint (James et al., 1995). TopFP hashes the atomic structure. This means that the structure is divided into smaller substructures and each substructure is converted into a unique binary ID called a hash. The hashes are concatenated into a long bitvector representing the entire molecule. In TopFP, the molecule is hashed along topological paths, or along bonds, as illustrated in Figure 4d. The path starts from one atom in a molecule and travels along bonds until k bond lengths have been traversed, and that completes a hash. The length of the bitvector, maximum and minimum possible path lengths k max and k min and the length of one hash can be optimized.
Topology is an informative molecular feature. We therefore expect TopFP to balance good accuracy with reasonable computational cost. However, this binary fingerprint is difficult to visualize and analyse for chemical insight.

Morgan Fingerprint
The Morgan fingerprint is also a bit-vector constructed by hashing the molecular structure. In contrast to the Topological fingerprint, the Morgan fingerprint is hashed along circular or spherical paths around the central atom as illustrated in Figure   4e. Each substructure for a hash is constructed by first numbering the atoms in a molecule with unique integers by applying the Morgan algorithm. Each uniquely numbered atom then becomes a cluster center, around which we iteratively increase a spherical radius to include the neighbouring bonded atoms (Rogers and Hahn, 2010). Each radius increment extends the neighbour list by another molecular bond. The "circular" substructures found by the algorithm described above, excluding duplicates, are then hashed into a fingerprint (James et al., 1995;Landrum et al., 2006). The length of the fingerprint and the maximum radius can be optimized.
The Morgan fingerprint is quite similar to the TopFP in size and type of information encoded, so we expect similar performance. It does not lend itself to easy chemical interpretation.

Kernel Ridge Regression
In this work, we apply the kernel ridge regression ( KRR is based on Ridge Regression, in which a penalty for overfitting is added to an ordinary least squares fit (Friedman et al., 2001). In KRR, unlike Ridge regression, a nonlinear kernel is applied. This maps the molecular structure to our target properties in a high dimensional space Rupp, 2015).
The target values f are a linear expansion in kernel elements where the sum runs over all training molecules. In this work, we use two different kernels, the Gaussian kernel and the Laplacian kernel The kernel width γ is a hyperparameter of the KRR model.
The regression coefficients α i can be solved by minimizing the error where y i are reference target values for molecules in the training data. The second term is the regularization term, whose size is controlled by the hyperparameter λ. K is the kernel matrix of training inputs k(x i , x j ).
This minimization problem can be solved analytically for the expansion coefficients α i α α α = (K − λI) −1 y The hyperparameters γ and λ need to be optimised separately.
We implemented KRR in Python using scikit-learn (Pedregosa et al., 2011). Our implementation has been described in Ref.

Computational Execution
Data used for supervised machine learning is typically divided into two sets, a large training set and a small test set. 2000, 2500 and 3000, so that a smaller training size is always a subset of the larger one. Training the model on a sequence of such training sets allows us to compute a learning curve, which facilitates the assessment of learning success with increasing training data size. We quantify the accuracy of our KRR model by computing the mean absolute error (MAE) for the test set.
To get statistically meaningful results, we repeat the training procedure 10 times. In each run, we shuffle the dataset before selecting the training and test sets so that the KRR model is trained and tested on different data each time. Each point on the learning curves is computed as the average over 10 results, and the spread serves as the standard deviation of the datapoint.
Model training proceeds by computing the KRR regression coefficients α i , obtained by minimizing equation 7. KRR hyperparameters γ and λ are typically optimized via grid search, and average optimal solutions are obtained by cross-validating the procedure. In cross-validation we split off a validation set from the training data before training the KRR model. KRR is then trained for all possible combinations of discretised hyperparameters (grid search) and evaluated on the validation set. This is done several times, so that the molecules in the validation set are changed each time. Then the hyperparameter combination with minimum average cross-validation error is chosen. Our implementation of cross-validated grid search is also based on Scikit-learn (Pedregosa et al., 2011). optimal values. In grid search, we varied both γ and λ by ten values between 10 −1 and 10 10 . In addition, we used two different kernels, Laplacian and Gaussian. We compared the performance of the two kernels for the average of 5 runs for each training size and the most optimal kernel was chosen. In cases in which both kernels performed equally well, e.g., for the fingerprints, we chose the Gaussian kernel for its lower computational cost.
MBTR hyperparameters and TopFP hyperparameters were optimized by grid search for several training set sizes (MBTR for sizes 500, 1500 and 3000 and TopFP for size 1000 and 1500 ) and the average of two runs for each training size was taken.
We did not extend the descriptor hyperparameter search to larger training set sizes, since we found that the hyperparameters were insensitive to the training set size. The MBTR weighting parameters were optimized in 8 steps between 0 (no weighting) and 1.4, and the broadening parameters in 6 steps between 10 −1 and 10 −6 . The length of TopFP was varied between 1024 and 8192 (size can varied by 2 n ). The range for the maximum path length extended from 5 to 11 and the bits per hash were varied between 3 and 16.
3 Results Figure 5. The learning curves for equilibrium partitioning coefficients K WIOM/G , K W/G and saturation vapour pressure Psat for predictions made with all five descriptors.
In Figure 5 we present the learning curves for our objectives K WIOM/G , K W/G and P sat . Shown is the mean average error (MAE) as a function of the training set size for all three target properties and with all five molecular descriptors. As expected, the MAE decreases as the training size increases. For all target properties, the lowest errors are achieved with MBTR and the worst performing descriptor is CM. TopFP approaches the accuracy of MBTR as the training size increases and appears likely to outperform MBTR beyond the largest training size of 3000 molecules. with MBTR descriptor). The highest accuracy is obtained for partitioning coefficient K WIOM/G , with a mean average error of 0.278, i.e. only 1.9% of the entire K WIOM/G range. The second best accuracy is obtained for saturation vapour pressure P sat with an MAE of 0.298 (or 2.0% of the range of pressure values). The lowest accuracy is obtained for K W/G with an MAE of 0.428. However, the range for partitioning coefficient K W/G is also the largest, as seen in Figure 5, so this amounts to only 2.7% of the entire range of values. Our best machine learning MAEs are of the order of the COSMOtherm prediction accuracy, which lies at around a few tenths of log values (Stenzel et al., 2014;Schröder et al., 2016;van der Spoel et al., 2019). The predictions of partitioning coefficient K WIOM/G are most accurate. This is expected because the MAE in Table 2 is lowest for this property. The largest scattered is observed for partitioning coefficient K W/G which had the highest MAE in Table 2.

Predictions
In the previous section we showed that our KRR model trained on the Wang et al. dataset produces low prediction errors for molecular partitioning coefficients and can now be employed as a fast predictor. When shown further molecular structures, it can make instant predictions for the molecular properties of interest. We demonstrate this application potential on an example dataset generated to imitate organic molecules typically found in the atmosphere. Figure 6. The scatter plots for predictions for partitioning coefficients of a molecule between a water-insoluble organic matter and gas phase K WIOM/G , water and gas phase K W/G and the saturation vapour pressure Psat for the test set of 414 molecules using MBTR (top) and TopFP (bottom). The prediction with the lowest mean average error was chosen for each scatter plot.
Atmospheric oxidation reaction mechanisms can be generally classified into two main types: fragmentation and functionalization. For SOA formation, functionalization is more relevant, as it leads to products with intact carbon backbones and added polar (and volatility-lowering) functional groups. Many of the most interesting molecules from a SOA-forming point of view, e.g. monoterpenes, have around 10 carbon atoms. These compounds simultaneously have high enough emissions or concentrations to produce appreciable amounts of condensable products, while being large enough for those products to have low volatility.
We thus generate a dataset of molecules with a backbone of ten carbon (C10) atoms. For simplicity, we use a linear alkane chain. In analogy with Wang's dataset, we then decorate this backbone with 0 to 6 functional groups at different locations.
We limit ourselves to the typical groups formed in "functionalizing" oxidation of VOC by both of the main day-time oxidants OH and O 3 : carboxyl(-COOH), carbonyl (=O) and hydroxyl (-OH) (Seinfeld and Pandis, 2016). The (-COOH) group can only be added to the ends of the C10 molecule, while (=O) and (-OH) can be added to any carbon atom in the chain. We then generate all possible combinations combinatorially and filter out duplicates resulting from symmetric combinations of functional groups. In total we obtain 35,383 unique molecules. Example molecules are depicted in Figure 9.
For each of the 35,383 molecules we generated a SMILES string that serves as input for the TopFP fingerprint. We did not relax the geometry of the molecules with force fields or density-functional theory. We then predicted P sat , K WIOM/G and a b c K W/G with the TopFP-KRR model. We chose TopFP as descriptor, because its accuracy is close to that of the best performing MBTR KRR model, but significantly cheaper to evaluate.  Figure 8 shows that the averages of all three quantities agree well for different numbers of functional groups, which illustrates that both datasets are similar. A certain degree of similarity is required to ensure predictive power, since machine learning models do not extrapolate well to data that lies outside the training range. Psat for different numbers of functional groups for our C10 dataset (in blue) and Wang's dataset (in red). Figure 7 illustrates that the saturation vapour pressure P sat decreases with increasing number of functional groups as expected, whereas K WIOM/G and K W/G increase. This is consistent with Wang's dataset as shown in Fig. 8, where we compare averages between the two datasets. The magnitude of the decrease (increase) amounts to approximately 1 or 2 orders of magnitude per functional group and is consistent with existing structure-activity relationships based on experimental data (e.g. Pankow and Asher (2008); Compernolle et al. (2011);Nannoolal et al. (2008)). The region of low P sat is most relevant for atmospheric SOA formation. Figure 9b shows histograms of only molecules with 7 or 8 oxygen atoms. These are compared to the full dataset. Since the "8 O atom set" is a subset of the "7 or 8 O atoms" set, which in turn is a subset of "all molecules" the length of the bars in a given bin reflect the percentages of molecules with 7 or 8 O atoms. We observe that below 10 −10 kPa, almost all C10 molecules contain 7 or 8 O atoms, as there is little grey visible in that part of the histogram. In the context of atmospheric chemistry, the least-volatile fraction of our C10 dataset corresponds to LVOC ("low volatility organic compounds"), which are capable of condensing onto small aerosol particles, but not actually forming them. Our results are thus in qualitative agreement with recent experimental results by (Peräkylä et al., 2020), who concluded that the highly oxidized C 10 products of α-pinene oxidation are mostly LVOC. However, we note that the compounds measure by Peräkylä et al. are likely to contain functional groups not included in our C10 dataset, as well as structural features such as branching and rings. Figure 9a and Figure 9c show the molecular structures of the lowest-volatility compounds, as well as the highest-volatility compounds with 7 or 8 O atoms, respectively. (Note that the latter set inevitably contains at least one carboxylic acid group, as we have restricted the number of functional groups to six or less, and only the acid groups contain two oxygen atoms.) Comparing the two sets, we can see that the lowest-volatility compounds contain more hydroxyl groups, and less ketone groups, while the highest-volatility compounds with 7 or 8 oxygen atoms contain almost no hydroxyl groups. This is expected -for example according to the SIMPOL model (Pankow and Asher, 2008), a hydroxyl group lowers the saturation vapor pressure by over a factor of 100 at 298 K, while the effect of a ketone group is a bit less than a factor of 10. However, even the lowest-volatility compounds ( Figure 9a) contain a few ketone groups, such that the number of hydrogen-bond donor and acceptor groups are roughly similar. This result demonstrates that unlike the simplest group-contribution models, both the original COSMOTherm predictions, and the machine-learning model based on them, are capable of accounting for hydrogen-bonding interactions between functional groups.

Conclusions
In this study, we set out to evaluate the potential of the KRR machine learning method to map molecular structures to its atmospheric partitioning behaviour, and establish which molecular descriptor has the best predictive capability.
KRR is a relatively simple kernel-based machine-learning technique that is straightforward to implement and fast to train.
Given model simplicity, the quality of learning depends strongly on information content of the molecular descriptor. More specifically, it hinges on how well each format encapsulates the structural features relevant to the atmospheric behaviour.
The exhaustive approach of MBTR descriptor to documenting molecular features has led to very good predictive accuracy in machine learning of molecular properties Langer et al., 2020;Rossi and Cumby, 2020;Himanen et al., 2020) and this work is no exception. The lightweight CM descriptor does not perform nearly as well, but these two representations from physical sciences provide us with an upper and lower limit on predictive accuracy.
Descriptors from cheminformatics that were developed specifically for molecules have variable performance. Between them, the topological fingerprint leads to best learning quality that approaches MBTR accuracy in the limit of larger training set sizes. This is a notable finding, not least because the relatively small TopFP data structures in comparison to MBTR reduce the computational time and memory required for machine learning. MBTR encoding requires knowledge of the 3-dimensional molecular structure, which raises the issue of conformer search. It is unclear which molecular conformers are relevant for atmospheric condensation behaviour, and COSMOterm calculations on different conformers can produce values that are orders of magnitude apart. TopFP requires only connectivity information and can be built from SMILES strings, eliminating any conformer considerations (albeit at the cost of possibly losing some information on e.g. intramolecular hydrogen bonds). All this makes TopFP the most promising descriptor for future machine learning studies in atmospheric science that we have identified in this work.
Our results show that KRR can be used to train a model to predict COSMOtherm saturation vapor pressures, with error margins smaller than those of the original COSMOtherm predictions. In the future, we propose to extend our training set to encompass especially atmospheric autoxidation products (Bianchi et al., 2019), which are not included in existing saturation vapour pressure datasets, and for which existing prediction methods are highly uncertain. While COSMOtherm predictions for such molecules also have large uncertainties, a fast and efficient "COSMOtherm -level" KRR predictor would still be immensely useful, for example in evaluating whether a given compound is likely to have extremely low volatility, or not. As experimental data for such compounds becomes available, either through indirect inference methods such as Peräkylä et al.
(2020) or for example thermal desorption methods (Li et al., 2020). These could then be used to constrain and anchor the model, and ultimately yield also quantitatively reliable volatility predictions.

Appendix A: Many-body tensor representation
In this appendix we provide the mathematical structure of the MBTR as it is implemented in the DScribe library Himanen et al. (2020). The many-body levels in the MBTR are denoted k. For k = 1, 2, 3, geometry functions encode the different features: g 1 (Z l ) = Z l (atomic number), g 2 (R l , R m ) = |R l − R m | (distance) or g 2 (R l , R m ) = 1 |R l −Rm| (inverse distance), and g 3 (R l , R m , R n ) = cos(∠(R l − R m , R n − R m )) (cosine of angle).
The sums for l, m, and n run over all atoms with atomic numbers Z 1 , Z 2 and Z 3 . w k are weighting functions that balance the relative importance of different k-terms and/or limit the range of inter-atomic interactions. For k = 1, usually no weighting is used (w l 1 = 1). For k = 2 and k = 3 the following exponential decay functions are implemented in DScribe The parameter s k effectively tunes the cutoff distance. The functions MBTR k (x) are then discretized with n k many points in the respective intervals [x k min , x k max ].