Predicting Gas-Particle Partitioning Coefficients of Atmospheric Molecules with Machine Learning

The formation, properties and lifetime of secondary organic aerosols in the atmosphere are largely determined by gas-particle partitioning coefficients of the participating organic vapours. Since these coefficients are often difficult to measure and to compute, we developed a machine learning model to predict them given molecular structure as input. Our data-driven approach is based on the dataset by Wang et al. (Atmos. Chem. Phys., 17, 7529 (2017)), who computed the partitioning coefficients and saturation vapour pressures of 3414 atmospheric oxidation products from the master chemical mechanism 5 using the COSMOtherm program. We trained a kernel ridge regression (KRR) machine learning model on the saturation vapour pressure (Psat), and on two equilibrium partitioning coefficients: between a water-insoluble organic matter phase and the gas phase (KWIOM/G), and between an infinitely dilute solution with pure water and the gas phase (KW/G). For the input representation of the atomic structure of each organic molecule to the machine, we tested different descriptors. We find that the many-body tensor representation (MBTR) works best for our application, but the topological fingerprint (TopFP) approach 10 is almost as good, and computationally cheaper to evaluate. Our best machine learning model (KRR with a Gaussian kernel + MBTR) predicts Psat and KWIOM/G to within 0.3 logarithmic units and KW/G to within 0.4 logarithmic units of the original COSMOtherm calculations. This is equal or better than the typical accuracy of COSMOtherm predictions compared to experimental data (where available). We then applied our machine learning model to a dataset of 35,383 molecules that we generated based on a carbon 10 backbone functionalized with 0 to 6 carboxyl, carbonyl or hydroxyl groups to evaluate its 15 performance for polyfunctional compounds with potentially low Psat. The resulting saturation vapor pressure and partitioning coefficient distributions were physico-chemically reasonable, for example, in terms of the average effects of the addition of single functional groups. The volatility predictions for the most highly oxidized compounds were in qualitative agreement with experimentally inferred volatilities of, for example, alpha-pinene oxidation products with as-yet unknown structures, but similar elemental compositions. 20

istry they are therefore considered computationally tractable compared to high-level methods such as coupled cluster theory.
Nevertheless, the application of COSMO-RS to complex polyfunctional organic molecules still entails a significant computational effort, especially due to the conformational complexity of these species that need to be taken into account appropriately.
Overall, there could be up to 10 4 − 10 7 different organic compounds in the atmosphere (not even counting most oxidation intermediates), which makes the computation of saturation vapour pressures and partitioning coefficients a daunting task (Shri- Here, we take a different approach compared to previous parametrization studies, and consider a data-science perspective . Instead of assuming chemical or physical relations, we let the data speak for itself. We develop and train a machine learning model to extract patterns from available data and predict saturation vapour pressures as well as partitioning coefficients. 75 Machine learning has only recently spread into atmospheric science (Cervone et al., 2008;Toms et al., 2018;Barnes et al., 2019;Nourani et al., 2019;Huntingford et al., 2019;Masuda et al., 2019). Prominent applications include the identification of forced climate patterns (Barnes et al., 2019), precipitation prediction (Nourani et al., 2019), climate analysis (Huntingford et al., 2019), pattern discovery (Toms et al., 2018), risk assessment of atmospheric emissions (Cervone et al., 2008), and the estimation of cloud optical thicknesses (Masuda et al., 2019). In molecular and materials science, machine learning is 80 more established and now frequently complements theoretical or experimental methods (Müller et al., 2016;Ma et al., 2015;Shandiz and Gauvin, 2016;Gómez-Bombarelli et al., 2016;Bartók et al., 2017;Rupp et al., 2018;Goldsmith et al., 2018;Meyer et al., 2018;Zunger, 2018;Gu et al., 2019;Schmidt et al., 2019;Jensen et al.;Coley et al.). Here we build on our experience in atomistic, molecular machine learning Todorović et al., 2019;Stuke et al., 2019;Himanen et al., 2020;Fang et al., 2021) to train a regression model that maps molecular structures onto saturation vapour pressures 85 and partitioning coefficients. Once trained, the machine learning model can make saturation vapour pressure and partitioning predictions at COSMOtherm accuracy for hundreds of thousands of new molecules at no further computational cost. When experimental training data becomes available, the machine learning model could easily be extended to encompass predictions for experimental pressures and coefficients.
Due to the above-mentioned lack of comprehensive experimental databases for saturation vapour pressures or gas-liquid 90 partioning coefficient of polyfunctional atmospherically relevant molecules, our machine-learning model is based on the computational data by Wang et al. (2017). They computed the partitioning coefficients and saturation vapour pressures for 3414 atmospheric secondary oxidation products, obtained from the Master Chemical Mechanism (Jenkin et al., 1997;Saunders et al., 2003), using a combination of quantum chemistry and statistical thermodynamics as implemented in the COSMOtherm approach . The parent VOCs for the MCM dataset include most of the atmospherically relevant 95 small alkanes (methane, ethane, propane etc), alcohols, aldehydes, alkenes, ketones and aromatics, as well as chloro-and hydrochlorocabons, esters, ethers, and a few representative larger VOCs such as three monoterpenes and one sesquiterpene. Some inorganics are also included. For technical details on the COSMOtherm calculations performed by Wang et al., we refer to the COSMOtherm documentation , (Klamt, 2011), and a recent study by (Hyttinen et al., 2020), where the conventions, definitions and notations used in COSMOtherm are connected to those more commonly employed in atmospheric 100 physical chemistry. We note especially that the saturation vapor pressures computed by COSMOtherm correspond to the subcooled liquid state, and that the partitioning coefficients correspond to partitioning between two flat bulk surfaces in contact with each other. Actual partitioning between, e.g., aerosol particles and the gas phase will depend on further thermodynamic and kinetic parameters, which are not included here.
We transform the molecular structures in Wang's dataset into atomistic descriptors more suitable for machine learning 105 than the atomic coordinates or the commonly used simplified molecular-input line-entry system (SMILES) strings. Optimal descriptor choices have been the subject of increased research in recent years (Langer et al., 2020;Rossi and Cumby, 2020;Himanen et al., 2020). We here test several descriptor choices: the many body tensor representation (Huo and Rupp, 2017), the Coulomb matrix (Rupp et al., 2012), the Molecular ACCess System (MACCS) structural key (Durant et al., 2002), a topological fingerprint developed by RDkit (Landrum et al., 2006) based on the daylight fingerprint (James et al., 1995) and the Morgan 110 fingerprint (Morgan, 1965).
Our work addresses the following objectives: 1) With view to future machine learning applications in atmospheric science, we assess the predictive capability of different structural descriptors for machine learning the chosen target properties. 2) We quantify the predictive power of our machine learning model for Wang's dataset to ascertain, if the dataset size is sufficient for accurate machine learning predictions. 3) We then apply our validated machine learning model to a new molecular dataset to 115 gain chemical insight into SOA condensation processes.
The paper is organized as follows. We describe our machine learning methodology in section 2, then present the machine learning results in section 3. Section 4 demonstrates how we employed the trained model for fast prediction of molecular properties. We discuss our findings and present a summary in section 5. Figure 1. Schematic of our machine learning workflow: The raw input data is converted into molecular representations (referred to as features in this figure). We then set up and train a machine learning method. After evaluating its performance in step 5, we may adjust the features.
Once the machine learning model is calibrated and trained, we make predictions on new data.

120
Our machine learning approach has six components as illustrated in Fig. 1. We start off with the raw data, which we present and analyse in section 2.1. The raw data is then transformed into a suitable representation 2 for machine learning (step 2). We introduce five different representations in section 2.2, which we test in our machine learning model (cf section 3). Next we choose our machine learning method. Here we use kernel ridge regression (KRR), which is introduced in section 2.3. After the machine learning model is trained in step 4, we analyse its learning success in step 5. The results of this process are shown in 125 section 3. In this step we also make adjustments to the representation and the parameters of the model to improve the learning (see arrow from Checks to Features in Fig. 1). Finally, we use the best machine learning model to make predictions as shown in section 4.

Dataset
In this work we are interested in the the equilibrium partitioning coefficients of a molecule between a water-insoluble organic 130 matter (WIOM) phase and its gas phase (K WIOM/G ) as well as between the gas phase and an infinitely diluted water solution.
These coefficients are defined as where C WIOM , C W , and C G are the equilibrium concentrations of the molecule in the WIOM, water, and gas phase, respec-135 tively, at the limit of infinite dilution. In the framework of COSMOtherm calculations, gas-liquid partitioning coefficients can be converted into saturation vapor pressures, or vice versa, using the activity coefficients γ W or γ WIOM in the corresponding liquid (which can also be computed by COSMOtherm). Specifically, if for example K W/G is expressed in units of m 3 g −1 , then K W = RT M γWPsat , where R is the gas constant, T the temperature, M the molar mass of the compound and K W and γ W the partitioning and activity coefficients in water (Arp and Goss, 2009). This illustrates that unlike the saturation vapor 140 pressure P sat , which is a pure-compound property, the partitioning coefficient also depends on the activity of the molecule in the chosen liquid solvent, in this case water. We caution, however, that many different conventions exist e.g. for the dimensions of the partitioning coefficients, as well as the reference states for activity coefficients -the relation given above applies only to the particular conventions used by COSMOtherm. We refer to Hyttinen et al. (2020) for a discussion on the connection between different conventions and the notation used by COSMOtherm, and those commonly employed in atmospheric physical 145 chemistry. Wang et al. (2017) used the conductor-like screening model for real solvents (COSMO-RS) theory Klamt, 2011) implemented in COSMOtherm to calculate the two partitioning coefficients K WIOM/G 3 and K W/G for 3414 molecules. These molecules were generated from 143 parent volatile organic compounds with the Master Chemical Mechanism (MCM) (Jenkin et al., 1997;Saunders et al., 2003) through photolysis and reactions with ozone, hydroxide radicals and nitrate  Here, we analyse the composition of the publicly available dataset by Wang et al. in preparation for machine learning. Figure 2 illustrates key dataset statistics. Panel a) shows the size distribution of molecules as measured in the number of nonhydrogen atoms. The 3414 non-radical species obtained from MGM range in size from 4 to 48 atoms, which translates into 2 to 24 non-hydrogen atoms per molecule. The distribution peaks at 10 non-hydrogen atoms and is skewed towards larger 155 molecules. Panel b) illustrates how many molecules contain at least one atom of the indicated element. All molecules contain carbon (100% C), 3410 contain hydrogen (H; 99.88%) and 3333 also oxygen (O; 97.63%). Nitrogen (N) is the next most abundant element (30.17%) followed by chlorine (Cl; 3.05%), sulphur (S; 0.44%) and bromine (Br; 0.32%). Lastly, panel c) presents the distribution of functional groups. It peaks at 2 (34%) to 3 (31%) functional groups per molecule, with relatively few molecules having 0 (2%), 5 (3%) or 6 (2%) functional groups. The percentages for 1 and 4 functional groups are 11% and    The equilibrium partitioning coefficient K WIOM/G distribution is skewed slightly towards larger coefficients, in contrast to the saturation vapour pressure P sat distribution that exhibits an asymmetry towards molecules with lower pressures. All three target properties cover approximately 15 logarithmic units and are approximately Gaussian distributed. Such peaked distributions are 165 often not ideal for machine learning since they over-represent molecules near the peak of the distribution and under-represent molecules at their edges. The data peak does supply enough similarity to ensure good quality learning, but properties of the under-represented molecular types might be harder to learn.
We found 11 duplicate entries in Wang's dataset. These are documented in Section A in Tab. A1. The entries have the same SMILES strings and chemical formula, but differ in their Master Chemical Mechanism ID. Also the three target properties 170 differ slightly. These duplicates did not affect the learning quality, so we did not remove them from the dataset.
Wang's dataset of 3414 molecules is relatively small for machine learning, which often requires hundreds of thousands to millions of training samples (Pyzer-Knapp et al., 2015;Smith et al., 2017;Stuke et al., 2019;Ghosh et al., 2019). A slightly larger set of Henry's law constants, which are related to K W/G , were reported by Sander (2015) for 4632 organic species.
Sander's database is a collection of 17350 Henry's law constant values collected from 689 references and therefore not as 175 internally consistent as Wang's dataset. For example, the Sander dataset contains several molecules with multiple entries for the same property, sometimes spanning many orders of magnitude. We are not aware of a larger dataset that reports partitioning coefficients. For this reason, we rely exclusively on Wang's dataset and show that we can develop machine learning methods that are just as accurate as the underlying calculations and thus suitable for predictions.

180
The molecular representation for machine learning should fulfil certain requirements. It should be invariant with respect to translation and rotation of the molecule and permutations of atomic indices. Furthermore, it should be continuous, unique, compact and efficient to compute (Faber et al., 2015;Huo and Rupp, 2017;Langer et al., 2020;Himanen et al., 2020).
In this work we employ two classes of representations for the molecular structure, also known as descriptors: physical and cheminformatics descriptors. Physical descriptors encode physical distances and angles between atoms in the material or 185 molecule. Meanwhile, decades of research in cheminformatics have produced topological descriptors that encode the qualitative aspects of molecules in a compact representation. These descriptors are typically bitvectors, in which molecular features are encoded (hashed) into binary fingerprints, which are joined into long binary vectors. In this work, we use two physical descriptors, the Coulomb Matrix and the many-body tensor, and three cheminformatics descriptors: the MACCS structural key, the topological fingerprint and the Morgan fingerprint.

190
In Wang's dataset the molecular structure is encoded in SMILES (Simplified Molecular Input Line Entry Specification) strings. We convert these SMILES strings into structural descriptors using Open Babel (O'Boyle et al., 2011) and the DScribe library (Himanen et al., 2020) or into cheminformatics descriptors using RDkit (Landrum et al., 2006).

Coulomb Matrix
The Coulomb matrix (CM) descriptor is inspired by an electrostatic representation of a molecule (Rupp et al., 2012). It encodes 195 the cartesian coordinates of a molecule in a simple matrix of the form where R i is the coordinate of atom i with atomic charge Z i . The diagonal provides element-specific information. The coefficient and the exponent have been fitted to the total energies of isolated atoms (Rupp et al., 2012). Off-diagonal elements encode inverse distances between the atoms of the molecule by means of a Coulomb-repulsion-like term.

200
The dimension of the Coulomb matrix is chosen to fit the largest molecule in the data set, i.e. it corresponds to the number of atoms of the largest molecule. The "empty" rows of Coulomb matrices for smaller molecules are padded with zeroes.
Invariance with respect to the permutation of atoms in the molecule is enforced by simultaneously sorting rows and columns of each Coulomb matrix in descending order according to their 2 -norms. An example of a Coulomb matrix for 2-hydroxy-2methylpropanoic acid is shown in Fig. 4b.

205
The CM is easily understandable, simple and relatively small as a descriptor. However, it performs best with Laplacian kernels in the machine-learning model (see Section 2.3), while other descriptors work better with the more standard choice of a Gaussian kernel.

Many-body tensor representation
The many-body tensor representation (MBTR) follows the Coulomb matrix philosophy of encoding the internal coordinates 210 of a molecule. We will here describe the MBTR only qualitatively. Detailed equations can be found in the original publication (Huo and Rupp, 2017), our previous work (Himanen et al., 2020;Stuke et al., 2020) or Appendix B.  to the number of times the species is present in the molecule. At many-body level 2, inverse distances between every pair of atoms (bonded and non-bonded) are recorded in the same fashion. Many-body level 3 adds angular information between any triple of atoms. Higher levels (e.g. dihedral angles) would in principle be straightforward to add, but are not implemented in the current MBTR versions (Huo and Rupp, 2017;Himanen et al., 2020). Figure 4c shows selected MBTR elements for 2-hydroxy-2-methylpropanoic acid.

220
The MBTR is a continuous descriptor, which is advantageous for machine learning. However, MBTR is by far the largest descriptor out of the five we tested, and this can pose restrictions on memory and computational cost. Furthermore, the MBTR is more difficult to interpret than the CM.

MACCS Structural Key
The Molecular ACCess System (MACCS) structural key is a dictionary-based descriptor (Durant et al., 2002). It is represented 225 as a bitvector of Boolean values that encode answers to a set of predefined questions. The MACCS structural key we used is a 166 bit long set of answers to 166 questions such as "Is there an S-S bond" or "Does it contain Iodine?" (Landrum et al., 2006;James et al., 1995).
MACCS is the smallest out of the five descriptors and extremely fast to use. Its accuracy critically depends on how well the 166 questions encapsulate the chemical detail of the molecules. Is it likely to reach a moderate accuracy with low computational 230 cost and memory usage and could be beneficial for fast testing of a machine learning model.

Topological Fingerprint
The topological fingerprint (TopFP) is RDKit's original fingerprint (Landrum et al., 2006) inspired by the Daylight fingerprint (James et al., 1995). TopFP first extracts all topological paths of a certain lengths. The paths start from one atom in a molecule and travel along bonds until k bond lengths have been traversed as illustrated in Fig. 4d. The path depicted in the figure would 235 be OCCO. The list of patterns produced is exhaustive: Every pattern in the molecule, up to the pathlength limit, is generated.
Each pattern then serves as a seed to a pseudo-random number generator (it is "hashed"), the output of which is a set of bits (typically 4 or 5 bits per pattern). The set of bits is added (with a logical OR) to the fingerprint. The length of the bitvector, maximum and minimum possible path lengths k max and k min and the length of one hash can be optimized.
Topology is an informative molecular feature. We therefore expect TopFP to balance good accuracy with reasonable com-240 putational cost. However, this binary fingerprint is difficult to visualize and analyse for chemical insight.

Morgan Fingerprint
The Morgan fingerprint is also a bit-vector constructed by hashing the molecular structure. In contrast to the Topological fingerprint, the Morgan fingerprint is hashed along circular or spherical paths around the central atom as illustrated in Figure   4e. Each substructure for a hash is constructed by first numbering the atoms in a molecule with unique integers by applying 245 the Morgan algorithm. Each uniquely numbered atom then becomes a cluster center, around which we iteratively increase a spherical radius to include the neighbouring bonded atoms (Rogers and Hahn, 2010). Each radius increment extends the neighbour list by another molecular bond. The "circular" substructures found by the algorithm described above, excluding duplicates, are then hashed into a fingerprint (James et al., 1995;Landrum et al., 2006). The length of the fingerprint and the maximum radius can be optimized.

250
The Morgan fingerprint is quite similar to the TopFP in size and type of information encoded, so we expect similar performance. It also does not lend itself to easy chemical interpretation.

Kernel Ridge Regression
In this work, we apply the kernel ridge regression (KRR) machine learning method. KRR is an example of supervised learning, KRR is based on Ridge Regression, in which a penalty for overfitting is added to an ordinary least squares fit (Friedman 260 et al., 2001). In KRR, unlike Ridge regression, a nonlinear kernel is applied. This maps the molecular structure to our target properties in a high dimensional space Rupp, 2015).
The target values f are a linear expansion in kernel elements where the sum runs over all training molecules. In this work, we use two different kernels, the Gaussian kernel and the Laplacian kernel The kernel width γ is a hyperparameter of the KRR model.
The regression coefficients α i can be solved by minimizing the error where y i are reference target values for molecules in the training data. The second term is the regularization term, whose size is controlled by the hyperparameter λ. K is the kernel matrix of training inputs k(x i , x j ).

275
The hyperparameters γ and λ need to be optimised separately.

Computational Execution
Data used for supervised machine learning is typically divided into two sets, a large training set and a small test set. parameters γ and λ are typically optimized via grid search, and average optimal solutions are obtained by cross-validating the procedure. In cross-validation we split off a validation set from the training data before training the KRR model. KRR is then trained for all possible combinations of discretised hyperparameters (grid search) and evaluated on the validation set. This is done several times, so that the molecules in the validation set are changed each time. Then the hyperparameter combination with minimum average cross-validation error is chosen. Our implementation of cross-validated grid search is also based on 295 Scikit-learn (Pedregosa et al., 2011). The optimized values for γ and λ are listed in Tab. B2. optimal values. In grid search, we varied both γ and λ by ten values between 10 −1 and 10 10 . In addition, we used two different kernels, Laplacian and Gaussian. We compared the performance of the two kernels for the average of 5 runs for each training size and the most optimal kernel was chosen. In cases in which both kernels performed equally well, e.g., for the fingerprints, 300 we chose the Gaussian kernel for its lower computational cost.
To compute the MBTR and CM descriptors we employed the openbabel software to convert the SMILES strings provided in the Wang et al. dataset into 3-dimensional molecular structures. We did not perform any conformer search. MBTR hyperparameters and TopFP hyperparameters were optimized by grid search for several training set sizes (MBTR for sizes 500, 1500 and 3000 and TopFP for size 1000 and 1500 ) and the average of two runs for each training size was taken. We did not extend 305 the descriptor hyperparameter search to larger training set sizes, since we found that the hyperparameters were insensitive to the training set size. The MBTR weighting parameters were optimized in 8 steps between 0 (no weighting) and 1.4, and the broadening parameters in 6 steps between 10 −1 and 10 −6 . The length of TopFP was varied between 1024 and 8192 (size can varied by 2 n ). The range for the maximum path length extended from 5 to 11 and the bits per hash were varied between 3 and 16.

Results
In Figure 5 we present the learning curves for our objectives K WIOM/G , K W/G and P sat . Shown is the mean average error (MAE) as a function of the training set size for all three target properties and for all five molecular descriptors. As expected, the MAE decreases as the training size increases. For all target properties, the lowest errors are achieved with MBTR and the    values, the predictions hug the diagonal quite closely and we observe only a few outliers that are further away from the diagonal.
The predictions of partitioning coefficient K WIOM/G are most accurate. This is expected because the MAE in Table 2 is lowest for this property. The largest scattered is observed for partitioning coefficient K W/G , which had the highest MAE in Table 2.  TopFP (bottom). The prediction with the lowest mean average error was chosen for each scatter plot.

Predictions
In the previous section we showed that our KRR model trained on the Wang et al. dataset produces low prediction errors 330 for molecular partitioning coefficients and saturation vapour pressures and can now be employed as a fast predictor. When shown further molecular structures, it can make instant predictions for the molecular properties of interest. We demonstrate this application potential on an example dataset generated to imitate organic molecules typically found in the atmosphere.
Atmospheric oxidation reaction mechanisms can be generally classified into two main types: fragmentation and functionalization (Kroll et al., 2009;Seinfeld and Pandis, 2016). For SOA formation, functionalization is more relevant, as it leads to 335 products with intact carbon backbones and added polar (and volatility-lowering) functional groups. Many of the most interesting molecules from a SOA-forming point of view, e.g. monoterpenes, have around 10 carbon atoms (Zhang et al., 2018). These compounds simultaneously have high enough emissions or concentrations to produce appreciable amounts of condensable products, while being large enough for those products to have low volatility.
We thus generated a dataset of molecules with a backbone of ten carbon (C10) atoms. For simplicity, we used a linear alkane 340 chain. In analogy with Wang's dataset, we then decorated this backbone with 0 to 6 functional groups at different locations. We  For each of the 35,383 molecules, we generated a SMILES string that serves as input for the TopFP fingerprint. We did not relax the geometry of the molecules with force fields or density-functional theory. We chose TopFP as descriptor, because its accuracy is close to that of the best performing MBTR KRR model, but significantly cheaper to evaluate. TopFP is also invariant to conformer choices, since the fingerprint is the same for all conformers of a molecule. We then predicted P sat , 355 K WIOM/G and K W/G with the TopFP-KRR model. Figures 7 and 8 show the predictions of our TopFP-KRR model for the C10 dataset. For comparison with Wang's dataset, we broke the histograms and analysis down by the number of functional groups. For a given number of functional groups, the partitioning coefficients for our C10 dataset are somewhat higher, and the saturation vapor pressures correspondingly somewhat lower, than in Wang's dataset. This follows from the fact that our C10 molecules are on average larger 4 than those contained in 360 Wang's dataset (Figure 2). However, as seen from Figure 8, the averages of all three quantities (for a given number of functional groups) are not substantially different, illustrating the similarity of both datasets. A certain degree of similarity is required to ensure predictive power, since machine learning models do not extrapolate well to data that lies outside the training range.
The variation in the studied parameters is larger in Wang's dataset for molecules with 4 or less functional groups, but similar or smaller for molecules with 5 or 6 functional groups. This is likely the case, because Wang's dataset contains relatively few 365 compounds with more than four functional groups. The variation in the studied parameters (for each number of functional groups) predicted for the C10 dataset is in line with the individual group contributions predicted based on fits to experimental data, for example, by the SIMPOL model (Pankow and Asher, 2008) for saturation vapor pressures. According to SIMPOL, a carboxylic acid group decreases the saturation vapor pressure at room temperature by almost a factor of 4000, while a ketone group reduces it by less than a factor of 9. Accordingly, if interactions between functional groups are ignored, a dicarboxylic 370 acid, for example, should have a saturation vapor pressure more than 100 000 times lower than a diketone with the same carbon backbone. This is remarkably consistent with Figure 8, where the variation of saturation vapor pressures for compounds with two functional groups in our C10 dataset is slightly more than 5 orders of magnitude.  Figure 7 illustrates that the saturation vapour pressure P sat decreases with increasing number of functional groups as expected, whereas K WIOM/G and K W/G increase. This is consistent with Wang's dataset as shown in Fig. 8, where we compare 375 averages between the two datasets. The magnitude of the decrease (increase) amounts to approximately 1 or 2 orders of magnitude per functional group and is, again, consistent with existing structure-activity relationships based on experimental data (e.g. Pankow and Asher (2008); Compernolle et al. (2011);Nannoolal et al. (2008)).
-12 -11 -10 -9 -8 -7 -6 -5 -4 P sat (log (kPa) The region of low P sat is most relevant for atmospheric SOA formation. However, we caution that COSMOtherm predictions have not yet been properly validated against experiments for this pressure regime. As discussed above, we can hope for order-380 of-magnitude accuracy at best. Figure 9b shows histograms of only molecules with 7 or 8 oxygen atoms. These are compared to the full dataset. Since the "8 O atom set" is a subset of the "7 or 8 O atoms" set, which in turn is a subset of "all molecules" the lengths of the bars in a given bin reflect the percentages of molecules with 7 or 8 O atoms. We observe that below 10 −10 kPa, almost all C10 molecules contain 7 or 8 O atoms, as there is little grey visible in that part of the histogram. In the context of atmospheric chemistry, the least-volatile fraction of our C10 dataset corresponds to LVOC ("low volatility organic 385 compounds"), which are capable of condensing onto small aerosol particles, but not actually forming them. Our results are thus in qualitative agreement with recent experimental results by (Peräkylä et al., 2020), who concluded that the highly oxidized C 10 products of α-pinene oxidation are mostly LVOC. However, we note that the compounds measured by Peräkylä et al. are likely to contain functional groups not included in our C10 dataset, as well as structural features such as branching and rings. Figure 9a and Figure 9c show the molecular structures of the lowest-volatility compounds, as well as the highest-volatility 390 compounds with 7 or 8 O atoms, respectively. The 6 shown highest volatility compounds inevitably contain at least one carboxylic acid group, as we have restricted the number of functional groups to six or less, and only the acid groups contain two oxygen atoms. Comparing the two sets, we see that the lowest-volatility compounds contain more hydroxyl groups, and less ketone groups, while the highest-volatility compounds with 7 or 8 oxygen atoms contain almost no hydroxyl groups. This is expected, since, e.g., a hydroxyl group lowers the saturation vapor pressure by over a factor of 100 at 298 K, while 395 the effect of a ketone group is, as previously noted, less than a factor of 9 according to the SIMPOL model (Pankow and Asher, 2008). However, even the lowest-volatility compounds ( Figure 9a) contain a few ketone groups, such that the number of hydrogen-bond donor and acceptor groups are roughly similar. This result demonstrates that unlike the simplest groupcontribution models such as SIMPOL, which would invariably predict that the lowest-volatility compounds in our C10 dataset should be the tetrahydroxydicarboxylic acids, both the original COSMOtherm predictions, and the machine-learning model 400 based on them, are capable of accounting for hydrogen-bonding interactions between functional groups. As we did not include conformational information of our C10 molecules in the machine-learning predictions, this is most likely due to structural similarities between the C10 compounds, and hydrogen-bonding molecules in the training dataset.
Lastly, we consider the issue of non-unique descriptors. Although the cheminformatics descriptors are fast to compute and use, a duplicate check revealed that it is possible to obtain identical descriptors for different molecule structures, even for 405 this relatively small pool of molecules. The MACCS fingerprint in particular produced over 500 duplicates (about 15% of the dataset) because its query list is not sufficiently descriptive of this molecule class. Some duplicates were also observed for key models) so we did not purge them.

Conclusions
In this study, we set out to evaluate the potential of the KRR machine learning method to map molecular structures to its atmospheric partitioning behaviour, and establish which molecular descriptor has the best predictive capability.
KRR is a relatively simple kernel-based machine-learning technique that is straightforward to implement and fast to train.

415
Given model simplicity, the quality of learning depends strongly on information content of the molecular descriptor. More specifically, it hinges on how well each format encapsulates the structural features relevant to the atmospheric behaviour. The exhaustive approach of the MBTR descriptor to documenting molecular features has led to very good predictive accuracy in machine learning of molecular properties Langer et al., 2020;Rossi and Cumby, 2020;Himanen et al., 2020) and this work is no exception. The lightweight CM descriptor does not perform nearly as well, but these two representations 420 from physical sciences provide us with an upper and lower limit on predictive accuracy.
Descriptors from cheminformatics that were developed specifically for molecules have variable performance. Between them, the topological fingerprint leads to the best learning quality that approaches MBTR accuracy in the limit of larger training set sizes. This is a notable finding, not least because the relatively small TopFP data structures in comparison to MBTR reduce the computational time and memory required for machine learning. MBTR encoding requires knowledge of the 3-dimensional 425 molecular structure, which raises the issue of conformer search. It is unclear which molecular conformers are relevant for atmospheric condensation behaviour, and COSMOterm calculations on different conformers can produce values that are orders of magnitude apart. TopFP requires only connectivity information and can be built from SMILES strings, eliminating any conformer considerations (albeit at the cost of possibly losing some information on e.g. intramolecular hydrogen bonds). All this makes TopFP the most promising descriptor for future machine learning studies in atmospheric science that we have 430 identified in this work.
Our results show that KRR can be used to train a model to predict COSMOtherm saturation vapor pressures, with error margins smaller than those of the original COSMOtherm predictions. In the future, we will extend our training set to encompass especially atmospheric autoxidation products (Bianchi et al., 2019), which are not included in existing saturation vapour pressure datasets, and for which existing prediction methods are highly uncertain. We also intend to extend the machine learn-435 ing model to predict a larger set of parameters computed by COSMOtherm, such as vaporization enthalpies, internal energies of phase transfer, and acivity coefficients in representative phases. While COSMOtherm predictions for complex molecules such as autoxidation products also have large uncertainties, a fast and efficient "COSMOtherm -level" KRR predictor would still be immensely useful, for example, for assessing whether a given compound is likely to have extremely low volatility or not. Experimental volatility data for such compounds is also gradually becoming available, either through indirect inference 440 methods such as Peräkylä et al. (2020), or for example from thermal desorption measurements (Li et al., 2020). These can then be used to constrain and anchor the model, and ultimately yield also quantitatively reliable volatility predictions.
Code and data availability. The Wang dataset (Wang et al., 2017)  In this appendix we provide the mathematical structure of the MBTR as it is implemented in the DScribe library Himanen et al. (2020). The many-body levels in the MBTR are denoted k. For k = 1, 2, 3, geometry functions encode the different features: g 1 (Z l ) = Z l (atomic number), g 2 (R l , R m ) = |R l − R m | (distance) or g 2 (R l , R m ) = 1 |R l −Rm| (inverse distance), 450 and g 3 (R l , R m , R n ) = cos(∠(R l − R m , R n − R m )) (cosine of angle).
The scalar values returned by the geometry functions g k are Gaussian broadened into continuous representations D k : The σ k 's are the feature widths for the different k-levels and x runs over a predefined range [x k min , x k max ] of possible values for the geometry functions g k .
Finally, a weighted sum of distributions D k is generated for each possible combination of chemical elements present in the The sums for l, m, and n run over all atoms with atomic numbers Z 1 , Z 2 and Z 3 . w k are weighting functions that balance the 465 relative importance of different k-terms and/or limit the range of inter-atomic interactions. For k = 1, usually no weighting is used (w l 1 = 1). For k = 2 and k = 3 the following exponential decay functions are implemented in DScribe  . Values of the optimal KRR hyperparameter λ obtained by cross-validation as a function of descriptor type and training set size. The procedure was repeated 10 times with re-shuffled data. Average values (λ) were used in further KRR models. We also report the statistical standard deviation ∆λ. 1.00E-02 0.00E+00 1.00E-02 0.00E+00 1.00E-02 0.00E+00 2500 1.00E-02 0.00E+00 1.00E-02 0.00E+00 1.00E-02 0.00E+00 Table B2. Values of the optimal KRR hyperparameter γ obtained by cross-validation as a function of descriptor type and training set size. The procedure was repeated 10 times with re-shuffled data. Average values (γ) were used in further KRR models. We also report the statistical standard deviation ∆γ.