Predicting gas–particle partitioning coefficients of  atmospheric molecules with machine learning

Lumiaro, Emma; Todorović, Milica; Kurten, Theo; Vehkamäki, Hanna; Rinke, Patrick

doi:https://doi.org/10.5194/acp-21-13227-2021

Articles | Volume 21, issue 17

https://doi.org/10.5194/acp-21-13227-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/acp-21-13227-2021

© Author(s) 2021. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 21, issue 17

Research article

| Highlight paper

|

06 Sep 2021

Research article | Highlight paper |

| 06 Sep 2021

Predicting gas–particle partitioning coefficients of atmospheric molecules with machine learning

Emma Lumiaro, Milica Todorović, Theo Kurten, Hanna Vehkamäki, and Patrick Rinke

Download

Final revised paper (published on 06 Sep 2021)
Preprint (discussion started on 26 Jan 2021)

Interactive discussion

Status: closed

RC1:
'Comment on acp-2020-1258', Frank Wania, 13 Feb 2021

When we first judged COSMOtherm as “the most promising approach” for predicting the equilibrium partitioning properties of SOA-related compounds (Wania et al., 2014), we qualified it as being “in the long term”, because of the considerable computational effort required. Also, much less computationally costly approaches yielded predictions of the partition coefficient between a water-insoluble organic matter phase and the gas phase (_WIOM/G) that were very similar to those obtained by COSMOtherm. However, for multifunctional compounds those simpler approaches failed to predict partition coefficients between an infinitely dilute solution in pure water and the gas phase (_W/G) that are similar to those obtained by COSMOtherm, because only the latter accounts for the influence of conformation on intra-molecular interactions (Wang et al., 2017). We therefore argued that “the expertise and time required to perform quantum-chemical calculations for atmospherically relevant molecules should constitute but a minor impediment to a wider adoption” (Wang et al., 2017). I am therefore very pleased to see that with their work, Lumiaro et al. have now obliterated even this minor impediment. While it would have been possible to make COSMOtherm-based predictions for datasets much larger than the 3414 molecules in Wang et al. (2017) using “brute force” and high-performance computing resources, Lumiaro et al. demonstrate convincingly that this can be achieved with much less computational effort using machine learning approaches.

The paper is very well written and, apart from some parts of the Methods section, easily accessible to those who are not familiar with computational chemistry and machine learning approaches. I have only a few questions and suggestions for improvement:

The compounds to which the trained algorithm was applied have very limited structural diversity (only normal decanes functionalized with up to six functional groups of only three types). Why was this relatively simple dataset of molecules generated, instead of using existing molecular datasets of atmospherically relevant species? For example, Valorso et al. (2011) generated > 200,000 oxidation products of a-pinene, i.e. one of the monoterpenes judged to be among “the most interesting molecules from a SOA-forming point of view” (line 307). A recent study generated datasets of ~200,000, ~550,000 and ~750,000 atmospheric oxidation products of decane, toluene and a-pinene (Isaacman-VanWertz and Aumont, 2020).

Can the authors explain in more detail how a machine-learning model that is not fed with information on the conformations of a molecule is “capable of accounting for hydrogen-bonding interactions between functional groups” (line 366). Is this merely by structural similarity with molecules within the training set that also have such capabilities?

In this context, it is stated on line 380: “MBTR encoding requires knowledge of the 3-dimensional molecular structure, which raises the issue of conformer search”, but section 2.2.2. does not spell out how that issue was resolved in the current study?

Can the author propose how in the future, the atmospheric community will be able to obtain predictions for atmospherically relevant molecules, i.e. how a trained machine learning algorithm or its predictions could be made available for use by others. The authors still intend to improve this algorithm by extending the “training set to encompass especially atmospheric autoxidation products” (line 388), i.e. may not yet want to make the existing version accessible to others. However, it may be instructive to hear how this could look like eventually. Is it conceivable to create an easy-to-use software or webpage that is fed batches of SMILES and generates _W/G, _WIOM/Gand _Sat as calculated by the algorithm? Or would that take the form of a searchable database that has such algorithm-generated values stored for the “10⁴ −10⁷ different organic compounds” (line 60) of atmospheric interest?

Many atmospheric applications require knowledge of phase partitioning at variable temperatures. COSMOtherm can also calculate the enthalpy of vaporization and the internal energies of phase transfer between the gas phase and water or WIOM. It would probably be advisable to eventually also train a machine learning algorithm to predict those thermodynamic properties.

I find Figure 2 not particularly useful. While it could be beneficial to have a representation of the machine learning workflow, it should look less generic than what is depicted here. For example, “representations” make no appearance in that diagram, but are obviously an important part of the process. Also, the training and testing of the machine learning algorithm is presumably a key element of the workflow.

Minor things:

Footnote on page 2: While it is indeed quite common to estimate the _O/G by dividing _O/W by _G/W (e.g. Meylan and Howard, 2005) this is only an approximation. Whereas the octanol phase in a _O/W measurement is saturated with water and the aqueous phase is saturated with octanol, the solvents in a _W/G and _O/G measurement are typically pure. This can lead to a failure of the thermodynamic triangle to correctly estimate _O/G for hydrophobic substances (Beyer et al. 2002).

Line 96. The abbreviation KRR is used here for the first time, but is only introduced on line 106.

Line 134: bromine not bromide

Line 146: The Pyzer-Knapp et al. reference is missing the year “2015” (also in the reference list)

Line 154: What does it mean if a molecular representation is “continuous”?

Line 320: Explain the meaning of “cheaper to evaluate”.

Line 331-332: I find this sentence very confusing and I wonder whether “or less” at the end of line 331 should be deleted.

Line 336: “by almost a factor of 4000”.

Line 397 and 398: If “Zenodo, 2020” and “Gitlab, 2020” are references, they are missing from the reference list. Wouldn’t it be better to provide complete links to those datasets?

References

Beyer, A., Wania, F., Gouin, T., Mackay, D., and Matthies, M.: Selecting internally consistent physicochemical properties of organic compounds. Environ. Toxicol. Chem. 21, 941-953, 2002.

Isaacman-VanWertz, G., and Aumont, B. Impact of structure on the estimation of atmospherically relevant physicochemical parameters. https://doi.org/10.5194/acp-2020-1038

Meylan, W.M. and Howard, P. H.: Estimating octanol–air partition coefficients with octanol–water partition coefficients and Henry’s law constants. Chemosphere, 61, 5, 640-644, 2005.

Valorso, R., Aumont, B., Camredon, M., Raventos-Duran, T., Mouchel-Vallon, C., Ng, N. L., Seinfeld, J. H., Lee-Taylor, J., and Madronich, S.: Explicit modelling of SOA formation from a-pinene photooxidation: sensitivity to vapour pressure estimation, Atmospheric Chem. Phys., 11, 6895–6910, 2011.

Wang, C. Yuan, T., Wood, S. A., Goss, K.-U., Li, J., Ying, Q. and Wania, F.: Uncertain Henry’s law constants compromise equilibrium partitioning calculations of atmospheric oxidation products. Atmos. Chem. Phys., 17, 7529-7540, 2017.

Wania, F., Lei, Y. D., Wang, C., Abbatt, J. P. D., and Goss, K.U.: Novel methods for predicting gas-particle partitioning during the formation of secondary organic aerosol. Atmos. Chem. Phys. 14, 13189–13204, 2014.

Citation: https://doi.org/10.5194/acp-2020-1258-RC1
- AC1: 'Reply on RC1', Emma Lumiaro, 05 May 2021
  
  The comment was uploaded in the form of a supplement: https://acp.copernicus.org/preprints/acp-2020-1258/acp-2020-1258-AC1-supplement.pdf
  
  Citation: https://doi.org/10.5194/acp-2020-1258-AC1
RC2:
'Comment on acp-2020-1258', Anonymous Referee #2, 16 Feb 2021

Review of Predicting Gas-Particle Partitioning Coefficients of Atmospheric Molecules with Machine Learning by Lumiaro et al.

It was interesting to read this manuscript. The topic of the manuscript is the prediction of saturation vapor pressures and partitioning coefficients between the gas phase and an aqueous phase and an organic phase respectively relevant in atmospheric science. There is a lack of experimental data on such properties and given the overwhelming amount of different molecules in the atmosphere, reliable computational methods that can predict such properties for a large amount of molecules are valuable. In this work, the authors explore the use of a machine learning method to predict selected thermodynamic properties for a large number of molecules, which seems very promising and timely.

Having said this, I also need to say, that I find the manuscript difficult to read in some aspects, and that the link to atmospheric chemistry and physics could be better explained and discussed. I am not an expert on machine learning and so many of my suggestions below are suggestions for improving the manuscript in relation to atmospheric relevance, which I hope the authors will find useful. To improve the manuscript I also find that the authors should give a clearer description of the thermodynamic framework for what they are calculating and discuss the limitations and relevance.

Major comments

References – I do not find that there are enough references to the literature throughout the introduction. As an example statements like “They scatter and absorb solar radiation and form cloud droplets in the atmosphere, affect visibility and human health and are responsible for large uncertainties in the study of climate change.” and “Most aerosol particles are secondary organic aerosols” should be accompanied by one or more literature references. Likewise, in section 4 on prediction I miss examples and references for the statements for example on functionalization and fragmentation.

The thermodynamic basis – vapor pressures and partitioning coefficients.

I expect several of the low volatile species will be solids at room temperature and likely exist in the subcooled liquid state in the atmosphere. There can be a large difference between the vapor pressure of the solid and that of the subcooled liquid. I assume the vapor pressures calculated are for the subcooled liquid state. This should be specified. Likewise, it should be better explained to the reader what the physical meaning of the partitioning coefficients is? Do they represent partitioning over a flat surface? It says they are infinite dilutions – does this mean the activity coefficients are one? What values are assumed for the activity coefficients? partitioning in the atmosphere depends on many things including particle size, amount of condensed material, accommodation coefficients – I suggest this is recognized and addressed.

Where does the formula for calculation of saturation vapor pressure come from? Please give a derivation or a reference. The saturation vapor pressure is a property of the pure component – but here it seems to depend on the activity in a mixture and a partitioning coefficient? The equilibrium vapor pressure over a mixture depends on the activity?

What is meant with the statement “Saturation vapor pressure describes the interaction of a compound with itself” (page 2 line 29/30) ? and “partitioning coefficients (K) for the interaction of the compound with representative other species.” I would say, that it is the activity coefficients that account for interactions between molecules in the condensed phase. In the gas phase – do the authors consider molecular interactions?

General comments

Some sentences are unclear: eg. “For relatively simple organic compounds, efficient empirical parametrizations have been developed to predict their condensation-relevant properties. “ – the authors should help the reader here with more clear definitions - what is a “relatively simple organic compound” – and what are the exact condensation relevant properties and which efficient empirical parameterizations are the authors referring to here (references should be given) ?

To help the reader I also suggest to restructure the manuscript a bit and define the coefficients that are modelled already in the introduction.

How was vapor pressures obtained/calculated from COSMOtherm – this is unclear from the manuscript and should be specified.

Could the authors reflect on why the MBTR method performs so much better than the other methods?

Accuracy and performance:   It should be stated explicitly what the COSMOTHERM accuracy is, both on the predicted saturation vapor pressures and on the partitioning coefficients.

Page 7 line 158 – what is “good performance” ?

Atmospheric context

I miss a short description of which parent VOCs were considered for the basis set used.

Regarding the prediction section. As the authors write monoterpenes are relevant molecules and as I understand the choice of 10 carbon atoms is based on monoterpenes. The choice of a linear alkane chain is motivated by simplicity – but is it relevant in the atmosphere from monoterpene oxidation? Are all the molecules studied in the master chemical mechanism? – I would have expected at least some molecules with a ring structure included.

The authors several times discuss formation of particles and – is there a reference for some thought of threshold vapor pressure value ? For example Page 2 line 50 a threshold value of 10-12 Pa for nucleation is given.

In the abstract it says” The resulting saturation vapor pressure and partitioning coefficient distributions were physico-chemically reasonable, and the volatility predictions for the most highly oxidized compounds were in qualitative agreement with experimentally inferred volatilities of atmospheric oxidation products with similar elemental composition.”

I do not see justification for this in the manuscript. I miss examples (optimally for all the compounds) where the authors give the experimental vapor pressure, the vapor pressure obtained from a state of the art group contribution method, the cosmotherm vapor pressure and the vapor pressure obtained using the machine learning code and discuss differences and similarities. For the lowest vapor pressures experimental data are not available. The authors should give the range of vapor pressures where the model can be compared with experimental data.   It is not clear what is meant with elemental composition – normally the molecular formula or even structural formula is needed to predict a vapor pressure?

Other

Page 2 line 3: Several experimental techniques are capable of measuring saturation vapor pressures of 10-5 Pa. It would be appropriate to cite literature providing experimental vapor pressures. What is the definition of non-volatile that the authors use?

Page 3 line 63: “Here, we take a different approach compared to previous parametrization studies, and consider a data-science perspective (Himanen et al., 2019). Instead of assuming chemical or physical relations, we let the data speak for itself.” - what is meant with letting the data speak for itself?

Figure 9 b: what is on the y-axis - is it a percentage? or an absolute number?

Page 16: “This result demonstrates that unlike the simplest group-contribution models (which would invariably predict that the lowest-volatility compounds in our C10 dataset should be the tetrahydroxydicarboxylic acids), both the original COSMOTherm predictions, and the machine-learning model based on them, are capable of accounting for hydrogen-bonding interactions between functional groups.”

I am not sure this statement is quite fair -- to my knowledge state of the art group contribution methods (e.g. those on the UMAN Sysprop webpage) include interactions – which simple group contribution methods are the authors referring to and are such simple methods being used in atmospheric simulations?

Citation: https://doi.org/10.5194/acp-2020-1258-RC2
- AC2: 'Reply on RC2', Emma Lumiaro, 05 May 2021
  
  The comment was uploaded in the form of a supplement: https://acp.copernicus.org/preprints/acp-2020-1258/acp-2020-1258-AC2-supplement.pdf
  
  Citation: https://doi.org/10.5194/acp-2020-1258-AC2
RC3:
'Comment on acp-2020-1258', Anonymous Referee #3, 23 Mar 2021

The authors utilize machine learning to predict saturation vapor pressure and two equilibrium-partitioning coefficients for gas-particle partitioning. For training and validating the machine learning model they use a dataset obtained by COSMOtherm calculations of theses observables for atmospheric oxidation product molecules.

The paper is well written, the topic timely and of great interest for the readers of ACP and I recommend publishing but ask the authors to take the following comments and suggestions into account.

I have one very general concern, which does not relate to the machine learning approach presented here, but to the underlying COSMOtherm data set. The authors write (e.g. line 49 page 2) that the COSMOtherm predictions have an order of magnitude accuracy. However, for a number of compounds at low saturation vapor pressures there have been studies comparing experimental saturation vapor pressures with COSMOtherm predictions and finding much larger deviations (e.g. Bannan et al., 2017, Krieger et al. 2018). It should be pointed out that the COSMOtherm model has been “calibrated” with a parametrization dataset of known compounds, which are potentially biased to high saturation vapor pressures (Klamt et al. 1998). Therefore, the accuracy of the underlying reference data may be only several orders of magnitude for low saturation vapor pressure components.

For gas-particle partitioning, the saturation vapor pressure range from about 10^-11 kPa to about 10^-3 kPa is relevant (e.g. Valorso et al. 2011, or the discussion starting in the last paragraph of page 2). However, Fig. 3c shows that there are hardly any molecules in the dataset below 10^-8 kPa. Actually about half of the dataset contains compounds, which will be entirely in the gas phase under atmospheric conditions. Does this pose a problem?

Related: the last paragraph on page 6 states that Wang’s dataset is rather small for machine learning but internally consistent. I intuitively understand that this helps the machine-learning model to succeed in predicting well. However, the authors write that Sanders’s dataset for 17350 Henry’s law constant are not internally consistent (as Wang’s dataset). But what if the Sander’s data are the correct ones? What if the real world is more complex than what is predicted by COSMOtherm? Would the machine learning approaches fail because it there are no easy “rules” the machine-learning algorithm can pick out of the dataset? Would the output of a model trained with these data just produce random partitioning coefficients within the range of the data set? These questions are probably impossible to answer without doing the experiment. It would have been very interesting to see how the machine-learning model perform on the dataset of Sander, but this is clearly beyond the work presented here.

I find section 2.2.4 rather brief. For me – being not familiar with the topic – it is not possible to follow despite Fig. 4d. May be extent a bit?

Discussion on page 16: Related to my comments above, without experimental vapor pressures for the C10 compounds being available, this discussion is interesting, but there may be surprises if experimental vapor pressures become available. I feel the authors should clearly state that the COMOtherm predictions are not validated in this pressure regime at all.

Technical comment:

Page 12, line 292: Figure 5 should be Fig. 3, correct?

References:

Bannan, T. J. et al.: Measured Saturation Vapor Pressures of Phenolic and Nitro-aromatic Compounds, Environ. Sci. Technol. 2017, 51, 3922−3928.

Klamt, A.; et al.: Refinement and Parametrization of COSMO-RS, J. Phys. Chem. A 1998, 102, 5074-5085.

Krieger , U. K. et al.: A reference data set for validating vapor pressure measurement techniques: homologous series of polyethylene glycols, Atmos. Meas. Tech., 11, 49–63, 2018.

Valorso, R. et al.: Explicit modelling of SOA formation from -pinene photooxidation: sensitivity to vapour pressure estimation, Atmos. Chem. Phys., 11, 6895–6910, 2011.

Citation: https://doi.org/10.5194/acp-2020-1258-RC3
- AC3: 'Reply on RC3', Emma Lumiaro, 05 May 2021
  
  The comment was uploaded in the form of a supplement: https://acp.copernicus.org/preprints/acp-2020-1258/acp-2020-1258-AC3-supplement.pdf
  
  Citation: https://doi.org/10.5194/acp-2020-1258-AC3

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

AR by Emma Lumiaro on behalf of the Authors (26 May 2021) Author's response Author's tracked changes Manuscript

ED: Publish as is (12 Jul 2021) by Gordon McFiggans

AR by Emma Lumiaro on behalf of the Authors (21 Jul 2021)

Short summary

The study of climate change relies on climate models, which require an understanding of aerosol formation. We train a machine-learning model to predict the partitioning coefficients of atmospheric molecules, which govern condensation into aerosols. The model can make instant predictions based on molecular structures with accuracy surpassing that of standard computational methods. This will allow the screening of low-volatility molecules that contribute most to aerosol formation.