HUB: A method to model and extract the distribution of ice nucleation temperatures from drop-freezing experiments

. The heterogeneous nucleation of ice is an important atmospheric process facilitated by a wide range of aerosols. Drop-freezing experiments are key for the determination of the ice nucleation activity of biotic and abiotic ice nucleators 10 (INs). The results of these experiments are reported as the fraction of frozen droplets 𝑓 !"# ( 𝑇 ) as a function of decreasing temperature, and the corresponding cumulative freezing spectra 𝑁 ! ( 𝑇 ) computed using Vali’s methodology. The differential freezing spectrum 𝑛 ! ( 𝑇 ) is an approximant to the underlying distribution of heterogeneous ice nucleation temperatures 𝑃 ! ( 𝑇 ) that represents the characteristic freezing temperatures of all IN in the sample. However, 𝑁 ! ( 𝑇 ) can be noisy, resulting in a differential form 𝑛 ! 𝑇 that is challenging to interpret. Furthermore, there is no rigorous statistical analysis of 15 how many droplets and dilutions are needed to obtain a well-converged 𝑛 ! ( 𝑇 ) that represents the underlying distribution 𝑃 ! ( 𝑇 ) . Here, we present the HUB method and associated Python codes that model (HUB-forward code) and interpret (HUB-backward code) the results of drop-freezing experiments. HUB-forward predicts 𝑓 !"# ( 𝑇 ) and 𝑁 ! ( 𝑇 ) from a proposed distribution 𝑃 ! ( 𝑇 ) of IN temperatures, allowing its users to test hypotheses regarding the role of subpopulations of nuclei in freezing spectra, and providing a guide for a more efficient collection of freezing data. HUB-backward uses a stochastic 20 optimization method to compute 𝑛 ! ( 𝑇 ) from either 𝑁 ! ( 𝑇 ) or 𝑓 !"# ( 𝑇 ) . The differential spectrum computed with HUB-backward is an analytical function that can be used to reveal and characterize the underlying number of IN subpopulations of complex biological samples (e.g. ice nucleating bacteria, fungi, pollen), and quantify the dependence of these subpopulations on environmental variables. By delivering a way to compute the differential spectrum from drop freezing data, and vice-versa, the HUB-forward and HUB-backward codes provide a hub to connect experiments and interpretative physical 25 quantities that can be analysed with kinetic models and nucleation theory.


Introduction
Ice nucleators (INs) of biological and abiotic origins present in aerosols are responsible for facilitating the heterogeneous freezing of atmospheric water droplets above the homogeneous nucleation temperature Demott et al., 2016;Demott et al., 2003). The potential of these aerosols as ice nuclei has significant implications for cloud properties and 30 precipitation patterns (Gettelman et al., 2012;Mülmenstädt et al., 2015;Froyd et al., 2022). Freezing experiments are key sources of information to determine the range of temperatures over which INs promote ice nucleation. The most common method to characterize INs is through immersion freezing experiments, for which a wide range of assays and instruments have been developed. A comprehensive report of various drop freezing techniques can be found in (Miller et al., 2021). The assays are typically performed by placing uniformly sized water droplets with a known IN concentration or area on a 35 substrate or in a multiwall plate that is gradually cooled from a temperature above 0 o C until all droplets are frozen (Kunert et al., 2018;Budke and Koop, 2015). Droplet freezing is detected visually or through measurement of the latent heat release (Stratmann et al., 2004;Budke and Koop, 2015;Kunert et al., 2018;Reicher et al., 2018), allowing the assignment of a heterogeneous nucleation temperature to each droplet. Drop-freezing experiments record the fraction of frozen droplets, !"# , as a function of decreasing temperature; for soluble or dispersible INs !"# curves are typically collected at 40 various ten-fold dilutions of the IN sample.
Historically, there have been two interpretations of the dispersion of nucleation temperatures in heterogeneous freezing experiments. The first approach suggests that the stochastic nature of the nucleation process dominates the variability in freezing temperatures (Bigg, 1953;Carte, 1956), while the second approach assumes that the dispersion in temperatures mostly arises from a distribution of nucleation sites (Fletcher, 1969), each with a deterministic, singular 45 nucleation temperature (Levine, 1950;Vali and Stansbury, 1966). Variability in the temperature, volume, and amount of icenucleating particles per droplet can also contribute to the dispersion of freezing temperatures (Vali, 2019;Knopf et al., 2020). There is consensus now that both stochastic effects and sample heterogeneities contribute to the distribution of freezing temperatures, and both approaches are used for the modelling of drop-freezing experiments (Vali, 1971;Niedermeier et al., 2011;Murray et al., 2011;Broadley et al., 2012;Herbert et al., 2014;50 Harrison et al., 2016;Alpert and Knopf, 2016;Vali, 2019;Fahy et al., 2022a). Stochastic modelling of the freezing curves is based on predicting the survival probability of liquid water containing IN as a function of supercooling, and requires a model for the temperature dependence of the nucleation rate of the IN components. These models have been solved numerically or evolved with Monte Carlo simulations to interpret or resolve the distribution of ice nucleation properties of minerals Murray et al., 2011;Broadley et al., 2012;Herbert et al., 2014;Harrison et 55 al., 2016) and organics Alpert and Knopf, 2016) and to perform parametric bootstrapping of experimental data Harrison et al., 2016). The advantage of the stochastic modelling approach is that it enables a direct link to microscopic properties of the nuclei and can account for the cooling rate dependence of the !"# data. These approaches require the use of analytical models for the freezing rates and their distribution in the sample.
The modelling of freezing experiments based on the singular approach is based on the framework proposed by Vali 60 (Vali, 1971). He assumed that each particular IN has a characteristic ice nucleation temperature that is independent of the cooling history. This implies that the IN with the highest characteristic nucleation temperature in a droplet is responsible for its freezing. Given a total number of droplets ! , the number of frozen droplets ! ( ) at a temperature T gives the range of characteristic freezing temperatures that determines the ice nucleation activity and is used to produce the cumulative freezing spectrum (Vali, 1971;Vali, 2014Vali, , 2019, 65 where ! = ! − ! ( ) is the number of unfrozen droplets, !"# = ! ( )/ ! is the fraction of frozen droplets at temperature T, and is a normalization factor per unit volume of water, unit mass, or surface of the INs (Vali, 2019). For soluble INs, the normalization factor is commonly defined by the mass of the ice nucleating material = ( !"#$ ), where is the density of the initial solution, !"#$ is the droplet volume and is the dilution factor (Kunert et al., 2018). The IN surface area per drop, = !"#$ , is sometimes used as normalization factor for insoluble INs (e.g., dust, crystals), 70 resulting in a cumulative spectrum per area denoted as ! . However, it is challenging to measure the total IN surface area accurately (Knopf et al., 2020). We note that Eq. (1a) can be used even when the absolute concentrations or areas of the IN are unknown, provided that the user knows the relative concentration of the dilution series derived from a parent sample. The differential freezing spectrum ! is obtained by differentiation of the cumulative spectrum, (Vali, 1971) The differential spectrum identifies the density of IN active at each temperature, and was identified by Vali as the central 75 quantity that can be derived and interpreted from drop-freezing experiments (Vali, 1971;Vali, 2019).
The determination of the differential spectrum from the cumulative one by finite differentiation is subject to significant noise, requiring a careful selection of the temperature intervals and extensive sampling (Vali, 2019). As stochastic effects are not considered in the singular temperature formalism, the cumulative and differential spectra should -in principledepend on the cooling rate (Vali, 1994). The stochastic nature of ice nucleation, combined with the uncertainties associated 80 with the experimental measurements (e.g., different droplet volumes, inhomogeneous samples, different detection efficiencies), can produce significant variations in the cumulative freezing spectra, that result in large uncertainties in ! .
Parametric and nonparametric bootstrapping based on the singular approximation and Monte Carlo simulations have been used to estimate confidence intervals in freezing spectra measurements (Vali, 2019;Fahy et al., 2022a;Fahy et al., 2022b).
A central assumption of the singular freezing approximation is that the freezing of a droplet containing multiple INs 85 is promoted by the IN with the highest nucleation temperature (Levine, 1950). The extreme-value sampling is apparent in the concentration dependence of !"# in experiments Budke and Koop, 2015;Kunert et al., 2018;Lukas et al., 2022). Using probability theory, Levine demonstrated that if the distribution of ice nucleation temperatures of the IN population follows an exponential distribution, then the sampling of droplet freezing temperatures corresponds to a Gumbel distribution, and the median freezing temperature T MED of the droplets scales with the logarithm of the number (or 90 total nucleating area) of IN per droplet (Levine, 1950). Sear more recently demonstrated that Levine's approach is a particular solution for a generalized extreme-value problem, and used modern extreme value statistics to derive the scaling of T MED with the number of IN sites per droplet for the three generalized extreme value distributions (GEV): Gumbel that would arise from an underlying IN distributions with exponential tails, Frechet from those with power law tails, and Weibull from those with an upper cutoff in the freezing temperature of the IN (Sear, 2013). However, there are limitations for the use 95 of the analytical approaches of Sear and Levine for the interpretation of actual drop freezing data. First, the extreme value sampling results in one of the three GEV only in the limit of extremely large number of IN per droplet, while in experiments the sampling is typically performed over dilutions down to a few IN per droplet. There is no analytical formulation for the dependence of the extreme value distribution in the low to intermediate concentration regime. Second, the analytical theory assumes that the sampling is complete (i.e. the number of droplets is extremely large), while experiments are typically 100 performed with tens to hundreds of droplets. Third, Sear notes that there is no general analytical theory to predict the GEV from a mixture of populations of nuclei with different temperature dependences (Sear, 2013). In this study we overcome these three limitations through a numerical implementation of extreme-value statistics for the modelling of drop-freezing experiments.
A consequence of extreme-value sampling is that the differential spectrum ! represents the underlying 105 distribution of ice nucleation temperatures of all INs in the sample, which we denote as ! ( ), only when the sampling of IN in the drop-freezing experiments is complete. The underlying distribution ! ( ) is akin to a hub that connects the experimental freezing temperatures to physical analysis based on nucleation theory or kinetic and equilibrium models that can elucidate the mechanisms and origins of the distributions of INs (Fig. 1). We here call the cumulative spectrum ! obtained through Eq. (1a) in this complete sampling limit the intrinsic cumulative spectrum of the system, ! (Fig. 1). 110 While there is consensus that the quality of the freezing spectrum increases with the number of droplets, a rigorous analysis of how many droplets and IN dilutions should be measured to provide accurate freezing spectra is still lacking. The first goal of the present study is to provide a strategy to optimize the sampling of drop-freezing experiments to derive interpretable differential spectra that is a good approximant of the underlying distribution of heterogeneous ice nucleation temperatures of the sample. 115

120
The existence of subpopulations or classes in the population of INs (e.g. different classes of bacterial INs, different ice nucleating sites on complex materials like dust) (Turner et al., 1990) is common in atmospheric aerosols. While several studies have broadly defined populations from the cumulative spectra by the range of nucleation temperatures they encompass (Turner et al., 1990;Creamean et al., 2019) or the origin of the sample (Steinke et al., 2020), there is currently no 125 simple procedure to identify and quantify subpopulations or classes from cumulative freezing spectra ! . The second aim of our study is to map the cumulative freezing spectrum ! ( ) into the differential spectrum ! ( ), in terms of subpopulations that may correspond to different physical nucleation sites in the sample.
To reach the aims above, we develop a method we name HUB (for Heterogeneous Underlying-Based) to model and interpret the results of drop-freezing experiments and provide its associated Python code and user manual 130 (https://github.com/Molinero-Group/underlying-distribution). Our method relies on the singular interpretation of freezing experiments: we assume that each individual IN has a characteristic nucleation temperature independent of its cooling history, and that the freezing of a droplet containing multiple INs is promoted by the IN with the highest nucleation temperature. This second assumption allows the use of extreme value statistics (Castillo, 2005;David and Nagaraja, 2004;Gumbel, 2012;De Haan and Ferreira, 2006) to model and interpret the data. 135 We present two implementations of the HUB analysis code. The HUB-forward code allows the user to postulate an underlying distribution of heterogeneous nucleation temperatures ! ( ) in the system of interest. The HUB-forward code uses the singular approximation and extreme-value statistics to generate an artificial IN dilution series similar to those obtained in experiments, from which it computes the fraction of frozen droplets !"# ( ) and from these derive ! ( ) using Vali's equation (Fig. 1). The HUB-backward code works in reverse, extracting the differential spectrum ! ( ) from a given 140 cumulative ! ( ) using a stochastic optimization procedure (Fig. 1). HUB-backward allows the decomposition the total population from ! ( ) into subpopulations. The combination of HUB-forward and HUB-backward allows for an analysis of the sensitivity of ! ( ) to the number of droplets and dilutions, and the impact of the sampling on the closeness of the differential spectrum ! ( ) to the underlying distribution ! ( ). The determination of distributions obtained from the HUBbackward code could further enable the interpretation of the experimental ice nucleation spectra with size and structure of 145 INs using nucleation theory, kinetic models, and molecular simulations. For example, (Schwidetzky et al., 2023) illustrates the use of the distribution of freezing temperatures obtained with HUB-backward together with classical nucleation theory for finite surfaces to interpret the size of the IN of Fusarium acuminatum. This paper is organized as follows: Section 2 presents the methodology: Section 2.1 discusses the details on the implementation of HUB-forward, while Section 2.2 describes the HUB-backward procedure to find the differential spectrum 150 ! ( ) and discusses how to determine whether or not ! ( ) has converged to the underlying distribution ! ( ). Section 3 presents examples of applications of both HUB-forward and HUB-backward codes and their capabilities. Section 3.1 analyses the effect of the number of droplets sampled on the cumulative freezing spectrum ! . Section 3.2 uses HUBbackward to compute the differential spectra ! ( ) of various biological INs with increasing grade of complexity in their cumulative freezing spectra. Section 3.3 demonstrates how to extract ! ( ) from the experimental fraction of ice !"# ( ) of 155 insoluble INs and the impact of the cooling rate on ! ( ). We end in Section 4 with a discussion of the main conclusions and outlook.

2.1
HUB-forward method to compute the fraction of frozen droplets ( ) and cumulative freezing spectrum ( ) from a known underlying distribution ( ). 160 In the HUB-forward analysis we know or assume an underlying distribution ! ( ) of ice nucleation temperatures for the IN in the sample, and generate from it an artificial IN dilution series similar to those obtained in experiments, from which we compute the cumulative freezing spectrum ! (T) using Vali's equation (Eq. (1a)). Using this approach, we investigate the relationship between ! ( ) and ! ( ) (Fig. 1) and the sensitivity of ! ( ) with respect to the number of droplets and dilutions. For generality, we represent ! ( ) as a linear combination of normalized continuous distributions ! that 165 represent subpopulations of freezing temperatures: where is the total number of subpopulations, ! , ! , . . . , ! are normalized distribution functions, and ! , ! , . . . , ! are their weights such that where each subpopulation ! is further characterized by its most likely temperature of freezing !"#$,! and spread of distribution of freezing temperatures ! . We also provide in the HUB code the option for the user to use the log-normal 175 distribution, which has a tail towards higher temperatures, or the left-tailed Gumbel distribution, which has a tail towards lower temperatures. In our model, we assume that the underlying distribution of ice nucleating temperatures ! ( ) does not change with the concentration of INs. This last condition is violated when IN are involved in chemical, aggregation, or solubility equilibria that alter the proportionality between their concentration and the dilution factor of the sample, resulting in a lack of overlap of the pieces of the cumulative spectra ! ( ) obtained from different dilutions  Dedekind, 2020) The number of INs in each droplet is then given by the Poisson distribution: where is the actual number of INs in each droplet and represents the average number of INs among all droplets of the corresponding dilution. Fig. 2A shows the probability mass function (PMF) for = 1, 5, and 10, computed according to Eq. (4) and sampling over ! = 10 ! droplets using the "SciPy Stats" Python framework (Virtanen et al., 2020). As 185 increases, the probability that any droplet nucleates homogeneously rapidly approaches zero (inset of Fig. 2A). When there is one IN on average per droplet ( = 1) ~37 % of the droplets do not have any IN, i.e., they are "empty" droplets that would nucleate at the homogeneous nucleation temperature. We note that by performing dilutions until a sizeable fraction of droplets nucleate homogeneously, it is possible to calibrate the absolute concentration of ice nuclei in the original, undiluted,

sample. 190
To illustrate how the heterogeneous ice nucleation temperatures recorded in drop-freezing experiments depend on the number of INs in the droplets, we start from two examples with ! ( ) represented by one or two Gaussian subpopulations, shown with black dashed lines in Fig. 2B and C, respectively. We assign a temperature to each IN contained in droplets from a 10-fold dilution series of 5 solutions with = 1, 10, 10 ! , 10 ! , and 10 ! average number of IN per droplet.
If the droplet volume is constant, is proportional to the concentration of INs in the droplets. We sample ! = 10 ! droplets 195 for each concentration. This ! is much higher than the ~100 droplets usually sampled in laboratory experiments; we address the effect of sampling in Section 3.1 below.
To sample independent random values for each IN, the number of random variates, which are drawn from ! ( ), is the total number of INs among ! droplets. Thereby, each droplet has a set of temperatures !
the droplet index and is the IN index. Since we assume that freezing occurs at the characteristic temperature of the IN with 200 the highest freezing temperature, the nucleation temperature for each droplet is defined as the maximum i.e., the extreme upper value, of several independent freezing temperatures !!",!
for different values of , namely !"# ! ( ). Therefore, ! ( ) represents the underlying probability of heterogeneous ice nucleation temperatures independent of the concentration of INs, while !"# ! ( ) represents the concentration-dependent distribution, and has the same units as ! ( ) and the differential spectrum. According to the Fisher-205 Tippett-Gnedenko theorem, the distribution of extreme upper values of the Gaussian distribution is the right-skewed Gumbel distribution (Castillo, 2005;David and Nagaraja, 2004;Gumbel, 2012;De Haan and Ferreira, 2006), which has a fatter tail on the high-temperature side of its maximum. The shift of !!" ! ( ) curves in HUB-forward computes the fraction of frozen droplets and cumulative spectra from a proposed underlying distribution of freezing temperatures, using extreme-value statistics. The fraction of frozen droplets !"# ! can be calculated as a function of the concentration-dependent distribution, 225 where the integration is from the ice melting temperature ! to the temperature . ! ! is the total number of droplets that freeze heterogeneously and ! is the total number of droplets. We note that the approach taken in this work differs from that of previous studies that either start from a microscopic model for the nucleation sites and nucleation theory to predict the fraction of frozen droplets using Monte Carlo simulations, and also from previous modelling using the singular approximation which do not account for the statistics of extreme sampling. 230 To use the HUB-Forward code, the user must define the total number of droplets "ndroplets" that serves as the total number of each concentration and the number of subpopulations "nsubpop". If "nsubpop"=1, the user must provide the temperature of maximum likelihood !"#$,! and the spread ! . If "nsubpop"=2, the user must provide !"#$,! , ! , !"#$,! , ! and ! . If "nsubpop"=3, the user has to provide !"#$,! , ! , !"#$,! , ! , ! , !"#$,! , ! , ! . To generate the cumulative freezing spectrum ! , the user needs to define the total number of concentrations "nconc", the concentration of the 235 parent suspension is defined in "density", and the droplet volume in "volumedrop". The output is composed of different data plots and files: the normalized ! ( ) and !"# ! , the artificially generated !"# ! ( ), and ! built from the 10-fold dilution series.  The ability of HUB-forward to generate the cumulative freezing spectrum ! ( ) from the underlying distribution 255 ! ( ) allows for an analysis of the sensitivity of ! ( ) and ! ( ) to the number of droplets and dilutions, as seen in the comparison of ! generated from the same underlying distributions using 100 and 10 4 droplets in Fig. 3. In Section 3.1 we show that the sampling with 100 droplets for only four dilutions of a system with two subpopulations of INs results in distortions of the freezing temperatures and the proportions of these populations in the differential spectrum.
The knee point in ! ( ) corresponds to the point of maximum curvature (Satopaa et al., 2011) and has been used 260 to characterize the nucleation temperature of a particular subpopulation (Hartmann et al., 2022). Similar to Hartmann et al., we have identified in Fig. 3C-D the knee points (magenta dotted line) of the artificially generated ! ( ) by using a Python function named "kneed". The Python function "kneed" using S=1, curve="concave" and direction="decreasing". The knee points !"## are very close to the temperatures of maximum likelihood !"#$ (dashed black lines) of the corresponding underlying distribution ! , because under these conditions the differential freezing spectrum ! ( ) is a very good 265 approximant for ! ( ). However, we find that removal of the more dilute solutions (that eliminate the plateau in ! ( )) results in poor estimation of the modes of ! ( ) from the knee of ! ( ).

HUB-Backward method to recover the differential freezing spectrum
( ) from the cumulative freezing spectrum by a stochastic optimization procedure.
The HUB-backward code implements a stochastic optimization procedure to extract the differential spectrum ! ( ) from a 270 given cumulative spectrum ! ( ) or from an experimental !"# curve. The later is useful when data is available for a single concentration. One possibility to obtain ! ( ) from ! ( ) would be to follow the following steps: i) propose a trial function ! !"#$% ( ), ii) use HUB-forward to predict the concentration-dependent distributions !"# !,!"#$% for various IN concentrations, iii) use these in Eq. (5) to predict the freezing fractions !"# !,!"#$% , iii) compute ! !"#$% ( ) from the freezing fractions using Eq. (1a), iv) evaluate the difference between that trial and the target (experimental) value, 275 and then v) evolve the parameters that determine ! !"#$% ( ) until the difference is minimized. However, the use of HUB-forward in steps ii) and iii) to generate and evaluate hundreds of droplets containing up to tens of millions of IN would require significant computations that render this optimization process inefficient.
The HUB-backward optimization procedure, sketched in Fig. 4, uses a shortcut for steps ii) and iii) above to directly predict ! !"#$% ( ) from ! !"#$% ( ) with fast convergence. The shortcut is based on the understanding that in the 280 asymptotic limit in which the sample is extremely dilute (i.e., ⟶ 0) each droplet that nucleates heterogeneously contains a single IN. In such case, sampling an infinitely large number of droplets with !"#

290
With this insight and considering that the intrinsic cumulative spectrum, define the cumulative integral of the differential spectrum as where the integration is from the ice melting temperature ! to the temperature , and is and adjustable scaling factor to be obtained from the optimization. Likewise, a similar estimate can be made for a single fraction of ice curve !"# !"#$% = ! !"#$% using Eq. (7) and proceed directly to evaluate the mean squared error (Fig. 4). When the target is a cumulative 295 freezing spectrum, HUB-forward uses ! !"#$% to predict a trial cumulative freezing spectrum (Fig. 4), where 1 corresponds to the maximum of the cumulative in the target distribution, we obtain an ! !"#$% that we compare with the target using Eq. (6) (Fig. 4). To do the comparison, HUB-backward uses a spline fit to interpolate the experimental ! !"#$%! ( ) in order to have equally spaced temperature points to then compare with the estimates in ! !"#$% ( ). We use the "interp1d" algorithm, which is available in the Python SciPy library (Virtanen et al.,300 2020) with a linear interpolation to construct new equally spaced data points within the range of the lowest and highest temperature values in the freezing spectrum. The cost function for the optimization is the mean squared error (MSE), computed from the difference in Eq. (6), where represents the total number of equally spaced points in .
We use a stochastic global optimization technique based on a simulated annealing algorithm to find the set of 305 parameters of ! !!"#$ (Eq. (2) and Eq. (3)), and (Eq. (7)) that globally minimizes the MSE. We use the simulated annealing (SA) algorithm "dual annealing" that is part of the SciPy minimize library (Virtanen et al., 2020) with its default arguments predefined, except for the parameters "maxfun" that sets the maximum number of the evaluation of the objective function (we select "maxfun" = 1,000,000 in the examples below), and the seed for the generation of random numbers (a new random integer is automatically generated every time the HUB-backward code is run). We show below that the optimized differential 310 spectra, !
where is the number of subpopulations. 315 We now turn our focus to how to select the input parameters required by HUB-backward to start the search for the underlying distribution, using the experimental ! !"#$%! ( ) or !"# !"#$%! ( ) as a guide. The code requires the user to define the number of distinct Gaussian subpopulations ! ( ) that comprise the underlying distribution (Eq. (2)) and to provide upper and lower bounds for the weighs ! , their modes !"#$,! , and spreads ! of each of these populations. In general, we find that defining the minimum and maximum values for the weighs ! !"# = 1 and ! !"# = 0 (see constrain in Eq. (2)), !"#$,! !"# and 320 !"#$,! !"# to be between the homogeneous nucleation temperature (about -30 o C) and the melting temperature (0 o C), and the bounds for the spreads ! !"# = 10 o C and ! !"# = 0.1 o C work well. However, these bounds can be tuned in order to better fit the data (as we find to be the case to fit the results for pollen in Section 3.2 below). If the existing experimental ! !"#$%! ( ) data is very noisy, it can be interpolated in HUB-backward using the method "interp1d" with "npoints"=100, and then smoothed with a Savitzky-Golay filter by changing the parameters "window_length" that is the length of the filter 325 window and "polyorder" that is the order of the polynomial used to fit the samples ("filter" in Fig. 4). The default values are 3 and 1, respectively. HUB-backward generates a plot that compares the original and the interpolated target data.
To identify the minimum number of subpopulations needed to represent a given freezing spectrum, we consider that every time a population is accumulated in ! ( ) or !"# ( ), these functions display a sharp increase. We note that assuming a large number of subpopulations may challenge the interpretability of the optimized differential spectrum ! !"#$%$&'( ( ). 330 We apply the HUB-backward procedure to the ! obtained in Fig. 3C-D by sampling four 10-fold dilutions with 100 droplets, i.e., only a total of 500 droplets. Fig. 5 shows the comparison between the predicted (solid magenta lines) and the target (black dashed lines) ! (panels A and B) and ! (panels C and D). Table 1 shows the predicted parameters and the precision of the optimization procedure to recover the known underlying distribution ! ( ). The MRE between the underlying distribution ! ( ) and the optimized differential spectrum ! !"#$%$&'( ( ) is 2% for the system with 335 one subpopulation and 13% for the one with two, despite the low number of droplets used to sample the cumulative freezing spectra in the computer-generated freezing experiments.   Table 1.
We conclude that the HUB-backward code gives a good estimate of the mode, spread and weights of the populations of INs in a sample, and it can be applied in a situation where ! ( ) is unknown. In Section 3.1 we discuss how 350 is the accuracy of the underlying distribution recovered with HUB-backward impacted by various schemes of sampling of number of droplets and dilutions to construct ! . In Section 3.2, we apply the HUB-backward procedure to obtain

Using the HUB code to optimize and analyse drop-freezing experiments 355
3.1 Effect of the number of droplets and dilutions on the temperature range of the cumulative freezing spectrum ( ) Fig. 3D-F shows ! ( ) generated with HUB-forward using five dilutions from λ = 10 4 to 1 of a solution with ! ( ) containing two populations in a ratio of 9 to 1. The ! ( ) are different when the number of droplets per dilution is 100 (Fig.   3F) or 10 4 (Fig. 3D). As shown in the previous section, the freezing spectrum obtained with 100 droplets and five dilutions 360 has enough sampling to recover this ! ( ) with good accuracy (Fig. 5C-D). We test different number of droplets and concentrations, defined by the average number of INs per droplet , to test the sensitivity of ! ( ) to the number of droplets and dilutions when the underlying distribution is known ! ( ). We use HUB-forward to build ! ( ) based on a combination of different number of droplets and concentrations, similar to the case shown in Fig. 3F. Then, we use the HUB-backward to obtain ! !"#$%$&'( ( ), compare it to ! and test the accuracy of each prediction through its mean 365 relative error (MRE) as defined in Eq. (10).
The left panels of Fig. 6 show ! ( ) generated with HUB-forward based on a combination of different concentrations using 100 droplets each. The magenta lines are based on the data fitting provided by the HUB-backward code. The right panels of Fig. 6 compare ! !"#$%$&'( ( ) in magenta and the known underlying distribution ! in black. In this example, ! ( ) is very close to ! if both subpopulations are sampled enough. However, if the most dilute solution 370 with = 1 is not included in ! ( ) (second panel), the estimate of the underlying distribution is very poor. Thus, to improve the sampling of the lower tail of ! , we recommend ending the dilution series always in the immediacy of = 1, which can be gleaned from the temperature range for which ! becomes flat and a sizeable fraction of droplets of the more diluted sample nucleates homogeneously (inset of Fig. 2A). We emphasize that reaching this limit allows for an absolute calibration of the number of INs in the initial sample. Moreover, sampling to concentrations down to about one 375 nucleant per droplet is essential to recover a proper weight of the poorly nucleating IN populations.  Table S1.

385
The relative weights of class A and C populations in Pseudomonas syringae is approximately 1 to 1000 (Section 3.2), while the ratio is 9 to 1 in the two-population system example of Fig. 6. To understand the impact of highly imbalanced populations on the sampling of the cumulative spectrum and recovery of the underlying distribution, we show in Fig. 7 the analysis of an example where the subpopulation of highly efficient INs is three orders of magnitude less likely to occur than the subpopulation at lower temperatures, mimicking the one of P. syringae. Our analysis confirms that it is important to end 390 the dilution series in the immediacy of = 1 to fully represent the contribution of the poorer INs (Fig. 7B-F). Furthermore, we find it is important to sample a high enough concentration to account for the rare INs that nucleate at the highest temperatures ( Fig. 7D-H).
If only 25 droplets per dilution, instead of 100, are used to construct the cumulative spectrum, the impact of insufficient sampling at the higher concentrations is more pronounced: compare Fig. 8C and Fig. 7D obtained with the same 395 underlying distribution ! with 1000 to 1 subpopulation ratios and number of dilutions.
We conclude that increase in the accuracy in the account of the subpopulations requires a higher number of dilutions and the checking of the predictions with the addition of each successive concentration, to ensure convergence of ! !"#$%$&'( ( ). Measuring fewer droplets or fewer dilutions lead to poor statistics and results in incompleteness or the misrepresentation of the underlying distribution in samples with multiple subpopulations. In principle, increasing the number 400 of droplets of the most concentrated solutions, or adding more ten-fold concentrated ones until there are no changes in the cumulative spectrum is recommended to ensure complete sampling. When that limiting scenario is not attainable, the use of HUB-forward to produce synthetic data from a proposed underlying distribution, followed by the recovery of the differential spectrum from these data sets, allows for an estimation of the errors that may be incurred for putative, proposed underlying distributions with the sampling scheme available in the laboratory. 405

410
= 10 3 (yellow x) and = 10 4 (red triangles). The sampling was done using 100 droplets for each concentration. Right panels (E-H) represent the differential freezing spectra compared to the known underlying distribution ( ), shown by the magenta and dashed black lines, respectively. The mean relative error (MRE) was computed using Eq. (10) and the parameters of and ( ) are shown in Table S2.  Table S3. 3.2 Obtaining the differential freezing spectrum ( ) from the experimental cumulative freezing spectrum ( )

of biological INs using the HUB-backward code
In this section we use the HUB-backward code to obtain the differential freezing spectrum ! ( ) from the cumulative freezing spectra ! ( ) of the fungi Fusarium acuminatum strain 3-68 (Kunert et al., 2019), the bacterium P. syringae 425 (Schwidetzky et al., 2021), and birch pollen (Dreischmeier, 2019). We select these systems because they are important biological INs and show increasing complexity in terms of the apparent number of underlying distributions that define their freezing spectra.
The experimental ! ( ) obtained for F. acuminatum (black squares in Fig. 9A) was obtained by sampling six 10fold dilutions, each with 96 droplets (Kunert et al., 2019). Fig. 9A shows the cumulative spectra optimized assuming one 430 (green curve) and two (cyan curve) subpopulations; Fig. 9B shows the corresponding optimized differential freezing spectra.
The ! !"#$%$&'( ( ) with a single subpopulation that peaks at -5.9 o C is unable to represent the cumulative density of the most potent nuclei and misses the inflection around -5.9 o C in the experimental data, resulting in a mean squared error MSE = 0.05.
The ! !"#$%$&'( ( ) with two subpopulations has a lower MSE = 0.003 and a better fit that suggests a population that peaks at -7.3 o C and another at -5.5 o C, in comparable amounts ( Table 2). Most notably, the two subpopulations do not overlap in the 435 differential freezing spectrum, supporting that they may indeed correspond to different physical entities. The improvement in the fit becomes apparent in the inset of Fig. 9A, which shows ! ( ) on a linear scale. The significant slope of ! ( ) even at the lowest temperatures indicates that sampling of more diluted solutions is needed to capture the contribution of the less active INs. An attempt to represent F. acuminatum nucleation data with three different subpopulations resulted in two of them being almost identical. We conclude that adding a third subpopulation is unnecessary to reproduce the experimental 440 cumulative freezing spectrum of F. acuminatum. We refer the reader to (Schwidetzky et al., 2023) for an interpretation of the size of the ice nucleating surface of F. acuminatum based on its differential spectrum and nucleation theory. 450 Table 2: Mean squared error (MSE) and parameters of the differential freezing spectra obtained using the HUB-backward code and experimental data as input. The values shown here were calculated based on the average of n = 3 independent runs. The error bars, shown in parentheses, were calculated by dividing the standard deviation of the values in these runs by 3 1/2 . Next, we apply the HUB-backward code to analyse the experimental freezing spectrum of Snomax®, i.e., inactivated P. syringae. The cumulative spectrum suggests the presence of two distinct subpopulations, usually called class A (at warmer temperatures) and class C (at colder ones). We first assume the differential freezing spectrum ! ( ) of Ps.
syringae is a combination of two Gaussian populations. The parameters of the optimized differential spectrum with two subpopulations are listed in Table 2, and the curve is shown with in Fig. 9D with a cyan line. Note that we use a logarithmic 465 scale to represent this ! !"#$%$&'( ( ) because the population corresponding to class A accounts for less than 0.1% of the total ( Table 2). While the fit with two subpopulations results in a good overall account of the target data, we note that there is some difference in the region between classes A and C (Fig. 9C). The fitting for P. syringae achieves an excellent agreement between optimized and target cumulative spectra (Fig. 9C), through the prediction of an additional peak located between classes A and C (the elusive class B), with a population comparable to class A ( Table 2 and red curve in Fig. 9D). However, 470 more measurements and analyses are needed to establish whether this "class B" peak at -5.2 o C is reproducible and truly distinct from the one of class A at -3.7 o C to warrant a physical interpretation. Overall, both the analyses with two and three subpopulations agree with previous ones (Govindarajan and Lindow, 1988;Warren, 1987)  To further test the methodology, we model the cumulative freezing spectrum of birch pollen. Given that the original ! ( ) data for pollen in Fig. 3.1 of (Dreischmeier, 2019) consists of multiple independent curves, we took one of the many presented in this graph as target ! !"#$%! ( ) (black curve in Fig. 9E) and present some of the additional data -not used in the optimization-with gray circles in Fig. 9E. Supp. Section S4 shows that the differential spectrum optimized from the whole 480 data set and its sparse sampling are almost identical, because HUB-forward interpolates and smooths the input data to produce an equispaced data set. the ! !"#$%! ( ) seems to contain three quite separated subpopulations; which is confirmed by the accuracy of the optimized cumulative spectrum in Fig. 9E. The parameters of the optimized differential freezing spectrum ! !"#$%$&'( ( ) and the MSE are shown in Table 2. Our analysis indicates that the two subpopulations that nucleate ice above -16 o C constitute less than 0.01% of the active nucleating sites in pollen (Fig. 9E), consistent with drop-freezing 485 assays that only measured solutions with low concentrations of birch pollen and did not observe freezing at higher temperatures (Augustin et al., 2013;Pummer et al., 2012;Felgitsch et al., 2018), where the more extensive data of (Dreischmeier, 2019) reveals two more active subpopulations of IN.
To further illustrate the use of HUB-backward, Fig. 10 shows the effect of pH on the subpopulations in the modes, spread and weighs that contribute to the nucleation spectrum of P. syringae (Snomax®), using data from (Lukas et al., 490 2020). Freezing in the temperature range of class A drops about 3 orders of magnitude when the pH is lowered from 6.2 to 4.4, (Fig. 10B). However, we note that the cumulative number of IN is preserved in the experimental cumulative freezing spectrum (Lukas et al., 2020), indicating that the change in pH did not impact the number of nucleants.  3.3. Obtaining the differential freezing spectrum ( ) from the experimental fraction of ice ( ) of insoluble ice nucleators using the HUB-backward code Sections 3.1 and 3.2 discuss how to obtain the differential spectrum from a target cumulative one. However, there are many cases where the results are presented as fraction of frozen droplets as a function of temperature, !"# ( ). In these cases, the HUB-backward code can be used to obtain the optimized differential freezing spectrum ! !"#$%$&'( ( ) directly 510 from !"# !"#$%! ( ). Supp. Section S5 illustrates this approach for the analysis of droplet freezing data for a sample of lignin (Bogler and Borduas-Dedekind, 2020) in which the IN participate in aggregation equilibria. Here, we exemplify the optimization the differential spectrum of cholesterol from experimental freezing data obtained at two cooling rates with droplets sampled from a single dilution.
The triangles and squares in Fig. 11A display the experimental !"# ( ) obtained by sampling the freezing of 515 hundreds of 120 droplets pipetted from a suspension of cholesterol monohydrate crystals in contact with Teflon walls cooled at 0.18 K/min (triangles) and 0.06 K/min (squares) (Zhang and Maeda, 2022). The tripling of the cooling rate has a significant effect on the freezing of the droplets. In the analysis of drop-freezing experiments, it is assumed that each IN has a singular freezing temperature, independent of the cooling rate. However, ice nucleation is a stochastic process, and the underlying distribution of freezing temperatures ! strictly depends on both temperature and cooling rate, as slower rates 520 give more time for the system to cross the nucleation barrier at warmer temperatures.  Table S5.
Our analysis of the freezing data of cholesterol monohydrate shows that even a three-fold change in the cooling rate can have significant impact on the differential spectrum (Fig. 11B). As expected, the modes of the three populations move towards warmer temperatures upon decreasing the cooling rate. We note, however, that the shift of the peaks is not uniform; 530 the middle one seems to be more sensitive to the cooling rate. Different sensitivity of the freezing rate of subpopulations to temperature has been also reported in simulations of nucleation data of minerals using the stochastic and modified singular frameworks (Herbert et al., 2014;Murray et al., 2011). The modified singular model proposes an empirical correction the relation between !"# ( ) and ! ( ) to account for the effect of the cooling rate on the shift of these quantities (Vali, 1994).
That analysis could be extended to the analysis of the subpopulations of IN obtained with HUB-backward. Moreover, it 535 would be interesting in future studies to use the rate dependence of the mode of the subpopulations to extract the steepness of the nucleation barrier with temperature using nucleation theory (Budke and Koop, 2015), and to investigate the relationship between the cooling rate dependence of the differential spectrum obtained in the singular approximation with the interpretation of the same data modelled with the stochastic framework, such as in Herbert et al., 2014).

Conclusions 540
In this study, we present the HUB method and associated Python codes that model (HUB-forward code) and interpret (HUB-backward code) the results of droplet freezing experiments under the assumptions that each ice nucleating site in the sample has a characteristic nucleation temperature that is time-independent. The use of the singular approximation is the same as used by Vali (Vali, 1971;Vali, 2014Vali, , 2019 in his derivation of the ice nucleation spectra from data of fraction of frozen droplets. Different to previous implementations of the singular model, HUB accounts for the distribution of the 545 number of IN in droplets at a given concentration, and uses extreme-value statistics to represent the effect of dilutions in the frozen fraction and freezing spectra. Our method and codes allow users to obtain an analytical differential freezing spectrum ! ( ) from the experimental distribution of freezing temperatures, and vice versa. The differential freezing spectrum ! ( ) is an approximant to the underlying distribution of ice nucleating temperatures ! , which provides a hub to connect the experimental freezing temperatures with interpretative physical analyses using kinetic models or nucleation theory that can 550 be used to elucidate the mechanisms of nucleation and origins of these distributions.
HUB-forward predicts the cumulative ice nucleation spectrum ! ( ) and fractions of frozen droplets !"# ( ) from a known (or assumed) underlying distribution ! ( ) of nucleation temperatures for the IN in the sample. The HUB-forward code can be used to investigate the effect of the number of droplets and dilutions on the temperature range of the cumulative freezing spectrum ! ( ). Our analysis shows that the differential freezing spectrum ! ( ) is identical to the underlying 555 distribution of heterogeneous ice nucleation temperatures ! ( ) only when sampling is complete. Measuring fewer droplets or fewer dilutions can result in a biased representation of the differential and cumulative spectra. HUB-forward predicts !"# ( ) and ! ( ) from a proposed distribution of IN temperatures, allowing its users to test hypotheses regarding the role of subpopulations of nuclei in the freezing spectra and providing a guide for a more efficient collection of freezing data.
HUB-backward uses a non-linear optimization method to find the differential freezing spectrum ! ( ) that best 560 represents the experimental target cumulative freezing spectrum ! ( ) or fraction of frozen droplets !"# ( ) in the experiments. The analytical form of the differential freezing spectrum ! ( ) obtained from HUB-backward offers an interpretable physical basis. The interpretability of the results in terms of subpopulations provides an advantage over polynomial fitting and differentiation of ! ( ). Indeed, we show that the HUB-backward code can be used to reveal and characterize the underlying number of IN subpopulations of complex biological samples (Snomax®, fungi Fusarium 565 acuminatum, and birch pollen) and quantify the dependence of their subpopulations on environmental variables.
Interestingly, our analysis evinces subpopulations that are not obvious to the eye and have not previously been identified in these samples. The robustness of the signals that correspond to these populations and their physical nature require further investigation.
We illustrate the use of HUB-backward to obtain the differential freezing spectrum ! ( ) from the fraction of 570 frozen droplets !"# ( ) collected at a single concentration. We apply that analysis to demonstrate that ! ( ) depends on the cooling rate. The shift of the peaks of the subpopulations to higher temperatures upon decreasing the cooling rate is not unexpected, as longer waiting times allow for the surmount of the same nucleation barrier at warmer temperatures. By providing the temperature dependence of the mode spread and weight of the subpopulation peaks, HUB-backward can be combined with nucleation theory and other theoretical analyses to extract the steepness, and may even the distribution, of 575 nucleation barriers that control the freezing process.

Code availability
All codes, a user manual, and input files used in this project can be accessed at https://github.com/Molinero-Group/underlying-distribution

Data availability 580
All data used in this project can be accessed by request to the authors.

Author contribution
V. M., I. de A. R. and K. M. designed the project. I. de A. R. developed the model code and performed the simulations. I. de A. R. and V.M. prepared the manuscript with contributions from K.M.

Competing interests 585
The authors declare that they have no conflict of interest.