Long- and Short-Term Temporal Variability in Cloud Condensation Nuclei Spectra in the Southern Great Plains

. When aerosol particles seed formation of liquid water droplets in the atmosphere, they are called cloud condensation 10 nuclei (CCN). Different aerosols will act as CCN under different degrees of water supersaturation (relative humidity above 100%) depending on their size and composition. In this work we build and analyze a best-estimate CCN spectrum product, tabulated at ~45 min resolution, generated using high quality data from eight independent instruments at the US Department of Energy Atmospheric Radiation Measurement (ARM) Southern Great Plains site. The data product spans a large supersaturation range, from 0.0001 to ~30%, and time period, 5 years from 2009-2013 and is available on the ARM data 15 archive. We leverage this added statistical power to examine relationships that are unclear in smaller datasets. Probability distributions of many aerosol and CCN metrics are found to exhibit skewed log-normal distribution shapes. Clustering analyses of CCN spectra reveal that the primary drivers of CCN differences are aerosol number size distributions, rather than hygroscopicity or composition, especially at supersaturations above 0.2%, while also allowing for simplified understanding of seasonal and diurnal variations in CCN behaviour. The predictive ability of using limited hygroscopicity data with accurate 20 number size distributions to estimate CCN spectra is investigated and uncertainties of this approach are estimated. Finally, the dynamics of CCN spectral clusters and concentrations are examined with cross-correlation and autocorrelation analyses, which assist in determining the time scales of changing CCN concentrations at different supersaturations and are important for cloud modelling studies.


Introduction 25
The interactions between atmospheric aerosol particles and ambient water vapor are key drivers of the formation of haze and clouds. Particles are the nuclei upon which liquid water first condenses to form haze and cloud droplets, thereby affecting visibility, cloud microphysical properties, and precipitation. Water uptake by particles depends upon their size and chemical composition as well as on ambient environmental conditions, and on their rates of change. Particles that are identified as cloud condensation nuclei (CCN) are typically those that are predicted to form cloud drops at water supersaturations (SSw) of 1% or 30 https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. lower, conditions believed typical of most clouds that are formed in weak to moderate updrafts. Based on average aerosol characteristics, and typical atmospheric aerosol number size distributions, number concentrations of particles active as CCN are thus generally assumed to correspond to those particles in the 80-300 nm dry diameter size range. Instruments designed to directly measure number concentrations of activated particles at fixed SSw are also typically limited to 2% SSw as an upper bound on the measurement range (Uin, 2016). 35 However, supersaturations and particle sizes outside of these ranges are also of atmospheric interest. In deep convection with intense updrafts, and in regions of very low existing particle or droplet surface area concentrations, SSw of water vapor can build rapidly to high levels as the condensation sink rates are so low relative to the rate of supersaturation generation (Pinsky et al., 2012). Those high SSw may be sufficient to allow particles smaller than 40 nm to serve as cloud condensation nuclei. Thus, despite their relatively short lifetimes in the atmosphere compared with larger accumulation-mode particles, high 40 concentrations of small particles can potentially influence cloud microphysical processes leading to precipitation formation or evaporation. It is now recognized that nucleation of new particles from the gas phase, generating particles on the order of 10 nm in diameter which subsequently grow, occurs in many regions of the troposphere and is an important control on global atmospheric aerosol number concentrations (Hodshire et al., 2016;Bianchi et al., 2016;Venzac et al., 2008;Pierce et al., 2014;Nieminen et al., 2018). 45 At the other end of atmospherically-relevant supersaturations, droplet formation may occur at very low SSw where CCN concentrations are very difficult to probe. For example, the slow cooling rates in radiation fogs allow vapor scavenging to effectively compete with the generation of supersaturation, and thus maximum SSw conditions reached in fogs can be below 0.05% (Gerber, 1991;Low, 1975;Shen et al., 2018), suggesting that only larger and more hygroscopic particles can participate in fog droplet formation. The cloud physics community has had a longstanding interest in elucidating the microphysical roles 50 of "giant CCN" (GCCN), that is, relatively large particles that activate at very low SSw, ~0.01% or less. Specifically, GCCN are hypothesized to control the initiation of drizzle and precipitation in shallow clouds (Cohard et al., 1998;Johnson, 1982;Feingold et al., 1999;Cheng et al., 2009;Hudson et al., 2011;Posselt and Lohmann, 2008;Levin and Cotton, 2009;Gantt et al., 2014;Jung et al., 2015).
Their potentially controlling roles in fog, cloud, and precipitation formation has motivated interest in direct measurements 55 of the number concentrations of CCN active over a range of atmospherically-relevant SSw, with modern instrumentation making possible long-term, unattended monitoring. Here, we analyse observations of CCN spectra from the United States Department of Energy's Atmospheric Radiation Measurement's (ARM) Southern Great Plains (SGP) site located in north central Oklahoma for the five-year period from 2009-2013. The CCN measurement instrumentation deployed at this site is typically limited to stable operation over the 0.1 to 1% SSw range. As described further below, we extend those observations 60 to a broader supersaturation range of interest using ancillary aerosol observations, creating a CCN estimate relevant for clouds https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. ranging from fog through to intense deep convective updrafts. These observations are especially important for modelling studies of aerosol impacts on clouds in this region that use CCN number concentrations as a basis for determining their aerosol initial conditions (Saleeby et al., 2016;Marinescu et al., 2017;Glenn et al., 2020).

Methods 65
This work builds upon prior work reported in . In that study, for the same 2009-2013 time period studied herein, aerosol data from three instruments deployed at the SGP site were averaged over ~45-minute intervals and merged into dry aerosol number size distributions, n(Dp), spanning 7 nm < particle dry diameter, Dp < 14 µm. We build upon that work by combining those merged size distributions with information on aerosol hygroscopicity,  (Petters and Kreidenweis, 2007), taken over the same ~45-minute intervals.  values were obtained via measurements of diameter growth 70 factors (GF) at 90% relative humidity measured by a humidified Tandem Differential Mobility Analyzer (H-TDMA; Collins, 2010;Mahish and Collins, 2017). The procedure for integrating these data into a CCN spectrum is described more fully in Appendix A, but is reviewed briefly here.
The critical saturation ratio, Sc, at which a particle with dry particle diameter Dp can be activated into a cloud drop is determined by finding the maximum of the following equation (Petters and Kreidenweis, 2007): 75 where D is the droplet diameter, ρw is the density of water, Mw is the molecular weight of water, σs/a is the surface tension of the solution/air interface, R is the universal gas constant, and T is temperature. Saturation ratio, S, and water supersaturation percent, SSw are related by: The functional relationship between Dp,  and Sc can be more readily illustrated by the approximate relationship that is valid 80 for κ > 0.2 (Petters and Kreidenweis, 2007) where SSw,c is the critical water supersaturation percent: Eq. 4 From Eq. 4 it can be seen that information on the aerosol number size distribution, n(Dp), and the variation of  with dry diameter, can be used to compute the critical SSw corresponding to each selected dry size used in Eq. 4. It should be noted that the approximation in Eq. 4 is only used for demonstration here, with full methods described in Appendix A. In practice, the size distribution is discretized to obtain total number concentrations in each selected dry diameter bin, 222 logarithmically 85 spaced bins in this work, and a constant  is assumed across each selected bin. To produce the size-dependent  distribution, measurements of aerosol hygroscopic growth were made for 7 different sizes ( Figure A2), which were subsequently processed to obtain a single weighted  value for each size. In this study, these  values are then interpolated linearly between measured values, with invariant  beyond the largest and smallest sizes (below ~10 nm and above ~600 nm). The cumulative spectrum of CCN concentration, CCN(SSw), can then be constructed, which defines the total number of particles that can be activated at 90 a particular SSw. Figure 1a shows an example of the measured n(Dp) and  (Dp), and the resulting CCN spectrum is shown in  Available data from the measurement suite include direct, concurrent, and collocated observations of cumulative CCN number concentrations at selected supersaturations, which can be used to check the accuracy of the reconstructed spectra. Two additional instruments provide separate, independent, continuous observations that were also used to assess the reconstructed https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. spectra: a nephelometer measured total particle scattering coefficients, and an Aerosol Chemical Speciation Monitor (ACSM) 100 measured nonrefractory, speciated submicron mass concentrations. Both of these observations emphasize larger particle sizes (generally > 300 nm) and thus served as constraints on the particles contributing strongly at the lowest supersaturations, for which no direct observations exist. At the other end of the size distribution, the smallest particles are expected to require the highest supersaturations for activation, but neither the size distribution nor  are well constrained observationally. Reasonable assumptions are applied to extrapolate the CCN spectrum beyond SSw=1%, the approximate upper limit of the CCN counter. 105 In total, data from six instruments are merged to produce a best-estimate CCN spectrum for each ~45 minute interval, as described more fully in Appendix A. Resulting size distributions range from 0.0068 to 13.8 µm (bin centers), with corresponding CCN spectra generated for SSw from 0.0001 to ~30 percent SSw (100.0001 to 130 percent RH). The SSw range is chosen to span the entire range of particle activations: the largest and most hygroscopic measured particles activate at 0.0001% SSw while the smallest and least hygroscopic at ~30% SSw. As noted in the Data Availability statement below, the 110 final merged data, with estimates of uncertainty, are available in the DOE ARM archive.

Distribution Characteristics
While we expect the final merged data over the entire range to be useful, all observations are not equally reliable for several reasons. In the size distributions, the lowest size bins are generated using a fitting procedure previously described by 115 . This fitting procedure is constrained by instrument data, and produces good agreement with direct observations of the number concentrations of particles in the ultrafine mode , but the shape of the aerosol spectrum at the smallest particle sizes, especially below 12 nm, is more uncertain than at larger particle sizes.  values are assumed to be invariant outside of the measured range (below ~10 nm and above ~600 nm) and are thus more uncertain in those ranges. Additionally, at larger aerosol sizes, growth factor distributions can be bi-modal ( Figure A2) which is not 120 captured in this approach, and results in additional uncertainty. Because the CCN spectra are generated by combining size and  distributions, uncertainties in each are inherited in certain regions. The high SSw region of the CCN spectrum is dominated by smaller particles, and uncertainty in this region is increased due to uncertainties in aerosol distribution shape.
Beyond a certain SSw that is sufficient to activate all particles regardless of size and composition, a CCN spectrum must level off. In cases where there are particles present in the smallest size bins, this region has increased uncertainty due to the effects 125 of very small particles (below ~8 nm) that were not measured. This occurs at SSw greater than 10%. Interestingly, uncertainties in κ in this region are largely irrelevant as high SSw is required for activation regardless of hygroscopicity (see discussion in section 3.1.2, Figure 4). The region of lowest SSw, less than 0.1%, is the other region subject to additional uncertainties, coming from uncertainties in κ and number size distributions. Particles >13.8 µm were not measured, and are not included in our size distributions, and these particles this lowest SSw region. While the abundances of large particles which activate in this low SSw 130 https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. regime are quite low, they can be important in controlling further activation under some conditions. Ultimately, the regions of increased uncertainty are under SSw conditions where few measurements of CCN concentrations exist, which adds value to these data despite these uncertainties.

Probability Fitting
We present statistical descriptions for key parameters related to the aerosol and CCN populations to describe their 135 variability over the study period. Figure   The frequency distributions of observations fit exceptionally well to skewed log-normal distributions (described in Appendix B), with low degrees of skew such that log-normal distributions remain a fair approximation in many cases. Aerosol https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. data are seldom fit in this manner, simply reporting median and percentile bounds instead. On the other hand, log-normal 145 distributions have been noted and used for Aerosol Optical Depth (AOD), either for spatial or temporal variations (Alexandrov et al., 2004(Alexandrov et al., , 2016Anderson et al., 2003;Sayer and Knobelspiesse, 2019). While the methods used for AOD treatments have not been widely adopted for aerosol distributions or CCN spectra, they could be, and hopefully supplying parameterizations in the same form makes further work that is focused on impacts of variability more accessible. Additionally, better fits for probability distribution functions could be incorporated into microphysical modelling studies or other efforts interested in the 150 likelihood of given aerosol conditions occurring. When using these fit data, it is important to keep in mind that neighbouring size bins are statistically correlated with each otherthe probability of finding 100 particles in the 10 nm size bin is not independent of the probability of finding 100 particles in the 15 nm size bin. Because of this, the simplest way to calculate combined or correlated quantities (for example, the number concentrations of all particles between 10 and 20 nm) is through our archived distributions across time points of interest, rather than utilizing our fit parameters. A similar skewed log-normal 155 distribution could be fit for the combined data if desired. It should also be noted that distribution shapes may not be well conserved across all time or length scales. Variations are most likely to occur at small time (less than 2 hours) or length (less than 0.5 km) scales based on the analysis discussed in Section 3.2 and previous works (Alexandrov et al., 2004;Anderson et al., 2003).

Clustering Analysis 160
Clustering analysis is used to simplify and seek relationships in the rather large and complicated data set. K-means clustering is performed using a vector-based distance metric. Details of cluster analysis can be found in Appendix C. From the K-means clustering applied to the CCN spectra, three distinct clusters are identified that achieve good separation in both CCN and number size distribution characteristics, as shown in Figure 3.  Clustering is carried out based on CCN spectrathat is, each spectrum was assigned to a cluster based on its shape and magnitude. Even though the clustering procedure had no direct information about particle number size distributions, the size 170 distributions associated with each cluster are well resolved (Fig. 3b). On the other hand, hygroscopicity parameter distributions are similar for all three clusters (Fig. 3d). This indicates that particle number size distributions have a greater influence on the resulting CCN spectra than do κ distributions, consistent with other analyses (Patel and Jiang, 2021). As a result, estimates of CCN spectra using size distribution data and either estimated or median  values are expected to be reasonable approximations, although deviations of approximate CCN spectra from observed CCN spectra can still be quite large for any given timepoint. 175 We estimate the error introduced using a "median κ" to compute CCN spectra, as follows. For this "median κ" estimate, we calculated a median κ value based on the entire data set, then used this median value in combination with all individual size distributions to generate estimated CCN spectra. These are then compared to the CCN spectrum products (using concurrent κ and size distribution data) to calculate error estimates, with the results shown in Figure 4. Estimates are generally least reliable for lower supersaturations, with estimates below 0.2% SSw having a 95% confidence interval broader than ±50% of the 180 estimated value. Therefore, caution should be taken interpreting estimated CCN spectra in this low SSw region. This result additionally highlights the uncertainty in our data product below about 0.03% SSw. In this region, particles significantly larger https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. than 600 nm are expected to activate but we do not have accurate  measurements for these sizes, as discussed in Section 2.
Above 0.2% SSw, the "median κ" estimate works quite well, with uncertainties decreasing as SSw increases. This region of the spectrum is likely a good candidate for generation of CCN spectra from observations of particle distributions where high-185 quality  measurements are available for only a limited time period. It is important to note that this approach will only work for accurate number size distribution data extending to diameters larger than 500 nm. For distributions ending at 500 nm, many CCN activating at or below ~0.2% SSw will not be directly counted, and due to the cumulative nature of the distributions, this gap can introduce large errors for all SSw values. We expect this median  estimation method to be especially applicable for the SGP site and similar environments, but it may apply elsewhere as well. 190 The clustering of CCN spectra into distinct groups highlights the contributions from particle number size distributions as the distinguishing factor. This analysis does not suggest that κ distributions for small particles are invariant, random, or 195 unimportant, only that their contribution to a final CCN spectrum is small compared to the contribution of the particle number https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. size distribution. Mahish and Collins (2017) provide a more complete analysis of the κ measurements at SGP during this time period, which is consistent with the data we use here.
Because of the factors discussed above, the different CCN clusters represent different characteristic particle number size distributions. Cluster 1 represents cases where the nucleation mode particles, associated with NPF events, dominate the number 200 size distributions. Cluster 1 also has lower absolute number concentrations of accumulation mode particles than found in the other clusters. Cluster 2 represents the opposite casethe absence of small particles and higher accumulation mode number concentrations combined with a shift of the accumulation mode to larger sizes. Cluster 3 is the intermediate case, with some nucleation mode particles and a substantial accumulation mode. The three clusters represent approximately equal portions of the total number of observations. Because these clusters represent three different scenarios quite well, we will use them to 205 simplify further discussion. Cluster 1 will subsequently be referred to as the nucleation cluster, cluster 2 the accumulation cluster, and cluster 3 the intermediate cluster.  The clustered data are examined for seasonal variations in particle and CCN characteristics. Cluster prevalence shows some seasonal dependence, although all clusters are still found a significant portion of the time for all seasons (Fig. 5). Summer (JJA) and winter (DJF) seasons show the highest prevalence of accumulation clusters, but significant differences in fractions of the intermediate and nucleation clusters. Summer has the highest prevalence of the intermediate clusters, while winter has the highest prevalence of nucleation clusters. This suggests that during the summer significant particle concentrations are more 215 likely to coexist in both the accumulation and nucleation modes, or perhaps that growth of nucleation mode particles to larger sizes (i.e., transfer of particles to the accumulation mode) is more likely to occur. These trends are not obvious from looking at seasonal particle data alone . An important consideration for reconciling seasonal particle data and seasonal cluster trends is the fact that the distributions within a cluster will have seasonal dependence as well, as shown in Figure C1. 220   Figure 7, shows no hourly dependence in median values at 1% SSw or lower. At 10% SSw, the hourly trend in CCN is similar to the hourly trend in the nucleation cluster. These data combined suggest that the hourly changes that occur are due to addition of nucleation mode particles, rather than changes in the particle concentrations of other modes. Seasonal variability in the nucleation mode time-of-day dependence is shown in Figure C1 and reflects the same overall time dependence within a day, alongside seasonal changes shown in Figure 5. 230 Another way to examine seasonal changes is through comparison of occurrence probabilities at a single supersaturation, as shown in Figure 8. Figure 8 shows this information in two similar ways: the skewed log-normal fits from Section 3.1.1 and 235

Seasonal and Hourly Trends
Appendix B, and more traditional box plots. Both methods of parameterizing the data require the same number of parameters (3 coefficients, a zero fraction and a correlation coefficient for the fit; 5 points for the box and whiskers), but the fits convey more information. Seasonal differences are somewhat obscured by the box plot, but for cumulative CCN active at 1%

Time Evolution of Clusters
Our large data set allows for additional statistical analysis to examine evolution of CCN spectra over time. This gives 250 insight into underlying processes that are obscured when examining single cases or shorter data periods. Figure

cluster (blue) or intermediate cluster (black).
There is some periodicity observed in the cluster evolution, which we examine more closely alongside fluctuations in 270 CCN number concentrations using autocorrelations. Autocorrelation coefficients are calculated for several different SSw conditions, shown in Figure 10, using the methods described in Appendix D. Autocorrelation coefficients can be interpreted similarly to other correlation coefficientsthey describe that portion of the variance that can be explained by the observation at a previous timepoint. The higher the value of an autocorrelation coefficient, the more stable that quantity is over time, so that an autocorrelation coefficient of 1 implies no change in state at a specified time lag, while 0 implies that a previous data 275 point (separated by the specified time lag) has no influence on a current one. From Figure 10, a great deal of variability is observed in the first several hours of the computed time lags, which is an unexpected finding. Autocorrelation coefficients are expected to be highest for the first several timepoints, but the oscillating nature of these points implies aerosol processes with some periodicity in the 2-3 hour range. Natural processes that might produce such variability throughout the day and over all seasons seem unlikely, so the oscillation may be an artefact, for example, introduced by sampling schedules. We apply a bi-280 https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. exponential fit as an approximate way to smooth the data for time lags of up to twelve hours, removing the effect of these oscillations. Single exponential fits are poor approximations of the shape of the autocorrelation functions for 0.1 and 1% SSw cases. Because a single decay pathway is expected to produce relatively consistent decay rates, the appearance of multiexponential decays suggests multiple decay pathways. In this case decay pathways for autocorrelation can be interpreted as pathways for changes in CCN number concentrations. 285 Autocorrelation decays much more quickly for larger (10%) SSw, with greatly increased values appearing at 24-hour intervals. At higher SSw, the CCN number concentration is often dominated by the smallest particles, associated with the 290 nucleation mode. This interpretation fits well with the autocorrelation data, which indicate short-lived events tied to diurnal cycles. At moderate (1%) and low (0.1%) SSw, there is a pronounced fast initial decay in autocorrelation, followed by a period of slower decay. Fit constants are described in appendix D and shown in Table D1. The fast initial decay rate is comparable (within the large uncertainties, Table D1) for low and moderate SSw cases, but the slow decay rate is significantly slower for the low SSw case. The diurnal peak (at 24-hour time lag) is also significantly weaker for the lowest SSw case, suggesting that 295 there is less variability in these lowest SSw CCN observations, as compared to higher SSw. This is consistent with that fact that number concentrations at larger particle sizes are much less variable than those at smaller particle sizes . Previous work on AOD spatial and temporal autocorrelation at the SGP site (Alexandrov et al., 2004) and elsewhere (Anderson et al., 2003) suggest that the fast decay can be attributed to 3D micro-scale turbulent fluctuations, while the slow https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. decay is due to 2D large-scale turbulence. The aerosol data used here are obtained only at the surface, to the influence of the 300 three-dimensional nature of the atmosphere may be present but cannot be distinguished. Given the role of NPF events, there is also potentially a chemical (non-turbulence-driven) source of variability for CCN, contributing to autocorrelation decay for CCN concentrations at high SSw. Ultimately these data illustrate that CCN spectra change rapidly over 1-3 hour timescales, with some conservation at longer timescales for the lower end of the supersaturation range. The granularity of our data (in ~45-minute increments) makes it somewhat difficult to resolve the exact timescales, but it is clear that the period of rapid 305 change is in the 1-3 hour range. The role of variability in CCN concentration is something that should be considered in modelling studies that focus on the impacts of aerosol, especially those that use fixed concentrations of aerosol particles or those that do not capture the comprehensive processes that cause aerosol concentrations and properties to evolve. For example, using fixed CCN concentrations for a given short-term (<2 hour) simulation of shallow clouds (i.e., lower supersaturations) is more justifiable than for a longer-term simulation of the development of deep convective clouds (i.e., higher supersaturations), 310 based on the faster autocorrelation decay rates of CCN at higher supersaturations. The autocorrelation results can also help define the time scales for data assimilation to ensure models are updated frequently enough to allow for accurate simulations.

Conclusions
We have developed, described and examined a long term CCN spectrum data product for the SGP site in Oklahoma. The data product builds on merged number size distributions  and hygroscopicity measurements (Mahish 315 and Collins, 2017) to create a best estimate of CCN spectra across a wide supersaturation range from ~0.0001% to 30%. It has been generated and verified by combining high quality data from eight different instruments. It has ~45 minute temporal resolution across five years of data, from 2009 to 2013, which has allowed for analyses not normally possible for smaller data sets.
We have determined that skewed log-normal distributions provide excellent fits to occurrence probabilities of CCN 320 concentrations at any given supersaturation, as well as to occurrence probabilities of a wide range of other aerosol quantities.
These types of distributions have been observed for AOD measurements previously, but not widely used. They provide more information than traditional box plots, while requiring a comparable number of parameters. For established occurrence distribution shapes, shorter timescale measurements could likely take advantage of these fit parameters to fill in data gaps to estimate data over longer periods. They also serve as useful inputs to models that include the expected variability in input 325 parameters in model predictions.
CCN spectra are controlled primarily by particle number size distributions, especially at larger SSw values (above ~0.2%).
In this high SSw region of the spectrum, it appears possible to generate estimated CCN spectra using only median κ values, rather than concurrent measurements of κ and size distribution. However, this estimation relies on accurate size distribution https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. data that extend beyond 500 nm. Approximations of uncertainties introduced by this "median κ" estimation have been 330 investigated and should hold for data from the SGP site during different time periods. This estimation method is also likely applicable for other sites, especially in similar environments, but is beyond the scope of this analysis.
Clustering analysis also highlights number size distributions as the driving force behind changes in CCN spectra. Three distinct clusters are found for cases dominated by nucleation mode particles, accumulation mode particles, or similar amounts of each. These are analyzed seasonally and hourly, finding all clusters in significant quantities across all seasons and times. 335 Intermediate clusters are more likely during summer months, while accumulation clusters are abundant in the winter. Fall and spring appear similar in this view, falling between summer and winter. Nucleation mode clusters are most likely during daylight hours, corresponding with decreased intermediate clusters but nearly invariant accumulation clusters.
Time evolutions are examined in this data set to try to understand the dynamics of CCN spectra. Analysis of transitions between clusters reveals that the most likely path is for nucleation and accumulation mode clusters to transition to an 340 intermediate cluster first, rather than direct transitions occurring between the two. Autocorrelation analyses probe the evolution of a given CCN SSw bin over time. A relatively quick decay is found for all SSw values, with the bulk of correlation decaying in several hours, indicating that relatively large changes in CCN spectra can be expected over that time period. An additional slow decay is observed for smaller SSw values, indicating that the CCN number is better conserved at longer timescales (>2-3 hours) in lower SSw regions of the CCN spectrum, corresponding to particles in the coarse mode. 345

Appendix A: Merging of Distributions and CCN Spectra
The initial number size distributions used in this work are generated from a combination of scanning mobility particle sizer (SMPS), aerodynamic particle sizer (APS), and condensation particle counter (CPC) data, as described previously by , which are available in the DOE ARM archive . This initial dataset, here referred to as size distribution (SD) 1, is processed to take into account additional instrument data utilizing humidified tandem 350 differential mobility analyzer (HTDMA), CCN counter (CCNC), nephelometer, and aerosol chemical speciation mass spectrometer (ACSM) instrument data as outlined in Figure A1. SD1 data is available in approximately ~45 minute time intervals, where additional instrument data are available in higher time resolution, and are subsequently averaged over the time period of SD1 data for comparison. In order to compare to CCNC measurements, SD1 must be converted to CCN spectra using the hygroscopicity parameter (κ) that is derived from HTDMA measurements.  Size resolved aerosol hygroscopicity was measured with a HTDMA (Collins, 2010), which first selected dried, monodisperse aerosol at seven diameters and then exposed them to a humidified (relative humidity, RH ~ 90%) growth region. The 360 humidified aerosol number size distribution was then measured and the change in particle diameter between the selected dry particle diameter (Dp) and the resulting humidified size (Dw) is termed the growth factor (GF = Dw/Dp). An example of HTDMA measured growth factor distributions is shown in Figure A2. The orange lines indicate the selected dry size for each GF measurement. The top part of the figure shows the number size distribution measured by the SMPS at the same time.

Figure A2: Example aerosol number distribution (solid line, upper plot) and growth factor (at 90% RH) distributions (lower plots) measured by an HTDMA. The dashed line in the upper plot is the corresponding volume distribution.
These GF data at a given RH (properly written e.g. GF(90%), but here abbreviated to GF for convenience) can be used to calculate the hygroscopicity of the particles as expressed via the hygroscopicity parameter, κ (Carrico et al., 2010;Petters and Kreidenweis, 2007): 370 Eq. 3 and σw, Mw and ρw are the surface tension, molecular weight, and density of water, respectively, T is the absolute temperature, and R is the ideal gas constant. After calculating κ distributions from each measured GF distribution at the diameters selected by the HTDMA we averaged κ for each selected Dd and linearly interpolated between selected sizes to generate a continuous distribution of aerosol hygroscopicity across the entire number size distribution. 375 CCN spectra were generated using these κ values, derived from the HTDMA growth factor data described above using either the SD1 (initial) or SD3 (final) size distributions (described below). For each time period with concurrent κ and size distribution data, critical SSw was calculated for each size bin. This was accomplished using Eq. 1 assuming the following https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License.
constant conditions: temperature of 25 °C, water density of 1 g/mL, and surface tension of 72 mJ/m 2 . This calculation is accomplished numerically, by calculating water SSw for a logarithmically-spaced array of wet diameters, with the largest SSw 380 chosen as the critical value. Calculated errors from this method were less than 1% of the calculated values using this method (e.g. a 1% error in a 0.01% SSw being ±0.0001% SSw). A CCN spectrum was then generated by adding up activated particle populations at each SSw value, making no assumptions about the order particles activated in; smaller particles with higher κ could activate before large particles with low κ if appropriate.
In order to compare spectra between CCN spectrum 1 and measured CCNC values averaged across the same ~45-minute 385 time intervals, interpolation of the CCNC data across the calculated SSw bins was performed using MATLAB's built in Piecewise Cubic Hermite Interpolating Polynomial (PCHIP) function for each CCNC spectrum. These data were used to create a distribution corrected for the CCNC data using a similar method to that described for the CPC corrections in SD1 . The only difference between the algorithm described in Appendix A of  and the one used for generation of SD2 here occurs in step 3. We calculated a two-week rolling median percent difference between the CPC 390 and SMPS + APS distribution and used this as a scaling factor across the entire distribution in this step. Times between 1200 and 1800 local are excluded from the rolling median since new particle formation events are common during those times and large differences between the SMPS+APS integrated number concentration and the CPC number concentration are expected.
To generate SD2, the average difference between CPC and CCNC (in total particles and SSw specific CCN numbers, respectively) is used instead of solely CPC data. For example, if the comparison with the CPC suggested there should be 25% 395 more particles in the SMPS+APS number size distributions and the comparison with the CCNC suggested there should be 15% fewer particles in the SMPS+APS number size distributions, then the SMPS+APS number size distribution data are scaled up by 5% (the average of +25% and -15%). The remaining steps described in  are performed unchanged on this distribution. The remaining steps in the algorithm can change the shape of the size distribution, so SD1 and SD2 are not simply scaled versions of each other. If no quality CCNC data are available for a given timepoint, the SD1 and SD2 spectra 400 are identical.
The resulting SD2 was then compared to ACSM and nephelometer data to examine whether the CCNC correction was warranted. ACSM comparisons were accomplished by generating total particle mass concentration for each distribution assuming spherical particles and a density of 1.77 mg/mL, which is that of ammonium sulfate. Additionally, the ACSM cutoff of 10 µm and the volume equivalent diameter (DeCarlo et al., 2004) were accounted for to produce calculated aerosol mass 405 for comparison. The density chosen is within the region of best agreement between the ACSM and distribution data for both the SD1 and SD2 and is chosen for consistency with the nephelometer comparison. The nephelometer comparison was accomplished by generating single-particle scattering cross-sections for all size bins in the distributions assuming the optical properties of ammonium sulfate, which was again within the region of best agreement in Figure A3. Both of these comparisons https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. produce excellent agreement for many time points, as shown in Figure A4. There was evidence of systematic bias for some 410 time periods but the bias was relatively low for periods outside of 10-Mar-2011 through 1-Nov-2011 for the nephelometer data, which were not used in distribution selection, below. SD2 generally produced better agreement with the nephelometer and ACSM than SD1 did, although this was not true for all time periods. In order to construct a final data set including the nephelometer and ACSM comparisons, the ratios between distribution-calculated values and measured values were used to select between SD1 and SD2 at each point where data were available. Given the better general agreement for SD2, it was used as default if there were no ACSM or nephelometer data 420 available, or if there was disagreement between the two instruments. Through this process, 4711 distributions were selected from SD1, and 16203 from SD2, with 19407 points defaulting to SD2. The resulting distribution, SD3 compared with ACSM and nephelometer measurements in Figure A3, is considered the final product distribution, and analyzed throughout the manuscript alongside the CCN spectra generated from it. https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License.

Appendix B: Skewed Log-Normal Fits
The log-normal probability density function is defined as: where x is a number concentration (bin), µ is the median value of log(x), and σ 2 is the variance in log(x). The log-normal 430 cumulative distribution function is defined as: where erf is the error function. A skewed log-normal probability density function can subsequently be defined as: where α is a parameter representing the degree of skew, such that when α = 0, the log-normal distribution is recovered. When these functions are used to fit data, µ, σ, and α are used as the fit parameters. Two issues arise using skewed log-normal fits. The first is that there is no closed-form expression for the median value of 435 a skewed log-normal distribution. The median can of course still be evaluated numerically. The second is that x values of zero cannot be represented within the distribution due to the logarithm. In this work, we address these issues by simply reporting median values and fractions of the data where zeros occur, alongside fit parameters. It should be noted µ is no longer the median value of log(x) for the skewed log-normal distribution. This can be observed in the data in Supporting File CCN_fit_coeffs.txt where µ decreases at high SSw while median value increases monotonically, as expected for a cumulative 440 distribution. Fits are generally very good, as shown in Figure B1 for several different aerosol quantities. At very low SSw, or for very large particle bins, the quality of the fits degrades due to the large amount of noise in the data. This noise occurs largely due to detection limits of the instruments involved at low particle concentrations (for very large particles). If concentrations are so low that particle detections are not guaranteed in the sampling period (45 minutes), a large amount of shot noise is introduced. 445 https://doi.org/10.5194/acp-2021-1008 Preprint. Discussion started: 13 December 2021 c Author(s) 2021. CC BY 4.0 License. Figure B1: Examples of skewed log-normal fits to occurrence probabilities for aerosol metrics, as labelled: total particle number, total particle volume, dN/dLogDp function at 100 nm, dN/dLogDp function at 1000 nm, particle number concentrations active as CCN at 0.01%, 0.1%, 2.8%, and 5.3% SSw.

′ √( ′ )( ′ )
Eq. 9 where x is input data, and c is a cluster. Both x and c are arrays, with subscripts indicating a single array element and 455 apostrophes indicating a transpose operation. It was found that metrics 1 and 2 produced separation of spectra based solely on total particle number, whether clustering was applied to aerosol number size distributions or to CCN spectra. Distance metric 3, however, produced well-resolved clusters based on distribution shape, and was selected for final cluster designations.
Clustering in CCN space also produced well resolved clusters in size distribution space, and vice versa. CCN space was ultimately chosen for clustering, due to the focus of the current work, but differences from the alternative are expected to be 460 very minor.
Next, optimal number of clusters was explored. While this is often accomplished somewhat arbitrarily, intuitively, or based on external models for a given process, statistical methods have been developed to guide the process. We chose to use the "gap statistic" (Tibshirani et al., 2001), a built in Matlab functionality through the "evalclusters" function. This method was too computationally intensive to use on the entire data set, so a subset of 500 randomly selected spectra were used instead. 465 Three clusters were suggested to be the optimal number to use, based on this approach. These clusters all appeared physically distinct, as discussed in Section 3.1.2, and addition of a 4 th cluster simply resulted in the splitting of two adjacent clusters.
Three clusters were thus chosen for use in further analysis.
Clusters are generally similar year round, but there is some seasonal dependence within a given cluster, as shown in Figure   C1.  Eq. 10 where rk is the autocorrelation coefficient at time lag k, N is the total number of timepoints, Yi is the measurement value (in our case CCN number concentration) at timepoint i, and Ȳ is the mean measurement value. For timepoints that are not evenly spaced, the same coefficient can be produced with a few extra steps. The way we have accomplished this is to 1) calculate 480 differences between all adjacent timepoints, for a fixed number of integer time lags, before 2) sorting all of these data into time bins based on how much time elapsed between any given set of measurements.
Autocorrelation functions were subsequently fit to bi-exponential decays for the first 12 hours of time lag data, using the following form: Eq. 11 This is accomplished using the "fit" function in Matlab, which provides 95% confidence interval information. Best fit 485 parameters and 95% confidence intervals are reported in Table D1 below.

Data Availability
All data are publicly available via the U.S. Department of Energy's Atmospheric Radiation Measurement (ARM) user facility 490 data archive (Atmospheric Radiation Measurement (ARM), 1995, 2001, 2007, 2011

Supplement link 495
A supplemental text document containing skewed log-normal fit coefficients for all CCN data, named CCN_fit_coeffs.txt, is available at XXX.

Author Contribution
RJP performed the analyses presented. All authors assisted in interpretation of raw instrument data and construction of the merged CCN product. RJP and SMK prepared the manuscript with contributions from all co-authors.