An unsupervised machine-learning-based classiﬁcation of aerosol microphysical properties over 10 years at Cabo Verde

. The Cape Verde Atmospheric Observatory (CVAO), which is inﬂuenced by both, marine and desert dust air masses, has been used for long-term measurements of different properties of the atmospheric aerosol from 2008 to 2017. These properties include particle number size distributions (PNSD), light absorbing carbon (LAC) and concentrations of cloud condensation nuclei (CCN) together with their hygroscopicity. Here we summarize the results obtained for these properties and use an unsupervised machine learning algorithm for the classiﬁcation of aerosol types. Five types of aerosols, i.e., marine, freshly-formed, 5 mixture, moderate dust and heavy dust, were classiﬁed. Air masses during marine periods are from the Atlantic Ocean and during dust periods are from the Sahara. Heavy dust was more frequently present during wintertime, whereas the clean marine periods were more frequently present during springtime. It was observed that during the dust periods CCN number concentrations at a supersaturation of 0.30% are roughly 2.5 times higher than during marine periods, but the hygroscopicity ( κ ) of particles in the size range from ∼ 30 to ∼ 175 nm during marine and dust periods are comparable. The long-term data presented 10 here, together with the aerosol classiﬁcation, can be used as a base to improve our understanding of annual cycles of the atmospheric aerosol in the eastern tropical Atlantic and on aerosol-cloud interactions and it can be used as a base for driving, evaluating and constraining atmospheric model simulations.

It is challenging to analyze and interpret the large data sets resulting from long-term measurements. Machine learning algorithms can build powerful models that can quickly and automatically analyze bigger, more complex data sets and deliver faster, accurate results. The choice of unsupervised or supervised machine learning algorithms will depend on the scientific questions to be answered. Each method has its advantages and disadvantages. The different machine learning algorithms can be very sensitive to user-defined parameters and implementation. A number of freely available open-source and easy-to-use 5 packages in popular programming languages, such as Python, have been developed in academia and industry. Machine learning approaches have been increasingly used for particle classification. For example, Christopoulos et al. (2018) implemented a supervised decision tree for aerosol classification for on-line single-particle mass spectrometry. Atwood et al. (2019) classified aerosol population types using an unsupervised K-means cluster analysis. In the present study, an unsupervised machine learning algorithm K-mean is used to help us understand better the particle microphysical properties at CVAO. 10 2 Experiment and Methods

Sampling location and measurement setup
The measurement station (CVAO, 16 • 51 49" N, 24 • 52 02" W) is located at the northeastern shore of the São Vicente island. It is about 10 m a.s.l and 70 m from the coastline. The São Vicente island itself is in the north Atlantic Ocean, ∼900 km off the African coast. This region constantly experiences northeasterly winds, with an average wind speed of about 7.3 m s −1 . The 15 average annual temperature at São Vicente is about 23.6 ± 4.0 • C. Cabo Verde has very little precipitation, with an annual rainfall of about 3 to 10 events and 24-350 mm, mainly between August and October (Fomba et al., 2013).
An aerosol PM 10 inlet, installed on top of a 32 m tower, was deployed to remove particles larger than 10 µm. The purpose of installing an aerosol inlet on such a high tower is to minimize the influence of sea spray aerosol generated in the surf zone at the coastline. Downstream of the PM 10 inlet, a vertical stainless-steel sampling pipe (1/2 in. outer diameter), together with 20 a 1-meter long diffusion dryer, was placed on top of a measurement container. All microphysical instruments were placed inside this air-conditioned container. The microphysical instruments included a TROPOS-type MPSS (mobility particle size spectrometer), an APS (aerodynamic particle sizer, model 3321, TSI Inc.), a MAAP (multi-angle absorption photometer, type 5012, Thermo Scientific Inc.) and a CCNC (cloud condensation nuclei counter, Droplet Measurement Technologies). More details concerning the CVAO station setup can be found in Gong et al. (2020b). 25

Particle size
PNSDs in the size range from 20 nm to 10 µm were measured by using a TROPOS-type MPSS and an APS. The electrical mobility distribution (measured by the MPSS in the size range from 20 to ∼800 nm) is converted to a PNSD by applying an inversion algorithm to correct for multiply charged aerosol particles (Wiedensohler, 1988) and diffusional losses (Wiedensohler et al., 2012). In this study, APS data, i.e., PNSDs for sizes above ∼700 nm, was involved to enable a multiple charge correction 30 of the MPSS data when applying the inversion algorithm (fpr more explanation see Pfeifer et al., 2014). Prior to using APS data in the inversion, the aerodynamic diameters as measured by APS had to be converted to geometric diameters, for which a dry dynamic shape factor and particle density are needed. The dry dynamic shape factor (χ) and a density (ρ) of sodium chloride is 1.08 and 2160 kg m −3 (Kelly and McMurry, 1992;Gysel et al., 2002), respectively, whereas the dynamic shape factor and density of mineral dust is 1. 25 (Kaaden et al., 2009) and in a range of 2450-2700 kg m −3 (Haywood et al., 2001).
An averaged shape factor and density of 1.17 and 2000 kg m −3 was used in this study (Schladitz et al., 2011). Overall, the 5 combined PNSD is given based on volume equivalent particle diameters in this study. Size-dependent particle losses within the inlet were corrected (refer to Appendix A). Based on the combined PNSDs, total particle number concentration (N total ) and particle number concentrations (PNC) in different size ranges can be calculated. The MPSS and APS were calibrated regularly during the measurements.

Light absorbing carbon
10 Light absorbing carbon (LAC) concentrations were derived from measurements with a multi-angle absorption photometer (MAAP, Petzold and Schönlinner, 2004). The MAAP measures the optical transmission and reflection at a wavelength of 637 nm of a filter while it is loaded with particles. An internal algorithm calculates an equivalent black carbon concentration (eBC, Petzold et al., 2013) using a mass specific absorption coefficient (MAC) of 6.6 m 2 g −1 . The MAC is a conversion factor between the light absorption coefficient (σ abs ), which is the primary measured parameter, and a corresponding eBC, which are related 15 by: In , it was shown that the values for eBC and σ abs had to be corrected by up to a factor of 1.05 to correct for an originally wrong assumption of the wavelength. Furthermore, filter-based absorption photometers are affected by an artifact due to light scattering particles. While the multi-angle approach of the MAAP compensates for this to a large extend, 20 it was shown in  that σ abs is still overestimated by about 1% of the value of the light scattering coefficient (σ sca ) at the wavelength of 637 nm. The calculation of the fully corrected light absorption coefficient (σ abs,corr ) is shown in the following equation: Since the σ sca was not measured directly, it was derived from PNSDs using Mie scattering theory (Bohren and Huffman,25 2008). The required complex refractive index was assumed to be 1.49-0.01*i. This value is a reasonable assumption for less absorbing marine aerosols (Hess et al., 1998). A comparison of probability density functions (PDFs) between σ abs and σ abs,corr is shown in the Appendix B.

Cloud condensation nuclei and particle hygroscopicity
A cloud condensation nuclei counter (CCNC, Roberts and Nenes, 2005) was used to measure CCN number concentration 30 (N CCN ). The main part of the CCNC is a cylindrical continuous-flow thermal-gradient diffusion chamber. A constant stream-wise temperature gradient is established to adjust a quasi constant centerline supersaturation. During the measurement, the sampled aerosol particles are guided through this chamber within a sheath flow along the centerline. Depending on the supersaturation adjusted in the CCNC and the particle's hygroscopic properties, particles can become activated to droplets.
In this study, the calibrated supersaturations were 0.15%, 0.20%, 0.30%, 0.50% and 0.70%. The supersaturation is switched every 12 minutes. The measurement data during the first 5 minutes and the last 30 seconds at each supersaturation were 5 excluded from the data analysis to assure a stable column temperature when switching the supersaturation, and the remaining data points were averaged. To ensure the accuracy of the results, a supersaturation calibration was done repeatedly during the long-term deployment of the instrument at the site (more details concerning calibration method in Gysel and Stratmann, 2013).
Whether or not a particle can be activated to a droplet depends on its chemical composition, its dry size and the maximum supersaturation it encounters (according to Köhler theory, Köhler, 1936). A single parameter κ can be used to describe the 10 relationship between particle hygroscopicity and dry diameter at which the particle will be activated (Petters and Kreidenweis, 2007). Assuming the surface tension of the examined solution droplets (σ s/α ) to be that of pure water, in this study κ was calculated as follows: where SS and d crit are supersaturation in % and the critical diameter above which all particles activate into droplets at this given supersaturation, respectively. M ω is the molar mass of water; R=8.314 J (K mol) −1 is the ideal gas constant; T is the absolute temperature; ρ ω is the density of water.
In this method, aerosol particles are assumed to be internally-mixed. All particles in the neighborhood of a given size are 20 assumed to have a similar κ. A specific particle can be activated to a droplet if its dry diameter is larger than or equal to d crit at a fixed supersaturation. Therefore, d crit is the diameter at which the cumulative particle number concentration from the largest diameter to d crit is equal to N CCN . In this study, the measured N CCN and PNSDs are both used to derive d crit .

Backward trajectories
A backward trajectory calculation was conducted with the HYbrid Single-Particle Lagrangian Integrated Trajectory (HYS-25 PLIT) Model (Rolph, 2003;Stein et al., 2015). The backward trajectories were calculated for 6 days with a resolution of 6 hours, arriving at 200 m above the measurement site.

Unsupervised machine learning algorithm
An unsupervised K-means algorithm was used in this study to classify aerosol particle types. The K-means model clusters samples by trying to separate data into k groups of equal variance, minimizing a criterion known as the inertia or within-cluster 30 sum-of-squares. Given a set of observations (x 1 , x 2 , ..., x n ), where each observation is a d-dimensional real vector, K-means clustering aims to partition the n observations into k (≤ n) sets C = C 1 , C 2 , ..., C k so as to minimize the within-cluster sum-of-squares. Formally, the objective is to find: where u j is the mean of points in C j .
It is important to start the analysis by looking into the number of clusters, k, to specify the optimal k. In this study, we used 5 an elbow plot and Calinski-Harabasz score (Caliński and Harabasz, 1974) to determine the optimal k, which is explained in more detail in Appendix C. submicron size range and a coarse mode in the supermicron size range are observed. The Aitken mode ranges from ∼20 to ∼80 nm, and the accumulation mode ranges from ∼80 to ∼1000 nm. The number concentration of supermicron particles shows a high variation, from 1 to above 100, with a median value of 3.8 cm −3 .

Seasonal variation of light absorbing carbon
The monthly variation of σ abs and σ abs,corr , averaged over all years, is shown in Fig. 2. For σ abs,corr , concentrations were 20 mostly below 0.5 Mm −1 . However, higher values of up to 1 Mm −1 were observed in October, December and January. The light absorbing aerosols usually include black carbon, brown carbon and dust. The CVAO station is free of local anthropogenic pollution because it is located at the northeastern shore and experiences constant northeasterly winds. Therefore it is highly unlikely that black carbon and brown carbon alone could explain the observed variation of the σ abs values. The higher values of σ abs coincide with months during which high dust concentrations were generally observed in this study (Sect. 3.2.2).  of 0.30 (N CCN, 0.30% ) varied from ∼10 to above 1000, with a mean value of 336 cm −3 . The high variability of N CCN also can be seen in the wide-spread probability density function of CCN number concentrations (PDF-N CCN ), shown in Fig. 4 (a).
The PDF-N CCN shows multiple modes at supersaturations of 0.15%, 0.20%, 0.30%, 0.50% and 0.70%, which indicates that particles acting as CCN arriving at CVAO originate from different sources. The boxplot of N CCN at different supersaturations is shown in Fig. 5 (a). It is clear that N CCN values increase towards a higher supersaturation, because increasingly smaller 5 particles can be activated to droplets at a increasingly higher supersaturations. The mean/median values of N CCN also exhibit a large variability, increasing from a mean of 233 to 485 cm −3 for a variation in supersaturations from 0.15% to 0.70%.
Time series of d crit values at all supersaturations (0.15%, 0.20%, 0.30%, 0.50% and 0.70%) are shown in Fig. 3  derived κ at a given supersaturation represents the particle averaged hygroscopicity around the corresponding d crit . Therefore, hygroscopicities derived at supersaturations of 0.70% and 0.50% can be assumed to be representative of the Aitken, whereas hygroscopicities derived at a supersaturation of 0.15% can be assumed to be representative of the accumulation mode. d crit at supersaturations of 0.20% and 0.30% are from about 70 to 100 nm. In this size range, marine aerosol particle size distributions typically show a minimum between Aitken and accumulation mode (Modini et al., 2015;Gong et al., 2020b), also known as 5 Hoppel minimum (Hoppel et al., 1986). The hygroscopicity derived at supersaturations of 0.20% to 0.30% represents a mixture of particles from both the accumulation and Aitken mode.
Figure 3(c) shows the time series of the particle hygroscopicity parameter κ. The κ values at each supersaturation show a larger variability during the whole campaign compared to a previous study of a one-month intensive measurement campaign at CVAO (Gong et al., 2020b), with one standard deviation varying from 0.10 (SS=0.15%) to 0.16 (SS=0.20%, 0.30%). This also 10 can be seen in the wide-spread PDF of κ at different supersaturations ( Fig. 4 (c)). The PDF-κ values at each supersaturation show uni-model distributions, except for the supersaturation of 0.70%. The PDF-κ at a supersaturation of 0.70% is bimodal featuring a small mode with a peak at 0.16 and a large mode with a peak at 0.36. At this supersaturation of 0.70%, the larger κ values are present when CCN number concentration is low, and vice versa. Figure 5 (c) shows the overall κ as a function of supersaturation. A slightly increasing trend of κ was observed from supersaturations 0.15% to 0.20%, whereas a slightly 15 decreasing trend of κ value was observed from 0.20% and above. However, there is no strongly pronounced difference between κ at different supersaturations.
Figure 5 (d) shows κ as a function of d crit and plotted error bars indicate standard and geometric standard deviations for d crit and κ, respectively. In the size range from 50 to 100 nm (Aitken mode), a slightly increasing trend of κ over increasing d crit is observed, indicating that the sulfate produced from DMS oxidation rather than organics might be a major species for particle condensational growth. In the accumulation mode, κ is getting smaller. But the overall uncertainty in these derived values is rather large so that no further interpretation of the κ values at different sizes will be given. The overall average κ value is 0.32.
However, this value cannot be taken as an annual average, as no data exists for the summer months. In Barbados, both modeling (Pringle et al., 2010) and measurements (Wex et al., 2016)  The Hoppel minimum divides particles that were not activated to a cloud droplet, yet, from other, larger ones, which were.
Therefore, the supersaturation of 0.30% can be seen to roughly represent the averaged maximum supersaturations in the clouds along the path of the sampled air masses. On the other hand, we can use the integrated particle number concentration for particle  concentration can be extended from the 9 months during which N CCN was measured with a CCNC to a range of 10 years for which N >HM values can be derived based on measured PNSDs. If the Hoppel minimum is not present, which means during the moderate and heavy dust periods (discussed in Sect. 3.2.1), we used the cumulative particle number concentration above 100 nm to estimate CCN number concentration. The 100 nm was chosen because the d crit was about 100 nm at a supersaturation of 0.30% during dust periods (discussed in Sec. 3.3), and it also coincides with a boundary between two modes in PNSD of the 5 moderate dust.

Air mass analysis
To understand the air mass origin and transport, we analyzed the backward trajectories at CVAO. From previous studies at CVAO, we learned that the marine boundary layer (MBL) was typically well mixed, with boundary layer heights from about 550 to 1100 m. An occasionally decoupled layer around 400 to 600 m was also observed (Gong et al., 2020b). Therefore, 10 we used backward trajectories starting at 200 m altitude to be sure to represent aerosol collected at CVAO. During air mass transport, the aerosol in an air mass might be altered due to precipitation events. We disregarded backward trajectories from our analysis (∼4.73% of total backward trajectories) when total precipitation exceeded 20 mm in the past 6 days. direction. A subgroup of backward trajectories originated in the Sahara, spent a few days above it and arrived at CVAO from a northwest direction. The frequency plot also reveals that backward trajectories featured paths over southern Europe and the Mediterranean region so that anthropogenically emitted particles from these regions might also be carried to CVAO.
3.2 Particle classification, origins and its monthly variation 3.2.1 Particle classification, origins 5 From the overview over the data sets introduced above, it can be seen that PNSDs, PNC, N CCN and air mass origin show large variability. In the following, we examine how different air mass origins and histories are related to different aerosols being present at CVAO. Therefore, we implement a K-means algorithm to classify aerosol types according to the shape of PNSDs.
The input parameter x i for K-means was defined as the normalized PNC in different size ranges. We used PNC in the size ranges from 20 to 40 nm (N 20_40nm ), 40 to 80 nm (N 40_80nm ), 80 to 600 nm (N 80_600nm ), 600 to 1000 nm (N 600_1000nm ) and 10 larger than 1000 nm (N >1000nm ). As the absolute numbers in different modes are very different, we normalized PNC in different size ranges to make sure that all size ranges are assessed equally. The normalization method can be found in Appendix C.
These size ranges were chosen based on assuming that particles in different size ranges feature different formation routes and behaviors. For the here presented data, higher N 20_40nm values indicated the presence of relatively freshly formed particles.  Main aerosol mode

N40_80nm
Aitken mode particles

N80_600nm
Accumulation mode particles N600_1000nm Mainly sea spray aerosol with possible dust fraction

N>1000nm
Mainly dust, but also large sea spray aerosol The boundary between Aitken and accumulation is roughly at 80 nm at CVAO. Therefore, N 40_80nm and N 80_600nm represented Aitken and accumulation mode particle number concentration, respectively. N 600_1000nm can be attributed to sea-salt with a possible contribution from dust particles (Schladitz et al., 2011). The largest observed particles (N >1000nm ) are directly linked to heavy dust plumes at CVAO with a small contribution from large sea spray aerosols (Gong et al., 2020b). A summary of input parameters and the corresponding aerosol mode and source is shown in Tab. 1. 5 As mentioned above and described in detail in Appendix C, in this study we determined k=5 for the number of clusters. The resulting classification of PNSDs and particle volume size distributions (PVSDs) are shown in Fig. 7. The solid lines show the median of PNSDs and PVSDs of the 5 different aerosol clusters, with a linear scaling on the y-axis in Fig. 7 (a), (c) and a logarithmic in Fig. 7 (b), (d). The error bars indicate the range between 75th and 25th percentiles. Additionally, Figure 8 shows the backward trajectory path frequency and backward trajectory height frequency which are connected to these 5 different 10 clusters of PNSDs. Error bars show the range between 75th and 25th percentiles.
The marine type PNSD features three modes, i.e., Aitken, accumulation and sea spray aerosol, shown as the blue line in Fig. 7. The marine type featured the lowest N 40_80nm , N 80_600nm , N 600_1000nm and N >1000nm values. The Hoppel minimum of marine type PNSDs is at roughly 70 nm. During the marine period, the sea spray aerosol mode can be assumed to be primary marine aerosols, as discussed in Modini et al. (2015) and Gong et al. (2020b). The PNSD representing the freshly-formed cluster is shown as a purple line in Fig. 7. This cluster featured the highest N 20_40nm . While we did not measure PNSDs below 5 20 nm, it can still be safely assumed that such high concentrations of small particles indicate new particle formation events which must have happened recently in the respective air masses. It is interesting to note that this is related to a generally low aerosol particle volume concentration (see Fig. 7 (c) and (d)), as well as mass concentration, as only comparably small numbers of larger particles are present, similar to the above described marine type, so new particle formation seems to happen predominantly in marine air masses with low particle number concentrations, an observation which has also been made for 10 Barbados in Wex et al. (2016).
The frequency of backward trajectory paths that were observed during times when PNSDs from marine and freshly-formed clusters were present a similar pattern, as shown in Fig. 8 (a1) and (b1). The majority of these backward trajectories spend the whole 6-day period over the North or western North Atlantic Ocean and arrived at CVAO from the northeast direction. Some of these backward trajectories touched southwest Europe and crossed the coastal areas in northwest Africa.
No clearly visible Hoppel minimum is present for the moderate dust cluster, as shown in a brown line in Fig. 7 (a) and (b). The moderate dust cluster featured the second highest N coarse . We also observed a very high concentration of Aitken and accumulation mode particles for the moderate dust cluster. The heavy dust cluster is shown as the black line in Fig. 7. No clearly visible Hoppel minimum is present for this cluster, either. The heavy dust cluster featured the highest N coarse . It is worth to be mentioned that similar PNSDs were found in previous field measurements at the Sahara (Weinzierl et al., 2009;Kandler 5 et al., 2009;Kaaden et al., 2009). The highest N coarse also means that the highest volume and mass concentrations (Fig. 7 (c) and (d) black line) are present in the heavy dust cluster, which was already reported in a previous study (Fomba et al., 2014).
Finally, the mixture cluster PNSD is shown as green lines in Fig. 7, with three clearly distinguished Aitken, accumulation and coarse modes. N >1000nm are between the marine and dust clusters. The Hoppel minimum of the mixture cluster is at ∼90 nm.
This may indicate that for this cluster the maximum supersaturation reached in the trade wind clouds is lower than the marine 10 and freshly-formed clusters, for which, as said above, it was at 70 nm. These two clusters have lower number concentrations for larger particles. Therefore it does not surprise that PVSDs of the mixture cluster are between the marine and dust clusters.
In the presence of higher particle concentrations for larger particles, an overall higher amount of water condenses onto particles at cloud base, which can be expected to lead to an overall lower supersaturation (Reutter et al., 2009), enabling only larger particles to be activated to cloud droplets, as observed in our data sets.
The frequency of backward trajectories during the moderate dust cluster featured a similar pattern as during the mixture period (as shown in Fig. 8 (c1) and (d1)). There are two major paths during mixture and dust periods. One class of backward trajectory was from the North Atlantic Ocean and crossed the coastal areas in northwest Africa. One class of backward 5 trajectories spend some days over the Sahara. So overall, for PNSDs which were characterized to belong to the mixture or moderate dust clusters, also sea spray particles contributed overall, and likely also to the particles at larger sizes. The backward trajectories during the heavy dust period were mainly from the Sahara, which justifies assuming that the related particles in this highest N coarse came from Saharan dust particles.
The relative frequency of backward trajectory heights is shown in the right panels of Fig. 8. Gong et al. (2020b) measured 10 that marine boundary height at CVAO varied roughly from 550 to 1100 m. From Fig. 8 it can be seen that a large fraction of air masses came from or had contact with the free troposphere. In this context, it is worth noticing that it has been described in the past, that new particle formation in the marine environment often takes place in the free troposphere, followed by growth of these particles into the CCN size range (Korhonen et al., 2008;Merikanto et al., 2009).

Monthly variation 15
In order to understand the large temporal variability of aerosol types on a monthly basis, the monthly variation of the relative frequency of occurrence, number fraction and volume fraction of the five different aerosol clusters are shown in Fig. 9 (a), (b) and (c), respectively. It can clearly be seen that relative frequency of occurrence, number fraction and volume fraction feature different behaviors. For example, the heavy dust cluster represented 14.6% of the cases encountered in January, contributed only 11.7% to particle number, however about 40.4% to particle volume/mass. This needs to be considered when relating 20 number-based properties such as CCN and INP concentrations to e.g. the mass-based chemical composition.
The heavy dust cluster is mainly present during wintertime. This is because, during wintertime, dust is transported westward from the Sahara in the lower troposphere. The deposition of dust particles takes place over the eastern tropical Atlantic and Cabo Verde region. During springtime, the marine aerosol is more often observed. This is because, during springtime, the CVAO mostly receives North Atlantic marine air parcels along with the NNE trade winds.

25
The above discussed marine, mixture and moderate dust aerosol clusters were recognized from Gong et al. (2020b), who characterized the sources of the particles at CVAO during September and October 2017. The freshly-formed and heavy dust clusters did not show up during this previous study and are described here for the first time for CVAO. It is worth mentioning that even in the relative simple dust-marine intersect location at Cabo Verde, five different aerosol clusters were classified, and these 5 clusters featured different characteristics such as size distributions and origins.

Evaluate marine and dust contribution to CCN
One important purpose of this study is to evaluate CCN population and particle hygroscopicity for marine or dust aerosol dominated environments. Therefore, we compared κ and N CCN during the marine and the dust periods, where for the latter, moderate and heavy dust were combined. Figure 10 (a) shows a boxplot of N CCN at different supersaturations during 9 months with CCNC measurement. N CCN at a supersaturation of 0.30% (as a proxy for the average maximum supersaturation encountered in clouds) during the dust period is about 2.5 times higher than that during marine periods. N HM can be used to estimate atmospheric N CCN at a supersaturation of 0.30%. Therefore, to extend atmospheric N CCN from 9 months to 10 years, we also derived N HM for marine and dust periods from 2008 to 2018, and also show these values in Fig. 10 (b). N CCN at a supersat-5 uration of 0.30% in Fig. 10 (a), derived from only 9 months of measurements, and N HM in Fig. 10 (b) derived for 10 years agree well, giving confidence in the extension of the CCN data set. Also, it is clear that the atmospheric N HM (i.e., N CCN,0.30% ), meaning also particles being able to act as CCN, is about 2.5 times higher during dust periods than during marine periods at CVAO.
Besides, N CCN during dust periods is generally higher than that during marine periods, increasingly so with increasing during dust periods is four times higher than during marine periods. This is interesting, as in the size range around 50 nm dust particles are not expected to be present in significant numbers. An enhanced Aitken mode particle number concentration was observed during moderate dust periods, showing up clearly separated from one large mode observed for particles > 100 nm.
We interpret these particles to stem from new particle formation in air mass associated with the dust, an phenomenon which has been observed in a previous field study (Dupart et al., 2012). However, this is highly speculative and proving or disproving 5 this hypothesis is beyond the scope of this paper. until March. They are comparably small and in agreement with the modeling study by Pringle et al. (2010). The majority of particles in marine air masses may originate from new particle formation, not from sea spray (Korhonen et al., 2008;Merikanto et al., 2009), explaining the derived κ which is lower than for sea salt particles. Dust particles, on the other hand, were reported to show hygroscopic behavior in the past (Karydis et al., 2011) and may also get more hygroscopic during aging. However, for dust type aerosols the nature of particles in the size range for which κ was determined, particularly for the size range < 100 nm, 5 is not fully clear, as mentioned above. But it did become clear here that the observed particles do not differ in hygroscopicity from those observed in marine type aerosols, which strengthens our suggestion that these particles also originate from new particle formation.

Synopsis
In the following, we summarize aerosol related results from this study and extend a bit on some of them. As mentioned 10 before, long-term data sets as this one can yield valuable insights for our general understanding but also for improving the performance of models (Reddington et al., 2017). This section is meant to present a broader picture of processes connected to the atmospheric aerosol without going into too much detail.
In this study, after classification based on machine learning, we found five different aerosol clusters for the 10 year-long data set obtained at CVAO -marine, freshly-formed, mixture, moderate dust and heavy dust.

15
When examining monthly variations, we showed that heavy dust was mainly observed during wintertime, where it was present roughly 10% of the time, and contributed, on a monthly base, roughly 10% of particle number but up to > 40% of the particle volume and mass. Moderate dust events, however, occurred throughout the year, typically for more than 10% of the time during each month, contributing up to more than 20% of all particles in number. Fall and winter months were those during which also the aerosol absorption was highest, and in the absence of local anthropogenic pollution sources upwind of CVAO, 20 the observed absorption can be assumed to originate from dust particles.
A special feature of the moderate dust type aerosols was a pronounced Aitken mode, which, as said above occurred frequently. While the determination of the nature of these particles is beyond the scope of this study, we speculate if new particle formation events may also occur in moderately dust laden air masses, based on not yet known processes, and propose this topic for further research.

25
Clean marine periods were most frequently present during springtime, occurring roughly 50% of the time in March and April. Periods during which the freshly-formed type aerosol prevailed had similar air mass origin as marine aerosol, indicating that the typical and well-known way of new particle formation mainly occurs in marine aerosols with low particle number concentration and low particulate surface area. Together, clean marine and freshly-formed air masses occurred for more than 50% of the time from October until May, while the mixed type aerosol was the most often occurring type in the remaining 30 months.
Dust and marine type aerosols featured clearly different PNSDs. Enhanced particle number concentrations and CCN number concentrations were observed when air masses came from the Sahara. For air masses originating from the Atlantic Ocean, PNSDs featured the lowest number concentrations in Aitken, accumulation and sea spray mode. CCN number concentration at a supersaturation of 0.30% (a proxy for the average maximum supersaturation encountered in clouds) during the marine period was about 2.5 times lower than during the dust period. However, the CCN-derived particle hygroscopicity (in the size range from 40 to 140 nm) for marine and dust periods showed no significant difference, independent of the air mass. This latter result, however, was based solely on data from the months of October until March. Averaging all κ values obtained in this 5 study, an average value of 0.32 was found, again for the months from October until March, and omitting small size-dependent differences which, however, all were within uncertainty.
The Hoppel minimum, a dip between the Aitken and accumulation mode indicating the division between particles that had been activated to cloud droplets from those that had not, was found to be at roughly 80 nm. More precisely, the Hoppel minimum was at 70 nm for marine and freshly-formed type aerosols and at 90 nm for the mixture type aerosol, where the 10 latter had higher number concentrations for larger particles. This larger size for the Hoppel minimum likely is indicative of these particles being a larger sink for water vapor during cloud droplet formation, lowering the effective supersaturation for air masses with mixture type aerosol.

Conclusions and future work
In this study, we summarized measurements of the aerosol microphysical properties in a representative marine-dust aerosol 15 intersect environment at Cape Verde Atmospheric Observatory (CVAO). In-situ aerosol measurements included light absorbing carbon (LAC), particle number size distribution (PNSD), cloud condensation nuclei (CCN) and corresponding backward trajectories. An unsupervised machine learning algorithm, K-means, was implemented to classify aerosol clusters, which were then the base for the classification of particle sources according to the shape of particle number size distributions. Based on the classification, we studied the air mass origins for different aerosol clusters and parsed out the marine and dust contributions to 20 CCN and their hygroscopicity.
We proved that an unsupervised machine learning algorithm K-means can be a good tool to classify aerosol types. It benefits from the long-term measurements with extremely large data sets. We refrain from repeating the key results here, as they were summarized in the previous section. But we point out that the summarized data sets can also help to shape new research questions towards future studies at CVAO. The freshly-formed particles and the corresponding backward trajectories indicated one 25 type of new particle formation which happened at CVAO mainly in air masses of marine origin. The height of backward trajectories indicated that new particle formation might have occurred in the free troposphere and particles were then transported down to the marine boundary layer. While new particle formation in the marine environment was assumed to occur in the free troposphere before, we still lack the knowledge of the overall mechanism, such as the precursors, growth rate, temperature and humidity effects of new particle formation, and also the fraction of CCN which originate from new particle formation. Further, 30 the moderate dust period featured high concentrations of Aitken mode particles, while the heavy dust period did not. More investigations are required to understand the reason behind that.
The data sets collected at CVAO are invaluable as, due to their long duration, they can help to better understand the aerosol microphysical properties, which are at the base of aerosol-cloud interactions. Furthermore, the data sets represent highly valuable information for driving, evaluating and constraining atmospheric model simulations. Long-term measurements of aerosol size, light absorbing carbon and CCN are important in the future because the real world has a large spatial heterogeneity and temporal variability.

Appendix A: Particle loss
We used the Particle Loss Calculator (written in IGOR Pro, von der Weiden et al., 2009) to calculate the particle losses within the inlet tube system. We considered size-dependent particle losses due to sedimentation, diffusion, turbulent inertial 10 deposition, inertial deposition in a contraction and in a bend. Note that the losses inside the diffusion dryer in this study are equivalent to the losses in a 15 m long tube (Wiedensohler et al., 2012). Figure A1 shows the resulting particle losses (in %) as a function of particle size. Appendix B: MAAP correction Figure B1 shows the PDF of σ abs (red line) and σ abs,corr (black line). Although corrections for the scattering artifact for filter-based absorption photometers are highly uncertain, the results clearly underline the importance of the correction.

Appendix C: Input parameters normalization and determine classification clusters
To make sure PNC in all size ranges are assessed equally when applying the K-means algorithm, we normalized PNC in 5 different size ranges separately based on a min-max normalization: where x i is i th PNC value in a specific size range. For PNC in a specific size range, the minimum PNC value of that size range gets transformed into a 0, the maximum PNC value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.

10
The elbow method is a useful graphical tool to estimate the optimal number of clusters k for a given task. Theoretically, if k increases, the within-cluster sum of squared errors (a.k.a. distortion) will decrease. This is because the samples will be closer to the centroids they are assigned to. In the elbow method, distortion begins to decrease rapidly and then switches to a more slowly decrease at a certain k value. Figure C1 shows the elbow plot as a blue line. The distortion started decreasing in a linear fashion after k=5. Therefore, k=5 is the optimal number of clusters. Also, we used the Calinski-Harabasz score (Caliński and 15 Harabasz, 1974), which computes the ratio of dispersion between and within clusters. The vertical black dashed line indicates the highest Calinski-Harabasz value which occurred, which also is at a number of five clusters. Therefore, overall, the optimal number of clusters was derived to be five, here, which is the value we then used within this study. Author contributions. XG conceived the study and wrote the manuscript with contributions from HW, TM and FS. TM performed particle number size distribution and light absorbing carbon measurements. SH performed cloud condensation nuclei measurement. XG evaluated and analyzed the data and developed analysis methods. XG, HW, TM, SH, JV, AW and FS discussed the results. All of the co-authors proofread and commented the manuscript.
Competing interests. The authors declare that they have no conflict of interests.