Identifying source regions of air masses sampled at the tropical high-altitude site of Chacaltaya using WRF-FLEXPART and cluster analysis

Observations of aerosol and trace gases in the remote troposphere are vital to quantify background concentrations and identify long term trends in atmospheric composition on large spatial scales. Measurements made at high altitude are often used to study free tropospheric air however such high-altitude sites can be influenced by boundary layer air masses. Thus, accurate information on air mass origin and transport pathways to high altitude sites is required. Here we present a new method, based 5 on the source-receptor relationship (SRR) obtained from backwards WRF-FLEXPART simulations and a k-means clustering approach, to identify source regions of air masses arriving at measurement sites. Our method is tailored to areas of complex terrain and to stations influenced by both local and long-range sources. We have applied this method to the Chacaltaya (CHC) GAW station (5240 m a.s.l., 16.35° S, 68.13° W) for the 6-month duration of the “Southern hemisphere high altitude experiment on particle nucleation and growth” (SALTENA) to identify where sampled air masses originate and to quantify the influence 10 of the boundary layer ::::: surface : and the free troposphere. A key aspect of our method is that it is probabilistic and for each observation time, more than one air mass (cluster) can influence the station and the percentage influence of each air mass can be quantified. This is in contrast to binary methods, which label each observation time as influenced either by boundary layer or free troposphere air masses. ::: Air ::::::: sampled :: at ::::: CHC :: is : a :::: mix :: of :::::::: different :::::::::: provenance. : We find that on average , 9% of the airsampled at CHC, at any given observation time, has been in contact with the surface within 4 days prior to arriving 15 at CHC : . ::::::::::: Furthermore, 24% of the air was located below :: has ::::: been :::::: located :::::: within ::: the :::: first : 1.5 km above ground level and consequently :::::: (surface ::::::::: included). :::::::::::: Consequently, 76% of the measured air masses at CHC represent free tropospheric air :: air


Introduction
Traditionally, high-altitude measurement sites are used to study the remote atmosphere and the interactions between the free troposphere (FT) and the planetary boundary layer (PBL). These sites provide the opportunity to have long-term in situ observations of the free troposphere with high time resolution (Collaud Coen et al., 2018) as opposed to the short duration and inherent transient nature of airborne measurement campaigns. Observations of aerosol and trace gases in the FT are of great 30 scientific value in terms of understanding long-range transport and atmospheric chemistry, quantifying background concentrations, and observing long term changes in the composition of the atmosphere (Laj et al., 2009). Nevertheless, it is well known that high-altitude mountain sites can be influenced by boundary-layer air.
The planetary boundary layer is the lowest part of the atmosphere and is in direct contact with the Earth's surface. The majority of natural, and especially anthropogenic, aerosols and pollutants are emitted at the surface and thus directly released 35 in ::: into : the PBL. The concentrations of gases and aerosols and their residence times depends on the structure of the PBL.
Over land the PBL has a pronounced diurnal cycle with deep, well-mixed boundary layers typically observed during the day and shallow, stable boundary layers occurring at night. In complex terrain, thermally driven winds develop (e.g. slope and valley winds, De Wekker and Kossmann (2015) :::::::::::::::::::::::::: De Wekker and Kossmann, 2015) during the day and can transport aerosols and pollutants from valley bottoms to high-altitude sites. Additionally, complex mountain meteorological processes such as 40 orographic lifting can also transport PBL air to high altitude. Therefore, high-altitude sites can be influenced by local boundary-layer air and free tropospheric air that may , ::::: where ::: the :::: latter :::: may :::: have ::::::::: undergone ::::::::: long-range :::::::: transport : due to stronger upper-level winds, have undergone long-range transport. This means that observations must be carefully screened and analysed in synergy with many parameters to understand the dynamics and diurnal cycle of individual sites. Consequently, understanding the history of air masses sampled at mountain-top sites and 45 related chemical composition is not an easy exercise. Since the chemical and physical composition of the sampled air masses is, in general, inherently related to its path through the atmosphere (Fleming et al., 2012), having good methods to describe the history of the sampled air masses increases the value of measurements. Under these circumstances, the observations can then be treated as samples from different parts of the atmosphere both in the vertical (PBL vs FT) and the horizontal (short-range vs long-range contributions) domain.
Cluster analysis is a multivariate technique used to classify elements into groups in a way that maximizes the similarity (by a predefined metric) within members of a group while also maximizing the dissimilarity across groups. Clustering 95 has been extensively used in studies that aim to classify air mass history. In the case of single trajectory studies, the goal is to group trajectories into ensembles that follow a similar pathway (Kassomenos et al. (2010) :::::::::::::::::::: Kassomenos et al., 2010 and references therein). In dispersion models, clustering analysis has been applied both to classify the retroplume of the particles (e.g. Stohl et al. (2002) :::::::::::::: Stohl et al., 2002) and also to classify the regional footprint of the particles (e.g. Sturm et al. (2013); Paris et al. (2010 ::::::::::::: Paris et al., 2010). However, these studies mostly assume that high-altitude sites are also background sites, and therefore are 100 mostly interested in the contribution of long-range transport sources. This means that the meteorological data used to drive the dispersion models is only required at ::: may ::::: have low spatial resolution since the local sources are assumed to be negligible.
However, there is still a lack of classification in the vertical dimension and accountability for the influence of short-and longrange transport simultaneously which is of special relevance for locations where short-range sources are equally as relevant as :::::: relevant :: to : more distant sources. , 68.13 • W) located at a.s.l. and :°:::: W) :::::: located : 20 km from the metropolitan area of La Paz / El Alto but ∼1.6 km higher in altitude than the centre of La Paz. For a detailed description of the site see Chauvigné et al. (2019) and Wiedensohler et al. (2018). Measurements of reactive and greenhouse gases as well as aerosol optical, chemical and physical properties are routinely monitored at the station following the GAW recommended procedures (Laj et al., 2020). 110 At this station, in the context of the SALTENA (Southern hemisphere high altitude experiment on particle nucleation and growth) campaign (Bianchi et al., 2021a) :::::::::::::::::: (Bianchi et al., 2021b), state-of-the-art instruments that measure aerosol chemical and physical properties were deployed to complement on-going long-term observations. The intensive measurements took place between December 2017 and June 2018 (covering both wet and dry seasons). The unique location of the station in the under-sampled southern hemisphere enabled us to study a mixture of pristine air masses from the Amazon Basin loaded with 115 biogenic emissions, regional background air masses from the Altiplano perturbed by volcanic activity, and, marine air masses from the Pacific Ocean. In addition, strong anthropogenic influence from the La Paz / El Alto metropolitan area was sampled.
This wide range of potential source areas, along with complex mountain meteorology, and state-of-the-art, highly detailed observations of the physical and chemical properties of aerosol and trace gases, means that a comprehensive meteorological analysis, beyond what is the typical performed for aerosol measurement campaigns, is required.

120
The overall objective of this study is to develop, :::: and ::::: apply :: to ::::: CHC, a new method to identify air mass source regions which is valid for high-altitude stations that are influenced both by local and long-range sources and where the vertical classification of sources is as relevant as horizontal segregationand then to apply this method to CHC. An . ::: An :::::: outline :::: and overview of the method is given in Fig. 1. The first aim of this study is , therefore, to use a regional meteorological model (Weather Research and Forecasting model -WRF ::::::::::: model-WRF) in combination with a Lagrangian dispersion model (FLEXible PARTicle disper-125 sion model -FLEXPART ::::::::::::::::: model-FLEXPART) to create a high-resolution data set of source areas for CHC at hourly resolution (steps 1 and 2 in Fig. 1). The second aim is to develop a new method, based on cluster analysis, to transform the complex, very large ::: and ::::::::::::::: multi-dimensional : dataset into a user-friendly dataset of air mass source regions (steps 3 to 7in Fig. 1). The third aim is to document the characteristics of the identified source areas (clusters) which will enable the dataset produced here to be applied in forthcoming studies on the chemical composition measurements made during the unique high-altitude SALTENA 130 campaign. The fourth and final aim is to demonstrate the strength and simplicity of the classification results from our method which we do by using them to confirm a well-known source of sulfate emissions that were measured at the CHC station.
The initial and boundary conditions were taken from the NCEP Climate Forecast System reanalysis Version 2 (Saha et al., 2014) :::::::: (Saha et a temporal resolution of 6 hours, a nearly 0.5°horizontal resolution and 64 sigma-pressure hybrid layers. The model topography was obtained from the Global multi-resolution terrain elevation data model(GMTED2010) (Danielson and Gesch, 2011) 155 with a resolution of :: ∼1 km. We use the following parameterizations: microphysics is parameterized by the Goddard Scheme; cumulus convection is parameterized by the Grell-Freitas Ensemble Scheme in D01 and D02 and no cumulus parameterization is used for D03 and D04; the short and long-wave radiation is parameterized by the New Goddard Shortwave and Longwave Schemes; the Planetary Boundary Layer (PBL) Physics are represented by the Mellor-Yamada-Janjić Scheme (MYJ); and the land surface model is the Unified Noah Land Surface Model.
FLEXPART also contains options , controlled via namelist options, for how turbulence and convection are included in the simulations. We take the values of the PBL height, surface sensible heat flux, and the friction velocity directly from the WRF simulation (SCF_OPTION=1). Turbulence is parameterized using the Hanna scheme (Hanna, 1984) as used in FLEXPART-ECMWF/GFS ( TURB_OPTION=1) and we assume skewed rather than Gaussian turbulence in the convective boundary layer 205 (CBL=1). Deep convection is also parameterized (LCONVECTION=1).
In order to decide which elements are not beneficial based on the above definition, we define a threshold T in the following way. First, we sum all the values for each of the elements over the time period. Then, we sort the element based on their total value and compute the cumulative values. Finally, we split this dataset at the point where of the total value is reached and discard the remaining . This procedure leaves us with cells out of the total in the grid. Out of the excluded cells, had a zero 290 total value and the total median for the non-zero left out cells is . The total median for the included 8580 cells is .
The global study by von Engeln and Teixeira (2013), based on reanalysis data, shows that PBL heights are somewhat deeper near CHC than in the Amazon but PBL heights ::: and : typically range between 500 m and 1.5 km. Therefore, the real PBL will usually be similar in depth or shallower than our value of 1.5 km. This means using the pseudo BL depth will likely 360 over-estimate the influence of PBL air masses ::: and :::::::::::: underestimate ::: the ::: FT ::::::: influence.
When referring to the SRR percentage influence of a cluster, for simplicity we use where the SRR total is equal to the theoretical total residence time of the simulation expressed in seconds (4 days = 345 600 seconds).
This means that for every time step of the FLEXPART simulation, the sum of all the SRR cluster values adds up to SRR total .

365
In practice this theoretical value is not always achieved since inevitably a very small fraction of the particles leave the outer domain (D01) before the end of the 4-day simulation. This also implies that for some time periods, the sum of all the cluster might not add up to 100%.
In order to understand diurnal cycles in the from different vertical levels and regional sources, we generated detrended diurnal cycle timeseries by decomposing the raw signal into a trend (SRR trend ) and a diurnal cycle (SRR diurnal cycle ) in the following 370 fashion: The SRR trend is obtained by applying a running mean with a window of to the raw signal.
Furthermore, when analysing the diurnal variation of the for each of the 18 clusters we first objectively identify the clusters that present a diurnal cycle by transforming their time series into the frequency domain by applying a fast Fourier transform 375 and plotting the resulting frequency spectrum (Fig. S4). In this domain, we visually identify those clusters that present a local peak around the frequency . Manual inspection of the clusters' timeseries indicated that not every day included in each cluster identified by the fast Fourier transform process to have a peak at had a pronounced diurnal cycle. Therefore, the clusters are then subjected to a dynamic time warp (DTW) grouping procedure (Tavenard et al., 2020) that classifies the days with and without a diurnal peak i.e. influence from the boundary layer. The results of this classification are shown in Figs. S5 to S8. (EXR), namely the Lake Titicaca and the Southern Atlantic ocean regions, that we add to fulfil the domain. In total, 37 different ecoregions exist in the area covered by the largest WRF model domain. The advantage of using these regions is that 385 they are well defined and described in the literature, do not depend on arbitrary political borders and are defined considering regions that have similar ecosystems and therefore similar potential emission signatures. Furthermore, the regions are also nested within biomes that provide a more general picture. Within our outermost domain (D01), 13 different biomes are present.
Sulfate mass concentrations measured at the CHC station from March to May 2018 are used in this study to illustrate an example application of our clustering methodology. The aim of this example is to show that our new method can identify the 390 main source of the high sulfate concentrations measured at CHC. A very likely source of this is degassing from nearby volcanoes which was observed at the same time. The sulfate dataset is obtained using the quadrupole aerosol chemical speciation monitor (Q-ACSM, Aerodyne Research Inc.) which is able to routinely characterize non-refractory submicron aerosol species such as organics, nitrate, sulfate, ammonium, and chloride (Ng et al., 2011). Because of the low atmospheric pressure, a 130 µm diameter critical orifice was used in order to retain the normal sample mass flow rate (Fröhlich et al., 2013). In addition, 395 inlet flow and mass calibration (using ammonium nitrate and ammonium sulfate as standards) were accomplished to guarantee optimal instrumental performance and mass quantification. The instrument's time resolution was 30 minutes. This sulfate timeseries was then correlated to the SRR timeseries of both the 18 clusters and the 6 main pathways.

400
:: In :::::: section ::: 5.2 we quantify the contribution of the surface, "pseudo " :::::: pseudo PBL, and FT sources to the air masses measured at CHC. In section 5.3 the characteristics of the 6 main pathways are presented before the more detailed analysis of the 18 clusters is shown in in section 5.4. Finally, in section 5.5 we show one example of how our clustering results can be combined with measurements to identify source areas.

430
The median influence of the surface is 9% meaning that on average 9% of the air sampled at CHC has been in contact with the surface in the last 4 days. In terms of the pseudo PBL, on average 24% of the sampled air masses represent PBL air. Indirectly this means that approximately 76% of the air sampled at CHC can be considered representative of the FT. Note that this does not mean 76% of observation times are representative of the FT; it should be interpreted as, on average, at any given time ∼ :: an :::::: average :::::::::: simulation :::: hour : 24% of the measured air mass represents the PBL and the remainder the FT. This

The six main pathways
In this section we describe the results from clustering the SRR log-polar cells into 6 groups. We call these 6 groups the main pathways (PW) since they tend to start near the station and reach far away from it as opposed to the 18 clusters that occupy 450 more localised regions. :::: Also, ::: the ::::: limits ::: of ::::: many :: of ::::: these ::: are :::::::: delimited ::: by ::: the ::::::: Andean ::::::: plateau. Furthermore, we label each cluster based on their 'clock direction' from CHC and append them with the acronym PW to distinguish them from the 18 clusters. For example, cluster label '03_PW' refers to the cluster whose centroid position is located east from the station.
In Fig. : 6a and b we show the 6 main pathways (PW) along with the 18 cluster centroids. The pathways are the shaded coloured regions and contain 2 to 4 clusters from the 18-cluster grouping. The 03_PW is located geographically in the lowlands 455 to the east of the station occupying the biomes 1, 2, 6, and 9 which in general are tropical forests and grasslands ( Fig. 6e and   f).
Cluster 05_PW originates from the south of the station in the altiplanic (montane grass and shrubland, biome 4) and lowland (subtropical dry broadleaf, biome 6) regions. Horizontally, it follows the Altiplano plateau and its eastern slopes to the station.
In Fig. 7a to d, we respectively describe the pathways' centroids in terms of their height above ground; height above sea level; surface influence (S SRR , Eq. 2); and SRR percentage influence. In general, the farther the centroid is from CHC, the higher above ground level its centroid location is. The same pattern is observed for their height above sea level, however, if the location of the centroid is not too far away and above a location where the ground height is considerably lower than CHC, 470 then the centroid location is located below the linear trend, e.g. 03_PW and 12_PW. In terms of their S SRR , a decreasing trend is observed, in other words, the farther the centroid, the lower the influence from the surface. Therefore, 12_PW is highly influenced by its contact with ground (62 %, Table :::::: Figs. 7 ::: and : S2) while 07_PW is almost unperturbed by the surface (8 %).
Finally, the SRR influence of each pathway seems to be uncorrelated with the distance from CHC. Pathway 03_PW has the highest influence over CHC with a share of 29 %. This is in agreement with previous studies at CHC (Chauvigné et al., 2019) 475 where air masses from the Amazon were identified as the major contributors during the wet season DJFM (our modelling period involves ::::: covers DJFMAM).
Finally, in Fig. : 8, we show the temporal influence of each of the pathways. We quantify the influence in percentage of each pathway by dividing the pathway's SRR values by the theoretical total residence time of the simulation, 96 hours × 3600 seconds. Note :::: again that the sum of the influence for all 6 pathways shown in Fig. : 8 does not always sum to exactly 100 which 480 is due to particles leaving the domain. One of the strengths of our method is that at any given time, we can identify more that one cluster influencing the station. Time series of the %influence for each of the 6 main pathways (PW) from 2017-12-06 until 2018-05-31.
A clear change in the influence pattern at the beginning of May is seen: On one hand, the influence of pathways 03_PW and 05_PW and to a lesser extent 12_PWbecomes ::::::: become almost negligible. On the other hand, pathways 07_PW and 08_PW 485 increase their influence. This is consistent with Chauvigné et al. (2019) where it was shown that during the wet season (DJFMA) air masses from the lowlands and the east-southeast tend to have a bigger influence on CHC. In our case, 12_PW and 03_PW are clearly lowland pathways while 05_PW has a mixture of Altiplano and lowland influence that, nonetheless, comes from the southwest and therefore is mostly favoured during the wet season. The pathway 11_PW does not present a clear change during the 6 month period. Finally, visual inspection shows that the influence of each pathway happens in ::::: varies :: on :: a :::::::: timescale 490 :: of 1-to-2 week timespans ::::: weeks. We will further develop this point when focusing on the detailed 18 clusters (section 5.4).

510
In Fig. 10a to d, we respectively describe the clusters' centroids in terms of their height above ground; height above sea level; surface influence (S SRR , Eq. 2); and SRR percentage influence. In general, they follow the same patterns that we described for the 6 pathways with the exception that there are more clusters below the CHC station's height a.s.l. (Fig. 10b). These clusters, namely 02_SR, 12_SM and 03_SM are located close to CHC in the lowlands located north and northeast :::::::: north-east : from the station.

515
Specifically, all of the short-range clusters' centroids are below 2 km a.g.l. and below 5.2 km a.s.l. (Fig. 10a and b) which is below the CHC station height (5.2 km a.s.l.). Furthermore, these clusters are in close contact with the surface since their S SRR (Eq. 2) is greater than 50 % (Fig. 10c).
The short-medium range clusters' centroids are between 2.4 km and 2.6 km a.g.l. However, their height a.s.l. varies from 4.1 to 6.1 km. This is due to clusters coming from both the Altiplano to the southwest and the lowlands to the northeast 520 :::::::: north-east of the station. In general, only one third of these clusters' air masses are below 1.5 km (S SRR ) and thus these clusters include notably more influence from the free troposphere than the SR clusters.
The medium-range centroids are between 3.2 km and 6.8 km a.g.l. This variance is mostly proportional to the distance from CHC ( Fig. 10a) with clusters 04_MR and 05_MR being slightly below the rest due to their location in the lowlands. Their height above sea level varies from 6.1 to 7.1 km (Fig. 10b). In terms of their S SRR , these clusters vary from 7 to 20 %. These 525 values are approximately inversely proportional to their centroid distance from the station (Fig. 10c).
Finally, for the long-range subgroup, their clusters' centroid far distance from CHC is reflected in their mean height a.g.l.
of 8.0 km, mean height a.s.l. of 8.4 km, and their low mean S SRR of 0.7 %. Furthermore, due to this high altitude and low influence from the surface, air masses arriving from these clusters are likely to present free troposphere characteristics. These clusters are all located west of the station (clock direction 07-10).

530
The temporal evolution of the clusters is shown in Figs. 11 and 12. All clusters within the short-range subgroup show a high degree of temporal variability ( Fig. 11a and b). Cluster 02_SR presents a high frequency variability which upon further analysis (via Fourier transform, Fig. S4) is shown to be a clear diurnal pattern. However, this pattern does not happen everyday ( Fig. S5) but in 81 out of the 176 modelled days (46 %). During these days, the peak happens in the early afternoon (13h local time). The cluster 04_SR is highly variable as well however it does not present a diurnal pattern (Fig. : S4) and its variation is similar to that 535 of 05_MR, probably due to the fact that both clusters belong to the same pathway. Cluster 07_SR does present a sharp diurnal pattern in 41 (23 %) out of 176 modelled days (Fig. : S6) peaking at 11h local time. We conjecture that the peak time difference between 02_SR (13h) and 07_SR (11h) is due to the different land type; 02_SR originates from the high humidity biome 1 (Tropical and subtropical forest, Fig. : 6f) whereas 07_SR originates from the less humid biome 4 (Montane grasslands). This difference would entail different thermal inertia, different diurnal cycles and partitioning of the sensible and latent heat fluxes 540 and thus a different boundary layer evolution. Cluster 07_SR is of particular interest since intense anthropogenic emission sampled at CHC would most likely be generated in this highly populated area. Furthermore, the close contact of cluster 07_SR with the surface and its diurnal variability favour transport of emissions to the station during the day when PBL air from La Paz and El Alto is advected upslope by thermally driven winds. The cluster 10_SR, which originates close to Lake Titicaca, presents a diurnal pattern in 63 (36 %) out of 176 days peaking at 8h local time (Fig. S7). We conjecture that this early morning 545 peak is related to the lake breeze circulation that develops due to the temperature difference between land and the lake. The average surface temperature of the lake, obtained from Pillco Zolá et al. (2019), is 10°C and does not have a diurnal cycle.
At 8h local time, the lake's surface temperature is higher than the surroundings favouring a land breeze (airflow from land to the lake) near the surface and ascent over the lake. The return flow, near the top of the BL, potentially advects air masses from the lake to CHC. The cluster 11_SR does not present a diurnal variation (Fig. S4) and its influence seems to be mostly driven 550 by the medium-range cluster 11_MR (Fig. 12). Cluster 12_SR does present a diurnal pattern (Fig. : S8) in 48 days out of the to its close location to CHC (the closest one, 14 km ) so that early morning irradiation would create favourable conditions for up-slope winds.
The short-medium (SM), medium (MR) and long (LR) range clusters' temporal variability is shown in Fig. 11c and d, and 555 Fig. 12a to d. The power spectra of the SRR intensity for these clusters is shown in Fig. S4 and is between 1 and 2 weeks except for 08_SM and 12_SM which, in addition, also show a small diurnal variability (Fig. S4). : .
The prevalence of some short-medium range clusters changes during the 6-month period. Clusters 03_SM, 12_SM, 02_MR, 04_MR and 05_MR occur regularly from December (wet season) to April (transition season) but cease influencing in May (dry season). In contrast, the influence of clusters 06_SM, 11_MR and 10_LR do not change substantially during the 6 months 560 whereas the influence of clusters 08_SM increases 09_MR, 07_LR and 08_LR in May.
Finally, in Table ::: Fig. 13 we present a quantitative summary of the 18 clusters centroid properties along with their average SRR influence. Furthermore, we also link each of the 18 clusters to their main pathway. On average, there are 3 clusters per pathway and at least two distinct distance ranges. Furthermore, the clusters within each pathway are heterogeneous in position, surface influence and age which supports the idea that further insight into the air mass transport patterns can be attained by 565 further subdividing the pathways.

An example application: sulfate from volcanic degassing
This section presents a proof of concept for our newly developed method to identify sources regions of air sampled at CHC. To determine if this is the case, we calculated the 'Pearson' correlation coefficients for all available measurements of sulfate from the ACSM :::::::: Q-ACSM to the SRR time series of each of the 18 clusters and all of the 6 pathways which are shown in Fig. 14a.
The cluster and pathway with the highest coefficients are 09_MR (0.40 ) and 08_PW (0.42 ) and both correlations have a 580 p value < 0.001. The horizontal locations of this cluster and pathway are shown in Fig. 14b and their corresponding timeseries are shown in Fig. : 14d. The timeseries for the sulfate measurements is shown in Fig. 14c. The correlation in combination with either the simplified (6 pathways) and the specific (18 clusters) clustering scheme correctly assigns the source region to the location of the degassing Sabancaya volcano.
The coefficients from the pathways are quite clear; the highest value of 0.42 is at least twice as high as the next candidate, the highest coefficient value correctly to 09_MR. However, 10_SR, 11_SR and 08_LR all have similarly high correlation coefficients. Therefore, while the results from 18 clusters are better at pinpointing the source region of sulfate, other regions also appear as plausible candidates so there is a risk of overfitting when too many regions are considered. This may be a result of this source regions being entangled in terms of their temporal influence over the station. For example, in this particular case, 590 the high correlation of 10_SR, 11_SR and 08_LR could be attributed to the fact that air masses that travel through the region of the volcano also have a time residence in the regions where 10_SR, 11_SR and 08_LR are located.
This example shows that the clustering scheme is successful in identifying regions in its simplified version (6 pathways) and also in its more detailed version (18 clusters) albeit care must be taken when drawing conclusions so that we do not overfit to the source regions. Finally, it should be noted that only one example is presented here. Future work will include the extension 595 to additional case studies, for example, comparing the clusters to measurements of black carbon.

Discussion and recommendations
Previously, we described the characteristics and location of the 6 pathways and 18 clusters and related this to the surface type (biomes). It is also relevant to consider how the pathway positions and time series relate to typical meteorological patterns.
During the austral summer, the Intertropical Convergence Zone (ITCZ) migrates south, coinciding with the decrease of the 600 meridional gradient temperature and the associated southward shift of the westerly subtropical jet stream. At the same time, deep convection starts to develop especially in the central part of the continent. This favours the expansion of the equatorial easterly winds and thus a weak mean east-to-west flow in the middle and upper troposphere is established (Garreaud et al., 2003). This east-to-west flow is well captured by 03_PW (03_SM, 02_MR and 04_MR) which is strong in DJFMA.
At the same time, the expansion of the trade winds also generates the South American Low Level Jet (SALLJ) which brings 605 moisture from the Atlantic and the Amazon basin first to the eastern slopes of the Andes and then turning in a north-to-south fashion towards the southern part of the continent. It is known that the SALLJ is also responsible for bringing moisture to the Altiplano plateau since part of this flow is channelled into/goes over the plateau (Insel et al., 2010). This pattern correlates well with 12_PW (12_SR, 02_SR and 12_SM) which brings low altitude air masses from the eastern slopes of the station. The pathway also is strong during DJFMA and diminishes its influence in May. It is important to note that 03_PW and 12_PW do 610 not necessarily influence the station synchronously.
During the austral winter, the ITCZ migrates north and the subtropical westerly jet stream moves north reaching up to 20 degrees south Garreaud et al. (2003) :::::::::::::::::: (Garreaud et al., 2003). This creates an upper-level, large-scale westerly flow that favours air masses from the Pacific/Altiplano region. This coincides with the increment of the influence of both 07_PW (06_SM, 07_LR and 08_LR) and 08_PW (specially 08_SM and 09_MR). The first brings long-range subsiding air masses from high 615 up in the troposphere (7.7 km a.s.l.) while the second advects dry air from Altiplano at low-level (2.5 km a.g.l.). Both of these clusters are present during the dry season and reach maximum intensity in May.
The connections of pathways 05 and 11 to the general atmospheric dynamics is not immediately evident. Pathway 05_PW, which comes from the south , ::: and is strongest between December to ::: and : March and has a low centroid height a.g.l. : (3 km), : . :::: This could be linked to the development of a strong South Atlantic Convergence Zone starting in the eastern slopes of the 620 Altiplano at latitudes of around 20°that would pull surface air from the south. On the other hand, 11_PW which consists of 3 clusters, originates from both from the Amazon and the Pacific (Fig. : 9). These multiple sources areas, spanning both east and west, indicate that 11_PW is a "hybrid" pathway and therefore likely occurs in both the wet and dry seasons.
This study builds upon a previous source region analysis performed by Chauvigné et al. (2019) at the CHC station where a similar WRF setup was used in combination with back trajectories. Our methodological approach is notably different from 625 this earlier study primarily as we use a Lagrangian dispersion transport model rather than a back trajectory model meaning that turbulent mixing and convection processes that air parcels experience during transport are better represented in this study.
Additionally, our WRF simulations and hence meteorological data have a higher vertical resolution (61 levels compared to 28) and our 18 clusters provide more detail than the 6 clusters presented by Chauvigné et al. (2019). Our pathway results are largely in agreement with this previous study: similar source regions are observed for similar seasons and both studies show 630 that source regions from the west start influencing the station in the transition month of May. However, key additions of this study are (1) the vertical distribution of the air masses sources is more accurately captured, (2) the influence of the surface and the pseudo PBL on air sampled at CHC is more accurately quantified and (3) the diurnal cycle is captured by the analysis.
The type of analysis performed here is applicable to many other stations worldwide both in mountainous regions but also for stations in non-mountainous areas which are equally influenced by local and remote sources. Therefore, based on the 635 experience and knowledge gain here, we make the following recommendations for future studies: -For source identification in regions with complex terrain we strongly recommend the use of Lagrangian dispersion models over simple, limited number back trajectories based approaches. This has been previously noted also by Stohl et al. (2002).
-The accuracy of the meteorological input data is crucial for reliable results and therefore should be verified before 640 performing the FLEXPART simulations. In our case, this step revealed large biases in the Titicaca lake temperature which affected local wind patterns.
-Due to computational limitations, and given that the campaign only lasted for six months, we only simulated six months.
For a more complete physical understanding, if computational resources allow, we recommend that a full annual cycle is always simulated even if the observational campaigns the model simulations support are shorter in duration.
In this study we successfully developed a new method to identify air mass source regions in sites of complex topography. We then applied this methodology to the GAW station CHC, located near La Paz / El Alto at 5240 m.a.s.l. In order to accomplish this, we started with a WRF simulation in combination with FLEXPART to create a high-resolution data set of source areas for CHC. Then we applied our new method, based on cluster analysis, to transform the complex and large output dataset into 655 a user-friendly timeseries dataset of air mass source regions. We documented the characteristics of the identified source areas and demonstrated the strength and simplicity of the method's classification results by applying our method to confirm that the Sabancaya volcano is the source of sulfate measurements at the CHC station. The main conclusions of our analysis are: -On average, 9% of the air sampled at CHC has been in contact with the surface, and 24% with the "pseudo " :::::: pseudo : PBL, within the previous 4 days. Therefore we can conclude that on average, at any given time ∼76% of the measured air mass 660 at CHC represents free tropospheric air. Thus, the air masses sampled at CHC are very rarely purely free tropospheric air masses . ::::::: (Fig. 5a).
-The surface influence has a clear diurnal cycle, with low contributions during the night and higher contributions starting at 10 am local time and continuing during the day. The duration of the high surface influence during daytime is longer in the dry season (May) compared to the wet season (December -March). :::::::::::::: December-March, ::::::: Figs. 5b :::: and :: c). :

665
-Air masses arriving at CHC have a wide range of sources covering many different biomes and altitudes and it is common for any one specific sample time to have more than one source region . ::::: (Figs. :: 6, :: 8, ::: 11, :: 12 ::: and :::: S3). : -The most dominant pathway to emerge in our 6-month study is 03_PW which is responsible for 29% of the SRR and originates in the Amazon. However, as we detected that this PW does not occur in May, we hypothesize, based on Chauvigné et al. (2019), that if our analysis extended over all of the dry season (May-August ::::::::::: May-August), the overall 670 prevalence of this PW would decrease and others (e.g. : 07_PW and 08_PW) would increase . :::::: (Figs. 7 :::: and :::: S2).
-For the clusters' centroid positions, a linear relationship exists between the horizontal distance from CHC and the height above ground, with those farther away also being located higher in the atmosphere . :::::: (Figs. 7 :::: and ::: 10). : -Clusters located closest to CHC have the highest pseudo PBL influence and, rather than a linear decrease, the influence of the pseudo PBL decreases almost exponentially with increasing distance from CHC . ::::::: (Fig. 10).

680
To conclude, firstly, the method developed here can be applied to many other long term monitoring stations. Secondly, the data sets produced here, that provide detailed information about the sources of air masses sampled at CHC, will be applied in forthcoming studies on the chemical composition measurements made at CHC during the SALTENA campaign.
Disclaimer. The authors declare no conflict of interest.
-We acknowledge CSC -IT Center for Science Ltd. for the generous allocation of computational resources.
-DA and FB are funded by ERC (project CHAPAs no. 850614) and Finnish Centre of Excellence as well as the Academy of Finland (project no. 311932, 315203 and 337549).
-This study was supported by UMSA (Universidad Mayor de San Andrés) through the Laboratory for Atmospheric Physics (LFA) part of the Institute for Physics Research. The LFA provided scientific, administrative and logistical support.
-We also acknowledge the financial, administrative and logistical support from IRD (Institut de Recherche pour le De´veloppement) representation in La Paz Bolivia, through funding from by Labex OSUG@2020.
-DA wishes to thank the many scientists, technicians and personnel involved in the Chacaltaya station whose outstanding commitment enables high-quality atmospheric records in a challenging environment.
The digits also refer to the clockwise direction. The last column adds up the SRR [%] of the cluster belonging to each main pathway. The colours in the first column correspond to the 6 pathways and associated colours shown in Figs. 6a and b, 7, and 8. Figure 14. In panel a), we show the Pearson correlation coefficient between sulfate concentrations sampled at CHC and, both, the 18 clusters and the 6 pathways (PW). In panel b), the regions covered by the pathway 08_PW and the clusters 09_MR, 08_SM, 10_SR and 07_SR are shown along with Sabancaya and Ubinas volcanoes (both are located within 09_MR). All of the aforementioned clusters are contained by 08_PW. Panel c) and d show the timeseries of sulfate sampled at CHC and the normalized SRR timeseries of 08_PW and 09_MR respectively. Schematic :: of ::: the :::::::: rectangular :: to ::::::: log-polar ::::::::: regridding. ::: See ::::: section ::: A2 :: for :: a ::: full :::::::: description. :