Aerosol type classification analysis using EARLINET multiwavelength and depolarization lidar observations

. We introduce an automated aerosol type classification method, called Source Classification ANalysis (SCAN). SCAN is based on predefined and characterized aerosol source regions, the time that the air parcel spends above each geographical region and a number of additional criteria. The output of SCAN is compared with two independent aerosol classification methods, which use the intensive optical parameters from lidar data: (1) “Mahalanobis distance automatic aerosol type classification” (MD) and (2) “Neural Network Aerosol Typing Algorithm” (NATALI). In this paper, data from 20 the European Aerosol Research Lidar Network (EARLINET) have been used. A total of 97 free tropospheric (FT) aerosol layers from 4 typical EARLINET stations (i.e., Bucharest, Kuopio, Leipzig and Potenza) in the period 2014-2018 were classified based on a 3β+2α+1δ lidar configuration. We found that SCAN, being an optical property independent method, is not affected by the overlapping optical values of different aerosol types. Furthermore, SCAN has no limitations concerning its ability to classify different aerosol mixtures. Additionally, it is a valuable tool to classify aerosol layers, based on even to 25 single (elastic) lidar signals, in case of lidar stations which cannot provide a full data set (3β+2α+1δ) of aerosol optical properties, therefore it can work independently of the capabilities of a lidar system. Finally, our results show that NATALI has the lower percentage of unclassified layers (4%), while MD has the percentage of unclassified layers (50%) and the lower percentage of cases classified as aerosol mixtures (5%).

perform the typing when the particle depolarization is not available, knowing that the mixtures cannot be resolved, so only the predominant aerosol type is retrieved .

"Mahalanobis distance aerosol classification algorithm" 100
The automatic classification algorithm described in Papagiannopoulos et al. (2018) makes use of the Mahalanobis distance function that relates an unclassified measurement to a predefined aerosol type. The method compares the observations to model distributions that comprise EARLINET pre-classified data. Each Mahalanobis distance of an observation from a specific aerosol type is estimated, and the aerosol type is assigned for the minimum distance. Prior to the classification, two screening criteria are assumed to ensure correct classification. The algorithm is able to classify an observation to a maximum 105 of 8 (dust, volcanic, mixed dust, polluted dust, clean continental, mixed-marine, polluted continental, and smoke) and minimum of 4 (dust + volcanic + mixed dust + polluted dust, mixed-marine, smoke + polluted continental, and clean continental) aerosol classes depending on the lidar configuration.
In this study, we used four aerosol intensive properties: the backscatter-related Ångström exponent at 355 and 1064 nm, the aerosol lidar ratio at 532 nm, the color ratio of the lidar ratios and the aerosol particle linear depolarization ratio at 532 nm, 110 whie the minimum accepted distance was set to 4.3. As soon as the distance from a specific aerosol class is below the threshold and the remaining distances are higher than the threshold, the observation is assigned to that aerosol class. If more than one distance is below the threshold, the normalized probability of each class needs to be over 50%.
This objective multi-dimensional classification scheme has found great applicability and has been used with lidar (e.g., Burton et al., 2012;Papagiannopoulos et al., 2018), space-based polarimetry (Russell et al., 2014), and spectral photometry 115 (e.g., Hamill et al., 2020;Siomos et al., 2020) data. Kavoudou et al. (2019) compared NATALI and the Mahalanobis distance classification algorithm for the EARLINET station of Thessaloniki. Their study used a 3+2 lidar configuration and 4 aerosol classes (i.e., dust, maritime, polluted smoke, and clean continental) for each automatic algorithm. In general, they found fair agreement between MD and NATALI and the differences were attributed to the class definition and the range of the class intensive properties. 120

"Source classification analysis"
SCAN is the automated aerosol layer classification process, optical property independent and developed in the IDL programming language in the frame of this study. For each identified aerosol layer a X-hours HYSPLIT backward trajectory (Draxler et al., 1998) is used to calculate the amount of time travelled above predefined aerosol source regions arriving over a lidar station at the specific date and height that the aerosol layer is observed. X is the number of hours of the backward 125 trajectory which can be decided by the user at the beginning of the process. SCAN assumes specific regions (Fig. 1, coloured squares, from now on mentioned as domains) in terms of aerosol sources (Penning de Vries et al., 2015).
Taking into account the information (latitude, longitude and height) from each HYSPLIT air-mass backward trajectory, SCAN implements a number of criteria: (i) if the geographical coordinates for the specific hour of the backward trajectory https://doi.org/10.5194/acp-2020-865 Preprint. Discussion started: 27 October 2020 c Author(s) 2020. CC BY 4.0 License.
are within the boundaries of the marine domains and if the height of this trajectory over the this domain is below 1 km (Wu 130 et a;Ho et al., 2015), SCAN assigns this layer to the "marine" aerosol type. (ii) If the geographical coordinates for the specific hour of the backward trajectory are within the boundaries of the clean continental, polluted continental or dust domains and if the height of this backward trajectory is below 2 km over the domain, SCAN assigns the specific hour to the "clean continental", "polluted continental" or "dust" aerosol type, respectively. (iii) For an hour to be assigned to the "smoke" aerosol type, except from the coordinates of the backward trajectory at this specific hour, which should be within 135 the boundaries of the "clean continental" or "polluted continental" domains, the height of the trajectory at this specific hour should be below 3 km (Amiridis et al., 2010).
Figure 1: SCAN's aerosol source type classification map. Different color squares represent different aerosol sources: orange squares correspond to dust, blue squares to marine, brown squares to clean continental and black squares to continental polluted the backward trajectory's path, a mixture of more than one aerosol type is assumed. In case the aforementioned criteria (domain and height limitations) are not satisfied, the aerosol type is considered 'unknown'.
The maximum number of pure aerosol types that SCAN can assign to a layer is 6 and the combination of them gives to SCAN the capability to indentify aerosols from different sources in a specific layer. In Table 1

EARLINET lidar stations and data
The lidar station selection was based on the availability of the vertical profiles of a full set (3b λα +2e λα +lpdr) of aerosol optical properties at several wavelengths: backscatter coefficient (b 355 , b 532 , b 1064 ), extinction coefficient (e 355 , e 532 ), lidar ratio (lr 355 , lr 532 ), Ångström exponent (A e355/532 , A b355/532 , A b532/1064 ) and particle linear depolarization ratio (pldr 532 ) at the EARLINET database during the period 2014-2018. Therefore, the selected lidar stations are: Bucharest, Romania; Kuopio, Finland; 165 Leipzig, Germany; and Potenza, Italy (Table 2). Thus, 48 full sets (3b λα +2e λα +pldr) of aerosol optical properties of lidar observations from the aforementioned stations have been studied. For each data set, and each aerosol layer, the geometrical layer boundaries (bottom, top) have been calculated according to Belegante et al. (2014). In total, 97 FT aerosol layers were obtained and their mean aerosol optical properties (intensive and extensive) were calculated.

Case studies 170
In this section 4 selected atmospheric layers involving different types of probed aerosols are presented. The performance of the 3 automated aerosol typing algorithms to aerosol classification is discussed in detail.    In Fig. 5 we present the classification of the aerosol layers under study, by (a) NATALI, (b) MD and (c) SCAN. The "aerosol type" results are given, concerning the classification by NATALI. Concerning the MD classification, the 220 possibilities of each aerosol type assumed by MD are, also, given. Finally, concerning the classification of the aerosols by SCAN, the time (in hours) within which the air mass circulated over specified domains is also provided. It should be noted here that different colours refer to different aerosol types or aerosol mixtures.
Thus, in Fig. 5 we observe that NATALI classified the aerosol layer observed over Kuopio as "Continental Polluted". MD, also, gave the highest probability (65%) for the same layer to be of "Continental Polluted" origin, and typed it as such, while 225 the second closest probability was 11% "Smoke". Finally, SCAN attributed 24 hours to that air mass and classified it as "Continental Polluted". Moreover, 17 hours were attributed as "Clean Continental" and 2 hours as "Marine" out of the 144 hours of the air mass backward trajectory, so, finally, was classified as cp+cc+m mixture (Fig. 5i, (a)-(b)-(c)). For the 101 remaining hours of the air mass backward trajectory that SCAN did not take into account, it was assumed that the air mass travelled without being affected by any aerosol source, as a result of the combination of the height and domain limitations, 230 criteria that SCAN takes into account during its classification process. Taking into account these criteria and the air mass back-trajectory provided by HYSPLIT, we would expect that SCAN should have counted more hours to that air mass, attributed as "Marine". This is because of the predefined domains on the map as possible aerosol sources, which reduces the https://doi.org/10.5194/acp-2020-865 Preprint. Discussion started: 27 October 2020 c Author(s) 2020. CC BY 4.0 License. spatial accuracy of the classification method (Fig. 1). Therefore, we can see that the final results of the 3 methods (Fig. 5i, (a)-(b)-(c)) are in good agreement concerning the layer observed over Kuopio, although, MD and SCAN can provide 235 additional information of the constituents of the aerosol layer. origin, as they are not affected by the sea spray at these heights (> 2.5 km amsl.). These erroneous classification results must have been affected by the low lr values (22-35±5 sr). 245 In this case MD gave to the aforementioned lower aerosol layer the probability of 41% to be of "Continental Polluted" origin, while there was also a 22% probability for the layer to be "Clean Continental". MD also, classified the middle aerosol layer as "Marine" with a 63% probability, while it seems to attribute a 16% probability to the "Continental Polluted" aerosol type. Again, the classification results for the lower and the middle layers have probably been affected by the low lr values.
Concerning the upper aerosol layer observed over Potenza, MD failed to characterize it, as it gave nearly equal probabilities 250 of its possible aerosol types.
Taking, now, into account the results provided by SCAN concerning the lower aerosol layer observed above Potenza, SCAN counted 78 hours as "Dust" and 11 hours as "Continental Polluted" and finally classifying the aerosols as a mixture of "Continental Polluted+Dust". Concerning the middle aerosol layer, SCAN counted 64 hours as "Continental Polluted" aerosols (out of 144 total hours), while one would expect a contribution of dust aerosols, as well, according to both lpdr 532 255 values and the origin of the air masses based on the air mass backward trajectory analysis. This indicates, again, that the predefined domains on the map of possible aerosol sources reduce the spatial accuracy of the classification method, especially when it comes to the Southern Mediterranean Sea in the vicinity of the Saharan desert. In these cases, an atmospheric dust model (e.g. BSC Dream model) should be used synergistically. Finally, concerning the upper aerosol layer observed above Potenza, SCAN counted 66 hours as "Continental Polluted" aerosols and 44 hours as "Marine" aerosols, the 260 latter being again highly impossible to happen at that height (~5 km amsl.), as previously explained. Again, here, it becomes obvious that an atmospheric dust model should be used synergistically with the SCAN results when we deal with air mass backward trajectories passing over the Southern Mediterranean Sea, due to the vicinity of the Saharan desert.

Results
So far, 97 FT aerosol layers have been classified by the aforementioned three classification algorithms. The results have been 265 separated in aerosol layers according to aerosol types, as follows: The "1 type" (Fig. 6, blue) and "mixture" (Fig. 6, cyan) categories represent the aerosol layers that consist of one and two or more aerosol types, respectively. The "other" category consists of cases that NATALI marked as 'aerosol type/cloud contaminated' (i.e. marine/cc) and finally, "unknown" category ( Fig. 6, yellow) consists of the cases that the method was unable to identify the source of the observed aerosol layers. All the aerosol types and mixtures considered inside each algorithm is demonstrated in Table 1. It can be concluded that NATALI (Fig. 6a) is able to classify the highest number of cases (94 cases), while MD (Fig. 6b) failed to classify the highest number of cases (46%) and the lower percentage in classifying aerosols as of "mixture" types, 275 which is a reasonable outcome considering that MD scheme considers only 2 aerosol mixtures, while NATALI and SCAN have way more. Finally, the SCAN algorithm (Fig. 6c) classified the 37% of the aerosol layers (36 layers) as "1 type", 30% (29 layers) as aerosol "mixtures", and 32% (32 layers) as "unknown" types.

Comparison of aerosol classification codes
In Fig. 7 we present the comparison of the classification results obtained using the pairs of (a) NATALI and MD, (b) MD 280 and SCAN and (c) SCAN and NATALI. The number of aerosol layers classified as indicated by the row 'i' and the column 'j' is given inside each '(i,j)' square. For example, the number of aerosol layer classified as being of "1 type" by MD, but as "mixture" of aerosols by SCAN, is shown inside the ("1 type"-"mixture") square (16 such cases for this example in Fig. 7b).

Figure 7: Comparison between classified aerosol layers (a) NATALI and MD, (b) MD and SCAN and (c) SCAN and NATALI. The number of classified aerosol layers as indicated by the 'i' row and the 'j' column is given inside each '(i,j)' square.
In Fig. 7a it can be seen that the 45 cases classified as "unknown" cases by MD, were either of "1 type" (20 cases) or "mixture" (5 cases) and "other" (17 cases) by NATALI and of "1 type" (12 cases) or "mixture" (13 cases) by SCAN (Fig.   7b) Furthermore, it seems that MD is unable to discriminate the different aerosol types inside the "other" layers according to 290 NATALI (Fig. 7a) and the "mixture" layers according to SCAN, labelling them as "1 type" (Fig. 7b), which is probably because only two aerosol mixtures are considered inside MD. Concerning the 29 "unknown" cases classified by SCAN, 20 of them were also identified as "unknown" by MD (Fig. 7b), and were almost equally separated to "1 type" (12 cases) and "other" (16 cases) by NATALI (Fig. 7c). Figure 8a presents the comparison between the cases that MD classified as "1 type" or "mixture" against those classified as "other" by NATALI, while Fig. 8b shows the number of aerosol layers that MD classified as "unknown",and as "1 type" by NATALI. Finally, in Fig. 8c we present the number of aerosol layers that MD classified as "unknown" and as "mixture" or "other" by NATALI. The pie charts above each bar (Fig. 8b) and each stem (Fig. 8c) reveal the mean frequencies of each aerosol type as calculated by MD. 300 https://doi.org/10.5194/acp-2020-865 Preprint. Discussion started: 27 October 2020 c Author(s) 2020. CC BY 4.0 License.

305
The inability of MD to classify the aerosol layers according to NATALIS's classification (Fig. 8a) can be attributed to the characteristics of the aerosol layers not well-modelled by the algorithm, which means that the intensive parameters are not within the accepted "boarders" of the pre-defined classes of the algorithm. Concerning the "1 type" aerosol layers according to NATALI (Fig. 8b), MD would have predicted them correctly if the labeling of these layers by MD was achieved considering the higher percentage of the aerosol types, as the pie charts above each bar indicate (Fig 8b). This does not seem 310 to be the case for the "Dust" type labeled by NATALI to which MD gave 40% probability to be "Continental Polluted", 30% "Smoke" and only 30% "Dust" and "Mixed Dust (Fig. 8b). Concerning the "mixture" and "other" according to NATALI (Fig. 8c), it seems that MD found a high contribution of dust and mixtures of dust aerosols (approximately 50%) inside these layers (yellow, orange and dark red aerosol types according to pie charts above the stems). Figure 9a presents the comparison between the cases that were classified by MD as "1 type" and as "mixture" by SCAN, while Fig. 9b shows the number of aerosol layers that were classified by MD as "unknown" and as "1 type" by SCAN.

MD versus SCAN 315
Finally, Fig. 9c presents the number of aerosol layers classified by MD as "unknown", and as "mixture" or "other" by SCAN. The pie charts above each bar (Fig. 9b) and each stem (Fig. 9c) reveal the mean frequencies of each aerosol type as calculated by MD. 320 https://doi.org/10.5194/acp-2020-865 Preprint. Discussion started: 27 October 2020 c Author(s) 2020. CC BY 4.0 License. Figure 9: (a) Comparison between the cases that were classified by MD as "1 type" and as "mixture" by SCAN. (b) Number of aerosol layers that were classified by MD as "unknown" and as "1 type" by SCAN. (c) Number of aerosol layers that were classified by MD as "unknown" and as "mixture" or "other" by SCAN. The pie charts above each bar and each stem reveal the mean frequencies of each aerosol type as calculated by MD.

325
The misclassification of an aerosol layer by MD compared to the classification as "Continental Polluted and Smoke" layer by SCAN (Fig. 9a) could be, again, attributed to the location of the observation compared to the location of the predefined aerosol types of MD classification algorithm which depends on the aerosol optical properties of the studied layers.

Figure 10: Comparison between the cases classified by (a) SCAN as "mixture" and as "1 type" by NATALI, (b) SCAN as "1 type" and as "other" by NATALI, and (c) SCAN as "mixture" and as "other" by NATALI.
From Fig. 10 it can be concluded that 12 cases that SCAN classified as "Continental Polluted and Smoke" (cp+s), NATALI classified 6 of them as "Clean Continental" (cc), 4 as "Continental Polluted" (cp) (Fig. 10a) and 2 as "Continental Polluted 345 and Marine/Cloud Contaminated" (cp+m/cc) (Fig. 10c). Moreover, 6 cases that SCAN classified as "Continental Polluted" (cp), 3 of them were classified as "Continental Polluted and Marine or Cloud Contaminated" (cp+m/cc), 2 as "Marine/Cloud Contaminated (m/cc) and only 1 as "Continental Polluted, Smoke and Marine or Cloud Contaminated" (cp+s+m/cc) by NATALI (Fig. 10b). It seems that the aerosol optical properties of this mixture are attributed either to "Clean Continental" or to "Continental Polluted" aerosol types based on the NATALI classification. 350

Aerosol Optical Properties
The mean values of the aerosol optical properties derived from the NATALI, MD and SCAN classification for each aerosol type are presented in Table 2. The correspondence between the aerosol types and the terminology defined inside the classification methods are presented in Table 1.

Conclusions 405
In this study, we compared three independent aerosol classification methods: the "Neural Network Aerosol Typing Algorithm", the "Mahalanobis distance automatic aerosol type classification" and the "Source Classification Analysis" using 97 free tropospheric aerosol layers from 4 EARLINET stations (Bucharest, Kuopio, Leipzig and Potenza) from 2014-2018.
NATALI is an automated aerosol layer classification neural network, depending on the aerosol optical properties (3β+2α+1δ) directly obtained from the EARLINET database. MD is an automated, aerosol layer classification algorithm, 410 depending on the mean values of the aerosol optical properties (A e355/1064 , lr 532 , lr 532 /lr λ355 , pldr 532 and A b1064/532 ) of the probed atmospheric layers. SCAN, introduced for the first time in this study, is based on the automatization of the typical classification method, while its classification procedure is based on the amount of time that an air parcel spends over specific pre-characterized aerosol source regions and a number of additional criteria, as analytically presented.
We concluded that NATALI showed the lower percentage (4%) of unclassified layers. When compared to MD, NATALI's 415 "'X' or Cloud Contaminated" aerosol layers (where 'X' is either an aerosol type or a mixture) are classified by MD as "Clean Continental" layers, except when 'X' is a mixture of dust aerosols. When compared to SCAN, SCAN's "Continental Polluted and Smoke" layers are classified by NATALI, as either "Clean Continental" or "Continental Polluted". https://doi.org/10.5194/acp-2020-865 Preprint. Discussion started: 27 October 2020 c Author(s) 2020. CC BY 4.0 License. Furthermore, we found that MD was unable to classify almost the 50% of the under study layers. Compared to NATALI, these layers belong either to one single aerosol type or to aerosol mixtures. Concerning the MD's "unknown" category and 420 NATALI's one single aerosol types, we showed that MD's mean percentages quite well predict the aerosol type of each layer, even though this aerosol type is not chosen by the classification process of MD. Concerning the MD's "unknown" and NATALI's "mixtures" categories, MD algorithm revealed an increased contribution of dust and dust aerosols mixtures (approximately 50%) inside the studied aerosol layers. Compared to SCAN, MD's "unknown" layers are, mainly, either "Continental Polluted" or "Continental Polluted and Smoke". Finally, SCAN's "Continental Polluted and Smoke" layers are 425 classified by MD either as "Clean Continental" or "Continental Polluted".
We found that the SCAN code successfully managed to classify more than 50% of the layers studied, either as a single aerosol type or as mixtures of different aerosols. Being aerosol optical property independent, SCAN provides the advantage that its classification process is not affected by the overlapping values of the optical properties representing more than one aerosol types (ex. clean continental, continental polluted, smoke). Furthermore, it has no limitations concerning its ability to 430 classify aerosol mixtures, an advantage that occurs from the air mass trajectory analysis and the relevant aerosol sources at ground. Finally, it can be useful for all types of lidar systems (independently of the number of channels used), as well as for other network-based systems (radar profilers, sunphotometers etc.).