Constructing a data-driven receptor model for organic and inorganic aerosol &#8211; a synthesis analysis of eight mass spectrometric data sets from a boreal forest site

Abstract. The interactions between organic and inorganic aerosol chemical components are integral to understanding and modelling climate and health-relevant aerosol physicochemical properties, such as volatility, hygroscopicity, light scattering and toxicity. This study presents a synthesis analysis for eight data sets, of non-refractory aerosol composition, measured at a boreal forest site. The measurements, performed with an aerosol mass spectrometer, cover in total around 9 months over the course of 3 years. In our statistical analysis, we use the complete organic and inorganic unit-resolution mass spectra, as opposed to the more common approach of only including the organic fraction. The analysis is based on iterative, combined use of (1) data reduction, (2) classification and (3) scaling tools, producing a data-driven chemical mass balance type of model capable of describing site-specific aerosol composition. The receptor model we constructed was able to explain 83 ± 8 % of variation in data, increased to 96 ± 3 % when signals from low signal-to-noise variables were not considered. The resulting interpretation of an extensive set of aerosol mass spectrometric data infers seven distinct aerosol chemical components for a rural boreal forest site: ammonium sulphate (35 % of mass), low and semi-volatile oxidised organic aerosols (27 and 12 %), biomass burning organic aerosol (11 %), a nitrate containing organic aerosol type (7 %), ammonium nitrate (5 %), and hydrocarbon-like organic aerosol (3 %). Some of the additionally observed, rare outlier aerosol types likely emerge due to surface ionisation effects, and likely represent amine compounds from an unknown source and alkaline metals from emissions of a nearby district heating plant. Compared to traditional, simplistic inorganics apportionment methods for aerosol mass spectrometer data, our statistics-based method provides an improved, more robust approach, yielding readily useful information for the modelling of submicron atmospheric aerosols physical and chemical properties. The results also shed light on the division between organic and inorganic aerosol types and dynamics of salt formation in aerosol. Equally importantly, the combined methodology exemplifies an iterative analysis, using consequent analysis steps by a combination of statistical methods. Such an approach offers new ways to home in on physicochemically sensible solutions with minimal need for a priori information or analyst interference. We therefore suggest that similar statistics-based approaches offer significant potential for un/semi supervised machine-learning applications in future analyses of aerosol mass spectrometric data.



Introduction
Along with particle size, aerosol chemical composition is fundamental in understanding aerosol physicochemical effects such as hygroscopicity, volatility, optics and toxicity (Bilde et al., 2015;Swietlicki et al., 2008;Zimmermann, 2015). In the past decade aerosol mass spectrometry has provided a way to quantitatively resolve basic chemical composition of aerosol in almost real-time. This not only enables basic chemical speciation into organic and common inorganic ion species, but also produces 5 a wealth of complex mass spectrometric data. It has since become clear that these data sets, although superficially hard-tointerpret, are rich in chemical information and their statistical analysis yields considerable new knowledge. However, tapping into this information source requires use of advanced statistical tools and chemometric methods. Consequently, advanced statistical methods for data reduction have quickly gained traction in aerosol mass spectrometry, and are presently widely used for deconvolution of complex, organic mass spectra into their underlying components. Specifically, the Positive Matrix 10 Factorization algorithm (PMF, Paatero and Tapper, 1994) has achieved a predominant status as the state-of-the-art analysis tool for deconvolving aerosol mass spectrometric data. Factorisation methods such as PMF notably allow for the condensation of information found in high-dimension data matrices into a manageable number of factors, corresponding to e.g. aerosol chemical species, sources or processes. Data reduction often additionally allows for improved visualisation, aiding in interpretation of the underlying aerosol chemical phenomena . 15 In exploratory factor analysis, the principal difficulties often relate to deciding the optimal number of factors, choosing between multiple solutions of mathematically similar quality, and estimating the reliability and uncertainty of the results. Lacking robust but easy-to-use mathematical tools, the selection and interpretation of factorisation solutions remains prone to subjective bias by the analyst. Evaluation and verification of a factorisation solution thus generally requires meticulous study and understanding of e.g. correlations with auxiliary data, temporal changes and cycles and spectral references. While statistics-20 driven methods for spectra comparison and classification as of yet remain marginal in aerosol mass spectrometry, they do show promise in their capability to automatically group similar spectra based on their chemically relevant features, producing comparable classifications to those performed manually by expert analysts Rebotier and Prather, 2007;Freutel et al., 2013).
The overwhelming majority of PMF analyses to date from AMS have been performed on the organic fraction alone (Zhang et 25 al., 2011). Contrary to popular belief, there exists no tenable reasons to limiting chemometric analysis to organic signals, as exemplified by the analyses of Sun et al. (2012) and Hao et al. (2014). Although it requires some additional data preparation and processing, inclusion of inorganics provides additional insight into e.g. salt formation in aerosol. In this work, we apply data reduction and classification methods for analysing organic and inorganic aerosol mass spectral data from several measurement campaigns in the boreal forest. We then derive a comprehensive receptor model resolving the dominant aerosol 30 categories at the site. In addition, by presenting an example of a semi-supervised, statistics-driven analysis of large mass spectral data sets, we hope to pave the way for machine learning based data analysis approaches, reducing the need for expert analyst input and subjective judgement at each step.

Data sets
In this study, the aerosol composition was monitored by an AMS between to 2008 and 2011, during several short measurement campaigns. Notable larger, intensive campaigns at the time were the EUCAARI project Kulmala et al., 2009; and HUMPPA-COPEC (2010;Williams et al., 2011;. The sets of data used along with their timeframes are shown in Table 1. 5

The aerosol mass spectrometer (AMS) instrument and basic data processing
The mass spectrometric data for this study was acquired with a Time-of-Flight Aerosol Mass Spectrometer (ToF-AMS), 10 developed by Aerodyne Research Inc. (Billerica, MA, U.S.). AMS instruments in general have been described by Canagaratna et al. (2007), and the compact ToF analyser version (CToF) used in this study by Drewnick et al. (2005). Additional, more specific details related to the specific instrument we used are available in our previous study (Äijälä et al., 2017).
In brief, the AMS instrument sucks sample aerosol from atmospheric pressure to vacuum conditions through an inlet system consisting of a critical orifice and a particle concentrating aerodynamic lens (Liu et al., 2007). The sample aerosol beam is 15 directed at a vaporizer operated at 600 degrees Celsius, whereby flash vapourisation of non-refractory aerosol components occurs. The resulting vapour is ionized using 70 eV electron impact ionisationa well-characterised hard ionisation technique, resulting in rather universal and predictable but highly fragmenting ionisation. Finally, the ions are led to an orthogonal extraction reflectron time-of-flight mass analyser, where the ions' mass-to-charge (m/z) ratios are measured.
The per-amu (atomic mass unit) analyser signal is subsequently quantified based on instrument response calibrations and 20 corrections (among others the correction for relative ionisation efficiency between the species; RIE; Allan et al., 2004); supplementary information Sect S.4). Individual, unit-mass-resolution amu signals are then chemically speciated, based on Atmos. Chem. Phys. Discuss., https://doi.org /10.5194/acp-2018-1069 Manuscript under review for journal Atmos. Chem. Phys. Discussion started: 19 October 2018 c Author(s) 2018. CC BY 4.0 License. chemical information on fragmentation and air composition (see Allan et al., 2003b), for details). Additional, specific minor modifications to our instrument have been discussed in our previous work (Äijälä et al., 2017).

Data preparation and down-weighting
After basic processing, the data was further prepared, to serve as input for factorisation (described in following Section, 2.3).
The organic and inorganic data and related uncertainties were extracted, and down-weighting of signals performed. The 5 procedure for extraction and preparation of AMS organic signal and related error matrices has been described by Allan et al. (2003b) and Ulbrich and co-workers (2009).
In short, measurement points or variables with missing data were omitted and error matrices calculated, based on a function accounting for both counting statistics induced uncertainty as well as background noise from the detector and electronics. The signals were then down-weighted by multiplying the error matrix conveyed uncertainty values for low signal-to-noise ratio 10 (SNR) variables with a scalar: "weak" variables (SNR < 3) were down-weighted by a factor of 2 and "bad" variables (SNR < 1) by 10. The procedure for inorganics (SO4, NO3, NH4, Chl; i.e. sulphates, nitrates, ammonia and chloride species) was similar to that used for the organics ("org"), including for the down-weighting of signals derived from fragmentation calculations.
Analogous to the basic procedure of down-weighting "duplicate information" organic signals, e.g. those derived from m/z 44 Th (mainly CO2 + ), similarly derived inorganic signal weights were normalized so that their weight of the original plus 15 "duplicate" signals equalled that of the original signal. Finally, the matrices for all the ion species (org,SO4,NO3,NH4,Chl; in nitrate equivalent mass), were combined to form the final input matrices for factorisation, while retaining speciation information in the ion indexing.

Positive matrix factorixation 20
For factorisation, we used the Positive Matrix Factorisation (PMF) model developed by P. Paatero and colleagues (Paatero, 1997(Paatero, , 1999Paatero and Tapper, 1994), and widely used for analysis of AMS data since 2007 (Lanz et al., 2007b;. In brief, PMF is a statistical model, typically resolving a bilinear linear combination of factor profiles (G) and time series (F) best describing the measured data matrix (X; Equation 1). The residual matrix E then denotes the portion of data left unexplained by the model (i.e. residual). PMF model is thus formulated: 25 (1) The brackets indicate matrix dimensions, with v denoting number of variables, t the number of time points, and f the number of factors. As shown in Equation 1, the model can be solved for any f (< v, t), requiring it to be selected by the analyst. 30 Atmos. Chem. Phys. Discuss., https://doi.org /10.5194/acp-2018-1069 Manuscript under review for journal Atmos. Chem. Phys. Discussion started: 19 October 2018 c Author(s) 2018. CC BY 4.0 License.
The main features setting PMF apart of other similar factorisation models, and making it particularly suitable for atmospheric aerosol models, are on one hand the limitation of factor profiles and time series to positive values and hence drastically reducing the amount of rotational ambiguity. On the other hand, the improved error model where the quantity to minimise is the weighted (typically the measurement uncertainty) residual, resulting in higher weight for the variables with better SNR. In PMF, the minimum weighted residual is solved using one of the related algorithms, i.e. PMF2 or Multilinear Engine 2 (ME2; 5 Paatero, 1999). Of the two algorithms, ME2 can take in additional equations defined by the user, i.e. constraints the solutions need to adhere to. For running the PMF and ME2 algorithms, we used the Igor Pro (Wavemetrics Inc.) based SoFi (v. 4.8) user interface developed by F. Canonaco and co-workers at Paul Scherrer Institute (PSI). The interface allows input of the preprocessed data and user selected parameters, and calls on the solver algorithms (PMF2 or ME2, depending on assignment) to return a solution to be displayed and analysed in SoFi (Canonaco et al., 2013). 10 When PMF is used as a standalone method for source attribution, the selection of solution needs to be carefully validated.
Sensitivities towards different number of factors, rotations and initialisation seeds are meticulously analysed, and correlations with auxiliary data computed. A case is then made for why the selection is the best possible. Contrarily, in our analysis approach, we do not claim to arrive at optimum solutions for individual PMF/ME-2 runs. Instead, we rely on multitude of data de-convolution runs to uncover the main structures in the ensemble of all data sets, and use statistical classification methods 15 to evaluate the general outlook and commonalities between PMF/ME-2 factors at each analysis phase. As discussed in Section 2.5, this trade-off instead enables us to concentrate on best modelling the entirety of all data sets.

Relaxed chemical mass balance model
To harmonise the description of aerosol components, we constructed a constrained receptor model, where all the profile components were constrained. For this purpose we applied a ME-2 based, chemical mass balance (CMB) type of model. CMB 20 models are typically used as receptor models for cases where source profiles are known, and only the mass loading information needs resolving (Friedlander, 1973;Gordon, 1988;Hopke, 1991;Miller et al., 1972;Hopke, 2016). In such mass conservationbased models, the observed loadings are modelled as a sum of multiple individual sources. Although CMB is mathematically often presented as sum of loadings (supplementary information, hereafter also S.I; Sect S.1, Eq. S.1), it can also be thought of as a special case of the bi-linear model described in Equation 1. Only now the profile matrix (F) is assumed fixed, simplifying 25 the problem to resolving the loading matrix (G) which minimises the residual (E). CMB can be run using the SoFi interface, using the same ME-2 solver as for PMF and ME-2 applications (Canonaco et al., 2013).
In this work, we use a relaxed CMB-like bilinear model (henceforth abbreviated as r-CMB), where all the source profiles are constrained, but allowed to vary within narrow limits. In strict technical terms this approach could be labelled "an extremely constrained ME-2 model", but we choose to use the term "relaxed CMB" to differentiate between the typical use of ME-2 or 30 constraining only part of the profiles, which allows the model considerably more freedom. We regard our use of the model is much closer to the idea of constraining all profiles than (semi-)exploratory factorisation typical for ME-2. The naming also serves to better highlight the conceptual differences between models in the different analysis phases. Generally, the biggest problems of the CMB models relate to the selection of source profiles, typically from spectral libraries, and handling of their uncertainty. In our use, the anchor spectra as well as the limits for their allowed variabilities are experimentally derived from data, alleviating some of these typical concerns.

k-means clustering
For spectra classification, we selected the k-means algorithm, specifically because in our previous tests it was successful in 5 classifying similar spectral data. The earlier tests additionally yielded of useful information on selection of the dissimilarity metric, as well as algorithm initialisation types and data weighting (Äijälä et al., 2017). K-means (e.g. Ball and Hall, 1965;MacQueen, 1967;Steinhaus, 1956;Jain, 2010)) is a rather simple, iterative algorithm that partitions a group of objects to a predesignated number of groups or 'clusters' based on their relative distances (i.e. dissimilarities). For each iteration, the algorithm assigns all objects to their closest centroids, which are then re-calculated from the mean variable values of the objects 10 in the updated cluster. The aim is to minimize the within-cluster sum of distance (variance) (J) between the objects' (Cn) locations (xi) and the cluster centroid µn they are assigned to (Eq. 2): (2)

15
The k-means algorithm iteratively converges on (any) minimum of total J (C) obtained by summing over all objects Cn. To increase chances of finding a global minimum, repetitions using different initialisations are used. Specifically, we used the improved stepwise initialisation 'kmeans++' (Arthur and Vassilvitskii, 2007); available in e.g. Matlab v. 2017a; Math Works Inc., Natick, MA. U.S.).

Spectral similarity and mass scaling 20
Based on our earlier metric comparison (Äijälä et al., 2017), we used (Pearson) correlation as a metric for spectral dissimilarity (or "distance", d;Fortier and Solomon, 1966;Mcquitty, 1966): 25 where u and v are the spectra in vector form, with m/z variables as vector components. ̅ and ̅ are the arithmetic mean values of u and v.
In clustering mass spectra, data weighting is often applied. Based on previous tests (Äijälä et al., 2017), we applied mass scaling of variables, advocated by Stein and Scott and others (Stein and Scott, 1994;Kim et al., 2012;Horai et al., 2010), giving additional emphasis to higher mass signals. This common practice is based on the idea that higher mass fragment ions 30 are more indicative of their parent ions, and thus the original characteristic composition, while smaller fragments can be produced from a wider variety of molecular fragmentation events. In mass scaling the weighted variables (̂) are calculated by multiplying the original variables (x) by mass-to-charge-specific weights (w), as presented in Equation 4.

5
where the scaling factor sm was optimised for each classification separately (SI; Sect. S.2).

Silhouette metric and post-weighting
The optimisation of mass scaling was based on silhouette metric (later also abbreviated as "silh"; (Rousseeuw, 1987), ranging between -1 to 1 and providing a straightforward, quantitative way to evaluate performance of the classification algorithm. The object-specific silhouette value si, defined as 10 where ai corresponds to the mean distance to other objects in the same cluster, and bi similarly to the mean distance to objects in the nearest neighbouring cluster. A silhouette value close to unity indicates the object is well clustered, while a value close to zero indicates the classification is uncertain, and the point is likely situated in-between two possible centroids. A negative 15 cluster value is indicatory of possible misclassification. Silhouette values can be calculated for any single cluster as the arithmetic mean of the cluster members' silhouettes, or similarly as a mean over all objects, to evaluate the quality of the clustering solution as a whole.
In order to mitigate the k-means algorithm's known sensitivity to outliers, and to improve handling of between-cluster samples, we applied a simple post-processing to all cluster centroids and variability calculations: the centroid spectra and variabilities 20 were calculated as weighted averages μ, and weighted standard deviations (̂2; Eq. 6) respectively, instead of the normal unweighted values (similar to Äijälä et al., 2017). As weights, we used the object specific silhouette values si > 0 (Eq. 5):

25
where vi are the cluster member objects (spectra) This procedure down-weights likely misclassified objects (silhouette < 0) to zero, and penalises the more uncertain or questionable assignations (low silhouette) compared to the decidedly well-clustered objects (silhouette close to unity). Singleton clusters were omitted from this calculation, and their variability thus left undefined.

Ion balance model for inorganics
Aerosol inorganic chemical speciation is better understood than the organic speciation, due to much lower diversity of the chemical compounds involved. There exist a variety of aerosol inorganic equilibrium models, typically used as modules in atmospheric meteorological and air quality models. However, performing thermodynamic equilibrium calculations is 5 computationally demanding (e.g Fountoukis and Nenes, 2007), and requires a good deal of auxiliary data on thermodynamic conditions and chemical activities. Due to the complexity of the models and increased data needs, simpler approximations are often used in connection to AMS inorganic speciation. In the following ion-balance-scheme description, we denote the respective AMS ion species molar concentrations in square brackets (e.g. A typical salt formation approximation used for AMS results is the Hong et al. (2017) ion pairing scheme, used in e.g. aerosol 10 volatility and light scattering models (Hong et al., 2017;Zieger et al., 2015). The Hong et al (2017) scheme is based on similar approximation of Gysel et al. (2007), which in turn is a simplification of the more extensive model by Reilly and Wood (1969).
We modified the Hong et al. scheme to additionally allow organonitrate (orgNO3) and speciate any leftover [NH4 + ] as its own class ("excess NH4 + "). The full scheme is available in supplementary material (Section S.3), and a schematic description is presented in Figure 1. 15 Briefly, in the scheme we apply, NH4 + is first combined with SO4 2to form ammonium bisulphate and/or ammonium sulphate "excess" and assigned to a separate class. For comparability with other models, any nitrate not in NH4NO3 is labelled organic.
Despite the label, we note this class not only encompasses organonitrates, but also any NO + fragment signal from amines, N-20 containing organics and may even contain influences of other inorganic nitrate species, such as KNO3, which are not considered separately in this simple model. Finally, since chloride loadings at the measurement site are generally negligible, neutralisation of hydrochloric acid (H2O:HCl) was not included to keep this scheme rather simple. We note ion balance schemes depending on relative ion abundances, such as the one described here, can be sensitive to measurement uncertainties (e.g. errors in RIE values) of these ratios. The topic is further discussed in supplementary information (Sect S.4) 25 Atmos. Chem. Phys. Discuss., https://doi.org /10.5194/acp-2018-1069 Manuscript under review for journal Atmos. Chem. Phys.

Constructing a data-driven r-CMB receptor model
As stated in the Introduction, one of the aims of our work was to derive a robust, harmonised receptor model for the 5 measurement site via explorative analysis. Considering the large amount of campaigns during different seasons, resulting in changing aerosol source contributions and mass spectral profiles, factorisation needed to be performed on a per-campaign (data set) basis. However, instead of performing traditional PMF complete with correlation analysis, source validation and the various sensitivity analyses separately, which would be an arduous task even for a single measurement set, we used the large amount of data sets to our advantage. Instead of optimising individual factorisations, we constructed an r-CMB model 10 applicable to all data sets. A similar task of constructing a semi-exploratory synthesis aerosol model, albeit applying a different methodology, was undertaken and reported by Sofowote et al. (2015).
To derive the anchors and constraints for a synthesis r-CMB model, we analysed the data in three phases (P-I to P-III; Figure   2), each consisting of factorisation, classification and silhouette-based post-weighting of anchor spectra and their allowed variabilities. In Phases I and II, a fixed number of 10 factors were resolved. This amount of factors was semi arbitrarily chosen, 15 and in our case likely to be somewhat above the optimal amount for most data sets, leading to over-resolved factor solutions.
However, unlike in traditional PMF analysis, we can use additional statistical diagnostics and post-processing options available to deal with potential fallout of unrealistic factor splitting (i.e. classification for identifying outliers and post-processing downweighting or nullifying their influence. Sensitivity to initialisation seed was examined by performing all runs using 10 initialisation seeds, and generally selecting the solution with lowest normalised residual. In rare cases of a physically unrealistic 20 solutions as the one with lowest residual (e.g. only NH4 species in a factor), a higher residual solution was chosen instead. We conclude the solutions were generally insensitive to seed selection, especially for the factors with non-negligible mass contribution.

Phase I: anthropogenic aerosols
In phase I, we performed unconstrained factorisation for all the 8 data sets. With 10 factors this resulted in a total of 80 factor 25 mass spectra. We then determined the dominant spectra classes using k-means clustering. To that purpose, we applied optimised mass scaling for improved data structure, and used silhouette diagnostics to evaluate the optimal number of clusters.
We identified the known, common anthropogenic aerosol classes from the silhouette-weighted cluster centroids. This is also an approach advocated by Crippa et al. (2014) in their similar work on a synthesis analyses of several data sets.
For a cluster centroid to qualify as an anchor for further phases of our analysis, we applied the following two criteria: (1) The 30 spectra forming the cluster were present in multiple (≥ 3) the data sets, (2) The spectra were interpretable chemically, and had adequate support from previous studies in form of literature and/or calibrations. We note that defining what constitutes as "interpretable" or "adequate support" is inevitably an analyst (subjective) decision, so we endeavour to make our reasoning transparent in the respective discussion sections. Adhering to criterion (1) also means that factors showing up only for 1 to 2 campaigns, due to special conditions (emission, meteorology etc.), are omitted from the final r-CMB model. We will briefly cover some of the more interesting "outlier observations" in Section 3.4. At the end of Phase I, a number of constrained anchor spectra and within-cluster-variabilities were obtained. In this case, these corresponded to four anthropogenic classes, which 5 will be discussed in more details in the results section.

Phase II: biogenic, secondary organic aerosols
Using the anchors and within-cluster-variabilities, we re-ran factorisation as in P-I, except now partly constrained (ME2; 4 of 10 factors constrained using anchors from P-I). In phase II, we focused on analysing the remaining free factors, likely corresponding to the biogenic, and assumedly more variable factors (Canonaco et al., 2015;Crippa et al., 2014). The procedure 10 for classification, and the selection criteria for the (assumedly) biogenic SOA in this phase was same as in phase I.
Due to the data-driven analysis approach, specifically the constrained factors being selected from phase I, we do not expect major changes between phase I and phase II results. While arguably the methodology could be further developed to constrain the r-CMB components directly from phase I result, phase II of our analysis currently serves several purposes: 1) it should narrow down the solution space for improved description of the various SOA types, by constraining the anthropogenic, 15 assumedly primary aerosols. 2) Compared to P-I, the allowed solutions are more similar for all data sets in P-II, which reduces the scatter of the factorisation solutions. This reduces the spectral variability (uncertainty) arising from the analysis process itself, allowing us to iteratively converge on more realistic limit values for the constraints. Ultimately, the limits should reflect the actual, natural chemical variabilities within the aerosol types. 3) Similarity of results between successive, un/semiconstrained phases allows evaluation of stability, reliability and repeatability of the method, so that it is not e.g. overly sensitive 20 to rotational ambiguity or initialisation parameters of algorithms. This is important since the method described here is new, and its robustness needs to be demonstrated, but less so in potential later use.

Phase III: final, constrained receptor model
In phase III, we constructed the r-CMB receptor model. In this phase, all the factors were constrained using anchors and variabilities from the previous phase result. The number of components in the final r-CMB model, in our case 7, was equal to 25 the total number of selected aerosol types in phase II. With these model constraints, we performed runs for each of the 8 data sets separately. Using the resulting 8×7 factor profiles, we determined the likely range of variability for the aerosol types, and calculated final, silhouette weighted reference spectra for the components by performing a final round of clustering.
Atmos. Chem. Phys. Discuss., https://doi.org /10.5194/acp-2018-1069 Manuscript under review for journal Atmos. Chem. Phys. Discussion started: 19 October 2018 c Author(s) 2018. CC BY 4.0 License. Figure 2. A flowchart illustrating the analysis using combined methodology. After initial data collection and preparation, statistical analysis is performed in three phases (P-I to P-III). Each phase limits the freedom given to factorisation from completely free (PMF) to nearly fully constrained (r-CMB). Finally, we evaluate and interpret the r-CMB model from an aerosol chemical perspective.

Results and discussion 5
In Section 3.1, we briefly describe the results from analysis phases I to III (P-I to P-III; corresponding to Sections 3.1.1 to 3.1.3), but concentrate more on the receptor model results and their interpretation (Sections 3.2). Finally we will compare our results with reference methods (Section 3.3). Comparison results are available in literature for organic aerosol components (Sect 3.3.1), and in Sect. 3.2 we will compare inorganic speciation with the alternative inorganic attribution methods, described in Methods (Sect 2.4). Finally, we briefly describe some of the outlier observations which contain potentially interesting 10 chemical information (Section 3.4).
When interpreting and identifying aerosol components, we evaluate spectral similarity using the same similarity metric (mass scaled correlation) as for the clustering (Equations 3 and 4). We thus report mass scaled squared correlation coefficients (rs 2 ) between reference spectra and our corresponding final spectrum for the class (P-III silhouette-weighted centroids; sm=1.81).
For easier comparability, all ratios and fractions of signals presented in the following sections are similarly calculated from 15 the corresponding final spectra (P-III).

Phase I: identification of anthropogenic aerosol components
In phase I, we performed unconstrained PMF runs using 10 factors for all 8 datasets separately. The resulting 80 factor spectra were subsequently clustered. Maximal data structure (silhouette 0.56) was achieved at mass scaling sm = 2.12 for 17 clusters (for details on silhouette analysis, see supplementary material, Sect. S.2 ). The eight clusters with largest population for the 5 phase I solution are shown in Figure 3, and the rest in Section 3.4, where outlier observations are further discussed. Generally, the solutions agreed closely on the largest clusters, lending credibility to the robustness of the approach. The solutions differed mainly regarding outlier classification, which is of secondary importance for our r-CMB model, since outliers are discarded from the model.  Unsurprisingly, the classification returns two large clusters of organic resembling the ubiquitous low-volatile oxidised organic aerosols (#1; "LV-OOA) and semi-volatile oxidized organic aerosol ("SV-OOA"; e.g. Aiken et al., 2007;Zhang et al., 2011). Comparing to library spectra, the m/z 44 Th (CO2 + ) dominated aerosol type (#1) best matches with LV-OOA and OOA-I (Oxidised Organic Aerosol; a historical label corresponding to LV-OOA; (Aiken et al., 2008;Zhang et al., 2011) spectra from Paris (rs 2 = 0.97; Crippa et al., 2013), Zurich (0.96; Lanz et al., 2007a;Crippa et al., 2013)  The solution also contains a large cluster (#2) with spectra dominated by ammonium and sulphate ion species. This is in agreement with ammonium sulphate being a main component of ambient aerosols. Although it contains also trace amounts of other species, we name the (NH4)2SO4 -dominated aerosol class (#2) ammonium sulphate ("AS") for brevity. 15 The main nitrate-containing spectra are divided into two clusters (#6 and #8). The divisive feature seems to be the ratio of m/z 46 to 30 Th signals (i.e. Rmeasured in Equation 7), which is much higher in cluster type #8 (0.44 ± 0.11) versus for #6 (0.08 ± 0.07; P-III values; see S.I. Sect S.5 for error estimate). Based on literature we interpret the split to correspond to the division between nitrogen in form of inorganic (ammonium) nitrate (AN), and organic nitrogen, matching with previous AMS observations (Hao et al., 2014;Farmer et al., 2010;Kiendler-Scharr et al., 2016). The interpretation of cluster #8 as AN is 20 additionally corroborated by its similarity to spectra from pure ammonium nitrate calibration for the instrument, available in closely matches, among others, HOA reported by Zhang et al. (Zhang et al., 2005) for Pittsburgh (rs 2 = 0.91) and the average of de-convolved 15 HOA spectra reported by Ng et al (2010; rs 2 = 0.89). The spectra also exhibits high similarity with traffic emission spectra of diesel bus exhaust (0.86), lubricating oil (0.82) and fuel (0.75), reported by Canagaratna et al. (2004).
Cluster (#5) features high signals for ions typical of biomass burning organic aerosol ("BBOA"; e.g. Alfarra et al., 2007) and cooking organic aerosol ("COA"; e.g. Mohr et al., 2012). The spectra features the marker signals of levoglucosan (Cubison et 5 al., 2011;Schneider et al., 2006)  The differentiation between HOA versus BBOA or COA can often be resolved even from unit resolution spectra, using the f55 to f57 ratio (Mohr et al., 2012), and the differences in mass spectral fingerprints higher up on the m/z axis (resolvable using mass scaling; Äijälä et al., 2017). However, the distinction between COA and BBOA aerosol types is much more delicate due to very high UMR spectral similarity also for higher m/z variables, (e.g. rs 2 = 0.79 for COA and BBOA reported by Mohr et 15 al., (2012). The main difference between the COA and BBOA aerosol types is the absolute level signals from levoglocosan fragments, the quantitative interpretation of which is difficult due to (1) levoglucosan production being determined by combustion temperature (Shafizadeh, 1984), (2) levoglucosan originating both from BBOA and COA (Mohr et al., 2012) and (3) levoglucosan sinks may be considerable in the atmosphere (Hoffmann et al., 2009), which affects especially transported aerosol. Due to the remote location of the measurement site and general prevalence of BBOA over COA in urban aerosol 20 loadings (e.g. Daellenbach et al., 2017) we conclude BBOA is more likely the dominant component for this aerosol type, so we will use the class label "BBOA" for brevity. Due to high spectral similarity, we find it extremely likely any COA contribution would be apportioned to this class, but without the benefit of high mass resolution data, the convolution seems insolvable at this time. al. (2017), but to our knowledge no precedent exists for mixed or inorganic aerosols. Generally, the more 'unique' the spectra of a group and the higher within-cluster-cohesion, the higher the silhouette.

Phase II: classification of biogenic, secondary organic aerosols
In the second phase of our analysis, ME-2 factorisations were run for ten factors for all the data sets. We constrained 4 out of the 10 factors with the anchors and variabilities for anthropogenic aerosol types, derived from previous phase (AS, AN, HOA, 5 BBOA). The resulting 80 factor profiles were again extracted and classified. The classification solutions featured generally higher silhouette value than in the first phase, which is at least partly explained by constrained spectra being forced to conform to their set limits. The highest total silhouette (0.66) was obtained for 15 clusters (at sm = 2.41). Again, the inter-solution variability for the solutions inspected was low for the main classes. The phase II solution is available in S.I ( Figure S.4).
Overall, the solution very closely resembles the result from phase I (Figure 3). 10 The expected LV-OOA (#1; n = 14; silh 0.64) and SV-OOA (#3; n = 9; silh 0.44) aerosol types again rank among the most typical classifications. Their moderate silhouettes reflect higher variability within these classes, corresponding to results from earlier studies (e.g. Canonaco et al., 2015), and/or closer proximity to neighbouring aerosol types, than for the AN, AS or HOA types. The result may suggest seasonal or other dataset-specific variability for SOA, which supports partitioning the data on a per-campaign basis. In accordance with typical AMS organic aerosol classification conventions laid out by e.g. Aiken et al. 15 (2008), we opt for two classes of oxidised aerosols. We thus select clusters #1 and #3 (P-II) to represent LV-OOA and SV-OOA (Aiken et al., 2008; respectively. For P-III of our analysis, we additionally fix the organic nitrogen class, (ON, P-II cluster #8). Irrespective of the exact chemical composition and label of this aerosol component, we assess there is enough literature support (among others Kiendler-Scharr et al., 2016;Farmer et al., 2010;Drewnick et al., 2015;Murphy et al., 2007;Hao et al., 2014) for inclusion of nitrogen 20 containing aerosol types other than AN, to warrant the inclusion of this class. In any case, the classification of nitrate signal at m/z 30 Th to a distinct class seems statistically robust, as exhibited by its emergence as a free factor in both P-I and P-II solutions. Due to the importance of nitrogen containing species in SOA composition and formation (e.g. (Kiendler-Scharr et al., 2016;Berkemeier et al., 2016) we find it an important aerosol class to include, examine and further interpret. The mixed cluster #7 also emerges for 4 data sets, but with notably low silhouette (0.18), suggestive of low within-cluster cohesion. As 25 we still lack a distinct chemical interpretation for this class, beyond the hypothesis of incomplete resolution of aged aerosol species in factorisation, we will not include the mixed class (#7) in our final receptor model.

Phase III: Final r-CMB receptor model
In the final phase (P-III) of constructing our r-CMB receptor model, we used 7 factors which were all constrained with the profiles and allowed variabilities from the previous phase (P-II, AS, LV-OOA, SV-OOA, BBOA, ON, HOA, AN). The ME-30 2 algorithm was tasked to resolve the factors' temporal behaviour. To derive final characteristic spectra for the model components, as well as to study the variability of spectra in the solutions, we once more applied the same clustering procedure and silhouette analysis as for previous phases. The maximal structure (silh 0.85) was achieved for the seven cluster solution (sm = 1.81), which was to be expected considering ME-2 was run with 7 rather strictly constrained factors in this phase. With silhouette weighting applied, we obtain the final spectra and variabilities ( Figure 4). We note this final clustering and weighting step mainly serves to provide an estimate of variability within each 5 aerosol type, but also yields final spectra to be used as library references for the outcome of this work. Details of the solution of the r-CMB model are discussed in following sections, from the perspective of mass attribution (Section 3.2.1) and spectral characteristics (Section 3.2.2).

Mass attribution and "default" AMS chemical speciation for r-CMB components
Tabulation of final explained variations (EV; Paatero, 2000;Canonaco et al., 2013) for the r-CMB model are shown in Table   2. The seven-component r-CMB model, explains 83 ± 8 % of the variation in loadings, when variation from low SNR variables 15 is included, and 97 ± 3 % when only residuals of variables with SNR > 2 are considered. The components with lowest loadings Model results for campaign VIII, especially regarding BBOA, are very different from other data sets, including the other cold season results available in e.g. set III. Upon closer examination, we attribute the VIII anomaly at least partly to pronounced surface ionisation effects, discussed more in Section 3.4. While we consider the r-CMB results for campaign VIII too unreliable 5 for use in models or further studies, we decided not to omit data set VIII, since other AMS data is likely also affected by the same processes, albeit to a lesser degree. The attribution of anomalies to exact processes is very difficult, and surface ionisation effects remain hard to quantify. We hope that reporting our results in full also furthers the discussion on surface ionisation in the AMS, and potentially helps other AMS users observing similar observations. The composition of our r-CMB components is shown in Figure 5

Spectral characteristics of organic components
As discussed above, despite the mixing observed, the inorganic aerosol classes generally seem separate from organic aerosols.
The scaled correlation values between inorganic and organic spectra are extremely low (S.I Sect S.8, Tables S.1 and S.2), indicating near-zero similarity and clear-cut separation between the inorganic and organic aerosol types by the clustering 15 algorithm. For inter-correlations between the organics-dominated aerosol classes, the picture is somewhat more complex.
To understand the drivers for the separation of the organic aerosol types, we visualised the phase I (unconstrained PMF) and phase III (r-CMB) classification results with a projection of the clustering solutions onto a plane defined by an axis corresponding to estimated oxidation level and another connected to source type (P-III in f55:f57 ratio is typically used for differentiation between HOA and COA/BBOA (Mohr et al., 2012), but equally seems to set apart the biogenic SOA types from the anthropogenic aerosols (Äijälä et al., 2017). This is due to the low signal of m/z 57 Th, a typical anthropogenic spectral marker, originating from C4H9 + and C3H5O + compounds (Mohr et al., 2012;Zhang et al., 2005). The LV-OOA aerosol type, characterised by the dominant m/z 44 and 28 Th signals is usually considered a highly oxidised aerosol type that results from oxidation of SV-OOA and various fresh emissions (among others Canonaco et al., 2015)). The f55:f57 ratio of LV-OOA is considerably lower than for SV-OOA in both solutions, indicating inclusion of other sources beyond the f57-poor biogenic SOA contribution. SV-OOA, on the other hand, has the highest f55:f57 ratio of the classes, 15 hinting to the predominantly biogenic origin of the SV-OOA at the site. The difference is further amplified for phase II and III solutions compared to the unconstrained PMF. We hypothesise this change can result from improved differentiation between SV-OOA and the BBOA species (in P-II), as these aerosol types may be difficult to separate initially due to similar oxidation level and features of the spectra (rs 2 = 0.34; Table S We also projected the P-I and P-III solutions to the (f44, f43) plane (P-III in Figure 7; P-I in S.I, Fig S.6), to produce a result comparable to the triangle plot by Ng et al. (2010). The result indicates a clear separation between the low and semi volatile aerosol types, as well as the primary combustion aerosols (HOA, BBOA), and the spectral shifts from phase 1 "bulk PMF" results to the final r-CMB model.
As stated in Section 3.1., the spectra of BBOA and HOA aerosol types match the previously published observations. The HOA 5 spectrum is characterised by the ion series CnH2n+1 (m/z 29,43,57,71,85,99 Th etc.) and 55,69,83,97 Th etc.) resulting from alkanes and aromatics from traffic emissions (diesel exhaust, lubricating oil; Chirico et al., 2010;Mohr et al., 2009;Canagaratna et al., 2004). The biomass burning organic aerosol levoglucosan marker signals at m/z 60 (C2H4O2 + ) and 73 Th (C3H5O2 + ) (Cubison et al., 2011;Schneider et al., 2006;Elsasser et al., 2012) are clearly identifiable in the BBOA spectra ( Figure 3; Figure 4), and set this class apart from HOA and SV-OOA with some similar features. The contribution of 10 often biogenic signal at m/z 53 Th is also lower for BBOA than for the biogenic, semi-volatile SOA. The pronounced signal from aromatic ring (tropyllium cation C7H7 + ) at m/z 91 Th is a typical result of fragmentation of aromatic hydrocarbon compounds (Lindon et al., 2016). As stated previously, we presume the BBOA class also encompasses any COA contributions, which are likely unresolvable as a separate class due to high spectral similarity (0.79; Sect 3.1.1).
In terms of spectral characteristics, the organic contributions of AS and AN classes fall somewhere between the distinct organic 15 classes and offer little in terms of significant organic markers. Notably, the organics in the ON class exhibit some of the characteristics of LV-OOA and feature generally high f44. This may indicate high degree of oxidation of the organics for this aerosol type (Aiken et al., 2008). However, alternative plausible interpretations exist: AMS response from oxidation products of amine compounds and amine-nitrate salts feature similarly high f44 (Murphy et al., 2007) as does a typical amine fragment ion C2H6N + (McLafferty and Turecek, 1993). Furthermore, as discussed in Section 3.3.2, an equally plausible explanation 20 would be inorganic nitrate salts such as KNO3 contributing to this class in form of the Pieber et al (2016) thermal decomposition artefact. The contribution of m/z 55 and 57 Th signals to the ON species are both low and the ratio 1.37 of f55:f57 is much lower than for the biogenic aerosol species. Without more detailed analysis, and due to the uncertainties surrounding the origins of this aerosol type (Section 3.3.2), it is difficult to say with any certainty if this is due to anthropogenic nature of this aerosol, or e.g. due to fragmentation pattern of characteristic organic compounds in this aerosol type. 25

Comparison with "traditional" ME-2 analysis for aerosol organic component
In order to evaluate the performance of the source apportionment approach presented in this study for organic aerosol, we compare our results to results only relying on the organic mass spectral fingerprints. Specifically, two data sets covered in this study (data sets II and III; Error! Reference source not found.) were also included in the Crippa et al. (2014) analysis, which 30 allows us to compared factorisation results directly. We chose to compare the Crippa et al. results to ours from data set II. We note that while there are minor differences in the pre-processing and corrections for data covered in Crippa et al (2014), the Atmos. Chem. Phys. Discuss., https://doi.org /10.5194/acp-2018-1069 Manuscript under review for journal Atmos. Chem. Phys. Discussion started: 19 October 2018 c Author(s) 2018. CC BY 4.0 License. factorisation input is very similar in both cases. The ME2 model used by Crippa and co-workers included only the organic spectra and apportioned its mass to four factors: LV-OOA, SV-OOA, BBOA and HOA. The latter two components were constrained using a HOA profile from an urban aerosol study in Paris  and an average BBOA of those extracted for Mexico City, Mexico, and Houston, U.S . The allowed variability around these anchors for all variables (m/z), were 5 % (HOA) and 30 % (BBOA). 5 We compared the solutions for Crippa et al. factorisation to our r-CMB model solution data set II, both for loadings ( Figure 8) and profiles (Figure 9). Generally the solutions correlated highlythe loadings (F) and profiles (G) for LV-OOA (F: r 2 =0.92; G: rs 2 = 0.96) and SV-OOA (F: 0.94; G: 0.99) agreed the closest, whilst the HOA also had high similarities (F: 0.85; G: 88).
The BBOA factor / component correlated markedly less (F: 0.63; G: 0.42), which we hypothesise to be due to differences in the anchors used, COA likely attributed to this class, high spectral similarity between SV-OOA and BBOA and the generally 10 low loadings of BBOA observed at SMEAR II.
The discrepancy in distribution of absolute mass for the LV-OOA and SV-OOA components, indicated by the sub-unity slope, suggests the r-CMB model attributes a part of the organic mass from the SOA factors into BBOA, AS, AN and ON components, while HOA is represented rather identically in both models. A difference in mass distribution between the results is to be    and total signal subsequently re-normalised to unity. Spectra similarity is evaluated using Pearson's squared correlation coefficients; unscaled (r 2 ) and with mass scaling (rs 2 ).

Comparison of inorganic salt and organic nitrogen results with reference methods
To evaluate the inorganic mass apportionment result, we compared the loadings from the r-CMB solution against the result from the inorganics apportionment scheme (Section 2.4.1). The comparison, again performed for data set II, is presented in 10 The loadings for the (r-CMB) AS component compare well with the combined NH4HSO4 + (NH4)2SO4 + H2SO4 loading, indicating ammonium(bi)sulphate is described similarly by both models (r 2 =0.92). We assume the r-CMB AS component to comprise of both NH4HSO4 and (NH4)2SO4, which would very likely be classified together in PMF / clustering due to their 15 high spectral similarity. For ammonium nitrate the correlation between loadings is very low (r 2 =0.16). Looking at the time series, the reason seems to be that the speciation scheme-based model often predicts total absence of AN, due to high amount of sulphate in aerosol. While the r-CMB model also generally estimates loadings to be low, they are clearly non-zero in the r-Atmos. Chem. Phys. Discuss., https://doi.org/10.5194/acp-2018-1069 Manuscript under review for journal Atmos. Chem. Phys. Discussion started: 19 October 2018 c Author(s) 2018. CC BY 4.0 License. CMB model. We take the result to reflect the assumption of complete and instantaneous, internal and external mixing of aerosol in the speciation scheme (Section 2.4.1).
The loading prediction for organic nitrogen by the speciation scheme model is similarly event-driven and the model results do not correlate. This is caused by the nitrate assignment to organonitrate class when not explained by NH4NO3. Same can be said for the excess NH4 class, which corresponds to the NH4 species in the other, mostly organic r-CMB factors, principally the 5 LV-OOA; the ion balance scheme predicts zero concentration for many of the data points, an estimate not matching with the r-CMB-based result.
On these differences between the models, we note that the ion balance-based apportionment scheme is sensitive towards small changes in NH4 concentrations, especially for data with generally low NH4 concentrations, such as ours. A simple sensitivity estimate, available in supporting material (Sect. S.4) was performed for data set III. The result indicates that a 33 % change in 10 RIENH4 changes the component mass concentrations on average 5% for AS; 56 % for AN, 66 % for orgNO3 and 164 % for excess_NH4 components. On the other hand, the r-CMB model is rather insensitive to error in RIE estimate, since (1) the spectra in factorisation and clustering have the variables' signals in "NO3 equivalent mass concentration" units, which is not (yet) corrected for RIE of different species, (2) mass scaling causes low mass signals such as NH4 fragments (m/z 15 to 17 Th) to weight less (relative to higher m/z variables) for determining the solution, and (3) NH4 seem not to be an unique marker of 15 any of the classes. We therefore suggest a factorisation-based model such as the r-CMB model presented here is much more robust for resolving speciation of inorganic aerosol components. The sensitivity test (S.I, Sect S.4) also indicates that the temporal differences between the ion balance scheme and r-CMB are not explained by a difference in RIENH4. Thus, the reasons for the discrepancies are more likely related to the unrealistic assumptions of the inorganics apportionment model. . Chem. Phys. Discuss., https://doi.org/10.5194/acp-2018-1069 Manuscript under review for journal Atmos. Chem. Phys. In addition to deriving organic nitrogen mass from the ion balance scheme, we compared the r-CMB-derived ON loading with 5 the Kiendler-Scharr method for estimating the orgNO3 mass loading (Equation 6). The comparison, shown in Figure 11, indicates that the two methods produce a very similar result for organic nitrogen mass (r 2 = 0.94). The discrepancy in absolute i) the primarily daytime reactions of organic peroxy radicals with NO (Orlando and Tyndall, 2012), and ii) the NO3-radical 5 initiated oxidation of unsaturated compounds during night-time (Peräkylä et al., 2014). While the nitrate functionality in all these reactions are identical, the organic part can be vastly different, as peroxy radicals are formed in almost all atmospheric oxidation reactions, irrespective of oxidant (e.g. OH or ozone) or VOC (biogenic or anthropogenic). Therefore, it is not to be expected that a specific organic spectrum should be linked to the organic nitrate functionality. Secondly, as described by e.g. Lee et al. (2016), the particle phase lifetime of organonitrates is of the order of hours with respect to hydrolysis. This reaction 10 will convert the nitrate functionality to nitric acid, while the organic part remains intact, except for the conversion of the -ONO2 group to -OH. This conversion will only have a small impact on the volatility of the organic molecule (e.g. Kroll and Seinfeld, 2008), while the nitric acid may well evaporate in the fairly low-ammonia boreal forest environment. Taken together, the diverse formation pathways as well as the atmospheric processing are likely to cause ON spectra retrieved from ambient air factorisations to look different from e.g. freshly formed organic aerosol from organonitrate standards, such as those used 15 by Farmer et al. (2010). We therefore avoid putting too much emphasis on the organic parts observed in our ON factor.

Outlier observations
During the course of our analysis we encountered some anomalous observations likely stemming from surface ionisation effects, i.e. molecules being thermally ionised at the heater surface rather than at the ionisation region by electron impact. A thorough review and discussion on AMS related surface ionisation effects was recently published by Drewnick and colleagues 20 (2015). Drewnick et al. emphasise that the division between refractory and non-refractory aerosol is not binary, and there exist a number of semi-refractory compounds, that the AMS can measure, albeit non-quantitatively.
Our observations on extracted "outlier" PMF factors from the different phases of analysis match well with the finding and calculations of Drewnick et al (2015), as well as other similar AMS observations published. In Figure 12, we present the outlier clusters from phase I classification solution that were excluded from further analysis due to low number of occurrences or/and 25 questionable interpretability. The emergence of most of these spectra are likely attributable to over-resolution or questionable separation of main PMF factors, due to setting the number of PMF factors to 10. Despite their questionable value for the main analysis, we find they contain many potentially interesting mass spectral features, and seem not to emerge by chance. Below we will present some hypotheses on their possible interpretation.

Surface ionisation and data correction artefacts
well-behaving signals such as m/z 44 Th or 48 Th. We hypothesise the strong signals at m/z 41 Th observable in many of the outlier spectra (cluster #10, #15, and #17) may be due to insufficient accuracy of the 41 K isotope correction.
A similar data processing/correction artefact is likely seen in cluster #12 with a lone, dominant signal at m/z 29 Th. This massto-charge is a problematic one for lower-resolution AMS data due to contribution of 29 N2 isotopic peak, and location on the slope of the enormous N2 peak at m/z 28 Th. Although the signal at m/z 29 Th is corrected for the (measured) isotope contribution, even a slight mismatch in the correction results in notable error in the estimation of the organic signal fraction at m/z 29 Th. We attribute this problem specifically to the scarce availability of filters for the earliest sets of data.

Alkali metals
The prominent signals at m/z 85 and 87 Th for cluster #10 correspond to Rubidium alkali metal ions, and their respective ratios 5  Haynes, 2014)). Unlike for the potassium signal, the temporal behaviour of the factors corresponding to cluster #10 is highly plume-like. Preliminary analysis of wind direction shows the plume direction to correspond to the arrival direction from the district heating plant (co-located with a sawmill and a pellet factory) at Juupajoki, 5 km due south-east (Supplementary information, Sect S.12). Similar observations of Rubidium from coal burning are 10 previously published by Irei et al. (2014). It seems likely that this aerosol class would originate from the heating plant.

Organic nitrogen and sulphur
As for the signals often attributed to amines at 86 and 100 Th, (Mclafferty, 1959), featured in cluster #11, in absence of alternative explanation for the 100 and 86 signals, we are inclined to believe they actually represent atmospheric amines. The cluster spectrum corresponds also to the spectra of pollution plumes, extracted for data sets I to III in our previous study on 15 pollution events (Äijälä et al., 2017). We note amines are also reported to be prone to surface ionisation, and e.g. trimethylamine is thermally ionised above temperatures 300°C, with high thermal ionisation efficiency at 600°C (50% of the maximum efficiency observed at around 350°C; Rasulev and Zandberg, 1988)). It thus seems plausible surface ionisation effects could contribute to the amine observations as well. In our earlier work (Äijälä et al., 2017), we also attributed similar spectral signal at m/z 58 Th to amines (C3H8N + ). However, in light of the recent results of Drewnick et al (2015) on surface 20 ionisation of NaCl, and the detachment of m/z 58 Th signal from the class of other amine-attributed signals at 86 and 100 Th, another plausible explanation for the m/z 58 Th signal observed in clusters #9, #11, #16 and #17 exists. Namely, we find it plausible such a spectrum would arise from surface ionisation of sodium chloride and thus represent atmospheric NaCl + .
Clusters #13, #15 and 16 are interesting from the viewpoint of organonitrates and sulphates. Nitrate signal in clusters #15 and #16 is composed mostly of m/z 30 Th signal, with negligible m/z 46 Th contribution. With the high organic contribution, this 25 would make these classes potential candidates for containing organonitrates. However, an equally plausible explanation is the surface ionisation of KNO3, discussed previously. The pronounced signals at m/z 80 and/or 81 Th featured in cluster #13, #14 and #17 are likely explained by humidity-induced fragmentation changes in ionisation of sulphate species, (particularly H2SO4 and SO3; Drewnick et al., 2015). We do note organosulphur-containing samples characterised by Farmer at al. (2010) also feature increased ratio of m/z 80 and 81 Th signals compared to (NH4)2SO4, so we can not rule out organic sulphate 30 contribution.

Cycloalkanes
Finally, we wish to draw attention to the ion series of cluster #16, with prominent organic signals at (69), 79, 81, 95, 107 and 109 Th, which have been connected to cycloalkanes (McLafferty and Turecek, 1993;Alfarra et al., 2004). Cycloalkanes are common in e.g. lubricating oils (Liang et al., 2018), which are an important, even dominant, component in traffic emissions (Worton et al., 2014). The closest literature match on ambient observations we found was the study of Takami and colleagues 5 (2007), where they observed similar high concentrations of mass-to-charge 95, 107, 109 Th, as well as 58 and 85 Th, but were unable to attribute the observation to a specific source.

Conclusions
We performed a synthesis analysis on eight AMS data sets from a boreal forest site, and constructed a data-driven chemical mass balance type of receptor model, with relaxed constraints on the component profiles (r-CMB). Notably, the data comprised 10 both inorganic and organic aerosol components. The resulting 7-component model explained 83 ± 8 % of variability in data (96 ± 3 % with low-SNR variables excluded). The model components for the SMEAR II boreal forest site were, in order of average aerosol mass contribution: Ammonium sulphate (35 % mass fraction), LV-OOA (27 %), SV-OOA (12 %), BBOA (11 %) Organic nitrogen (7 %) Ammonium nitrate (5 %) and HOA (3 %).
Remarkably, organic nitrogen seems a larger component than ammonium nitrate for the site. However, ambiguity remains on 15 the interpretation of the organic nitrogen class as organonitrate, prompting caution against casual use of NO2 + :NO + fragmentation ratio as a sole organonitrate proxy. COA was not resolved separately, presumably due to high spectral similarity with BBOA and low mass contribution to SMEAR II aerosol, and is most likely included to the BBOA component. Other, minor aerosol groups that were not included in the model feature characteristics potentially indicative of amine-dominated aerosols, coal combustion aerosol with alkali metals (Rubidium, Cesium), as well as hints of cycloalkanes and organosulphates. 20 We presume many of these observations may arise from by surface ionisation processes, and as such they may not be currently quantifiable in mass. Their corroboration, quantification, and connection to emission sources or thermal ionisation effects require for further study.
We suggest inorganics should be routinely included in factorisation of AMS data due to the high demand of such data in aerosol models. We wish specifically to point out that adding the inorganic information is easy and only requires applying the 25 same tried-and-tested data processing and uses the same error model as for organics. While inclusion of inorganics does diminish the relative weight organics carry in the analysis, and thus may hinder extraction organic factors comprising very low fraction (<5 %) of total mass (Ulbrich et al., 2009), we argue that the added information value of inorganic speciation makes up for this. Compared to organics only analyses, inclusion of inorganic data increases direct usability of AMS data for physicochemical aerosol models. We also demonstrate factorisation-based speciation provide a more realistic and robust, less 30 assumption-dependant and calibration-sensitive, speciation than simplistic ion balance schemes.
The classification methods presented here for evaluating factor analysis output can also have direct use in applications that produce large quantities of discrete aerosol spectral data, such as deriving factorisation error estimates via bootstrapping analysis (Osborne et al., 2014;Brown et al., 2015). With further development, we find it likely a two-step analysis (exploratory factorisation + classification → r-CMB) would be a viable option for increasingly unsupervised and less analyst biased AMS data analysis. 5 We would also encourage further development of combined statistical methods for improved mass spectral feature extraction and parametrisations for mass spectra, as they will enable future machine learning applications for data analysis. Drawing from the comprehensive information available in current size-resolved aerosol mass spectrometric data, it seems likely that advanced machine learning methods (such as data reduction combined with predictive neural networking (e.g. Burns and Whitesides, 1993;Gasteiger and Zupan, 1993) will likely provide new, improved ways to model aerosol physicochemical properties like 10 hygroscopicity, volatility and optics in the near future.

Data availability
Data used in this study is available from the contact author. r-CMB component profiles will be made available in the AMS spectral database (http://cires1.colorado.edu/jimenez-group/AMSsd/) upon publication.

Author contribution 15
Conceptualisation: MÄ, ME formulated the study. Investigation and data curation: MÄ, HJ, ME collected and curated the experimental data. PA*, ES*, JL*, HL* TP* provided technical support for the experiments. Formal analysis, methodology, visualisation: MÄ, supported by KD, designed and performed the statistical analysis and data visualisations. Validation: MÄ and KD reviewed the data quality and reproducibility. Software, methodology: FC designed and supported the SoFi analysis software. DS*, LW* provided support for Squirrel software. Writing: MÄ wrote the original draft, which was reviewed, 20 commented and edited by all the authors. Funding acquisition, resources, project administration, supervision: ME, MK, AP, TP and DW * supported and supervised the research.