Some air pollution datasets contain multiple variables with a
range of measurement units, and combined analysis using positive matrix
factorization (PMF) can be problematic but can offer benefits through the
greater information content. In this work, a novel method is devised and the
source apportionment of a mixed unit dataset (PM

It is unquestionable that worldwide, the scientific vista of air quality is expanding, whether it is the increasing number of observatories or the refinement of information mined from the increasing sophistication of measurements often incorporated in campaign work. The number of metrics being measured has increased from simple measurements of particulate matter (PM) mass and gas concentrations, and we can now probe the composition of the PM mass and the size distributions with mass spectrometers, mobility analysers and optical devices.

Studies using positive matrix factorization (PMF) as a tool for source apportionment of particle mass using multicomponent chemical analysis data are published frequently using datasets from around the world. However, they do not always provide consistent outcomes (Pant and Harrison, 2012), and one means by which source resolution and identification can be improved is by inclusion of auxiliary data, such as gaseous pollutants (Thimmaiah et al., 2009), particle number count (Masiol et al., 2017) or particle size distribution (Beddows et al., 2015; Ogulei et al., 2006; Leoni et al., 2018).

Harrison et al. (2011) analysed number size
distribution (NSD) data (merged Scanning Mobility Particle Sizer (SMPS) and
Aerodynamic Particle Sizer (APS) data) with PMF using auxiliary data
(meteorology, gas concentration, traffic counts and speed). The study used
particle size distribution data collected at the Marylebone Road supersite
in London in the autumn of 2007 and put forward a 10-factor solution
comprised of roadside and background particle source factors. Sowlat et al. (2016)
carried out a similar analysis on number size distribution (13 nm–10

However, while combining, for example, particle chemical composition and size distribution data in a single PMF analysis may assist source resolution, difficulties arise if the two datasets have different and/or ambiguous rotations (discussed in Sect. 2). This tends to result in factors with either mass contributions and small number contributions or number contributions and small mass contributions and rarely a meaningful contribution from both data types. Experimental design can of course circumnavigate this problem, for instance, using chemical data which are already size-segregated, measured using a cascade impactor (Contini et al., 2014). Such an approach is attractive by view of the fact that there is no question as to whether both datasets sufficiently overlap across the size bins. However, cascade impactors do not offer the high time resolution of particle counting instruments, with individual measurements lasting hours or days. Even so, for the case in which two or more instruments are available in a campaign to measure two or more different metrics, e.g. PM mass and particle number (PN), then a combined data analysis is useful. Emami and Hopke (2017) have shown that the effect of adding variables as auxiliary data (with potentially different units) to a NSD dataset is to decrease the rotational ambiguity of a solution from a one-step PMF analysis.

In this study, we present a method for analysing simultaneously collected
PM

Venn diagram showing the summary of the findings of Beddows et
al. (2015), applying PMF to PM

With a population of 8.5 million in 2014 (ONS, 2017), the UK city of London
is the focus of study in this work; the North Kensington (NK) site (lat.

As alluded to, this work is a continuation of the study carried out by
Beddows et al. (2015), which analysed NSD and PM

Accompanying the NSD data from the study of Beddows et al. (2015) was the
PMF output from the analysis of PM^{™} 2500 QAT-UP filters. No cation measurements
were available from these filters, and this resulted in a lower data capture
for the cations. Again, all missing data were replaced using a value
calculated using the method of Polissar et al. (1998). A woodsmoke metric,
CWOD, was also included. This was derived as PM woodsmoke from the
methodology of Sandradewi et al. (2008) utilizing aethalometer and
^{™} 2500
QAT-UP) and analysis using a Sunset Laboratory thermal–optical analyser
according to the QUARTZ protocol (which gives results very similar to EUSAAR
2: Cavalli et al., 2010) (NPL, 2013). We refer to CWOD, EC and OC as
PM

This aforementioned PM

Positive matrix factorization (PMF) is a well-established multivariate data
analysis method used in the field of aerosol science. PMF can be described
as a least-squares formulation of factor analysis developed by Paatero
(Paatero and Tapper, 1994). It assumes that the ambient aerosol
concentration

Using the data and uncertainty matrices for the model, Eq. (1) is
optimized in the PMF algorithm by minimizing the

PMF is a weighted technique, and the value of

To assess the PMF model, the

With regards to the final output from PMF, a scaling has to be applied in
order to achieve quantitative results. This is done by scaling either

Given a PMF input data matrix

The resulting value for the PM

If the analysis of

In the ideal case, if the individually computed factors for both datasets
result in

The method proposed in this work separates the analysis of the two datasets

Positive matrix factorization was carried out in this work using the DOS-based executable file PMF2 v4.2 compiled by Pentti Paatero and released on 11 February 2010 (available from Dr. Pentii Paatero). This is used by the author in preference to a GUI version of PMF (e.g. US EPA PMF 5.0; Norris et al., 2014) because of the ease with which it can be incorporated into a Cran R procedure script using shell commands, thus facilitating automation of the analysis and any optimization. R script can be written to manipulate and organize input data for PMF2, run PMF2, collect the output and produce the necessary output for consideration as text, table or plot. The main strength of this approach is to improve the repeatability and transference of a method between practitioners within our group.

The two-step method is shown schematically in Fig. 2. Matrix

Flow diagram showing the flow of data through the two-step PMF–PMF
analysis. The PMF analyses of a single dataset

To extend the analysis from six factors to seven factors, an extra row was added to

As an alternative to using PMF in the second step, a regression was carried
out. Each column of data for each of the 51 size bins

If it is assumed that the factors derived from the daily NSD data are the
same as those present in the hourly data, i.e. the factors are conserved
when averaging the data from hourly to daily data before PMF analysis, then
daily NSD profiles can be fitted to the hourly NSD spectra to recover a
diurnal cycle for the factors. However, it is worth noting that the process
of aggregating hourly data to daily NSD data may cause loss of information,
implying that minor factors (e.g. due to event episodes) might well be
averaged out of the data. For the elements of the

The Cran R package non-linear minimization (nlm) (R Core Team,
2018) was used to minimize the value of

The resulting scaled-fitted values were then used to calculate the PN
concentration for each of the regression source profiles (Eq. 7), allowing subsequent plotting of the seven diurnal cycles.

Identification of the sources responsible for the factors outputted from PMF
can be assisted by meteorological data. Time series of the

The aim of this work has been to show how a given PMF result can be
complemented with concurrently measured auxiliary data. We exemplify this
using PM

Furthermore, because of the nature of any factor analysis, we also have to make the assumption that each source chemical profile and size distribution not only remains unchanged between source and receptor but that they remain constant throughout the measurement campaign. This of course limits our capacity to fully understand the aerosol within the atmosphere we are considering. Chemical reactions during the transit of the air masses will of course modify the chemical composition. It might be assumed that a fully aged aerosol remains unchanged and is identified as a background component, but, for example, we would expect progressive chlorine depletion within a fresh marine aerosol passing over a city. Likewise, we also have to appreciate that different particle sizes will have different atmospheric transit efficiencies, with large particles settling out of the air mass before smaller ones. Similarly, particles nucleate and grow from 1 nm up to 20–30 nm over a short time period of time. It is these finer details which are missed when making an overall assessment of the chemical and physical composition of air mass measured over a long (e.g. 2 years) dataset using PMF.

Figure 3 presents the profiles

Source profiles

The time series

When inspecting Fig. 3 it is notable that the source profiles are
surprisingly similar to those calculated for the NSD-only and
PM

Figure S2 shows the results of the linear regression of the NSD data and the PM

This PMF–LR analysis was carried out using daily averaged data, and to obtain
hourly information – and thus obtain the diurnal patterns (Fig. S2) – the
resulting regression source profiles were refitted to the original NSD
data. On inspection of these source profiles and diurnal plots, the negative
values make interpretation a struggle, reinforcing one of the four conditions
(Hopke, 1991) in the analysis if it is to make sense. We
can however fit non-negative gradients using non-negative regression.
However, the surprising consequence of applying this constraint is that the
same profiles are derived, but they are clipped so that all negative values
are replaced by zero values – hence, information is lost. One
interpretation of the negative values is that these are particle sinks, but
this contradicts the PMF–PMF findings, and hence it is concluded that the
PMF–LR analysis only serves as an indication of how the PM

Returning to the PMF–PMF analysis and extending the analysis from six factors
to seven factors, an extra row in the

The same six-factor solution is obtained with the additional seventh factor (Figs. 4 and S3), and, as expected, this seventh factor was a nucleation factor. It was suspected that in the six-factor solution, the nucleation factor was combined with the fuel oil factor. This does not suggest any link between the nucleation and fuel oil factor other than that there was an insufficient number of factors within the model for the two to factorize out of the data, giving the fuel oil NSD profile a more reasonable modal peak between 50 and 60 nm rather than 20, 30 and 60 nm.

Nucleation and fuel oil
factors derived when extending the second
PMF analysis from the six factors (shown in Fig. 3) to seven factors. Source
profiles

Beddows et al. (2015) applied a one-step analysis to three different
datasets: PM

Another interesting observation is that although only four factors were derived
from the PMF analysis of NSD-alone (diffuse urban; secondary; traffic and
nucleation), when extra information is included from the PMF analysis of the
PM

The original PMF was carried out on daily PM

Diurnal cycles of the derived

The hourly contributions are aggregated into daily values and plotted as bivariate plots in Fig. 5 to assist comparison with the daily plots in Beddows et al. (2015). In that work, the same PMF analysis of the NSD data yielded four factors which are named identically to those in the bivariate plots. The similarity of both of the polar and annular plots for each of the six factors supports our previous factor identification. The secondary and diffuse urban factors are background sources with strongest contributions in the evening and morning. Traffic is strongest for all wind speeds from the east, which makes sense since North Kensington is to the west of the city centre of London where traffic is expected to be most dense. Nucleation is also seen to be strongest for wind from the west, which is expected to be cleaner and have a lower condensation sink. NET and crustal and fuel oil factors are similar to the diffuse urban factor, suggesting a similar predominant source location in the centre of London. The marine factor is observed to be strongest for elevated wind speeds for all wind directions, which is consistent with the expected strong contribution for all high wind speeds from the south-west, as observed in the daily polar plots in Beddows et al. (2015).

The nucleation factor was extracted from the two-step PMF–PMF analysis, which
included pulling the

Ideally, the chemical data would be limited to the composition of the
particles in the same size range as the SMPS data. However, since we are
using the PM

The nucleation factor, with its peak

A two-step PMF analysis method is presented, whereby existing PMF profiles
can be extended to incorporate auxiliary data concurrently measured and with different units. This is exemplified using PM

When analysing PM

Hence, the process starts with a dataset which produces a solution which is
sensitive to mass, but the factors more sensitive to number can be accessed
using a second step. Furthermore, by exploring a higher number of factors,
NSD factors which are insensitive to PM

Data supporting this publication are openly available from the UBIRA eData
repository at

The supplement related to this article is available online at:

DCSB conceived the two-step method, ran the PMF and other data analyses and wrote the first draft of the paper. RMH provided constructive criticism, contributed to the data interpretation and wrote sections of the final paper.

The authors declare that they have no conflict of interest.

The collection of data used for this paper was funded by the Natural Environment Research Council Clean Air for London (ClearFlo) Programme through grant number NE/H003142/1. The contributions of Gary Fuller and David Green (King's College, London) are gratefully acknowledged. The National Centre for Atmospheric Science is funded by the UK Natural Environment Research Council (grant number R8/H12/83/011). Figures were produced using CRAN R and Openair (R Core Team, 2016; Carslaw and Ropkins, 2012).

The authors are grateful to three anonymous reviewers and to Pentti Paatero for their detailed and constructive comments on the first version of this paper. Edited by: Ari Laaksonen Reviewed by: Pentti Paatero and three anonymous referees