Robust observational constraint of uncertain aerosol processes and emissions in a climate model and the effect on aerosol radiative forcing

Abstract. The effect of observational constraint on the ranges of
uncertain physical and chemical process parameters was explored in a global
aerosol–climate model. The study uses 1 million variants of the Hadley Centre General Environment Model version 3
(HadGEM3) that sample 26 sources of uncertainty, together with over 9000
monthly aggregated grid-box measurements of aerosol optical depth, PM2.5,
particle number concentrations, sulfate and organic mass concentrations.
Despite many compensating effects in the model, the procedure constrains the
probability distributions of parameters related to secondary organic
aerosol, anthropogenic SO2 emissions, residential emissions, sea spray
emissions, dry deposition rates of SO2 and aerosols, new particle
formation, cloud droplet pH and the diameter of primary combustion
particles. Observational constraint rules out nearly 98 % of the model
variants. On constraint, the ±1σ (standard deviation) range
of global annual mean direct radiative forcing (RFari) is reduced by
33 % to −0.14 to −0.26 W m−2, and the 95 % credible interval (CI)
is reduced by 34 % to −0.1 to −0.32 W m−2. For the global annual
mean aerosol–cloud radiative forcing, RFaci, the ±1σ
range is reduced by 7 % to −1.66 to −2.48 W m−2, and the 95 % CI by
6 % to −1.28 to −2.88 W m−2. The tightness of the constraint is
limited by parameter cancellation effects (model equifinality) as well as
the large and poorly defined “representativeness error” associated with
comparing point measurements with a global model. The constraint could also
be narrowed if model structural errors that prevent simultaneous agreement
with different measurement types in multiple locations and seasons could be
improved. For example, constraints using either sulfate or PM2.5
measurements individually result in RFari±1σ ranges
that only just overlap, which shows that emergent constraints based on one
measurement type may be overconfident.


PM2.5 measurements individually result in RFari ±1σ ranges that only just overlap, which shows that emergent constraints based on one measurement type may be over-confident.

Introduction
Global model simulations of aerosols and their climatic effects are very uncertain. Different global aerosol models have large spread in their simulations of aerosol microphysics, radiation and forcing Shindell et 5 al., 2013;Tsigaridis et al., 2014). This multi-model spread can be due to different model structures, missing processes, parameter settings, algorithms or coding errors. Individual climate models are also very uncertain because the values of parameters related to physical processes and emissions are often poorly defined Lee et al., 2011bLee et al., , 2013Regayre et al., 2018). The uncertainty in the aerosol effective radiative forcing (ERF) over the industrial period caused by aerosol processes, physical atmosphere model processes and emissions could be as large as the multi-model spread (Johnson 10 et al., 2018;Regayre et al., 2014Regayre et al., , 2018. There are two main methods to reduce model uncertainty, often called bottom-up and top-down approaches. The bottom-up approach involves improving the representation of model processes and refining estimates of the associated parameter values through experiment and theory. This approach is necessary to improve model fidelity, but it does not provide an estimate of 15 the model uncertainty, and the uncertainty may grow if the increase in model complexity requires a large number of new and poorly defined parameters. To reduce model uncertainty, bottom-up model development needs to be combined with top-down approaches in which numerous uncertain process-related parameters and emissions are adjusted to improve the agreement of models with measurements. 20 The difficulty with top-down model adjustments (in its simplest form, model tuning) is that the uncertainty stems from large combinations of uncertain model input parameters. This means that the adjustment of small sets of parameters to improve model agreement with measurements will not produce robust results . For example, a model simulation of particle concentrations could be improved by adjusting particle formation rates, but many other combinations of parameters related to emissions, chemistry or deposition might be able to achieve similar model skill (Carslaw et al., 2013b). Models that 25 are narrowly tuned in this way can therefore produce a wide range of results when used to make predictions outside the range of conditions under which they were tuned. This is likely to be a cause of the large uncertainty in aerosol radiative forcing, which is a predicted rather than observable quantity.
If other aerosol-climate models are comparable with our own HadGEM-UKCA model, then they contain at least 20 important 30 uncertain parameters related to emissions and processes, although fewer than about 10 parameters will dominate the uncertainty in a particular model variable in any one environment and time of year (Lee et al., 2016;Regayre et al., 2014Regayre et al., , https://doi.org/10.5194/acp-2019 Preprint. Discussion started: 21 November 2019 c Author(s) 2019. CC BY 4.0 License. 2018). Therefore, to define and reduce the model uncertainty it is necessary to find from within 10 dimensions of parameter space all the parameter combinations that produce plausible agreement with different aerosol properties observed across all seasons and global environments. A single well-configured version of a model produced by parameter tuning tells us nothing about the combinations of parameter values that can achieve consistency with measurements within their uncertainty range, nor does it tell us anything about the model output uncertainty . 5 In this paper we address the question: To what extent do extensive and diverse aerosol measurements enable the plausible range of model parameters to be constrained if the full range of their compensating effects is accounted for? By 'constrain' we mean a narrowing of the probability distribution of a parameter (and potentially the absolute range) compared to the uncertainty range that was assumed when the model was built. We also quantify how the identification of observationally plausible 10 parameter ranges feeds through to a reduction in the uncertainty in predictions of aerosol radiative forcing over the industrial period. The study focuses on model constraint using measurements of aerosol properties rather than cloud properties, therefore we emphasise the effect on aerosol-radiation interaction forcing rather than aerosol-cloud interaction.

configuration of the Hadley Centre General Environment
Model version 3 (HadGEM3; Hewitt et al., 2011), which incorporates the UK Chemistry and Aerosol (UKCA) model at version 8.4 of the UK Met Office's Unified Model (UM). UKCA simulates trace gas chemistry and the evolution of the aerosol particle size distribution and chemical composition using the GLObal Model of Aerosol Processes (GLOMAP-mode;Mann et al., 2010) and a whole-atmosphere chemistry scheme (Morgenstern et al., 2009;O'Connor et al., 2014). The model has a 20 horizontal resolution of 1.25x1.875 degrees and 85 vertical levels.
The aerosol size distribution is defined by seven log-normal modes: one soluble nucleation mode as well as soluble and insoluble Aitken, accumulation and coarse modes. The aerosol chemical components are sulphate, sea salt, black carbon (BC), organic carbon (OC) and dust. The model does not include any representation of nitrate aerosols. Secondary organic aerosol 25 (SOA) material is produced from the first stage oxidation products of biogenic monoterpenes under the assumption of zero vapour pressure. SOA is combined with primary particulate organic matter after kinetic condensation.
GLOMAP simulates new particle formation, coagulation, gas-to-particle transfer, cloud processing and deposition of gases and aerosols. The activation of aerosols into cloud droplets is calculated using globally prescribed distributions of sub-grid 30 vertical velocities (West et al., 2014)  precipitation collocation (Boutle et al., 2014;Lebsock et al., 2013). Aerosol water uptake efficiency is determined by kappa-Kohler theory (Petters and Kreidenweis, 2007) using composition-dependent hygroscopicity factors.
Anthropogenic emission scenarios prepared for the Atmospheric Chemistry and Climate Model Inter-comparison Project (ACCMIP) and prescribed in some of the CMIP Phase 5 experiments are used here. Biomass burning emissions for recent 5 decades were prescribed using a ten year average of 2002 to 2011 monthly mean data from the Global Fire and Emissions Database (GFED3; van der Werf et al., 2010) and according to Lamarque et al. (2010) for 1850. Volcanic SO2 emissions are prescribed in the model by combining emissions from the Andres and Kasgnoc (1998) dataset for continuously erupting and sporadically erupting volcanoes and the Halmer et al. (2002) dataset for explosive volcanoes.

10
A full description of the set-up for our model simulations can be found in Yoshioka et al. (2019), which we summarise here.
The base model simulation was subject to a multi-year spin-up period. Parameter perturbations were then applied distinctly to individual ensemble members (which branch from the base model) and spun-up for a further month. We then ran each simulation for a further 12 months to produce the data used here. Horizontal winds and temperatures in the simulations are nudged towards European Centre for Medium-Range Weather Forecasts (ECMWF) ERA-Interim reanalyses for 2008 between 15 approximately 1.2 and 80 km using a 6-hour relaxation timescale. Nudging means that pairs of simulations have identical synoptic-scale features, which enables the effects of perturbations to aerosol and chemical processes to be quantified using single-year simulations, although the magnitude of forcing will vary with the chosen year (Fiedler et al., 2019;Yoshioka et al., 2019).

Creation of perturbed parameter model variants 20
Our method to determine observational constraint on the model parameters and radiative forcings involves producing a very large set of 'model variants', each with a different combination of parameter values, and then ruling out model variants for which a set of model outputs are judged to be implausible against measurements (see section 2.4). The model variants were generated using a perturbed parameter ensemble (PPE) of 235 model simulations of HadGEM3-UKCA (the 'AER PPE' detailed in Yoshioka et al., 2019) that samples 26 sources of uncertainty in the aerosol model (Carslaw et al., 2017;Yoshioka 25 et al., 2019)see Table A1 in Appendix A.
A set of 235 simulations alone is much too small to allow statistical analysis of model performance across 26 dimensions of parameter space. We therefore built Gaussian Process emulators (surrogate models) using the PPE simulations as training data (Lee et al., 2011b), which define how the model outputs vary continuously over the 26-dimensional parameter space and enable dense sampling over parameter uncertainty. Separate emulators were built describing the monthly mean value of each model 30 output in each model grid cell. We then used Monte Carlo sampling from these emulators to produce output for a set of 1 million model variants (parameter input combinations). Uniform distributions were assumed for each parameter in this https://doi.org /10.5194/acp-2019-834 Preprint.  sampling. The emulator is not a perfect representation of a model output, but its uncertainty can be estimated and accounted for in the model-measurement comparison. In the rest of this paper we refer to the emulator-derived values of model outputs at each sampled 26-d input combination as a 'model variant'.
The AER PPE samples only uncertainties in the aerosol component of the model and the radiative forcing does not account for atmospheric and cloud adjustmentsi.e., it is a radiative forcing (RF) rather than an effective radiative forcing, which we 5 analysed in previous papers Regayre et al., 2018). The prior (unconstrained) 95% credible interval of global mean aerosol RF is -2.23 ± 0.94 W m -2 . However, because of the way that multiple parameters compensate (Lee et al., 2016;Regayre et al., 2018), the forcing uncertainty in this PPE is similar to the AER-ATM PPE in which additional physical atmosphere model parameters were perturbed and cloud adjustments accounted for (Yoshioka et al., 2019). Because the AER PPE analysed here samples only aerosol uncertainties, we restrict the constraints to measurements of aerosol properties. In 10 future work we will extend the analysis to radiation, precipitation and cloud measurements that are relevant to the wider range of parameters in the AER-ATM PPE.
The choice of the 26 perturbed parameters and their uncertainty ranges were defined using expert elicitation (Yoshioka et al., 2019). The parameters (Table A1; full descriptions given in Yoshioka et al., 2019) relate to natural and anthropogenic emission fluxes of aerosol precursor gases and primary particles, the properties of primary particles (size), aerosol processes, aerosol 15 hygroscopicity, removal rates and cloud droplet formation (updraft speed). The list of parameters is not exhaustive, but oneat-a-time parameter perturbation tests were used to show that any other parameters have a smaller effect regionally and globally in our model than the set we chose. Finally, we note that the evaluated uncertainty in global annual mean RF in this study differs from that shown in Yoshioka et al. (2019) as we have used uniform parameter distributions when sampling over the parameter uncertainty space, while elicited parameter distributions were used in Yoshioka et al. (2019). Our choice to use 20 uniform distributions here means that the constraint can be fully attributed to the model-measurement comparison.

Measurements
We use aerosol measurements from ground stations, ship campaigns and aircraft campaigns covering the following aerosol properties: aerosol optical depth (AOD), PM2.5 concentrations, sulphate mass concentrations, organic carbon mass concentrations, and number concentrations of particles larger than 3 nm dry diameter (N3) and 50 nm dry diameter (N50)see 25 Appendix B and Table S1 in the supplementary data file. All measurements used are from within the boundary layer, which we define to be at an atmospheric pressure greater than 800hPa. We do not attempt to constrain aerosol properties above the boundary layer.
The measurements were all made at specific locations and times (i.e. they are 'point measurements') in the period from October an error term in the constraint process, see Section 2.4). The measurements were aggregated to monthly mean values in grid cells of size 2.50° longitude by 3.75° latitude (4 model grid boxes of the N96 model grid). In cases where there is more than one measurement in a model grid cell, the observed values were averaged. This processing resulted in 9464 monthlyaggregated grid-box measurements (over 6 aerosol properties and 12 months). Figure 1 shows the global spatial coverage of the gridded measurements for each aerosol property, along with the monthly temporal coverage for each measurement, which 5 is indicated by the colour-scale. Table 1 shows the breakdown of the number of grid-box measurements by variable and month.
The AOD data are level 2.0 (quality assured) monthly-mean data at 440 nm wavelength from the AERONET (Aerosol Robotic Network) network (Giles et al., 2019;Holben et al., 1998). Our dataset includes an average of 312 aggregated grid-box measurements for comparison in each month. Figure 1 shows that the measurements are well distributed across all continental 10 regions except Antarctica. The coverage at high northern latitudes is relatively sparse, and there are only a small number of island measurement that are representative of marine aerosol environments. The temporal coverage is very good, with the majority of stations providing measurements in all months of the year. Africa, as well as sulphate and PM2.5 data recorded at the Station for Observing Regional Processes of the Earth System 20 (SORPES) in Nanjing, East China. The PM2.5 data (except for the SORPES site) were obtained, processed and gridded to the N96 model grid as described in Browse et al. (2019). Figure 1 shows that these PM2.5 and sulphate measurements are highly clustered over polluted land areas of the Northern hemisphere, mostly in North America and Europe with limited coverage elsewhere, especially in remote and marine areas. Nearly all stations in these data sets have full temporal coverage, leading to approx. 150 and 170 aggregated grid-box measurements for comparison in each month for PM2.5 and sulphate respectively 25 (Table 1).
For N50 particle concentrations and OC concentrations we have a mixture of measurements from a small number of land-based ground stations along with measurements taken over marine environments from ship and aircraft campaigns (see Table A2).
The N50 concentration data were mainly derived from size distribution measurements and gridded to the N96 model grid as described in Browse et al. (2019). The amount of campaign data, and hence global spatial coverage in the gridded data, is 30 greater for N50 than for OC (Figure 1), and the number of aggregated grid-box measurements is variable between months (Table 1). Due to the nature of field campaigns, the temporal coverage is much sparser for these variables, with each campaign only measuring for 1-3 months of the year, shown by the blue colours for the data of these variables in Figure 1. The measurement data for N3 particle concentration has the smallest number of grid-box measurements over the year and spatially is the sparsest data set included here. The data for this aerosol property comes from only 12/13 ground stations (ACTRIS; Asmi et al., 2013), which are mostly located in Europe, with one in the Arctic, one in Antarctica and one in Northern India. The N3 concentrations at each site were derived directly by integrating size distribution measurements. This data was then averaged over multiple years for each month and location by the authors. 5

Constraint methodology
We apply the statistical methodology of history matching, which has been applied to complex models in a range of fields, 5 including epidemiological modelling of virus transmission (Andrianakis et al., 2017), risk assessment for oil field developments (Craig et al., 1997), modelling galaxy formation (Rodrigues et al., 2017) and climate modelling (Edwards et al., 2011;McNeall et al., 2016;Williamson et al., 2013). The methodology is described in detail in previous papers Regayre et al., 2018), which built upon our earlier study (Lee et al., 2016). We therefore describe the overall methodology only briefly here, but present a full description of the new aspects related to using real measurements rather than 10 'synthetic' measurements .
In the comparison of the model and measurements we account for emulator uncertainty, measurement uncertainty (instrument error), representativeness uncertainties (caused by spatial and temporal mismatches in resolution and sampling between model and measurements), and potential structural model uncertainty. The model-measurement difference together with these measures of uncertainty is incorporated into an 'implausibility measure' and our model constraint procedure in order to identify 15 implausible parts of parameter space (model variants).

Implausibility measure
The implausibility metric ( ) is calculated for each of the 1 million model variants , for each gridded measurement. ( ) weights the difference between the model and measurements by the uncertainties associated with the comparison (Craig et al., 1996;Williamson et al., 2013): where is the estimate of model output calculated using the emulator and is the observed value (the measurement). In the 5 denominator ( ) is the variance in the model estimate (associated with replacing the model with the emulator), ( ) is the variance in the measurement (i.e., instrument or retrieval uncertainty), ( ) is the variance associated with the comparison of the model with the measurements, called the representativeness error (Schutgens et al., 2017(Schutgens et al., , 2016a(Schutgens et al., , 2016b, and ( ) is a model structural error term.
A low value of the implausibility metric indicates either the model-measurement difference is small (i.e., the model is skilful) 10 or that the uncertainty in the denominator is large (i.e., we cannot tell whether the model is skilful because the uncertainties are too large). Therefore, the implausibility metric allows model variants to be ruled out if the model-measurement difference is large and we can be confident that it is large.
The representativeness error ( ) has three components.
( ) (sp = spatial) accounts for uncertainty associated with spatial variability below the grid scale of the model, which means that a point measurement may not be representative of the 15 grid-box mean (Schutgens et al., 2016b).
( ) (temp = temporal) accounts for the temporal sampling of a measurement, which may not match the temporal sampling of the model (e.g. a ship track through the grid-box over a short time period which is compared with a monthly-mean model value (Schutgens et al., 2016a). ( ) (iav = inter-annual variability) accounts for the fact that we sometimes match measurements and the model for the correct calendar month but not for the correct year.
This is necessary in cases where we use measurements from years for which we have not run the model. We assume that 20 The magnitude of these errors is discussed in section 2.4.2.
The structural error term ( ) has been included in previous studies using the implausibility metric. It is intended to represent an estimate of the potential structural error in the model. Practically, however, we have no way to estimate this term for all variables at all times and geographical locations. We therefore set it to zero, and instead use very large values of implausibility to point us towards potential structural errors in the model-observation comparison and constraint procedure, as 25

Estimation of uncertainty terms
Our estimates of the uncertainty terms in Equation 1 are preliminary and are designed to test our approach. We discuss in the conclusions the need to refine our understanding of these uncertainty terms.
For all aerosol properties we assume an instrument uncertainty of 10%, a spatial co-location uncertainty of 20%, and a temporal 5 sampling uncertainty of 10% on the measured value. The spatial sampling uncertainty for monthly mean aerosol properties is estimated based on Schutgens et al. (2017Schutgens et al. ( , 2016b. These studies examined a typical spatially heterogeneous continental environment where the sampling error is dominated mainly by local aerosol sources that are not resolved by the global model. The magnitude of uncertainty is likely to vary globally (especially between land and ocean), with surface measurements typically having larger errors than column measurements and the magnitude of error also depending on the location of a ground 10 site with respect to the grid-box centre, but we do not account for these variations. We base our estimate of the temporal sampling uncertainty on Schutgens et al. (2016a) who quantified the error associated with the different temporal sampling of models and measurements (e.g., daily measurements or temporally sporadic measurements versus monthly mean model, etc.).
The emulator uncertainty is taken from the Gaussian error on the emulator mean prediction, which is known for every parameter combination (i.e., each of the 1 million model variants). 15 The inter-annual uncertainty was defined to be the standard deviation of monthly mean aerosol properties in each grid cell over a 30-year period. We take information from an analysis of the trend and variation of gridded aerosol properties in a HadGEM3-UKCA hindcast simulation over the period of 1980(Turnock et al., 2015. For each month and grid-box, the monthly mean output of the aerosol variable of interest for each year of the simulation was obtained. These values were de-20 trended using linear regression and the resulting residuals were then analysed. We use a relative measure of monthly mean uncertainty defined by the standard deviation of these residuals divided by the de-trended mean. As an example, Figure 2 shows the relative standard deviation for the surface-level N50 concentration in July.

Methodology for ruling out observationally implausible parts of parameter space
There is an element of subjectivity involved in comparing a model with point measurements and reaching a conclusion about 5 the fidelity of the model. The comparison may indicate either: a) the model seems to be structurally adequate, but the parameters need to be adjusted to optimise agreement, or b) the model is structurally deficient (i.e., there are missing or incorrect process representations in the model). Structural deficiencies may be apparent, for example, because the model skill is particularly poor in one region or at one time of year, or it is not possible to obtain good skill across multiple variables simultaneously. 10 Our use of 1 million model variants and more than 9000 monthly aggregated grid-box measurements means that we need to automate the model-measurement comparison processes and detection of potential structural errors while also using the measurements to rule out implausible parts of parameter space. The difficulties for us in detecting structural errors are: a) we cannot inspect each of the 1 million model variants individually, so we need to rely on summary statistics; b) many of the aerosol point measurements are spatially and temporally sparse, so we cannot easily detect spatial and temporal changes in 15 model skill that might indicate structural error; c) the measurements do not have the same spatial distribution in all months (because of brief, localised field campaigns) so spatial-temporal biases are hard to detect; d) the uncertainty in each measurement (particularly the representativeness error, section 2.4.2) is spatially and temporally heterogeneous and often very poorly defined.
Our approach is summarised in Figure 3. It is designed to rule out implausible parts of parameter space while avoiding doing 20 so in cases where the biases shared by many model variants could be caused by structural errors in the model.  The steps are: 1) The implausibility is quantified for each of the 1 million variants across all measurements of a single type in a particular 5 month. Figure 4 shows an example for the measurements of N50 in July. For each measurement (numbered on the horizontal axis, left plot), the distribution of the implausibility over the variants is shown by the bar representing the 95% credible interval.
2) Measurements are identified for which 97.5% of the model variants have an implausibility I > 1. These measurements are excluded from the constraint procedure (shown in red in Figure 4). We assume that this large implausibility for the significant majority of variants indicates either there is a structural error in the model or that the model is unable to represent these point 10 measurements because of its low spatial and temporal resolution (section 2.4.1). We flag these measurements for further investigation of potential structural errors or underestimated error terms (these are not examined further in this study).
3) Using all other measurements, (where more model variants have lower implausibility, shown in blue in Figure 4), we use the implausibility metric values to decide whether to rule each variant out as implausible, or retain it as plausible. If we ruled out all model variants with high implausibility for each measurement in turn (treating the measurements independently, as in 15 many emergent constraint studies), we could end up ruling out all parts of parameter space. Our criterion is therefore to rule out a model variant if more than a defined fraction (or number) of the measurements (tolerance T) exceeds a defined implausibility threshold (). For example, we might rule out a model variant if more than 20% of measurements exceed an implausibility of 3.5 (i.e., bias is 3.5 times the expected error).
We apply this approach to the set of measurements for each variable (measurement type) in each month and then combine the 20 constraints to a joint constraint over months and/or over variables such that if a variant is ruled out for any single month/variable combination, then it is also ruled out in the joint constraint. This method allows us to identify the set of model variants that capably represent measurements of a range of variables and across multiple locations and seasons. We extensively explored  shown by the outer crosses, the bar corresponds to the 95% credible interval (2.5% to 97.5% empirical quantiles), the horizontal markers through the bar show the inter-quartile range, and the square point is the median implausibility. Here we assume no structural error term in the implausibility calculations and use the implausibility distribution to identify potential structural errors. Measurements coloured red are ruled out as potential structural error cases (as the lower 95% credible interval bound is >1), and those coloured blue are retained and used in our constraint procedure. Right: Corresponding map to show the locations of the rejected July N50 measurements (red) and those 10 retained for constraint (blue), over the North Pacific and North America region (outside this region, all measurements were retained). We hypothesise that the red points over the Pacific correspond to ports with localised pollution sources while the red points over Canada correspond to localised fire emissions that are not represented at the resolution of the model. Our choices of the threshold and tolerance for each measurement type are given in Table A2 in Appendix A. A wide range of values were tested in each case, starting with a set threshold of  = 3.5 and iterating through increasing tolerances T up to a 15 maximum of T = 33% (1/3 of the measurements), before further increasing  by 0.5 (to a maximum of  = 4.5) and re-iterating over T in order to retain (approximately) a chosen percentage of model variants. Approximately the same percentage of variants was attained for all months of a variable type and combined for an 'all months' constraint. Our final choices for each variable type on its own (left column in Table A2) were relaxed for the joint all-variables-months constraint (retaining a larger percentage of variants in each month for each variable, so a weaker constraint; right column in Table A2), in order to retain a 20 reasonable number of model variants and avoid over-constraining on any one observational type.
Our assumption of zero structural error ( indicating a large model-measurement discrepancy, e.g. Figure 4), or when bringing together the constraint effects of different sets / types of measurements (where very few, if any, model variants that lead to plausible model output in all cases/measurement types simultaneously can be identified and retained). Even though we do not directly account for structural errors in the implausibility measure itself, our constraint approach offsets the effects of such errors on the achieved constraint as much as possible. This is accomplished by screening out observations with large model-measurement discrepancies from 5 the constraint process (step 2; Figure 4) and by relaxing the constraint criteria for the joint all-variables-months constraint.
Through this approach we are able to produce as robust a constraint as possible, given the limitations we have in specifying structural and representational errors.

Interpretation of constrained parameter probability distributions
Observationally plausible parts of parameter space exist in 26 dimensions. We show the results as 1-dimensional marginal 10 probability distributions, which are 1-dimensional projections of the 26-dimensional parameter probability distribution. Figure   5 shows an idealised representation for a 2-dimensional parameter constraint. The white parts of the joint distribution are ruled out, leaving the shaded region of joint parameter space as observationally plausible. The effect on the marginal probability distribution of parameter 1 is to entirely rule out the lowest and highest values (i.e., there is no combination of these values of parameter 1 with parameter 2 that produces an observationally plausible model). Where some values of parameter 1 are ruled 15 out over the range of parameter 2, the likelihood of parameter 1 having those values is reduced.
In the results below, the parameter probability distributions therefore reflect the relative likelihood of the parameter having particular values, with lower probabilities indicating that there are fewer ways in which the parameter can be combined with the other 25 parameters to produce a plausible model. For conciseness in the results section we say, for example, that "a measurement constrains the parameter to low values", which means that we retain a larger proportion of model variants with 20 low values. Figure 5 also shows the separate and joint effects of two observational constraints. We show this conceptually because it arises in the results. Measurement 1 rules out the lowest values of parameter 1 and suggests that parameter 1 is likely to be at the high end of the sampled range. Conversely, measurement 2 suggests that parameter 1 is likely to be at the low end of the range.
However, the correct interpretation of this situation is that intermediate values of the parameter are consistent with both 25 measurements (measurement 1 is consistent with the model for all but the lowest values of the parameter and measurement 2 is consistent for all but the highest values). Only in cases where the two separate constrained parameter pdfs do not overlap can we conclude with certainty that there is likely to be a structural deficiency in the model. However, to obtain multivariate constraint we prevent this happening by screening out measurements with large model-measurement discrepancies and relaxing the constraint criteria with each measurement type.   AOD measurements constrain aerosol and precursor emissions to low values and removal rates to high values. These constraints imply that the PPE produces generally too-high AODs across the sampled parameter space, which is the case (section 3.4). In particular, sea spray emissions higher than about 3.6 times the baseline emissions are ruled out, but emissions down to as low as 0.125 times the baseline emissions are plausible. For anthropogenic SO2 emissions, the likelihood of the 10 emissions scale factor being below 1 (corresponding to the default value from the inventory) increases from 55% to 70% on constraint. For BVOC emissions, the likelihood of the emissions (or effectively the production of SOA) being more than 3 times the default emission of 46 Tg/y (=138 Tg/y) is reduced from 31% to 13%, but all lower values from the default emission down to our lower bound of 37 Tg/y are equally plausible.

Constraint using individual measurement types
The AOD measurements also constrain the model to low values of other parameters: more variants with higher cloud droplet 15 pH values are ruled out (judged implausible) and as a result cloud droplet pH is nearly 3 times as likely to be below the central value of its range of 5.8 as above it, which is consistent with a higher likelihood of slower production of sulphate aerosol from in-cloud SO2 oxidation. The hygroscopicity of OC in the particles (OC) is also weakly constrained to low values, which reduces the water content of aerosol and reduces AOD. The rate of aerosol scavenging by precipitating raindrops (the Rain_Frac parameter) is weakly constrained to high values. Sulphate measurements strongly constrain SO2 emissions to low values, which is consistent with the AOD constraint. Given this constraint, the SO2 emissions have a 78% likelihood of being below the default value from the inventory and the median emission is reduced to 0.78 times the default. Also consistent with the AOD constraint, the deposition rate of accumulation mode particles is constrained strongly to high values, with an 87% likelihood of the rate being above the default value.
Likewise, the SO2 dry deposition rate is constrained fairly weakly to higher values, with a 60% likelihood of the scaled value 5 being above the default value. Each of these constraints is consistent with too-high sulphate concentrations in many of the sampled model variants across the parameter uncertainty space (section 3.4).
PM2.5 measurements have a similar effect to AOD on some parameters, but for others there are differences. Emissions of sea spray and BVOC emissions are constrained similarly (to low values). However, SO2 emissions and cloud droplet pH are weakly constrained to higher values and the dry deposition rate of accumulation mode particles is weakly constrained to low 10 values, opposite to the AOD and sulphate constraints for these parameters. The PM2.5 measurements also weakly constrain the residential combustion emissions to high values. PM2.5 and AOD are strongly correlated in the PPE , so differences in the constrained parameters most likely reflect differences in the spatial distribution of the measurements ( Figure 1) and how that maps on to the spatial variations in sensitive parameters. As described in section 2.5, these apparently opposing constraints are not necessarily inconsistent: for AOD and PM2.5 there may be other parameter settings that can be 15 combined with low SO2 emissions to achieve agreement with the measurements (so the space is not ruled out).
Organic carbon (OC) measurements strongly constrain the scaled magnitude of residential carbonaceous emissions to a narrower credible interval of about 0.3-1.8 centred near the default value specified in the emissions. Emissions above 2.0 times the default value are effectively ruled out and there is only a 13% likelihood of the emissions being below half the default value. Fossil fuel emissions have a 70% likelihood of being above the default emission value. The OC measurements also 20 constrain the scaled BVOC emissions in a similar way to PM2.5 and AOD, with scaled emissions above about 2.1 times (97 Tg/y) having only a 31% likelihood (compared to 50% prior to constraint). OC measurements also constrain the lowest values of BVOC emissions, which was not achieved with PM2.5 and AOD. The likelihood of the scaled emissions being below 1 (46 Tg/y) is 6% (compared with 11%). The dry deposition rate of Aitken mode particles is constrained to the low part of the range, which will tend to increase OC concentrations in the atmosphere consistent with the constraint of fossil fuel emissions to high 25 values. There is also a weak constraint of the ageing rate towards higher values, which has a 55% likelihood of being in the upper half of the range. The rate of aerosol scavenging by precipitating raindrops (Rain_Frac parameter) is constrained similarly, but to lower values. Again, although weak, these two constraints imply slower ageing, slower removal rates, longer  choices of threshold θ and tolerance T that were applied in the constraint process to generate these results are given in Table A2 (left column for each individual measurement type; right column for the joint measurement-types constraint), along with the percentage of model variants that is retained in the constrained sample in each case. See section 2.5 for a definition of marginal parameter distributions.
Particle concentration (N3 and N50) measurements constrain a wider range of parameters than the measurements of massrelated properties. The rate of boundary layer nucleation is strongly constrained to the low part of the sampled range by the N3 10 measurements (a 77% likelihood of being below the default rate), suggesting N3 concentrations are generally too high across the PPE. N3 also weakly constrains the dry deposition of Aitken and accumulation mode particles to low values. Low deposition rates of accumulation mode particles (hence higher atmospheric concentrations) will result in a higher condensation sink and more removal of sulphuric acid that participates in particle nucleation, so this is consistent with the constraint of nucleation rates to low values. The constraint of Aitken mode deposition to low values is less obvious. Aitken mode particles can contribute substantially to N3, so low deposition rates would enhance N3 (opposite to the constraint on nucleation rates).
However, nucleation rates are constrained to very low values, so in such a situation Aitken particles can begin to act as a sink term for nucleation by affecting the condensation sink and by growing into accumulation mode particles. BVOC emissions 5 are not constrained by N3 measurements, even though SOA enters into the nucleation rate expression. This is most likely because high BVOC emissions also enhance total SOA, which acts as a condensation sink for nucleation, so the two effects cancel (Carslaw et al., 2013b).
For N50, the constraints are consistent with shifting the N50 concentrations in the ensemble towards lower values (section 3.4).
N50 has very little effect on the range of boundary layer nucleation rate. In contrast, a previous study found that boundary layer 10 nucleation made a statistically significant difference to model skill at about half of the ground sites they analysed (Reddington et al., 2011)although that study tested the effect of including or not including boundary layer nucleation rather than perturbing the rate as we do here. Without boundary layer particle formation the model was structurally deficient and had poor skill at around half the sites analysed. However, our results show that uncertainty in the parameter value itself is unimportant when other parameter uncertainties are considered. This parameter is unconstrained by N50 measurements because there are 15 many alternative ways of achieving model-measurement agreement.
N50 measurements also tend to constrain primary particle emissions to the lower end of the range (fossil fuel and primary sulphate emissions), albeit weakly. Residential particle emissions are not constrained, but the measurements we used are not well located to achieve this. It also constrains the emitted particle diameters to the high end of their ranges (fossil fuel, primary sulphate), which is again consistent with low number concentrations (since we perturb emission diameter independently of the 20 mass, so number concentration is affected). The constraint of particle emission sizes is consistent with a previous study that showed CCN concentrations are sensitive to the assumed size (Reddington et al., 2011). Our results show that N50 measurements allow the emission size to be constrained, even though there are many other compensating factors that can affect CCN concentrations. N50 weakly constrains cloud pH to higher values, consistent with greater production of sulphate aerosol and a higher sink for nucleation. BVOC emissions are constrained to the low end, which is consistent with reduced growth of 25 nucleation mode particles into the Aitken and accumulation modes. N50 also constrains depositions rates: accumulation mode deposition is constrained to low values and Aitken mode deposition to high values, suggesting a shift in the aerosol size distribution towards larger aerosols is consistent with N50 measurements.

Seasonal variations in constraint
Many of the parameter constraints vary seasonally, which can be linked to seasonal variations in emissions and parameter For PM2.5, the seasonality of constraint is very similar to AOD with one notable exception. The dry deposition rate of accumulation mode particles is constrained to high values in summer (consistent with AOD and sulphate), but to low values in the winter (Figure 7c). This may occur just because of the way in which the combinations of parameters control PM2.5 -10 for example BVOCs can account for PM2.5 in summer so high dry deposition rates cannot be ruled out. However, it may also indicate a structural deficiency, with the low deposition rates in winter implying that PM2.5 has missing sources in winter but not in the summer, such as nitrate.
For N3, we find that the boundary layer nucleation rate is constrained only in summer when photochemical production of the nucleating vapours is fast (Figure 7d). This is consistent with previous studies that have examined the seasonal cycle of organic-15 mediated nucleation (Riccobono et al., 2014). Similarly, N3 measurements constrain SO2 emissions and cloud droplet pH in summer when nucleation is most active. This is in contrast to the AOD and sulphate measurements, which constrained these two parameters in winter when their relative contribution to aerosol mass is greater. For N50 we find that parameter constraints do not vary smoothly through the year (not shown). This is because the N50 measurements we have used are primarily from campaigns, which move around the globe, resulting in constraint of regionally important parameters. This is one indication that we need to add long-term network measurements of N50 to the dataset.

Constraint using all measurement types
The multi-variate constraint is shown as the right-hand column of pdfs in Figure 6 and Table 2 shows corresponding parameter 5 distribution statistics from this constraint. For each individual variable / month constraint that feeds into this multi-variate constraint, the implausibility threshold and tolerance criteria (θ and T), were relaxed from the individual measurement constraints to retain approximately 75% of the 1 million model variants (Table A2). This relaxed criteria leads to measurements that provide stronger constraint being down-weighted and individual parameter constraints becoming weaker, but it means that we are able to avoid over-constraining on any one measurement type. Using all measurement types together leads to retention 10 of only 2.1% of the original 1 million model variants as plausible models (nearly 98% rejected; Table A2). In most cases the marginal parameter distributions from this constraint can be understood in terms of the combination of individual constraints described above.
Boundary layer nucleation rates are constrained to the low end of the range, which can be attributed almost entirely to the N3 measurements. However, the constraint is slightly weaker than when just N3 measurements are used because of the need to 15 relax the tolerances and thresholds applied when ruling out model variants using multiple measurement types (section 2.4.3).
The nucleation rate is constrained such that the likelihood of it being in the lower half of the range (0.1-1 times the default value) is 70% -more than twice the likelihood of it being in the upper half of the range (1-10 times the default value).
The pH of cloud droplets, which controls aqueous-phase oxidation of SO2 to form sulphate aerosol, is constrained to be more likely in the middle of our elicited range. This results from a combination of AOD and sulphate measurements constraining it 20 to the lower end of the range and PM2.5 measurements constraining it to the higher end. Observational constraint is unable to rule out any of the pH values between 4.6 and 7.0, although there is reduction of 0.13 in the 95% credible interval to 4.69-6.84 (from 4.66-6.94 before constraint) and a larger reduction of 0.32 in the interquartile range to 5.24-6.12 (from 5.2-6.4 before constraint).
Biomass burning emissions are weakly constrained. The likelihood of emissions being more than a factor 2 above the default 25 value is reduced to 14% (from 25%), but all values below this down to 0.25 times the default value are still equally likely, as they were before constraint.  (Table 2).
The diameter of fossil fuel particles is constrained mainly through the N50 measurements towards larger diameters, with a likelihood of being in the upper half of our elicited range (60-90 nm diameter) of 61% and the median of this parameter distribution shifting to a larger diameter on constraint, increasing from 60nm to 65.63nm 5 Sea spray emissions are constrained through a combination of AOD and PM2.5 measurements, and to a lesser extent by N50.
The multivariate constraint is slightly weaker than was achieved by AOD and PM2.5 individually, although we are still able The dry deposition rate of Aitken mode particles is weakly constrained to low values, which comes mainly from the OC and N3 observational constraint. The likelihood of the deposition rate being in the range 0.5-1.0 times the default value (1.0) is increased from 50% to 60% on constraint.
The dry deposition rate of accumulation mode particles is constrained to the middle of the range. This is likely because sulphate measurements constrain the deposition rate to be towards the high end while the other measurements constrain it 5 towards the low end. The multivariate constraint is weaker than when individual measurement types are used (AOD, sulphate, PM2.5, N50, N3), which results from relaxing the individual constraints in order to retain a reasonable number of model variants when multiple variables do not agree on the best value of the deposition rate.
The dry deposition rate of SO2 is constrained to the upper part of the elicited range, with the likelihood of it being in the range 1-5 times the default value (i.e., an increase in SO2 emissions) now 62% after constraint. 10  We show the results when single measurement types are used for constraint (blue) and when all measurement types are used together (red). The constraint procedure clearly rules out wide ranges of model outputs that are inconsistent with the measured values, shown by the vertical green lines. For example, the unconstrained distribution of mean global sulphate concentration 10 (at the measurement sites) extends up to about 6 g m -3 in January, but the tail of the distribution is limited to 3 g m -3 after constraint.

Parameter
The constrained model distribution sometimes agrees much better with the measurements when only a single measurement type is used compared to when all measurements are used. The weaker multi-variate constraint is because we relax the constraint on individual variables so as not to rule out all model variants. This effect is most apparent for sulphate and PM2. 5. 15 The mean of the constrained PM2.5 distribution using all measurements is about 40% lower than the mean of the measurements in January but the mean of the sulphate distribution is about 50% higher than the mean of the measurements. This is likely to indicate a structural error in the model that prevents good model-measurement agreement with both quantities in the same parts of model parameter space. One explanation could be that the model is missing sources of PM2.5 mass (e.g. nitrate aerosols in winter), which forces a compromise in which the constraint methodology rules in sulphate concentrations that are at the 20 upper end of the uncertainty range to minimise the error for PM2.5. Although relaxing our constraint criteria offsets many effects of such structural errors, the shifting of these all-measurement constraint distributions away from the measurements indicates some structural error is still not fully accounted for. It is possible that our constraint would adjust better to account for this structural deficiency if we could directly specify a structural error term in the implausibility measure through ( ).

Unconstrained parameters
Several parameters are barely constrained or not constrained at all using all the measurements. Unconstrained microphysical 10 processes or assumptions are the ageing rate of insoluble into soluble particles, the width of the lognormal accumulation mode, the hygroscopicity of organic material (OC), the updraft speed and wet deposition rates. Among the emissions, unconstrained parameters are the emission rates of fossil fuel particles, degassing volcanic SO2, DMS and dust emissions.
There are several potential reasons for the lack of constraint. It is possible that parts of the joint parameter space are ruled out, but with a negligible effect on the marginal parameter distribution (i.e., the ruled out parameter space is uniform across the 15 parameter of interest). For example, wet deposition rates are directly compensated by emission rates and the ageing rate affects the wet removal rate. Another reason is that we did not include measurements in regions where the six variables are sensitive to these parameters. This is likely to be the case for DMS, volcanic and dust emissions given the relative lack of measurements over remote ocean regions and downwind of dust sources, which means these regions are not strongly weighted in the overall constraint process. Furthermore, some parameters may be more related to other aerosol properties that we have not used for 20 constraint. For example, aging rates in the model are not constrained, likely because the ageing process predominantly affects the black carbon concentration which is not included as a measurement type in this study.  Figure 9 shows the nine most-important parameters for the uncertainty in global mean aerosol forcing in the PPE in terms of the forcing uncertainty they account for (Yoshioka et al., 2019). Some of these parameters are fairly strongly constrained by the measurements, but others are unconstrained. Within the joint parameter space of just these nine parameters there is considerable potential for model variants that compensate, thereby reducing the effectiveness of the constrained parameters on 5 the forcing. It also needs to be borne in mind that global mean forcing is the sum of regional forcings, and in each region a different set of parameters is being constrained and may be constraining the same parameters to different parts of their range (Lee et al., 2016;Regayre et al., 2015).  parameter uncertainty space, while we use uniform distributions for the sampling here). Table 3 shows the corresponding parameter distribution statistics (median, inter-quartile range, ±1σ range (on mean value) and 95% credible interval) for these forcing constraints.  The net RF is dominated by RFaci, which is only weakly constrained (Figure 10b) by 6%, in line with the net RF (Figure 10a).

Implications for constraint of aerosol forcing
This occurs because our constraint uses measurements of aerosol properties rather than cloud properties. Although the overall 15 reduction in the RFaci uncertainty is weak, the pdfs in Figure 10b  of the strength of RFaci being weaker than -1.5 W m -2 (less negative) is reduced by 38% and the likelihood of it being stronger (more negative) than -2.5 W m -2 is increased by 20%. In general, although the anthropogenic emissions were constrained to lower values (which should weaken the forcing), the sea spray emissions were constrained to lower values, which acts to strengthen the forcing (Carslaw et al., 2013a;Regayre et al., 2014). The 95% credible interval for the net RF is reduced by 8%. 5 The 95% credible interval of direct forcing, RFari, is reduced by 34% (Figure 10c) and the ±1σ range is reduced by 33%. RFari is constrained most strongly by the PM2.5, AOD and sulphate measurements (not shown) but insignificantly by the OC, N3 and N50 measurements. The inconsistency between the constraints on PM2.5, AOD and sulphate ( Figure 8) leads to inconsistency in the constraint on RFari. In particular, using just sulphate measurements results in an RFari ±1 range of -0.10 to -0.22 W m -2 , but using just PM2.5 measurements results in a range -0.20 to -0.36 W m -2 . This highlights the importance of 10 detecting and fixing structural deficiencies in the model as well as the limitations of using single-variable emergent constraints.
It is important to note that the probability distribution of net aerosol RF includes values that, with current knowledge, would produce a net negative (greenhouse gas plus aerosol) forcing over the industrial period. The forcing is dominated by aerosolcloud processes, which we have not attempted to constrain here. Nevertheless, the lack of constraint shows, as in Lee et al. (2016), a well-configured global aerosol model has little bearing on the uncertainty in RFaci. In contrast, the constraint of RFari, 15 which has not been attempted in our previous studies, is significant and encouraging.

Conclusions
We have used extensive point measurements of aerosol optical depth (AOD), PM2.5, sulphate mass, organic carbon mass, and the concentrations of particles larger than 50 nm dry diameter (N50) and 3 nm (N3) from surface sites, aircraft and ships to constrain uncertain aerosol parameters in a global aerosol-climate model. Twenty-six parameters related to aerosol emissions 20 and processes were varied in a perturbed parameter ensemble and statistical emulators were used to generate a set of 1 million model variants that represent the model outputs for combinations of parameter values across the 26-dimensional uncertainty space. The plausibility of each model variant was tested against each measurement type in turn and then in combination using a history matching procedure based on an implausibility metric (Craig et al., 1996;Williamson et al., 2013). The resulting probability distributions of aerosol forcings can be considered as the "observationally plausible" ranges for the HadGEM3-25 UKCA model.
Observational constraint ruled out almost 98% of the 26-dimensional parameter space and the probability distributions of many parameter values were effectively constrained. Fourteen of the parameters were constrained to some extent, despite the fact that there are many ways in which parameter values can be combined to produce plausible results within the uncertainties.
Constraint of a parameter means that the probability distribution of a parameter (and potentially its absolute range) is narrowed, 30 and hence the likelihood of the parameter taking a particular range of values within its absolute range is increased; i.e., there are more ways to combine these parameter values with values of the other 25 parameters to produce a plausible model. For two parameters, some of the individual prior elicited parameter ranges were ruled out entirely: very high sea spray emissions and very high residential carbonaceous emissions. The very highest BVOC emissions were nearly ruled out. However, for the remaining parameters it was not possible to entirely rule out any part of the prior range.
Parameter constraints are mostly, but not always, consistent across multiple measurement types, even though the different 5 measurement types were made at very different locations and sometimes at different times of the year. For example, we often found consistent constraint of parameters related to the production of aerosol mass or number (emissions, nucleation, secondary aerosol mass) and parameters related to removal (condensation sink, deposition rates of gases and aerosols). There is also a very clear seasonal variation in the magnitude of constraint related to variations in the dominant processes.
The multivariate constraint has the following effect on the parameter probability distributions, which were assumed before 10 constraint to be equally plausible between lower and upper bounds defined by expert elicitation: 1. Boundary layer nucleation rates based on a sulphuric acid-organic mechanism (Metzger et al., 2010) are constrained to the low end of the elicited range, mainly from the N3 measurements.
2. The pH of cloud droplets, which controls aqueous-phase oxidation of SO2 to form sulphate aerosol, is constrained to be more likely in the middle of our elicited range but we were unable to rule out any of the pH values between 4.6 15 and 7.0. The constraint led to a reduction of 0.32 in the interquartile range to 5.24-6.12 (from 5.2-6.4 before constraint) and a reduction of 0.13 in the 95% credible interval to 4.69-6.84 (from 4.66-6.94 before constraint).
3. Biomass burning emissions are weakly constrained by the PM2.5, AOD and OC measurements, with a reduced likelihood of the emissions exceeding a factor 2 above the default value. All lower emissions down to 0.25 times the default value are equally likely as they were before constraint. 8. BVOC emissions are constrained by AOD, PM2.5, OC, N50 and N3 measurements. The likelihood of either high or low emissions is reduced. On constraint, our median estimate corresponds to 86.5 Tg/y SOA production, with a 95% credible interval of 40-172 Tg/y. However, no values in the range 37-250 Tg/y are ruled out entirely.
9. The dry deposition rate of Aitken mode particles is weakly constrained to low values using OC and N3 measurements and the dry deposition rate of accumulation mode particles is constrained to the middle of the range by AOD, sulphate, 5 PM2.5, N50 and N3 measurements. The dry deposition rate of SO2 is constrained to the upper part of the elicited range, with a 62% likelihood of it being above the default value.
Several parameters of importance to aerosol forcing were not well constrained, in particular parameters related to microphysical processes (primary sulphate particle diameter, the diameter of biomass burning particles, the width of the accumulation mode and the ageing rate). Dimethyl sulphide emissions were also not strongly constrained. 10 The prior (unconstrained) uncertainty (95% credible interval) in the pre-industrial to present-day net aerosol RF is reduced by 8%. The radiative forcing in the ensemble accounts for direct and indirect (cloud albedo) effects, but not cloud adjustments.
RFari uncertainty (95% credible interval) is reduced by 34%, but the net RF uncertainty is dominated by the RFaci uncertainty, which is reduced by only 6%. The recent assessment of aerosol forcing (Bellouin et al., 2019) adopted ±1σ ranges to define the uncertainty. Our equivalent ±1σ ranges are -0.14 to -0.26 W m -2 for RFari (reduction of 33% due to constraint) and -1.66 to 15 -2.48 W m -2 for RFaci (reduction of 7% due to constraint). The reduction in uncertainty is much larger for RFari than RFaci because our constraints focus on aerosol properties rather than cloud properties.
Our results highlight the importance of using multiple measurement types to constrain aerosol-climate models. We have shown that use of a single measurement type, as is done in emergent constraint studies, would lead to an over-confident constraint. This is because potential structural deficiencies in our model prevent consistently good constraint across several measurement 20 types. In particular, we showed that constraint using PM2.5 or sulphate aerosol measurements lead to probability distributions of RFari that barely overlap. The final multivariate constraint on forcing is therefore a compromise that achieves reasonable agreement with all observations rather than being over-confidently constrained by one metric.
In terms of future directions and requirements to achieve better constraint, we make the following recommendations: 1. We need to understand and quantify model-measurement representativeness errors. The biggest challenge (and the 25 factor that most limits the constraint, other than model structural error) is quantification of the representativeness error associated with comparing point measurements with a global model (Reddington et al., 2017;Schutgens et al., 2017Schutgens et al., , 2016aSchutgens et al., , 2016b. The ambiguity in deciding whether a model-measurement bias is related to structural error in the model or underestimation of the uncertainty terms was the main limiting factor in our constraint procedure. The representativeness uncertainties have been estimated based on model simulations at a few locations (Schutgens et al., 30 2016b), but they have not been measured, and we have no information about these uncertainties at other locations. Small campaigns to characterise the space-time variability around existing sites and for potential new sites should be considered.
2. We need to understand and quantify structural deficiencies in the model. Our current approach, to assume zero structural error in our implausibility calculations, enables some structural errors in the model to be detected by comparing single and multi-measurement constraints. Ideally, to produce the most robust constraint possible, these 5 structural errors should be accounted for directly using ( ) in the implausibility measureor, better still, corrected in the model.
3. We should prioritise expansion of measurements to cover more long-term well-characterised sites. Such measurements will have much lower (or at least well-defined) representativeness error when used for model evaluation and constraint. Sites should be characterised in terms of how they represent a typical model grid box over 10 a long period of time. These sites should be in diverse locations so that they help to constrain a wide range of model uncertainties (Reddington et al., 2017). 4. We should aim to dedicate part of field campaigns to routine, unbiased (or effectively random) sampling of aerosols across the scale of model grid boxes. Such measurements will also have much lower representativeness error than measurements that target specific processes or aerosol environments. Many field campaigns (particularly using 15 aircraft) often prioritise measurements to explore aerosol processes, or to characterise particular aerosol environments (e.g., pollution plumes). Others, like the Atmospheric Tomography Mission emphasised the effective sampling of air masses in an unbiased way. A greater emphasis on such tomography missions, even just as part of a larger field campaign, would benefit model uncertainty reduction. 5. Additional aerosol measurements could be used to further constrain the parameters and forcing. We expect the 20 following (already available) measurements would provide further constraint: i) black carbon measurements to help constrain the aerosol absorption component of radiative forcing, but also as a measure of aerosol removal rates; ii) SO2 concentrations to avoid ambiguity between sulphate and PM2.5 constraints; iii) measurements from biomass burning regions to help constrain both the emissions and the size distribution of the particles; iv) vertical profiles of aerosol (Watson-Parris et al., 2019); v) more particle number size distribution information instead of just N3 and N50, 25 which have not constrained the size distribution well enough. 6. Measured or derived process rates would be very useful because they would help to constrain model parameters directly, rather than relying on indirect constraint through measured state variables. For example, dry and wet deposition rates (Emerson et al., 2018) are required on the scale of model grid boxes (10's to 100's km). Similarly, direct estimates of particle formation and growth rates (Kerminen et al., 2018) would be useful, rather than just relying 30 on integral particle number concentrations averaged over long periods.
It is very challenging to constrain model uncertainty using a large set of perturbed parameter model variants and extensive measurements of different types. However, the resulting ranges of model parameters and outputs (like radiative forcings) https://doi.org /10.5194/acp-2019-834 Preprint. Discussion started: 21 November 2019 c Author(s) 2019. CC BY 4.0 License. estimated in this way are much more robust than those based on a very small number of models. Although our study is incomplete (not all parameters were perturbed and not all measurements were used) the outcome is an estimate of the "observationally plausible" range of aerosol forcings for the HadGEM3-UKCA aerosol-climate model. The ranges might be wider if we accounted for more sources of uncertainty and directly accounted for structural errors in the implausibility calculations, but they could also be narrowed if we improved model structural deficiencies, reduced model-measurement-5 representativeness errors and used a wider set of measurement types. Final % variants retain for the joint all-variables-months constraint: 2.085 Table A2. The choices of the number of measurements (tolerance, T) allowed to exceed a threshold implausibility (θ) for constraint on each of the monthly observed aerosol properties, along with the corresponding percentage of the large sample of 1 million model variants (that covers the PPE parametric uncertainty) that is retained on constraint for each choice. The choices on the left correspond to those used for the all-months constraints shown on the left of Figure 6. The choices on the right correspond to those used for the joint all-variables-months constraint shown on the right in Figure 6. The percentage of variants retained is also shown for the combined all-months constraints for each 5 variable (ALL) as well as for the joint all-variables-months constraint (final row).

Appendix B: Description of the data sets used in this study
For the AERONET Aerosol Optical Depth (AOD) data used in this study, we thank the PIs of the AERONET sites used for maintaining their instruments and providing their data to the community. We also acknowledge AERONET for their continuous efforts in providing high-quality measurements and derivative products. All data used in this work can be 10 accessed through the AERONET web page: http://aeronet.gsfc.nasa.gov/.

Data Availability
Measurement data can be sourced from the databases and data providers as outlined in Appendix B and Table S1 in the supplementary data file. Model data and analysis code can be made available upon request from the corresponding author. The authors welcome use of the perturbed parameter ensemble for advancing climate research.

Competing Interests
The authors declare that they have no conflict of interest.