Model evaluation of short-lived climate forcers for the Arctic Monitoring and Assessment Programme: a multi-species, multi-model study

. While carbon dioxide is the main cause for global warming, modeling short-lived climate forcers (SLCFs) such as methane, ozone, and particles in the Arctic allows us to simulate near-term climate and health impacts for a sensitive, pristine region that is warming at 3 times the global rate. Atmospheric modeling is critical for understanding the long-range transport of pollutants to the Arctic, as well as the abundance and distribution of SLCFs throughout the Arctic atmosphere. Modeling is also used as a tool to determine SLCF impacts on climate and health in the present and in future emissions scenarios. In this study, we evaluate 18 state-of-the-art atmospheric and Earth system models by assessing their representation of Arctic and Northern Hemisphere atmospheric SLCF distributions, considering a wide range of different chemical species (methane, tropospheric ozone and its precursors, black carbon, sulfate, organic aerosol, and particulate matter) and multiple observational datasets. Model simulations over 4 years (2008–2009 and 2014– 2015) conducted for the 2022 Arctic Monitoring and Assessment Programme (AMAP) SLCF assessment report are thoroughly evaluated against satellite, ground, ship, and aircraft-based observations. The annual means, seasonal cycles, and 3-D distributions of SLCFs were evaluated using several metrics, such as absolute and percent model biases and correlation coefﬁcients. The results show a large range in model performance, with no one particular model or model type performing well for all regions and all SLCF species. The multi-model mean (mmm) was able to represent the general features of SLCFs in the Arctic and had the best overall performance. For the SLCFs with the greatest radiative impact (CH 4 , O 3 , BC, and SO 2 − 4 ), the mmm was within ± 25 % of the measurements across the Northern Hemisphere. Therefore, we recommend a multi-model ensemble be used for simulating climate and health impacts of SLCFs. recommend and distribution SLCFs


Introduction
The Arctic atmosphere is warming 3 times more quickly than the global average (Bush and Lemmen, 2019;NOAA, 2020;AMAP, 2021;IPCC, 2021). Arctic warming is a manifestation of global warming, and the main driver for this is the increasing carbon dioxide (CO 2 ) radiative forcing (IPCC, 2021). Arctic warming is amplified by sea ice and snow feedbacks and affected by local radiative forcings in the Arctic, including radiative forcings by short-lived climate forcers (SLCFs), such as methane, black carbon, and tropospheric ozone (AMAP, 2015a(AMAP, , b, 2022. The remote pristine Arctic environment is sensitive to the long-range transport of atmospheric pollutants and deposition (Schmale et al., 2021). At the same time, it is difficult to carry out in situ measurements Freud et al., 2017) and satellite observations over the Arctic. The majority of the Arctic surface is ocean covered with sea ice that is usually adrift for most of the year. The Arctic environment is also harsh. These aspects have historically kept surface-based measurements sparse. The overwhelming majority of the satellite observations either depend on the visible spectrum, are limited by the presence of clouds, or have very low sensitivity in the lower troposphere where the atmospheric processes mainly determine the fate of the pollutants. Many satellite measurements also do not have good coverage in the Arctic, given their orbital parameters or problems measuring areas with high albedo (Beer, 2006).
Modeling the Arctic atmosphere comes with its own challenges due to extreme meteorological conditions, its great distance from major global pollution sources, poorly known local emissions, high gradients in physical and chemical fields, and a singularity in some model grids at the pole. Models have been improving in the last 2 decades, but many models still have inaccurate results in the Arctic (Shindell et al., 2008;Eckhardt et al., 2015;Emmons et al., 2015;Sand et al., 2017;Marelle et al., 2018). That said, there has recently been a number of improvements in numerous models that have allowed for better representation of certain processes (Morgenstern et al., 2017;Emmons et al., 2020a;Swart et al., 2019;Holopainen et al., 2020;Im et al., 2021). In this study, model simulations for the 2021 Arctic Monitoring and Assessment Programme (AMAP) SLCF assessment report (AMAP, 2022) have been thoroughly evaluated by comparison to several freely available observational datasets in the Northern Hemisphere and assessed in more detail in the Arctic. In order to support the integrated assessment of climate and human health for AMAP, 6 SLCF species (methane -CH 4 , ozone -O 3 , black carbon -BC, sulfate -SO 2− 4 , organic aerosol -OA, and fine particulate matter -PM 2.5 ) and 2 O 3 precursors (carbon monoxide -CO; nitrogen dioxide -NO 2 ) from 18 atmospheric or Earth system models are compared to numerous observational datasets (from three satellite instruments, seven monitoring networks, and nine measurement campaigns) for 4 years (2008-2009 and 2014-2015), with the goal of answering the following questions.
1. How well do the AMAP SLCF models perform in the context of measurements and their associated uncertainty?
-What do the best-performing models have in common?
-Are there regional patterns in the model biases?
-Are there patterns in the model biases between SLCF species?
2. How does the model performance impact model applications, such as simulated climate and health impacts?
3. What processes should be improved or studied further for better model performance?
Out of scope of this study are any sensitivity tests by the models to assess different components of model errors. Also out of scope are the models' simulations of aerosol optical properties and cloud properties (e.g., cloud fraction, cloud droplet number concentration, cloud scavenging), though those parameters do have a large impact on climate and a tight relationship with some SLCFs. Their initial evaluation can be found in AMAP (2022) (chap. 7). Estimates of effective radiative forcings of SLCFs in the Arctic by the AMAP participating models are also provided elsewhere .
The next section summarizes the models used in this study, with more information in the Appendix. Section 3 summarizes the measurements used for model evaluation. Section 4 presents our model evaluation for each SLCF species, followed by a summary of all SLCFs. Finally, Sect. 5 is the conclusion where the questions posed above are answered.

Models
In this section we briefly describe the models used for the AMAP SLCF study and refer the reader to Appendix A for individual model descriptions and further information. All models were run globally with the same anthropogenic emissions dataset (see Sect. 2.1), and most were run for the years 2008-2009 (as was done for the 2015 AMAP assessment report) and 2014-2015 (to evaluate more recent model results) inclusive for this evaluation, as these were years with numerous Arctic measurements. Unless otherwise indicated, all model output was monthly-averaged.
The models used for this study are summarized in Table 1. As is shown in the table, not all models provided all SLCF species, and not all models provided all 4 years. There were eight chemical transport models (CTMs), two chemistry-climate models (CCMs), three global climate models (GCMs), and five Earth system models (ESMs). Many models used specified or nudged meteorology, which allows the day-to-day variability of the model meteorology to be more closely aligned with the historical evolution of the atmosphere than occurs in a free-running model. The ERA-Interim reanalysis was the most commonly used meteorology (in 7 out of 18 models), but some were free-running (simulating their own meteorology) and some used other reanalysis products (Table 1).
In comparison to the CMIP6 emissions (Hoesly et al., 2018), ECLIPSEv6b emissions have additionally taken into account the recent declines in emissions from Asia of SO 2 , BC, and NO x due to recent control measures, whereas those declines in the CMIP6 emissions were unrealistically small (Wang et al., 2021;. The inclusion of emissions from the flaring sector in Russia was a significant improvement, which was not present in the previous version of ECLIPSE emissions that was used in the AMAP (2015a) report.
For non-agricultural fire emissions, many models utilized the CMIP6 fire emissions, which are based on monthly GFED (Global Fire Emissions Database) v4.1 (van Marle et al., 2017). About half of the models included volcanic emissions or stratospheric aerosol concentrations from the CMIP6 dataset (Thomason et al., 2018) or other sources, and the other half did not include volcanic emissions, which mainly impact SO 2 and thus modeled SO 2− 4 . The emissions from the October to December 2014 Honoluraun volcano eruption (Gíslason et al., 2015;Twigg et al., 2016;Ilyinskaya et al., 2017) were included by six models in a separate set of simulations. Similar differences in biogenic and agricultural waste emissions appear in these model simulations, and all are summarized in Table 2.

Chemistry
This section contains a summary of models' chemistry schemes, and we refer the reader to Appendix A and references therein for more details.

Methane
All participating models that provided CH 4 output prescribed CH 4 concentrations based on box model results from Olivié et al. (2021) for 2015 and from Meinshausen et al. (2017) for years prior to 2015. The former utilized the ECLIPSE v6B anthropogenic CH 4 emissions (Sect. 2.1), along with assumptions for the natural emissions (Olivié et al., 2021;Prather et al., 2012), to provide as input to models' surface or boundary layer CH 4 concentrations. Models then allow CH 4 to take part in photochemical processes, such as the production of tropospheric O 3 .  are involved in chemical reactions interacting with the gasphase chemistry. Aerosol size distributions are either prescribed or discretized into lognormal modes or size sections. How the aerosol size distribution varies in space and time depends on many different processes, including emission, aerosol microphysics, aerosol-cloud interactions, and removal. How these processes are parameterized depends on the model, and we refer the reader to the Appendix and the references therein for more detail.

Measurements
We have utilized many freely available observational datasets of SLCFs to evaluate the models. General descriptions are given below under the broad headings of surface monitoring, satellite, and campaign datasets, and there is some additional information in Appendix B.  (Biraud, 2011). For NO x , the instrument deploys the characteristic chemiluminescence produced by the reaction between NO and O 3 , the intensity of which is pro-portional to the NO concentration. NO 2 measurements are approximated using its thermal reduction to NO by a heated (350 • C) molybdenum converter (Bauguitte, 2014). Note that this method has an estimated bias of about 5 %-20 % because of sensitivity to other oxidized nitrogen species, and this has not been corrected for. The bias is on the lower end for high-NO x conditions and in the low-NO x Arctic can be up to 100 % uncertainty.

BC and OA
There are various BC measurement methods exploiting different properties of BC and thus measuring different quantities (Petzold et al., 2013): elemental carbon (EC) determined by thermal and/or thermal-optical methods, equivalent BC (eBC) by optical absorption methods, and refractory BC (rBC) by incandescence methods. Table B1 in Appendix B lists the different measurement techniques and instruments that the different monitoring networks and individual Arctic monitoring stations use. As BC emission inventories, including ECLIPSEv6b, are mainly based on emission factors derived from thermal and/or thermal-optical methods, modeled BC is consequently representative of EC. The different types of BC measurements (EC, eBC, and rBC) usually agree with each other within a factor of 2 (AMAP, 2022; Pileci et al., 2021). However, it has been shown that, as the aerosol ages, the complex state of mixing of BC particles causes eBC to increase relative to EC (Zanatta et al., 2018). The absorption and scattering cross sections of coated BC particles vary by more than a factor of 2 due to different coating structures. He et al. (2015) found an increase of 20 %-250 % in absorption during aging, significantly depending on coating morphology and aging stages. Thus, this complexity impacts model-measurement comparisons at remote Arctic locations where one would expect eBC to have a high, positive uncertainty.
We obtained BC from the Canadian Aerosol Baseline Measurement (CABM) network for Canada, Interagency Monitoring of Protected Visual Environments (IMPROVE) network for the US, the EMEP network for Europe, and individual Arctic locations. To our knowledge, there were no other freely accessible BC measurements. The major observing networks EMEP, CABM, and IMPROVE measure EC with approximately 10 % uncertainty . However, given the complexities in different BC measurement types, as mentioned above, the overall uncertainty is about 200 %. Another complexity with model evaluation of BC is that some of the eBC measurements that models are compared to were made from collected particulate matter with different maximum diameters (e.g., PM 1 , PM 2.5 , and PM 10 ). These are included in Table B1 for each of the measurement locations. From the models we use BC from PM 2.5 , as most of the BC is expected to be in the submicron mode. Organic carbon (OC) is also measured via thermal and/or thermal-optical methods (Chow et al., 1993(Chow et al., , 2001(Chow et al., , 2004Cavalli et al., 2010;Chan et al., 2019;Huang et al., 2021) using the same instrumentation as for EC detection in IMPROVE, CABM, NAPS, and EMEP measurement networks. These OC measurements have approximately 20 % or less uncertainty (Chan et al., 2019). Models output organic aerosol (OA), which includes OC and organic matter and is related to OC via a factor of 1.4 (Russell, 2003;Tsigaridis et al., 2014), though this factor has been reported as a range from 1.4 to 2.1 in the literature, depending on the source of OC and OA (Tsigaridis et al., 2014). Nevertheless, we applied a conversion factor of 1.4 to the OC measurements before comparing the modeled OA.
Arctic BC measurement locations are shown in Fig. 1, and many of these Arctic aerosol measurements were discussed in Schmale et al. (2022). We also evaluated modeled BC deposition by comparing it to BC deposition derived from ice core measurements (D4, ACT2: McConnell and Edwards, 2008;Humboldt: Bauer et al., 2013;Summit: Keegan et al., 2014;NGT_B19, ACT11D: McConnell et al., 2019). All of the ice core locations are also shown in Fig. 1. Deposition fluxes are not a measured value but are derived from the EC concentrations in ice and precipitation estimates.
Surface in situ SO 2− 4 measurements in the major observing networks typically use ion chromatography methods, which have an approximately 3 % uncertainty range (Solomon et al., 2014). However, SO 2− 4 measurements have been shown to have up to 20 % analytical uncertainty (AMAP, 2022). SO 2− 4 datasets were obtained from IMPROVE, EMEP, and CABM networks, often via the EBAS database. SO 2− 4 deposition was also derived from the same ice core measurements mentioned above for BC deposition (D4, ACT2: McConnell and Edwards, 2008;Humboldt: Bauer et al., 2013;Summit: Maselli et al., 2017;NGT_B19, ACT11D: McConnell et al., 2019). The Arctic SO 2− 4 measurement locations are shown in Fig. 1.

PM 2.5
Surface in situ PM 2.5 measurements are usually made via gravimetric analysis of particulate matter collected on a filter (e.g., Teflon substrate), which has around a 1 %-6 % uncertainty range (Malm et al., 2011). These data were obtained from Beijing Air Quality and US Embassy data for China, NAPS for Canada (Dabek-Zlotorzynska et al., 2011), IMR-PROVE for the US, and EMEP and/or EBAS for Europe.

Satellite datasets
Satellite observations are useful for evaluating models on larger horizontal spatial scales and for evaluating the threedimensional atmosphere -not the surface concentrations. Observations from three satellite instruments were used to evaluate model trace gas distributions in the free troposphere and, when appropriate, the lower stratosphere. These were the Tropospheric Emission Spectrometer version 7 (TES; NASA Atmospheric Science Data Centre, 2018;Gluck, 2004a, b), the Atmospheric Chemistry Experiment-Fourier Transform Spectrometer version 4.1 (ACE-FTS; Bernath et al., 2005;Sheese and Walker, 2020), and the Measurements of Pollution in the Troposphere version 8 (MOPITT; Ziskin, 2000;Deeter et al., 2019). The vertical profiles of trace gas volume mixing ratios are derived or retrieved from the satellite-measured emission or absorption spectra, with varying degrees of vertical sensitivity. These remote techniques typically have about a 15 % uncertainty in the measurements (e.g., Verstraeten et al., 2013), though this depends on the specific instrument and the species retrieved (e.g., Sheese et al., 2017).
Note that while TES and MOPITT have global spatial coverage, their coverage does not extend up into the high Arctic. The TES instrument on NASA's Aura satellite measures vertical profiles of trace gases such as O 3 , CH 4 , NO 2 , CO, and HNO − 3 from 2004-present. After interpolating all models and TES results to a 1 • × 1 • horizontal grid, the monthly mean CH 4 and O 3 from the TES lite products were matched in space and time with models. Models were smoothed with the TES monthly mean averaging kernels prior to comparisons with satellite data. TES measurements started in 2004, stopped in late 2015, and had poorer coverage in the last few years.
A similar comparison method was used for MOPITT data. The MOPITT instrument on NASA's Terra satellite measures CO from 2000 to the present.
The ACE-FTS instrument on CSA's SCISAT satellite has measured the trace gases O 3 , CO, NO, NO 2 , and CH 4 , among over 30 others from 2004-present. SCISAT has a high-inclination orbit, giving its instruments better coverage in the Arctic. ACE-FTS is a limb-sounding instrument measuring the solar absorption spectra of dozens of trace gas concentrations from the upper troposphere to the thermosphere. This gives us the opportunity to evaluate the 3-D model output in a region of the atmosphere where the radiative forcing of ozone is at its highest. Evaluating models with ACE-FTS measurements also provides insight into models' transport and upper-tropospheric chemistry. As was shown in Kolonjari et al. (2018), 3-hourly model output (rather than monthly mean output) is required for accurate comparisons to ACE-FTS data; thus, only models that provided output at this time frequency were compared to ACE-FTS measurements. The model output was sampled to match the times and locations of ACE-FTS measurements. We used an updated version of the advanced method in Kolonjari et al. (2018). Instead of taking the model output at the closest time to the ACE-FTS measurement time, the model output was linearly interpolated onto the ACE-FTS time. This reduces the bias introduced by diurnal cycles, which can cause certain volume mixing ratios (VMRs; e.g., that of NO and NO 2 ) to vary significantly between model output times. As in Kolonjari et al. (2018), the model output is also interpolated vertically in log pressure space and bilinearly in latitude and longitude to account for spatial variation between model grid points.

Measurement campaigns
Finally, there were air-and ship-based measurement campaigns of black carbon that were used for model evaluation. Aircraft campaigns allow vertical profiles of chemical species to be evaluated, and ship campaigns allow for in situ measurements in the remote Arctic seas.
The AMAP models that submitted 3-hourly BC output were linearly interpolated onto the aircraft locations in space and time using the Community Intercomparison Suite (CIS; Watson-Parris et al., 2016) in order to provide representative comparisons and robust evaluation.  (Fig. 17). Models that provided 3-hourly BC output were compared to these observations. The Russian measurements of aerosol eBC concentrations were determined continuously using an Aethalometer purposely designed by MSU/CAO (Popovicheva et al., 2017). Light attenuation caused by the particles depositing on a quartz fiber filter was measured, and the light attenuation coefficient of the collected aerosol was calculated. eBC concentrations were determined continuously by converting the time-resolved light attenuation to the eBC mass corresponding to the same attenuation and characterized by a specific mean mass attenuation coefficient, in calibration with AE33 (Magee Scientific).

Ship campaigns
The Japanese measurements provide rBC (refractory BC). Pileci et al. (2021) showed that rBC and eBC are linearly related; thus, in order to compare the observations to models, we converted rBC to eBC via a factor of 1.8 (eBC = 1.8 × rBC; Zanatta et al., 2018;Pileci et al., 2021).

Model-measurement comparisons
In this section, we evaluate modeled SLCFs from the 18 participating models with a focus on performance in the Northern Hemisphere midlatitudes (defined for our purposes as 30-60 • N) and the Arctic (defined here as > 60 • N for simplicity). Unless otherwise noted, the observations are compared to the model grid box that they are located in, and when more than one observation location occurs in the same model grid box, those observations are averaged first before the comparison. We look at spatial patterns in the model biases, as well as the vertical distribution and the seasonal cycles for each species, but first we start by providing a multispecies summary of the annual mean model biases in the surface air.

Multi-species summary
The 2014-2015 average modeled percent biases for surface concentrations of SLCFs are shown in Fig. 3 for each model and the multi-model mean (mmm). This figure is based on the model comparisons at the surface observation locations that will be shown in subsequent sections (Figs. 1,5,7,10,11,13,18,21, and 23 and additional American observations from the IMPROVE network for BC, SO 2− 4 , and OA). Figure 3b shows that, for surface Arctic concentrations, no one model performs best for all species but that the mmm performs particularly well. It also shows that the model biases vary quite a bit among SLCF species for both the midlatitudes and the Arctic. It is important to note that there are many more measurement locations at midlatitudes compared to in the Arctic. BC, CH 4 , O 3 , and PM 2.5 have the smallest model biases out of the SLCFs of this study, whereas OA, CO, and NO 2 have larger model biases.
We find for half of the SLCF species that the mmm percent bias decreases with latitude ( Fig. 4). O 3 , NO 2 , BC, and SO 4 have a negative slope in the bias vs. latitude figure. So if the mmm bias was high at the midlatitudes, it is close to zero in the Arctic (O 3 ), and if the mmm bias was near zero at midlatitudes, it is negative in the Arctic (NO 2 , BC, SO 2− 4 ). This implies that there is not enough long-range transport from the midlatitude source regions to the Arctic. That said, the mmm CH 4 bias stays relatively constant with latitude, and we will see in Sect. 4.2 that this result is model-dependent. The CO, PM 2.5 , and OA mmm biases have an increasing trend with latitude. However, both CO and PM 2.5 have no observation locations in the high Arctic, so those results cannot represent long-range transport. OA only has one observation location in the high Arctic, and its bias is very large overall, so issues other than long-range transport are at play, as we will see in the following discussion (Sect. 4.7).
Of course, there are spatial, temporal, and model differences in the results, so we will now explore model performance for each SLCF in more detail in the next subsections.

Methane
Measured annual mean surface methane is shown in the top left panel of Fig. 5, along with model biases in the rest of the panels. Recall that unlike the rest of the SLCF species in this study, CH 4 concentrations were prescribed in these models from the same CH 4 dataset (Olivié et al., 2021). That said, the different decisions by modelers on how those CH 4 global concentrations are distributed make differences in how these models compare to measurements. The mean model biases are small and mainly positive; in the midlatitudes, the multimodel mean bias is +145 ppbv (or +8.5 %), and in the Arctic, the mmm bias is 24 ppbv (or 1.3 %), which means that the models simulate the magnitude of surface CH 4 well -though still outside the < 1 % measurement uncertainty range. There is a gradient in CH 4 VMRs (higher in the Northern Hemisphere and lower in the Southern Hemisphere) that is seen in the measurements (Fig. 5,top left) and reported in the literature (e.g., Dlugokencky et al., 1994), though it is not well captured by CMAM, MRI-ESM2, and UKESM1 models, which are all biased low in the Northern Hemisphere and biased higher towards the south. That is because of the simplifications made in these models' distributions of CH 4 . For example, CMAM used a single global average CH 4 concentration that is interpolated linearly in time from once-yearly values. Figure 5 also shows that observed annual mean surface CH 4 ranges geographically from about 1500 to 2100 ppbv depending on location; however, the models have a much smaller range due to their prescribing CH 4 concentrations as a lower boundary input. For example CMAM CH 4 volume mixing ratios only span about ±3 ppbv around 1836 ppbv. The span of MRI-ESM2 surface CH 4 is even smaller. GEOS-Chem, GISS-E2.1, and OsloCTM have a more realistic range of 1700-2000 ppbv, though they still do not get the full variability that is seen in surface CH 4 mixing ratios close to CH 4 sources. However, in the free troposphere (above the  boundary layer), we have TES satellite measurements of CH 4 that show that CH 4 is much more smoothly distributed aloft. Thus, the simplification of prescribing CH 4 concentrations in the models is more realistic there (Fig. 6, showing the 600 hPa level in the mid-troposphere). Additionally, Fig. 6 better illustrates the latitudinal gradient in CH 4 over the globe and its lack in some models, which have more negative biases in the Northern Hemisphere and more positive biases in the Southern Hemisphere. Other models, such as GISS-E2.1, do a good job of capturing the global distribution of CH 4 .
In the Arctic, the vertical cross section of CH 4 VMRs over time as measured by the ACE-FTS in the middle to upper troposphere and in the stratosphere is shown in Fig. S1. There is a large decrease in CH 4 above the tropopause at around 300-100 hPa. The models are all biased low around 300 hPa and high around 100 hPa. This pattern is true for midlatitudes as well as in the Arctic and may imply that the altitude of the modeled tropopause is too low. This same conclusion was also found in Whaley et al. (2022a) via comparisons of these models' simulations to ozonesonde measurements and in our satellite O 3 comparison in the next section. The CH 4 modelmeasurement correlation coefficients for ACE-FTS are relatively high (e.g., R = 0.48 to 0.86 depending on the model).
Therefore, the general model evaluation for CH 4 indicates that because models do not explicitly model CH 4 emissions, they do not simulate the surface-level variability of CH 4 VMRs. Models differ in their global distribution of CH 4 ; thus, only some contain the north-south CH 4 gradient. Those that do not have the largest underestimations of Arctic tropospheric CH 4 . The CH 4 evaluation also implies that the modeled tropopause height may be too low.

Ozone
Tropospheric O 3 is the third most important greenhouse gas (IPCC, 2021), and it is a regional pollutant that causes damage to human health and ecosystems. O 3 is a secondary pollutant formed in the troposphere via photochemical oxidation of volatile organic compounds in the presence of nitrogen oxides (NO x = NO + NO 2 ). As such, models must simultaneously simulate the meteorological conditions, precursor species distributions, and photochemistry correctly in order to accurately simulate O 3 . That said, since surface O 3 is an important contributor to poor air quality, there is significant pressure for models to simulate it accurately, particularly in the heavily populated midlatitudes (e.g., for air quality forecasting). Only models with prognostic O 3 are included in this section. Figure 7 shows the in situ summertime mean O 3 measurements (top left panel), and the model biases (remaining panels) and the same for the annual mean is shown in the Supplement (Fig. S2). These include averaging O 3 from hourly observations (day and night) and 3-hourly or monthly modeled O 3 depending on which were available for each model.  Surface O 3 is overpredicted by most models, which has been documented previously (Makar et al., 2017;. It has been shown that models can have problems producing low O 3 overnight (Brown et al., 2006;Lin et al., 2008). In the Arctic, simulated surface O 3 has more mixed results. Annual mean concentrations are of the order of 40 ppbv, and individual model biases range from −20 % to +52 % globally on average for 2014-2015. The multi-model mean has a bias of +11 % for the Arctic, but this is not uniformly spatially distributed. All models overestimated surface O 3 in Alaska (mainly due to the overestimation of summertime concentrations, discussed below), and most models have too little O 3 at the Greenland location and in northern Europe. The models were all able to represent the summertime peak in the midlatitude seasonal cycle (not shown). In contrast to the more polluted midlatitudes, where surface O 3 peaks in the summertime due to photochemical production being at a maximum, Arctic O 3 is more influenced by the Brewer-Dobson circulation, bringing a maximum of tropospheric O 3 in the springtime due to photochemical production (Wespes et al., 2012), descent from the stratosphere, and more long-range transport of O 3 to the Arctic. Figure 8, shows this springtime peak in both the western (a, longitude < 0 • ) and eastern (b, longitude > 0 • ) Arctic in the measurements. However, the models only capture that seasonal cycle in the eastern Arctic ( Fig. 8b), implying that the models represent large-scale circulation and possibly stratosphere to troposphere exchange well. But it is interesting to note that the models that have sophisticated representation of stratosphere-troposphere exchange (such as CMAM, MRI-ESM2, UKESM1) do not particularly stand out as better performers in Fig. 8 compared to models that do not simulate the stratosphere (such as DEHM, MATCH, MATCH-SALSA). Thus, its impact on surface O 3 may be very small.
In the western Arctic (Alaska mainly, Fig. 8a), models overestimate summertime Arctic O 3 , likely due to overpredicting the impact of wildfire emissions on tropospheric O 3 concentrations, which is a research topic with high uncertainty (van der Werf et al., 2010;Monks et al., 2015;Arnold et al., 2015). Another possibility is that modeled O 3 dry deposition over boreal vegetation is underestimated (Stjernberg et al., 2012;Thorp et al., 2021).
Some Arctic locations are more inclined to get springtime surface O 3 depletion due to bromine explosions and halo-gen chemistry (Bottenheim et al., 1986;Barrie et al., 1988;Simpson et al., 2007). None of the model simulations in this study contain the necessary tropospheric halogen chemistry to simulate those events, which partly explains why some models in Fig. 8 (bottom) overestimate springtime O 3 concentrations. That particular feature is explored further on a site-to-site basis in Whaley et al. (2022a).
The next subsection shows that both the O 3 precursors CO and NO 2 are underestimated compared to measurements at all global locations. This has implications for simulated tropospheric O 3 chemistry.
Free-tropospheric O 3 -satellite comparisons. Aircraftbased measurements and ozonesondes can provide insight into the vertical distribution of O 3 , and these have been well documented (e.g., Tarasick et al., 2019;Whaley et al., 2022a). However, model grid boxes may not be representative of those fine-spatial-scale measurements. In this study, we examine how the model biases change in the vertical when compared to satellite measurements, which have a larger, "smoothed out" spatial sensitivity due to their viewing geometry and retrieval methods. Specifically, we compare modeled O 3 to TES and ACE-FTS satellite-based retrievals. These satellite instruments also have better global coverage than aircraft and sonde-based measurements.
The model fractional biases compared to TES measurements from near the surface up to 100 hPa are shown in Fig. 9 for the Arctic (left) and midlatitudes (right). All models' simulated fractional biases have similar vertical profiles for both the Arctic and midlatitudes, with greater negative values at lower levels and a more positive "bulge" of about 10 % around 300 hPa in the Arctic and about 5 % around 200 hPa at midlatitudes. That bulge in model biases at 300 hPa was also seen to a greater degree (50 %-70 %) when comparing these model simulations to Arctic ozonesonde measurements in Whaley et al. (2022a). Compared to TES, which has much lower vertical resolution, the results are not as striking but are consistent with ACE-FTS measurements. On average, the models have a negative bias at all vertical levels in the Arc-tic region and in lower troposphere in the midlatitude region, whereas a positive bias is seen in the upper troposphere below 60 • N. This is consistent for the two time periods (2008-2009 and 2014-2015).
Given that Fig. 7 shows positive biases near the midlatitudes, while Fig. 9 shows lower O 3 in the free troposphere, these results imply that there is not enough vertical lifting and/or mixing of tropospheric O 3 in most of the models. However, the TES measurements have been shown to be biased high by approximately 13 % throughout the troposphere (Verstraeten et al., 2013), which is the same amount that the mmm is low. Similarly, ACE-FTS O 3 has an uncertainty range of +5 %-10 % when compared to O 3 from other satellite limb-view observations (Sheese et al., 2017). The ACE-FTS comparison for O 3 can be found in the Supplement (Fig. S3), showing higher model biases around 300-100 hPa (except for GEOS-Chem) and good agreement below that. Therefore, overall, participating models simulate the freetropospheric O 3 reasonably well and within the uncertainly limits of the observations. Therefore, the general model evaluation for O 3 indicates that all models overestimate surface O 3 at midlatitudes, and that, combined with a lack of O 3 transport to the Arctic, results in modeled Arctic O 3 VMRs having relatively little bias (the right answer for the wrong reason). The summertime evaluation implies that models overestimate the O 3 produced and transported by wildfires in the western Arctic. The O 3 evaluation also implies that the modeled tropopause height may be too low. The annual mean surface CO underestimation is mainly dominated by the wintertime (e.g., the mmm bias in DJF is −92 %), when it has been reported that CO emissions from combustion are too low (e.g., Kasibhatla et al., 2002;Pétron et al., 2002). All the models exhibit a large negative bias over China, which is consistent with the study by Quennehen et al. (2016) and is attributed to the enhanced destruction of CO by OH radicals, but it was also found in Kasibhatla et al. (2002) and Pétron et al. (2002) that bottom-up CO emission inventories in Asia are greatly underestimated.

O 3 precursors: carbon monoxide and nitrogen oxides
In the free troposphere, we compare modeled CO to that measured by MOPITT. Figure 12 shows these comparisons for the summertime (JJA) mean at the 600 hPa level, and Fig. S6 in the Supplement shows the same for the springtime (MAM) mean. Unlike in the winter and spring when all models are biased low at midlatitudes, there is more variability in the summertime biases, with about half the models overestimating free-tropospheric CO and the other half underestimating it. In the Arctic region, all models are biased high in the summer but low in the springtime. In the winter, MOPITT does not have coverage in the Arctic. Monks et al. (2015) discussed the fact that models had high biases in the tropospheric column of CO compared to MOPITT measurements in the outflow from Asia and low biases north of there due to lack of transport. The Quennehen et al. (2016) study also suggested that summertime CO transport out of Asia is too zonal when comparing models to IASI CO columns. At 600 hPa, where CO concentrations are lower and the atmosphere is more mixed, these features do not appear.
In the upper troposphere and stratosphere, modeled CO and NO x monthly time series are compared to measurements from the ACE-FTS satellite instrument (where NO x = NO + NO 2 , which are measured separately), and those results are shown as Taylor diagrams in Fig. S7, along with O 3 and CH 4 at 150 hPa, which is in the upper troposphere-lower stratosphere (UTLS) region. The contours show the model's overall skill as defined in Hegglin et al. (2010). Only the models that simulate the stratosphere were included, and the results show that there is a range in model performance by SLCF species, with no one model performing best for all. Comparison statistics for CO were poorer than those for O 3 , CH 4 , and NO x .

Black carbon
In this section, we examine the spatial and seasonal distributions of BC using ground-based measurements, which are primarily available in North America, Europe, and several locations in the Arctic, but are also available from two shipbased campaigns and several aircraft campaigns. Given the limited global data available for both BC and SO 2− 4 (e.g., we could find none freely available for Asia), we focus the plots on the Arctic region here, and given that the magnitude of BC and SO 2− 4 does not span a wide range throughout the Arctic, we show model biases as percent rather than absolute differences as was done in previous sections for trace gas species shown globally. We also analyze the BC model-measurement comparisons, keeping in mind that because there are various definitions and measurement types for BC, we consider agreement within a factor of 2 to be within the uncertainty range (Sect. 3.1.3). Figure 13 (top left panel) shows annual mean surface-level concentrations of black carbon (BC) at nine Arctic observation stations and (remaining panels) the model percent biases there. The annual mean BC concentrations are of the order of less than 1 µg m −3 , and most models tend to underestimate BC in the high Arctic while overestimating in Alaska and Scandinavia. This result could be partially explained by the discrepancy caused by the use of EC and eBC data, which are not the same (Sect. 3.1.3). As aerosols age during transport to high-Arctic locations, their eBC (based on absorption converted to mass) gets more and more of a positive bias compared to EC. As models are more representative of EC, they will not be able to agree with eBC measurements in aged air at high-Arctic remote stations, such as Gruvebadet, Zeppelin, Alert, and Utqiaġvik. This is in contrast to the Alaskan and European stations, which are closer to sources where BC is more fresh; thus, the eBC measurements have lower uncertainty.
That said, a few models (CanAM5-PAM, DEHM, and FLEXPART) overestimate BC concentrations in the high Arctic. Overall individual model biases range ±100 % at individual sites.
The underestimation of high-Arctic atmospheric BC concentrations may be related to excessive BC deposition further south; however, there are very few BC deposition measurements. In the Arctic, we can evaluate total (wet + dry) modeled deposition via derived ice core measurements. There were six ice cores on Greenland and one in the European Arctic in Spitsbergen (Lomosovfonna). Figure 14 shows that all models overestimate BC deposition fluxes at the ice core locations. While the ice cores contain BC data starting in the year 1750, only data after 1990 have been used to match the modeled time period (1990-2015, 1995-2015, 2008-2009, and 2014-2015, depending on the model).
The measured BC deposition flux values on Greenland vary with elevation (lower fluxes at higher elevation). Summit (3177 m a.s.l.) has an average of 285 µg m −2 yr −1 in contrast to ACT2 (2461 m a.s.l.) with 676 µg m −2 yr −1 . BC deposition is highest in the European Arctic at Spitsbergen with 856 µg m −2 yr −1 . For all seven ice cores used in this comparison the averaged model mean is 3 times as high as the observations. At D4 (2728 m a.s.l.) the modeled mean corre-   sponds best to the observation, with a mean bias of +83 %. At ACT11 (2296 m a.s.l.) the models have 4 times the deposition flux compared to the measurements. Generally though, the model mean is skewed higher by FLEXPART and DEHM (Fig. 14), which also had higher atmospheric BC concentrations. A few models simulated less BC deposition than observed at these sites, and these models also underestimated BC atmospheric concentrations. Thus, it is difficult to conclude that deposition biases are a cause for atmospheric biases when the two are inter-related parameters.
The seasonal cycles of surface-level atmospheric BC concentrations at several Arctic locations are shown in Fig. 15. As seen in Eckhardt et al. (2015), some models still underestimate wintertime BC, but many models now show similar seasonality as the observations. The multi-model mean also captures the monthly variations well including the summertime peak at some Alaskan sites caused by fire emissions. The multi-model mean Arctic BC is underestimated in winter (−24 %) and overestimated in the summer (+32 %), though overall, this is an improvement in model performance in simulating Arctic BC since the 2015 AMAP assessment of black carbon and ozone as climate forcers -the latter of which had −59 % winter bias and +88 % summertime bias (AMAP, 2015a; Eckhardt et al., 2015). However, it is difficult to make direct comparisons to that report as those values were for a smaller set of Arctic locations, different observation periods, and with a different set of models (though many overlapping). The model improvement may be due to the improved anthropogenic emissions of BC, particularly from northern Russia, where flaring emission factors were increased in ECLIPSEv6B compared to those used for the 2015 AMAP report.
Most models have reasonable spatial correlation with the measurements across the Arctic in that they correctly simulate the range of BC concentrations that appear across the Arctic (e.g., higher concentrations at Hurdal, lower concentrations at Zeppelin), as shown in Fig. S8. However, there are There are positive model biases at midlatitudes (in North America and Europe; not shown) for surface-level BC. The vertical analysis of BC from the aircraft campaigns (below) provides further insight and support for the suggestion that models do not have adequate long-range transport of the pollutants from their sources in the midlatitude; thus, they do not simulate enough pollution in the Arctic.
Vertical profiles of BC -aircraft campaigns. Gridded BC output at 3-hourly intervals was provided by 11 of the participating models and was compared to aircraft campaign measurements of BC. The interpolation of model output to flight track coordinates was carried out by tools from the Community Intercomparison Suite (CIS; Watson-Parris et al., 2016), which co-located the extracted model tracks with their corresponding observational values. Figure 16 shows the median vertical profiles of BC concentrations from the aircraft measurements and from the models. At midlatitudes, from 0-2 km, all of the models agree well with the measurements. However, BC concentrations decline steeply in a few models (e.g., MATCH-SALSA, EMEP-MSC-W, and GEOS-Chem) above 2 km. It would appear that they do not have enough vertical lifting of BC and/or perhaps too short of a BC lifetime. Indeed, one of these is EMEP MSC-W, in which the short BC lifetime was previously reported in Gliß et al. (2021). That said, in Lund et al. (2018b), the Oslo-CTM and ECHAM models were shown to overestimate the BC lifetime. In our case, OsloCTM is not shown in Fig. 16 because it did no provide BC at 3-hourly timescales. But ECHAM-SALSA results are consistent with the Lund et al. (2018b) study in that they particularly overestimate BC in the upper altitudes in both the midlatitudes and the Arctic, implying too long a lifetime and too much long-range transport into the upper Arctic atmosphere. The measured BC profile at midlatitudes drops off more quickly around the tropopause at 11-12 km, and, except for CanAM5-PAM, the models do not reproduce this drop.
In the Arctic profiles (Fig. 16), the modeled and observed profiles do not decline with altitude throughout the troposphere, but the observed median BC concentration does drop sharply around 9 km -again near the Arctic tropopause, and again, the only model to capture that change is CanAM5-PAM. In the Arctic comparisons, the models that did not simulate enough BC aloft at the midlatitudes stand out as having larger underestimates of BC in the Arctic. For example, MATCH-SALSA and EMEP MSC-W have very low BC throughout the Arctic vertical profile. These results are consistent with the surface BC underestimation in Fig. 15. Therefore, the underestimation seen in the Arctic for those two models is likely due to a lack of long-range transport from the midlatitudes, as well as errors in BC deposition mentioned above. In addition, the Zhao et al. (2021)  study showed that different parts of the Arctic BC vertical profile are sensitive to BC transported from different areas of the world. For example, the lower-tropospheric BC is influenced by emissions transported from North America, Russia-Belarus-Ukraine, Europe, and East Asia, whereas upper-tropospheric Arctic BC is mainly influenced by transport from South Asia. Thus, the differences in the model results could be related to differences in how they simulate these transport pathways. Mahmood et al. (2016) found that, overall, considerable differences in wet deposition efficiencies in the models exist and are a leading cause of differences in simulated BC burdens. Results from their model sensitivity experiments indicated that convective scavenging outside the Arctic reduces the mean altitude of BC residing in the Arctic, making it more susceptible to scavenging by stratiform (layer) clouds in the Arctic. Consequently, scavenging of BC in convective clouds outside the Arctic acts to substantially increase the overall efficiency of BC wet deposition in the Arctic, which leads to low BC burdens compared to simulations without convective BC scavenging. Oshima et al. (2013) also found that convective scavenging at middle and subtropical latitudes removes a significant fraction of BC. In contrast, BC concentrations in the upper troposphere are only weakly influenced by wet deposition in stratiform clouds, whereas lower-tropospheric concentrations are highly sensitive (Mah-mood et al., 2016) -these are consistent with the results we find in this study, wherein the multi-model median is too high above about 9 km and too low from 0-9 km. Indeed, the MATCH and MATCH-SALSA models, for example, assume reduced scavenging of aerosol in mixed-phase clouds following Liu et al. (2011), which increases long-range transport to the Arctic. It is odd then that MATCH is one of the better=performing models in Fig. 16 and MATCH-SALSA is not. Despite the large range in modeled vertical BC concentrations, the multi-model median is close to the observed throughout the troposphere at both midlatitudes and in the Arctic.
Arctic seas analysis -ship campaigns. From the shipbased measurements, we see that there is a consistent model overestimate of BC in the Pacific region (Japan cruise) where measured concentrations are very low (Fig. 17). Indeed, Taketani et al. (2016) report that BC concentrations were in the range 0-66 ng m −3 , with an overall mean value of just 1.0 ± 1.2 ng m −3 . The models, possibly due to their coarse resolutions, were not able to simulate such low background BC concentrations. However, even the model with the highest resolution (GEM-MACH at 15 km resolution) overestimated BC in the Pacific -though that limited-area model in that region near the boundary would have been heavily influenced by the upwind, coarser-resolution boundary conditions that were assumed (1 • × 1 • MOZART4 chemical boundary conditions). The high bias in the Pacific may be due to all models overestimating the amount of BC that gets transported off the Asian continent. That model bias is consistent with our low-altitude comparisons of the models to the HIPPO aircraft campaign measurements, which were taken over the northwestern Pacific (Fig. 2). The BC overestimate over the Pacific was also found in the Schwarz et al. (2013) study looking at simulated BC from the AeroCom global model intercomparison initiative compared to HIPPO measurements.
Conversely, the modeled BC concentrations generally agree with measurements in the Russian Arctic Ocean, though they are biased slightly low for the most part. Popovicheva et al. (2017) attribute the higher BC concentrations measured near the Kara Straight (north of 70 • N) to gasflaring emissions, and near Arkhangelsk (White Sea), important sources were midlatitude biomass burning, transportation, and combustion (residential and commercial). Since models were able to simulate this well, their improvement is likely due to improved Russian anthropogenic emissions in ECLIPSE v6B (Sect. 2.1, AMAP, 2022) compared to previous emissions datasets, which did not include enough Russian flaring emissions. The best model results were from ECHAM-SALSA and MATCH when compared to all of the ship campaign data.
Therefore, the general model evaluation for BC indicates that while there is a large variability in models results, they tend to overestimate surface BC at midlatitudes (including over the Pacific Ocean) and underestimate surface BC in the Arctic. Again, these results point to a lack of transport to the Arctic and, in this case, too much BC deposition along the way. While we were only able to evaluate BC deposition in the Arctic in this study, those results support the hypothesis of some models having too much BC deposition. The BC vertical profile evaluation also implies that the modeled tropopause height may be too low.

Sulfate
We used monthly mean surface level observations of SO 2− 4 from 18 Arctic sites to evaluate the models. Figure 18 shows that, similar to BC, the SO 2− 4 concentrations in the high Arctic are underestimated by most of the models. A few models overestimate SO 2− 4 in Scandinavia and Alaska. The model underestimations of SO 2− 4 could be mainly due to higher efficiencies of models in removing aerosol during the long-range transport to the high Arctic. This is consis-tent with a previous study based on AMAP 2015 model simulations that found that the convective wet deposition outside the Arctic region may have led to different seasonal cycles of aerosol concentrations in the Arctic (Mahmood et al., 2016). Dimethylsulfide (DMS), a naturally occurring source of sulfur from marine algae emissions, could also be misrepresented in models. However, this source would be more pronounced in the summer when there is less sea ice in the high Arctic, and it does not appear as though models are underestimating only in the summertime (Fig. 19). Rather, some models appear too low in the winter and spring -which points towards underestimating local Arctic sources and to a lack of transport from midlatitudes as being the key issues. Despite the individual model differences in representing the seasonal cycle, the multi-model mean compares well with observations at most locations. However, the multimodel mean SO 2− 4 is significantly underestimated at Alert and Irafoss sites. Mean model biases for all Arctic sites range from −65 % to +80 % among different models, and correlation coefficients are typically around 0.5 (Fig. S9).
The high GEOS-Chem bias in the summertime seen in Fig. 19 was first reported in Breider et al. (2014) and found to be due to problems with cloud pH and cloud liquid water in the summer over the Arctic. In Breider et al. (2017), the summertime Arctic surface SO 2− 4 concentrations are reduced by a factor of 2 by reducing the cloud liquid water content to a uniform value of 1 × 10 −7 g m −3 north of 65 • N in the model. The version of GEOS-Chem in our study is more recent and uses the offline cloud liquid water content from both GEOS-FP and MERRA2. We did not scale this variable down, which may be a reason for the high GEOS-Chem sulfate bias in Fig. 19.
The seasonal cycle for observations grouped together is shown in Fig. S10, showing a consistent seasonal cycle for 2008-2009 as seen in the observations. Most models (e.g., CanAM5-PAM, DEHM, MATCH, OsloCTM) are able to capture the seasonal cycle well. However, several models (e.g., CESM, CIESM-MAM7, ECHAM-SALSA, and EMEP-MSCW) strongly underestimate observed springtime peak values. Conversely, the models and the measurements showed a weaker seasonal cycle during the 2014-2015 time period (Fig. S11). It may be partly due to the local pollution sources in the Arctic during wintertime (e.g., Fairbanks). Those highly localized pollution events, caused by local emissions getting trapped in a stable boundary layer, occur on scales that are smaller than the model resolutions employed here can represent. Many models are also missing chemical formation processes for SO 2− 4 in the absence of sunlight, which may explain underestimations seen in winter (e.g., Moch et al., 2018;Alexander et al., 2009). An evaluation of SO 2 could help with our understanding but was beyond the scope of this study. A lack of dark chemistry may be true for organic aerosol, as discussed in the next section as well. From October to December 2014, the Honoluraun volcanic eruption may have elevated SO 2− 4 concentrations at some locations in the Arctic. However, in our modelmeasurement comparisons, there does not appear to be a large underestimate during those months, which implies that model performance was not impeded by not including those volcanic emissions.
As mentioned above, the uncertainty in wet deposition could be a significant factor in atmospheric SO 2− 4 model biases. Previous studies have shown that models have too much washout in winter and not enough wet deposition in summer, leading to a "flatter" seasonal cycle than observed (e.g., Fig. 19; Browse et al., 2012;Mahmood et al., 2016). As with BC in the previous section, the SO 2− 4 deposition was evaluated here in the same manner. The average measured SO 2− 4 deposition fluxes from ice cores for all locations (only Greenland was available here) are 18 mg m −2 yr −1 . The lowest observed fluxes are found at D4 (12 mg m −2 yr −1 ) and the highest at ACT11D (30 mg m −2 yr −1 ). The model average for all locations is overestimated by around 20 % compared to measured fluxes. This is similar to the model biases in atmospheric SO 2− 4 concentrations. Therefore, the general model evaluation for SO 2− 4 indicates that while there is a large variability in models results, as with BC, models underestimate surface SO 2− 4 in the Arctic. The evaluation of SO 2− 4 deposition in the Arctic is similar to BC, with both overestimating deposition.

Organic aerosol
Unfortunately, there is only one high-Arctic station with OA measurements (Alert, NV, Canada); however, there are a few additional stations measuring OA in the sub-Arctic (still > 60 • N). These are all shown in Fig. 21. The seasonal cycles are shown in Fig. 22, and the model vs. measurement scatter plot along with some comparison statistics are presented in Fig. S12.
Model biases have a large range of ±200 % at the different locations, but the multi-model mean for the region is +65 %. At midlatitudes (30-60 • N), measurements are conducted mainly in the US, where the multi-model mean bias is +83 % for the 2014-2015 average (not shown).
Several models (CanAM5-PAM, DEHM, CIESM-MAM7, ECHAM-SALSA, GEOS-Chem, MRI-ESM2, NorESM, OsloCTM, and UKESM1) are able to simulate the summer-time peak in Arctic OA concentrations; however, the other seven models in Fig. 22 simulate a seasonal cycle that is too flat or peaks at the wrong time (e.g., the CESM seasonal cycle peaks too early in the year).
Figures 21 and S11 both show that most models consistently overestimate Alaskan OA and underestimate European OA, consistent with our assessment of other species showing that models are likely overestimating wildfire influence in summertime Alaska. All models, except EMEP-MSC-W, used the GFED fire emissions inventory. EMEP-MSC-W used the FINN fire emissions inventory, which for BC+OC has been shown to be significantly lower than the GFED emissions . As a result, EMEP-MSC-W model biases in Alaska are lower. However, they are not the lowest. MATCH-SALSA has the lowest OA model biases in Alaska, despite that model using GFED emissions. The comparison statistics in Fig. S12 show highly varying results.

Fine particulate matter
PM 2.5 is partly connected to direct and indirect climate effects via its interactions with clouds. It is mainly composed of BC, SO 2− 4 , OA, NO − 3 , and NH + 4 , as well as crustal material (dust), sea salt, and water, though the water component is often dried off during measurements. Model biases of those species will contribute to the total PM 2.5 biases.
While BC, SO 2− 4 , and OA were discussed above, it is beyond the scope of this project to evaluate the other major PM species, which, aside from water, have a smaller radiative impact. Note that the analysis in this section is focused on sub-Arctic and midlatitude sites closer to human populations rather than remote high-Arctic sites due to a lack of data. PM 2.5 is not a typical parameter included in the longerterm remote Arctic observations. Since PM 2.5 has important health impacts, it is well measured at air quality monitoring networks.
The model PM 2.5 biases at several locations in the United States, Europe, and China are within the 60 %-80 % range. However, some models (CanAM5-PAM, CIESM-MAM7, GEOS-CHEM, GEM-MACH, and Oslo-CTM) show biases larger than 200 %, especially in the western US and Alaska. The large inter-model differences in Fig. 23 are likely due to uncertainty in mineral dust. CanAM5-PAM has a particularly large contribution of dust to PM 2.5 . The Turnock et al. (2020) study showed that dust regions were globally one of the largest areas of diversity in PM 2.5 concentrations between different CMIP6 models. The EMEP MSC-W results are consistent with EMEP annual evaluations for Europe, where the model underestimates PM 2.5 by 10 %-25 %, including a few Arctic sites in Norway and Finland (Gauss et al., 2020, and annual model evaluation reports at https:// www.emep.int/publ/common_publications.html, last access: 14 April 2022).
The simulated surface-level SLCFs were quite sensitive to the different meteorological conditions such as boundary layer stability and levels of photochemistry, which differed between the two time periods chosen in this study. 2014-2015 was also a time period with more wildfires compared to 2008-2009. For example, according to the CMIP6 emission data that were used in most of the models, emissions of BC from Canadian wildfires in 2014-2015 were 340 % higher than in 2008-2009, whereas the emissions from the USA and Russia were similar for these years. Given the very intense wildfire emissions and low anthropogenic emissions in northern Canada in 2014-2015, differences in simulated PM 2.5 concentrations over Canada and Alaska can be partly attributed to differences in simulations of wildfire aerosols in the models.
Some models (CanAM5-PAM, CESM, CIESM-MAM7, GEOS-Chem, and WRF-Chem) simulate higher PM 2.5 and more variable PM 2.5 in the summertime (e.g., Fig. S13 in the Supplement). While this is seen to some extent in the observations, this may be due to the way fire emissions and sea salt emissions are treated in these models. Fire emissions, fire plume injection height, plume rise, and wet deposition of fire pollutants are all highly uncertain model processes and a subject of ongoing research (e.g., Urbanski, 2014;Heilman et al., 2014;Paugam et al., 2016). Indeed, the individual model PM 2.5 Arctic biases are more tightly clustered for 2008-2009 when there were fewer fires. Mölders and Kramm (2018) showed that Arctic PM 2.5 seasonal pollution is mainly due to local air pollution in the winter and due to fires in the summer. Figure 24 shows that the annual mean simulated PM 2.5 concentrations compare well with observations and the correlation coefficients are relatively high (R = 0.8 or higher for all models). The high concentrations in China and low concentrations in the US and Europe are captured by the models, providing confidence for health impact assessments that utilize these model results.

Conclusions
In this study, we evaluated the SLCF simulation capabilities of 18 models that were used in the 2022 AMAP SLCF assessment report. Our conclusions are grouped into the questions we aimed to answer in the Introduction.

How well do the AMAP SLCF models perform in the context of measurements and their associated uncertainty?
Recall that the in situ SLCF measurements had the following reported uncertainties: CH 4 1 %, O 3 3 %, CO 5 %, NO 2 5 %-100 %, BC 200 %, SO 2− 4 20 %, OA 20 %, and PM 2.5 1 %-6 %. However, since the variability in measurements from different techniques was only really taken into account for the BC uncertainty, and since we are comparing annual mean results to each other, it is not a fair comparison to say that models and measurements agree with each other if model biases are within the reported measurement uncertainty range. However, we do use those numbers as a rough guideline for "good" model performance in the absence of other quantitative criteria.
Some model annual mean biases were within those uncertainty ranges. For example, CMAM, MRI-ESM2, and UKESM1 simulate Arctic CH 4 to within 1 %, thus agreeing with the CH 4 measurements. However, at midlatitudes, they are all out of range at around +6 %-10 %. MATCH and WRF-Chem simulated midlatitude O 3 to within 2 %, but only MATCH-SALSA was within 3 % in the Arctic region. The Arctic NO 2 measurements are highly uncertain at around 100 %, so all of the models agreed with Arctic NO 2 measurements. However, in the higher-NO 2 midlatitude environment, NO 2 measurement uncertainty is at the smaller end of the range. OsloCTM and WRF-Chem midlatitude NO 2 biases were within 10 %. All models agree with BC measurements in both midlatitudes and the Arctic, as all biases are less than 80 %. CESM, CIESM-MAM7, DEHM, MATCH, UKESM1, and WRF-Chem all simulate midlatitude SO 2− 4 to within 20 %. But only CanAM5-PAM and MRI-ESM2 do the same for the Arctic region. OA had some of the largest model biases (Figs. 3 and 4), though ECHAM-SALSA, EMEP-MSC-W, GISS-E2.1, MATCH, and OsloCTM are all within 20 % in the Arctic but none at midlatitudes. Finally, with the small uncertainty in PM 2.5 , only CIESM-MAM7 in the midlatitudes and GISS-E2.1 in the Arctic agree within 2 %.
To summarize the mmm annual mean performance, it "matches" surface observations in the Arctic for CH 4 , NO 2 , BC, SO 2− 4 , OA, and PM 2.5 -as such, the mmm has the best overall performance for the Arctic. In the midlatitudes the mmm matches surface observations for BC and SO 2− 4 only. Regarding the comparisons of trace gases to the TES, MOPITT, and ACE-FTS satellite measurements (which have roughly 5 %-20 % uncertainty), models agree well. Freetropospheric distributions of trace gases are somewhat easier to simulate, as common problems like a boundary layer that is too stable or too much deposition do not negatively impact the free-tropospheric SLCF distributions. The variability in the free troposphere is smaller compared to at the surface as well. It is also because of the previously noted difference between the spatial range that remote measurements cover be- ing more akin to the spatial scale of model grid boxes compared to the point measurements from in situ observations.

What do the best-performing models have in common?
There were no models that performed best for all SLCF species and for all regions, highlighting the fact that it is difficult for any one model to bring together numerous complex processes and get results comparable to observations for all SLCFs. This would involve simulating aerosols and chemistry together with the right transport processes, meteorology, and clouds, which is difficult, especially for a remote region like the Arctic where parameterizations might have been built on datasets that are not always applicable there.
In addition, studies like that of Tsigaridis et al. (2014) have shown that there was no clear change in model skill (in that case for OA) with increasing model complexity.
However, several models such as CanAM5-PAM, DEHM, NorESM, and MATCH have better representation of the ver-tical distribution of BC. DEHM and MATCH also had relatively small biases throughout the O 3 tropospheric profile (CanAM5-PAM and NorESM did not simulate gas-phase SLCFs). MATCH in particular has the smallest surface O 3 bias at midlatitudes, which may be related to its high vertical resolution in the boundary layer (the lowest two layers are 20 m thick and the four lowest layers are below 150 m). These models are a mix of air quality and climate-focused models; thus, it is important to note that there is no obvious difference between climate and air quality model biases for annual mean SLCFs. Despite the lack of complex tropospheric chemistry, CMAM had some of the lowest O 3 biases at both midlatitudes and the Arctic. This may imply that the more complex chemistry is not needed in the context of climatological tropospheric O 3 for climate studies (though of course, O 3 on shorter timescales would need more complete tropospheric O 3 chemistry). In the lower stratosphere, however, models with simplified or climatological O 3 schemes did not perform as well as models that had full stratospheric chemistry included. Generally speaking, when comparing the midlatitude model biases to those of the Arctic, they skew more negative (Fig. 4), implying a lack of long-range transport to the Arctic. The best Arctic results for BC throughout the tropospheric profile were from models (CanAM5-PAM, DEHM, MATCH, WRF-Chem) that simulated the vertical mixing of BC well at midlatitudes. The first three of the four models were nudged to the ERA-Interim analysis, and WRF-Chem was nudged to the NCEP FNL analysis. The Arnold et al. (2015) study showed that a key determinant in model differences for peroxyacetyl nitrate (PAN) export relative to CO was the meteorology used in the models. Their results implied that the ERA-Interim models had more efficient vertical transport and mixing in midlatitude source regions compared to GEOS-driven models. In the current study, Arctic BC was greatly underestimated throughout the Arctic profile by MATCH-SALSA, EMEP MSC-W, and GEM-MACH, which had BC concentrations at midlatitudes dropping much too low in the free troposphere. Those models had different sources of meteorology (Table 2), and some may have insufficient convection schemes. For example, GEM-MACH is missing subgrid-scale deep and shallow convection, which is important for the exchange between the planetary boundary layer and free troposphere and consequently for transport at middle and high latitudes. This subject could be studied further via sensitivity tests with and without nudged meteorology, while keeping the aerosol physics the same.
The summertime evaluations of surface O 3 , BC, and OA all imply that models overestimate the amount of these pollutants coming from wildfires in the western Arctic. This could be due to uncertainties in wildfire emissions, fire plume transport, or, in the case of O 3 and secondary OA, the plume chemistry. It is also likely that wet deposition of fire pollutants is underestimated if fire aerosol size is too small and due to climate models lacking pyrocumulus clouds and precipitation. The model overestimations of SLCFs in the summer could be due to a combination of all of these uncertainties.
Our analysis from ACE-FTS, TES, and the aircraft datasets show that CH 4 , O 3 , and BC model biases all imply that the modeled tropopause height is likely too low. Tropospheric species like CH 4 and BC should drop rapidly above the tropopause, but model biases increase sharply at that point. Stratospheric species like O 3 should increase rapidly above the tropopause, but model biases decline sharply there.

Are there patterns in the model biases between SLCF species?
Some patterns one might expect between SLCF species were not demonstrated in this study's results. For example, both O 3 precursors, CO and NO 2 , are too low in models, though that underestimation is worse in the winter. Despite that, surface O 3 tends to be overestimated, though that overestima-tion is mainly in the summer. Also, the west-east pattern in Arctic surface O 3 was overestimated in Alaska and underestimated in Scandinavia. However, for CO and NO 2 those skewed biases are reversed. At midlatitudes SO 2− 4 , BC, and OA are biased high in most models, yet despite that, PM 2.5 is underestimated. Thus, the PM 2.5 biases must be significantly influenced by the other, mainly natural, aerosol components.
In addition to expected patterns, there were no other discernible patterns in model biases between SLCF species.

How does the model performance impact model
applications, such as simulated climate and health impacts?
In the AMAP (2022) report, these models go on to be used to simulate future emission scenarios, and from those results, the future temperature changes due to these SLCFs are predicted. They are further used to determine future changes to human health due to the changes in SLCFs. Given the model evaluation of this study, we have determined that using the mmm to predict SLCF climate impacts is generally robust. Considering the SLCFs with the greatest radiative impact (CH 4 , O 3 , BC, and SO 4 ), the mmm was within ±25 % of the measurements across the Northern Hemisphere. The mmm also performed very well for PM 2.5 , which is a main component considered for human health impacts. Thus, for climate studies, wherein relatively large regions and time periods are considered, the model performance is sufficient.

What processes should be improved or studied further for better model performance?
The model evaluation in this study brought about results that have been reported in previous publications, and several no- -There are a number of indications that simulated boundary layers are too stable (not enough vertical lifting of SLCFs, too much O 3 titration, too much BC and SO 4 deposition). Therefore, increased convection at midlatitudes may be needed. However, this hypothesis is opposite to that found in Allen and Landuyt (2014), which found that excessive tropical convection caused CMIP5 models to overestimate BC aloft. It is therefore important to evaluate models specifically for export and longrange transport events driven by different mechanisms (e.g., frontal export, convective lifting), which is a focus within the PACES initiative (Arnold et al., 2016).
-The O 3 , BC, and PM 2.5 model biases were all high in the Alaskan summertime, implying that many models may simulate too much pollution from wildfires there. Models need improved wildfire parameters for emissions, plume height, plume chemistry, and aerosolcloud processes. For example, the fire emissions inventories GFED4, GFASv1.2, and FINNv1.5 vary by up to a factor of 3 for BC emissions (AMAP, 2022); light and temperature attenuation under smoke plumes means less O 3 is produced than precursor concentrations may imply, and plume rise and injection height need to be accurate for long-range transport.
-Modeled deposition is highly uncertain, and there is evidence here that some models have too much deposition of BC at midlatitudes. However, deposition measurements are scarce, even at midlatitudes, and more of those measurements are needed to constrain models.
-Additional, preferably long-term, OA and PM 2.5 measurements are needed in the high Arctic. Both are expected to be important to Arctic conditions in the future, with increasing wildfires and shipping influencing the Arctic atmosphere, and the lack of those measurements is problematic for constraining models.
-An evaluation of SO 2 would help to determine if the model biases in SO 2− 4 are due to transport, emission uncertainty, or if it can be explained by the uncertainty in chemistry. The removal of particles represents a large uncertainty, but without SO 2 (and DMS) it cannot be concluded that the removal is too fast. Therefore, we conclude that sensitivity tests for the abovementioned model processes will be important for further understanding and improving model performance for SLCFs. But just as important is having additional Arctic measurements and the continuation of existing Arctic measurements in order to assess the model improvements.

Appendix A: Model descriptions
The 18 models used in this study are described in each subsection below. Table A1 contains further information summarizing the models' setup for the AMAP SLCF simulations.

A1 CanAM5-PAM
The Canadian Atmospheric Model version 5 (CanAM5), with the Piecewise lognormal approximation Aerosol Model (PAM), was used. CanAM5-PAM is an improved version global WRF-Chem 100 km regional-Arctic of CanAM4 (von Salzen et al., 2013). The improvements include a higher vertical resolution, improved parameterizations for land surface and snow processes, DMS emissions, and clear-sky radiative transfer. CanAM5-PAM has 49 vertical levels extending up to 1 hPa with a resolution of approximately 100 m near the surface. Model simula-tions are performed using a spectral resolution of T63, which is equivalent to the horizontal resolution of approximately 2.8 • × 2.8 • . The model uses separate parameterizations for layer and convective clouds. Aerosol microphysical processes are based on the piecewise lognormal approximation Ma et al., 2008;Peng et al., 2012;Mahmood et al., 2016Mahmood et al., , 2019AMAP, 2015a). The model simulates binary homogeneous nucleation of sulfuric acid and water vapor. Newly formed particles grow by condensation and coagulation.
A detailed description of parameterizations of ocean DMS flux to atmosphere, oxidation, and removal processes is provided in Tesdal et al. (2016). In-cloud production of sulfate requires O 3 and hydrogen peroxide (H 2 O 2 ) as oxidants (von Salzen et al., 2000), with oxidant (OH, NO − 3 , H 2 O 2 , O 3 ) concentrations specified as climatological results from CMAM. Dry deposition of aerosol depends on concentrations of aerosols in the near-surface model layer (Zhang et al., 2001). Wet deposition includes in-cloud scavenging in both convective clouds and layer clouds, as well as belowcloud scavenging.
Cloud droplet number concentrations are calculated based on the assumption of a parcel of air which ascends from the subcloud layer into the cloud layer with a characteristic vertical velocity (Peng et al., 2005); the standard deviation of the subgrid-scale cloud vertical velocity probability distribution is parameterized using the approach by Ghan et al. (1997). Aerosol particles that are suspended in the parcel of air may activate and grow into cloud droplets by condensation of water vapor. A numerically efficient solution of the condensational droplet growth equation (e.g., Seinfeld and Pandis (2006)) is employed for this purpose. In grid cells that are affected by clouds, CanAM5-PAM accounts for cloud albedo and lifetime effects (first and second aerosol indirect effects) as well as semi-direct effects.

A2 CESM
The Community Earth System Model version 2 (Danabasoglu et al., 2020) is an ESM that can be configured in many different ways. The configuration applied for this assessment utilized the Community Atmosphere Model (CAM) version 6 and Modal Aerosol Model (MAM4) with four mixed-species aerosol modes (Liu et al., 2016). CAM6 employs a spectral element dynamical core (Lauritzen et al., 2018). Type 0 and Type 1 CESM runs were conducted at 1.9 • × 2.5 • horizontal resolution, while Type 3 runs were conducted at 0.9 • × 1.25 • , all with 32 vertical layers. For Type 0 and Type 1 simulations, CESM version 2.0 was used with CAM6-chem representations of chemical reactions (Emmons et al., 2020b), enabling prognostic simulation of tropospheric ozone concentrations, along with a volatility basis set (VBS) parameterization for the formation of secondary organic aerosol (SOA) (Tilmes et al., 2019) and stratospheric chemistry. CAM6-chem is coupled to the in-teractive Community Land Model (CLM5), which provides biogenic emissions, calculated online using the MEGANv2.1 algorithm (Guenther et al., 2012), and handles dry deposition. Tracked aerosol species simulated by MAM4 include sulfate, primary and aged black carbon and organic matter, dust, sea salt, and secondary organic aerosols. Both sea salt and dust emissions are calculated on-line and are highly sensitive to the surface wind speed (Mahowald et al., 2006a, b). These runs were also forced with prescribed sea surface temperatures (SSTs) and sea ice concentrations, created from merged Reynolds-HADISST products as in Hurrell et al. (2008). Type 3 transient runs utilized CESM version 2.1.1 without atmospheric chemistry and with fully coupled atmosphere, ocean, land, and sea ice components (component set BSSP245cmip6), as applied to simulate future scenarios for CMIP6. All CESM runs specified global mean mixing ratios of methane and carbon dioxide.

A3 CIESM-MAM7
CIESM-MAM7 is the Community Integrated Earth System Model (CIESM) (Lin et al., 2020) using the Modal Aerosol Model (MAM7) with seven mixed-species aerosol modes . The current CIESM version 1.1 (see Table 1 of Lin et al., 2020) is based on the NCAR Community Earth System Model (CESM version 1.2.1) with several novel developments and modifications aiming to overcome some persistent systematic biases, such as the double Intertropical Convergence Zone (ITCZ) problem and underestimated marine boundary layer clouds. CIESM-MAM7 employs a finite-volume dynamical core with 0.9 • × 1.25 • for horizontal resolution and 31 layers for vertical resolution. The large-scale meteorology (horizontal wind field) is nudged towards ERA-Interim reanalysis data and the relaxation time is set to 6 h. In CIESM-MAM7, the primary emissions of black carbon (BC), organic carbon (OC), ammonia (NH 3 ), volatile organic compounds (VOCs), sulfur dioxide (SO 2 ), and oxidizing gases (H 2 O 2 , O 3 , OH) are prescribed by the input data uniformly provided by the AMAP-SLCF group. The emission amounts of dust (DU) and sea salt (SS) are calculated online. Aerosol size distributions in CIESM-MAM7 are described by the seven overlapping lognormal distributions, including Aitken, accumulation, primary carbon, fine dust and sea salt, coarse dust, and sea salt modes. The geometric standard deviation of each mode is prescribed (see Table 1 of Liu et al., 2012). A simplified gas-and liquidphase chemistry is included in CIESM-MAM7. SO 2 and dimethyl sulfate (DMS) can be oxidized to sulfuric acid gas (H 2 SO 2− 4 ) and then condenses to form the sulfate aerosols, while the evolution of oxidizing gases is not considered. Primary organic matter (POM) and BC are emitted to the primary carbon mode, then aged and transferred to the accumulation mode by condensation of H 2 SO 2− 4 , NH 3 , and semivolatile organics and by coagulation with Aitken and accumulation modes. The effect of stratospheric sulfate aerosol from volcanic emission on radiative forcing is considered by following the CMIP6 procedure (Thomason, 2012). No specific stratospheric chemistry is included in CIESM-MAM7.

A4 CMAM
The Canadian Middle Atmosphere Model (CMAM) is based on the third-generation CanAM model, with the model lid raised to approximately 95 km and the necessary radiative processes for the mesosphere included (Scinocca et al., 2008). A representation of gas-phase chemistry has also been included that contains a relatively complete description of the HO x , NO x , Cl x , and Br x chemistry that controls stratospheric ozone along with the longer-lived source gases such as CH 4 , N 2 O, and CFCs (Jonsson et al., 2004). For the troposphere the chemical mechanism can considered as methane-NO x chemistry as it does not include the chemistry of larger volatile organic compounds. The model does, however, include a description of associated tropospheric chemical processes such as wet and dry deposition, interactive NO x emissions from lightning, corrections of clear-sky photolysis rates for clouds, and N 2 O 5 hydrolysis on prescribed sulfate aerosol distribution using the reaction probabilities of Davis et al. (2008). The simulation analyzed here used a "specified dynamics" setup, wherein the model horizontal winds and temperature are nudged towards a meteorological reanalysis dataset that represents the observed historical evolution of the atmosphere. In this way the day-to-day variability of the model meteorology is much more closely aligned with the historical evolution of the atmosphere than would be possible in a free-running model. Here CMAM was nudged to 6-hourly fields from the ERA-Interim reanalysis (Dee et al., 2011) on all model levels below 1 hPa and with a relaxation time constant of 24 h.

A5 DEHM
The Danish Eulerian Hemispheric Model (DEHM) (Christensen, 1997;Brandt et al., 2012;Massling et al., 2015) is a 3-D Eulerian atmospheric chemistry transport model developed at the Department of Environmental Science at Aarhus University in Denmark. The model domain covers the Northern Hemisphere using a polar stereographic projection with a grid resolution of 150 km × 150 km. It includes nesting capabilities to make simulations with a higher grid resolution in a limited area of the domain, and in this work an Arctic sub-domain with 50 km × 50 km has been applied covering the Arctic area down to about 40-54 • N. The model has 29 vertical levels in sigma coordinates; the lowest 15 levels are below 2000 m above the surface. The lowest model levels are 22 m thick, and the top of the model domain is at 100 hPa, i.e., the whole troposphere and very lowest part of the stratosphere. DEHM includes a SO x -NO x -VOC-ozone chemistry, with 71 components including secondary organic aerosol (SOA), the use of the VBS mechanism, and nine particulates including hydrophobic and hydroscopic BC, primary organic aerosols, primary anthropogenic dust, PM 2.5 fraction, and coarse fraction of PM 10 of sea salt and Pb. CH 4 is a prognostic species for which the boundary conditions have a large influence. The model is driven by meteorological data from a numerical weather prediction model from the WRF model (Skamarock et al., 2008) version 3.9, with 1 h resolution. The WRF model system is driven by reanalysis data from the ERA-Interim made by ECMWF by nudging.

A6 ECHAM-SALSA
ECHAM-SALSA is the general aerosol-climate model ECHAM-HAMMOZ (ECHAM6.3-HAM2.3-MOZ1.0) (Tegen et al., 2019;Schultz et al., 2018) using the Sectional Aerosol module for Large Scale Applications (SALSA)  to solve the aerosol microphysics. ECHAM6 (Stevens et al., 2013) computes the atmospheric circulation and fluxes using a semi-Lagrangian transport scheme. In the setup used here, the large-scale meteorology (vorticity, divergence, and surface pressure; relaxation times of 24, 6, and 48 h, respectively) was nudged towards ERA-Interim reanalysis data (Berrisford et al., 2011). In SALSA the aerosol size distribution is modeled using 10 size sections (or bins), which span particle sizes between 3 nm and 10 µm. The size distribution is further divided into a soluble and an insoluble sub-population, which are treated as externally mixed. Within one size bin of one sub-population, all aerosol particles are considered internally mixed. In its standard setup, SALSA describes the aerosol compounds, BC, organic carbon (OC), SO 2− 4 , SS, and mineral dust (DU). In the model, BC, OC, SS, and DU are emitted as primary particles, while SO 2− 4 is emitted as either SO 2 or as DMS, which are oxidized using a simplified chemistry (Stier et al., 2005) to form H 2 SO 4 , which then either nucleates or condenses onto existing particles. BC, OC, and SO 2 emissions are prescribed using input files, while SS and DU emissions are computed online. All greenhouse gas concentrations are fixed to pre-defined concentrations. The model resolution for the simulations performed here was T63 (roughly 2 • by 2 • ), further using 47 hybrid sigma-pressure levels.

A7 EMEP MSC-W
The EMEP MSC-W model is a 3-D Eulerian chemistry transport model developed at the Norwegian Meteorological Institute within the framework of the UN Convention on Longrange Transboundary Air Pollution. It is described in detail in Simpson et al. (2012). Although the model has traditionally been aimed at simulations of acidification, eutrophication, and air quality over Europe, global modeling has been performed and evaluated against observations for many years (Jonson et al., 2010;Wild et al., 2012). The model uses 20 vertical levels defined as eta-hybrid coordinates. The 10 lowest levels are within the planetary boundary layer (with the bottom layer being 92 m thick), and the top of the model domain is at 100 hPa. Model updates since Simpson et al. (2012), resulting in EMEP model version rv4.33 as used here, have been described in Simpson et al. (2019) and references cited therein. The main revisions were made to the parameterizations of coarse NO − 3 formation on sea salt and dust aerosols, N 2 O 5 hydrolysis on aerosols, and additional gasaerosol loss processes for O 3 , HNO − 3 , and HO 2 . The EMEP model, including a user guide, is publicly available as opensource code at https://github.com/metno/emep-ctm (last access: 14 April 2022). EMEP-modeled PM 2.5 and PM 10 include primary and secondary aerosols, both anthropogenic and natural. Secondary aerosol consists of inorganic sulfate, nitrate, ammonium, and SOA; the latter is formed from both anthropogenic and biogenic emissions using the VBS scheme detailed in Bergström et al. (2012) and Simpson et al. (2012). The model also calculates sea salt aerosols and windblown dust particles from soil erosion. Aerosol optical depth (AOD) is calculated based on the mass concentrations of individual aerosols multiplied by corresponding mass extinction coefficients. In these simulations, we did not use the BC and OC emissions from EclipseV6b directly, but applied EclipseV6b PM 2.5 and coarse PM emissions instead, which were split into elementary carbon (EC), organic matter (OM) (here assumed inert), and the remaining inorganic dust. The EC and OM emissions in the fine and coarse fractions were further divided into fossil fuel and wood-burning compounds for each country and source sector. The split applied to the PM emissions is the same as used in EMEP operational runs (IIASA, personal communication). A total of 80 % of emitted EC is assumed to be hydrophobic, aging to become hydrophilic within 1 to 1.5 d. As in Bergström et al. (2012), the organic matter to organic carbon ratio of emissions by mass is assumed to be 1.3 for fossil fuel sources and 1.7 for wood-burning sources. Note that different wildfire emissions were used here, i.e., from FINN (the Fire INventory from NCAR version 15). The EMEP model runs were driven by 3-hourly meteorological data from the ECMWF IFS model at 0.5 • × 0.5 • resolution.

A8 FLEXPART
The Lagrangian particle dispersion model FLEXPART version 10.4 (Pisso et al., 2019) releases computational particles that are simulated forward in time following 3-hourly ECMWF meteorological fields with 137 vertical layers and a spatial resolution of 1 • × 1 • . For each year around 330 million particles were released to calculate turbulent diffusion (Cassiani et al., 2014), unresolved mesoscale motions (Stohl et al., 2005), and convection (Forster et al., 2007). A recently updated wet deposition scheme taking into account in-cloud and below-cloud removal was used . Gravitational settling for spherical BC particles with an aerosol mean diameter of 0.25 µm, a normalized standard deviation of 3.3, and a particle density of 1500 kg m −3 (Long et al., 2013) is used in the calculation of dry deposition. The surface concentration and deposition fields were retrieved on a monthly basis on a resolution of 0.5 • × 0.5 • .
The Type 0 simulation was conducted for the year 2015 over a limited-area model (LAM) domain on a rotated latlong grid at 0.1375 • × 0.1375 • (or 15 km) resolution covering the Arctic (> 60 • N) and extending to the southern US-Canada border. Some of the model upgrades for the Arctic simulation are described in Gong et al. (2018). Anthropogenic emissions used are based on a combination of North American emission inventories (specifically, the 2016 US National Emission Inventories and 2015 Canadian national Air Pollution Emission Inventories) and global ECLIPSE v6b 2015 baseline emissions. North American wildfire emissions are processed using the Canadian Forest Fire Emission Prediction System (CFFEPS) from satellite-detected fire hotspot data (MODIS, AVHRR, and VIIRS). CFFEPS consists of a fire growth model, a fire emissions model, and a thermodynamic-based model to predict the vertical penetration height of a smoke plume from fire energy (see Chen et al., 2019, for details). Biogenic emissions are calculated online in GEM-MACH based on the algorithm from BEIS version 3.09 with BELD3-format vegetation land cover. Sea salt emissions are computed based on .
The chemical lateral boundary conditions were from MOZART-4/GEOS-5 . The meteorology was initialized daily (at 00:00 UTC) using the Canadian Meteorological Centre's global objective analyses.

A10 GEOS-Chem
GEOS-Chem is a global three-dimensional chemical transport model driven by assimilated meteorological observation from the Goddard Earth Observing System (GEOS) of the NASA Data Assimilation Office (DAO), which was first introduced in 2001 ( . GEOS-Chem is a grid-independent model which operates on a 1-D column with default or user-specified horizontal grid points, vertical grid points, and time step. GEOS-Chem Classic can use archived GEOS meteorological data on a rectilinear latitudelongitude grid to compute horizontal and vertical transport and use Open-MP in parallelization. Two types of assimilated meteorological data from the NASA Global Modeling and Assimilation Office (GMAO) can be used to drive the offline mode of GEOS-Chem. The first type are operational data starting from 2012, the GEOS Forward Processing (GEOS-FP, the native resolution of which was 0.25 • × 0.3125 • ; Lucchesi, 2013). The second type is the consistent Modern-Era Retrospective Analysis for Research and Applications version 2 (MERRA-2, starting from 1979-present; Randles et al., 2017), with the native resolution 0.5 • × 0.625 • . Both types of meteorological data have 72 hybrid sigma-pressure levels with the top at 0.01 hPa, 3-hourly temporal resolution for 3-D fields, and 1 h resolution for 2-D fields. The advection scheme of GEOS-Chem uses the TPCORE advection scheme (Lin and Rood, 1996) on the latitude-longitude grid, while the convective transport uses the convective mass flux described by Wu et al. (2007). The wet deposition scheme in GEOS-Chem is based on Liu et al. (2001) for water-soluble aerosols and Amos et al. (2012) for gases. The dry deposition is based on the resistance-in-series scheme of Wesely (1989). Aerosol deposition is from Zhang et al. (2001). Emissions of dust aerosol, lightning NO x , biogenic VOCs, soil NO x , and sea salt aerosol are dependent on the local meteorological conditions. The CEDS global inventory provides the default anthropogenic emissions, while EDGAR v4.3.2 (M. et al., 2018) is also available as an alternate option to CEDS. Future anthropogenic emissions following the RCP scenarios (), aircraft emissions (Stettler et al., 2011), ships emission (from CEDS), and lighting NO x emissions (Murray et al., 2012) are also included and configured at run time us-ing the HEMCO module described (Keller et al., 2014). Biogenic VOC emissions in GEOS-Chem are from the MEGAN v2.1 inventory (Guenther et al., 2012). The chemical solver in the standard GEOS-Chem simulation uses the Kinetic Pre-Processor (KPP) (Damian et al., 2002) as implemented in GEOS-Chem. The gas phase in the troposphere in GEOS-Chem includes a detailed HO x -NO x -VOC-ozone-halogenaerosol tropospheric chemistry mechanism, which generally follows JPL/IUPAC recommendations including PAN (Fischer et al., 2014), isoprene Fisher et al., 2016), halogens (Sherwen et al., 2016;Chen et al., 2017), andCriegees (Millet et al., 2015). A linearized stratospheric chemistry scheme has been implemented since GEOS-Chem v9.0. The model will read from archived 3-D monthly mean production rates and losing frequency for each species at the beginning of each month. The Linoz chemistry (McLinden et al., 2000) is also applied as a recommended option for the stratospheric ozone layer. The original sulfate-nitrateammonium aerosol simulation in GEOS-Chem coupled to gas-phase chemistry (Park et al., 2004). The black carbon simulation , organic aerosol (Pai et al., 2020), complex SOA (Pye et al., 2010), the aqueous-phase isoprene SOA scheme , and the dust simulation (Duncan Fairlie et al., 2007) are also implemented into GEOS-Chem. The dust size distributions are from Zhang et al. (2013). GEOS-Chem v12.3.2 with uniform 2 • × 2.5 • MERRA-2 meteorological data for 2008-2009, GEOS-FP meteorological data for 2014-2015, and ECLIPSEv6b emissions was used in this study.

A11 GISS-E2.1
The NASA Goddard Institute of Space Studies (GISS) Earth system model (ESM), GISS-E2.1, is a fully coupled ESM. A full description of GISS-E2.1 as well as evaluation of its coupled climatology during the satellite era  and the historical ensemble simulation of the atmosphere and ocean component models (1850-2014) are described in Kelley et al. (2020). GISS-E2.1 has a horizontal resolution of 2 • in latitude by 2.5 • in longitude and 40 vertical layers extending from the surface to 0.1 hPa in the lower mesosphere. The tropospheric chemistry scheme used in GISS-E2.1 (Shindell et al., 2001(Shindell et al., , 2003 includes inorganic chemistry of O x , NO x , HO x , CO, and organic chemistry of CH 4 and higher hydrocarbons using the CBM4 scheme (Gery et al., 1989), as well as the stratospheric chemistry scheme (Shindell et al., 2006), which includes chlorine and bromine chemistry together with polar stratospheric clouds. The meteorology was nudged to the NCEP reanalysis.
In the present work, we used the OMA (the one-moment aerosol scheme) (Bauer et al., 2007a, b;Bauer and Koch, 2005;Tsigaridis et al., 2013;Bauer et al., 2020). OMA is a mass-based scheme in which aerosols are assumed to remain externally mixed and have a prescribed and constant size distribution, with the exception of sea salt that has two distinct size classes and dust that is described by a sectional model with an option from four to six bins. The OMA scheme treats sulfate, nitrate, ammonium, carbonaceous aerosols (black carbon and organic carbon, including the NO x -dependent formation of SOA and methanesulfonic acid formation), dust, and sea salt. The model includes secondary organic aerosol production, as described by Tsigaridis and Kanakidou (2007). The default dust configuration that is used in this work includes five bins, including a clay and four silt ones, from submicron to 16 µm in size. The first three dust size bins can be coated by sulfate and nitrate aerosols (Bauer and Koch, 2005). OMA only includes the first aerosol indirect effect. The aerosol number concentration that impacts clouds is obtained from the aerosol mass as described in Menon and Rotstayn (2006).
The natural emissions of sea salt, DMS, isoprene, and dust are calculated interactively. Anthropogenic dust sources are not represented in ModelE2.1. Dust emissions vary spatially and temporally only with the evolution of climate variables like wind speed and soil moisture . The version of the model we use in this work uses prescribed sea surface temperature (SST) as well as sea ice thickness and extent during the historical period (Rayner et al., 2003). istry (Robertson et al., 1999) -is an offline, Eulerian, 3-D chemistry transport model developed at the Swedish Meteorological and Hydrological Institute. MATCH can be run on global to urban domains to study a range of atmospheric chemistry and air quality problems, but for this study model runs were performed for the 20-90 • N region focusing on long-transport to the Arctic. ERA-Interim reanalysis data from the European Centre for Medium-Range Weather Forecasts (ECMWF) were used as meteorological input to the model. The 6-hourly data (3-hourly for precipitation) were extracted from the ECMWF archives on a 0.75 • × 0.75 • rotated latitude-longitude grid. The original data had 60 levels, but the 38 lowest levels, reaching about 16 km in the Arctic, were used in the model.

MATCH -Multiscale Atmospheric Transport and Chem
The scheme for gas-phase tropospheric chemistry and bulk aerosols as described in Andersson et al. (2007) was used. Methane concentrations were prescribed. Boundary conditions at the top of the model and at the lateral boundaries for a range of species including ozone were based on monthly mean values from the Copernicus Atmospheric Monitoring Service. The aerosol scheme was extended with BC and OC simulated as three fractions: fresh, hydrophobic and aged, and hydrophilic. A total of 80 % of anthropogenic emissions from all sectors were emitted into the hydrophobic and 20 % into the hydrophilic fraction, except for fire-biomass combustion for which 100 % was emitted into the hydrophilic component following Genberg et al. (2013). Scavenging and aging were parameterized following Liu et al. (2011); i.e., aging is proportional to OH and scavenging in mixed-phase clouds is reduced. The hydrophobic fraction is assumed to be 5 % activated in the scavenging scheme, while the hydrophilic fraction is 100 % activated. If the clouds are mixedphase, then the scavenging efficiency is scaled by the ratio of cloud ice water content to total cloud water content assuming zero scavenging for 100 % ice clouds.

A13 MATCH-SALSA-RCA4
The chemistry transport model, MATCH (Robertson et al., 1999;Andersson et al., 2007) described above, is onlinecoupled to the aerosol dynamics model, SALSA (Kokkola et al., 2008). SALSA describes the whole chain from nucleation to the growth and deposition of particles and computes the size distribution, number concentration, and chemical composition of the aerosol species. A sectional representation of the aerosol size distribution is considered with three main size ranges (a: 3-50 nm, b: 50-700 nm, and c: > 700 nm), and each range is again subdivided into smaller bins and into soluble and insoluble bins, adding up to a total of 20 bins. The seasonally varying emissions are based on the sector-wise ECLIPSE inventory. Isoprene emissions are modeled online depending on the meteorology based on the methodology by Simpson et al. (1995). The terpene emissions (α-pinene) are taken from the modeled fields by the EMEP model. Sea salt is parameterized following the scheme of Foltescu et al. (2005) but modified for varying particle sizes, wherein the Mårtensson et al. (2003) scheme is used if the particle diameter is 1 µm and the Monahan et al. (1986) scheme is used otherwise. The coupling of MATCH with SALSA and the evaluation of this model setup is described in detail in Andersson et al. (2015). A cloud activation model that computes 3-D CDNCs (cloud droplet number concentrations) based on the prognostic parameterization scheme of Abdul-Razzak and Ghan (2002) specifically designed for aerosol representation with sectional bins is embedded in the MATCH-SALSA model. This scheme simulates the efficiency of an aerosol particle to be converted to a cloud droplet depending on the number concentration and chemical composition of the particles given the updraft velocity and supersaturation of the air parcel. The updraft velocity is computed as the sum of the grid mean vertical velocity and turbulent kinetic energy (TKE) for stratiform clouds (Lohmann et al., 1999). These CDNCs are then offlinecoupled to a regional climate model, RCA4 (SAMUELS-SON et al., 2011), that provides us with information on cloud properties such as cloud cover, cloud droplet radii, cloud liquid water path, and radiative fluxes. RCA4 is run with 6hourly ERA-Interim meteorology, and the 3-hourly RCA4 meteorological fields along with the fields needed to calculate updraft velocity are used to drive the MATCH-SALSA cloud activation model. The CDNCs are then used to rerun the RCA4 model to obtain the cloud properties and radiative effects. The validation and more details of this model set up are described in Thomas et al. (2015).

A14 MRI-ESM2
MRI-ESM2 (Meteorological Research Institute (MRI) Earth System Model version 2.0, developed by the MRI of the Japan Meteorological Agency) consists of four major component models: an atmospheric general circulation model (MRI-AGCM3.5) with land processes, an ocean-sea ice general circulation model (MRI.COMv4), and aerosol and atmospheric chemistry models Kawai et al., 2019;Oshima et al., 2020). However, we do not couple OGCM in this study's simulations. MRI-ESM2 uses different horizontal resolutions but employs the same vertical resolution in each atmospheric component model as follows: TL159 (approximately 120 km), TL95 (approximately 180 km), and T42 (approximately 280 km) in the MRI-AGCM3.5, the aerosol model, and the atmospheric chemistry model, respectively, all with 80 vertical layers (from the surface to a model top of 0.01 hPa) in a hybrid sigma-pressure coordinate system. Each component model is interactively coupled by a coupler, which enables an explicit representation of the effects of gases and aerosols on the climate system. The atmospheric chemistry component model in MRI-ESM2 is the MRI Chemistry-Climate Model version 2.1 (MRI-CCM2.1), which calculates the evolution and distribution of ozone and other trace gases in the troposphere and in the middle atmosphere. The model calculates a total of 90 gas-phase chemical species and 259 chemical reactions in the atmosphere. The aerosol component model in MRI-ESM2 is the Model of Aerosol Species in the Global Atmosphere mark-2 revision 4-climate (MASINGAR mk-2r4c) that calculates atmospheric aerosol physical and chemical processes and treats the following species: nonsea-salt sulfate, BC, OC, sea salt, mineral dust, and aerosol precursor gases (e.g., sulfur dioxide and dimethyl sulfide). The size distributions of sea salt and mineral dust are divided into 10 discrete bins, and the sizes of the other aerosols are represented by lognormal size distributions. The model assumes external mixing for all aerosol species; however, in the radiation process in MRI-AGCM3.5, hydrophilic BC is assumed to be internally mixed with sulfate with a shell-to-core volume ratio of 2; the optical properties of hydrophilic BC are calculated based on Mie theory with a core-shell aerosol treatment, in which a concentric BC core is surrounded by a uniform coating shell composed of other aerosol compounds (Oshima et al., 2009b, a). MRI-ESM2 employs a BC aging parameterization ) that calculates the variable conversion rate of BC from hydrophobic BC to hydrophilic BC, which generally depends on the production rate of condensable materials such as sulfate. In the radiation and cloud processes in MRI-ESM2, sulfate is assumed to be (NH + 4 ) 2 SO 2− 4 and OC is assumed to be organic matter (OM) by lumping OC species using an OM-to-OC factor of 1.4. MRI-ESM2 represents the activation of aerosols into cloud droplets based on the parameterizations, and detailed descriptions and evaluations of the cloud processes and cloud representations in MRI-ESM2 are given by Kawai et al. (2019). Evaluations of the effective radiative forcing (ERF) of anthropogenic gases and aerosols in present-day conditions relative to preindustrial conditions globally and in the Arctic using MRI-ESM2 are given by Oshima et al. (2020). The simulations in this study were performed from January 2008 (or January 1990) to December 2015 after a 1-year spin-up run using the prescribed SST and sea ice data (provided by the AMIP experiment in CMIP6, https://www.wcrp-climate.org/ modelling-wgcm-mip-catalogue/modelling-wgcm-mips-2/ 240-modelling-wgcm-catalogue-amip, last access: 14 April 2022). The horizontal wind fields were nudged toward the 6-hourly Japanese 55-year Reanalysis (JRA55) data (Kobayashi et al., 2015) (https://jra.kishou.go.jp/JRA-55/index_en.html, last access: 14 April 2022) in the simulation. We used the monthly anthropogenic emissions from the ECLIPSE V6B emission dataset and the monthly biomass burning emissions from the CMIP6 in the simulations. Major volcanic aerosols are given by the stratospheric aerosol dataset used in the CMIP6 experiments (Thomason et al., 2018). A second simulation with volcanic SO 2 emission including Holuhraun eruption was also performed for 2014-2015.

A15 NorESM1
NorESM1 Iversen et al., 2013) is based on the fourth version of the Community Climate System Model (CCSM4) (Gent et al., 2011), with coupled models for the atmosphere, ocean, land, and sea ice. Here, we have used a 1 • horizontal resolution in the atmosphere (0.95 • latitude by 1.25 • longitude, version NorESM1-Happi). The model has 26 vertical levels on a hybrid sigma-pressure coordinate up to the model top at 2.194 hPa. The model calculates the life cycles of a range of natural and anthropogenic aerosol components from emissions and physicochemical processing in air and cloud droplets. The only prescribed aerosol concentrations are stratospheric sulfate from explosive volcanoes. The direct and indirect aerosol effects on climate are calculated by parameterization of aerosol interactions with schemes for radiation and warm cloud microphysics . The model uses a prognostic calculation of cloud droplet numbers, allowing for competition effects between aerosols of different hygroscopic property and size.

A16 OsloCTM
The Oslo CTM3 is an offline global three-dimensional chemistry transport model driven by 3-hourly meteorological forecast data from the Integrated Forecast System (IFS) model at the European Centre for Medium-Range Weather Forecasts (ECMWF). The Oslo CTM3 consists of a tropospheric and stratospheric chemistry scheme (Søvde et al., 2012) as well as aerosol modules for sulfate, nitrate, black carbon, primary organic carbon, secondary organic aerosols, mineral dust, and sea salt (Lund et al., 2018a).

A17 UKESM1
UKESM1 (United Kingdom Earth System Model) is a fully coupled Earth system model (Sellar et al., 2019) with a coupled atmosphere-ocean physical climate model (HadGEM3-GC3.1) at its core (Kuhlbrodt et al., 2018;Williams et al., 2018). For UKESM1 various Earth system components are incorporated with the physical climate model including ocean biogeochemistry, an interactive stratosphere-troposphere chemistry and aerosol scheme, and terrestrial carbon and nitrogen cycles coupled to interactive vegetation. The model has a horizontal resolution of 135 km at the midlatitudes (1.875 • × 1.25 • ), with 85 levels on a terrain-following hybrid height coordinate system, ranging in height from the surface to a model top of 85 km. The combined stratosphere-troposphere United Kingdom Chemistry and Aerosol (UKCA) scheme is used within UKESM1 and is fully described and evaluated in Archibald et al. (2020);.
The chemical scheme in UKCA is built upon the scheme described for the stratosphere in Morgenstern et al. (2009) and that for the troposphere described in O'Connor et al. (2014). Chemical reactions are included within UKCA for odd oxygen (O x ), nitrogen (NOy), hydrogen (HO x = OH + HO 2 ), CO, CH 4 , and short-chain nonmethane volatile organic compounds (NMVOCs), including isoprene. Reactions involving NMVOCs are simulated as discrete species. UKCA includes an interactive photolysis scheme, as well as representations of both wet and dry deposition, for gas and aerosol species. Additional chemical reactions for DMS, SO 2 , and monoterpenes (C 10 H 16 ) are included to enable coupling to the aerosol scheme within UKCA. A two-moment aerosol microphysical scheme, GLOMAP (Global Model of Aerosol Processes; Mann et al., 2010Mann et al., , 2012, is used to simulate four aerosol components (SO 3 , BC, organic matter, sea salt) across five lognormal modes, ranging from submicron to supermicron sizes. Mineral dust is simulated separately using a six-bin mass-only scheme, ranging in size from 0.6 to 60 µm in diameter (Woodward, 2001). Ammonium nitrate is not currently included within the UKCA aerosol scheme. The formation of secondary organic aerosol (SOA) is included based on a fixed yield rate of 26 % from the products of monoterpene oxidation. The higher fixed yield value accounts for the underlying uncertainty in SOA formation and the absence of anthropogenic, marine, and isoprene sources.
Precursor emission fluxes are either prescribed using specified input files or calculated interactively using online me-teorological variables within UKESM1. Methane is represented by using prescribed global concentrations. Interactive emission fluxes are calculated online for sea salt, DMS, dust, lightning NO x , and biogenic volatile organic compounds (BVOCs). Emissions of isoprene and monoterpenes from the natural environment are calculated online by coupling to the land surface scheme within UKESM1. Simulations provided by UKESM1 and used in the AMAP assessment have been undertaken using different configurations. For this study, the experiment with UKESM1 has been set up using an atmosphere-only configuration that is nudged to ECMWF reanalysis (ERA-Interim) of temperature and wind fields above the boundary layer. Prescribed values of sea surface temperatures and sea ice are used for each year of simulation based on historical simulations conducted as part of CMIP6 using the fully coupled atmosphere-ocean configuration of UKESM1. For other ancillary inputs a multi-year climatology was used, equivalent to an AMIP-type simulation.

A18 WRF-Chem
WRF-Chem (Weather Research and Forecasting model with online coupled chemistry) is used to simulate the transport and chemical transformation of trace gases and aerosols simultaneously with the meteorology. The model dynamics (WRF) are non-hydrostatic. The model version used for AMAP is WRF-Chem version 3.8.1 also including updates reported in Marelle et al. (2017) and Marelle et al. (2018). The simulation was performed on a polar stereographic projection with a horizontal resolution of 100 km and 50 vertical hybrid terrain-following vertical pressure levels using hydrostatic pressure. The center of the domain is placed at the North Pole, and the latitude at the domain's outside boundary varies from 7 • S to 7 • N. The WRF-Chem chemical lateral boundary conditions are from MOZART-4/GEOS-5 (Emmons et al., 2010). Pressure at the model top is set to 50 hPa with stratospheric concentrations (e.g., ozone) taken from climatologies. The model was run with Morrison doublemoment scheme microphysics, longwave and shortwave radiative effects treated by the RRTMG scheme, and the Kain-Fritsch cumulus potential (KF-CuP) cumulus parameterization scheme. The model was run with the SAPRC-99 chemical scheme providing gas-phase tropospheric reactions including VOCs and NO x , coupled with the MOSAIC eightbin sectional scheme including VBS treatments for SOA. Methane concentrations are prescribed. Stratospheric or tropospheric halogen chemistry is not included. It was run using anthropogenic emissions from ECLIPSE v6b and the GFED fire emissions. Boundary and initial meteorological conditions were given by the global NCEP Final Analysis (FNL) and used to nudge the temperature, relative humidity, and winds at every dynamical time step above the planetary boundary layer.

Appendix B: Observational datasets
The following datasets were used to evaluate the models in this study. As BC measurements vary by instrument, Table B1 summarizes the different Arctic BC datasets used in this study.  Stephens et al. (2003) For SLCFs other than Arctic BC, Table B2 summarizes some information about the observation networks. Competing interests. At least one of the (co-)authors is a member of the editorial board of Atmospheric Chemistry and Physics. The peer-review process was guided by an independent editor, and the authors also have no other competing interests to declare.
Disclaimer. PM 2.5 observation from the US embassy in China belongs to the US Department of State, is not fully verified or validated, and could be subject to changes, corrections, or errors.
Publisher's note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Special issue statement.
This article is part of the special issue "Arctic climate, air quality, and health impacts from short-lived climate forcers (SLCFs): contributions from the AMAP Expert Group (ACP/BG inter-journal SI)". It is not associated with a conference.
Acknowledgements. The work reflected in this publication was produced with the financial support of the Arctic Monitoring and Assessment Programme (AMAP). The authors would like to thank all operators and technicians at the Arctic stations for the collection of observational data. Thanks are also due to the following people for providing us with data: Mauro Mazzola, Stefania Gilardoni, and Angelo Lupi from the Institute of Polar Sciences for eBC measurements at Gruvebadet lab; and Mirko Severi from the University of Florence for SO 2− 4 measurements at Gruvebadet. Fairbanks aerosol measurements came from William Simpson. For Villum data, we acknowledge the Aarhus University Department of Environmental Science (ENVS). NOAA/ESRL/GMD, EMEP (http://ebas.nilu.no, last access: 14 April 2022), and the WMO GAW network are acknowledged for Barrow and Zeppelin observational datasets. The ACE-FTS is a Canadian-led mission mainly supported by the Canadian Space Agency (CSA). IMPROVE is a collaborative association of state, tribal, and federal agencies, as well as international partners. The US Environmental Protection Agency is the primary funding source, with contracting and research support from the National Park Service. The Air Quality Group at the University of California, Davis, is the central analytical laboratory, with ion analysis provided by the Research Triangle Institute and carbon analysis provided by the Desert Research Institute. Julia Schmale holds the Ingvar Kamprad Chair for Extreme Environments Research sponsored by Ferring Pharmaceuticals. The ECHAM-HAMMOZ model is developed by a consortium composed of ETH Zurich, the Max-Planck-Institut für Meteorologie, Forschungszentrum Jülich, the University of Oxford, the Finnish Meteorological Institute, and the Leibniz Institute for Tropospheric Research; it is managed by the Center for Climate Systems Modeling (C2SM) at ETH Zurich. Publisher's note: the article processing charges for this publication were not paid by a Russian or Belarusian institution. Review statement. This paper was edited by Frank Dentener and reviewed by two anonymous referees.