Exploration of oxidative chemistry and secondary organic aerosol formation in the Amazon during the wet season: explicit modeling of the Manaus urban plume with GECKO-A

The GoAmazon 2014/5 field campaign took place in Manaus, Brazil, and allowed the investigation of the interaction between background-level biogenic air masses and anthropogenic plumes. We present in this work a box model built to simulate the impact of urban chemistry on biogenic secondary organic aerosol (SOA) formation and composition. An organic chemistry mechanism is generated with the Generator for Explicit Chemistry and Kinetics of Organics in the Atmosphere (GECKO-A) to simulate the explicit oxidation of biogenic and anthropogenic compounds. A parameterization is also included to account for the reactive uptake of isoprene oxidation products on aqueous particles. The biogenic emissions estimated from existing emission inventories had to be reduced to match measurements. The model is able to reproduce ozone and NOx for clean and polluted situations. The explicit model is able to reproduce background case SOA mass concentrations but does not capture the enhancement observed in the urban plume. The oxidation of biogenic compounds is the major contributor to SOA mass. A volatility basis set (VBS) parameterization applied to the same cases obtains better results than GECKO-A for predicting SOA mass in the box model. The explicit mechanism may be missing SOA-formation processes related to the oxidation of monoterpenes that could be implicitly accounted for in the VBS parameterization. Published by Copernicus Publications on behalf of the European Geosciences Union. 5996 C. Mouchel-Vallon et al.: GECKO-A explicit modeling of the Manaus urban plume

. Map of the GoAmazon field campaign instrumented sites. Measurements used in this work came from the T3 site. ©Geocover, to model atmospheric organic chemistry with a detailed molecular view. GECKO-A is an automated chemical mechanism generator built to write the explicit chemistry of given precursors by following a prescribed set of systematic rules. This set of systematic rules relies on experimental data when available and Structure Activity Relationships (SAR) to determine unknown kinetic or thermodynamic constants. It has previously been run in box models to evaluate processes like secondary organic aerosol formation (Valorso et al., 2011;Aumont et al., 2012;Camredon and Aumont, 2006;Camredon et al., 2007) and dissolution of organic compounds . It was also applied to simulate chamber experiments (Valorso et al., 2011;La et al., 2016) and urban and biogenic plumes (Lee-Taylor et al., 2011.
In this work, a box model is run to simulate the evolution of an Amazonian air mass intercepting Manaus emissions during the wet season. Emissions of anthropogenic and biogenic primary VOCs are estimated with available data. The chemical scheme 70 describing the explicit oxidation of these primary compounds is generated with GECKO-A. The resulting detailed simulation is then used to explore the impact of Manaus emissions on the Amazonian biogenic chemistry. Comparisons with aerosol mass spectrometer data and the VBS parameterization are carried out to identify important processes involved in biogenic SOA formation that may not be accounted for in GECKO-A. Finally the potential for reduction of the explicit mechanism is estimated.

2 Experimental Data
The main instrumented site (referred to as T3 hereafter) of the GoAmazon 2014/5 field campaign was situated 70 km west of Manaus ( Fig. 1). Two aircraft were also deployed, a G-159 Gulfstream I (G-I) (Schmid et al., 2014) that flew at low altitude and mostly sampled the boundary layer and a Gulfstream G550 (HALO) that flew higher altitudes and sampled the free troposphere (Wendisch et al., 2016). The flight tracks are depicted in Martin et al. (2016) and Wendisch et al. (2016). The G-1 airplane 80 mainly flew daytime transects of the Manaus plume between the city and the T3 site.
The detailed instrumentation deployed at T3 and in the airplanes has been described elsewhere . For this study we mainly relied on ground deployed instruments briefly described here.
Ozone concentration measurements made with a Thermo Fisher Model 49i Ozone Analyzer were obtained from the Mobile Aerosol Observing System-Chemistry (MAOS-C).

85
Due to some issues with the NO x analyzer deployed at T3 by the MAOS-C during the wet season, NO x data reported here is weakly reliable. The values reported here are only qualitative indications of NO x levels in the studied period.
OH radicals concentrations were provided by an OH chemical ionization mass spectrometer (Sinha et al., 2008, OH-CIMS).
Organic compounds in the gas phase were measured with selected reagent ion proton transfer reaction time-of-flight mass spectrometer (SRI-PTR-ToFMS, Jordan et al., 2009a, b). Aerosol composition was monitored by a high-resolution time-of-90 flight aerosol mass spectrometer (HR-ToF-AMS) (DeCarlo et al., 2006;de Sá et al., , 2019.
For the purpose of comparisons with the model, we need to be able to separate time periods representing clean and polluted episodes. Using a fuzzy c-means clustering algorithm (Bezdek, 1981;Bezdek et al., 1984) applied to T3 measurements, de Sá et al. (2018) were able to identify four different clusters corresponding to (i) fresh or (ii) aged (2+ days) biogenic production,  Thalman et al. (2017) and air masses influenced by the (iii) northern or (iv) southern parts of Manaus. Using the timeseries contribution of these 95 clusters, we labeled as background air masses that were identified as being composed of at least 50% of any clean cluster (i or ii). Conversely, air masses that were identified by de  as being composed of at least 50% of any polluted cluster (iii and iv) were labeled as polluted. The clustering methods constrained the classification to only include wet season afternoon air masses that were not exposed to rain in the previous day (see . These limitations match with our model restrictions which do not include cloud chemistry, nor fire emissions that would be important during the dry season.

100
For comparison with the model, experimental data were hourly averaged for each cluster.

Model Setup
A Lagrangian box model was built to simulate chemistry in the planetary boundary layer and the residual layer for an air parcel traveling over the Amazonian forest and Manaus. Because experimental data compared to the model only contain air masses that were not exposed to rain in the previous day (see Sect. 2 and de Sá et al., 2018), the model simulates biogenic 105 conditions for one day, assuming the air mass was washed out by rain prior to that day. After the one day spinup, biogenic emissions are replaced by urban emissions for one hour during the second day to represent the interaction of the air mass with the Manaus urban area. After the simulated encounter with Manaus, the model inputs return to biogenic emissions until the end of the second day. This simulation is defined hereafter as the "polluted" case. Another simulation is run where the box is only subjected to biogenic emissions for two days, without any exposure to urban emissions to simulate a background case.

110
This simulation is defined hereafter as the "clean" case. This section describes the box model setup, how the emissions were defined and the chemical mechanism used for this study.

Box model
This study relies on a box model described in this section. It includes emissions from the forest and the city, deposition and chemical evolution of the trace gases. Daytime growth of the planetary boundary layer is also simulated, with mixing with the 115 residual layer.

Boundary Layer
The model includes two boxes on top of each other separated by a moving boundary representing the height of the boundary layer. The bottom box extends from the surface to the top of the planetary boundary layer (PBL). The top box extends from the top of the planetary boundary layer to 850 m and represents the residual layer (RL) (see Fig. 2). The daytime PBL height 120 evolution is parameterized according to the Tennekes (1973) approach and was calculated using the Second-Order Model for Conserved and Reactive Unsteady Scalars (SOMCRUS, Lenschow et al., 2016) (see Fig. 2). At sunset, stratification is assumed to quickly shrink the PBL to 50 m which results in the contents of the PBL being reallocated to the RL. During the night, the PBL is constrained to linearly grow to reach the next morning level. The PBL height evolution is the same for each of the two simulated days. During the day, the PBL is therefore slowly incorporating residual chemicals resulting from the previous day 125 and night chemistry. Thalman et al. (2017) report PBL heights estimated from ceilometer measurements during the wet season in the central Amazonian Forest, for polluted and background conditions. The measurements reach a maximum of 800 m at around 3pm local time. This value was used to further constrain the PBL height evolution by scaling the SOMCRUS output to reach this measured PBL height maximum. The growth and shrinking of the PBL dilute the expanding box and transfer gases from the shrinking box to the expanding box. This is parameterized according to Eqs. 1 and 2: well-mixed concentrations. The residual layer is also slowly mixed with the free troposphere. The free troposphere is assumed to be a fixed reservoir of CO (80 ppb) and ozone (15 ppb, e.g. Browell et al., 1990;Gregory et al., 1990;Kirchhoff et al., 1990).

140
Temperature is assumed to follow a sinusoidal daily variation, with an average of 27°C, an amplitude of 4°C and a maximum at 6 pm local time. Relative humidity is initially set at 75% at 6 am (23°C) and is free to evolve with temperature changes assuming water vapor concentration is constant.

145
VOC emissions from the rainforest were estimated with the Model of Emissions of Gases and Aerosols from Nature (MEGAN v2.1, Guenther et al., 2012). Biogenic emissions on March, 13 th 2014 (the golden day of the GoAmazon field campaign, see de Sá et al., 2017) in a domain situated in the forest around Manaus were averaged to obtain total isoprene and monoterpene hourly averaged emissions for a day typical of the wet season, without any recorded rain event. Monoterpenes were then speciated to match concentrations measured by Jardine et al. (2015) at the top of an Amazonian rainforest canopy with a thermal 150 desorption-gas chromatograph-mass spectrometer (TD-GC-MS). Based on this emission inventory, we then simultaneously optimized isoprene and total monoterpenes emissions to match the model with isoprene and total monoterpenes measured at T3 under clean conditions. This resulted in the need to reduce isoprene emissions by a factor of 7. Using measurements from a similar site in Amazonia, Alves et al. (2016) Isoprene/10 Monoterpenes

Manaus Emissions
The emissions used to represent the influence of Manaus are shown in Fig. 4a and were calculated following the methodology 165 described in Abou  and Medeiros et al. (2017). Traffic emissions have been estimated from vehicle use intensity and emission factors for CO, NO x , SO 2 and VOCs depending on type of fuel use in Manaus (Abou Rafee et al., 2017). VOC speciation is assumed to be similar to the average speciation of the vehicle fleet emissions of São Paulo, Brazil in 2004 (Martins et al., 2006). Hourly distribution of the traffic emissions is considered to be similar to the hourly traffic distribution in São Paulo (Andrade et al., 2015). In the past decades, Brazil has become known for pioneering the large scale 170 use of ethanol based biofuels. However, due to its isolation and being distant from south Brazilian biofuel producing regions, Manaus traffic doesn't involve consumption of significant amounts of ethanol-based fuel.
The difference in the fuel blend used in São Paulo and Manaus can introduce errors in the traffic emissions VOC speciation.
For instance, a recent study by Yang et al. (2019) showed that the combustion of fuels with higher ethanol content emits significantly less carbon monoxide and more acetaldehyde. Schifter et al. (2020) showed similar results, and also suggested 175 that ethanol blends emit smaller amounts of simple aromatic compounds (e.g. benzene, toluene). This speciation uncertainty can especially have an impact on oxidants concentrations. Schifter et al. (2020) reported for instance that fuels containing ethanol would potentially produce less ozone after the oxidation of emitted organic species than fuels without ethanol. Moreover, the lifetime of OH is likely to change depending on the speciation of emitted VOCs due to varying reactivities with respect to OH.
In the same way that the potential for ozone formation could depend on the use of ethanol fuel blends, it is also possible that 180 the potential for SOA formation would depend on these fuel blends too.
This traffic emission estimate does not include Intermediate Volatile Organic Compounds (IVOC) which would mainly be produced by diesel vehicle emissions (Gentner et al., 2012(Gentner et al., , 2017. Zhao et al. (2015Zhao et al. ( , 2016 showed that the IVOC to VOC emissions ratio lies between 4% for gasoline vehicles and 65% for diesel vehicles. Knowing that diesel vehicles account for ca. 45% of the total driven distance in Manaus (Abou Rafee et al., 2017), we therefore assume that IVOC total emissions are 185 approximately equal to 30% of total VOC emissions. To estimate the distribution of species resulting from IVOC emissions, we assumed that the distribution in volatility is similar to the distribution used to simulate traffic emissions in Mexico City in Lee-Taylor et al. (2011), with n-alkanes from C 12 to C 25 acting as surrogates for these heavier emitted organic compounds.
The resulting distribution of urban organic emissions at noon as a function of the number of carbon atoms is presented in of importance above C 12 , whereas diesel vehicles emit species from C 10 to C 25 , with a peak at C 12 . These features are present in the emissions estimated in this work, with the gasoline peak around C 6-7 and the diesel maximum at C 13 . Gentner et al. (2017) also report that half of the gasoline VOC emissions are composed of linear and branched alkanes, the other half consisting of aromatics and cycloalkanes. In our estimates of gasoline emissions (C <12 ) the proportion of branched alkanes is smaller, alkenes constitute a more important fraction of emitted C 4-6 species, branched cycloalkanes are missing, and aromatics constitute the 195 majority of emissions of C 7-10 compounds. These differences could represent differing sources of fuels or different distributions of vehicle brands and ages. In the case of diesel emissions, Gentner et al. (2017) report that they are approximately equally distributed between aromatics, branched cycloalkanes, bicycloalkanes and branched alkanes whereas our method leads to diesel emissions being only constituted of n-alkanes, which are used here as surrogate species for the entire mixture.
Choosing alkanes as surrogates for emitted IVOCs is likely to introduce uncertainties to SOA produced from their oxidation.

200
Lim and Ziemann (2009) carried out multiple chamber experiments that investigated the impact of branching and rings on alkanes SOA yields. For instance they showed that SOA yields range from a few percent for branched alkanes with 12 carbon atoms to 80% for cyclododecane while n-dodecane has an SOA yield of ≈ 32%. La et al. (2016) simulated these experiments with GECKO-A and they were able to reproduce this experimentally observed behavior. This means that without a detailed inventory of emitted IVOCs, the uncertainty on the SOA yield from IVOCs is high in our version of the model. It should be 205 noted that the range of measured SOA yields for structurally different compounds with the same number of carbon atoms seems to peak for C 10 to C 13 alkanes. The range of observed SOA yields in Lim and Ziemann (2009) decreases after this peak.
For instance, SOA yields for C 15 alkanes of various structures range from 45% to 90%. We can therefore expect the IVOCs SOA yield to be highly sensitive to the speciation of compounds ranging from C 12 to C 14 , but this sensitivity should decrease for heavier molecular weight species.

210
Additionally, emissions from 11 local thermal power plants (TPP) and one oil refinery located in the vicinity of Manaus were obtained from the data presented in Medeiros et al. (2017). Based on monthly statistics of fuel use in each of the TPP and the oil refinery, combined with emission factors of CO and NO x for each type of fuel (diesel, fuel oil, natural gas), we calculated

GECKO-A
All emitted organic compounds were used as inputs to GECKO-A to automatically generate the chemical scheme used in this study. The GECKO-A protocol has been described in detail in Aumont et al. (2005) and updated in Camredon et al. (2007), 220 Valorso et al. (2011), Aumont et al. (2013), and La et al. (2016). Partitioning of low volatility compounds to the aerosol phase is described dynamically as in La et al. (2016). Vapor pressures are estimated with the Nannoolal et al. (2008) structure activity relationship. As isoprene first oxidations steps have been widely studied in the literature, there is no need to automatically generate them with GECKO-A. Isoprene chemistry first two generations of oxidation were therefore taken from the Master Chemical Mechanism 3.3.1 (Jenkin et al., 1997;Saunders et al., 2003;Jenkin et al., 2015, MCM, e.g.). With 12 biogenic 225 and 53 anthropogenic precursors ranging from C 2 to C 25 , some reductions are carried out to reduce the size of the generated mechanisms. Species with an estimated vapor pressure below 10 −13 atm are assumed to entirely partition to the aerosol phase so quickly that a description of their gas phase oxidation is not needed (Valorso et al., 2011). Furthermore, lower yield, longer chain species are lumped with chemically similar compounds according to a a hierarchical decision tree based on molecular structure (Valorso et al., 2011). The resulting chemical scheme contains 23 million reactions involving 4.4 million species of 230 which 780000 can partition into the aerosol phase. The time integration in the two-box model setup takes approximately 0.5 computing hour per simulated hour on 16 cores (Computational and Information Systems Laboratory, 2017).

Isoprene SOA formation
GECKO-A treats SOA formation through a dynamic approach that converges towards the equilibrium defined by the Pankow formulation of Raoult's Law (Pankow, 1994). However it is likely that isoprene SOA (ISOPSOA) formation is not only con-235 trolled by vapor pressure (Paulot et al., 2009). Among factors that have been identified to play a role in ISOPSOA are: aqueous phase oxidation in deliquescent aerosol (e.g. Blando and Turpin, 2000;Ervens et al., 2011;Daumit et al., 2016), organic sulfate/nitrate formation via interaction with the inorganic component of the aerosol (e.g. McNeill et al., 2012;Pratt et al., 2013;Wang et al., 2018;Glasius et al., 2018;Jo et al., 2019), and accretion reactions in the bulk aerosol (e.g. oligomerization, dimerization, Altieri et al., 2006;Liu et al., 2012;Renard et al., 2015). None of these processes is currently implemented in the their epoxide ring via acid-catalized reactions. These reactions are followed by either the nucleophilic addition of (i) H 2 O to form methyltetrols or (ii) sulfate and nitrate ions to form organosulfates and organonitrates. The uptake of epoxides therefore depends on the acidity of particles, as well as their sulfate and nitrate content. These parameters had to be constrained in the model and were deduced from the T3 AMS measurements and literature data (see Table 1). On the other hand, isoprene oxidation products containing nitrate moieties (dihydroxydinitrates and isoprene nitrate) hydrolyze and form polyols and nitric 250 acid.

Dry Deposition
Dry deposition is treated following the Wesely (1989) parameterization. This parameterization is a resistance model that allows calculating dry deposition velocities based on multiple resistances defined as properties of the surfaces. The city and the forest were respectively attributed the properties of surfaces defined as urban and deciduous forest in the Wesely (1989)   In the clean situations, measured isoprene mixing ratios range from 2-3 ppb at noon to 5-6 ppb at the end of the afternoon.
The sum of all monoterpenes follows a similar increasing trend in the afternoon, from 0.1 to 0.3 ppb. After adjusting biogenic emissions rates (see Sect. 3.2.1), the model is able to reproduce these mixing ratios, with isoprene and monoterpenes being  Pristine forest conditions are characterized in the model by low NO x emissions from the soil (8.3 × 10 9 molec cm −2 s −1 ≈ 1.5 × 10 −5 g m −2 h −1 , see Table 1). The model predicts NO x mixing ratios around 50 ppt in the afternoon. In the polluted 275 case, the background air mass is exposed to a complex mixture of anthropogenic compounds emissions as well as three orders of magnitude higher NO x emissions (≈ 1 × 10 −2 g m −2 h −1 , see Fig. 4). This leads to modeled NO x around 1 ppb in the afternoon, after a 48 ppb peak in the city in the morning. The increase in NO x is not as important in the experimental data, but these NO x measurements are highly uncertain, which could explain the modeled discrepancies.
Daytime ozone mixing ratios are modeled around 9 ppb in the clean situation, in the lower range of measured values. The

280
higher NO x levels result in strong ozone production in the polluted plume, characterized by mixing ratios of 15 ppb at noon and up to 51 ppb at the end of the afternoon. During this increase of ozone production, the model matches T3 measurements around around 23 ppb at 1pm. On average, measured ozone in the polluted case is a factor of 2 higher than the clean case while the model sees an increase by a factor of 2 to 4 between noon and 6pm. It should also be noticed that the model completely separates clean and polluted situation, which increases the contrast for all variables compared to the classification 285 of the measurements that always includes some degree of mixing (see Sect. 2). It should also be noted that the nighttime decay of ozone can be explained by dry deposition to the forest surface.
Furthermore, VOCs in the plume are exposed to high OH concentrations, with modeled concentration reaching 1.9×10 7 molec cm −3 in the afternoon. In the clean background, OH concentrations only reach 2×10 6 molec cm −3 . These clean values are in the lower range of reported measurements at T3 Unlike the model, OH measurements averaged at T3 and identified as 290 clean and polluted did not exhibit any difference between both situations (Fig. 6). In that case, there could be issues with the OH OH concentrations from isoprene and its oxidation products measurement. They showed that noontime OH concentrations vary between 5×10 5 molec cm −3 in clean situations to 1. At the measurement site, SOA mass concentrations measured by AMS range from 0.6 to 2.5 µg m −3 in clean conditions. In polluted conditions, SOA mass concentrations range from 1.9 to 2.9 µg m −3 (Fig. 7). In the clean case, the modeled SOA mass is within the range of T3 measurements, increasing from 0.6 µg m −3 at sunrise to 2.16 µg m −3 at the end of the afternoon.
In the polluted situation, modeled SOA mass concentration is very similar to the clean simulation, with only a 20 minutes delay in the start of SOA production. The maximum concentration is 2.23 µg m −3 , only a 3.5% increase compared to the clean 305 simulation, while experimentally this increase averaged around 56%. Because the model is unable reproduce the observed urban SOA enhancement, in the polluted situation the model underestimates SOA mass by 10 to 45%.   Glasius et al. (2018). This is consistent with Glasius et al. (2018) who reported that the main source of the measured organosulfates is IEPOX heterogeneous uptake, which is the only pathway represented in this model. Furthermore, this shows that the combination of the MCM 3.3.1 isoprene oxidation mechanism to produce IEPOX and the reactive uptake parameterization from Marais et al. (2016) is able to predict realistic levels of organosulfates, assuming that aerosol properties are also realistic (hygroscopicity, inorganic sulfates 315 and pH). Figure 9 depicts the distribution of organic functional groups in the particle phase. In the clean case scenario, total functionalization, defined as the number of functional groups per carbon atom, is constant around approximately 0.5. As expected for a low-NO x situation, approximately 40% of these functional groups are hydroxy moieties and 30% of the organic functional 320 groups are hydroperoxides. The remaining functional groups are dominated by carbonyls and nitrates to a lower extent. Manaus pollution has the direct effect of reducing total functionalization by 10% because of the contribution of long-chain primary hydrocarbons to SOA formation in the plume. Oxidation of organics in the higher NO x environment also leads to an increase of nitrate moieties contribution at the expense of hydroxy and hydroperoxide moieties.  conditions while H/C ratios range from 1.2 to 1.4. Additionally airborne measurements above the T3 site report O/C ratios 335 ranging from 0.35 to 0.9 and H/C ratios ranging from 1.5 to 1.9. Compiling multiple field campaigns AMS measurements, Chen et al. (2015b) reported van Krevelen diagrams slopes (H/C vs O/C) ranging from -1 to -0.7. A linear regression over the data points from both airborne and ground measurements (dotted line on Fig. 10) gives a slope of -1.3, close to values reported in Chen et al. (2015b). This means that T3 air masses may be sampled at a later stage of oxidation than the airborne samples, possibly because they were exposed to higher levels of oxidants than the higher altitude air masses. As a test, we generalized this estimation to all C 10 in the aerosol phase: we replaced each C 10 by the corresponding C 20 and halved its concentration. In this way, we can calculate what would H/C and O/C ratios be in the aerosol phase if aging processes only dimerized C 10 compounds. The resulting modeled van Krevelen diagram is reported on Fig. 10 (labeled w/ dimer.). The impact of C 10 dimerization is relatively strong on O/C ratio, ranging from 0.66 to 0.78 and remaining in the range of measured 355 O/C ratios at T3 site and in the aircraft. H/C ratios are only reduced to 1.88-1.94, still 50% higher than measured H/C at the T3 site and 20% higher than airborne data.

Modeled vs Measured Atomic Ratios
Oppositely, GECKO-A could be missing processes that would fragment the aforementioned two C 10 compounds. Fragmenting C 10 H 18 O 7 into a C 4 H 6 O 4 (H/C = 1.5; O/C = 1) and a C 6 H 10 O 5 (H/C = 1.7; O/C = 0.8) species would bring the average H/C ratio down from 1.8 to 1.6. This possibility of missing fragmentation processing means that either the modeled gas phase 360 chemistry doesn't compete enough with condensation to fragment these species, or these C 10 species should be fragmented by heterogeneous or condensed phase processes in the particles themselves, which are not accounted for by the model. It should be noted that because the fragmented compounds are lighter, they would exhibit higher volatility. However this does not necessarily mean that the SOA mass would decrease because these shorter species are still oxygenated, maybe enough to contribute to SOA mass through solubility controlled processes in the same fashion as what is known about isoprene oxidation products.

365
As another test, we also estimated what would O/C and H/C ratios be if all C 10 fragmented in the aerosol phase. The resulting modeled van Krevelen diagram is reported on Fig. 10 (labeled w/ frag.). In this case, modeled O/C ratios increase to a range of 0.88 to 0.96 and remain in the higher end of measured ratio at the T3 site. H/C are reduced further than in the dimerization test and sit at the higher end of airborne measured H/C ratios, but they still are 45% higher than H/C ratios measured at the T3 site.
Even if they apparently cannot account for the discrepancy between modeled and measured H/C ratios, the two tests pre-370 sented here on C 10 compounds in the aerosol phase show the potential importance of adding these missing processes in GECKO-A. These simple tests are however simplifications that overlook important factors in the potential impact on SOA composition: (i) not all C 10 compounds would be affected by these processes, (ii) other compounds than C 10 could react in a similar way, (iii) trimerization, tetramerization and other accretion processes could also occur in the aerosol phase, (iv) missing fragmentation processes could also happen in the gas phase.  Shrivastava et al. (2019) modeled this same field campaign with WRF-Chem, a chemistry transport regional model (Grell et al., 2005) and similarly to this work they based their primary organic compounds emissions on the MEGAN inventory (Guenther et al., 2012) for biogenic compounds, and the methodology described in Andrade et al. (2015) and data from Medeiros et al. (2017) for anthropogenic emissions. Using a Volatility Basis Set approach (VBS) to account for condensation of low volatility 380 species, and considering ISOPSOA separately with an approach similar to this work, they modeled airborne SOA mass to within 15% of airborne measurements. The VBS parameterization described in Shrivastava et al. (2019) represents the formation of SOA as four surrogate species differing by their volatility (C = 0.1, 1, 10 and 100 µg m −3 ). For biogenic SOA, isoprene and monoterpenes produce these four surrogates from the oxidation by OH, ozone and NO 3 , with yields depending on NO x .

Comparison with VBS approach
Moreover multigenerational aging is accounted for the surrogate species assigning fragmentation (i.e. increasing volatility) and 385 functionalization (i.e. decreasing volatility). This aging is parameterized as a reaction of each of the SOA surrogate species VBS n with OH as follows: The reaction rate is k R1 = 2 × 10 −11 cm 3 molec −1 s −1 . The branching ratio for fragmentation α frag is determined as the ratio of the reaction rate of peroxy radicals with NO to the sum of all peroxy radical reactions rates; it has an upper limit of 75%.

390
The yields used in this VBS approach were fitted over a variety of low OA loading atmospheric chamber studies of biogenics oxidation under high and low NO x concentrations (Shrivastava et al., 2019). More details about this VBS approach can be found in Shrivastava et al. (2013Shrivastava et al. ( , 2015Shrivastava et al. ( , 2019. In case scenario, the SOA mass concentration only decreases by approximately 10% when SOA aging is removed. This means that SOA aging becomes more important in the ground case scenario when the air mass is exposed to high OH concentrations that were not seen by the model run by Shrivastava et al. (2019): their maximum OH concentrations reach 2×10 6 molec cm −3 while our maximum OH concentration reach 1.6×10 7 molec cm −3 . Figure 11 and Table 2 attribute sources of SOA according to the GECKO-A explicit simulation and the VBS approach. In the 410 clean case scenario, GECKO-A attributes most of SOA mass to monoterpene oxidation products (65% at 2pm). The remainder is attributed to isoprene SOA, with condensation of low volatility compounds contributing in the same proportion as reactive uptake (17% and 18% respectively). In Shrivastava et al. (2019), monoterpene oxidation products account for 45% of SOA sources in the airborne plume. With their VBS applied to the ground situation, 28% of SOA is attributed to monoterpenes at 2pm, approximately half of the proportion predicted by the GECKO-A explicit approach. Like in the 3D model calculation, 415 the VBS in the box model attributes the remainder of background SOA mass mostly to reactive uptake of isoprene oxidation products (53% of total SOA).
In the polluted case, the explicit model predicts a slight decrease of 6% in total SOA at 2pm while measurements exhibit an increase of 33% on average. The urban effect is strong in the VBS case than the explicit approach with a 380% increase    cause the same large increase in biogenic SOA formation (up to 400%), and that this increase is due to enhanced monoterpene oxidation. With GECKO-A at the ground site, SOA mass remains constant because of the contribution of anthropogenics which compensates the decrease in the contribution from the condensation of isoprene and monoterpenes oxidation products by 32%.
This loss is slightly compensated for by an increase in the production of SOA via reactive uptake of isoprene oxidation products (15% increase) because the plume favors these processes with higher sulfate load and lower pH (see Table 1). Overall biogenic 425 SOA decreases by 23% with respect to the clean case. In the VBS test case, SOA mass formed from condensation of low volatility oxidation products of isoprene and monoterpenes is enhanced in the polluted case respectively by a factor of 7 and 3. This enhancement is notably inhibited when the aging parameterization is removed from the VBS approach with a mass increase due to condensation of low volatility products of isoprene and monoterpenes of respectively 100% and 21%. This highlights the importance of modeling aging of low volatility oxidation products to explain the enhanced production of SOA 430 in the urban plume.

Potential for Reduction of the Explicit GECKO-A mechanism
It is obvious that the chemical mechanisms generated with GECKO-A are too large to be implemented in 3D models. The GECKO-A mechanisms need to be reduced to sizes manageable by 3D models, typically a few hundred species and reactions.
The VBS parameterization used for comparison in this work is fit for low OA loadings, biogenic dominated situations but it is 435 unclear that it should be applied to other situations.
In this section, we are not proposing a much needed new approach to reducing explicit mechanisms with the goal of predicting SOA mass concentrations, but we explore here the potential for reduction of the chemical mechanism that was generated for this study. In other words, what is the theoretical lower limit to the number of species that should be used in a reduced scheme to still be able to model the same SOA mass concentration time profile as the explicit model?

440
To answer this, two metrics are presented in Fig. 12. The first one N 90% is the smallest number of species needed in the explicit model to capture 90% of the total SOA mass at each timestep. After sorting species by decreasing concentration, this number is calculated by adding up these concentrations until 90% of the total modeled SOA mass is reached. The operation is repeated at each timestep. Calculated independantly, the second one is the particle diversity D in the explicitly modeled SOA, as defined for instance in Riemer and West (2013): where S is the first order generalized entropy (also known as Shannon entropy): where p i is the mass fraction of species i in the organic particle phase and N is the total number of species in the organic particle phase. As stated in Riemer and West (2013), the diversity is a measure of the effective number of species with the 450 same concentration in the organic fraction of the aerosol phase. If D = 1, the organic fraction is pure as it is composed of a single species. Therefore, a value D N means that of all the species contributing to modeled organic aerosol, only a few significantly contribute to its composition. Oppositely, D = N is the maximum value reachable by D and is obtained when the organic fraction is composed of N equally distributed species. In the case where D is close to N , only a few species are negligible. For more details and better explanations, we refer the reader to Riemer and West (2013, esp. Fig . 1). We make 455 the hypothesis here that D can be interpreted as an effective number of species derived from the informational entropy of the modeled particle phase.
In the clean situation both metrics behave similarly, with a morning increase of the number of species until 10 am, after which during daytime by about a factor of 9, reaching about 2500. The calculated diversity only increases up to approximately 260 effective species. These increases in the species numbers for the polluted case are logical as the variety of precursors, and hence secondary species that could potentially contribute to SOA, is increased by urban emissions.
The number of species needed to represent most of the modeled SOA mass in all cases seems too high to be used in 3D models applications. Furthermore there is no guarantee that the most important species at a given timestep would be the same 465 most important species at the following timestep. This suggests that reductions should not come from simply selecting species identified as important to represent the variety of species that could arise in the interaction of biogenic air and an urban plume.
The statistical diversity calculation seems like a better approach to estimate the minimum number of species needed to model SOA mass. As this number is directly derived from informational entropy, we suggest that the diversity represents the number of species that would be needed to reproduce the same informational content regarding the time evolution of SOA 470 mass in the explicit model. Even if the effective species numbers fall in the higher range of what would be acceptable in a 3D model chemical mechanism, the practical construction of the mechanism remains to be explored. For instance, in the polluted scenario, D is a factor of 9 lower than N 90% . This should mean that D cannot represent a subset of the individual species from the original mechanism, otherwise it would be expected to be equal or higher than N 90% if it is supposed to reproduce the informational content regarding SOA mass. It is therefore likely, and making this problem more complex, that each of these 475 effective species is a (non) linear combination of explicit individual species.
Finally, we used in this section an entropy calculation for SOA mass: it is based only on mass fractions of the species composing the modeled organic particles. The effective number of species displayed on Fig. 12 is therefore only meaningful for SOA mass and properties directly linked to it. If the goal is to predict other properties, e.g. hygroscopicity, toxicity or optical properties, assuming we find a way to calculate these with GECKO-A, the diversity defined here would not necessarily 480 be meaningful. For instance, hygroscopicity or toxicity could be driven by a handful of oxygenated species that do not matter for the informational content regarding SOA mass. We did not explore further down this path, as this is not the subject of this paper, but it may be possible to generalize this definition of informational diversity to properties other than mass.

Conclusions
An explicit chemical mechanism generated with GECKO-A was used in a box model to simulate a situation similar to the 485 situation studied in Manaus during the GoAmazon 2014/5 field campaign. After scaling down the emissions generated from the MEGAN biogenic emissions model and estimating urban emissions in Manaus, the model was able to reproduce realistic primary organic compounds mixing ratios as well as NO x , ozone and OH concentrations.
The model is able to reproduce background SOA mass concentrations but is not able to reproduce the observed enhancement in the polluted plume. When running a Volatility Basis Set approach that was previously applied to the Manaus case (Shri- , only a handful of monoterpenes have been studied to the extent that we can be as confident in model predictions of SOA formation from monoterpenes as from isoprene. Detailed mechanistic studies of monoterpene oxidation are therefore needed for further incorporation in explicit models to better understand the nature and the magnitude of the contribution of monoterpenes to SOA formation, as well as their response to the interaction with urban pollution (e.g. Claflin and Ziemann, 2018).

510
Even if a parameterization was implemented in GECKO-A to properly address the formation of isoprene SOA via aqueous phase processes (Marais et al., 2016), to explicitly treat these in a more general way, future GECKO-A developments for mechanism generation will need to include the following: (i) aerosol thermodynamics, for instance via coupling with a model like MOSAIC (Zaveri et al., 2008) or ISORROPIA (Nenes et al., 1998), (ii) aqueous phase processes including explicit dissolution (e.g. Mouchel-Vallon et al., 2013), oxidation (e.g. Mouchel-Vallon et al., 2017), accretion reactions (e.g. Renard et al., 515 2015), and interaction with dissolved inorganic ions, (iii) explicit treatment of the fate of newly formed species like dimers or organo-sulfates.
One could be tempted to think that since the VBS parameterization is behaving particularly well in this GoAmazon 2014/5 case, it could be the answer to predict SOA mass in larger scale 3D models. However this approach is limited by the fact that it was fitted for low biogenic OA loading situations and was run in a limited domain regional model (Shrivastava et al., 2019). One 520 possible way of building reduced mechanisms is to reduce existing detailed chemical mechanisms to sizes manageable by 3D models (e.g. Szopa et al., 2005;Kaduwela et al., 2015). Using an information theory based approach, we provide here a lower limit to the size of these reduced mechanisms, assuming the goal is to produce the same informational content as the explicit mechanism. This lower limit of a few hundred species is four orders of magnitudes lower than the actual number of species that are actually accounted for in the explicit mechanism (4×10 6 ) and shows the potential for progress in future mechanism 525 reduction endeavors. Even if a direct application of this statistical approach to create a reduced mechanism would likely require some atmospheric chemistry breakthrough, it could at least currently be used as a statistical indicator for comparing reduced mechanisms with reference explicit mechanisms.
Code and data availability. The GoAmazon 2014/5 experimental data is available from the ARM website: https://www.arm.gov/research/campaigns/amf2014goamazon

530
The chemical mechanism generated for this study is available upon request from CMV in text or netcdf format. G., Yee, L. D., Ynoue, R., Zaveri, R. A., Zelenyuk, A., and Zhao, C.: Urban pollution greatly enhances formation of natural aerosols over