Source-receptor relationships for airborne measurements of CO 2 , CO and O 3 above Siberia: a cluster-based approach

Abstract. We analysed results of three intensive aircraft campaigns above Siberia (April and September 2006, August 2007) with a total of ~70 h of continuous CO2, CO and O3 measurements. The flight route consists of consecutive ascents and descents between Novosibirsk (55° N, 82° E) and Yakutsk (62° N, 129° E). We performed retroplume calculations with the Lagrangian particle dispersion model FLEXPART for many short segments along the flight tracks. To reduce the extremely rich information on source regions provided by the model calculation into a small number of distinct cases, we performed a statistical clustering – to our knowledge for the first time – into potential source regions of the footprint emission sensitivities obtained from the model calculations. This technique not only worked well to separate source region influences but also resulted in clearly distinct tracer concentrations for the various clusters obtained. High CO and O3 concentrations were found associated with agricultural fire plumes originating from Kazakhstan in September 2006. A statistical analysis indicates that summer uptake of CO2 is largely explained (~50% of variance) by air mass exposure to uptake by Siberian and sub-arctic ecosystems. This resulted in an average 5 to 10 ppm difference with overlaying air masses. Stratosphere-troposphere exchange is found to strongly influence the observed O3 mixing ratios in spring, but not in summer. European emissions contributed to high O3 concentrations above Siberia in April 2006 and August 2007, while emissions from North Eastern China also contributed to higher O3 mixing ratios in summer, but tend to lower mixing ratios in spring, when the airmass aerosol burden is important. In the lower troposphere, large-scale deposition processes in the boreal and sub-arctic boundary layer is a large O3 sink, resulting in a ~20 ppb difference with overlaying air masses. Lagrangian footprint clustering is very promising and could also be advantageously applied to the interpretation of ground based measurements including calculation of tracers' sources and sinks.


Introduction
Atmospheric transport of pollutants can occur both at low altitude in the boundary layer and in the free troposphere, where it can be much faster.Export of North-East Asian and North American emissions is associated with strong uplift in the warm conveyor belts of mid-latitude cyclones (e.g.Cooper et al., 2001;Liang et al., 2004;Owen et al., 2006).On the opposite, European emissions have a tendency to remain in the lower troposphere and are most frequently channelled to the Arctic or to Siberia (Wild et al., 2004;Stohl et al., 2002Stohl et al., , 2007a;;Duncan and Bey, 2004;Law and Stohl, 2007;Liu et al., 2002).Faster zonal advection occurs in winter and early spring in association with the Siberian High (Newell and Evans, 2000;Wild et al., 2004;Liu et al., 2003).In April, European pollutants can contribute significantly to tracers' concentrations in the Asian outflow to the Pacific (Liu et al., 2003;Liang et al., 2004).With respect to ozone production, European export increase O 3 concentrations over Siberia by 2 to 4 ppb in spring and 2 to 6 ppb in summer (Wild et al., 2004;Duncan and Bey, 2004).
Published by Copernicus Publications on behalf of the European Geosciences Union.
Pollutant export to or across Siberia, however, has been studied almost exclusively through modelling studies and lacks measurement-based assessment.Pochanart et al. (2003) has conducted ground based measurements supported by transport model analysis, showing that European emissions lead to a 1 to 4 ppb increase in O 3 near Baikal Lake.Eneroth et al. (2003), based on backtrajectories analysis of transport to the West-Siberian site of Zotino have shown that stagnant flow conditions were responsible of respiration build-up leading to high CO 2 concentrations (and ruled out European emissions).But measurements of CO 2 and pollutants' distribution are still required over Siberia, along with the possibility to relate them to remote emission sources i.e. source-receptor relationships (SRR).
Contrarily to single back-trajectories, the SRR obtained from a Lagrangian particle dispersion model (LPDM) accounts for the atmospheric turbulence and convection (Stohl et al., 1998).Han et al. (2005) compared the result of regional source apportionment using single backtrajectories and backward dispersion.Their backward dispersion model added individual turbulence-related stochastic component to a large number of HYSPLIT backtrajectories.They demonstrated the better ability of LPDM to identify regional point sources of reactive mercury.Use of LPDM has been proposed (Gerbig et al., 2003;Lin et al., 2003) and eventually demonstrated (Lauvaux et al., 2008) for meso-to regionalscale inversion of CO 2 surface fluxes.
Clustering of backtrajectories has been widely used for the analysis of atmospheric composition measurements at fixed observatories and has been found to be an efficient method for separating air masses with different properties (Moody and Galloway, 1988, Dorling et al., 1992, Sirois and Bottenheim, 1995, Eneroth et al., 2003).Moody and Galloway (1988) were the first to attempt clustering of backtrajectories in order to assess the wet deposition of acid compounds over the Bermuda.The same technique has been used at the continental scale to derive SRRs for O 3 concentrations in central Siberia and supported the conclusion of elevated CO in air masses influenced by European emissions (Pochanart et al., 2003).Traub et al. (2003) applied a simple backtrajectories' partitioning technique to the extensive analysis of measurements obtained from an aircraft.No attempt has been made yet, to our knowledge, to use a clustering technique based on LPDM footprints.
Aircraft measurements only provide a snapshot of the atmosphere at a particular time, but they have the potential to explore transport processes and impact across transported plumes (Takegawa et al., 2004;Fehsenfeld et al., 2006;Stohl et al., 2007b;Methven et al., 2006;Lelieveld et al., 2002) or in the outflow of massively emitting regions (Jacob et al., 2003).Here we investigate data from three recent YAK-AEROSIB (Airborne Extensive Regional Observations in Siberia) intensive campaigns that sampled the Siberian troposphere at different times of the year (Paris et al., 2008).Measurement programmes of different types dedicated to CO 2 , CO, O 3 and other tracers have been implemented in various parts of Siberia using aircraft campaigns equipped with flask sampler (Nakazawa et al., 1997), aircraft vertical profiles (Lloyd et al., 2002;Levin et al., 2002;Ramonet et al., 2002), the Transsiberian railroad (Oberlander et al., 2002) or tall towers (Kozlova et al., 2008).
Three airborne campaigns were made across Siberia in April 2006, September 2006 and August 2007 respectively.Numerous vertical profiles of continuous measurements of CO, CO 2 , O 3 and aerosols were collected, extensively sampling the regional troposphere at different times of the year.After describing the data set, we analyse the tracers' concentrations using a clustering technique based on the partitioning of footprints obtained with a LPDM.
More specifically, we focus on the extent to which clusters (groups with similar distributions) of potential emission sensitivity (PES) maps for receptor positions relate to the air masses chemical composition at these receptors.The PES is expressed here as a residence time.The clusters are used to generalize SRR throughout the aircraft trajectory through the Siberian air shed.We try to answer the following questions: (1) can we identify in our data the contribution to CO 2 and CO concentrations of European or other remote anthropogenic sources predicted by models?(2) Is it possible to identify contributions from forest fire or from other types of biomass burning?(3) To what extent can we explain the seasonal variability in CO 2 , CO and O 3 concentrations by variations in atmospheric transport patterns through intensive campaigns?(4) Is there a consistent signal about the regional carbon source/sink distribution emerging through model analysis of CO 2 concentrations?
Section 2 describes the campaigns, instruments, model and statistical tools.Section 3 describes the variability of CO 2 , CO and O 3 observed above Siberia during the three campaigns.In Sect. 4 we discuss, using cluster analysis, the tracer concentrations observed in selected flights.Section 5 investigates the "seasonal" (inter-campaign) variation of the connection between source regions and trace gas concentrations over Siberia.

Campaigns overview and instruments
The airborne platform is an Antonov-30 dubbed "Optik-E" chartered by Tomsk's Institute of Atmospheric Optics.It was equipped in collaboration between French and Russian laboratories for the measurement of CO 2 , CO, O 3 , aerosols and meteorological parameters.Intensive campaigns took place over Siberia between 11 and 14 April 2006, between 7 and 10 September 2006and between 17 and 20 August 2007.The flight route consists in a large, continental-scale loop from Novosibirsk in central Siberia to Yakutsk in eastern Siberia (Fig. 1) that is done in four days.Each campaign Atmos.Chem. Phys., 10, 1671-1687, 2010 www.atmos-chem-phys.net/10/1671/2010/The flight route is repeated with minor modifications for each campaign.Flights were conducted in all weather conditions, usually between 09:00 and 17:00 LT (Paris et al., 2008).Profiles are collected as often as possible as the flight tracks consist mainly of ascents up to 7 km altitude and descents.Thick cloud decks were flown over instead of conducting normal ascent and descent through them.
The April campaign was dominated by stagnant flow conditions and low surface temperatures (Paris et al., 2008), with most of the ground covered by snow.In contrast, the September 2006 and August 2007 campaigns were characterized by westerly flow, warmer and highly variable temperatures, and significant frontal activity in the flight area.
Detailed instrument descriptions can be found in Paris et al. (2008); we give here only a short description.CO 2 is measured by a modified Non Dispersive Infrared Analyser based on a commercial Li-Cor 6262, with accuracy and precision of 0.15 ppm, obtained by periodical in-flight calibration against WMO-referenced reference gases bracketing atmospheric concentrations, and regulation of pressure, temperature and flow in the cells.CO is measured by IR absorption gas correlation with an accuracy of 5 ppb or 5%.The instrument is based on a commercial infrared absorption correlation gas analyser (Model 48C, TEI Thermo Environment Instruments, USA; Nédélec et al., 2003).O 3 is measured by a modified UV commercial fast response ozone analyser (Thermo Instruments Model 49) with a precision of 2 ppb, 2% for an integration time of 4 s.

Atmospheric backward transport model
Atmospheric transport was investigated using the FLEX-PART v6.2 LPDM.FLEXPART calculates the trajectories of tracer particles using the mean winds interpolated from the analysis fields plus random motions representing turbulence (Stohl and Thomson, 1999).Results presented here use ECMWF analysis fields although both ECMWF and GFS (NOAA/NCEP) fields were used for a prior assessment of transport error (see http://zardoz.nilu.no/∼ andreas/YAK/).For moist convective transport, FLEXPART uses the scheme of Emanuel and Zivkovic-Rothman (1999), as described and tested by Forster et al. (2007).A backward simulation mode is available, described in more detail by Stohl et al. (2005) and Seibert and Frank (2004).Here the backward method is used to analyse transport pathways from potential flux regions to the receptor position.Each simulation consists of www.atmos-chem-phys.net/10/1671/2010/Atmos.Chem.Phys., 10, 1671-1687, 2010 40 000 particles released whenever the aircraft has moved 0.15 • in latitude or longitude, or 10 hPa in altitude.This corresponds to about one minute of sampling or to a layer ∼100 m thick during ascent or descent.Released particles were followed 10 days backward in time.Gridded PESs in three vertical levels (0-300 m, 300-3000 m, and 3000-50 000 m) are stored at 1 • ×1 • resolution every 24 h.The fraction of stratospheric air is recorded as the percentage of all particles originating from the stratosphere (i.e. have resided in the stratosphere any time within the last 10 days).

Cluster analysis
Airborne campaigns can deliver detailed information on the atmospheric state at a particular time.To generalize this information and to organize the YAK-AEROSIB data set, a method is sought that groups the data according to common transport properties.Cluster analysis is such an exploratory tool, which sorts multivariate data into groups as dissimilar as possible but whose properties are not known a priori.We seek to investigate to which extent footprints can explain the air mass chemical composition in CO 2 , CO and O 3 .Footprints are implemented as 10-day summed spatial distribution of PES obtained from FLEXPART, complemented by 10-day averaged relative contributions from the stratosphere, defined as the region with potential vorticity >2 PVU.To facilitate analysis we have reduced the number of variables from the gridded PES by further summing these PES over large regions of interest for our study (Fig. 2).A vector M) is associated with each consecutive receptor position j .This time series x j will constitute our set of j realizations of M variables to cluster.Defining j over a single campaign improves the separation capability of the algorithm, whereas defining j over the three campaigns allows inter-campaign comparison.
The M boxes are identified a priori as regions with specific sources or sink for the species relevant to this study.Such identification is based either on case studies from the April and September 2006 campaigns (Paris et al., 2008), or on the questions to be addressed with this dataset, such as the transport of European emissions to Siberia.The regions identified are: -Europe: European pollutant emissions and outflow; -Stratosphere: An extra variable is added which averages the 10-day proportion of stratospheric particles.
The boxes' geographical extent is the result of a trade-off between maximizing differences between transport from the various boxes, and an insufficient number of particles if boxes are chosen to be too small.There are large regions Atmos.Chem.Phys., 10, 1671-1687, 2010 www.atmos-chem-phys.net/10/1671/2010/not covered by any box.Theoretically, this could mean that a footprint is located entirely outside all of the boxes.However, the footprints normally cover quite large regions, so they always overlap with one or more of the boxes.By separating the boxes from each other, differences between the various footprints can be maximized by the clustering algorithm.Choosing boxes that are contiguous make the separation more sensitive to horizontal transport error, and associated PES less contrasted.More remote regions are less connected (few or no particles reaching the region) by transport to the measurements.As a result, remote regions' average footprint distributions are skewed toward zero.To account for this skewness and different region sizes, normalization is applied to the regions' average footprint time series x according to x N = (x − p 0.05 )/ x where p 0.05 is the 5th percentile of x and was found to optimally separate clusters.The robustness of data reduction into regions and data normalization was tested by running the clustering algorithm for varying region sizes, locations and number of regions (bootstrapping), and various normalization functions.
The K-means algorithm implemented in the MATLAB software's Statistics toolbox is used for clustering.K-means is a classic partitioning (non-hierarchical) algorithm (see e.g.Wilks, 2006) which attempts to separate observations into a fixed number of groups.(1) It defines K centroids with initial random position vector of size M=7 (number of regions); then (2) associates each point x j to the nearest (in Euclidian metrics) cluster centroid, (3) moves the centroid to the centre of the cluster and (4) repeats steps 2 and 3 until convergence is achieved.In the process, any empty cluster is discarded.In order to maintain the number of clusters, a point having maximum distance to its centroid is singled out and declared centroid of a new cluster.
Figure 3 shows the "silhouette" index (S ∈[−1, 1]), a measure of the separation capability of the clustering, as a function of the chosen number of clusters.The S index is defined as: with N number of points, and where point i has a distance to the nearest cluster d nearest and distance to its associated centroid d centroid .The S index informs about how close a point is to its cluster's centroid, relative to the other nearest cluster centroid (Matlab R2006b documentation, Statistics toolbox, http://www.mathworks.com/access/helpdesk/help/toolbox/stats/silhouette.html).Using the S index, the minimum number k of clusters was found to be optimally set to 6 in the global clustering option, and 4 in the campaign by campaign option (Fig. 3).
As the clustering algorithm is sensitive to a priori (random) position of the centroids, the process is repeated 20 times and the best result only (optimizing the separation between centroids) is used in the analysis.This also ensures a very high confidence in reproducibility of cluster centroids determination.The normalization and the metrics were also found to have a strong impact on the result.In this respect, cluster analysis is not an objective classification technique as the criteria set for the clustering have to be determined subjectively.

CO 2 measurements
Figure 4a, d and g shows the observed CO 2 concentrations averaged for each of the 4 flights of each campaign.Among our surveys in 2006 and 2007, the minimum BL CO 2 concentrations have been observed in August 2007 (average 366 ppm below 1 km during Flight 12).While comparing campaigns it should be noted that atmospheric CO 2 increases by ∼1.4 ppm yr −1 due to global anthropogenic emissions.The flight where this CO 2 minimum was measured occurred mostly over a forested region, with sparse agricultural landscapes and industrial centres (Kemerovo, Novosibirsk) nearby.This CO 2 minimum was coincident with a very high BL top (up to ∼3.5 km a.g.l.), as deduced from humidity and CO 2 gradients.At the Zotino tall tower, average daytime CO 2 mixing ratios measured in the boundary layer were 390 ppm and 382 ppm in April and September 2006, respectively (Kozlova et al., 2008).Kozlova et al.'s September value is significantly higher than the mixing ratio observed in the lower part of our vertical profiles (Fig. 4d; values between 371 and 377 ppm for Flights 5 and 8, the closest to the Zotino site), possibly due to regional variability in sources and sinks of CO 2 .In the free troposphere, CO 2 mixing ratios were www.atmos-chem-phys.net/10/1671/2010/Atmos.Chem.Phys., 10, 1671-1687, 2010 CO 2 mixing ratios in April 2006 were as high as 392 ppm in polluted filaments encountered between 5 and 6 km altitude (Paris et al., 2008).

CO and O 3 measurements
Figure 4b shows the average CO profiles for each flight during the April 2006 campaign.Throughout this campaign the average CO mixing ratio was 175 ppb.A number of plumes with higher CO mixing ratios were repeatedly observed.A plume with CO up to 220 ppb was found to be associated to uplift of a polluted air mass from NE China by a warm conveyor belt (Paris et al., 2008).During this campaign, the O 3 profile shows a relatively flat vertical profile (from ∼60 ppb at 6 km to ∼45 ppb at 1 km, Fig. 4c) compared to summer campaigns.
During the September 2006 campaign the average CO concentration was about 100 ppb, with a large variability (Fig. 4e), due to the presence of air masses with very distinct origins.For example during Flight 5 the aircraft crossed a warm front across which strong composition gradients were found, with CO mixing ratios up to 150 ppb immediately above the front.Except for local emissions in the lower troposphere, the August 2007 campaign has shown low CO mixing ratios (background 100±5 ppb) with little variation, although satellite fire detection has shown numerous fires in temperate and boreal Eurasia.High ozone values (up to 90 ppb) were found near the flight ceiling during the August 2007 campaign (Fig. 4i), and the O 3 vertical gradient was the steepest of the three campaigns.

Cluster-based SRR relationships: case studies
In this section we apply the clustering algorithm and discuss the results to examine the measurements for a selected flight from each of the three campaigns (Flight 1 on 11 April 2006, Flight 5 on 7 September 2006 and Flight 9 on 17 August 2007).All three flights are going from Novosibirsk to Myrni (green track in Fig. 1).The three flights took off at 12:00, 09:00 and 09:30 LT, and landed at 21:00, 19:00 and 18:15 LT respectively.The clustering is applied on a single campaign-basis.The clustering being independent from the observed concentrations, statistical separation of the concentrations across different clusters is interpreted as a validation of the SRR.Atmos.Chem.Phys., 10, 1671-1687, 2010 www.atmos-chem-phys.net/10/1671/2010/Footprints classified as Cluster C (Fig. 5c) dominate the flight (73%).This cluster is characterized by a strong residence time in the free troposphere, and exhibits the highest stratospheric influence compared to the other clusters (Fig. 6f, green line).Potential emission sensitivity is typically stronger over South East and East Siberia.A marginal number (1%) of footprints are associated to Cluster A (Fig. 5a), where air masses are sensitive to European emissions.Cluster B (20%; Fig. 5b) gathers a significant amount of observations from air masses having resided over northern Siberia and the Arctic.Footprints in Cluster D (5%; Fig. 5d) are exposed to surface exchange in southern Siberia and the Kazakhstan.The four well-differentiated footprint maps illustrate the validity of cluster-based partitioning.
Figure 6 shows the distribution (quartiles and median) of CO 2 , CO, O 3 and water vapour measurements associated to each cluster.Associated altitude is given in Fig. 6e, and the four (A-D) clusters' centroid (in std dev-normalized coordinates) position vector is given in Fig. 6f.Each cluster's centroid position vector can be related to the cluster average footprint map in Fig. 5.
The data associated to Cluster D (sensitive to NE China emissions) are encountered between 5 and 6.5 km altitude (Fig. 6e) and have a median CO concentration of 180 ppb (Fig. 6b), well differentiated from other clusters.On the opposite, CO 2 values (median 388.6 ppm; Fig. 6a) are not distinguishably higher.This cluster identifies the transport of North Eastern China emissions to eastern Siberia on 12 April 2006.The ability of the LPDM to resolve thin layers associated to synoptic features is a useful advantage over Eulerian models.
Cluster C is ubiquitous throughout Flight 1 and has a higher stratospheric influence reflected in its high O 3 concentrations (up to 85 ppb).Cluster B has high CO concentrations (median 174 ppb) corresponding to local emission sensitivity of the lower troposphere.

Flight 5: fires and CO 2 uptake
Figure 7 shows the average footprint for each cluster for the 7 September 2006 flight.As in the previous section, clustering was performed on the whole campaign (here, the September 2006 campaign).Data within Cluster B (43%; Fig. 7b  were mostly associated to FT zonal flow, with limited sensitivity to potential surface emissions over European Russia and the Black Sea area.This cluster has the highest stratospheric signature of the dataset (Fig. 8f).Cluster D is also ubiquitous in this flight (39%; Fig. 7d).It gathers footprint with sensitivity highest over Kazakhstan, between Caspian and Aral Seas.A large number of fires detected by ATSR fire count in this region, probably of agricultural origin, were injected in the FT as the region was swept by a front (Paris et al., 2008).Cluster A (14%; Fig. 7a) is clearly associated to an Arctic air mass channelled southward in the BL.Cluster C (4%; Fig. 7c) reflects advection of air masses in contact with potential European emissions.
The European (Cluster C) air mass has a median CO concentration of 110 ppb (Fig. 8b) at a median altitude of 5 km, to be compared to 105 ppb for the more ubiquitous Cluster B. Although this is only a single case, it provides a case of eastward European pollution export in the mid-troposphere, suggesting that this export does not always follow the low level advection pathway described by modelling studies (Wild et al., 2003;Stohl et al., 2002;Duncan and Bey, 2004).Other characteristic of this air mass include relatively high CO 2 (median 376 ppm) and an O 3 median concentration of 53 ppb.Atmos.Chem. Phys., 10, 1671-1687, 2010 www.atmos-chem-phys.net/10/1671/2010/ Figure 8b shows that Cluster D has the most elevated CO concentration, up to 158 ppb (median 121 ppb).It is also associated to high O 3 (median 55 ppb) but CO 2 mixing ratios are not particularly high (375 ppm).High CO and O 3 are likely to reflect the sensitivity of Cluster D (Fig. 7d) to the biomass burning detected in the Caspian region (especially in Northern Kazakhstan at ∼50 • N, between 50 and 70 • E).To illustrate our analysis we focus now on individual biomass burning plumes found through this flight.The plume with the highest CO concentration (144±14 ppb) exhibits a high CO-O 3 correlation (R 2 =0.68, regression slope 0.24 mol mol −1 ).This plume was determined to be a biomass burning plume using FLEXPART and satellite detected fire hot spots, with the fire hot spots located over Kazakhstan (http://zardoz.nilu.no/∼ andreas/YAK/), and hence is illustrative of Cluster D. The positive regression slope is usually associated to net O 3 production within the plume from combustion-generated precursors like CO (Pfister et al., 2006;Val Martin et al., 2006).In another biomass burning plume the regression slope was found to be negative (−0.10 mol mol −1 ).Both of these plumes are encountered at 2-2.5 km altitude.The O 3 -CO regression slopes are comparable to those found in the literature for boreal forest fire plumes (range 0.05-0.30mol mol −1 ; Val Martin et al., 2006;Real et al., 2007 and references therein).
Both of these plumes have a high CO-CO 2 correlation (respectively R 2 =0.80 and 0.78) with negative CO-CO 2 regression slopes of −0.06 and −0.09 mol mol −1 respectively.These negative slopes reflect competing loss processes such as CO oxidation within the plume and CO 2 assimilation by intact vegetation, and reflect the difficulty to retrieve fire emission ratio with the airmass integrating sources and sinks upstream of the burning area.
Cluster A gathers observations affected by Arctic air (Fig. 8f) with low CO 2 , CO and O 3 concentrations values (median values 373 ppm CO 2 , 103 ppb CO and 34 ppb O 3 ).It is typically encountered at low altitude (1 km) within the BL and consists of moist air (Fig. 8d).Advection of Arctic air in the campaign domain sampled during September 2006 occurred in the boundary layer.FLEXPART results indicate that total PES over the Arctic was 5 days or more in the last 10 days for the majority of data in Cluster A. The low CO and CO 2 concentrations suggest a strong isolation from pollution sources in the air mass history.During the five days of meridian transport from the Arctic, exposure to uptake by high latitude vegetation with permanent daylight period and to surface deposition must also have contributed to low CO 2 and low O 3 values respectively.

Flight 9: CO 2 uptake, forest fire and stratospheric input
Figure 9 shows the average footprint for each of the 4 clusters for the 14 August 2007 flight.Cluster B indicates the dominant footprint pattern for this flight (65%; Fig. 9b).It has a strong stratospheric component (Fig. 10f) and the weak surface footprint density is spread over Western Siberia.Cluster C (17%; Fig. 9c) has a high sensitivity north of Lake Baikal.
www.atmos-chem-phys.net/10/1671/2010/Atmos.Chem.Phys., 10, 1671-1687, 2010 Cluster A (9%; Fig. 9a) is centred over north European Russia and Scandinavia while Cluster D (9%; Fig. 9d) reflects air of Arctic origin, comparable to Cluster A in the September 2006 case.Figure 10b-c shows that Cluster C has the highest CO values (median 122 ppb, upper quartile 191 ppb).Such high CO values point to regional fire influence, as seen on ATSR fire atlas for the period of the campaign (not shown).The altitude range is very large (1-6 km).The highest CO concentrations within this cluster (300 ppb) were observed in two distinct layers in the lower FT (see Fig. 11).
Comparing CO 2 concentrations in the low-altitude Arctic clusters of September 2006 (Cluster A) and August 2007 (Cluster D), the latter (median 369 ppm) is lower by ∼4 ppm (despite the anthropogenic inter-annual increase of ∼1.4 ppm yr −1 ).The same is found for O 3 with 25 ppb in August 2007 vs. 34 ppb in September 2006.In Siberia FT CO 2 concentrations are minimum in August (Ramonet et al., 2002), and September concentrations increase by about 2 ppm in the FT and 5 ppm in the BL relative to August.
Cluster B shows the highest stratospheric component for this flight.The highest O 3 values (up to 91 ppb) associated to it are consistent with the "stratospheric" classification.High O 3 concentrations (up to 90 ppb; Fig. 11) are associated to more than 50% of freshly (2 days) exported stratospheric particles.

Statistical relationships between source regions and species across campaigns
In order to compare between campaigns and seasons, and to discuss the cluster analysis we now resort to pair-wise correlations statistics (correlations and regression slopes) directly associating observed trace gas mixing ratios and individual retroplume's residence times over each region (Fig. 2).Pearson correlation coefficients were calculated after excluding data associated to residence time <100 s.This data selection rejects less than 1% of the data range.We report also the Kendall τ parameter which is a "rank" correlation coefficient and therefore describes better increasing or decreasing but non-linear relationships (Wilks, 2006).

Variations of CO 2 associated to source regions
Figure 12 shows the influence that airmasses' residence time in the lowest 300 m in the Arctic and Siberian regions exerts on CO 2 measurements.Since CO 2 fluxes are highly variable at the sub-day timescale, whereas simulated exposure to fluxes is averaged over 10 days, the relations between CO 2 concentrations and PES over source or sink regions are not expected to be strongly linear, making the Kendall τ a useful complement to the Pearson correlation coefficient to describe the relationship between residence time and mixing ratios.Linear regression slopes and correlation coefficients are reported in Table 1.CO 2 mixing ratios are significantly (p <0.01) anti-correlated with PES over local ecosystems both in September 2006 (R=−0.20,τ =−0.37,Table 1) and August 2007 (R=−0.41,τ =−0.43).The regression slope is steeper in August than in September (−3.9×10 −4 ppm s −1 and −1.0×10 −4 ppm s −1 respectively), documenting a stronger uptake in August.Similarly, for the Arctic region in summer, significant correlation is observed between CO 2 concentrations and the PES, also with stronger correlations in August 2007 than in September 2006 (Fig. 12).Therefore, the more air travelled through the Arctic region before reaching our aircraft, the more it has been depleted in CO 2 by sub-arctic ecosystems.As discussed in Sect.4, air having resided in the Arctic surface layer and eventually reaching our aircraft is typically associated to advection in the BL, and therefore integrates between 1 and 10 days of exposure to uptake by Siberian sub-arctic ecosystems.Together, these correlation coefficients indicate that the 10-day cumulated exposure to local and Arctic fluxes in August 2007 explains ∼50% of the total CO 2 variance, when boreal and sub-arctic ecosystem uptake is at its annual maximum.
In August 2007 North Eastern China is an influent source of CO 2 , as denoted by the strong positive correlation (R=0.54,slope=5.0×10−2 ppm s −1 ; Table 1).We note that the clustering analysis did not identify a cluster with dominant NE China signal (Sect. 4.3.,Fig. 10).The strong correlation combined with this absence of a distinct cluster suggests that NE Chinese emissions enhanced CO 2 concentrations over Siberia in an ubiquitous and diffuse way.On the opposite, in April 2006 both a highly polluted cluster of www.atmos-chem-phys.net/10/1671/2010/Atmos.Chem.Phys., 10, 1671-1687, 2010  dominant NE China footprint, and a consistent positive relationship (τ =0.31) between CO 2 and NE China PES were found, indicating that spring contribution from NE China to CO 2 over Siberia occurred in specific plumes through specific transport event, rather than contributing to the background concentration.

Variations of CO and O 3 associated to source regions
Tables 2 and 3 show the pair-wise statistics associating potential source regions and CO and O 3 concentrations respectively.The CO burden over Siberia is partly due to emissions from remote anthropogenic sources such as in the NE China region in April 2006 (τ =0.32;Table 2), where uplift of polluted air and advection toward Siberia was documented.Western Kazakhstan sources in September 2006 (τ =0.29) and August 2007 (τ =0.26) also contributed to the CO burden.In the case of Kazakhstan CO emissions are most likely due to biomass burning occurring in these regions.More local sources also have an impact through all campaigns.
The absence of a strong correlation between O 3 mixing ratio and airmass residence time over biomass-burningaffected areas of Kazakhstan does not support a strong average O 3 production during advection (Table 3), contrarily to our observations on CO-O 3 correlation in individual plumes (Sect.4.2).O 3 statistics exhibit significant correlations with the simulated stratosphere influence in April 2006.This strong correlation (R=0.43)suggests that stratosphere-troposphere exchange is a strong source term to the free tropospheric O 3 spring budget.However, stratosphere Pearson correlation coefficients are weaker and not significant in the summer campaigns, indicating that other processes dominate summer tropospheric O 3 budget, such as local photochemistry and surface deposition.
Exposure to surface influence in the Siberian BL consistently reduces O 3 concentrations in all campaigns, with patterns similar to that of CO 2 .This effect, attributed to deposition on forests, is stronger in summer months (   ).This negative spring influence might be due to heterogeneous chemistry in the presence of heavy aerosol load in the air mass following the dust storm that occurred in the region prior to and during the campaign (Paris et al., 2008).

Conclusion
We analysed three intensive campaigns above Siberia resulting in a total of ∼70 h of CO 2 , CO and O 3 measurements.The sampling strategy was to perform consecutive ascents and descents to obtain a tomography of the atmosphere above Siberia.Our data analysis used a novel technique based on the clustering of LPDM footprints and correlation analysis.Summer (August 2007, September 2006) tropospheric variability of CO 2 was found to be largely explained (∼50%) by uptake due to exposure to boreal and sub-arctic ecosystems.It is likely that permanent sunlight exposure of vegetation above the Arctic Circle contributes to a more constant uptake in this region, therefore strongly depleting CO 2 in slowly transiting air masses.
A discernable influence of air exposure to European emissions was observed, associated with higher O 3 concentrations.O 3 was also affected by large sink processes in the Siberian BL and constant stratospheric contribution.North Eastern Chinese emissions contribute to higher O 3 mixing ratio in summer, but tend to decrease them in spring.
Airborne measurements across Siberia during three YAK campaigns have been capable to detect concentrations originating from sources located in remote regions, thus Atmos.Chem. Phys., 10, 1671-1687, 2010 www.atmos-chem-phys.net/10/1671/2010/validating the sampling strategy.The cluster-based method proposed here was successful at separating air masses with different chemical compositions, although the classification scheme is based only on simulated transport properties and eventually matched independently against air chemical composition measurements.Another advantage of our clustering method is that it is independent from our limited knowledge of sources and sinks, especially for the case of Siberia.Single-flight cluster analysis exhibited a variety of air masses origin, related to a large variability in CO 2 , CO and O 3 .The technique used here, although it has sufficient resolving power, would benefit from more refinements.The clusterbased method should be able to ensure comparability between different campaigns at different locations and/or periods (given no strong concentrations trend between campaigns).Another possible application of this technique to surface observatories or regular aircraft profiles is to contribute to the definition of background conditions and to provide a systematic classification of air masses origin.

Fig. 1 .
Fig. 1.Horizontal (top panel) and 3-D (bottom panel) flight tracks.Each campaign followed a similar path.Flights are numbered in the bottom panel.The different flights of each campaign (four each, typically one per day) are shown in different colors.

Fig. 2 .
Fig. 2. Regions chosen for data reduction prior to the clustering analysis.The black line shows the fight track during the campaigns.Anthropogenic CO emissions from the EDGAR database are also shown in logarithmic grey scale, in g(CO) m −2 yr −1 , to illustrate difference between regions.

Fig. 3 .
Fig. 3. Silhouette index of the clustering algorithm separation capability as a function of the number of clusters for all data (thick line) and separated campaign by campaign (thin line with error bars showing ±1 std dev).

Fig. 4 .
Fig. 4. Profiles of median CO 2 , CO and O 3 mixing ratios for each flight over the 3 campaigns (top row: April 2006, middle: September 2006, bottom: August 2007).The horizontal lines join the 10th and 90th percentiles of each altitude level.

Fig. 5 .
Fig. 5. Average footprint maps for data belonging to clusters A-D throughout Flight 1 (11 April 2006).Logarithmic color scale shows the log 10 of residence time, i.e. the PES below 300 m.The number of elements N in a particular cluster is given above each map.

Fig. 6 .
Fig. 6.(a) Box plots of median and inter-quartile range (IQR) for CO 2 concentrations measured during Flight 1 (11 April 2006) associated to each cluster (clusters A-D in abscissa, same denomination as in Fig. 5).Outliers within the IQR are included in the whiskers; outliers beyond the IQR are shown as dots.(b) Same for CO.(c) Same for O 3 .(d) Same for water vapour mixing ratio.(e) Same for altitude.(f) Cluster centroid position vector for each cluster.Gray bars indicate the population of each cluster (number of samples).

Fig. 12 .
Fig. 12. Scatter plot of CO 2 measurements against PES in the Arctic (circles) and Siberia (triangles) regions, with robust linear regression fit for each of the two regions and each of the three campaigns.YAK-1: April 2006 (red and orange for Siberia and Arctic, respectively), YAK-2: September 2006 (blue and purple), YAK-3: August 2007 (light green and green).
The clustering technique was found to be able to separate tracers' concentrations although it was based solely on atmospheric transport modelling.CO concentrations of ∼109 ppb were observed in clusters associated to a dominant European emissions sensitivity in September 2006 and ∼105 ppb in August 2007.Dominant NE China PES was associated to concentrations of 180 ppb in April 2006.High CO and O 3 concentrations (median values 121 ppb and 55 ppb respectively) were found in clusters associated with fires in Kazakhstan in September 2006.High correlation (R 2 =0.68) and robust linear relationships with regression slopes ranging between −0.10 and 0.24 ppb ppb −1 where found in individual plumes.

Table 1 .
Regional PES-concentration pairwise statistics for CO 2 .Kendall's tau.Values are given between parentheses if not statistically robust (p >0.01) b Pearson correlation coefficient.Values are given between parentheses if not statistically robust (p >0.01) c Robust linear regression slope (ppm s −1 ).No value is given if both Pearson and Kendall correlations are not robust, or if only Pearson correlation is not robust with |R| <0.2.Time series of measured CO 2 (black), CO (red) and O 3 (blue) concentrations as well as altitude (grey) during Flight 9 on 17 August 2007.Aircraft longitude is also given every half hour. a

Table 2 .
Regional PES-concentration pair-wise statistics for CO.Kendall's tau.Values are given between parentheses if not statistically robust (p >0.01) b Pearson correlation coefficient.Values are given between parentheses if not statistically robust (p >0.01) c Robust linear regression slope (ppm s −1 ).No value is given if both Pearson and Kendall correlations are not robust, or if only Pearson correlation is not robust with |R| <0.2. a

Table 3 .
Regional PES-concentration pair-wise statistics for O 3 .Kendall's tau.Values are given between parentheses if not statistically robust (p >0.01) b Pearson correlation coefficient.Values are given between parentheses if not statistically robust (p >0.01) c Robust linear regression slope (ppm s −1 ).No value is given if both Pearson and Kendall correlations are not robust, or if only Pearson correlation is not robust with |R| < 0.2. a