An evaluation of the efficacy of very high resolution air-quality modelling over the Athabasca oil sands region, Alberta, Canada

We examine the potential benefits of very high resolution for air-quality forecast simulations using a nested system of the Global Environmental Multiscale – Modelling Air-quality and Chemistry chemical transport model. We focus on simulations at 1 and 2.5 km grid-cell spacing for the same time period and domain (the industrial emissions region of the Athabasca oil sands). Standard grid cell to observation station pair analyses show no benefit to the higher-resolution simulation (and a degradation of performance for most metrics using this standard form of evaluation). However, when the evaluation methodology is modified, to include a search over equivalent representative regions surrounding the observation locations for the closest fit to the observations, the model simulation with the smaller grid-cell size had the better performance. While other sources of model error thus dominate net performance at these two resolutions, obscuring the potential benefits of higher-resolution modelling for forecasting purposes, the higher-resolution simulation shows promise in terms of better aiding localized chemical analysis of pollutant plumes, through better representation of plume maxima.


Introduction
Numerical modelling of the atmosphere in an Eulerian framework relies on discretization of the computational domain into a numerical grid. The horizontal grid-cell size of atmospheric simulations can range from hundreds of kilometres to the metre-scale of large eddy simulation models. Air-quality model grid-cell size typically follows the grid-cell sizes used in weather forecasting models, which in turn have followed a gradual progression towards finer discretization where more explicit representation of cloud formation and local radiative transfer effects may be represented. The most recent weather forecasting applications (e.g. Leroyer et al., 2014) have reached grid-cell sizes as small as 250 m over limited domains such as individual cities, and have shown promising results in terms of being able to resolve some aspects of local circulation. In addition, as grid resolution reaches the 3 to 4 km scale, explicit cloud microphysics packages may be used, allowing potentially better performance, particularly with regards to feedbacks between meteorology and chemistry (Yu et al., 2014;Gong et al., 2015). However, while these models promise better physical representation of local chemistry, their performance may be limited by the quantity and availability of initialization and boundary condition meteorological data; these data may be used in a data assimilation context to improve their initial state. The accuracy of broader-scale meteorological predictions may thus influence local model accuracy, despite the ongoing decrease in meteorological model (and consequently air-quality model) grid-cell size. Some recent air-quality model simulation studies with grid-cell sizes on the order of 1 to 4 km include Thompson and Selin (2012), Li et al. (2014), Joe et al. (2014), Kheirbek et al. (2014, 2016), and Pan et al. (2017. For the purposes of this study, very high resolution (VHR) modelling refers to the current higher-resolution limits of chemical transport models (CTMs), employing a horizontal grid-cell spacing of 1 km or less. It is in this regime that the photochemical processes may be forecasted with resolved microphysics (e.g. Milbrandt and Yau, 2005a, b) and detailed particle and gas-phase chemistry, using currently available computer technology. VHR modelling is very computationally expensive, and also introduces its own set of challenges, such as the availability of surface boundary condition fields as the model grid-cell size decreases. Moreover, it is not currently clear whether decreases in model grid-cell size leads to more accurate results when compared to observations. The motivation behind VHR modelling in CTMs is to reduce the impact of diluting chemical concentrations -especially from averaging emission plumes into large grid cells -in order to better capture inhomogeneities in emission profiles, to better simulate local transport processes associated with terrain that would otherwise be smoothed by the use of a coarse grid, and to reduce truncation errors and hence achieve better numerical accuracy (Jacobson, 1999).
We note here that while the terms "grid-cell size" and "resolution" tend to be used interchangeably in the literature, this is not true in a precise mathematical sense; more formally, the ability to resolve features of size 2 x requires a gridcell spacing of size x, and the highest spatial frequency which can be reconstructed from a discrete sampling of the latter grid-cell spacing will be 1 2 x , the Nyquist wavenumber of the grid-cell size discretization. Furthermore, atmospheric models may make use of energy dissipation techniques that broaden the size of resolvable wavelengths to 3 to 4 x (Grasso, 2000;Pielke, 2001). Model resolution is thus a function of, but not equivalent to, grid-cell size. Here, we define "resolution" as the ability of a model to clearly distinguish components of a predicted atmospheric variable, as a function of grid-cell size.
The issue of a model to distinguish these features is also compounded by uncertainties in model inputs. For example, in a large rural setting, a large model grid cell will represent an area containing many roads, whose emissions will be averaged into one value per species per time. As the grid-cell size decreases, however, this averaging effect will be reduced, giving each road's emissions more impact on the resulting concentrations in the grid cell containing it. However, the smaller grid-cell size will also result in steeper concentration gradients in the model between adjacent grid cells, which can in turn result in numerical instabilities that contaminate predictions (Salvador et al., 1999). At the same time, a reduction in grid-cell size can be shown formally to reduce inaccuracies in the discretization of the governing equations for atmospheric motion (Coiffier, 2011). Previous efforts to address these issues through variable grid size or structure in air quality modelling have not received sustained attention, and therefore most current air quality models use a uniform (albeit nested) grid-cell size in applications (Garcia-Menendez et al., 2010;Kumar et al., 1997).
As resolution increases further, the presence of local topographical features (e.g. buildings and street canyons) becomes more important. Both the increased topographic complexity and potential numerical instabilities can lead to differences in meteorological forcing as resolution increases (Wolke et al., 2012;Gego et al., 2005). The contribution of meteorological uncertainties due to resolution becomes more significant, especially for secondary pollutants such as ozone (Valari and Menut, 2008) or secondary particulate matter (PM). For example, Markakis et al. (2015) in their analysis of 4 km CHIMERE simulations for the relatively flat terrain of Paris, France, suggested that model meteorological grid-cell size does not significantly impact forecast accuracy. That may not have been the case, had their terrain been more complex. In contrast, Queen and Zhang (2008) observed considerable meteorological sensitivity to the more complex terrain in their 4 km resolution Community Multiscale Air Quality (CMAQ; EPA, 1999) model simulations over the Appalachian Mountains in the eastern United States, as did Salvador et al. (1999) for meteorological model simulations.
A number of studies have tried to evaluate the benefits of higher-resolution simulations and to quantify the impact of sub-grid variability by using different model grid-cell sizes (Vardoulakis et al., 2003;Ching et al., 2006;Pepe et al., 2016). These studies have often demonstrated that failure to account for higher-resolution features may result in mischaracterization of concentrations or health impacts (Isakov et al., 2007), although the capability of current models to provide this information with sufficient accuracy is unclear. One study found that increasing resolution did not change predicted health outcomes and concluded that "resolution requirements should be assessed on a case-by-case basis" (Thompson and Selin, 2012), while others (e.g. Kheirbek et al., 2014Kheirbek et al., , 2016 have employed 1 km resolution without discussing the impacts of resolution on predicted health outcomes. Population exposure studies using air pollution models may be affected by resolution in a more complex fashion, given that both the predicted field (a pollutant with a known health impact) and the data to which the predicted field is to be linked (the human population) both have resolution dependencies. The health studies carried out to date highlight the need for better understanding of the underlying controlling factors for model accuracy with decreasing gridcell size.
Terrain and meteorology are not the only factors that contribute to greater uncertainties as horizontal grid-cell size is reduced -for example, the ability of the model to locally resolve emission fluxes may also become a factor. This may result in improved or deteriorated model performance as the size of the grid cells decreases. Gridded model emissions may have an intrinsic resolution dependence due to the underlying spatial disaggregation fields, and this can contribute to uncertainties and errors in emissions as grid-cell size is decreased. For instance, Valari and Menut (2008) found that the discrepancy between their modelled and observed concentrations grew rather than shrank, in response to decreases in grid-cell size from 48 to 6 km, and they associated these results with changes in the resulting local emission fluxes. They showed that in their model setup, with regard to ozone, a grid-cell size was reached (12 km × 12 km) where errors in inputs (errors in the emission inventory, wind direction, etc.) outweighed the importance of other sources of model error such as grid-cell size. The authors, however, noted that Paris' ozone photochemistry very often resides on the transition between a NO − x -sensitive and a VOC-sensitive regime (Sillman et al., 2003). These are chemical conditions which can alternatively produce or titrate ozone and hence have a degree of sensitivity to precursor emissions, and therefore, also, to any errors in those emissions. Conversely, in a 3-level nested MM5-CMAQ simulation with grid-cell sizes going from 9 to 3 to 1 km over Osaka, Japan, Shrestha et al. (2009) found that ozone comparisons to observations improved as the grid resolution increased. This was also the case for a 3-level nested MM5-CMAQ simulation going from 36 to 12 to 4 km over Houston, USA (Ching et al., 2006), where the ozone forecast improvement associated with higher resolution was attributed to the ability of the finer grid-cell size model nests to adequately resolve high concentrations of freshly emitted NO x and hence allow for more local ozone titration. The latter process might not take effect until the grid-cell size is sufficiently fine to resolve the NO x source patterns (i.e. a level where traffic and industrial sources can be identified.) This titration was not seen until they decreased their grid-cell sizes to 2 km and smaller. Stroud et al. (2011) noted a similar grid-cell-size-dependent chemical impact on model performance, where secondary organic aerosol formation maxima were better simulated with a 2.5 km grid-cell size model than a 10 km grid-cell size model. In general, the impact of resolution on model performance appears to depend on a number of factors, such as the terrain, spatial distribution of sources, pollutant of concern, season, etc. (Arunachalam et al., 2006;Queen and Zhang, 2008;Dore et al., 2012). Salvador et al. (1999) studied the prediction accuracy impacts of meteorological model grid-cell size in a region with a complex domain and found that 2 km or smaller grid-cell sizes were required to resolve local-scale complex terrain flow features, and that daytime vertical advection and predictions of turbulent kinetic energy and potential temperature were influenced by grid-cell size. Dore et al. (2012) evaluated air quality model NO 2 simulations employing 1, 5, and 50 km grid-cell sizes against observations and found the best performance for the 1 km simulation, with more physically realistic distributions of reactive nitrogen, attributing this performance gain to more realistically precipitation simulations and emissions inputs for the smallest grid-cell size. The availability of high-resolution emissions information may be a limiting factor in improved simulations as gridcell size decreases. Valari and Menut (2008) noted that emissions inaccuracy was the principal cause of noise in small grid-cell size simulations conducted for the Paris area, and proposed the use of statistical downscaling in favour of predictive modelling at scales at or below 1 km grid-cell size. The current state of model science is typically evaluated through multi-model intercomparisons (e.g. Im et al., 2015), and the meta-analysis of these studies can be used to provide useful benchmarks to assess current model performance for specific model species and observations (Emery et al., 2017). However, such studies do not identify the causes for good or poor performance relative to the benchmarks -diagnostic studies "in which chemical and physical processes within the model are analyzed individually and collectively" (Emery et al., 2017) are required for this purpose. Examinations of the impact of model grid-cell size on performance are an example of such a diagnostic evaluation.
The benefits for model performance with increased spatial resolution are unclear, based on the above literature. However, most papers converge towards the following qualitative conclusions: -The impact of terrain topology on meteorological forcing as grid-cell size decreases can dwarf the impact of a more accurate spatial apportionment of the corresponding emissions.
-Decreases in grid-cell size result in a more realistic spatial distribution of chemical species, whether or not model performance is improved.
-Uncertainties in spatial and temporal emissions allocation have an increasing influence on overall model uncertainty as model grid-cell size decreases.
The 1980s saw several studies in which the potential impacts of wind direction errors on dispersion model performance were examined. Fox (1981) noted that pairing of model output at observation station locations could be done as a function of both time and space: as a function of time (by combining the data across all stations), as a function of space (by combining all times, at each station location), or without any pairing (observations and data were compared as cumulative frequency distributions). The accuracy of regulatory dispersion models in the early 1980s was such that Fox (1984) concluded that model and observation values paired in time and space exhibited "little to no correlation" and discussed potential errors associated with transport. Poor correlations were also noted in the report on the first generation of reactivetransport models by Hanha (1988), who stated "wind direction errors are the major cause of the poor agreement in hourly predictions of concentrations at short distances downwind of point sources," and described metrics for air-quality model evaluation. Hanha (1988) also noted that model predictions could be offset in space and time relative to observations, leading to poor performance statistics, despite a greater degree of similarity of behaviour if the offsets are taken into account. Errors in wind-field modelling were described as the main source of error in simulations of plumes by Carhart et al. (1989), who again showed how better agreement resulted when model and observations were unpaired in time and/or space and noted that other metrics such as maximum plume width might better represent model performance. Lee (1987) found that small perturbations in space and time could result in poor correlations, despite similar histogram distributions of both model and observations. More recently, Kang et al. (2007) examined the concept of using the area of the limiting resolution of the model (2 to 3 x, where x is the horizontal grid-cell size) to weight or spatially average model evaluation metrics for a single grid-cell size, noting how the model's rated ability to capture high-concentration events ("hits") was increased when the limiting resolution of the model was incorporated into the performance metrics. However, the use of averaging may mask the potential for a model with a small grid-cell size to contain both the desired plume magnitude and much lower concentrations, within the same larger representative area, in turn masking the potential impact of the reduction in grid-cell size.
We expand on this concept to evaluate the impact of model grid-cell size in the context of an equivalent area about a given observation location. We examine area-weighted metrics in the form of averages over roughly equivalent areas for different model grid-cell sizes, and also use the a priori knowledge of the observations to determine whether the closest match to observations may be found within an equivalent area. We show that the latter metric demonstrates a positive impact of model grid-cell size on simulation results, while more simple paired comparisons, and averages over similar areas, mask these benefits.
We examine the impact of grid-cell size on model performance in a region of intense petrochemical extraction and upgrading, the Athabasca oil sands region (AOSR). The AOSR refers to the northernmost of three large bitumen deposits located the northern part of the province of Alberta in Canada: the Athabasca, Peace River, and Cold Lake areas. Together these areas cover 142 200 km 2 in total and constitute the third largest oil reserves in the world (Government of Alberta, 2016), as shown in Fig. 1. The oil sands sector is the second largest source of SO 2 and the third largest source of industrial NO x in the province of Alberta. This sector is also a significant source of industrial PM, CO, and volatile organic compound (VOC) emissions , from a variety of source types and industrial processes (e.g. open pit mine tailings ponds, large diesel fleets, bitumen upgrading facilities). As is described below, very high resolution emissions data are available for these sources, and emissions take place in a region with significant topography, hence the region provides a good test case for the relative impact of grid-cell size on air-quality model prediction results.
Next we describe our model, the simulation domains and forecasting setup, the emissions data, our evaluation methodology, and the results of our analysis.

GEM-MACH
The air-quality model used in this work is Environment and Climate Change Canada's (ECCC) Global Environmental Multiscale -Modelling Air-quality and Chemistry (GEM-MACH) model, which has been in use as Canada's operational air-quality forecast model since 2009 (Moran et al., 2010). GEM-MACH is an on-line model; that is, both meteorological and chemistry processes are handled within a single model. The chemical processes reside within the physics module of the Global Environmental Multiscale meteorological forecast model (Côté et al., 1998a, b), originate with Environment Canada's earlier off-line model (A Unified Regional Air-quality Modelling System, AURAMS; Gong et al., 2006), and include process representation for particle microphysics (Gong et al., 2003a, b), inorganic heterogeneous chemistry (Makar et al., 2003), aqueous phase chemistry, incloud and below-cloud scavenging (Gong et al., 2006), and secondary organic aerosol formation (Stroud et al., 2011). GEM-MACH employs a sectional approach to represent the size distribution of atmospheric particles, with 12-bin (Makar et al., 2015a, b;Gong et al., 2015) or 2-bin configurations (Moran et al., 2010). The latter configuration is designed for maximum computational efficiency, with re-binning to the 12-bin distribution for key particle microphysics processes, in order to improve accuracy. Here, the 2-bin version of the model has been used, the main focus of the work being the impact of horizontal grid-cell size on model results. Eight aerosol chemical components are resolved in GEM-MACH (sulfate, nitrate, ammonium, elemental carbon, primary organic aerosol, secondary organic aerosol, sea salt, and crustal material). In the present study, we make use of GEM-MACH v.1.5.1, described in more detail in Makar et al. (2015a, b), employing 80 levels in a hybrid vertical coordinate system extending up to 0.1 hPa (∼ 30 km). Both model grid-cell size simulations compared here (2.5 and 1 km grid-cell sizes; see below) make use of the Milbrandt-Yau double-moment explicit microphysics scheme; that is, cloud processes are resolved explicitly at these scales (Milbrandt and Yau, 2005a, b).

Grid nesting
Four levels of nesting have been employed in our simulations, shown in Fig. 2a. This version of GEM-MACH operates on a rotated latitude-longitude coordinate system wherein the position of the coordinate system poles is set by the user, allowing rotations of the grid with decreasing grid-cell size during nesting. The outermost nested grid corresponds to the westernmost two-thirds of the operational GEM-MACH forecasting domain, with a 10 km grid-cell size, and employ a combination of the Kain-Fritsch subgridscale convective cloud scheme (Kain and Fritsch, 1990;Kain, 2004) and a Sunqvist (1988) scheme for cloud parameterizations. Within that outer grid is nested a 10 km gridcell size western Canada domain (yellow region, Fig. 2a) which has been rotated to match the horizontal orientation of the Rocky Mountains, and which makes use of a doublemoment microphysics scheme (Milbrandt and Yau, 2005a, b) in place of the Sundqvist (1988) parameterization. The intention of this intermediate local 10 km simulation domain was to provide initial hydrometeors for the two innermost domains, to reduce the "spin-up" time required for the inner domains' meteorology to reach an equilibrium with respect to cloud formation. The latter two domains (2.5 and 1 km grid-cell sizes) resolve the cloud microphysics explicitly using the double-moment scheme alone and no convective parameterization (Milbrandt and Yau, 2005a, b). The third nested grid inset (green region, Fig. 2a) is the 2.5 km grid-cell size domain, which covers most of the Canadian provinces of Alberta and Saskatchewan. This grid will hereafter be referred to as the OS2.5km domain. The fourth and final nested grid (blue square, Fig. 2a) is a 1 km grid-cell size domain, roughly centered over and covering the immediate environs of the Athabasca oil sands, and is referred to here-after as the OS1km model. This last nest also shows the region within which 22 instrumented aircraft flights were conducted during August and September of 2013, providing a unique measurement data set for our evaluation of the OS2.5 and OS1km model output for the same time period. Table 1 provides details on the horizontal dimensions of each of these nested domains and the duration of the simulations on each grid. All four model nests make use of the same vertical coordinate and levels. Figure 2b shows the topography of the 1 km domain in detail; the region to be modelled is situated in a broad river valley, with a local vertical relief of 750 m. Significant wind shears and frequent inversions are observed in the region, and part of our interest in 1 km grid-cell size simulations is to determine the extent to which these local features may influence model prediction accuracy.

Simulation cycling strategy
Model simulations mimic an operational forecasting system, starting from the use of archived, data-assimilated meteorological analyses as meteorological input and boundary conditions every 36 h. The use of analysis fields is a standard meteorological forecasting practice to prevent the chaotic drift of the model results from observed meteorology over time. The outermost 10 km domain uses initial and boundary conditions from the output of a meteorological simulation, that is itself driven by an analysis field. The outermost domain model then carries out a 36 h forecast, of which the first 6 h is discarded as spin-up; the final 30 h is used as initial and boundary conditions for the rotated 10 km grid-cell size domain (the OS10 km domain). An OS10 km simulation of 30 h is then carried out, with the first 6 h being discarded as spinup, and the latter 24 h forming the initial and boundary conditions for the 2.5 km grid-cell size OS2.5km simulation. The OS2.5km simulation is of 24 h duration. The OS1km simulation covers the same 24 h (and hence both 2.5 and 1 km simulations start from the same OS10 km initial conditions for every 24 h forecast), with the 2.5 km simulation providing boundary conditions thereafter to the OS1km model. Continuity between 24 h forecasts is thus maintained at the level of the outermost nest. The outermost domain is cycled every 12 h starting at 00:00 and 12:00 UT; however, we have selected the set of contiguous OS2.5 and OS1km 24 h simulations starting from the 12:00 UT continental domain for our comparison.
Meteorological boundary conditions for the lowestresolution GEM-MACH simulations are taken from operational GEM forecasts, in turn driven by data assimilation analyses performed at the Canadian Meteorological Centre.

Model emissions
All emissions data used in this work are described in Zhang et al. (2018). These emissions data include  a. direct observations of stack-specific hourly emissions measured by continuous emission monitoring systems (CEMS); b. regional emissions inventory data from the Cumulative Environmental Management Association (CEMA) -which had the most detailed stack and process level emission data for the AOSR facilities, including emissions from mine faces, tailings ponds, and the off-road mining feet); c. the 2010 Canadian Air Pollutant Emissions Inventory (APEI) -which is the most comprehensive national emissions inventory, and which has the largest spatial coverage for area sources outside the AOSR; and d. the 2013 National Pollutant Release Inventory (NPRI) (a subset of the APEI) that is based on emissions reports from large industrial facilities.
These emissions data sets primarily describe emissions of pollutants known as criteria air contaminants (NO x , VOCs, SO 2 , NH 3 , CO, PM 2.5 , and PM 10 ) for major-point sources (i.e. large emission stacks) and area sources. Area emissions sources typically consist of multiple small mobile sources spread over a large area (e.g. off-road vehicles), large flux sources such as mine tailings settling ponds or mine faces, and/or large numbers of small stacks for which no stack characteristic data (volume flow rates, temperatures of emissions, stack diameters), needed to estimate plume-rise heights, are available.
Major-point sources are represented by a single geographical (latitude, longitude) pair of coordinates, and are assigned to the grid cell in which the point is located. These sources are likely to be the most impacted by model horizontal gridcell size, as even a large major-point source plume, which in reality may only occupy an emissions horizontal area on the order of 100 m 2 , is represented by a flux spread over an entire grid cell. A plume from a major point source within a 2.5 km grid cell will thus be immediately diluted to a size of 6.25 km 2 upon emission, whereas the same source with a 1 km grid cell will have a cross-sectional horizontal extent of 1 km 2 . At the same time, higher resolution may require a much more accurate representation of model winds close to the sources to maintain accuracy in evaluation metrics dependant on plume position such as correlation -a wider plume being more likely to at least partially intersect a monitoring station location than a narrower plume.
Area sources that are large compared to both model gridcell sizes (2.5 and 1 km) can be expected to be approximated by model grid cells of both resolutions and are thus expected to be less impacted by model resolution than emissions from point sources. However, smaller area sources (i.e. areas intermediate between 2.5 and 1 km to the side) may be better resolved, and hence have less dilution and higher downwind concentrations, when higher spatial resolution is employed.
In the AOSR, approximately 95 % of the SO 2 emissions originate in major-point sources, while NO 2 is apportioned ∼ 40 % to major-point sources and ∼ 60 % to area sources . Consequently our a priori expectation is that the impact of the resolution change will be strongest for species like SO 2 , and less strong for species like NO 2 that are emitted in part by point sources, but may also be apparent for other species and secondary products, such as O 3 .

Model evaluation methodology and metrics
Comparisons between air-quality models and observations usually take the approach of comparing observation and model-generated values paired in time and space, from the observation location and corresponding model grid-cell respectively. We refer to this approach hereafter as our "standard" evaluation, for both 2.5 and 1 km simulations. However, we note additional factors aside from grid-cell size may influence the outcome of air-quality model evaluations. For example, the relative skill of the meteorological component of the air-quality model will depend in part on the density of meteorological observation data, incorporated into the model via data assimilation, for the construction of the model's initial meteorological state. This in turn will influence the local skill of the model's predicted wind directions and hence the skill of its plume transport. The simulations carried out here focus on the Fort McMurray area, where the nearest available upper air meteorological sounding site is located at the ECCC Stony Plain station, located approximately 500 km south-west of the study area. The advantage of higher-resolution simulations (e.g. reduced numerical error associated with the discretization of transport operators, and better treatment of local topographic influences) may thus be offset by errors in the predicted large-scale flow.
While meteorological model synoptic-scale forecast errors may manifest themselves locally as errors in the direction of winds driving local plume transport, other advantages may result from the use of higher-resolution air-quality models. Since lower resolution models de facto instantaneously redistribute plumes emitted from large stack sources over a larger area, such artificial diffusion will reduce the model's ability to accurately simulate concentration maxima, and the resulting chemistry, within simulated model plumes. However, the spatial extent of a plume in a model employing a large horizontal grid-cell size may be such that its existence may be captured at discrete observing sites. In contrast, forecast plumes in models with smaller horizontal grid-cell sizes may correctly capture plume magnitude and chemical behaviour, but may be more subject to errors in the larger-scale wind direction. To illustrate this point, Fig. 3 shows a conceptual diagram of an actual plume, a large grid-cell size model plume, and a small grid-cell size model plume, where the latter two simulated plumes are both subject to the same synoptic-scale error in wind forecast direction (indicated by large red arrows; the smaller red arrow in Fig. 3c indicates the impact of local forcing predicted for the second model). Observation station "+A" is located downwind, and records the presence of the actual plume (Fig. 3a). The coarse grid-cell size simulated plume (Fig. 3b), despite the error in the forecast wind direction, captures part of the observed plume in the resulting time series at the observation station location. In contrast, the small grid-cell size plume (Fig. 3c), despite resolving the plume shape (and plume internal chemistry) to a greater degree than the coarse grid-cell size simulated plume, fails to record the presence of the plume at the observation location. A simple paired observation-model time series evaluation would thus suggest that the former model has superior performance to the latter model in this example, despite the latter model having created a more "realistic" plume in terms of the maximum concentration reached, albeit in the wrong location, due to synoptic-scale forecast wind direction error. In this particular instance, the magnitude of the smaller gridcell size simulated plume is more realistic than that of the coarse grid-cell size plume, but this improvement will not be captured in a standard evaluation analysis. Shifts in plume location across individual grid cells away from the location of an in situ observation are more likely grid-cell size decreases. In this example, a standard analysis would impose a more stringent expectation on the smaller grid-cell size simulation to correctly identify plume locations.
In addition to the standard analysis, we perform additional analyses that examine the model's ability to resolve plumes in the vicinity of the observation station, in order to attempt to evaluate the potential for higher-resolution simulations to provide benefits which may be masked by synoptic-scale forcing errors. This strategy is illustrated in Fig. 4. Figure 4a shows an observation station enclosing the nine nearest-neighbour model grid cells for a 2.5 km grid-cell size, while Fig. 4b shows the corresponding 1 km grid-cell size map, with the 9 nearest-neighbour model grid cells shown in light grey and the 49 nearest grid cells shown in the region enclosed in dark grey. Figure 4a encloses a region of 56.25 km 2 (7.5 km × 7.5 km), while the light grey region in Fig. 4b encloses 9 km 2 , and the darker grey region encloses 49 km 2 . As noted above, in a formal mathematical sense, the smallest region resolvable by an Eulerian grid model is twice the size of the model grid-cell size (relating to the Nyquist frequency of the model); hence the smallest resolvable feature spans two model grid cells in each direction. However, in a practical sense, a total of nine grid cells centered on the observation station must be used to allow a boundary of two grid cells in any direction. Sampling any or all of the nine grid cells in Fig. 4a may thus be said to be representative of the model's ability to simulate events occurring at discrete location "+A". The closest corresponding sampling region available to the 1 km model (Fig. 4b) is shown in dark grey. The light grey region of Fig. 4b represents the closest 1 km grid-cell size region that corresponds to the single 2.5 km grid cell in which the observation station is located in Fig. 4a. We attempt to ascertain model performance in these approximately equivalent regions around each observation station, in the analysis that follows.
Our approach follows two steps: 1. From the 2.5 km simulation, in addition to the predicted model value at the grid cell containing the observation location, we determine the model grid-cell value in the nine grid cells surrounding the observation station location which has the closest value to that observed at the station. This represents the model's "best estimate" of the value at the observation station location itself and represents the model's ability to resolve features at 2.5 km grid-cell size.
2. From the 1 km simulation, in addition to the model value at the grid-cell location, we select the closest value to the observation value from a. the nearest nine grid cells to the observation station location, and b. the nearest 49 grid cells to the observation station location.
The former represents the model's "best estimate" of the value at the observation station location itself, while the latter represents the 1 km model's best estimate in the closest equivalent region to the limiting resolution of the 2.5 km model.
Comparing the resulting statistical measures of each of these selected values with observations, in addition to the standard analysis, thus evaluates the model's best attempt to resolve features for the specified grid-cell size and allows cross-comparison of model performance within nearly equivalent areas. Cross-comparing the statistical values for the different regions described above shows the model's ability to resolve features such as plumes from the standpoint of the region represented at the different grid-cell sizes. If synopticscale transport direction errors create situations similar to that depicted in Fig. 3a, a standard comparison of error would be expected to show little benefit to higher resolution. However, the "best model estimate" comparisons would capture the ability of the higher-resolution model to more accurately simulate the magnitude of the plume, if not its spatial location. Each of these selection procedures will be employed in the surface concentration comparisons which follow. We evaluate our model simulations against observations made at surface monitoring networks in the vicinity of the Athabasca oil sands, and aboard an instrumented aircraft, the National Research Council of Canada Convair. For the surface monitoring data, hourly time series of model output were matched to station time series using the different strategies described above. For the aircraft observations, we extract model values through temporal and spatial interpolation to the aircraft's position during the flights and only perform the standard analysis, as well as examining the behaviour of the two simulations along cross sections corresponding to the flight paths.
Our statistical metrics for evaluation are common to many other air-quality applications and were computed using the "modstat" function from the OpenAir R package (Carslaw and Ropkins, 2012). Further discussion of different metrics for model evaluation may also be found in Yu et al. (2006). The statistics calculated here include mean bias (MB; perfect score: zero), mean absolute gross error (MGE; perfect score: zero), normalized mean bias (NMB; perfect score: zero), normalized mean gross error (NMGE: perfect score: zero), root mean squared error (RMSE; perfect score: zero), correlation coefficient (r, perfect score: unity), coefficient of efficiency (COE: a perfect score is unity, a zero or negative score means the model is equivalent or less predictive, respectively, than the mean of the observations), and the index of agreement (IoA; perfect agreement is unity, and −1 indicates no agreement or little variability).

Model-to-model comparisons and averages
We begin a comparison of 2.5 and 1 km grid-cell size for specific events, and for averages across the 1 km domain, in order to provide a qualitative comparison of the differences in simulations for the two simulations and then con-tinue with the quantitative comparison. Figure 5 compares OS2.5km (left column) and OS1km (right column) simulation results for a cross section located 0.2 km from a major SO 2 emissions source at 00:00, 12:00, and 24:00 UT into a given simulation day.
The model results are identical at hour 0 due to both the OS2.5 and OS1km models being initialized from the OS10 km data at this time (small differences between Fig. 5a and b are due to slight mismatches in the cross section locations). Subsequent cross sections show the OS1km model is capable of resolving both higher absolute mixing ratio values, and sharper gradients, within 12 h of simulation time (Fig. 5c, d). Multiple plumes are resolved by 12 h of simulation time in the 1 km grid-cell size simulation, along with markedly different plume heights, plume structure, and an increase of a factor of 2 in the magnitude of plume mixing ratios relative to the lower grid-cell size simulation, and these differences persist into the 24th simulation hour (Fig. 5e, f). Mixing ratio differences of these magnitudes are to be expected given the increase in resolution, but Fig. 5 shows that other important aspects of the predicted plumes have changed. The plume heights are a function of predicted local stability conditions in the grid square containing the source, and the variation shown here represents a substantial change in the predicted local stability for the origin sources of these plumes, resulting from the change in model horizontal gridcell size. Figure 6 compares the maximum surface SO 2 during the entire period for each simulation, as well as the difference in maximum SO 2 between the simulations, along with a scatter plot of OS2.5 vs. OS1km simulation results. In the latter two panels, OS2.5km values were assigned to the corresponding OS1km grid-cell locations using the nearest-neighbour approach.
The maximum surface concentrations tend to show more elongated structures at the smaller grid-cell size, comparing Fig. 6a and b, particularly for plumes in the western (left) half of the OS1km domain. The difference plot (Fig. 6c) shows that local maximum concentration differences of up to −45 ppbv occur, due to changes in the placement and maximum concentration of high-concentration plumes. The scatter plot of Fig. 6d shows that OS1km model has a demonstrated ability to achieve higher concentrations than the OS2.5km model, with a slope of 1.22, and a noticeable clustering of values along the 1 : 2 line. While these results are not unexpected since approximately 95 % of the SO 2 emissions in the domain originate in large stack, or point, sources, and hence initial concentrations at source would be expected to 6.25× higher in the OS1km simulation, they also suggest that a substantial improvement in the OS1km model's ability to capture SO 2 concentrations should be possible. That is, the results of the two models are substantially different, and given the reduction in numerical error expected with employing a smaller grid-cell size, the latter might be expected to outperform a larger grid-cell size model. However, as we shall demonstrate in the next section, plume placement errors such as those depicted in Fig. 3 play a substantial role in model performance as grid-cell size decreases.

Surface observation comparison
The locations of the local network of 10 surface monitoring stations located near the sources of emissions in the region (oil sands facilities) are shown in Fig. 7. As noted in Sect. 2.4, we carry out several analyses: 1. The standard evaluation. The model values are extracted from the model grid cells containing the observation stations, at both grid-cell sizes.
2. Equal areas of representativeness, 1 and 2.5 km gridcell sizes. The nearest nine OS1km grid cells are compared to the OS2.5km single cell evaluation in two ways: a. averaging of the OS1km results across the nine grid cells prior to evaluation (to determine whether the mean value is better represented by the smaller grid-cell size, similar to the approach taken in Kang et al., 2007); b. selection of the best of the nine grid cells (closest to the observation value), to determine the extent to which the OS1km model is capable of better representing the concentrations somewhere within the corresponding OS2.5km model grid cell, if not at the OS1km cell closest to the observation location -higher scores for the 1 km grid-cell size simulation in this case would indicate that while errors in plume positioning (for example due to errors in the synoptic-scale flow) negate some of the advantages of the OS1km simulation, the plume may be better represented by the OS1km simulation within the 2.5 km grid cell's area.
3. Equal areas of representativeness and equal regions of variability. The nearest 9 2.5 km cells are compared to the nearest 49 1 km cells. Here we make the assumption that the 2.5 km grid-cell size model's ability to resolve features is limited to the surrounding three grid cells in each horizontal dimension and make use of the closestin-size block of corresponding 1 km cells (a 7 × 7 grid centered on the cell containing the observation point). In both cases, the model value closest to the observations is chosen prior to evaluation.
While evaluations (Figs. 2b and 3) deliberately select the "best" value, they also provide a quantitative estimate of the extent to which each model is capable of achieving the correct answer within roughly equal representative areas centered on the observation station locations. These comparisons are intended to evaluate a. the extent to which the 1 km grid-cell size is capable of improving simulation results despite, for example, the larger-scale flow resulting in errors in the plume placement, and b. whether the 1 km grid-cell size model is capable of outperforming the 2.5 km grid-cell size model over equivalent regions.
In the last test, we place both models on an equal footing with regards to the region being represented, as well as with regards to allowing cell-to-cell variability and the selection of a closest match to observations. Our evaluation is presented as tables of statistical metrics. The comparisons employing the nearest-neighbour approach are described with a "Bn" superscript suffix, denoting that the "best" sample within a square centered on the observation point containing a total of n grid cells (e.g. the OS1km B9 label denotes a comparison between observed data and the simulation grid cell within a 3 × 3 grid-cell square centered about the observation point). Similarly, an An superscript de- scribes a comparison between the observations and the average of the n square of grid cells centered on the observation point.
Comparisons to surface concentrations were performed using publicly available data collected by the Wood Buffalo Environmental Association (WBEA), which operates the airquality monitoring network residing within the OS1km domain. The monitoring station locations are shown in Fig. 7. The statistical performance of the models, calculated using the procedure outlined above, are given in Tables 2 through  5, for SO 2 , NO x , O 3 , and PM 2.5 , respectively. In the standard model grid cell to observation measurement comparison for SO 2 and NO x (first two columns, Tables 2 and 3), the OS1km simulation had worse scores for all the metrics considered here. For O 3 , the OS1km model had the better score for the correlation coefficient and root mean square error, and worse scores for all remaining model evaluation metrics. For PM 2.5 , the OS1km model showed a better performance for the correlation coefficient and biases, while the OS2.5km model outperforms the OS1km model for all other metrics examined here. Based on a standard analysis, the OS1km model thus performs poorly compared to the OS2.5km model; the expected advantages associated with reduced numerical error in transport at smaller grid-cell sizes are being offset by other factors controlling the net model error.
When the standard evaluation is compared to the average of the nearest nine 1 km simulation grid cells surrounding the observation point (third column of the tables), an intermediate result appears. For SO 2 ( Table 2) the nine-cell OS1km average has the best performance for correlation coefficient -indicating a better time distribution of events may be achieved by a nine-cell average at 1 km grid-cell size. The other metrics for the A9 simulation are intermediate between the two standard evaluations for each simulation, indicating that some of the performance loss resulting from the use of 1 km grid-cell size is reduced through averaging results to approximately the same size regions as the OS2.5km grid-cell size. The latter result holds for all metrics for NO x (including R, see Table 3). For ozone (Table 4), averaging the nine near- est OS1km grid cells prior to measurement gives the best performance for R and RMSE, and worse performance for the other metrics. For PM 2.5 (Table 5), all metrics for the OS1km nine grid-cell average aside from the bias fall mid-way between the two standard methodology evaluations. Averaging the smaller grid-cell size model results thus shows a marginal improvement, depending on the species, but overall does not compensate for the decrease in performance resulting from going to the smaller grid-cell size.
We next ask the question, "does a more accurate simulation value exist within the same region of the 1 km model as is encompassed by a 2.5 km grid cell?" (fourth column of Tables 2-5), by selecting the model value in the nearest nine 1 km grid cells with the closest match to observations and comparing to the corresponding single 2.5 km grid cell. A dramatic improvement in the relative OS1km performance metric scores can be seen. For each of Tables 2 through 5, this "best of nine" 1 km comparison outperforms the previous 3 comparisons (columns 1 through 3), for all metrics. These improvements are sometimes dramatic (e.g. a doubling of correlation coefficient along with a reduction in mean bias by a factor of 3, a reduction of NO x mean bias values by a factor of 3, a shift of coefficient of error from negative to positive values for O 3 , and a reduction in the coefficient of error for PM 2.5 by a factor of 2.5 compared to the nearest competing value from the previous evaluations. The coefficients of efficiency for SO 2 and O 3 make the transition from negative to positive values when the "best-of-nine" methodology is used, indicating that the model is able to predict the observations better than the observed mean, somewhere within an equivalent area. This evaluation suggests that the OS1km model does contain a better result within the same approximate region encompassed by a 2.5 km grid cell. However, the location of that better result may be subject to positioning error, such as those described in Fig. 3. A valid argument could be made that the methodology employed in this fourth evaluation is subject to selection bias, in that the selection of a best value in the case of the nearest nine 1 km simulation places that model simulation at an advantage relative to the 2.5 km model. To address this last issue, the final two additional methodologies for evaluation were employed, still maintaining the same approximate area of representativeness for a grid cell, namely choosing the best value out of the nearest 9 2.5 km grid cells (the limiting resolution of this model simulation), and the best value out of the nearest 49 1 km grid cells (fifth and sixth columns, respectively, of Tables 2 through 5). That is, we attempt to place the two models on an equal basis with regards to selection bias within a given region containing an observation station.
Two important results can be seen from this final evaluation. First, as was the case for the "best of 9" for the OS1km simulation compared to the standard OS1km evaluation, the best of 9 for the OS2.5km simulation has a considerably better performance than the standard OS2.5km evaluation (compare fifth and first columns, Tables 2 through 5). That is, the OS2.5km model may also be subject to location errors in transported species representation which influence model performance. However, when performance within the 56.25 km 2 area surrounding each measurement point in the OS2.5km best of 9 evaluation is compared to the 49 km 2 area surrounding the measurement points in the OS1km best of 49 simulation (i.e. compare columns five and six in Tables 2 through 5), it can be seen that the OS1km model outperforms the OS2.5km model for all metrics for O 3 and PM 2.5 , and all metrics aside from bias for SO 2 and NO x . That is, despite the OS1km model having a slight disadvantage in the relative size of the representative area containing the measurement station location, and both models being allowed a similar selection strategy, the OS1km model is capable of generating values closer to the observations than the OS2.5km model within an equivalent sub-region, across most of the metrics and chemical species considered here. This final result is strongly suggestive of the presence of issues such as those illustrated in Fig. 3. These may include errors in the larger-scale synoptic wind flow, combined with the reduced size of plumes as grid-cell size is reduced, leading to more "misses" than "hits" for a given recorded event at a measurement station compared to the coarse grid-cell size model. There may be multiple additional causes for such errors (examples include poor observation density in the region for model initialization, underlying lower-resolution boundary condition fields such as topography not improving with the reduction in grid-cell size, inaccuracies in land use fields used in meteorological modelling due to rapid development, and errors in other aspects of the reaction transport modelling system aside from horizontal resolution). The expected advantages of the small grid-cell size, such as better representation of the concentrations of species within plumes and hence better representation of their reactive chemistry (see for example Lonsdale et al., 2012), may be lost in a standard performance analysis due to these other issues.
Our analysis suggests that a practical limit in the benefits of increasing model accuracy may be reached when resolution exceeds some threshold, as a result of other errors inherent in the modelling system. However, the analysis also suggests that if these non-resolution-related errors are corrected, the benefits of adopting a smaller grid-cell size may be substantial. For example, meteorological data assimilation employing a dense monitoring network for a specific area of interest would be expected to show a greater impact for smaller rather than larger grid-cell sizes, due to the greater ability of the former to take advantage of the observation density in correcting the initial meteorological state. We note that recent work applying land use data assimilation (Carrera et al., 2015) to regional 2.5 km grid-cell size weather simulations (Milbrandt et al., 2016) have suggested that such data assimilation may indeed improve forecast skill at the very local scale.
The surface observation data were also analyzed by the time of day, with both observations and simulations split into daytime (09:00 to 18:00 LT) and nighttime (19:00 to 08:00 LT) data pairs (Appendix A, Tables A1 through A8; Carslaw and Ropkins, 2012). Within each of these diurnally segregated time periods, the broad aspects of the comparison were the same as for the "all data" Tables 2 to 5 above: the OS1km simulations tended to have reduced performance in a standard analysis, averaging improved but did not completely ameliorate the performance of the OS1km simulation, a methodology employing the best of nine OS1km grid cells had superior performance to the two standard comparisons, and comparison of the "best of" methodologies for equal areas showed better performance for the OS1km compared to the OS2.5km simulation. We also noted substantial differences in the day and night performance of both models across the methodologies. For example, daytime SO 2 and NO x performance within a given model and comparison methodology was usually better than nighttime performance for IOA, R, NMGE, COE, and NMB, while worse for RMSE, while nighttime O 3 performance was better for IOA, r, NMGE, and COE. Daytime PM 2.5 performance was better than nighttime for IOA, r, COE, and NMB. The study area is located in a broad river valley with frequent slope-defined anabatic/akatabic and drainage flow events. These often have a diurnal nature, and may explain part of the day-night differences. Example sources of these differences may include the relative ability of the driving meteorological model to capture daytime versus nighttime mixed layer turbulence and the planetary boundary layer height.

Comparisons to aircraft observations
A total of 22 aircraft observation flights were carried out during the study simulation period -here we present statistical comparisons using the standard approach only (model grid cell containing the observation point to observation data at the aircraft location). Model values were linearly interpolated in time and space to the aircraft observation locations and times (aircraft observations were on a 10 s interval.) We begin with a composite comparison across all observation times, in Table 6.
The results are in general similar to the surface analysis, in that the OS1km simulation tended to have worse performance than the OS2.5km simulation (exceptions being the biases for both SO 2 and NO 2 , and the slightly better OS1km correlation coefficient for SO 2 ). One striking difference between the first two columns of Tables 2 and 3, compared  to Table 6, is the magnitude of the differences between the simulations. Aloft (Table 6), the differences in performance metric magnitudes between OS2.5 and OS1km simulations are much smaller than at the surface (Tables 3 and 4). The biases are negative aloft and positive at the surface, indicating that both models may be lofting plumes to insufficient distances; one of the possible (non-horizontal grid-cell-sizedependent) causes of model error may be in the extent of vertical transport. This possibility is examined in more detail in Akingunola et al. (2018) and Gordon et al. (2018). An example of this behaviour is shown in Fig. 8; both plumes fumigate to the surface, while the observed plume resides largely aloft. The OS1km model captures the higher concentrations to a better degree, but the impact of excessive fumigation more than offsets this improvement, as is shown by the performance evaluation of Table 7, where both models have negative biases aloft. In this particular case, the tendency of the model to overestimate the extent of fumigation has a bigger impact on performance than grid-cell size. Garcia-Menendez et al. (2014) have noted similar results for forest fire plume prediction. Figure 8a and c provide a further example of the kind of situation referenced in Fig. 3; surface monitoring station locations are depicted as grey circles, one of which is identified with a pink arrow. This station lies within the plume at 2.5 km resolution (Fig. 8a), and outside of the plume at 1 km resolution (Fig. 8c). While the plume direction is the same at both scales, that is, the large-scale wind field controls the positioning of the plume axis, the smaller grid-cell size simulation places a stronger constraint on the accuracy of the wind field. For example, if the simulated large-scale flow direction was inaccurately predicted by only a few degrees, the plume would not appear in the 1 km simulation time series at this location but would register as present in the 2.5 km simulation. Nevertheless, the plume maximum concentration is better captured by the smaller grid-cell size simulation (compare maximum values in observed aircraft SO 2 ; Fig. 8b, d). The higher-resolution simulation may thus more accurately simulate the plume maximum concentration -but not its placement in space, as was hypothesized in Fig. 3.
Meanwhile other flights show a clear advantage of the OS1km model. One example is given by the NO 2 performance evaluation of Table 8 and depicted in Fig. 9, for Flight 17 (a similar flight plan carried out around the same facility as Flight 8). While the correlation coefficient degraded slightly in the OS1km resolution simulation, all other performance measures were improved with the decrease in gridcell size. Two time-versus-height-profile cross sections for Flight 17 are shown in Fig. 9. In Fig. 9a and b, the OS2.5km and OS1km simulations, respectively, are compared for the portion of the overall flight track circling the given facility. This comparison clearly shows that the OS1km model does a better job of capturing the width of the high-concentration region of the plume -however, the location of the model plume lags the observations. During this portion of the flight alone, the OS2.5km model statistics, particularly the correlation coefficient, outperform the OS1km model, due to this issue of plume location mismatching. Figure 9a and b may be compared to Fig. 3a and b -the same situation is depicted in both figures. Figure 9c and d show the OS2.5km simulation (Fig. 10c) and OS1km simulation results in another portion of the flight -here the OS1km performance for most statistics was better than the OS2.5km model performance. The OS1km model (Fig. 9d) captures the existence of a lower-concentration layer aloft in the right-hand side of the cross section, and the existence of low-concentration intervening layers, as well as the overall lower concentrations of SO 2 , while the OS2.5km model does not resolve these fine-scale and lower-concentration features. We note here that IoA, COE, and the other error measures capture the visual impression that the OS1km model outperforms the OS2.5km model for this flight, while the correlation coef- ficient is highly dependent on the placement of the plume maximum in the upper two panels. These and the snap-shot comparisons described in Sect. 3.1 show that the higher-resolution model is having a significant impact on predictions -however, other aspects of the overall model performance are preventing the potential benefits of higher resolution from influencing the standard performance evaluation.

Discussion
A key result of our current work is that 1 km grid-cell size simulations resulted in improved prediction of plume concentration maxima relative to 2.5 km grid-cell size simulations, despite having no improvement using standard scoring methodologies. We also have described a scoring approach wherein these potential advantages of higher resolution may be quantified. We believe that flow field effects such as those described in Fig. 3 are a general result of increasing grid resolution, but note important caveats, which include the following: 1. The availability of meteorological observation and high-resolution emissions data to provide model driving information, and the resolution and proximity of this information to the simulation location. Both will influence the relative importance of grid-cell size on model results. If this information is available in a higher resolution than the lower of two grid-cell size simulations being compared, and/or is used via data assimilation to improve model initial meteorological conditions, our expectation is that the smaller grid-cell size model may outscore the larger grid-cell size model, even for more standard metrics.
2. The extent to which local, versus synoptic, weather conditions drive flow in a given region. For example, in the urban heat island meteorological simulations of Leroyer et al. (2014), the accuracy of local flow predictions was shown to be extremely dependent on the representation of the urban heat island, and the accuracy of the latter was critically dependent on the grid-cell size (which in this example went down to 250 m). In this respect, for meteorological conditions wherein local factors can dominate the flow, and where those conditions may be adequately modelled only at very high resolution, we would again expect the smaller grid-cell size simulation to provide better performance, for standard metrics.
3. The results apply as grid-cell size decreases, but are not necessarily true for grid-cell size increases. Conversely, model performance using standard metrics should not be expected to increase with successively larger and larger grid sizes; the accuracy of even the synoptic flow field will not be captured as model resolution decreases.
Given these considerations, we recommend that modellers should attempt successively smaller grid-cell sizes to determine the following: first, the point at which, for their particu-   . Flight 17 comparison for NO 2 (ppbv) for portions of the net flight track circling the CNRL facility for OS2.5km (a) and OS1km (b) simulations, and for a later section of the same flight path for the OS2.5km (c) and OS1km (d) simulations.
lar system and simulation location, subsequent grid-cell size reductions fail to improve performance, and second, to make use of still higher resolutions for studies wherein the pointto-point comparison is less important, and other factors such as accurately capturing the plume chemistry are more crucial.

Summary and conclusions
Our work suggests the following: decreasing air-quality model horizontal grid-cell size will not necessarily result in improvements to model performance in standard performance evaluations, in which the model values at the grid cells encompassing measurement location stations are used in a pairwise comparison to observations. Other considerations, such as the accuracy of the larger-scale wind direction and speed forecast, and the accuracy of the plume rise parameterization used within the model may play a greater role in the overall performance of the model, and reduce the benefits of the smaller grid-cell size. In the context of a standard model performance evaluation, there may be fixed limits to the benefits of decreasing model grid-cell size. Despite this difficulty, our results also show that the use of smaller grid-cell sizes have some potential benefits, in that these models do a better job of resolving specific air pollution features, like high-concentration maxima within plumes. Both coarse and fine grid-cell size plumes may be misplaced in both time and space, with the net result that the latter model has a worse performance in a standard comparison, but is nevertheless more likely to capture the correct in-plume concentrations, and hence the chemistry, of the actual plume, in the neighbourhood of the observation location. When the evaluation is broadened to find the closest fit to observations in the vicinity of the observation station, with models confined to a similar representative area around the observation station, these potential benefits of the smaller grid-cell size become apparent.
Our results should not be taken as an indication that the standard metrics for model comparison are in some way flawed -they provide the most rigorous method for evaluating the performance of a model at specific monitoring locations and specific times. However, the ancillary performance assessment methodology presented here shows that models with very small grid sizes, which may have standard performance metric scores that have not improved or have even degraded relative to larger grid-cell size models, nevertheless have scientific value, in terms of being better able to capture plume concentrations and hence plume chemistry, if not plume position. The work also suggests that the prediction accuracy of very local transport conditions may be a large factor in preventing the smaller grid-cell size models from achieving improved performance in standard performance analyses.
These findings suggest that at the current state of development, VHR air-quality models are of benefit for the specific purpose of chemical process studies, in which the main aim of the work is to accurately simulate plume chemistry -and in which accurate forecasting of the position of the plume in time and space is a secondary concern. Our work also suggests that efforts to improve other aspects of the overall modelling framework which improve the large-scale flow (for example, the use of data assimilation of local meteorology to improve wind direction predictions) may result in greater benefits as smaller grid-cell sizes are employed.
Data availability. Surface station air monitoring data used in this study were obtained and are publicly available from the Wood Buffalo Environmental Association (https://wbea.org/network-and-data/monitoring-stations/, last access: 1 April 2019). The 2013 aircraft data used in this study are available from the Oil Sands Monitoring site (http://donnees.ec.gc.ca/data/air/monitor/ambient-air-qualityoil-sands-region/pollutant-transformation-summer-2013-aircraftintensive-multi-parameters-oil-sands-region/?lang=en, last access: 1 April 2019). GEM-MACH model output data generated for this study are available on email request to Paul A. Makar (paul.makar@canada.ca).
The model output generated from the study requires a large amount of memory space (several gigabytes per day of simulation) in a binary format specific to Environment and Climate Change Canada's modelling systems. The size of the files precludes maintenance on a public site, but conversion to other formats and arrangements for uploading may be made on request to Environment and Climate Change Canada at paul.makar@canada.ca.
M. Russell et al.: An evaluation of the efficacy of very high resolution air-quality modelling 4411 Appendix A: Model evaluation statistics Table A1. Model comparison statistics.

Metric and formula
Range Ideal score Index of agreement (IOA) = [    Author contributions. MR conducted computer simulations and analysis and created graphical outputs and the initial manuscript draft. AH supervised MR, provided research advice and infrastructure, contributed to manuscript writing, and provided comments on manuscript drafts. PAM co-supervised MR, provided research advice and infrastructure, contributed to manuscript writing, and was the lead for revisions and responses to referees. AA provided model code assistance and setup, and the scoring package for modelobservation comparisons. JZ provided 2.5 and 1 km resolution emissions files. MDM provided 2.5 and 1 km resolution emissions files and commented on manuscript drafts. QZ contributed to the provision of 2.5 and 1 km resolution emissions files.
Competing interests. The authors declare that they have no conflict of interest.