Improving the prediction of an atmospheric chemistry transport model using gradient boosted regression trees

Predictions from process-based models of environmental systems are biased, due to uncertainties in their inputs and parameterisations, reducing their utility. We develop a predictor for the bias in tropospheric ozone (O3, a key pollutant) calculated by an atmospheric chemistry transport model (GEOS-Chem), based on outputs from the model and observations of ozone from both the surface (EPA, EMEP and GAW) and the ozone-sonde networks. We train a gradient-boosted decision tree algorithm (XGBoost) to predict model bias, with model and observational data for 2010-2015, and then test the approach using 5 the years 2016-2017. We show that the bias-corrected model performs significantly better than the uncorrected model. The root mean square error is reduced from from 16.21 ppb to 7.48 ppb, the normalised mean bias is reduced from 0.28 to -0.04, and the Pearson’s R is increased from 0.479 to 0.841. Comparisons with observations from the NASA ATom flights (which were not included in the training) also show improvements but to a smaller extent reducing the RMSE from 12.11 ppb to 10.50 ppb, the NMB from 0.08 to 0.06 and increasing the Pearson’s R from 0.761 to 0.792. We attribute the smaller improvements to the lack 10 of routine observational constraints of the remote troposphere. We explore the choice of predictor (bias prediction versus direct prediction) and conclude both may have utility. We show that the method is robust to variations in the volume of training data, with approximately a year of data needed to produce useful performance. Data denial experiments (removing observational sites from the algorithm training) shows that information from one location (for example Europe) can reduce the model bias over other locations (for example North America) which might provide insights into the processes controlling the model bias. 15 We conclude that combining machine learning approaches with process based models may provide a useful tool for improving performance of air quality forecasts or to provide enhanced assessments of the impact of pollutants on human and ecosystem health, and may have utility in other environmental applications.

Machine learning has shown utility in the field of atmospheric science, examples include: leveraging computationally burdensome short-term cloud simulations for use in climate models (Rasp et al., 2018), quantifying sea-surface iodine distribution (Sherwen et al., 2019) and high resolution mapping of precipitation from lower resolution model output (Anderson and Lucas, 2018). More specifically to atmospheric O 3 , machine learning has been used for improving parameterization in climate models (Nowack et al., 2018), creating ensemble weighting for forecasts (Mallet et al., 2009) and predicting exposure during forest 45 fire events (Watson et al., 2019). For bias correction applications, machine learning has been used to correct observational bias in dust prior to use in data assimilation (Jin et al., 2019).
We describe here the GEOS-Chem model used as our model (Sect. 2), the observations of O 3 from four observational networks (Sect. 3) and our method (Sect. 4) to produce an algorithm to predict the bias in the model. We explore its performance (Sect. 5) and how it performs under a number of situations and analyse its resilience to a reduction in training data (Sect. 6) 50 and training locations (Sect. 7). Finally, We explore the choice of predictor in Sect. 8 and discuss the applicability and future of such a methodology in Sect. 9.

GEOS-Chem model
For this analysis we use GEOS-Chem Version V11-01 (Bey et al., 2001) an open-access, community, offline chemistry transport model (http://www.geos-chem.org). In this proof of concept work, we run the model at a coarse resolution of 4 o 55 x 5 o for numerical expediency using MERRA2 meteorology from the NASA Global Modelling and Assimilation Office (https://gmao.gsfc.nasa.gov/reanalysis/MERRA-2/). The model has 47 vertical levels extending from the surface to approximately 80 km in altitude. We use the "tropchem" configuration, which has a differential equation representation of the chemistry of the troposphere, and a linearized version in the stratosphere (Eastham et al., 2014). The standard emissions configuration is used, with EDGAR (Crippa et al., 2018) and RETRO (Hu et al., 2015) inventories for global anthropogenic emissions, which 60 are overwritten by regional inventories where available (NEI (USA) (Travis et al., 2016), CAC (Canada) (van Donkelaar et al., 2008), BRAVO (Mexico) (Kuhns et al., 2005), EMEP (Europe) (van Donkelaar et al., 2008) and MIX (East Asia) (Li et al., 2017)). GFED4 (Giglio et al., 2013) and MEGAN (Guenther et al., 2012) are used for biomass burning and biogenic emissions.
Details of the other emissions used and other details of the model can be found online (http://www.geos-chem.org, (http://wiki.seas.harvard.edu/geos-chem/index.php/HEMCO_data_directories).

65
To produce the dataset to train the algorithm, the model is run from January 1st 2010 to December 31st 2015 outputting the local model state for each hourly observation (see Sect. 3 and 4). For the testing we run the model from January 1st 2016 to December 31st 2017 outputting the local model state hourly for every grid box within the troposphere.

Observational dataset
The location of all of the observations used in this study are shown in Figure 1. Ground observations of O 3 from the European 70 Monitoring and Evaluation Program (EMEP) (https://www.emep.int), the United States Environmental Protection Agency (EPA) (https://www.epa.gov/outdoor-air-quality-data) and the Global Atmospheric Watch (GAW) (https://public.wmo.int/) are compiled between 2010 and 2018 (see Sofen et al. (2016) for data cleaning). Due to the coarse spatial resolution of this study (4 o x 5 o ), we removed all sites flagged as "urban", as these would not be representative at this model resolution. Similarly, all mountain sites (observations made at a pressure <850 hPa ) were removed due the difficulty in representing the complex 75 topography typical of mountain locations within the large grid boxes. Ozone-sonde data from the World Ozone and Ultraviolet Radiation Data Centre was also used (https://woudc.org). Ozone-sonde observations above 100 ppb of O 3 were excluded as they are considered to be in the stratosphere (Pan et al., 2004). For both surface and sonde observations, when multiple observations were found in the same hourly model grid-box (both in the horizontal and vertical) they were averaged (mean) together to create a single "meta-site". There are 13,118,334 surface meta-site observations in the training period between 1/1/2010 to 80 31/12/2015, and 3,783,303 in the testing period between 1/1/2016 and 31/12/2017. There are 250,533 ozone-sonde meta-site observations in the training period and 78,451 in the testing.
Observations of O 3 from the NASA Atmospheric Tomography Mission (ATom) flights (Wofsy et al., 2018) were used as an independent testing data set. ATom flew over the Pacific and the Atlantic from the northern mid-latitudes to the southern and back from the surface to 15 km measuring the concentration of many compounds including O 3 (Figure 1). It flew for 85 each of the four seasons between July 2016 and May 2018, but only the first three (summer, spring and winter) are used due to availability at the time of writing. Given the oceanic nature of the flights and their sampling through the lowermost observations greater than 100 ppb was removed, and data were averaged onto the model grid resolution (mean) to give hourly model resolution "meta" sites. Once averaged, there are 10,518 meta observations used for the algorithm testing.  Table 1).

95
Once each O 3 observation has a corresponding model prediction, we can develop a function to predict the model bias given the values of the model local state as input. Several potential "machine learning" methodologies exist for making this prediction, including neural nets (Gardner and Dorling, 1998) and decision trees (Breiman, 2001). We have tended to favour decision tree methods due to their increased level of explicability over neural nets.
As with other machine learning approaches, decision tree techniques (Blockeel and De Raedt, 1998) make a prediction for 100 the value of a function based on a number of input variables (features) given previous values of the function and the associated values of the features. It is essentially non-linear multi-variant regression. A single decision tree, is a series of decision-nodes that ask whether the value of a particular feature is higher than a specific value. If the value is higher, progress is made to another decision-node, if it is not, progress is made to a different decision-node. Ultimately, this series of decisions reaches a leaf-node which gives the prediction of the function. The depth of a tree (the number of decision needed to get to a leaf-node) 105 is an important aspect of tuning decision trees. If the tree is too shallow it will miss key relationships in the data. Conversely, if a model is too deep it will over-fit to the specific dataset and will not generalise well. The training of the system relies upon deciding which features should be used by each decision node and the specific value to be tested. The use of a single decision tree leads to over-fitting (Geurts et al., 2009), so this progressed to using random forest regression (Breiman, 2001), where a number of decision trees are constructed with differing sampling of the input data. The mean prediction of all of the decision 110 trees (the forest) was then used as the prediction of the function. More recently, gradient boosting regression (Friedman, 2002) relies on building a tree with a relatively shallow depth and then fitting a subsequent tree to the residuals. This this is then repeated until an adequate level of complexity is reached.
The gradient boosted regression technique suited our needs for a variety of reasons: it is able to capture non-linear relationships which underlies atmospheric chemistry (Gardner and Dorling, 2000); the decision tree-based machine learning technique 115 is more interpretable than neural net-based models (Kingsford and Salzberg, 2008); it has a relatively quick training time allowing efficient cross validation for tuning of hyper parameters; and it is highly scalable meaning we are able to test on small subsets of the data before increasing to much longer training runs (Torlay et al., 2017). For the work described here we use the XGBoost (Chen and Guestrin, 2016;Frery et al., 2017) algorithm.
Hyper-parameters are parameters set before training that represent the required complexity of the system being learnt 120 (Bergstra and Bengio, 2012). Tuning of these parameters was achieved by five kfold cross validation whereby the training data is broken into five subsets, with the training data organised by date. The model was then trained on four of these subsets and tested on the remaining subset. Training and test is repeated on each of the five subsets to identify the optimum hyper-parameters attempting to balance complexity without over fitting (Cawley and Talbot, 2010).
The key hyper-parameters tuned were the number of the trees and depth of trees. Similar results could be found with 12 to 125 18 layers of tree depth, with a reduction in number of trees needed at greater depth. It was found that the algorithm achieved the majority of its predictive power early on, with the bulk of the trees producing small gains in root mean square error. As a compromise between training time and predictive strength, 150 trees with a depth of 12, were chosen. This took 1 hour to train on a 40 core CPU node, consisting of two Intel Xeon Gold 6138 CPUs. Mean squared error was the loss function used for training.
Where y is the observed values,ŷ is the predicted values and N is the number of samples.

Application
With the bias predictor now trained we can now apply it to the model output and evaluate performance. We do this for a different period (1/1/2016-31/12/2017) to that used in the training (1/1/2010-31/21/2015). We first look at the mean daily (diurnal) cycles calculated the model for nine globally distributed sites ( Figure 2 with statistics given in Table 2). The base model (blue) shows significant differences with the observations (black) for most sites. The subtraction of the bias prediction We use the ATom dataset to provide this independent evaluation. Figure 5 (with statistical data in Table 4) shows the comparison between the model prediction of the ATom observations with and without the bias corrector. Although the inclusion of the bias correction improves the performance of the model, this improvement is significantly smaller than that seen for the surface data. The RMSE is reduced by only 13% for the ATom data compared to 54% for the surface observations. Similarly 175 the Pearson's R only marginally improves with the use of the bias corrector. Much of the improvement of the model's performance for the ATom data will be coming from the observations collected by the sonde network. There are significantly fewer observations (40:1) collected by that network than by the surface network. Thus for the bias corrector approach to work well it appears that there must be significant volumes of observations to constrain the bias under sufficiently diverse conditions. It would appear that the sonde network may not provide that level of information to the degree that the surface network does.

180
Applying the bias corrector to all of the grid points within the model shows the global magnitude of the predicted bias ( Figure 6). Similar to the analysis of the nine individual sites, the base model is predicted to be biased high over much of the continental USA, with smaller biases over Europe and the tropical ocean regions. Over the southern ocean the model is predicted to be biased low. However, the bias is also predicted for regions without observations (see Figure 1). For example, over China, the model is predicted to be biased high by~15 ppbv. This is higher but not dissimilar to the biases previously 185 found for the model in China (Hu et al., 2018) which found a positive bias of 4-9 ppbv but using a different model configuration (higher resolution) and for a different model assessment (MDA8 vs annual mean). Similar questions as to the accuracy of the prediction arise from the large biases predicted for central Africa and South America. Future evaluation of the bias corrector methodology should more closely look at the impact on these regions and where possible extend the training dataset to use observations from these regions if they are available.

190
The gain (the loss reduction gained from splits using that feature (Chen and Guestrin, 2016)) is shown in Figures 7. This provides a diagnostic of the importance of different input variables in the decision trees used for making predictions. Surprisingly, the most important feature from this analysis is the concentration of NO 3 (the nitrate radical). This has a high concentration in polluted night-time environments and low concentration in clean regions or during daytime (Winer et al., 1984). This fea-  (Wittrock et al., 2006). Future work should explore these explanatory capabilities to understand why the bias correction is performing as it is. This may also allow for a scientific understanding of why the model is biased rather than just 200 how much the model is biased.
We have shown that the bias corrector method provides an enhancement of the base-model prediction under the situations explored. We now perform some experiments with the system to explore its robustness to the size of the dataset used for training both spatially and temporally.
6 Size of training dataset

205
The bias predictor was trained using six years of data (2010)(2011)(2012)(2013)(2014)(2015). This provides a challenge for incorporating other observational data sets. For some critical locations such as China or India the observational record is not that long and for high resolution model data (eg 12.5 km (Hu et al., 2018)) managing and processing 73 parameters for six years could be computationally burdensome. Being able to reduce the number of years of data whilst maintaining the utility of the approach would therefore be useful. Figure 8 shows the improvement in the global performance of the model metrics (same as for Table 4) corrector. For the other six sites around the world, the influence of removing North and South America is minimal. It appears surprising that the corrections applied for North America are so good even though the North American data is not included within the training. This suggests that at least some of the reasons for the biases in the model are common between, say North America and Europe, indicating a common global source of some of the bias. This may be due to errors in the model's chemistry or meteorology, which would be global rather than a local source of bias. Removing the Western hemisphere reduces the number of unique environments the algorithm has to learn from resulting in significant changes in the prediction.
These types of data denial experiments may in the future provide an ability to explain model failings which could be used to 260 help improve the process level representation within models.
8 Nature of the prediction The bias correction method described here, attempts to predict the bias in the model. An alternative approach would be to directly predict the O 3 concentration. An algorithm to do this given the same model local state information is trained on the standard six years of training data (2010 -2015). Table 4 shows a statistical analysis of the performance for the model, coupled 265 to both the bias predictor and the direct predictor. For the testing years (2016 to 2017) the direct prediction of surface O 3 performs marginally better than the bias correction for some metrics (RMSE of 7.11 ppb versus 7.48 ppbv, NMB of 0.00 vs -0.04, and R of 0.850 versus 0.841) but for some metrics the performance is less good (Slope of best fit of 0.84 versus 0.89 and a y-intercept of 4.96 ppbv versus 2.07 ppbv). However, for the ATom dataset, the bias predictor performs better (Table 4). We interpret this to mean that for locations where observations are included in the training (surface sites and sondes), directly using 270 those observed has benefits. However for sites where no observations are used, it is better to use the bias corrector approach.
Further work is necessary to advance our understanding of the form of the prediction that is necessary to best provide a useful enhancement of the system.

Discussion
We have shown that the bias in the O 3 concentration calculated by a chemistry transport model can be reduced through the 275 use of a machine learning algorithm with the results appearing robust to data denial and training length experiments. For activities such as air quality forecasting for sites with a long observational record this appears to offer a route to significant improvements in the fidelity of the forecasts without having to improve process level understanding. This work offers some practical advantages over data assimilation. The observations don't necessarily need to be available in real time as the training of the bias predictor can be made using past observations and applied to a forecast without the latest observations being 280 available. The approach may also be applied to regions where observational data is not available. Although this necessitates care, the temporary lack of availability of data is much less of a problem for this approach than for data assimilation.
Significantly more future work is needed to understand the approach than has been shown in this proof of concept work.
Exploring the number and nature of the variables used would thus be advantageous. The complete set of model tracers and some physical variables were used here but their choice was somewhat arbitrary. A more systematic exploration of which variables are needed to be included is necessary. Are all the variables needed? Are important physical variables missing? Similarly, only one machine learning algorithm has been used with one set of hyper-parameters chosen. Algorithm development is occurring very quickly, and we have not explored other approaches such as neural nets that may offer improved performance. The ability to predict the bias for regions without observations is also a potentially useful tool for better constraining the global system.
Observations of surface O 3 exist for China (Li et al., 2019) but have not been included here for expediency. It would be 290 scientifically interesting to see how they compare to those predicted by the bias corrector and how the bias corrector changes if they are included in the training. It seems possible that the approach developed here could be used to explore methods to extract information about why the model is biased rather than just quantifying that bias. Some hint of that is given by the importance of the nitrate radical (NO 3 ) in the decision trees which highlights the night time as being a large factor in the model bias. Finally, the method could readily be extended to other model products such as PM2.5.

295
More generally machine learning algorithms appear to offer significant opportunities to understand the large, multivariate and non-linear data sets typical of atmospheric science and the wider environmental sciences. They offer new tools to understand these scientifically interesting, computationally demanding and socially relevant problems. However, they must also be well characterised and evaluated before they are routinely used to make the forecasts and predictions.
Code and data availability.

300
The GEOS-Chem model code is available from https://github.com/geoschem/geos-chem and the XGBoost code used is available from https://xgboost.readthedocs.io. Licensing agreements mean that we are unable to redistribute the observational data however it is all publicly available.