Technical note: Uncertainties in eddy covariance CO2 fluxes in a semi-arid sagebrush ecosystem caused by gap-filling approaches

Gap-filling eddy covariance CO2 fluxes is challenging at dryland sites due to small CO2 fluxes. Here, four 10 machine learning (ML) algorithms including artificial neural network (ANN), k-nearest neighbours (KNN), random forest (RF), and support vector machine (SVM) are employed and evaluated for gap-filling CO2 fluxes over a semi-arid sagebrush ecosystem with different lengths of artificial gaps. The ANN and RF algorithms outperform the KNN and SVM in filling gaps ranging from hours to days, with the RF being more time efficient than the ANN. Performances of the ANN and RF are largely degraded for extremely long gaps of two months. In addition, our results suggest that there is no need to fill the 15 daytime and nighttime NEE gaps separately when using the ANN and RF. With the ANN and RF, the gap-filling induced uncertainties in the annual NEE at this site are estimated to be within 16 g C m-2, whereas the uncertainties by the KNN and SVM can be as large as 27 g C m-2. To better fill extremely long gaps of a few months, we test a two-layer gap-filling framework based on the RF. With this framework, the model performance is improved significantly, especially for the nighttime data. Therefore, this approach provides an alternative in filling extremely long gaps to characterize annual carbon 20 budgets and interannual variability in dryland ecosystems.


Introduction
The eddy covariance (EC) technique has been widely applied for monitoring energy and water fluxes as well as net ecosystem exchanges of carbon dioxide (NEE) and other trace gases between lands and the atmosphere (Baldocchi, 2003;25 Oncley et al., 2007;Stoy et al., 2013). However, due to multiple factors including power outages, instrument malfunctions and maintenance, and data quality checks, there exist gaps with approximately 20-60% of half-hourly data points annually at many long-term EC sites (Dragoni et al., 2007;Falge et al., 2001;Ma et al., 2007;Missik et al., 2019;Moffat et al., 2007;https://doi.org/10.5194/acp-2021-631 Preprint. Discussion started: 28 July 2021 c Author(s) 2021. CC BY 4.0 License. Pastorello et al., 2020;Soloway et al., 2017;Wutzler et al., 2018). An average gap fraction of 30% in an annual dataset leads to an uncertainty of ±25 g C m -2 year -1 for the annual NEE at forest sites (Moffat et al., 2007), while some EC sites report 30 much greater uncertainties (Soloway et al., 2017). Therefore, gap-filling usually accounts for one large source of uncertainties in the annual NEE (Soloway et al., 2017), together with other sources of uncertainties such as measurement errors and bias related to non-closure of the surface energy balance Wilson et al., 2002).
Robust NEE gap-filling approaches are critical for quantifying the annual and interannual variability of carbon budgets (Falge et al., 2001;Moffat et al., 2007;Pastorello et al., 2020;Richardson and Hollinger, 2007;Soloway et al., 2017;35 Wutzler et al., 2018). Previous studies have developed and evaluated a number of NEE gap-filling approaches including nonlinear regressions (NLR), look-up tables (e.g., marginal distribution sampling (MDS)), machine learning (ML) algorithms (e.g., artificial neural networks), and process-based models (Falge et al., 2001;Huang and Hsieh, 2020;Moffat et al., 2007;Wutzler et al., 2018). NLR fills NEE gaps based on regression analyses between NEE and meteorological variables such as temperature (e.g., air or soil temperature) and light (e.g., photosynthetically active radiation), whereas MDS is based on 40 look-up tables for similar meteorological conditions (i.e., global radiation, air temperature, and vapor pressure deficit) (Falge et al., 2001;Moffat et al., 2007;Reichstein et al., 2005). MDS has been widely used in gap-filling by virtue of an easy-to-use R package (Wutzler et al., 2018), although it cannot effectively fill the gaps of longer than 12 days (Moffat et al., 2007). MLbased methods are trained by presenting them with numerous meteorological variables as inputs and NEE as output data, which have the potential to fill long gaps (Kim et al., 2020;Moffat et al., 2007). Artificial neural network (ANN) was 45 applied to fill daytime and nighttime gaps separately (Knox et al., 2016;Moffat et al., 2007;Papale and Valentini, 2003).
However, the performance of these ML-based algorithms has not been evaluated in filling gaps in EC fluxes for dryland ecosystems with low NEE. The motivation of this gap-filling practice was driven by the fact that dryland ecosystems are 50 very sensitive to water availability, functioning as carbon sinks in wet years and carbon sources in dry years (Biederman et al., 2017;Scott et al., 2015), and bias in gap-filled NEE may alter conclusions in sources or sinks of dryland ecosystems in case of relatively long gaps for eddy covariance data. In addition, different ML algorithms have distinctive internal structures that account for the underlying dependencies of outputs (i.e., NEE) on the inputs (i.e., meteorological variables) in different ways, and uncertainties associated with different ML-based methods can also be assessed. 55 In this study, we evaluate the performance of four commonly used ML algorithms (ANN, KNN, RF, and SVM) in filling the gaps in the NEE data collected at an EC site over a semi-arid sagebrush ecosystem in the central Washington, USA, from 2016 to 2019, and assess the uncertainties in the annual NEE introduced by gap-filling methods.

Site description 60
The eddy covariance flux tower is located in the Hanford Area in the U.S. State of Washington (AmeriFlux site: US-Hn1; 46°24¢32² N, 119°16¢30² W), and it started to collect data in December 2015 (Gao et al., , 2020a(Gao et al., , 2020bMissik et al., 2019). This semi-arid site is predominantly covered by scattered shrubs and short grasses. Shrub species include Artemisia tridentata and Chrysothamnus viscidiflorus, and grasses include invasive weedy species (i.e., Bromus tectorum and Salsola kali) and native grasses (i.e., Poa secunda, Pseudoroegneria spicata, and Stipa comate) . The long-term 65  mean annual precipitation was 197 mm, most of which occurred late in the fall and early in winter . The soil texture in the top layer of 30 cm is loamy sand with small rocks and gravel interspersed (Gao et al., 2017;Missik et al., 2019). In this study, the 4-year data from 2016 to 2019 are analyzed.

Eddy covariance and meteorological measurements
The EC system included a three-dimensional sonic anemometer (CSAT3,Campbell Scientific,Inc.) and an open path gas 70 analyzer (LI-7500A, LI-COR, Inc.). In addition, a variety of microclimate data were measured, including four-component radiation, air temperature and relative humidity, wind speed and direction, precipitation, soil heat flux, and soil temperature and volumetric water content (Gao et al., 2017(Gao et al., , 2020aMissik et al., 2019). These data were sampled at a rate of 1 Hz and stored as 30-min averages. Further, 15-min meteorological data from two weather stations close to the tower site were obtained from the Washington State University AgWeatherNet (AWN; https://weather.wsu.edu/). The two AWN stations are 75 located within 8 km from the tower. The 15-min data were averaged to half-hourly values to fill gaps in the tower meteorological data of solar radiation ( ! ), air temperature ( "#$ ) and relative humidity (RH), vapor pressure deficit (VPD), wind speed (WS), precipitation (P), and soil temperature ( %&#' ). Thus, these half-hourly meteorological data for the study period are gap-free.

EC data processing, quality control, and gap identification 80
Raw 10 Hz EC data were processed using the EddyPro ® software (version 7.06, LI-COR Biosciences, USA) to calculate the 30-min average fluxes of CO2 (NEE) and latent (LE) and sensible (H) heat. The data were despiked and filtered for physically impossible values and abnormal diagnostic values of the sonic anemometer and the gas analyzer. The double rotation method was applied to the sonic anemometer data. Block averaging was used to determine the turbulent fluctuations for each 30-min intervals. The fluxes were corrected for the effects of high-and low-pass filtering (Massman, 2000(Massman, , 200185 Moncrieff et al., 2004) and air density fluctuations (Webb et al., 1980), respectively. The corrected fluxes were quality checked according to Mauder and Foken (2004). After quality checking, the "REddyProc" R package (Wutzler et al., 2018) https://doi.org/10.5194/acp-2021-631 Preprint. Discussion started: 28 July 2021 c Author(s) 2021. CC BY 4.0 License. was used to determine the friction velocity ( * ) threshold and NEE data with low turbulence conditions were removed from the dataset.
For simplicity, a score of 2 was assigned to gaps due to field operations (e.g., instrument maintenance), electrical and/or 90 instrument failures, a score of 1 to gaps due to low data quality (i.e., quality control and * filtering), and a score of 0 to flux data with good quality. Only the data with the score of 0 were used to train/test the gap-filling algorithms.

Machine learning algorithms
Four ML algorithms including the ANN, KNN, RF, and SVM were employed and evaluated for filling NEE gaps. In the following sections, we briefly describe the characteristics and implementation of each ML algorithm. The required 95 parameters in each algorithm are optimized using the "caret" R package (Kuhn et al., 2020) with a 10-fold cross-validation repeated ten times.

Artificial neural network (ANN)
The ANN algorithm has been successfully applied for filling NEE gaps in various ecosystems (Baldocchi & Sturtevant, 2015;Knox et al., 2016;Moffat et al., 2007;Papale & Valentini, 2003;Tramontana et al., 2016). In this study, we employed 100 the "neuralnet" R package (Günther and Fritsch, 2010), the resilient backpropagation algorithm that has proven to be capable of gap-filling flux data (Dengel et al., 2013;Jammet et al., 2015;Kim et al., 2020;Knox et al., 2016). The required parameters for the ANN algorithm include the number of hidden layers and the number of nodes in each layer. Here, based on the parameter tuning results, we use two hidden layers with 12 and 10 nodes in the first and second hidden layers. We train the neural network 1000 times, and the mean prediction results of the top 20 runs based on their training and testing R 2 105 values are used to fill NEE gaps (Baldocchi & Sturtevant, 2015;Knox et al., 2016).

K-nearest neighbours (KNN)
The KNN algorithm (Fix and Hodges, 1951) is a non-parametric ML approach and has been used in many applications. For example, Chen et al. (2012) applied the KNN algorithm for filling latent heat flux gaps, which fills the data gaps based on a certain attribution of k neighbours in the feature space. In this study, we use the "caret" R package (Kuhn et al., 2020) to 110 build the KNN where a suitable k value needs to be determined. Here, the optimized k value is 9.

Random forest (RF)
The RF algorithm (Breiman, 2001) has been applied for upscaling flux data to regional (Xu et al., 2018) and global (Jung et al., 2017;Zeng et al., 2020) scales and recently for gap-filling flux data (Huang and Hsieh, 2020;Kim et al., 2020). The RF algorithm uses bootstrap aggregation and feature randomness when generating each individual tree to try to create many 115 independent decision trees that operate as an ensemble of the prediction results. In this study, we create 500 regression trees https://doi.org/10.5194/acp-2021-631 Preprint. Discussion started: 28 July 2021 c Author(s) 2021. CC BY 4.0 License. for each case using the "randomForest" R package (Liaw and Wiener, 2002), in which the tuning parameter is the number of randomly selected predictors (i.e., mtry equals 7).

Support vector machine (SVM)
The SVM algorithm (Cortes and Vapnik, 1995) has also been applied for gap-filling (Huang and Hsieh, 2020;Kim et al., 120 2020) and upscaling (Xu et al., 2018) flux data. The SVM algorithm can convert nonlinear regressions into linear regressions by projecting the original finite-dimensional space into the much higher-dimensional space with a predefined kernel function. In this study, we use the radial basic kernel function and the "kernlab" R package (Karatzoglou et al., 2004) where the tuning parameters include the inverse kernel width (i.e., sigma = 0.13) and the cost regularization parameter (i.e., C = 27). 125

Input variables
Besides the above mentioned meteorological variables, the input variables for the ML algorithms also include the normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI) from the Moderate Resolution Imaging Spectroradiometer (MODIS), and three fuzzy variables (i.e., decimal day of year, and sine and cosine functions to represent seasonal changes) following Moffat et al. (2007). We obtained the NDVI and EVI data around the flux tower location from 130 the MOD13Q1 version 6 data product (https://lpdaac.usgs.gov/products/mod13q1v006/) at 16-day temporal and 250-m spatial resolutions (Didan, 2015). The 16-day NDVI and EVI data were resampled to 30 minutes using cubic spline interpolation.

Artificial gap scenarios and performance evaluation
In order to evaluate the gap-filling algorithms, artificial gaps with different lengths were randomly generated in the original 135 flux data, accounting for 10% of the total data length. Following Moffat et al. (2007) and Kim et al. (2020), we considered four artificial gap lengths: one hour (1-H), one day (1-D), one week (1-W), and two months (2-M). Each gap length scenario was permuted 10 times, resulting in 40 distinct artificial gap scenarios plus the real gap scenario to be filled.
For each artificial gap scenario, we trained the models separately with only daytime or nighttime data, and with both daytime and nighttime data. The performance of gap-filling algorithms was evaluated by comparing the estimated values ( # ) with the 140 measured values ( # ) for the artificial gaps. Five commonly used performance metrics are used including the coefficient of determination (R 2 ), the absolute root mean square error (ARMSE) and relative root mean square error (RRMSE), mean absolute error (MAE), and the bias error (BE): (2) 145 where the overbar denotes the mean value.

NEE data gap evaluation
At the US-Hn1 site, different lengths of gaps were found in NEE data during the four years from 2016 to 2019 (Fig. 1). Gaps with short to medium lengths were usually caused by low data quality (i.e., gap score of 1), whereas gaps with medium to extremely long lengths were mostly due to electrical and/or instruments failures (i.e., gap score of 2). Data gaps with a score 155 of 1 frequently occurred in nighttime because * filtering mainly removes nighttime data. On average, gaps with scores of 1 and 2 accounted for about 28.4% and 30.2% of half-hour NEE data, respectively (Table 1). There were more extremely long gaps in 2018 and 2019 than 2016 and 2017 due to power failures in winter and early spring. Therefore, it is worth to examining the performances of gap-filling algorithms in filling different lengths of gaps. Overall, about 43.7% of NEE data were available to calibrate and validate the gap-filling methods. In addition, for the data gaps with a score of 1, 160 meteorological data from the flux tower were still available for NEE gap-filling; whereas for the data gaps with a score of 2, meteorological data from the flux tower also had gaps, and the data obtained from the two nearby AWN stations are thus used for gap filling.

Figure 1
Distribution of NEE data gaps by day and hour from 2016 to 2019. NEE data gaps were classified as non-gaps (gap score: 0), 165 gaps due to low data quality (gap score: 1), and gaps due to electrical and/or instrument failures (gap score: 2). https://doi.org/10.5194/acp-2021-631 Preprint. Discussion started: 28 July 2021 c Author(s) 2021. CC BY 4.0 License. Table 1 Percentage of NEE data with different gap scores (0/1/2) for daytime and nighttime from 2016 to 2019. NEE gaps are classified as non-gaps (gap score: 0), gaps due to low data quality (gap score: 1), and gaps due to electrical and/or instrument failures (gap score: 2).

Daytime
Nighttime

Performance of ML gap-filling algorithms with different gap lengths 170
We first train and evaluate the ML algorithms without separating the daytime and nighttime data (i.e., using all the data). The overall performance of each algorithm degrades as the gap length increases, while the RF slightly outperforms the other algorithms for all the gap scenarios ( Fig. 2 and Table 2). For the gap length of one hour (1-H), all the four ML algorithms have the highest R 2 and the lowest ARMSE, RRMSE, and MAE. As the gap length increases, R 2 for all ML algorithms decreases and ARMSE (RRMSE and MAE) increases. For 1-H, the RF has R 2 , ARMSE, and RRMSE of 0.77±0.02, 175 0.66±0.03 µmol m -2 s -1 , and 0.47±0.02, respectively, whereas for the gap length of two months (2-M), the R 2 decreases to 0.60±0.22, and ARMSE and RRMSE increase to 0.91±0.07 µmol m -2 s -1 and 0.62±0.17, respectively. The magnitudes of BE for all ML algorithms also increase with the increasing gap lengths. The performance of these ML algorithms in NEE gapfilling at this semi-arid sagebrush site is comparable to that at some grassland sites (Huang & Hsieh, 2020), but lower than that at forest and cropland sites (Huang & Hsieh, 2020;Moffat et al., 2007). 180 Following Moffat et al. (2007), we also perform the algorithm training and evaluation separately for daytime and nighttime data. For the daytime data, the performance of each algorithm is similar to that for all the data, whereas the algorithm performance for the nighttime data is degraded with R 2 of 0.1-0.2 and RRMSE of 0.7-0.8, similar to the results for some forest sites in Moffat et al. (2007). The poor performance of the gap-filling algorithms for the nighttime data is primarily attributed to the shortage of available nighttime data for training the models ( Fig. 1 and Table 1). However, the change of BE 185 with the increased gap length at night is relatively small compared to that for the daytime data, especially for the ANN and RF. In addition, BE for long gaps (e.g., 2-M) has opposite signs for the daytime and nighttime data, resulting in a smaller BE for all the data.     Figure 3 shows the comparison of the probability density functions (PDFs) between the measured and estimated NEE of the gap-filling algorithms. The estimated NEE by the gap-filling algorithms has similar shape of PDFs, which is also quite similar to that of the measured NEE with some difference in amplitudes. At this site, the measured half-hourly NEE ranges from -6 to 4 µmol m -2 s -1 , while the estimated NEE varies from approximately -5 to 2 µmol m -2 s -1 . For all the data, the PDFs of the estimated NEE present two peaks at around -0.2 and 0.2 µmol m -2 s -1 , whereas the PDF of the measured NEE only has 205 one peak at around 0.2 µmol m -2 s -1 . That means that the PDFs of the estimated NEE have higher amplitude than the measured NEE in the range of -0.6 to 0.0 µmol m -2 s -1 . However, in the range of -2.0 to -0.8 µmol m -2 s -1 , the amplitude of the PDFs of the estimated NEE is lower than that of the measured NEE. The PDFs of the estimated NEE for the daytime data show one peak at around -0.2 µmol m -2 s -1 with their shapes similar to those for all the data in the range with the negative NEE values; for the nighttime data, the estimated NEE has a narrower shape of PDF than that of the measured NEE. These 210 results suggest that the gap-filling algorithms underestimate the magnitudes of NEE in the range of -2.0 to 0.0 µmol m -2 s -1 and the magnitudes of the peak values of NEE.

Figure 3
Comparison of probability density functions (PDFs) for the measured and estimated NEE of the gap-filling algorithms using daytime and nighttime data together and separately. The bin width is 0.2 µmol m -2 s -1 and bins with less than 10 data points are excluded. Figure 4 shows the absolute errors as a function of bin-averaged NEE values for the gap-filling algorithms with the estimated random measurement errors (Finkelstein and Sims, 2001) as a reference. For all the data, the binned absolute errors for the ML algorithms are quite close to each other, and they are also close to the random errors in the range of negative NEE values but slightly deviated from the random errors in the range of positive NEE values. Note that the large deviations from the 220 random errors at the edges are most likely due to the small number of data points as illustrated in Fig. 3. The binned absolute https://doi.org/10.5194/acp-2021-631 Preprint. Discussion started: 28 July 2021 c Author(s) 2021. CC BY 4.0 License. errors for the daytime data are close to the random errors in the NEE range of about -4.0 to 1.0 µmol m -2 s -1 ; whereas the binned absolute errors for the nighttime data are consistently higher than the random errors. For all the data, the mean value of the random errors is 0.56 µmol m -2 s -1 , while the MAE of the ML algorithms is 0.55, 0.59, 0.53, and 0.58 µmol m -2 s -1 for the ANN, KNN, RF, and SVM, respectively. For the daytime data, the MAE of the ML algorithms (0.59, 0.62, 0.55, and 225 0.62 µmol m -2 s -1 ) is all smaller than the mean value of the random error (0.66 µmol m -2 s -1 ); whereas for the nighttime data, the MAE of the ML algorithms (0.46, 0.48, 0.47, and 0.49 µmol m -2 s -1 ) is all larger than the mean value of the random error (0.32 µmol m -2 s -1 ). Overall, the RF and ANN have better performance in filling NEE gaps at this semi-arid site, especially for the daytime data, although the RF is more time efficient than the ANN. In addition, all the four ML algorithms have low performance in gap-filling the nighttime data, though the BE for the nighttime data is relatively small compared to the 230 daytime data mostly due to the low magnitudes of the nighttime NEE.

Figure 4
Comparison of NEE measurement and model uncertainties of the gap-filling algorithms using the data and the daytime and nighttime data. The random measurement error is estimated by Finkelstein and Sims (2001). The bin width is 0.2 µmol m -2 s -1 and bins with less than 10 data points are excluded.

Uncertainties in carbon budgets caused by gap-filling
We now examine uncertainties in carbon budgets caused by gap filling with different training dataset and different methods.
With the 40 distinct artificial gap scenarios, we obtain 40 gap-filled NEE time series for each method. By replacing the 240 artificial gaps with the observed data, these time series datasets allow for an evaluation of the model consistency and reliability in filling the actual gaps. Figure 5 compares the monthly accumulative NEE of the gap-filled data during 2016-2019. The subscript A denotes that the daytime and nighttime NEE data are gap-filled together using the ML algorithms trained with all the data, and DN denotes that the daytime and nighttime data are gap-filled separately using the trained ML algorithms and then combined together to determine the monthly accumulative NEE. The error bar denotes one standard 245 deviation of the monthly accumulative NEE of the 40 gap-filled time series. Here the gap-filled NEE using the MDS method is also included for comparison.
For months with gaps less than 7 days ( Fig. 1; e.g., February to October in 2016), all the four ML algorithms are quite consistent and reliable with the mean standard deviations of the monthly accumulative NEE ranging from 0.4 (ANN) to 1.0 (SVM) g C m -2 . Both the RF and KNN have mean standard deviations of approximately 0.7 g C m -2 during these months. For 250 months with long gaps (i.e., > 7 days), the mean standard deviations of the monthly accumulative NEE are 1.2, 1.3, 1.5, and 3.3 g C m -2 for the ANN, KNN, RF, and SVM, respectively. From this perspective, the ANN is the most reliable method in gap-filling because the predicted values are averages of the best 20 runs (Section 2.4.1). In other words, most of the ML algorithms are quite consistent in filling the gaps, and the differences in the monthly accumulative NEE caused by changes in training dataset are less than 1.5 g C m -2 except for the SVM. 255 The uncertainties in the monthly accumulative NEE as a result of the differences in the methods are now assessed. The difference among the methods ranges from 0.2 to 1.3 g C m -2 for months with gaps less than 7 days, and changes from 0.8 to 10.2 g C m -2 for months with long gaps, which is 0.8 to 4.8 g C m -2 without including the SVM. For the ANN and RF, the differences in the monthly accumulative NEE between A and DN are usually quite small for all the months, whereas the KNN presents opposite signs for A and DN in months with long gaps, which means that the KNN is unable to handle the 260 daytime and nighttime data together. Overall, for months with short to medium gaps (i.e., < 7 days), there is no significant difference in the monthly accumulative NEE among the methods including the MDS method. For months with long gaps (i.e., > 7 days), the MDS method usually fails, and the ANN and RF have the best performance, and they have the potential to handle the daytime and nighttime data gap-filling together.
265 Figure 5 Monthly total net ecosystem exchange (NEE, g C m -2 ) at the US-Hn1 in (a-d) 2016-2019. The subscript A denotes that the daytime and nighttime NEE data are gap-filled together using the ML algorithms trained with all the data, and DN denotes that the daytime and nighttime data are gap-filled separately and then combined to determine the monthly total values. RFs denotes the proposed two-layer RF based gap-filling framework, and MDS is the marginal distribution sampling algorithm. The error bar denotes one standard deviation of the monthly total NEE of the 40 gap-filled time series.

270
The uncertainties in the annual total NEE are estimated by comparing the annual NEE obtained by each ML algorithm with their ensemble means (Fig. 6) trials for the annual NEE by ANN range from 2.5 to 4.3 g C m -2 , the mean standard deviations by the RF vary from 3.4 to 7.1 g C m -2 , and the mean standard deviations by the KNN and SVM vary from 5.1 to 13.9 g C m -2 . The differences between the 275 ANN and RF are within ±8.4 g C m -2 ; and the differences between A and DN are less than 1.7 and 4.5 g C m -2 for the ANN and RF, respectively. The overall uncertainties in the annual NEE caused by the ANN and RF are usually less than 15.5 g C m -2 , while the uncertainties can be as large as 27.2 g C m -2 if including the KNN and SVM. Therefore, it is recommended to use the ensemble mean of the ANN and RF as the best estimate of the annual NEE at the semi-arid sagebrush site. The annual mean NEE by the ANN and RF is -15.4±4.7, -50.0±9.1, -31.4±7.2, and -40.3±8.7 g C m -2 for 2016, 2017, 2018, and 280 2019, respectively. In addition, the annual total NEE by the MDS is about 5.6~15.6 g C m -2 larger than that by the ANN and RF.

Figure 6
Comparison of the annual mean NEE from the ML algorithms and their ensemble mean. The subscript A denotes that daytime and nighttime NEE data are gap-filled together using the ML algorithms trained with all the data, and DN denotes that the daytime and 285 nighttime data are gap-filled separately and then combined to determine the monthly total values.

A two-layer RF based gap-filling framework for extremely long gaps
We then propose and test a two-layer RF-based gap-filling framework (RFs) because 1) the performance of the ML algorithms is only reliable for short to medium gaps; 2) there is no obvious difference in monthly and annual NEE between training the model using all the data or the separated daytime and nighttime data; and 3) the RF is more time efficient than 290 the ANN. The procedures include: 1) train the RF model using the half-hourly data and fill the NEE gaps shorter than 7 days; 2) calculate daily means of the input variables and the partially filled NEE data; and 3) train the RF model using the daily data and fill the gaps in the daily NEE data. Following the same procedures, the performance of this framework in filling the gaps in the daytime and nighttime daily means are also accessed. Using a 10-fold cross-validation repeated ten times, the R 2 of the second layer RF model is 0.78, 0.85, and 0.77 for all the data, the daytime data, and the nighttime data, respectively. 295 The mean BE for this framework is 0.04 g C m -2 day -1 , which is slightly smaller than that of the original RF with the halfhourly NEE data (0.06 g C m -2 day -1 ).
As shown in Figs. 5 and 6, the monthly and annual NEE by the RFs are quite close to those by the RF. For A (i.e., trained by all the data), the difference in the annual mean NEE ranges from 0.0 to 2.2 g C m -2 , whereas for DN (i.e., trained by the separated daytime and nighttime data), the difference varies from 0.8 to 3.3 g C m -2 . This test suggests that it is not necessary 300 to fill all the gaps in the half-hourly NEE data if the focus is on assessing the uncertainties in cumulative annual mean NEE and interannual variability. In addition, the performance of the different methods is quite consistent when filling short to medium gaps (e.g., < 7 days), and thus using the ensemble mean of different methods as the input of the second RF layer has the potential to lower the uncertainties in the gap-filled data.

Conclusions
The performance of the four ML algorithms in filling the NEE data gaps is evaluated at a semi-arid sagebrush ecosystem site. Due to the relatively small range of NEE variations, the overall performance of these gap-filling algorithms at this site is lower than that at other forest sites, but comparable to that at other grassland sites. The RF algorithm outperforms the other algorithms in terms of the overall performance. It is not necessary to train the model separately for daytime and nighttime 310 data when using the ANN and RF algorithms. The uncertainties in the monthly and annual NEE due to the gap-filling approaches are evaluated by the standard deviations of monthly NEE of multiple trials for each method and also accessed by the difference in the monthly NEE by the methods. With the ANN and RF, the uncertainties in annual NEE are usually within 16 g C m -2 at this semi-arid sagebrush site. Extremely long gaps in half-hourly NEE data due to power failures cannot be confidently filled by either of the methods because of the high uncertainties in R 2 and RRMSE, and thus we propose and 315 test a two-layer RF based gap-filling framework. With this framework, the improvement in model performance is significant, especially for the nighttime data. Therefore, it is recommended that the two-layer RF based framework should be used if there are extremely long gaps existed in the NEE dataset and if there is a need to investigate its annual and interannual variability.

Code availability
The code used in this study are available from J.Y. and J.H. upon request.

Data availability
The data used in this study are deposited in a public domain repository at https://doi.org/10.6084/m9.figshare.14747952.