Predicting wildfire burned area in South Central US using integrated machine learning techniques

. Occurrences of devastating wildfires have been on the rise in the United States for the past decades. While the environmental controls, including weather, climate, and fuels, are known to play important roles in controlling wildfires, the interrelationships between fires and the environmental controls are highly complex and may not be well represented by traditional parametric regressions. Here we develop a model integrating multiple machine learning algorithms to predict gridded monthly wildfire burned area during 2002-2015 over the South Central United States and identify the relative 10 importance of the environmental drivers on the burned area for both the winter-spring and summer fire seasons of that region. The developed model is able to alleviate the issue of unevenly-distributed burned area data and achieve a cross-validation (CV) R 2 value of 0.42 and 0.40 for the two fire seasons. For the total burned area over the study domain, the model can explain 50% and 79% of interannual total burned area for the winter-spring and summer fire season, respectively. The prediction model ranks relative humidity (RH) anomalies and preceding months’ drought severity as the top two most important predictors on 15 the gridded burned area for both fire seasons. Sensitivity experiments with the model show that the effect of climate change represented by a group of climate-anomaly variables contributes the most to the burned area for both fire seasons. Antecedent fuel amount and conditions are found to outweigh weather effects for the burned area in the winter-spring fire season, while the current-month fire weather is more important for the summer fire season likely due to the controlling effect of weather on fuel moisture in this season. This developed model allows us to predict gridded burned area and to access specific fire 20 management strategies for different fire mechanisms in the two seasons. to model the monthly burned area for the phytoclimatic zones in Spain. Amatulli et al. (2013) estimated the monthly burned area in five countries in Europe using two machine learning 50 approaches, including Random Forest (RF) and MARS. In these studies, the machine learning methods were used to estimate total burned area aggregated over a large-scale domain, e.g. on an ecoregion or a country scale. However, fewer studies have explored the utility of machine-learning methods in resolving the within-domain and grid-level relationships between fires and the environmental drivers. A particular challenge in predicting burned area of fires at the grid level across a broad region relates to the uneven distribution of burned area. The vast majority of wildlands are not burned and thus the relative portion of 55 burned wildlands is extremely low. Among the burned wildlands, the majority of fires are small ones but the total acreage burned is usually contributed by only a few large fires. For example, Steel et al. (2015) showed that for fires in California, small fires (< 25 ha each) contributed to 87% of the total number of grids burned but only 17% of the total burned area. By contrast, large fires (> 150 ha each) accounted for only 3% of the total number of burned grids but made up 64% of the total burned area. Thus, at the grid level the majority class is non-burn wildlands, or wildlands of small fires if considering only 60 burned wildlands alone, while the minority class is large fires. As most data-driven regression algorithms, parametric or non-parametric, would favor the majority class, large fires will be underpredicted for grid-level predictions.


Introduction
Wildfire is an important process maintaining the balance of terrestrial ecosystems. Wildfire occurrence is controlled by a complex interaction among fuel, weather, and climate (Bowman et al., 2009;Pausas and Keeley, 2009). In recent decades, many regions of the world have seen increasing frequency and intensity of wildfires, possibly connected to changes in regional 25 climate (Barbero et al., 2015;Westerling et al., 2006;Westerling, 2016). More intense and more frequent wildfire activities not only heighten ecosystem vulnerability but also cause poor air quality (Jaffe et al., 2008;Pellegrini et al., 2017;Wang et al., 2018;Yue et al., 2015). Thus, it is imperative to understand how wildfires would respond to changes in environmental factors in a warming climate. Previous studies revealed the importance of several environmental factors on wildfires. Fuel availability and 30 composition across regions can affect fire developments such as fire likelihood and spread efficiency (Nunes et al., 2005;Parks et al., 2012). Weather influences fuel moisture through changing precipitation and humidity and controls fire spread through winds. Long-term climate change can alter both fuel and weather conditions, for example by changing vegetation distributions and the frequency of fire-favorable atmospheric conditions Keyser and Westerling, 2017;Morgan et al., 2008), therefore changing fire regimes. Past studies also highlighted that the complex interplay between fuel, weather, 35 climate, and wildfires can change by spatial scale, fire size, region, and season. For instance, the relationships between fire activity and the environmental controls can exhibit complex nonlinearities across the spatial scale gradient (Peters et al., 2004).
Fuel and topography mainly regulate fires at a local scale, while weather and climate control fires at a broad spatial scale (Parks et al., 2012). In terms of fire size, it was found that the major controlling factors could shift from fuel and topography to weather as fire size increases in boreal forests (Liu et al., 2013;Fang et al., 2015). For the western Mediterranean Basin 40 where is characterized by large land heterogeneity, fuel influences can outweigh climate-weather influences on large fires (Fernandes et al., 2016). Therefore, it is challenging to examine the relative importance of the environmental drivers on wildfires due to the complex interrelationships among them.
One common method to explain the relationships between fire regimes (e.g. fire sizes or fire occurrences) and environmental factors is regression. This method is also used to evaluate the relative importance of different environmental 45 controls (Littell et al., 2009;Slocum et al., 2010;Parisien et al., 2011;Yue et al., 2013;Liu & Wimberly, 2015;Fernandes et al., 2016). Among a wide range of regression techniques used, non-parametric machine learning algorithms have emerged as an important tool to predict wildfires because they rely on fewer pre-assumptions about the data. Bedia et al. (2013) used nonparametric multivariate adaptive regression splines (MARS) to model the monthly burned area for the phytoclimatic zones in Spain. Amatulli et al. (2013) estimated the monthly burned area in five countries in Europe using two machine learning 50 approaches, including Random Forest (RF) and MARS. In these studies, the machine learning methods were used to estimate total burned area aggregated over a large-scale domain, e.g. on an ecoregion or a country scale. However, fewer studies have explored the utility of machine-learning methods in resolving the within-domain and grid-level relationships between fires and the environmental drivers. A particular challenge in predicting burned area of fires at the grid level across a broad region relates to the uneven distribution of burned area. The vast majority of wildlands are not burned and thus the relative portion of 55 burned wildlands is extremely low. Among the burned wildlands, the majority of fires are small ones but the total acreage burned is usually contributed by only a few large fires. For example, Steel et al. (2015) showed that for fires in California, small fires (< 25 ha each) contributed to 87% of the total number of grids burned but only 17% of the total burned area. By contrast, large fires (> 150 ha each) accounted for only 3% of the total number of burned grids but made up 64% of the total burned area. Thus, at the grid level the majority class is non-burn wildlands, or wildlands of small fires if considering only 60 burned wildlands alone, while the minority class is large fires. As most data-driven regression algorithms, parametric or nonparametric, would favor the majority class, large fires will be underpredicted for grid-level predictions. In this study, we develop a model integrating multiple machine learning techniques to predict wildfire burned area at the grid level over South Central United States (US) and to identify the relative importance of the environmental drivers on the burned area for that region. The integrated machine learning model aims at mitigating the problem of uneven burned area 65 and improving the accuracy of predicting wildfire burned area at a grid-scale of 50 km x 50 km. While the 50 km-resolution is still coarse to capture individual fires, it is comparable to the resolution of historical and future climate data generated by most climate models (~10 km to 100 km), thus allowing for a practical integration with climate models for potential applications of studying climate-driven changes of wildfire behaviors. The South Central US, encompassing four states --Texas, Oklahoma, Louisiana, and Arkansas -has experienced periodically large wildfires in recent years, such as the 2011 70 Texas fires. This region is projected to experience the highest risk of wildfires in 2031-2050 across the continental United States Fann et al., 2018). We chose the vegetation-rich thus fire-prone part of the South Central US, as shown by the red box in Figure 1. The study period is from 2002 to 2015. For each year, we predict gridded wildfire burned area at the monthly scale for the typical bimodal wildfire seasons over the region: the winter-spring fire season from January to April and summer fire season from July to September (Zhang et al., 2014). 75 The rest of the paper is organized as follows: Section 2 describes the developed model and introduces data incorporated into the model. Section 3 presents the procedures of model validation and evaluation. In section 4, we analyze the relative importance of individual variables and the environmental controls at different spatial scales.

Model description 80
One major challenge in wildfire prediction is the highly uneven distribution of burned area. For the study region (red box in Figure 1), grids without any fire occurrence or with only small fires (< 10 ha) in combination take up 70% of the total number of the grids but correspond to only 1% of the total burned area. By contrast, grids with large burned area (>100 ha) account for only 9% of the total number of grids but make up 88% of the total burned area. For uneven data like wildfire burned area, standard machine learning methods usually have a bias in favor of the majority class (i.e. non-burned or small 85 fires), leading to low prediction accuracy of fire burned area (Krawczyk, 2016). To alleviate the bias toward large fires, we developed a model consisting of multiple steps that address the uneven data issue. Figure 2 demonstrates the structures and processes of our model, which has four steps consisting of three machine learning algorithms. First, for each data grid, given its environmental conditions, we use the quantile regression forest (QRF) to predict a distribution of burned area at the targeted quantiles (in this step the quantiles are chosen at {0.45, 0.55, 0.65, 0.85, 90 0.95, 0.99}). Second, we predict if a grid is a non-burn grid or not using the logistic regression model and a variety of environmental factors. Third, for the grids that are predicted to burn, instead of predicting burned area directly, we predict the relative position of the burned area within the training set using a random forest (RF) model, which is presented by quantiles.
In the last step we divided burned area data into six even-sized sub-groups according to the predicted quantiles. The six sub- groups correspond to the six quantiles predicted in the first step, given the six quantiles are the middle points between the 95 lower and upper bounds for each sub-group, except the top three largest quantiles. For example, in the subgroup (0.39, 0.49), the corresponding quantile is 0.45. For the grids in the subgroups, they are assigned to the value at corresponding quantile as determined by the predicted distribution generated in the first step. The values represent the predicted burned areas.
Our approach alleviates the unevenness data issue for two reasons. First, the majority of zero-burn grids are first separated (i.e. step 2). Second, for the grids predicted to burn, we predict the relative position (i.e. quantiles) of the burned area 100 based on the training set. The distribution of quantiles is less skewed compared to the burned area distribution. Thus, the unevenness of the burned area is less severe when predicting the relative position than predicting the burned area directly. We assume the probability distribution of predicted burned area for most grids is similar to those in the training set but for the grids with large burned area their probability distributions will be more right-shifted, which allows us to transfer the predicted relative position to the predicted burned area of a grid. Specifics of the machine learning algorithms and technical details of 105 the prediction model are described below.

Random forest regression
Random forest (RF) is an ensemble-learning algorithm built on decision trees. Each tree is built using the best split for each node among a subset of predictors randomly selected at the node (Liaw and Wiener, 2002). The best split criterion is 110 based on selecting the variables at the nodes with lowest Gini Index (GI), which is defined as GI ( " ( % ) ) = 1- 2 , where ( " ( % ), ) is the proportion of samples with the value xi belonging to leave j as node t. Two parameters can be adjusted to optimize the RF model, including the number of trees grown (ntree) and the number of predictors sampled for splitting at each node (mtry). The RF regression model would first draw ntree bootstrap samples from the original dataset. For each sample, at each node of a tree, mtry predictors are randomly chosen from all predictors and then the best split 115 from among the predictors is determined at each node according to GI. In this study, we had ntree of 1200 and mtry of 8 for the winter-spring fire season and ntree of 1500 and mtry of 7 for summer fire season to obtain the best prediction accuracy. The predicted value of an observation is the average of the observed values of the variable response belongs to the leaf of ntree trees.
Here, we used RF model to predict quantiles of burned area for the grids that are predicted to burn.
After all the predicted-burn grids obtain their predicted quantiles of burned area by RF, the whole dataset is divided 120 into six subsets according to their predicted quantiles: {(0.39,0.49), (0.50,0.59), (0.60,0.69), (0.70,0.79), (0.80, 0.89), (>=0.90)}. The grids within each subset will be assigned with the burned area predicted from quantile regression forest (QRF; described in 2.1.2) for a chosen quantile that lies at the middle of the quantile range, which is according to the values of the quantiles at 0.45, 0.55, 0.65, 0.85, 0.95, 0.99, respectively, for the quantile subsets described above. Note that the predicted burn area quantiles belong to the subsets of (0.70, 0.79), (0.80, 0.89), and (>=0.90) are assigned to values of quantiles that are 125 larger than their upper bounds, which is based on the assumption that grids with larger burned area will have more right-shifted burned area distribution than the distributions of the training set. The benefit of applying the RF model is that it can provide the variable importance that measures the strength of individual predictors. The variable importance is measured by the increase in MSE (%IncMSE) and the increase in node purities (IncNodePurity). The %IncMSE is calculated by comparing the mean square error with and without permuting 130 variables for each tree, and the variables with greater values of %IncMSE are more important. As for the IncNodePurity, at each split, we can derive the changes of residual sum of square (RSS) before and after the split, and the final IncNodePurity of a variable is summed over the RSS of all the splits including the variable over all trees. Thus, a larger IncNodePurity represents higher variable importance. 135

Quantile regression forests
Quantile regression forests (QRF) are an extension of the RF. QRF develops trees in the same way as RF, but for the final leaf of each tree, instead of calculating the mean of predicted values, the QRF estimates the conditional distribution of a target variable. The conditional distribution is calculated by averaging the conditional distributions from all the trees and the predictive quantiles are derived from the final empirical distribution function. Here we chose to predict quantiles at 0.45, 0.55, 140 0.65, 0.75, 0.85, 0.95, and 0.99. These quantiles were selected because they can represent the full spectrum of fire sizes ranging from small to extremely large ones. The quantiles less than 0.45 are typically zero-burn while the quantile of 0.45 is the lowest quantile that can possibly record both zero-burn and very small burned area for each grid.

Logistic regression model 145
Logistic regression is used to estimate the probability of fire occurrences in a grid cell by the statistical relationships between fire occurrences and the predictor variables. Logistic regression is defined as % = .
.12 345 and % = 8 + . %. + ; %; + ⋯ + = %= , where Pi represents the probability of an occurrence of wildfire in a grid cell i; hi is the linear combination of the predictor variables weighted by their regression coefficients (b); xij is the value of the predictor variable j of the grid i, and 8 is the constant. The logit function can be expressed as log ( > .?> ) = % @ , where % @ is the vector of the predictor variables 150 and b is the vector of the parameters. Values of P larger than 0.4 are considered to be a fire occurrence and those with equal to or less than 0.4 are interpreted as non-occurrence of wildfires. If a grid is classified not to burn, the predicted burned area would be zero and that grid will not be processed further. On the other hand, if a grid is classified to burn, it would be analyzed by the RF model to predict the burned area quantiles.

Wildfire burned area
We chose wildfire burned area as our target variable, as burned area is a widely-used parameter for quantitative assessment of fire danger and fire impact (Amatulli et al., 2013;Balshi et al., 2009;Yue et al., 2013). Wildfire information over the study period (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015) was obtained from the Fire Program Analysis Fire-Occurrence Database (FPA-FOD). The 160 FPA-FOD collects daily wildfire reports from federal, state, tribal, and local governments. The dataset includes wildfire burned area, fire location in longitude and latitude, and fire discovery date from 1992 to 2015 (Short, 2017). A known caveat of this database is that it does not include some small fires that occur on private lands and thus our model will not be able to predict those fires as such information is not in the training dataset.
The FPA-FOD wildfire data is point data of daily time step. As the prediction model deals with monthly total burned 165 area at a 50 km spatial resolution, we aggregated the daily point burned area into 0.5°× 0.5° grids and summed the burned area in each grid by month. The resulting dataset of monthly burned area has nearly 70% of the grids with burned area less than 10 ha or non-burned. To reduce skewness and improve data symmetry, we applied the log transformation function ln(x+1), where x is the gridded monthly total burned area. The log-transformed burned area was the target variable by our model.

Predictor variables 170
Based on previously published studies, we collected a number of predictor variables that are thought to influence wildfire burned area and grouped them into four categories of environmental controls (Table 1): weather, climate, fuel, and fixed-geospatial variables. We chose ten weather variables to represent the fire-weather conditions of month t, including monthly mean and maximum temperature, monthly mean zonal (U) and meridional (V) components of wind at 10 m, monthly mean of daily precipitation and monthly total accumulated precipitation, monthly mean and minimum relative humidity (RH), 175 number of consecutive days without rainfall in a month, and Standard Precipitation and Evaporation index (SPEI). Besides current month's weather, weather conditions in the preceding months also have impacts on fire development. For example, increasing precipitation in the preceding months can promote biomass growth and can lead to widespread of larger wildfires (Fréjaville and Curt, 2017;Littell et al., 2009). To consider such lagged effects, for a given month t, we calculated the average of chosen weather variables from the months of t-1 to t-12. We then checked if the variables have correlation coefficients (r) 180 larger than 0.5 with fire burned area of month t but not are strongly correlated with the same variables of month t (r < 0.5).
The antecedent variables that pass this criterion for the winter-spring fire season are the monthly mean of daily precipitation of months of t-1 and the average SPEI for the months of t-1 to t-6. For the summer fire season, the selected variables are the average of monthly mean temperature for months of t-1 and t-2, monthly mean of daily precipitation for months t-1 to t-3, and SPEI for months t-1 to t-3. 185 Climate variables include climate anomalies of monthly mean temperature, monthly mean of daily precipitation, monthly mean relative humidity, monthly maximum temperature, monthly minimum relative humidity, and monthly total accumulated precipitation. Climate anomalies indicate the effects of changing climate on burned area, by presenting differences between monthly mean of the meteorological variables and their long-term average over 1979-2000. In addition to climate anomaly, we also included the 22-years (1979-2000) means and standard deviations of monthly mean temperature, 190 monthly mean of daily accumulated precipitation, monthly maximum temperature, and monthly total accumulated precipitation. The long-term average of meteorology characterizes the spatial distribution of vegetation, and the standard deviation suggests climate variability that can further affect fire occurrence and size (Keyser and Westerling, 2017). Note that climate normals and standard deviations were considered as fix-geospatial variables in this study because their values do not vary with time. 195 To estimate the fuel effect on burned area, we included monthly mean of Leaf Area Index (LAI), sum of neighboring LAI, and soil moisture. The LAI is the ratio of the total one-sided area of green leaf area per unit ground surface area, which has been widely used to describe the structural property of a plant canopy. Additionally, LAI is correlated with important metrics of canopy fuel loads, such as canopy bulk density (Keane et al., 2005;Steele-Feldman et al., 2006). Besides local LAI values, to capture the effects of spatial autocorrelations, we consider each grid cell as the center of a 3-by-3 grid matrix and 200 compute the summation of the LAI from the center grid's eight neighboring grids. This sum is referred to as the 'sum of neighboring LAI' and included as a predictor variable. Fuel moisture is a critical property for evaluating fire danger, but there is limited accessible data. It has been known that soil moisture is strongly correlated with fuel moisture, thus soil moisture can be approximated as an indicator of fuel moisture (Krueger et al., 2016). Here, we used the monthly surface soil moisture (0-10 cm) to represent the influence of fuel moisture. Considering the lagged effect of fuel buildup in the preceding months, we 205 chose the averages of LAI and sum of neighboring LAI for the months of t-5 for the winter-spring fire season, as they passed the same criterion for selecting antecedent weather variables described before. No antecedent fuel variables were included in the model of summer fire season because they did not pass the selection criteria.
Lastly, three fixed-geospatial variables were used as predictors, including land cover types, ecoregion types, and population. Land cover types and ecoregion types were chosen to capture the effects of land use and ecosystem similarity on 210 fire burned area. Population density data in the year 2010 was used to estimate the influence of humans on wildfires (i.e., high suppression efficiency close to high population density area). All variables used in the model, their sources, spatial and temporal resolutions, and categories are listed in Table 1. Note that all variables were regridded to the same spatial resolution of 0.5°× 0.5°.

Validation method
The developed model incorporates weather, climate, fuel, and fixed-geospatial variables to predict monthly burned area for each 0.5°× 0.5° grid cell over the South Central US for the two fire seasons during each year of 2002-2015. We applied a 10-fold cross-validation (CV) technique to evaluate the model performance and to avoid overfitting. The entire dataset was 220 randomly divided into 10 equal-sized splits. For each round of CV, the model was trained with nine splits of the data and the trained model was then used to predict burned area at the remaining split. Classification of burned or non-burn grids was evaluated by the accuracy and the area under the curve (AUC) statistics. Burned area predictions were evaluated using statistical indicators such as the coefficient of determination (R 2 ), mean absolute error (MAE), and root mean squared error (RMSE) between the predicted and observed wildfire burned areas and the evaluation was done for the winter-spring fire 225 season and summer fire season separately. We also quantified the prediction performance by evaluating the model ability in reproducing temporal variation of burned area for each grid and spatial patterns of burned area across all the grids of the study domain. Details of calculating the spatial and temporal correlations are described in the Supporting Information.

Validation results 230
The validation results are presented at both the grid-scale and the large-domain scale. The large-domain scale is defined by our study domain (red box in Figure 1) with a horizontal scale of around 700 x 700 km 2 . At this large-domain scale, we will compare our model performance with prior studies which predicted total burned area of an ecoregion or a country. Table 2 shows the model performance at the grid-scale for the winter-spring fire season and summer fire season.  Table S1.
To our best knowledge, our model outperforms previously published models in predicting gridded burned area and the better performance results from the advantages of the approach. First, we predicted gridded burned area at 0.5° x 0.5° grid cells and on a monthly basis, while most prior studies predicted monthly or annual total burned area on an ecoregion or a country scale. Only a few studies targeted burned area at the grid level but with a spatial resolution of 1° x 1° and a temporal 250 resolution of one year. One study used ocean climate indices to estimate annual gridded burned area and achieved correlation coefficient (r) around 0.55 over SUS (Chen et al., 2016). That study excluded zero-burn grids from the data and thus did not confront the severe unevenness issue. Compared with their results, our model performs better despite predicting at finer temporal and spatial scale, with r of 0.60 and 0.67 for the two fire seasons, respectively. Another study (Liu and Wimberly, 2015) obtained a higher correlation R 2 of around 0.76 between climate variables and burned area over the western US using 255 boosted regression trees, but their investigation included only extremely large fires (> 405 ha) and was at a 1° x 1° resolution and annual timestep. As discussed above, the spatial and temporal resolution of our model is finer than the prior studies and the model is capable of predicting burned area for all the grids within any large domain. Such setting makes it more practical for our prediction model to integrate with climate models. Through the integration of multiple machine learning algorithms, we are able to address the unevenness issue of burned area, which significantly improves the model ability to predict gridded 260 burned area across the whole study domain.
We also evaluated the model performance in reproducing the spatial patterns of the burned area. We first assessed the model ability reproducing the spatial patterns of climatological burned area for the two fire seasons ( Figure S1). The model reproduces the 14-year mean burned area, with a correlation coefficient between mean observed and predicted burned area of 0.82 and 0.80 for the winter-spring and summer fire season, respectively. For the whole study domain, the spatial patterns of 265 predicted burned area (including zero burns) have correlation coefficients higher than 0.5 with the observed pattern for more than 60% of the study months. It is noteworthy that such performance is achieved without introducing any coordinate variables like longitude or latitude as predictors. This indicates the model we developed is not hardwired to geographical features of the study domain and thus can be easily adopted for other regions, a unique advantage allowing for practical incorporation into climate models. Temporally, the predicted burned area time series at 70% of the grids (combined the two fire seasons) has a 270 correlation higher than 0.5 with the observed burned area ( Figure S2). These results demonstrate the model has the certain ability in predicting both spatial and temporal variation of the burned area at the grid-scale across the study domain.
Besides the grid-scale statistics, we evaluated the model performance at the large-domain scale by adding up all the grid-level predictions to obtain the total burned area of the study domain by month. Figure 4 shows the time series of the predicted total burned area over South Central US in comparison to the observed ones for the two fire seasons. The domain-275 scale prediction explains 50% and 79% of the month-to-month variability of burned area for winter-spring and summer fire season, respectively. MAE of the monthly burned area across the whole domain is 251.3 km 2 for the winter-spring fire season and 100.7 km 2 for the summer fire season. As listed in Table S2, the prediction accuracy of our model in terms of R 2 is equivalent to or better than most of the published studies on the ecoregion scale or country scale.

Individual variable importance at grid scale
Before discussing the environmental controls on fire burned area across our study domain, it is useful to understand the dominant factors controlling the burned area at the grid-scale. One advantage of the random forest approach is that it provides the variable importance metrics that can measure the power of predictor variables in the overall prediction. We show 285 in Figure 5 the top 14 predictors ranked by the variable importance metrics to illustrate the intricate relationships among fires, weather, climate, and fuel. As Figure 5 shows, for both fire seasons, RH anomaly is the predictor variable with the highest rank in predicting burned area at the grid-scale. This finding broadly supports the work of other studies that highlighted the importance of RH on burned area (Riley et al., 2013;Ruthrof et al., 2016). Yet our approach particularly reveals the response of fire burned area to changes in RH anomaly, which is a climate variable as opposed to weather variable. RH anomaly indicates 290 the changes of current RH relative to its historical climatology and it ranks higher than current RH in the variable importance metrics. While RH anomaly is identified as the top factor in both fire seasons, the two fire seasons exhibit a notable difference in that temperature anomaly and maximum temperature anomaly are included in the top 14 variables only for the summer fire season. In addition, RH anomaly and temperature anomaly have a stronger negative correlation in the summer fire season (r= -0.7) than in the spring fire season (r= -0.2). This highlights the importance of combined effects of RH and temperature on 295 burned area during summer.
For the winter-spring fire season specifically, the long-term average of precipitation (apcp_avg and asum avg) is identified as the key climate variable (Figure 5a). Precipitation normals indicate the amount of available moisture that could affect fuel distributions and tendency of fire activities (Westerling and Bryant 2008;Keyer et al 2017). The average of SPEI for month t-4 is the highest-ranked weather variables, exceeding the current-month SPEI in terms of variable importance, and 300 the 3-5 months' time lag coincidentally corresponds to the interval between the two fire seasons. Our results imply that prefire-season drought condition is influential on fire burned area, and it is in agreement with prior studies (Scott and Burgan., 2005; Riley et al., 2013;Turco et al., 2017). Interestingly, the average of LAI and sum of neighboring LAI for months of t-6 are the only top fuel variables being selected in the winter-spring fire season (Figure 5), indicating the importance of antecedent fuel abundance. From the ranking of variable importance, we can infer that antecedent fuel abundance together with pre-fire-305 season drought conditions together determines the amount of dry fuel, which likely exerts the primary controls of the burned area during the winter-spring fire season.
For the summer fire season, temperature anomalies (temp_anomaly and tmax_anomaly) are shown as the major climate factors influencing fire burned area (Figure 5b). This indicates temperature variations under a changing climate will have significant effects on the burned area in that season. Top-ranked weather variables include the average of monthly 310 accumulated precipitation and SPEI of preceding month t-1, t-2, and t-3. These variables are known to affect fire burned area through changing fuel moisture. Consistently, fuel moisture as represented by soil moisture is identified as the only fuel variable among the top 14 variables in the summer fire season. These results suggest that fuel drying driven by both increasing temperature and pre-fire season drought conditions is the pivotal process determining fire burned area in the summer fire season. Similar to our findings, rising summer temperature under climate change was found to cause fast fuel dryness, which 315 increased fire activity in the western US (Williams et al., 2013;Holden et al., 2018).
Overall, the analysis of variable importance reveals some important differences in the fire mechanisms between the two fire seasons and shows semi-quantitatively that drought conditions in the preceding months (3-5 months for the spring fire season and 1-3 months for the summer fire season) may be more important than current-month conditions on the burned area. Furthermore, we demonstrate that the effect of climate change on fire burned area is consequential, even more influential than 320 current fire weather. This aspect has not been well documented or quantified due to a lack of long-term observations of wildfires over South Central US.

Method to decompose the relative influence of environmental controls
The variable importance metrics presented in the previous section reveal the relative importance of individual 325 predictors. Recall these predictors were purposely selected from four broadly defined categories of environmental controls on wildfire burned area, namely climate, weather, fuel, and fixed-geospatial. As variables within the same category may work in conjunction to create conditions conducive for wildfires, in this section we examine the composite influence of predictors by category and identify which category would exert the largest contribution controlling the variability of wildfire burned area.
To do so, we designed a set of sensitivity experiments using the prediction model developed to decompose the effect of 330 different environmental controls across our study domain by perturbing variables belonging to one category at a time. The environmental control categories to be perturbed include weather, climate, and fuel; the variables of each category are listed in Table 1. The fix-geospatial factors remain unchanged in each sensitivity experiment. In addition, as the relative importance of environmental controls can vary by spatial scale (Parks et al., 2012), the results were analyzed at both the grid-scale and large-domain scale. The large-domain scale here is defined as our study domain. For instance, to examine the influence of 335 weather, for each grid, we assigned the values of individual weather variables to their 15-year means while keeping the variation of other variables (hereafter refer to as the "weather-avg run"). The same procedure was applied to the variables in the climate and fuel category, resulting in the climate-avg run and fuel-avg run respectively. The original model with all the variables of each grid varying by time is called the full-model run. The gridded burned area predicted from each run is summed over all the grids across the study domain. The differences in resulting total burned area between the full-model run and 340 weather-avg run represent the impact of weather control (hereafter called "weather effect"), and the same procedure was applied to derive the climate effect and fuel effect on the burned area. We also conducted the fixed run, in which for each grid, its weather, climate anomaly, and fuel variables are all fixed to their long-term mean, and the predicted burned area from this run represents the influence of geospatial variables and climate normals on the burned area (hereafter named "fix effect").
Although the calculations of deriving the effect of a given environmental category are made by assuming linearity, the 345 machine-learning-based prediction model does not assume linearity. Thus, the summation of burned area prediction from the weather, climate, fuel, and fixed run is not necessarily equal to the burned area predicted by the full model. This difference is considered as the interaction effect among these environmental controls.
After deriving the effects of the environmental controls on the burned area, we then calculated such effects of environmental controls in the scaled absolute percentage. We first normalized the effect of an environmental control category 350 by the number of variables in that category because the numbers of variables are different by environmental control and the , where E is the influence of the environmental controls in burned area, N indicates the number of variables in the category, Nt is the total number of variables, and the subscript w, fu, c, fi, and, i represent weather, fuel, climate, fixed, and interaction, 360 respectively. Figure S3 shows the time series of the burned area contributed by different environmental controls for the two fire seasons. According to the results, weather, fuel, climate, and fixed effect all tend to increase burned area while the interaction 365 effect acts to reduce burned area, in particular for the large burn events (e.g. July 2011 in the summer fire season). Considering the number of variables in each environmental control category is different, it is better to compare the effects in the scaled absolute percentage that represents the average contribution of the variables in one environmental category (Section 4.2). Figure S4 presents the time series of environmental effects on the burned area in terms of the scaled absolute percentage. For both fire seasons, on average, the climate and fixed variables have larger contributions to the burned area than other types of 370 variables, but their relative importance varies by time. To further compare the mean effect of the environmental controls between the two fire seasons, we averaged the scaled absolute percentage of each environmental controls over the whole study periods and results are shown in Table S3 and Figure 6. From Figure 6, one can clearly see the climate variable on average has a larger contribution to burned area than other variables for both fire seasons, with the mean scaled absolute contribution of 33% and 35% for the winter-spring and summer fire season, respectively. The result suggests changing climate is a significant 375 factor that can explain wildfire burned area over our study domain. In accordance with our results, previous studies also demonstrated the significant contribution of changing climate to the total burned area of ecoregions or countries (Littell et al., 2009;Swetnam and Anderson, 2008;Yue et al., 2013). For example, increasing temperature and earlier spring snowmelt due to climate change are highly associated with increased large wildfire activity in the western US (Westerling et al., 2006).

Relative importance of environmental controls at large scale
Another study showed that fire-year climate variables such as average spring temperature are predictive variables that could 380 improve the predicting probability of high severity fires in the western US (Keyser and Westerling, 2017). Additionally, the fixed effect that comprises the geospatial variables and past climatology is ranked as the second most important control ( Figure  6). This is consistent with the findings of Keyser et al. (2017), which revealed the importance of long-term climate normals in controlling large fire occurrences in the western US.
Comparing effects of the environmental controls between the two fire seasons, the fuel effect is significantly more 385 important in the winter-spring fire season, while weather and climate effects are more substantial in the summer fire season.
This can probably be explained by different characteristics of the two fire seasons. Biomass growth is relatively limited in the winter-spring fire season and the wildfire fuel in this season is often provided by vegetation accumulated since the last growing season. Thus, the effect of fuel (mainly from the preceding fuel amount) is comparatively dominant in the winter-spring fire season. On the other hand, vegetation growth is relatively sufficient during the summer growing fire season and fuel abundance 390 would not be a constraint of wildfires. Yet, fire weather that determines fuel moisture is a substantial factor in the summer fire season (Figure 6).
The above analysis focuses on the relative importance of the environmental controls at the large-domain scale. At the grid-scale, we calculated the average variable importance from RF in %IncMSE (section 2.1.1) of each category which represents the relative importance at the grid-scale and the results are presented in Table S4. Climate variables on average have 395 the largest importance in controlling burned area for the two fire seasons, with the mean %IncMSE of 12.09 and 19.18 for the winter-spring and summer fire season, respectively. This is consistent with the results on the large-domain scale. Fuel effect outweighs weather effect on the grid level in winter-spring fire season, while weather effect is more important in summer fire season, both consistent with analyses based on the large-scale domain (Table S4). However, the fixed effect estimated at the grid-scale is not as important as at the large-scale domain (Table S4) and this is partly due to the way these variables are coded 400 in the model. Fixed variables consist of past climatology and geospatial variables (i.e. land use, ecoregion, and population).
The geospatial variables except the population were encoded as categorical variables at the grid level. For example, forest ecoregion is coded as 0 or 1 for a given grid, with 0 representing non-forest while 1 forest. For such encoding method, each categorical variable (e.g. forest v.s. non-forest) tends to have a smaller relative importance score, compared to the relative importance score of the entire variable encoded by continuous values (e.g. forest, desert, prairie, and other ecoregion types in 405 a large domain). As RF measures the effect of a specific split on the improvement in model performance and aggregates the improvement of all the splits with a specific variable, the fragmented scores for each category are likely smaller than the scores reflecting all of the categories. Therefore, for the relative importance at the grid level measured by RF, the effect of a single geospatial variable such as a land type on the burned area is trivial. When we average the relative importance of all the fixed variables including many small scores, the resulting average importance becomes still a small value. 410

Conclusion
We presented a model integrating multiple machine learning methods to predict monthly burned area over South Central US at 0.5° x 0.5° grid cells. The developed model is able to alleviate the issue of unevenly-distributed burned area and consequently improves the model capability of predicting large burned area at a finer spatial and temporal scale. The predicted 415 burned area showed a good agreement with the observed burned area at both the grid and large-domain scale. At the grid-scale, the model achieves a CV-R 2 value of 0.42 and 0.40 for the winter-spring and summer fire season respectively. In terms of predicting the spatial patterns of the burned area, the model has the correlation coefficient exceeding 0.5 over 60% of the study months. Across the study domain, the temporal variation of predicted burned area has a correlation coefficient larger than 0.5 with the observed burned area over around 70% of the grids. At the large-domain scale, the prediction model can explain 50% 420 and 79% of interannual burned area for the winter-spring and summer fire season, respectively. The validation results demonstrate that the model has certain skills in predicting monthly burned area at both grid-scale and large-domain scale.
The individual variable importance was analyzed and discussed. For both fire seasons, RH anomaly followed by drought condition in the preceding months (3-5 months for the winter-spring fire season and 1-3 months for the summer fire seasons) are the two top variables in predicting burned area at the grid scale. For the winter-spring fire season specifically, the 425 average of LAI and sum of neighboring LAI for the previous six months are the only fuel variables being identified as top important variables. The finding suggests that the antecedent fuel abundance together with the pre-fire season drought conditions regulate the abundance of dry fuel, which is the primary control of fire burned area during the winter-spring seasons.
For the summer fire season, temperature anomalies, the average of monthly accumulated precipitation of the preceding three months, and fire season soil moisture are important variables in predicting burned area. Therefore, the fire burned area in the 430 summer fire season is controlled by rising temperature and pre-fire season drought that speeds up fuel drying. The model highlights the effect of changing climate on burned area as well as the common and different critical factors controlling burned area for the two fire seasons.
Besides the relative importance of individual predictors, we also analyzed the relative importance of four environmental categories -climate, weather, fuel, and fixed-geospatial -at both the grid-scale and large-domain scale. The 435 relative importance of these factors is generally consistent at the two scales. The climate variable on average has the largest contribution to burned area for both fire seasons, with the mean scaled absolute contribution of 33% and 35 % to the burned area at the large-domain scale for the winter-spring and summer fire season, respectively. For the winter-spring fire season, the fuel variable on average has larger importance compared to the weather variable; while for the summer fire season, the weather variable is more dominant than the fuel variable. The difference in the relative importance of the environmental 440 controls between the large-domain scale and grid scale mainly lies in the predominance of the fixed effect. The fixed effect is ranked as the second most important controls at the large-domain scale, but it is not as important at the grid-scale.
The presented study mainly utilizes environmental factors as input variables to build the prediction model and focuses on the effects of environmental controls on burned area. The factors examined herein are not exclusive. For example, we do not examine the effects of human factors on burned area, such as anthropogenic influences that affect wildfires through 445 ignition, suppression, or modifying fuel distribution (Syphard et al., 2007;Bowman et al., 2011;Mann et al., 2016). Future work is needed to better understand the role of human activity engaged with climate change and its implications for wildfire control.   Climate anomalies of monthly total precipitation asum_anomaly climate

Fire season one
The average of antecedent monthly mean of daily precipitation for the month of t-1 apcp_mean1m weather monthly 32 km North American Regional

Reanalysis (NARR)
The average of antecedent monthly mean of LAI and sum of neighboring LAI for the month of t-5 LAI_mean5m, convLAI_mean5 m fuel monthly 500 m MODerate resolution Imaging Spectroradiometer (MODIS)

Fire season two
The average of antecedent monthly mean of daily precipitation for the month of t-1, t-2, and t-3 apcp_mean1m weather monthly 32 km North American Regional

Reanalysis (NARR)
The average of antecedent monthly mean of temperature for the month of t-1 and t-2 temp_mean1m weather monthly 32 km North American Regional

Reanalysis (NARR)
The average of antecedent SPEI for the month of t-1, t-2, and t-3