Modelling Stand Variables of Beech Coppice Forest Using Spectral Sentinel-2A Data and the Machine Learning Approach

Background and Purpose: Coppice forests have a particular socio-economic and ecological role in forestry and environmental management. Their production sustainability and spatial stability become imperative for forestry sector as well as for local and global communities. Recently, integrated forest inventory and remotely sensed data analysed with non-parametrical statistical methods have enabled more detailed insight into forest structural characteristics. The aim of this research was to estimate forest attributes of beech coppice forest stands in the Sarajevo Canton through the integration of inventory and Sentinel S2A satellite data using machine learning methods. Materials and Methods: Basal area, mean stand diameter, growing stock and total volume data were determined from the forest inventory designed for represented stands of coppice forests. Spectral data were collected from bands of Sentinel S2A satellite image, vegetation indices (difference, normalized difference and ratio vegetation index) and biophysical variables (fraction of absorbed photosynthetically active radiation, leaf area index, fraction of vegetation cover, chlorophyll content in the leaf and canopy water content). Machine learning rule-based M5 model tree (M5P) and random forest (RF) methods were used for forest attribute estimation. Predictor subset selection was based on wrapping assuming M5P and RF learning schemes. Models were developed on training data subsets (402 sample plots) and evaluations were performed on validation data subsets (207 sample plots). Performance of the models was evaluated by the percentage of the root mean squared error over the mean value (rRMSE) and the square of the correlation coefficient between the observed and estimated stand variables. Results and Conclusions: Predictor subset selection resulted in a varied number of predictors for forest attributes and methods with their larger contribution in RF (between 8 and 11). Spectral biophysical variables dominated in subsets. The RF resulted in smaller errors for training sets for all attributes than M5P, while both methods delivered very high errors for validation sets (rRMSE above 50%). The lowest rRMSE of 50% was obtained for stand basal area. The observed variability explained by the M5P and RF models in training subsets was about 30% and 95% respectively, but those values were lower in test subsets (below 12%) but still significant. Differences of the sample and modelled forest attribute means were not significant, while modelled variability for all forest attributes was significantly lower (p<0.01). It seems that additional information is needed to increase prediction accuracy, so stand information (management classes, site class, soil type, canopy closure and others), new sampling strategy and new spectral products could be integrated and examined in further more complex modelling of forest attributes.


INTRODUCTION
Coppice forests have multiple roles related to their production in forest management, as well as their social, ecological and economic importance for local communities. Their contribution is recognized and emphasized in rural livelihoods, low-carbon bio-economy, in protective functions, sharing economy, provision and enrichment [1].
Several studies were conducted analyzing structure, functions, silvicultural measures and other aspects in coppice forests in Europe and the Balkan region [1][2][3][4][5][6][7][8]. Authors [3][4][5][6][7][8] from South Eastern Europe (SEE) concluded that degradation and inappropriate treatments in high forests in the 20 th century resulted in degradation and appearance of coppice forests on larger areas. Stajić et al. [3] described past and recent coppice forest management in some regions of SEE in relation to their characteristics. Višnjić et al. [8] investigated ecological and silvicultural characteristics of coppice forests in Bosnia and Herzegovina (B&H). In B&H coppice forests occupy around 23% of the forested area according to data from the second national forest inventory. Different silvicultural treatments (conversion, thinning, reforestation and others) for the improvement of their production and other forest functions were examined and analyzed [9,10]. Recent intensive studies of coppice forests were conducted in the Sarajevo Canton [11][12][13][14]. Balić [11] presented the research on productivity, structural characteristics and models of growth and increment of coppice beech forests based on forest inventory data using statistical parametrical approach in the Sarajevo Canton.
For management planning purposes it is important to estimate stand productivity variables (basal area, stand diameter, wood volume, growing stock and others) and their spatial distributions, especially where different management regimes are recommended. Therefore, apart from forest inventory data, forest management planning should consider all available information about the forest status and stand conditions. Available remote sensing data from different satellite programs compiled with forest inventory have been used as a source for additional research about forest characteristics since the middle of the 20 th century. Landsat and Sentinel satellite images have been used most frequently for forest type classification [15][16][17], as well as for the estimation of forest productivity attributes [18,19]. Rapid information technology development resulted in continuous improvements of remote sensing capabilities (satellite and aerial imagery, lidar), offering innovative possibilities of research on forest vegetation [19][20][21]. Then statistical classification and estimation methods supported with information technology development become more efficient and promising in spatial characterization of forest attributes on the forested area [22][23][24]. Recently, high forests and artificial stands were analyzed frequently using machine learning rule-based approach. Therefore research focus was re-directed on coppice forests where wide interest for further coppice forest characterization was obtained.
The aim of this paper is to evaluate beech coppice forest stand variable estimates based on machine learning rule-based methods: M5 model tree and random forest regression using inventory and Sentinel S2A spectral data.

Study Area
The study was conducted in the Sarajevo Canton (about 1277 km 2 ), which is bounded by the southern geographical latitudes 43°53' -43°47' and the eastern geographical longitudes 18°16'-18°27' in central Bosnia and Herzegovina ( Figure 1). Forest stands of state-owned beech (Fagus silvatica L.) coppice forests surrounding the capital city of Sarajevo were selected as study areas. The selected beech coppice stands are situated on plane and hilly positions at altitude range of 550 to 1700 meters, but mostly below 1000 meters (about 60%). About 80% of forest stands are situated on humid expositions with deeper and moist soils. More than 65% of forest stands are located on a position with an inclination above 20o, while less than 15% is on planes. The study area is influenced by moderate continental climate with subalpine character at higher altitudes.

Field Data
Field measurements were acquired for geo-referenced field plots located at the intersection of 200×200 m grid. Trees with diameter of the breast height of 7 cm were selected in circular plots with different radii based on the probability proportional to size [25]. The most important forest stand attributes including the basal area, stand mean diameter, total volume and growing stock were calculated and used in this research (Table 1). Tree volume for individual trees was calculated using regression models [26] and then scaled to a per unit area basis (m 3 ·ha -1 ). In this research 609 sample plots in 185 stands were used for modelling. Descriptive statistics of forest attributes and rank correlations with predictor variables were calculated for the sample dataset.

Sentinel S2A Data
One cloud-free Sentinel-2 scene acquired on 17 th October 2018 was used in this study. The spectral data were obtained from the Copernicus Open Access Hub [27] as Level-1C data with Top of Atmosphere (TOA) reflectance. Characteristics of the spectral bands of Sentinel-2 MSI (Multi-Spectral Instrument) sensor and subset of used bands are presented in Table 2 [28]. The atmospheric correction of Level-1C input data was performed using the Sen2Cor plug-in for Sentinel-2 Toolbox and SNAP software provided by ESA (version 6.0.0, Brockmann Consult, Geesthacht, Germany). Corrected data were resampled on 20 m resolution, and vegetation indices and biophysical variables were calculated.
Then, three spectral vegetation indices were calculated: difference vegetation index (DVI), ratio vegetation index (RVI) [29], and normalized difference vegetation index (NDVI) [30]. In addition, the biophysical variables were calculated in SNAP from its biophysical processor, which uses a neural network algorithm based on the PROSPECT+SAIL (PROSAIL) radiative transfer model [31]. Five biophysical variables were determined: fraction of absorbed photosynthetically active radiation (fapar), leaf area index (LAI), fraction of vegetation cover (FCOVER), chlorophyll content in the leaf (CHC), and canopy water content (CWC).

Machine Learning Algorithms
Machine learning approach refers to analytical model building automatically learning from data itself. Here two different machine learning-based rules algorithms for regression were applied: M5P and RF. M5P is a machine learning technique introduced as reconstruction of Quinlan's M5 algorithm for tree-based regression modelling [32]. It creates decision tree with linear regression function at the nodes using splitting criterion that minimizes the intrasubset variation. The RF regression model is an ensemble of tree predictors constructed from bootstrapping training data. For both algorithms parameters tuning is related to the number of regression trees and the number of features (explanatory variables). Here default rules for the number of trees in Weka software were applied [33]. Important influence on the results of the applied rule-based algorithms has the feature selection. Here the "wrapper method was used, which selects a set of features most suitable for a particular algorithm. Datasets were separated in reference (66%) and validation (33%) subsets randomly. Accuracy assessment was evaluated using the mean square error (MAE), root mean square error (RMSE) and relative RMSE (RMSE%) calculated using the following equations: where y i is observed forest attribute of the data i, ŷ i is estimated forest attribute of i, n is the number of validation data and y ⁻ is the mean of the observed forest attribute. Then, determination was used to examine relationships between observed and estimated values.
The finalized machine learning models were used to make predictions for measured and non-measured geopositions on pixel level in the study area. Input data were extracted from raster layers for each pixel geo-positioned on determined x and y coordinates.
Described method was applied for forest attributes estimates based on inventory and Sentinel S2A spectral data in similar studies [19, 22,23].

Correlations
Spearman's rank correlation between forest attributes, spectral data, vegetation indices, biophysical variables and altitude is shown in Table 3. All forest attributes are correlated significantly to the most auxiliary variables. All forest variables are correlated significantly to B2 (blue), B3 (green), B5, B6, B7 (three vegetation red edges), B8 (near infrared), B8A (narrow near-infrared band), all vegetation indices (DVI, NDVI and RVI) and a set of four biophysical variables (LAI, fapar, FCOVER and CWC). Shortwave infra-red bands B11 and B12 have low but significant correlation with the total volume and growing stock only.
Growing stock was correlated significantly to all auxiliary variables achieving highest correlation with vegetation red edge B6 (-0.24). All correlations were very low, pointing out to weak correlations in general. Astola et al. [18] reported higher correlations between V 0 , D g , BA in boreal broadleaved forests and Sentinel S2A digital numbers (-0.74, -0.75 and -0.69, respectively).

Feature Selection
Predictor selection based on wrapping method resulted in subsets presented in Figure 2. The number of selected predictors varied between four and ten per forest attributes. Vegetation indices and biophysical variables were selected more frequently then the original spectral data.
Original spectral bands participated in smaller numbers than in similar research related to regression tree modelling for boreal broadleaved forests [19].

Model Evaluation
The differences of sample and modelled forest attribute means were not significant, while modelled variability for all forest attributes was significantly lower (p<0.01). Model evaluations for reference and validation subsets are presented in Table 4. Relative RMSEs for M5P in reference sets ranged from 47.4% for G to 63.1% for GS, while values for RF varied from 17.9% to 23.6% for the same forest attributes respectively. Higher relative RMSEs for both algorithms were obtained for validations sets and ranged between 51% and 68% approximately. Higher relative RMSEs related to regression tree modelling were found for G, D g and V 0 in boreal broadleaved forests [19].
The RF performed better in a reference set for all forest attributes related to correlations between the observed and predicted values, while correlations in validation subsets were higher for M5P for all attributes (Figure 2).
The presence of systematic errors was obtained for both algorithms for all attributes in a consistent way. Figure  3 presents the relationships between the observed and predicted values in reference and validation sets for basal area that are similar for other forest attributes. The better performance of RF predictions is visible for RF in the reference set, while predictions in validation sets were biased and weak  for both algorithms in the range of observed basal area values. The over-fitting in reference sets was obtained in all models with weak adjustment in validation sets. The RF predictions on the measured location achieved reliable values, while surrounding pixel-based estimates deviated from ground truth. It seems that the chosen feature selection method and the algorithm's specifications express low performances in validation sets, so further research related to reliable estimates on the pixel-level is needed.

Mapping of Basal Area for Beech Coppice Stands
The mapping of forest attributes has become a contributing part for forest management on all forested areas [23,34]. Recent research of coppice forests [12][13][14]35] pointed out the importance of spatial distribution of forest production attributes for the planning of ameliorative measures (restoration, reforestation) especially in beech stands.
We found that the applied machine learning-based estimation mapping could give insight into spatial distribution of forest attributes with better preservation of ranges of ground and RF estimated values.
Here is a visualized spatial distribution modelled by RF for basal area as an example and two details with stands from Google Satellite (above) and estimated basal area (below) (Figure 4).
Estimate of spatial distribution of highly correlated forest attributes is consistent over the forested area. This consistency could contribute to the coppice forests' function analysis considering their productive and protective roles. Related to productivity, RF spatial estimates better indicate areas with low values of forest attributes pointing out to the adequacy of stand potentials usage. Also, machine learning-based estimates near stand boundaries indicate forest quantity coverage related to the preservation of forest soil and protection from erosion and drying. We found that these indications could contribute to forest planning considering management and silvicultural measures aiming to improve coppice forest quantity and quality potentials.      were better than M5P estimates (RMSE% below 63%). In both modelling processes over fitting in reference sets were obtained, while estimates achieved high relative RMSEs in validation sets. • The machine learning approach compiled with Sentinel S2A spectral data is promising for the estimation and mapping of spatial distribution of forest attributes in beech coppice stands. Further research is needed related to machine learning algorithm specifications, more intensive and representative ground sample, spatial correlations and other scientific and technical possibilities.