Choosing a Method for Biomass Model Error Verification

blocks requiring high reliability (Brown, 1997, IPCC, 2003, Chave, 2005, 2014; Basuki et al., 2009; Huy et al., 2016a,b,c).

The number of sample trees cut down for local and site-specific models from 100,166 sample trees (Picard et al., 2012; Dutcă et al., 2020) is appropriate and reliable for modeling.

For mixed-age forests, the number of sample trees to be cut should be proportional to the diameter distribution of the stand and to the density of the species (Basuki et al., 2009).

3.6.1.2 Choosing a method to evaluate biomass model errors

It is recommended to apply the K-Fold cross-validation method with K = 10 because this method has the advantage that all data are involved in modeling and all are involved in calculating the error. Therefore, the error is correct for all data in all research and data collection areas. At the same time, among the three cross-validation methods, the K-Fold method with iterations K = 10 has given stable errors, convenient for data processing in R.

3.6.1.3 Select input variables for the biomass modeling system

There are three common variables in forest tree biomass models, which are tree diameter at breast height ( D , cm), tree height ( H , m) and wood volume density ( WD , g/cm3 ) . In which, variable D represents the variable tree size, H indicates the site and WD represents the ability to accumulate biomass and carbon by species.

The results of this study show that:

- For the general biomass model of species or plant family and for tree trunk ( Bst ) and total aboveground biomass of forest trees ( AGB ), three variables D, H and WD are required , while for the biomass of branches (Bbr), leaves ( Ble) and bark ( Bba) , only one input variable is required, D.

- For the biomass model established to the plant genus or species, only one input variable is needed, D, for all biomass parts and AGB . Because at this time, the genus or species has reflected the morphological characteristics of the plant and its carbon accumulation capacity without needing additional variables H and WD.

3.6.1.4 Select the form of the forest tree biomass function

According to Picard et al., 2015, the Power function biomass model does not provide the highest reliability when compared to other complex functions. However, due to its simplicity and reliability not being too different from complex functions, most of the Power functions are used worldwide to model forest tree biomass. The results of this study also show that the Power function is suitable for biomass models of forest tree parts and the total is AGB. Therefore, it is recommended to use the Power function to model biomass, the general model is as follows (Huy et al., 2016a, b, c; Kralicek et al., 2017):

(3.12)

(3.13)

Maybe you are interested!

In which Bst , Bba, Bbr, Bl, AGB (kg) correspond to the jth tree; and are the parameters of the model; are the variables D (cm), H (m), WD (g/cm 3 ), or the organization

the variable representing the volume of the tree: D 2 H or the variable representing the biomass: D 2 HWD corresponding to the jth tree; and is the random error corresponding to the jth tree.

3.6.1.5 Choosing the Power function estimation method

For the Power function, the conventional method of model estimation is used by linearizing through logarithms and applying the least squares method. The results of this study used the Furnival's Index (FI) (Jayaraman, 1999) to compare and show that the nonlinear model is linear in terms of

The maximum likelihood method (Weighted Non-linear Models fit by Maximum Likelihood) achieves much higher reliability than the log linearization method. Therefore, it is recommended to use this method in estimating Power-type biomass models.

3.6.1.6 Techniques for establishing independent biomass models

Partial biomass and AGB show strong differentiation when tree size increases (heteroscedasticity) (Davidian and Giltinan, 1995; Picard et al., 2012; Huy et al., 2016a,b,c; Kralicek et al., 2017), so when setting up the model, it is necessary to apply weights (Weight), in which Weight = 1/ X δ (Picard et al., 2012) with X being variable D, H, D 2 H or D 2 HWD depending on which variables are important in the model and δ is the variance function coefficient.

Apply Weighted Non-linear Fixed Models fit by Maximum Likelihood and K-Fold cross validation. Here is an illustration of the Codes running in R.

Codes run in R to model biomass AGB = a × D b × H c × WD d using the (Weighted Non-linear Fixed Models fit by Maximum Likelihood) method and K-Fold cross validation (K=10), then use the entire data to estimate the model parameters

1) Model setup and cross validation

# Erase memory rm(list=ls())

# Clean plot window dev.off()

# Directory path

setwd("C:/Users/baohu/OneDrive/1 - Article Dip Forest/Data/Data for use")

# Enter dataset t

t <- read.table("tAll.txt", header=T,sep="t",stringsAsFactors = FALSE)

# Install.packages("ggplot2") library(ggplot2) library(nlme) library(cowplot) library(gridExtra)

# Randomly shuffle the data t <- t[sample(nrow(t)),]

# Create 10 equal size folds

folds <- cut(seq(1,nrow(t)),breaks=10,labels=FALSE) AIC = rep(0, 10)

R2adj = rep(0, 10) Bias = rep(0, 10) RMSE = rep(0, 10) MAPE = rep(0, 10)

# Perform 10 fold cross validation: for(i in 1:10){

#Segement the data by fold using the which() function testIndexes <- which(folds==i,arr.ind=TRUE)

n_va <- t[testIndexes, ] t_eq <- t[-testIndexes, ]

# Develop Model:

start <- coefficients(lm(log(AGB)~log(D)+log(H)+log(WD), data=t_eq)) names(start) <- c("a","b","c ","d")

start[1]<-exp(start[1])

Max_like <- nlme(AGB~a*D^b*H^c*WD^d, data=cbind(t_eq,g="a"), fixed=a+b+c+d~1, start=start, groups=~g, weights=varPower(form=~D))

# Estimated values and Predicted:

k <- summary(Max_like)$modelStruct$varStruct[1] t_eq$Max_like.fit <- fitted.values(Max_like) t_eq$Max_like.res <- residuals(Max_like) t_eq$Max_like.res.weigh <- residuals(Max_like )/t_eq$D^k

# Calculation of AIC, R2 AIC[i] <- AIC(Max_like)

R2 <- 1- sum((t_eq$AGB - t_eq$Max_like.fit)^2)/sum((t_eq$AGB - mean(t_eq$AGB))^2)

R2.adjusted <- 1 - (1-R2)*(length(t_eq$D)-1)/(length(t_eq$D)-5-1) R2adj[i] <- R2.adjusted

# Prediction of the model for validation

n_va$Pred <- predict(Max_like, newdata=cbind(n_va,g="a"))

# Calculation of RMSE, Bias, MAPE% each time:

Bias[i] = 100*mean((n_va$AGB - n_va$Pred)/n_va$AGB)

RMSE[i] = 100*sqrt(mean(((n_va$AGB - n_va$Pred)/n_va$AGB)^2)) MAPE[i] = 100*mean(abs(n_va$AGB - n_va$Pred)/ n_va$AGB)

}

# Mean of AIC, R2 adj.:

mean(AIC) mean(R2adj)

# Mean of RMSE, Bias, MAPE%: mean(Bias)

mean(RMSE) mean(MAPE)

# Output last model: summary(Max_like)

# Plot of Validation vs. Prediction p <- ggplot(n_va)

p <- p + geom_point(aes(x=Pred, y=AGB), cex = 3.5)

p <- p + geom_abline(intercept = 0, slope = 1, col="black", cex=1.5)

p <- p + xlab("Predicted AGB (kg)") + ylab("Validation AGB (kg)") + theme_bw() p <- p + labs(title ="")

p = p + theme(axis.title.y = element_text(size = rel(1.7))) p = p + theme(axis.title.x = element_text(size = rel(1.7))) p <- p + theme (plot.title = element_text(size = rel(1.7))) p = p + theme(axis.text.x = element_text(size=15))

p = p + theme(axis.text.y = element_text(size=15)) p = p + ylim(0, 1500)

p = p + xlim(0, 1500) p

2) Estimate model parameters with all data

# Erase memory rm(list=ls())

# Clean plot window dev.off()

# Directory path

setwd("C:/Users/baohu/OneDrive/PhD Master/PhD Tinh/1. Instructions for Final Calculation/Data")

# Enter data set t

t <- read.table("tAll.txt", header=T,sep="t",stringsAsFactors = FALSE)

# Install.packages("ggplot2") library(ggplot2) library(nlme) library(cowplot) library(gridExtra)

# Developing model:

start <- coefficients(lm(log(AGB)~log(D)+log(H)+log(WD), data=t)) names(start) <- c("a","b","c ","d")

start[1]<-exp(start[1])

Max_like <- nlme(AGB~a*D^b*H^c*WD^d, data=cbind(t,g="a"), fixed=a+b+c+d~1, start=start, groups=~g, weights=varPower(form=~D))

# Model summary: summary(Max_like)

# The end

3.6.1.7 Techniques for establishing a biomass model system under the influence of forest ecological environmental factors

Forest biomass accumulation is affected by forest ecological environmental factors, so in addition to establishing the relationship between biomass and forest tree variables, it is necessary to consider the influence of environmental factors on the model to increase the accuracy of biomass and carbon estimates of the model.

There are two approaches to establishing biomass models that include environmental ecological factors.

i) Method of considering the influence of each forest ecological environment factor on biomass model

Environmental and forest ecological factors including climate, soil, terrain, and forest characteristics with fluctuations in different sample plots were studied for their random effect on the biomass model.

Apply the general Power model format as follows (Huy et al., 2016a, b, c; Kralicek et al., 2017):

(3.14)

(3.15)

In which Bst , Bba, Bbr, Bl, AGB (kg) correspond to the jth tree in the i-factor level of randomly influencing environmental ecological factors; and are the parameters of the model; and are the changes of the parameters according to level i; are the

variables D (cm), H (m), WD (g/cm 3 ), or a combination of variables representing tree volume: D 2 H or a combination of variables representing biomass: D 2 HWD corresponding to tree j

in factor level i; and is the random error corresponding to the jth tree and factor level

factor i; the weight variable is 1/ X δ , where X=D or D 2 H or D 2 HWD ) and δ is the coefficient of the variance function.

Using the weighted non-linear method and considering the random effects of factors on the model by the Maximum Likelihood method (Weighted Non-Linear Mixed Models with random effects fit bay Maximum Likeliohood) and K-Fold cross-validation to assess whether or not the influence of each factor on the biomass model. The following is an illustration of the Codes running in R

Codes run in R to model biomass AGB = a × D b × H c × WD d using the method (Weighted Non-linear Mixed Effects with random effect fit by Maximum Likelihood) and K-Fold cross validation (K=10), then use the entire data to estimate the model parameters

1) K-Fold setup and cross validation

# Erase memory rm(list=ls())

# Clean plot window dev.off()

# Define the working directory

setwd("C:/Users/baohu/OneDrive/PhD Master/PhD Tinh/1. Instructions for Final Calculation/Data")

# Import data

t <- read.table("tAll.txt", header=T,sep="t",stringsAsFactors = FALSE)

# Install.packages("ggplot2") library(ggplot2) library(nlme) library(cowplot) library(gridExtra)

# K-Fold Cross validation

# Create 10 equal size folds t <- t[sample(nrow(t)),]

folds <- cut(seq(1,nrow(t)),breaks=10,labels=FALSE)

AIC = rep(0, 10)

R2adj = rep(0, 10) Bias = rep(0, 10) RMSE = rep(0, 10) MAPE = rep(0, 10)

# Perform 10 fold cross validation: for(i in 1:10){

# Segement the data by fold using the which() function testIndexes <- which(folds==i,arr.ind=TRUE)

n_va <- t[testIndexes, ] t_eq <- t[-testIndexes, ]

# Random Effects Modeling:

start <- coefficients(lm(log(AGB)~log(D)+log(H)+log(WD), data=t_eq)) names(start) <- c("a","b",”c ”,”d”)

start[1]<-exp(start[1])

Max_like <- nlme(AGB~a*D^b*H^c*WD^d, data=t_eq, fixed=a+b+c+d~1, random=a~1, start, groups=~Region_simple, weights=varPower(form=~D))

# Estimated values and Predicted:

k <- summary(Max_like)$modelStruct$varStruct[1] t_eq$Max_like.fit <- fitted.values(Max_like) t_eq$Max_like.res <- residuals(Max_like) t_eq$Max_like.res.weigh <- residuals(Max_like )/t_eq$D^k

# Calculation of AIC, R2 AIC[i] <- AIC(Max_like)

R2 <- 1- sum((t_eq$AGB - t_eq$Max_like.fit)^2)/sum((t_eq$AGB - mean(t_eq$AGB))^2)

R2.adjusted <- 1 - (1-R2)*(length(t_eq$AGB)-1)/(length(t_eq$AGB)-5-1) R2adj[i] <- R2.adjusted

# Prediction of the model for validation n_va$Pred <- predict(Max_like, newdata=n_va)

# Calculation of RMSE, Bias, MAPE% each time:

Bias[i] = 100*mean((n_va$AGB - n_va$Pred)/n_va$AGB)

RMSE[i] = 100*sqrt(mean(((n_va$AGB - n_va$Pred)/n_va$AGB)^2)) MAPE[i] = 100*mean(abs(n_va$AGB - n_va$Pred)/ n_va$AGB)

}

# Parameters and rank of parameters fixef( Max_like )

ranef( Max_like ) coef(( Max_like ))

coef(summary( Max_like ))