Random Forest Regression

Libraries required for RF regression analysis

library(randomForest)
library(dplyr)
library(ggplot2)

Introducing variables which can be used as modifier of input files to run different models and reading formatted dataset

This step becomes handy when multiple RF models are run simultaneously for comparison.

x= "_593"
y= ""
z= ""
rf_data1 <- read.csv(paste0("RF_input",x,y,z,".csv"), header=TRUE, row.name=1)

Removing columns having columnSUM = 0

Removing columns having no values is important since these variables can integrate into trees and reduce accuracies of the models.

a = rf_data1[,-1]

b = a[, which(colSums(a) != 0)]

Sulfide = rf_data1[,1]

rf_data = cbind(Sulfide,b)

write.csv(rf_data, paste0("RF_input",x,y,z,"_removed_column0.csv")) #saving the the filtered dataset

Data Partition 70:30 ratio (70% data for training and 30% data for validation)

To analyze predictive performance, 30% of the observations can be randomly withheld to perform more precise model validation. Setting a seed value is important to make the random selection reproducible.

set.seed(123)
ind <- sample(2, nrow(rf_data), replace = TRUE, prob = c(0.7, 0.3))
train <- rf_data[ind==1,]
test <- rf_data[ind==2,]

Running Random Forest (RF)

Setting a seed value is important to make the random selection reproducible. mtry parameter (No. of variables tried at each split) is set to default which is 1/3 of the number of independent variable. Mean of squared residuals will give you an estimated of error (lower the better), whereas % Var explained can be used as a pseudo-R2 value (higher the better). But, one should keep an eye out for overfitting.

set.seed(4444)
rf <- randomForest(Sulfide~., data=train, #here Sulfide is the dependent variable and the rest of the variables are independent variables
                   ntree = 300, #number of trees will depend on the error plot
                   importance = TRUE, #evaluates importance of a predictor
                   proximity = TRUE) #calculates the proximity measure among the rows
rf

## 
## Call:
##  randomForest(formula = Sulfide ~ ., data = train, ntree = 300,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 365
## 
##           Mean of squared residuals: 0.6725344
##                     % Var explained: 75.51

Plotting error profile

Error rate vs. number of trees is an important plot to understand whether adding extra tree in the forest will enhance the accuracy significantly. If the curve flattens, adding extra tree will not have an significant improvement in accuracy.

plot(rf)

Predicting the dependent variable in the training set using generated RF model

Using the generated RF model to predict response variable for the training dataset (70% randomly picked before) and saving the prediction for the training set

p1 <- predict(rf, train)
write.csv(p1,paste0("training",x,y,z,".csv"))

Predicting the dependent variable in the validation set using generated RF model (this is the step where external dataset can be used for prediction)

Using the generated RF model to predict response variable for the validation dataset (30% randomly picked before) and saving the prediction. If you want to predict the response varaible of an external dataset, you should store the external dataset in test dataframe. For external dataset, splitting in 70:30 ratio is not always necessary, rather training with the total available data will be the best.

p2 <- predict(rf, test)
write.csv(p2,paste0("validation",x,y,z,".csv"))

Finding training (with 70% of the withheld data) accuracies using linear models

3 steps are present here

comparison of actual vs. predicted values
plotting actual vs. predicted values
evaluation of the plot using linear models

#Step 1:
data <- read.csv(paste0("RF_input",x,y,z,".csv"), header=TRUE)

df2 <- data %>%  select(1, Sulfide)
colnames(df2) <- c("SampleID", "Actual_value")

#for training accuracy

tr <- read.csv(paste0("training",x,y,z,".csv"))

#Changing the column names 
colnames(tr) <- c("SampleID", "Predicted_value")

total1 <- merge(tr,df2,by="SampleID")
write.csv(total1,"predicted_vs_actual_in_training.csv", row.names = FALSE) #saving the output of Step 1 for further comparison and analyses

# Step 2: plotting predicted vs actual value

ggplot(total1, aes(y=Actual_value, x=Predicted_value)) +  geom_point() + geom_smooth(method=lm)+
  ggtitle(paste0("Model RM",x,y,z," training plot")) +
  xlab("Predicted value") + ylab("Actual value")

#Step 3: running linear model for training
fit <- lm(Actual_value ~ Predicted_value, data = total1)

summary(fit)

## 
## Call:
## lm(formula = Actual_value ~ Predicted_value, data = total1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54771 -0.16551 -0.00861  0.16709  1.46969 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.270263   0.029988  -9.013   <2e-16 ***
## Predicted_value  1.099739   0.009513 115.608   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.289 on 417 degrees of freedom
## Multiple R-squared:  0.9697, Adjusted R-squared:  0.9697 
## F-statistic: 1.337e+04 on 1 and 417 DF,  p-value: < 2.2e-16

Finding validation accuracies (30% of the withheld data) using linear models (this is the main evaluation for predictive performance)