Libraries required for RF regression analysis

library(randomForest)
library(dplyr)
library(ggplot2)

Introducing variables which can be used as modifier of input files to run different models and reading formatted dataset

This step becomes handy when multiple RF models are run simultaneously for comparison.

x= "_593"
y= ""
z= ""
rf_data1 <- read.csv(paste0("RF_input",x,y,z,".csv"), header=TRUE, row.name=1)

Removing columns having columnSUM = 0

Removing columns having no values is important since these variables can integrate into trees and reduce accuracies of the models.

a = rf_data1[,-1]

b = a[, which(colSums(a) != 0)]

Sulfide = rf_data1[,1]

rf_data = cbind(Sulfide,b)

write.csv(rf_data, paste0("RF_input",x,y,z,"_removed_column0.csv")) #saving the the filtered dataset

Data Partition 70:30 ratio (70% data for training and 30% data for validation)

To analyze predictive performance, 30% of the observations can be randomly withheld to perform more precise model validation. Setting a seed value is important to make the random selection reproducible.

set.seed(123)
ind <- sample(2, nrow(rf_data), replace = TRUE, prob = c(0.7, 0.3))
train <- rf_data[ind==1,]
test <- rf_data[ind==2,]

Running Random Forest (RF)

Setting a seed value is important to make the random selection reproducible. mtry parameter (No. of variables tried at each split) is set to default which is 1/3 of the number of independent variable. Mean of squared residuals will give you an estimated of error (lower the better), whereas % Var explained can be used as a pseudo-R2 value (higher the better). But, one should keep an eye out for overfitting.

set.seed(4444)
rf <- randomForest(Sulfide~., data=train, #here Sulfide is the dependent variable and the rest of the variables are independent variables
                   ntree = 300, #number of trees will depend on the error plot
                   importance = TRUE, #evaluates importance of a predictor
                   proximity = TRUE) #calculates the proximity measure among the rows
rf
## 
## Call:
##  randomForest(formula = Sulfide ~ ., data = train, ntree = 300,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 300
## No. of variables tried at each split: 365
## 
##           Mean of squared residuals: 0.6725344
##                     % Var explained: 75.51

Plotting error profile

Error rate vs. number of trees is an important plot to understand whether adding extra tree in the forest will enhance the accuracy significantly. If the curve flattens, adding extra tree will not have an significant improvement in accuracy.

plot(rf)

Predicting the dependent variable in the training set using generated RF model

Using the generated RF model to predict response variable for the training dataset (70% randomly picked before) and saving the prediction for the training set

p1 <- predict(rf, train)
write.csv(p1,paste0("training",x,y,z,".csv"))

Predicting the dependent variable in the validation set using generated RF model (this is the step where external dataset can be used for prediction)

Using the generated RF model to predict response variable for the validation dataset (30% randomly picked before) and saving the prediction. If you want to predict the response varaible of an external dataset, you should store the external dataset in test dataframe. For external dataset, splitting in 70:30 ratio is not always necessary, rather training with the total available data will be the best.

p2 <- predict(rf, test)
write.csv(p2,paste0("validation",x,y,z,".csv"))

Finding training (with 70% of the withheld data) accuracies using linear models

3 steps are present here

  1. comparison of actual vs. predicted values
  2. plotting actual vs. predicted values
  3. evaluation of the plot using linear models
#Step 1:
data <- read.csv(paste0("RF_input",x,y,z,".csv"), header=TRUE)

df2 <- data %>%  select(1, Sulfide)
colnames(df2) <- c("SampleID", "Actual_value")

#for training accuracy

tr <- read.csv(paste0("training",x,y,z,".csv"))

#Changing the column names 
colnames(tr) <- c("SampleID", "Predicted_value")

total1 <- merge(tr,df2,by="SampleID")
write.csv(total1,"predicted_vs_actual_in_training.csv", row.names = FALSE) #saving the output of Step 1 for further comparison and analyses

# Step 2: plotting predicted vs actual value

ggplot(total1, aes(y=Actual_value, x=Predicted_value)) +  geom_point() + geom_smooth(method=lm)+
  ggtitle(paste0("Model RM",x,y,z," training plot")) +
  xlab("Predicted value") + ylab("Actual value") 

#Step 3: running linear model for training
fit <- lm(Actual_value ~ Predicted_value, data = total1)

summary(fit)
## 
## Call:
## lm(formula = Actual_value ~ Predicted_value, data = total1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54771 -0.16551 -0.00861  0.16709  1.46969 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.270263   0.029988  -9.013   <2e-16 ***
## Predicted_value  1.099739   0.009513 115.608   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.289 on 417 degrees of freedom
## Multiple R-squared:  0.9697, Adjusted R-squared:  0.9697 
## F-statistic: 1.337e+04 on 1 and 417 DF,  p-value: < 2.2e-16

Finding validation accuracies (30% of the withheld data) using linear models (this is the main evaluation for predictive performance)

3 steps are present here

  1. comparison of actual vs. predicted values
  2. plotting actual vs. predicted values
  3. evaluation of the plot using linear models
#Step 1:
pr <- read.csv(paste0("validation",x,y,z,".csv"))

#Changing the column names 
colnames(pr) <- c("SampleID", "Predicted_value")

total2 <- merge(pr,df2,by="SampleID")
write.csv(total2,"predicted_vs_actual_in_validation.csv", row.names = FALSE) #saving the output of Step 1 for further comparison and analyses

#Step2: plotting predicted vs actual value


ggplot(total2, aes(y=Actual_value, x=Predicted_value)) +  geom_point() + geom_smooth(method=lm) +
  ggtitle(paste0("Model RM",x,y,z," validation plot")) +
  xlab("Predicted value") + ylab("Actual value") 

#Ste 3: running linear model for validation
fit <- lm(Actual_value ~ Predicted_value, data = total2)

summary(fit)
## 
## Call:
## lm(formula = Actual_value ~ Predicted_value, data = total2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04816 -0.47534  0.00856  0.43593  2.55673 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.49231    0.12861  -3.828 0.000181 ***
## Predicted_value  1.14815    0.04137  27.755  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7585 on 172 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.8164 
## F-statistic: 770.3 on 1 and 172 DF,  p-value: < 2.2e-16

Assessing important variables

Variables, important for predicting the response/dependent variable, can be evaluated. %IncMSE is an informative measure for finding out importance of a variable. %IncMSE indicates the increase of the mean squared error when given independent variable is randomly permuted. The higher the value, the more important the variable is.

#plotting top 10 variable which are important for decision making in decision making tree

varImpPlot(rf,
           sort = T,
           n.var = 10,
           main = "Top 10 - Variable Importance")

#writing all the variable importance
write.csv(importance(rf), paste0("variable_importance_",x,y,z,".csv"))