Libraries required for RF regression analysis

library(randomForest)
library(caret)
library(e1071)
library(dplyr)

Reading the input file

rf_data1 <- read.csv("RF_input_674_phase.csv", header=TRUE, row.name=1)

Removing columns having columnSUM = 0

Removing columns having no values is important since these variables can integrate into trees and reduce accuracies of the models.

a = rf_data1[,-1]

b = a[, which(colSums(a) != 0)]

phase = rf_data1[,1]

rf_data = cbind(phase,b)

#saving the filtered dataset
write.csv(rf_data, "RF_input_674_phase_colsum0.csv") 

Changing the dependent/response variable to factor since the variable is categorical

This step is very important for classification and should always be done for classification approach. Otherwise the model will fail. In this study phase is the dependent variable.

data <- transform(rf_data,phase=as.factor(phase))

Data Partition 70:30 ratio (70% data for training and 30% data for validation)

To analyze predictive performance, 30% of the observations can be randomly withheld to perform more precise model validation. Setting a seed value is important to make the random selection reproducible.

set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]

Running Random Forest (RF)

Setting a seed value is important to make the random selection reproducible. mtry parameter (No. of variables tried at each split) is set to default which is square root of the number of independent variable. OOB (Out of Bag) estimate of error rate is an internal evaluation mechanism for RF models based on a concept of bagging/bootstrapped dataset (Lower the value, higher the accuracies).But, one should keep an eye out for overfitting.

set.seed(4444)
rf <- randomForest(phase~., #predicting phase based on all the columns so dot (.) is given after telda (~)
                   data=train, #train data is used to train
                   ntree = 300, #number of trees to run
                   importance = TRUE, #evaluates importance of a predictor
                   proximity = TRUE) #calculates the proximity measure among the rows
rf
## 
## Call:
##  randomForest(formula = phase ~ ., data = train, ntree = 300,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 300
## No. of variables tried at each split: 33
## 
##         OOB estimate of  error rate: 8.96%
## Confusion matrix:
##      M R   S TM class.error
## M  102 0   2  0 0.019230769
## R   10 8   1  0 0.578947368
## S    2 0 313  0 0.006349206
## TM   0 0  27  4 0.870967742

Plotting error profile

OOB (Out of Bag) error rate vs. number of trees is an important plot to understand whether adding extra tree in the forest will enhance the accuracy significantly. If the curve flattens, adding extra tree will not have an significant improvement in accuracy.

plot(rf)

Predicting the dependent variable in the training set using generated RF model

Using the generated RF model to predict response variable for the training dataset (70% randomly picked before) and saving the prediction for the training set.

p1 <- predict(rf, train)
write.csv(p1,"training_phase.csv")

Predicting the dependent variable in the validation set using generated RF model (this is the step where external dataset can be used for prediction)

Using the generated RF model to predict response variable for the validation dataset (30% randomly picked before) and saving the prediction. If you want to predict the response varaible of an external dataset, you should store the dataset in test dataframe. For external dataset, splitting in 70:30 ratio is not always necessary, rather training with the total available data will be best.

p2 <- predict(rf, test)
write.csv(p2,"validation_phase.csv")

Finding training (with 70% of the withheld data) accuracies using confusion matrix

2 steps are present here

  1. Comparison of actual vs. predicted values
  2. Constructing confusion matrix and eavaluating performance (For understanding the calculation and evaluation measures, please read https://cran.r-project.org/web/packages/caret/caret.pdf)
#Step 1:
data <- read.csv("RF_input_674_phase.csv", header=TRUE)

df2 <- data %>%  select(1, phase)
colnames(df2) <- c("SampleID", "Actual_phase")

#for training accuracy

tr <- read.csv("training_phase.csv")

#Changing the column names 
colnames(tr) <- c("SampleID", "Predicted_phase")

total1 <- merge(tr,df2,by="SampleID")
write.csv(total1,"predicted_vs_actual_in_training.csv", row.names = FALSE) #saving the output of Step 1

#Step 2: constructing confusion matrix evaluating accuracies

confusionMatrix(p1, train$phase)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   R   S  TM
##         M  104   0   0   0
##         R    0  19   0   0
##         S    0   0 315   0
##         TM   0   0   0  31
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9922, 1)
##     No Information Rate : 0.6716     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: M Class: R Class: S Class: TM
## Sensitivity            1.0000  1.00000   1.0000    1.0000
## Specificity            1.0000  1.00000   1.0000    1.0000
## Pos Pred Value         1.0000  1.00000   1.0000    1.0000
## Neg Pred Value         1.0000  1.00000   1.0000    1.0000
## Prevalence             0.2217  0.04051   0.6716    0.0661
## Detection Rate         0.2217  0.04051   0.6716    0.0661
## Detection Prevalence   0.2217  0.04051   0.6716    0.0661
## Balanced Accuracy      1.0000  1.00000   1.0000    1.0000

Finding validation accuracies (30% of the withheld data) using confusion matrix (this is the main evaluation for predictive performance)

2 steps are present here

  1. Comparison of actual vs. predicted values
  2. Constructing confusion matrix and eavaluating performance
#Step 1:
pr <- read.csv("validation_phase.csv")

#Changing the column names 
colnames(pr) <- c("SampleID", "Predicted_value")

total2 <- merge(pr,df2,by="SampleID")
write.csv(total2,"predicted_vs_actual_in_validation.csv", row.names = FALSE) #saving the output of Step 1

#Step 2: constructing confusion matrix and evaluating performance

confusionMatrix(p2, test$phase)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   R   S  TM
##         M   44   2   0   0
##         R    0  14   0   0
##         S    1   1 131  11
##         TM   1   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.922           
##                  95% CI : (0.8763, 0.9547)
##     No Information Rate : 0.639           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8423          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: M Class: R Class: S Class: TM
## Sensitivity            0.9565  0.82353   1.0000  0.000000
## Specificity            0.9874  1.00000   0.8243  0.994845
## Pos Pred Value         0.9565  1.00000   0.9097  0.000000
## Neg Pred Value         0.9874  0.98429   1.0000  0.946078
## Prevalence             0.2244  0.08293   0.6390  0.053659
## Detection Rate         0.2146  0.06829   0.6390  0.000000
## Detection Prevalence   0.2244  0.06829   0.7024  0.004878
## Balanced Accuracy      0.9720  0.91176   0.9122  0.497423

Assessing important variables for prediction

Variables, important for predicting the response/dependent variable, can be evaluated. Mean Decrease in Accuracy (MDA) is an important and informative parameter for determining importance of a particular variable. Mean Decrease Accuracy indicates the decrease in accuracy of the model when a particular variable is excluded. The higher the value, the more important the variable is.

#plotting top 10 variable which are important for decision making in decision making tree

varImpPlot(rf,
           sort = T,
           n.var = 10,
           main = "Top 10 - Variable Importance")

#writing all the variable importance
write.csv(importance(rf), "variable_importance_phase.csv")