library(randomForest)
library(caret)
library(e1071)
library(dplyr)
rf_data1 <- read.csv("RF_input_674_phase.csv", header=TRUE, row.name=1)
Removing columns having no values is important since these variables can integrate into trees and reduce accuracies of the models.
a = rf_data1[,-1]
b = a[, which(colSums(a) != 0)]
phase = rf_data1[,1]
rf_data = cbind(phase,b)
#saving the filtered dataset
write.csv(rf_data, "RF_input_674_phase_colsum0.csv")
This step is very important for classification and should always be done for classification approach. Otherwise the model will fail. In this study phase is the dependent variable.
data <- transform(rf_data,phase=as.factor(phase))
To analyze predictive performance, 30% of the observations can be randomly withheld to perform more precise model validation. Setting a seed value is important to make the random selection reproducible.
set.seed(123)
ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))
train <- data[ind==1,]
test <- data[ind==2,]
Setting a seed value is important to make the random selection reproducible. mtry parameter (No. of variables tried at each split) is set to default which is square root of the number of independent variable. OOB (Out of Bag) estimate of error rate is an internal evaluation mechanism for RF models based on a concept of bagging/bootstrapped dataset (Lower the value, higher the accuracies).But, one should keep an eye out for overfitting.
set.seed(4444)
rf <- randomForest(phase~., #predicting phase based on all the columns so dot (.) is given after telda (~)
data=train, #train data is used to train
ntree = 300, #number of trees to run
importance = TRUE, #evaluates importance of a predictor
proximity = TRUE) #calculates the proximity measure among the rows
rf
##
## Call:
## randomForest(formula = phase ~ ., data = train, ntree = 300, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 33
##
## OOB estimate of error rate: 8.96%
## Confusion matrix:
## M R S TM class.error
## M 102 0 2 0 0.019230769
## R 10 8 1 0 0.578947368
## S 2 0 313 0 0.006349206
## TM 0 0 27 4 0.870967742
OOB (Out of Bag) error rate vs. number of trees is an important plot to understand whether adding extra tree in the forest will enhance the accuracy significantly. If the curve flattens, adding extra tree will not have an significant improvement in accuracy.
plot(rf)
Using the generated RF model to predict response variable for the training dataset (70% randomly picked before) and saving the prediction for the training set.
p1 <- predict(rf, train)
write.csv(p1,"training_phase.csv")
Using the generated RF model to predict response variable for the validation dataset (30% randomly picked before) and saving the prediction. If you want to predict the response varaible of an external dataset, you should store the dataset in test dataframe. For external dataset, splitting in 70:30 ratio is not always necessary, rather training with the total available data will be best.
p2 <- predict(rf, test)
write.csv(p2,"validation_phase.csv")
2 steps are present here
#Step 1:
data <- read.csv("RF_input_674_phase.csv", header=TRUE)
df2 <- data %>% select(1, phase)
colnames(df2) <- c("SampleID", "Actual_phase")
#for training accuracy
tr <- read.csv("training_phase.csv")
#Changing the column names
colnames(tr) <- c("SampleID", "Predicted_phase")
total1 <- merge(tr,df2,by="SampleID")
write.csv(total1,"predicted_vs_actual_in_training.csv", row.names = FALSE) #saving the output of Step 1
#Step 2: constructing confusion matrix evaluating accuracies
confusionMatrix(p1, train$phase)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R S TM
## M 104 0 0 0
## R 0 19 0 0
## S 0 0 315 0
## TM 0 0 0 31
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9922, 1)
## No Information Rate : 0.6716
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: M Class: R Class: S Class: TM
## Sensitivity 1.0000 1.00000 1.0000 1.0000
## Specificity 1.0000 1.00000 1.0000 1.0000
## Pos Pred Value 1.0000 1.00000 1.0000 1.0000
## Neg Pred Value 1.0000 1.00000 1.0000 1.0000
## Prevalence 0.2217 0.04051 0.6716 0.0661
## Detection Rate 0.2217 0.04051 0.6716 0.0661
## Detection Prevalence 0.2217 0.04051 0.6716 0.0661
## Balanced Accuracy 1.0000 1.00000 1.0000 1.0000
2 steps are present here
#Step 1:
pr <- read.csv("validation_phase.csv")
#Changing the column names
colnames(pr) <- c("SampleID", "Predicted_value")
total2 <- merge(pr,df2,by="SampleID")
write.csv(total2,"predicted_vs_actual_in_validation.csv", row.names = FALSE) #saving the output of Step 1
#Step 2: constructing confusion matrix and evaluating performance
confusionMatrix(p2, test$phase)
## Confusion Matrix and Statistics
##
## Reference
## Prediction M R S TM
## M 44 2 0 0
## R 0 14 0 0
## S 1 1 131 11
## TM 1 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.922
## 95% CI : (0.8763, 0.9547)
## No Information Rate : 0.639
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8423
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: M Class: R Class: S Class: TM
## Sensitivity 0.9565 0.82353 1.0000 0.000000
## Specificity 0.9874 1.00000 0.8243 0.994845
## Pos Pred Value 0.9565 1.00000 0.9097 0.000000
## Neg Pred Value 0.9874 0.98429 1.0000 0.946078
## Prevalence 0.2244 0.08293 0.6390 0.053659
## Detection Rate 0.2146 0.06829 0.6390 0.000000
## Detection Prevalence 0.2244 0.06829 0.7024 0.004878
## Balanced Accuracy 0.9720 0.91176 0.9122 0.497423
Variables, important for predicting the response/dependent variable, can be evaluated. Mean Decrease in Accuracy (MDA) is an important and informative parameter for determining importance of a particular variable. Mean Decrease Accuracy indicates the decrease in accuracy of the model when a particular variable is excluded. The higher the value, the more important the variable is.
#plotting top 10 variable which are important for decision making in decision making tree
varImpPlot(rf,
sort = T,
n.var = 10,
main = "Top 10 - Variable Importance")
#writing all the variable importance
write.csv(importance(rf), "variable_importance_phase.csv")