library(randomForest)
library(dplyr)
library(ggplot2)
This step becomes handy when multiple RF models are run simultaneously for comparison.
x= "_593"
y= ""
z= ""
rf_data1 <- read.csv(paste0("RF_input",x,y,z,".csv"), header=TRUE, row.name=1)
Removing columns having no values is important since these variables can integrate into trees and reduce accuracies of the models.
a = rf_data1[,-1]
b = a[, which(colSums(a) != 0)]
Sulfide = rf_data1[,1]
rf_data = cbind(Sulfide,b)
write.csv(rf_data, paste0("RF_input",x,y,z,"_removed_column0.csv")) #saving the the filtered dataset
To analyze predictive performance, 30% of the observations can be randomly withheld to perform more precise model validation. Setting a seed value is important to make the random selection reproducible.
set.seed(123)
ind <- sample(2, nrow(rf_data), replace = TRUE, prob = c(0.7, 0.3))
train <- rf_data[ind==1,]
test <- rf_data[ind==2,]
Setting a seed value is important to make the random selection reproducible. mtry parameter (No. of variables tried at each split) is set to default which is 1/3 of the number of independent variable. Mean of squared residuals will give you an estimated of error (lower the better), whereas % Var explained can be used as a pseudo-R2 value (higher the better). But, one should keep an eye out for overfitting.
set.seed(4444)
rf <- randomForest(Sulfide~., data=train, #here Sulfide is the dependent variable and the rest of the variables are independent variables
ntree = 300, #number of trees will depend on the error plot
importance = TRUE, #evaluates importance of a predictor
proximity = TRUE) #calculates the proximity measure among the rows
rf
##
## Call:
## randomForest(formula = Sulfide ~ ., data = train, ntree = 300, importance = TRUE, proximity = TRUE)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 365
##
## Mean of squared residuals: 0.6725344
## % Var explained: 75.51
Error rate vs. number of trees is an important plot to understand whether adding extra tree in the forest will enhance the accuracy significantly. If the curve flattens, adding extra tree will not have an significant improvement in accuracy.
plot(rf)
Using the generated RF model to predict response variable for the training dataset (70% randomly picked before) and saving the prediction for the training set
p1 <- predict(rf, train)
write.csv(p1,paste0("training",x,y,z,".csv"))
Using the generated RF model to predict response variable for the validation dataset (30% randomly picked before) and saving the prediction. If you want to predict the response varaible of an external dataset, you should store the external dataset in test dataframe. For external dataset, splitting in 70:30 ratio is not always necessary, rather training with the total available data will be the best.
p2 <- predict(rf, test)
write.csv(p2,paste0("validation",x,y,z,".csv"))
3 steps are present here
#Step 1:
data <- read.csv(paste0("RF_input",x,y,z,".csv"), header=TRUE)
df2 <- data %>% select(1, Sulfide)
colnames(df2) <- c("SampleID", "Actual_value")
#for training accuracy
tr <- read.csv(paste0("training",x,y,z,".csv"))
#Changing the column names
colnames(tr) <- c("SampleID", "Predicted_value")
total1 <- merge(tr,df2,by="SampleID")
write.csv(total1,"predicted_vs_actual_in_training.csv", row.names = FALSE) #saving the output of Step 1 for further comparison and analyses
# Step 2: plotting predicted vs actual value
ggplot(total1, aes(y=Actual_value, x=Predicted_value)) + geom_point() + geom_smooth(method=lm)+
ggtitle(paste0("Model RM",x,y,z," training plot")) +
xlab("Predicted value") + ylab("Actual value")
#Step 3: running linear model for training
fit <- lm(Actual_value ~ Predicted_value, data = total1)
summary(fit)
##
## Call:
## lm(formula = Actual_value ~ Predicted_value, data = total1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54771 -0.16551 -0.00861 0.16709 1.46969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.270263 0.029988 -9.013 <2e-16 ***
## Predicted_value 1.099739 0.009513 115.608 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.289 on 417 degrees of freedom
## Multiple R-squared: 0.9697, Adjusted R-squared: 0.9697
## F-statistic: 1.337e+04 on 1 and 417 DF, p-value: < 2.2e-16
3 steps are present here
#Step 1:
pr <- read.csv(paste0("validation",x,y,z,".csv"))
#Changing the column names
colnames(pr) <- c("SampleID", "Predicted_value")
total2 <- merge(pr,df2,by="SampleID")
write.csv(total2,"predicted_vs_actual_in_validation.csv", row.names = FALSE) #saving the output of Step 1 for further comparison and analyses
#Step2: plotting predicted vs actual value
ggplot(total2, aes(y=Actual_value, x=Predicted_value)) + geom_point() + geom_smooth(method=lm) +
ggtitle(paste0("Model RM",x,y,z," validation plot")) +
xlab("Predicted value") + ylab("Actual value")
#Ste 3: running linear model for validation
fit <- lm(Actual_value ~ Predicted_value, data = total2)
summary(fit)
##
## Call:
## lm(formula = Actual_value ~ Predicted_value, data = total2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04816 -0.47534 0.00856 0.43593 2.55673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.49231 0.12861 -3.828 0.000181 ***
## Predicted_value 1.14815 0.04137 27.755 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7585 on 172 degrees of freedom
## Multiple R-squared: 0.8175, Adjusted R-squared: 0.8164
## F-statistic: 770.3 on 1 and 172 DF, p-value: < 2.2e-16
Variables, important for predicting the response/dependent variable, can be evaluated. %IncMSE is an informative measure for finding out importance of a variable. %IncMSE indicates the increase of the mean squared error when given independent variable is randomly permuted. The higher the value, the more important the variable is.
#plotting top 10 variable which are important for decision making in decision making tree
varImpPlot(rf,
sort = T,
n.var = 10,
main = "Top 10 - Variable Importance")
#writing all the variable importance
write.csv(importance(rf), paste0("variable_importance_",x,y,z,".csv"))