I want to compute an unsupervised random forest classification out of a raster stack in R. The raster stack represents the same extent in different spectral bands and as a result I want to obtain an unsupervised classification of the stack.
I am having problems with my code as my data is very huge. Is it okay to just convert the stack into a dataframe in order to run the random forest algorithm like this:
stack_median <- stack(b1_mosaic_median, b2_mosaic_median, b3_mosaic_median, b4_mosaic_median, b5_mosaic_median, b7_mosaic_median)
stack_median_df <- as.data.frame(stack_median)
Here is the data as a csv file (https://www.dropbox.com/s/gkaryusnet46f0i/stack_median_df.csv?dl=0) - and you can read it in via:
stack_median_df<-read.csv(file="stack_median_df.csv")
stack_median_df<-stack_median_df[,-1]
stack_median_df_na <- na.omit(stack_median_df)
My next step would be the unsupervised classification:
median_rf <- randomForest(stack_median_df_na, importance=TRUE, proximity=FALSE, ntree=500, type=unsupervised, forest=NULL)
Due to my huge dataset a proximity measure can't be calculated (would need around 6000GB). Do you know how to be able to have a look at the classification? As predict(median_rf) and plot(median_rf) don't return anything.
I am happy for every suggestion, improvement or code snippet of a unsupervised random forest classification with its accuracy measures,...
Thanks a lot!
I think you could use a large sample for unsupervised classification, and then use the create a supervised classification model (that predicts the classes from the raw data; and should have a very good fit) and apply that to the entire data set.
Related
I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:
rf_reg <- ranger(formula = y ~ ., data = training_df)
results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions
Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!
Due to computational limitations with my GIS software, I am trying to implement random forests in R for image classification purposes. My input is a multi-band TIFF image, which is being trained on an ArcGIS shapefile (target values 0 and 1). The code technically works and produces a valid output. When I view the confusion matrix I get the following:
0 1 class.error
0 11 3 0.214285714
1 1 13 0.071428571
This is sensible for my data. However when I plot up the output of the image classification in my GIS software (the binary reclassified tiff with values 0 and 1), it predicts the training data with a 100% success rate. In other words there is no classification error with the output image. How is this the case when the confusion matrix indicates there are classification errors?
Am I missing something really obvious here? Code snippet below.
rf.mdl <- randomForest(x=samples#data[, names(PredMaps)], y=samples#data[, ValueFld], ntree=501, proximity=TRUE, importance=TRUE, keep.forest=TRUE,keep.inbag=TRUE)
ConfMat = rf.mdl$confusion
write.csv(ConfMat,file = "ConfMat1.csv")
predict(PredMaps, rf.mdl, filename=classifiedPath, type="response", na.rm=T, overwrite=T, progress="text")
I expected the output classified image to misclassify 1 of the Value=1 training points and misclassify 3 of the Value=0 training points based on what is indicated in the confusion matrix.
The Random Forest algorithm is a bagging method. This means it creates numerous weak classifiers, then has each weak classifier "vote" to create the end prediction. In RF, each weak classifier is one decision tree that is trained on a random sample of observations in the training set. Think of the random samples each decision tree is trained on as a "bag" of data.
What is being shown in the confusion matrix is something called "out-of-bag error" (OOB error). This OOB error is an accurate estimate of how your model would generalize to data it has never seen before (this estimate is usually achieved by testing your model on a withheld testing set). Since each decision tree is trained on only one bag from your training data, the rest of the data (data that's "outside the bag") can stand in for this withheld data.
OOB error is calculated by making a prediction for each observation in the training set. However, when predicting each individual observation, only decision trees whose bags did not include that observation are allowed to participate in the voting process. The result is the confusion matrix available after training a RF model.
When you predict the observations in the training set using the complete model, decision trees whose bags did include each observation are now involved in the voting process. Since these decision trees "remember" the observation they were trained on, they skew the prediction toward the correct answer. This is why you achieve 100% accuracy.
Essentially, you should trust the confusion matrix that uses OOB error. It's a robust estimate of how the model will generalize to unseen data.
Dear neuralnet experts,
I am studying ANN with a book and R package.
One of the examples is to train a neuralnet (R package) for a simple set of squares of numbers [1~10]. It was quite quick and easy to fit them with 1 hidden layer with 10 neurons.
But, for a large set of [1~30], the algorithm does not converge. I think that some parameters should be changed to train the neuralnet. At first, I increased the number of neurons and hidden layers, i.e, c(20,10). But, failed...
Could somebody please guide me to learn more about neuralnet to train the dataset?
My code in R is given as,
library("neuralnet")
#Read the input file
mydata50=read.csv('Squares50.csv',sep=",",header=TRUE)
mydata30 <- mydata50[1:30,]
attach(mydata30)
names(mydata30)
mydata30
#Train the model based on output from input
model30=neuralnet(formula = Output~Input,
data = mydata30,
hidden=c(20,10),
threshold=0.01 )
print(model30)
#Lets plot and see the layers
plot(model30)
Best regards,
Dong-Ho
I want to extract the by tree predictions for each observation from the rfsrc object. In other words, for i trees and j observations, I want to extract an [i,j] matrix of the predictions. My goal is to calculate the prediction confidence intervals using the R code found at https://github.com/swager/randomForestCI. My analysis requires a competing risks random forest; otherwise I would have used the randomForest package which makes the by tree predictions more obvious to extract.
I appreciate any assistance.
EDIT: I am attempting to follow the procedure outlined here: http://blog.revolutionanalytics.com/2016/03/confidence-intervals-for-random-forest.html
I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)