Random Forest test data - r

I am trying to use random Forest model on my data set which has 4679 observations and 13 variables.
I am using the random forest model to predict is a part will fail or not.
On the total 4679 observations, I have 66 observation with my target variable as NA. I wanted to predict if this 66 part will fail or not.
SO, I decided to split my train data into first 4613 as my train data and rest 66 rows as my test data.
train<- Imputed_data[1:4613, ]
test <- Imputed_data[4614:4679, ]
I then used the below code for my random forest
fit<- randomForest(claim.Qty.Accepted~., data=train, na.action=na.exclude)
The training confusion matrix I received was clear.
I tried the same to predict my test with the following piece of code
#Prediction for test set
p2 <- predict(fit, test)
head(p2)
head(test$claim.Qty.Accepted)
caret::confusionMatrix(p2, test$claim.Qty.Accepted)
the confusion matrix was 0 with both the classes of Yes and No.
I later saved the predicted value p2 in the form of data frame like below; in the table i could see that all the 66 entries have Yes and No classes.
t2<- data.frame(p2)
I am confused why, the confusion matrix didn't show me the results of prediction? Also is this a right approach I am following to predict my test result? Any lead would be helpful, since i am new in the field.

Related

Difference between fitted values and cross validation values from pls model in r

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?
Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.
You probably need to explain what you mean by "valid" (in your comments).
Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.
During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.
LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.
You can also see below how different they are, first I fit a model
library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
method = "oscorespls")
Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:
calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}
# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548
# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985
# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615
You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

the questions about predict function in randomForestSRC package

In common with other machine learning methods, I divided my original data set (7-training data set: 3-test data set).
Here is my code.
install.packages(randomForestSRC)
library(randomForestSRC)
data(pbc, package="randomForestSRC")
data <- na.omit(pbc)
train <- sample(1:nrow(data), round(nrow(data) * 0.70))
data.grow <- rfsrc(Surv(days, status) ~ .,
data[train, ],
ntree = 100,
tree.err=T,
importance=T,
nsplit=1,
proximity=T)
data.pred <- predict(data.grow,
data[-train , ],
importance=T,
tree.err=T)
What I have a question is that predict function in this code.
Originally, I wanted to construct a prediction model based on random survival forest to predict the diseae development.
For example, After I build the prediction model with training data set, I wanted to know the probability of disease development with test data which has no information about disease incidence for each individual becuase I would like to know the probability of diease development based on the subject's general characteristics such as age, bmi, sex, something like that.
However, unlike my intention to build a predicion model as I said above, "predict" function in this package didn't work based on the data which has no status information (event/censored).
"predict" function must work with outcome information (event/censored).
Therefore, I cannot understand what the "predict" function means.
If "precict" function works only with oucome information, then how can I make a predction for disease development based on the subject's general characteristics in the future?
In addition, if the prediction in this model is constructed with the outcome information, what the meaning is "predct" in the random survival forest model.
Please let me know what the "predict" function in this package means is.
Thank you for reading my long question.
The predict for this type of model, i.e. predict.rfsrc, works much like you'd expect it to if you've used predict with glm, lm, RRF or other models.
The predict statement does not require you to know the outcome for the prediction data set. I am trying to understand why you thought that it did.
Your example rfsrc statement does not work because it refers to columns that are not in the example data set.
I think the best plan is that I will show you using a reproducible example, below. If you have further questions you can ask me in a comment.
# Train a RFSRC model
mtcars.mreg <- rfsrc(Surv(mpg, cyl) ~., data = mtcars[1:30,],
tree.err=TRUE, importance = TRUE)
# Simulate new data
new_data <- mtcars[31:32,]
# predict
predicted <-predict(mtcars.mreg, new_data)
predicted
Sample size of test (predict) data: 2
Number of grow trees: 1000
Average no. of grow terminal nodes: 4.898
Total no. of grow variables: 9
Analysis: RSF
Family: surv-CR
Test set error rate: NA
predicted$predicted
event.1 event.2 event.3
[1,] 0.4781338 2.399299 14.71493
[2,] 3.2185606 4.720809 2.15895

Training and Test Data Predictions

So I was posed this question:
(a) The training set contains 1000 observations on 7 covariates with the last (the 8th)column containing a continuous response variable. Predict the response variable from the covariates.
(b) The test set contains a further 500 observations on the 7 covariates. Provide predictions of the response using the model you chose in part (a).
I'm not sure if I'm doing this correctly. Ive read in the .csv files and did some regression. Here's what I've been trying:
train.lm<-lm(y~., data=train)
summary(train.lm)
predict(train.lm, train)
predict(train.lm, test)
Am I even on the right track?
Any help is greatly appreciated.
EDIT: small sample of the data:

Why is random forest performing worse than decision tree

I have a data set with 1962 observations and 46 columns. Column 46 is the target. 6 of the other columns are nominal variables and the rest are ordinal variables. I have preprocessed them using as follows:
for (i in c(1:4,6,9,46)){
cw_alldata_known[,i] <- as.factor(cw_alldata_known[,i])
}
for (i in c(5,7,8,10:45)){
cw_alldata_known[,i] <- as.ordered(cw_alldata_known[,i])
}
Then I divide them 50/50 into training and test sets.
I fitted a decision tree using party package of R:
cw.ctree <- ctree(cr~.,data = cw.train)
Then I also fitted a random forest model:
cw.forest <- randomForest(credit.rating ~ ., data=cw.train,ntree=107)
I have tried other ntree values but 107 seems to be the best.
The accuracy on the test set of decision tree is around 61%, while random forest is only 56%. I read that random forest is often more robust and reliable. Why doesn't it perform better than decision tree in this case?

Linear Discriminant Analysis in R - Training and validation samples

I am working with lda command to analyze a 2-column, 234 row dataset (x): column X1 contains the predictor variable (metric) and column X2 the independent variable (categorical, 4 categories). I would like to build a linear discriminant model by using 150 observations and then use the other 84 observations for validation. After a random partitioning of data i get x.build and x.validation with 150 and 84 observations, respectively. I run the following
fit = lda(x.build$X2~x.build$X1, data=x.build, na.action="na.omit")
Then I run predict command like this:
pred = predict(fit, newdata=x.validation)
From the reading of the commands description I thought that in pred$class I would get the classification of validation data according to the model built, but I get the classification of 150 observations instead of the 84 I intended to use as validation data. I don't really know what is happening, can someone please give me an example of how I should be conducting this analysis?
Thank you very much in advance.
Try this instead:
fit = lda(X2~X1, data=x.build, na.action="na.omit")
pred = predict(fit, newdata=x.validation)
If you use this formula x.build$X2~x.build$X1 when you build the model, predict expects x.build$X1 column in the validation data. Obviously there isn't one so you get prediction for training data.

Resources