Sorry if it was already asked, but I couldn't find it in half an hour of looking, so I would appreciate if you can point me to some direction.
I have a trouble with missing object in the model, while I don't actually use this object when building the model, it's just present in the dataset. (as you can see in the example below).
It is a problem, because I have already trained some rf models, I am loading the models into environment and I am reusing them as they are. The test dataset doesn't contain some variables that are present in dataset upon which the model was built, but they are not used in the model itself!
library(randomForest)
data(iris)
smp_size <- floor(0.75*nrow(iris))
set.seed(123)
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
test <- iris[-train_ind, ]
test$Sepal.Length <- NULL # for the sake of example I drop this column
rf_model <- randomForest(Species ~ . - Sepal.Length, # I don't use the column in training model
data = train)
rf_prediction <- predict(rf_model, newdata = test)
When I try to predict on test dataset, I get an error:
Error in eval(expr, envir, enclos) : object 'Sepal.Length' not found
What I hope to achieve, is use the models I have already built, as redoing them without missing variables would be costly.
Thanks for advice!
As your models are already built. You will want to add missing columns back on to the test set before running the model. Just add the missing columns with a value of 0 as in the following exmaple.
library(randomForest)
library(dplyr)
data(iris)
smp_size <- floor(0.75*nrow(iris))
set.seed(123)
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
test <- iris[-train_ind, ]
test$Sepal.Length <- NULL
rf_model <- randomForest(Species ~ . - Sepal.Length,
data = train)
# adding the missing column to your test set.
missingColumns <- setdiff(colnames(train),colnames(test))
test[,missingColumns] <- 0
rf_prediction <- predict(rf_model, newdata = test)
rf_prediction
#showing this produce the same results
train2 <- iris[train_ind, ]
test2 <- iris[-train_ind, ]
test2$Sepal.Length <- NULL
train2$Sepal.Length <- NULL
rf_model2 <- randomForest(Species ~ .,
data = train2)
rf_prediction2 <- predict(rf_model2, newdata = test2)
rf_prediction2 == rf_prediction
Related
I try to run a randomForest model on iris data without the variable Petal.Length. The code give me errors on prédiction. How can I code properly? Thanks for help.
Richard
data (iris)
attach (iris)
iris$id <- 1:nrow(iris)
library (dplyr)
train <- iris %>%
sample_frac (0.8)
test <- iris %>%
anti_join(train, by = "id")
library (randomForest)
library (caret)
fit <- randomForest(Species ~
Sepal.Length +Sepal.Width +Petal.Width, data = train,)
prediction <- predict (fit, test [1:2 , 4])
confusionMatrix (test$Species,prediction)
You subsetting for test dataset is wrong. Just use
prediction <- predict (fit, newdata = test)
in place of
predict (fit, test [1:2 , 4])
It will automatically take the required independent variables.
Or you can use like
prediction <- predict (fit, subset(test, select = -c(Petal.Length)))
In the prediction function, you have to supply all the numeric data used for training. Try this instead
prediction <- predict (fit, test[ , c(1:4)])
As an example, let's use the iris data set.
library(randomForest)
data(iris)
smp_size <- floor(0.75 * nrow(iris))
train_ind <- sample(seq_len(nrow(iris)), size = smp_size)
train <- iris[train_ind, ]
test <- iris[-train_ind, ]
model <- randomForest(Species~., data = train, ntree=10)
If I use the getTree() function from the randomForest package, I can extract, for example, the third tree without any problem.
treefit <- getTree(model, 3)
But how can I use that (i.e. treefit) to make predictions on the test set, for instance? like "predict()", is there a function out there to do that directly?
Thank you in advance
You can use the predict function in the randomForest package directly by setting the predict.all argument to TRUE.
See the following reproducible code for how to use this: also see the help page for predict.randomForest here.
library(randomForest)
set.seed(1212)
x <- rnorm(100)
y <- rnorm(100, x, 10)
df_train <- data.frame(x=x, y=y)
x_test <- rnorm(20)
y_test <- rnorm(20, x_test, 10)
df_test <- data.frame(x = x_test, y = y_test)
rf_fit <- randomForest(y ~ x, data = df_train, ntree = 500)
# You get a list with the overall predictions and individual tree predictions
rf_pred <- predict(rf_fit, df_test, predict.all = TRUE)
rf_pred$individual[, 3] # Obtains the 3rd tree's predictions on the test data
In fact, there is a similar question and answer, but it does not work me. see below. The trick lies in rewrite fit of lmFunc.
"Error in { : task 1 failed - "Results do not have equal lengths", many warning:glm.fit: fitted probabilities numerically 0 or 1 occurred"
where is the fault?
lmFuncs$fit=function (x, y, first, last, ...)
{
tmp <- as.data.frame(x)
tmp$y <- y
glm(y ~ ., data = tmp, family=binomial(link='logit'))
}
ctrl <- rfeControl(functions = lmFuncs,method = 'cv',number=10)
fit.rfe=rfe(df.preds,df.depend, rfeControl=ctrl)
And in the rfeControl help, it is said the parameter 'functions' that can be used with caret’s train function (caretFuncs). What does it really mean?
Any details and example? Thanks
I was having a similar issue with customising lmFunc.
For logistic regression make sure you use lrFuncs and set size equal to the number of predictor variables. This leads to no issues.
Example (for functionality purposes only)
library(caret)
#Reproducible data
set.seed(1)
x <- data.frame(runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10))
x$dpen <- sample(c(0,1), replace=TRUE, size=10)
x$dpen <- factor(x$dpen)
#Spliting training set into two parts based on outcome: 80% and 20%
index <- createDataPartition(x$dpen, p=0.8, list=FALSE)
trainSet <- x[ index,]
testSet <- x[-index,]
control <- rfeControl(functions = lrFuncs,
method = "cv", #cross validation
verbose = FALSE, #prevents copious amounts of output from being produced.
)
##RFE
rfe(trainSet[,1:28] #predictor varia,
trainSet[,9],
sizes = c(1:28) #size of predictor variables,
rfeControl = control)
I'm trying to do some experiment and I want to run several GLMs model in R using the same variables but different training samples.
Here is some simulated data:
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
This is the loop I'm trying to use:
n <- 5
for (i in 1:n)
{
### Create training and testing data
## 80% of the sample size
# Note that I didn't use seed so that random split is performed every iteration.
smp_sizelogis <- floor(0.8 * nrow(dat))
train_indlogis <- sample(seq_len(nrow(dat)), size = smp_sizelogis)
trainlogis <- dat[train_indlogis, ]
testlogis <- dat[-train_indlogis, ]
InitLOogModel[i] <- glm(resp ~ ., data =trainlogis, family=binomial)
}
But unfortunately, I'm getting this error:
Error in InitLOogModel[i] <- glm(resp ~ ., data = trainlogis, family = binomial) :
object 'InitLOogModel' not found
Any thoughts.
I'd suggest using caret for what you're trying to do. It takes some time to learn, but incorporates many 'best practices'. Once you've learned the basics you'll be able to quickly try models other than a glm, and easily compare the models to each other. Here's modified code from your example to get you started.
## caret
library(caret)
# your data
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
# so caret knows you're trying to do classification, otherwise will give you an error at the train step
dat$resp <- as.factor(dat$resp)
# create a hold-out set to use after your model fitting
# not really necessary for your example, but showing for completeness
train_index <- createDataPartition(dat$resp, p = 0.8,
list = FALSE,
times = 1)
# create your train and test data
train_dat <- dat[train_index, ]
test_dat <- dat[-train_index, ]
# repeated cross validation, repeated 5 times
# this is like your 5 loops, taking 80% of the data each time
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
# fit the glm!
glm_fit <- train(resp ~ ., data = train_dat,
method = "glm",
family = "binomial",
trControl = fitControl)
# summary
glm_fit
# best model
glm_fit$finalModel
How can I use result of randomForest call in R to predict labels on some unlabled data (e.g. real world input to be classified)?
Code:
train_data = read.csv("train.csv")
input_data = read.csv("input.csv")
result_forest = randomForest(Label ~ ., data=train_data)
labeled_input = result_forest.predict(input_data) # I need something like this
train.csv:
a;b;c;label;
1;1;1;a;
2;2;2;b;
1;2;1;c;
input.csv:
a;b;c;
1;1;1;
2;1;2;
I need to get something like this
a;b;c;label;
1;1;1;a;
2;1;2;b;
Let me know if this is what you are getting at.
You train your randomforest with your training data:
# Training dataset
train_data <- read.csv("train.csv")
#Train randomForest
forest_model <- randomForest(label ~ ., data=train_data)
Now that the randomforest is trained, you want to give it new data so it can predict what the labels are.
input_data$predictedlabel <- predict(forest_model, newdata=input_data)
The above code adds a new column to your input_data showing the predicted label.
You can use the predict function
for example:
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])
This is from http://ugrad.stat.ubc.ca/R/library/randomForest/html/predict.randomForest.html