Subscript out of bound error in predict function of randomforest - r

I am using random forest for prediction and in the predict(fit, test_feature) line, I get the following error. Can someone help me to overcome this. I did the same steps with another dataset and had no error. but I get error here.
Error: Error in x[, vname, drop = FALSE] : subscript out of bounds
training_index <- createDataPartition(shufflled[,487], p = 0.8, times = 1)
training_index <- unlist(training_index)
train_set <- shufflled[training_index,]
test_set <- shufflled[-training_index,]
accuracies<- c()
k=10
n= floor(nrow(train_set)/k)
for(i in 1:k){
sub1<- ((i-1)*n+1)
sub2<- (i*n)
subset<- sub1:sub2
train<- train_set[-subset, ]
test<- train_set[subset, ]
test_feature<- test[ ,-487]
True_Label<- as.factor(test[ ,487])
fit<- randomForest(x= train[ ,-487], y= as.factor(train[ ,487]))
prediction<- predict(fit, test_feature) #The error line
correctlabel<- prediction == True_Label
t<- table(prediction, True_Label)
}

I had similar problem few weeks ago.
To go around the problem, you can do this:
df$label <- factor(df$label)
Instead of as.factor try just factor generic function. Also, try first naming your label variable.

Are there identical column names in your training and validation x?
I had the same error message and solved it by renaming my column names because my data was a matrix and their colnames were all empty, i.e. "".

Your question is not very clear, anyway I try to help you.
First of all check your data to see the distribution in levels of your various predictors and outcomes.
You may find that some of your predictor levels or outcome levels are very highly skewed, or some outcomes or predictor levels are very rare. I got that error when I was trying to predict a very rare outcome with a heavily tuned random forest, and so some of the predictor levels were not actually in the training data. Thus a factor level appears in the test data that the training data thinks is out of bounds.
Alternatively, check the names of your variables.
Before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.
For example You can try:
names(test) <- names(train)

Add the expression
dimnames(test_feature) <- NULL
before
prediction <- predict(fit, test_feature)

Related

KNN in R -- All arguments must have the same length, test.X is empty

I'm trying to perform KNN in R on a dataframe, following 3-way classification for vehicle types (car, boat, plane), using columns such as mpg, cost as features.
To start, when I run:
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
then
knn.pred
returns
factor(0) Levels: car boat plane
And
table(knn.pred,VehicleType.All)
returns
Error in table(knn.pred, VehicleType.All) :
all arguments must have the same length
I think my problem is that I can successfully load train.X with cbind() but when I try the same for test.X it remains an empty matrix. My code looks like this:
train=(DATA$Values<=200) # to train for all 200 entries including cars, boats and planes
train.X = cbind(DATA$mpg,DATA$cost)[train,]
summary(train.X)
Here, summary(train.X) returns correctly, but when I try the same for test.X:
test.X = cbind(DATA$mpg,DATA$cost)[!train,]
When I try and print test.X it returns an empty matrix like so:
[,1] [,2]
Apologies for such a long question and I'm probably not including all relevant info. If anyone has any idea what's going wrong here or why my test.X isn't loading through any data I'd appreciate it!
Without any info on your data, it is hard to guess where the problem is. You should post a minimal reproducible example
or at least dput your data or part of it. However here I show 2 methods for training a knn model, using 2 different package (class, and caret) with the mtcars built-in dataset.
with class
library(class)
data("mtcars")
str(mtcars)
mtcars$gear <- as.factor(mtcars$gear)
ind <- sample(1:nrow(mtcars),20)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
train.VehicleType <- train.X[,"gear"]
VehicleType.All <- test.X[,"gear"]
knn.pred=knn(train.X,test.X,train.VehicleType,k=3)
table(knn.pred,VehicleType.All)
with caret
library(caret)
ind <- createDataPartition(mtcars$gear,p=0.60,list=F)
train.X <- mtcars[ind,]
test.X <- mtcars[-ind,]
control <-trainControl(method = "cv",number = 10)
grid <- expand.grid(k=2:10)
knn.pred <- train(gear~.,data=train.X,method="knn",tuneGrid=grid)
pred <- predict(knn.pred,test.X[,-10])
cm <- confusionMatrix(pred,test.X$gear)
the caret package allows performing cross-validation for parameters tuning during model fitting, in a straightforward way. By default train perform a 25 rep bootstrap cross-validation to find the best value of k among the values I've supplied in the grid object.
From your example, it seems that your test object is empty so the result of knn is a 0-length vector. Probably your problem is in the data reading. However, a better way to subset your DATA can be this:
#insetad of
train.X = cbind(DATA$mpg,DATA$cost)[train,]
#you should do:
train.X <- DATA[train,c("mpg","cost")]
test.X <- DATA[-train,c("mpg","cost")]
However, I do not understand what variable is DATA$Values, Firstly I was thinking it was the outcome, but, this line confused me a lot:
train=(DATA$Values<=200)
You can work on these examples to catch your error on your own. If you can't post an example that reproduces your situation.

Error with RandomForest in R because of "too many categories"

I'm trying to train a RF model in R, but when i try to define the model:
rf <- randomForest(labs ~ .,data=as.matrix(dd.train))
It gives me the error:
Error in randomForest.default(m, y, ...) :
Can not handle categorical predictors with more than 53 categories.
Any idea what could it be?
And no, before you say "You have some categoric variable with more than 53 categories". No, all variables but labs are numeric.
Tim Biegeleisen: Read the last line of my question and you will see why is not the same as the one you are linking!
Edited to address followup from OP
I believe using as.matrix in this case implicitly creates factors. It is also not necessary for this packages. You can keep it as a data frame, but will need to make sure that any unused factor levels are dropped by using droplevels (or something similar). There are many reasons an unused factor may be in your data set, but a common one is a dropped observation.
Below is a quick example that reproduces your error:
library('randomForest')
#making a toy data frame
x <- data.frame('one' = c(1,1,1,1,1,seq(50) ),
'two' = c(seq(54),NA),
'three' = seq(55),
'four' = seq(55) )
x$one <- as.factor(x$one)
x <- na.omit(x) #getting rid of an NA. Note this removes the whole row.
randomForest(one ~., data = as.matrix(x)) #your first error
randomForest(one ~., data = x) #your second error
x <- droplevels(x)
randomForest(one ~., data = x) #OK

R: factor as new level when I predict with test data

I am getting an error from my datasets similar logic with the code I posted in below. I have tried increased the number of training data but didn't solve. I have already excluded all NA values.
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
factor y has new levels L, X
set.seed(234)
d <- data.frame(w=abs(rnorm(50)*1000),
x=rnorm(50),
y=sample(LETTERS[1:26], 50, replace=TRUE))
train_idx <- sample(1:nrow(d), floor(0.8*nrow(d)))
train <- d[train_idx,]
test <- d[-train_idx,]
fit <- lm(w ~x + y, data=train)
predict(fit, test)
As #jdobres has already explained the reason of why this error popped up I'll straightforwardly jump to the solution approach:
Let's try below line of code just before predict statement
#add all levels of 'y' in 'test' dataset to fit$xlevels[["y"]] in the fit object
fit$xlevels[["y"]] <- union(fit$xlevels[["y"]], levels(test[["y"]]))
Hope this would resolve your problem!
Factor and character data are treated as categorical variables. As such, models cannot form predictions for category labels they've never seen before. If you built a model to predict things about "poodle" and "pit bull", the model would fail if you gave it "golden retriever".
More specific to your example, the error is telling you that labels "L" and "X", which are in your test set, do not appear in your training set. Since they weren't in the training set, the model doesn't know what to do when it encounters these in the test.
Thanks Prem, and if you have many variables you can loop the line of code like this:
for(k in vars){
if(is.factor(shop_data[,k])){
ols_fit$xlevels[[k]] <- union(ols_fit$xlevels[[k]],levels(shop_data[[k]]))
}
}
vars are the variables used in the model, shop_data is the main dataset which is split into train and test

Error in predict.randomForest

I was hoping someone would be able to help me out with an issue I am having with the prediction function of the randomForest package in R. I keep getting the same error when I try to predict my test data:
Here's my code so far:
extractFeatures <- function(RCdata) {
features <- c(4, 9:13, 17:20)
fea <- RCdata[, features]
fea$Week <- as.factor(fea$Week)
fea$Age_Range <- as.factor(fea$Age_Range)
fea$Race <- as.factor(fea$Race)
fea$Referral_Source <- as.factor(fea$Referral_Source)
fea$Referral_Source_Category <- as.factor(fea$Referral_Source_Category)
fea$Rehire <- as.factor(fea$Rehire)
fea$CLFPR_.HS <- as.factor(fea$CLFPR_.HS)
fea$CLFPR_HS <- as.factor(fea$CLFPR_HS)
fea$Job_Openings <- as.factor(fea$Job_Openings)
fea$Turnover <- as.factor(fea$Turnover)
return(fea)
}
gp <- runif(nrow(RCdata))
RCdata <- RCdata[order(gp), ]
train <- RCdata[1:4600, ]
test <- RCdata[4601:6149, ]
rf <- randomForest(extractFeatures(train), suppressWarnings(as.factor(train$disposition_category)), ntree=100, importance=TRUE)
testpredict <- predict(rf, extractFeatures(test))
"Error in predict.randomForest(rf, extractFeatures(test)) :
Type of predictors in new data do not match that of the training data."
I have tried adding in the following line to the code, and still receive the same error:
testpredict <- predict(rf, extractFeatures(test), type="prob")
I found the source of the error being the fact that the training data has a level or two that is not found in the test data. So when I tried another suggestion I found online to adjust the levels of the test data to that of the training data, I keep getting NULL values in the fields I am using in both the training and test sets.
levels(test$Referral)
NULL
I can see the levels when I use the function, however.
levels(as.factor(test$Referral))
So then I tried the same suggestion I found online with adjusting the levels of the test to equal that of the training data using the following function and received an error:
levels(as.factor(test$Referral)) -> levels(as.factor(train$Referral))
Error in `levels<-.factor`(`*tmp*`, value = c(... :
number of levels differs
I am sure there is something simple I am missing (I am still very new to R), so any insight you can provide would be unbelievably helpful. Thanks!

Use of randomforest() for classification in R?

I originally had a data frame composed of 12 columns in N rows. The last column is my class (0 or 1). I had to convert my entire data frame to numeric with
training <- sapply(training.temp,as.numeric)
But then I thought I needed the class column to be a factor column to use the randomforest() tool as a classifier, so I did
training[,"Class"] <- factor(training[,ncol(training)])
I proceed to creating the tree with
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
But I'm getting two errors:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
I would appreciate it if someone could point out the formatting mistake I'm making.
Thanks!
So the issue is actually quite simple. It turns out my training data was an atomic vector. So it first had to be converted as a data frame. So I needed to add the following line:
training <- as.data.frame(training)
Problem solved!
First, your coercion to a factor is not working because of syntax errors. Second, you should always use indexing when specifying a RF model. Here are changes in your code that should make it work.
training <- sapply(training.temp,as.numeric)
training[,"Class"] <- as.factor(training[,"Class"])
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=training[,"Class"],
importance=TRUE, do.trace=100)
# You can also coerce to a factor directly in the model statement
training_rf <- randomForest(x=training[,1:(ncol(training)-1)], y=as.factor(training[,"Class"]),
importance=TRUE, do.trace=100)

Resources