This question already has an answer here:
R error "sum not meaningful for factors"
(1 answer)
Closed 3 years ago.
I am trying to perform cross validation for my data set using random forest.
My response variable is of datatype factor with 2 levels (1, 2).
I am using this function below for my cross validation technique
k = 10
Imputed_data$id <- sample(1:k , nrow(Imputed_data), replace = TRUE)
list <- 1:k
prediction <- data.frame()
testsetcopy <- data.frame()
progress.bar <- create_progress_bar("text")
progress.bar$init(k)
for (i in 1:k){
trainingset <- subset(Imputed_data,id %in% list[-i])
testset <- subset(Imputed_data, id %in% c(i))
# run a random forest model
mymodel <- randomForest(trainingset$Accepted~ ., data = trainingset)
temp <- as.data.frame(predict(mymodel, testset[,-13]))
prediction <- rbind(prediction, temp)
testsetcopy <- rbind(testsetcopy, as.data.frame(testset[,13]))
progress.bar$step()
}
result <- cbind(prediction, testsetcopy[,1])
names(result) <- c("Predicted", "Actual")
result$Difference <-abs(result$Actual-result$Predicted)
summary(result$Difference)
I am getting a error in the line
result$Difference <-abs(result$Actual-result$Predicted)
In Ops.factor(result$Actual, result$Predicted) : ‘-’ not meaningful
for factors
I could understand that abs cant be used for factors and - is also not used.
I am new to R, and i am unsure how i could then calculate my result. Any lead will be helpful.
You can't subtract factors, nor can you use abs for factors. That was clear.
The best way to show your results is in a cross table, try e.g.,
table(result$predicted, result$Actual)
Or use caret's function:
confusionMatrix(result$predicted, result$Actual)
Related
We created a table in R with values from the S&P500 and added rows like the simple 10 Days Moving Average. We set the NA-values to 0. Example:
myStartDate <- '2020-01-01'
myEndDate <- Sys.Date()
Dataset$SMA10 <- SMA(Dataset[,"Close"], 10)
Dataset$SMA10 <- as.numeric(Dataset$SMA10)
Dataset$SMA10[is.na(Dataset$SMA10)] <- 0
Our goal is to create a random forest model. Therefore we split the data into a train and a valid data:
set.seed(100)
train <- sample(nrow(Dataset), 0.5*nrow(Dataset), replace = FALSE)
TrainSet <- Dataset [train,]
ValidSet <- Dataset [-train,]
Now if we want to generate the model with following code;
model1 <- randomForest(SMA10~.,data=TrainSet, mtry=5, importance=TRUE,ntree=500)
print(model1)
we get this error message:
Error in x[, i] <- frame[[i]] : number of items to replace is not a multiple of replacement length
By looking up this error in the forum, we found that this is related with NA-Values. Therefore we are a little confused, because we have no NA-Values in our table. Can you tell us what we are doing wrong? Thank you very much in advance.
I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs
I am trying to predict new observations after multiple imputation. Both the newdata and the model to use are list objects. The correctness of the approach is not the issue but how to use the predict function after multiple imputation we I have a new data that is a list. Below are my code.
library(betareg)
library(mice)
library(mgcv)
data(GasolineYield)
dat1 <- GasolineYield
dat1 <- GasolineYield
dat1$yield <- with(dat1,
ifelse(yield > 0.40 | yield < 0.17,NA,yield)) # created missing values
datim <- mice(dat1,m=30) #imputing missing values
mod1 <- with(datim,gam(yield ~ batch + emp,family=betar(link="logit"))) #fit models using gam
creating data set to be used for prediction
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
the code below just testing out the predict without newdata. The problem I observed was that tp is saved as 1 by 32 matrix instead of 30 by 32 matrix. But the print option prints out a 30 by 32 but then I couldn't save it as such.
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,type = "response")
print(tp)
}
the code below is me trying to predict new observation using newdata. Here I am just lost I am not sure how to go about it.
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,newdata=datsplit[[i]], type = "response")
print(tp)
}
Can someone help me out on how best to go about it?
I finally find solved the problem. Here is the solution:
datnew <- complete(datim,"long")# stack all the imputation data
though I have to point out that this should be your new dataset
I am assuming that this is not used in building the model. My aim of opening this #thread was to address the question of how to predict observations using new data after multiple imputation/using model built with multiple imputation dataset.
datsplit <- split(datnew,datnew$.imp)
tot <- list()
tot_ <- list()
for(i in 1:30){
for(j in 1:30){
tot[[j]] <- predict.gam(mod1$analyses[[i]],newdata=datsplit[[j]])
}
tot_[[i]] <- tot
}
# flatten the lists within lists
totfl <- tot_ %>% flatten()
#nrow is the number of observations to be predicted as contained in the
#newdata set (datsplit)
totn <- matrix(unlist(totfl),nrow=32)
apply(totn,1,mean) #takes the means of prediction across the 30 data set
I hope this helps those with similar questions. I once came across a question on how to predict newdata after multiple imputation, I guess this will answer some of the questions contained in that thread.
I am using random forest for prediction and in the predict(fit, test_feature) line, I get the following error. Can someone help me to overcome this. I did the same steps with another dataset and had no error. but I get error here.
Error: Error in x[, vname, drop = FALSE] : subscript out of bounds
training_index <- createDataPartition(shufflled[,487], p = 0.8, times = 1)
training_index <- unlist(training_index)
train_set <- shufflled[training_index,]
test_set <- shufflled[-training_index,]
accuracies<- c()
k=10
n= floor(nrow(train_set)/k)
for(i in 1:k){
sub1<- ((i-1)*n+1)
sub2<- (i*n)
subset<- sub1:sub2
train<- train_set[-subset, ]
test<- train_set[subset, ]
test_feature<- test[ ,-487]
True_Label<- as.factor(test[ ,487])
fit<- randomForest(x= train[ ,-487], y= as.factor(train[ ,487]))
prediction<- predict(fit, test_feature) #The error line
correctlabel<- prediction == True_Label
t<- table(prediction, True_Label)
}
I had similar problem few weeks ago.
To go around the problem, you can do this:
df$label <- factor(df$label)
Instead of as.factor try just factor generic function. Also, try first naming your label variable.
Are there identical column names in your training and validation x?
I had the same error message and solved it by renaming my column names because my data was a matrix and their colnames were all empty, i.e. "".
Your question is not very clear, anyway I try to help you.
First of all check your data to see the distribution in levels of your various predictors and outcomes.
You may find that some of your predictor levels or outcome levels are very highly skewed, or some outcomes or predictor levels are very rare. I got that error when I was trying to predict a very rare outcome with a heavily tuned random forest, and so some of the predictor levels were not actually in the training data. Thus a factor level appears in the test data that the training data thinks is out of bounds.
Alternatively, check the names of your variables.
Before calling predict() to make sure that the variable names match.
Without your data files, it's hard to tell why your first example worked.
For example You can try:
names(test) <- names(train)
Add the expression
dimnames(test_feature) <- NULL
before
prediction <- predict(fit, test_feature)
I use a lars model and apply it to a large data set (75 features) with numerical data and factors.
I train the model by
mm <- model.matrix(target~0+.,data=data)
larsMod <- lars(mm,data$target,intercept=FALSE)
which gives a nice in-sample fit. If I apply it to testdata by
mm.test <- model.matrix(target~0+.,,data=test.data)
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
then I get the error message
Error in scale.default(newx, object$meanx, FALSE) :
length of 'center' must equal the number of columns of 'x'
I assume that it has todo with the fact that factor levels differ in the data sets. However
which(! colnames(mm.test) %in% colnames(mm) )
gives an empty result
while
which(! colnames(mm) %in% colnames(mm.test) )
gives 3 indizes.
Thus 3 factor levels do appear in the training set but not in the test set.
Why does this cause a problem? How can I solve this?
The code blow illustrates this with a toy example. In the test dataset the factor does not have the level "l3".
require(lars)
data.train = data.frame( target = c(0,1,0,1,1,1,1,0,0,0), f1 = rep(c("l1","l2","l1","l2","l3"),2), n1 = rep(c(1,2,3,4,5),2))
test.data = data.frame(f1 = rep(c("l1","l2","l1","l2","l2"),2),n1 = rep(c(7,4,3,4,5),2) )
mm <- model.matrix(target~0+f1+n1,data = data.train)
colnames(mm)
length(colnames(mm))
larsMod <- lars(mm,data.train$target,intercept=FALSE)
mm.test <- model.matrix(~0+f1+n1,data=test.data)
colnames(mm.test)
length( colnames(mm.test) )
which(! colnames(mm.test) %in% colnames(mm) )
which(! colnames(mm) %in% colnames(mm.test) )
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
I might be very much off here, but in my field predict doesn't work if it can't find a variable it expects. So I tried what happened if I forced the model matrix to 0 for the factor (f1l3) that was not in the test data.
Note1: I created a target variable in the testdata, because I couldn't get your code to run otherwise
set.seed(123)
test.data$target <- rbinom(nrow(test.data),1,0.2)
#proof of concept:
mm.test <- model.matrix(target~0+f1+n1,data=test.data)
mm.test1 <- cbind(f1l3=0,mm.test)
predict(larsMod,mm.test1[,colnames(mm)],type="fit",s=length(larsMod$arc.length)) #runs
#runs!
Now generalize to allow for creation of a 'complete' model matrix
when factors are missing in testdata.
#missing columns
mis_col <- setdiff(colnames(mm), colnames(mm.test))
#matrix of missing levels
mis_mat <- matrix(0,ncol=length(mis_col),nrow=nrow(mm.test))
colnames(mis_mat) <- mis_col
#bind together
mm.test2 <- cbind(mm.test,mis_mat)[,colnames(mm)] #to ensure ordering, yielded different results in my testing
predict(larsMod,mm.test2,type="fit",s=length(larsMod$arc.length)) #runs
Note2: I don't know what happens if the problem is the other way around (factors present in testdata that were not in train data)