Error message when using predict with LARS model on testdata - r

I use a lars model and apply it to a large data set (75 features) with numerical data and factors.
I train the model by
mm <- model.matrix(target~0+.,data=data)
larsMod <- lars(mm,data$target,intercept=FALSE)
which gives a nice in-sample fit. If I apply it to testdata by
mm.test <- model.matrix(target~0+.,,data=test.data)
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))
then I get the error message
Error in scale.default(newx, object$meanx, FALSE) :
length of 'center' must equal the number of columns of 'x'
I assume that it has todo with the fact that factor levels differ in the data sets. However
which(! colnames(mm.test) %in% colnames(mm) )
gives an empty result
while
which(! colnames(mm) %in% colnames(mm.test) )
gives 3 indizes.
Thus 3 factor levels do appear in the training set but not in the test set.
Why does this cause a problem? How can I solve this?
The code blow illustrates this with a toy example. In the test dataset the factor does not have the level "l3".
require(lars)
data.train = data.frame( target = c(0,1,0,1,1,1,1,0,0,0), f1 = rep(c("l1","l2","l1","l2","l3"),2), n1 = rep(c(1,2,3,4,5),2))
test.data = data.frame(f1 = rep(c("l1","l2","l1","l2","l2"),2),n1 = rep(c(7,4,3,4,5),2) )
mm <- model.matrix(target~0+f1+n1,data = data.train)
colnames(mm)
length(colnames(mm))
larsMod <- lars(mm,data.train$target,intercept=FALSE)
mm.test <- model.matrix(~0+f1+n1,data=test.data)
colnames(mm.test)
length( colnames(mm.test) )
which(! colnames(mm.test) %in% colnames(mm) )
which(! colnames(mm) %in% colnames(mm.test) )
predict(larsMod,mm.test,type="fit",s=length(larsMod$arc.length))

I might be very much off here, but in my field predict doesn't work if it can't find a variable it expects. So I tried what happened if I forced the model matrix to 0 for the factor (f1l3) that was not in the test data.
Note1: I created a target variable in the testdata, because I couldn't get your code to run otherwise
set.seed(123)
test.data$target <- rbinom(nrow(test.data),1,0.2)
#proof of concept:
mm.test <- model.matrix(target~0+f1+n1,data=test.data)
mm.test1 <- cbind(f1l3=0,mm.test)
predict(larsMod,mm.test1[,colnames(mm)],type="fit",s=length(larsMod$arc.length)) #runs
#runs!
Now generalize to allow for creation of a 'complete' model matrix
when factors are missing in testdata.
#missing columns
mis_col <- setdiff(colnames(mm), colnames(mm.test))
#matrix of missing levels
mis_mat <- matrix(0,ncol=length(mis_col),nrow=nrow(mm.test))
colnames(mis_mat) <- mis_col
#bind together
mm.test2 <- cbind(mm.test,mis_mat)[,colnames(mm)] #to ensure ordering, yielded different results in my testing
predict(larsMod,mm.test2,type="fit",s=length(larsMod$arc.length)) #runs
Note2: I don't know what happens if the problem is the other way around (factors present in testdata that were not in train data)

Related

Column changes from "WinorLoss" to "Class"

I am working on constructing a logistic model on R (I am a beginner on R and am following a tutorial on building logistic models). I have done the following, everything works but when I complete the downsample function for some reason the column named "WinorLoss" changes to "Class" and I am sure this cause an issue with everything.
Could anyone please let me know if what I am doing makes sense or is there big errors I am making?
my_data <- read.csv('C:/Users/Magician/Desktop/R files/Fnaticfirstround.csv', header=TRUE)
my_data
str(my_data)
library(mlbench)
glm(Map ~ WinorLoss, family="binomial", data=my_data)
table(my_data$Map)
table(my_data$WinorLoss)
my_data$WinorLoss <- ifelse(my_data$WinorLoss == "W", 1,0)
my_data$WinorLoss <- factor(my_data$WinorLoss, levels = c(0,1))
my_data
table(my_data$WinorLoss)
library(caret)
'%ni%' <- Negate('%in%')
options(scipen=999)
set.seed(100)
trainDataIndex <- createDataPartition(my_data$WinorLoss, p=0.7, list=F)
trainData <- my_data[trainDataIndex, ]
testData <- my_data[-trainDataIndex, ]
trainData
testData
table(trainData$WinorLoss)
table(testData$WinorLoss)
set.seed(100)
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "WinorLoss"],
y = trainData$WinorLoss)
down_train
When running trainData the columns returned are Date, Event, opponent, Map, Score, WinorLoss, winner.. but when I run the downtrain function the columns become Date, Event, opponent, Map, Score, winner, Class
Help Please!
Yep, downSample and some of the other caret packages do that by default, unless specified otherwise.
If you have a question about a particular function try the manual packages first.
?downSample
If you do this you will see all of the arguments
downSample(x, y, list = FALSE, yname = "Class")
So by default the function will change the yname to "Class" which is what you are seeing.
Thus to get your desired output:
down_train <- downSample(x = trainData[, colnames(trainData) %ni% "WinorLoss"],
y = trainData$WinorLoss,
yname = "WinorLoss")

"Error in model.frame.default(data = train, formula = cost ~ .) : variable lengths differ", but all variables are length 76?

I'm modeling burrito prices in San Diego to determine whether some burritos are over/under priced (according to the model). I'm attempting to use regsubsets() to determine the best linear model, using the BIC, on a data frame of 76 observations of 14 variables. However, I keep getting an error saying that variable lengths differ, and thus a linear model doesn't work.
I've tried rounding all the observations in the data frame to one decimal place, I've used the length() function on each variable in the data frame to make sure they're all the same length, and before I made the model I used na.omit() on the data frame to make sure no NAs were present. By the way, the original dataset can be found here: https://www.kaggle.com/srcole/burritos-in-san-diego. I cleaned it up a bit in Excel first, removing all the categorical variables that appeared after the "overall" column.
burritos <- read.csv("/Users/Jack/Desktop/R/STOR 565 R Projects/Burritos.csv")
burritos <- burritos[ ,-c(1,2,5)]
burritos <- na.exclude(burritos)
burritos <- round(burritos, 1)
library(leaps)
library(MASS)
yelp <- burritos$Yelp
google <- burritos$Google
cost <- burritos$Cost
hunger <- burritos$Hunger
tortilla <- burritos$Tortilla
temp <- burritos$Temp
meat <- burritos$Meat
filling <- burritos$Meat.filling
uniformity <- burritos$Uniformity
salsa <- burritos$Salsa
synergy <- burritos$Synergy
wrap <- burritos$Wrap
overall <- burritos$overall
variable <- sample(1:nrow(burritos), 50)
train <- burritos[variable, ]
test <- burritos[-variable, ]
null <- lm(cost ~ 1, data = train)
full <- regsubsets(cost ~ ., data = train) #This is where error occurs

How to run a loop inside a loop for a gam object

I am trying to predict new observations after multiple imputation. Both the newdata and the model to use are list objects. The correctness of the approach is not the issue but how to use the predict function after multiple imputation we I have a new data that is a list. Below are my code.
library(betareg)
library(mice)
library(mgcv)
data(GasolineYield)
dat1 <- GasolineYield
dat1 <- GasolineYield
dat1$yield <- with(dat1,
ifelse(yield > 0.40 | yield < 0.17,NA,yield)) # created missing values
datim <- mice(dat1,m=30) #imputing missing values
mod1 <- with(datim,gam(yield ~ batch + emp,family=betar(link="logit"))) #fit models using gam
creating data set to be used for prediction
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
the code below just testing out the predict without newdata. The problem I observed was that tp is saved as 1 by 32 matrix instead of 30 by 32 matrix. But the print option prints out a 30 by 32 but then I couldn't save it as such.
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,type = "response")
print(tp)
}
the code below is me trying to predict new observation using newdata. Here I am just lost I am not sure how to go about it.
datnew <- complete(datim,"long")
datsplit <- split(datnew,datnew$.imp)
tot <- 0
for(i in 1:30){
tot <- mod1$analyses[[i]]
tp <- predict.gam(tot,newdata=datsplit[[i]], type = "response")
print(tp)
}
Can someone help me out on how best to go about it?
I finally find solved the problem. Here is the solution:
datnew <- complete(datim,"long")# stack all the imputation data
though I have to point out that this should be your new dataset
I am assuming that this is not used in building the model. My aim of opening this #thread was to address the question of how to predict observations using new data after multiple imputation/using model built with multiple imputation dataset.
datsplit <- split(datnew,datnew$.imp)
tot <- list()
tot_ <- list()
for(i in 1:30){
for(j in 1:30){
tot[[j]] <- predict.gam(mod1$analyses[[i]],newdata=datsplit[[j]])
}
tot_[[i]] <- tot
}
# flatten the lists within lists
totfl <- tot_ %>% flatten()
#nrow is the number of observations to be predicted as contained in the
#newdata set (datsplit)
totn <- matrix(unlist(totfl),nrow=32)
apply(totn,1,mean) #takes the means of prediction across the 30 data set
I hope this helps those with similar questions. I once came across a question on how to predict newdata after multiple imputation, I guess this will answer some of the questions contained in that thread.

Error with "-" not meaningful for factors [duplicate]

This question already has an answer here:
R error "sum not meaningful for factors"
(1 answer)
Closed 3 years ago.
I am trying to perform cross validation for my data set using random forest.
My response variable is of datatype factor with 2 levels (1, 2).
I am using this function below for my cross validation technique
k = 10
Imputed_data$id <- sample(1:k , nrow(Imputed_data), replace = TRUE)
list <- 1:k
prediction <- data.frame()
testsetcopy <- data.frame()
progress.bar <- create_progress_bar("text")
progress.bar$init(k)
for (i in 1:k){
trainingset <- subset(Imputed_data,id %in% list[-i])
testset <- subset(Imputed_data, id %in% c(i))
# run a random forest model
mymodel <- randomForest(trainingset$Accepted~ ., data = trainingset)
temp <- as.data.frame(predict(mymodel, testset[,-13]))
prediction <- rbind(prediction, temp)
testsetcopy <- rbind(testsetcopy, as.data.frame(testset[,13]))
progress.bar$step()
}
result <- cbind(prediction, testsetcopy[,1])
names(result) <- c("Predicted", "Actual")
result$Difference <-abs(result$Actual-result$Predicted)
summary(result$Difference)
I am getting a error in the line
result$Difference <-abs(result$Actual-result$Predicted)
In Ops.factor(result$Actual, result$Predicted) : ‘-’ not meaningful
for factors
I could understand that abs cant be used for factors and - is also not used.
I am new to R, and i am unsure how i could then calculate my result. Any lead will be helpful.
You can't subtract factors, nor can you use abs for factors. That was clear.
The best way to show your results is in a cross table, try e.g.,
table(result$predicted, result$Actual)
Or use caret's function:
confusionMatrix(result$predicted, result$Actual)

QDA | lengths of training and test data sets | How to split data in training and test data?

In QDA (Quadratic Discriminant Analysis), do i need to keep length of training and test data exactly same? If not, how do you find a Confusion Matrix in such cases?
Here's psuedo data.
Because if I keep training-data and test data sets of different lengths, it gives an error (Using R Studio):
"Error in table(pred, true) : all arguments must have the same length".
Tried to remove NAs using na.omit() on both data sets as well as pred and true; and using na.action = na.exclude for qda(), but it didn't work.
After dividing the data set in exactly half; half of it as training and half as test; it worked perfectly after na.omit() on pred and true.
Following is the code used for either of approaches. In approach 2, with data split into equal halves, it worked perfectly fine.
#Approach 1: divide data age-wise
train <- vif_data$Age < 30
# there are around 400 values passing (TRUE) above condition and around 50 failing (FALSE)
train_vif <- vif_data[train,]
test_vif <- vif_data[!train,]
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(pred) # result: 399
length(true) # result: 47
#that's where it throws error: "Error in table(zone_pred$class, train_vif) : all arguments must have the same length"
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
###############################
#Approach 2: divide data into random halves
train <- splitSample(dataset = vif_data, div = 2, path = "./", type = "csv")
train_data <- read.csv("splitSample_s1.csv")
test_data <- read.csv("splitSample_s2.csv")
#taking QDA
zone_qda <- qda(train_vif$Awareness~train_vif$Zone, na.action = na.exclude)
#compare QDA against test data
zone_pred <- predict(zone_qda, test_vif)
#omitting nulls
pred <- na.omit(zone_pred$class)
true <- na.omit(test_vif$Awareness)
length(train_vif)
# this works fine
zone_aware <- table(zone_pred$class, train_vif)
# OR
zone_aware <- table(pred, true)
accur <- mean(zone_pred$class==test_vif$Awareness)
Want to know if there is any method by which we can have a confusion matrix with data set unequally divided into training and test data set.
Thanks!
Are you plugging in your training inputs instead of your test set input data to predict? Notice how this yields the same error message:
table(c(1,2),c(1,2,3))
If pred isn't the right length, then you're probably predicting incorrectly. At the moment, you haven't shared any of your code, so I cannot say anything more. But there is no reason that you shouldn't be able to get a confusion matrix using test data of different size than your training data.

Resources