R - GAM error - Missing value where TRUE/FALSE needed - r

I'm trying to train a binomial GAM model with the mgcv package, and running into this error:
Error in if (length(grad) > 0 && sum(uconv.ind) > 0) { : missing value where TRUE/FALSE needed
There are no columns in my data frame (of the columns which are included in the model) that have NAs in them. When I look at the unique values of the response column, it shows [1] 0 1 as expected.
Here is the code used to train the model:
mgcv::bam(formula = formula, family = binomial, data = df, select = T, discrete = T, method = 'fREML', nthreads = 32, drop.unused.levels = FALSE)
Any help would be greatly appreciated!
As requested, here is a screenshot of a random sample of the data. The data is related to my company so I can't give too much information away:
The final column is the response, and it is a numeric column. When I type df[!complete.cases(df), ], the result has 0 rows.

Related

Polynomial Regression - Unused Arguments

I have a dataset with 269 rows and only two variables (A: measurements, B: the time-point at which it was registered, goes from 1 to 280).
I already removed all NaN values after removing outliers with a Hampel filter.
I am trying to model my data with a Polynomial Regression. I used the following command:
model <- lm(A~ poly(B, 15, raw = TRUE), data = data_for_model)
However, I get the following error:
Error in poly(B, 15, raw = TRUE) : unused arguments (15, raw = TRUE)
Can anyone help me in this?
Thank you in advance

Cross validation help: Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

So I have a specific error that I can't figure out. By searching I am finding that the model and the cross validation set do not have the data with the same levels to fit the model. I am trying to understand completely with my use case. Basically I am building a QDA model to predict vehicle country based on numeric values. This code will run for anyone since it is a public google sheets document. For those of you who follow Doug Demuro on YouTube you may find this a tad bit interesting.
#load dataset into r
library(gsheet)
url = 'https://docs.google.com/spreadsheets/d/1KTArYwDWrn52fnc7B12KvjRb6nmcEaU6gXYehWfsZSo/edit'
doug_df = read.csv(text=gsheet2text(url, format='csv'), stringsAsFactors=FALSE,header=FALSE)
#begin cleanup. remove first blank rows of data
doug_df = doug_df[-c(1,2,3), ]
attach(doug_df)
#name columns appropriately
names(doug_df) = c("year","make","model","styling","acceleration","handling","fun factor","cool factor","total weekend score","features","comfort","quality","practicality","value","total daily score","dougscore","video duration","filming city","filming state","vehicle country")
#removing categorical columns and columns not being used for discriminate analysis to include totals columns
library(dplyr)
doug_df = doug_df %>% dplyr::select (-c(make,model,`total weekend score`,`total daily score`,dougscore,`video duration`,`filming city`,`filming state`))
#convert from character to numeric
num.cols <- c("year","styling","acceleration","handling","fun factor","cool factor","features","comfort","quality","practicality","value")
doug_df[num.cols] <- sapply(doug_df[num.cols], as.numeric)
`vehicle country` = as.factor(`vehicle country`)
#create a new column to reflect groupings for response variable
doug_df$country.group=ifelse(`vehicle country`=='Germany','Germany',
ifelse(`vehicle country`=='Italy','Italy',
ifelse(`vehicle country`=='Japan','Japan',
ifelse(`vehicle country`=='UK','UK',
ifelse(`vehicle country`=='USA','USA','Other')))))
#remove the initial country column
doug_df = doug_df %>% dplyr::select (-c(`vehicle country`))
#QDA with multiple predictors
library(MASS)
qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df)
#predict using model and compute error
n=dim(doug_df)[1]
fittedclass = predict(qdafit1,data=doug_df)$class
table(doug_df$country.group,fittedclass)
Error = sum(doug_df$country.group != fittedclass)/n; Error
#conduct k 10 fold cross validation
allpredictedCV1 = rep("NA",n)
cvk = 10
groups = c(rep(1:cvk,floor(n/cvk)))
set.seed(4)
cvgroups = sample(groups,n,replace=TRUE)
for (i in 1:cvk) {
qdafit1 = qda(country.group~styling+acceleration+handling+`fun factor`+`cool factor`+features+comfort+quality+value,data=doug_df,subset=(cvgroups!=i))
newdata1i = data.frame(doug_df[cvgroups==i,])
allpredictedCV1[cvgroups==i] = as.character(predict(qdafit1,newdata1i)$class)
}
table(doug_df$country.group,allpredictedCV1)
CVmodel1 = sum(allpredictedCV1!=doug_df$country.group)/n; CVmodel1
This is throwing the error for the last part of the code w/ the cross validation:
Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')
Can someone help explain it a bit more in depth to me what is happening? I think that the variable fun factor doesn't have the same levels in each fold of the cross validation as it did the model. Now I need to know my options to fix it. Thanks in advance!
EDIT
In addition to the above, I am getting a very similar error for when I try to predict a dummy car review.
#build a dummy review and predict it using multiple models
dummy_review = data.frame(year=2014,styling=8,acceleration=6,handling=6,`fun factor`=8,`cool factor`=8,features=4,comfort=4,quality=6,practicality=3,value=5)
#predict vehicle country for dummy data using model 1
predict(qdafit1,dummy_review)$class
This returns the following error:
Error in model.frame.default(as.formula(delete.response(Terms)), newdata, : variable lengths differ (found for 'fun factor')

R: randomForest error message

Trying to run Random Forest on a data set that has 400~ samples, and about 360 variables in data frame df:
I'm trying to use the the variables (s10, s100, etc etc) to predict the Genotype. This is the code I'm using:
rf <-randomForest(Genotype ~ ., data = df, importance = TRUE, proximity = TRUE)
but I keep getting the error message:
Error in if (n == 0) stop("data (x) has 0 rows") :
argument is of length zero
What am I doing wrong?
First, don't name your objects the same as R functions (ie., "df"). Second, try a non-formula interface to randomForest. Let's where that gets you.
( rf <-randomForest(y=my.df[,"Genotype"], x=my.df[,2:ncol(my.df)], importance = TRUE, proximity = TRUE) )

How to use predict from a model stored in a list in R?

I have a dataframe dfab that contains 2 columns that I used as argument to generate a series of linear models as following:
models = list()
for (i in 1:10){
models[[i]] = lm(fc_ab10 ~ (poly(nUs_ab, i)), data = dfab)
}
dfab has 32 observations and I want to predict fc_ab10 for only 1 value.
I thought of doing so:
newdf = data.frame(newdf = nUs_ab)
newdf[] = 0
newdf[1,1] = 56
prediction = predict(models[[1]], newdata = newdf)
First I tried writing newdf as a dataframe with only one position, but since there are 32 in the dataset on which the model was built, I thought I had to provide at least 32 points as well. I don't think this is necessary though.
Every time I run that piece of code I am given the following error:
Error: variable 'poly(nUs_ab, i) was fitted with type “nmatrix.1” but type “numeric” was supplied.
In addition: Warning message:
In Z/rep(sqrt(norm2[-1L]), each = length(x)) :
longer object length is not a multiple of shorter object length
I thought all I need to use predict was a LM model, predictors (the number 56) given in a column-named dataframe. Obviously, I am mistaken.
How can I fix this issue?
Thanks.
newdf should be a data.frame with column name nUs_ab, otherwise R won't be able to know which column to operate upon (i.e., generate the prediction design matrix). So the following code should work
newdf = data.frame(nUs_ab = 56)
prediction = predict(models[[1]], newdata = newdf)

Getting Error in R when Performing predict function

I am getting the following error whenever I try to predict data against my linear model.
Warning message: 'newdata' had 101 rows but variables found have 296 rows
The following is the code snippet
trainingFrame = data.frame(weeksTrainingConv,bugsTraining)
validateFrame = data.frame(weekTestConv,bugsTest)
model <- lm(totWeekConv ~ totBugs,trainingFrame)
myPrediction <- predict(model,validateFrame)
The calculations for the dataframe and their components are written in a separate sheet. Here is the snippet. I have commented out the blocks according to the nature of the code. The first block represents my training dataset, the second is the dataset which I will use to test my model. Finally the last block is the total dataset.
library(lubridate)
#training DataSet
weeksTraining = as.Date(c("2003-12-28","2004-01-04","2004-01-11","2004-01-18","2004-01-25","2004-02-01","2004-02-08","2004-02-15","2004-02-22","2004-02-29","2004-03-07","2004-03-14","2004-03-21","2004-03-28","2004-04-04","2004-04-11","2004-04-18","2004-04-25","2004-05-02","2004-05-09","2004-05-16","2004-05-23","2004-05-30","2004-06-06","2004-06-13","2004-06-20","2004-06-27","2004-07-04","2004-07-11","2004-07-18","2004-07-25","2004-08-01","2004-08-08","2004-08-15","2004-08-22","2004-08-29","2004-09-05","2004-09-12","2004-09-19","2004-09-26","2004-10-03","2004-10-10","2004-10-17","2004-10-24","2004-10-31","2004-11-07","2004-11-14","2004-11-21","2004-11-28","2004-12-05","2004-12-12","2004-12-19","2004-12-26","2005-01-02","2005-01-09","2005-01-16","2005-01-23","2005-01-30","2005-02-06","2005-02-13","2005-02-20","2005-02-27","2005-03-06","2005-03-13","2005-03-20","2005-03-27","2005-04-03","2005-04-10","2005-04-17","2005-04-24","2005-05-01","2005-05-08","2005-05-15","2005-05-22","2005-05-29","2005-06-05","2005-06-12","2005-06-19","2005-06-26","2005-07-03","2005-07-10","2005-07-17","2005-07-24","2005-07-31","2005-08-07","2005-08-14","2005-08-21","2005-08-28","2005-09-04","2005-09-11","2005-09-18","2005-09-25","2005-10-02","2005-10-09","2005-10-16","2005-10-23","2005-10-30","2005-11-06","2005-11-13","2005-11-20","2005-11-27","2005-12-04","2005-12-11","2005-12-18","2005-12-25","2006-01-01","2006-01-08","2006-01-15","2006-01-22","2006-01-29","2006-02-05","2006-02-12","2006-02-19","2006-02-26","2006-03-05","2006-03-12","2006-03-19","2006-03-26","2006-04-02","2006-04-09","2006-04-16","2006-04-23","2006-04-30","2006-05-07","2006-05-14","2006-05-21","2006-05-28","2006-06-04","2006-06-11","2006-06-18","2006-06-25","2006-07-02","2006-07-09","2006-07-16","2006-07-23","2006-07-30","2006-08-06","2006-08-13","2006-08-20","2006-08-27","2006-09-03","2006-09-10","2006-09-17","2006-09-24","2006-10-01","2006-10-08","2006-10-15","2006-10-22","2006-10-29","2006-11-05","2006-11-12","2006-11-19","2006-11-26","2006-12-03","2006-12-10","2006-12-17","2006-12-24","2006-12-31","2007-01-07","2007-01-14","2007-01-21","2007-01-28","2007-02-04","2007-02-11","2007-02-18","2007-02-25","2007-03-04","2007-03-11","2007-03-18","2007-03-25","2007-04-01","2007-04-08","2007-04-15","2007-04-22","2007-04-29","2007-05-06","2007-05-13","2007-05-20","2007-05-27","2007-06-03","2007-06-10","2007-06-17","2007-06-24","2007-07-01","2007-07-08","2007-07-15","2007-07-22","2007-07-29","2007-08-05","2007-08-12","2007-08-19","2007-08-26","2007-09-02","2007-09-09","2007-09-16"))
bugsTraining = c(3,18,14,25,21,13,17,25,21,18,20,11,17,19,23,9,7,18,13,17,16,15,16,18,20,12,14,16,19,23,18,10,24,23,11,14,16,19,22,20,15,21,14,9,19,12,18,12,20,10,20,16,14,12,16,11,10,18,20,17,17,20,16,15,20,19,9,11,11,17,10,14,10,16,7,14,11,9,10,9,14,7,13,13,13,16,17,7,17,8,11,11,10,16,9,20,9,13,13,6,11,21,8,10,7,14,16,13,12,9,13,12,17,13,10,12,15,14,8,8,9,13,9,9,18,9,6,10,14,11,5,6,7,4,9,9,9,6,4,5,7,10,12,7,4,13,11,9,6,6,2,8,10,2,7,7,4,1,5,5,10,11,5,11,9,14,5,9,2,6,6,4,4,2,5,7,13,6,4,3,1,5,4,4,2,6,3,5,2,5,5,3,1,5,2)
weeksTrainingConv = numeric();
#converting Dates to numerical Value
for(i in 1:length(weeksTraining)){
val = ymd(weeksTraining[i])
val = as.numeric(val)
weeksTrainingConv[i] = c(val)
print(weeksTrainingConv[i])
}
#end Training DataSet
#test DataSet
weekTest = as.Date(c("2007-09-23","2007-09-30","2007-10-07","2007-10-14","2007-10-21","2007-10-28","2007-11-04","2007-11-11","2007-11-18","2007-11-25","2007-12-02","2007-12-09","2007-12-16","2007-12-30","2008-01-06","2008-01-13","2008-01-20","2008-01-27","2008-02-03","2008-02-10","2008-02-17","2008-02-24","2008-03-02","2008-03-09","2008-03-16","2008-03-23","2008-03-30","2008-04-06","2008-04-13","2008-04-20","2008-04-27","2008-05-04","2008-05-11","2008-05-18","2008-05-25","2008-06-01","2008-06-08","2008-06-15","2008-06-22","2008-06-29","2008-07-06","2008-07-20","2008-07-27","2008-08-03","2008-08-10","2008-08-17","2008-08-24","2008-08-31","2008-09-07","2008-09-14","2008-09-21","2008-09-28","2008-10-05","2008-10-12","2008-10-19","2008-10-26","2008-11-02","2008-11-09","2008-11-16","2008-11-30","2008-12-07","2008-12-14","2009-01-04","2009-01-11","2009-01-18","2009-01-25","2009-02-01","2009-02-15","2009-02-22","2009-03-15","2009-03-22","2009-03-29","2009-04-05","2009-04-12","2009-04-19","2009-04-26","2009-05-10","2009-05-17","2009-05-24","2009-05-31","2009-06-21","2009-06-28","2009-07-05","2009-07-12","2009-07-19","2009-07-26","2009-08-02","2009-08-09","2009-08-16","2009-08-23","2009-09-06","2009-09-20","2009-09-27","2009-10-04","2009-10-11","2009-10-25","2009-11-01","2009-11-08","2009-11-15","2009-11-29","2009-12-06"));
bugsTest = c(2,4,5,1,4,4,2,4,1,7,2,2,4,1,2,3,1,2,3,1,4,2,10,1,1,6,3,5,1,4,2,3,2,4,2,1,5,6,3,1,1,2,2,5,1,1,2,1,2,3,3,4,4,3,2,3,1,2,6,1,1,1,2,2,2,3,1,1,2,1,3,4,2,3,1,3,1,2,2,1,1,2,2,1,1,1,2,2,2,1,4,3,2,2,6,2,4,3,2,2,1)
weekTestConv = numeric()
#converting Dates to numerical Value
for(i in 1:length(weekTest)){
val = ymd(weekTest[i])
val = as.numeric(val)
weekTestConv[i] = c(val)
}
#end Test DataSet
#total DataSet
totWeek = as.Date(c("2003-12-28","2004-01-04","2004-01-11","2004-01-18","2004-01-25","2004-02-01","2004-02-08","2004-02-15","2004-02-22","2004-02-29","2004-03-07","2004-03-14","2004-03-21","2004-03-28","2004-04-04","2004-04-11","2004-04-18","2004-04-25","2004-05-02","2004-05-09","2004-05-16","2004-05-23","2004-05-30","2004-06-06","2004-06-13","2004-06-20","2004-06-27","2004-07-04","2004-07-11","2004-07-18","2004-07-25","2004-08-01","2004-08-08","2004-08-15","2004-08-22","2004-08-29","2004-09-05","2004-09-12","2004-09-19","2004-09-26","2004-10-03","2004-10-10","2004-10-17","2004-10-24","2004-10-31","2004-11-07","2004-11-14","2004-11-21","2004-11-28","2004-12-05","2004-12-12","2004-12-19","2004-12-26","2005-01-02","2005-01-09","2005-01-16","2005-01-23","2005-01-30","2005-02-06","2005-02-13","2005-02-20","2005-02-27","2005-03-06","2005-03-13","2005-03-20","2005-03-27","2005-04-03","2005-04-10","2005-04-17","2005-04-24","2005-05-01","2005-05-08","2005-05-15","2005-05-22","2005-05-29","2005-06-05","2005-06-12","2005-06-19","2005-06-26","2005-07-03","2005-07-10","2005-07-17","2005-07-24","2005-07-31","2005-08-07","2005-08-14","2005-08-21","2005-08-28","2005-09-04","2005-09-11","2005-09-18","2005-09-25","2005-10-02","2005-10-09","2005-10-16","2005-10-23","2005-10-30","2005-11-06","2005-11-13","2005-11-20","2005-11-27","2005-12-04","2005-12-11","2005-12-18","2005-12-25","2006-01-01","2006-01-08","2006-01-15","2006-01-22","2006-01-29","2006-02-05","2006-02-12","2006-02-19","2006-02-26","2006-03-05","2006-03-12","2006-03-19","2006-03-26","2006-04-02","2006-04-09","2006-04-16","2006-04-23","2006-04-30","2006-05-07","2006-05-14","2006-05-21","2006-05-28","2006-06-04","2006-06-11","2006-06-18","2006-06-25","2006-07-02","2006-07-09","2006-07-16","2006-07-23","2006-07-30","2006-08-06","2006-08-13","2006-08-20","2006-08-27","2006-09-03","2006-09-10","2006-09-17","2006-09-24","2006-10-01","2006-10-08","2006-10-15","2006-10-22","2006-10-29","2006-11-05","2006-11-12","2006-11-19","2006-11-26","2006-12-03","2006-12-10","2006-12-17","2006-12-24","2006-12-31","2007-01-07","2007-01-14","2007-01-21","2007-01-28","2007-02-04","2007-02-11","2007-02-18","2007-02-25","2007-03-04","2007-03-11","2007-03-18","2007-03-25","2007-04-01","2007-04-08","2007-04-15","2007-04-22","2007-04-29","2007-05-06","2007-05-13","2007-05-20","2007-05-27","2007-06-03","2007-06-10","2007-06-17","2007-06-24","2007-07-01","2007-07-08","2007-07-15","2007-07-22","2007-07-29","2007-08-05","2007-08-12","2007-08-19","2007-08-26","2007-09-02","2007-09-09","2007-09-16","2007-09-23","2007-09-30","2007-10-07","2007-10-14","2007-10-21","2007-10-28","2007-11-04","2007-11-11","2007-11-18","2007-11-25","2007-12-02","2007-12-09","2007-12-16","2007-12-30","2008-01-06","2008-01-13","2008-01-20","2008-01-27","2008-02-03","2008-02-10","2008-02-17","2008-02-24","2008-03-02","2008-03-09","2008-03-16","2008-03-23","2008-03-30","2008-04-06","2008-04-13","2008-04-20","2008-04-27","2008-05-04","2008-05-11","2008-05-18","2008-05-25","2008-06-01","2008-06-08","2008-06-15","2008-06-22","2008-06-29","2008-07-06","2008-07-20","2008-07-27","2008-08-03","2008-08-10","2008-08-17","2008-08-24","2008-08-31","2008-09-07","2008-09-14","2008-09-21","2008-09-28","2008-10-05","2008-10-12","2008-10-19","2008-10-26","2008-11-02","2008-11-09","2008-11-16","2008-11-30","2008-12-07","2008-12-14","2009-01-04","2009-01-11","2009-01-18","2009-01-25","2009-02-01","2009-02-15","2009-02-22","2009-03-15","2009-03-22","2009-03-29","2009-04-05","2009-04-12","2009-04-19","2009-04-26","2009-05-10","2009-05-17","2009-05-24","2009-05-31","2009-06-21","2009-06-28","2009-07-05","2009-07-12","2009-07-19","2009-07-26","2009-08-02","2009-08-09","2009-08-16","2009-08-23","2009-09-06","2009-09-20","2009-09-27","2009-10-04","2009-10-11","2009-10-25","2009-11-01","2009-11-08","2009-11-15","2009-11-29","2009-12-06"))
totBugs = c(3,18,14,25,21,13,17,25,21,18,20,11,17,19,23,9,7,18,13,17,16,15,16,18,20,12,14,16,19,23,18,10,24,23,11,14,16,19,22,20,15,21,14,9,19,12,18,12,20,10,20,16,14,12,16,11,10,18,20,17,17,20,16,15,20,19,9,11,11,17,10,14,10,16,7,14,11,9,10,9,14,7,13,13,13,16,17,7,17,8,11,11,10,16,9,20,9,13,13,6,11,21,8,10,7,14,16,13,12,9,13,12,17,13,10,12,15,14,8,8,9,13,9,9,18,9,6,10,14,11,5,6,7,4,9,9,9,6,4,5,7,10,12,7,4,13,11,9,6,6,2,8,10,2,7,7,4,1,5,5,10,11,5,11,9,14,5,9,2,6,6,4,4,2,5,7,13,6,4,3,1,5,4,4,2,6,3,5,2,5,5,3,1,5,2,2,4,5,1,4,4,2,4,1,7,2,2,4,1,2,3,1,2,3,1,4,2,10,1,1,6,3,5,1,4,2,3,2,4,2,1,5,6,3,1,1,2,2,5,1,1,2,1,2,3,3,4,4,3,2,3,1,2,6,1,1,1,2,2,2,3,1,1,2,1,3,4,2,3,1,3,1,2,2,1,1,2,2,1,1,1,2,2,2,1,4,3,2,2,6,2,4,3,2,2,1)
totWeekConv = numeric();
#converting Dates to numerical Value
for(i in 1:length(totWeek)){
val = ymd(totWeek[i])
val = as.numeric(val)
totWeekConv[i] = c(val)
}
#end Total DataSet
I wanted to create a linear model and establish a relationship between weeks vs bugs. I converted the week Dates into a numerical format for easier calculation.
I can create the model using the lm() command and I provided it with a training dataset as shown in the 1st code snippet. Whenever I want to predict against the model using testing data set which in this case is a dataframe named "validateFrame", the program gives me an error stating
Warning message: 'newdata' had 101 rows but variables found have 296
rows
I am new to R and I have already googled regarding this but am failing somewhere.I have googled it already but the solution I found doesn't seem to work for me.
The problem is in your initial code snippet.
trainingFrame = data.frame(weeksTrainingConv,bugsTraining)
validateFrame = data.frame(weekTestConv,bugsTest)
model <- lm(totWeekConv ~ totBugs, trainingFrame)
myPrediction <- predict(model,validateFrame)
First, you create the model using totWeekConv and totBugs from trainingFrame. But trainingFrame does not have variables with those names. It has columns named weeksTrainingConv and bugsTraining. Then you try to evaluate the model on validateFrame where the variables have yet different names - weekTestConv and bugsTest. You must use the same variable names throughout.
I am not quite sure how you meant to use totWeekConv and totBugs but I believe that what you wanted was:
trainingFrame = data.frame(weeksConv = weeksTrainingConv,bugs = bugsTraining)
validateFrame = data.frame(weeksConv = weekTestConv,bugs = bugsTest)
model <- lm(weeksConv ~ bugs,trainingFrame)
myPrediction <- predict(model,validateFrame)
Here, you are training on the training data and testing on the test data but the column names are the same in both places.

Resources