Difficulty getting Caret GLM with Repeated CV to execute - r

I have been doing 10X10-fold cv logistic models for a long time using homebrew code, but recently have figured that it might be nice to let caret handle the messy stuff for me.
Unfortunately, I seem to be missing some of the nuances that caret needs to be able to function.
Specifically, I keep getting this error:
>Error in { : task 1 failed - "argument is not interpretable as logical"
Please see if you can pick up what I am doing wrong...
Thanks in advance!
Data set is located here.
dataset <- read.csv("Sample Data.csv")
library(caret)
my_control <- trainControl(
method="repeatedcv",
number=10,
repeats = 10,
savePredictions="final",
classProbs=TRUE
)
This next block of code was put in there to make caret happy. My original dependent variable was a binary that I had turned into a factor, but caret had issues with the factor levels being "0" and "1". Not sure why.
dataset$Temp <- "Yes"
dataset$Temp[which(dataset$Dep.Var=="0")] <- "No"
dataset$Temp <- as.factor(dataset$Temp)
Now I (try) to get caret to run the 10X10-fold glm model for me...
testmodel <- train(Temp ~ Param.A + Param.G + Param.J + Param.O, data = dataset,
method = "glm",
trControl = my_control,
metric = "Kappa")
testmodel
> Error in { : task 1 failed - "argument is not interpretable as logical"

Though you already found a fix by updating R and caret, I'd like to point out there is (was) a bug in your code which caused the error, and which I can reproduce here with an older version of R and caret:
The savePredictions of trainControl is meant to be set to either TRUE or FALSE instead of 'final'. Seems you simply mixed it with the returnResamp parameter, which would take exactly this parameter.
BTW: R and caret have restrictions on level names of factors, which is why caret complained when you handed 0 and 1 level names for the dependent variable to it. Using a simple dataset$Dep.Var <- factor(paste0('class', dataset$Dep.Var)) should do the trick in such cases.

I don't have enough reputation to comment, so I am posting this as an answer. I ran your exact code, and it worked fine for me, twice. I did get this warning:
glm.fit: fitted probabilities numerically 0 or 1 occurred
As per the author, this error had something to do with the savePredictions parameter. Have a look at this issue:
https://github.com/topepo/caret/issues/304

Thanks to #Sumedh, I figured that the problem might not be with my code, and I updated all my packages.
Surprise! Now it works. So I wasn't crazy after all.
Sorry all for the fire drill.

Related

error when fitting random effects model using bam() rather than gam() function in mgcv package, R

I am fitting a model with many random effects using the bam() function within the mgcv package for R. My basic model structure looks like:
fit <- bam(y ~ s(x1) + s(x2) + s(xn) + s(plot, bs = 're'), data = dat)
This function works for 4 subsets of my data, but not the fifth, which is surprising. Instead, it throws this error:
Error in qr.qty(qrx, f) :
right-hand side should have 14195 not 14196 rows
This error goes away if I switch to using the gam() rather than bam() function. It also goes away if I drop the random effect from the model. I am really unsure whats causing this error, or what to do about it. Unfortunately, generating a reproducible example would require passing along a very large dataset, as its not clear why this error is thrown on this particular dataset, compared to 4 other datasets fitting the exact same model.
Any idea why this error is being thrown, and how to overcome it, would be greatly appreciated.
I had the same question and I found this r-help mail which tries to solve the same problem:
[R] bam (mgcv) not using the specified number of cores
After reading the mail, I deleted all the code about the cluster, such as the argument cluster in bam() function. Then the error message goes away.
I don't know the details but I hope this trick will help you.
One possible cause of
Error in qr.qty(qrx, f) :
right-hand side should have 14195 not 14196 rows
is running out of RAM. This may explain why you have seen the error for some datasets but not others. This is especially common when using a large cluster size.

R: gls error "false convergence (8)" and glsControl function

I've seen that a common error when running a generalized least squares (gls) from nlme package in R is the "false convergence (8)". I am trying to run gls models to account for the spatial dependence of my residuals, but I got stucked with the same problem. For example:
library(nlme)
set.seed(2)
samp.sz<-400
lat<-runif(samp.sz,-4,4)
lon<-runif(samp.sz,-4,4)
exp1<-rnorm(samp.sz)
exp2<-rnorm(samp.sz)
resp<-1+4*exp1-3*exp2-2*lat+rnorm(samp.sz)
mod.cor<-gls(resp~exp1+exp2,correlation=corGaus(form=~lat,nugget=TRUE))
Error in gls(resp ~ exp1 + exp2, correlation = corGaus(form = ~lat, nugget = TRUE)) :
false convergence (8)
(the above data simulation was copied from here because it yields the same problem I am facing).
Then, I read that the function glsControl has some parameters (maxIter, msMaxIter, returnObject) that can be setted prior running the analysis, which can solve this error. As an attempt to understand what was going on, I adjusted the three parameters above to 500, 2000 and TRUE, and ran the same code above, but the error still shows up. I think that the glsControl didn't work at all, because none result was shown even I've asked for it.
glsControl(maxIter = 500, msMaxIter=2000, returnObject = TRUE)
mod.cor<-gls(resp~exp1+exp2,correlation=corGaus(form=~lat,nugget=TRUE))
For comparison, if I run different models with the same variables, it works fine and no error is shown.
For example, models containing only one explanatory variable.
mod.cor2<-gls(resp~exp1,correlation=corGaus(form=~lat,nugget=TRUE))
mod.cor3<-gls(resp~exp2,correlation=corGaus(form=~lat,nugget=TRUE))
I really digged into several sites, foruns and books in a desperate search trying to solve it, and then I come to know that the 'false convergence' is a recurrent error that many users have faced. However, none of the previous posts seems to solve it for me. i really thought the glsControl could provide an alternative, but it didn't. Do you guys have a clue on how can I solve that?
I really appreciate any help. Thanks in advance.
The problem is that the nugget effect is very small. Provide better starting values:
mod.cor <- gls(resp ~ exp1 + exp2,
correlation = corGaus(c(200, 0.1), form = ~lat, nugget = TRUE))
summary(mod.cor)
#<snip>
#Correlation Structure: Gaussian spatial correlation
# Formula: ~lat
# Parameter estimate(s):
# range nugget
#2.947163e+02 5.209379e-06
#</snip>
Note that this model may be sensitive to starting values even if there is no error or warning.
I would like to add a quote from library(lme4); help("convergence"):
The lme4 package uses general-purpose nonlinear optimizers (e.g.
Nelder-Mead or Powell's BOBYQA method) to estimate the
variance-covariance matrices of the random effects. Assessing reliably
whether such algorithms have converged is difficult.
I believe something similar applies here. This model is clearly problematic and you should be grateful for getting this error. You should at least check how the fit changes with different starting values and try increasing the number of iterations or decreasing the tolerance. In the end, I would suggest looking for a model that better fits the data (we know that this would be an OLS model including lat as a linear predictor here).
PS: A good coding style uses blanks where appropriate.

feature selection function in caret package

I am posting this because this postfeture selection in caret hasent helped my issue and I have 2 questions regarding feature selection function in caret package
when I run code below on my matrix of gene expression allsamplecombat with 5 classes defined in y= :
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
results <- rfe(t(allsamplecombat[filter,]), y = factor(info$clust), sizes=c(300,400,500,600,700,800,1000,1200), rfeControl=control)
I get an out put like this
So, I want to know if I can extract top features for each classes, because predictors(results) just give me the resulting feature without indicating importance for each classes.
my second problem is that when i try to change rfeControl functions to treebagFuncs and run 'parRF` method
control <- rfeControl(functions=treebagFuncs, method="cv", number=5)
results <- rfe(t(allsamplecombat[filter,]), y = factor(info$clust), sizes=c(400,500,600,700,800), rfeControl=control, method="parRF")
i get Error in { : task 1 failed - "subscript out of bounds" error.
what is wrong in my code?
For the importances, there is a sub-object called variables that contains this information for each step of the elimination.
treebagFuncs is designed to work with ipred's bagging function and isn't related to random forest.
You would probably used caretFuncs and pass method to that. However, if you are going to parallelize something, do it to the resampling loop and not the model function. This is generally more efficient. Note that if you do both with M workers, you might actually get M^3 (one for rfe, one for train, and one for parRF). There are options in rfe and train to turn their parallelism off.

model averaged coefficients of linear mixed models in glmulti? Fix no longer works

I'm using the glmulti package to do variable selection on the fixed effects of a mixed model in lme4. I had the same problem retrieving coefficients and confidence intervals that was solved by the author of the package in this thread. Namely using the coef or coef.multi gives an check.names error and the coefficients are listed as NULL when calling the predict method. So I tried the solution listed on the thread linked above, using:
setMethod('getfit', 'merMod', function(object, ...) {
summ=summary(object)$coef
summ1=summ[,1:2]
if (length(dimnames(summ)[[1]])==1) {
summ1=matrix(summ1, nr=1, dimnames=list(c("(Intercept)"),c("Estimate","Std. Error")))
}
cbind(summ1, df=rep(10000,length(fixef(object))))
})
I fixed the missed " in the original post and the code ran. But, now instead of getting
Error in data.frame(..., check.names = FALSE) :arguments imply
differing number of rows: 1, 0
I get this error for every single model...
Error in calculation of the Satterthwaite's approximation. The output
of lme4 package is returned summary from lme4 is returned some
computational error has occurred in lmerTest
I'm using lmerTest and it doesn't surprise me that it would fail if glmulti can't pull the correct info from the model. So really it's the first two lines of the error that are probably what should be focussed on.
A description of the original fix is on the developers website here. Clearly the package hasn't been updated in awhile, and yes I should probably learn a new package...but until then I'm hoping for a fix. I'll contact the developer directly through his website. But, in the mean time, has anyone tried this and found a fix?
lme4 glmulti rJava and other related packages have all been updated to the latest version.

Error when using AUC package in R

We use the AUC package in R to evaluate the model prediction.
Sometimes, we faced the error like below:
> plot(roc(pred, yTEST))
> auc(roc(pred, yTEST))
Error in rank(prob) : argument "prob" is missing, with no default
Could anyone let us know where the error comes from ? Note that: the problem did not occur frequently. For example: we ran 10 models and it happened to 3-4 models.
It sounds like you might have loaded glmnet after AUC (glmnet also has a auc function). Just used AUC:: before the function.

Resources