Errors while performing caret tuning in R - r

I am building a predictive model with caret/R and I am running into the following problems:
When trying to execute the training/tuning, I get this error:
Error in if (tmps < .Machine$double.eps^0.5) 0 else tmpm/tmps :
missing value where TRUE/FALSE needed
After some research it appears that this error occurs when there missing values in the data, which is not the case in this example (I confirmed that the data set has no NAs). However, I also read somewhere that the missing values may be introduced during the re-sampling routine in caret, which I suspect is what's happening.
In an attempt to solve problem 1, I tried "pre-processing" the data during the re-sampling in caret by removing zero-variance and near-zero-variance predictors, and automatically inputting missing values using a carets knn automatic imputing method preProcess(c('zv','nzv','knnImpute')), , but now I get the following error:
Error: Matrices or data frames are required for preprocessing
Needless to say I checked and confirmed that the input data set are indeed matrices, so I dont understand why I get this second error.
The code follows:
x.train <- predict(dummyVars(class ~ ., data = train.transformed),train.transformed)
y.train <- as.matrix(select(train.transformed,class))
vbmp.grid <- expand.grid(estimateTheta = c(TRUE,FALSE))
adaptive_trctrl <- trainControl(method = 'adaptive_cv',
number = 10,
repeats = 3,
search = 'random',
adaptive = list(min = 5, alpha = 0.05,
method = "gls", complete = TRUE),
allowParallel = TRUE)
fit.vbmp.01 <- train(
x = (x.train),
y = (y.train),
method = 'vbmpRadial',
trControl = adaptive_trctrl,
preProcess(c('zv','nzv','knnImpute')),
tuneGrid = vbmp.grid)
The only difference between the code for problem (1) and (2) is that in (1), the pre-processing line in the train statement is commented out.
In summary,
-There are no missing values in the data
-Both x.train and y.train are definitely matrices
-I tried using a standard 'repeatedcv' method in instead of 'adaptive_cv' in trainControl with the same exact outcome
-Forgot to mention that the outcome class has 3 levels
Anyone has any suggestions as to what may be going wrong?
As always, thanks in advance
reyemarr

I had the same problem with my data, after some digging i found that I had some Inf (infinite) values in one of the columns.
After taking them out (df <- df %>% filter(!is.infinite(variable))) the computation ran without error.

Related

Error while using the weights option in nlme in r

Sorry this is crossposting from https://stats.stackexchange.com/questions/593717/nlme-regression-with-weights-syntax-in-r, but I thought it might be more appropriate to post it here.
I am trying to fit a power curve to model some observations in an nlme. However, I know some observations to be less reliable than others (reliability of each OBSID reflected in the WEIV in the dummy data), relatively independent of variance, and I quantified this beforehand and wish to include it as weights in my model. Moreover, I know a part of my variance is correlated with my independent variable so I cannot use directly the variance as weights.
This is my model:
coeffs_start = lm(log(DEPV)~log(INDV), filter(testdummy10,DEPV!=0))$coefficients
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]), b=coeffs_start[2]),
verbose = F,
method="REML",
weights=varFixed(~WEIV))
This is some sample dummy data (I know it is not a great fit but it's fake data anyway) : https://github.com/FlorianLeprevost/dummydata/blob/main/testdummy10.csv
This runs well without the "weights" argument, but when I add it I get this error and I am not sure why because I believe it is the correct syntax:
Error in recalc.varFunc(object[[i]], conLin) :
dims [product 52] do not match the length of object [220]
In addition: Warning message:
In conLin$Xy * varWeights(object) :
longer object length is not a multiple of shorter object length
Thanks in advance!
This looks like a very long-standing bug in nlme. I have a patched version on Github, which you can install via remotes::install_github() as below ...
remotes::install_github("bbolker/nlme")
testdummy10 <- read.csv("testdummy10.csv") |> subset(DEPV>0 & INDV>0)
coeffs_start <- coef(lm(log(DEPV)~log(INDV), testdummy10))
library(nlme)
nlme_fit <- nlme(DEPV ~ a*INDV^b,
data = testdummy10,
fixed=a+b~ 1,
random = a~ 1,
groups = ~ PARTID,
start = c(a=exp(coeffs_start[1]),
b=coeffs_start[2]),
verbose = FALSE,
method="REML",
weights=varFixed(~WEIV))
packageVersion("nlme") ## 3.1.160.9000

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

"promise already under evaluation" error in R caret's rfe function

I have a matrix X and vector Y which I use as arguments into the rfe function from the caret package. It's as simple as:
I get a weird error which I can't decipher:
promise already under evaluation: recursive default argument reference or earlier problems?
EDIT:
Here is a reproducible example for the first 5 rows of my data:
library(caret)
X_values = c(29.04,96.57,4.57,94.23,66.81,26.71,69.01,77.06,49.52,97.59,47.57,64.07,24.25,11.27,77.30,90.99,44.05,30.96,96.32,16.04)
X = matrix(X_values, nrow = 5, ncol=4)
Y = c(5608.11,2916.61,5093.05,3949.35,2482.52)
rfe(X, Y)
My R version is 3.2.3. Caret package is 6.0-76.
Does anybody know what this is?
There are two problems with your code.
You need to specify the function/algorithm that you want to fit. (this is what causes the error message you get. I am unsure why rfe throws such a cryptic error message; it makes it difficult to debug, indeed.)
You need to name your columns in the input data.
The following works:
library(caret)
X_values = c(29.04,96.57,4.57,94.23,66.81,26.71,69.01,77.06,49.52,97.59,47.57,64.07,24.25,11.27,77.30,90.99,44.05,30.96,96.32,16.04)
X = matrix(X_values, nrow = 5, ncol=4)
Y = c(5608.11,2916.61,5093.05,3949.35,2482.52)
ctrl <- rfeControl(functions = lmFuncs)
colnames(X) <- letters[1:ncol(X)]
set.seed(123)
rfe(X, Y, rfeControl = ctrl)
I chose a linear model for the rfe.
The reason for the warning messages is the low number of observations in your data during cross validation. You probably also want to set the sizes argument to get a meaningful feature elimination.

Grid tuning xgboost with missing data

It seems like the expected method of grid tuning the xgboost model is using the caret package, as clearly displayed here: https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees
However, I struggle to make sense of the case with missing data. When creating the model without using caret, I set the missing to NA.
dtrain = xgb.DMatrix(data = data.matrix(train$data),label = train$label,missing = NA)
That allows me to create the model like so:
bst = xgboost(data = dtrain,depth = 4,eta =.3,nthread = 2,
nround = 43, print.every.n = 5,
objective = "binary:logistic",eval_metric = "auc",verbose = TRUE
)
This works very nicely, however, caret does not take this kind of object.
This is what I'm trying:
xgbtrain = train(x = train$data,y = as.factor(make.names(train$label)),
trControl = trControl, tuneGrid = my_grid,method = "xgbTree")
But for every iteration it is telling me this:
Error in xgb.DMatrix(as.matrix(x), label = y) : can not open file "NA"
That's the same error message I was getting before in regular xgb.boost when I didn't set my missing to NA. The xgb.DMatrix is not a subsettable object I could take the data from, and it is also not possible to convert it to a data frame. How do I get around this?
EDIT
Figured it out. In the end, it had nothing to do with missing data, but with having factors in the dataset. Instead of using xgboost's function to convert to a sparse matrix, I used regular model.matrix() and was able to successfully plug in the new matrix into caret's train function.

adabag boosting function throws error when giving mfinal>10

I have a strange issue, whenever I try increasing the mfinal argument in boosting function of adabag package beyond 10 I get an error, Even with mfinal=9 I get warnings.
My train data has 7 class Dependant variable and 100 independant variables and around 22000 samples of data(Smoted one class using DMwR). My Dependant Variable is at the end of the training dataset in sequence.
library(adabag)
gc()
exp_recog_boo <- boosting(V1 ~ .,data=train_dataS,boos=TRUE,mfinal=9)
Error in 1:nrow(object$splits) : argument of length 0
In addition: Warning messages:
1: In acum + acum1 :
longer object length is not a multiple of shorter object length
Thanks in advance.
My mistake was that I didn't set the TARGET as factor before.
Try this:
train$target <- as.factor(train$target)
and check by doing:
str(train$TARGET)
This worked for me:
modelADA <- boosting(lettr ~ ., data = trainAll, boos = TRUE, mfinal = 10, control = (minsplit = 0))
Essentially I just told rpart to require a minimum split length of zero to generate tree, it eliminated the error. I haven't tested this extensively so I can't guarantee it's a valid solution (what does a tree with a zero length leaf actually mean?), but it does prevent the error from being thrown.
I think i Hit the problem.
ignore this -if you configure your control with a cp = 0, this wont happen. I think that if the first node of a tree make no improvement (or at least no better than the cp) the tree stay wiht 0 nodes so you have an empty tree and that make the algorithm fail.
EDIT: The problem is that the rpart generates trees with only one leaf(node) and the boosting metod use this sentence "k <- varImp(arboles[[m]], surrogates = FALSE, competes = FALSE)" being arboles[[m]] a tree with only one node it give you the eror.
To solve that you can modify the boosting metod:
Write: fix(boosting) and add the *'S lines.
if (boos == TRUE) {
** k <- 1
** while (k == 1){
boostrap <- sample(1:n, replace = TRUE, prob = pesos)
fit <- rpart(formula, data = data[boostrap, -1],
control = control)
** k <- length(fit$frame$var)
** }
flearn <- predict(fit, newdata = data[, -1], type = "class")
ind <- as.numeric(vardep != flearn)
err <- sum(pesos * ind)
}
this will prevent the algorith from acepting one leaf trees but you have to set the CP from the control param as 0 to avoid an endless loop..
Just ran into the same problem, and setting the complexity parameter to -1 or minimum split to 0 both work for me with rpart.control, e.g.
library(adabag)
r1 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(cp = -1))
r2 <- boosting(Y ~ ., data = data, boos = TRUE,
mfinal = 10, control = rpart.control(minsplit = 0))
I also run into this same problem recently and this example R script solves it completely!
The main idea is that you need to set the control for rpart (which adabag uses for creating trees, see rpart.control) appropriately, so that at least a split is attempted in every tree.
I'm not totally sure but it appears that your "argument of length 0" may be the result of an empty tree, which can happen since there is a default setting of a "complexity" parameter that tells the function not to attempt a split if the decrease in homogeneity/lack of fit is below certain threshold.
use str() to see the attributes of your dataframe. For me, I just convert myclass variable as factor, then everything runs.

Resources