caret with NAs and Factors - r

I feel like I'm stuck in a bit of a circular error here.
I have some columns with NA (still trialling whether to impute or omit) and a few categorical/factor columns too.
If I use the formula method I can run my model but then get issues with trying to predict as the factors are dummified.
train(sales~.,
data=df,
method="glmnet",
preProcess=c('center', 'scale', 'zv'),
trControl=trainControl(method="repeatedcv", number=5, repeats=2),
na.action = na.omit)
This suggests to use non-forumla method
https://stackoverflow.com/a/30169022/10291291
train(
x = model.frame(formula( sales~.), df)[,-1],
y = model.frame(formula( sales~.), df)[,1],
method="glmnet",
preProcess=c('center', 'scale', 'zv'),
trControl=trainControl(method="repeatedcv", number=5, repeats=2),
na.action = na.omit)
However when I try that I get issues with the NAs and this post suggests to go back to formulas
https://stackoverflow.com/a/48230658/10291291
For reference I'll likely be sticking with xgboost and glmnet
So a little lost but can't imagine this is that irregular so hoping I've perhaps missed something obvious

Related

Strange glmulti results: Why are interaction variables from the candidate model dropped/not included?

I have been using glmulti to obtain model averaged estimates and relative importance values for my variables of interest. In running glmulti I specified a candidate model for which all variables and interactions were included based on a priori knowledge (see code below).
After running the glmutli model I studied the results by using the functions summary() and weightable(). There seem to be a number of strange things going on with the results which I do not understand.
First of all, when I run my candidate model with lme4 glmer() function I obtain an AIC value of 2086. In the glmulti output this candidate model (with exactly the same formula) has a lower AIC value (2107), as a result of which it appears at position 8 out of 26 in the list of all potential models (as obtained through the weigtable() function).
What seems to be causing this problem is that the logArea:Habitat interaction is dropped from the candidate model, despite level=2 being specified. The function summary(output_new#objects[[8]]) provides a different formula (without the logArea:Habitat interaction variable) compared to the formula provided through weightable(). This explains why the candidate model AIC value is not the same as obtained through lme4, but I do not understand why the interaction variables logArea:Habitat is missing from the formula. The same is happening for other possible models. It seems that for all models with 2 or more interactions, one interaction is dropped.
Does anyone have an explanation for what is going on? Any help would be much appreciated!
Best,
Robert
Note: I have created a subset of my data (https://drive.google.com/open?id=1rc0Gkp7TPdnhW6Bw87FskL5SSNp21qxl) and simplified the candidate model by removing variables in order to decrease model run time. (The problem remains the same)
newdat <- Data_ommited2[, c("Presabs","logBodymass", "logIsolation", "Matrix", "logArea", "Protection","Migration", "Habitat", "Guild", "Study","Species", "SpeciesStudy")]
glmer.glmulti <- function (formula, data, random, ...) {
glmer(paste(deparse(formula), random), data = data, family=binomial(link="logit"),contrasts=list(Matrix=contr.sum, Habitat=contr.treatment, Protection=contr.treatment, Guild=contr.sum),glmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 100000)))
}
output_new <- glmulti(y = Presabs ~ Matrix + logArea*Protection + logArea*Habitat,
data = sampledata,
random = '+(1|Study)+(1|Species)+(1|SpeciesStudy)',
family = binomial,
method = 'h',
level=2,
marginality=TRUE,
crit = 'aic',
fitfunc = glmer.glmulti,
confsetsize = 26)
print(output_new)
summary(output_new)
weightable(output_new)
I found a post (https://stats.stackexchange.com/questions/341356/glmulti-package-in-r-reporting-incorrect-aicc-values) of someone who encountered the same problem and it appears that the problem was caused by this line of code:
glmer.glmulti <- function (formula, data, random, ...) {
glmer(paste(deparse(formula), random), data = data, family=binomial(link="logit"))
}
By changing this part of the code into the following the problem was solved:
glmer.glmulti<-function(formula,data,random,...) {
newf <- formula
newf[[3]] <- substitute(f+r,
list(f=newf[[3]],
r=reformulate(random)[[2]]))
glmer(newf,data=data,
family=binomial(link="logit"))
}

Using the caret::train package for calculating prediction error (MdAE) of glmms with beta-binomial errors

The question is more or less as the title indicates. I would like to use the caret::train function with beta-binomial models made with glmmTMB package (although I am not opposed to other functions capable of fitting beta-binomial models) to calculate median absolute error (MdAE) estimates through jack-knife (leave-one-out) cross-validation. The glmmTMBControl function is already capable of estimating the optimal dispersion parameter but I was hoping to retain this information somehow as well... or having caret do the calculation possibly?
The dataset I am working with looks like this:
df <- data.frame(Effect = rep(seq(from = 0.05, to = 1, by = 0.05), each = 5), Time = rep(seq(1:20), each = 5))
Ideally I would be able to pass the glmmTMB function to trainControl like so:
BB.glmm1 <- train(Time ~ Effect,
data = df, method = "glmmTMB",
method = "", metric = "MAD")
The output would be as per the examples contained in train, although possibly with estimates for the dispersion parameter.
Although I am in no way opposed to work arounds - Thank you in advance!
I am unsure how to perform the required operation with caret without creating a custom method but I trust it is fairly easy to implement it with a for (lapply) loop.
In the example I will use the sleepstudy data set since your example data throws a bunch of warnings.
library(glmmTMB)
to perform LOOCV - for every row, create a model without that row and predict on that row:
data(sleepstudy,package="lme4")
LOOCV <- lapply(1:nrow(sleepstudy), function(x){
m1 <- glmmTMB(Reaction ~ Days + (Days|Subject),
data = sleepstudy[-x,])
return(predict(m1, sleepstudy[x,], type = "response"))
})
get the median of the residuals (I think this is MdAE? if not post a comment on how its calculated):
median(abs(unlist(LOOCV) - sleepstudy$Reaction))

Prediction Warnings when predicting a fitted caret SVM model

I'm hoping to get some pointers as to why I'm getting :
Warning message: In method$predict(modelFit = modelFit, newdata =
newdata, submodels = param) : kernlab class prediction calculations
failed; returning NAs
When I print out the prediction:
svmRadial_Predict
[1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>....
The code I wrote to perform the SVM fitting:
#10-fold cross validation in 3 repetitions
control = trainControl(seeds = s, method="repeatedcv", number=10,
repeats=3, savePredictions = TRUE, classProbs = TRUE)
The SVM model for the fitting is like this:
svmRadial_model = train(y=modelTrain$Emotion,
x=modelTrain[c(2:4)],
method ='svmRadial',
trControl = control,
data=modelTrain,
tuneLength = 3
)
And the code I wrote to perform the prediction looks like this:
svmRadial_Predict <- predict(svmRadial_model,
newdata = modelTest[c(2:4)], probability = TRUE )
I've checked the data, and there's no NA values in the training or testing set. The y value is a factor and the x values are numeric if that makes a difference? Any tips to debug this would be very much appreciated!
As the model trains I can see warnings like this:
line search fails -1.407938 -0.1710936 2.039448e-05
which I had assumed was just the model being unable to fit a hyperplane for particular observations in the data. I'm using the svmRadial kernel
The data I'm trying to fit was already centred and scaled using the R scale() function.
Further work leads me to believe it's something to do with the
classProbs = TRUE flag. If I leave it out, no warnings are printed.
I've kicked off another run of my code, SVM seems to take ages to complete on my laptop for this task but I'll report the results as soon as that completes.
As a final edit, the model fitting completed without error, and I can use that model just fine for prediction/calculating the confusion matrix etc. I don't understand why including classProbs = TRUE breaks it, but maybe it's related to the combination of the cross validation that does, with the cross validation I had requested in my trainControl
Your problem is part of the peculiarities of the caret package.
There are two potential reasons why your prediction fails with kernlab svm methods called by caret:
The x, y interface returns a caret::train object which the predict function cannot use.
Solution: Simply replace by the formula interface.
train(form = Emotion ~ . , data = modelTrain, ...
The iterative search within the support vector machine algorithm doesn't converge.
Solution 2a)
Set different seeds before the train() call until it
converges.
set.seed(xxx)
train(form = Emotion ~ . , data = modelTrain, ...)
Solution 2b)
Decrease parameter minstep as suggested by #catastrophic-failure here. For this solution, there is no parameter in the ksvm function, so you need to change the source code in line 2982: minstep <- 1e-10 to a lower value. Then compile the source code yourself. No guarantee it will help though.
Try out Solution 1 first, as it is the most likely!
My solution to this was just to leave out the classProbs = TRUE parameter of the trainControl function. Once I did that, everything worked. I'd guess it's related to what's happening with cross validation under the hood but I'm not certain of that.

r caret predict returns fewer output than input

I used caret to train an rpart model below.
trainIndex <- createDataPartition(d$Happiness, p=.8, list=FALSE)
dtrain <- d[trainIndex, ]
dtest <- d[-trainIndex, ]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv", number=10, repeats=10)
fitRpart <- train(Happiness ~ ., data=dtrain, method="rpart",
trControl = fitControl)
testRpart <- predict(fitRpart, newdata=dtest)
dtest contains 1296 observations, so I expected testRpart to produce a vector of length 1296. Instead it's 1077 long, i.e. 219 short.
When I ran the prediction on the first 220 rows of dtest, I got a predicted result of 1, so it's consistently 219 short.
Any explanation on why this is so, and what I can do to get a consistent output to the input?
Edit: d can be loaded from here to reproduce the above.
I downloaded your data and found what explains the discrepancy.
If you simply remove the missing values from your dataset, the length of the outputs match:
testRpart <- predict(fitRpart, newdata = na.omit(dtest))
Note nrow(na.omit(dtest)) is 1103, and length(testRpart) is 1103. So you need a strategy to address missing values. See ?predict.rpart and the options for the na.action parameter to choose what you want.
Similar to what Josh mentioned, if you need to generate predictions using predict.train from caret, simply pass the na.action of na.pass:
testRpart <- predict(fitRpart, newdata = dtest, na.action = na.pass)
Note: moving this to a separate answer based on Ricky's comment on Josh's answer above for visibility.
I had a similar issue using "newx" instead of "newdata" in the predict function. Using "newdata" (or nothing) solve my problem, hope it will help someone else who used newx and having same issue.

Issues with predict function when building a CART model via CrossValidation using the train command

I am trying to build a CART model via cross validation using the train function of "caret" package.
My data is 4500 x 110 data frame, where all the predictor variables (except the first two, UserId and YOB (Year of Birth) which I am not using for model building) are factors with 2 levels except the dependent variable which is of type integer (although has only two values 1 and 0). Gender is one of the independent variables.
When I ran rpart command to get CART model (using the package "rpart"), i didn't have any problem with the predict function. However, I wanted to improve the model via cross validation, and so used the train function from the package "caret" with the following command:
tr = train(y ~ ., data = subImpTrain, method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
This build the model with the following warning
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
But it did give me a final model (best.tree). However, when I am trying to run the predict function using the following command:
best.tree.pred = predict(best.tree, newdata = subImpTest)
on the test data, it is giving me the following error:
Error in eval(expr, envir, enclos) : object 'GenderMale' not found
The Gender variable has two values: Female, Male
Can anybody help me understand the error
As #lorelai suggested, caret dummy-codes your variables if you supply it a formula. An alternative is to provide it the variables themselves, like so:
tr = train(y = subImpTrain$y, x = subImpTrain[, -subImpTrain$y],
method = "rpart", trControl = tr.control, tuneGrid = cp.grid)
More importantly, however, you shouldn't use predict.rpart and instead use predict.train, like so:
predict(tr, subImpTest)
In which case it would work just fine with the formula interface.
I have had a similar problem in the past, although concerning another algorithm.
Basically, some algorithms transform the factor variables into dummy variables and rename them accordingly.
My solution was to create my own dummies and leave them in numerical format.
I read that decision trees manage to work properly even so.

Resources