how R recursive feature elimination with logistic regression - r

In fact, there is a similar question and answer, but it does not work me. see below. The trick lies in rewrite fit of lmFunc.
"Error in { : task 1 failed - "Results do not have equal lengths", many warning:glm.fit: fitted probabilities numerically 0 or 1 occurred"
where is the fault?
lmFuncs$fit=function (x, y, first, last, ...)
{
tmp <- as.data.frame(x)
tmp$y <- y
glm(y ~ ., data = tmp, family=binomial(link='logit'))
}
ctrl <- rfeControl(functions = lmFuncs,method = 'cv',number=10)
fit.rfe=rfe(df.preds,df.depend, rfeControl=ctrl)
And in the rfeControl help, it is said the parameter 'functions' that can be used with caret’s train function (caretFuncs). What does it really mean?
Any details and example? Thanks

I was having a similar issue with customising lmFunc.
For logistic regression make sure you use lrFuncs and set size equal to the number of predictor variables. This leads to no issues.
Example (for functionality purposes only)
library(caret)
#Reproducible data
set.seed(1)
x <- data.frame(runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10),runif(10))
x$dpen <- sample(c(0,1), replace=TRUE, size=10)
x$dpen <- factor(x$dpen)
#Spliting training set into two parts based on outcome: 80% and 20%
index <- createDataPartition(x$dpen, p=0.8, list=FALSE)
trainSet <- x[ index,]
testSet <- x[-index,]
control <- rfeControl(functions = lrFuncs,
method = "cv", #cross validation
verbose = FALSE, #prevents copious amounts of output from being produced.
)
##RFE
rfe(trainSet[,1:28] #predictor varia,
trainSet[,9],
sizes = c(1:28) #size of predictor variables,
rfeControl = control)

Related

Cannot generate predictions in mgcv when using discretization (discrete=T)

I am fitting a model using a random site-level effect using a generalized additive model, implemented in the mgcv package for R. I had been doing this using the function gam() however, to speed things up I need to shift to the bam() framework, which is basically the same as gam(), but faster. I further sped up fitting by passing the options bam(nthreads = N, discrete=T), where nthreads is the number of cores on my machine. However, when I use the discretization option, and then try to make predictions with my model on new data, while ignoring the random effect, I consistent get an error.
Here is code to generate example data and reproduce the error.
library(mgcv)
#generate data.
N <- 10000
x <- runif(N,0,1)
y <- (0.5*x / (x + 0.2)) + rnorm(N)*0.1 #non-linear relationship between x and y.
#uninformative random effect.
random.x <- as.factor(do.call(paste0, replicate(2, sample(LETTERS, N, TRUE), FALSE)))
#fit models.
fit1 <- gam(y ~ s(x) + s(random.x, bs = 're')) #this one takes ~1 minute to fit, rest faster.
fit2 <- bam(y ~ s(x) + s(random.x, bs = 're'))
fit3 <- bam(y ~ s(x) + s(random.x, bs = 're'), discrete = T, nthreads = 2)
#make predictions on new data.
newdat <- data.frame(runif(200, 0, 1))
colnames(newdat) <- 'x'
test1 <- predict(fit1, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test2 <- predict(fit2, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
test3 <- predict(fit3, newdata=newdat, exclude = c("s(random.x)"), newdata.guaranteed = T)
Making predictions with the third model which uses discretization throws this error (which the other two do not):
Error in model.frame.default(object$dinfo$gp$fake.formula[-2], newdata) :
variable lengths differ (found for 'random.x')
In addition: Warning message:
'newdata' had 200 rows but variables found have 10000 rows
How can I go about making predictions for a new dataset using the model fit with discretization?
newdata.gauranteed doesn't seem to be working for bam() models with discrete = TRUE. You could email the author and maintainer of mgcv and send him the reproducible example so he can take a look. See ?bug.reports.mgcv.
You probably want
names(newdat) <- "x"
as data frames have names.
But the workaround is just to pass in something for random.x
newdat <- data.frame(x = runif(200, 0, 1), random.x = random.x[[1]])
and then do your call to generate test3 and it will work.
The warning message and error are the result of you not specifying random.x in the newdata and then mgcv looking for random.x and finding it in the global environment. You should really gather that variables into a data frame and use the data argument when you are fitting your models, and try not to leave similarly named objects lying around in your global environment.

Caret - Factor issue in multi class classification

I want to perform a multi-class classification in the caretpackage. Below is a minimum example.
library(caret)
library(randomForest)
x <- data.frame("A"=seq(1,100), "B"=seq(1,100), "C"="class1")
x[,"C"] <- as.character(x[,"C"])
x[1,"C"] <- "class2"
x[2,"C"] <- "class3"
x[3,"C"] <- "class4"
x[4,"C"] <- "class5"
x[5,"C"] <- "class6"
x[6,"C"] <- "class7"
x[7,"C"] <- "class8"
x[8,"C"] <- "class9"
x[9,"C"] <- "class10"
x[10,"C"] <- "class11"
x[11,"C"] <- "class12"
x[,"C"] <- as.factor(x[,"C"])
control <- trainControl(method="repeatedcv", number=10, repeats=1, search="grid") set.seed(5) tunegrid <- expand.grid(.mtry=c(1:2)) fit <- train(x=x[,1:2], y=x$C, method="rf", metric=metric, tuneGrid=tunegrid, trControl=control)
print(fit)
plot(fit)
When running the code I get an error stating 1: model fit failed for Fold2.Rep1: mtry=1 Error in randomForest.default(x, y, mtry = param$mtry, ...) :
Can't have empty classes in y.
Related posts suggest that this is due to unaccounted factors in the response variable - which is not taken account of in resampling. Typically, one runs into the problem, if there is a higher number of classes to be predicted (and little observations).
Is there any workaround to change the caret package such that the missing factors are removed in the resampling methods (e.g., by droplevels())?

How do I use predict() on new data for lme4::glmer model?

I have been trying to establish predictive performance (AUC ROC) for a glmer model. When I try and use the predict() function on a test data set, the output for this function is the length of my train data set.
folds = 10;
glmerperf=rep(0,folds); glmperf=glmerperf;
TB_Train.glmer.subset <- TB_Train.glmer %>% select(one_of(subset.vars), IDNO)
TB_Train.glmer.fs <- TB_Train.glmer.subset[,c(1:7, 22)]
TB_Train.glmer.ns <- TB_Train.glmer.subset[, 8:21]
TB_Train.glmer.cns <- TB_Train.glmer.ns %>% scale(center=TRUE, scale=TRUE) %>% cbind(TB_Train.glmer.fs)
foldsamples = caret::createFolds(TB_Train.glmer.cns$Case.Status, k = folds, list = TRUE, returnTrain = FALSE)
for (n in 1:folds)
{
testdata = TB_Train.glmer.cns[foldsamples[[n]],]
traindata = TB_Train.glmer.cns[-foldsamples[[n]],]
GLMER <- lme4::glmer(Case.Status ~ . + (1 | IDNO), data = traindata, family="binomial", control=glmerControl(optimizer="bobyqa", optCtrl=list(maxfun=1000000)))
glmer.probs <- predict(GLMER, newdata=testdata$Non.TB.Case, type="response")
glmer.ROC <- roc(predictor=glmer.probs, response=testdata$Case.Status, levels=rev(levels(testdata$Case.Status)))
glmerperf[n] <- glmer.ROC$auc
}
prob <- predict(GLMER, newdata=TB_Test.glmer$Non.TB.Case, type="response", re.form=~(1|IDNO))
print(sprintf('Mean AUC ROC of model on test set for GLMER %f', mean(glmerperf)))
Both the prob and glmer.probs objects are the length of the traindata object, despite specifying the newdata argument. I have noticed issues with the predict function in the past, but none as specific as this one.
Also, when the model is run, I get several errors about needing to scale my data (which I already have) and that the model fails to converge. Any ideas on how to fix this? I have already bumped up the iterations and selected a new optimizer.
Figured out that error was arising from using the "." shortcut to specify all predictors for the model.

Run several GLM models using for loop in R

I'm trying to do some experiment and I want to run several GLMs model in R using the same variables but different training samples.
Here is some simulated data:
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
This is the loop I'm trying to use:
n <- 5
for (i in 1:n)
{
### Create training and testing data
## 80% of the sample size
# Note that I didn't use seed so that random split is performed every iteration.
smp_sizelogis <- floor(0.8 * nrow(dat))
train_indlogis <- sample(seq_len(nrow(dat)), size = smp_sizelogis)
trainlogis <- dat[train_indlogis, ]
testlogis <- dat[-train_indlogis, ]
InitLOogModel[i] <- glm(resp ~ ., data =trainlogis, family=binomial)
}
But unfortunately, I'm getting this error:
Error in InitLOogModel[i] <- glm(resp ~ ., data = trainlogis, family = binomial) :
object 'InitLOogModel' not found
Any thoughts.
I'd suggest using caret for what you're trying to do. It takes some time to learn, but incorporates many 'best practices'. Once you've learned the basics you'll be able to quickly try models other than a glm, and easily compare the models to each other. Here's modified code from your example to get you started.
## caret
library(caret)
# your data
resp <- sample(0:1,100,TRUE)
x1 <- c(rep(5,20),rep(0,15), rep(2.5,40),rep(17,25))
x2 <- c(rep(23,10),rep(5,10), rep(15,40),rep(1,25), rep(2, 15))
dat <- data.frame(resp,x1, x2)
# so caret knows you're trying to do classification, otherwise will give you an error at the train step
dat$resp <- as.factor(dat$resp)
# create a hold-out set to use after your model fitting
# not really necessary for your example, but showing for completeness
train_index <- createDataPartition(dat$resp, p = 0.8,
list = FALSE,
times = 1)
# create your train and test data
train_dat <- dat[train_index, ]
test_dat <- dat[-train_index, ]
# repeated cross validation, repeated 5 times
# this is like your 5 loops, taking 80% of the data each time
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5)
# fit the glm!
glm_fit <- train(resp ~ ., data = train_dat,
method = "glm",
family = "binomial",
trControl = fitControl)
# summary
glm_fit
# best model
glm_fit$finalModel

svm {e1071} predict creates larger array of predicted values than expected

I'm using Support Vector Machine (SVM, package e1071) within R to build a classification model and out-of-sample predicting a 7-factor class.
The problem is, when using the predict function, I obtain a array, much larger than the number of rows in the validation set. See code and results below.
Any suggestions about what goes wrong? Do I miss-interpret the predict function in the SVM package?
install.packages("e1071","caret")
library(e1071)
library(caret)
data <- data.frame(replicate(10,sample(0:6,1000,rep=TRUE)))
trainIndex <- createDataPartition(data[,1], p = 0.8,
list = FALSE,
times = 1)
trainset <- data[trainIndex,2:10]
validationset <- data[-trainIndex,2:10]
trainlabel <- data[trainIndex,1]
validationlabel <- data[-trainIndex,1]
svmModel <- svm(x = trainset,
y = trainlabel,
type = "C-classification",
kernel = "radial")
# Predict
svmPred <- predict(svmModel, x = validationset)
length(svmPred)
# 800, expected 200 since validationset has nrow = 200.
It's because x doesn't exist in predict
try :
svmPred <- predict(svmModel, validationset)
length(svmPred)

Resources