I am using the rfe function in the caret package to do feature selection. I frequently get the following error:
'Error in { :
task 1 failed - "error in evaluating the argument 'x' in selecting a method for function 'as.data.frame': Error in [.data.frame(x, , retained, drop = FALSE) :
undefined columns selected`
I'm doing 100 samples and it runs for about 60 samples or so before producing the error I'm executing the following:
folds=100
validmethod='boot'
subsets=c(5,10,15,20,25)
ctrl <- rfeControl(functions = funcs,
method = validmethod,
rerank=TRUE,
saveDetails=TRUE,
verbose = TRUE,
returnResamp = "all",
number=folds)
rfe(df.preds,df.depend, metric=smetric,sizes=subsets, rfeControl=ctrl)
Can someone help me understand the types of things which would cause this error?
MWE:
df <- cbind(rbinom(100, 1, 0.5), rnorm(100, 0, 1),
rnorm(100, 5, 5), rnorm(100, 12, 4), rnorm(100, 100, 0.1))
colnames(df) <- c("response", "f1", "f2", "f3", "f4")
rfe(x=df[,-1], y=as.factor(df[,1]), sizes = 1:3,
rfeControl = rfeControl(functions = caretFuncs,
number = 2, method = "cv"),method = "svmRadial")
How to choose sizes?
Related
I am currently trying to run a parallelized a Date Envelopment Analysis (DEA) by using "DeaR" package and "Snow".
The libraries that I use are:
library(deaR)
library(snow)
library(dplyr)
I generate a random data frame as follow:
n <- 1000
df <- data.frame("DMU"=paste0("Unit_",c(1:n)),
"input1"=round(runif(n, min = 0.001, max = 1), digits = 3),
"input2"=round(runif(n, min = 0.001, max = 1), digits = 3),
"input3"=round(runif(n, min = 0.001, max = 1), digits = 3),
"input4"=round(runif(n, min = 0.001, max = 1), digits = 3),
"output1"=round(runif(n, min = 0.001, max = 1), digits = 3),
"output2"=round(runif(n, min = 0.001, max = 1), digits = 3),
"output3"=round(runif(n, min = 0.001, max = 1), digits = 3))
Then, I run a standard DEA as follow:
dea <- function(x){
model <- read_data(x,
dmus = 1,
inputs = 2:5,
outputs = 6:8)
results <- model_basic (model,
orientation = "io",
rts = "vrs")
return(results)
}
results <- dea(df)
eff <- efficiencies(results)
Finally, I try the same but with a parallelization:
cluster <- makeCluster(16)
clusterEvalQ(cluster, library(deaR))
dea <- function(x){
model <- read_data(x,
dmus = 1,
inputs = c(2:5),
outputs = c(6:8))
results <- model_basic(model,
orientation = "io",
rts = "vrs")
return(results)
}
results <- clusterApply(cl = cluster,x = datos, fun = dea)
stopCluster(cluster)
eff <- efficiencies(results)
In this case, the outcome is an error:
Error in checkForRemoteErrors(val) :
8 nodes produced errors; first error: Invalid data datadea (should be a data frame)!
When I introduce something like " x <- as.data.frame(x)" the outcome is:
Error in checkForRemoteErrors(val) :
8 nodes produced errors; first error: undefined columns selected
Every time I solve the error (that shouldn't happen) a new one appears, even when the same process without parallelization is working perfectly.
What can I do? How can I make this DEA parallel?
How do I make DEA in parallel?
I'm new to R and trying to isolate the best performing features from a data set of 247 columns (246 variables + 1 outcome), and 800 or so rows (where each row is one person's data) to create a predictive model.
I'm using caret to do RFE using lmfuncs - I need to use linear regression since the target variable continuous.
I use the following to split into test/training data (which hasn't evoked errors)
inTrain <- createDataPartition(data$targetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
The resulting test and train files have even variables within the sets. e.g X and Y contain the same number samples / all columns are the same length
My control parameters are as follows (also runs without error)
control = rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
But when I run RFE I get an error message saying
Error in rfe.default(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control) :
there should be the same number of samples in x and y
My code for RFE is as follows, with the target variable in first column
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)
I've looked through various forums, but nothing seems to work.
This google.group suggests using an older version of Caret - which I tried, but got the same X/Y error https://groups.google.com/g/rregrs/c/qwcP0VGn4ag?pli=1
Others suggest converting the target variable to a factor or matrix. This hasn't helped, and evokes
Warning message:
In createDataPartition(data$EBI_SUM, p = 0.8, list = F) :
Some classes have a single record
when partitioning the data into test/train, and the same X/Y sample error if you try to carry out RFE.
Mega thanks in advance :)
p.s
Here's the dput for the target variable (EBI_SUM) and a couple of variables
data <- structure(list(TargetVar = c(243, 243, 243, 243, 355, 355), Dosing = c(2,
2, 2, 2, 2, 2), `QIDS_1 ` = c(1, 1, 3, 1, 1, 1), `QIDS_2 ` = c(3,
3, 2, 3, 3, 3), `QIDS_3 ` = c(1, 2, 1, 1, 1, 2)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
>
Your data object should not contain spaces:
library(caret)
data <- data.frame(
TargetVar = c(243, 243, 243, 243, 355, 355),
Dosing = c(2, 2, 2, 2, 2, 2),
QIDS_1 = c(1, 1, 3, 1, 1, 1),
QIDS_2 = c(3, 3, 2, 3, 3, 3),
QIDS_3 = c(1, 2, 1, 1, 1, 2)
)
inTrain <- createDataPartition(data$TargetVar, p = .8, list = F)
train <- data[inTrain, ]
test <- data[-inTrain, ]
control <- rfeControl(functions = lmFuncs, method = "repeatedcv", repeats = 5, verbose = F, returnResamp = "all")
rfe_lm_profile <- rfe(train[, -1], train[, 1], sizes = c(10, 15, 20, 25, 30), rfeControl = control)
I am getting an error when I try to run glmulti on different datasets in parallel using furr::future_map. It works when future_map is called sequentially.
#Load packages
library(furrr)
library(future)
library(tidyverse)
#This doesn't work
plan(multisession, workers = 2 ) #Set number of parallel sessions (can't do multi core on Windows)
mods <- list(tibble(exposure = rnorm(100, 0:100), outcome = rbinom(100, 1, 0.5)),
tibble(exposure = rnorm(100, 0:100), outcome = rbinom(100, 1, 0.5))) %>%
future_map(~glmulti(
y = "outcome",
xr = "exposure",
data = .x,
level = 1,
method = "g", #Genetic algorithm
fitfunction = "glm",
family = binomial,
confsetsize = 2, #Maximum number of possible models, so it doesn't run indefinitely
plotty = F, report = F #To simplify the outputs
))
Here is the error message this gives:
Error in get(as.character(FUN), mode = "function", envir = envir) : object 'aic' of mode 'function' was not found
It runs fine when done sequentially:
#This works
plan(multiprocess, workers = 1 ) #1 worker, so normal map behaviour.
mods <- list(tibble(exposure = rnorm(1000, 0:1000), outcome = rbinom(1000, 1, 0.5)),
tibble(exposure = rnorm(1000, 0:1000), outcome = rbinom(1000, 1, 0.5))) %>%
future_map(~glmulti(
y = "outcome",
xr = "exposure",
data = .x,
level = 1,
method = "g", #Genetic algorithm
fitfunction = "glm",
family = binomial,
confsetsize = 2,
plotty = F, report = F
))
Is there any way to fix this? Or is it just a problem with one of the two packages? Is it more likely that it's an issue with furrr or with glmulti?
library(rjags)
library(pcnetmeta)
data(smoke)
set.seed(1234)
hom.eqcor.out <- nma.ab.bin(s.id, t.id, r, n, data = smoke,
param = c("AR"), model = "het_cor", prior.type = "chol", c = 10,
higher.better = TRUE, n.adapt = 5000, n.iter = 100000, n.chains = 2)
I'm using the nma.ab.bin function in pcnetmet. However, I get the following error:
"Error in update.default(jags.m, n.iter = n.burnin) :
need an object with call component"
I've tried playing around with the n.adapt, and n.iter. Using n.adapt = 400 and n.iter = 1000 has no problem. But increasing those values seems to lead to the above error?
Though I am defining that target <- factor(train$target, levels = c(0, 1)), the below-given code provides this error:
Error in cut.default(y, unique(quantile(y, probs = seq(0, 1, length =
cuts))), : invalid number of intervals In addition: Warning
messages: 1: In train.default(x, y, weights = w, ...) : cannnot
compute class probabilities for regression
What does it mean and how to fix this?
gbmGrid <- expand.grid(n.trees = (1:30)*10,
interaction.depth = c(1, 5, 9),
shrinkage = 0.1)
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = FALSE,
returnResamp = "all",
classProbs = TRUE)
target <- factor(train$target, levels = c(0, 1))
gbm <- caret::train(target ~ .,
data = train,
#distribution="gaussian",
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid)
prob = predict(gbm, newdata=testing, type='prob')[,2]
First, don't do this:
target <- factor(train$target, levels = c(0, 1))
You will get an warning:
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
Second, you created an object called target. Using the formula method means that train will use the column called target in the data frame train and those are different data. Modify the column.