Since I use cross-validation with many different algorithms I decided to build myself the following function:
crossFun <- function(myseed, vars, labels, par, tim, algo, len) {
set.seed(myseed)
multiFolds <- createMultiFolds(labels, k = par, times = tim)
cv_ctrl <- trainControl(method ='repeatedcv', number = par, repeats = tim, index = multiFolds)
cl <- makeCluster(3, type = 'SOCK')
registerDoSNOW(cl)
result <- train(x = vars, y = labels, method = algo, tuneLength = len, trControl = cv_ctrl)
stopCluster(cl)
return(result)
}
The function does work, but with the result, I get the following warning:
Warning message: Setting row names on a tibble is deprecated.
I couldn't find any clear explanation to it. I tried writing the function in different ways but nothing seems to get rid of that message.
Any ideas what it means?
The tidyverse way encourages not using rownames.
You can always coerce back to a base data frame with as.data.frame().
The encouraged way is to use tibble::rownames_to_column() to make rownames a new variable.
See this issue.
Related
there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)
I have a matrix X and vector Y which I use as arguments into the rfe function from the caret package. It's as simple as:
I get a weird error which I can't decipher:
promise already under evaluation: recursive default argument reference or earlier problems?
EDIT:
Here is a reproducible example for the first 5 rows of my data:
library(caret)
X_values = c(29.04,96.57,4.57,94.23,66.81,26.71,69.01,77.06,49.52,97.59,47.57,64.07,24.25,11.27,77.30,90.99,44.05,30.96,96.32,16.04)
X = matrix(X_values, nrow = 5, ncol=4)
Y = c(5608.11,2916.61,5093.05,3949.35,2482.52)
rfe(X, Y)
My R version is 3.2.3. Caret package is 6.0-76.
Does anybody know what this is?
There are two problems with your code.
You need to specify the function/algorithm that you want to fit. (this is what causes the error message you get. I am unsure why rfe throws such a cryptic error message; it makes it difficult to debug, indeed.)
You need to name your columns in the input data.
The following works:
library(caret)
X_values = c(29.04,96.57,4.57,94.23,66.81,26.71,69.01,77.06,49.52,97.59,47.57,64.07,24.25,11.27,77.30,90.99,44.05,30.96,96.32,16.04)
X = matrix(X_values, nrow = 5, ncol=4)
Y = c(5608.11,2916.61,5093.05,3949.35,2482.52)
ctrl <- rfeControl(functions = lmFuncs)
colnames(X) <- letters[1:ncol(X)]
set.seed(123)
rfe(X, Y, rfeControl = ctrl)
I chose a linear model for the rfe.
The reason for the warning messages is the low number of observations in your data during cross validation. You probably also want to set the sizes argument to get a meaningful feature elimination.
I have a classification problem with a very skewed class to predict (e.g. 90% / 10% unbalanced binary variable to predict).
In order to deal with that issue, I want to use the SMOTE method to oversample this class variable. However, as I read here (http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation) it is best practice to use SMOTE inside the k-fold loop to avoid overfitting.
As I'm using the caret package to perform my analysis, I'm referring to this link (http://topepo.github.io/caret/sampling.html). I undestand everything perfectly but the last part where it explains how to change the SMOTE parameters:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I simply don't understand this. Someone care to explain? Let's say I want to include the SMOTE parameters perc.over, k and perc.under, how would I do that?
Thank you very much.
EDIT:
Actually I realized I could probably just add these parameters inside the "SMOTE" expression in the above function, this would for instance give something like:
smotest <- list(name = "SMOTE with more neighbors!",
func = function (x, y) {
library(DMwR)
dat <- if (is.data.frame(x)) x else as.data.frame(x)
dat$.y <- y
dat <- SMOTE(.y ~ ., data = dat, k = 10, perc.over = 1200, perc.under = 100)
list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
y = dat$.y)
},
first = TRUE)
I am not sure to have understood what you do not understand but here is an attempt to clarify what is done in this piece of code.
The smotest object is created as list because it is the way the argument sampling of trainControl function must be represented. The first element of this list is a name used only for display purposes. The second, func, is the actual sampling function. The third, first, is a logical value indicating whether samplin must be done before or after the pre-processing step.
The element func is here only a wrapper of SMOTE function. In this wrapper, line 3 is here because only a data.frame can be passed to SMOTE function. Line 4 is added because a formula combined to a data.frame is used in SMOTE rather than a couple x y. Line 6 is here to ensure that the appropriate format is returned to trainControl.
And, to answer you last question: yes, you can do what you have proposed to set additional parameters to SMOTE.
Here is my code:
train_points <- read.table("kaggle_train_points.txt", sep="\t")
train_labels <- read.table("kaggle_train_labels.txt", sep="\t")
test_points <- read.table("kaggle_test_points.txt", sep="\t")
#uses package 'class'
library(class)
knn(train_points, test_points, train_labels, k = 5);
dim(train_points) is 42000 x 784
dim(train_labels) is 42000 x 1
I don't see the issue, but I'm getting the error :
Error in knn(train_points, test_points, train_labels, k = 5) :
'train' and 'class' have different lengths.
What's the problem?
Without access to the data, it's really hard to help. However, I suspect that train_labels should be a vector. So try
cl = train_labels[,1]
knn(train_points, test_points, cl, k = 5)
Also double check:
dim(train_points)
dim(test_points)
length(cl)
I had the same issue in trying to apply knn on breast cancer diagnosis from wisconsin dataset I found that the issue was linked to the fact that cl argument need to be a vector factor (my mistake was to write cl=labels , I thought this was the vector to be predicted it was in fact a data frame of one column ) so the solution was to use the following syntax : knn (train, test,cl=labels$diagnosis,k=21) diagnosis was the header of the one column data frame labels and it worked well
Hope this help !
I have recently encountered a very similar issue.
I wanted to give only a single column as a predictor. In such cases, selecting a column, you have to remember about drop argument and set it to FALSE. The knn() function accepts only matrices or data frames as train and test arguments. Not vectors.
knn(train = trainSet[, 2, drop = FALSE], test = testSet[, 2, drop = FALSE], cl = trainSet$Direction, k = 5)
Try converting the data into a dataframe using as.dataframe(). I was having the same problem & afterwards it worked fine:
train_pointsdf <- as.data.frame(train_points)
train_labelsdf <- as.data.frame(train_labels)
test_pointsdf <- as.data.frame(test_points)
Simply set drop = TRUE while you're excluding cl from dataframe, it causes to remove dimension from an array which have only one level:
cl = train_labels[,1, drop = TRUE]
knn(train_points, test_points, cl, k = 5)
I had a similar error when I was reading to a tibble (read_csv) and when I switched to read.csv the code worked.
Followed the code as given in the book but will show error due to mismatch lengths (1 is df other is vector returned). I reached here but nothing worked exactly but ideas helped that vectors were needed for comparison.
This throws error
gmodels::CrossTable(x = wbcd_test_labels, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
The following works :
gmodels::CrossTable(x = wbcd_test_labels$diagnosis, # actuals
y = wbcd_test_pred, # predicted
prop.chisq = FALSE)
where using $ for x makes it a vector and hence matches
Additionally while running knn
Cl parameter shoud also have vector save labels in vectors else there will be length mismatch OR use labelDF$Class_label
wbcd_test_pred <- knn(train = wbcd_train,
test = wbcd_test,
cl =wbcd_train_labels$diagnosis, #note this
k = 21)
Hope this helps beginners like me.
Uninstall R Previous versions and install R version > 4.0. It will work.
I have a weird problem with R that I can't seem to work out.
I've tried to write a function that performs K-fold cross validation for a model chosen by the stepwise procedure in R. (I'm aware of the issues with stepwise procedures, it's purely for comparison purposes) :)
Now the issue is, that if I define the function parameters (linmod,k,direction) and run the contents of the function, it works flawlessly. BUT, if I run it as a function, I get an error saying the datas.train object can't be found.
I've tried stepping through the function with debug() and the object clearly exists, but R says it doesn't when I actually run the function. If I just fit a model using lm() it works fine, so I believe it's a problem with the step function in the loop, while inside a function. (try commenting out the step command, and set the predictions to those from the ordinary linear model.)
#CREATE A LINEAR MODEL TO TEST FUNCTION
lm.cars <- lm(mpg~.,data=mtcars,x=TRUE,y=TRUE)
#THE FUNCTION
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
form <- formula(linmod$call)
# generate indices for cross validation
rar <- n/k
xval.idx <- list()
s <- sample(1:n, n) # permutation of 1:n
for (i in 1:k) {
xval.idx[[i]] <- s[(ceiling(rar*(i-1))+1):(ceiling(rar*i))]
}
#error calculation
errors <- R2 <- 0
for (j in 1:k){
datas.test <- datas[xval.idx[[j]],]
datas.train <- datas[-xval.idx[[j]],]
test.idx <- xval.idx[[j]]
#THE MODELS+
lm.1 <- lm(form,data= datas.train)
lm.step <- step(lm.1,direction=direction,trace=0)
step.pred <- predict(lm.step,newdata= datas.test)
step.error <- sum((step.pred-response[test.idx])^2)
errors[j] <- step.error/length(response[test.idx])
SS.tot <- sum((response[test.idx] - mean(response[test.idx]))^2)
R2[j] <- 1 - step.error/SS.tot
}
CVerror <- sum(errors)/k
CV.R2 <- sum(R2)/k
res <- list()
res$CV.error <- CVerror
res$CV.R2 <- CV.R2
return(res)
}
#TESTING OUT THE FUNCTION
cv.step(lm.cars)
Any thoughts?
When you created your formula, lm.cars, in was assigned its own environment. This environment stays with the formula unless you explicitly change it. So when you extract the formula with the formula function, the original environment of the model is included.
I don't know if I'm using the correct terminology here, but I think you need to explicitly change the environment for the formula inside your function:
cv.step <- function(linmod,k=10,direction="both"){
response <- linmod$y
dmatrix <- linmod$x
n <- length(response)
datas <- linmod$model
.env <- environment() ## identify the environment of cv.step
## extract the formula in the environment of cv.step
form <- as.formula(linmod$call, env = .env)
## The rest of your function follows
Another problem that can cause this is if one passes a character (string vector) to lm instead of a formula. vectors have no environment, and so when lm converts the character to a formula, it apparently also has no environment instead of being automatically assigned the local environment. If one then uses an object as weights that is not in the data argument data.frame, but is in the local function argument, one gets a not found error. This behavior is not very easy to understand. It is probably a bug.
Here's a minimal reproducible example. This function takes a data.frame, two variable names and a vector of weights to use.
residualizer = function(data, x, y, wtds) {
#the formula to use
f = "x ~ y"
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
residualizer2 = function(data, x, y, wtds) {
#the formula to use
f = as.formula("x ~ y")
#residualize
resid(lm(formula = f, data = data, weights = wtds))
}
d_example = data.frame(x = rnorm(10), y = rnorm(10))
weightsvar = runif(10)
And test:
> residualizer(data = d_example, x = "x", y = "y", wtds = weightsvar)
Error in eval(expr, envir, enclos) : object 'wtds' not found
> residualizer2(data = d_example, x = "x", y = "y", wtds = weightsvar)
1 2 3 4 5 6 7 8 9 10
0.8986584 -1.1218003 0.6215950 -0.1106144 0.1042559 0.9997725 -1.1634717 0.4540855 -0.4207622 -0.8774290
It is a very subtle bug. If one goes into the function environment with browser, one can see the weights vector just fine, but it somehow is not found in the lm call!
The bug becomes even harder to debug if one used the name weights for the weights variable. In this case, since lm can't find the weights object, it defaults to the function weights() from base thus throwing an even stranger error:
Error in model.frame.default(formula = f, data = data, weights = weights, :
invalid type (closure) for variable '(weights)'
Don't ask me how many hours it took me to figure this out.