R Crashes when training using caret and method = gamLoess - r

When I run the code below, R crashes. If I comment out the tuneGrid line in the call to train, there is no crash. I've tried this with another dataset, and still crash R. Crash message is
R Session Aborted
R encountered a fatal error
The session was terminated
Start new session.
The code is:
library(splines)
library(foreach)
library(gam)
library(lattice)
library(ggplot2)
library(caret)
# crashes when I uncomment the tuneGrid = tuneGrid line
Set_seed_seed <- 100
data_set <- diamonds[, c(1, 5, 6, 7, 8, 9, 10)]
data_set <- data_set[1:1000,]
formula <- price ~ carat + depth + table + x + y + z
training_control <- trainControl(method = "cv", allowParallel = FALSE)
tune_grid <- expand.grid(span = seq(0.1, 0.9, length = 9), degree = seq(1, 2, length = 2))
set.seed(Set_seed_seed)
GAM_model <- train(formula,
data = data_set,
method = "gamLoess",
tuneGrid = tune_grid,
trControl = training_control
)
This occurred in R3.2.1 and 3.2.2 using R Studio.
In R gui, also get crashes.

It is a bug in the gam package. I alerted Trevor Hastie on March 3, 2014 about it:
library(gam)
set.seed(1)
x <- rnorm(1000)
y <- x^2+0.1*rnorm(1000)
tdat <- data.frame(y = y, x = x)
m1 <- gam(y ~ lo(x, span = .5, degree = 2), data = tdat)
That works fine but as I fit multiple models a seg fault occurs (but only
with loess and degree = 2).
This will produce it for me:
for(i in 1:10) m1 <- gam(y ~ lo(x, span = .5, degree = 2), data = tdat)

I verified that the problem exists. I debugged the program and found that the program gets stuck as shown. This is a bug with the foreach package
train(formula, data=data_set, ...)
useMethod("train") # train(); namespace:caret
train(x, y, weight = w, ...) train.formula(); # namespace:caret
useMethod("train") # train(); namespace:caret
nominalTrainWorkflow(x = x, ...) # train.default(); namespace:caret
result <- foreach(iter = , ...) # nominalTrainWorkflow(); namespace:caret
e <- getDoSeq() # %op%; namespace:foreach
list(fun = doSeq, data=NULL) # getDoSeq(); namespace:foreach
e$fun(obj, substitute(ex), parent.frame(), e$data) # %op%; namespace:foreach
tryCatch(accumulator(list(r), i) # e$fun; namespace:foreach

Related

Error in -train : invalid argument to unary operator

I am using R Studio and trying to knit a file. the code chunk below will run as the chunk but throws an error when I try to knit the file.
tree.corolla <- rpart(Price ~ ., data = toyota.corolla.df, control = rpart.control(maxdepth = 5), method = "anova")
The error I am getting is:
Error in -train : invalid argument to unary operator
Calls: ... eval -> predict -> predict.rpart -> [ -> [.data.frame
I am using the ToyotaCorolla.csv dataset that is available here:
https://pitt.box.com/s/e0rhjtba8az85epqus9xu85e4q6zxuts
The entire code chunk is below:
#install.packages("rpart")
#install.packages("rpart.plot")
#install.packages("gbm")
#install.packages("randomForest")
#install.packages("dummies")
library(randomForest)
library(gbm)
library(rpart)
library(rpart.plot)
library(tree)
library(ISLR)
library(dummies)
library(adabag)
library(rpart)
library(caret)
toyota.corolla.df <- read.csv("ToyotaCorolla.csv")
#View(toyota.corolla.df)
# randomly generate training and validation sets
toyota.corolla.df <- toyota.corolla.df[ , -c(1, 2, 5, 6)]
toyota.corolla.df <- cbind(toyota.corolla.df, dummy(toyota.corolla.df$Fuel_Type, sep = "_"))
toyota.corolla.df <- cbind(toyota.corolla.df, dummy(toyota.corolla.df$Color, sep = "_"))
toyota.corolla.df <- toyota.corolla.df[ , -c(4, 7)]
set.seed(123)
inTraining <- createDataPartition(toyota.corolla.df$Price, p = .60, list = FALSE)
training <- toyota.corolla.df[ inTraining,]
testing <- toyota.corolla.df[-inTraining,]
tree.corolla <- rpart(Price ~ ., data = toyota.corolla.df, control = rpart.control(maxdepth = 5), method = "anova")
summary(tree.corolla)
plot(tree.corolla)
text(tree.corolla,pretty=0)
cv.corolla=trainControl(method = "repeatedcv", number = 10, repeats = 10)
prp(tree.corolla, type = 1, extra = 1, split.font = 1, varlen = -10)
yhat=predict(tree.corolla,newdata=toyota.corolla.df[-train,])
corolla.test=toyota.corolla.df[-train,"Price"]
plot(yhat,corolla.test)
abline(0,1)

what is RMSE in the results object in R package caret?

Why result 1 is different from result 2 ? Intuitively I would think that truc$results$RMSE is the root mean square error of forecasts but I guess it is not.
library(caret)
x <- data.frame(x = rnorm(15))
y <- x$x + rnorm(15)
myTimeControl <- trainControl(method = "timeslice",initialWindow = 10, horizon = 1, fixedWindow = FALSE, savePredictions=TRUE)
truc <- train(x,y,method = "lm",metric= "RMSE",trControl =myTimeControl,preProc = c("center", "scale"))
result1 <- sqrt(mean((truc$pred$pred-truc$pred$obs)^2))
result2 <- truc$results$RMSE
result1
result2
If you invert mean and sqrt, you get the same result... Something's weird with caret's formula... Actually, you made an interesting observation...
result1 <- mean(sqrt((truc$pred$pred-truc$pred$obs)^2))

Error with prediction - ROCR package (using probabilities)

I have used "rfe" function with svm to create a model with reduced features. Then I use "predict" on test data which outputs class labels (binary), 0 class probabilities, 1 class probabilities. I then tried using prediction function, in ROCR package, on predicted probabilities and true class labels but get the following error and am not sure why as the lengths of the 2 arrays are equal:
> pred_svm <- prediction(pred_svm_2class[,2], as.numeric(as.character(y)))
Error in prediction(pred_svm_2class[, 2], as.numeric(as.character(y))) :
Number of predictions in each run must be equal to the number of labels for each run.
I have the code below and the input is here click me.It is a small dataset with binary classification, so code runs fast.
library("caret")
library("ROCR")
sensor6data_2class <- read.csv("/home/sensei/clustering/svm_2labels.csv")
sensor6data_2class <- within(sensor6data_2class, Class <- as.factor(Class))
set.seed("1298356")
inTrain_svm_2class <- createDataPartition(y = sensor6data_2class$Class, p = .75, list = FALSE)
training_svm_2class <- sensor6data_2class[inTrain_svm_2class,]
testing_svm_2class <- sensor6data_2class[-inTrain_svm_2class,]
trainX <- training_svm_2class[,1:20]
y <- training_svm_2class[,21]
ctrl_svm_2class <- rfeControl(functions = rfFuncs , method = "repeatedcv", number = 5, repeats = 2, allowParallel = TRUE)
model_train_svm_2class <- rfe(x = trainX, y = y, data = training_svm_2class, sizes = c(1:20), metric = "Accuracy", rfeControl = ctrl_svm_2class, method="svmRadial")
pred_svm_2class = predict(model_train_svm_2class, newdata=testing_svm_2class)
pred_svm <- prediction(pred_svm_2class[,2], y)
Thanks and appreciate your help.
This is because in the line
pred_svm <- prediction(pred_svm_2class[,2], y)
pred_svm_2class[,2] is the predictions on test data and y is the labels for training data. Just generate the labels for test in a separate variable like this
y_test <- testing_svm_2class[,21]
And now if you do
pred_svm <- prediction(pred_svm_2class[,2], y_test)
There will be no error. Full code below -
# install.packages("caret")
# install.packages("ROCR")
# install.packages("e1071")
# install.packages("randomForest")
library("caret")
library("ROCR")
sensor6data_2class <- read.csv("svm_2labels.csv")
sensor6data_2class <- within(sensor6data_2class, Class <- as.factor(Class))
set.seed("1298356")
inTrain_svm_2class <- createDataPartition(y = sensor6data_2class$Class, p = .75, list = FALSE)
training_svm_2class <- sensor6data_2class[inTrain_svm_2class,]
testing_svm_2class <- sensor6data_2class[-inTrain_svm_2class,]
trainX <- training_svm_2class[,1:20]
y <- training_svm_2class[,21]
y_test <- testing_svm_2class[,21]
ctrl_svm_2class <- rfeControl(functions = rfFuncs , method = "repeatedcv", number = 5, repeats = 2, allowParallel = TRUE)
model_train_svm_2class <- rfe(x = trainX, y = y, data = training_svm_2class, sizes = c(1:20), metric = "Accuracy", rfeControl = ctrl_svm_2class, method="svmRadial")
pred_svm_2class = predict(model_train_svm_2class, newdata=testing_svm_2class)
pred_svm <- prediction(pred_svm_2class[,2], y_test)

Error in confusionMatrix.rfe: resampled confusion matrices are not availible

I ran into an error "resampled confusion matrices are not available" when trying to extract confusion matrix from a rfe object. is the confusionMaitrx.rfe function of the caret package not working or am I missing something here?
Below is an example using simulated data from
http://topepo.github.io/caret/rfe.html
Documentation on function confusionMatrix.rfe is here
http://www.inside-r.org/packages/cran/caret/docs/confusionMatrix.train
library(caret)
library(mlbench)
library(Hmisc)
library(randomForest)
n <- 100
p <- 40
sigma <- 1
set.seed(1)
sim <- mlbench.friedman1(n, sd = sigma)
colnames(sim$x) <- c(paste("real", 1:5, sep = ""),
paste("bogus", 1:5, sep = ""))
bogus <- matrix(rnorm(n * p), nrow = n)
colnames(bogus) <- paste("bogus", 5+(1:ncol(bogus)), sep = "")
x <- cbind(sim$x, bogus)
y <- sim$y
normalization <- preProcess(x)
x <- predict(normalization, x)
x <- as.data.frame(x)
subsets <- c(1:5, 10, 15, 20, 25)
set.seed(10)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile <- rfe(x, y,
sizes = subsets,
rfeControl = ctrl)
lmProfile
confusionMatrix(lmProfile)
**Error in confusionMatrix.rfe(lmProfile) :
resampled confusion matrices are not availible**
Thanks!
mlbench.friedman1 is a regression problem, not a classification problem. If you check the data, you can see that your Y variable is continuous. confusionMatrix has no use in this case.

R foreach error when using formula notation in randomForest

I have an issue running a randomForest in parrallel using fore each.
See this example, I create some data,then a formula notation.
The formula works on a randomForest by itself.
But fails when used in a foreach parrallel loop...?
# rf on big training set
# use parallel foreach
library(foreach)
library(doMC)
registerDoMC(4) #change the 2 to your number of CPU cores
# info on parrallell backend
getDoParName()
getDoParWorkers()
# bogus data
set.seed(123)
ssize <- 100000
x1 <- sample( LETTERS[1:9], ssize, replace=TRUE, prob=c(0.1, 0.2, 0.15, 0.05,0.1, 0.2, 0.05, 0.05,0.1) )
x2 <- rlnorm(ssize,0,0.25)
x3 <- rlnorm(ssize,0,0.5)
y <- sample( c("Y","N"), ssize, replace=TRUE, prob=c(0.05, 0.95))
df <- data.frame(x1,x2,x3,y)
df$p_y <- as.numeric(df$y)-1
# use strata to sample whole dataset
library(sampling)
s1 = strata(df,stratanames = "y", size = c(2500,2500))
s2 = strata(df,stratanames = "y", size = c(2500,2500))
s3 = strata(df,stratanames = "y", size = c(2500,2500))
s4 = strata(df,stratanames = "y", size = c(2500,2500))
s_list <- list(s1$ID_unit, s2$ID_unit, s3$ID_unit, s4$ID_unit)
# model function
rf.formula <- as.formula(paste("y","~",paste("x1","x2",sep="+")))
library(randomForest)
# simple stuff works but takes some time
model.rf <-randomForest(y ~ x1 + x2, df, ntree=100, nodesize = 50)
# build rf with dopar on explicit formula works and is quick
model.rf.dopar <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
randomForest(y ~ x1 + x2, df, ntree=100, nodesize = 50, subset=subset)
# build rf with dopar on rf.formula fails
model.rf.s.b2 <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
# > model.rf.s.b2 <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
# + randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
# Error in randomForest(rf.formula, df, ntree = 100, nodesize = 50, subset = subset) :
# task 1 failed - "invalid subscript type 'closure'"
The error:
model.rf.s.b2 <- foreach(subset=s_list, .combine=combine, .packages='randomForest') %dopar%
+ randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
Error in randomForest(rf.formula, df, ntree = 100, nodesize = 50, subset = subset) :
task 1 failed - "invalid subscript type 'closure'"
Any suggestions?
Tx
The problem seems to be due to an indexing operation going wrong deep down in the model.frame.default function, which is indirectly called by randomForest.formula. I'm not at all sure what is triggering the problem because there are a lot of tricky evals happening in model.frame.default, but modifying the environment of the formula seems to fix the problem:
r <- foreach(subset=s_list, .combine='combine', .multicombine=TRUE,
.packages='randomForest') %dopar% {
environment(rf.formula) <- environment()
randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
}
In particular, this causes subset to be evaluated correctly, otherwise it evaluates to the subset function. I tried renaming the iteration variable, but it didn't help.
Note that I also set .multicombine to TRUE since the randomForest combine function accepts multiple objects, and that can improve performance significantly.
Update
The problem can be reproduced with:
fun <- function(subset) {
randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=subset)
}
fun(s_list[[1]])
If the variable subset is changed to s, for example, it also fails, but with a less misleading error message:
> fun <- function(s) {
> randomForest(rf.formula, df, ntree=100, nodesize = 50, subset=s)
> }
> fun(s_list[[1]])
Error in eval(expr, envir, enclos) : object 's' not found
Calls: fun ... eval -> model.frame -> model.frame.default -> eval -> eval
Execution halted
As with the foreach example, resetting the environment of the formula seems to work-around the problem.

Resources