Error in -train : invalid argument to unary operator - r

I am using R Studio and trying to knit a file. the code chunk below will run as the chunk but throws an error when I try to knit the file.
tree.corolla <- rpart(Price ~ ., data = toyota.corolla.df, control = rpart.control(maxdepth = 5), method = "anova")
The error I am getting is:
Error in -train : invalid argument to unary operator
Calls: ... eval -> predict -> predict.rpart -> [ -> [.data.frame
I am using the ToyotaCorolla.csv dataset that is available here:
https://pitt.box.com/s/e0rhjtba8az85epqus9xu85e4q6zxuts
The entire code chunk is below:
#install.packages("rpart")
#install.packages("rpart.plot")
#install.packages("gbm")
#install.packages("randomForest")
#install.packages("dummies")
library(randomForest)
library(gbm)
library(rpart)
library(rpart.plot)
library(tree)
library(ISLR)
library(dummies)
library(adabag)
library(rpart)
library(caret)
toyota.corolla.df <- read.csv("ToyotaCorolla.csv")
#View(toyota.corolla.df)
# randomly generate training and validation sets
toyota.corolla.df <- toyota.corolla.df[ , -c(1, 2, 5, 6)]
toyota.corolla.df <- cbind(toyota.corolla.df, dummy(toyota.corolla.df$Fuel_Type, sep = "_"))
toyota.corolla.df <- cbind(toyota.corolla.df, dummy(toyota.corolla.df$Color, sep = "_"))
toyota.corolla.df <- toyota.corolla.df[ , -c(4, 7)]
set.seed(123)
inTraining <- createDataPartition(toyota.corolla.df$Price, p = .60, list = FALSE)
training <- toyota.corolla.df[ inTraining,]
testing <- toyota.corolla.df[-inTraining,]
tree.corolla <- rpart(Price ~ ., data = toyota.corolla.df, control = rpart.control(maxdepth = 5), method = "anova")
summary(tree.corolla)
plot(tree.corolla)
text(tree.corolla,pretty=0)
cv.corolla=trainControl(method = "repeatedcv", number = 10, repeats = 10)
prp(tree.corolla, type = 1, extra = 1, split.font = 1, varlen = -10)
yhat=predict(tree.corolla,newdata=toyota.corolla.df[-train,])
corolla.test=toyota.corolla.df[-train,"Price"]
plot(yhat,corolla.test)
abline(0,1)

Related

Recipe fails with caret::train

When using caret with recipes i get an error stating:
Error in { : task 1 failed - "$ operator is invalid for atomic vectors"
I managed to narrow it down to a problem with the recipe. But i am not sure what i'm doing wrong. Anyone has seen this before? The only relevant information i found was here and it stated:
This happens when the model object fails and caused no recipe to be
available
Below follows the code i use. I cannot share the data, but the error appears when using mtcars as well.
library(caret)
library(tidymodels)
library(embed)
library(doParallel)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
df <- mtcars %>%
as_tibble() %>%
mutate(cyl = factor(cyl)) # to have one nominal variable
set.seed(123)
cv_split <- initial_split(df)
df_train <- training(cv_split)
set.seed(123)
cv_folds <- vfold_cv(df_train, v = 10, repeats = 10)
cv_ind <- rsample2caret(cv_folds)
rec <-
recipe(mpg ~ ., data = df_train) %>%
step_nzv(all_predictors()) %>%
step_lencode_mixed(all_nominal(), outcome = vars(mpg))
ctrl <-
trainControl(
method = "repeatedcv",
repeats = 10,
index = cv_ind$index,
indexOut = cv_ind$indexOut,
allowParallel = TRUE)
train(rec,
data = df_train,
method = "glmnet",
tuneLength = 20,
trControl = ctrl)`

R: object not found when knitting

I am trying to run build a knn model using caret with my dataset where True (real sales), DOW (Day of the week), and D1 to D10 (historic sales) are available.
library(caret)
library(reshape2)
library(dplyr)
library(tibble)
library(dummies)
#data
rm = matrix(rnorm(100*10, 10, 5), nrow = 100) %>% as.data.frame()
wide = cbind(
rnorm(100, 100, 1),
weekdays(seq(as.Date('2019/1/1'), by='day', length.out = 100)),
rm
)
colnames(wide) = c('true', 'DOW', paste0('D',1:10))
#preprocessing for knn
train.true = train[,1]
dow.tr = dummy(train$DOW, sep='.')
dow.te = dummy(test$DOW, sep='.')
k.train = cbind(train[, -c(2, nearZeroVar(train))], dow.tr)
k.test = cbind(test, dow.te)[,-2]
seq.knn.pre1 = rep(0, nrow(test))
for (i in 1:10){
this.train = k.train[, c((i+1):ncol(k.train))]
this.test = k.test[i, c((i+1):ncol(k.test))]
train.control = trainControl(method='repeatedcv', number=10, repeats = 1)
k = train(train.true~., method='knn', tuneLength = 8,
trControl=train.control, preProcess='scale',
data=data.frame(train.true, this.train))
seq.knn.pre1[i] = predict(k, this.test)
}
seq.knn.pre1 = cbind(true = test[,1], k.pred1 = seq.knn.pre1) %>% data.frame()
However, when I am knitting the file, it gives me error object 'X.Rachel.Documents.Research.file.Rmd.Friday' not found Calls: <Anonymous> ... predict.train -> model.frame -> model.frame.default -> eval -> eval Execution halted.
I am guessing the problem might come from the DOW dummy variables. When my simulated dataset does not include categorical variables, the code knitted well. Is there any possibility that I can fix it there?
Any suggestion is highly appreciated!

R: LIME returns error on different feature numbers when it's not the case

I'm building a text classifier of Clinton & Trump tweets (data can be found on Kaggle ).
I'm doing EDA and modelling using quanteda package:
library(dplyr)
library(stringr)
library(quanteda)
library(lime)
#data prep
tweet_csv <- read_csv("tweets.csv")
tweet_data <- tweet_csv %>%
select(author = handle,
text,
retweet_count,
favorite_count,
source_url,
timestamp = time) %>%
mutate(date = as_date(str_sub(timestamp, 1, 10)),
hour = hour(hms(str_sub(timestamp, 12, 19))),
tweet_num = row_number()) %>%
select(-timestamp)
# creating corpus and dfm
tweet_corpus <- corpus(tweet_data)
edited_dfm <- dfm(tweet_corpus, remove_url = TRUE, remove_punct = TRUE, remove = stopwords("english"))
set.seed(32984)
trainIndex <- sample.int(n = nrow(tweet_csv), size = floor(.8*nrow(tweet_csv)), replace = F)
train_dfm <- edited_dfm[as.vector(trainIndex), ]
train_raw <- tweet_data[as.vector(trainIndex), ]
train_label <- train_raw$author == "realDonaldTrump"
test_dfm <- edited_dfm[-as.vector(trainIndex), ]
test_raw <- tweet_data[-as.vector(trainIndex), ]
test_label <- test_raw$author == "realDonaldTrump"
# making sure train and test sets have the same features
test_dfm <- dfm_select(test_dfm, train_dfm)
# using quanteda's NB model
nb_model <- quanteda::textmodel_nb(train_dfm, train_labels)
nb_preds <- predict(nb_model, test_dfm)
# defining textmodel_nb as classification model
class(nb_model)
model_type.textmodel_nb_fitted <- function(x, ...) {
return("classification")
}
# a wrapper-up function for data preprocessing
get_matrix <- function(df){
corpus <- corpus(df)
dfm <- dfm(corpus, remove_url = TRUE, remove_punct = TRUE, remove = stopwords("english"))
}
then I define the explainer - no problems here:
explainer <- lime(train_raw[1:5],
model = nb_model,
preprocess = get_matrix)
But when I run an explainer, even on exactly same dataset as in explainer, I get an error:
explanation <- lime::explain(train_raw[1:5],
explainer,
n_labels = 1,
n_features = 6,
cols = 2,
verbose = 0)
Error in predict.textmodel_nb_fitted(x, newdata = newdata, type = type, :
feature set in newdata different from that in training set
Does it have something to do with quanteda and dfms? I honestly don't see why this should happen. Any help will be great, thanks!
We can trace the error to predict_model, which calls predict.textmodel_nb_fitted (I used only the first 10 rows of train_raw to speed up computation):
traceback()
# 7: stop("feature set in newdata different from that in training set")
# 6: predict.textmodel_nb_fitted(x, newdata = newdata, type = type,
# ...)
# 5: predict(x, newdata = newdata, type = type, ...)
# 4: predict_model.default(explainer$model, case_perm, type = o_type)
# 3: predict_model(explainer$model, case_perm, type = o_type)
# 2: explain.data.frame(train_raw[1:10, 1:5], explainer, n_labels = 1,
# n_features = 5, cols = 2, verbose = 0)
# 1: lime::explain(train_raw[1:10, 1:5], explainer, n_labels = 1,
# n_features = 5, cols = 2, verbose = 0)
The problem is that predict.textmodel_nb_fitted expects a dfm, not a data frame. For example, predict(nb_model, test_raw[1:5]) gives you the same "feature set in newdata different from that in training set" error. However, explain takes a data frame as its x argument.
A solution is to write a custom textmodel_nb_fitted method for predict_model that does the necessary object conversions before calling predict.textmodel_nb_fitted:
predict_model.textmodel_nb_fitted <- function(x, newdata, type, ...) {
X <- corpus(newdata)
X <- dfm_select(dfm(X), x$data$x)
res <- predict(x, newdata = X, ...)
switch(
type,
raw = data.frame(Response = res$nb.predicted, stringsAsFactors = FALSE),
prob = as.data.frame(res$posterior.prob, check.names = FALSE)
)
}
This gives us
explanation <- lime::explain(train_raw[1:10, 1:5],
explainer,
n_labels = 1,
n_features = 5,
cols = 2,
verbose = 0)
explanation[1, 1:5]
# model_type case label label_prob model_r2
# 1 classification 1 FALSE 0.9999986 0.001693861

Training mxnet:mx.mlp

I am trying to reproduce an example from ND Lewis: Neural Networks for time series forecasting with R. If I include the device argument I get the error:
Error in mx.opt.sgd(...) :
unused argument (device = list(device = "cpu", device_id = 0, device_typeid = 1))
In addition: Warning message:
In mx.model.select.layout.train(X, y) :
Auto detect layout of input matrix, use rowmajor..
If I remove this parameter, I still get this warning:
Warning message:
In mx.model.select.layout.train(X, y) :
Auto detect layout of input matrix, use rowmajor..
The code is:
library(zoo)
library(quantmod)
library(mxnet)
# data
data("ecoli", package = "tscount")
data <- ecoli$cases
data <- as.zoo(ts(data, start = c(2001, 1), end = c(2013, 20), frequency = 52))
xorig <- do.call(cbind, lapply((1:4), function(x) as.zoo(Lag(data, k = x))))
xorig <- cbind(xorig, data)
xorig <- xorig[-(1:4), ]
# normalization
range_data <- function(x) {
(x - min(x))/(max(x) - min(x))
}
xnorm <- data.matrix(xorig)
xnorm <- range_data(xnorm)
# test/train
y <- xnorm[, 5]
x <- xnorm[, -5]
n_train <- 600
x_train <- x[(1:n_train), ]
y_train <- y[(1:n_train)]
x_test <- x[-(1:n_train), ]
y_test <- y[-(1:n_train)]
# mxnet:
mx.set.seed(2018)
model1 <- mx.mlp(x_train,
y_train,
hidden_node = c(10, 2),
out_node = 1,
activation = "sigmoid",
out_activation = "rmse",
num.round = 100,
array.batch.size = 20,
learning.rate = 0.07,
momentum = 0.9
#, device = mx.cpu()
)
pred1_train <- predict(model1, x_train, ctx = mx.cpu())
How can I fix this?
Regarding the second warning message, MXNet is trying to detect the row/column major based on the shape of your inputs: https://github.com/apache/incubator-mxnet/blob/424143ac47ab3a38ae8aedaeb3319379887de0bc/R-package/R/model.R#L329
For the unused argument device = mx.cpu(), should the argument name be corrected to ctx instead of device?

R Crashes when training using caret and method = gamLoess

When I run the code below, R crashes. If I comment out the tuneGrid line in the call to train, there is no crash. I've tried this with another dataset, and still crash R. Crash message is
R Session Aborted
R encountered a fatal error
The session was terminated
Start new session.
The code is:
library(splines)
library(foreach)
library(gam)
library(lattice)
library(ggplot2)
library(caret)
# crashes when I uncomment the tuneGrid = tuneGrid line
Set_seed_seed <- 100
data_set <- diamonds[, c(1, 5, 6, 7, 8, 9, 10)]
data_set <- data_set[1:1000,]
formula <- price ~ carat + depth + table + x + y + z
training_control <- trainControl(method = "cv", allowParallel = FALSE)
tune_grid <- expand.grid(span = seq(0.1, 0.9, length = 9), degree = seq(1, 2, length = 2))
set.seed(Set_seed_seed)
GAM_model <- train(formula,
data = data_set,
method = "gamLoess",
tuneGrid = tune_grid,
trControl = training_control
)
This occurred in R3.2.1 and 3.2.2 using R Studio.
In R gui, also get crashes.
It is a bug in the gam package. I alerted Trevor Hastie on March 3, 2014 about it:
library(gam)
set.seed(1)
x <- rnorm(1000)
y <- x^2+0.1*rnorm(1000)
tdat <- data.frame(y = y, x = x)
m1 <- gam(y ~ lo(x, span = .5, degree = 2), data = tdat)
That works fine but as I fit multiple models a seg fault occurs (but only
with loess and degree = 2).
This will produce it for me:
for(i in 1:10) m1 <- gam(y ~ lo(x, span = .5, degree = 2), data = tdat)
I verified that the problem exists. I debugged the program and found that the program gets stuck as shown. This is a bug with the foreach package
train(formula, data=data_set, ...)
useMethod("train") # train(); namespace:caret
train(x, y, weight = w, ...) train.formula(); # namespace:caret
useMethod("train") # train(); namespace:caret
nominalTrainWorkflow(x = x, ...) # train.default(); namespace:caret
result <- foreach(iter = , ...) # nominalTrainWorkflow(); namespace:caret
e <- getDoSeq() # %op%; namespace:foreach
list(fun = doSeq, data=NULL) # getDoSeq(); namespace:foreach
e$fun(obj, substitute(ex), parent.frame(), e$data) # %op%; namespace:foreach
tryCatch(accumulator(list(r), i) # e$fun; namespace:foreach

Resources