When I type down the name of my tree model, I keep getting this error and I have no idea about what it means.
Here is my code, OJ is a dataset contained in package "ISLR"
library(tree)
library(ISLR)
library(tidyverse)
library(caTools)
library(randomForest)
library(MASS)
library(rpart)
library(gbm)
library(glmnet)
# 8.4.9
## a. split the data into train and test
OJ <- OJ
set.seed(2)
train2 <- sample.split(OJ$Purchase, SplitRatio = 800/1070)
OJ_train <- subset(OJ, train2 == T)
OJ_test <- subset(OJ, train2 == F)
## b. fit a tree model
OJ_train_tree <- tree(Purchase ~ ., data = OJ_train)
summary(OJ_train_tree)
## c. a closer look at the model detail (here is the error)
OJ_train_tree
And the error I got is
OJ_train_tree
Error in cat(x, ..., sep = sep) :
argument 1 (type 'list') cannot be handled by 'cat'
Related
I am using the "lung capacity" data set to try to set up a linear model:
library(tidyverse)
library(rvest)
h <- "https://docs.google.com/spreadsheets/d/0BxQfpNgXuWoIWUdZV1ZTc2ZscnM/edit?resourcekey=0-gqXT7Re2eUS2JGt_w1y4vA#gid=1055321634"
t <- rvest::read_html(h)
Nodes <- t %>% html_nodes("table")
table <- html_table(Nodes[[1]])
colnames(table) <- table[1,]
table <- table[-1,]
table <- table %>% select(LungCap, Age, Height, Smoke, Gender, Caesarean)
Lung_Capacity <- table
Lung_Capacity$LungCap <- as.numeric(Lung_Capacity$LungCap)
Lung_Capacity$Age <- as.numeric(Lung_Capacity$Age)
Lung_Capacity$Height <- as.numeric(Lung_Capacity$Height)
Lung_Capacity$Smoke <- as.numeric(Lung_Capacity$Smoke == "yes")
Lung_Capacity$Gender <- as.numeric(Lung_Capacity$Gender == "male")
Lung_Capacity$Caesarean <- as.numeric(Lung_Capacity$Caesarean == "yes")
colnames(Lung_Capacity)[4] <- "Smoker_YN"
colnames(Lung_Capacity)[5] <- "Male_YN"
colnames(Lung_Capacity)[6] <- "Caesarean_YN"
head(Lung_Capacity)
Capacity <- Lung_Capacity
I am splitting the data into a training set and a validation set:
library(caret)
set.seed(1)
y <- Capacity$LungCap
testIndex <- caret::createDataPartition(y, times = 1, p = 0.2, list = FALSE)
train <- Capacity[-testIndex,]
test <- Capacity[testIndex,]
Cross-validating to obtain my final model:
set.seed(3)
control <- trainControl(method="cv", number = 5)
LinearModel <- train(LungCap ~ ., data = train, method = "lm", trControl = control)
LM <- LinearModel$finalModel
summary(LM)
And trying to run a prediction on the held-out test set:
lmPredictions <- predict(LM, newdata = test)
However, there is an error thrown that reads:
Error in eval(predvars, data, env) : object 'Smoker_YN1' not found
Looking through this site, I thought the column names of the test and train tables may have been off, but that is not the case, they are identical. The issue seems to be that training the model has renamed the factor predictors "Smoker_YN1" as opposed to the column name "Smokey_YN" that is intended. I tried renaming the column headers in the test set and I tried re-naming the coefficient headers. Neither approach was successful.
I've run out of research and experimental approaches, can anyone please help with this issue?
I am not sure. Please go through and tell me: My guess (and I am not an expert, is that LungCap character and Lung numeric interfer in this code):
h <- "https://docs.google.com/spreadsheets/d/0BxQfpNgXuWoIWUdZV1ZTc2ZscnM/edit?resourcekey=0-gqXT7Re2eUS2JGt_w1y4vA#gid=1055321634"
#install.packages("textreadr")
library(textreadr)
library(rvest)
t <- read_html(h)
t
Nodes <- t %>% html_nodes("table")
table <- html_table(Nodes[[1]])
colnames(table) <- table[1,]
table <- table[-1,]
table <- table %>% select(LungCap, Age, Height, Smoke, Gender, Caesarean)
Lung_Capacity <- table
# I changed Lung_Capacity$LungCap <- as.numeric(Lung_Capacity$LungCap) to
Lung_Capacity$Lung <- as.numeric(Lung_Capacity$LungCap)
Lung_Capacity$Age <- as.numeric(Lung_Capacity$Age)
Lung_Capacity$Height <- as.numeric(Lung_Capacity$Height)
Lung_Capacity$Smoke <- as.numeric(Lung_Capacity$Smoke == "yes")
Lung_Capacity$Gender <- as.numeric(Lung_Capacity$Gender == "male")
Lung_Capacity$Caesarean <- as.numeric(Lung_Capacity$Caesarean == "yes")
colnames(Lung_Capacity)[4] <- "Smoker_YN"
colnames(Lung_Capacity)[5] <- "Male_YN"
colnames(Lung_Capacity)[6] <- "Caesarean_YN"
head(Lung_Capacity)
# I changed to
Capacity <- Lung_Capacity
Capacity
library(caret)
set.seed(1)
# I changed y <- Capacity$LungCap to
y <- Capacity$Lung
testIndex <- caret::createDataPartition(y, times = 1, p = 0.2, list = FALSE)
train <- Capacity[-testIndex,]
test <- Capacity[testIndex,]
# I removed
train$LungCap <- NULL
test$LungCap <- NULL
set.seed(3)
control <- trainControl(method="cv", number = 5)
# I changed LungCap to Lung
LinearModel <- train(Lung ~ ., data = train, method = "lm", trControl = control)
LM <- LinearModel$finalModel
summary(LM)
lmPredictions <- predict(LM, newdata = test)
lmPredictions
Output:
1 2 3 4 5 6 7
6.344355 10.231586 4.902900 7.500179 5.295711 9.434454 8.879997
8 9 10 11 12 13 14
12.227635 11.097691 7.775063 8.085810 6.399364 7.852107 9.480219
15 16 17 18 19 20
8.982051 10.115840 7.917863 12.089960 7.838881 9.653292
birth <- import("smoker_data1.xlsx")
## Splitting the dataset in test and train datasets
mysplit <- sample.split(birth, SplitRatio = 0.65)
train <- subset(birth, mysplit == T)
test <- subset(birth, mysplit == F)
## Build Random Forest model on the test set
mod1 <- randomForest(smoke~., train)
Error message: Error: Error in y - ymean : non-numeric argument to binary operator**
I think the best way is to check the data type for smoke variable first.
If possible try to change the variable using as.factor().
library(readxl)
birth <- read_excel("smoker_data1.xlsx")
## Splitting the dataset in test and train datasets
mysplit <- sample.split(birth, SplitRatio = 0.65)
train <- subset(birth, mysplit == T)
test <- subset(birth, mysplit == F)
train$smoke <- as.factor(train$smoke)
## Build Random Forest model on the test set
mod1 <- randomForest(smoke~., train)
I already tried with the data you gave, just need to specify the type of data correctly before fitting randomForest function.
data1$baby_wt <- as.numeric(data1$baby_wt)
data1$income <- as.factor(data1$income)
data1$mother_a <- as.numeric(data1$mother_a)
data1$smoke <- as.factor(data1$smoke)
data1$gestation <- as.numeric(data1$gestation)
data1$mother_wt <- as.numeric(data1$mother_wt)
library(caret)
library(randomForest)
predictors <- names(data1)[!names(data1) %in% "smoke"]
inTrainingSet <- createDataPartition(data1$smoke, p=0.7, list=F)
train<- data1[inTrainingSet,]
test<- data1[-inTrainingSet,]
library(randomForest)
m.rf = randomForest(smoke~., data=train, mtry=sqrt(ncol(x)), ntree=5000,
importance=T, proximity=T, probability=T)
m.rf
#############################################
# Test Performance
#############################################
m.pred = predict(m.rf, test[-4], response="class")
m.table <- table(m.pred, test$smoke)
library(caret)
confusionMatrix(m.table)
I'm trying to use the "lime" package to interpret a Random Forest model with the "import85" dataset, but when I run the explain command it returns an error:
library(lime)
library(caret)
data("imports85", package = "randomForest")
imp85 <- imports85[,-2]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[, drop=TRUE] else x)
stopifnot(require(randomForest))
NROW(imp85)*0.7
idx <- sample(1:NROW(imp85),135)
test <- imp85[idx, c(1:4, 6:25)]
train <- imp85[-idx, c(1:4, 6:25)]
resp <- imp85[[5]][-idx]
model <- train(train, resp, method = 'rf')
explainer <- lime(train, model)
explanation <- explain(test, explainer, n_labels = 1, n_features = 2)
Error in predict.randomForest(modelFit, newdata, type = "prob") :
Type of predictors in new data do not match that of the training data.
How can I solve it?
EDIT 1:
I tried to force the factor variable levels to be the same for both train and test datasets, but it doesn't work
I would like to get R2 between the predicted and actual data in test dataset, why the result from h2o.performance(m, test) is different from caret::R2() or a 'lm' model?
'h2o.performance(m, test)' is 0.733401, and 'caret::R2(p, a)' is 0.7577784
summary(lmm)$r.squared is the same as 'caret::R2(p, a)'
Example code:
library(h2o)
h <- h2o.init()
data <- as.h2o(iris)
part <- h2o.splitFrame(data, 0.7, seed = 123)
train <- part[[1]]
test <- part[[2]]
m <- h2o.glm(x=2:5,y=1,train, nfolds = 10, seed = 123)
summary(m)
predictions <- h2o.predict(m, test)
p <- as.data.frame(predictions)
a <- as.data.frame(test[1])
caret::R2(p, a)
# 0.7577784
h2o.performance(m, test)
# the R^2 is 0.733401
df <- data.frame(p=p, a=a)
lmm <- lm(predict ~ Sepal.Length, data =df)
summary(lmm)$r.squared
# the r.squared is 0.7577784
You can get training metrics as follows:
m <- h2o.glm(x=2:5,y=1,train,validation_frame = test)
#We would ideally use a validation set.
h2o.performance(m,test)
m#model$training_metrics
My testing data a my trining data have different factors levels. I try to merge levels but it doesnt works.
library(mgcv)
library(ff)
myData <- read.csv.ffdf(file = "myFile.csv")
myData$myVar <- as.factor(myData$myVar)
testData <- read.csv(file = "test.csv")
testData$myVar <- as.factor(testData$myVar)
form <- dependent ~ .
model <- gam(form, data=myData)
model$xlevels[["myVar"]] <- union(model$xlevels[["myVar"]], levels(testData$myVar))
predictedData <- predict(model, newdata=testData)
then R gives me this error:
Error in predict.gam(model, newdata = testData) : 1001, 1213,1231 not in original fit
Calls: predict -> predict.gam