CHAID with r-caret error x is not a factor - r

I am using the CHAID method from caret. I am getting the same error as this post that x is not a factor when all the x's are factors.
I'm using R 3.3.3 and caret_6.0-78
Here is a toy example:
library(datasets)
library(caret)
library(CHAID)
testDat<-data.frame(HairEyeColor, stringsAsFactors=T)[,1:3]
str(testDat)
'data.frame': 32 obs. of 3 variables:
$ Hair: Factor w/ 4 levels "Black","Brown",..: 1 2 3 4 1 2 3 4 1 2 ...
$ Eye : Factor w/ 4 levels "Brown","Blue",..: 1 1 1 1 2 2 2 2 3 3 ...
$ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
control <- trainControl(method="repeatedcv", number=10, repeats=3,
+ savePredictions="final", summaryFunction=twoClassSummary, classProbs=TRUE)
fit.chaid <- train(Sex~Hair+Eye, data=testDat, method="chaid", metric="ROC", trControl=control)
Error: is.factor(x) is not TRUE
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Timing stopped at: 0.02 0 0.02
warnings()
Warning messages:
1: model fit failed for Fold01.Rep1: alpha2=0.05, alpha3=-1, alpha4=0.05 Error : is.factor(x) is not TRUE
.
.

I know that this question already old, but here I got the answers for it by experimenting:
For CHAID Modeling, try to use xy modeling rather than formula modeling, like this:
fit.chaid <- train(x=testDat[,c(1,2)], #Hair and Eye Variable
y=testDat[,c(3)], #Sex Variable
method="chaid",
metric="ROC",
trControl=control)

Related

R - Caret train() "Error: Stopping" with "Not all variable names used in object found in newdata"

I am trying to build a simple Naive Bayes classifer for mushroom data. I want to use all of the variables as categorical predictors to predict if a mushroom is edible.
I am using caret package.
Here is my code in full:
##################################################################################
# Prepare R and R Studio environment
##################################################################################
# Clear the R studio console
cat("\014")
# Remove objects from environment
rm(list = ls())
# Install and load packages if necessary
if (!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require(caret)) {
install.packages("caret")
library(caret)
}
if (!require(klaR)) {
install.packages("klaR")
library(klaR)
}
#################################
mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)
na.omit(mushrooms)
names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")
# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'
set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)
train <- mushrooms[split, ]
test <- mushrooms[-split, ]
predictors <- names(train)[2:20] #Create response and predictor data
x <- train[,predictors] #predictors
y <- train$edibility #response
train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation
edibility_mod1 <- train( #train the model
x = x,
y = y,
method = "nb",
trControl = train_control
)
When executing the train() function I get the following output:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :2 NA's :2
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
x and y after script run:
> str(x)
'data.frame': 6500 obs. of 19 variables:
$ capShape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ capSurface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ cap-color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ bruises : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
$ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ gill-attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ gill-spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ gill-size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ gill-color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ stalk-shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk-root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-color-above-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk-color-below-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil-type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ veil-color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ ring-number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ ring-type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
> str(y)
Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
My environment is:
> R.version
_
platform x86_64-apple-darwin17.0
arch x86_64
os darwin17.0
system x86_64, darwin17.0
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
> RStudio.Version()
$citation
To cite RStudio in publications use:
RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
#Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, PBC},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
}
$mode
[1] "desktop"
$version
[1] ‘1.3.1093’
$release_name
[1] "Apricot Nasturtium"
What you are trying to do is a bit tricky, most naive bayes implementation or at least the one you are using (from kLAR which is derived from e1071) uses a normal distribution. You can see under the details of naiveBayes help page from e1071:
The standard naive Bayes classifier (at least this implementation)
assumes independence of the predictor variables, and Gaussian
distribution (given the target class) of metric predictors. For
attributes with missing values, the corresponding table entries are
omitted for prediction.
And your predictors are categorical so this might be problematic. You can try to set kernel=TRUE and adjust=1 to force it towards normal, and avoid kernel=FALSE which will throw the error.
Before that we remove columns with only 1 level and sort out the column names, also in this case it's easier to use the formula and avoid the making dummy variables :
df = train
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
mod1 <- train(edibility~.,data=df,
method = "nb", trControl = trainControl(method="cv",number=5),
tuneGrid=Grid
)
mod1
Naive Bayes
6500 samples
21 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200
Resampling results across tuning parameters:
fL Accuracy Kappa
0.2 0.9243077 0.8478624
0.5 0.9243077 0.8478624
0.8 0.9243077 0.8478624
Tuning parameter 'usekernel' was held constant at a value of TRUE
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
adjust = 1.

"contrasts can be applied only to factors with 2 or more levels" Despite having multiple levels in each factor

I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels

Error in cross validation with factor value

I have this code:
# Define training control
set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(is_nocnv ~., data = mydata, method = "lm", trControl = train.control)
# Summarize the results
print(model)
When I execute this code I obtain this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
The field: is_nocnv is factor the value of this field is 'YES' , 'NO'
str(mydata)
'data.frame': 8334 obs. of 7 variables:
$ chr : Factor w/ 1 level "chr1": 1 1 1 1 1 1 1 1 1 1 ...
$ start : int 3218610 154080441 154089408 61735 2069681 2074104 3135175 3137913 3214732 5901288 ...
$ stop : int 154074261 154081058 247813706 2061969 2071738 3130590 3136858 3212946 5900106 5902086 ...
$ strand : Factor w/ 1 level "*": 1 1 1 1 1 1 1 1 1 1 ...
$ num_probes : int 69643 3 59364 379 2 333 2 33 1943 3 ...
$ segment_mean: num -0.122 -13.462 -0.1 -0.326 -25.242 ...
$ is_nocnv : Factor w/ 2 levels "NO","YES": 2 2 2 1 1 1 1 1 1 1 ...
Here a small part of my dataset (csv)
"chr","start","stop","strand","num_probes","segment_mean","is_nocnv"
chr1,3218610,154074261,*,69643,-0.122,YES
chr1,154080441,154081058,*,3,-13.462,YES
chr1,154089408,247813706,*,59364,-0.1003,YES
chr1,61735,2061969,*,379,-0.326,NO
chr1,2069681,2071738,*,2,-25.242,NO
chr1,2074104,3130590,*,333,-0.3957,NO

how to calculate GBM accuracy in r

I used the gbm() function to create the model and I want to get the accuracy. Here is my code:
df<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
str(df)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train<-df[intrain, ]
test<-df[-intrain, ]
install.packages("gbm")
library("gbm")
df_boosting<-gbm(Creditability~.,distribution = "bernoulli", n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
summary(df_boosting)
yhat.boost<-predict (df_boosting ,newdata =test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
However, when using the summary function, an error appears. The error message is as follows.
Error in plot.window(xlim, ylim, log = log, ...) :
유한한 값들만이 'xlim'에 사용될 수 있습니다
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
And, When measuring the MSE with the mean function, the following error also appears:
Warning message:
In Ops.factor(yhat.boost, test$Creditability) :
요인(factors)에 대하여 의미있는 ‘-’가 아닙니다.
Do you know why these two errors appear? Thank you in advance.
In your code the problem is in the definition of the (binary) response variable Creditability. You declare it as factor but gbm needs a numerical response variable.
Here is the code:
df <- read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
str(df)
Creditability now is a binary numerical variable:
'data.frame': 1000 obs. of 21 variables:
$ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
$ Account.Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
$ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
$ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
$ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
...
... and the remaining part of the code works nicely:
library(caret)
set.seed(1000)
intrain <- createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train <- df[intrain, ]
test <- df[-intrain, ]
library("gbm")
df_boosting <- gbm(Creditability~., distribution = "bernoulli",
n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
par(mar=c(3,14,1,1))
summary(df_boosting, las=2)
##########
var rel.inf
Account.Balance Account.Balance 36.8578980
Credit.Amount Credit.Amount 12.0691120
Duration.of.Credit..month. Duration.of.Credit..month. 10.5359895
Purpose Purpose 10.2691646
Payment.Status.of.Previous.Credit Payment.Status.of.Previous.Credit 9.1296524
Value.Savings.Stocks Value.Savings.Stocks 4.9620662
Instalment.per.cent Instalment.per.cent 3.3124252
...
##########
yhat.boost <- predict(df_boosting , newdata=test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
[1] 0.2719788
Hope this can help you.

R biglm predict searching for dependent variable

I'm using the biglm package to run a regression on a data set. The regression runs fine using the following code:
chunkStart <- seq(1,150000000,1000000)
chunkEnd <- seq(1000000,151000000,1000000)
ff <- price ~ factor(Var1) + factor(Var2)
#for(i in 1:length(chunkStart)){
for(i in 1:5){
startRow <- chunkStart[i]
endRow <- chunkEnd[i]
curchunk <- data.frame( price=x[startRow:endRow,1]
,Var1=factor( x[startRow:endRow,6], levels=1:3), Var2= factor( x[startRow:endRow,7], levels=1:3 ) )
if(i == 1){
a <- biglm(ff,curchunk )
}
if(i != 1){
a <- update(a,curchunk )
}
rm(curchunk )
gc()
print(paste(i, " | ",startRow ," | ",endRow ," | ", sep=""))
flush.console()
}
> summary(a)
Large data regression model: biglm(ff, curchunk)
Sample size = 5000000
Coef (95% CI) SE p
(Intercept) 0.0457 0.0454 0.0461 2e-04 0
factor(Var1)2 0.0189 0.0184 0.0194 2e-04 0
factor(Var1)3 0.0148 0.0142 0.0155 3e-04 0
factor(Var2)2 -0.0331 -0.0335 -0.0326 2e-04 0
factor(Var2)3 -0.0417 -0.0426 -0.0408 4e-04 0
The problems come when I try to predict using the biglm object, 'a'.
> df1 <- data.frame(y[1:1000,])
> pred1 <- predict(a, df1)
Error in eval(expr, envir, enclos) : object 'price' not found
Why is the predict function looking for the price/ dependent variable? Any suggestions?
EDIT:
> head(df1)
Var1 Var2
1 3 3
2 3 1
3 3 2
4 2 1
5 2 2
6 1 1
> str(df1)
'data.frame': 1000 obs. of 2 variables:
$ Var1: Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 2 1 2 1 ...
$ Var2: Factor w/ 3 levels "1","2","3": 3 1 2 1 2 1 1 1 2 1 ...
> pred1 <- predict(a, df1)
Error in eval(expr, envir, enclos) : object 'price' not found
The reason it is looking for the dependent variable is that the predict method uses a call to model.frame from the stats package, and that function requires all the variables to be present in the new data. This is indicated on the model.frame help page without explanation for the motivation behind it.
All you actually need to do about this is create a variable in your new data that has the same name as the dependent variable, then fill it with zeroes (or any non-missing value). So it should work if you run this:
df1$price <- 0
pred1 <- predict(a, df1)

Resources