I used the gbm() function to create the model and I want to get the accuracy. Here is my code:
df<-read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
str(df)
F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
library(caret)
set.seed(1000)
intrain<-createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train<-df[intrain, ]
test<-df[-intrain, ]
install.packages("gbm")
library("gbm")
df_boosting<-gbm(Creditability~.,distribution = "bernoulli", n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
summary(df_boosting)
yhat.boost<-predict (df_boosting ,newdata =test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
However, when using the summary function, an error appears. The error message is as follows.
Error in plot.window(xlim, ylim, log = log, ...) :
유한한 값들만이 'xlim'에 사용될 수 있습니다
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
And, When measuring the MSE with the mean function, the following error also appears:
Warning message:
In Ops.factor(yhat.boost, test$Creditability) :
요인(factors)에 대하여 의미있는 ‘-’가 아닙니다.
Do you know why these two errors appear? Thank you in advance.
In your code the problem is in the definition of the (binary) response variable Creditability. You declare it as factor but gbm needs a numerical response variable.
Here is the code:
df <- read.csv("http://freakonometrics.free.fr/german_credit.csv", header=TRUE)
F <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) df[,i]=as.factor(df[,i])
str(df)
Creditability now is a binary numerical variable:
'data.frame': 1000 obs. of 21 variables:
$ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
$ Account.Balance : Factor w/ 4 levels "1","2","3","4": 1 1 2 1 1 1 1 1 4 2 ...
$ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
$ Payment.Status.of.Previous.Credit: Factor w/ 5 levels "0","1","2","3",..: 5 5 3 5 5 5 5 5 5 3 ...
$ Purpose : Factor w/ 10 levels "0","1","2","3",..: 3 1 9 1 1 1 1 1 4 4 ...
...
... and the remaining part of the code works nicely:
library(caret)
set.seed(1000)
intrain <- createDataPartition(y=df$Creditability, p=0.7, list=FALSE)
train <- df[intrain, ]
test <- df[-intrain, ]
library("gbm")
df_boosting <- gbm(Creditability~., distribution = "bernoulli",
n.trees=100, verbose=TRUE, interaction.depth=4,
shrinkage=0.01, data=train)
par(mar=c(3,14,1,1))
summary(df_boosting, las=2)
##########
var rel.inf
Account.Balance Account.Balance 36.8578980
Credit.Amount Credit.Amount 12.0691120
Duration.of.Credit..month. Duration.of.Credit..month. 10.5359895
Purpose Purpose 10.2691646
Payment.Status.of.Previous.Credit Payment.Status.of.Previous.Credit 9.1296524
Value.Savings.Stocks Value.Savings.Stocks 4.9620662
Instalment.per.cent Instalment.per.cent 3.3124252
...
##########
yhat.boost <- predict(df_boosting , newdata=test, n.trees=100)
mean((yhat.boost-test$Creditability)^2)
[1] 0.2719788
Hope this can help you.
Related
I am trying to build a simple Naive Bayes classifer for mushroom data. I want to use all of the variables as categorical predictors to predict if a mushroom is edible.
I am using caret package.
Here is my code in full:
##################################################################################
# Prepare R and R Studio environment
##################################################################################
# Clear the R studio console
cat("\014")
# Remove objects from environment
rm(list = ls())
# Install and load packages if necessary
if (!require(tidyverse)) {
install.packages("tidyverse")
library(tidyverse)
}
if (!require(caret)) {
install.packages("caret")
library(caret)
}
if (!require(klaR)) {
install.packages("klaR")
library(klaR)
}
#################################
mushrooms <- read.csv("agaricus-lepiota.data", stringsAsFactors = TRUE, header = FALSE)
na.omit(mushrooms)
names(mushrooms) <- c("edibility", "capShape", "capSurface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing", "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring", "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color", "ring-number", "ring-type", "spore-print-color", "population", "habitat")
# convert bruises to a logical variable
mushrooms$bruises <- mushrooms$bruises == 't'
set.seed(1234)
split <- createDataPartition(mushrooms$edibility, p = 0.8, list = FALSE)
train <- mushrooms[split, ]
test <- mushrooms[-split, ]
predictors <- names(train)[2:20] #Create response and predictor data
x <- train[,predictors] #predictors
y <- train$edibility #response
train_control <- trainControl(method = "cv", number = 1) # Set up 1 fold cross validation
edibility_mod1 <- train( #train the model
x = x,
y = y,
method = "nb",
trControl = train_control
)
When executing the train() function I get the following output:
Something is wrong; all the Accuracy metric values are missing:
Accuracy Kappa
Min. : NA Min. : NA
1st Qu.: NA 1st Qu.: NA
Median : NA Median : NA
Mean :NaN Mean :NaN
3rd Qu.: NA 3rd Qu.: NA
Max. : NA Max. : NA
NA's :2 NA's :2
Error: Stopping
In addition: Warning messages:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in x[, 2] : subscript out of bounds
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
x and y after script run:
> str(x)
'data.frame': 6500 obs. of 19 variables:
$ capShape : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
$ capSurface : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
$ cap-color : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
$ bruises : logi TRUE TRUE TRUE TRUE FALSE TRUE ...
$ odor : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
$ gill-attachment : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
$ gill-spacing : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
$ gill-size : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
$ gill-color : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
$ stalk-shape : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
$ stalk-root : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
$ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
$ stalk-color-above-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ stalk-color-below-ring : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
$ veil-type : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
$ veil-color : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
$ ring-number : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
$ ring-type : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
> str(y)
Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
My environment is:
> R.version
_
platform x86_64-apple-darwin17.0
arch x86_64
os darwin17.0
system x86_64, darwin17.0
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
> RStudio.Version()
$citation
To cite RStudio in publications use:
RStudio Team (2020). RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
A BibTeX entry for LaTeX users is
#Manual{,
title = {RStudio: Integrated Development Environment for R},
author = {{RStudio Team}},
organization = {RStudio, PBC},
address = {Boston, MA},
year = {2020},
url = {http://www.rstudio.com/},
}
$mode
[1] "desktop"
$version
[1] ‘1.3.1093’
$release_name
[1] "Apricot Nasturtium"
What you are trying to do is a bit tricky, most naive bayes implementation or at least the one you are using (from kLAR which is derived from e1071) uses a normal distribution. You can see under the details of naiveBayes help page from e1071:
The standard naive Bayes classifier (at least this implementation)
assumes independence of the predictor variables, and Gaussian
distribution (given the target class) of metric predictors. For
attributes with missing values, the corresponding table entries are
omitted for prediction.
And your predictors are categorical so this might be problematic. You can try to set kernel=TRUE and adjust=1 to force it towards normal, and avoid kernel=FALSE which will throw the error.
Before that we remove columns with only 1 level and sort out the column names, also in this case it's easier to use the formula and avoid the making dummy variables :
df = train
levels(df[["veil-type"]])
[1] "p"
df[["veil-type"]]=NULL
colnames(df) = gsub("-","_",colnames(df))
Grid = expand.grid(usekernel=TRUE,adjust=1,fL=c(0.2,0.5,0.8))
mod1 <- train(edibility~.,data=df,
method = "nb", trControl = trainControl(method="cv",number=5),
tuneGrid=Grid
)
mod1
Naive Bayes
6500 samples
21 predictor
2 classes: 'e', 'p'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 5200, 5200, 5200, 5200, 5200
Resampling results across tuning parameters:
fL Accuracy Kappa
0.2 0.9243077 0.8478624
0.5 0.9243077 0.8478624
0.8 0.9243077 0.8478624
Tuning parameter 'usekernel' was held constant at a value of TRUE
Tuning parameter 'adjust' was held constant at a value of 1
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were fL = 0.2, usekernel = TRUE and
adjust = 1.
I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
I am using the CHAID method from caret. I am getting the same error as this post that x is not a factor when all the x's are factors.
I'm using R 3.3.3 and caret_6.0-78
Here is a toy example:
library(datasets)
library(caret)
library(CHAID)
testDat<-data.frame(HairEyeColor, stringsAsFactors=T)[,1:3]
str(testDat)
'data.frame': 32 obs. of 3 variables:
$ Hair: Factor w/ 4 levels "Black","Brown",..: 1 2 3 4 1 2 3 4 1 2 ...
$ Eye : Factor w/ 4 levels "Brown","Blue",..: 1 1 1 1 2 2 2 2 3 3 ...
$ Sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
control <- trainControl(method="repeatedcv", number=10, repeats=3,
+ savePredictions="final", summaryFunction=twoClassSummary, classProbs=TRUE)
fit.chaid <- train(Sex~Hair+Eye, data=testDat, method="chaid", metric="ROC", trControl=control)
Error: is.factor(x) is not TRUE
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Timing stopped at: 0.02 0 0.02
warnings()
Warning messages:
1: model fit failed for Fold01.Rep1: alpha2=0.05, alpha3=-1, alpha4=0.05 Error : is.factor(x) is not TRUE
.
.
I know that this question already old, but here I got the answers for it by experimenting:
For CHAID Modeling, try to use xy modeling rather than formula modeling, like this:
fit.chaid <- train(x=testDat[,c(1,2)], #Hair and Eye Variable
y=testDat[,c(3)], #Sex Variable
method="chaid",
metric="ROC",
trControl=control)
I am attempting to run a classification algorithm for a dataset with no missing values. Here is the dataset description:
'data.frame': 59977 obs. of 6 variables:
$ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 1 2 2 ...
$ age : num 35.7 35.7 35.7 35.7 35.7 ...
$ code : Factor w/ 492 levels "ADN105","AXN16B",..: 128 128 128 363 363 363 104 104 221 221 ...
$ totalflags : num 4 4 4 4 4 4 3 3 2 2 ...
$ measure2 : num 30 30 30 1 1 1 23 23 22 22 ...
$ outcome : num 1 1 1 0 0 0 1 1 1 1 ...
- attr(*, "na.action")=Class 'omit' Named int [1:138] 3718 3719 5493 5494 5495 5496 7302 7303 8415 8416 ...
.. ..- attr(*, "names")= chr [1:138] "4929" "4930" "7384" "7385" ...
When I run the following command
x <- Mydataset[,1:5]
y <- Mydataset[,6]
fit <- glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001)
I get
Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs, :
NAs introduced by coercion
Before running the glm model, I did this:
Mydataset <- na.omit(Mydataset)
And checked to make sure no NA's exist:
sapply(Mydataset, function(y) sum(length(which(is.na(y)))))
and I got:
gender age code totalflags measure2 outcome
0 0 0 0 0 0
I looked at other questions for couldn't find anything relevant. Appreciate any thoughts and help in this
EDIT: ANSWER
I did a little digging and decided to change the data frame to numeric matrix and the model ran without complaining. This is the code that helped me:
x <- data.matrix(Mydataset[,1:5])
y <- data.matrix(Mydataset[,6])
The most likely cause is small or zero numbers of factor variables within one or more levels. Try this first:
Mydataset [ c('gender', 'code') ] <-
lapply( Mydataset [ c('gender', 'code') ], factor)
If that's not effective then you should show the actual code used and better description and names of all objects used. At the moment we don't even know what are x and y.
EDIT: The glmnet function does not have a formula interface and is not set up to handle data.frames and factors the way that typical R regression functions would allow. After looking at the structure of x (still a list/dataframe) and reviewing the help page for ?glmnet and doing a bit of searching for the correct way to handle factors when a numeric matrix is the expected input, I suggest converting your factors to dummies with model.matrix. It's going to be easier for interpretation of the results if you change the default contrast scheme for treatment contrasts (See https://stats.stackexchange.com/questions/69804/group-categorical-variables-in-glmnet):
contr.Dummy <- function(contrasts, ...){
conT <- contr.treatment(contrasts=FALSE, ...)
conT
}
options(contrasts=c(ordered='contr.Dummy', unordered='contr.Dummy'))
x.m <- model.matrix( ~.-1, x)
fit <- glmnet(x=x.m, y, family="binomial", alpha=0.5, lambda=0.001)
I'm using the biglm package to run a regression on a data set. The regression runs fine using the following code:
chunkStart <- seq(1,150000000,1000000)
chunkEnd <- seq(1000000,151000000,1000000)
ff <- price ~ factor(Var1) + factor(Var2)
#for(i in 1:length(chunkStart)){
for(i in 1:5){
startRow <- chunkStart[i]
endRow <- chunkEnd[i]
curchunk <- data.frame( price=x[startRow:endRow,1]
,Var1=factor( x[startRow:endRow,6], levels=1:3), Var2= factor( x[startRow:endRow,7], levels=1:3 ) )
if(i == 1){
a <- biglm(ff,curchunk )
}
if(i != 1){
a <- update(a,curchunk )
}
rm(curchunk )
gc()
print(paste(i, " | ",startRow ," | ",endRow ," | ", sep=""))
flush.console()
}
> summary(a)
Large data regression model: biglm(ff, curchunk)
Sample size = 5000000
Coef (95% CI) SE p
(Intercept) 0.0457 0.0454 0.0461 2e-04 0
factor(Var1)2 0.0189 0.0184 0.0194 2e-04 0
factor(Var1)3 0.0148 0.0142 0.0155 3e-04 0
factor(Var2)2 -0.0331 -0.0335 -0.0326 2e-04 0
factor(Var2)3 -0.0417 -0.0426 -0.0408 4e-04 0
The problems come when I try to predict using the biglm object, 'a'.
> df1 <- data.frame(y[1:1000,])
> pred1 <- predict(a, df1)
Error in eval(expr, envir, enclos) : object 'price' not found
Why is the predict function looking for the price/ dependent variable? Any suggestions?
EDIT:
> head(df1)
Var1 Var2
1 3 3
2 3 1
3 3 2
4 2 1
5 2 2
6 1 1
> str(df1)
'data.frame': 1000 obs. of 2 variables:
$ Var1: Factor w/ 3 levels "1","2","3": 3 3 3 2 2 1 2 1 2 1 ...
$ Var2: Factor w/ 3 levels "1","2","3": 3 1 2 1 2 1 1 1 2 1 ...
> pred1 <- predict(a, df1)
Error in eval(expr, envir, enclos) : object 'price' not found
The reason it is looking for the dependent variable is that the predict method uses a call to model.frame from the stats package, and that function requires all the variables to be present in the new data. This is indicated on the model.frame help page without explanation for the motivation behind it.
All you actually need to do about this is create a variable in your new data that has the same name as the dependent variable, then fill it with zeroes (or any non-missing value). So it should work if you run this:
df1$price <- 0
pred1 <- predict(a, df1)