R error: all arguments must have the same length - r

I got an error when I'm doing naive Bayes by R, here's my code and error
library(e1071)
#data
train_data <- read.csv('https://raw.githubusercontent.com/JonnyyJ/data/master/train.csv',header=T)
test_data <- read.csv('https://raw.githubusercontent.com/JonnyyJ/data/master/test.csv',header=T)
efit <- naiveBayes(y~job+marital+education+default+contact+month+day_of_week+
poutcome+age+pdays+previous+cons.price.idx+cons.conf.idx+euribor3m
,train_data)
pre <- predict(efit, test_data)
bayes_table <- table(pre, test_data[,ncol(test_data)])
accuracy_test_bayes <- sum(diag(bayes_table))/sum(bayes_table)
list('predict matrix'=bayes_table, 'accuracy'=accuracy_test_bayes)
ERROR:
bayes_table <- table(pre, test_data[,ncol(test_data)])
Error in table(pre, test_data[, ncol(test_data)]) :
all arguments must have the same length
accuracy_test_bayes <- sum(diag(bayes_table))/sum(bayes_table)
Error in diag(bayes_table) : object 'bayes_table' not found
list('predict matrix'=bayes_table, 'accuracy'=accuracy_test_bayes)
Error: object 'bayes_table' not found
I really don't understand what's going on, because I'm new in R

For some reason, the default predict(efit, test_data, type = "class") doesn't work in this case (probably because your model predicts 0 for all observations in the test dataset). You also need to construct the table using your outcome (i.e. test_data[,ncol(test_data)] returns euribor3m). The following should work:
pre <- predict(efit, test_data, type = "raw") %>%
as.data.frame() %>%
mutate(prediction = if_else(0 < 1, 0, 1)) %>%
pull(prediction)
bayes_table <- table(pre, test_data$y)
accuracy_test_bayes <- sum(diag(bayes_table)) / sum(bayes_table)
list('predict matrix' = bayes_table, 'accuracy' = accuracy_test_bayes)
# $`predict matrix`
#
# pre 0 1
# 0 7282 956
#
# $accuracy
# [1] 0.8839524

Related

Getting variables names from glmnet lasso into a data.frame

I'm working with a phyloseq object ps.scale and trying to get the most important variables/features that can predict health status sample_data(ps.scale)$group.
Code is as follows:
library(glmnet)
metadata <- factor(sample_data(ps.scale)$group)
otu_tab <- otu_table(ps.scale)
otu_tab <- apply(otu_tab, 2, function(x) x+1/sum(x+1))
otu_tab <- t(log10(otu_tab))
y <- metadata
x <- otu_tab
lasso <- cv.glmnet(x, y, family="multinomial", alpha=1)
print(lasso)
plot(lasso)
So I get the results and a plot here.
#Call: cv.glmnet(x = x, y = y, family = "multinomial", alpha = 1)
#Measure: Multinomial Deviance
# Lambda Index Measure SE Nonzero
#min 0.03473 36 1.704 0.05392 68
#1se 0.05529 26 1.751 0.05474 16
Now I want to be able to extract the important variables/features (i.e., OTUs). Below are some codes I gathered from the internet:
Code 1
all_1se <- coef(lasso, s = "lambda.1se")
chosen_1se <- all_1se[all_1se > 0, ]
chosen_1se
#Error: 'list' object cannot be coerced to type 'double'
Code 2
tmp_coeffs <- coef(lasso, s = "lambda.1se")
data.frame(name = tmp_coeffs#Dimnames[[1]][tmp_coeffs#i + 1], coefficient = tmp_coeffs#x)
#Error in data.frame(name = tmp_coeffs#Dimnames[[1]][tmp_coeffs#i + 1], :
# trying to get slot "Dimnames" from an object of a basic class ("list") with no slots
Code 3
myCoefs <- coef(lasso, s="lambda.min");
myCoefs[which(myCoefs != 0 ) ]
myCoefs#Dimnames[[1]][which(myCoefs != 0 ) ] #feature names: intercept included
## Asseble into a data.frame
myResults <- data.frame(
features = myCoefs#Dimnames[[1]][ which(myCoefs != 0 ) ], #intercept included
coefs = myCoefs [ which(myCoefs != 0 ) ] #intercept included
)
myResults
#Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'which': 'list' object cannot be coerced to type 'double'
#3.h(simpleError(msg, call))
#2..handleSimpleError(function (cond)
# .Internal(C_tryCatchHelper(addr, 1L, cond)), "'list' object cannot be coerced to type 'double'",
# base::quote(which(myCoefs != 0)))
#1.which(myCoefs != 0)
I need help fixing the above errors, mainly 'list' object cannot be coerced to type 'double'.
Thank you in advance.

Neural Network Confusion Matrix Error: All arguments must have the same length?

I am trying to get a confusion matrix for my neural network, but I am getting an error: Error in table(data, reference, dnn = dnn, ...) :
all arguments must have the same length. There shouldn't be any null values as I removed them before I performed the neural network. I can see that my prediction.net1 has a length of 82 using length(prediction.net1). My df.valid1 has a length of 27 using length(df.valid1). How can I fix this error so that I can view the confusion matrix?
attribute$Number.of.Dining.and.Beverage.Options[is.na(attribute$Number.of.Dining.and.Beverage.Options)] <- 0
attribute$Number.of.Highways[is.na(attribute$Number.of.Highways)] <- 0
attribute$Number.of.Historical.Properties[is.na(attribute$Number.of.Historical.Properties)] <- 0
attribute$Number.of.Entertainment.Options[is.na(attribute$Number.of.Entertainment.Options)] <- 0
attribute$Number.of.Medical.Facilities[is.na(attribute$Number.of.Medical.Facilities)] <- 0
attribute$Number.of.Offices[is.na(attribute$Number.of.Offices)] <- 0
attribute$Number.of.Parks[is.na(attribute$Number.of.Parks)] <- 0
attribute$Number.of.Rail.Roads[is.na(attribute$Number.of.Rail.Roads)] <- 0
attribute$Number.of.Educational.Institutions[is.na(attribute$Number.of.Educational.Institutions)] <- 0
attribute$Number.of.Shops[is.na(attribute$Number.of.Shops)] <- 0
attribute$Number.of.Bus.Stops[is.na(attribute$Number.of.Bus.Stops)] <- 0
str(attribute)
#Creating new data frames for each property type
One_Bed_Property <- attribute %>% select(c(Number.of.Dining.and.Beverage.Options,
Number.of.Highways,
Number.of.Historical.Properties,
Number.of.Entertainment.Options,
Number.of.Medical.Facilities,
Number.of.Offices,
Number.of.Parks,
Number.of.Rail.Roads,
Number.of.Educational.Institutions,
Number.of.Shops,
Number.of.Bus.Stops,
Proximity_to_Uptown,
Proximity_to_Light_Rail,
`1_Bed_Target`))
#Dropping NA's from Target Variable
One_Bed_Property1 <- na.omit(One_Bed_Property)
str(One_Bed_Property1)
#Select numeric variables
one_bed_processed <- One_Bed_Property1[,c(0:13)]
#Scale numeric variables
maxs1 <- apply(one_bed_processed,2,max)
mins1 <- apply(one_bed_processed,2,min)
one_bed_processed1 <- as.data.frame(scale(one_bed_processed,
center = mins1,
scale=maxs1 - mins1))
one_bed_processed2 <- cbind(one_bed_processed1, One_Bed_Property1)
#Data partition
trainIndex1 <- createDataPartition(one_bed_processed2$`1_Bed_Target`,
p=0.7,
list=FALSE,
times=1)
df.train1 <- one_bed_processed2[trainIndex1,]
df.valid1 <- one_bed_processed2[-trainIndex1,]
#Train neural net
nn1 <- neuralnet(`1_Bed_Target`~ Number.of.Dining.and.Beverage.Options
+ Number.of.Highways
+ Number.of.Historical.Properties
+ Number.of.Entertainment.Options
+ Number.of.Medical.Facilities
+ Number.of.Offices
+ Number.of.Parks
+ Number.of.Rail.Roads
+ Number.of.Educational.Institutions
+ Number.of.Shops
+ Number.of.Bus.Stops
+ Proximity_to_Uptown
+ Proximity_to_Light_Rail, data = df.train1, hidden=c(5,2), linear.output = FALSE)
plot(nn1)
#Model evaluation
#Changing DV data types
df.valid1$`1_Bed_Target` <- as.factor(df.valid1$`1_Bed_Target`)
prediction.net1 <- predict(nn1, newdata = df.valid1)
prediction.net1 <- ifelse(prediction.net1>0.5,1,0)
confusionMatrix(as.factor(prediction.net1),df.valid1$`1_Bed_Target`)

Get an error: incorrect number of dimensions when I'm using R

When I'm using ROCR to evaluate the naive Bayes model, I got this error: the incorrect number of dimensions, really don't know how to debug that, here's my error log.
$`predict matrix`
pre 0 1
0 7282 956
$accuracy
[1] 0.8839524
> #ROCR
> library(ROCR)
> pred<-prediction(predictions =pre[,2],labels =test_data$y)
Error in pre[, 2] : incorrect number of dimensions
And this is my R script.
library(e1071)
library(rvest)
library(dplyr)
#data
train_data <- read.csv('/Users/jonnyy/Desktop/IS/S2/IS688/project/train.csv')
test_data <- read.csv('/Users/jonnyy/Desktop/IS/S2/IS688/project/test.csv')
#construct naiveBayes model
efit <- naiveBayes(y~job+marital+education+default+contact+month+day_of_week+
poutcome+age+pdays+previous+cons.price.idx+cons.conf.idx+euribor3m
,train_data)
#using predict function on test data to classified prediction
pre <- predict(efit, test_data, type = "raw") %>%
as.data.frame() %>%
mutate(prediction = if_else(0 < 1, 0, 1)) %>%
pull(prediction)
#predict matrix and accuracy
bayes_table <- table(pre, test_data$y)
accuracy_test_bayes <- sum(diag(bayes_table)) / sum(bayes_table)
list('predict matrix' = bayes_table, 'accuracy' = accuracy_test_bayes)
#ROCR
library(ROCR)
pred<-prediction(predictions =pre[,2],labels =test_data$y)
perf<-performance((pred.measure ="tpr",x.measure="fpr"))
plot(perf,main="ROC curve",col="blue",lwd=3)
abline(a=0,b=1,lwd=2,lty=2)
The thing is, my predict matrix is 2X1, seems like it's only one dimension? But when I change pre[,2] to pre[,1], it still not works

Error using glm in R

I am trying to apply simple binomial logistic regression in json data I downloaded from Kaggle:
https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data
I changed values of interest_level column to either 1 if the value is "high" and 0 if otherwise.
This is my first time using glm so any help is welcome.
library(rjson)
library(dplyr)
library(purrr)
library(nnet)
json.data <- fromJSON(file = "train.json")
json.data = as.data.frame(t(do.call(rbind, json.data)))
#head(json.data)
#colnames(json.data)
x <- json.data$interest_level
for (i in 1:length(x)){
if (json.data$interest_level[i] =="high"){
json.data$interest_level[i] <- 1
}else {json.data$interest_level[i] <- 0}
}
indexes = sample(1:nrow(json.data), size=0.5*nrow(json.data))
train.data <- json.data[indexes,]
test.data <- json.data[-indexes,]
model <- glm(train.data~interest_level,family=binomial(link='logit'),data=train.data)
I'm getting this error message:
Error in model.frame.default(formula = train.data ~ interest_level, data = train.data, : invalid type (list) for variable 'train.data'

formula error inside function

I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
library(survival)
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
set.seed(13)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
Z2=sample(0:1,n,replace=TRUE),
Z3=sample(0:1,n,replace=TRUE))
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
mean(event)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
}
BaseFun(fmla)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
basehaz(fit)
It is a problem of scoping.
Notice that the environment of basehaz is:
environment(basehaz)
<environment: namespace:survival>
meanwhile:
environment(BaseFun)
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
assign('x',x,pos=.GlobalEnv)
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
rm(x)
}
BaseFun(fmla)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
})
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
exists("train_data")
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
Trying to run this function throws an error similar to OP's.
library(rpart)
library(partykit)
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in is.data.frame(data) : object 'train_data' not found
# 10. is.data.frame(data)
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3. as.party.rpart(ct_train)
# 2. partykit::as.party(ct_train)
# 1. badFunc(data = data, my_formula = my_formula)
print(bad_out)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
model=TRUE,
##
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
good_out <- goodFunc(data=data, my_formula = my_formula)
print(good_out)
# Model formula:
# Trt ~ W + X
#
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
model:
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!

Resources