Error using glm in R - r

I am trying to apply simple binomial logistic regression in json data I downloaded from Kaggle:
https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data
I changed values of interest_level column to either 1 if the value is "high" and 0 if otherwise.
This is my first time using glm so any help is welcome.
library(rjson)
library(dplyr)
library(purrr)
library(nnet)
json.data <- fromJSON(file = "train.json")
json.data = as.data.frame(t(do.call(rbind, json.data)))
#head(json.data)
#colnames(json.data)
x <- json.data$interest_level
for (i in 1:length(x)){
if (json.data$interest_level[i] =="high"){
json.data$interest_level[i] <- 1
}else {json.data$interest_level[i] <- 0}
}
indexes = sample(1:nrow(json.data), size=0.5*nrow(json.data))
train.data <- json.data[indexes,]
test.data <- json.data[-indexes,]
model <- glm(train.data~interest_level,family=binomial(link='logit'),data=train.data)
I'm getting this error message:
Error in model.frame.default(formula = train.data ~ interest_level, data = train.data, : invalid type (list) for variable 'train.data'

Related

Error while running randomForest in R: "Error in y - ymean : non-numeric argument to binary operator"

birth <- import("smoker_data1.xlsx")
## Splitting the dataset in test and train datasets
mysplit <- sample.split(birth, SplitRatio = 0.65)
train <- subset(birth, mysplit == T)
test <- subset(birth, mysplit == F)
## Build Random Forest model on the test set
mod1 <- randomForest(smoke~., train)
Error message: Error: Error in y - ymean : non-numeric argument to binary operator**
I think the best way is to check the data type for smoke variable first.
If possible try to change the variable using as.factor().
library(readxl)
birth <- read_excel("smoker_data1.xlsx")
## Splitting the dataset in test and train datasets
mysplit <- sample.split(birth, SplitRatio = 0.65)
train <- subset(birth, mysplit == T)
test <- subset(birth, mysplit == F)
train$smoke <- as.factor(train$smoke)
## Build Random Forest model on the test set
mod1 <- randomForest(smoke~., train)
I already tried with the data you gave, just need to specify the type of data correctly before fitting randomForest function.
data1$baby_wt <- as.numeric(data1$baby_wt)
data1$income <- as.factor(data1$income)
data1$mother_a <- as.numeric(data1$mother_a)
data1$smoke <- as.factor(data1$smoke)
data1$gestation <- as.numeric(data1$gestation)
data1$mother_wt <- as.numeric(data1$mother_wt)
library(caret)
library(randomForest)
predictors <- names(data1)[!names(data1) %in% "smoke"]
inTrainingSet <- createDataPartition(data1$smoke, p=0.7, list=F)
train<- data1[inTrainingSet,]
test<- data1[-inTrainingSet,]
library(randomForest)
m.rf = randomForest(smoke~., data=train, mtry=sqrt(ncol(x)), ntree=5000,
importance=T, proximity=T, probability=T)
m.rf
#############################################
# Test Performance
#############################################
m.pred = predict(m.rf, test[-4], response="class")
m.table <- table(m.pred, test$smoke)
library(caret)
confusionMatrix(m.table)

I am trying to run XGBoost in R but am facing some issues

I have a dataset of 25 variables and 248 rows.
There are 8-factor variables and the rest are integers and numbers.
I am trying to run XGBoost.
I have done the following code: -
# Partition Data
set.seed(1234)
ind <- sample(2, nrow(mission), replace = T, prob = c(0.7,0.3))
train <- mission[ind == 1,]
test <- mission[ind == 2,]
# Create matrix - One-Hot Encoding for Factor variables
trainm <- sparse.model.matrix(GRL ~ .-1, data = train)
head(trainm)
train_label <- train[,"GRL"]
train_matrix <- xgb.DMatrix(data = as.matrix(trainm), label = train_label)
testm <- sparse.model.matrix(GRL~.-1, data = test)
test_label <- test[,"GRL"]
test_matrix <- xgb.DMatrix(data = as.matrix(testm),label = test_label)
The response variable here is "GRL" and I am running the test_label <- test[,"GRL"]
The above code is getting executed but when I am trying to use it in xgb.DMatrix, I am encountering the following error:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I have partitioned the data into 70:30.
test[,"GRL"] returns a data.frame, and XGBoost needs the label to be a vector.
Just use teste$GRL or test[["GRL"]] instead. You also need to do the same for the training dataset

Logistic Regression Error:'x' is NULL so the result will be NULLError in ans[test & ok]

I'm new to R and I'm trying to run a logistic regression model. I've created a cross validation function and a regular model using glm. When I run the regular model, it works fine but when I try using the function it errors.
I've tried defining the Y variable but this errors
er_log=mycv.logistic(data = train_data, glmfit=payment~., yname="payment", K=3, seed=123)
Error in terms.formula(formula, data = data) : argument is not a valid model
I've also tried using the glm model that originally worked in the function but this gives me a different error
glmfit1=glm(payment~., data=train_data, family=binomial)
er_log=mycv.logistic(data = train_data, glmfit=glmfit1, yname="payment", K=3, seed=123)
Error in ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok] : replacement has length zero
This is the function I'm trying to use.
mycv.logistic<-
function (data, glmfit, yname, K, seed=1) {
n <- nrow(data)
set.seed(seed)
datay=data[,yname]#response variable
#partition the data into K subsets
f <- ceiling(n/K)
s <- sample(rep(1:K, f), n)
CV=NULL; O.P=NULL
for (i in 1:K) { #i=1
j.out <- seq_len(n)[(s == i)] #test data
j.in <- seq_len(n)[(s != i)] #training data
#model with training data
log.fit=glm(glmfit$call, data=data[j.in,],family = 'binomial')
#observed test set y
testy <- datay[j.out]
#predicted test set y
log.predy=predict(log.fit, data[j.out,],type='response')
le=levels(datay)
class.p = ifelse(log.predy > 0.5,le[2], le[1] )
#observed - predicted on test data
error= mean(testy!=class.p)
ovsp <- cbind(pred=class.p,obs=testy) #pred vs obs vector
CV <- c(CV,error)
O.P <- rbind(O.P,ovsp)
#error rates
}
#Output
list(call = glmfit$call, K = K,
error = mean(CV), ConfusianMatrix=table(O.P[,1],O.P[,2]),
seed = seed)
}
I expect this to output the confusion matrix for the training data so I can ultimately use the model on my testing data.
Just figured it out. Turns out my response variable contained either 0 or 1 and was a numeric value that needed to be converted to a factor.

Get an error: incorrect number of dimensions when I'm using R

When I'm using ROCR to evaluate the naive Bayes model, I got this error: the incorrect number of dimensions, really don't know how to debug that, here's my error log.
$`predict matrix`
pre 0 1
0 7282 956
$accuracy
[1] 0.8839524
> #ROCR
> library(ROCR)
> pred<-prediction(predictions =pre[,2],labels =test_data$y)
Error in pre[, 2] : incorrect number of dimensions
And this is my R script.
library(e1071)
library(rvest)
library(dplyr)
#data
train_data <- read.csv('/Users/jonnyy/Desktop/IS/S2/IS688/project/train.csv')
test_data <- read.csv('/Users/jonnyy/Desktop/IS/S2/IS688/project/test.csv')
#construct naiveBayes model
efit <- naiveBayes(y~job+marital+education+default+contact+month+day_of_week+
poutcome+age+pdays+previous+cons.price.idx+cons.conf.idx+euribor3m
,train_data)
#using predict function on test data to classified prediction
pre <- predict(efit, test_data, type = "raw") %>%
as.data.frame() %>%
mutate(prediction = if_else(0 < 1, 0, 1)) %>%
pull(prediction)
#predict matrix and accuracy
bayes_table <- table(pre, test_data$y)
accuracy_test_bayes <- sum(diag(bayes_table)) / sum(bayes_table)
list('predict matrix' = bayes_table, 'accuracy' = accuracy_test_bayes)
#ROCR
library(ROCR)
pred<-prediction(predictions =pre[,2],labels =test_data$y)
perf<-performance((pred.measure ="tpr",x.measure="fpr"))
plot(perf,main="ROC curve",col="blue",lwd=3)
abline(a=0,b=1,lwd=2,lty=2)
The thing is, my predict matrix is 2X1, seems like it's only one dimension? But when I change pre[,2] to pre[,1], it still not works

R error: all arguments must have the same length

I got an error when I'm doing naive Bayes by R, here's my code and error
library(e1071)
#data
train_data <- read.csv('https://raw.githubusercontent.com/JonnyyJ/data/master/train.csv',header=T)
test_data <- read.csv('https://raw.githubusercontent.com/JonnyyJ/data/master/test.csv',header=T)
efit <- naiveBayes(y~job+marital+education+default+contact+month+day_of_week+
poutcome+age+pdays+previous+cons.price.idx+cons.conf.idx+euribor3m
,train_data)
pre <- predict(efit, test_data)
bayes_table <- table(pre, test_data[,ncol(test_data)])
accuracy_test_bayes <- sum(diag(bayes_table))/sum(bayes_table)
list('predict matrix'=bayes_table, 'accuracy'=accuracy_test_bayes)
ERROR:
bayes_table <- table(pre, test_data[,ncol(test_data)])
Error in table(pre, test_data[, ncol(test_data)]) :
all arguments must have the same length
accuracy_test_bayes <- sum(diag(bayes_table))/sum(bayes_table)
Error in diag(bayes_table) : object 'bayes_table' not found
list('predict matrix'=bayes_table, 'accuracy'=accuracy_test_bayes)
Error: object 'bayes_table' not found
I really don't understand what's going on, because I'm new in R
For some reason, the default predict(efit, test_data, type = "class") doesn't work in this case (probably because your model predicts 0 for all observations in the test dataset). You also need to construct the table using your outcome (i.e. test_data[,ncol(test_data)] returns euribor3m). The following should work:
pre <- predict(efit, test_data, type = "raw") %>%
as.data.frame() %>%
mutate(prediction = if_else(0 < 1, 0, 1)) %>%
pull(prediction)
bayes_table <- table(pre, test_data$y)
accuracy_test_bayes <- sum(diag(bayes_table)) / sum(bayes_table)
list('predict matrix' = bayes_table, 'accuracy' = accuracy_test_bayes)
# $`predict matrix`
#
# pre 0 1
# 0 7282 956
#
# $accuracy
# [1] 0.8839524

Resources