Logistic regression - eval(family$initialize) : y values must be 0 <= y <= 1 - r

I am trying to perform logistic regression using R in a dataset provided here : http://archive.ics.uci.edu/ml/machine-learning-databases/00451/
It is about breast cancer. This dataset contains a column Classification which contains only 1 (if patient doesn't have cancer) or 2 (if patient has cancer)
library(ISLR)
dataCancer <- read.csv("~/Desktop/Isep/Machine
Leaning/TD/Project_Cancer/dataR2.csv")
attach(dataCancer)
#Step : Split data into training and testing data
training = (BMI>25)
testing = !training
training_data = dataCancer[training,]
testing_data = dataCancer[testing,]
Classification_testing = Classification[testing]
#Step : Fit a logistic regression model using training data
classification_model = glm(Classification ~ ., data =
training_data,family = binomial )
When running my script I get :
> classification_model = glm(Classification ~ ., data = training_data,family = binomial )
Error in eval(family$initialize) : y values must be 0 <= y <= 1
> summary(classification_model)
Error in summary(classification_model) : object 'classification_model' not found .
I added as.factor(dataCancer$Classification) as seen in other posts but it has not solved my problem.
Can you suggest me a way to have a classification's value between 0 and 1 if it is the content of this predictor?

You added the as.factor(dataCancer$Classification) in the script, but even if the dataset dataCancer is attached, a command like the one above does not transform the dataset variable Classification into a factor. It only returns a factor on the console.
Since you want to fit the model on the training dataset, you either specify
training_data$Classification <- as.factor(training_data$Classification)
classification_model <- glm(Classification ~ ., data =
training_data, family = binomial)
or use the as.factor function in the glm line code
classification_model <- glm(as.factor(Classification) ~ ., data =
training_data, family = binomial)

classification_model = glm(Classification ~ ., data = training_data,family = binomial )
Error in eval(family$initialize) : y values must be 0 <= y <= 1
This is because your data contains numeric values, not factor values. I hope you did
dataCancer$Classification <- as.factor(dataCancer$Classification)
Ideally, 1,0 or 1,2 will not matter as long as it's a factor. But, if doing the above also doesn't help, then you can try converting 1,2 to 1,0 and then trying the same code.
Of course the second error is because logistic regression variable was not created at all.

You need to recode the Dependent variable as 0,1 so use the below code.
library(car)
dataCancer$Classification <- recode(dataCancer$Classification, "1=0; 2=1")

Related

Is there an alternative way to fit Poisson model in R besides using glm?

I get an error when I try to fit Poisson model in a given dataset in R. Am struggling to understand the cause of the error.
library(COUNT) # Titanic dataset
data("titanic")
library(tidyverse)
# Number of missing values
titanic %>%
map_int(~sum(is.na(.)))
# Fit the Poisson regression model
poifit <- glm(survived ~ class, family = poisson, data = titanic)
titanic2 <- titanic %>%
mutate(across(.cols = everything(), ~as.factor(.)))
poifit2 <- glm(survived ~ class, family = poisson, data = titanic2)
I get the error:
Warning in Ops.factor(y, 0) : ‘<’ not meaningful for factors
Error in if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") :
missing value where TRUE/FALSE needed
You might be confused. You can't fit a Poisson to a categorical response. You could fit a Poisson to binary data after converting survival "yes"/"no" to 0/1, but it doesn't really make sense:
glm(as.numeric(survived=="no") ~ class, family = poisson, data = titanic)
The sensible thing to do (probably) is to cross-tabulate and use the value e.g.
cc <- as.data.frame(table(titanic))
glm(Freq ~ ., data = cc, family = poisson)

Issues with logit regression in r

I am trying to run a logit regression and I tried two approaches:
m.logit <- glm(p4 ~ scale(log(gdp,orthodox,swb)),
data = happiness,
family = binomial("logit"))
summary(m.logit)
Throws: Error in summary(m.logit) : object 'm.logit' not found
While
m1.logit <- glm(p4 ~ gdp + orthodox + swb, family = binomial(link = "logit"), data = happiness)
Throws: Error in eval(family$initialize) : y values must be 0 <= y <= 1
I kind of understood the errors (in the former case m.logit is not found, and in the latter, I need to transform the variables I think...) but don't know how to solve it...
Any help?

Error in model.frame.default(formula = mldata.knn$Gender ~ ., data = mldata_train, : variable lengths differ (found for 'Age')

set.seed(400)
random <- createDataPartition(mldata.knn$Transport, p=0.70,list = F)
mldata_train <- mldata.knn[random,]
mldata_test <- mldata.knn[-random,]
print(table(mldata.knn$Transport))
print(table(mldata_train$Transport))
library(e1071)
NB_model = naiveBayes(mldata.knn$Gender ~., data = mldata_train)
print(NB_model)
Error in model.frame.default(formula = mldata.knn$Gender ~ ., data = mldata_train, : variable lengths differ (found for 'Age')
The lengths of the data are different when making the Naive Bayes classifier for the data. Whatever the length of your mldata.knn is, your training data set mldata_train will contain 70% of the data based on your 70/30 split from createDataPartition.
So, in the statement:
NB_model = naiveBayes(mldata.knn$Gender ~., data = mldata_train)
You have mldata.knn$Gender which has the length of the original data set (mldata.knn), and also whatever variables were in there from dot (.) in the formula - but these variables are taken from mldata_train as the statement includes data = mldata_train. These other variables would have a different length (only 70% of original data).
Perhaps you intend to just use the training data to make your NB classifier:
NB_model = naiveBayes(Gender ~., data = mldata_train)

Multilevel moderated mediation with continuous variables

I am a beginner in R, so please forgive me if my question reflects insufficient background.
I am trying to run a moderated mediation model using the mediation and lme4 libraries.
All of my variables are continuous. My data have a nested structure with individuals nested in branches (Branch).
In the model I'm trying to test, my predictor/independent variable (abranch) is at the branch level. My mediator (bmed) and outcome (cout) are at the individual level. And the effect of the mediator is moderated by another individual level variable (dmod). So in my model I have abranch predicting bmed, and bmed*dmod are predicting cout.
This is the syntax I've used:
med.fit <- glmer(
bmed ~ abranch + (1|Branch),
family = binomial(link = "logit"),
data = Dataset
)
out.fit <- glmer(
cout ~ dmod*bmed + (1+bmed|Branch),
family = binomial(link = "logit"),
data = Dataset
)
I was then thinking of using:
med.out <- mediate(med.fit, out.fit, treat = "abranch", mediator = "bmed",
+ sims = 100)
summary(med.out)
But even before getting to the last two lines, I get the following error:
Error in eval(family$initialize, rho) : y values must be 0 <= y <= 1
I now realize that this is because I'm using the "binomial"/logit family whereas my DV is continuous and not between 0 and 1. What can I do, given the nature of my variables?

Adding a vector of dummy variables in logistic regression

I am currently trying to conduct logistic regression where one of the variables is a vector of 32 dummy variables. Each dummy represents a type of crime. For example:
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "NARCOTICS", 1,0)
Then the vector is created:
crime.type <- c(narcotics, theft, other.offense, burglary, motor.vehicle.theft, battery, robbery, assault, criminal.damage, deceptive.practice, kidnapping, etc.)
The logistic model is as follows:
logit.mod.train <- lm(street1 ~ BEAT+WARD+X.COORDINATE+Y.COORDINATE+LATITUDE+LONGITUDE+crime.type, data = train, family = "binomial")
It's important to note that street1 is actually a dummy variable for the location of the crime being on the street. So the column is LOCATION.DESCRIPTION and the element is street.
street1 <- ifelse(train$LOCATION.DESCRIPTION == "STREET", 1,0).
It yields this error:
Error in model.frame.default(formula = street1 ~ BEAT + WARD + X.COORDINATE + :
variable lengths differ (found for 'crime.type')
I thought this would work because they are derived from the same data set and the dummies represent each unique element of one of the columns. When I input each dummy variable separately it's successful but I want to condense the regression and make it more efficient.
Thank you in advance
If you intend for each type of crime to be its own predictor, you'll need to bind them to train, and then specify the variables in your lm formula. (Actually for logit it should be glm().)
For a more compact formula, subset train in the data= argument of glm() to include only your response variable and your intended design matrix. Then use street1 ~ . as your formula.
train <- cbind(train, narcotics, theft)I
model.vars <- c("narcotics", "theft", "street1")
logit.mod.train <- glm(street1 ~ ., data = train[,model.vars], family = "binomial")
More explanation:
Using ifelse as you've done produces a 1 or 0 for every element in train.
When you define crime.type as narcotics (which has the length of train) plus any additional elements, crime.type is longer than the number of rows in train.
Then you're asking lm() to process a lopsided design matrix, where one predictor (crime.type) has more elements in it than the other predictors. That's why you're getting the error.
Here's a replication of the issue:
N <- 100
train <- data.frame(PRIMARY.DESCRIPTION=sample(c("A","B"), replace = T, size = N),
response = rbinom(n=N, prob=0.7, size=1))
dim(train) # 100 2
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "A", 1, 0)
length(narcotics) # 100
theft <- ifelse(train$PRIMARY.DESCRIPTION == "B", 1, 0)
length(theft) # 100
crime.type <- c(desc.A, desc.B)
length(crime.type) # 200
logit.mod.train <- glm(response ~ PRIMARY.DESCRIPTION+crime.type, data = train, family = "binomial")
Error in model.frame.default(formula = response ~ PRIMARY.DESCRIPTION + :
variable lengths differ (found for 'crime.type')

Resources