Running a linear model in R with spreadsheet data - r

I have a dataset consisting of 106 individuals of two types - a and b with various variables, for example age and gender. I want to run a linear model which predicts whether each individual is of type a or type b based on the co-variates.
I read in the values for age, gender and the type label for each individual using:
`data = read.xlsx("spreadsheet.xlsx",2, as.is = TRUE)`
age = data$age
gender = data$gender
type = data$type
where each is of the form:
age = [28, 30, 19, 23 etc]
gender = [male, male, female, male etc]
type = [a b b b]
Then I try to set up the model using:
model1 = lm(type ~ age + gender)
but I get this error message:
Warning messages:
1: In model.response(mf, "numeric") :
using type="numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors
I've tried changing the format of type, age and gender using:
age = as.numeric(as.character(age))
gender = as.character(gender)
type = as.character(type)
But this doesn't work!

You can't use a linear regression model with a factor as your response variable, which is what you are attempting to do here (type is your response variable). Regression models require numeric response variables. You should instead look at classification models.
As Roland points out, you may wish to start by restating your "type" variable as a logical, binomial variable. Rather than a factor called "type" with two levels "a" and "b", you might create a new variable called "is.type.a", which would contain TRUE or FALSE.
You could then try a logistic regression based on a binomial distribution
model <- glm(is.type.a ~ age + gender,data=data,family="binomial")

Related

Cannot fit multilevel ordinal logit model using clmm

I'm trying to fit a multilevel (random effects) ordered logit model using the ordinal package, but I keep running into this error:
Error in region:country1 : NA/NaN argument
Here's my simplified model. I'm regressing an indicator of happiness on a number of variables, including class, gender, age, etc. There are two nested levels: regions within countries.
library(ordinal)
# Set as factor
data$happiness <- as.factor(data$happiness)
# Remove NA
missing_country1 <- is.na(data$country1)
data <- data[!missing_country1, ]
missing_region <- is.na(data$region)
data <- data[!missing_region, ]
# Model
model1 <- clmm(happiness ~ age + gender + class + (1 | country1 / region),
data = data,
na.action = na.omit
)
I have removed all NA and NaN from both country1 and region.
Thanks,
Figured it out: it was because ordinal doesn't automatically convert the grouping variables to factor, so you need to do it manually.

Logistic regression - eval(family$initialize) : y values must be 0 <= y <= 1

I am trying to perform logistic regression using R in a dataset provided here : http://archive.ics.uci.edu/ml/machine-learning-databases/00451/
It is about breast cancer. This dataset contains a column Classification which contains only 1 (if patient doesn't have cancer) or 2 (if patient has cancer)
library(ISLR)
dataCancer <- read.csv("~/Desktop/Isep/Machine
Leaning/TD/Project_Cancer/dataR2.csv")
attach(dataCancer)
#Step : Split data into training and testing data
training = (BMI>25)
testing = !training
training_data = dataCancer[training,]
testing_data = dataCancer[testing,]
Classification_testing = Classification[testing]
#Step : Fit a logistic regression model using training data
classification_model = glm(Classification ~ ., data =
training_data,family = binomial )
When running my script I get :
> classification_model = glm(Classification ~ ., data = training_data,family = binomial )
Error in eval(family$initialize) : y values must be 0 <= y <= 1
> summary(classification_model)
Error in summary(classification_model) : object 'classification_model' not found .
I added as.factor(dataCancer$Classification) as seen in other posts but it has not solved my problem.
Can you suggest me a way to have a classification's value between 0 and 1 if it is the content of this predictor?
You added the as.factor(dataCancer$Classification) in the script, but even if the dataset dataCancer is attached, a command like the one above does not transform the dataset variable Classification into a factor. It only returns a factor on the console.
Since you want to fit the model on the training dataset, you either specify
training_data$Classification <- as.factor(training_data$Classification)
classification_model <- glm(Classification ~ ., data =
training_data, family = binomial)
or use the as.factor function in the glm line code
classification_model <- glm(as.factor(Classification) ~ ., data =
training_data, family = binomial)
classification_model = glm(Classification ~ ., data = training_data,family = binomial )
Error in eval(family$initialize) : y values must be 0 <= y <= 1
This is because your data contains numeric values, not factor values. I hope you did
dataCancer$Classification <- as.factor(dataCancer$Classification)
Ideally, 1,0 or 1,2 will not matter as long as it's a factor. But, if doing the above also doesn't help, then you can try converting 1,2 to 1,0 and then trying the same code.
Of course the second error is because logistic regression variable was not created at all.
You need to recode the Dependent variable as 0,1 so use the below code.
library(car)
dataCancer$Classification <- recode(dataCancer$Classification, "1=0; 2=1")

Adding a vector of dummy variables in logistic regression

I am currently trying to conduct logistic regression where one of the variables is a vector of 32 dummy variables. Each dummy represents a type of crime. For example:
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "NARCOTICS", 1,0)
Then the vector is created:
crime.type <- c(narcotics, theft, other.offense, burglary, motor.vehicle.theft, battery, robbery, assault, criminal.damage, deceptive.practice, kidnapping, etc.)
The logistic model is as follows:
logit.mod.train <- lm(street1 ~ BEAT+WARD+X.COORDINATE+Y.COORDINATE+LATITUDE+LONGITUDE+crime.type, data = train, family = "binomial")
It's important to note that street1 is actually a dummy variable for the location of the crime being on the street. So the column is LOCATION.DESCRIPTION and the element is street.
street1 <- ifelse(train$LOCATION.DESCRIPTION == "STREET", 1,0).
It yields this error:
Error in model.frame.default(formula = street1 ~ BEAT + WARD + X.COORDINATE + :
variable lengths differ (found for 'crime.type')
I thought this would work because they are derived from the same data set and the dummies represent each unique element of one of the columns. When I input each dummy variable separately it's successful but I want to condense the regression and make it more efficient.
Thank you in advance
If you intend for each type of crime to be its own predictor, you'll need to bind them to train, and then specify the variables in your lm formula. (Actually for logit it should be glm().)
For a more compact formula, subset train in the data= argument of glm() to include only your response variable and your intended design matrix. Then use street1 ~ . as your formula.
train <- cbind(train, narcotics, theft)I
model.vars <- c("narcotics", "theft", "street1")
logit.mod.train <- glm(street1 ~ ., data = train[,model.vars], family = "binomial")
More explanation:
Using ifelse as you've done produces a 1 or 0 for every element in train.
When you define crime.type as narcotics (which has the length of train) plus any additional elements, crime.type is longer than the number of rows in train.
Then you're asking lm() to process a lopsided design matrix, where one predictor (crime.type) has more elements in it than the other predictors. That's why you're getting the error.
Here's a replication of the issue:
N <- 100
train <- data.frame(PRIMARY.DESCRIPTION=sample(c("A","B"), replace = T, size = N),
response = rbinom(n=N, prob=0.7, size=1))
dim(train) # 100 2
narcotics <- ifelse(train$PRIMARY.DESCRIPTION == "A", 1, 0)
length(narcotics) # 100
theft <- ifelse(train$PRIMARY.DESCRIPTION == "B", 1, 0)
length(theft) # 100
crime.type <- c(desc.A, desc.B)
length(crime.type) # 200
logit.mod.train <- glm(response ~ PRIMARY.DESCRIPTION+crime.type, data = train, family = "binomial")
Error in model.frame.default(formula = response ~ PRIMARY.DESCRIPTION + :
variable lengths differ (found for 'crime.type')

predict.glm() with three new categories in the test data (r)(error)

I have a data set called data which has 481 092 rows.
I split data into two equal halves:
The first halve (row 1: 240 546) is called train and was used for the glm();
the second halve (row 240 547 : 481 092) is called test and should be used to validate the model;
Then I started the regression:
testreg <- glm(train$returnShipment ~ train$size + train$color + train$price +
train$manufacturerID + train$salutation + train$state +
train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
Now the prediction:
prediction <- predict.glm(testreg, newdata=test, type="response")
gives me an Error:
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.
I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.
How can I proceed?
Edit-Suggested approach by Z. Li
I got problem in the first step:
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
but mID125 is NULL! What have I done wrong?
It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.
Your model formula for model estimation is:
returnShipment ~ size + color + price + manufacturerID + salutation +
state + age + deliverytime
then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:
returnShipment ~ size + color + price + salutation +
state + age + deliverytime
However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:
extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.
Now, let's first pick up a factor level used for model fitting
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
Then we assign this level to your prediction data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement
And we are ready to predict:
pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!!
In the end, we adjust this linear predictor, by subtracting factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est
Finally, if you want prediction on the original scale, you apply the inverse of link function:
testreg$family$linkinv(pred)
update:
You complained that you met various troubles in trying the above solutions. Here is why.
Your code:
testreg <- glm(train$returnShipment~ train$size + train$color +
train$price + train$manufacturerID + train$salutation +
train$state + train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.
As a simple example for such drawback, we simulate some toy data and fit a GLM:
set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo) ## bad style
> toy$formula
foo$y ~ foo$a
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"
Now, we see everything comes with a prefix foo$. During prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata"
rm(foo) ## remove "foo" from R session
predict(toy, newdata)
we get an error:
Error in eval(expr, envir, enclos) : object 'foo' not found
The good style is to specify environment of getting data from data argument of the function:
foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)
then foo$ goes away.
> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"
This would explain two things:
You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;
The prediction error you posted
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
complains train$manufacturerID instead of test$manufacturerID.
As you have divided your train and test sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.
You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use stratified from the splitstackshape package.

Identifying Fitted Y, Given X Values

I am fitting a Poisson GLM and want to predict y values given specific levels of the explanatory variables. My code is:
poisson.fit<-glm(y ~ age + gender, family= "poisson", data = data)
I want poisson.fit$y for a hypothetical observation of age = 50 and gender = "male". How do I produce this statistic?
Use the predict function.
predict(poisson.fit, newdata=data.frame(age=50, gender="male"))
You can specify the type of response you want with type= "link", "response" or "terms". See ?predict.glm for complete options and documentation.

Resources