I am carrying out an split experiment with microbiology. Totally 3 blocks I set: A,B,C; each block contains 2 replicates; each replicate contains 2 species. I want to test if the density ratio of these 2 species change with time.
I write this code:
y19 <- cbind(data19$density.E, data19$density.P)
model19 <- glmer(y19 ~ time + (1|block), binomial, data = data19)
summary(model19)
it works, but show warning:
boundary (singular) fit: see help('isSingular')
Warning messages:
1: Some predictor variables are on very different scales: consider rescaling
2: In eval(family$initialize, rho) : non-integer counts in a binomial glm!
I learned that glm function can be also used, but I dont know the format of split experiments in glm, I can only write:
model19 <- glm(y19 ~ time * block, binomial, data = data19)
what the right format of glm of my question
Related
I am trying to code a model which uses interaction term and generate out-of-sample predictions using the model.
My training sample has 3 variables and 11 rows.
My test sample has 3 variables and 1 row.
My code is the following.
inter.model <- lm(Y.train ~ Y.lag.train + X.1.train + X.1.train:X.2.train)
However, I am not quite sure how R handles the interaction terms.
I have coded the predictions using the coefficients from the model and the test data.
inter.prediction <- inter.model$coef[1] + inter.model$coef[2]*Y.lag.test +
inter.model$coef[3]*X.1.test + (inter.model$coef[4]*X.1.test*X.2.test)
I wanted to make sure that these predictions were correctly coded. Thus, I tried to produce them with the R´s predict-function.
inter.pred.function <- predict(inter.model, newdata=test_data)
However, I am getting a error message:
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
variable lengths differ (found for 'X.2.train')
In addition: Warning message:
'newdata' had 1 row but variables found have 11 rows
names(test_data)
[1] "Y.lag.test" "X.1.test" "X.1.test:X.2.test"
So, my question is, how do you code and make linear regression predictions with interaction terms in R?
You won't need "X.1.test:X.2.test" in your new data, the interaction is created automatically in stats:::predict.lm via the model.matrix.
fit <- lm(mpg ~ hp*am, mtcars[1:10, ])
test <- mtcars[-(1:10), c('mpg', 'hp', 'am')]
as.numeric(predict(fit, newdata=test))
# [1] 20.220513 17.430053 17.430053 17.430053 16.206167 15.716612 14.982281 25.658824 27.141176 25.764706
# [11] 21.493355 18.898716 18.898716 14.247949 17.674830 25.658824 23.011765 20.682353 4.694118 14.117647
# [21] -2.823529 21.105882
I am trying to build a logistic regression model with a response as diagnosis ( 2 Factor variable: B, M).
I am getting an Error on building a logistic regression model:
Error in model.matrix.default(mt, mf, contrasts) :
variable 1 has no levels
I am not able to figure out how to solve this issue.
R Code:
Cancer <- read.csv("Breast_Cancer.csv")
## Logistic Regression Model
lm.fit <- glm(diagnosis~.-id-X, data = Cancer, family = binomial)
summary(lm.fit)
Dataset Reference: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Your problem is similar to the one reported here on the randomForest classifier.
Apparently glm checks through the variables in your data and throws an error because X contains only NA values.
You can fix that error by
either by dropping X completely from your dataset, setting Cancer$X <- NULL before handing it to glm and leaving X out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial));
or by adding na.action = na.pass to the glm call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass))
However, please note that still, you'd have to make sure to provide the diagnosis variable in a form digestible by glm. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector
"For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the glm-doc
Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis).
On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)
I first establish a cox model in R:
test1<- test[1:20,]
model.1 <- coxph(Surv(test1$days,test1$status==1) ~ test1$MTT+test1$ADC,data=test1)
and when i tried to predict next patient's survival like this:
covs1 <- data.frame(test[21,]$MTT,test[21,]$ADC)
summary(survfit(model.1, newdata= covs1, type ="aalen"))
it gave me too many survival results and the warning is
"'newdata' had 1 row but variables found have 20 rows "
fyi, there are 20 events and the results contain 20 survival results.
The names of the columns in the datframe being given as the basis for a prediction must have the same column names as are in the RHS of the model formula. I don't think yours will qualifiy unless you do something like this:
test1<- test[1:20,]
model.1 <- coxph( Surv(days, status==1) ~ MTT + ADC, data=test1)
covs1 <- test[21, c("MTT", "ADC")]
# then do your prediction
You should not use $ to supply arguments to Surv. It is important that the model be constructed in the environment of the dataframe.
Is it possible to run a GLM with a poisson distribution with a variable that has combined columns in R?
I am looking at the effects of different species, the cage density and the day that eggs are laid on how many eggs were laid and how many hatched, so I have linked the hatched and unhatched columns. My data are count data. The code works ok with family = binomial but I want to test if poisson is a better model.
My code is as follows:
attach(EggV)
density <- as.factor(Density)
day <- as.factor(Day)
Y <- cbind (Hatched, Unhatched)
model.pois <- glm(Y ~ Species + density + day, data = EggV, family = poisson)
But once I run the code it give me an error:
Error in x[good, , drop = FALSE] : (subscript) logical subscript too long
If I run the same code with only the variables "Hatched" or "Unhatched" it works but this is not sufficient for my data analysis.
Let me state my confusion with the help of an example,
#making datasets
x1<-iris[,1]
x2<-iris[,2]
x3<-iris[,3]
x4<-iris[,4]
dat<-data.frame(x1,x2,x3)
dat2<-dat[1:120,]
dat3<-dat[121:150,]
#Using a linear model to fit x4 using x1, x2 and x3 where training set is first 120 obs.
model<-lm(x4[1:120]~x1[1:120]+x2[1:120]+x3[1:120])
#Usig the coefficients' value from summary(model), prediction is done for next 30 obs.
-.17947-.18538*x1[121:150]+.18243*x2[121:150]+.49998*x3[121:150]
#Same prediction is done using the function "predict"
predict(model,dat3)
My confusion is: the two outcomes of predicting the last 30 values differ, may be to a little extent, but they do differ. Whys is it so? should not they be exactly same?
The difference is really small, and I think is just due to the accuracy of the coefficients you are using (e.g. the real value of the intercept is -0.17947075338464965610... not simply -.17947).
In fact, if you take the coefficients value and apply the formula, the result is equal to predict:
intercept <- model$coefficients[1]
x1Coeff <- model$coefficients[2]
x2Coeff <- model$coefficients[3]
x3Coeff <- model$coefficients[4]
intercept + x1Coeff*x1[121:150] + x2Coeff*x2[121:150] + x3Coeff*x3[121:150]
You can clean your code a bit. To create your training and test datasets you can use the following code:
# create training and test datasets
train.df <- iris[1:120, 1:4]
test.df <- iris[-(1:120), 1:4]
# fit a linear model to predict Petal.Width using all predictors
fit <- lm(Petal.Width ~ ., data = train.df)
summary(fit)
# predict Petal.Width in test test using the linear model
predictions <- predict(fit, test.df)
# create a function mse() to calculate the Mean Squared Error
mse <- function(predictions, obs) {
sum((obs - predictions) ^ 2) / length(predictions)
}
# measure the quality of fit
mse(predictions, test.df$Petal.Width)
The reason why your predictions differ is because the function predict() is using all decimal points whereas on your "manual" calculations you are using only five decimal points. The summary() function doesn't display the complete value of your coefficients but approximate the to five decimal points to make the output more readable.