How can I compare 3 binary variables in R? - r

I'm looking at debris ingestion in gulls. Each gull is listed by row. Columns contain the sex(0=male, 1=female), if they ate debris (0=no, 1=yes) and if I found any number of other items in their stomach, for this problem I'd like to see if sex and presence of debris influences the number of birds with Shells in their stomach (0=no shells, 1=shells). Debris prevalence is likely overdispersed and zero-inflated, but I'm not sure that matters if I'm using it as a factor to evaluate shell prevalence. Shell prevalence might be overdispersed and zero inflated as well.
I've plotted the data and want to test whether the differences seen in the plot are significant.
But when trying to run a zero-inflated negative binomial model I get many diff errors depending on how I set it up.
library (aod)
library(MASS)
library (ggplot2)
library(gridExtra)
library(pscl)
library(boot)
library(reshape2)
mydata1 <- read.csv('D:/mp paper/analysis wkshts/stats files/FOdata.csv')
mydata1 <- within(mydata1, {
debris <- factor(debris)
sex <- factor(sex)
Shell_frags <- factor(Shell_frags)
})
summary(mydata1)
ggplot(mydata1, aes(Shell_frags, fill=debris)) +
stat_count() +
facet_grid(debris ~ sex, margins=TRUE, scales="free_y")
m1 <- zeroinfl((Shell_frags ~ sex + debris), data = mydata1, dist = "negbin", EM = TRUE)
summary(m1)
Error message:
Error in if (all(Y > 0)) stop("invalid dependent variable, minimum count is not zero") :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(Y, 0) : ‘>’ not meaningful for factors
> summary(m1)
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': object 'm1'
not found

Related

Issues performing Hosmer-Lemeshow test

Please help!
I am trying to perform a HL test to assess the goodness of fit for my model but I keep getting the above error!
I have these packages installed:
library(jtools)
library(skimr)
library(epiR)
library(rms)
library(epimisc)
library(DescTools)
library(car)
library(readxl)
library(summarytools)
library(survival)
library(ggplot2)
library(survminer)
library(PredictABEL)
This is my code:
# Final model
l_final <- glm(pro$treatment ~ pro$age,family = binomial(link="logit"),data = pro, x=TRUE)
summ(l_final)
# Assessing the fit of the model
#predict proabailities
pro$l_final_pred <- predict(l_final, type = "response")
# Hosmer-Lemeshow test
HosmerLemeshowTest(pro$l_final_pred, pro$treatment, X = l_final_pred$x)
This is the output:
> # Assessing the fit of the model
> pro$l_final_pred <- predict(l_final, type = "response")
Error:
! Assigned data `predict(l_final, type = "response")` must be compatible with existing data.
✖ Existing data has 866 rows.
✖ Assigned data has 864 rows.
ℹ Only vectors of size 1 are recycled.
Backtrace:
1. base::`$<-`(`*tmp*`, l_final_pred, value = `<dbl>`)
12. tibble (local) `<fn>`(`<vctrs___>`)
> # Hosmer-Lemeshow test
> HosmerLemeshowTest(pro$l_final_pred, pro$treatment, X = l_final_pred$x)
Error in cut.default(fit, breaks = brks, include.lowest = TRUE) :
'x' must be numeric
In addition: Warning message:
Unknown or uninitialised column: `l_final_pred`.
Thankyou!!
I know it has something to do with the lengths of the data sets (there are two missing in treatment = 864, compared to age - 866) but can't quite work out for to fix it

Cannot calculate marginal effects in logit model

I am running the following regression:
Model <- glm(emp ~ industry + nat_status + region + state + age + educ7 + religion + caste,
family=binomial(link="logit"), data=IHDS)
However when I use the margins command, I get the following error:
There were 50 or more warnings (use warnings() to see the first 50)"
Warning messages: 1: In predict.lm(object, newdata, se.fit, scale = 1,
type = if (type == ... : prediction from a rank-deficient fit may
be misleading
Based on this error, I know that collinearity might exist. However, I do not know how to find it out and deal.
(I have tried adding each control individually)

predict.glm() with three new categories in the test data (r)(error)

I have a data set called data which has 481 092 rows.
I split data into two equal halves:
The first halve (row 1: 240 546) is called train and was used for the glm();
the second halve (row 240 547 : 481 092) is called test and should be used to validate the model;
Then I started the regression:
testreg <- glm(train$returnShipment ~ train$size + train$color + train$price +
train$manufacturerID + train$salutation + train$state +
train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
Now the prediction:
prediction <- predict.glm(testreg, newdata=test, type="response")
gives me an Error:
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.
I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.
How can I proceed?
Edit-Suggested approach by Z. Li
I got problem in the first step:
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
but mID125 is NULL! What have I done wrong?
It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.
Your model formula for model estimation is:
returnShipment ~ size + color + price + manufacturerID + salutation +
state + age + deliverytime
then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:
returnShipment ~ size + color + price + salutation +
state + age + deliverytime
However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:
extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.
Now, let's first pick up a factor level used for model fitting
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
Then we assign this level to your prediction data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement
And we are ready to predict:
pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!!
In the end, we adjust this linear predictor, by subtracting factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est
Finally, if you want prediction on the original scale, you apply the inverse of link function:
testreg$family$linkinv(pred)
update:
You complained that you met various troubles in trying the above solutions. Here is why.
Your code:
testreg <- glm(train$returnShipment~ train$size + train$color +
train$price + train$manufacturerID + train$salutation +
train$state + train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.
As a simple example for such drawback, we simulate some toy data and fit a GLM:
set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo) ## bad style
> toy$formula
foo$y ~ foo$a
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"
Now, we see everything comes with a prefix foo$. During prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata"
rm(foo) ## remove "foo" from R session
predict(toy, newdata)
we get an error:
Error in eval(expr, envir, enclos) : object 'foo' not found
The good style is to specify environment of getting data from data argument of the function:
foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)
then foo$ goes away.
> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"
This would explain two things:
You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;
The prediction error you posted
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
complains train$manufacturerID instead of test$manufacturerID.
As you have divided your train and test sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.
You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use stratified from the splitstackshape package.

ANOVA Error in levels(x)[x]

I am attempting to run an ANOVA on some data, but it gives me the following error:
Call:
aov(formula = speaker ~ CoG * skewness * kurtosis, data = total)
Error in levels(x)[x] : only 0's may be mixed with negative subscripts
In addition: Warning messages:
1: In model.response(mf, "numeric") :
using type="numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : - not meaningful for factors
I'm trying to see how well the three variables CoG, skewness and kurtosis can predict the speaker and if they are significant between speakers. A copy of my data an be found here:
https://www.dropbox.com/s/blzpb12bemv6kuc/All.csv
Can anyone help interpret what the error is saying and where it is occurring?
Here is the answer I gave on stats.stackexchange.com
Sounds like you are trying to do Multinomial regression. Perhaps look up information on that.
Here is a great start:
http://www.ats.ucla.edu/stat/r/dae/mlogit.htm
e.g.
install.packages('nnet')
library(nnet)
test<-multinom(formula = as.factor(speaker) ~ CoG * skewness * kurtosis, data = total)
z <- summary(test)$coefficients/summary(test)$standard.errors
# 2-tailed z test
p <- (1 - pnorm(abs(z), 0, 1)) * 2

Day-ahead using GLM model in R

I have the following code to get a day-ahead prediction for load consumption in 15 minute interval using outside air temperature and TOD(96 categorical variable, time of the day). When I run the code below, I get the following errors.
i = 97:192
formula = as.formula(load[i] ~ load[i-96] + oat[i])
model = glm(formula, data = train.set, family=Gamma(link=vlog()))
I get the following error after the last line using glm(),
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
And the following error shows up after the last line using predict(),
Warning messages:
1: In if (!se.fit) { :
the condition has length > 1 and only the first element will be used
2: 'newdata' had 96 rows but variable(s) found have 1 rows
3: In predict.lm(object, newdata, se.fit, scale = residual.scale, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
4: In if (se.fit) list(fit = predictor, se.fit = se, df = df, residual.scale = sqrt(res.var)) else predictor :
the condition has length > 1 and only the first element will be used
You're doing things in a rather roundabout fashion, and one that doesn't translate well to making out-of-sample predictions. If you want to model on a subset of rows, then either subset the data argument directly, or use the subset argument.
train.set$load_lag <- c(rep(NA, 96), train.set$load[1:96])
mod <- glm(load ~ load_lag*TOD, data=train.set[97:192, ], ...)
You also need to rethink exactly what you're doing with TOD. If it has 96 levels, then you're fitting (at least) 96 degrees of freedom on 96 observations which won't give you a sensible outcome.

Resources