zero-inflated GAM prediction with newdata - r

In a zero-inflated GAM (ziplss), I'm getting a warning when 1) I use new data and 2) the count portion has categorical variables that are NOT in the zero-inflation portion. There's a warning for every categorical variable not represented in the zero-inflation part.
Here's a reproducible example:
library(mgcv)
library(glmmTMB)
data(Salamanders)
Salamanders$x <- rnorm(nrow(Salamanders), 0, 10)
zipgam <- gam(list(count ~ spp * mined + s(x) + s(site, bs = "re"),
~ spp),
data = Salamanders, family = ziplss, method = "REML")
preds.response <- data.frame(Predict = predict(zipgam, type = "response"))
nd <- data.frame(x = 0, spp = "GP", mined = "yes", site = Salamanders$site[1])
nd$pred <- predict(zipgam, newdata = nd, exclude="site")
I haven't seen this mentioned anywhere, which is odd and tells me that I'm likely doing something wrong (otherwise this would be available in search results). Would appreciate any insight.

I think this is just an infelicity in the implementation. The warning I am seeing is:
Warning message:
In model.matrix.default(Terms[[i]], mf, contrasts = object$contrasts) :
variable 'mined' is absent, its contrast will be ignored
This is harmless (at least in this case; I haven't checked other cases) and is generated because there is only a single object$contrasts, and it contains details about mined but this variable is not present in the second linear predictor so R warns that it is going to ignore the contrasts for the mined variable, but this only happens when building the model matrix for the zero-inflation part of the model. The count part correctly uses the mined variable and the correct contrasts.
You could argue that having $contrasts be a list, one per linear predictor would be a better design and then the model matrix would be created using:
model.matrix.default(Terms[[i]], mf, contrasts = object$contrasts[[i]])
but I have no idea if this would break everything else in mgcv.
Currently $contrasts for this model is just:
> zipgam$contrasts
$spp
[1] "contr.treatment"
$mined
[1] "contr.treatment"
$spp
[1] "contr.treatment"
which already shows some redundancy.

Related

Trouble in GAM model in R software

I am trying to run the following code on R:
m <- gam(Flp_pop ~ s(Flp_CO, bs = "cr", k = 30), data = data, family = poisson, method = "REML")
My dataset is like this:
enter image description here
But when I try to execute, I get this error message:
"Error in if (abs(old.score - score) > score.scale * conv.tol) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)"
I am very new to R, maybe it is a very basic question. But does anyone know why this is happening?
Thanks!
The Poisson distribution has support on the non-negative integers and you are passing a continuous variable as the response. Here's an example with simulated data
library("mgcv")
library("gratia")
library("dplyr")
df <- data_sim("eg1", seed = 2) %>% # simulate Gaussian response
mutate(yabs = abs(y)) # make y non negative
mp <- gam(yabs ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
# fails
which reproduces the error you saw
Error in if (abs(old.score - score) > score.scale * conv.tol) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The warnings are of the form:
$> warnings()[1]
Warning message:
In dpois(y, y, log = TRUE) : non-integer x = 7.384012
Indicating the problem; the model is evaluating the probability mass for your response data given the estimated model and you're evaluating this at the indicated non-integer value, which returns a 0 mass plus the warning.
If we'd passed the original Gaussian variable as the response, which includes negative values, the function would have errored out earlier:
mp <- gam(y ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
which raises this error:
r$> mp <- gam(y ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
Error in eval(family$initialize) :
negative values not allowed for the 'Poisson' family
An immediate but not necessarily advisable solution is just to use the quasipoisson family
mq <- gam(yabs ~ s(x2, bs = "cr"), data = df,
family = quasipoisson, method = "REML")
which uses the same mean variance relationship as the Poisson distribution but not the actual distribution so we can get away with abusing it.
Better would be to ask yourself why you were trying to fit a model that is ostensibly for counts to a response that is a continuous (non-negative) variable?
If the answer is you had a count but then normalised it in some way (say by dividing by some measure of effort like area surveyed or length of observation time) then you should use an offset of the form + offset(log(effort_var)) added to the model formula, and use the original non-normalised integer variable as the response.
If you really have a continuous response and the poisson was an over sight, try fitting with family = Gamma(link = "log")) or family = tw().
If it's something else, you should edit your question to include that info and perhaps we here can help or the question could be migrated to CrossValidated if the issue is more statistical in nature.

Error in brglm model with Backward elimination with Interaction: error in do.call("glm.control", control) : second argument must be a list

After fitting a model with glm I got this as a result:
Warning message:
glm.fit: Adjusted probabilities with numerical value 0 or 1.**
After some research on Google, I tried with the brglm package. When I try to apply backward elimination on the model, I get the following error:
Error in do.call("glm.control", control) : second argument must be a list.
I searched on Google but I didn't find anything.
Here is my code with brglm:
library(mlbench)
#require(Amelia)
library(caTools)
library(mlr)
library(ciTools)
library(brglm)
data("BreastCancer")
data_bc <- BreastCancer
data_bc
head(data_bc)
dim(data_bc)
#Delete id column
data_bc<- data_bc[,-1]
data_bc
dim(data_bc)
str(data_bc)
# convert all factors columns to be numeric except class.
for(i in 1:9){
data_bc[,i]<- as.numeric(as.character(data_bc[,i]))
}
str(data_bc)
#convert class: benign and malignant to binary 0 and 1:
data_bc$Class<-ifelse(data_bc$Class=="malignant",1,0)
# now convert class to factor
data_bc$Class<- factor(data_bc$Class, levels = c(0,1))
str(data_bc)
model <- brglm(formula = Class~.^2, data = data_bc, family = "binomial",
na.action = na.exclude )
summary(model)
#Backward Elimination:
final <- step(model, direction = "backward")
You can work around this by using the brglm2 package, which supersedes the brglm package anyway:
model <- glm(formula = Class~.^2, data = na.omit(data_bc), family = "binomial",
na.action = na.fail, method="brglmFit" )
final <- step(model, direction = "backward")
length(coef(model)) ## 46
length(coef(final)) ## 42
setdiff(names(coef(model)), names(coef(final))
## [1] "Cl.thickness:Epith.c.size" "Cell.size:Marg.adhesion"
## [3] "Cell.shape:Bl.cromatin" "Bl.cromatin:Mitoses"
Some general concerns about your approach:
stepwise reduction is one of the worst forms of model reduction (cf. lasso, ridge, elasticnet ...)
in the presence of missing data, model comparison (e.g. by AIC) is questionable, as different models will be fitted to different subsets of the data. Given that you are only going to lose a small fraction of your data by using na.omit() (comparing nrow(bc_data) with sum(complete.cases(bc_data)), I would strongly recommend dropping observations with NA values from the data set before starting
it's also not clear to me that comparing penalized models via AIC is statistically appropriate (see here)

Fit binomial GLM on probabilities (i.e. using logistic regression for regression not classification)

I want to use a logistic regression to actually perform regression and not classification.
My response variable is numeric between 0 and 1 and not categorical. This response variable is not related to any kind of binomial process. In particular, there is no "success", no "number of trials", etc. It is simply a real variable taking values between 0 and 1 depending on circumstances.
Here is a minimal example to illustrate what I want to achieve
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10))
fit <- glm(formula = resp ~ a + b,
family = "binomial",
data = dummy_data)
This code gives a warning then fails because I am trying to fit the "wrong kind" of data:
In eval(family$initialize) : non-integer #successes in a binomial glm!
Yet I think there must be a way since the help of family says:
For the binomial and quasibinomial families the response can be
specified in one of three ways: [...] (2) As a numerical vector with
values between 0 and 1, interpreted as the proportion of successful
cases (with the total number of cases given by the weights).
Somehow the same code works using "quasibinomial" as the family which makes me think there may be a way to make it work with a binomial glm.
I understand the likelihood is derived with the assumption that $y_i$ is in ${0, 1}$ but, looking at the maths, it seems like the log-likelihood still makes sense with $y_i$ in $[0, 1]$. Am I wrong?
This is because you are using the binomial family and giving the wrong output. Since the family chosen is binomial, this means that the outcome has to be either 0 or 1, not the probability value.
This code works fine, because the response is either 0 or 1.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = sample(c(0,1),10,replace=T,prob=c(.5,.5)) )
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data)
If you want to model the probability directly you should include an additional column with the total number of cases. In this case the probability you want to model is interpreted as the success rate given the number of case in the weights column.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data, weights = w)
You will still get the warning message, but you can ignore it, given these conditions:
resp is the proportion of 1's in n trials.
for each value in resp, the corresponding value in w is the number of trials.
From the discussion at Warning: non-integer #successes in a binomial glm! (survey packages), I think we can solve it by another family function ?quasibinomial().
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit2 <- glm(formula = resp ~ a + b,
family = quasibinomial(),
data = dummy_data, weights = w)

hurdle model prediction - count vs response

I'm working on a hurdle model and ran into a question I can't quite figure out. It was my understanding that the overall response prediction of the hurdle is the multiplication of the count prediction by the probability prediction. I.e., the overall response has to be smaller or equal to the count prediction. However, in my data, the response prediction is higher than the count prediction, and I can't figure out why.
Here's a similar result for a toy model (code adapted from here):
library("pscl")
data("RecreationDemand", package = "AER")
## model
m <- hurdle(trips ~ quality | ski, data = RecreationDemand, dist = "negbin")
nd <- data.frame(quality = 0:5, ski = "no")
predict(m, newdata = nd, type = "count")
predict(m, newdata = nd, type = "response")
Why is it that the counts are higher than the responses?
added comparison to glm.nb
Also - I was under the impression that the count part of the hurdle model should give identical predictions to a count-model of only positive values. When I try that, I get completely different values. What am I missing??
library(MASS)
m.nb <- glm.nb(trips ~ quality, data = RecreationDemand[RecreationDemand$trips > 0,])
predict(m, newdata = nd, type = "count") ## hurdle
predict(m.nb, newdata = nd, type = "response") ## positive counts only
The last question is the easiest to answer. The "count" part of the hurdle modle is not simply a standard count model (including a positive probability for zeros) but a zero-truncated count model (where zeros cannot occur).
Using the countreg package from R-Forge you can fit the model you attempted to fit with glm.nb in your example. (Alternatively, VGAM or gamlss could also be used to fit the same model.)
library("countreg")
m.truncnb <- zerotrunc(trips ~ quality, data = RecreationDemand,
subset = trips > 0, dist = "negbin")
cbind(hurdle = coef(m, model = "count"), zerotrunc = coef(m.truncnb), negbin = coef(m.nb))
## hurdle zerotrunc negbin
## (Intercept) 0.08676189 0.08674119 1.75391028
## quality 0.02482553 0.02483015 0.01671314
Up to small numerical differences the first two models are exactly equivalent. The non-truncated model, however, has to compensate the lack of zeros by increasing the intercept and dampening the slope parameter, which is clearly not appropriate here.
As for the predictions, one can distinguish three quantities:
The expectation from the untruncated count part, i.e., simply exp(x'b).
The conditional/truncated expectation from the count part, i.e., accounting for the zero trunctation: exp(x'b)/(1 - f(0)) where f(0) is the probability for 0 in that count part.
The overall expectation for the complete hurdle model, i.e., the probability for crossing the hurdle times the conditional expectation from 2.: exp(x'b)/(1 - f(0)) * (1 - g(0)) where g(0) is the probability for 0 in the zero hurdle part of the model.
See also Section 2.2 and Appendix C in vignette("countreg", package = "pscl") for more details and formulas. predict(..., type = "count") computes item 1 from above where predict(..., type = "response") computes item 3 for a hurdle model and item 2 for a zerotrunc model.

predict.glm() with three new categories in the test data (r)(error)

I have a data set called data which has 481 092 rows.
I split data into two equal halves:
The first halve (row 1: 240 546) is called train and was used for the glm();
the second halve (row 240 547 : 481 092) is called test and should be used to validate the model;
Then I started the regression:
testreg <- glm(train$returnShipment ~ train$size + train$color + train$price +
train$manufacturerID + train$salutation + train$state +
train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
Now the prediction:
prediction <- predict.glm(testreg, newdata=test, type="response")
gives me an Error:
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
Now I know that these levels were omitted in the regression because it doesn't show any coefficients for these levels.
I have tried this: predict.lm() with an unknown factor level in test data . But it somehow doesn't work for me or I maybe just don't get how to implement it. I want to predict the dependent binary variable but of course only with the existing coefficients. The link above suggests to tell R that rows with new levels should just be called /or treated as NA.
How can I proceed?
Edit-Suggested approach by Z. Li
I got problem in the first step:
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
but mID125 is NULL! What have I done wrong?
It is impossible to get estimation of new factor levels, in fixed effect modelling, including linear models and generalized linear models. glm (as well as lm) keeps records of what factor levels are presented and used during model fitting, and can be found in testreg$xlevels.
Your model formula for model estimation is:
returnShipment ~ size + color + price + manufacturerID + salutation +
state + age + deliverytime
then predict complains new factor levels 125, 136, 137 for manufactureID. This means, these levels are not inside testreg$xlevels$manufactureID, therefore has no associated coefficient for prediction. In this case, we have to drop this factor variable and use a prediction formula:
returnShipment ~ size + color + price + salutation +
state + age + deliverytime
However, the standard predict routine can not take your customized prediction formula. There are commonly two solutions:
extract model matrix and model coefficients from testreg, and manually predict model terms we want by matrix-vector multiplication. This is what the link given in your post suggests to do;
reset the factor levels in test into any one level appeared in testreg$xlevels$manufactureID, for example, testreg$xlevels$manufactureID[1]. As such, we can still use the standard predict for prediction.
Now, let's first pick up a factor level used for model fitting
xlevels <- testreg$xlevels$manufacturerID
mID125 <- xlevels[1]
Then we assign this level to your prediction data:
replacement <- factor(rep(mID125, length = nrow(test)), levels = xlevels)
test$manufacturerID <- replacement
And we are ready to predict:
pred <- predict(testreg, test, type = "link") ## don't use type = "response" here!!
In the end, we adjust this linear predictor, by subtracting factor estimate:
est <- coef(testreg)[paste0(manufacturerID, mID125)]
pred <- pred - est
Finally, if you want prediction on the original scale, you apply the inverse of link function:
testreg$family$linkinv(pred)
update:
You complained that you met various troubles in trying the above solutions. Here is why.
Your code:
testreg <- glm(train$returnShipment~ train$size + train$color +
train$price + train$manufacturerID + train$salutation +
train$state + train$age + train$deliverytime,
family=binomial(link="logit"), data=train)
is a very bad way to specify your model formula. train$returnShipment, etc, will restrict the environment of getting variables strictly to data frame train, and you will have trouble in later prediction with other data sets, like test.
As a simple example for such drawback, we simulate some toy data and fit a GLM:
set.seed(0); y <- rnorm(50, 0, 1)
set.seed(0); a <- sample(letters[1:4], 50, replace = TRUE)
foo <- data.frame(y = y, a = factor(a))
toy <- glm(foo$y ~ foo$a, data = foo) ## bad style
> toy$formula
foo$y ~ foo$a
> toy$xlevels
$`foo$a`
[1] "a" "b" "c" "d"
Now, we see everything comes with a prefix foo$. During prediction:
newdata <- foo[1:2, ] ## take first 2 rows of "foo" as "newdata"
rm(foo) ## remove "foo" from R session
predict(toy, newdata)
we get an error:
Error in eval(expr, envir, enclos) : object 'foo' not found
The good style is to specify environment of getting data from data argument of the function:
foo <- data.frame(y = y, a = factor(a))
toy <- glm(y ~ a, data = foo)
then foo$ goes away.
> toy$formula
y ~ a
> toy$xlevels
$a
[1] "a" "b" "c" "d"
This would explain two things:
You complained to me in the comment that when you do testreg$xlevels$manufactureID, you get NULL;
The prediction error you posted
Error in model.frame.default(Terms, newdata, na.action=na.action, xlev=object$xlevels):
Factor 'train$manufacturerID' has new levels 125, 136, 137
complains train$manufacturerID instead of test$manufacturerID.
As you have divided your train and test sample based on rownumbers, some factor levels of your variables are not equally represented in both the train and test samples.
You need to do stratified sampling to ensure that both train and test samples have all factor level representations. Use stratified from the splitstackshape package.

Resources