Trouble in GAM model in R software - r

I am trying to run the following code on R:
m <- gam(Flp_pop ~ s(Flp_CO, bs = "cr", k = 30), data = data, family = poisson, method = "REML")
My dataset is like this:
enter image description here
But when I try to execute, I get this error message:
"Error in if (abs(old.score - score) > score.scale * conv.tol) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)"
I am very new to R, maybe it is a very basic question. But does anyone know why this is happening?
Thanks!

The Poisson distribution has support on the non-negative integers and you are passing a continuous variable as the response. Here's an example with simulated data
library("mgcv")
library("gratia")
library("dplyr")
df <- data_sim("eg1", seed = 2) %>% # simulate Gaussian response
mutate(yabs = abs(y)) # make y non negative
mp <- gam(yabs ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
# fails
which reproduces the error you saw
Error in if (abs(old.score - score) > score.scale * conv.tol) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)
The warnings are of the form:
$> warnings()[1]
Warning message:
In dpois(y, y, log = TRUE) : non-integer x = 7.384012
Indicating the problem; the model is evaluating the probability mass for your response data given the estimated model and you're evaluating this at the indicated non-integer value, which returns a 0 mass plus the warning.
If we'd passed the original Gaussian variable as the response, which includes negative values, the function would have errored out earlier:
mp <- gam(y ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
which raises this error:
r$> mp <- gam(y ~ s(x2, bs = "cr"), data = df,
family = poisson, method = "REML")
Error in eval(family$initialize) :
negative values not allowed for the 'Poisson' family
An immediate but not necessarily advisable solution is just to use the quasipoisson family
mq <- gam(yabs ~ s(x2, bs = "cr"), data = df,
family = quasipoisson, method = "REML")
which uses the same mean variance relationship as the Poisson distribution but not the actual distribution so we can get away with abusing it.
Better would be to ask yourself why you were trying to fit a model that is ostensibly for counts to a response that is a continuous (non-negative) variable?
If the answer is you had a count but then normalised it in some way (say by dividing by some measure of effort like area surveyed or length of observation time) then you should use an offset of the form + offset(log(effort_var)) added to the model formula, and use the original non-normalised integer variable as the response.
If you really have a continuous response and the poisson was an over sight, try fitting with family = Gamma(link = "log")) or family = tw().
If it's something else, you should edit your question to include that info and perhaps we here can help or the question could be migrated to CrossValidated if the issue is more statistical in nature.

Related

Error when adjusting a GLM: Error in eval(family$initialize)

I am trying to adjust a generalized linear model defined below:
It must be noted that the response variable Var1, as well as the regressor variable Var2, have zero values, for which a constant has been added to avoid problems when applying the log.
model = glm(Var1+2 ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), data = data2)
However, I am facing an error when performing the graph for the diagnostic analysis using the hnp function, which is expressed by:
library(hnp)
hnp(model)
Gaussian model (glm object)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
In order to get around the situation, I tried to perform the manual implementation to then carry out the construction of the graph, however, the error message is still present.
dfun <- function(obj) resid(obj)
sfun <- function(n, obj) simulate(obj)[[1]]
ffun <- function(resp) glm(resp ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), data = data2)
hnp(model, newclass = TRUE, diagfun = dfun, simfun = sfun, fitfun = ffun)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
Some guidelines in which I found information to try to solve the problem were used, such as considering initial values to initialize the estimation algorithm both in the linear predictor, as well as for the means, however, these were not enough to solve the problem, see below the computational routine:
fit = lm(Var1+2 ~ log(Var2+2) + offset(log(Var3/Var4)), data=data2)
coefficients(fit)
(Intercept) log(Var2+2)
32.961103 -8.283306
model = glm(Var1+2 ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), start = c(32.96, -8.28), data = data2)
hnp(model)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
See that the error persists even when trying to manually implement the half-normal plot.
dfun <- function(obj) resid(obj)
sfun <- function(n, obj) simulate(obj)[[1]]
ffun <- function(resp) glm(resp ~ log(Var2+2) + offset(log(Var3/Var4)),
family = gaussian(link = "log"), data = data2, start = c(32.96, -8.28))
hnp(model, newclass = TRUE, diagfun = dfun, simfun = sfun, fitfun = ffun)
Error in eval(family$initialize) :
cannot find valid starting values: please specify some
I also tried to readjust the model by removing the zeros from the database, however, I didn't get any solution to the problem, that is, it still persists.
I suspect what you meant to fit is a log transformed response variable against your predictors. You can more detail about the difference between a log link glm and a log transformed response variable. Essentially when you use a log link, you are assuming the errors are on the exponential scale. I am not so familiar with hnp but my guess it there are problems simulating the response variable.
If I run your regression like this using the data provided, it looks ok
data2$Y = with(data2, log( (Var1+2)/Var3/Var4))
model = glm(Y ~ log(Var2+2), data = data2)
hnp(model)

zero-inflated GAM prediction with newdata

In a zero-inflated GAM (ziplss), I'm getting a warning when 1) I use new data and 2) the count portion has categorical variables that are NOT in the zero-inflation portion. There's a warning for every categorical variable not represented in the zero-inflation part.
Here's a reproducible example:
library(mgcv)
library(glmmTMB)
data(Salamanders)
Salamanders$x <- rnorm(nrow(Salamanders), 0, 10)
zipgam <- gam(list(count ~ spp * mined + s(x) + s(site, bs = "re"),
~ spp),
data = Salamanders, family = ziplss, method = "REML")
preds.response <- data.frame(Predict = predict(zipgam, type = "response"))
nd <- data.frame(x = 0, spp = "GP", mined = "yes", site = Salamanders$site[1])
nd$pred <- predict(zipgam, newdata = nd, exclude="site")
I haven't seen this mentioned anywhere, which is odd and tells me that I'm likely doing something wrong (otherwise this would be available in search results). Would appreciate any insight.
I think this is just an infelicity in the implementation. The warning I am seeing is:
Warning message:
In model.matrix.default(Terms[[i]], mf, contrasts = object$contrasts) :
variable 'mined' is absent, its contrast will be ignored
This is harmless (at least in this case; I haven't checked other cases) and is generated because there is only a single object$contrasts, and it contains details about mined but this variable is not present in the second linear predictor so R warns that it is going to ignore the contrasts for the mined variable, but this only happens when building the model matrix for the zero-inflation part of the model. The count part correctly uses the mined variable and the correct contrasts.
You could argue that having $contrasts be a list, one per linear predictor would be a better design and then the model matrix would be created using:
model.matrix.default(Terms[[i]], mf, contrasts = object$contrasts[[i]])
but I have no idea if this would break everything else in mgcv.
Currently $contrasts for this model is just:
> zipgam$contrasts
$spp
[1] "contr.treatment"
$mined
[1] "contr.treatment"
$spp
[1] "contr.treatment"
which already shows some redundancy.

Fit binomial GLM on probabilities (i.e. using logistic regression for regression not classification)

I want to use a logistic regression to actually perform regression and not classification.
My response variable is numeric between 0 and 1 and not categorical. This response variable is not related to any kind of binomial process. In particular, there is no "success", no "number of trials", etc. It is simply a real variable taking values between 0 and 1 depending on circumstances.
Here is a minimal example to illustrate what I want to achieve
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10))
fit <- glm(formula = resp ~ a + b,
family = "binomial",
data = dummy_data)
This code gives a warning then fails because I am trying to fit the "wrong kind" of data:
In eval(family$initialize) : non-integer #successes in a binomial glm!
Yet I think there must be a way since the help of family says:
For the binomial and quasibinomial families the response can be
specified in one of three ways: [...] (2) As a numerical vector with
values between 0 and 1, interpreted as the proportion of successful
cases (with the total number of cases given by the weights).
Somehow the same code works using "quasibinomial" as the family which makes me think there may be a way to make it work with a binomial glm.
I understand the likelihood is derived with the assumption that $y_i$ is in ${0, 1}$ but, looking at the maths, it seems like the log-likelihood still makes sense with $y_i$ in $[0, 1]$. Am I wrong?
This is because you are using the binomial family and giving the wrong output. Since the family chosen is binomial, this means that the outcome has to be either 0 or 1, not the probability value.
This code works fine, because the response is either 0 or 1.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = sample(c(0,1),10,replace=T,prob=c(.5,.5)) )
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data)
If you want to model the probability directly you should include an additional column with the total number of cases. In this case the probability you want to model is interpreted as the success rate given the number of case in the weights column.
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit <- glm(formula = resp ~ a + b,
family = binomial(),
data = dummy_data, weights = w)
You will still get the warning message, but you can ignore it, given these conditions:
resp is the proportion of 1's in n trials.
for each value in resp, the corresponding value in w is the number of trials.
From the discussion at Warning: non-integer #successes in a binomial glm! (survey packages), I think we can solve it by another family function ?quasibinomial().
dummy_data <- data.frame(a=1:10,
b=factor(letters[1:10]),
resp = runif(10),w=round(runif(10,1,11)))
fit2 <- glm(formula = resp ~ a + b,
family = quasibinomial(),
data = dummy_data, weights = w)

error in gamma link specification for lmer

I'm trying to fit a mixed-effects model with a gamma distribution. The most basic model has one fixed predictor and 1 random effect. No matter which link I specify (I've tried log, identity and inverse), I obtain the following error. My real data has zeros in Y, but even when I apply simulated data with only positive Y as below, it throws the same error.
mockdf = data.frame(y = rnorm(100,77,6.5), x1 = sample(letters,100,replace = T), x2 = seq(1900,1999,1))
mod = lmer(y ~ (1|x1) + x2, family = gamma(link = 'identity'), na.action = na.exclude, data = mockdf)
Error in gamma(link = "identity") :
supplied argument name 'link' does not match 'x'
I searched through SO and couldn't find another person who ran into this error. Is my syntax incorrect?
Thanks for your help.

Using CARET together with GAM ("gamSpline" method) in R Poisson Regression

I am trying to use caret package to tune 'df' parameter of a gam model for my cohort analysis.
With the following data:
cohort = 1:60
age = 1:26
grid = data.frame(expand.grid(age = age, cohort = cohort))
size = data.frame(cohort = cohort, N = sample(100:150,length(cohort), replace = TRUE))
df = merge(grid, size, by = "cohort")
log_k = -3 + log(df$N) - 0.5*log(df$age) + df$cohort*(df$cohort-30)*(df$cohort-50)/20000 + runif(nrow(df),min = 0, max = 0.5)
df$conversion = rpois(nrow(df),exp(log_k))
Explanation of the data : Cohort number is the time of arrival of the potential customer. N is the number of potential customer that arrived at that time. Conversion is the number out of those potential customer that 'converted' (bought something). Age is the age (time spent from arrival) of the cohort when conversion took place. For a given cohort there are fewer conversions as age grows. This effect follows a power law.
But the total conversion rate of each cohort can also change slowly in time (cohort number). Thus I want a smoothing spline of the time variable in my model.
I can fit a gam model from package gam
library(gam)
fit = gam(conversion ~ log(N) + log(age) + s(cohort, df = 4), data = df, family = poisson)
fit
> Call:
> gam(formula = conversion ~ log(N) + log(age) + s(cohort, df = 4),
> family = poisson, data = df)
> Degrees of Freedom: 1559 total; 1553 Residual
> Residual Deviance: 1869.943
But if i try to train the model using the CARET package
library(caret)
fitControl = trainControl(verboseIter = TRUE)
fit.crt = train(conversion ~ log(N) + log(age) + s(cohort,df),
data = df, method = "gamSpline",
trControl = fitControl, tune.length = 3, family = poisson)
I get this error :
+ Resample01: df=1
model fit failed for Resample01: df=1 Error in as.matrix(x) : object 'N' not found
- Resample01: df=1
+ Resample01: df=2
model fit failed for Resample01: df=2 Error in as.matrix(x) : object 'N' not found .....
Please does anyone know what I'm doing wrong here?
Thanks
There are a two things wrong with your code.
The train function can be a bit tedious depending on the method you used (as you have noticed). In the case of method = "gamSpline", the train function adds a smooth term to every independent term in the formula. So it converts your variables to s(log(N), df), s(log(age) df) and to s(s(cohort, df), df).
Wait s(s(cohort, df), df) does not really makes sense. So you must change s(cohort, df) to cohort.
I am not sure why, but the train with method = "gamSpline" does not like it when you put functions (e.g. log) in the formula. I think this is due to the fact that this method already applies the s() functions to your variables. This problem can be solved by applying the log earlier to your variables. Such as df$N <- log(df$N) or logN <- log(df$N) and use logN as variable. And of course, do the same for age.
My guess is that you don't want this method to apply a smoothing term to all your independent variables based on the code you provided. I am not sure if this is possible and how to do it, if it is possible.
Hope this helps.
EDIT: If you want a more elegant solution than the one I provided at point 2, make sure to read the comment of #topepo. This suggestion also allows you to apply s() function to the variables you want if I understand it correctly.

Resources