I am trying to model count data of birds in forest fragments of different size. As the plots in which the surveys were conducted also differ in size among fragments, I would like to add survey plot size as an offset term to convert counts to densities.
As I understand from previous questions on this site this is generally done for poisson models as these have a log link. The GAM model (mgcv package) I am running with family ziP has link="identity". As far as I understand in such cases the offset term will be subtracted from the response, rather than resulting in the desired response/offset rate.
However, when I run the model with the offset term and plot the results it seems to be giving the result I want (I compared the plot for a poisson model with the ziP model).
This is the model I used, whereby Guild reflects different feeding guilds, logArea is the log of fragment size and Study is my random effect (data come from several studies).
gam1 <- gam(Count ~ Guild + s(logArea, by=Guild) + s(Study,bs="re"), offset=lnTotalPlotsize, family=ziP(),data=Data_ommited2)
Can someone explain how GAM handles offset terms in this case (ziP model with identity link)? Is it really resulting in the desired response/offset rate or is it doing something else?
Thanks for your help!
Regards,
Robert
Whilst only the identity link is provided, the linear predictor returns the log of the expected count. As such the linear predictor is on a log-scale and your use of the offset is OK.
Basically, the model is parameterized for the log response not the response, hence identity link functions are used. This is the same as for the ziplss() family.
Related
I'm trying to run GAMs to analyze some temperature data. I have remote cameras and external temperature loggers, and I'm trying to model the difference in the temperatures recorded by them (camera temperature - logger temperature). Most of the time, the cameras are recording higher temperatures, but sometimes, the logger returns the higher temperature, in which case the difference ends up being a negative value. The direction of the difference is something that I care about, so I do have to have non-positive values as a response. My explanatory variables are percent canopy cover (quantitative), direct and diffuse radiation (quant.), and camera direction (ordered factor) as fixed effects as well as the camera/logger pair (factor) for a random effect.
I had mostly been using the gam() function in mgcv to run my models. I'm using a scat distribution since my data is heavy-tailed. My model code is as follows:
gam(f1, family = scat(link = "identity"), data = d)
I wanted to try using bam() since I have 60,000 data points (one temperature observation per hour of the day for several months). The gam() models run fine, though they take a while to run. But the exact same model formulas run in bam() end up returning negative deviance explained values. I also get 50+ warning messages that all say:
In y - mu : longer object length is not a multiple of shorter object length
Running gam.check() on the fitted models returns identical residuals plots. The parametric coefficients, smooth terms, and R-squared values are also almost identical. The only things that have really noticeably changed are the deviance explained values, and they've changed to something completely nonsensical (the deviance explained values for the bam() models range from -61% to -101% deviance explained).
I'll admit that I'm brand new to using GAM's. I know just enough to know that the residuals plots are more important than the deviance explained values, and the residuals plots look good (way better than they did with a Gaussian distribution, at least). More than anything, I'm curious about what's going on within the bam() function specifically that's causing the function to pass that warning and return a negative deviance explained value. Is there some extra argument that I can set in bam() or some further manipulations I can do to my data to prevent this from happening, or can I ignore it and move forward since my residuals plots look good and the outputs are mostly the same?
Thanks in advance for any help.
I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance
I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.
I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.
However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.
This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.
A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.
Any suggestions on the best form of distribution for this new parameter, and how to model it in R?
Thanks!
As suggested in the comments you could keep the Poisson model and do it with an offset:
glm(response~predictor1+predictor2+predictor3+ ... + offset(log(population),
family=poisson,data=...)
Or you could use a binomial GLM, either
glm(cbind(response,pop_size-response) ~ predictor1 + ... , family=binomial,
data=...)
or
glm(response/pop_size ~ predictor1 + ... , family=binomial,
weights=pop_size,
data=...)
The latter form is sometimes more convenient, although less widely used.
Be aware that in general switching from Poisson to binomial will change the
link function from log to logit, although you can use family=binomial(link="log")) if you prefer.
Zero-inflation might be easier to model with the Poisson + offset combination (I'm not sure if the pscl package, the most common approach to ZIP, handles offsets, but I think it does), which will be more commonly available than a zero-inflated binomial model.
I think glmmADMB will do a zero-inflated binomial model, but I haven't tested it.
So I am using survreg, and I expect my predicted results to obey a lower bound of 0, but they indicate negative results frequently. I think it is somehow estimating a linear result instead of the survival model I'm trying to create. Here's what I've done:
linear.first.stage<-lm(y ~ x, data=clip)
First I estimated some points to speed up my estimation process. It fails to converge without this first stage. I create a survival object, following the code from ?survreg that provides an explicit example of a tobit regression. I duplicated this below for x and y. In my data set, y can only be observed at a non-negative value, but if it is positive, it tends to be distributed normally around 200 or so with sd of about 20. X may take any value and isn't theoretically bound by any particular number that immediately comes to mind.
surv_y<-Surv(clip$y, clip$y>0,type="left")
first.stage<-survreg(surv_y ~ x,init=(linear.first.stage), dist="gaussian", data=clip)
I run the survival regression, which should be equivalent to a Tobit. To confirm that my interpretation of events were the same, I ran the following:
test<-tobit(y~x, left=0, right=Inf, dist="gaussian", data=clip)
p_test<-predict(test)
p<-predict(first.stage)
plot(p_test-p)
The plot shows a flat line at zero, so upon visual inspection these commands are identical, as they should be. However, in both cases, results under 0 are predicted. This is problematic because I have stated that the leftward bound of observable information is 0. My expectations is that all predicted values must be >0.
I have tried predicting using types "link", "response", "linear", but to no avail. I assume the predict command is producing the outcomes as if the censorship was not occurring. How do I produce the prediction that obeys the lower bound of 0?
References:
Running predict() after tobit() in package AER
https://stats.stackexchange.com/questions/11440/standardized-residuals-of-a-tobit-model-in-r
You probably need to scale the prediction up in the sense that is described here by one of the authors of the package.
Answer: Tobit is not the right regression type. Tobit predicts what the result ought to be in the absence of the truncation.
Clarification: I restructured my estimation process to reflect a zero-inflated or hurdle model. Tobit is for censored data, it says there exists a non-zero result, but we only observe 0 because the information is hidden somehow. For example, women's wages should be fit with Tobit, because married women who choose not to work still have a reservation wage, and still have some (invisible) return to effort doing unpaid labor of whatever type. Zero-inflated or hurdle models indicate that the result is truly zero. As in, no crimes occurred. Or no widgets produced. They more accurately reflected my model.