Poisson regression on gravity model - r

For a university project, I am trying to fit a regression model for demand to a number of independent variables. I tried to include a small example, but it didn't work as a figure (as I am new to this). Instead, see the following link to view a sample of the dataset that I am using:
In this table, the first column indicates country pairs, columns 2 to 6 are independent variables, and the final column is the depedent variable. What I would like to do, is to perform regression analysis, assuming that this data can be described by a gravity equation.
I know that people frequently use log-linearisation to solve this. However, as I am dealing with zeros in my data (and don't to distort the data by adding small constants), and as I assume that heteroskedasticity is in the data, I would like to use a different method. Based on what Santos 2006 described (in his paper "The log of gravity"), I would like to use Poisson Pseudo Maximum Likelihood Estimation.
However, I am fairly new to using R (or any statistical software), and I cant figure out how to do this in R. Can anybody help me with this? The only thing I've found so far is that the glm commands poisson and quasipoisson could be used (https://stat.ethz.ch/pipermail/r-help/2010-September/252476.html).
I've searched for help in documents on glm, but I don't understand how to use the glm function to solve this gravity model? How do I specify that I want the model in the form:
DEM = RP^alpha1 * GDPC_O^alpha2 * GDPC_D^alpha3 * POP_O^alpha4.... and then use the regression to solve for the different alphas?

Hard to say precisely without more detail, but
glm(DEM ~ log(RP) + log(GDPC_O) + log(GDPC_D) + log(POP_O),
data=your_data,
family=quasipoisson(link="log"))
should work reasonably well. The intercept will be the log of the proportionality constant; the other coefficients will be the exponents
of the respective terms (this works because the log link says that log(DEM) = beta_0 + beta_1*log(RP) + ...; if you exponentiate both sides you get DEM = exp(beta_0) * exp(log(RP)*beta_1) * ... or DEM = C*RP^beta_1*...
PS it is not necessary, but it may be helpful for interpretation to scale and center your predictor variables.

Related

Syntax of how to code a nonlinear mixed effects model in R with nlme package

I am having difficulty with programming a nonlinear mixed effects model in R using the nlme package. I am not concerned with fitting to data at this point, rather I just want to make sure I have the model coded correctly. The model I am trying to code looks like this in equation form:
y = (a + X)(1 - exp(-bZ))^(c+dX) + e
where X is the random effect varying across a single grouping level we can call g1, Z is a variable in the data, and e is described by N(0, LZ) where L is a parameter estimated from the data.
I am struggling to understand the correct coding syntax, specifically surrounding the (c+dX) and the error structure.
From my best understanding, I can get the code of a simpler equation if we remove the d parameter leaving the equation as:
y = (a + X)(1 - exp(-bZ))^(c+X) + e
m1<-nlme(y~a*(1-exp(-b*Z))^c, fixed=a+b+c~1, random=a+c~1|g1, start=c(a=1,b=1, c=1))
However, I'm not sure how to actually program the equation I want with c+dX. Another concern I have is that the X random effect should be the same random effect just used in two locations of the equation. Is this what will give me that? And lastly, for the error structure, I am completely lost on how to include that in the code and become more confused when reading the R documentation. Any and all help would be greatly appreciated, as I'm certain this is just a lack of understanding on my part. Thank you!

Using offset in GAM zero inflated poisson (ziP) model

I am trying to model count data of birds in forest fragments of different size. As the plots in which the surveys were conducted also differ in size among fragments, I would like to add survey plot size as an offset term to convert counts to densities.
As I understand from previous questions on this site this is generally done for poisson models as these have a log link. The GAM model (mgcv package) I am running with family ziP has link="identity". As far as I understand in such cases the offset term will be subtracted from the response, rather than resulting in the desired response/offset rate.
However, when I run the model with the offset term and plot the results it seems to be giving the result I want (I compared the plot for a poisson model with the ziP model).
This is the model I used, whereby Guild reflects different feeding guilds, logArea is the log of fragment size and Study is my random effect (data come from several studies).
gam1 <- gam(Count ~ Guild + s(logArea, by=Guild) + s(Study,bs="re"), offset=lnTotalPlotsize, family=ziP(),data=Data_ommited2)
Can someone explain how GAM handles offset terms in this case (ziP model with identity link)? Is it really resulting in the desired response/offset rate or is it doing something else?
Thanks for your help!
Regards,
Robert
Whilst only the identity link is provided, the linear predictor returns the log of the expected count. As such the linear predictor is on a log-scale and your use of the offset is OK.
Basically, the model is parameterized for the log response not the response, hence identity link functions are used. This is the same as for the ziplss() family.

Regression: Generalized Additive Model

I have started to work with GAM in R and I’ve acquired Simon Wood’s excellent book on the topic. Based on one of his examples, I am looking at the following:
library(mgcv)
data(trees)
ct1<-gam(log(Volume) ~ Height + s(Girth), data=trees)
I have two general questions to this example:
How does one decide when a variable in the model estimation should be parametric (such as Height) or when it should be smooth (such as Girth)? Does one hold an advantage over the other and is there a way to determine what is the optimal type for a variable is? If anybody has any literature about this topic, I’d be happy to know of it.
Say I want to look closer at the weights of ct1: ct1$coefficients. Can I use them as the gam-procedure outputs them, or do I have to transform them before analyzing them given that I am fitting to log(Volume)? In the case of the latter, I guess I would have to use exp (ct1$coefficients)

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

How to fit a multitple linear regression model on 1664 explantory variables in R

I have one response variable, and I'm trying to find a way of fitting a multiple linear regression model using 1664 different explanatory variables. I'm quite new to R and was taught the way of doing this by stating the formula using each of the explanatory variables in the formula. However as I have 1664 variables, it would take too long to do. Is there a quicker way of doing this?
Thank you!
I think you want to select from the 1664 variables a valid model, i.e. a model that predicts as much of the variability in the data with as few explanatory variables. There are several ways of doing this:
Using expert knowledge to select variables that are known to be relevant. This can be due to other studies finding this, or due to some underlying process that you now makes that variable relevant.
Using some kind of stepwise regression approach which selects the variables are relevant based on how well they explain the data. Do note that this method has some serious downsides. Have a look at stepAIC for a way of doing this using the Aikaike Information Criterium.
Correlating 1664 variables with data will yield around 83 significant correlations if you choose a 95% significance level (0.05 * 1664) purely based on randomness. So, tread carefully with the automatic variable selection. Cutting down the amount of variables with expert knowledge or some decorrelation techniques (e.g. principal component analysis) would help.
For a code example, you first need to include an example of your own (data + code) on which I can build.
I'll answer the programming question, but note that often a regression with that many variables could use some sort of variable selection procedure (e.g. #PaulHiemstra's suggestions).
You can construct a data.frame with only the variables you want to run, then use the formula shortcut: form <- y ~ ., where the dot indicates all variables not yet mentioned.
You could instead construct the formula manually. For instance: form <- as.formula( paste( "y ~", paste(myVars,sep="+") ) )
Then run your regression:
lm( form, data=dat )

Resources