Resurrecting coefficients from simulated data in Poisson regression - r

I am trying to understand how to resurrect model estimates from simulated data in a poisson regressions. There are other similar posts on interpreting coefficients on StackExchange/CrossValidated (https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression, https://stats.stackexchange.com/questions/128926/how-to-interpret-parameter-estimates-in-poisson-glm-results), but I think my question is different (although admittedly related). I am trying to resurrect known relationships in order to understand what is happening with the model. I am posting here instead of CrossValidated because I am thinking that it is less of statistical interpretation and more of how I would get a known / simulated relationship back via code.
Here are some simulated data y and x with known relationships to some response resp
set.seed(707)
x<-rnorm(10000,mean=5,sd=1)
y<-rnorm(10000,mean=5,sd=1)
resp<-(0.5*x+0.7*y-0.1*x*y) # where I define some relationships
With a linear regression, it is very straight forward:
summary(lm(resp~y+x+y:x))
The output shows the exact linear relationship between x, y, and the interaction.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.592e-14 1.927e-15 8.260e+00 <2e-16 ***
y 7.000e-01 3.795e-16 1.845e+15 <2e-16 ***
x 5.000e-01 3.800e-16 1.316e+15 <2e-16 ***
y:x -1.000e-01 7.489e-17 -1.335e+15 <2e-16 ***
Now, if I am interested in a poisson regression, I need integers, I just round, but keep the relationship between predictors and response:
resp<-round((0.5*x+0.7*y-0.1*x*y),0)
glm1<-glm(resp~y+x+y:x,family=poisson())
summary(glm1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.419925 0.138906 3.023 0.0025 **
y 0.163919 0.026646 6.152 7.66e-10 ***
x 0.056689 0.027375 2.071 0.0384 *
y:x -0.011020 0.005261 -2.095 0.0362 *
It is my understanding that one needs to exponentiate the results to understand them, because of the link function. But here, neither the exponentiated estimate nor intercept + estimate get me back to the original values.
> exp(0.419925+0.163919)
[1] 1.792917
> exp(0.163919)
[1] 1.178119
How do I interpret these values as related to the original 0.7*y relationship?
Now if I put that same linear equation into the exponential function, I get the values directly - no need to use exp():
resp<-round(exp(0.5*x+0.7*y-0.1*x*y),0)
summary(glm(resp~y+x+y:x,family=poisson()))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.002970 0.045422 0.065 0.948
y 0.699539 0.008542 81.894 <2e-16 ***
x 0.499476 0.008912 56.047 <2e-16 ***
y:x -0.099922 0.001690 -59.121 <2e-16 ***
Can someone explain to me what I am misinterpreting here, and how I might find the original values of a known relationship without first using the exp() function, as above?

You're neglecting the fact that the Poisson GLM uses a log link (exponential inverse-link) by default (or rather, you're not using that information consistently). You should either generate your 'data' with an exponential inverse-link:
resp <- round(exp(0.5*x+0.7*y-0.1*x*y))
or fit the model with an identity link (family=poisson(link="identity")). (I wouldn't recommend the latter, as it is rarely a sensible model.)
For what it's worth, it's harder to simulate Poisson data that will exactly match a specified set of parameters, because (unlike the Gaussian where you can reduce the variance to arbitrarily small values) you can't generate real Poisson data with arbitrarily little noise. (Your round() statement produces integers, but not Poisson-distributed outcomes.)

Related

Logit regression : glmer vs bife

I am working on a panel dataset and trying to run a logit regression with fixed effects.
I found that glmer models from the lme4 package and the bife package are suited for this kind of work.
However when I run a regression with each model I do not have the same results (estimates, standard errors, etc.)
Here is the code and results for the glmer model with an intercept:
glmer_1 <- glmer(CVC_dummy~at_log + (1|year), data=own, family=binomial(link="logit"))
summary(glmer_1)
Estimate Std. Error zvalue Pr(>|z|)
(Intercept) -6.43327 0.09635 -66.77 <2e-16 ***
at_log 0.46335 0.01101 42.09 <2e-16 ***
Without an intercept:
glmer_2 <- glmer(CVC_dummy~at_log + (1|year)-1, data=own, family=binomial(link="logit"))
summary(glmer_2)
Estimate Std.Error z value Pr(>|z|)
at_log 0.46554 0.01099 42.36 <2e-16 ***
And with the bife package:
bife_1 <- bife(CVC_dummy~at_log | year, data=own, model="logit")
summary(bife_1)
Estimate Std. error t-value Pr(> t)
at_log 0.4679 0.0110 42.54 <2e-16 ***
Why are estimated coefficients of at_log different between the two packages?
Which package should I use ?
There is quite a confusion about the terms fixed effects and random effects. From your first sentence, I guess that you intend to calculate a fixed-effects model.
However, while bife calculates fixed-effects models, glmer calculates random-effects models/mixed-effects models.
Both often get confused because random-effects models differ between fixed effects (your usual coefficients, the independent variables you are interested in) and random effects (the variances/std. dev. of your random intercepts and/or random slopes).
On the other hand, fixed-effects models are called that way because they cancel out individual differences by including a dummy variable (-1) for each group, hence by including a fixed effect for each group.
However, not all fixed-effects models work by including indicator-variables: Bife works with pseudo demeaning - yet, the results are the same and it is still called a fixed-effects model.

Adding more coefficients in R regression to exponential equation Y=a+b*(1-exp(-y/c))

i am running into a problem with R where I can not get enough coefficients to solve the exponential equation
Y=a+b*(1-exp(-X/c))
with X and Y being data variables and a,b,c fitting coefficients of the exponential equation. My current formula in R is:
results = lm(Y~I(1-(exp(-X))),data=Data)
This gives:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2815 0.0689 4.086 0.002734 **
I(1 - (exp(I(-Y^2)))) 0.9592 0.1894 5.065 0.000677 ***
I am pretty new to R, but could not find any answers in the manuals on how to implement the c coefficient. Help is much appreciated.

creating a simple linear regression model and get the result 1.367e+02

Im calculating a simple linear regression using the lm() function and get the following output when calling it:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.367e+02 4.901e+00 27.90 <2e-16 ***
LCF2010$grosswk 4.855e-01 8.589e-03 56.53 <2e-16 ***
Im interested in the Estimate column right now, but cant understand what the values are, considering they have the e character in the middle of them...

How to make polynomial predictions? [duplicate]

I have
library(ISLR)
attach(Wage)
# Polynomial Regression and Step Functions
fit=lm(wage~poly(age,4),data=Wage)
coef(summary(fit))
fit2=lm(wage~poly(age,4,raw=T),data=Wage)
coef(summary(fit2))
plot(age, wage)
lines(20:350, predict(fit, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
lines(20:350, predict(fit2, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
The prediction lines seem to be the same, however why are the coefficients so different? How do you intepret them in raw=T and raw=F.
I see that the coefficients produced with poly(...,raw=T) match the ones with ~age+I(age^2)+I(age^3)+I(age^4).
If I want to use the coefficients to get the prediction "manually" (without using the predict() function) is there something I should pay attention to? How should I interpret the coefficients of the orthogonal polynomials in poly().
By default, with raw = FALSE, poly() computes an orthogonal polynomial. It internally sets up the model matrix with the raw coding x, x^2, x^3, ... first and then scales the columns so that each column is orthogonal to the previous ones. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
Consider the simple cars data with response stopping distance and driving speed. Physically, this should have a quadratic relationship but in this (old) dataset the quadratic term is not significant:
m1 <- lm(dist ~ poly(speed, 2), data = cars)
m2 <- lm(dist ~ poly(speed, 2, raw = TRUE), data = cars)
In the orthogonal coding you get the following coefficients in summary(m1):
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.980 2.146 20.026 < 2e-16 ***
poly(speed, 2)1 145.552 15.176 9.591 1.21e-12 ***
poly(speed, 2)2 22.996 15.176 1.515 0.136
This shows that there is a highly significant linear effect while the second order is not significant. The latter p-value (i.e., the one of the highest order in the polynomial) is the same as in the raw coding:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.47014 14.81716 0.167 0.868
poly(speed, 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656
poly(speed, 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136
but the lower order p-values change dramatically. The reason is that in model m1 the regressors are orthogonal while they are highly correlated in m2:
cor(model.matrix(m1)[, 2], model.matrix(m1)[, 3])
## [1] 4.686464e-17
cor(model.matrix(m2)[, 2], model.matrix(m2)[, 3])
## [1] 0.9794765
Thus, in the raw coding you can only interpret the p-value of speed if speed^2 remains in the model. And as both regressors are highly correlated one of them can be dropped. However, in the orthogonal coding speed^2 only captures the quadratic part that has not been captured by the linear term. And then it becomes clear that the linear part is significant while the quadratic part has no additional significance.
I believe the way the polynomial regression would be run based on raw=T, is that one would look at the highest power term and assess its significance based on the pvalue for that coefficient.
If found not significant (large pvalue) then the regression would be re-run without that particular non-significant power (ie. the next lower degree) and this would be carried out one step at a time reducing if not significant.
If at any time the higher degree is significant then the process would stop and assert that, that degree is the appropriate one.

poly() in lm(): difference between raw vs. orthogonal

I have
library(ISLR)
attach(Wage)
# Polynomial Regression and Step Functions
fit=lm(wage~poly(age,4),data=Wage)
coef(summary(fit))
fit2=lm(wage~poly(age,4,raw=T),data=Wage)
coef(summary(fit2))
plot(age, wage)
lines(20:350, predict(fit, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
lines(20:350, predict(fit2, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
The prediction lines seem to be the same, however why are the coefficients so different? How do you intepret them in raw=T and raw=F.
I see that the coefficients produced with poly(...,raw=T) match the ones with ~age+I(age^2)+I(age^3)+I(age^4).
If I want to use the coefficients to get the prediction "manually" (without using the predict() function) is there something I should pay attention to? How should I interpret the coefficients of the orthogonal polynomials in poly().
By default, with raw = FALSE, poly() computes an orthogonal polynomial. It internally sets up the model matrix with the raw coding x, x^2, x^3, ... first and then scales the columns so that each column is orthogonal to the previous ones. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
Consider the simple cars data with response stopping distance and driving speed. Physically, this should have a quadratic relationship but in this (old) dataset the quadratic term is not significant:
m1 <- lm(dist ~ poly(speed, 2), data = cars)
m2 <- lm(dist ~ poly(speed, 2, raw = TRUE), data = cars)
In the orthogonal coding you get the following coefficients in summary(m1):
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.980 2.146 20.026 < 2e-16 ***
poly(speed, 2)1 145.552 15.176 9.591 1.21e-12 ***
poly(speed, 2)2 22.996 15.176 1.515 0.136
This shows that there is a highly significant linear effect while the second order is not significant. The latter p-value (i.e., the one of the highest order in the polynomial) is the same as in the raw coding:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.47014 14.81716 0.167 0.868
poly(speed, 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656
poly(speed, 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136
but the lower order p-values change dramatically. The reason is that in model m1 the regressors are orthogonal while they are highly correlated in m2:
cor(model.matrix(m1)[, 2], model.matrix(m1)[, 3])
## [1] 4.686464e-17
cor(model.matrix(m2)[, 2], model.matrix(m2)[, 3])
## [1] 0.9794765
Thus, in the raw coding you can only interpret the p-value of speed if speed^2 remains in the model. And as both regressors are highly correlated one of them can be dropped. However, in the orthogonal coding speed^2 only captures the quadratic part that has not been captured by the linear term. And then it becomes clear that the linear part is significant while the quadratic part has no additional significance.
I believe the way the polynomial regression would be run based on raw=T, is that one would look at the highest power term and assess its significance based on the pvalue for that coefficient.
If found not significant (large pvalue) then the regression would be re-run without that particular non-significant power (ie. the next lower degree) and this would be carried out one step at a time reducing if not significant.
If at any time the higher degree is significant then the process would stop and assert that, that degree is the appropriate one.

Resources