poly() in lm(): difference between raw vs. orthogonal - r

I have
library(ISLR)
attach(Wage)
# Polynomial Regression and Step Functions
fit=lm(wage~poly(age,4),data=Wage)
coef(summary(fit))
fit2=lm(wage~poly(age,4,raw=T),data=Wage)
coef(summary(fit2))
plot(age, wage)
lines(20:350, predict(fit, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
lines(20:350, predict(fit2, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
The prediction lines seem to be the same, however why are the coefficients so different? How do you intepret them in raw=T and raw=F.
I see that the coefficients produced with poly(...,raw=T) match the ones with ~age+I(age^2)+I(age^3)+I(age^4).
If I want to use the coefficients to get the prediction "manually" (without using the predict() function) is there something I should pay attention to? How should I interpret the coefficients of the orthogonal polynomials in poly().

By default, with raw = FALSE, poly() computes an orthogonal polynomial. It internally sets up the model matrix with the raw coding x, x^2, x^3, ... first and then scales the columns so that each column is orthogonal to the previous ones. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
Consider the simple cars data with response stopping distance and driving speed. Physically, this should have a quadratic relationship but in this (old) dataset the quadratic term is not significant:
m1 <- lm(dist ~ poly(speed, 2), data = cars)
m2 <- lm(dist ~ poly(speed, 2, raw = TRUE), data = cars)
In the orthogonal coding you get the following coefficients in summary(m1):
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.980 2.146 20.026 < 2e-16 ***
poly(speed, 2)1 145.552 15.176 9.591 1.21e-12 ***
poly(speed, 2)2 22.996 15.176 1.515 0.136
This shows that there is a highly significant linear effect while the second order is not significant. The latter p-value (i.e., the one of the highest order in the polynomial) is the same as in the raw coding:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.47014 14.81716 0.167 0.868
poly(speed, 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656
poly(speed, 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136
but the lower order p-values change dramatically. The reason is that in model m1 the regressors are orthogonal while they are highly correlated in m2:
cor(model.matrix(m1)[, 2], model.matrix(m1)[, 3])
## [1] 4.686464e-17
cor(model.matrix(m2)[, 2], model.matrix(m2)[, 3])
## [1] 0.9794765
Thus, in the raw coding you can only interpret the p-value of speed if speed^2 remains in the model. And as both regressors are highly correlated one of them can be dropped. However, in the orthogonal coding speed^2 only captures the quadratic part that has not been captured by the linear term. And then it becomes clear that the linear part is significant while the quadratic part has no additional significance.

I believe the way the polynomial regression would be run based on raw=T, is that one would look at the highest power term and assess its significance based on the pvalue for that coefficient.
If found not significant (large pvalue) then the regression would be re-run without that particular non-significant power (ie. the next lower degree) and this would be carried out one step at a time reducing if not significant.
If at any time the higher degree is significant then the process would stop and assert that, that degree is the appropriate one.

Related

Regression model produces low coefficient but high R-squared?

I'm doing research on right wing radicalization factors. Since I'm using survey data and have to account for oversampling, I'm using the survey-package. Furthermore I log-transformed the dependent variable in order to adjust for its right skewed distribution. This are the relevant code lines:
dat_wght <- svydesign(ids= ~1, data=dat, weights =~wghtpew)
mod1 <-svyglm(log(right) ~ religiosity, design = dat_wght)
For model 1, I get this regression output:
Call:
svyglm(formula = log(right) ~ religiosity,
design = dat_wght)
Survey design:
svydesign(ids = ~1, data = dat, weights = ~wghtpew)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.493398 0.016111 154.763 < 2e-16 ***
religiosity 0.016750 0.004091 4.094 4.43e-05 ***
Since it is a log-model, I interpretated the coefficient as follows: If independent variable religiosity increases by 1 unit, the dependent variable right increases by (exp(0.01675)-1)*100=1.69%. This would mean that there is indeed some kind of correlation, but the effect is very very low. Is that correct so far?
Furthermore, I want to calculate R-squared. The svyglm-model doesn't provide R-squared immediately. However you guys kindly pointed out, that I can calculate it by myself with:
total_var <-svyvar(~right, dat_wght)
resid_var_mod1 <- summary(mod1)$dispersion
rsq_mod1 <- 1-resid_var_mod1/total_var
rsq_mod1
However, the result I get is:
variance SE
right 0.99407 0.0028
How can that be? If the effect is very very low, my model apparently isn't suited to explain the variation in the dependent variable. Therefore R-squared should also be very low and much closer to zero than to 1, shouldn't it? Why is it so high then? Did I interpretate my coefficients wrong? Are there any mistakes that I did along the way?
I'm really grateful for every kind of advice! Thanks :)
Ok, it's simpler than everyone thought. Your R^2 calculation is just wrong.
You compute the residual variance of log(right) and divide it by the population variance of right. You need to use the population variance of log(right) as well.

How to make polynomial predictions? [duplicate]

I have
library(ISLR)
attach(Wage)
# Polynomial Regression and Step Functions
fit=lm(wage~poly(age,4),data=Wage)
coef(summary(fit))
fit2=lm(wage~poly(age,4,raw=T),data=Wage)
coef(summary(fit2))
plot(age, wage)
lines(20:350, predict(fit, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
lines(20:350, predict(fit2, newdata = data.frame(age=20:350)), lwd=3, col="darkred")
The prediction lines seem to be the same, however why are the coefficients so different? How do you intepret them in raw=T and raw=F.
I see that the coefficients produced with poly(...,raw=T) match the ones with ~age+I(age^2)+I(age^3)+I(age^4).
If I want to use the coefficients to get the prediction "manually" (without using the predict() function) is there something I should pay attention to? How should I interpret the coefficients of the orthogonal polynomials in poly().
By default, with raw = FALSE, poly() computes an orthogonal polynomial. It internally sets up the model matrix with the raw coding x, x^2, x^3, ... first and then scales the columns so that each column is orthogonal to the previous ones. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
Consider the simple cars data with response stopping distance and driving speed. Physically, this should have a quadratic relationship but in this (old) dataset the quadratic term is not significant:
m1 <- lm(dist ~ poly(speed, 2), data = cars)
m2 <- lm(dist ~ poly(speed, 2, raw = TRUE), data = cars)
In the orthogonal coding you get the following coefficients in summary(m1):
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.980 2.146 20.026 < 2e-16 ***
poly(speed, 2)1 145.552 15.176 9.591 1.21e-12 ***
poly(speed, 2)2 22.996 15.176 1.515 0.136
This shows that there is a highly significant linear effect while the second order is not significant. The latter p-value (i.e., the one of the highest order in the polynomial) is the same as in the raw coding:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.47014 14.81716 0.167 0.868
poly(speed, 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656
poly(speed, 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136
but the lower order p-values change dramatically. The reason is that in model m1 the regressors are orthogonal while they are highly correlated in m2:
cor(model.matrix(m1)[, 2], model.matrix(m1)[, 3])
## [1] 4.686464e-17
cor(model.matrix(m2)[, 2], model.matrix(m2)[, 3])
## [1] 0.9794765
Thus, in the raw coding you can only interpret the p-value of speed if speed^2 remains in the model. And as both regressors are highly correlated one of them can be dropped. However, in the orthogonal coding speed^2 only captures the quadratic part that has not been captured by the linear term. And then it becomes clear that the linear part is significant while the quadratic part has no additional significance.
I believe the way the polynomial regression would be run based on raw=T, is that one would look at the highest power term and assess its significance based on the pvalue for that coefficient.
If found not significant (large pvalue) then the regression would be re-run without that particular non-significant power (ie. the next lower degree) and this would be carried out one step at a time reducing if not significant.
If at any time the higher degree is significant then the process would stop and assert that, that degree is the appropriate one.

Calculating logLik by hand from a logistic regression

I ran a mixed model logistic regression adjusting my model with genetic relationship matrix using an R package known as GMMAT (function: glmmkin()).
My output from the model includes (taken from the user manual):
theta: the dispersion parameter estimate [1] and the variance component parameter estimate [2]
coefficients: fixed effects parameter estimates (including the intercept).
linear.predictors: the linear predictors.
fitted.values: fitted mean values on the original scale.
Y: a vector of length equal to the sample size for the final working vector.
P: the projection matrix with dimensions equal to the sample size.
residuals: residuals on the original scale. NOT rescaled by the dispersion parameter.
cov: covariance matrix for the fixed effects (including the intercept).
converged: a logical indicator for convergence.
I am trying to obtain the log-likelihood in order to compute variance explained. My first instinct was to pull apart the logLik.glm function in order to compute this "by hand" and I got stuck at trying to compute AIC. I used the answer from here.
I did a sanity check with a logistic regression run with stats::glm() where the model1$aic is 4013.232 but using the Stack Overflow answer I found, I obtained 30613.03.
My question is -- does anyone know how to compute log likelihood from a logistic regression by hand using the output that I have listed above in R?
No statistical insight here, just the solution I see from looking at glm.fit. This only works if you did not specify weights while fitting the models (or if you did, you would need to include those weights in the model object)
get_logLik <- function(s_model, family = binomial(logit)) {
n <- length(s_model$y)
wt <- rep(1, n) # or s_model$prior_weights if field exists
deviance <- sum(family$dev.resids(s_model$y, s_model$fitted.values, wt))
mod_rank <- sum(!is.na(s_model$coefficients)) # or s_model$rank if field exists
aic <- family$aic(s_model$y, rep(1, n), s_model$fitted.values, wt, deviance) + 2 * mod_rank
log_lik <- mod_rank - aic/2
return(log_lik)
}
For example...
model <- glm(vs ~ mpg, mtcars, family = binomial(logit))
logLik(model)
# 'log Lik.' -12.76667 (df=2)
sparse_model <- model[c("theta", "coefficients", "linear.predictors", "fitted.values", "y", "P", "residuals", "cov", "converged")]
get_logLik(sparse_model)
#[1] -12.76667

Opposite directions of exponential hazard model coefficients ( with survreg and glm poisson)

I want to estimate an exponential hazards model with one predictor in R. For some reason, I am getting coefficients with opposite signs when I estimate it using a glm poisson with offset log t and when I just use the survreg function from the survival package. I am sure the explanation is perfectly obvious but I can not figure it out.
Example
t <- c(89,74,23,74,53,3,177,44,28,43,25,24,31,111,57,20,19,137,45,48,9,17,4,59,7,26,180,56,36,51,6,71,23,6,13,28,16,180,16,25,6,25,4,5,32,94,106,1,69,63,31)
d <- c(0,1,1,0,1,1,0,1,1,0,1,1,1,1,0,0,1,0,1,1,1,0,1,0,1,1,0,0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1)
p <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,1,1)
df <- data.frame(d,t,p)
# exponential hazards model using poisson with offest log(t)
summary(glm(d ~ offset(log(t)) + p, data = df, family = "poisson"))
Produces:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.3868 0.7070 -7.619 2.56e-14 ***
p 1.3932 0.7264 1.918 0.0551 .
Compared to
# exponential hazards model using survreg exponential
require(survival)
summary(survreg(Surv(t,d) ~ p, data = df, dist = "exponential"))
Produces:
Value Std. Error z p
(Intercept) 5.39 0.707 7.62 2.58e-14
p -1.39 0.726 -1.92 5.51e-02
Why are the coefficients in opposite directions and how would I interpret the results as they stand?
Thanks!
In the second model an increased value of p is associated with a decreased expected survival. In the first model the increased p that had a long t value would imply a higher chance of survival and a lower risk. Variations in risk and mean survival time values of necessity go in opposite directions. The fact that the absolute values are the same comes from the mathematical identity log(1/x) = -log(x). The risk is (exactly) inversely proportional to mean lifetime in exponential models.

How to get an estimate and confidence interval for a contrast in R with offset

I've got a Poisson GLM model fitted in R, looking something like this:
glm(Outcome~Exposure + Var1 + offset(log(persontime)),family=poisson,data=G))
Where Outcome will end up being a rate, the Exposure is a continuous variable, and Var1 is a factor with three levels.
It's easy enough from the output of that:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.6998 0.1963 -29.029 < 2e-16
Exposure 4.7482 1.0793 4.399 1.09e-05
Var1Thing1 -0.2930 0.2008 -1.459 0.144524
Var1Thin 1.0395 0.2037 5.103 3.34e-07
Var1Thing3 0.7722 0.2201 3.508 0.000451
To get the estimate of a one-unit increase in Exposure. But a one-unit increase isn't actually particularly meaningful. An increase of 0.025 is actually far more likely. Getting an estimate for that isn't particularly difficult either, but I'd like a confidence interval along with the estimate. My intuition is that I need to use the contrast package, but the following generated an error:
diff <- contrast(Fit,list(Exposure=0.030,Var1="Thing1"),list(Exposure=0.005,Type="Thing1"))
"Error in offset(log(persontime)) : object 'persontime' not found"
Any idea what I'm doing wrong?
you want to use the confint function (which in this case will call the MASS:::confint.glm method), as in:
confint(Fit)
Since the standard errors is the model scale linearly with the linear changes in the scale of the variable 'Exposure' in your model, you can simply multiply the confidence interval by the difference in scale to get the confidence for a smaller 'unit' change.
Dumb example:
Lets say you want to test the hypothesis that people fall down more often when they've had more alcohol. You test this by randomly serving individuals varying amounts of alcohol (which you measure in ml) and counting the number of times each person falls down. Your model is:
Fit <- glm(falls ~ alcohol_ml,data=myData, family=poisson)
and the coef table is
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.6998 0.1963 -29.029 < 2e-16
Alcohol_ml 4.7482 1.0793 4.399 1.09e-05
and the Confidence interval for alcohol is 4-6 (just to keep is simple). Now a colegue asks you to give the the confidence interval in ounces. All you have to do is to scale by the confidence interval by the conversion factor (29.5735 ounces per ml), as in:
c(4,6) * 29.5735 # effect per ounce alcohol [notice multiplication is used to rescale here]
alternatively you could re-scale your data and re-fit the model:
mydata$alcohol_oz <- mydata$alcohol_ml / 29.5735 #[notice division is used to rescale here]
Fit <- glm(falls ~ alcohol_oz,data=myData, family=poisson)
or you could re-scale your data right in the model:
#[again notice that division is used here]
Fit <- glm(falls ~ I(alcohol_ml/29.5735),data=myData, family=poisson)
Either way, you will get the same confidence intervals on the new scale.
Back to your example: if you're units of Exposure are so large that you are unlikely to observe such a change within an individual and a smaller change is more easily interpreted, just re-scale your variable 'Exposure' (as in myData$Exposure_newScale = myData$Exposure / 0.030 so Exposure_newScale is in multiples of 0.030) or rescale the confidence intervals using either of these methods.

Resources