Overriding default polynomial contrasts with ordered factors - r

Using an ordered factor as a predictor in a regression by default produces a linear (.L) and quadratic (.Q) polynomial contrast. Is there a way to omit the quadratic contrast? Here's some clumsy example code I rigged up:
xvar<-rnorm(100)
yvar<-x+rnorm(100)
xfac<-as.factor(c(1,2,3))
dat<-cbind(xvar,yvar,xfac)
dat<-data.frame(dat)
dat$xfac<-ordered(as.factor(dat$xfac))
summary(lm(yvar~xvar+xfac,data=dat))
Am I correct in assuming that the quadratic contrast being included as a predictor might result in some multicollinearity issues? I looked around but couldn't find any other posts about only including the linear component. Thank you!

No, you are not correct. You would be correct if you had done this:
lm( yvar ~ xvar + as.numeric(xfac) +I(as.numeric(xfac)^2), data=dat)
But that's not the same as what R does when it encounters such a situation. Whether or not the quadratic term will "weaken" the linear estimate really depends on the data situation. If a quadratic fit reduces the deviations of fit from data, then the linear estimate might get "weakened", but not necessarily.
If you do want only the linear contrasts, you could do this (which is often called a "test of trend" for xfac):
lm( yvar ~ xvar + as.numeric(xfac), data=dat)
If you have an ordered factor with several levels and you only wanted the linear and quadratic contrasts then you can do this:
> fac <- factor(c("E","VG","G","F","P"),
levels=c("E","VG","G","F","P"), ordered=TRUE)
> sfac <- sample(fac, 30, rep=TRUE)
> outcome <- 5*as.numeric(sfac) +rnorm(30) # linear outcome effect
> lm(outcome ~ sfac)
#-----------
Call:
lm(formula = outcome ~ sfac)
Coefficients:
(Intercept) sfac.L sfac.Q sfac.C sfac^4
14.97297 15.49134 0.10634 -0.03287 0.40144
#---------
> contrasts(sfac, 2) <- contr.poly(5)[, 1:2]
> lm(outcome ~ sfac)
Call:
lm(formula = outcome ~ sfac)
Coefficients:
(Intercept) sfac.L sfac.Q
14.97078 15.50680 0.07977

Related

Converting logistic regression output from log odds to probability

I initially made this model for a class. Looking back at it, I found that, when I tried to convert my logistic regression output to probability, I got values greater than 1. I am using the following dataset: https://stats.idre.ucla.edu/stat/data/binary.csv
My code to set up the model:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank<- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data=mydata, family="binomial")
summary(mylogit)
Now, I exponentiate these coefficients to get my odds("odds"):
odds <- exp(coef(mylogit))
and convert the odds to probability:
odds/(1 + odds)
# (Intercept) gre gpa rank2 rank3 rank4
# 0.01816406 0.50056611 0.69083749 0.33727915 0.20747653 0.17487497
This output does not make sense; probability must be less than 1, and if GRE is 300, GPA is 3, and rank2 is true (all reasonable possibilities), then probability would be much more than 1.
What is my mistake here? What would be the correct way to convert this to probability?
Your formula p/(1+p) is for the odds ratio, you need the sigmoid function
You need to sum all the variable terms before calculating the sigmoid function
You need to multiply the model coefficients by some value, otherwise you are assuming all the x's are equal to 1
Here is an example using mtcars data set
mod <- glm(vs ~ mpg + cyl + disp, mtcars, family="binomial")
z <- coef(mod)[1] + sum(coef(mod)[-1]*mtcars[1, c("mpg", "cyl", "disp")])
1/(1 + exp(-z))
# 0.3810432
which we can verify using
predict(mod, mtcars[1, c("mpg", "cyl", "disp")], type="response")
# 0.3810432

Validating a model and introducing a new predictor in glm

I am hitting my head against the computer...
I have a prediction model in R that goes like this
m.final.glm <- glm(binary_outcome ~ rcs(PredictorA, parms=kn.a) + rcs(PredictorB, parms=kn.b) + PredictorC , family = "binomial", data = train_data)
I want to validate this model on test_data2 - first by updating the linear predictor (lp)
train_data$lp <- predict(m.final.glm, train_data)
test_data2$lp <- predict(m.final.glm, test_data2)
lp2 <- predict(m.final.glm, test_data2)
m.update2.lp <- glm(binary_outcome ~ 1, family="binomial", offset=lp2, data=test_data2)
m.update2.lp$coefficients[1]
m.final.update2.lp <- m.final.glm
m.final.update2.lp$coefficients[1] <- m.final.update2.lp$coefficients[1] + m.update2.lp$coefficients[1]
m.final.update2.lp$coefficients[1]
p2.update.lp <- predict(m.final.update2.lp, test_data2, type="response")
This gets me to the point where I have updated the linear predictor, i.e. in the summary of the model only the intercept is different, but the coefficients of each predictor are the same.
Next, I want to include a new predictor (it is categorical, if that matters), PredictorD, into the updated model. This means that the model has to have the updated linear predictor and the same coefficients for Predictors A, B and C but the model also has to contain Predictor D and estimate its significance.
How do I do this? I will be very grateful if you could help me with this. Thanks!!!

Categorical Regression with Centered Levels

R's standard way of doing regression on categorical variables is to select one factor level as a reference level and constraining the effect of that level to be zero. Instead of constraining a single level effect to be zero, I'd like to constrain the sum of the coefficients to be zero.
I can hack together coefficient estimates for this manually after fitting the model the standard way:
x <- lm(data = mtcars, mpg ~ factor(cyl))
z <- c(coef(x), "factor(cyl)4" = 0)
y <- mean(z[-1])
z[-1] <- z[-1] - y
z[1] <- z[1] + y
z
## (Intercept) factor(cyl)6 factor(cyl)8 factor(cyl)4
## 20.5021645 -0.7593074 -5.4021645 6.1614719
But that leaves me without standard error estimates for the former reference level that I just added as an explicit effect, and I need to have those as well.
I did some searching and found the constrasts functions, and tried
lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
but this still only produces two effect estimates. Is there a way to change which constraint R uses for linear regression on categorical variables properly?
Think I've figured it out. Using contrasts actually is the right way to go about it, you just need to do a little work to get the results into a convenient looking form. Here's the fit:
fit <- lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
Then the matrix cs <- contr.sum(factor(cyl)) is used to get the effect estimates and the standard error.
The effect estimates just come from multiplying the contrast matrix by the effect estimates lm spits out, like so:
cs %*% coef(fit)[-1]
The standard error can be calculated using the contrast matrix and the variance-covariance matrix of the coefficients, like so:
diag(cs %*% vcov(fit)[-1,-1] %*% t(cs))

Simple slopes for interaction in Negative Binomial regression

I am looking to obtain parameter estimates for one predictor when constraining another predictors to specific values in a negative binomial glm in order to better explain an interaction effect.
My model is something like this:
model <- glm.nb(outcome ~ IV * moderator + covariate1 + covariate2)
Because the IV:moderator term is significant, I would like to obtain parameter estimates for IV at specific values of moderator (i.e., at +1 and -1 SD). I can obtain slope estimates for IV at various levels of moderator using the visreg package but I don't know how to estimate SEs and test statistics. moderator is a continuous variable so I can't use the multcomp package and other packages designed for finding simple slopes (e.g., pequod and QuantPsyc) are incompatible with negative binomial regression. Thanks!
If you want to constrain one of the values in your regression, consider taking that variable out of the model and adding it in as an offset. For example with the sample data.
dd<-data.frame(
x1=runif(50),
x2=runif(50)
)
dd<-transform(dd,
y=5*x1-2*x2+3+rnorm(50)
)
We can run a model with both x1 and x2 as parameters
lm(y ~ x1 + x2,dd)
# Call:
# lm(formula = y ~ x1 + x2, data = dd)
#
# Coefficients:
# (Intercept) x1 x2
# 3.438438 4.135162 -2.154770
Or say that we know that the coefficient of x2 is -2. Then we can not estimate x2 but put that term in as an offset
lm(y ~ x1 + offset(-2*x2), dd)
# Call:
# lm(formula = y ~ x1 + offset(-2 * x2), data = dd)
#
# Coefficients:
# (Intercept) x1
# 3.347531 4.153594
The offset() option basically just create a covariate who's coefficient is always 1. Even though I've demonstrated with lm, this same method should work for glm.nb and many other regression models.

How to estimate the best fitting function to a scatter plot in R?

I have scatterplot of two variables, for instance this:
x<-c(0.108,0.111,0.113,0.116,0.118,0.121,0.123,0.126,0.128,0.131,0.133,0.136)
y<-c(-6.908,-6.620,-5.681,-5.165,-4.690,-4.646,-3.979,-3.755,-3.564,-3.558,-3.272,-3.073)
and I would like to find the function that better fits the relation between these two variables.
to be precise I would like to compare the fitting of three models: linear, exponential and logarithmic.
I was thinking about fitting each function to my values, calculate the likelihoods in each case and compare the AIC values.
But I don't really know how or where to start. Any possible help about this would be extremely appreciated.
Thank you very much in advance.
Tina.
I would begin by an explantory plots, something like this :
x<-c(0.108,0.111,0.113,0.116,0.118,0.121,0.123,0.126,0.128,0.131,0.133,0.136)
y<-c(-6.908,-6.620,-5.681,-5.165,-4.690,-4.646,-3.979,-3.755,-3.564,-3.558,-3.272,-3.073)
dat <- data.frame(y=y,x=x)
library(latticeExtra)
library(grid)
xyplot(y ~ x,data=dat,par.settings = ggplot2like(),
panel = function(x,y,...){
panel.xyplot(x,y,...)
})+
layer(panel.smoother(y ~ x, method = "lm"), style =1)+ ## linear
layer(panel.smoother(y ~ poly(x, 3), method = "lm"), style = 2)+ ## cubic
layer(panel.smoother(y ~ x, span = 0.9),style=3) + ### loeess
layer(panel.smoother(y ~ log(x), method = "lm"), style = 4) ## log
looks like you need a cubic model.
summary(lm(y~poly(x,3),data=dat))
Residual standard error: 0.1966 on 8 degrees of freedom
Multiple R-squared: 0.9831, Adjusted R-squared: 0.9767
F-statistic: 154.8 on 3 and 8 DF, p-value: 2.013e-07
Here is an example of comparing five models. Due to the form of the first two models we are able to use lm to get good starting values. (Note that models using different transforms of y should not be compared so we should not use lm1 and lm2 as comparison models but only for starting values.) Now run an nls for each of the first two. After these two models we try polynomials of various degrees in x. Fortunately lm and nls use consistent AIC definitions (although its not necessarily true that other R model fitting functions have consistent AIC definitions) so we can just use lm for the polynomials. Finally we plot the data and fits of the first two models.
The lower the AIC the better so nls1 is best followed by lm3.2 following by nls2 .
lm1 <- lm(1/y ~ x)
nls1 <- nls(y ~ 1/(a + b*x), start = setNames(coef(lm1), c("a", "b")))
AIC(nls1) # -2.390924
lm2 <- lm(1/y ~ log(x))
nls2 <- nls(y ~ 1/(a + b*log(x)), start = setNames(coef(lm2), c("a", "b")))
AIC(nls2) # -1.29101
lm3.1 <- lm(y ~ x)
AIC(lm3.1) # 13.43161
lm3.2 <- lm(y ~ poly(x, 2))
AIC(lm3.2) # -1.525982
lm3.3 <- lm(y ~ poly(x, 3))
AIC(lm3.3) # 0.1498972
plot(y ~ x)
lines(fitted(nls1) ~ x, lty = 1) # solid line
lines(fitted(nls2) ~ x, lty = 2) # dashed line
ADDED a few more models and subsequently fixed them up and changed notation. Also to follow up on Ben Bolker's comment we can replace AIC everywhere above with AICc from the AICcmodavg package.
You could start by reading the classic paper by Box and Cox on transformations. They discuss how to compare transformations and how to find meaningful transformations within a set or family of potential transforms. The log transform and linear model are special cases of the Box-Cox family.
And as #agstudy said, always plot the data as well.

Resources