Setting Different Levels of constants for categorical variables in R - r

Will anyone be able to explain how to set constants for different levels of categorical variables in r?
I have read the following: How to set the Coefficient Value in Regression; R and it does a good job for explaining how to set a constant for the whole of a categorical variable. I would like to know how to set one for each level.
As an example, let us look at the MTCARS dataset:
df <- as.data.frame(mtcars)
df$cyl <- as.factor(df$cyl)
set.seed(1)
glm(mpg ~ cyl + hp + gear, data = df)
This gives me the following output:
Call: glm(formula = mpg ~ cyl + hp + gear, data = df)
Coefficients:
(Intercept) cyl6 cyl8 hp gear
19.80268 -4.07000 -2.29798 -0.05541 2.79645
Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
Null Deviance: 1126
Residual Deviance: 219.5 AIC: 164.4
If I wanted to set cyl6 to -.34 and cyl8 to -1.4, and then rerun to see how it effects the other variables, how would I do that?

I think this is what you can do
df$mpgCyl=df$mpg
df$mpgCyl[df$cyl==6]=df$mpgCyl[df$cyl==6]-0.34
df$mpgCyl[df$cyl==8]=df$mpgCyl[df$cyl==8]-1.4
model2=glm(mpgCyl ~ hp + gear, data = df)
> model2
Call: glm(formula = mpgCyl ~ hp + gear, data = df)
Coefficients:
(Intercept) hp gear
16.86483 -0.07146 3.53128
UPDATE withe comments:
cyl is a factor, therefore by default it contributes to glm as offset, not slope. Actually cyl==4 is 'hidden' but existing in the glm as well. So in your first glm what the models says is:
1) for cyl==4: mpg=19.8-0.055*hp+2.79*gear
2) for cyl==6: mpg=(19.8-4.07)-0.055*hp+2.79*gear
3) for cyl==8: mpg=(19.8-2.29)-0.055*hp+2.79*gear
Maybe you can also check here https://stats.stackexchange.com/questions/213710/level-of-factor-taken-as-intercept and here Is there any way to fit a `glm()` so that all levels are included (i.e. no reference level)?
Hope this helps

Related

R: Prevent "~." in a linear mixed effect model from running an independent variable as both a fixed and random effect

I seem to be running into some issues when I run the code below:
library(lme4)
columns <- c("disp", "hp", "wt", "qsec", "vs")
X <- mtcars[,c(columns, 'cyl', 'carb', 'gear')] # data
rf <- lmer(cyl~ (carb | gear) +., data = X)
rf
### The output that I don't want (lists 'carb' and 'gear' as fixed variables):
Linear mixed model fit by REML ['lmerMod']
Formula: cyl ~ (carb | gear) + .
Data: X
REML criterion at convergence: 76.9662
Random effects:
Groups Name Std.Dev. Corr
gear (Intercept) 0.2887
carb 0.2039 -1.00
Residual 0.5202
Number of obs: 32, groups: gear, 3
Fixed Effects:
(Intercept) carb gear disp hp wt
10.179140 0.025990 -0.873174 0.003883 0.008190 0.089656
qsec vs
-0.159582 -0.779400
As you can see, it is counting 'carb' and 'gear' as fixed variables when I only need them to be used for my random effect variable.
My goal is to keep the code in a similar format and also be able to run the model without the variables 'carb' and 'gear' being taken in as fixed effects (only as random effects).
How can I prevent "~." in the first model from selecting 'carb' and 'gear' as fixed variables so that it may produce the same output as the second model below?
The output that I need: (ONLY 'carb' and 'gear' listed as random effects):
> el <- lmer(cyl~ disp + hp + wt + qsec + vs + (carb | gear), data = mtcars)
> el
Linear mixed model fit by REML ['lmerMod']
Formula: cyl ~ disp + hp + wt + qsec + vs + (carb | gear)
Data: mtcars
REML criterion at convergence: 79.7548
Random effects:
Groups Name Std.Dev. Corr
gear (Intercept) 0.9932
carb 0.1688 -0.82
Residual 0.5263
Number of obs: 32, groups: gear, 3
Fixed Effects:
(Intercept) disp hp wt qsec vs
6.848103 0.004024 0.006929 0.172789 -0.169145 -0.785878
Any help at all is greatly appreciated!
The . shortcut in R formula notation means "every variable in data except the variable(s) on the left-hand side of the formula." This is just a convenient shorthand that was originally intended for use with the lm() function. It is not very flexible as you have found out from trying to apply it in a different context than what it was intended for. The lmer() formula includes dependent variables, fixed effects, and random effects. So in this case the . shortcut will not work. To avoid hard-coding the variable names you can put together the formula by pasting strings, like this:
columns <- c("disp", "hp", "wt", "qsec", "vs")
fixed_effects <- paste(columns, collapse = '+')
model_formula <- as.formula(paste('cyl ~ (carb|gear) +', fixed_effects))
model_formula
# cyl ~ (carb | gear) + disp + hp + wt + qsec + vs
rf <- lmer(model_formula, data = mtcars)
Here we use paste() with the collapse argument to concatenate all elements of columns with a + between each one. Then we paste() again to combine the part of the formula with the left-hand side and the random effects with the fixed effects. That gives us the full model formula as a string, which we can convert to a formula using as.formula().

using "at" argument of margins function in R for logit model

I want to be able to analyze the marginal effect of continuous and binary variables in a logit model. I am hoping for R to provide what the independent marginal effect of hp is at its mean (in this example that is at 200), while also finding the marginal effect of the vs variable equaling 1. I am hoping the output table also includes the SE, p value, and z score. I am having trouble with the table and when I have gotten it to run it doesn't evaluate the two variables independently. Here is an MRE below. Thank you!
mod2 <- glm(am ~ hp + factor(vs), data=mtcars, family=binomial)
margins(mod2)
#> Average marginal effects
#> glm(formula = am ~ hp + factor(vs), family = binomial, data = mtcars)
#> hp vs1
#> -0.00203 -0.03154
#code where I am trying to evaluate at the desired values.
margins(mod2, at=list(hp=200, vs=1))
This is because you've changed vs to a factor.
Consider the following
library(margins)
mod3 <- glm(am ~ hp + vs, data=mtcars, family=binomial)
margins(mod3, at=list(hp=200, vs=1))
# Average marginal effects at specified values
# glm(formula = am ~ hp + vs, family = binomial, data = mtcars)
#
# at(hp) at(vs) hp vs
# 200 1 -0.001783 -0.02803
There is no real reason to turn vs into a factor here; it's dichotomous.

How to extract intraclass correlation or rho in r?

I am trying to use the between and within effects model specifications. My question is how do I extract the intraclass correlation from these models. In Stata, this output is rho, or the variance due to differences across panels. Below is a copy of what I have using the mtcars dataset. (Hopefully the between and within effects models are correctly specified.)
between <- lmer(mpg ~ disp + hp + (1|cyl), mtcars)
summary(between)
within <- felm(mpg ~ disp + hp | factor(cyl), data = mtcars)
summary(within)
I think I answered my own question regarding the lmer model specification. By running
between.null <- lmer(mpg ~ 1 + (1|cyl), mtcars)
summary(between.null)
There are reported both variance of cyl and variance of Residual. Rho should then just be variance of cyl / (variance of cyl + variance of Residual).

Using column name of dataframe as predictor variable in linear regression

I'm trying to loop through all the column names of my data.frame and use them
as predictor variable in a linear regression.
What I currently have is:
for (i in 1:11){
for (j in 1:11){
if (i != j ){
var1 = names(newData)[i]
var2 = names(newData)[j]
glm.fit = glm(re78 ~ as.name(var1):as.name(var2), data=newData)
summary(glm.fit)
cv.glm(newData, glm.fit, K = 10)$delta[1]
}
}
}
Where newData is my data.frame and there are 11 columns in total. This code gives me the following error:
Error in model.frame.default(formula = re78 ~ as.name(var1), data = newData, :
invalid type (symbol) for variable 'as.name(var1)'
How can I fix this, and make it work?
It looks like you want models that use all combinations of two variables. Here's another way to do that using the built-in mtcars data frame for illustration and using mpg as the outcome variable.
We get all combinations of two variables (excluding the outcome variable, mpg in this case) using combn. combn returns a list where each list element is a vector containing the names of a pair of variables. Then we use map (from the purrr package) to create models for each pair of variables and store the results in a list.
We use reformulate to construct the model formula. .x refers back to the vectors of variables names (each element of vars). If you run, for example, reformulate(paste(c("cyl", "disp"),collapse="*"), "mpg"), you can see what reformulate is doing.
library(purrr)
# Get all combinations of two variables
vars = combn(names(mtcars)[-grep("mpg", names(mtcars))], 2, simplify=FALSE)
Now we want to run regression models on all pairs of variables and store results in a list:
# No interaction
models = map(vars, ~ glm(reformulate(.x, "mpg"), data=mtcars))
# Interaction only (no main effects)
models = map(vars, ~ glm(reformulate(paste(.x, collapse=":"), "mpg"), data=mtcars))
# Interaction and main effects
models = map(vars, ~ glm(reformulate(paste(.x, collapse="*"), "mpg"), data=mtcars))
Name each list element with the formula for that model:
names(models) = map(models, ~ .x[["terms"]])
To create the model formulas using paste instead of reformulate you could do (change + to : or *, depending on what combination of interactions and main effects you want to include):
models = map(vars, ~ glm(paste("mpg ~", paste(.x, collapse=" + ")), data=mtcars))
To see how paste is being used here, you can run:
paste("mpg ~", paste(c("cyl", "disp"), collapse=" * "))
Here's what the first two models look like when the models include both main effects and the interaction:
models[1:2]
$`mpg ~ cyl * disp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl disp cyl:disp
49.03721 -3.40524 -0.14553 0.01585
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 198.1 AIC: 159.1
$`mpg ~ cyl * hp`
Call: glm(formula = reformulate(paste(.x, collapse = "*"), "mpg"),
data = mtcars)
Coefficients:
(Intercept) cyl hp cyl:hp
50.75121 -4.11914 -0.17068 0.01974
Degrees of Freedom: 31 Total (i.e. Null); 28 Residual
Null Deviance: 1126
Residual Deviance: 247.6 AIC: 166.3
To assess model output, you can use functions from the broom package. The code below returns data frames with, respectively, the coefficients and performance statistics for each model.
library(broom)
model_coefs = map_df(models, tidy, .id="Model")
model_performance = map_df(models, glance, .id="Model")
Here are what the results look like for models with both main effects and the interaction:
head(model_coefs, 8)
Model term estimate std.error statistic p.value
1 mpg ~ cyl * disp (Intercept) 49.03721186 5.004636297 9.798357 1.506091e-10
2 mpg ~ cyl * disp cyl -3.40524372 0.840189015 -4.052950 3.645320e-04
3 mpg ~ cyl * disp disp -0.14552575 0.040002465 -3.637919 1.099280e-03
4 mpg ~ cyl * disp cyl:disp 0.01585388 0.004947824 3.204212 3.369023e-03
5 mpg ~ cyl * hp (Intercept) 50.75120716 6.511685614 7.793866 1.724224e-08
6 mpg ~ cyl * hp cyl -4.11913952 0.988229081 -4.168203 2.672495e-04
7 mpg ~ cyl * hp hp -0.17068010 0.069101555 -2.469989 1.987035e-02
8 mpg ~ cyl * hp cyl:hp 0.01973741 0.008810871 2.240120 3.320219e-02
You can use fit <- glm(as.formula(paste0("re78 ~ ", var1)), data=newData) as #akrun suggest. Further, you likely do not want to call your object glm.fit as there is a function with the same.
Caveat: I do not why you have the double loop and the :. Do you not want a regression with a single covaraite? I have no idea what you are trying to achieve otherwise.

Map function in R for multiple regression

My goal is to run a multiple regression on each dependent variable in a list, using all of the independent variables in another list. I then would like to store the best model for each dependent variable by AIC.
I have written the below function guided by this post. However, instead of employing each independent variable individually, I'd like to run the model against the entire list as a multiple regression.
Any tips on how to build this function?
dep<-list("mpg~","cyl~","disp~") # list of unique dependent variables with ~
indep<-list("hp","drat","wt") # list of first unique independent variables
models<- Map(function(x,y) step(lm(as.formula(paste(x,paste(y),collapse="+")),data=mtcars),direction="backward"),dep,indep)
Start: AIC=88.43
mpg ~ hp
Df Sum of Sq RSS AIC
<none> 447.67 88.427
- hp 1 678.37 1126.05 115.943
Start: AIC=18.56
cyl ~ drat
Df Sum of Sq RSS AIC
<none> 50.435 18.558
- drat 1 48.44 98.875 38.100
Start: AIC=261.74
disp ~ wt
Df Sum of Sq RSS AIC
<none> 100709 261.74
- wt 1 375476 476185 309.45
[[1]]
Call:
lm(formula = mpg ~ hp, data = mtcars)
Coefficients:
(Intercept) hp
30.09886 -0.06823
[[2]]
Call:
lm(formula = cyl ~ drat, data = mtcars)
Coefficients:
(Intercept) drat
14.596 -2.338
[[3]]
Call:
lm(formula = disp ~ wt, data = mtcars)
Coefficients:
(Intercept) wt
-131.1 112.5
The y needs to be collapsed with + and then pasted to the x and y needs to be passed as a vector to each value of x
models <- lapply(dep, function(x, y)
step(lm(as.formula(paste(x, paste(y, collapse="+"))), data=mtcars),
direction="backward"), y = indep)

Resources