How to fix coefficients in R for categorical variables - r

I would like to know how to put offsets (or fixed coefficients) in a model on categorical variables for each different level and see how that effects the other variables. I'm not sure how to exactly code that.
library(tidyverse)
mtcars <- as_tibble(mtcars)
mtcars$cyl <- as.factor(mtcars$cyl)
model1 <- glm(mpg ~ cyl + hp, data = mtcars)
summary(model1)
This gives the following:
Call:
glm(formula = mpg ~ cyl + hp, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.818 -1.959 0.080 1.627 6.812
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.65012 1.58779 18.044 < 2e-16 ***
cyl6 -5.96766 1.63928 -3.640 0.00109 **
cyl8 -8.52085 2.32607 -3.663 0.00103 **
hp -0.02404 0.01541 -1.560 0.12995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 9.898847)
`Null deviance: 1126.05 on 31 degrees of freedom`
Residual deviance: 277.17 on 28 degrees of freedom
AIC: 169.9
Number of Fisher Scoring iterations: 2
I would like to set the cylinders to different offsets, say 6 cylinders to -4 and 8 cylinders to -9 so I can see what that does to horse power. I tried this in the below code but get an errror so I'm not sure the correct way to do one unique value in a categorical variable much less more than one.
model2 <- glm(mpg ~ offset(I(-4 * cyl[6]))+ hp, data = mtcars)
Would anyone help me figure out how to correctly do this?

In a fresh R session:
glm(mpg ~ offset(I(-4 * (cyl == 6) + -9 * (cyl == 8))) + hp, data = mtcars)
# Call: glm(formula = mpg ~ offset(I(-4 * (cyl == 6) + -9 * (cyl == 8))) +
# hp, data = mtcars)
#
# Coefficients:
# (Intercept) hp
# 27.66881 -0.01885
#
# Degrees of Freedom: 31 Total (i.e. Null); 30 Residual
# Null Deviance: 353.8
# Residual Deviance: 302 AIC: 168.6

Related

How do I change predictors in linear regression in loop in R?

How do I change predictors in linear regression in loop in R?
Below is an example along with the error. Can someone please fix it.
# sample data
mpg <- mpg
str(mpg)
# array of predictors
predictors <- c("hwy", "cty")
# loop over predictors
for (predictor in predictors)
{
# fit linear regression
model <- lm(formula = predictor ~ displ + cyl,
data = mpg)
# summary of model
summary(model)
}
Error
Error in model.frame.default(formula = predictor ~ displ + cyl, data = mpg, :
variable lengths differ (found for 'displ')
We may use paste or reformulate. Also, as it is a for loop, create an object to store the output from summary
sumry_model <- vector('list', length(predictors))
names(sumry_model) <- predictors
for (predictor in predictors) {
# fit linear regression
model <- lm(reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
# with paste
# model <- lm(formula = paste0(predictor, "~ displ + cyl"), data = mpg)
# summary of model
sumry_model[[predictor]] <- summary(model)
}
-output
> sumry_model
$hwy
Call:
lm(formula = reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.5098 -2.1953 -0.2049 1.9023 14.9223
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2162 1.0481 36.461 < 2e-16 ***
displ -1.9599 0.5194 -3.773 0.000205 ***
cyl -1.3537 0.4164 -3.251 0.001323 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.759 on 231 degrees of freedom
Multiple R-squared: 0.6049, Adjusted R-squared: 0.6014
F-statistic: 176.8 on 2 and 231 DF, p-value: < 2.2e-16
$cty
Call:
lm(formula = reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
Residuals:
Min 1Q Median 3Q Max
-5.9276 -1.4750 -0.0891 1.0686 13.9261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.2885 0.6876 41.139 < 2e-16 ***
displ -1.1979 0.3408 -3.515 0.000529 ***
cyl -1.2347 0.2732 -4.519 9.91e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.466 on 231 degrees of freedom
Multiple R-squared: 0.6671, Adjusted R-squared: 0.6642
F-statistic: 231.4 on 2 and 231 DF, p-value: < 2.2e-16
This may be also done as a multivariate response
summary(lm(cbind(hwy, cty) ~ displ + cyl, data = mpg))
Or if we want to use predictors
summary(lm(as.matrix(mpg[predictors]) ~ displ + cyl, data = mpg))

Contrast between variables in glmmTMB

As a reproducible example, let's use the next no-sense example:
> library(glmmTMB)
> summary(glmmTMB(am ~ disp + hp + (1|carb), data = mtcars))
Family: gaussian ( identity )
Formula: am ~ disp + hp + (1 | carb)
Data: mtcars
AIC BIC logLik deviance df.resid
34.1 41.5 -12.1 24.1 27
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
carb (Intercept) 2.011e-11 4.485e-06
Residual 1.244e-01 3.528e-01
Number of obs: 32, groups: carb, 6
Dispersion estimate for gaussian family (sigma^2): 0.124
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.7559286 0.1502385 5.032 4.87e-07 ***
disp -0.0042892 0.0008355 -5.134 2.84e-07 ***
hp 0.0043626 0.0015103 2.889 0.00387 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Actually, my real model family is nbinom2. I want to make a contrast test between disp and hp. So, I try:
> glht(glmmTMB(am ~ disp + hp + (1|carb), data = mtcars), linfct = matrix(c(0,1,-1)))
Error in glht.matrix(glmmTMB(am ~ disp + hp + (1 | carb), data = mtcars), :
‘ncol(linfct)’ is not equal to ‘length(coef(model))’
How can I avoid this error?
Thank you!
The problem is actually fairly simple: linfct needs to be a matrix with the number of columns equal to the number of parameters. You specified matrix(c(0,1,-1)) without specifying numbers of rows or columns, so R made a column matrix by default. Adding nrow=1 seems to work.
library(glmmTMB)
library(multcomp)
m1<- glmmTMB(am ~ disp + hp + (1|carb), data = mtcars)
modelparm.glmmTMB <- function (model, coef. = function(x) fixef(x)[[component]],
vcov. = function(x) vcov(x)[[component]],
df = NULL, component="cond", ...) {
multcomp:::modelparm.default(model, coef. = coef., vcov. = vcov.,
df = df, ...)
}
glht(m1, linfct = matrix(c(0,1,-1),nrow=1))

Running a regression

Background: my data set has 52 rows and 12 columns (assume column names are A - L) and the name of my data set is foo
I am told to run a regression where foo$L is the dependent variable, and all other variables are independent except for foo$K.
The way i was doing it is
fit <- lm(foo$L ~ foo$a + ... +foo$J)
then calling
summary(fit)
Is my way a good way to run a regression and finding the intercept and coef?
Use the data argument to lm so you don't have to use the foo$ syntax for each predictor. Use dependent ~ . as the formula to have the dependent variable predicted by all other variables. Then you can use - K to exclude K:
data_mat = matrix(rnorm(52 * 12), nrow = 52)
df = as.data.frame(data_mat)
colnames(df) = LETTERS[1:12]
lm(L ~ . - K, data = df)
You can first remove the column K, and then do fit <- lm(L ~ ., data = foo). This will treat the L column as the dependent variable and all the other columns as the independent variables. You don't have to specify each column names in the formula.
Here is an example using the mtcars, fitting a multiple regression model to mpg with all the other variables except carb.
mtcars2 <- mtcars[, !names(mtcars) %in% "carb"]
fit <- lm(mpg ~ ., data = mtcars2)
summary(fit)
# Call:
# lm(formula = mpg ~ ., data = mtcars2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.3038 -1.6964 -0.1796 1.1802 4.7245
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 12.83084 18.18671 0.706 0.48790
# cyl -0.16881 0.99544 -0.170 0.86689
# disp 0.01623 0.01290 1.259 0.22137
# hp -0.02424 0.01811 -1.339 0.19428
# drat 0.70590 1.56553 0.451 0.65647
# wt -4.03214 1.33252 -3.026 0.00621 **
# qsec 0.86829 0.68874 1.261 0.22063
# vs 0.36470 2.05009 0.178 0.86043
# am 2.55093 2.00826 1.270 0.21728
# gear 0.50294 1.32287 0.380 0.70745
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 2.593 on 22 degrees of freedom
# Multiple R-squared: 0.8687, Adjusted R-squared: 0.8149
# F-statistic: 16.17 on 9 and 22 DF, p-value: 9.244e-08

Confidence interval for sigma in a purely fixed effect model

Is there a standard way to estimate confidence interval for the variance parameter of a linear model with fixed-effect. E.g. given:
reg=lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
how can I get the confidence interval for the variance parameter. confint only details fixed effect and lmer from lme4 does not accept model without level-2 random-effect, which is my case here.
Unfortunately, you have to implement it yourself.
Like so :
reg <- lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
alpha <- 0.05
n <- length(resid(reg))
sigma <- summary(reg)$sigma
sigma*n/qchisq(1-alpha/2, df = n-2) ; sigma*n/qchisq(alpha/2, df = n-2)
> sigma*n/qchisq(1-alpha/2, df = n-2) ; sigma*n/qchisq(alpha/2, df = n-2)
[1] 0.4600539
[1] 1.287194
It comes from the relation :
I assume you are looking for the summary() function.
The code shows the following:
data(mtcars)
reg<-lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
summary(reg)
# Call:
# lm(formula = 100/mpg ~ disp + hp + wt + am, data = mtcars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.6923 -0.3901 0.0579 0.3649 1.2608
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.740648 0.738594 1.003 0.32487
# disp 0.002703 0.002715 0.996 0.32832
# hp 0.005275 0.003253 1.621 0.11657
# wt 1.001303 0.302761 3.307 0.00267 **
# am 0.155815 0.375515 0.415 0.68147
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6754 on 27 degrees of freedom
# Multiple R-squared: 0.8527, Adjusted R-squared: 0.8309
# F-statistic: 39.08 on 4 and 27 DF, p-value: 7.369e-11
To select it, you can store the summary as a variable and select the coefficients.
summa<-summary(reg)
summa$coefficients
With that, one can select the sd covariate that you want and do the confidence interval with the % of interest. To learn the confidence interval, one can read how it is done here
R does it automatically using confint(object, parms, level)
In your case, confint(reg, level = 0.95)

Extract data from Partial least square regression on R

I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")

Resources