Remove coefficients from lm output in R - r

I would like to remove all of the as.factor elements from the output of an ordinary least squares lm() model in R. The last line doesn't work, but for example:
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
mod <- lm(y ~ x + as.factor(block), data = frame)
summary(mod)
summary(mod)$coefficients[3:5,] <- NULL
Is there a way to remove all of these elements so that the saved `lm' object no longer has them? Thanks.

One option is to use felm function in lfe package.
As stated in the package:
The package is intended for linear models with multiple group fixed effects, i.e. with 2 or more factors with a large number of levels. It performs similar functions as lm, but it uses a special method for projecting out multiple group fixed effects from the normal equations, hence it is faster.
set.seed(123)
frame <- data.frame(y = rnorm(100), x= rnorm(100), block = sample(c("A", "B", "C", "D"), 100, replace = TRUE))
id<-as.factor(frame$block)
mod <- lm(y ~ x + id, data = frame) #lm
summary(mod)
Call:
lm(formula = y ~ x + id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18115 0.17201 1.053 0.2950
x -0.08310 0.09604 -0.865 0.3891
idB 0.04834 0.24645 0.196 0.8449
idC -0.51265 0.25052 -2.046 0.0435 *
idD 0.04905 0.26073 0.188 0.8512
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared: 0.06677, Adjusted R-squared: 0.02747
F-statistic: 1.699 on 4 and 95 DF, p-value: 0.1566
library(lfe)
est <- felm(y ~ x| id)
summary(est)
Call:
felm(formula = y ~ x | id, data = frame)
Residuals:
Min 1Q Median 3Q Max
-2.53394 -0.68372 0.04072 0.67805 2.00777
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.08310 0.09604 -0.865 0.389
Residual standard error: 0.9002 on 95 degrees of freedom
Multiple R-squared(full model): 0.06677 Adjusted R-squared: 0.02747
Multiple R-squared(proj model): 0.00782 Adjusted R-squared: -0.03396
F-statistic(full model):1.699 on 4 and 95 DF, p-value: 0.1566
F-statistic(proj model): 0.7487 on 1 and 95 DF, p-value: 0.3891
P.S. A similar program for Stata is reghdfe.

Related

Perform multiple linear regression analysis including interaction terms, interpret results using summary() and diagnostic plots using lm()

I tried to perform a multiple linear regression analysis with code like this one but with no success. I tried to do it with lm() function. I think there is a problem with the 'x1*x2'.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1 + x2 + x1*x2)
summary(model)
plot(model)
It shows me error.
What should I do?
The error did not occur because of your interaction term. When testing it, that worked perfectly for me. You forgot to specify the data. The lm() function requires you to provide the data your variables should stem from. In the code below I also shortened the code within the function because x1*x2 is already sufficient. R detects that you have an interaction term, so you don't have to repeat the same variable names.
data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100))
model <- lm(y ~ x1*x2,
data= data)
summary(model)
#>
#> Call:
#> lm(formula = y ~ x1 * x2, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.21772 -0.77564 0.06347 0.56901 2.15324
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.05853 0.09914 -0.590 0.5564
#> x1 0.17384 0.09466 1.836 0.0694 .
#> x2 -0.02830 0.08646 -0.327 0.7442
#> x1:x2 -0.00836 0.07846 -0.107 0.9154
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.9792 on 96 degrees of freedom
#> Multiple R-squared: 0.03423, Adjusted R-squared: 0.004055
#> F-statistic: 1.134 on 3 and 96 DF, p-value: 0.3392
Created on 2023-01-14 with reprex v2.0.2

How to formulate time period dummy variable in lm()

I am analysing whether the effects of x_t on y_t differ during and after a specific time period.
I am trying to regress the following model in R using lm():
y_t = b_0 + [b_1(1-D_t) + b_2 D_t]x_t
where D_t is a dummy variable with the value 1 over the time period and 0 otherwise.
Is it possible to use lm() for this formula?
observationNumber <- 1:80
obsFactor <- cut(observationNumber, breaks = c(0,55,81), right =F)
fit <- lm(y ~ x * obsFactor)
For example:
y = runif(80)
x = rnorm(80) + c(rep(0,54), rep(1, 26))
fit <- lm(y ~ x * obsFactor)
summary(fit)
Call:
lm(formula = y ~ x * obsFactor)
Residuals:
Min 1Q Median 3Q Max
-0.48375 -0.29655 0.05957 0.22797 0.49617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50959 0.04253 11.983 <2e-16 ***
x -0.02492 0.04194 -0.594 0.554
obsFactor[55,81) -0.06357 0.09593 -0.663 0.510
x:obsFactor[55,81) 0.07120 0.07371 0.966 0.337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3116 on 76 degrees of freedom
Multiple R-squared: 0.01303, Adjusted R-squared: -0.02593
F-statistic: 0.3345 on 3 and 76 DF, p-value: 0.8004
obsFactor[55,81) is zero if observationNumber < 55 and one if its greater or equal its coefficient is your $b_0$. x:obsFactor[55,81) is the product of the dummy and the variable $x_t$ - its coefficient is your $b_2$. The coefficient for $x_t$ is your $b_1$.

A linear model matrix where each level of a categorical is contrasted with the mean

I have xy data, where y is a continuous response and x is a categorical variable:
set.seed(1)
df <- data.frame(y = rnorm(27), group = c(rep("A",9),rep("B",9),rep("C",9)), stringsAsFactors = F)
I would like to fit the linear model: y ~ group to it, in which each of the levels in df$group is contrasted with the mean.
I thought that using Deviation Coding does that:
lm(y ~ group,contrasts = "contr.sum",data=df)
But it skips contrasting group A with the mean:
> summary(lm(y ~ group,contrasts = "contr.sum",data=df))
Call:
lm(formula = y ~ group, data = df, contrasts = "contr.sum")
Residuals:
Min 1Q Median 3Q Max
-1.6445 -0.6946 -0.1304 0.6593 1.9165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2651 0.3457 -0.767 0.451
groupB 0.2057 0.4888 0.421 0.678
groupC 0.3985 0.4888 0.815 0.423
Residual standard error: 1.037 on 24 degrees of freedom
Multiple R-squared: 0.02695, Adjusted R-squared: -0.05414
F-statistic: 0.3324 on 2 and 24 DF, p-value: 0.7205
Is there any function that builds a model matrix to get each of the levels of df$group contrasted with the mean in the summary?
All I can think of is manually adding a "mean" level to df$group and setting it is as baseline with Dummy Coding:
df <- df %>% rbind(data.frame(y = mean(df$y), group ="mean"))
df$group <- factor(df$group, levels = c("mean","A","B","C"))
summary(lm(y ~ group,contrasts = "contr.treatment",data=df))
Call:
lm(formula = y ~ group, data = df, contrasts = "contr.treatment")
Residuals:
Min 1Q Median 3Q Max
-2.30003 -0.34864 0.07575 0.56896 1.42645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.14832 0.95210 0.156 0.878
groupA 0.03250 1.00360 0.032 0.974
groupB -0.06300 1.00360 -0.063 0.950
groupC 0.03049 1.00360 0.030 0.976
Residual standard error: 0.9521 on 24 degrees of freedom
Multiple R-squared: 0.002457, Adjusted R-squared: -0.1222
F-statistic: 0.01971 on 3 and 24 DF, p-value: 0.9961
Similarly, suppose I have data with two categorical variables:
set.seed(1)
df <- data.frame(y = rnorm(18),
group = c(rep("A",9),rep("B",9)),
class = as.character(rep(c(rep(1,3),rep(2,3),rep(3,3)),2)))
and I would like to estimate the interaction effect per each level: (i.e., class1:groupB, class2:groupB, and class3:groupB for:
lm(y ~ class*group,contrasts = c("contr.sum","contr.treatment"),data=df)
How would I obtain it?
Use +0 in the lm formula to omit the intercept, then you should get the expected contrast coding:
summary(lm(y ~ 0 + group, contrasts = "contr.sum", data=df))
Result:
Call:
lm(formula = y ~ 0 + group, data = df, contrasts = "contr.sum")
Residuals:
Min 1Q Median 3Q Max
-2.3000 -0.3627 0.1487 0.5804 1.4264
Coefficients:
Estimate Std. Error t value Pr(>|t|)
groupA 0.18082 0.31737 0.570 0.574
groupB 0.08533 0.31737 0.269 0.790
groupC 0.17882 0.31737 0.563 0.578
Residual standard error: 0.9521 on 24 degrees of freedom
Multiple R-squared: 0.02891, Adjusted R-squared: -0.09248
F-statistic: 0.2381 on 3 and 24 DF, p-value: 0.8689
If you want to do this for an interaction, here's one way:
lm(y ~ 0 + class:group,
contrasts = c("contr.sum","contr.treatment"),
data=df)

Extract data from Partial least square regression on R

I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")

Need help modeling data with a log() function

I have some data that Excel will fit pretty nicely with a logarithmic trend. I want to pass the same data into R and have it tell me the coefficients and intercept. What form should have the data in and what function should I call to have it figure out the coefficients? Ultimately, I want to do this thousands of time so that I can project into the future.
Passing Excel these values produces this trendline function: y = -0.099ln(x) + 0.7521
Data:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813,
0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
For context, the data points represent % of our user base that are retained on a given day.
The question omitted the value of x but working backwards it seems you were using 1, 2, 3, ... so try the following:
x <- 1:11
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647,
0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076,
0.514708368)
fm <- lm(y ~ log(x))
giving:
> coef(fm)
(Intercept) log(x)
0.7521 -0.0990
and
plot(y ~ x, log = "x")
lines(fitted(fm) ~ x, col = "red")
You can get the same results by:
y <- c(0.7521, 0.683478429, 0.643337383, 0.614856858, 0.592765647, 0.574715813, 0.559454895, 0.546235287, 0.534574767, 0.524144076, 0.514708368)
t <- seq(along=y)
> summary(lm(y~log(t)))
Call:
lm(formula = y ~ log(t))
Residuals:
Min 1Q Median 3Q Max
-3.894e-10 -2.288e-10 -2.891e-11 1.620e-10 4.609e-10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.521e-01 2.198e-10 3421942411 <2e-16 ***
log(t) -9.900e-02 1.261e-10 -784892428 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.972e-10 on 9 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.161e+17 on 1 and 9 DF, p-value: < 2.2e-16
For large projects I recommend to encapsulate the data into a data frame, like
df <- data.frame(y, t)
lm(formula = y ~ log(t), data=df)

Resources