R run linear model by group in dataset [duplicate]

R run linear model by group in dataset [duplicate] - r

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated

library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.

Related

R loop over linear regression

I have looked over the forum but couldn't find what I am looking for.
I want to run a simple linear regression a couple of times. Each time using a different column as my independent variable, the dependent variable stays the same. After running it I want to be able to extract the R squared from each of the regressions. My thought process was to use a simple for loop. However, I cannot make it work.
Assume I work with the following data:
num value person1 person2 person3
0 1 229 29 81 0
1 2 203 17 75 0
2 3 244 62 0 55
and that I want to run the regression on the value using three variables: person1, person2 and person3. Note that this is a minimal working example but I hope to generalize the idea.
And so my initial attempt was to:
column <- names(df)[-2]
for(i in 3:5){
temp <- df[,c("value", column[i])]
lm.test <- lm(value ~ ., data = temp)
i + 1
}
However, when I run summary(lm.test) I only get a summary of the last regression, i.e. lm(value ~ person3) which I think makes sense but when trying to rewrite it as: lm.test[i] <- lm(value ~ ., data = temp) I get the following error:
debug at #3: temp <- df[,c("value", column[i])]
suggesting that there's something wrong with line 3?
If possible I'd like to be able to capture the summary for each regression but what I am really after is the R squared for each one of the regressions.

You can create formula in a loop and then run the lm. For instance, if I want to run regression on mtcars for regressing mpg on each of cyl, wt, hp, I can use the following:
vars <- c("cyl", "wt", "hp")
lm_results <- lapply(vars, function(col){
lm_formula <- as.formula(paste0("mpg ~ ", col))
lm(lm_formula, data = mtcars)
})
You can then again iterate over lm_results to get the r.squared:
lapply(lm_results, function(x) summary(x)$r.squared)

Here’s an approach using broom::glance() and purrr::map_dfr() to collect model summary stats into a tidy tibble:
library(broom)
library(purrr)
lm.test <- map_dfr(
set_names(names(df)[-2]),
~ glance(lm(
as.formula(paste("value ~", .x)),
data = df
)),
.id = "predictor"
)
Result:
# A tibble: 4 x 13
predictor r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 num 0.131 -0.739 27.4 0.150 0.765 1 -12.5 31.1
2 person1 0.836 0.672 11.9 5.10 0.265 1 -10.0 26.1
3 person2 0.542 0.0831 19.9 1.18 0.474 1 -11.6 29.2
4 person3 0.607 0.215 18.4 1.55 0.431 1 -11.3 28.7
# ... with 4 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>,
# nobs <int>
NB, you can capture model coefficients with a similar approach using broom::tidy() instead of glance().

Mapping broom::tidy to nested list of {fixest} models and keep name of list element

I want to apply broom::tidy() to models nested in a fixest_multi object and extract the names of each list level as data frame columns. Here's an example of what I mean.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
This command estimates two models for each dep. var. (Ozone and Solar.R) for a subset of each Month plus the full sample. Here's how the resulting object looks like:
> names(multiple_est)
[1] "Full sample" "5" "6" "7" "8" "9"
> names(multiple_est$`Full sample`)
[1] "Ozone" "Solar.R"
I now want to tidy each model object, but keep the information of the Month / Dep.var. combination as columns in the tidied data frame. My desired output would look something like this:
I can run map_dfr from the tidyr package, giving me this result:
> map_dfr(multiple_est, tidy, .id ="Month") %>% head(9)
# A tibble: 9 x 6
Month term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Full sample (Intercept) -71.0 23.6 -3.01 3.20e- 3
2 Full sample Wind -3.06 0.663 -4.61 1.08e- 5
3 Full sample Temp 1.84 0.250 7.36 3.15e-11
4 5 (Intercept) -76.4 82.0 -0.931 3.53e- 1
5 5 Wind 2.21 2.31 0.958 3.40e- 1
6 5 Temp 3.07 0.878 3.50 6.15e- 4
7 6 (Intercept) -70.6 46.8 -1.51 1.45e- 1
8 6 Wind -1.34 1.13 -1.18 2.50e- 1
9 6 Temp 1.64 0.609 2.70 1.29e- 2
But this tidies only the first model of each Month, the model with the Ozone outcome.
My desired output would look something like this:
Month outcome term estimate more columns from tidy
Full sample Ozone (Intercept) -71.0
Full sample Ozone Wind -3.06
Full sample Ozone Temp 1.84
Full sample Solar.R (Intercept) some value
Full sample Solar.R Wind some value
Full sample Solar.R Temp some value
... rows repeated for each month 5, 6, 7, 8, 9
How can I apply tidy to all models and add another column that indicates the outcome of the model (which is stored in the name of the model object)?

So, fixest_mult has a pretty strange setup as I delved deeper. As you noticed, mapping across it or using apply just accesses part of the data frames. In fact, it isn't just the data frames for "Ozone", but actually just the data frames for the first 6 data frames (those for c("Full sample", "5", "6").
If you convert to a list, it access the data attribute, which is a sequential list of all 12 data frames, but dropping the relevant names you're looking for. So, as a workaround, could use pmap() and the names (found in the attributes of the object) to tidy() and then use mutate() for your desired columns.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
nms <- attr(multiple_est, "meta")$all_names
pmap_dfr(
list(
data = as.list(multiple_est),
month = rep(nms$sample, each = length(nms$lhs)),
outcome = rep(nms$lhs, length(nms$sample))
),
~ tidy(..1) %>%
mutate(
Month = ..2,
outcome = ..3,
.before = 1
)
)
#> # A tibble: 36 × 7
#> Month outcome term estimate std.error statistic p.value
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Full sample Ozone (Intercept) -71.0 23.6 -3.01 3.20e- 3
#> 2 Full sample Ozone Wind -3.06 0.663 -4.61 1.08e- 5
#> 3 Full sample Ozone Temp 1.84 0.250 7.36 3.15e-11
#> 4 Full sample Solar.R (Intercept) -76.4 82.0 -0.931 3.53e- 1
#> 5 Full sample Solar.R Wind 2.21 2.31 0.958 3.40e- 1
#> 6 Full sample Solar.R Temp 3.07 0.878 3.50 6.15e- 4
#> 7 5 Ozone (Intercept) -70.6 46.8 -1.51 1.45e- 1
#> 8 5 Ozone Wind -1.34 1.13 -1.18 2.50e- 1
#> 9 5 Ozone Temp 1.64 0.609 2.70 1.29e- 2
#> 10 5 Solar.R (Intercept) -284. 262. -1.08 2.89e- 1
#> # … with 26 more rows

How do I export coefficients from a lm() object containing multiple lm()?

I have an object (S3; lm) that contains the linear regression outputs of 471 different models. I am trying to extract the standard error of a specific variable in each model but I'm unsure how to do so, can anyone help? Specifically, I want to extract the standard error for the variable "p" for EACH of the 471 models saved in the "fit" object.
varnames = names(merged1)[2036:2507]
fit <- lapply(varnames,
FUN=function(p) lm(formula(paste("Dx ~ x + y + z + q +", p)),data=merged1))
names(fit) <- varnames
Thank you so much!
Note
Edited to reflect the anonymous function p, rather than x, as stated previously.

Using fit shown reproducibly in the Note at the end invoke map_dfr on that with tidy which will give a data frame containing coefficients and associated statistics. We filter out the rows we want.
library(broom) # tidy
library(dplyr)
library(purrr) # map_dfr
fit %>%
map_dfr(tidy, .id = "variable") %>%
filter(term == variable)
giving:
# A tibble: 8 x 6
variable term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 hp hp -0.0147 0.0147 -1.00 0.325
2 drat drat 1.21 1.50 0.812 0.424
3 wt wt -3.64 1.04 -3.50 0.00160
4 qsec qsec -0.243 0.402 -0.604 0.551
5 vs vs -0.634 1.90 -0.334 0.741
6 am am 1.93 1.34 1.44 0.161
7 gear gear 0.158 0.910 0.174 0.863
8 carb carb -0.737 0.393 -1.88 0.0711
Note
We compute fit reproducibly using mtcars which is built into R.
data <- mtcars
resp <- "mpg" # response
fixed <- c("cyl", "disp") # always include these
varnames <- setdiff(names(data), c(resp, fixed)) # incl one at a time
fit <- Map(function(v) {
fo <- reformulate(c(fixed, v), resp)
lm(fo, data)
}, varnames)
Updated
Significantly revised.

sapply(fit,function(x) summary(x)$coefficients[p,][2],simplify = F)
subsetting to 2nd element serves standard error for a variable.

Multiple LM model returning the same coefficients

Hello Stack Community,
I am trying to model wage growth across US territories using linear models to forecast into the future. I want to try and create a model for each state/ territory (DC, VI, and PR), however, when I look at the coefficients for my models, they are the same for each state.
I have used a combination of plyr ,dplyr, and broom thus far to create and sort my data frame (named stuben_dat) for this project
#Wage Growth
state_data = stuben_dat %>% group_by(st) %>%
do (state_wg= lm(wage_growth ~ us_wage_growth + lag_wage_growth + dum1
+dum2 +dum3,
data= stuben_dat, subset=yr>= (current_year - 5)))
#The dummy variables adjust for seasonality (q1 vs q2 vs q3 vs q4)
#The current_year = whatever year I last updated the program
#The current_year-5 value lets me change the look back period
#This look back period can be used to exclude recessions or outliers
Here is just a snapshot of my output, and as you can see, the beta coefficients and regression statistics are exactly the same for each state (Just AK and AL) are shown here. However, I want to build a different model for each state.
# A tibble: 318 x 6
# Groups: st [53]
st term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AK (Intercept) -1.75 0.294 -5.97 3.28e- 9
2 AK us_wage_growth 996. 23.6 42.2 1.82e-228
3 AK lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
4 AK dum1 -0.245 0.304 -0.806 4.21e- 1
5 AK dum2 -0.321 0.304 -1.06 2.90e- 1
6 AK dum3 0.0947 0.303 0.312 7.55e- 1
7 AL (Intercept) -1.75 0.294 -5.97 3.28e- 9
8 AL us_wage_growth 996. 23.6 42.2 1.82e-228
9 AL lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
10 AL dum1 -0.245 0.304 -0.806 4.21e- 1
# ... with 308 more rows

It is because you are using the same data in your do() call. Try out:
state_data = stuben_dat %>%
group_by(st) %>%
do(state_wg = lm(wage_growth ~ us_wage_growth + lag_wage_growth +
dum1 + dum2 + dum3,
data = ., subset = (yr >= (current_year - 5))))

Extracting Coefficient Values from Several Regressions and Storing in New Matrix [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 4 years ago.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Improve this question
I have run a series of 86 regressions (numbered 83-168) and stored them as "reg_83", "reg_84", et cetera. I am now trying to extract the coefficient values for each variable and input them into a new dataframe for analysis to see how the coefficient values change over time. I have a new matrix ("gencoef") with 12 columns and 86 rows. I have a column dedicated to each independent variable. I am trying to run a loop which will store the coefficient value from each regression in the appropriate cell in the variable's column. I have used the following code but to no avail. I am not particularly skilled at loops so it might be a relatively straight-forward solution:
for(i in c(83:168)){
for(j in c(1:86)){
eval(parse(text=paste(
"gencoef[",j,",2] <- summary(reg_",i,")$coefficients[1,1]"),sep==""))
}
}
For whatever reason it is currently creating a space between "reg_" and the number, so it seems to think I am running "reg_ 83" which of course does not work. However, I have a sep=="" command in the loop, so I do not understand where the issue is coming from. Perhaps someone can enlighten me?

There are probably many ways of doing this but here is one I came up with very quickly. It uses the broom package as commented above.
First let's make a list of models:
# make a response variable and a matrix of predictors
set.seed(111)
response <- rnorm(10)
predictors <- matrix(rnorm(100), nrow = 10)
# model response using each predictor to give a list of 10 model outputs
mods <- apply(predictors, 2, function(x) lm(response ~ x))
Now to tidy up the output with broom and bind together the resulting data frames.
library(broom)
l <- lapply(mods, tidy)
do.call(rbind, l)
Or using purrr you can eliminate both lapply and do.call.
library(purrr)
map_df(mods, tidy)
Gives the same result.
# A tibble: 20 x 5
# term estimate std.error statistic p.value
# * <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 0.0643 0.564 0.114 0.912
# 2 x 0.0851 0.454 0.187 0.856
# 3 (Intercept) 0.0256 0.511 0.0501 0.961
# 4 x -0.0763 0.567 -0.135 0.896
# 5 (Intercept) 0.113 0.514 0.220 0.832
# 6 x -0.310 0.458 -0.677 0.518
# 7 (Intercept) -0.448 0.562 -0.797 0.448
# etc
Oh, and you could give each model an .id:
map_dfr(mods, tidy, .id = "model")
# A tibble: 20 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) -0.672 0.263 -2.56 0.0338
# 2 1 x -0.0655 0.284 -0.230 0.824
# 3 2 (Intercept) -0.688 0.260 -2.65 0.0293
# 4 2 x 0.133 0.225 0.589 0.572
# etc