Tidy approach to regression models, ideally with dplyr - r

Reading the documentation for do() in dplyr, I've been impressed by the ability to create regression models for groups of data and was wondering whether it would be possible to replicate it using different independent variables rather than groups of data.
So far I've tried
require(dplyr)
data(mtcars)
models <- data.frame(var = c("cyl", "hp", "wt"))
models <- models %>% do(mod = lm(mpg ~ as.name(var), data = mtcars))
Error in as.vector(x, "symbol") :
cannot coerce type 'closure' to vector of type 'symbol'
models <- models %>% do(mod = lm(substitute(mpg ~ i, as.name(.$var)), data = mtcars))
Error in substitute(mpg ~ i, as.name(.$var)) :
invalid environment specified
The desired final output would be something like
var slope standard_error_slope
1 cyl -2.87 0.32
2 hp -0.07 0.01
3 wt -5.34 0.56
I'm aware that something similar is possible using a lapply approach, but find the apply family largely inscrutable. Is there a dplyr solution?

There's nothing particularly complicated about the approach in the linked page. The use of substitute and as.name is a bit arcane, but that's easily rectified.
varlist <- names(mtcars)[-1]
models <- lapply(varlist, function(x) {
form <- formula(paste("mpg ~", x))
lm(form, data=mtcars)
})
dplyr is not the be-all and end-all of R programming. I'd suggest getting familiar with the *apply functions as they'll be of use in many situations where dplyr doesn't work.

This isn't pure "dplyr", but rather, "dplyr" + "tidyr" + "data.table". Still, I think it should be pretty easily readable.
library(data.table)
library(dplyr)
library(tidyr)
mtcars %>%
gather(var, val, cyl:carb) %>%
as.data.table %>%
.[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
# var Estimate Std. Error
# 1: cyl -2.87579014 0.322408883
# 2: disp -0.04121512 0.004711833
# 3: hp -0.06822828 0.010119304
# 4: drat 7.67823260 1.506705108
# 5: wt -5.34447157 0.559101045
# 6: qsec 1.41212484 0.559210130
# 7: vs 7.94047619 1.632370025
# 8: am 7.24493927 1.764421632
# 9: gear 3.92333333 1.308130699
# 10: carb -2.05571870 0.568545640
If you really just wanted a few variables, start with a vector, not a data.frame.
models <- c("cyl", "hp", "wt")
mtcars %>%
select_(.dots = c("mpg", models)) %>%
gather(var, val, -mpg) %>%
as.data.table %>%
.[, as.list(summary(lm(mpg ~ val))$coefficients[2, 1:2]), by = var]
# var Estimate Std. Error
# 1: cyl -2.87579014 0.3224089
# 2: hp -0.06822828 0.0101193
# 3: wt -5.34447157 0.5591010

Related

Mapping pipes to multiple columns in tidyverse

I'm working with a table for which I need to count the number of rows satisfying some criterion and I ended up with basically multiple repetitions of the same pipe differing only in the variable name.
Say I want to know how many cars are better than Valiant in mtcars on each of the variables there. An example of the code with two variables is below:
library(tidyverse)
reference <- mtcars %>%
slice(6)
mpg <- mtcars %>%
filter(mpg > reference$mpg) %>%
count() %>%
pull()
cyl <- mtcars %>%
filter(cyl > reference$cyl) %>%
count() %>%
pull()
tibble(mpg, cyl)
Except, suppose I need to do it for like 100 variables so there must be a more optimal way to just repeat the process.
What would be the way to rewrite the code above in an optimal way (maybe, using map() or anything else that works with pipes nicely so that the result would be a tibble with the counts for all the variables in mtcars?
I feel the solution should be very easy but I'm stuck.
Thank you!
Or:
library(tidyverse)
map_dfc(mtcars, ~sum(.x[6] < .x))
map2_dfc(mtcars, reference, ~sum(.y < .x))
You could use summarise + across to count observations greater than a certain value in each column.
library(dplyr)
mtcars %>%
summarise(across(everything(), ~ sum(. > .[6])))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 18 14 15 22 30 11 1 0 13 17 25
base solution:
# (1)
colSums(mtcars > mtcars[rep(6, nrow(mtcars)), ])
# (2)
colSums(sweep(as.matrix(mtcars), 2, mtcars[6, ], ">"))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 18 14 15 22 30 11 1 0 13 17 25
You can do it in a loop for example. Like this:
library(tidyverse)
reference <- mtcars %>%
slice(6)
# Empty list to save outcome
list_outcome <- list()
# Get the columnnames to loop over
loop_var <- colnames(reference)
for(i in loop_var){
nr <- mtcars %>%
filter(mtcars[, i] > reference[, i]) %>%
count() %>%
pull()
# Save every iteration in the loop as the ith element of the list
list_outcome[[i]] <- data.frame(Variable = i, Value = nr)
}
# combine all the data frames in the list to one final data frame
df_result <- do.call(rbind, list_outcome)

Extract restricted regression coefficients without using tidy

I'm using the restriktor package to perform restricted regressions; however, at the same time I'm doing the restricted regressions by group using the dplyr. In order to extract the coefficients and have them formatted into a nice panel format, I use tidy and broom but the tidy packaged doesn't work on the restriktor so I'm not sure how to go about extracting the coefficients:
library(restriktor)
library(dplyr)
reg =
mtcars %>%
group_by(cyl) %>%
do(model = restriktor(lm(mpg ~ wt + hp, data =.), constraints = ' wt < -4 '))
I would like to have the b.restr which is the restricted model coefficients to be extracted for each group and formatted together into a panel normally I would use the following:
reg =
mtcars %>%
group_by(cyl) %>%
do({model = restriktor(lm(mpg ~ wt + hp, data =.), constraints = ' wt < -4 ') # create your model
data.frame(tidy(model), # get coefficient info
glance(model))})
But I get the following error:
Error: No tidy method for objects of class restriktor
All I want is to extract the following elements from the lists and put them altogether with their group identifier in one panel format:
reg[[2]][[1]][["b.restr"]]
Use group_modify (which is preferred over do now) with coef/as.list/as_tibble.
library(dplyr)
library(restriktor)
# coefficients and R2's or NAs if too few rows for restriktor
co <- function(fo, data) {
fm <- lm(fo, data)
coef_lm <- coef(fm)
min_rows <- length(coef_lm)
if (nrow(data) <= min_rows) NA * c(coef_lm, R2.org = NA, R2.reduced = NA)
else {
r <- restriktor(fm, constraints = ' wt < -4 ')
c(coef(r), R2.org = r$R2.org, R2.reduced = r$R2.reduced)
}
}
mtcars %>%
group_by(cyl) %>%
group_modify(~ {
.x %>%
co(mpg ~ wt + hp, .) %>%
as.list %>%
as_tibble
}) %>%
ungroup
giving:
tibble: 3 x 6
cyl `(Intercept)` wt hp R2.org R2.reduced
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 45.8 -5.12 -0.0905 0.681 0.681
2 6 35.3 -4 -0.0256 0.589 0.667
3 8 33.9 -4 -0.0132 0.497 0.652

Feeding vector into summarise_at

I'm sure it is something obvious since I'm an R novice, but I cannot figure out why the first approach is working while the second is not. Something is wrong with my use of "paste".
library(dplyr)
data(mtcars)
characteristics <- c('disp', 'hp')
summarise_at(df, .vars = vars(characteristics), mean)
characteristics <- paste('disp hp', collapse = ",")
summarise_at(df, .vars = vars(characteristics), mean)
If you want to summarise over disp and hp of mtcars why not use a simpler and more straigthforward approach, like so?
mtcars %>%
summarise(across(c('disp', 'hp'), mean))
disp hp
1 230.7219 146.6875
Of yourse, you can also 'feed' your vector into the across operation:
characteristics <- c('disp', 'hp')
mtcars %>%
summarise(across(characteristics, mean))
disp hp
1 230.7219 146.6875
Using summarise(across...)would also take into account that so-called scoped dplyr verbs have now essentially been superseded by across()
With help from a friend, I found the answer.
library('dplyr')
data(mtcars)
characteristics <- unlist(str_split('disp hp', ' '))
# the line above replaced characteristics <- paste('disp hp', collapse = ",")
summarise_at(mtcars, .vars = vars(characteristics), mean)

Extracting the coefficients of each model by using a series of map functions

I am creating the following model:
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
Based on the results you get from that, I am trying to extract the coefficients by using a series of map functions.
The results should look like this:
4 6 8
-5.647025 -2.780106 -2.192438
I am pulling my hair out trying to figure this out. Any help is appreciated.
You can use map_dbl with the coef function to pick out the "wt" coefficients:
coefs <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df)) %>%
map_dbl(~coef(.)[["wt"]])
It looks like
coefs <- (mtcars
%>% split(.$cyl)
%>% map(lm, formula = mpg~wt)
%>% map_dbl(~coef(.)[["wt"]])
)
should do what you want? If you want to get more information, ending with map_dfr(broom::tidy) instead of the map_dbl will be helpful (you can use the .id= argument too, although this is less useful when the list doesn't have named arguments).
This is very similar to #henryn's answer, although the map syntax (using the named formula argument means that the data get substituted as the next argument implicitly, so you don't have to use an anonymous function function(df) lm(mpg ~ wt, data = df) or (with R >= 4.1.0) \(df) lm(mpg ~ wt, data = df): I think the usual way of doing this, ~ lm(mpg ~ wt, data = .) might get messed up by the tilde in the formula, but I'm nto sure ...
Does this work:
mtcars %>% split(.$cyl) %>% map(function(x) {
c = lm(mpg ~ wt, data = x)
c$coefficients[2]
}) %>% unlist
4.wt 6.wt 8.wt
-5.647025 -2.780106 -2.192438
1) This could be done in straight dplyr:
mtcars %>%
group_by(cyl) %>%
summarize(wt = coef(lm(mpg ~ wt))[[2]], .groups = "drop")
giving:
# A tibble: 3 x 2
cyl wt
<dbl> <dbl>
1 4 -5.65
2 6 -2.78
3 8 -2.19
2) This variation also works:
mtcars %>%
group_by(cyl) %>%
summarize(wt = cov(mpg, wt) / var(wt), .groups = "drop")
3) Also consider this -- omit the [2] to get both coefficients.
library(nlme)
coef(lmList(mpg ~ wt | cyl, mtcars))[2]
giving:
wt
4 -5.647025
6 -2.780106
8 -2.192438

Creating data frame with repeat rows

I want to create a data frame with rows that repeat.
Here is my original dataset:
> mtcars_columns_a
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 disp mtcars mtcars$disp 230.72188
3 hp mtcars mtcars$hp 146.68750
Here is my desire dataset
> mtcars_columns_b
variables_interest data_set data_set_and_variables_interest mean
1 mpg mtcars mtcars$mpg 20.09062
2 mpg mtcars mtcars$mpg 20.09062
3 disp mtcars mtcars$disp 230.72188
4 disp mtcars mtcars$disp 230.72188
5 hp mtcars mtcars$hp 146.68750
6 hp mtcars mtcars$hp 146.68750
I know how to do this the long way manually, but this is time consuming and rigid. Is there a quicker way to do this that is more automated and flexible?
Here is the code I used to create the dataset:
# mtcars data
## displays data
mtcars
## 3 row data set
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_a <-
data.frame(
c(
"mpg",
"disp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_a)[names(mtcars_columns_a) == 'c..mpg....disp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_a$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_a$data_set_and_variables_interest <-
paste(mtcars_columns_a$data_set,mtcars_columns_a$variables_interest,sep = "$")
### creates mean column
mtcars_columns_a$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$hp)
)
## 6 row data set., the long way
### lists columns of interest
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: lists variables of interest
mtcars_columns_b <-
data.frame(
c(
"mpg",
"mpg",
"disp",
"disp",
"hp",
"hp"
)
)
# ---- NOTE: REQUIRES MANUAL INPUT
# ---- NOTE: adds colnames
names(mtcars_columns_b)[names(mtcars_columns_b) == 'c..mpg....mpg....disp....disp....hp....hp..'] <- 'variables_interest'
### adds data set info
mtcars_columns_b$data_set <-
c("mtcars")
### creates data_set_and_variables_interest column
mtcars_columns_b$data_set_and_variables_interest <-
paste(mtcars_columns_b$data_set,mtcars_columns_b$variables_interest,sep = "$")
### creates mean column
mtcars_columns_b$mean <-
c(
mean(mtcars$mpg),
mean(mtcars$mpg),
mean(mtcars$disp),
mean(mtcars$disp),
mean(mtcars$hp),
mean(mtcars$hp)
)
You can try rep like below
mtcars_columns_a[rep(seq(nrow(mtcars_columns_a)), each = 2),]
Another option is uncount
library(dplyr)
library(tidyr)
mtcars_columns_a %>%
uncount(2)
Based on your expected output is this the sort of thing you were after?
The selection of required variables is made with the select function and the mean calculated using the summarise function following group_by variables.
The duplication of data and adding of additional variables (not really sure if these are necessary) is carried out using mutate.
You can edit variable names using the dplyr::rename function.
library(dplyr)
library(tidyr)
df <-
mtcars %>%
select(mpg, disp, hp) %>%
pivot_longer(everything()) %>%
group_by(name) %>%
summarise(mean = mean(value))
df1 <-
bind_rows(df, df) %>%
arrange(name) %>%
mutate(dataset = "mtcars",
variable = paste(dataset, name, sep = "$"))
df1
#> # A tibble: 6 x 4
#> name mean dataset variable
#> <chr> <dbl> <chr> <chr>
#> 1 disp 231. mtcars mtcars$disp
#> 2 disp 231. mtcars mtcars$disp
#> 3 hp 147. mtcars mtcars$hp
#> 4 hp 147. mtcars mtcars$hp
#> 5 mpg 20.1 mtcars mtcars$mpg
#> 6 mpg 20.1 mtcars mtcars$mpg
Created on 2021-04-06 by the reprex package (v1.0.0)
The order of records in a data.frame object is usually not meaningful, so you could just do:
rbind(mtcars_columns_a, mtcars_columns_a)
If you need it to be in the order you showed, this is also simple:
mtcars_columns_b <- rbind(mtcars_columns_a, mtcars_columns_a)
mtcars_columns_b[order(mtcars_columns_b, mtcars_columns_b$name),]

Resources