Retain nesting variable when using select on nested tibble - r

I am using the code from this question (below) to save columns of nested tibble into a new list of tibbles (each column being a tibble in the list). However, when using selected on the nested tibble, the nested variable is lost. Which I'd like to retain, it keeps the grouping variable with the results.
e.g., results %>% unnest(tidied) keeps "carb", but 'results %>% select(tidied) %>% map(~bind_rows(.))' does not.
How can I keep the nested variable with the selected columns?
library(tidyverse)
library(broom)
data(mtcars)
df <- mtcars
nest.df <- df %>% nest(-carb)
results <- nest.df %>%
mutate(fit = map(data, ~ lm(mpg ~ wt, data=.x)),
tidied = map(fit, tidy),
glanced = map(fit, glance),
augmented = map(fit, augment))
final <- results %>% select(glanced, tidied, augmented ) %>%
map(~bind_rows(.))

We can do a mutate_at before the select step (not clear about the expected output though). Here mutate_at in looping through each column, but these columns are also tibble, so inside the function (list(~), we use map2 to pass the column and the 'carb' column, then create a new column with the list tibble column by mutateing with new column 'carb'
results %>%
mutate_at(vars(glanced, tidied, augmented),
list(~ map2(.,carb, ~ .x %>% mutate(carb = .y)))) %>%
select(glanced, tidied, augmented) %>%
map(~ bind_rows(.x))
$glanced
# A tibble: 6 x 12
# r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
#1 0.696 0.658 2.29 18.3 0.00270 2 -21.4 48.7 49.6 41.9 8 4
#2 0.654 0.585 3.87 9.44 0.0277 2 -18.2 42.4 42.3 74.8 5 1
#3 0.802 0.777 2.59 32.3 0.000462 2 -22.6 51.1 52.1 53.5 8 2
#4 0.00295 -0.994 1.49 0.00296 0.965 2 -3.80 13.6 10.9 2.21 1 3
#5 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 6
#6 0 0 NaN NA NA 1 Inf -Inf -Inf 0 0 8
#$tidied
# A tibble: 10 x 6
# term estimate std.error statistic p.value carb
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 27.9 2.91 9.56 0.0000118 4
# 2 wt -3.10 0.724 -4.28 0.00270 4
#...
#...

Related

Using mutate(across(...)) with purrr::map

I'm having trouble figuring out how to use purrr::map() with mutate(across(...)).
I want to do a linear model and pull out the estimate for the slope of multiple columns as predicted by a single column.
Here is what I'm attempting with an example data set:
mtcars %>%
mutate(across(-mpg),
map(.x, lst(slope = ~lm(.x ~ mpg, data = .x) %>%
tidy() %>%
filter(term != "(Intercept") %>%
pull(estimate)
)))
The output I'm looking for would be new columns for each non-mpg column with _slope appended to the name, ie cyl_slope
In my actual data, I'll be grouping by another variable as well in case that matters, as I need the slope for each group for each predicted variable. I have this working in a standard mutate doing one variable at a time as follows:
df %>%
group_by(unitid) %>%
nest() %>%
mutate(tuition_and_fees_as_pct_total_rev_slope = map_dbl(data, ~lm(tuition_and_fees_as_pct_total_rev ~ year, data = .x) %>%
tidy() %>%
filter(term == "year") %>%
pull(estimate)
))
So:
I think my issue is how to pass the column name being predicted into the lm
I don't know if the solution requires nesting or not, so it would be appreciated if in the mtcars example that is considered.
If we wanted to do lm on all other columns with independent variable as 'mpg', one option is to loop over the column names of the 'mtcars' except the 'mpg', create the formula with reformulate, apply the lm, convert to a tidy format, filter out the 'Intercept' and select the 'estimate' column
library(dplyr)
library(tidyr)
library(broom)
map_dfc(setdiff(names(mtcars), 'mpg'), ~
lm(reformulate('mpg', response = .x), data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(estimate))
-output
# A tibble: 1 x 10
# estimate...1 estimate...2 estimate...3 estimate...4 estimate...5 estimate...6 estimate...7 estimate...8 estimate...9 estimate...10
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
Or this can be done more easily with a matrix as dependent
library(stringr)
lm(as.matrix(mtcars[setdiff(names(mtcars), "mpg")]) ~ mpg,
data = mtcars) %>%
tidy %>%
filter(term != "(Intercept)") %>%
select(response, estimate) %>%
mutate(response = str_c(response, '_slope'))
-output
# A tibble: 10 x 2
# response estimate
# <chr> <dbl>
# 1 cyl_slope -0.253
# 2 disp_slope -17.4
# 3 hp_slope -8.83
# 4 drat_slope 0.0604
# 5 wt_slope -0.141
# 6 qsec_slope 0.124
# 7 vs_slope 0.0555
# 8 am_slope 0.0497
# 9 gear_slope 0.0588
#10 carb_slope -0.148
Or another option is summarise with across
mtcars %>%
summarise(across(-mpg, ~ list(lm(reformulate('mpg',
response = cur_column())) %>%
tidy %>%
filter(term != "(Intercept)") %>%
pull(estimate)), .names = "{.col}_slope")) %>%
unnest(everything())
# A tibble: 1 x 10
# cyl_slope disp_slope hp_slope drat_slope wt_slope qsec_slope vs_slope am_slope gear_slope carb_slope
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.253 -17.4 -8.83 0.0604 -0.141 0.124 0.0555 0.0497 0.0588 -0.148
One option could be:
map_dfr(.x = names(select(mtcars, -c(mpg, vs))),
~ mtcars %>%
group_by(vs) %>%
nest() %>%
mutate(variable = .x,
estimate = map_dbl(data, function(y) lm(!!sym(.x) ~ mpg, data = y) %>%
tidy() %>%
filter(term != "(Intercept)") %>%
pull(estimate))) %>%
select(-data))
vs variable estimate
<dbl> <chr> <dbl>
1 0 cyl -0.242
2 1 cyl -0.116
3 0 disp -22.5
4 1 disp -8.01
5 0 hp -10.1
6 1 hp -3.26
7 0 drat 0.0748
8 1 drat 0.0529
9 0 wt -0.192
10 1 wt -0.113
11 0 qsec -0.0357
12 1 qsec -0.0432
13 0 am 0.0742
14 1 am 0.0710
15 0 gear 0.114
16 1 gear 0.0492
17 0 carb -0.0883
18 1 carb -0.0790

dot notation in magrittr pipe not be

I thought I understood that in conjunction with the magrittr pipe, the dot-notation indicates where the dataset that is piped into a function should go for evaluation. When I was starting to work with purrr/broom to generate some nested dataframes with the linear models I was generating by group I ran into a problem. When using the dot notation it seems that my prior group_by command was being ignored. Took me a while to figure out that I should simply omit the dot-notation and it works like expected, but I would like to understand why it is not working.
Here is the sample code that I expected to generate identical data, but only the first example is generating linear models by group, while the second generates the model for the whole dataset, but then still stores it at the group level.
#// library and data prep
library(tidyverse)
library(broom)
data <- as_tibble(mtcars)
#// generates lm fit for the model by group
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the grouped dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg)))) %>%
#// unnest
unnest(cols = new)
#> Warning in summary.lm(x): essentially perfect fit: summary may be unreliable
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 1.72e+ 0 5.85e- 1 2.94e+ 0 3.24e- 2
#> 2 1 mpg 7.75e- 2 2.26e- 2 3.44e+ 0 1.85e- 2
#> 3 2 (Intercept) 1.44e+ 0 5.87e- 1 2.46e+ 0 3.95e- 2
#> 4 2 mpg 1.01e- 1 2.55e- 2 3.95e+ 0 4.26e- 3
#> 5 3 (Intercept) 3.07e+ 0 6.86e-15 4.48e+14 1.42e-15
#> 6 3 mpg 3.46e-17 4.20e-16 8.25e- 2 9.48e- 1
#> 7 4 (Intercept) 2.18e+ 0 4.29e- 1 5.07e+ 0 9.65e- 4
#> 8 4 mpg 8.99e- 2 2.65e- 2 3.39e+ 0 9.43e- 3
#> 9 6 (Intercept) 3.62e+ 0 NaN NaN NaN
#> 10 6 mpg NA NA NA NA
#> 11 8 (Intercept) 3.54e+ 0 NaN NaN NaN
#> 12 8 mpg NA NA NA NA
#// generates lm fit for the whole model
data %>%
#// group by factor
group_by(carb) %>%
#// summary for the whole dataset
summarize(new = list( tidy( lm(formula = drat ~ mpg, data = .)))) %>%
#// unnest
unnest(cols = new)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 6
#> carb term estimate std.error statistic p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 2 1 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 3 2 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 4 2 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 5 3 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 6 3 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 7 4 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 8 4 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 9 6 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 10 6 mpg 0.0604 0.0119 5.10 1.78e- 5
#> 11 8 (Intercept) 2.38 0.248 9.59 1.20e-10
#> 12 8 mpg 0.0604 0.0119 5.10 1.78e- 5
Created on 2021-01-04 by the reprex package (v0.3.0)
. in this case refers to data which is present in the previous step which is (data %>% group_by(carb)). Although the data is grouped it is still complete data. If you are on dplyr > 1.0.0 you could use cur_data() to refer to the data in the group.
library(dplyr)
library(broom)
library(tidyr)
data %>%
group_by(carb) %>%
summarize(new = list(tidy(lm(formula = drat ~ mpg, data = cur_data())))) %>%
unnest(cols = new)
This gives the same output as your first example.
Note that you can use . to refer to the grouped data with group_modify instead of summarise:
data %>%
group_by(carb) %>%
group_modify(~lm(formula = drat ~ mpg, data = .) %>% tidy)
* Just an alternative - I think list-columns + unnest-variants are considered the better approach now.

conditionally mutating column values using `dplyr`

I am using WRS2 to carry out robust pairwise comparisons. But one problem is that it removes the group level names from the output dataframes and saves it in a different object.
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# comparisons
x$comp
#> Group Group psihat ci.lower ci.upper p.value
#> [1,] 1 2 -1.0 -3.440879 1.44087853 0.25984505
#> [2,] 1 3 -2.8 -5.536161 -0.06383861 0.04914871
#> [3,] 2 3 -1.8 -4.536161 0.93616139 0.17288911
# vector with group level names
x$fnames
#> [1] "placebo" "low" "high"
I can convert it to a tibble:
# converting to tibble
suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 -1 -3.44 1.44 0.260
#> 2 1 3 -2.8 -5.54 -0.0638 0.0491
#> 3 2 3 -1.8 -4.54 0.936 0.173
I would then like to replace the group column numeric values with actual names included in fnames (so map fnames[1] -> 1, fnames[2] -> 2, and so on).
So the final dataframe should look something like the following-
#> # A tibble: 3 x 6
#> group1 group2 psihat ci.lower ci.upper p.value
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
In this case, it was easy to just copy-paste the three values, but I want to have a generalizable approach where no matter the number of levels, it works. How can I do this using dplyr?
Using a named vector to match with tidyverse. This matches by value and not by the sequence of index i.e. if the value in 'Group' columns are not in a sequence or character, this would still work
library(dplyr)
as_tibble(x$comp, .name_repair = 'unique') %>%
mutate(across(starts_with("Group"),
~ setNames(x$fnames, seq_along(x$fnames))[as.character(.)]))
Does this fullfil your needs :
names <- c("A","B","C")
df = data.frame(group=c(1,2,3))
library(dplyr)
df %>% mutate(group = names[group])
group
1 A
2 B
3 C
Here's an approach using the recode function, with the recoding vector built programmatically from the data:
# Setup
set.seed(123)
library(WRS2)
library(tidyverse)
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
# Create recoding vector
recode.vec = x$fnames %>% set_names(1:length(x$fnames))
# Recode columns
x.comp = x$comp %>%
as_tibble(.name_repair=make.unique) %>%
mutate(across(starts_with("Group"), ~recode(., !!!recode.vec)))
Output:
x.comp
#> # A tibble: 3 x 6
#> Group Group.1 psihat ci.lower ci.upper p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 placebo low -1 -3.44 1.44 0.260
#> 2 placebo high -2.8 -5.54 -0.0638 0.0491
#> 3 low high -1.8 -4.54 0.936 0.173
Try this tidyverse approach formating data to long after extracting the objects as tibbles. You can use left_join() to get your groups as you want. Here the code to get something close to what you want:
# setup
set.seed(123)
library(WRS2)
library(tidyverse)
# robust pairwise comparisons
x <- lincon(libido ~ dose, data = viagra, tr = 0.1)
#Transform to tibble
df1 <- suppressMessages(as_tibble(x$comp, .name_repair = "unique")) %>%
dplyr::rename(group1 = Group...1, group2 = Group...2)
#Extract labels
df2 <- tibble(treat=x$fnames) %>% mutate(value=1:n())
#Format to long df1
df1 <- df1 %>%
mutate(id=1:n()) %>%
pivot_longer(cols = c(group1,group2)) %>%
rename(group=name) %>% left_join(df2) %>% select(-value) %>%
pivot_wider(names_from = group,values_from=treat) %>% select(-id)
Output:
# A tibble: 3 x 6
psihat ci.lower ci.upper p.value group1 group2
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 -1 -3.44 1.44 0.260 placebo low
2 -2.8 -5.54 -0.0638 0.0491 placebo high
3 -1.8 -4.54 0.936 0.173 low high

summarize across -- is it order dependent?

I came across something weird with dplyr and across, or at least something I do not understand.
If we use the across function to compute the mean and standard error of the mean across multiple columns, I am tempted to use the following command:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}"),
across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}")) %>% head()
Which results in
gear mpg cyl se_mpg se_cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 16.1 7.47 NA NA
2 4 24.5 4.67 NA NA
3 5 21.4 6 NA NA
However, if I switch the order of the individual across commands, I get the following:
mtcars %>% group_by(gear) %>% select(mpg,cyl) %>%
summarize(across(everything(), ~sd(.x, na.rm=T)/sqrt(sum(!is.na(.x))), .names="se_{col}"),
across(everything(), ~mean(.x, na.rm = TRUE), .names = "{col}")) %>% head()
# A tibble: 3 x 5
gear se_mpg se_cyl mpg cyl
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0.871 0.307 16.1 7.47
2 4 1.52 0.284 24.5 4.67
3 5 2.98 0.894 21.4 6
Why is this the case? Does it have something to do with my usage of everything()? In my situation I'd like the mean and the standard error of the mean calculated across every variable in my dataset.
I have no idea why summarize behaves like that, it's probably due to an underlying interaction of the two across functions (although it seems weird to me). Anyway, I suggest you to write a single across statement and use a list of lambda functions as suggested by the across documentation.
In this way it doesn't matter if the mean or the standard deviation is specified as first function, you will get no NAs.
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
mean = ~mean(.x, na.rm = TRUE),
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x)))
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear mean_mpg se_mpg mean_cyl se_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 16.1 0.871 7.47 0.307
# 2 4 24.5 1.52 4.67 0.284
# 3 5 21.4 2.98 6 0.894
mtcars %>%
group_by(gear) %>%
select(mpg, cyl) %>%
summarize(across(everything(), list(
se = ~sd(.x, na.rm = TRUE)/sqrt(sum(!is.na(.x))),
mean = ~mean(.x, na.rm = TRUE)
), .names = "{fn}_{col}"))
# A tibble: 3 x 5
# gear se_mpg mean_mpg se_cyl mean_cyl
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 0.871 16.1 0.307 7.47
# 2 4 1.52 24.5 0.284 4.67
# 3 5 2.98 21.4 0.894 6

r studio: simulate my code 1000 times and pick the things which p value<0.05

Here is my original code:
x = rbinom(1000,1,0.5)
z = log(1.3)*x
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
k=glm(y~x,family="binomial")$coef
t=exp(k)
How can I simulate it 1000 times and pick the one with a p-value<0.05?
This is a perfect application for the tidyverse and it's list columns. Please see explanation in the inline comments.
library(tidyverse)
library(broom)
# create a tibble with an id column for each simulation and x wrapped in list()
sim <- tibble(id = 1:1000,
x = list(rbinom(1000,1,0.5))) %>%
# to generate z, pr, y, k use map and map2 from the purrr package to loop over the list column x
# `~ ... ` is similar to `function(.x) {...}`
# `.x` represents the variable you are using map on
mutate(z = map(x, ~ log(1.3) * .x),
pr = map(z, ~ 1 / (1 + exp(-.x))),
y = map(pr, ~ rbinom(1000, 1, .x)),
k = map2(x, y, ~ glm(.y ~ .x, family="binomial")),
# use broom::tidy to get the model summary in form of a tibble
sum = map(k, broom::tidy)) %>%
# select id and sum and unnest the tibbles
select(id, sum) %>%
unnest(cols = c(sum)) %>%
# drop the intercepts and every .x with a p < 0.05
filter(term !="(Intercept)",
p.value < 0.05)
sim
#> # A tibble: 545 x 6
#> id term estimate std.error statistic p.value
#> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 3 .x 0.301 0.127 2.37 0.0176
#> 2 7 .x 0.263 0.127 2.06 0.0392
#> 3 8 .x 0.293 0.127 2.31 0.0211
#> 4 11 .x 0.377 0.128 2.96 0.00312
#> 5 12 .x 0.265 0.127 2.08 0.0373
#> 6 13 .x 0.366 0.127 2.88 0.00403
#> 7 14 .x 0.461 0.128 3.61 0.000305
#> 8 17 .x 0.274 0.127 2.16 0.0309
#> 9 18 .x 0.394 0.127 3.09 0.00200
#> 10 19 .x 0.371 0.127 2.92 0.00354
#> # … with 535 more rows
Created on 2020-05-18 by the reprex package (v0.3.0)
I won't do this for you, but these are the steps you'll probably want to go through:
Write your code as a function that returns the value you're interested in (presumably t)
Use something like replicate to run this function many times and record all the answers
Use something like quantile to extract the percentile you're interested in

Resources