use of other columns as arguments to function in summarize_at() - r

This works great:
> mtcars %>% group_by(cyl) %>% summarize_at(vars(disp, hp), weighted.mean)
# A tibble: 3 x 3
cyl disp hp
<dbl> <dbl> <dbl>
1 4.00 105 82.6
2 6.00 183 122
3 8.00 353 209
But now I want to use one of the columns from mtcars as the w argument to weighted.mean. Sadly, the obvious attempt fails:
> mtcars %>% group_by(cyl) %>% summarize_at(vars(disp, hp), weighted.mean, w = wt)
Error in dots_list(...) : object 'wt' not found
Even though wt is, indeed, part of mtcars. How can I use other columns as arguments to a function within summarize_at()?

You could try the funs() syntax:
mtcars %>% group_by(cyl) %>%
summarize_at(vars(disp, hp), funs(weighted.mean(.,wt)))
# cyl disp hp
# <dbl> <dbl> <dbl>
#1 4.00 110 83.4
#2 6.00 185 122
#3 8.00 362 209

Related

Summarize different Columns with different Functions with dplyr in r

I have seen this (Summarize different Columns with different Functions)
But in my situation, I want to use sum() with mpg, disp and hp. And use mean() with drat, wt and qsec.
All the function should be used with a group variable cyl.
Like this:
result.1 = mtcars %>% group_by(cyl) %>% summarise(across(.cols = c(mpg, disp, hp),
.fns = sum))
result.2 = mtcars %>% group_by(cyl) %>% summarise(across(.cols = c(drat:qsec),
.fns = mean))
final.result = full_join(result.1, result.2)
Is this possible that get final.result only use summarise() once.
Any help will be highly appreciated!
You can use across twice in the same summarise call :
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(across(.cols = c(mpg, disp, hp),.fns = sum),
across(.cols = c(drat:qsec),.fns = mean))
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 293. 1156. 909 4.07 2.29 19.1
#2 6 138. 1283. 856 3.59 3.12 18.0
#3 8 211. 4943. 2929 3.23 4.00 16.8

Add Another Column Info to Results of groupby r

Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)

Loosing group_by information when using dplyr::do for the second time

I am running multiple models on multiple sections of my data set, similar to (but with many more models)
library(tidyverse)
d1 <- mtcars %>%
group_by(cyl) %>%
do(mod_linear = lm(mpg ~ disp + hp, data = ., x = TRUE))
d1
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# # A tibble: 3 x 3
# cyl mod_linear
# * <dbl> <list>
# 1 4. <S3: lm>
# 2 6. <S3: lm>
# 3 8. <S3: lm>
I then tidy this tibble and save my parameter estimates using tidy() in the broom package.
I also want to calculate the standard deviation of the predictors (stored in models above as I set x = TRUE) to create and then compare re-scaled parameters. I can do the former of these using
d1 %>%
# group_by(cyl) %>%
do(term = colnames(.$mod$x),
pred_sd = apply(X = .$mod$x, MARGIN = 2, FUN = sd)) %>%
unnest()
# # A tibble: 9 x 2
# term pred_sd
# <chr> <dbl>
# 1 (Intercept) 0.00000
# 2 disp 26.87159
# 3 hp 20.93453
# 4 (Intercept) 0.00000
# 5 disp 41.56246
# 6 hp 24.26049
# 7 (Intercept) 0.00000
# 8 disp 67.77132
# 9 hp 50.97689
However, the result is not a grouped tibble so I end up loosing the cyl column to tell me which terms belong to which model. How can avoid this loss? - Adding in group_by again seems to throw an error.
n.b. I want avoid using purrr for at least for the first part (fitting the models) as I run different types of models and then need to reshape the results (d1), and I like the progress bar with do.
n.b. I want to work with the $x component of the models rather than the raw data as they have the data on correct scale (I am experimenting with different transformations of the predictors)
We can do this by nesting initially and then do the unnest
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_linear = map(data, ~ lm(mpg ~ disp + hp, data = .x, x = TRUE)),
term = map(mod_linear, ~ names(coef(.x))),
pred = map(mod_linear, ~ .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist )) %>%
select(-data, -mod_linear) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
Or instead of calling the map multiple times, this can be further made compact with
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_contents = map(data, ~ {
mod <- lm(mpg ~ disp + hp, data = .x, x = TRUE)
term <- names(coef(mod))
pred <- mod$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
tibble(term, pred)
}
)) %>%
select(-data) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
If we start from 'd1' (based on the OP's code)
d1 %>%
ungroup %>%
mutate(mod_contents = map(mod_linear, ~ {
pred <- .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
term <- .x %>%
coef %>%
names
tibble(term, pred)
})) %>%
select(-mod_linear) %>%
unnest

Why does as_tibble() round floats to the nearest integer?

When using as_tibble in dplyr 0.7.4 and R 3.4.1 I get the following outputs
mtcars %>% aggregate(disp ~ cyl, data=., mean) %>% as_tibble()
which outputs
# A tibble: 3 x 2
cyl disp
<dbl> <dbl>
1 4.00 105
2 6.00 183
3 8.00 353
while
mtcars %>% aggregate(disp ~ cyl, data=., mean)
outputs
cyl disp
1 4 105.1364
2 6 183.3143
3 8 353.1000
Not really surprisingly, the following
mtcars %>% group_by(cyl) %>% summarise(disp=mean(disp))
gives again
# A tibble: 3 x 2
cyl disp
<dbl> <dbl>
1 4.00 105
2 6.00 183
3 8.00 353
Why is this rounding happening and how can I avoid it?
This is not a rounding, it's only a way for {tibble} to display data in a pretty way:
> mtcars %>%
+ aggregate(disp ~ cyl, data=., mean) %>%
+ as_tibble() %>%
+ pull(disp)
[1] 105.1364 183.3143 353.1000
If you want to see more digits, you have to print a data.frame:
> mtcars %>%
+ aggregate(disp ~ cyl, data=., mean) %>%
+ as_tibble() %>%
+ as.data.frame()
cyl disp
1 4 105.1364
2 6 183.3143
3 8 353.1000
(and yes, the two last lines are useless)

dplyr group by colnames described as vector of strings

I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())
This returns error:
Error in mutate_impl(.data, dots) :
Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7
I definitely want to use a dplyr function to do this, but can't figure this one out.
Update
group_by_at() has been superseded; see https://dplyr.tidyverse.org/reference/group_by_all.html. Refer to Harrison Jones' answer for the current recommended approach.
Retaining the below approach for posterity
You can use group_by_at, where you can pass a character vector of column names as group variables:
mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
Or you can move the column selection inside group_by_at using vars and column select helper functions:
mtcars %>%
filter(disp < 160) %>%
group_by_at(vars(matches('[a-z]{3,}$'))) %>%
summarise(n = n())
# A tibble: 12 x 8
# Groups: mpg, cyl, disp, drat, qsec, gear [?]
# mpg cyl disp drat qsec gear carb n
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 19.7 6 145.0 3.62 15.50 5 6 1
# 2 21.4 4 121.0 4.11 18.60 4 2 1
# 3 21.5 4 120.1 3.70 20.01 3 1 1
# 4 22.8 4 108.0 3.85 18.61 4 1 1
# ...
I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:
cols <- grep("[a-z]{3,}$", colnames(mtcars), value = TRUE)
original <- mtcars %>%
filter(disp < 160) %>%
group_by_at(cols) %>%
summarise(n = n())
superseded <- mtcars %>%
filter(disp < 160) %>%
group_by(across(all_of(cols))) %>%
summarise(n = n(), .groups = 'drop_last')
all.equal(original, superseded)
Here is a blog post that goes into more detail about using the across function:
https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

Resources