Summarize different Columns with different Functions with dplyr in r - r

I have seen this (Summarize different Columns with different Functions)
But in my situation, I want to use sum() with mpg, disp and hp. And use mean() with drat, wt and qsec.
All the function should be used with a group variable cyl.
Like this:
result.1 = mtcars %>% group_by(cyl) %>% summarise(across(.cols = c(mpg, disp, hp),
.fns = sum))
result.2 = mtcars %>% group_by(cyl) %>% summarise(across(.cols = c(drat:qsec),
.fns = mean))
final.result = full_join(result.1, result.2)
Is this possible that get final.result only use summarise() once.
Any help will be highly appreciated!

You can use across twice in the same summarise call :
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(across(.cols = c(mpg, disp, hp),.fns = sum),
across(.cols = c(drat:qsec),.fns = mean))
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 293. 1156. 909 4.07 2.29 19.1
#2 6 138. 1283. 856 3.59 3.12 18.0
#3 8 211. 4943. 2929 3.23 4.00 16.8

Related

R split/map from purrr combined with tidy from broom to get linear model statistics by group

I found this code online at tidyverse.org at this link:
mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
The code works as expected. I'm now practicing with this same structure but using a long dataframe. You can see the code; it's mostly the same. First I convert to a tibble, add rownames for cars, select numeric variables, and make the dataframe a long data frame.
mtcars <- as_tibble(mtcars, rownames = 'car')
mtcars_numeric <- mtcars %>%
select(car, mpg, disp, hp, drat, wt, qsec)
mtcars_long_numeric <- pivot_longer(mtcars_numeric, names_to = 'names', values_to = 'values', 3:7)
mtcars_long_numeric %>%
split(.$names) %>%
map(~ lm(mpg ~ values, data = .)) %>%
map(summary) %>%
map_df("r.squared") %>%
pivot_longer(., names_to = 'explanatory_variable_to_mpg', values_to = 'r_squared', 1:5) %>%
arrange(desc(r_squared))
But what about other model statistics like p-value? How do I extract that? If I just change "r.squared" to "p.value" it doesn't work. I've tried other variations like "p_value" and "pvalue" and it doesn't work. I also don't know how to find the right names for these objects.
I can create a linear model object and look at the r.squared in the summary and get the right value.
mtcars_linear_model <- lm(mpg ~ wt, mtcars)
summary(mtcars_linear_model)$r.squared
...But outside of this vignette I don't know how I would have known that r.squared existed in the summary of linear model. If I just type the dollar sign after the summary(lm) I get values that don't exist. (Is this a bug?)
Then I tried a different tactic. I can see that if I use broom and tidy the linear model object I have other statistics:
broom::tidy(mtcars_linear_model)
Is there any way to add the broom::tidy function to these data frames involving purrr:map? The purpose would be to figure out how to extract other model statistics like p-value. Also, how do I find a comprehensive list of items I can extract from the summary of a linear model object summary(lm)$'?'
The following code doesn't work. I tried a few variations like %>% tidy() or else to wrap tidy around map(summary) like this: tidy(map(summary)) but it doesn't work.
mtcars_long_numeric %>%
split(.$names) %>%
map(~ lm(mpg ~ values, data = .)) %>%
map(summary) %>%
tidy() %>% #### ????????
map_df("r.squared") %>%
pivot_longer(., names_to = 'explanatory_variable_to_mpg', values_to = 'r_squared', 1:5) %>%
arrange(desc(r_squared))
This?. You need to use glance instead of tidy for model statistics.
mtcars_long_numeric %>%
nest_by(names) %>%
mutate(model = list(lm(mpg ~ values, data = data))) %>%
summarise(glance(model))
`summarise()` has grouped output by 'names'. You can override using the `.groups` argument.
# A tibble: 5 × 13
# Groups: names [5]
names r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 disp 0.718 0.709 3.25 76.5 9.38e-10 1 -82.1 170. 175. 317. 30 32
2 drat 0.464 0.446 4.49 26.0 1.78e- 5 1 -92.4 191. 195. 604. 30 32
3 hp 0.602 0.589 3.86 45.5 1.79e- 7 1 -87.6 181. 186. 448. 30 32
4 qsec 0.175 0.148 5.56 6.38 1.71e- 2 1 -99.3 205. 209. 929. 30 32
5 wt 0.753 0.745 3.05 91.4 1.29e-10 1 -80.0 166. 170. 278. 30 32

How do I compare a particular group mean to each separate group?

I have a large dataset and there are many different columns that I am trying to group the data by. I am trying to create a new column using dplyr and mutate which is the mean for each individual group. I then want to see the difference between these means and the mean of just one single category.
This question can pertain to the mtcars dataset. How would I group the mtcars data by "cyl" & "gear" and then take the mean of "mpg" for each group. I then want to see the difference of every group's mean of "mpg" compared to specifically all the cars with "gear"==5, but have variable "cyl".
I apologize if I'm asking the same question as others have, but I have not been able to find this specific question.
df <- mtcars
df2 <- df %>% group_by(cyl, gear) %>% mutate(mean_mpg = mean(mpg))
This is pretty brute force but it should give you what you want. I got the mean mpg of both cyl and gear then just of cyl ignoring gear and then the mean mpg of gear ignoring cyl .
mtcars %>%
group_by(cyl,gear) %>%
mutate(mean_mpg_both = mean(mpg)) %>%
ungroup %>%
group_by(gear) %>%
mutate(mean_gear_mpg = mean(mpg)) %>%
ungroup %>%
group_by(cyl) %>%
mutate(mean_cyl_mpg = mean(mpg)) %>%
select(mpg,cyl,gear,mean_mpg_both,mean_gear_mpg, mean_cyl_mpg) %>%
group_by(cyl,gear) %>%
filter(row_number()==1)
df2 <- df %>%
group_by(cyl, gear) %>%
summarise(mean_mpg = mean(mpg)) %>%
mutate(comparison_mpg = mean_mpg[which(gear == 5)],
mpg_diff = mean_mpg - comparison_mpg)
Result
# A tibble: 8 x 5
# Groups: cyl [3]
cyl gear mean_mpg comparison_mpg mpg_diff
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4. 3. 21.5 28.2 -6.70
2 4. 4. 26.9 28.2 -1.27
3 4. 5. 28.2 28.2 0.
4 6. 3. 19.8 19.7 0.0500
5 6. 4. 19.8 19.7 0.0500
6 6. 5. 19.7 19.7 0.
7 8. 3. 15.0 15.4 -0.350
8 8. 5. 15.4 15.4 0.
Going from your comment, I think this is what you are after:
mtcars %>% group_by(cyl) %>%
summarize(mean_by_cyl = mean(mpg),
mean_gear5_by_cyl = mean(mpg[gear == 5]),
mean_diff_from_gear5 = mean_by_cyl - mean_gear5_by_cyl)
# # A tibble: 3 x 4
# cyl mean_by_cyl mean_gear5_by_cyl mean_diff_from_gear5
# <dbl> <dbl> <dbl> <dbl>
# 1 4 26.66364 28.2 -1.53636364
# 2 6 19.74286 19.7 0.04285714
# 3 8 15.10000 15.4 -0.30000000

Loosing group_by information when using dplyr::do for the second time

I am running multiple models on multiple sections of my data set, similar to (but with many more models)
library(tidyverse)
d1 <- mtcars %>%
group_by(cyl) %>%
do(mod_linear = lm(mpg ~ disp + hp, data = ., x = TRUE))
d1
# Source: local data frame [3 x 3]
# Groups: <by row>
#
# # A tibble: 3 x 3
# cyl mod_linear
# * <dbl> <list>
# 1 4. <S3: lm>
# 2 6. <S3: lm>
# 3 8. <S3: lm>
I then tidy this tibble and save my parameter estimates using tidy() in the broom package.
I also want to calculate the standard deviation of the predictors (stored in models above as I set x = TRUE) to create and then compare re-scaled parameters. I can do the former of these using
d1 %>%
# group_by(cyl) %>%
do(term = colnames(.$mod$x),
pred_sd = apply(X = .$mod$x, MARGIN = 2, FUN = sd)) %>%
unnest()
# # A tibble: 9 x 2
# term pred_sd
# <chr> <dbl>
# 1 (Intercept) 0.00000
# 2 disp 26.87159
# 3 hp 20.93453
# 4 (Intercept) 0.00000
# 5 disp 41.56246
# 6 hp 24.26049
# 7 (Intercept) 0.00000
# 8 disp 67.77132
# 9 hp 50.97689
However, the result is not a grouped tibble so I end up loosing the cyl column to tell me which terms belong to which model. How can avoid this loss? - Adding in group_by again seems to throw an error.
n.b. I want avoid using purrr for at least for the first part (fitting the models) as I run different types of models and then need to reshape the results (d1), and I like the progress bar with do.
n.b. I want to work with the $x component of the models rather than the raw data as they have the data on correct scale (I am experimenting with different transformations of the predictors)
We can do this by nesting initially and then do the unnest
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_linear = map(data, ~ lm(mpg ~ disp + hp, data = .x, x = TRUE)),
term = map(mod_linear, ~ names(coef(.x))),
pred = map(mod_linear, ~ .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist )) %>%
select(-data, -mod_linear) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
Or instead of calling the map multiple times, this can be further made compact with
mtcars %>%
group_by(cyl) %>%
nest(-cyl) %>%
mutate(mod_contents = map(data, ~ {
mod <- lm(mpg ~ disp + hp, data = .x, x = TRUE)
term <- names(coef(mod))
pred <- mod$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
tibble(term, pred)
}
)) %>%
select(-data) %>%
unnest
# A tibble: 9 x 3
# cyl term pred
# <dbl> <chr> <dbl>
#1 6.00 (Intercept) 0
#2 6.00 disp 41.6
#3 6.00 hp 24.3
#4 4.00 (Intercept) 0
#5 4.00 disp 26.9
#6 4.00 hp 20.9
#7 8.00 (Intercept) 0
#8 8.00 disp 67.8
#9 8.00 hp 51.0
If we start from 'd1' (based on the OP's code)
d1 %>%
ungroup %>%
mutate(mod_contents = map(mod_linear, ~ {
pred <- .x$x %>%
as_tibble %>%
summarise_all(sd) %>%
unlist
term <- .x %>%
coef %>%
names
tibble(term, pred)
})) %>%
select(-mod_linear) %>%
unnest

use of other columns as arguments to function in summarize_at()

This works great:
> mtcars %>% group_by(cyl) %>% summarize_at(vars(disp, hp), weighted.mean)
# A tibble: 3 x 3
cyl disp hp
<dbl> <dbl> <dbl>
1 4.00 105 82.6
2 6.00 183 122
3 8.00 353 209
But now I want to use one of the columns from mtcars as the w argument to weighted.mean. Sadly, the obvious attempt fails:
> mtcars %>% group_by(cyl) %>% summarize_at(vars(disp, hp), weighted.mean, w = wt)
Error in dots_list(...) : object 'wt' not found
Even though wt is, indeed, part of mtcars. How can I use other columns as arguments to a function within summarize_at()?
You could try the funs() syntax:
mtcars %>% group_by(cyl) %>%
summarize_at(vars(disp, hp), funs(weighted.mean(.,wt)))
# cyl disp hp
# <dbl> <dbl> <dbl>
#1 4.00 110 83.4
#2 6.00 185 122
#3 8.00 362 209

How can I manipulate dataframe columns with different values from an external vector (with dplyr)

In R, I would like to manipulate (say multiply) data.frame columns with appropriately named values stored in a vector (or data.frame, if that's easier).
Let's say, I want to first summarise the variables disp, hp, and wt from the mtcars dataset.
vars <- c("disp", "hp", "wt")
mtcars %>%
summarise_at(vars, funs(sum(.))
(throw a group_by(cyl) into the mix, or use mutate_at if you'd like to have more rows)
Now I'd like to multiply each of the resulting columns with a particular value, given by
multiplier <- c("disp" = 2, "hp" = 3, "wt" = 4)
Is it possible to refer to these within the summarise_at function?
The result should look like this (and I don't want to have to refer to the variable names directly while getting there):
disp hp wt
14766.2 14082 411.808
UPDATE:
Maybe my MWE was too minimal. Let's say I want to do the same operation with a data.frame grouped by cyl
mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum)
The result should thus be:
cyl disp hp wt
1 4 2313.0 2727 100.572
2 6 2566.4 2568 87.280
3 8 9886.8 8787 223.956
UPDATE 2:
Maybe I was not explicit enough here either, but the columns in the data.frame should be multiplied by the respective values in the vector (and only those columns mentioned in the vector), so e.g. disp should be multiplied by 2, hp by 3 and wt by 4, all other variables (e.g. cyl) should remain untouched by the multiplication.
We could also do this with map function from purrr
library(purrr)
mtcars %>%
summarise_at(vars, sum) %>%
map2_df(multiplier, `*`)
# disp hp wt
# <dbl> <dbl> <dbl>
# 1 14766.2 14082 411.808
For the updated question
d1 <- mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum)
d1 %>%
select(one_of(vars)) %>%
map2_df(multiplier[vars], ~ .x * .y) %>%
bind_cols(d1 %>% select(-one_of(vars)), .)
# cyl disp hp wt
# <dbl> <dbl> <dbl> <dbl>
#1 4 2313.0 2727 100.572
#2 6 2566.4 2568 87.280
#3 8 9886.8 8787 223.956
Or we can use gather/spread
library(tidyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum) %>%
gather(var, val, -cyl) %>%
mutate(val = val*multiplier[match(var, names(multiplier))]) %>%
spread(var, val)
# cyl disp hp wt
# <dbl> <dbl> <dbl> <dbl>
#1 4 2313.0 2727 100.572
#2 6 2566.4 2568 87.280
#3 8 9886.8 8787 223.956
I am not sure if you can do this in the summarise_at function, but this is close alternative...
library(dplyr)
library(purrr)
vars <- c("disp", "hp", "wt")
multiplier <- c("disp" = 2, "hp" = 3, "wt" = 4)
mtcars %>%
summarise_at(vars, sum) %>%
do(. * multiplier)
disp hp wt
1 14766.2 14082 411.808
****REDUX****
Include the grouping var cyl in the multiplier and set it equal to 1. #akrun's map2_df does the real work here.
vars <- c("disp", "hp", "wt")
multiplier <- c("cyl" = 1, "disp" = 2, "hp" = 3, "wt" = 4)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum) %>%
map2_df(multiplier, ~ .x * .y)
cyl disp hp wt
<dbl> <dbl> <dbl> <dbl>
1 4 2313.0 2727 100.572
2 6 2566.4 2568 87.280
3 8 9886.8 8787 223.956

Resources