This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
With dplyr and R, you can use group_by and summarize to aggregate data.
For instance:
mpg_cyl_carb <- mtcars %>%
group_by(cyl, carb) %>%
summarise(var1 = mean(mpg))
head(mpg_cyl_carb, 3)
A tibble: 3 x 3
Groups: cyl [2]
cyl carb var1
<dbl> <dbl> <dbl>
1 4 1 27.6
2 4 2 25.9
3 6 1 19.8
It means that when cyl = 4 and carb = 4, the mean for mpg is 27.6. When cyl = 6 and carb = 1, the mean is 19.8, and so on.
I would like to nest those aggregate results on the original dataframe. Currently am joining two tables to do this:
> mtcars %>%
+ left_join(mpg_cyl_carb, by = c("cyl", "carb")) %>%
+ head(3) %>%
+ select(mpg, cyl, carb, var1)
mpg cyl carb var1
1 21.0 6 4 19.75
2 21.0 6 4 19.75
3 22.8 4 1 27.58
But is there an easier way? A single command for mutate, like:
> mtcars %>%
+ mutate(. . .)
Not a solution using if_else, as it would add complexity.
Use the group_by before the mutate to create the mean column by group - instead of creating a summarised dataset and then joining to original data
library(dplyr)
mtcars %>%
group_by(cyl, carb) %>%
mutate(var1 = mean(mpg)) %>%
ungroup %>%
head
Related
Bare with me... I am using the R/RStudio with the data mtcars, dplyr , mutate and the summarise commands. Also tried group by.
I want to center the values mtcars$mpg then take that info and display the summary of the number of cylinders vs centered mtcars$mpg.
So far...
mtcars %>% mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>% summarise(centered_mpg, cyl)
The above produces:
centered_mpg
cyl
0.909375
6
0.909375
6
2.709375
4
1.309375
6
...
...
INSTEAD, I WANT:
centered_mpg
cyl
x1
4
x2
6
x3
8
Are you looking for this?
with(mtcars, aggregate(list(centered_mpg=scale(mpg, scale=FALSE)), list(cyl=cyl), mean))
# cyl centered_mpg
# 1 4 6.5730114
# 2 6 -0.3477679
# 3 8 -4.9906250
It looks like you want to center each individual car's mpg by subtracting the global mean(mpg). This gives a centered_mpg for every car - and the code you have looks fine for this.
Then you want to calculate some sort of "summary" of the centered mpg values by cylinder group, so we need to group_by(cyl) and then define whatever summary function you want - here I use mean() but you can use median, sum, or whatever else you'd like.
mtcars %>%
mutate(centered_mpg = mpg - mean(mpg, na.rm = TRUE)) %>%
group_by(cyl) %>%
summarise(mean_centered_mpg = mean(centered_mpg))
# # A tibble: 3 x 2
# cyl mean_centered_mpg
# <dbl> <dbl>
# 1 4 6.57
# 2 6 -0.348
# 3 8 -4.99
The function sample_n() from package dplyr allows to randomly keep a specific number of rows. Combine with group_by(), you can for instance keep 2 observations per group:
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(2)
# A tibble: 4 x 2
# Groups: vs [2]
vs drat
<dbl> <dbl>
1 0 3.07
2 0 3.9
3 1 4.22
4 1 3.08
Question: is there an easy way to select a different number of observations per group? For instance, if I want to keep 2 observations for the first group, and 3 for the second one. If I give a vector to the function sample_n(), it only uses the first value (result is the same as above).
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(c(2,3))
Thanks in advance.
create list-columns of each groups using group_nest(), add a column with the number of samples you want in each group, then map these two columns to the sample_n() function:
library(tidyverse)
mtcars %>%
select(vs, drat) %>%
group_nest(vs, keep= TRUE) %>%
add_column(mysamples = c(2,3)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows()
# A tibble: 5 x 2
vs drat
<dbl> <dbl>
1 0 3.15
2 0 4.22
3 1 3.7
4 1 4.93
5 1 3.08
>
I have a large dataset and there are many different columns that I am trying to group the data by. I am trying to create a new column using dplyr and mutate which is the mean for each individual group. I then want to see the difference between these means and the mean of just one single category.
This question can pertain to the mtcars dataset. How would I group the mtcars data by "cyl" & "gear" and then take the mean of "mpg" for each group. I then want to see the difference of every group's mean of "mpg" compared to specifically all the cars with "gear"==5, but have variable "cyl".
I apologize if I'm asking the same question as others have, but I have not been able to find this specific question.
df <- mtcars
df2 <- df %>% group_by(cyl, gear) %>% mutate(mean_mpg = mean(mpg))
This is pretty brute force but it should give you what you want. I got the mean mpg of both cyl and gear then just of cyl ignoring gear and then the mean mpg of gear ignoring cyl .
mtcars %>%
group_by(cyl,gear) %>%
mutate(mean_mpg_both = mean(mpg)) %>%
ungroup %>%
group_by(gear) %>%
mutate(mean_gear_mpg = mean(mpg)) %>%
ungroup %>%
group_by(cyl) %>%
mutate(mean_cyl_mpg = mean(mpg)) %>%
select(mpg,cyl,gear,mean_mpg_both,mean_gear_mpg, mean_cyl_mpg) %>%
group_by(cyl,gear) %>%
filter(row_number()==1)
df2 <- df %>%
group_by(cyl, gear) %>%
summarise(mean_mpg = mean(mpg)) %>%
mutate(comparison_mpg = mean_mpg[which(gear == 5)],
mpg_diff = mean_mpg - comparison_mpg)
Result
# A tibble: 8 x 5
# Groups: cyl [3]
cyl gear mean_mpg comparison_mpg mpg_diff
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4. 3. 21.5 28.2 -6.70
2 4. 4. 26.9 28.2 -1.27
3 4. 5. 28.2 28.2 0.
4 6. 3. 19.8 19.7 0.0500
5 6. 4. 19.8 19.7 0.0500
6 6. 5. 19.7 19.7 0.
7 8. 3. 15.0 15.4 -0.350
8 8. 5. 15.4 15.4 0.
Going from your comment, I think this is what you are after:
mtcars %>% group_by(cyl) %>%
summarize(mean_by_cyl = mean(mpg),
mean_gear5_by_cyl = mean(mpg[gear == 5]),
mean_diff_from_gear5 = mean_by_cyl - mean_gear5_by_cyl)
# # A tibble: 3 x 4
# cyl mean_by_cyl mean_gear5_by_cyl mean_diff_from_gear5
# <dbl> <dbl> <dbl> <dbl>
# 1 4 26.66364 28.2 -1.53636364
# 2 6 19.74286 19.7 0.04285714
# 3 8 15.10000 15.4 -0.30000000
In R, I would like to manipulate (say multiply) data.frame columns with appropriately named values stored in a vector (or data.frame, if that's easier).
Let's say, I want to first summarise the variables disp, hp, and wt from the mtcars dataset.
vars <- c("disp", "hp", "wt")
mtcars %>%
summarise_at(vars, funs(sum(.))
(throw a group_by(cyl) into the mix, or use mutate_at if you'd like to have more rows)
Now I'd like to multiply each of the resulting columns with a particular value, given by
multiplier <- c("disp" = 2, "hp" = 3, "wt" = 4)
Is it possible to refer to these within the summarise_at function?
The result should look like this (and I don't want to have to refer to the variable names directly while getting there):
disp hp wt
14766.2 14082 411.808
UPDATE:
Maybe my MWE was too minimal. Let's say I want to do the same operation with a data.frame grouped by cyl
mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum)
The result should thus be:
cyl disp hp wt
1 4 2313.0 2727 100.572
2 6 2566.4 2568 87.280
3 8 9886.8 8787 223.956
UPDATE 2:
Maybe I was not explicit enough here either, but the columns in the data.frame should be multiplied by the respective values in the vector (and only those columns mentioned in the vector), so e.g. disp should be multiplied by 2, hp by 3 and wt by 4, all other variables (e.g. cyl) should remain untouched by the multiplication.
We could also do this with map function from purrr
library(purrr)
mtcars %>%
summarise_at(vars, sum) %>%
map2_df(multiplier, `*`)
# disp hp wt
# <dbl> <dbl> <dbl>
# 1 14766.2 14082 411.808
For the updated question
d1 <- mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum)
d1 %>%
select(one_of(vars)) %>%
map2_df(multiplier[vars], ~ .x * .y) %>%
bind_cols(d1 %>% select(-one_of(vars)), .)
# cyl disp hp wt
# <dbl> <dbl> <dbl> <dbl>
#1 4 2313.0 2727 100.572
#2 6 2566.4 2568 87.280
#3 8 9886.8 8787 223.956
Or we can use gather/spread
library(tidyr)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum) %>%
gather(var, val, -cyl) %>%
mutate(val = val*multiplier[match(var, names(multiplier))]) %>%
spread(var, val)
# cyl disp hp wt
# <dbl> <dbl> <dbl> <dbl>
#1 4 2313.0 2727 100.572
#2 6 2566.4 2568 87.280
#3 8 9886.8 8787 223.956
I am not sure if you can do this in the summarise_at function, but this is close alternative...
library(dplyr)
library(purrr)
vars <- c("disp", "hp", "wt")
multiplier <- c("disp" = 2, "hp" = 3, "wt" = 4)
mtcars %>%
summarise_at(vars, sum) %>%
do(. * multiplier)
disp hp wt
1 14766.2 14082 411.808
****REDUX****
Include the grouping var cyl in the multiplier and set it equal to 1. #akrun's map2_df does the real work here.
vars <- c("disp", "hp", "wt")
multiplier <- c("cyl" = 1, "disp" = 2, "hp" = 3, "wt" = 4)
mtcars %>%
group_by(cyl) %>%
summarise_at(vars, sum) %>%
map2_df(multiplier, ~ .x * .y)
cyl disp hp wt
<dbl> <dbl> <dbl> <dbl>
1 4 2313.0 2727 100.572
2 6 2566.4 2568 87.280
3 8 9886.8 8787 223.956
I expected the code below to output a data frame with three rows, each row representing the cumulative mean value of mpg after calculating the mean for each group of cyl:
library(dplyr)
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
summarise(running.mean.mpg = cummean(mpg))
This is what I expected to happen:
mean_cyl_4 <- mtcars %>%
filter(cyl == 4) %>%
summarise(mean(mpg))
mean_cyl_4_6 <- mtcars %>%
filter(cyl == 4 | cyl == 6) %>%
summarise(mean(mpg))
mean_cyl_4_6_8 <- mtcars %>%
filter(cyl == 4 | cyl == 6 | cyl == 8) %>%
summarise(mean(mpg))
data.frame(cyl = c(4,6,8), running.mean.mpg = c(mean_cyl_4[1,1], mean_cyl_4_6[1,1], mean_cyl_4_6_8[1,1]))
cyl running.mean.mpg
1 4 26.66364
2 6 23.97222
3 8 20.09062
How come dplyr seems to ignore group_by(cyl)?
require("dplyr")
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
mutate(running.mean.mpg = cummean(mpg)) %>%
select(cyl, running.mean.mpg)
# Source: local data frame [32 x 2]
# Groups: cyl
#
# # cyl running.mean.mpg
# # 1 4 22.80000
# # 2 4 23.60000
# # 3 4 23.33333
# # 4 4 25.60000
# # 5 4 26.56000
# # 6 4 27.78333
# # 7 4 26.88571
# # 8 4 26.93750
For the sake of experimentation, this would also work with data.table.
I mean, you have to load dplyr also to have cummean() available.
require("data.table")
DT <- as.data.table(mtcars)
DT[,j=list(
running.mean.mpg = cummean(mpg)
), by="cyl"]
Use mutate rather than summarise.
This works as you want.
mtcars %>%
arrange(cyl) %>%
mutate(running.mean.mpg = cummean(mpg)) %>%
select(cyl, running.mean.mpg)%>%
group_by(cyl)%>%
summarize(target=last(running.mean.mpg))