I have a large dataset and there are many different columns that I am trying to group the data by. I am trying to create a new column using dplyr and mutate which is the mean for each individual group. I then want to see the difference between these means and the mean of just one single category.
This question can pertain to the mtcars dataset. How would I group the mtcars data by "cyl" & "gear" and then take the mean of "mpg" for each group. I then want to see the difference of every group's mean of "mpg" compared to specifically all the cars with "gear"==5, but have variable "cyl".
I apologize if I'm asking the same question as others have, but I have not been able to find this specific question.
df <- mtcars
df2 <- df %>% group_by(cyl, gear) %>% mutate(mean_mpg = mean(mpg))
This is pretty brute force but it should give you what you want. I got the mean mpg of both cyl and gear then just of cyl ignoring gear and then the mean mpg of gear ignoring cyl .
mtcars %>%
group_by(cyl,gear) %>%
mutate(mean_mpg_both = mean(mpg)) %>%
ungroup %>%
group_by(gear) %>%
mutate(mean_gear_mpg = mean(mpg)) %>%
ungroup %>%
group_by(cyl) %>%
mutate(mean_cyl_mpg = mean(mpg)) %>%
select(mpg,cyl,gear,mean_mpg_both,mean_gear_mpg, mean_cyl_mpg) %>%
group_by(cyl,gear) %>%
filter(row_number()==1)
df2 <- df %>%
group_by(cyl, gear) %>%
summarise(mean_mpg = mean(mpg)) %>%
mutate(comparison_mpg = mean_mpg[which(gear == 5)],
mpg_diff = mean_mpg - comparison_mpg)
Result
# A tibble: 8 x 5
# Groups: cyl [3]
cyl gear mean_mpg comparison_mpg mpg_diff
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4. 3. 21.5 28.2 -6.70
2 4. 4. 26.9 28.2 -1.27
3 4. 5. 28.2 28.2 0.
4 6. 3. 19.8 19.7 0.0500
5 6. 4. 19.8 19.7 0.0500
6 6. 5. 19.7 19.7 0.
7 8. 3. 15.0 15.4 -0.350
8 8. 5. 15.4 15.4 0.
Going from your comment, I think this is what you are after:
mtcars %>% group_by(cyl) %>%
summarize(mean_by_cyl = mean(mpg),
mean_gear5_by_cyl = mean(mpg[gear == 5]),
mean_diff_from_gear5 = mean_by_cyl - mean_gear5_by_cyl)
# # A tibble: 3 x 4
# cyl mean_by_cyl mean_gear5_by_cyl mean_diff_from_gear5
# <dbl> <dbl> <dbl> <dbl>
# 1 4 26.66364 28.2 -1.53636364
# 2 6 19.74286 19.7 0.04285714
# 3 8 15.10000 15.4 -0.30000000
Related
This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 1 year ago.
With dplyr and R, you can use group_by and summarize to aggregate data.
For instance:
mpg_cyl_carb <- mtcars %>%
group_by(cyl, carb) %>%
summarise(var1 = mean(mpg))
head(mpg_cyl_carb, 3)
A tibble: 3 x 3
Groups: cyl [2]
cyl carb var1
<dbl> <dbl> <dbl>
1 4 1 27.6
2 4 2 25.9
3 6 1 19.8
It means that when cyl = 4 and carb = 4, the mean for mpg is 27.6. When cyl = 6 and carb = 1, the mean is 19.8, and so on.
I would like to nest those aggregate results on the original dataframe. Currently am joining two tables to do this:
> mtcars %>%
+ left_join(mpg_cyl_carb, by = c("cyl", "carb")) %>%
+ head(3) %>%
+ select(mpg, cyl, carb, var1)
mpg cyl carb var1
1 21.0 6 4 19.75
2 21.0 6 4 19.75
3 22.8 4 1 27.58
But is there an easier way? A single command for mutate, like:
> mtcars %>%
+ mutate(. . .)
Not a solution using if_else, as it would add complexity.
Use the group_by before the mutate to create the mean column by group - instead of creating a summarised dataset and then joining to original data
library(dplyr)
mtcars %>%
group_by(cyl, carb) %>%
mutate(var1 = mean(mpg)) %>%
ungroup %>%
head
I have seen this (Summarize different Columns with different Functions)
But in my situation, I want to use sum() with mpg, disp and hp. And use mean() with drat, wt and qsec.
All the function should be used with a group variable cyl.
Like this:
result.1 = mtcars %>% group_by(cyl) %>% summarise(across(.cols = c(mpg, disp, hp),
.fns = sum))
result.2 = mtcars %>% group_by(cyl) %>% summarise(across(.cols = c(drat:qsec),
.fns = mean))
final.result = full_join(result.1, result.2)
Is this possible that get final.result only use summarise() once.
Any help will be highly appreciated!
You can use across twice in the same summarise call :
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(across(.cols = c(mpg, disp, hp),.fns = sum),
across(.cols = c(drat:qsec),.fns = mean))
# cyl mpg disp hp drat wt qsec
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 4 293. 1156. 909 4.07 2.29 19.1
#2 6 138. 1283. 856 3.59 3.12 18.0
#3 8 211. 4943. 2929 3.23 4.00 16.8
Can someone help me please?
I Have Column A, Column B and Column C, I want to get the top value of column C, grouped by A, but also have the information of B for those top values
Max <-X %>% select(A,B,C) %>% group_by(A) %>% summarise(top = max(C))
But this code only show me the top values of each unique A data, so I dont know whats the B value assigned to that. (Important, making group_by(A,B) doesnt work, because it doesnt give the top values for each unique A value, it returns the same as the data base X)
This could be achieved via dplyr::top_n or ? dplyr::slice_max like so:
library(dplyr)
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% top_n(1, hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
mtcars %>% select(cyl, mpg, hp) %>% group_by(cyl) %>% slice_max(hp)
#> # A tibble: 3 x 3
#> # Groups: cyl [3]
#> cyl mpg hp
#> <dbl> <dbl> <dbl>
#> 1 4 30.4 113
#> 2 6 19.7 175
#> 3 8 15 335
So, in your case it should be:
Max <-X %>% select(A,B,C) %>% group_by(A) %>% slice_max(C)
The function sample_n() from package dplyr allows to randomly keep a specific number of rows. Combine with group_by(), you can for instance keep 2 observations per group:
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(2)
# A tibble: 4 x 2
# Groups: vs [2]
vs drat
<dbl> <dbl>
1 0 3.07
2 0 3.9
3 1 4.22
4 1 3.08
Question: is there an easy way to select a different number of observations per group? For instance, if I want to keep 2 observations for the first group, and 3 for the second one. If I give a vector to the function sample_n(), it only uses the first value (result is the same as above).
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(c(2,3))
Thanks in advance.
create list-columns of each groups using group_nest(), add a column with the number of samples you want in each group, then map these two columns to the sample_n() function:
library(tidyverse)
mtcars %>%
select(vs, drat) %>%
group_nest(vs, keep= TRUE) %>%
add_column(mysamples = c(2,3)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows()
# A tibble: 5 x 2
vs drat
<dbl> <dbl>
1 0 3.15
2 0 4.22
3 1 3.7
4 1 4.93
5 1 3.08
>
I want to use filter or similar function inside summarise from dplyr package. So I've got a dataframe (e.g. mtcars) where I need to group by factor (e.g. cyl) and then calculate some statistics and a percentage of total wt for every cyl type —> wt.pc.
The question is how can I subset/filter wt column inside summarise function to get a percentage but without last 10 rows?
I've tried this code but it returns NA:(
mtcars %>%
group_by(cyl) %>%
summarise(wt = round(sum(wt)),
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[1:22]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 NA 5
2 6 22 21.4 NA 4
3 8 56 54.4 NA 4
wt.pc.short — % of sum(wt) for every cyl for shorter dataframe mtcars[1:22,]
Something like this?
mtcars %>%
mutate(id = row_number()) %>%
group_by(cyl) %>%
summarise(wt_new = round(sum(wt)), # note the change in name here!
wt.pc = sum(wt) * 100 / sum(mtcars[, 6]),
wt.pc.short = sum(wt[id<23]) * 100 / sum(mtcars[1:22, 6]),
drat.max = round(max(drat)))
# A tibble: 3 x 5
cyl wt_new wt.pc wt.pc.short drat.max
<dbl> <dbl> <dbl> <dbl> <dbl>
1 4 25 24.3 22.7 5
2 6 22 21.4 25.8 4
3 8 56 54.4 51.6 4
The important part here is that when you assign wt in the call to summarize, all subsequent references to wt will take the previously assigned wt, not the original wt. A statement such as wt[1:22] is thus somewhat problematic. You can see this here:
mean(mtcars[,"mpg"])
# [1] 20.09062
var(mtcars[,"mpg"])
# [1] 36.3241
mtcars %>% summarise(var_before = var(mpg),
mpg = mean(mpg),
var_after = var(mpg))
# var_before mpg var_after
# 1 36.3241 20.09062 NA
I think you can do it like this. First we calculate the row number within the group, if max(row_number) > 10 then we have enough observations to remove the last 10 rows, in which case we filter to max(ID)-9 (i.e. remove the last 10 rows), otherwise ID==ID returns true and doesn't remove anything.
mtcars %>% group_by(cyl) %>%
mutate(ID = row_number()) %>%
filter(if (max(ID) > 10) ID < (max(ID) - 9) else ID == ID)