Calculate mean by groups in R with two group variables [duplicate] - r

I want to start using dplyr in place of ddply but I can't get a handle on how it works (I've read the documentation).
For example, why when I try to mutate() something does the "group_by" function not work as it's supposed to?
Looking at mtcars:
library(car)
Say I make a data.frame which is a summary of mtcars, grouped by "cyl" and "gear":
df1 <- mtcars %.%
group_by(cyl, gear) %.%
summarise(
newvar = sum(wt)
)
Then say I want to further summarise this dataframe. With ddply, it'd be straightforward, but when I try to do with with dplyr, it's not actually "grouping by":
df2 <- df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + 5
)
Still yields an ungrouped output:
cyl gear newvar newvar2
1 6 3 6.675 11.675
2 4 4 19.025 24.025
3 6 4 12.375 17.375
4 6 5 2.770 7.770
5 4 3 2.465 7.465
6 8 3 49.249 54.249
7 4 5 3.653 8.653
8 8 5 6.740 11.740
Am I doing something wrong with the syntax?
Edit:
If I were to do this with plyr and ddply:
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
and then to get the second df:
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
But that same approach, with sum(newvar) + 5 in the summarise() function doesn't work with dplyr...

I had a similar problem. I found that simply detaching plyr solved it:
detach(package:plyr)
library(dplyr)

Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use
mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt)) %>%
summarise(newvar2 = sum(newvar) + 5)
Note that this will give a different answer if you use group_by(gear, cyl) in the second line.
And to get your first attempt working:
df1 <- mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt))
df2 <- df1 %>%
group_by(cyl) %>%
summarise(newvar2 = sum(newvar)+5)

If you translate your plyr code into dplyr using summarise instead of mutate you get the same results.
library(plyr)
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
df2
## cyl newvar2
## 1 4 30.143
## 2 6 26.820
## 3 8 60.989
detach(package:plyr)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
group_by(cyl) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820
EDIT
Since summarise drops the last group (gear) you can skip the second group_by (see #hadley comment below)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820

Detaching plyr is one way to solve the problem so you can use dplyr functions as desired... but what if you need other functions from plyr to complete other tasks in your code?
(In this example, I've got both dplyr and plyr libraries loaded)
Suppose we have a simple data.frame and we want to compute the groupwise sum of the variable value, when grouped by different levels of gname
> dx<-data.frame(gname=c(1,1,1,2,2,2,3,3,3), value = c(2,2,2,4,4,4,5,6,7))
> dx
gname value
1 1 2
2 1 2
3 1 2
4 2 4
5 2 4
6 2 4
7 3 5
8 3 6
9 3 7
But when we try to use what we believe will produce a dplyr grouped sum, here's what happens:
dx %>% group_by(gname) %>% mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 36
2 1 2 36
3 1 2 36
4 2 4 36
5 2 4 36
6 2 4 36
7 3 5 36
8 3 6 36
9 3 7 36
It doesn't give us the desired answer. Probably because of some interaction or overloading of the group_by and or mutate functions between dplyr and plyr. We could detach plyr, but another way is to give a unique call to the dplyr versions of group_by and mutate:
dx %>% dplyr::group_by(gname) %>% dplyr::mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 6
2 1 2 6
3 1 2 6
4 2 4 12
5 2 4 12
6 2 4 12
7 3 5 18
8 3 6 18
9 3 7 18
now we see that this works as expected.

dplyr is working as you should expect in your example. Mutate, as you specified it, will just add 5 to each value of newvar as it creates newvar2. This would look the same if you group or not. If, however, you specify something that differs by group you will get something different. For example:
df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + mean(cyl)
)

Related

Is there a way to "summarize_by_group" without having to group_by the whole data each time?

I have a data frame with numerous variables I can group by.
I write a new chunk every time:
df %>% group_by(variable) %>% summarize()
Yet when I make a boxplot, I do not have to do this. I can simply add the groups in the function:
boxplot(df$numericvariable ~ df$variable_I_want_to_group_by, data=df)
This allows me in Rmarkdown to write all the different group_by's in the same chunk and view all the plots created next to each other.
I would like to find the same "group_by" as an integral part of a function for summarize (or an other function that does the same from a different package).
Expanding on the idea of writing a custom function so that you can quickly try lots of groupings, use the ... dots.
f <- function(...){
mtcars %>%
group_by(...) %>%
summarise(mean = mean(disp), n =n())
}
f(cyl)
f(cyl, gear)
You may use base R aggregate with a similar formula interface to boxplot,
aggregate(disp ~ cyl, mtcars, \(x) c(mean=mean(x), n=length(x)))
# cyl disp.mean disp.n
# 1 4 105.1364 11.0000
# 2 6 183.3143 7.0000
# 3 8 353.1000 14.0000
which will give you the same as dplyr.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(disp), n =n())
# # A tibble: 3 × 3
# cyl mean n
# <dbl> <dbl> <int>
# 1 4 105. 11
# 2 6 183. 7
# 3 8 353. 14

How to count() each variable automatically

I am cleaning some data and like to use the count() function in dplyr to look at unique values of every variable.
Is there a way to do this automatically? Right now I am using this method:
df %>% count(variable1)
df %>% count(variable2)
df %>% count(variable3)
...
I would like something that returns all of them without me having to repeat the line of code and type in each variable. I thought about trying to have R recognize all the column names and automatically fill them in but I'm not sure where to start. If I just add variables together, say
df %>% count(variable1, variable2)
I get counts by both of those variables when I want individual tables for each variable.
Assume that you want to count am, gear, and carb from mtcars. You can apply the function table() on each variable by map(), which returns a list object.
library(dplyr)
library(purrr)
mtcars %>%
select(am, gear, carb) %>%
map(table)
# $am
# 0 1
# 19 13
#
# $gear
# 3 4 5
# 15 12 5
#
# $carb
# 1 2 3 4 6 8
# 7 10 3 10 1 1
base Version :
lapply(mtcars[c("am", "gear", "carb")], table)
In addition, you can use summary(), which counts factor variables.
mtcars %>%
select(am, gear, carb) %>%
mutate(across(.fn = as.factor)) %>%
summary
# am gear carb
# 0:19 3:15 1: 7
# 1:13 4:12 2:10
# 5: 5 3: 3
# 4:10
# 6: 1
# 8: 1
It looks like you can use a tidyverse approach to solve your issue. You want to get the counts for each variable in your dataset (Please next time add a sample of df). You can get something close to what you want using data in long format. I will show you an example with mtcars data. I will choose some variables that display classes so that they can be summarised with counts. Here the code:
library(tidyverse)
#Data
data("mtcars")
I will select some categorical variables with next code, then I will reshape to long. Finally, I will use summarise() and n() (used for counting) with group_by() to determine the counts:
#Code
mtcars %>% select(cyl,vs,am,gear,carb) %>%
#Format to long
pivot_longer(cols = everything()) %>%
#Group and summarise
group_by(name,value) %>%
summarise(N=n())
Output:
# A tibble: 16 x 3
# Groups: name [5]
name value N
<chr> <dbl> <int>
1 am 0 19
2 am 1 13
3 carb 1 7
4 carb 2 10
5 carb 3 3
6 carb 4 10
7 carb 6 1
8 carb 8 1
9 cyl 4 11
10 cyl 6 7
11 cyl 8 14
12 gear 3 15
13 gear 4 12
14 gear 5 5
15 vs 0 18
16 vs 1 14
As you can see all the variables are showed with their respective groups and counts.
a simple solution would be to use sapply or lapply with table
sapply(df,table)
This will return you a list of count tables for each of the columns for dt. You can always pass in a subsetted dataframe to get the count for your variables of interest.

Writing own function using dplyr and group_by - how to continue with changed column names

I would like to make tables for publication that give the number of observations, grouped by two variables. The code for this works fine. However, I have run into problems when trying to turn this into a function.
I am using dplyr_0.7.2
Example using mtcars:
Code for table outside of function: this works
library(tidyverse)
tab1 <- mtcars %>% count(cyl) %>% rename(Total = n)
tab2 <- mtcars %>%
group_by(cyl, gear) %>% count %>%
spread(gear, n)
tab <- full_join(tab1, tab2, by = "cyl")
tab
# This is the output (which is what I want)
A tibble: 3 x 5
cyl Total `3` `4` `5`
<dbl> <int> <int> <int> <int>
1 4 11 1 8 2
2 6 7 2 4 1
3 8 14 12 NA 2
Trying to put this into a function
Function for tab1: this works
count_by_two_groups_A <- function(df, var1){
var1 <- enquo(var1)
tab1 <- df %>% count(!!var1) %>% rename(Total = n)
tab1
}
count_by_two_groups_A(mtcars, cyl)
A tibble: 3 x 2
cyl Total
<dbl> <int>
1 4 11
2 6 7
3 8 14
Function for first part of tab2: it works up to this point, but...
count_by_two_groups_B <- function(df, var1, var2){
var1 <- enquo(var1)
var2 <- enquo(var2)
tab2 <- df %>% group_by((!!var1), (!!var2)) %>% count
tab2
}
count_by_two_groups_B(mtcars, cyl, gear)
A tibble: 8 x 3
Groups: (cyl), (gear) [8]
`(cyl)` `(gear)` n
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
The column names have changed to (cyl) and (gear). I can't seem to figure out how to carry on with spread() and full_join() (or anything else using the new column names) now that the column names have changed. I.e. I can't figure out how to specify the new column names in the tidyeval way, to be able to carry on. I have tried various things, without success.
The usual way of setting names in a tidyeval context is to use the definition operator :=. It would look like this:
df %>%
group_by(
!! nm1 := !! var1,
!! nm2 := !! var2
) %>%
count()
For this you need to extract nm1 from var1. Unfortunately I don't have an easy way of stripping down the enclosing parentheses yet. I think it'd make sense to do it in the forthcoming function ensym() (it captures symbols instead of quosures and issue an error if you supply a call). I have submitted a ticket here: https://github.com/tidyverse/rlang/issues/223
Fortunately we have two easy solutions here. First note that you don't need the enclosing parentheses. They are only needed when other operators are involved in the captured expression. E.g. in these situations:
(!! var) / avg
(!! var) < value
In this case if you omitted parentheses, !! would try to unquote the whole expressions instead of just the one symbol. On the other hand in your function there is no operator so you can safely unquote without enclosing:
count_by_two_groups_B <- function(df, var1, var2) {
var1 <- enquo(var1)
var2 <- enquo(var2)
df %>%
group_by(!! var1, !! var2) %>%
count()
}
Finally, you could make your function more general by allowing a variable number of arguments. This is even easier to implement because dots are forwarded so there is no need to capture and unquote. Just pass them down to group_by():
count_by <- function(df, ...) {
df %>%
group_by(...) %>%
count()
}
I can make it work with NSE (non-standard evaluation). Could not do it with tidyverse as I did not have that installed and did not bother installing.
Here is a working code:
library(dplyr)
library(tidyr)
count_by_two_groups_B <- function(df, var1, var2){
# var1 <- enquo(var1)
# var2 <- enquo(var2)
tab2 <- df %>% group_by_(var1, var2) %>% summarise(n = n() ) %>%spread(gear, n)
tab2
}
count_by_two_groups_B(mtcars, 'cyl', 'gear')
Result:
# A tibble: 3 x 4
# Groups: cyl [3]
cyl `3` `4` `5`
* <dbl> <int> <int> <int>
1 4 1 8 2
2 6 2 4 1
3 8 12 NA 2
This is one of those situations where reaching for dplyr or tidyverse seems excessive. There are base functions to do this ... table and to make the results in long form, as.dataframe:
as.data.frame( with(mtcars, table(cyl,gear)) , responseName="Total")
#--------
cyl gear Total
1 4 3 1
2 6 3 2
3 8 3 12
4 4 4 8
5 6 4 4
6 8 4 0
7 4 5 2
8 6 5 1
9 8 5 2
This would be one dplyr approach:
mtcars %>% group_by(cyl,gear) %>% summarise(Total=n())
#----
# A tibble: 8 x 3
# Groups: cyl [?]
cyl gear Total
<dbl> <dbl> <int>
1 4 3 1
2 4 4 8
3 4 5 2
4 6 3 2
5 6 4 4
6 6 5 1
7 8 3 12
8 8 5 2
And if the question was how to get this as a table object (thinking that might have been your goal with spread then just:
with(mtcars, table(cyl,gear))

Summary of proportions by group [duplicate]

This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 2 years ago.
What would be the best tool/package to use to calculate proportions by subgroups? I thought I could try something like this:
data(mtcars)
library(plyr)
ddply(mtcars, .(cyl), transform, Pct = gear/length(gear))
But the output is not what I want, as I would want something with a number of rows equal to cyl. Even if change it to summarise i still get the same problem.
I am open to other packages, but I thought plyr would be best as I would eventually like to build a function around this. Any ideas?
I'd appreciate any help just solving a basic problem like this.
library(dplyr)
mtcars %>%
count(cyl, gear) %>%
mutate(prop = prop.table(n))
See ?count, basically, count is a wrapper for summarise with n() but it does the group by for you. Look at the output of just mtcars %>% count(cyl, gear). Then, we add an additional variable with mutate named prop which is the result of calling prop.table() on the n variable we created after as a result of count(cyl, gear).
You could create this as a function using the SE versions of count(), that is count_(). Look at the vignette for Non-Standard Evaluation in the dplyr package.
Here's a nice github gist addressing lots of cross-tabulation variants with dplyr and other packages.
To get frequency within a group:
library(dplyr)
mtcars %>% count(cyl, gear) %>% mutate(Freq = n/sum(n))
# Source: local data frame [8 x 4]
# Groups: cyl [3]
#
# cyl gear n Freq
# (dbl) (dbl) (int) (dbl)
# 1 4 3 1 0.09090909
# 2 4 4 8 0.72727273
# 3 4 5 2 0.18181818
# 4 6 3 2 0.28571429
# 5 6 4 4 0.57142857
# 6 6 5 1 0.14285714
# 7 8 3 12 0.85714286
# 8 8 5 2 0.14285714
or equivalently,
mtcars %>% group_by(cyl, gear) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
Careful of what the grouping is at each stage, or your numbers will be off.

dplyr: group mean centering (mutate + summarize)

What is the efficient/preferred way to do group mean centering with dplyr, that is take each element of a group (mutate) and perform an operation on it and a summary stat (summarize) for that group. Here's how one might do group mean centering on mtcars using base R:
do.call(rbind, lapply(split(mtcars, mtcars$cyl), function(x){
x[["cent"]] <- x$mpg - mean(x$mpg)
x
}))
You can try
library(dplyr)
mtcars %>%
add_rownames()%>% #if the rownames are needed as a column
group_by(cyl) %>%
mutate(cent= mpg-mean(mpg))
It appears that the above code use the global mean to center the mpg; how should I do if I want to center at the within group mean, i.e. the mean values of each cyl group level are different.
> mtcars %>%
+ add_rownames()%>% #if the rownames are needed as a column
+ group_by(cyl) %>%
+ mutate(cent= mpg-mean(mpg))%>%
+ dplyr ::select(cent)
Adding missing grouping variables: `cyl`
# A tibble: 32 x 2
# Groups: cyl [3]
cyl cent
<dbl> <dbl>
1 6 0.909
2 6 0.909
3 4 2.71
4 6 1.31
5 8 -1.39
6 6 -1.99
7 8 -5.79
8 4 4.31
9 4 2.71
10 6 -0.891
# … with 22 more rows
Warning message:
Deprecated, use tibble::rownames_to_column() instead.
> mtcars$mpg[1:5]-mean(mtcars$mpg)
[1] 0.909375 0.909375 2.709375 1.309375 -1.390625
You can try this instead (although the name of the new variable displayed is different):
mtcars %>%
group_by(cyl) %>%
mutate(gpcent = scale(mpg, scale = F))

Resources