This question already has answers here:
Relative frequencies / proportions with dplyr
(10 answers)
Closed 2 years ago.
What would be the best tool/package to use to calculate proportions by subgroups? I thought I could try something like this:
data(mtcars)
library(plyr)
ddply(mtcars, .(cyl), transform, Pct = gear/length(gear))
But the output is not what I want, as I would want something with a number of rows equal to cyl. Even if change it to summarise i still get the same problem.
I am open to other packages, but I thought plyr would be best as I would eventually like to build a function around this. Any ideas?
I'd appreciate any help just solving a basic problem like this.
library(dplyr)
mtcars %>%
count(cyl, gear) %>%
mutate(prop = prop.table(n))
See ?count, basically, count is a wrapper for summarise with n() but it does the group by for you. Look at the output of just mtcars %>% count(cyl, gear). Then, we add an additional variable with mutate named prop which is the result of calling prop.table() on the n variable we created after as a result of count(cyl, gear).
You could create this as a function using the SE versions of count(), that is count_(). Look at the vignette for Non-Standard Evaluation in the dplyr package.
Here's a nice github gist addressing lots of cross-tabulation variants with dplyr and other packages.
To get frequency within a group:
library(dplyr)
mtcars %>% count(cyl, gear) %>% mutate(Freq = n/sum(n))
# Source: local data frame [8 x 4]
# Groups: cyl [3]
#
# cyl gear n Freq
# (dbl) (dbl) (int) (dbl)
# 1 4 3 1 0.09090909
# 2 4 4 8 0.72727273
# 3 4 5 2 0.18181818
# 4 6 3 2 0.28571429
# 5 6 4 4 0.57142857
# 6 6 5 1 0.14285714
# 7 8 3 12 0.85714286
# 8 8 5 2 0.14285714
or equivalently,
mtcars %>% group_by(cyl, gear) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
Careful of what the grouping is at each stage, or your numbers will be off.
Related
I have a data frame with numerous variables I can group by.
I write a new chunk every time:
df %>% group_by(variable) %>% summarize()
Yet when I make a boxplot, I do not have to do this. I can simply add the groups in the function:
boxplot(df$numericvariable ~ df$variable_I_want_to_group_by, data=df)
This allows me in Rmarkdown to write all the different group_by's in the same chunk and view all the plots created next to each other.
I would like to find the same "group_by" as an integral part of a function for summarize (or an other function that does the same from a different package).
Expanding on the idea of writing a custom function so that you can quickly try lots of groupings, use the ... dots.
f <- function(...){
mtcars %>%
group_by(...) %>%
summarise(mean = mean(disp), n =n())
}
f(cyl)
f(cyl, gear)
You may use base R aggregate with a similar formula interface to boxplot,
aggregate(disp ~ cyl, mtcars, \(x) c(mean=mean(x), n=length(x)))
# cyl disp.mean disp.n
# 1 4 105.1364 11.0000
# 2 6 183.3143 7.0000
# 3 8 353.1000 14.0000
which will give you the same as dplyr.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(mean = mean(disp), n =n())
# # A tibble: 3 × 3
# cyl mean n
# <dbl> <dbl> <int>
# 1 4 105. 11
# 2 6 183. 7
# 3 8 353. 14
I am cleaning some data and like to use the count() function in dplyr to look at unique values of every variable.
Is there a way to do this automatically? Right now I am using this method:
df %>% count(variable1)
df %>% count(variable2)
df %>% count(variable3)
...
I would like something that returns all of them without me having to repeat the line of code and type in each variable. I thought about trying to have R recognize all the column names and automatically fill them in but I'm not sure where to start. If I just add variables together, say
df %>% count(variable1, variable2)
I get counts by both of those variables when I want individual tables for each variable.
Assume that you want to count am, gear, and carb from mtcars. You can apply the function table() on each variable by map(), which returns a list object.
library(dplyr)
library(purrr)
mtcars %>%
select(am, gear, carb) %>%
map(table)
# $am
# 0 1
# 19 13
#
# $gear
# 3 4 5
# 15 12 5
#
# $carb
# 1 2 3 4 6 8
# 7 10 3 10 1 1
base Version :
lapply(mtcars[c("am", "gear", "carb")], table)
In addition, you can use summary(), which counts factor variables.
mtcars %>%
select(am, gear, carb) %>%
mutate(across(.fn = as.factor)) %>%
summary
# am gear carb
# 0:19 3:15 1: 7
# 1:13 4:12 2:10
# 5: 5 3: 3
# 4:10
# 6: 1
# 8: 1
It looks like you can use a tidyverse approach to solve your issue. You want to get the counts for each variable in your dataset (Please next time add a sample of df). You can get something close to what you want using data in long format. I will show you an example with mtcars data. I will choose some variables that display classes so that they can be summarised with counts. Here the code:
library(tidyverse)
#Data
data("mtcars")
I will select some categorical variables with next code, then I will reshape to long. Finally, I will use summarise() and n() (used for counting) with group_by() to determine the counts:
#Code
mtcars %>% select(cyl,vs,am,gear,carb) %>%
#Format to long
pivot_longer(cols = everything()) %>%
#Group and summarise
group_by(name,value) %>%
summarise(N=n())
Output:
# A tibble: 16 x 3
# Groups: name [5]
name value N
<chr> <dbl> <int>
1 am 0 19
2 am 1 13
3 carb 1 7
4 carb 2 10
5 carb 3 3
6 carb 4 10
7 carb 6 1
8 carb 8 1
9 cyl 4 11
10 cyl 6 7
11 cyl 8 14
12 gear 3 15
13 gear 4 12
14 gear 5 5
15 vs 0 18
16 vs 1 14
As you can see all the variables are showed with their respective groups and counts.
a simple solution would be to use sapply or lapply with table
sapply(df,table)
This will return you a list of count tables for each of the columns for dt. You can always pass in a subsetted dataframe to get the count for your variables of interest.
I want to start using dplyr in place of ddply but I can't get a handle on how it works (I've read the documentation).
For example, why when I try to mutate() something does the "group_by" function not work as it's supposed to?
Looking at mtcars:
library(car)
Say I make a data.frame which is a summary of mtcars, grouped by "cyl" and "gear":
df1 <- mtcars %.%
group_by(cyl, gear) %.%
summarise(
newvar = sum(wt)
)
Then say I want to further summarise this dataframe. With ddply, it'd be straightforward, but when I try to do with with dplyr, it's not actually "grouping by":
df2 <- df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + 5
)
Still yields an ungrouped output:
cyl gear newvar newvar2
1 6 3 6.675 11.675
2 4 4 19.025 24.025
3 6 4 12.375 17.375
4 6 5 2.770 7.770
5 4 3 2.465 7.465
6 8 3 49.249 54.249
7 4 5 3.653 8.653
8 8 5 6.740 11.740
Am I doing something wrong with the syntax?
Edit:
If I were to do this with plyr and ddply:
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
and then to get the second df:
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
But that same approach, with sum(newvar) + 5 in the summarise() function doesn't work with dplyr...
I had a similar problem. I found that simply detaching plyr solved it:
detach(package:plyr)
library(dplyr)
Taking Dickoa's answer one step further -- as Hadley says "summarise peels off a single layer of grouping". It peels off grouping from the reverse order in which you applied it so you can just use
mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt)) %>%
summarise(newvar2 = sum(newvar) + 5)
Note that this will give a different answer if you use group_by(gear, cyl) in the second line.
And to get your first attempt working:
df1 <- mtcars %>%
group_by(cyl, gear) %>%
summarise(newvar = sum(wt))
df2 <- df1 %>%
group_by(cyl) %>%
summarise(newvar2 = sum(newvar)+5)
If you translate your plyr code into dplyr using summarise instead of mutate you get the same results.
library(plyr)
df1 <- ddply(mtcars, .(cyl, gear), summarise, newvar = sum(wt))
df2 <- ddply(df1, .(cyl), summarise, newvar2 = sum(newvar) + 5)
df2
## cyl newvar2
## 1 4 30.143
## 2 6 26.820
## 3 8 60.989
detach(package:plyr)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
group_by(cyl) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820
EDIT
Since summarise drops the last group (gear) you can skip the second group_by (see #hadley comment below)
library(dplyr)
mtcars %.%
group_by(cyl, gear) %.%
summarise(newvar = sum(wt)) %.%
summarise(newvar2 = sum(newvar) + 5)
## cyl newvar2
## 1 4 30.143
## 2 8 60.989
## 3 6 26.820
Detaching plyr is one way to solve the problem so you can use dplyr functions as desired... but what if you need other functions from plyr to complete other tasks in your code?
(In this example, I've got both dplyr and plyr libraries loaded)
Suppose we have a simple data.frame and we want to compute the groupwise sum of the variable value, when grouped by different levels of gname
> dx<-data.frame(gname=c(1,1,1,2,2,2,3,3,3), value = c(2,2,2,4,4,4,5,6,7))
> dx
gname value
1 1 2
2 1 2
3 1 2
4 2 4
5 2 4
6 2 4
7 3 5
8 3 6
9 3 7
But when we try to use what we believe will produce a dplyr grouped sum, here's what happens:
dx %>% group_by(gname) %>% mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 36
2 1 2 36
3 1 2 36
4 2 4 36
5 2 4 36
6 2 4 36
7 3 5 36
8 3 6 36
9 3 7 36
It doesn't give us the desired answer. Probably because of some interaction or overloading of the group_by and or mutate functions between dplyr and plyr. We could detach plyr, but another way is to give a unique call to the dplyr versions of group_by and mutate:
dx %>% dplyr::group_by(gname) %>% dplyr::mutate(mysum=sum(value))
Source: local data frame [9 x 3]
Groups: gname
gname value mysum
1 1 2 6
2 1 2 6
3 1 2 6
4 2 4 12
5 2 4 12
6 2 4 12
7 3 5 18
8 3 6 18
9 3 7 18
now we see that this works as expected.
dplyr is working as you should expect in your example. Mutate, as you specified it, will just add 5 to each value of newvar as it creates newvar2. This would look the same if you group or not. If, however, you specify something that differs by group you will get something different. For example:
df1 %.%
group_by(cyl) %.%
mutate(
newvar2 = newvar + mean(cyl)
)
Is there any way to add series lines to stacked bar charts created by ggplot in R? I have looked around in the documentation - to no avail.
If I understand what you are asking correctly, here is an example of how you might do something like that using the mtcars data set. This uses dplyr and ggplot2
library(dplyr)
library(ggplot2)
## Calculate 'y' for each cyl/vs pair
(mtcars_summary <-
mtcars %>%
arrange(cyl, vs) %>%
group_by(cyl, vs) %>%
summarise(count = n()) %>% # Count per cyl/vs pair
group_by(cyl) %>% # Here to be explicit, but can be left out.
mutate(count = cumsum(count)) %>% # Calculate 'y' for each cyl/vs pair
ungroup)
## cyl vs count
## (dbl) (dbl) (int)
## 1 4 0 1
## 2 4 1 11
## 3 6 0 3
## 4 6 1 7
## 5 8 0 14
ggplot(mtcars, aes(cyl, fill = factor(vs))) +
geom_bar() +
geom_line(aes(cyl, count), data = mtcars_summary) +
ggtitle('Data: mtcars')
I must be missing something with how group_by levels in dplyr get peeled off. In the example below, I group by 2 columns, summarize values into a single variable, then sort by that new variable:
mtcars %>% group_by( cyl, gear ) %>%
summarize( hp_range = max(hp) - min(mpg)) %>%
arrange( desc(hp_range) )
# Source: local data frame [8 x 3]
# Groups: cyl [3]
#
# cyl gear hp_range
# (dbl) (dbl) (dbl)
#1 4 4 87.6
#2 4 5 87.0
#3 4 3 75.5
#4 6 5 155.3
#5 6 4 105.2
#6 6 3 91.9
#7 8 5 320.0
#8 8 3 234.6
Obviously this is not sorted by hp_range as intended. What am I missing?
EDIT: The example works as expected without the call to desc in arrange. Still unclear why?
Ok, just got to the bottom of this:
The call to desc had no effect, it was by chance that the example did not work without it
The key is that when you group_by multiple columns, it seems that results are automatically sorted by the Groups. In the example above it is sorted by cyl. To get the intended sort of the entire data table, you must first ungroup and then arrange
mtcars %>% group_by( cyl, gear ) %>%
summarize( hp_range = max(hp) - min(mpg)) %>%
ungroup() %>%
arrange( hp_range )