Accessing grouped subset in dplyr - r

I have the feeling this was already asked several times, but I can not make it run in my case. Don't know why.
I group_by my data frame and calculate a mean from values. Additionally, I marked a specific row and I want to calculate the ratio of my fresh calculated mean with the value of my highlighted row of the subset.
library(dplyr)
df <- data.frame(int=c(5:1,4:1),
highlight=c(T,F,F,F,F,F,T,F,F),
exp=c('a','a','a','a','a','b','b','b','b'))
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
But for some reason, . is not the subset of group_by but the complete input. Am I missing something here?
My expected output would be
exp mean ratio_mean
<fct> <dbl> <dbl>
1 a 3 1.67
2 b 2.5 1.2

This works:
df %>%
group_by(exp) %>%
summarise(mean = mean(int),
l1 = n(),
ratio_mean = int[highlight] / mean)
But what's going wrong with your solution?
nrow(.) counts the number of rows of your whole input dataframe, wherase n() counts only the rows per group
.[.$highlight, 'int']/mean here again you use the whole input dataframe and subset using the highlight column, but it get's divided by the correct group mean. Actually you are returning two values here as two rows of your original df have a highlight = TRUE. This causes a nasty NA-column name.
To save it, we could use do() as suggested by #MikkoMarttila, but this gets a little bit clunky:
df %>%
group_by(exp) %>%
do(summarise(., mean = mean(.$int),
l1 = nrow(.),
ratio_mean = .$int[.$highlight] / mean))
Original output
df %>%
group_by(exp) %>%
summarise(mean=mean(int),
l1=nrow(.),
ratio_mean=.[.$highlight, 'int']/mean)
# A tibble: 2 x 4
# exp mean l1 ratio_mean$ NA
# <fct> <dbl> <int> <dbl> <dbl>
# 1 a 3 9 1.67 2
# 2 b 2.5 9 1 1.2

Related

Standard deviation of average events per ID in R

Background
I've got this dataset d:
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
It's got 2 people (IDs) in it, and they each have some events.
The problem
I'm trying to get an average number (count) of events per person, along with a standard deviation for that average, all in one result (it can be a dataframe or not, doesn't matter).
In other words I'm looking for something like this:
| Mean | SD |
|------|------|
| 4.00 | 2.83 |
What I've tried
I'm not far off, I don't think -- it's just that I've got 2 separate pieces of code doing these calculations. Here's the mean:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event))
# A tibble: 1 x 1
ratio
<dbl>
1 4
And here's the SD:
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(sd = sd(event))
# A tibble: 1 x 1
sd
<dbl>
1 2.83
But I when I try to pipe them together like so...
d %>%
group_by(ID) %>%
summarise(event = length(event)) %>%
summarise(ratio = mean(event)) %>%
summarise(sd = sd(event))
... I get an error:
Error in `h()`:
! Problem with `summarise()` column `sd`.
i `sd = sd(event)`.
x object 'event' not found
Any insight?
You have to put the last two calls to summarise() in the same call. The only remaining columns after summarise() will be those you named and the grouping columns, so after your second summarise, the event column no longer exists.
library(dplyr)
d <- data.frame(ID = c("a","a","a","a","a","a","b","b"),
event = c("G12","R2","O99","B4","B4","A24","L5","J15"),
stringsAsFactors=FALSE)
d %>%
group_by(ID) %>%
# the next summarise will be within ID
summarise(event = length(event)) %>%
# this summarise is overall
summarise(sd = sd(event),
ratio = mean(event))
#> # A tibble: 1 × 2
#> sd ratio
#> <dbl> <dbl>
#> 1 2.83 4
The code is a bit confusing because you are renaming the event variable, and doing the first summarise() within groups and the second without grouping. This code would be a little easier to read and get the same result:
d %>%
count(ID) %>%
summarise(sd = sd(n),
ratio = mean(n))
Created on 2022-05-25 by the reprex package (v2.0.1)

Fairly new to R , can anyone tell me the difference between the queries?

penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(penguins$bill_length_mm))
penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(bill_length_mm))
I'll word it a little more strongly: when using the pipe operator %>% and the dplyr package, you should not use the dataframe name with the column names ($-indexing); while it works sometimes, if anything in the pipeline removes, adds, or reorders the rows, then your subsequent calculations will be wrong. It isn't that you don't need to assign the dataframe name, it's that if you do use it then you are likely corrupting your data. The first code is broken, do not trust it. (Whether it is truly corrupted or not may be contextual; I don't know if it corrupts it here.)
Let me demonstrate. If we want to know the max bill length (mm) of all of the penguins, by sex, we should do something like this:
library(dplyr)
data("penguins", package = "palmerpenguins")
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 58
# 2 male 59.6
If for some reason we instead use penguins$bill_length_mm, then we'll see this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female NA
# 2 male NA
which will likely encourage us to add na.rm=TRUE to the data, and we'll get a seemingly valid-ish number:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm, na.rm = TRUE))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 59.6
# 2 male 59.6
but the problem is that max(.) is being passed all of penguins$bill_length_mm, not just the values within each group.
In this case, the use of penguins$ is not a syntax error, it is a logical error, and there is no way for dplyr or anything else in R to know that what you are doing is not what you really need. It works, because max(.) sees a vector and it returns a single number; then summarize(.) sees a single number and assigns it to a new variable.
And in this case, our results are corrupted.
The only time it may be valid to use penguins$ in this is if we truly need to bring in a number or object from outside of the current "view" of the data. Realize that the data that summarize(.) sees is not the data that started in the pipe: it has been filtered (by drop_na()), it might be changed (if we mutated some columns into it) or reordered (if we arrange the data).
But if we need to find out the percentage of the max bill length with respect to the max of the original data, we might do this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(
maxbill = max(bill_length_mm),
maxbill_ratio = max(bill_length_mm) / max(penguins$bill_length_mm, na.rm = TRUE)
)
# # A tibble: 2 x 3
# sex maxbill maxbill_ratio
# <fct> <dbl> <dbl>
# 1 female 58 0.973
# 2 male 59.6 1
(Recall that we needed to add na.rm=TRUE in that call because one of the rows has an NA ... and the data we see in that last max has not been filtered/cleaned by the drop_na() call.)

dplyr::summarise with filtering inside

Inside of dplyr::summarise, how can I apply filters based on different rows than the one I'm summarising?
Example:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(y), # this should only include rows where z == 1
mb = mean(y) # this should only include rows where z == 2
)
So, the problem here is to apply a summary function to one column, while filtering based on another, all within summarise.
One idea was double-grouping, so applying group_by on both x and z, but I don't want all summary columns to be based on double-grouping, some (like mall in the example above) should be based on single-grouping only.
One quick option would be to use ifelse to filter to the rows you need, make the rest missing and use the na.rm = T argument to ignore missing values, like the example below.
dplyr::group_by(x) %>%
dplyr::summarise(
mall = mean(y), # this should include all rows in each group
ma = mean(ifelse(z == 1, y, NA), na.rm = T), # this should only include rows where z == 1
mb = mean(ifelse(z == 2, y, NA), na.rm = T) # this should only include rows where z == 2
)
# A tibble: 3 x 4
x mall ma mb
<dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11
While the answer by #Colin H is certainly the way to go for this specific example, a more flexible way to approach this could be to work within the subsets of the first grouping operation. This could be implemented with dplyr::group_split plus a subsequent purrr::map_dfr, but there is also dplyr::group_modify to do this in one step.
Note this relevant sentence from the documentation of dplyr::group_modify:
Use group_modify() when summarize() is too limited, in terms of what you need to do and return for each group.
So here is a solution for the example provided above:
t = data.frame(
x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
y = c(1,2,3,4,5,6,7,8,9,10,11,12),
z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)
t %>%
dplyr::group_by(x) %>%
dplyr::group_modify(function(x, ...) {
x %>% dplyr::mutate(
mall = mean(y)
) %>%
dplyr::group_by(z, mall) %>%
dplyr::summarise(
m = mean(y),
.groups = "drop"
)
}) %>%
dplyr::ungroup()
# A tibble: 6 x 4
x z mall m
<dbl> <dbl> <dbl> <dbl>
1 1 1 2.5 2
2 1 2 2.5 3
3 2 1 6.5 6
4 2 2 6.5 7
5 3 1 10.5 10
6 3 2 10.5 11
group_modify applies a function to each subset tibble after grouping by x. This function has two arguments:
The subset of the data for the group, exposed as .x.
The key, a tibble with exactly one row and columns for each grouping
variable, exposed as .y.
Within our function here we use mutate to cover the requested mall-case first. We do not need any further grouping for that, because that is already covered by the wrapping group_modify. Then we apply another group_by + summarise to cover the different iterations of z. Note that this solution is independent of the number of cases in z we want to consider. While the two cases in this example can be easily handled manually, this might change if there are more.
If the wide output format with individual columns for the cases in z is required, then you can further modify the output of my code with tidyr::pivot_wider.
Another option and perhaps a little more concise is via subsetting:
t %>%
group_by(x) %>%
summarise(mall = mean(y),
ma = mean(y[z == 1]),
mb = mean(y[z == 2]))
# A tibble: 3 x 4
x mall ma mb
* <dbl> <dbl> <dbl> <dbl>
1 1 2.5 2 3
2 2 6.5 6 7
3 3 10.5 10 11
Here is another generic way (just like group_modify) to perform custom filtering on a group data while summarizing. This uses dplyr's context dependent expression: cur_data(), which makes the current group's data available inside dplyr verbs like mutate/summary:
t %>%
dplyr::group_by(x) %>%
dplyr::summarize(
mall = mean(y),
ma = mean(cur_data() %>% as.data.frame() %>% filter(z == 1) %>% pull(y)),
mb = mean(cur_data() %>% as.data.frame() %>% filter(z == 2) %>% pull(y))
)
The benefit of using cur_data() is that you can perform any complex filtering or munging before returning the final summary. For more information refer to: https://dplyr.tidyverse.org/reference/context.html

Summarise using multiple functions with dplyr across()

I have data where an id variable should identify a unique observation. However, some ids are repeated. I want to get an idea of which measurements are driving this repetition by grouping by id and then calculating the proportion of inconsistent responses for each variable.
Below is an example of what I mean:
require(tidyverse)
df <- tibble(id = c(1,1,2,3,4,4,4),
col1 = c('a','a','b','b','c','c','c'), # perfectly consistent
col2 = c('a','b','b','b','c','c','c'), # id 1 is inconsistent - proportion inconsistent = 0.25
col3 = c('a','a','b','b','a','b','c'), # id 4 is inconsistent - proportion inconsistent = 0.25
col4 = c('a','b','b','b','b','b','c') # id 1 and 4 are inconsistent - proportion inconsistent = 0.5
)
I can test for inconsistent responses within ids by using group_by(), across(), and n_distinct() as per the below:
# count the number of distinct responses for each id in each column
# if the value is equal to 1, it means that all responses were consistent
df <- df %>%
group_by(id) %>%
mutate(across(.cols = c(col1:col4), ~n_distinct(.), .names = '{.col}_distinct')) %>%
ungroup()
For simplicity I can now take one row for each id:
# take one row for each test (so we aren't counting duplicates twice)
df <- distinct(df, across(c(id, contains('distinct'))))
Now I would like to calculate the proportion of ids that contained an inconsistent response for each variable. I would like to do something like the following:
consistency <- df %>%
summarise(across(contains('distinct'), ~sum(.>1) / n(.)))
But this gives the following error, which I am having trouble interpreting:
Error: Problem with `summarise()` input `..1`.
x unused argument (.)
ℹ Input `..1` is `across(contains("distinct"), ~sum(. > 1)/n(.))`.
I can get the answer I want by doing the following:
# calculate consistency for each column by finding the number of distinct values greater
# than 1 and dividing by total rows
# first get the number of distinct values
n_inconsistent <- df %>%
summarise(across(.cols = contains('distinct'), ~sum(.>1)))
# next get the number of rows
n_total <- nrow(df)
# calculate the proportion of tests that have more than one value for each column
consistency <- n_inconsistent %>%
mutate(across(contains('distinct'), ~./n_total))
But this involves intermediate variables and feels inelegant.
You can do it in the following way :
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(starts_with('col'), n_distinct)) %>%
summarise(across(starts_with('col'), ~mean(. > 1), .names = '{col}_distinct'))
# col1_distinct col2_distinct col3_distinct col4_distinct
# <dbl> <dbl> <dbl> <dbl>
#1 0 0.25 0.25 0.5
First we count number of unique values in each column per id and then calculate the proportion of values that are above 1 in each column.

To create a frequency table with dplyr to count the factor levels and missing values and report it

Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.

Resources