custom grouped dplyr function (sample_n) - r

I am trying to apply a sampling function in a grouped fashion to a data frame, where it should sample n samples from each group, or all group members if the group size is smaller than n.
Using dplyr, I first tried
library(dplyr)
mtcars %>% group_by(cyl) %>% sample_n(2)
This works when n is smaller than all the group sizes but does not take the full group when I choose n larger than the group size (note that there are 7 cars in one of the cyl groups):
mtcars %>% group_by(cyl) %>% sample_n(8)
Error: `size` must be less or equal than 7 (size of data),
set `replace` = TRUE to use sampling with replacement
I tried to solve this by creating an adapted group_n function like so:
sample_n_or_all <- function(tbl, n) {
if (nrow(tbl) < n)return(tbl)
sample_n(tbl, n)
}
but using my custom function (mtcars %>% group_by(cyl) %>% sample_n_or_all(8)) generates the same error.
Any suggestions how I can adapt my function so I can apply it to each of the groups? Or another solution to the problem?

We could check the number of rows in the group and pass the value to sample_n accordingly.
library(dplyr)
n <- 8
temp <- mtcars %>% group_by(cyl) %>% sample_n(if(n() < n) n() else n)
temp
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 4 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 7 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 8 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 9 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#10 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
# … with 13 more rows
We can check number of rows in each group after that.
table(temp$cyl)
#4 6 8
#8 7 8
table(mtcars$cyl)
# 4 6 8
#11 7 14

We can do this without using a logical condition with pmin
library(dplyr)
tmp <- mtcars %>%
group_by(cyl) %>%
sample_n(pmin(n(), n))
# A tibble: 23 x 11
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 5 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 7 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 8 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 9 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#10 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# … with 13 more rows
-checking
table(tmp$cyl)
# 4 6 8
# 8 7 8

Related

Add dynamic value to row by group

I'd like to add a row for each group, where the entry for a particular column is the mean of the values of that column for that group. It's easy to add a constant value
library(dplyr)
mtcars %>% group_by(cyl) %>% group_modify(~add_row(.x, .before=0, carb=2))
# A tibble: 35 x 11
# Groups: cyl [3]
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 NA NA NA NA NA NA NA NA NA 2
2 4 22.8 108 93 3.85 2.32 18.6 1 1 4 1
3 4 24.4 147. 62 3.69 3.19 20 1 0 4 2
4 4 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
But when I try to dynamically add e.g. the mean of all carbs for that group, it doesn't recognise carb as a column:
mtcars %>% group_by(cyl) %>% group_modify(~add_row(.x, .before=0, carb=mean(carb)))
Error in mean(carb) : object 'carb' not found
Alternatively:
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(carb = mean(carb)) %>%
bind_rows(mtcars) %>%
arrange(cyl)
#> # A tibble: 35 x 11
#> cyl carb mpg disp hp drat wt qsec vs am gear
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 1.55 NA NA NA NA NA NA NA NA NA
#> 2 4 1 22.8 108 93 3.85 2.32 18.6 1 1 4
#> 3 4 2 24.4 147. 62 3.69 3.19 20 1 0 4
#> 4 4 2 22.8 141. 95 3.92 3.15 22.9 1 0 4
#> 5 4 1 32.4 78.7 66 4.08 2.2 19.5 1 1 4
#> 6 4 2 30.4 75.7 52 4.93 1.62 18.5 1 1 4
#> 7 4 1 33.9 71.1 65 4.22 1.84 19.9 1 1 4
#> 8 4 1 21.5 120. 97 3.7 2.46 20.0 1 0 3
#> 9 4 1 27.3 79 66 4.08 1.94 18.9 1 1 4
#> 10 4 2 26 120. 91 4.43 2.14 16.7 0 1 5
#> # ... with 25 more rows

How to replace duplicate row values by appending indexes in R using dplyr? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
How can two strings be concatenated?
(12 answers)
Closed 1 year ago.
I would like to replace duplicate row values in a given column by appending an underscore with an index based on their incidence. For example
old_df_col new_df_col
object object_1
object object_2
object object_3
object object_4
Most other questions focus around deleting or replacing duplicate values with NA so I wasn't able to find an implementation using R and dplyr.
Here's what I've worked out so far
# count duplicates
mtcars %>% group_by(carb) %>% summarize(n=n())
# filter duplicates
mtcars %>% group_by(carb) %>% filter(n()>1)
You can group by the target variable and use row_number() to create the sequence.
Clearly, you might have to sort the data set previously (using arrange()) so that the sequence has some meaning for your data, but is not strictly necessary.
library(dplyr)
mtcars %>% group_by(carb) %>%
arrange(carb, cyl, mpg, hp) %>%
mutate(
carb_seq = paste("carb", carb, "seq", row_number(), sep = "_")
)
# A tibble: 32 x 12
# Groups: carb [6]
mpg cyl disp hp drat wt qsec vs am gear carb carb_seq
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 carb_1_seq_1
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 carb_1_seq_2
3 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 carb_1_seq_3
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 carb_1_seq_4
5 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 carb_1_seq_5
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 carb_1_seq_6
7 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 carb_1_seq_7
8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 carb_2_seq_1
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 carb_2_seq_2
10 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 carb_2_seq_3
# … with 22 more rows
Created on 2021-07-11 by the reprex package (v2.0.0)

Can I group_by columns with starts_with?

I'm dealing with a big dataframe that has a number of columns I want to group by. I'd like to do something like this:
output <- df %>%
group_by(starts_with("GEN", ignore.case=TRUE),x,y) %>%
summarize(total=n()) %>%
arrange(desc(total))
is there a way to do this? Maybe with group_by_at or some other similar function?
To use starts_with() in group_by(), you need to wrap it in across(). Here is an example using some built data.
library(dplyr)
mtcars %>%
group_by(across(starts_with("c"))) %>%
summarize(total = n()) %>%
arrange(-total)
# A tibble: 9 x 3
# Groups: cyl [3]
cyl carb total
<dbl> <dbl> <int>
1 4 2 6
2 8 4 6
3 4 1 5
4 6 4 4
5 8 2 4
6 8 3 3
7 6 1 2
8 6 6 1
9 8 8 1
Yes, there is. You could use the group_by_at function:
mtcars %>% group_by_at(vars(starts_with("c"), gear))
Group by all columns whose name starts with "c" and by the column gear
Output
# A tibble: 32 x 11
# Groups: cyl, carb, gear [12]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 22 more rows

Select top rows in R using add_tally and top_n functions

I would like to select the top n rows in a data frame for which I
calculated a column n that represents the sum of a variable. For example,
using the mtcars data, I would like to filter to keep only the two cyl
with the greatest sum of mpg. In the following example, I was expecting
to select all rows where cyl == 4 and cyl == 8. It must be simple, but
I can not figure out my mistake.
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
summarise(sum(mpg))
#> # A tibble: 3 x 2
#> cyl `sum(mpg)`
#> <dbl> <dbl>
#> 1 4 293.
#> 2 6 138.
#> 3 8 211.
mtcars %>%
group_by(cyl) %>% # Calculate the sum of mpg for each cyl
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
top_n(2, n)
#> # A tibble: 11 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb n
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 293.
#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 293.
#> 3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 293.
#> 4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 293.
#> 5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 293.
#> 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 293.
#> 7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 293.
#> 8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 293.
#> 9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 293.
#> 10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2 293.
#> 11 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 293.
Created on 2019-07-26 by the reprex package (v0.3.0)
It seems that top_n returns the top n rows after ordering the dataframe and returns more than n rows if there are ties. It does not return rows with distinct top n values.
From documentation -
Usage
top_n(x, n, wt)
Arguments
x: a tbl() to filter
n: number of rows to return. If x is grouped,
this is the number of rows per group. Will include more than n rows if
there are ties. If n is positive, selects the top n rows. If negative,
selects the bottom n rows.
You need, as suggested by #tmfmnk -
mtcars %>%
group_by(cyl) %>%
add_tally(mpg, sort = TRUE) %>%
ungroup() %>%
filter(dense_rank(desc(n)) < 3)

sampling by group in R

I have a large sample data of healthcare data called oct
Providers ID date ICD
Billy 4504 9/11 f.11
Billy 5090 9/10 r.05
Max 4430 9/01 k.11
Mindy 0812 9/30 f.11
etc.
I want a random sample of ID numbers for each provider. I have tried.
review <- oct %>% group_by(Providers) %>% do (sample(oct$ID, size = 5, replace= FALSE, prob = NULL))
Example using dplyr::sample_n
library(dplyr)
set.seed(1)
mtcars %>% group_by(cyl) %>% sample_n(3)
# A tibble: 9 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
4 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
5 21 6 160 110 3.9 2.88 17.0 0 1 4 4
6 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
7 15 8 301 335 3.54 3.57 14.6 0 1 5 8
8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
9 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
If you'd like to just select a specific variable (ID in your question):
set.seed(1)
mtcars %>%
group_by(cyl) %>%
sample_n(3) %>%
pull(mpg)
[1] 22.8 32.4 33.9 19.7 21.0 19.2 15.0 15.5 14.7

Resources