R - keep random rows per group, but different numbers per group

R - keep random rows per group, but different numbers per group - r

The function sample_n() from package dplyr allows to randomly keep a specific number of rows. Combine with group_by(), you can for instance keep 2 observations per group:
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(2)
# A tibble: 4 x 2
# Groups: vs [2]
vs drat
<dbl> <dbl>
1 0 3.07
2 0 3.9
3 1 4.22
4 1 3.08
Question: is there an easy way to select a different number of observations per group? For instance, if I want to keep 2 observations for the first group, and 3 for the second one. If I give a vector to the function sample_n(), it only uses the first value (result is the same as above).
mtcars %>%
select(vs, drat) %>%
group_by(vs) %>%
sample_n(c(2,3))
Thanks in advance.

create list-columns of each groups using group_nest(), add a column with the number of samples you want in each group, then map these two columns to the sample_n() function:
library(tidyverse)
mtcars %>%
select(vs, drat) %>%
group_nest(vs, keep= TRUE) %>%
add_column(mysamples = c(2,3)) %>%
mutate(sampled = map2(data , mysamples, ~ sample_n(.x, .y))) %>%
.$sampled %>%
bind_rows()
# A tibble: 5 x 2
vs drat
<dbl> <dbl>
1 0 3.15
2 0 4.22
3 1 3.7
4 1 4.93
5 1 3.08
>

Related

group_by() level disappear after filter()/mutate()/count() without using ungroup

This problem bothers me for the entire day and I don't know why it happens.
The issue is group_by level will disappear after one line of code such as filter(),mutate(), count(), and in order to keep that level, I need to add group_by() everytime after these codes again to keep the group level.
Below I attach an example.
As you can see, if I add group_by after filter, it works fine.
data("mtcars")
> mtcars %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
However, if I use group_by before filter and count the value, it will lose the group by level
data("mtcars")
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
mpg n
1 21.0 2
2 21.4 1
In order to make it work, I need to change codes to
> mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ group_by(cyl) %>%
+ count(mpg)
cyl mpg n
1 6 21.0 2
2 6 21.4 1
This method also doesn't work:
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
mpg n
1 21.0 2
2 21.4 1
I am using another PC to run the codes and it works well.
data("mtcars")
mtcars %>%
+ group_by(cyl) %>%
+ filter(hp == 110) %>%
+ count(mpg)
# A tibble: 2 x 3
# Groups: cyl [1]
cyl mpg n
<dbl> <dbl> <int>
1 6 21 2
2 6 21.4 1
I have reinstalled dplyr package many times and this thing keeps happening. I am using version 1.0.2 for dplyr.
Really appreciate if someone can help me about this issue!
Edit:
The problem is being solved after I update my R version to 4.0.2 (my previous version is 3.6.3). Not sure why dplyr doesn't work properly undr 3.6.3 but at least the problem is being solved for now.

Try this:
data("mtcars")
> mtcars %>%
+ dplyr::group_by(cyl) %>%
+ dplyr::filter(hp == 110) %>%
+ dplyr::count(mpg)
There can be masking problem. Function filter is in dplyr and stats package as well. Same issue was discussed here. Similar problem occours with select function.

Also note in that context the difference between:
data("mtcars")
mtcars %>%
group_by(cyl,gear) %>%
summarize(
n=n()
) %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl [3]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 11
2 4 4 8 11
3 4 5 2 11
4 6 3 2 7
5 6 4 4 7
6 6 5 1 7
mtcars %>%
group_by(cyl,gear) %>%
count() %>%
mutate(mysum = sum(n))
# A tibble: 8 x 4
# Groups: cyl, gear [8]
cyl gear n mysum
<dbl> <dbl> <int> <int>
1 4 3 1 1
2 4 4 8 8
3 4 5 2 2
4 6 3 2 2
5 6 4 4 4
Summarise defaults to dropping the last grouping variable (.groups="drop_last"). And for a funny reason :)
https://twitter.com/hadleywickham/status/1254802700589555715

Get top values of multiple group bys

I've been trying a few ways to achieve (do, row_number) this but still stuck.
I have 3 groups: month, city, and gender.
I would like to get only the top 5 count of these 3 group bys.
This code works fine only with 2 groups:
df_top5_2grp <- df %>%
group_by(month, city) %>%
tally() %>%
top_n(n = 5, wt = n) %>%
arrange(retention_month, desc(n))
However, it won't return the top 5 count if I add an additional group:
df_top5_3grp <- df %>%
group_by(month, city, gender) %>%
tally() %>%
top_n(n = 5, wt = n) %>%
arrange(retention_month, gender, desc(n))
It returns all rows instead. The only difference is I added gender.
Any help is appreciated. Thanks!

You probably need an ungroup() in there.
In the first example below, it returns all the rows, since there are 7 groups, each with one row. So returning the top 5 of each of the seven groups returns all rows.
mtcars %>%
group_by(cyl, vs, am) %>% # grouping across three variables
tally() %>% # tally is a summarization that removes the last grouping
top_n(n = 5, wt = n)
# A tibble: 7 x 4
# Groups: cyl, vs [5] # NOTE! This reminds us the data is still grouped
cyl vs am n
<dbl> <dbl> <dbl> <int>
1 4 0 1 1
2 4 1 0 3
3 4 1 1 7
4 6 0 1 3
5 6 1 0 4
6 8 0 0 12
7 8 0 1 2
Adding ungroup makes it so the top 5 filtering happens across all the summarized groups, not within each group.
mtcars %>%
group_by(cyl, vs, am) %>%
tally() %>%
ungroup() %>%
top_n(n = 5, wt = n)
# A tibble: 5 x 4
cyl vs am n
<dbl> <dbl> <dbl> <int>
1 4 1 0 3
2 4 1 1 7
3 6 0 1 3
4 6 1 0 4
5 8 0 0 12

how to calculate proportion by another variable (not by frequency) in dplyr in R

Using mtcars data, I want to calculate proportion of mpg for each group of cyl and am. How to calc it?
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n(mpg)) %>%
mutate(mpg.gr = mpg/(sum(mpg))
Thanks in advance!

If I understand you correctly, you want the proportion of records for each combination of cyl and am. If so, then I believe your code isn't working because n() doesn't accept an argument. You also need to ungroup() before calculating your proportions.
You could simply do:
mtcars %>%
group_by(cyl, am) %>%
summarise(mpg = n()) %>%
ungroup() %>%
mutate(mpg.gr = mpg/(sum(mpg))
#> # A tibble: 6 x 4
#> cyl am mpg mpg.gr
#> <dbl> <dbl> <int> <dbl>
#> 1 4 0 3 0.0938
#> 2 4 1 8 0.25
#> 3 6 0 4 0.125
#> 4 6 1 3 0.0938
#> 5 8 0 12 0.375
#> 6 8 1 2 0.0625
Note that thanks to ungroup(), the proportions are calculated using the counts of all records, not just those within the cyl group, as before.

For loop with dplyr package

I want to make this for loop for each colname in my dataframe but I have an error with group_by method :
Error in usemethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"
My code :
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by(i) %>%
summarise(value = n()) %>%
select(label = i, value)
print(distribution)
}
How can I fix this error?
Thanks for your help.

I offer a more tidy alternative that creates a frequency table by column and binds them in a single data frame.
library(dplyr)
library(purrr)
mtcars %>%
map(~table(.x)) %>%
lapply(as_tibble) %>%
bind_rows(.id = "var")
# # A tibble: 171 x 3
# var .x n
# <chr> <chr> <int>
# 1 mpg 10.4 2
# 2 mpg 13.3 1
# 3 mpg 14.3 1
# 4 mpg 14.7 1
# 5 mpg 15 1
# 6 mpg 15.2 2
# 7 mpg 15.5 1
# 8 mpg 15.8 1
# 9 mpg 16.4 1
# 10 mpg 17.3 1
# # ... with 161 more rows

If I’m understanding your code correctly
You want to find out the unique items in each column in your data frame and print the table to the console
for(i in colnames(creditDF)){
distribution <- creditDF %>%
group_by_at(.vars = i) %>%
summarise(value = n())
print(distribution)
}

Solution with base R.
for(i in creditDF) print(as.data.frame(table(i)))

R - dplyr Summarize and Retain Other Columns

I am grouping data and then summarizing it, but would also like to retain another column. I do not need to do any evaluations of that column's content as it will always be the same as the group_by column. I can add it to the group_by statement but that does not seem "right". I want to retain State.Full.Name after grouping by State. Thanks
TDAAtest <- data.frame(State=sample(state.abb,1000,replace=TRUE))
TDAAtest$State.Full.Name <- state.name[match(TDAAtest$State,state.abb)]
TDAA.states <- TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
summarize(n=n()) %>%
ungroup() %>%
arrange(State)

Perhaps we need
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
summarise(State.Full.Name = first(State.Full.Name), n = n())
Or use mutate to create the column and then do the distinct
TDAAtest %>% f
filter(!is.na(State)) %>%
group_by(State) %>%
mutate(n= n()) %>%
distinct(State, .keep_all=TRUE)

To retain all columns, you can include across() as a summarize argument, as explained in the documentation for dplyr::do().
by_cyl <- head(mtcars) %>%
group_by(cyl)
by_cyl %>%
summarise(m_mpg = mean(mpg), across())
cyl m_mpg mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 22.8 22.8 108 93 3.85 2.32 18.6 1 1 4 1
2 6 20.4 21 160 110 3.9 2.62 16.5 0 1 4 4
3 6 20.4 21 160 110 3.9 2.88 17.0 0 1 4 4
4 6 20.4 21.4 258 110 3.08 3.22 19.4 1 0 3 1
5 6 20.4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
6 8 18.7 18.7 360 175 3.15 3.44 17.0 0 0 3 2
To retain only a subset of unaltered columns, you can select them within across using tidyselect semantics.

I believe there are more accurate answers than the accepted answer specially when you don't have unique data for other columns in each group (e.g. max or min or top n items based on one particular column
).
Although the accepted answer works for this question, for instance, you would like to find the county with the max population for each state. (You need to have county and population columns).
We have the following options:
1. dplyr version
From this link, you have three extra operations (mutate, ungroup and filter) to achieve that:
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
mutate(maxPopulation = max(Population)) %>%
ungroup() %>%
filter(maxPopulation == Population)
2. Function version
This one gives you as much flexibility as you want and you can apply any kind of operation to each group:
maxFUN = function(x) {
# order population in a descending order
x = x[with(x, order(-Population)), ]
x[1, ]
}
TDAAtest %>%
filter(!is.na(State)) %>%
group_by(State) %>%
do(maxFUN(.))
This one is highly recommended for more complex operations. For instance, you can return top n (topN) counties per state by having x[1:topN] for the returned dataframe in maxFUN.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - keep random rows per group, but different numbers per group - r

Related

group_by() level disappear after filter()/mutate()/count() without using ungroup

Get top values of multiple group bys

how to calculate proportion by another variable (not by frequency) in dplyr in R

For loop with dplyr package

R - dplyr Summarize and Retain Other Columns

Categories

Resources