This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
For some reason, I could not find a solution using the summarise_all function for the following problem:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,10,9))
desired results:
df %>%
group_by(A) %>%
summarise(B = B[which.min(D)],
C = C[which.min(D)],
D = D[which.min(D)])
# A tibble: 4 x 4
A B C D
<dbl> <int> <int> <dbl>
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
What I tried:
df %>%
group_by(A) %>%
summarise_all(.[which.min(D)])
In words, I want to group by a variable and find for each column the value that belongs to the minimum value of another column. I could not find a solution for this using summarise_all. I am searching for a dplyr approach.
You can just filter down to the row that has a minimum value of D for each level of A. The code below assumes there is only one minimum row in each group.
df %>%
group_by(A) %>%
arrange(D) %>%
slice(1)
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 8 1 9
If there can be multiple rows with minimum D, then:
df <- data.frame(A = c(1,2,2,3,3,3,4,4), B = 1:8, C = 8:1, D = c(1,2,3,1,2,5,9,9))
df %>%
group_by(A) %>%
filter(D == min(D))
A B C D
1 1 1 8 1
2 2 2 7 2
3 3 4 5 1
4 4 7 2 9
5 4 8 1 9
You need filter - any time you're trying to drop some rows and keep others, that's the verb you want.
df %>% group_by(A) %>% filter(D == min(D))
#> # A tibble: 4 x 4
#> # Groups: A [4]
#> A B C D
#> <dbl> <int> <int> <dbl>
#> 1 1 1 8 1
#> 2 2 2 7 2
#> 3 3 4 5 1
#> 4 4 8 1 9
Related
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 months ago.
I need to assign an index value when a value is repeated.
Here is a sample dataset.
df <- data.frame(id = c("A","A","B","C","D","D","D"))
> df
id
1 A
2 A
3 B
4 C
5 D
6 D
7 D
How can I get that indexing column as below:
> df1
id index
1 A 1
2 A 2
3 B 1
4 C 1
5 D 1
6 D 2
7 D 3
base R:
df$index <- ave(rep(1L, nrow(df)), df$id, FUN = seq_along)
df
# id index
# 1 A 1
# 2 A 2
# 3 B 1
# 4 C 1
# 5 D 1
# 6 D 2
# 7 D 3
Another option using n() like this:
library(dplyr)
df %>%
group_by(id) %>%
mutate(index = 1:n()) %>%
ungroup()
#> # A tibble: 7 × 2
#> id index
#> <chr> <int>
#> 1 A 1
#> 2 A 2
#> 3 B 1
#> 4 C 1
#> 5 D 1
#> 6 D 2
#> 7 D 3
Created on 2022-09-23 with reprex v2.0.2
I am trying to use dplyr to count elements grouped by multiple conditions (columns) in a data frame. In the below example (dataframe output is at the top (except that I manually inserted the 2 right-most columns to explain what I am trying to do), and R code is underneath), I am trying to count the joint groupings of the Element and Group columns. My multiple condition grouping attempt is eleGrpCnt. Any recommendations for the correct way to do this in dplyr? I thought that group_by a combined (Element, Group) would work.
desired
Element Group origOrder eleCnt eleGrpCnt eleGrpCnt explanation
<chr> <dbl> <int> <int> <int> <comment> <comment>
1 B 0 1 1 1 1 1st grouping of B where Group = 0
2 R 0 2 1 1 1 1st grouping of R where Group = 0
3 R 1 3 2 1 2 2nd grouping of R where Group = 1
4 R 1 4 3 2 2 2nd grouping of R where Group = 1
5 B 0 5 2 2 1 1st grouping of B where Group = 0
6 X 2 6 1 1 1 1st grouping of X where Group = 2
7 X 2 7 2 2 1 1st grouping of X where Group = 2
8 X 0 8 3 1 2 2nd grouping of X where Group = 0
9 X 0 9 4 2 2 2nd grouping of X where Group = 0
10 X -1 10 5 1 3 3rd grouping of X where Group = -1
library(dplyr)
myData6 <-
data.frame(
Element = c("B","R","R","R","B","X","X","X","X","X"),
Group = c(0,0,1,1,0,2,2,0,0,-1)
)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element, Group) %>%
mutate(eleGrpCnt = row_number())%>%
ungroup()
If you group by element then the numbers you are looking for are simply the matches of Group against the unique values of Group:
library(dplyr)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element) %>%
mutate(eleGrpCnt = match(Group, unique(Group)))
#> # A tibble: 10 x 5
#> # Groups: Element [3]
#> Element Group origOrder eleCnt eleGrpCnt
#> <chr> <dbl> <int> <int> <dbl>
#> 1 B 0 1 1 1
#> 2 R 0 2 1 1
#> 3 R 1 3 2 2
#> 4 R 1 4 3 2
#> 5 B 0 5 2 1
#> 6 X 2 6 1 1
#> 7 X 2 7 2 1
#> 8 X 0 8 3 2
#> 9 X 0 9 4 2
#> 10 X -1 10 5 3
Created on 2022-09-11 with reprex v2.0.2
Here's one approach; I'm sorting by Group value but if you want to change the order to match original appearance order we could add a step.
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
arrange(Element, Group) %>%
group_by(Element) %>%
mutate(eleGrpCnt = cumsum(Group != lag(Group, default = -999))) %>%
ungroup() %>%
arrange(origOrder)
# A tibble: 10 × 5
Element Group origOrder eleCnt eleGrpCnt
<chr> <dbl> <int> <int> <int>
1 B 0 1 1 1
2 R 0 2 1 1
3 R 1 3 2 2
4 R 1 4 3 2
5 B 0 5 2 1
6 X 2 6 1 3
7 X 2 7 2 3
8 X 0 8 3 2
9 X 0 9 4 2
10 X -1 10 5 1
I know this might be a simple operation but I can't find a solution. I know it should be some form of group_by and sum or cumsum, but I cant figure out how. I want to plot a cumulative count of something by group over time. I have multiple rows per group and time that need to be counted (and some missing data).
My dataset looks somewhat like this
df <- data.frame(group = c("A","A","A","A","B","B","B","C","C","C","C","C"),
time = c(1,1,2,3,1,2,2,1,2,2,3,3))
and I want this result:
group time count
A 1 2
A 2 3
A 3 4
B 1 1
B 2 3
C 1 1
C 2 3
C 3 5
I am usually use dplyr, but I am also happy with base R.
How do I do that?
You can use the following solution:
library(dplyr)
df %>%
group_by(group, time) %>%
add_count() %>%
distinct() %>%
group_by(group) %>%
mutate(n = cumsum(n))
# A tibble: 8 x 3
# Groups: group [3]
group time n
<chr> <dbl> <int>
1 A 1 2
2 A 2 3
3 A 3 4
4 B 1 1
5 B 2 3
6 C 1 1
7 C 2 3
8 C 3 5
We can use summarise with group_by
library(dplyr)
df %>%
group_by(group, time) %>%
summarise(count = n()) %>%
group_by(group) %>%
mutate(count = cumsum(count)) %>%
ungroup
-output
# A tibble: 8 x 3
group time count
<chr> <dbl> <int>
1 A 1 2
2 A 2 3
3 A 3 4
4 B 1 1
5 B 2 3
6 C 1 1
7 C 2 3
8 C 3 5
You can use count and cumsum -
library(dplyr)
df %>%
count(group, time, name = 'count') %>%
group_by(group) %>%
mutate(count = cumsum(count)) %>%
ungroup
# group time count
# <chr> <dbl> <int>
#1 A 1 2
#2 A 2 3
#3 A 3 4
#4 B 1 1
#5 B 2 3
#6 C 1 1
#7 C 2 3
#8 C 3 5
This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 1 year ago.
I have data where each row represents one observation from one person. For example:
library(dplyr)
dat <- tibble(ID = rep(sample(1111:9999, 3), each = 3),
X = 1:9)
# A tibble: 9 x 2
ID X
<int> <int>
1 9573 1
2 9573 2
3 9573 3
4 7224 4
5 7224 5
6 7224 6
7 7917 7
8 7917 8
9 7917 9
I want to replace these IDs with a different value. It can be anything, but the easiest (and preferred) solutions is just to replace with 1:n groups. So the desired solution would be:
# A tibble: 9 x 2
ID X
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 4
5 2 5
6 2 6
7 3 7
8 3 8
9 3 9
Probably something that starts with:
dat %>%
group_by(IID) %>%
???
A fast option would be match
library(dplyr)
dat %>%
mutate(ID = match(ID, unique(ID)))
-output
# A tibble: 9 x 2
# ID X
# <int> <int>
#1 1 1
#2 1 2
#3 1 3
#4 2 4
#5 2 5
#6 2 6
#7 3 7
#8 3 8
#9 3 9
Or use as.integer on a factor
dat %>%
mutate(ID = as.integer(factor(ID, levels = unique(ID))))
In tidyverse, we can also cur_group_id
dat %>%
group_by(ID = factor(ID, levels = unique(ID))) %>%
mutate(ID = cur_group_id()) %>%
ungroup
I want to count unique combinations in a dataframe using dplyr
I tried the following:
require(dplyr)
set.seed(314)
dat <- data.frame(a = sample(1:3, 100, replace = T),
b = sample(1:2, 100, replace = T),
c = sample(1:2, 100, replace = T))
dat %>% group_by(a,b,c) %>% summarise(n = n())
But to make this generic (unrelated to the names of the columns) I tried:
dat %>% group_by(everything()) %>% summarise(n = n())
Which results in:
a b c n
<int> <int> <int> <int>
1 1 1 1 6
2 1 1 2 8
3 1 2 1 13
4 1 2 2 8
5 2 1 1 7
6 2 1 2 12
7 2 2 1 14
8 2 2 2 10
9 3 1 1 3
10 3 1 2 4
11 3 2 1 7
12 3 2 2 8
Which gives the error
Error in mutate_impl(.data, dots) : `c(...)` must be a character vector
I fiddled around with different things but cannot get it to work. I know I could use names(dat) but the columns in the dataframe that need to be in the group_by() are depended on previous steps in the dplyr chain.
There is a function called group_by_all() (and in the same sense group_by_at and group_by_if )which does exactly that.
library(dplyr)
dat %>%
group_by_all() %>%
summarise(n = n())
which gives the same result,
# A tibble: 12 x 4
# Groups: a, b [?]
a b c n
<int> <int> <int> <int>
1 1 1 1 6
2 1 1 2 8
3 1 2 1 13
4 1 2 2 8
5 2 1 1 7
6 2 1 2 12
7 2 2 1 14
8 2 2 2 10
9 3 1 1 3
10 3 1 2 4
11 3 2 1 7
12 3 2 2 8
PS
packageVersion('dplyr')
#[1] ‘0.7.2’
We can use .dots
dat %>%
group_by(.dots = names(.)) %>%
summarise(n = n())
# A tibble: 12 x 4
# Groups: a, b [?]
# a b c n
# <int> <int> <int> <int>
#1 1 1 1 6
#2 1 1 2 8
#3 1 2 1 13
#4 1 2 2 8
#5 2 1 1 7
#6 2 1 2 12
#7 2 2 1 14
#8 2 2 2 10
#9 3 1 1 3
#10 3 1 2 4
#11 3 2 1 7
#12 3 2 2 8
Another option would be to use the unquote, sym approach
dat %>%
group_by(!!! rlang::syms(names(.))) %>%
summarise(n = n())
In dplyr version 1.0.0 and later, you would now use across().
library(dplyr)
dat %>%
group_by(across(everything())) %>%
summarise(n = n())
Package version:
> packageVersion("dplyr")
[1] ‘1.0.5’