Which groups have exactly the same rows - r

If I have a data frame like the following
group1
group2
col1
col2
A
1
ABC
5
A
1
DEF
2
B
1
AB
1
C
1
ABC
5
C
1
DEF
2
A
2
BC
8
B
2
AB
1
We can see that the the (A, 1) and (C, 1) groups have the same rows (since col1 and col2 are the same within this group). The same is true for (B,1) and (B, 2).
So really we are left with 3 distinct "larger groups" (call them categories) in this data frame, namely:
category
group1
group2
1
A
1
1
C
1
2
B
1
2
B
2
3
A
2
And I am wondering how can I return the above data frame in R given a data frame like the first? The order of the "category" column doesn't matter here, for example (A,2) could be group 1 instead of {(A,1), (C,1)}, as long as these have a distinct category index.
I have tried a few very long/inefficient ways of doing this in Dplyr but I'm sure there must be a more efficient way to do this. Thanks

You can use pivot_wider first to handle identical groups over multiple rows.
library(tidyverse)
df %>%
group_by(group1, group2) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = c(col1, col2)) %>%
group_by(across(-c(group1, group2))) %>%
mutate(category = cur_group_id()) %>%
ungroup() %>%
select(category, group1, group2) %>%
arrange(category)
category group1 group2
<int> <chr> <int>
1 1 B 1
2 1 B 2
3 2 A 1
4 2 C 1
5 3 A 2

You could first group_by "col1" and "col2" and select the duplicated rows. Next, you can create a unique ID using cur_group_id like this:
library(dplyr)
library(tidyr)
df %>%
group_by(col1, col2) %>%
filter(n() != 1) %>%
mutate(ID = cur_group_id()) %>%
ungroup() %>%
select(-starts_with("col"))
#> # A tibble: 6 × 3
#> group1 group2 ID
#> <chr> <int> <int>
#> 1 A 1 2
#> 2 A 1 3
#> 3 B 1 1
#> 4 C 1 2
#> 5 C 1 3
#> 6 B 2 1
Created on 2022-08-12 by the reprex package (v2.0.1)

Related

R - Summarize dataframe to avoid NAs

Having a dataframe like:
id = c(1,1,1)
A = c(3,NA,NA)
B = c(NA,5,NA)
C= c(NA,NA,2)
df = data.frame(id,A,B,C)
id A B C
1 1 3 NA NA
2 1 NA 5 NA
3 1 NA NA 2
I want to summarize the whole dataframe in one row that it contains no NAs. It should looke like:
id A B C
1 1 3 5 2
It should work also when the dataframe is bigger and contains more ids but in the same logic.
I didnt found the right function for that and tried some variations of summarise().
You can group_by id and use max with na.rm = TRUE:
library(dplyr)
df %>%
group_by(id) %>%
summarise(across(everything(), max, na.rm = TRUE))
id A B C
1 1 3 5 2
If multiple cases, max may not be what you want, you can use sum instead.
Using fmax from collapse
library(collapse)
fmax(df[-1], df$id)
A B C
1 3 5 2
Alternatively please check the below code
data.frame(id,A,B,C) %>% group_by(id) %>% fill(c(A,B,C), .direction = 'downup') %>%
slice_head(n=1)
Created on 2023-02-03 with reprex v2.0.2
# A tibble: 1 × 4
# Groups: id [1]
id A B C
<dbl> <dbl> <dbl> <dbl>
1 1 3 5 2

Is there a way to do a group by and do a full count as well as a count based on filter in same table?

I have a dataset that looks like this
ID|Filter|
1 Y
1 N
1 Y
1 Y
2 N
2 N
2 N
2 Y
2 Y
3 N
3 Y
3 Y
I would like the final result to look like this. A summary count of total count and also when filter is "Y"
ID|All Count|Filter Yes
1 4 3
2 5 2
3 3 2
If i do like this i only get the full count but I also want the folder as the next column
df<- df %>%
group_by(ID)%>%
summarise(`All Count`=n())
df %>%
group_by(ID) %>%
summarise(`All Count` = n(),
`Count Yes` = sum(Filter == "Y"))
# A tibble: 3 × 3
ID `All Count` `Count Yes`
<chr> <int> <int>
1 1 4 3
2 2 5 2
3 3 3 2
We can use
library(dplyr)
df %>%
group_by(ID)%>%
summarise(`All Count`=n(), `Filter Yes` = sum(Filter == 'Y', na.rm = TRUE))

Subsetting first Observation per id and date in r

I want to subset the first date per observation per id. For example, just get the rows for the first date in which observations A and B appeared. If we have the following dataset:
df =
id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A
the outcome should look like this:
df =
id date Observation
1 3 A
1 2 B
2 5 B
2 3 A
thanks
If you don't mind the order being different, it can be accomplished using dplyr by grouping then slicing:
library(tidyverse)
df <- read_table("id date Observation
1 3 A
1 2 B
1 8 B
2 5 B
2 3 A
2 9 A")
df %>%
group_by(id, Observation) %>%
slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, Observation [4]
#> id date Observation
#> <dbl> <dbl> <chr>
#> 1 1 3 A
#> 2 1 2 B
#> 3 2 3 A
#> 4 2 5 B
Created on 2021-04-12 by the reprex package (v1.0.0)
library(dplyr)
df %>%
group_by(id, Observation) %>%
slice(1) %>%
ungroup()
# OR
df %>%
group_by(id, Observation) %>%
filter(row_number() == 1) %>%
ungroup()

Cumulative sum for each row of data for the same ID

I have this data frame:
df=data.frame(id=c(1,1,2,2,2,5,NA),var=c("a","a","b","b","b","e","f"),value=c(1,1,0,1,0,0,1),cs=c(2,2,3,3,3,3,NA))
I want to calculate the sum of value for each group (id, var) and then the cumulative sum but I would like to have the cumulative sum to be displayed for each row of data, i.e., I don't want to summarized view of the data. I have included what my output should look like. This is what I have tried so far:
df%>%arrange(id,var)%>%group_by(id,var)%>%mutate(cs=cumsum(value))
Any suggestions?
Here is an approach that I think meets your expectations.
Would group by id and calculate the sum of value for each id via summarise.
You can then add your cumulative sum column with mutate. Based on your comments, I included an ifelse so that if id was NA, it would not provide a cumulative sum, but instead be given NA.
Finally, to combine your cumulative sum data with your original dataset, you would need to join the two tables.
library(tidyverse)
df %>%
arrange(id) %>%
group_by(id) %>%
summarise(sum = sum(value)) %>%
mutate(cs=ifelse(is.na(id), NA, cumsum(sum))) %>%
left_join(df)
Output
# A tibble: 7 x 5
id sum cs var value
<dbl> <dbl> <dbl> <fct> <dbl>
1 1 2 2 a 1
2 1 2 2 a 1
3 2 1 3 b 0
4 2 1 3 b 1
5 2 1 3 b 0
6 5 0 3 e 0
7 NA 1 NA f 1
Calculate cumulative sum over all values, even if id is NA, then alter final cs to NA if id is NA
df %>%
arrange(id, var) %>%
mutate(cs = cumsum(value)) %>%
group_by(id, var) %>%
mutate(cs = max(ifelse(!is.na(id), cs, NA))) %>%
ungroup()
OR, Exclude rows where id is NA when calculating cumulative sum
df %>%
arrange(id, var) %>%
mutate(cs = cumsum(ifelse(!is.na(id), value, 0))) %>%
group_by(id, var) %>%
mutate(cs = max(ifelse(!is.na(id), cs, NA))) %>%
ungroup()
For your data, both return similar result
# A tibble: 7 x 4
# id var value cs
# <dbl> <fct> <dbl> <dbl>
# 1 1 a 1 2
# 2 1 a 1 2
# 3 2 b 0 3
# 4 2 b 1 3
# 5 2 b 0 3
# 6 5 e 0 3
# 7 NA f 1 4

R group by | count distinct values grouping by another column

How can I count the number of distinct visit_ids per pagename?
visit_id post_pagename
1 A
1 B
1 C
1 D
2 A
2 A
3 A
3 B
Result should be:
post_pagename distinct_visit_ids
A 3
B 2
C 1
D 1
tried it with
test_df<-data.frame(cbind(c(1,1,1,1,2,2,3,3),c("A","B","C","D","A","A","A","B")))
colnames(test_df)<-c("visit_id","post_pagename")
test_df
test_df %>%
group_by(post_pagename) %>%
summarize(vis_count = n_distinct(visit_id))
But this gives me only the amount of distinct visit_id in my data set
One way
test_df |>
distinct() |>
count(post_pagename)
# post_pagename n
# <fct> <int>
# 1 A 3
# 2 B 2
# 3 C 1
# 4 D 1
Or another
test_df |>
group_by(post_pagename) |>
summarise(distinct_visit_ids = n_distinct(visit_id))
# A tibble: 4 x 2
# post_pagename distinct_visit_ids
# <fct> <int>
#1 A 3
#2 B 2
#3 C 1
#4 D 1
*D has one visit, so it must be counted*
The function n_distinct() will give you the number of distict rows in your data, as you have 2 rows that are "2 A", you should use only n(),that will count the number of times your groupped variable appears.
test_df<-data.frame(cbind(c(1,1,1,1,2,2,3,3),c("A","B","C","D","A","A","A","B")))
colnames(test_df)<-c("visit_id","post_pagename")
test_df
test_df %>%
unique() %>%
group_by(post_pagename) %>%
summarize(vis_count = n())
This should work fine.
Hope it helps :)

Resources