R group by | count distinct values grouping by another column - r

How can I count the number of distinct visit_ids per pagename?
visit_id post_pagename
1 A
1 B
1 C
1 D
2 A
2 A
3 A
3 B
Result should be:
post_pagename distinct_visit_ids
A 3
B 2
C 1
D 1
tried it with
test_df<-data.frame(cbind(c(1,1,1,1,2,2,3,3),c("A","B","C","D","A","A","A","B")))
colnames(test_df)<-c("visit_id","post_pagename")
test_df
test_df %>%
group_by(post_pagename) %>%
summarize(vis_count = n_distinct(visit_id))
But this gives me only the amount of distinct visit_id in my data set

One way
test_df |>
distinct() |>
count(post_pagename)
# post_pagename n
# <fct> <int>
# 1 A 3
# 2 B 2
# 3 C 1
# 4 D 1
Or another
test_df |>
group_by(post_pagename) |>
summarise(distinct_visit_ids = n_distinct(visit_id))
# A tibble: 4 x 2
# post_pagename distinct_visit_ids
# <fct> <int>
#1 A 3
#2 B 2
#3 C 1
#4 D 1
*D has one visit, so it must be counted*

The function n_distinct() will give you the number of distict rows in your data, as you have 2 rows that are "2 A", you should use only n(),that will count the number of times your groupped variable appears.
test_df<-data.frame(cbind(c(1,1,1,1,2,2,3,3),c("A","B","C","D","A","A","A","B")))
colnames(test_df)<-c("visit_id","post_pagename")
test_df
test_df %>%
unique() %>%
group_by(post_pagename) %>%
summarize(vis_count = n())
This should work fine.
Hope it helps :)

Related

Is there a way to do a group by and do a full count as well as a count based on filter in same table?

I have a dataset that looks like this
ID|Filter|
1 Y
1 N
1 Y
1 Y
2 N
2 N
2 N
2 Y
2 Y
3 N
3 Y
3 Y
I would like the final result to look like this. A summary count of total count and also when filter is "Y"
ID|All Count|Filter Yes
1 4 3
2 5 2
3 3 2
If i do like this i only get the full count but I also want the folder as the next column
df<- df %>%
group_by(ID)%>%
summarise(`All Count`=n())
df %>%
group_by(ID) %>%
summarise(`All Count` = n(),
`Count Yes` = sum(Filter == "Y"))
# A tibble: 3 × 3
ID `All Count` `Count Yes`
<chr> <int> <int>
1 1 4 3
2 2 5 2
3 3 3 2
We can use
library(dplyr)
df %>%
group_by(ID)%>%
summarise(`All Count`=n(), `Filter Yes` = sum(Filter == 'Y', na.rm = TRUE))

Which groups have exactly the same rows

If I have a data frame like the following
group1
group2
col1
col2
A
1
ABC
5
A
1
DEF
2
B
1
AB
1
C
1
ABC
5
C
1
DEF
2
A
2
BC
8
B
2
AB
1
We can see that the the (A, 1) and (C, 1) groups have the same rows (since col1 and col2 are the same within this group). The same is true for (B,1) and (B, 2).
So really we are left with 3 distinct "larger groups" (call them categories) in this data frame, namely:
category
group1
group2
1
A
1
1
C
1
2
B
1
2
B
2
3
A
2
And I am wondering how can I return the above data frame in R given a data frame like the first? The order of the "category" column doesn't matter here, for example (A,2) could be group 1 instead of {(A,1), (C,1)}, as long as these have a distinct category index.
I have tried a few very long/inefficient ways of doing this in Dplyr but I'm sure there must be a more efficient way to do this. Thanks
You can use pivot_wider first to handle identical groups over multiple rows.
library(tidyverse)
df %>%
group_by(group1, group2) %>%
mutate(id = row_number()) %>%
pivot_wider(names_from = id, values_from = c(col1, col2)) %>%
group_by(across(-c(group1, group2))) %>%
mutate(category = cur_group_id()) %>%
ungroup() %>%
select(category, group1, group2) %>%
arrange(category)
category group1 group2
<int> <chr> <int>
1 1 B 1
2 1 B 2
3 2 A 1
4 2 C 1
5 3 A 2
You could first group_by "col1" and "col2" and select the duplicated rows. Next, you can create a unique ID using cur_group_id like this:
library(dplyr)
library(tidyr)
df %>%
group_by(col1, col2) %>%
filter(n() != 1) %>%
mutate(ID = cur_group_id()) %>%
ungroup() %>%
select(-starts_with("col"))
#> # A tibble: 6 × 3
#> group1 group2 ID
#> <chr> <int> <int>
#> 1 A 1 2
#> 2 A 1 3
#> 3 B 1 1
#> 4 C 1 2
#> 5 C 1 3
#> 6 B 2 1
Created on 2022-08-12 by the reprex package (v2.0.1)

How to reduce factor levels depending on other attribute?

I have a dataframe of two columns id and result, and I want to assign factor levels to result depending on id. So that for id "1", result c("a","b","c","d") will have factor levels 1,2,3,4.
For id "2", result c("22","23","24") will have factor levels 1,2,3.
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
I tried to group them by split, but they will be converted to a list instead of a data frame, which causes a length problem for modeling. Can you help please?
Though the question was closed as a duplicate by user #Ronak Shah, I don't believe it is the same question.
After numbering the row by group the new column must be coerced to class "factor".
library(dplyr)
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
df <- data.frame(id, result)
df %>%
group_by(id) %>%
mutate(fac = row_number()) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 7 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 23 2
#7 2 24 3
Edit.
If there are repeated values in result, coerce as.integer/factor to get numbers, then coerce those numbers to factor.
id2 <- c(1,1,1,1,2,2,2,2)
result2 <- c("a","b","c","d","22", "22","23","24")
df2 <- data.frame(id = id2, result = result2)
df2 %>%
group_by(id) %>%
mutate(fac = as.integer(factor(result))) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 8 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3
After grouping by id, we can use match with unique to assign unique number to each result. Using #Rui Barradas' dataframe df2
library(dplyr)
df2 %>%
group_by(id) %>%
mutate(ans = match(result, unique(result))) %>%
ungroup %>%
mutate(ans = factor(ans))
# id result ans
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3

count distinct levels of a data frame for groups based on a condition

I have the following DF
x = data.frame('grp' = c(1,1,1,2,2,2),'a' = c(1,2,1,1,2,1), 'b'= c(6,5,6,6,2,6), 'c' = c(0.1,0.2,0.4,-1, 0.9,0.7))
grp a b c
1 1 1 6 0.1
2 1 2 5 0.2
3 1 1 6 0.4
4 2 1 6 -1.0
5 2 2 2 0.9
6 2 1 6 0.7
I want to count distinct levels of (a,b) for each group where c >= 0.1
I have tried using dplyr for this using group_by & summarise but not getting the desired result
x %>% group_by(grp) %>% summarise(count = n_distinct(c(a,b)[c >= 0.1]))
For the above case I would expect the following result
grp count
<dbl> <int>
1 1 2
2 2 2
However using the above query I am getting the following result
grp count
<dbl> <int>
1 1 4
2 2 3
Logically the above output seems to be solving for all unique values of a concat list of (a,b) but not what I require
Any pointers, really appreciate any help
Here's another way using dplyr. It sounds like you want to filter based on c, so we do that. Instead of using c(a, b) in n_distinct, we can write it as n_distinct(a, b).
x %>%
filter(c >= 0.1) %>%
group_by(grp) %>%
summarise(cnt_d = n_distinct(a, b))
# grp cnt_d
# <dbl> <int>
# 1 1 2
# 2 2 2
We can paste a and b columns and count distinct values in each group.
library(dplyr)
x %>%
mutate(col = paste(a, b, sep = "_")) %>%
group_by(grp) %>%
summarise(count = n_distinct(col[c >= 0.1]))
# grp count
# <dbl> <int>
#1 1 2
#2 2 2
An option using data.table
library(data.table)
setDT(x)[c >= 0.1, .(cnt_d = uniqueN(paste(a, b))), .(grp)]
# grp cnt_d
#1: 1 2
#2: 2 2

summarise and group_by using two different columns consecutively

I have a dataframe df with three columns a,b,c.
df <- data.frame(a = c('a','b','c','d','e','f','g','e','f','g'),
b = c('X','Y','Z','X','Y','Z','X','X','Y','Z'),
c = c('cat','dog','cat','dog','cat','cat','dog','cat','cat','dog'))
df
# output
a b c
1 a X cat
2 b Y dog
3 c Z cat
4 d X dog
5 e Y cat
6 f Z cat
7 g X dog
8 e X cat
9 f Y cat
10 g Z dog
I have to group_by using the column b followed by summarise using the column c with counts of available values in it.
df %>% group_by(b) %>%
summarise(nCat = sum(c == 'cat'),
nDog = sum(c == 'dog'))
#output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 2 2
2 Y 2 1
3 Z 2 1
However, before doing the above task, I should remove the rows belonging to a value in a which has more than one value in b.
df %>% group_by(a) %>% summarise(count = n())
#output
# A tibble: 7 × 2
a count
<fctr> <int>
1 a 1
2 b 1
3 c 1
4 d 1
5 e 2
6 f 2
7 g 2
For example, in this dataframe, all the rows having value e(values: Y,X), f(values: Z,Y), g(values: X,Z) in column a.
# Expected output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 1 1
2 Y 0 1
3 Z 1 0
We can use filter with n_distinct to filter the values in 'b' that have only one unique element for each 'a' group, then grouped by 'b', we do the summarise
df %>%
group_by(a) %>%
filter(n_distinct(b)==1) %>%
group_by(b) %>%
summarise(nCat =sum(c=='cat'), nDog = sum(c=='dog'), Total = n())
# A tibble: 3 × 4
# b nCat nDog Total
# <fctr> <int> <int> <int>
#1 X 1 1 2
#2 Y 0 1 1
#3 Z 1 0 1

Resources