summarise and group_by using two different columns consecutively - r

I have a dataframe df with three columns a,b,c.
df <- data.frame(a = c('a','b','c','d','e','f','g','e','f','g'),
b = c('X','Y','Z','X','Y','Z','X','X','Y','Z'),
c = c('cat','dog','cat','dog','cat','cat','dog','cat','cat','dog'))
df
# output
a b c
1 a X cat
2 b Y dog
3 c Z cat
4 d X dog
5 e Y cat
6 f Z cat
7 g X dog
8 e X cat
9 f Y cat
10 g Z dog
I have to group_by using the column b followed by summarise using the column c with counts of available values in it.
df %>% group_by(b) %>%
summarise(nCat = sum(c == 'cat'),
nDog = sum(c == 'dog'))
#output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 2 2
2 Y 2 1
3 Z 2 1
However, before doing the above task, I should remove the rows belonging to a value in a which has more than one value in b.
df %>% group_by(a) %>% summarise(count = n())
#output
# A tibble: 7 × 2
a count
<fctr> <int>
1 a 1
2 b 1
3 c 1
4 d 1
5 e 2
6 f 2
7 g 2
For example, in this dataframe, all the rows having value e(values: Y,X), f(values: Z,Y), g(values: X,Z) in column a.
# Expected output
# A tibble: 3 × 3
b nCat nDog
<fctr> <int> <int>
1 X 1 1
2 Y 0 1
3 Z 1 0

We can use filter with n_distinct to filter the values in 'b' that have only one unique element for each 'a' group, then grouped by 'b', we do the summarise
df %>%
group_by(a) %>%
filter(n_distinct(b)==1) %>%
group_by(b) %>%
summarise(nCat =sum(c=='cat'), nDog = sum(c=='dog'), Total = n())
# A tibble: 3 × 4
# b nCat nDog Total
# <fctr> <int> <int> <int>
#1 X 1 1 2
#2 Y 0 1 1
#3 Z 1 0 1

Related

R data imputation from group_by table based on count

group = c(1,1,4,4,4,5,5,6,1,4,6,1,1,1,1,6,4,4,4,4,1,4,5,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c','a','a','a','a','c','c','c','c','c','a','c','a','c')
sleep = c('y','n','y','y','y','n','n','y','n','y','n','y','y','n','m','y','n','n','n','n',NA, NA, NA, NA)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% count(sleep)
print(group_animal)
I would like to replace the NA values in the test df's sleep column by the highest count of sleep answer based on group and animal.
Such that Group 1, Animal a with NAs in the sleep column should have a sleep value of 'y' because that is the value with the highest count among Group 1 Animal a.
Group 4 animal c with NAs for sleep should have 'n' as the sleep value as well.
Another option is replacing the NAs with the Mode. You can use the Mode function from this post in the na.aggregate function from zoo to replace these NAs like this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
group = c(1,1,4,4,4,5,5,6,1,4,6,1,1,1,1,6,4,4,4,4,1,4,5,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c','a','a','a','a','c','c','c','c','c','a','c','a','c')
sleep = c('y','n','y','y','y','n','n','y','n','y','n','y','y','n','m','y','n','n','n','n',NA, NA, NA, NA)
test = data.frame(group, animal, sleep)
library(dplyr)
library(zoo)
test %>%
group_by(group, animal) %>%
mutate(sleep = na.aggregate(sleep , FUN=Mode)) %>%
ungroup()
#> # A tibble: 24 × 3
#> group animal sleep
#> <dbl> <chr> <chr>
#> 1 1 a y
#> 2 1 b n
#> 3 4 c y
#> 4 4 c y
#> 5 4 d y
#> 6 5 a n
#> 7 5 b n
#> 8 6 c y
#> 9 1 b n
#> 10 4 d y
#> # … with 14 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Created on 2022-07-26 by the reprex package (v2.0.1)
Here is tail of output:
> tail(test)
# A tibble: 6 × 3
group animal sleep
<dbl> <chr> <chr>
1 4 c n
2 4 c n
3 1 a y
4 4 c n
5 5 a n
6 6 c y
Update now with group_by(group, animal) thnx #Quinten, removed prior answer:
group by animal
use replace_na with the replace argument as sleep[n==max(n)]
new: in case of ties like in group 5 add !is.na(sleep) to avoid conflicts:
library(dplyr)
library(tidyr)
group_animal %>%
group_by(group, animal) %>%
arrange(desc(sleep), .by_group = TRUE) %>%
mutate(sleep = replace_na(sleep, sleep[n==max(n) & !is.na(sleep)]))
group animal sleep n
<dbl> <chr> <chr> <int>
1 1 a y 3
2 1 a n 1
3 1 a m 1
4 1 a y 1
5 1 b n 2
6 4 c y 2
7 4 c n 4
8 4 c n 1
9 4 d y 2
10 5 a n 1
11 5 a n 1
12 5 b n 1
13 6 c y 2
14 6 c n 1
15 6 c y 1
Try this.
This method essential creates a custom column to coalesce with sleep, it subsets sleep based on the max count values obtained from str_count
library(dplyr)
test |>
group_by(group, animal) |>
mutate(sleep = coalesce(sleep, sleep[max(stringr::str_count(paste(sleep, collapse = ""), pattern = sleep), na.rm = TRUE)])) |>
ungroup()
group animal sleep
1 1 a y
2 1 b n
3 4 c y
4 4 c y
5 4 d y
6 5 a n
7 5 b n
8 6 c y
9 1 b n
10 4 d y
11 6 c n
12 1 a y
13 1 a y
14 1 a n
15 1 a m
16 6 c y
17 4 c n
18 4 c n
19 4 c n
20 4 c n
21 1 a y
22 4 c n
23 5 a n
24 6 c n

Re-Framing datasets

Hi all I have a got a 2 datasets below. From these 2 datasets(dataset1 is formed from dataset2. I mean the dataset1 is the count of users from dataset2) can we build the the third datasets(expected output)
dataset1
Apps # user Enteries
A 3
B 4
C 6
dataset2
Apps Users
A X
A Y
A Z
B Y
B Y
B Z
B A
C X
C X
C X
C X
C X
C X
Expected output
Apps Entries X Y Z A
A 3 1 1 1
B 4 2 1 1
C 6 6
We can first count first for Apps and Users, get the data in wide format and join with the table for count of Apps.
library(dplyr)
df %>%
count(Apps, Users) %>%
tidyr::pivot_wider(names_from = Users, values_from = n,
values_fill = list(n = 0)) %>%
left_join(df %>% count(Apps), by = 'Apps')
# Apps X Y Z A n
# <chr> <int> <int> <int> <int> <int>
#1 A 1 1 1 0 3
#2 B 0 2 1 1 4
#3 C 6 0 0 0 6
I showing 0 is no problem and having a different column order you can use table and rowSums to produce the expected output.
x <- table(dataset2)
cbind(Entries=rowSums(x), x)
# Entries A X Y Z
#A 3 0 1 1 1
#B 4 1 0 2 1
#C 6 0 6 0 0
A solution where you need not have to calculate Total separately and do joins...
This solution uses purrr::pmap and dplyr::mutate for dynamically calculating Total.
library(tidyverse) # dplyr, tidyr, purrr
df %>% count(Apps, Users) %>%
pivot_wider(id_cols = Apps, names_from = Users, values_from = n, values_fill = list(n = 0)) %>%
mutate(Total = pmap_int(.l = select_if(., is.numeric),
.f = sum))
which have output what you need
# A tibble: 3 x 6
Apps X Y Z A Total
<chr> <int> <int> <int> <int> <int>
1 A 1 1 1 0 3
2 B 0 2 1 1 4
3 C 6 0 0 0 6

Add original values for columns after group by

For the dataframe below I want to add the original values for Var_x after a group_by on ID and event and a max() on quest, but I cannot get my code right. Any suggestions? By the way, in my original dataframe more than 1 column needs to be added.
df <- data.frame(ID = c(1,1,1,1,1,1,2,2,2,3,3,3),
quest = c(1,1,2,2,3,3,1,2,3,1,2,3),
event = c("A","B","A","B","A",NA,"C","D","C","D","D",NA),
VAR_X = c(2,4,3,6,3,NA,6,4,5,7,5,NA))
Code:
df %>%
group_by(ID,event) %>%
summarise(quest = max(quest))
Desired output:
ID quest event VAR_X
1 1 2 B 6
2 1 3 A 3
3 2 2 D 4
4 2 3 C 5
5 3 2 D 5
Start by omiting the na values and in the end do an inner_join with the original data set.
df %>%
na.omit() %>%
group_by(ID, event) %>%
summarise(quest = max(quest)) %>%
inner_join(df, by = c("ID", "event", "quest"))
## A tibble: 5 x 4
## Groups: ID [3]
# ID event quest VAR_X
# <dbl> <fct> <dbl> <dbl>
#1 1 A 3 3
#2 1 B 2 6
#3 2 C 3 5
#4 2 D 2 4
#5 3 D 2 5
df %>%
drop_na() %>% # remove if necessary ..
group_by(ID, event) %>%
filter(quest == max(quest)) %>%
ungroup()
# A tibble: 5 x 4
# ID quest event VAR_X
#<dbl> <dbl> <chr> <dbl>
# 1 1 2 B 6
# 2 1 3 A 3
# 3 2 2 D 4
# 4 2 3 C 5
# 5 3 2 D 5

dplyr mutate: create column using first occurrence of another column

I was wondering if there's a more elegant way of taking a dataframe, grouping by x to see how many x's occur in the dataset, then mutating to find the first occurrence of every x (y)
test <- data.frame(x = c("a", "b", "c", "d",
"c", "b", "e", "f", "g"),
y = c(1,1,1,1,2,2,2,2,2))
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 c 2
6 b 2
7 e 2
8 f 2
9 g 2
Current Output
output <- test %>%
group_by(x) %>%
summarise(count = n())
x count
<fct> <int>
1 a 1
2 b 2
3 c 2
4 d 1
5 e 1
6 f 1
7 g 1
Desired Output
x count first_seen
<fct> <int> <dbl>
1 a 1 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
6 f 1 2
7 g 1 2
I can filter the test dataframe for the first occurrences then use a left_join but was hoping there's a more elegant solution using mutate?
# filter for first occurrences of y
right <- test %>%
group_by(x) %>%
filter(y == min(y)) %>%
slice(1) %>%
ungroup()
# bind to the output dataframe
left_join(output, right, by = "x")
We can use first after grouping by 'x' to create a new column, use that also in group_by and get the count with n()
library(dplyr)
test %>%
group_by(x) %>%
group_by(first_seen = first(y), add = TRUE) %>%
summarise(count = n())
# A tibble: 7 x 3
# Groups: x [7]
# x first_seen count
# <fct> <dbl> <int>
#1 a 1 1
#2 b 1 2
#3 c 1 2
#4 d 1 1
#5 e 2 1
#6 f 2 1
#7 g 2 1
I have a question. Why not keep it simple? for example
test %>%
group_by(x) %>%
summarise(
count = n(),
first_seen = first(y)
)
#> # A tibble: 7 x 3
#> x count first_seen
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 b 2 1
#> 3 c 2 1
#> 4 d 1 1
#> 5 e 1 2
#> 6 f 1 2
#> 7 g 1 2

How to append a sequential count of a column into a new column from a grouped column using dplyr

I have the following data frame:
library(tidyverse)
dat <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'a', 'b', 'b', 'c', 'd'))
dat
#> foo bar
#> 1 1 a
#> 2 1 a
#> 3 2 b
#> 4 3 b
#> 5 3 c
#> 6 3 d
What I want to do is to create a new column with bar column tagged with the sequential count of its member, resulting in:
foo bar new_column
1 a a.sample.1
1 a a.sample.2
2 b b.sample.1
3 b b.sample.2
3 c c.sample.1
3 d d.sample.1
I'm stuck with this code:
> dat %>% group_by(bar) %>% summarise(n=n())
# A tibble: 4 x 2
bar n
<fctr> <int>
1 a 2
2 b 2
3 c 1
4 d 1
You can use group_by %>% mutate:
dat %>% group_by(bar) %>% mutate(new_column = paste(bar, 'sample', 1:n(), sep = "."))
# A tibble: 6 x 3
# Groups: bar [4]
# foo bar new_column
# <dbl> <fctr> <chr>
#1 1 a a.sample.1
#2 1 a a.sample.2
#3 2 b b.sample.1
#4 3 b b.sample.2
#5 3 c c.sample.1
#6 3 d d.sample.1
dat%>%group_by(bar)%>%mutate(new_column=paste0(bar,'.','sample.',row_number()))
# A tibble: 6 x 3
# Groups: bar [4]
foo bar new_column
<dbl> <fctr> <chr>
1 1 a a.sample.1
2 1 a a.sample.2
3 2 b b.sample.1
4 3 b b.sample.2
5 3 c c.sample.1
6 3 d d.sample.1

Resources