distinct cases for two variables by grouping and counting - r

We can use the following data frame as an example:
Case <- c("Siddhartha", "Siddhartha", "Siddhartha", "Paul", "Paul", "Paul", "Hannah", "Herbert")
Procedure <- c("1", "1", "2", "3", "3", "4", "1", "1")
Location <- c("a", "a", "b", "a", "a", "b", "c", "a")
(df <- data.frame(Case, Procedure, Location))
Case Procedure Location
1 Siddhartha 1 a
2 Siddhartha 1 a
3 Siddhartha 2 b
4 Paul 3 a
5 Paul 3 a
6 Paul 4 b
7 Hannah 1 c
8 Herbert 1 a
Now i do the following:
df %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
which gives me:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 3 0 1
2 3 2 0 0
3 2 0 1 0
4 4 0 1 0
This is not exactly, what i want though. What i want is the following data frame:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 3 1 0 0
3 2 0 1 0
4 4 0 1 0
Notice the difference in Procedure 1 and 3.
So what i would like is a function, that counts the number of DISTINCT cases for each Procedures AND each location. Also that function should be working on varying data frames, where there are different (unknown) cases and procedures.
For the original data frame
df %>%
distinct() %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
does not work, since it is ignoring the "distinct". What works (also for the original data frame!) is the following:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case))
That gives me the following though:
# A tibble: 5 x 3
# Groups: Procedure [4]
Procedure Location Anzahl
<fct> <fct> <int>
1 1 a 2
2 1 c 1
3 2 a 1
4 3 b 1
5 4 b 1
But how to implement the "pivot_wider" function, so it is also sorted by location? If i try to add it, i get the following error:
"Error: This tidyselect interface doesn't support predicates yet.
i Contact the package author and suggest using eval_select()."
Also it is very confusing to me, why the solution of Ronak works for the example data frame but not for the original. I can't spot important differences in these two data frames.
Regards

You can do it with a single call to pivot_wider and take advantage of the argument values_fn, which applies a function to the values
df %>%
pivot_wider(names_from = Location,
values_from = Case,
values_fn = list(Case = n_distinct),
values_fill = list(Case = 0))
which gives,
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 2 0 1 0
3 3 1 0 0
4 4 0 1 0

A simple fix is to add distinct or unique before counting
library(dplyr)
library(tidyr)
df %>%
distinct() %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
# A tibble: 4 x 4
# Procedure a b c
# <chr> <int> <int> <int>
#1 1 2 0 1
#2 3 1 0 0
#3 2 0 1 0
#4 4 0 1 0
For OP's data they need :
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
pivot_wider(names_from = Location, values_from = Anzahl,
values_fill = list(Anzahl = 0))

Related

Why does this dplyr group function give strange results?

When I run the below reproducible code I get the desired grouping results in the GroupRank column shown immediately beneath:
library(dplyr)
myData <-
data.frame(
Element = c("A","A","B","A","C","C"),
Group = c(0,0,0,0,1,1)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
group_by(Group) %>%
mutate(GroupRank = ElementCnt - max(0L,groupCt),
GroupRank = if_else(as.character(Group) == "0", ElementCnt, min(GroupRank))
)%>%
ungroup() %>%
arrange(origOrder)
myDataGroups
> myDataGroups
# A tibble: 6 x 6
Element Group origOrder ElementCnt groupCt GroupRank
<chr> <fct> <int> <int> <int> <int>
1 A 0 1 1 -1 1
2 A 0 2 2 -1 2
3 B 0 3 1 -1 1
4 A 0 4 3 -1 3
5 C 1 5 1 0 1
6 C 1 6 2 0 1
However when I take the line from the above code GroupRank = if_else(as.character(Group) == "0", ElementCnt, min(GroupRank)) and simply add a max function like this GroupRank = max(1L,if_else( as.character(Group) == "0", ElementCnt, min(GroupRank))) (run as 1 and 1L both ways and get the same results) I get the strange output shown below. GroupRank shouldn´t have changed from the above output:
Element Group origOrder ElementCnt groupCt GroupRank
<chr> <fct> <int> <int> <int> <int>
1 A 0 1 1 -1 3
2 A 0 2 2 -1 3
3 B 0 3 1 -1 3
4 A 0 4 3 -1 3
5 C 1 5 1 0 1
6 C 1 6 2 0 1
What am I doing wrong here? Am I using max() incorrectly?
Note the difference between max() and pmax().
max(1:5, 5:1)
#> [1] 5
pmax(1:5, 5:1)
#> [1] 5 4 3 4 5
max() returns a scalar, which is why you get a constant value per group. pmax() does what you apparently expect, which is return a rowwise maximum vector.

tidyverse do a pivot_wider with two different reshaping strategies (creating categorical and binary columns)

Using the following data:
df <- data.frame(id = c("A", "B", "C", "A", "B", "A"),
value = c(1, 2, 3, 4, 5, 6))
I want to pivot_wider this data so that the reshaping creates two different sets of columns:
One set where I create a bunch of binary columns that take the column names from the value columns (e.g. bin_1, bin_2 and so on) and that are coded as 0/1.
An additional set where I create as many necessary columns to store the values in a "categorical" way. Here, id "A" has three values, so I want to create three columns cat_1, cat_2, cat_3 and for IDs B and C I want to fill them up with NAs if there's no value.
Now, I know how to create these two things separately from each other and merge them afterwards via a left_join.
However, my question is: can it be done in one pipeline, where I do two subsequent pivot_widers? I tried, but it doesn't work (obviously because my way of copying the value column and then try to use one for the binary reshape and one for the categorial reshape is wrong).
Any ideas?
Code so far that works:
df1 <- df %>%
group_by(id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
df2 <- df %>%
mutate(dummy = 1) %>%
arrange(value) %>%
pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length))
df <- df1 %>%
left_join(., df2, by = "id)
Expected output:
# A tibble: 3 x 10
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0
With the addition of purrr, you could do:
map(.x = reduce(range(df$value), `:`),
~ df %>%
group_by(id) %>%
mutate(!!paste0("bin_", .x) := as.numeric(.x %in% value))) %>%
reduce(full_join) %>%
mutate(cats = paste0("cat_", row_number())) %>%
pivot_wider(names_from = "cats",
values_from = "value")
id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 1 0 0 1 0 1 1 4 6
2 B 0 1 0 0 1 0 2 5 NA
3 C 0 0 1 0 0 0 3 NA NA
In base you can try:
tt <- unstack(df[2:1])
x <- cbind(t(sapply(tt, "[", seq_len(max(lengths(tt))))),
t(+sapply(names(tt), "%in%", x=df$id)))
colnames(x) <- c(paste0("cat_", seq_len(max(lengths(tt)))),
paste0("bin_", seq_len(nrow(df))))
x
# cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
#A 1 4 6 1 0 0 1 0 1
#B 2 5 NA 0 1 0 0 1 0
#C 3 NA NA 0 0 1 0 0 0
Slightly modifying your approach by reducing df2 code and including it all in one pipe by taking advantage of the list and . trick which allows you to work on two versions of df in the same call.
Its not much of an improvement on what you have done but it is now all in one call. I can't think of way you can do it without a merge/join.
library(tidyverse)
df %>%
list(
pivot_wider(., id_cols = id,
names_from = value,
names_prefix = "bin_") %>%
mutate_if(is.numeric, ~ +(!is.na(.))), #convert to binary
group_by(., id) %>%
mutate(group_id = 1:n()) %>%
ungroup() %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value)
) %>%
.[c(2:3)] %>%
reduce(left_join)
# id bin_1 bin_2 bin_3 bin_4 bin_5 bin_6 cat_1 cat_2 cat_3
# <chr> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
# 1 A 1 0 0 1 0 1 1 4 6
# 2 B 0 1 0 0 1 0 2 5 NA
# 3 C 0 0 1 0 0 0 3 NA NA
Even you can join both your syntax into one without creating any intermediate object
df %>%
group_by(id) %>%
mutate(group_id = row_number()) %>%
pivot_wider(names_from = group_id,
names_prefix = "cat_",
values_from = value) %>% left_join(df %>% mutate(dummy = 1) %>% arrange(value) %>% pivot_wider(names_from = value,
names_prefix = "bin_",
values_from = dummy,
values_fill = list(dummy = 0),
values_fn = list(dummy = length)), by = "id")
# A tibble: 3 x 10
# Groups: id [3]
id cat_1 cat_2 cat_3 bin_1 bin_2 bin_3 bin_4 bin_5 bin_6
<chr> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 A 1 4 6 1 0 0 1 0 1
2 B 2 5 NA 0 1 0 0 1 0
3 C 3 NA NA 0 0 1 0 0 0

add columns to data frames for values in existing column [duplicate]

This question already has answers here:
distinct cases for two variables by grouping and counting
(2 answers)
Closed 2 years ago.
The data frame:
Case <- c("Siddhartha", "Siddhartha", "Siddhartha", "Paul", "Paul", "Paul", "Hannah", "Herbert", "Herbert")
Procedure <- c("1", "1", "2", "3", "3", "4", "1", "1", "1")
Location <- c("a", "a", "a", "b", "b", "b", "c", "a", "a")
(df <- data.frame(Case, Procedure, Location))
Case Procedure Location
1 Siddhartha 1 a
2 Siddhartha 1 a
3 Siddhartha 2 a
4 Paul 3 b
5 Paul 3 b
6 Paul 4 b
7 Hannah 1 c
8 Herbert 1 a
9 Herbert 1 a
The function:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
arrange(desc(Anzahl))
The result:
Procedure Location Anzahl
<fct> <fct> <int>
1 1 a 2
2 1 c 1
3 2 a 1
4 3 b 1
5 4 b 1
What i need:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 1 0
So i want to sort the data frame by procedures AND locations. This is what i tried:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
But: Error: This tidyselect interface doesn't support predicates yet.
i Contact the package author and suggest using eval_select().
I tried to solve this problem in other questions i asked before (almost feels like spamming at this point), but i can't apply the solutions to the original data frame. The function shown above (group_by, summarize) is what also works for the original. The only thing is, that it doesn't sort it for locations.
Regards
This should work:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
arrange(Location, desc(Anzahl)) %>%
pivot_wider(names_from = Location, values_from = Anzahl, values_fill = list(Anzahl = 0))
Which gives us:
Procedure a b c
<chr> <int> <int> <int>
1 1 2 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 1 0

Re-Framing datasets

Hi all I have a got a 2 datasets below. From these 2 datasets(dataset1 is formed from dataset2. I mean the dataset1 is the count of users from dataset2) can we build the the third datasets(expected output)
dataset1
Apps # user Enteries
A 3
B 4
C 6
dataset2
Apps Users
A X
A Y
A Z
B Y
B Y
B Z
B A
C X
C X
C X
C X
C X
C X
Expected output
Apps Entries X Y Z A
A 3 1 1 1
B 4 2 1 1
C 6 6
We can first count first for Apps and Users, get the data in wide format and join with the table for count of Apps.
library(dplyr)
df %>%
count(Apps, Users) %>%
tidyr::pivot_wider(names_from = Users, values_from = n,
values_fill = list(n = 0)) %>%
left_join(df %>% count(Apps), by = 'Apps')
# Apps X Y Z A n
# <chr> <int> <int> <int> <int> <int>
#1 A 1 1 1 0 3
#2 B 0 2 1 1 4
#3 C 6 0 0 0 6
I showing 0 is no problem and having a different column order you can use table and rowSums to produce the expected output.
x <- table(dataset2)
cbind(Entries=rowSums(x), x)
# Entries A X Y Z
#A 3 0 1 1 1
#B 4 1 0 2 1
#C 6 0 6 0 0
A solution where you need not have to calculate Total separately and do joins...
This solution uses purrr::pmap and dplyr::mutate for dynamically calculating Total.
library(tidyverse) # dplyr, tidyr, purrr
df %>% count(Apps, Users) %>%
pivot_wider(id_cols = Apps, names_from = Users, values_from = n, values_fill = list(n = 0)) %>%
mutate(Total = pmap_int(.l = select_if(., is.numeric),
.f = sum))
which have output what you need
# A tibble: 3 x 6
Apps X Y Z A Total
<chr> <int> <int> <int> <int> <int>
1 A 1 1 1 0 3
2 B 0 2 1 1 4
3 C 6 0 0 0 6

pivot_wider when there's no value column

I'm trying to reshape a dataset from long to wide. The following code works, but I'm curious if there's a way not to provide a value column and still use pivot_wider. In the following example, I have to create a temporary column "val" to use pivot_wider, but is there a way I can do it without it?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"))
a
name type
1 sam a
2 rob b
3 tom c
I want to convert it as the following.
name a b c
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1
This can be done by the following code, but can I do it without creating "val" column (and still using tidyverse language)?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"),
val = rep(1, 3)) %>%
pivot_wider(names_from = type, values_from = val, values_fill = list(val = 0))
You can use the values_fn argument to assign 1 and values_fill to assign 0:
library(tidyr)
pivot_wider(a, names_from = type, values_from = type, values_fn = ~1, values_fill = 0)
# A tibble: 3 x 4
name a b c
<fct> <dbl> <dbl> <dbl>
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1
We can mutate with a column of 1s and use that in pivot_wider
library(dplyr)
library(tidyr)
a %>%
mutate(n = 1) %>%
pivot_wider(names_from = type, values_from = n, values_fill = list(n = 0))
# A tibble: 3 x 4
# name a b c
# <fct> <dbl> <dbl> <dbl>
#1 sam 1 0 0
#2 rob 0 1 0
#3 tom 0 0 1
In base R, it would be easier..
table(a)
Going older school, reshape2::dcast, or the thriving data.table::dcast, let you do this by specifying an aggregate function:
reshape2::dcast(a, name ~ type, fun.aggregate = length)
# name a b c
# 1 rob 0 1 0
# 2 sam 1 0 0
# 3 tom 0 0 1
data.table::dcast(setDT(a), name ~ type, fun.aggregate = length)
# name a b c
# 1: rob 0 1 0
# 2: sam 1 0 0
# 3: tom 0 0 1

Resources