add columns to data frames for values in existing column [duplicate] - r

This question already has answers here:
distinct cases for two variables by grouping and counting
(2 answers)
Closed 2 years ago.
The data frame:
Case <- c("Siddhartha", "Siddhartha", "Siddhartha", "Paul", "Paul", "Paul", "Hannah", "Herbert", "Herbert")
Procedure <- c("1", "1", "2", "3", "3", "4", "1", "1", "1")
Location <- c("a", "a", "a", "b", "b", "b", "c", "a", "a")
(df <- data.frame(Case, Procedure, Location))
Case Procedure Location
1 Siddhartha 1 a
2 Siddhartha 1 a
3 Siddhartha 2 a
4 Paul 3 b
5 Paul 3 b
6 Paul 4 b
7 Hannah 1 c
8 Herbert 1 a
9 Herbert 1 a
The function:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
arrange(desc(Anzahl))
The result:
Procedure Location Anzahl
<fct> <fct> <int>
1 1 a 2
2 1 c 1
3 2 a 1
4 3 b 1
5 4 b 1
What i need:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 1 0
So i want to sort the data frame by procedures AND locations. This is what i tried:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
But: Error: This tidyselect interface doesn't support predicates yet.
i Contact the package author and suggest using eval_select().
I tried to solve this problem in other questions i asked before (almost feels like spamming at this point), but i can't apply the solutions to the original data frame. The function shown above (group_by, summarize) is what also works for the original. The only thing is, that it doesn't sort it for locations.
Regards

This should work:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
arrange(Location, desc(Anzahl)) %>%
pivot_wider(names_from = Location, values_from = Anzahl, values_fill = list(Anzahl = 0))
Which gives us:
Procedure a b c
<chr> <int> <int> <int>
1 1 2 0 1
2 2 1 0 0
3 3 0 1 0
4 4 0 1 0

Related

Why does this dplyr group function give strange results?

When I run the below reproducible code I get the desired grouping results in the GroupRank column shown immediately beneath:
library(dplyr)
myData <-
data.frame(
Element = c("A","A","B","A","C","C"),
Group = c(0,0,0,0,1,1)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
group_by(Group) %>%
mutate(GroupRank = ElementCnt - max(0L,groupCt),
GroupRank = if_else(as.character(Group) == "0", ElementCnt, min(GroupRank))
)%>%
ungroup() %>%
arrange(origOrder)
myDataGroups
> myDataGroups
# A tibble: 6 x 6
Element Group origOrder ElementCnt groupCt GroupRank
<chr> <fct> <int> <int> <int> <int>
1 A 0 1 1 -1 1
2 A 0 2 2 -1 2
3 B 0 3 1 -1 1
4 A 0 4 3 -1 3
5 C 1 5 1 0 1
6 C 1 6 2 0 1
However when I take the line from the above code GroupRank = if_else(as.character(Group) == "0", ElementCnt, min(GroupRank)) and simply add a max function like this GroupRank = max(1L,if_else( as.character(Group) == "0", ElementCnt, min(GroupRank))) (run as 1 and 1L both ways and get the same results) I get the strange output shown below. GroupRank shouldn´t have changed from the above output:
Element Group origOrder ElementCnt groupCt GroupRank
<chr> <fct> <int> <int> <int> <int>
1 A 0 1 1 -1 3
2 A 0 2 2 -1 3
3 B 0 3 1 -1 3
4 A 0 4 3 -1 3
5 C 1 5 1 0 1
6 C 1 6 2 0 1
What am I doing wrong here? Am I using max() incorrectly?
Note the difference between max() and pmax().
max(1:5, 5:1)
#> [1] 5
pmax(1:5, 5:1)
#> [1] 5 4 3 4 5
max() returns a scalar, which is why you get a constant value per group. pmax() does what you apparently expect, which is return a rowwise maximum vector.

Estimating the percentage of common set members over time in a panel

I have a time-series panel dataset that is structured in the following way: There are 2 funds that each own different stocks at each time period.
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1, 1, 2,2,2,2),
time_Q = c(1,1,1,2,2,2,2,3, 3, 1,1,2,2),
stock_id = c("A", "B", "C", "A", "C", "D", "E", "D", "E", "A", "B", "B", "C")
)
> df
fund_id time_Q stock_id
1 1 1 A
2 1 1 B
3 1 1 C
4 1 2 A
5 1 2 C
6 1 2 D
7 1 2 E
8 1 3 D
9 1 3 E
10 2 1 A
11 2 1 B
12 2 2 B
13 2 2 C
For each fund, I would like to calculate the percentage of stocks held in that current time_Q that were also held in the previous one to 2 quarters. So basically for every fund and every time_Q, I would like to have 2 columns with past 1 time_Q, past 2 time_Q which show what percentage of stocks held on that time were also present in each of those past time_Qs.
Here is what the result should look like:
result <- data.frame(
fund_id = c(1,1,1,2,2),
time_Q = c(1,2,3,1,2),
past_1Q = c("NA",0.5,1,"NA",0.5),
past_2Q = c("NA","NA",0,"NA","NA")
)
> result
fund_id time_Q past_1Q past_2Q
1 1 1 NA NA
2 1 2 0.5 NA
3 1 3 1 0
4 2 1 NA NA
5 2 2 0.5 NA
I'm currently thinking about using either setdiff or intersect function but I'm not sure how to format it in the panel dataset. I'm looking for a scalable dplyr or data.table solution that would be able to cover multiple funds, stocks and time periods and also look into common elements in up to 12 lagged time-periods. I would appreciate any help as I've been stuck on this problem for quite a while.
We can use dplyr and purrr to programmatically build up a lagged ownership variable and then summarize() across all of them using across(). First, we just need a dummy variable for ownership and group our data by fund and stock.
library(dplyr)
library(purrr)
df_grouped <- df %>%
mutate(owned = TRUE) %>%
group_by(fund_id, stock_id)
Then we can generate lagged ownership for each stock, based on time_Q, join all of them together, and for each fund and time_Q, calculate proportion of ownership.
map(
1:2,
~df_grouped %>%
mutate(
"past_{.x}Q" := lag(owned, n = .x, order_by = time_Q)
)
) %>%
reduce(left_join, by = c("fund_id", "stock_id", "time_Q", "owned")) %>%
group_by(fund_id, time_Q) %>%
summarize(
across(
starts_with("past"),
~if (all(is.na(.x))) NA else sum(.x, na.rm = T) / n()
)
)
#> # A tibble: 5 × 4
#> fund_id time_Q past_1Q past_2Q
#> <dbl> <dbl> <dbl> <lgl>
#> 1 1 1 NA NA
#> 2 1 2 0.5 NA
#> 3 1 3 1 NA
#> 4 2 1 NA NA
#> 5 2 2 0.5 NA
Here's a dplyr-only solution:
library(dplyr)
df %>%
group_by(fund_id, time_Q) %>%
summarise(new = list(stock_id)) %>%
mutate(past_1Q = lag(new, 1),
past_2Q = lag(new, 2)) %>%
rowwise() %>%
transmute(time_Q,
across(past_1Q:past_2Q, ~ length(intersect(new, .x)) / length(new)))
output
fund_id time_Q past_1Q past_2Q
<dbl> <dbl> <dbl> <dbl>
1 1 1 0 0
2 1 2 0.5 0
3 1 3 1 0
4 2 1 0 0
5 2 2 0.5 0

How can I filter by subjects who have all levels of a factor?

I am trying to filter a data set to only include subjects who have data in all conditions (levels of a factor).
I have tried to filter by calculating the number of levels for each subject, but that does not work.
library(tidyverse)
Data <- data.frame(
Subject = factor(c(rep(1, 3),
rep(2, 3),
rep(3, 1))),
Condition = factor(c("A", "B", "C",
"A", "B", "C",
"A")),
Val = c(1, 0, 1,
0, 0, 1,
1)
)
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
summarize(Num_Cond = length(levels(Condition))) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This attempt yields:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
7 3 A 1
Desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1
I want to filter subject 3 out because they only have data for one condition.
Is there a dplyr/tidyverse approach for this problem?
We can create a condition with all and levels
library(dplyr)
Data %>%
group_by(Subject) %>%
filter(all(levels(Condition) %in% Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Or with n_distinct and nlevels
Data %>%
group_by(Subject) %>%
filter(nlevels(Condition) == n_distinct(Condition))
# A tibble: 6 x 3
# Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Here is a solution testing wether the number of rows of each groupis equal to the number of levels of Condition.
Data %>%
group_by(Subject) %>%
filter(n() == nlevels(Condition))
## A tibble: 6 x 3
## Groups: Subject [2]
# Subject Condition Val
# <fct> <fct> <dbl>
#1 1 A 1
#2 1 B 0
#3 1 C 1
#4 2 A 0
#5 2 B 0
#6 2 C 1
Edit
Following the comment by user #akrun I tested with a data set having duplicate values for each row and the code above does fail.
bind_rows(Data, Data) %>%
group_by(Subject) %>%
#distinct() %>%
filter(n() == nlevels(Condition))
## A tibble: 0 x 3
## Groups: Subject [0]
## ... with 3 variables: Subject <fct>, Condition <fct>, Val <dbl>
To run the commented out code line would solve the problem.
I found a relatively simple solution by sub-setting on Subject:
Data %>%
semi_join(
.,
Data %>%
group_by(Subject) %>%
droplevels() %>%
summarize(Num_Cond = length(levels(Condition)[Subject])) %>%
filter(Num_Cond == 3),
by = "Subject"
)
This gives the desired output:
Subject Condition Val
1 1 A 1
2 1 B 0
3 1 C 1
4 2 A 0
5 2 B 0
6 2 C 1

distinct cases for two variables by grouping and counting

We can use the following data frame as an example:
Case <- c("Siddhartha", "Siddhartha", "Siddhartha", "Paul", "Paul", "Paul", "Hannah", "Herbert")
Procedure <- c("1", "1", "2", "3", "3", "4", "1", "1")
Location <- c("a", "a", "b", "a", "a", "b", "c", "a")
(df <- data.frame(Case, Procedure, Location))
Case Procedure Location
1 Siddhartha 1 a
2 Siddhartha 1 a
3 Siddhartha 2 b
4 Paul 3 a
5 Paul 3 a
6 Paul 4 b
7 Hannah 1 c
8 Herbert 1 a
Now i do the following:
df %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
which gives me:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 3 0 1
2 3 2 0 0
3 2 0 1 0
4 4 0 1 0
This is not exactly, what i want though. What i want is the following data frame:
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 3 1 0 0
3 2 0 1 0
4 4 0 1 0
Notice the difference in Procedure 1 and 3.
So what i would like is a function, that counts the number of DISTINCT cases for each Procedures AND each location. Also that function should be working on varying data frames, where there are different (unknown) cases and procedures.
For the original data frame
df %>%
distinct() %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
does not work, since it is ignoring the "distinct". What works (also for the original data frame!) is the following:
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case))
That gives me the following though:
# A tibble: 5 x 3
# Groups: Procedure [4]
Procedure Location Anzahl
<fct> <fct> <int>
1 1 a 2
2 1 c 1
3 2 a 1
4 3 b 1
5 4 b 1
But how to implement the "pivot_wider" function, so it is also sorted by location? If i try to add it, i get the following error:
"Error: This tidyselect interface doesn't support predicates yet.
i Contact the package author and suggest using eval_select()."
Also it is very confusing to me, why the solution of Ronak works for the example data frame but not for the original. I can't spot important differences in these two data frames.
Regards
You can do it with a single call to pivot_wider and take advantage of the argument values_fn, which applies a function to the values
df %>%
pivot_wider(names_from = Location,
values_from = Case,
values_fn = list(Case = n_distinct),
values_fill = list(Case = 0))
which gives,
# A tibble: 4 x 4
Procedure a b c
<fct> <int> <int> <int>
1 1 2 0 1
2 2 0 1 0
3 3 1 0 0
4 4 0 1 0
A simple fix is to add distinct or unique before counting
library(dplyr)
library(tidyr)
df %>%
distinct() %>%
count(Location, Procedure) %>%
pivot_wider(names_from = Location, values_from = n, values_fill = list(n = 0))
# A tibble: 4 x 4
# Procedure a b c
# <chr> <int> <int> <int>
#1 1 2 0 1
#2 3 1 0 0
#3 2 0 1 0
#4 4 0 1 0
For OP's data they need :
df %>%
group_by(Procedure, Location) %>%
summarise(Anzahl = n_distinct(Case)) %>%
pivot_wider(names_from = Location, values_from = Anzahl,
values_fill = list(Anzahl = 0))

pivot_wider when there's no value column

I'm trying to reshape a dataset from long to wide. The following code works, but I'm curious if there's a way not to provide a value column and still use pivot_wider. In the following example, I have to create a temporary column "val" to use pivot_wider, but is there a way I can do it without it?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"))
a
name type
1 sam a
2 rob b
3 tom c
I want to convert it as the following.
name a b c
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1
This can be done by the following code, but can I do it without creating "val" column (and still using tidyverse language)?
a <- data.frame(name = c("sam", "rob", "tom"),
type = c("a", "b", "c"),
val = rep(1, 3)) %>%
pivot_wider(names_from = type, values_from = val, values_fill = list(val = 0))
You can use the values_fn argument to assign 1 and values_fill to assign 0:
library(tidyr)
pivot_wider(a, names_from = type, values_from = type, values_fn = ~1, values_fill = 0)
# A tibble: 3 x 4
name a b c
<fct> <dbl> <dbl> <dbl>
1 sam 1 0 0
2 rob 0 1 0
3 tom 0 0 1
We can mutate with a column of 1s and use that in pivot_wider
library(dplyr)
library(tidyr)
a %>%
mutate(n = 1) %>%
pivot_wider(names_from = type, values_from = n, values_fill = list(n = 0))
# A tibble: 3 x 4
# name a b c
# <fct> <dbl> <dbl> <dbl>
#1 sam 1 0 0
#2 rob 0 1 0
#3 tom 0 0 1
In base R, it would be easier..
table(a)
Going older school, reshape2::dcast, or the thriving data.table::dcast, let you do this by specifying an aggregate function:
reshape2::dcast(a, name ~ type, fun.aggregate = length)
# name a b c
# 1 rob 0 1 0
# 2 sam 1 0 0
# 3 tom 0 0 1
data.table::dcast(setDT(a), name ~ type, fun.aggregate = length)
# name a b c
# 1: rob 0 1 0
# 2: sam 1 0 0
# 3: tom 0 0 1

Resources