Ignore value conditionally within group_by in dplyr - r

Please consider the following.
Background
In a data.frame I have patient IDs (id), the day at which patients are admitted to a hospital (day), a code for the diagnostic activity they received that day (code), a price for that activity (price) and a frequency for that activity (freq).
Activities with code b and c are registered at the same time but mean more or less the same thing and should not be double counted.
Problem
What I want is: if code "b" and "c" are registered for the same day, code "b" should be ignored.
The example data.frame looks like this:
x <- data.frame(id = c(rep("a", 4), rep("b", 3)),
day = c(1, 1, 1, 2, 1, 2, 3),
price = c(500, 10, 100, rep(10, 3), 100),
code = c("a", "b", "c", rep("b", 3), "c"),
freq = c(rep(1, 5), rep(2, 2))))
> x
id day price code freq
1 a 1 500 a 1
2 a 1 10 b 1
3 a 1 100 c 1
4 a 2 10 b 1
5 b 1 10 b 1
6 b 2 10 b 2
7 b 3 100 c 2
So the costs for patient "a" for day 1 would be 600 and not 610 as I can compute with the following:
x %>%
group_by(id, day) %>%
summarise(res = sum(price * freq))
# A tibble: 5 x 3
# Groups: id [?]
id day res
<fct> <dbl> <dbl>
1 a 1. 610.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
Possible approaches
Either I delete observation code "b" when "c" is present on that same day or I set freq of code "b" to 0 in case code "c" is present on the same day.
All my attempts with ifelse and mutate failed so far.
Every help is much appreciated. Thank you very much in advance!

You can add a filter line to remove the offending b values like this...
x %>%
group_by(id, day) %>%
filter(!(code=="b" & "c" %in% code)) %>%
summarise(res = sum(price * freq))
id day res
<fct> <dbl> <dbl>
1 a 1. 600.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.

You could create a new column like this:
mutate(code_day = paste0(ifelse(code %in% c("b", "c"), "z", code), day)
Then all your Bs and Cs will become Zs (without losing the original code column that helps you tell them apart). You can then arrange by code descending and remove duplicate values in the code_day column:
arrange(desc(code)) %>% # Bs will come after Cs
distinct(code_day, .keep_all = TRUE)

Related

Conditional cumulative sum from two columns

I can't get my head around the following problem.
Assuming the follwoing data:
library(tidyverse)
df <- tibble(source = c("A", "A", "B", "B", "B", "C"),
value = c(5, 10, NA, NA, NA, 20),
add = c(1, 1, 1, 2, 3, 4))
What I want to do is: for all cases where source == "B", I want to calculate the cumulative sum of the previous row's value and the current row's add. Of course, for the first "B" row, I need to provide a starting value for value. Note: in this case, it would be fine if we just take the value from the last "A" row.
So for row 3, the result would be 10 + 1 = 11.
For row 4, the result would be 11 + 2 = 13.
For row 5, the results would be 13 + 3 = 16.
I tried to use purrr::accumulate, but I failed in many different ways, e.g. I thought I can do:
df %>%
mutate(test = accumulate(add, .init = 10, ~.x + .y))
But this leads to error:
Error: Problem with `mutate()` column `test`.
i `test = accumulate(add, .init = 10, ~.x + .y)`.
i `test` must be size 6 or 1, not 7.
Same if I use .init = value
And I also didn't manage to do the job only on group B (although this is probably no issue, I think I can probably performa on the full data frame and then just replace values for all non-B rows).
Expected output:
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA
You were essentially in the right direction. Since you provide an .init value to accumulate, the resulting vector is of size n+1, with the first value being .init. You have to remove the first value to get a vector that fit to your column size.
Then, if you want NAs on the remaining values, here's a way to do it. Also, since the "starting row" is the third, .init has to be set to 8.
df %>%
mutate(test =
ifelse(source == "B", accumulate(add, .init = 8, ~.x + .y)[-1], NA))
# A tibble: 6 x 4
source value add test
<chr> <dbl> <dbl> <dbl>
1 A 5 1 NA
2 A 10 1 NA
3 B NA 1 11
4 B NA 2 13
5 B NA 3 16
6 C 20 4 NA
#tmfmnk provided an awesome answer and they deserve full credit (NOT ME)
Below is the same code from their comment (for more visibility, while also setting an initial value)
init_value = 10
df = df %>%
mutate(test = lag(value)) %>%
group_by(source) %>%
mutate(test = init_value + cumsum(add))

Creating a table based on criteria from another table in r

I've a three column table (Table_1) and I would like to create another table based on Table_1. The table has personal ID and work start and end days.
Table_1 <- data.frame(ID = c("A", "B", "C"), Start_Day = c(1, 20, 38), End_Day = c(14, 29, 42))
The new table I would like to create will have two columns, namely ID and Week. The number of rows for each ID level is equal to the number of bins (weeks) of the End_Day and Start_Day. For example, ID A will have 2 week bins 1 (days 1-7) and 2 (days 8-14), ID B will have 3 week bins, 3 (days 15-21), 4 (days 22-28) and 5 (days 29-35).
The expected outcome is:
Table_2 <- data.frame(ID = c("A", "A", "B", "B", "B", "C" ), Week = c(1, 2, ,3, 4, 5, 6))
One way would be to divide Start_Day and End_Day by 7 and create a sequence between them using map2 bringing the data in long format using unnest.
library(dplyr)
Table_1 %>%
mutate_at(-1, ~ceiling(./7)) %>%
mutate(Week = purrr::map2(Start_Day, End_Day, seq)) %>%
tidyr::unnest(Week) %>%
select(ID, Week)
# A tibble: 6 x 2
# ID Week
# <fct> <int>
#1 A 1
#2 A 2
#3 B 3
#4 B 4
#5 B 5
#6 C 6

How to group or subset a data frame by two conditions in R

I need to create subsets or groups of my data based on two different conditions. This is a sample of how the data is structured:
df <- data.frame(id = c("a", "a", "a", "b", "d", "b", "b", "c", "d", "e"),
kpi = c ("rev", "rev", "rev", "rev", "rev", "fte", "fte", "fte", "fte", "fte"),
value = c(100, 150, 200, 50, 70, 3, 5, 8, 9, 3))
id kpi value
1 a rev 100
2 a rev 150
3 a rev 200
4 b rev 50
5 d rev 70
6 b fte 3
7 b fte 5
8 c fte 8
9 d fte 9
10 e fte 3
the first column is filled with IDs for companies. There can be multiple rows for each ID as they might have data for multiple months (months column not included in the sample data) and data for both rev (Revenue) and fte (Full Time Equivalent)
I want to select every company for which the fte average in a certain range: 1-5
for example company b should be included as it has an average fte of 4 (in one month 3 in another 5), company d should be excluded as it has a higher fte.
for those included I want all rows to remain in the data frame, therefore also those rows with rev data. The goal is to calculate an average revenue for cohorts of companies with specific fte numbers.
The new.data frame with the mentioned conditions should look like this for the sample data:
df <- data.frame(id = c("b", "b", "b", "e"),
kpi = c("rev", "fte", "fte", "fte"), value = c(50, 3, 5, 3))
id kpi value
1 b rev 50
2 b fte 3
3 b fte 5
4 e fte 3
It would be applied to a data.frame of about 40,000 rows.
I already did some research and found a lot on creating subsets with multiple conditions but nothing I could apply to my specific problem. I am sorry if this an obvious question, I am a R rookie and could really use some help!
If I didn't specify the problem clear enough feel free to ask and I will try to explain it more clearly!
Thank you all in advance!
Group on id and then filter those satisfying the condition:
library(dplyr)
df %>%
group_by(id) %>%
filter(between(mean(value[kpi == "fte"]), 1, 5)) %>%
ungroup
giving:
# A tibble: 4 x 3
id kpi value
<fct> <fct> <dbl>
1 b rev 50.
2 b fte 3.
3 b fte 5.
4 e fte 3.
In base R you can use ave to create a temporary variable and then use that variable.
a <- ave(df$value, df$id, df$kpi, FUN = mean)
new <- df[1 <= a & a <= 5, ]
new
# id kpi value
#6 b fte 3
#7 b fte 5
#10 e fte 3
Now remove what you no longer need.
rm(a) # clean up
you can try a tidyverse solution
library(tidyverse)
df %>%
group_by(id,kpi) %>%
mutate(Mean=mean(value)) %>%
mutate(gr= between(Mean, 1, 5)) %>%
group_by(id) %>%
mutate(gr2 = ifelse(any(gr) & kpi == "rev",T, F)) %>%
filter(gr | gr2) %>%
select(1:3)
# A tibble: 4 x 3
# Groups: id [2]
id kpi value
<fct> <fct> <dbl>
1 b rev 50.
2 b fte 3.
3 b fte 5.
4 e fte 3.
I included each step to illustrate what the idea is.
First calculate the mean for each id and kpi values.
Add TRUE if the mean is between 1 and 5
group by id again to filter the corresponding rev values
filter
select the right columns.
Here is a solution with data.table:
library("data.table")
setDT(df)
df[df[kpi=="fte", if (between(mean(value), 1, 5)) id, id], on="id"][, -c("V1")]
# > df[df[kpi=="fte", if (between(mean(value), 1, 5)) id, id], on="id"][, -c("V1")]
# id kpi value
# 1: b rev 50
# 2: b fte 3
# 3: b fte 5
# 4: e fte 3
or
df[df[kpi=="fte", if (between(mean(value), 1, 5)) id, id][,-2], on="id"][]

Extracting corresponding dataframe values from multiple records using a function

I have a dataframe (df1) containing many records Each record has up to three trials, each trial can be repeat up to five times. Below is an example of some data I have:
Record Trial Start End Speed Number
1 2 1 4 12 9
1 2 4 6 11 10
1 3 1 3 10 17
2 1 1 5 14 5
I have the following code that calculates the longest 'Distance' and 'Maximum Number' for each Record.:
getInfo <- function(race_df) {
race_distance <- as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.distance = max(End - Start)))
race_max_number = as.data.frame(race_df %>% group_by(record,trial) %>% summarise(max.N = max(Number)))
rd_rmn_merge <- as.data.frame(merge(x = race_distance, y = race_max_number)
total_summary <- as.data.frame(rd_rmn_merge[order(rd_rmn_merge$trial,])
return(list(race_distance, race_max_number, total_summary)
}
list_summary <- getInfo(race_df)
total_summary <- list_of_races[[3]]
list_summary gives me an output like this:
[[1]]
Record Trial Max.Distance
1 2 3
1 3 2
2 1 4
[[2]]
Record Trial Max.Number
1 2 10
1 3 17
2 1 5
[[3]]
Record Trial Max.Distance Max.Number
1 2 3 10
1 3 2 17
2 1 4 5
I am now trying to seek the longest distance with the corresponding 'Number' regardless if it being maximum. So having Record 1, Trial 2 look like this instead:
Record Trial Max.Distance Corresponding Number
1 2 3 9
Eventually I would like to be able to create a function that is able to take arguments 'Record' and 'Trial' through the 'race_df' dataframe to make searching for a specific record and trial's longest distance easier.
Any help on this would be much appreciated.
The data (in case anyone else wants to offer their solution):
df <- data.frame( Record = c(1,1,1,2),
Trial = c(2,2,3,1),
Start = c(1,4,1,1),
End = c(4,6,3,5),
Speed = c(12,11,10,14),
Number = c(9,10,17,5))
Here's a tidyverse solution:
library(tidyverse)
df %>%
mutate( Max.Distance = End - Start) %>%
select(-Start,-End,-Speed) %>%
group_by(Record) %>%
nest() %>%
mutate( data = map( data, ~ filter(.x, Max.Distance == max(Max.Distance)) )) %>%
unnest()
The output:
Record Trial Number Max.Distance
<dbl> <dbl> <dbl> <dbl>
1 1 2 9 3
2 2 1 5 4
Note if you want to keep all of your columns in the final data frame, just remove select....
I hope I get right what your function is supposed to do. In the end it should take a record and a trial and put out the row(s) where we have the maximum distance, right?
So, it boils down to two filters:
filter rows for the record and trial.
filter the row inside that subset that has the maximum distance
Between those two filters, we have to calculate the distance although I suggest you move that outside the function because it is basically a one time operation.
race_df <- data.frame(Record = c(1, 1, 1, 2), Trial = c(2, 2, 3, 1),
Start = c(1, 4, 1, 1), End = c(4, 6, 3, 5), Speed = c(12, 11, 10, 14),
Number = c(9, 10, 17, 5))
get_longest <- function(df, record, trial){
df %>%
filter(Record == record & Trial == trial) %>%
mutate(Distance = End - Start) %>%
filter(Distance == max(Distance)) %>%
select(Number, Distance)
}
get_longest(race_df, 1, 2)

R - Find a sequence of row elements based on time constraints in a dataframe

Consider the following dataframe (ordered by id and time):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b".
Some rules for creating sequences:
Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.
The expected result:
df1
id count
1 1 2
2 2 2
There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.
Is it a way to solve the problem using "dplyr"?
I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
Here's the output:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:
Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
This outputs the following:
Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
If you want just the counts, you can group_by(id) then count() as in the original code snippet.
Perhaps it's easier to represent event sequences as strings and use regex:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2

Resources