I've a three column table (Table_1) and I would like to create another table based on Table_1. The table has personal ID and work start and end days.
Table_1 <- data.frame(ID = c("A", "B", "C"), Start_Day = c(1, 20, 38), End_Day = c(14, 29, 42))
The new table I would like to create will have two columns, namely ID and Week. The number of rows for each ID level is equal to the number of bins (weeks) of the End_Day and Start_Day. For example, ID A will have 2 week bins 1 (days 1-7) and 2 (days 8-14), ID B will have 3 week bins, 3 (days 15-21), 4 (days 22-28) and 5 (days 29-35).
The expected outcome is:
Table_2 <- data.frame(ID = c("A", "A", "B", "B", "B", "C" ), Week = c(1, 2, ,3, 4, 5, 6))
One way would be to divide Start_Day and End_Day by 7 and create a sequence between them using map2 bringing the data in long format using unnest.
library(dplyr)
Table_1 %>%
mutate_at(-1, ~ceiling(./7)) %>%
mutate(Week = purrr::map2(Start_Day, End_Day, seq)) %>%
tidyr::unnest(Week) %>%
select(ID, Week)
# A tibble: 6 x 2
# ID Week
# <fct> <int>
#1 A 1
#2 A 2
#3 B 3
#4 B 4
#5 B 5
#6 C 6
Related
I have a table that looks like this:
Id
Types
1
A
1
A
1
A
1
B
2
A
2
B
3
A
3
B
4
A
4
B
4
B
What I would like to do is 1. count for every ID the amount of A's and B's it has. 2. Compute the distribution of every combination of the amounts of A and B.
So at the end of step 2 I should have the table:
Amount of A
Amount of B
Number of Different IDs
1
1
2
1
2
1
3
1
1
How can this be achieved?
Thank you.
Here's a solution with dplyr and tidyr:
library(dplyr)
library(tidyr)
# ...
# Code to generate your original table: "your_table".
# ...
result <- your_table %>%
# Count the amount of each type for each Id.
group_by(Id) %>% count(Types) %>% ungroup() %>%
# "Pivot" the Types column, such that each type (here "A" and "B") gets its
# own column (here "Amount of A" and "Amount of B") to hold its amount (as
# calculated right above).
pivot_wider(id_cols = c(Id, Types),
names_from = Types, names_prefix = "Amount of ",
values_from = n) %>%
# For each combination of amounts among those pivoted columns (ie. all the
# columns except "Id"), count how many distinct IDs there are.
group_by(across(-c(Id))) %>%
summarize("Number of Different IDs" = n_distinct(Id)) %>% ungroup()
# Print the result.
result
Given the example of your_table that you provided
your_table <- tibble::tribble(
~Id, ~Types,
1, "A",
1, "A",
1, "A",
1, "B",
2, "A",
2, "B",
3, "A",
3, "B",
4, "A",
4, "B",
4, "B"
)
you should get the following result:
# A tibble: 3 x 3
`Amount of A` `Amount of B` `Number of Different IDs`
<int> <int> <int>
1 1 1 2
2 1 2 1
3 3 1 1
I am trying to define the start of an interval based on the known end_time of that interval in R using dplyr::mutate() with an ifelse() statement.
I can define the start_time for the first interval easily using minimum value time value but am getting stuck with the other start times. I've tried ranking them using dense_rank(), but I do not know the proper syntax to extract the end_time for the previous ranked value. The start_time for ranked > 1 should equal the end_time + 1 for the previous ranked value.
library(dplyr)
blks <- data.frame(Group = c(rep("A", 3), rep("B", 4)),
end_time = c(4, 8, 20, 5, 11, 15, 20))
expand.grid(time = 0:20,
Group = c("A","B")) %>%
left_join(mutate(blks, time = end_time), by = c("Group", "time")) %>%
group_by(Group) %>%
mutate(ranked = dense_rank(end_time),
start_time = ifelse(ranked == 1, min(time), "WHERE I NEED HELP"))
# else = the end_time from the previous ranked + 1
# end_time[ranked == ranked-1] + 1))
Desired result is:
mutate(blks, start_time = c(0, 5, 9, 0, 6, 12, 16))
We can try dplyr::lag with deafult=-1 then add 1
library(dplyr)
blks %>% group_by(Group) %>% mutate(start_time = lag(end_time,default=-1)+1)
# A tibble: 7 x 3
# Groups: Group [2]
Group end_time start_time
< fct> <dbl> <dbl>
1 A 4 0
2 A 8 5
3 A 20 9
4 B 5 0
5 B 11 6
6 B 15 12
7 B 20 16
Please consider the following.
Background
In a data.frame I have patient IDs (id), the day at which patients are admitted to a hospital (day), a code for the diagnostic activity they received that day (code), a price for that activity (price) and a frequency for that activity (freq).
Activities with code b and c are registered at the same time but mean more or less the same thing and should not be double counted.
Problem
What I want is: if code "b" and "c" are registered for the same day, code "b" should be ignored.
The example data.frame looks like this:
x <- data.frame(id = c(rep("a", 4), rep("b", 3)),
day = c(1, 1, 1, 2, 1, 2, 3),
price = c(500, 10, 100, rep(10, 3), 100),
code = c("a", "b", "c", rep("b", 3), "c"),
freq = c(rep(1, 5), rep(2, 2))))
> x
id day price code freq
1 a 1 500 a 1
2 a 1 10 b 1
3 a 1 100 c 1
4 a 2 10 b 1
5 b 1 10 b 1
6 b 2 10 b 2
7 b 3 100 c 2
So the costs for patient "a" for day 1 would be 600 and not 610 as I can compute with the following:
x %>%
group_by(id, day) %>%
summarise(res = sum(price * freq))
# A tibble: 5 x 3
# Groups: id [?]
id day res
<fct> <dbl> <dbl>
1 a 1. 610.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
Possible approaches
Either I delete observation code "b" when "c" is present on that same day or I set freq of code "b" to 0 in case code "c" is present on the same day.
All my attempts with ifelse and mutate failed so far.
Every help is much appreciated. Thank you very much in advance!
You can add a filter line to remove the offending b values like this...
x %>%
group_by(id, day) %>%
filter(!(code=="b" & "c" %in% code)) %>%
summarise(res = sum(price * freq))
id day res
<fct> <dbl> <dbl>
1 a 1. 600.
2 a 2. 10.
3 b 1. 10.
4 b 2. 20.
5 b 3. 200.
You could create a new column like this:
mutate(code_day = paste0(ifelse(code %in% c("b", "c"), "z", code), day)
Then all your Bs and Cs will become Zs (without losing the original code column that helps you tell them apart). You can then arrange by code descending and remove duplicate values in the code_day column:
arrange(desc(code)) %>% # Bs will come after Cs
distinct(code_day, .keep_all = TRUE)
I need to create subsets or groups of my data based on two different conditions. This is a sample of how the data is structured:
df <- data.frame(id = c("a", "a", "a", "b", "d", "b", "b", "c", "d", "e"),
kpi = c ("rev", "rev", "rev", "rev", "rev", "fte", "fte", "fte", "fte", "fte"),
value = c(100, 150, 200, 50, 70, 3, 5, 8, 9, 3))
id kpi value
1 a rev 100
2 a rev 150
3 a rev 200
4 b rev 50
5 d rev 70
6 b fte 3
7 b fte 5
8 c fte 8
9 d fte 9
10 e fte 3
the first column is filled with IDs for companies. There can be multiple rows for each ID as they might have data for multiple months (months column not included in the sample data) and data for both rev (Revenue) and fte (Full Time Equivalent)
I want to select every company for which the fte average in a certain range: 1-5
for example company b should be included as it has an average fte of 4 (in one month 3 in another 5), company d should be excluded as it has a higher fte.
for those included I want all rows to remain in the data frame, therefore also those rows with rev data. The goal is to calculate an average revenue for cohorts of companies with specific fte numbers.
The new.data frame with the mentioned conditions should look like this for the sample data:
df <- data.frame(id = c("b", "b", "b", "e"),
kpi = c("rev", "fte", "fte", "fte"), value = c(50, 3, 5, 3))
id kpi value
1 b rev 50
2 b fte 3
3 b fte 5
4 e fte 3
It would be applied to a data.frame of about 40,000 rows.
I already did some research and found a lot on creating subsets with multiple conditions but nothing I could apply to my specific problem. I am sorry if this an obvious question, I am a R rookie and could really use some help!
If I didn't specify the problem clear enough feel free to ask and I will try to explain it more clearly!
Thank you all in advance!
Group on id and then filter those satisfying the condition:
library(dplyr)
df %>%
group_by(id) %>%
filter(between(mean(value[kpi == "fte"]), 1, 5)) %>%
ungroup
giving:
# A tibble: 4 x 3
id kpi value
<fct> <fct> <dbl>
1 b rev 50.
2 b fte 3.
3 b fte 5.
4 e fte 3.
In base R you can use ave to create a temporary variable and then use that variable.
a <- ave(df$value, df$id, df$kpi, FUN = mean)
new <- df[1 <= a & a <= 5, ]
new
# id kpi value
#6 b fte 3
#7 b fte 5
#10 e fte 3
Now remove what you no longer need.
rm(a) # clean up
you can try a tidyverse solution
library(tidyverse)
df %>%
group_by(id,kpi) %>%
mutate(Mean=mean(value)) %>%
mutate(gr= between(Mean, 1, 5)) %>%
group_by(id) %>%
mutate(gr2 = ifelse(any(gr) & kpi == "rev",T, F)) %>%
filter(gr | gr2) %>%
select(1:3)
# A tibble: 4 x 3
# Groups: id [2]
id kpi value
<fct> <fct> <dbl>
1 b rev 50.
2 b fte 3.
3 b fte 5.
4 e fte 3.
I included each step to illustrate what the idea is.
First calculate the mean for each id and kpi values.
Add TRUE if the mean is between 1 and 5
group by id again to filter the corresponding rev values
filter
select the right columns.
Here is a solution with data.table:
library("data.table")
setDT(df)
df[df[kpi=="fte", if (between(mean(value), 1, 5)) id, id], on="id"][, -c("V1")]
# > df[df[kpi=="fte", if (between(mean(value), 1, 5)) id, id], on="id"][, -c("V1")]
# id kpi value
# 1: b rev 50
# 2: b fte 3
# 3: b fte 5
# 4: e fte 3
or
df[df[kpi=="fte", if (between(mean(value), 1, 5)) id, id][,-2], on="id"][]
Consider the following dataframe (ordered by id and time):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
I want to count how many times a given sequence of events appears in each "id" group. Consider the following sequence with time constraints:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
It means that event "a" can start at any time, event "b" must start no earlier than 2 and no later than 8 after event "a", another event "a" must start no earlier than 12 and no later than 18 after event "b".
Some rules for creating sequences:
Events don't need to be consecutive with respect to "time" column. For example, seq can be constructed from rows 1, 3, and 5.
To be counted, sequences must have different first event. For example, if seq = rows 8, 10, and 11 was counted, then seq = rows 8, 10, and 12 must not be counted.
The events may be included in many constructed sequences if they do not violate the second rule. For example, we count both sequences: rows 1, 3, 5 and rows 5, 6, 7.
The expected result:
df1
id count
1 1 2
2 2 2
There are some related questions in R - Identify a sequence of row elements by groups in a dataframe and Finding rows in R dataframe where a column value follows a sequence.
Is it a way to solve the problem using "dplyr"?
I believe this is what you're looking for. It gives you the desired output. Note that there is a typo in your original question where you have a 32 instead of a 42 when you define the time column in df. I say this is a typo because it doesn't match your output immediately below the definition of df. I changed the 32 to a 42 in the code below.
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
Here's the output:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
Also, if you omit the last 2 parts of the dplyr pipe that do the counting (to see the sequences it is matching), you get the following sequences:
Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
EDIT IN RESPONSE TO COMMENT REGARDING GENERALIZING THIS: Yes it is possible to generalize this to arbitrary length sequences but requires some R voodoo. Most notably, note the use of Reduce, which allows you to apply a common function on a list of objects as well as foreach, which I'm borrowing from the foreach package to do some arbitrary looping. Here's the code:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
This outputs the following:
Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
If you want just the counts, you can group_by(id) then count() as in the original code snippet.
Perhaps it's easier to represent event sequences as strings and use regex:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2