I have a dataset with drugs being administered overtime. I want to create groups for each block of the drug being administered. I figured out a simple method to do this with a for-loop that I can apply to each patients set of drugs.
BUT I am curious if there is simple way to do this within the realm of tidyverse?
Not that it matters, but more so I am curious if there is a simple method already created for this problem.
Set-up
have <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3))
)
## Want
want <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3)),
grp = sort(rep(1:4,3))
)
> have
# A tibble: 12 × 3
patinet date drug
<dbl> <date> <chr>
1 1 2022-03-16 a
2 1 2022-03-17 a
3 1 2022-03-18 a
4 1 2022-03-19 b
5 1 2022-03-20 b
6 1 2022-03-21 b
7 1 2022-03-22 c
8 1 2022-03-23 c
9 1 2022-03-24 c
10 1 2022-03-25 a
11 1 2022-03-26 a
12 1 2022-03-27 a
> want
# A tibble: 12 × 4
patinet date drug grp
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4
You can use data.table::rleid
have %>% mutate(group = data.table::rleid(drug))
# A tibble: 12 x 4
patinet date drug group
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4
I have a time-series dataset of daily consumption which looks like the following:
consumption <- data.frame(
date = as.Date(c('2020-06-01','2020-06-02','2020-06-03','2020-06-03',
'2020-06-03','2020-06-04','2020-06-05','2020-06-05')),
val = c(10,20,31,32,33,40,51,52)
)
consumption <- consumption %>%
group_by(date) %>%
mutate(n = n(), record = row_number()) %>%
ungroup()
consumption
# A tibble: 8 × 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-03 32 3 2
5 2020-06-03 33 3 3
6 2020-06-04 40 1 1
7 2020-06-05 51 2 1
8 2020-06-05 52 2 2
Some days have more than one rows in the dataset. I would like to transform this into split groups with all possible combinations such as:
Group 1:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 31 1
4 2020-06-04 40 1
5 2020-06-05 51 1
Group 2:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 31 1
4 2020-06-04 40 1
5 2020-06-05 52 2
Group 3:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 32 2
4 2020-06-04 40 1
5 2020-06-05 51 1
Group 4:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 32 2
4 2020-06-04 40 1
5 2020-06-05 52 2
Group 5:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 33 3
4 2020-06-04 40 1
5 2020-06-05 51 1
Group 6:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 33 3
4 2020-06-04 40 1
5 2020-06-05 52 2
I've tried the following solution, but it does not produce the desired results.
library(dplyr)
library(purrr)
out <- consumption %>%
filter(n > 1) %>%
group_split(date, rn = row_number()) %>%
map(~ bind_rows(consumption %>%
filter(n == 1), .x %>%
select(-rn)) %>%
arrange(date))
Your help to getting around this would be much appreciated.
Many thanks,
We could filter where the 'record' is greater than 1, group_split by 'row_number' and 'date', then bind the rows with the filtered data where the 'record' is 1
library(dplyr)
library(purrr)
out <- consumption %>%
filter(n > 1) %>%
group_split(date, rn = row_number()) %>%
map(~ bind_rows(consumption %>%
filter(n == 1), .x %>%
select(-rn)) %>%
arrange(date))
-output
> out
[[1]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
[[2]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
[[3]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
With the updated data, we create the row_number(), then split it by 'date' column (as in #ThomasIsCoding solution), use crossing (from purrr) to expand the data, and loop over the rows with pmap, slice the rows of the original data based on the row index
library(tidyr)
library(tibble)
consumption %>%
transmute(date, rn = row_number()) %>%
deframe %>%
split(names(.)) %>%
invoke(crossing, .) %>%
pmap(~ consumption %>%
slice(c(...))) %>%
unname
-output
[[1]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[2]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[3]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[4]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[5]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[6]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
Maybe you can try the code below
with(
consumption,
apply(
expand.grid(
split(seq_along(date), date)
),
1,
function(k) consumption[k, ]
)
)
which gives
[[1]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[2]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[3]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[4]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[5]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[6]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
Here's an approach using some basic dplyr and tidyr functions.
First, complete the data for every date / copy combination. Then fill the missing ones with the prior value, and reshape wide.
library(tidyverse)
consumption %>%
complete(date, record) %>%
group_by(date) %>% fill(val) %>% ungroup() %>%
pivot_wider(-n, names_from = record, values_from = val)
# A tibble: 5 x 4
date `1` `2` `3`
<date> <dbl> <dbl> <dbl>
1 2020-06-01 10 10 10
2 2020-06-02 20 20 20
3 2020-06-03 31 32 33
4 2020-06-04 40 40 40
5 2020-06-05 51 52 52
I would like to create a new database from the df database I entered below. My idea is to create a base where only have one day per line. For example, instead of inserting 4 rows for 01/07/2021, it will only be 1, this way the values of the columns of those days will be added.
df <- structure(
list(Id=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
date1 = c("2021-07-01","2021-07-01","2021-07-01","2021-07-01","2021-04-02",
"2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-02","2021-04-03",
"2021-04-03","2021-04-03","2021-04-03","2021-04-03","2021-04-08","2021-04-08",
"2021-04-07","2021-04-09","2021-04-10","2021-04-10"),
Week= c("Thursday","Thursday","Thursday","Thursday","Friday","Friday","Friday","Friday",
"Friday","Friday","Saturday","Saturday","Saturday","Saturday","Saturday","Thursday",
"Thursday","Friday","Friday","Saturday","Saturday"),
DTPE = c("Ho","Ho","Ho","Ho","","","","","","","","","","","","","","","","Ho","Ho"),
D1 = c(8,1,9, 3,5,4,7,6,3,8,2,3,4,6,7,8,4,2,6,2,3), DR01 = c(4,1,4,3,3,4,3,6,3,7,2,3,4,6,7,8,4,2,6,7,3),
DR02 = c(8,1,4,3,3,4,1,6,3,7,2,3,4,6,7,8,4,2,6,2,3), DR03 = c(7,5,4,3,3,4,1,5,3,3,2,3,4,6,7,8,4,2,6,4,3),
DR04= c(4,5,6,7,3,2,7,4,2,1,2,3,4,6,7,8,4,2,6,4,3),DR05 = c(9,5,4,3,3,2,1,5,3,7,2,3,4,7,7,8,4,2,6,4,3)),
class = "data.frame", row.names = c(NA, -21L))
> df
Id date1 Week DTPE D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-07-01 Thursday Ho 8 4 8 7 4 9
2 1 2021-07-01 Thursday Ho 1 1 1 5 5 5
3 1 2021-07-01 Thursday Ho 9 4 4 4 6 4
4 1 2021-07-01 Thursday Ho 3 3 3 3 7 3
5 1 2021-04-02 Friday 5 3 3 3 3 3
6 1 2021-04-02 Friday 4 4 4 4 2 2
7 1 2021-04-02 Friday 7 3 1 1 7 1
8 1 2021-04-02 Friday 6 6 6 5 4 5
9 1 2021-04-02 Friday 3 3 3 3 2 3
10 1 2021-04-02 Friday 8 7 7 3 1 7
11 1 2021-04-03 Saturday 2 2 2 2 2 2
12 1 2021-04-03 Saturday 3 3 3 3 3 3
13 1 2021-04-03 Saturday 4 4 4 4 4 4
14 1 2021-04-03 Saturday 6 6 6 6 6 7
15 1 2021-04-03 Saturday 7 7 7 7 7 7
16 1 2021-04-08 Thursday 8 8 8 8 8 8
17 1 2021-04-08 Thursday 4 4 4 4 4 4
18 1 2021-04-07 Friday 2 2 2 2 2 2
19 1 2021-04-09 Friday 6 6 6 6 6 6
20 1 2021-04-10 Saturday Ho 2 7 2 4 4 4
21 1 2021-04-10 Saturday Ho 3 3 3 3 3 3
We may do a grouping by 'Id', along with 'date1' and 'Week', then summarise the numeric columns to get the sum in across
library(dplyr)
df %>% group_by(Id, date1, Week) %>%
summarise(across(where(is.numeric), sum, na.rm = TRUE), .groups = 'drop')
You can perform this using the following code:
library(dplyr)
df %>%
group_by(Id, date1, Week) %>%
select(D1:DR05) %>%
summarise_all(sum)
# A tibble: 7 × 9
# Groups: Id, date1 [7]
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-03 Saturday 22 22 22 22 22 23
3 1 2021-04-07 Friday 2 2 2 2 2 2
4 1 2021-04-08 Thursday 12 12 12 12 12 12
5 1 2021-04-09 Friday 6 6 6 6 6 6
6 1 2021-04-10 Saturday 5 10 5 7 7 7
7 1 2021-07-01 Thursday 21 12 16 19 22 21
You might want to also convert the date1 field to a DATE object, but can do that using the lubridate verbs for e.g. ymd() inside a mutate
Base R with aggregate:
aggregate(cbind(D1, DR01, DR02, DR03, DR04, DR05) ~ Id+date1+Week, df, sum)
Output:
Id date1 Week D1 DR01 DR02 DR03 DR04 DR05
1 1 2021-04-02 Friday 33 26 24 19 19 21
2 1 2021-04-07 Friday 2 2 2 2 2 2
3 1 2021-04-09 Friday 6 6 6 6 6 6
4 1 2021-04-03 Saturday 22 22 22 22 22 23
5 1 2021-04-10 Saturday 5 10 5 7 7 7
6 1 2021-04-08 Thursday 12 12 12 12 12 12
7 1 2021-07-01 Thursday 21 12 16 19 22 21
I have a dataset with 110 participants who answered the same questionnaire in multiple sessions within three timeframes. The number of sessions per timeframe differs within and between participants.
I need a new variable that assigns consecutive numbers to the sessions a participant X completed the questionnaire within the timeframe Y from 1 to (number of sessions participant X completed the questionnaire within timeframe Y).
Example: I have
participant timeframe date
1 1 2021-04-30 09:12:00
1 1 2021-04-30 10:03:00
1 1 2021-05-02 09:20:00
2 1 2021-04-30 13:00:00
2 1 2021-05-02 12:13:00
1 2 2021-05-05 08:34:00
1 2 2021-05-06 14:15:00
2 2 2021-05-05 07:12:00
2 2 2021-05-05 14:13:00
2 2 2021-05-08 15:22:00
I need:
participant timeframe date session per timeframe
1 1 2021-04-30 09:12:00 1
1 1 2021-04-30 10:03:00 2
1 1 2021-05-02 09:20:00 3
2 1 2021-04-30 13:00:00 1
2 1 2021-05-02 12:13:00 2
1 2 2021-05-05 08:34:00 1
1 2 2021-05-06 14:15:00 2
2 2 2021-05-05 07:12:00 1
2 2 2021-05-05 14:13:00 2
2 2 2021-05-08 15:22:00 3
Hope that somebody can help! Thank you so much in advance.
Here is a tidyverse approach using row_number():
library(dplyr)
library(tibble)
dat <- tribble(~participant, ~timeframe, ~date,
1, 1, "2021-04-30 09:12:00",
1, 1, "2021-04-30 10:03:00",
1, 1, "2021-05-02 09:20:00",
2, 1, "2021-04-30 13:00:00",
2, 1, "2021-05-02 12:13:00",
1, 2, "2021-05-05 08:34:00",
1, 2, "2021-05-06 14:15:00",
2, 2, "2021-05-05 07:12:00",
2, 2, "2021-05-05 14:13:00",
2, 2, "2021-05-08 15:22:00") %>%
mutate(date = as.POSIXct(date))
dat %>%
group_by(participant, timeframe) %>%
mutate(session = row_number())
#> # A tibble: 10 x 4
#> # Groups: participant, timeframe [4]
#> participant timeframe date session
#> <dbl> <dbl> <dttm> <int>
#> 1 1 1 2021-04-30 09:12:00 1
#> 2 1 1 2021-04-30 10:03:00 2
#> 3 1 1 2021-05-02 09:20:00 3
#> 4 2 1 2021-04-30 13:00:00 1
#> 5 2 1 2021-05-02 12:13:00 2
#> 6 1 2 2021-05-05 08:34:00 1
#> 7 1 2 2021-05-06 14:15:00 2
#> 8 2 2 2021-05-05 07:12:00 1
#> 9 2 2 2021-05-05 14:13:00 2
#> 10 2 2 2021-05-08 15:22:00 3
Created on 2021-04-30 by the reprex package (v0.3.0)
Alternatively, use rleid:
library(data.table)
df %>%
group_by(participant, timeframe) %>%
mutate(session_per_timeframe = rleid(date))
# A tibble: 10 x 4
# Groups: participant, timeframe [4]
participant timeframe date session_per_timeframe
<dbl> <dbl> <dttm> <int>
1 1 1 2021-04-30 09:12:00 1
2 1 1 2021-04-30 10:03:00 2
3 1 1 2021-05-02 09:20:00 3
4 2 1 2021-04-30 13:00:00 1
5 2 1 2021-05-02 12:13:00 2
6 1 2 2021-05-05 08:34:00 1
7 1 2 2021-05-06 14:15:00 2
8 2 2 2021-05-05 07:12:00 1
9 2 2 2021-05-05 14:13:00 2
10 2 2 2021-05-08 15:22:00 3
My answer
data %>% group_by(grp = data.table::rleid(participant)) %>%
mutate(session = row_number())
# A tibble: 10 x 5
# Groups: grp [4]
participant timeframe date grp session
<int> <int> <chr> <int> <int>
1 1 1 2021-04-30 09:12:00 1 1
2 1 1 2021-04-30 10:03:00 1 2
3 1 1 2021-05-02 09:20:00 1 3
4 2 1 2021-04-30 13:00:00 2 1
5 2 1 2021-05-02 12:13:00 2 2
6 1 2 2021-05-05 08:34:00 3 1
7 1 2 2021-05-06 14:15:00 3 2
8 2 2 2021-05-05 07:12:00 4 1
9 2 2 2021-05-05 14:13:00 4 2
10 2 2 2021-05-08 15:22:00 4 3
I'm looking for a way, to calculate the number of days a participant (id) spent in a study.
An exemplary data file looks like this:
data <- data.frame(date = as.Date(c("2020-11-29", "2020-11-30", "2020-12-02",
"2020-12-04", "2020-12-05", "2020-12-08",
"2020-11-22", "2020-11-21", "2020-11-24",
"2020-11-25", "2020-11-30", "2020-11-29",
"2021-01-29", "2021-01-20", "2021-01-30",
"2021-02-01", "2021-02-04", "2021-02-04")),
id = rep(1:3, each = 6))
data <- dplyr::arrange(data, id, date)
data
date id
1 2020-11-29 1
2 2020-11-30 1
3 2020-12-02 1
4 2020-12-04 1
5 2020-12-05 1
6 2020-12-08 1
7 2020-11-21 2
8 2020-11-22 2
9 2020-11-24 2
10 2020-11-25 2
11 2020-11-29 2
12 2020-11-30 2
13 2021-01-20 3
14 2021-01-29 3
15 2021-01-30 3
16 2021-02-01 3
17 2021-02-04 3
18 2021-02-04 3
What i'd like to have, is new column days_from_start that will take the 1st day for every id and set it to 0. Then it will compute number of days for every other row within each id. Something like this:
data$days_from_start <- c(0, 1, 3, 4, 5, 8,
0, 1, 3, 4, 8, 10,
0, 9, 10, 11, 14, 14)
data
date id days_from_start
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 4
5 2020-12-05 1 5
6 2020-12-08 1 8
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 10
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 11
17 2021-02-04 3 14
18 2021-02-04 3 14
Any ideas?
Thank you
Simply group the data, work out the earliest date for each id and then calculate differences.
data <- dplyr::arrange(data, id, date)
data %>%
group_by(id) %>%
mutate(
start_date=min(date),
days_from_start=as.numeric(date-start_date)
) %>%
ungroup() %>%
select(-start_date)
# A tibble: 18 x 3
date id days_from_start
<date> <int> <dbl>
1 2020-11-29 1 0
2 2020-11-30 1 1
3 2020-12-02 1 3
4 2020-12-04 1 5
5 2020-12-05 1 6
6 2020-12-08 1 9
7 2020-11-21 2 0
8 2020-11-22 2 1
9 2020-11-24 2 3
10 2020-11-25 2 4
11 2020-11-29 2 8
12 2020-11-30 2 9
13 2021-01-20 3 0
14 2021-01-29 3 9
15 2021-01-30 3 10
16 2021-02-01 3 12
17 2021-02-04 3 15
18 2021-02-04 3 15