I have a dataset with drugs being administered overtime. I want to create groups for each block of the drug being administered. I figured out a simple method to do this with a for-loop that I can apply to each patients set of drugs.
BUT I am curious if there is simple way to do this within the realm of tidyverse?
Not that it matters, but more so I am curious if there is a simple method already created for this problem.
Set-up
have <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3))
)
## Want
want <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3)),
grp = sort(rep(1:4,3))
)
> have
# A tibble: 12 × 3
patinet date drug
<dbl> <date> <chr>
1 1 2022-03-16 a
2 1 2022-03-17 a
3 1 2022-03-18 a
4 1 2022-03-19 b
5 1 2022-03-20 b
6 1 2022-03-21 b
7 1 2022-03-22 c
8 1 2022-03-23 c
9 1 2022-03-24 c
10 1 2022-03-25 a
11 1 2022-03-26 a
12 1 2022-03-27 a
> want
# A tibble: 12 × 4
patinet date drug grp
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4
You can use data.table::rleid
have %>% mutate(group = data.table::rleid(drug))
# A tibble: 12 x 4
patinet date drug group
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4
Related
I want to use the column grp to create suffixes when using pivot_longer from tidyverse.
Say I have data like this
dta <- tibble(grp = rep(c('one', 'two'), each = 3),
date = rep(c('2022-12-31', '2021-12-31'), 3),
a = 1:6, b = 12:7)
dta
# A tibble: 6 x 4
grp date a b
<chr> <chr> <int> <int>
1 one 2022-12-31 1 12
2 one 2021-12-31 2 11
3 one 2022-12-31 3 10
4 two 2021-12-31 4 9
5 two 2022-12-31 5 8
6 two 2021-12-31 6 7
and what to get to something like this,
# A tibble: 12 x 3
date names.one values.one names.two values.two
<chr> <chr> <int> <chr> <int>
1 2022-12-31 a 1 a 4
2 2022-12-31 b 12 b 9
3 2021-12-31 a 2 a 5
4 2021-12-31 b 11 b 8
5 2022-12-31 a 3 a 6
6 2022-12-31 b 10 b 7
This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 11 months ago.
Let's consider one has the following dataframe:
date
x
counter
2021-09-30
a
1
2021-09-30
b
2
2021-09-30
c
3
2021-12-31
e
1
2021-12-31
f
2
2021-12-31
g
3
2022-03-31
t
1
2022-03-31
u
2
2022-03-31
z
3
I need to create a new increasing and monotonic ID by the date variable.
For instance, the new dataframe should appear as follows:
date
x
counter
new counter
2021-09-30
a
1
1
2021-09-30
b
2
1
2021-09-30
c
3
1
2021-12-31
e
1
2
2021-12-31
f
2
2
2021-12-31
g
3
2
2022-03-31
t
1
3
2022-03-31
u
2
3
2022-03-31
z
3
3
I'm running the R version 3.6.3; in the hope my question is clear enough.
You can use dplyr::cur_group_id() to do the job.
library(dplyr)
df %>%
group_by(date) %>%
mutate(new_counter = cur_group_id())
# A tibble: 9 × 4
# Groups: date [3]
date x counter new_counter
<chr> <chr> <int> <int>
1 2021-09-30 a 1 1
2 2021-09-30 b 2 1
3 2021-09-30 c 3 1
4 2021-12-31 e 1 2
5 2021-12-31 f 2 2
6 2022-03-31 g 3 3
7 2022-03-31 t 1 3
8 2022-03-31 u 2 3
9 2022-03-31 z 3 3
I have a time-series dataset of daily consumption which looks like the following:
consumption <- data.frame(
date = as.Date(c('2020-06-01','2020-06-02','2020-06-03','2020-06-03',
'2020-06-03','2020-06-04','2020-06-05','2020-06-05')),
val = c(10,20,31,32,33,40,51,52)
)
consumption <- consumption %>%
group_by(date) %>%
mutate(n = n(), record = row_number()) %>%
ungroup()
consumption
# A tibble: 8 × 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-03 32 3 2
5 2020-06-03 33 3 3
6 2020-06-04 40 1 1
7 2020-06-05 51 2 1
8 2020-06-05 52 2 2
Some days have more than one rows in the dataset. I would like to transform this into split groups with all possible combinations such as:
Group 1:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 31 1
4 2020-06-04 40 1
5 2020-06-05 51 1
Group 2:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 31 1
4 2020-06-04 40 1
5 2020-06-05 52 2
Group 3:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 32 2
4 2020-06-04 40 1
5 2020-06-05 51 1
Group 4:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 32 2
4 2020-06-04 40 1
5 2020-06-05 52 2
Group 5:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 33 3
4 2020-06-04 40 1
5 2020-06-05 51 1
Group 6:
date val record
1 2020-06-01 10 1
2 2020-06-02 20 1
3 2020-06-03 33 3
4 2020-06-04 40 1
5 2020-06-05 52 2
I've tried the following solution, but it does not produce the desired results.
library(dplyr)
library(purrr)
out <- consumption %>%
filter(n > 1) %>%
group_split(date, rn = row_number()) %>%
map(~ bind_rows(consumption %>%
filter(n == 1), .x %>%
select(-rn)) %>%
arrange(date))
Your help to getting around this would be much appreciated.
Many thanks,
We could filter where the 'record' is greater than 1, group_split by 'row_number' and 'date', then bind the rows with the filtered data where the 'record' is 1
library(dplyr)
library(purrr)
out <- consumption %>%
filter(n > 1) %>%
group_split(date, rn = row_number()) %>%
map(~ bind_rows(consumption %>%
filter(n == 1), .x %>%
select(-rn)) %>%
arrange(date))
-output
> out
[[1]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
[[2]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
[[3]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
With the updated data, we create the row_number(), then split it by 'date' column (as in #ThomasIsCoding solution), use crossing (from purrr) to expand the data, and loop over the rows with pmap, slice the rows of the original data based on the row index
library(tidyr)
library(tibble)
consumption %>%
transmute(date, rn = row_number()) %>%
deframe %>%
split(names(.)) %>%
invoke(crossing, .) %>%
pmap(~ consumption %>%
slice(c(...))) %>%
unname
-output
[[1]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[2]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[3]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[4]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[5]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[6]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
Maybe you can try the code below
with(
consumption,
apply(
expand.grid(
split(seq_along(date), date)
),
1,
function(k) consumption[k, ]
)
)
which gives
[[1]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[2]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[3]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[4]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[5]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[6]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
Here's an approach using some basic dplyr and tidyr functions.
First, complete the data for every date / copy combination. Then fill the missing ones with the prior value, and reshape wide.
library(tidyverse)
consumption %>%
complete(date, record) %>%
group_by(date) %>% fill(val) %>% ungroup() %>%
pivot_wider(-n, names_from = record, values_from = val)
# A tibble: 5 x 4
date `1` `2` `3`
<date> <dbl> <dbl> <dbl>
1 2020-06-01 10 10 10
2 2020-06-02 20 20 20
3 2020-06-03 31 32 33
4 2020-06-04 40 40 40
5 2020-06-05 51 52 52
I have a dataset with 110 participants who answered the same questionnaire in multiple sessions within three timeframes. The number of sessions per timeframe differs within and between participants.
I need a new variable that assigns consecutive numbers to the sessions a participant X completed the questionnaire within the timeframe Y from 1 to (number of sessions participant X completed the questionnaire within timeframe Y).
Example: I have
participant timeframe date
1 1 2021-04-30 09:12:00
1 1 2021-04-30 10:03:00
1 1 2021-05-02 09:20:00
2 1 2021-04-30 13:00:00
2 1 2021-05-02 12:13:00
1 2 2021-05-05 08:34:00
1 2 2021-05-06 14:15:00
2 2 2021-05-05 07:12:00
2 2 2021-05-05 14:13:00
2 2 2021-05-08 15:22:00
I need:
participant timeframe date session per timeframe
1 1 2021-04-30 09:12:00 1
1 1 2021-04-30 10:03:00 2
1 1 2021-05-02 09:20:00 3
2 1 2021-04-30 13:00:00 1
2 1 2021-05-02 12:13:00 2
1 2 2021-05-05 08:34:00 1
1 2 2021-05-06 14:15:00 2
2 2 2021-05-05 07:12:00 1
2 2 2021-05-05 14:13:00 2
2 2 2021-05-08 15:22:00 3
Hope that somebody can help! Thank you so much in advance.
Here is a tidyverse approach using row_number():
library(dplyr)
library(tibble)
dat <- tribble(~participant, ~timeframe, ~date,
1, 1, "2021-04-30 09:12:00",
1, 1, "2021-04-30 10:03:00",
1, 1, "2021-05-02 09:20:00",
2, 1, "2021-04-30 13:00:00",
2, 1, "2021-05-02 12:13:00",
1, 2, "2021-05-05 08:34:00",
1, 2, "2021-05-06 14:15:00",
2, 2, "2021-05-05 07:12:00",
2, 2, "2021-05-05 14:13:00",
2, 2, "2021-05-08 15:22:00") %>%
mutate(date = as.POSIXct(date))
dat %>%
group_by(participant, timeframe) %>%
mutate(session = row_number())
#> # A tibble: 10 x 4
#> # Groups: participant, timeframe [4]
#> participant timeframe date session
#> <dbl> <dbl> <dttm> <int>
#> 1 1 1 2021-04-30 09:12:00 1
#> 2 1 1 2021-04-30 10:03:00 2
#> 3 1 1 2021-05-02 09:20:00 3
#> 4 2 1 2021-04-30 13:00:00 1
#> 5 2 1 2021-05-02 12:13:00 2
#> 6 1 2 2021-05-05 08:34:00 1
#> 7 1 2 2021-05-06 14:15:00 2
#> 8 2 2 2021-05-05 07:12:00 1
#> 9 2 2 2021-05-05 14:13:00 2
#> 10 2 2 2021-05-08 15:22:00 3
Created on 2021-04-30 by the reprex package (v0.3.0)
Alternatively, use rleid:
library(data.table)
df %>%
group_by(participant, timeframe) %>%
mutate(session_per_timeframe = rleid(date))
# A tibble: 10 x 4
# Groups: participant, timeframe [4]
participant timeframe date session_per_timeframe
<dbl> <dbl> <dttm> <int>
1 1 1 2021-04-30 09:12:00 1
2 1 1 2021-04-30 10:03:00 2
3 1 1 2021-05-02 09:20:00 3
4 2 1 2021-04-30 13:00:00 1
5 2 1 2021-05-02 12:13:00 2
6 1 2 2021-05-05 08:34:00 1
7 1 2 2021-05-06 14:15:00 2
8 2 2 2021-05-05 07:12:00 1
9 2 2 2021-05-05 14:13:00 2
10 2 2 2021-05-08 15:22:00 3
My answer
data %>% group_by(grp = data.table::rleid(participant)) %>%
mutate(session = row_number())
# A tibble: 10 x 5
# Groups: grp [4]
participant timeframe date grp session
<int> <int> <chr> <int> <int>
1 1 1 2021-04-30 09:12:00 1 1
2 1 1 2021-04-30 10:03:00 1 2
3 1 1 2021-05-02 09:20:00 1 3
4 2 1 2021-04-30 13:00:00 2 1
5 2 1 2021-05-02 12:13:00 2 2
6 1 2 2021-05-05 08:34:00 3 1
7 1 2 2021-05-06 14:15:00 3 2
8 2 2 2021-05-05 07:12:00 4 1
9 2 2 2021-05-05 14:13:00 4 2
10 2 2 2021-05-08 15:22:00 4 3
I have two datasets- one is a baseline and the other is a follow up dataset.
DF1 is the baseline (cross-sectional) data with id, date, score1, score2, level, and grade.
DF2 has id, date, score1, and score2, in a long format with multiple rows per id.
df1 <- as.data.frame(cbind(id = c(1,2,3),
date = c("2020-06-03","2020-07-02","2020-06-11"),
score1 =c(6,8,5),
score2=c(1,1,6),
baselevel=c(2,2,2),
basegrade=c("B","B","A")))
df2 <- as.data.frame(cbind(id =c(1,1,1,1,2,2,2,3,3,3),
date = c("2020-06-10","2020-06-17","2020-06-24",
"2020-07-01", "2020-07-03", "2020-07-10","2020-07-17", "2020-06-14",
"2020-06-22", "2020-06-29"),
score1 = c(3,1,7,8,8,6,5,5,3,5),
score2 = c(1,4,5,4,1,1,2,6,7,1)) )
This is what I want as a result of merging the two dfs.
id date score1 score 2 baselevel basegrade
1 2020-06-03 6 1 2 "B"
1 2020-06-10 3 1 2 "B"
1 2020-06-17 1 4 2 "B"
1 2020-06-24 7 5 2 "B"
1 2020-07-01 8 4 2 "B"
2 2020-07-02 8 1 2 "B"
2 2020-07-03 8 1 2 "B"
2 2020-07-10 6 1 2 "B"
2 2020-07-17 5 2 2 "B"
3 2020-06-11 5 6 1 "A"
3 2020-06-14 5 6 1 "A"
3 2020-06-22 3 7 1 "A"
3 2020-06-29 5 1 1 "A"
I tried two different code below using merge, but I still get NAs.. what am I missing here?
Any help would be appreciated!!
dfcombined1 <- merge(df1, df2, by=c("id","date"), all= TRUE)
dfcombined2 <- merge(df1, df2, by=intersect(names(df1), names(df2)), all= TRUE)
You can use bind_rows() in dplyr.
library(dplyr)
library(tidyr)
bind_rows(df1, df2) %>%
group_by(id) %>%
fill(starts_with("base"), .direction = "updown") %>%
arrange(date, .by_group = T)
# # A tibble: 13 x 6
# # Groups: id [3]
# id date score1 score2 baselevel basegrade
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2020-06-03 6 1 2 B
# 2 1 2020-06-10 3 1 2 B
# 3 1 2020-06-17 1 4 2 B
# 4 1 2020-06-24 7 5 2 B
# 5 1 2020-07-01 8 4 2 B
# 6 2 2020-07-02 8 1 2 B
# 7 2 2020-07-03 8 1 2 B
# 8 2 2020-07-10 6 1 2 B
# 9 2 2020-07-17 5 2 2 B
# 10 3 2020-06-11 5 6 2 A
# 11 3 2020-06-14 5 6 2 A
# 12 3 2020-06-22 3 7 2 A
# 13 3 2020-06-29 5 1 2 A
I think you are looking for this:
library(tidyverse)
#Code
df1 %>% bind_rows(df2) %>% arrange(id) %>% group_by(id) %>% fill(baselevel) %>% fill(basegrade)
Output:
# A tibble: 13 x 6
# Groups: id [3]
id date score1 score2 baselevel basegrade
<fct> <fct> <fct> <fct> <fct> <fct>
1 1 2020-06-03 6 1 2 B
2 1 2020-06-10 3 1 2 B
3 1 2020-06-17 1 4 2 B
4 1 2020-06-24 7 5 2 B
5 1 2020-07-01 8 4 2 B
6 2 2020-07-02 8 1 2 B
7 2 2020-07-03 8 1 2 B
8 2 2020-07-10 6 1 2 B
9 2 2020-07-17 5 2 2 B
10 3 2020-06-11 5 6 2 A
11 3 2020-06-14 5 6 2 A
12 3 2020-06-22 3 7 2 A
13 3 2020-06-29 5 1 2 A
A double merge maybe.
merge(merge(df1[-(3:4)], df2, all.y=TRUE)[-(3:4)], df1[-(2:4)], all.x=TRUE)
# id date score1 score2 baselevel basegrade
# 1 1 2020-06-10 3 1 2 B
# 2 1 2020-06-17 1 4 2 B
# 3 1 2020-06-24 7 5 2 B
# 4 1 2020-07-01 8 4 2 B
# 5 2 2020-07-03 8 1 2 B
# 6 2 2020-07-10 6 1 2 B
# 7 2 2020-07-17 5 2 2 B
# 8 3 2020-06-14 5 6 2 A
# 9 3 2020-06-22 3 7 2 A
# 10 3 2020-06-29 5 1 2 A