I have a dataset with drugs being administered overtime. I want to create groups for each block of the drug being administered. I figured out a simple method to do this with a for-loop that I can apply to each patients set of drugs.
BUT I am curious if there is simple way to do this within the realm of tidyverse?
Not that it matters, but more so I am curious if there is a simple method already created for this problem.
Set-up
have <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3))
)
## Want
want <- tibble(
patinet = c(1),
date = seq(today(), today()+11,1),
drug = c(rep("a",3), rep("b",3), rep("c",3), rep("a",3)),
grp = sort(rep(1:4,3))
)
> have
# A tibble: 12 × 3
patinet date drug
<dbl> <date> <chr>
1 1 2022-03-16 a
2 1 2022-03-17 a
3 1 2022-03-18 a
4 1 2022-03-19 b
5 1 2022-03-20 b
6 1 2022-03-21 b
7 1 2022-03-22 c
8 1 2022-03-23 c
9 1 2022-03-24 c
10 1 2022-03-25 a
11 1 2022-03-26 a
12 1 2022-03-27 a
> want
# A tibble: 12 × 4
patinet date drug grp
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4
You can use data.table::rleid
have %>% mutate(group = data.table::rleid(drug))
# A tibble: 12 x 4
patinet date drug group
<dbl> <date> <chr> <int>
1 1 2022-03-16 a 1
2 1 2022-03-17 a 1
3 1 2022-03-18 a 1
4 1 2022-03-19 b 2
5 1 2022-03-20 b 2
6 1 2022-03-21 b 2
7 1 2022-03-22 c 3
8 1 2022-03-23 c 3
9 1 2022-03-24 c 3
10 1 2022-03-25 a 4
11 1 2022-03-26 a 4
12 1 2022-03-27 a 4
I am doing some coding in R. I am trying to use the doBy package to get a sum total score for a variable (x) by both date (date) and by id (id). The doBy command works fine and I get this output.
data
id date x
1 01/01/2021 1
1 01/02/2021 2
1 01/03/2021 3
2 02/01/2021 2
2 02/02/2021 3
2 02/02/2021 4
3 03/11/2021 3
3 03/12/2021 3
3 03/13/2021 2
I want to recode the date so that everyone's first chronological date 1 is 1, the chronological second date is 2, the chronological 3rd date is 3, etc. I want my date to look something like this.
data2
id daycount x
1 1 1
1 2 2
1 3 3
2 1 2
2 2 3
2 3 4
3 1 3
3 2 3
3 3 2
I was able to order the days using order() but I am not sure how to get the dates to match up. I think I need some kind of sequence or autonumber. Also, some participants may have different number of days. Some participants may have 1 day and other participants may have 10 days.
1) doBy Assuming that the dates are already sorted within id:
library(doBy)
transform_by(data, ~ id, countdays = seq_along(id))
giving:
id date x countdays
1 1 01/01/2021 1 1
2 1 01/02/2021 2 2
3 1 01/03/2021 3 3
4 2 02/01/2021 2 1
5 2 02/02/2021 3 2
6 2 02/02/2021 4 3
7 3 03/11/2021 3 1
8 3 03/12/2021 3 2
9 3 03/13/2021 2 3
2) Base R It could also be done using transform and ave in base R.
transform(data, daycount = ave(id, id, FUN = seq_along))
giving:
id date x daycount
1 1 01/01/2021 1 1
2 1 01/02/2021 2 2
3 1 01/03/2021 3 3
4 2 02/01/2021 2 1
5 2 02/02/2021 3 2
6 2 02/02/2021 4 3
7 3 03/11/2021 3 1
8 3 03/12/2021 3 2
9 3 03/13/2021 2 3
Note
data in reproducible form:
Lines <- "id date x
1 01/01/2021 1
1 01/02/2021 2
1 01/03/2021 3
2 02/01/2021 2
2 02/02/2021 3
2 02/02/2021 4
3 03/11/2021 3
3 03/12/2021 3
3 03/13/2021 2"
data <- read.table(text = Lines, header = TRUE)
You may want to use group_by by id and then create a new column using rank or dense_rank (note the difference with them with handling the duplicates).
To recreate your data, I used:
# recreate data frame
id_vec <- rep(c(1,2,3), each = 3)
date_vec <- c(
'01/01/2021',
'01/02/2021',
'01/03/2021',
'02/01/2021',
'02/02/2021',
'02/02/2021',
'03/11/2021',
'03/12/2021',
'03/13/2021'
)
x_vec <- rep(c(1,2,3), times = 3)
data <- data.frame(id = id_vec, date = date_vec, x = x_vec)
I also converted the data column to an actual date format for your convenience:
# convert string to date object
library(lubridate)
library(dplyr)
data <- data %>% mutate(date_formatted = mdy(date))
Creating a column with rank:
data %>%
group_by(id) %>%
mutate(day_count = rank(date_formatted, ties.method = "first")) %>%
ungroup()
# # A tibble: 9 x 5
# id date x date_formatted day_count
# <dbl> <chr> <dbl> <date> <int>
# 1 1 01/01/2021 1 2021-01-01 1
# 2 1 01/02/2021 2 2021-01-02 2
# 3 1 01/03/2021 3 2021-01-03 3
# 4 2 02/01/2021 1 2021-02-01 1
# 5 2 02/02/2021 2 2021-02-02 2
# 6 2 02/02/2021 3 2021-02-02 3
# 7 3 03/11/2021 1 2021-03-11 1
# 8 3 03/12/2021 2 2021-03-12 2
# 9 3 03/13/2021 3 2021-03-13 3
Creating new column with dense_rank:
data %>%
group_by(id) %>%
mutate(day_count = dense_rank(date_formatted)) %>%
ungroup()
# # A tibble: 9 x 5
# id date x date_formatted day_count
# <dbl> <chr> <dbl> <date> <int>
# 1 1 01/01/2021 1 2021-01-01 1
# 2 1 01/02/2021 2 2021-01-02 2
# 3 1 01/03/2021 3 2021-01-03 3
# 4 2 02/01/2021 1 2021-02-01 1
# 5 2 02/02/2021 2 2021-02-02 2
# 6 2 02/02/2021 3 2021-02-02 2
# 7 3 03/11/2021 1 2021-03-11 1
# 8 3 03/12/2021 2 2021-03-12 2
# 9 3 03/13/2021 3 2021-03-13 3
I have a data looks like this:
The sample data can be get by following codes:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
I want to build a variable "Event" to capture all events. The final results will look like this:
What should I do? I would like to know as many ways as possible. Thanks.
One option could be using apply() like this. The suggestion from #AllanCameron is also a great choice. Here the code as option for you:
#Vectors
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
#Data
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C,stringsAsFactors = F)
#Option 1
index <- which(grepl('Event',names(Sample.data)))
Sample.data$Event <- apply(Sample.data[,index],1,function(x) paste0(x[x!=''],collapse='/'))
Output:
ID Days Event_P Event_N Event_C Event
1 1 -5 C C
2 1 1
3 1 18 P C P/C
4 1 30
5 2 1 N N
6 2 8
7 2 16 P N C P/N/C
8 3 1
9 3 8
10 3 6 P N C P/N/C
11 4 -6 N N
12 4 1
13 4 7 P N P/N
14 4 15 P N P/N
Duck's answer is very good, but you mentioned you want as many ways as possible so here are two more ways:
You could also use tidyverse's mutate and base r's interaction to combine the columns then use gsub to clear out all the unnecessary things:
ID<-c(1,1,1,1,2,2,2,3,3,3,4,4,4,4)
Days<-c(-5,1,18,30,1,8,16,1,8,6,-6,1,7,15)
Event_P<-c("","","P","","","","P","","","P","","","P","P")
Event_N<-c("","","","","N","","N","","","N","N","","N","N")
Event_C<-c("C","","C","","","","C","","","C","","","","")
Sample.data <- data.frame(ID, Days, Event_P, Event_N,Event_C)
library(tidyverse)
Sample.data %>%
mutate(Event = paste(Event_P, Event_N, Event_C, sep='/'),
Event = gsub('^/|^//|/$|//$', '', Event),
Event = gsub('//', '/', Event))
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Sample.data$Event <-
interaction(Sample.data$Event_P, Sample.data$Event_N, Sample.data$Event_C, sep = '/') %>%
gsub('^/|^//|/$|//$', '', .) %>%
gsub('//', '/', .)
Sample.data
#> ID Days Event_P Event_N Event_C Event
#> 1 1 -5 C C
#> 2 1 1
#> 3 1 18 P C P/C
#> 4 1 30
#> 5 2 1 N N
#> 6 2 8
#> 7 2 16 P N C P/N/C
#> 8 3 1
#> 9 3 8
#> 10 3 6 P N C P/N/C
#> 11 4 -6 N N
#> 12 4 1
#> 13 4 7 P N P/N
#> 14 4 15 P N P/N
Created on 2020-09-18 by the reprex package (v0.3.0)
What inside the gsub(^/|^//|/$|//$) does is
^/|^//: Take out all / or // that start the string
/$|//$: Take out all / or // that end the string
I have two datasets- one is a baseline and the other is a follow up dataset.
DF1 is the baseline (cross-sectional) data with id, date, score1, score2, level, and grade.
DF2 has id, date, score1, and score2, in a long format with multiple rows per id.
df1 <- as.data.frame(cbind(id = c(1,2,3),
date = c("2020-06-03","2020-07-02","2020-06-11"),
score1 =c(6,8,5),
score2=c(1,1,6),
baselevel=c(2,2,2),
basegrade=c("B","B","A")))
df2 <- as.data.frame(cbind(id =c(1,1,1,1,2,2,2,3,3,3),
date = c("2020-06-10","2020-06-17","2020-06-24",
"2020-07-01", "2020-07-03", "2020-07-10","2020-07-17", "2020-06-14",
"2020-06-22", "2020-06-29"),
score1 = c(3,1,7,8,8,6,5,5,3,5),
score2 = c(1,4,5,4,1,1,2,6,7,1)) )
This is what I want as a result of merging the two dfs.
id date score1 score 2 baselevel basegrade
1 2020-06-03 6 1 2 "B"
1 2020-06-10 3 1 2 "B"
1 2020-06-17 1 4 2 "B"
1 2020-06-24 7 5 2 "B"
1 2020-07-01 8 4 2 "B"
2 2020-07-02 8 1 2 "B"
2 2020-07-03 8 1 2 "B"
2 2020-07-10 6 1 2 "B"
2 2020-07-17 5 2 2 "B"
3 2020-06-11 5 6 1 "A"
3 2020-06-14 5 6 1 "A"
3 2020-06-22 3 7 1 "A"
3 2020-06-29 5 1 1 "A"
I tried two different code below using merge, but I still get NAs.. what am I missing here?
Any help would be appreciated!!
dfcombined1 <- merge(df1, df2, by=c("id","date"), all= TRUE)
dfcombined2 <- merge(df1, df2, by=intersect(names(df1), names(df2)), all= TRUE)
You can use bind_rows() in dplyr.
library(dplyr)
library(tidyr)
bind_rows(df1, df2) %>%
group_by(id) %>%
fill(starts_with("base"), .direction = "updown") %>%
arrange(date, .by_group = T)
# # A tibble: 13 x 6
# # Groups: id [3]
# id date score1 score2 baselevel basegrade
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2020-06-03 6 1 2 B
# 2 1 2020-06-10 3 1 2 B
# 3 1 2020-06-17 1 4 2 B
# 4 1 2020-06-24 7 5 2 B
# 5 1 2020-07-01 8 4 2 B
# 6 2 2020-07-02 8 1 2 B
# 7 2 2020-07-03 8 1 2 B
# 8 2 2020-07-10 6 1 2 B
# 9 2 2020-07-17 5 2 2 B
# 10 3 2020-06-11 5 6 2 A
# 11 3 2020-06-14 5 6 2 A
# 12 3 2020-06-22 3 7 2 A
# 13 3 2020-06-29 5 1 2 A
I think you are looking for this:
library(tidyverse)
#Code
df1 %>% bind_rows(df2) %>% arrange(id) %>% group_by(id) %>% fill(baselevel) %>% fill(basegrade)
Output:
# A tibble: 13 x 6
# Groups: id [3]
id date score1 score2 baselevel basegrade
<fct> <fct> <fct> <fct> <fct> <fct>
1 1 2020-06-03 6 1 2 B
2 1 2020-06-10 3 1 2 B
3 1 2020-06-17 1 4 2 B
4 1 2020-06-24 7 5 2 B
5 1 2020-07-01 8 4 2 B
6 2 2020-07-02 8 1 2 B
7 2 2020-07-03 8 1 2 B
8 2 2020-07-10 6 1 2 B
9 2 2020-07-17 5 2 2 B
10 3 2020-06-11 5 6 2 A
11 3 2020-06-14 5 6 2 A
12 3 2020-06-22 3 7 2 A
13 3 2020-06-29 5 1 2 A
A double merge maybe.
merge(merge(df1[-(3:4)], df2, all.y=TRUE)[-(3:4)], df1[-(2:4)], all.x=TRUE)
# id date score1 score2 baselevel basegrade
# 1 1 2020-06-10 3 1 2 B
# 2 1 2020-06-17 1 4 2 B
# 3 1 2020-06-24 7 5 2 B
# 4 1 2020-07-01 8 4 2 B
# 5 2 2020-07-03 8 1 2 B
# 6 2 2020-07-10 6 1 2 B
# 7 2 2020-07-17 5 2 2 B
# 8 3 2020-06-14 5 6 2 A
# 9 3 2020-06-22 3 7 2 A
# 10 3 2020-06-29 5 1 2 A