I need to reshape my data, to get it in a proper format for Survival Analysis.
My current Dataset looks like this:
Product_Number Date Status
A 2018-01-01 0
A 2018-01-02 1
A 2018-01-03 0
B 2018-01-01 0
B 2018-01-02 0
B 2018-01-03 0
B 2018-01-04 1
C 2018-01-01 0
C 2018-01-02 0
I need to reshape my data, based on the columns Product_Number, Date and Status (I want to count the number of days, per product, until the status shift to a 1. If the status is 0, the proces should start over again).
So the data should look like this:
Product_Number Number_of_Days Status
A 2 1 #Two days til status = 1
A 1 0 #One day, status = 0 (no end date yet)
B 4 1 #Four days til status = 1
C 2 0 #Two days, status is still 0 (no end date yet)
What have I tried so far?
I ordered my data by ProductNumber and Date. I love the DPLYR way, so I used:
df <- df %>% group_by(Product_Number, Date) # note: my data is now in the form as in the example above.
Then I tried to use the diff() function, to see the differences in dates (count the number of days). But I was unable to "stop" the count, when status switched (from 0 to 1, and vice versa).
I hope that I clearly explained the problem. Please let me know if you need some additional information.
You could do:
library(dplyr)
df %>%
group_by(Product_Number) %>%
mutate(Date = as.Date(Date),
group = cumsum(coalesce(as.numeric(lag(Status) == 1 & Status == 0), 1))) %>%
group_by(Product_Number, group) %>%
mutate(Number_of_Days = (last(Date) - first(Date)) + 1) %>%
slice(n()) %>% ungroup() %>%
select(-group, -Date)
Output:
# A tibble: 4 x 3
Product_Number Status Number_of_Days
<chr> <int> <time>
1 A 1 2
2 A 0 1
3 B 1 4
4 C 0 2
This might be what you're looking for, if I got your question right.
library(dplyr)
df %>%
mutate(Number_of_Days=1) %>%
select(-Date) %>%
group_by(Product_Number, Status) %>%
summarise_all(sum,na.rm=T)
Product_Number Status Number_of_Days
1 A 0 2
2 A 1 1
3 B 0 3
4 B 1 1
5 C 0 2
Related
I have a dataset that I can reduce to three columns - CustomerID, EnterDate, ReturnDate. I would like to add a 3rd column which states whether or not, if a CustomerID appears more than once in the dataset, the second 'EnterDate' is within 30 days of the first 'ExitDate' (and the third is within 30 days of the second etc. if there are multiple entries for a single Customer ID).
So to turn a table like this:
CustomerID
EnterDate
ExitDate
1
14/09/2021
15/09/2021
1
03/10/2021
11/10/2021
2
03/10/2021
01/10/2021
2
17/10/2021
11/11/2021
3
03/10/2021
11/10/2021
3
30/12/2021
31/12/2021
4
03/10/2021
09/07/2022
In to this - an entry of '1' is entered in new column 'ResaleWithin30' if CustomerID matches and 'EnterDate' is within 30 days of previous 'ExitDate'.
CustomerID
EnterDate
ExitDate
ResaleWithin30
1
14/09/2021
15/09/2021
0
1
03/10/2021
11/10/2021
1
2
03/10/2021
01/10/2021
0
2
17/10/2021
11/11/2021
1
3
03/10/2021
11/10/2021
0
3
30/12/2021
31/12/2021
0
4
03/10/2021
09/07/2022
0
The below code works for comparing just EnterDate to previous EnterDate but I'd like essentially EnterDate to compare to ExitDate. I assume I need to amend the mutate statement to apply to both EnterDate and ExitDate and then change the lag to compare EnterDate to ExitDate however I am getting in to various errors trying to get this completed so any amendment/help would be very much appreciated. Thank you!
library(dplyr)
df %>%
group_by(CustomerID) %>%
mutate(EnterDate = as.Date(EnterDate, tryFormats = "%d/%m/%Y"),
ResaleWithin30 = as.integer(EnterDate - lag(EnterDate) <= 30),
ResaleWithin30 = replace_na(ResaleWithin30, 0))
I'm afraid I don't see what the problem is. If I've missed something, please explain.
You can mutate more than one column in a single call. Just sepatate the mutations with commas. You can even mutate the same column more than once, as you have done yourself. Or you can chain seveal calls to mutate in the same pipe.
Create some data
library(tidyverse)
# I prefer the consistency and additional functionality of lubridate to base R.
# Base R will suffice in this case.
library(lubridate)
df <-
read.table(textConnection("CustomerID EnterDate ExitDate
1 14/09/2021 15/09/2021
1 03/10/2021 11/10/2021
2 03/10/2021 01/10/2021
2 17/10/2021 11/11/2021
3 03/10/2021 11/10/2021
3 30/12/2021 31/12/2021
4 03/10/2021 09/07/2022"), header=TRUE)
Solve the problem.
df %>%
group_by(CustomerID) %>%
mutate(
EnterDate=dmy(EnterDate),
ExitDate=dmy(ExitDate),
ResaleWithin30 = ifelse(EnterDate - lag(ExitDate) <= 30, 1, 0),
ResaleWithin30 = replace_na(ResaleWithin30, 0)
) %>%
ungroup()
# A tibble: 7 × 4
CustomerID EnterDate ExitDate ResaleWithin30
<int> <date> <date> <dbl>
1 1 2021-09-14 2021-09-15 0
2 1 2021-10-03 2021-10-11 1
3 2 2021-10-03 2021-10-01 0
4 2 2021-10-17 2021-11-11 1
5 3 2021-10-03 2021-10-11 0
6 3 2021-12-30 2021-12-31 0
7 4 2021-10-03 2022-07-09 0
Given your test data, the call to replace_na() is redundant.
I think the below might work.
df %>%
group_by(CustomerID) %>%
mutate(EnterDate = as.Date(EnterDate, tryFormats = "%d/%m/%Y"),
ResaleWithin30 = as.integer(EnterDate - lag(ExitDate) <= 30),
ResaleWithin30 = replace_na(ResaleWithin30, 0))
It is easier to help if you provide a reprex
Background
I've got an R dataframe d:
d <- data.frame(ID = c("a","a","b","b", "c","c","c"),
event = c(1,1,0,0,1,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
stringsAsFactors=FALSE)
As you can see, it's got 3 distinct people in the ID column, and they've either had or not had an event, along with a date their event status was recorded (event_date).
The Problem
I'd like to create a new variable / column, event_within_interval, which assigns 1 to all the cells of a given ID if that ID has 2 or more event=1 within 180 days of their first event=1.
Let me explain further: both ID=a and ID=c have 2 or more events each, but only ID=c has their second event within 180 days of their first (so here, the 4/7/2013 - 3/14/2013 = 24 days for ID=c).
The problem is that I'm not sure how to tell R this idea of "if the second happens within 180 days of the first event=1".
What I'd like
Here's what I'm looking for:
want <- data.frame(ID = c("a","a","b","b","c","c","c"),
event = c(1,1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07","2014-07-14")),
event_within_interval = c(0,0,0,0,1,1,1),
stringsAsFactors=FALSE)
What I've tried
I've only got the beginnings of an attempt thus far:
d <- d %>%
mutate(event_within_interval = ID %in% if_else(d$event == 1, 1, 0))
But this doesn't give me what I'd like, as you can tell if you run the code.
I've set the thing up as an if_else, but I'm not sure where to go from here.
UPDATE: I've edited both reproducible examples (what I've got and what I want) to emphasize the fact that the desired date interval needs to be between the first event and the second event, not the first event and the last event. (A couple of users submitted examples using last, which worked for the previous iteration of the reproducible example but wouldn't have worked on the real dataset.)
What about by packages lubridate and data.table?
library(data.table)
library(lubridate)
d <- data.frame(ID = c("a","a","b","b", "c","c"),
event = c(1,1,0,0,1,1),
event_date = as.Date(c("2011-01-01","2012-08-21","2011-12-23","2011-12-31","2013-03-14","2013-04-07")),
stringsAsFactors=FALSE)
d <- data.table(d)
d <- d[, event_within_interval := 0]
timeInterval <- interval(start = "2013-03-14", end = "2013-04-07")
d <- d[event == 1 & event_date %within% timeInterval, event_within_interval := 1]
d
# ID event event_date event_within_interval
# 1: a 1 2011-01-01 0
# 2: a 1 2012-08-21 0
# 3: b 0 2011-12-23 0
# 4: b 0 2011-12-31 0
# 5: c 1 2013-03-14 1
# 6: c 1 2013-04-07 1
This is good fun.
Scenario 1
My approach would be to
group events by ID
Apply first condition check on two the span of days between current date and initial date
check if the sum of events is bigger or equal two: sum(event) >= 2
only if the two conditions are met I would return one for the event
For readability, I've returned values of conditions in the data as test_* variables.
d %>%
group_by(ID) %>%
mutate(test_interval = event_date - min(event_date) < 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(test_interval & test_sum_events,
1, 0)) %>%
ungroup()
Scenario 2
In this scenario, the data is sorted by event_date within ID and the difference between the first event and second event has to be under 180 days. Rest of events is ignored.
d %>%
group_by(ID) %>%
arrange(event_date) %>%
mutate(
# Check the difference between first event: min(event_date) and
# second event: event_date[2]
test_interval_first_two = event_date[2] - min(event_date) <= 180,
test_sum_events = sum(event) >= 2,
event_within_interval = if_else(
test_interval_first_two & test_sum_events, 1, 0)
) %>%
ungroup()
You can first group_by the ID column, so that we can calculate days within the same ID. Then in the condition in the if_else statement, use condition with sum() > 1 AND day difference <= 180.
Here I assume there's only two "events" or rows per ID.
library(dplyr)
d %>%
group_by(ID) %>%
mutate(event_within_interval = if_else(sum(event) > 1 & last(event_date) - first(event_date) <= 180, 1L, 0L))
# A tibble: 6 x 4
# Groups: ID [3]
ID event event_date event_within_interval
<chr> <dbl> <date> <int>
1 a 1 2011-01-01 0
2 a 1 2012-08-21 0
3 b 0 2011-12-23 0
4 b 0 2011-12-31 0
5 c 1 2013-03-14 1
6 c 1 2013-04-07 1
Here is how we could do it. In this example with an additional column interval to see the interval and then use an ifelse statement.
library(dpylr)
d %>%
group_by(ID) %>%
mutate(interval = last(event_date)- first(event_date),
event_within_interval = ifelse(event == 1 &
interval < 180, 1, 0))
ID event event_date interval event_within_interval
<chr> <dbl> <date> <drtn> <dbl>
1 a 1 2011-01-01 598 days 0
2 a 1 2012-08-21 598 days 0
3 b 0 2011-12-23 8 days 0
4 b 0 2011-12-31 8 days 0
5 c 1 2013-03-14 24 days 1
6 c 1 2013-04-07 24 days 1
I am currently working on a dataset which consists of multiple participants. Some participants have participated all followups, whereas others have skipped some followups.
For example, in the dataset below, participant 2 only participated the 3rd followup, and participant 3 only participated the 2nd and the 3rd followup. You can also see that some participants have more than 1 rows of entry because they have several followups.
The original dataset only has the 1st and the 2nd column. Since I am aiming to create a progress chart like this
I have tried to create extra columns for each visit by using the code below:
participant <- c(1,1,1,2,3,3,4,5,5,5 )
visit <- c(1,2,3,3,2,3,1,1,2,3)
df <- data.frame(participant, visit)
df[,3] <- as.integer(df$visit=="1")
df[,4] <- as.integer(df$visit=="2")
df[,5] <- as.integer(df$visit=="3")
colnames(df)[colnames(df) %in% c("V3","V4","V5")] <- c(
"Visit1","Visit2","Visit3")
However, I still experience a hard time combining rows of the same participant, and hence I could not proceed to making the chart (which I also have no clue about). I have tried the 'reshape' function but it did not work out. group_by function also did not work out and still showed the original dataset
df1 <- df[,-2]
df1 %>%
group_by(participant)
What function should I use this case for:
combining rows of the same participant?
how to produce the progress chart?
Thank you in advance for your help!
Based on your df you could produce the chart with
library(ggplot2)
library(dplyr)
df %>%
ggplot(aes(x = as.factor(visit),
y = as.factor(participant),
fill = as.factor(visit))) +
geom_tile(aes(width = 0.7, height = 0.7), color = "black") +
scale_fill_grey() +
xlab("Visit") +
ylab("Participants") +
guides(fill = "none")
If you need your data.frame in a wide format (similar to the image shown but with only one row per participant), use
library(tidyr)
library(dplyr)
df %>%
mutate(value = 1) %>%
pivot_wider(
names_from = visit,
values_from = value,
names_glue = "Visit{visit}",
values_fill = 0)
to get
# A tibble: 5 x 4
participant Visit1 Visit2 Visit3
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
I think you are looking for a way to dummify a variable.
There are several ways to do that.
I like the fastDummies package. You can use dummy_cols, with remove_selected_columns=TRUE.
df %>% fastDummies::dummy_cols(select_columns = 'visit',
remove_selected_columns = TRUE)
participant visit_1 visit_2 visit_3
1 1 1 0 0
2 1 0 1 0
3 1 0 0 1
4 2 0 0 1
5 3 0 1 0
6 3 0 0 1
7 4 1 0 0
8 5 1 0 0
9 5 0 1 0
10 5 0 0 1
You may want to pipe in some summariseoperation to make the table even cleaner, as in:
df %>% fastDummies::dummy_cols(select_columns = 'visit', remove_selected_columns = TRUE)%>%
group_by(participant)%>%
summarise(across(starts_with('visit'), max))
# A tibble: 5 x 4
participant visit_1 visit_2 visit_3
<dbl> <int> <int> <int>
1 1 1 1 1
2 2 0 0 1
3 3 0 1 1
4 4 1 0 0
5 5 1 1 1
In a certain way, this looks a bit like a pivoting operation too.
You may be interested in using dplyr::pivot_wider here too
EDIT: #MartinGal had just given a similar answer, I removed a very similar version of his pivot_wider
I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an onset variable, so for each patient it would code a 1 for the time which the patient first got the disease, but would then be a 0 for any time period before or a time period after (even if that patient still had the disease). For the example data it should now look like this.
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,1,1,0,0,1,1,1,0)
outcome_onset <- c(0,1,0,0,0,1,1,0,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome,
outcome_onset=outcome_onset)
Data
Therefore I would like some code/ some help automating the creation of the outcome_onset variable.
Here is an option with cumsum to create a logical vector after grouping by the 'patientid'
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(cumsum(outcome) == 1))
Or use match and %in%
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = +(row_number() %in% match(1, outcome_onset)))
We can use which.max to get the index of 1st one in outcome variable and make that row as 1 and rest of them as 0.
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome_onset = as.integer(row_number() %in% which.max(outcome)),
outcome_onset = replace(outcome_onset, is.na(outcome), NA))
# patientid time outcome outcome_onset
# <dbl> <dbl> <dbl> <int>
#1 100 1 0 0
#2 100 2 1 1
#3 100 3 1 0
#4 101 1 0 0
#5 101 2 0 0
#6 101 3 1 1
#7 102 1 1 1
#8 102 2 1 0
#9 102 3 0 0
I have a large data.frame containing these values:
ID_Path Conversion Lead Path Week
32342 A25177 1 JEFD 2015-25
32528 A25177 1 EUFD 2015-25
25485 A3 1 DTFE 2015-25
32528 Null 0 DDFE 2015-25
23452 A25177 1 JDDD 2015-26
54454 A25177 1 FDFF 2015-27
56848 A2323 1 HDG 2015-27
I want to be able to create a frequency table that displays a table like this:
Week Total A25177 A3 A2323
2015-25 3 2 1 0
2015-26 1 1 0 0
2015-27 2 1 0 1
Where every unique Conversion has a column, and all the times where the Conversion is Null is the same time as when the Lead is 0.
In this example there is 3 unique conversions, sometimes there is 1, sometimes there are 5 or more. So it should not be limited to only 3.
I have created a new DF containing only Conversion that are not Null
I have tried using data.table with this code:
DF[,list(Week=Week,by=Conversion]
with no luck.
I have tried using plyr with this code:
ddply(DF,~Conversion,summarise,week=week)
with no luck.
I would recommend dropping unnecessary levels in order to not mess the output, and then run a simple table and addmargins combination
DF <- droplevels(DF[DF$Conversion != "Null",])
addmargins(table(DF[c("Week", "Conversion")]), 2)
# Conversion
# Week A2323 A25177 A3 Sum
# 2015-25 0 2 1 3
# 2015-26 0 1 0 1
# 2015-27 1 1 0 2
Alternatively, you could do the same with reshape2 while specifying the margins parameter
library(reshape2)
dcast(DF, Week ~ Conversion, value.var = "Conversion", length, margins = "Conversion")
# Week A2323 A25177 A3 (all)
# 1 2015-25 0 2 1 3
# 2 2015-26 0 1 0 1
# 3 2015-27 1 1 0 2
An alternative solution using dplyr and tidyr:
library(tidyr)
library(dplyr)
dt = data.frame(Conversion = c("A1","Null","A1","A3"),
Lead = c(1,0,1,1),
Week = c("2015-25","2015-25","2015-25","2015-26"))
dt %>%
filter(Conversion != "Null") %>%
group_by(Week, Conversion) %>%
summarise(Lead = sum(Lead)) %>%
ungroup() %>%
spread(Conversion,Lead,fill=0) %>%
group_by(Week) %>%
do(data.frame(.,
Total = sum(.[,-1]))) %>%
ungroup()
# Week A1 A3 Total
# 1 2015-25 2 0 2
# 2 2015-26 0 1 1