R dplyr select row of minimum date difference between two separate events - r

I am working with a data set that includes roughly 400 unique subjects. for this example I will only be working with two however. You can generate sample data with this code:
set.seed(100)
library(tidyr)
library(dplyr)
Subject<-c("A","A","A","A","A","A","B","B","B","B")
Event1<-c("01/01/2001","01/01/2001","01/01/2001","01/01/2001","09/09/2001","09/09/2001","09/09/2009","09/09/2009","09/09/2009","09/09/2009")
random.dates<-function(N,sd="2001-01-01",ed="2010-01-01"){
sd<-as.Date(sd,"%Y-%m-%d")
ed<-as.Date(ed,"%Y-%m-%d")
dt<-as.numeric(difftime(ed,sd))
ev<-sort(runif(N,0,dt))
rt<-sd+ev
}
Event1<-as.Date(Event1,"%m/%d/%Y")
Event1
Event2<-print(random.dates(10))
df<-data.frame(Subject,Event1,Event2)
df
and produces something close to this output of output:
Subject Event1 Event2
1 A 2001-01-01 2001-05-04
2 A 2001-01-01 2001-09-24
3 A 2001-01-01 2002-10-22
4 A 2001-01-01 2003-02-25
5 A 2001-09-09 2007-07-16
6 A 2001-09-09 2008-04-06
7 B 2009-09-09 2008-07-12
8 B 2009-09-09 2008-07-24
9 B 2009-09-09 2009-04-01
10 B 2009-09-09 2009-09-11
In this case I am interested in first grouping unique Subjects with unique Event1's which I can do easily. From there I need to select Event2 that falls closest to Event1 for that unique Subject-Event1 combination, which I really need help with. For this example these data should decompose to 3 different records:
Subject Event1 Event2
1 A 2001-01-01 2001-05-04
2 A 2001-09-09 2008-04-06
3 B 2009-09-09 2009-09-11
I've jerry-rigged a solution to produce the 3 records of Subject-Event1 combinations:
df2<-df
df2$SubEv<-paste(df2$Subject,df2$Event1)
df2$Event1<-NULL
df2$Subject<-NULL
df2$Event2<-NULL
df2<-unique(df2)
df2<-separate(df2,SubEv,c("Subject","Event1"),sep=" ")
From here I'm just lost as to how to make R select from df the date of Event2 that is closest to Event1.
I already know that my code is super inefficient and sloppy (probably because of my approach at the get go). I'd like to know how to do this (at all honestly), and if there's a way I can do this calling fewer than 10 lines of code that would be pretty boss.

With dplyr:
library(dplyr)
df %>%
group_by(Subject, Event1) %>%
slice(which.min(abs(Event1 - Event2)))
# Subject Event1 Event2
# (chr) (date) (date)
# 1 A 2001-01-01 2001-07-05
# 2 A 2001-09-09 2004-05-02
# 3 B 2009-09-09 2008-04-24
Comments:
group_by can work with multiple columns.
slice selects row numbers within a group. Alternately...
... %>% filter( row_number() == which.min(abs(Event1 - Event2)) )
For a tie, which.min will return the first min. See ?which.min for details.
Data: When I run the OP's code, I get df looking like
Subject Event1 Event2
1 A 2001-01-01 2001-07-05
2 A 2001-01-01 2002-07-14
3 A 2001-01-01 2003-04-27
4 A 2001-01-01 2003-10-09
5 A 2001-09-09 2004-05-02
6 A 2001-09-09 2005-03-21
7 B 2009-09-09 2005-05-10
8 B 2009-09-09 2005-12-02
9 B 2009-09-09 2005-12-21
10 B 2009-09-09 2008-04-24
which explains why my result doesn't match exactly the OP's expected result.

Related

Calculating Average Time between Dates

I'm having difficulty calculating the average time between the payment dates for my csv. I have tried multiple methods that I have seen online (changing to data.table, using ddply) with no success
WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
This is an example of my dataset- I wanted to calculate the average time between the PaymentDates (in number of days), in the simplest way possible. I would like to group by the workerID.
Thank you!
This is a perfect job for aggregate(). It groups PaymentDate by WorkerID and applies the function mean(diff(.)) to each group.
tt <- read.table(text="
WorkerID PaymentDate
1 2015-06-18
1 2015-07-18
1 2015-08-18
2 2015-09-18
3 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
4 2015-12-16", header=TRUE)
tt$PaymentDate <- as.Date(tt$PaymentDate)
aggregate(PaymentDate ~ WorkerID, data=tt, FUN=function(x) mean(diff(x)))
# WorkerID PaymentDate
# 1 1 30.5
# 2 2 NaN
# 3 3 31.0
# 4 4 29.5
An alternative to AkselA's answer, one can use the data.table package if one prefers this over base R.
This is similar to using aggregate, but may sometimes give a speed boost. In my example below I've handled single times by setting the difference to 0, to illustrate how this can be achieved.
library(lubridate)
library(data.table)
df <- fread("WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18")
df[,PaymentDate := as.Date(PaymentDate)]
df[,{
if(length(PaymentDate) > 1){
mean(diff(as.numeric(PaymentDate)))
}else
0
}, by = WorkerID]

R Function involving two for loops - baseball data

For those into sports, I am working on a function that adds a column with the pitch count for a game in a given season for a pitcher.
For example's sake, data used is a data frame called pitcher that contains a game_date and sv_id (date/timestamp or the pitch). My goal is to order the sv_id in ascending order for each unique game_date and then add a column with a numbering system for this order. So for example, if for game_date=9/9/2018 there were 3 pitches thrown with sv_id's equal to 090918_031456, 090918_031613, and 090918_031534, I would first want to sort this data into chronological order (090918_031456,090918_031534,090918_031613) and then have a new column with the values 1,2,3 respectively to act as a pitch count. Below is my function so far. I originally thought I would make a list of lists but now I am not sure that is the right way to go about this. Please help! This is also my first time posting on here so any advice is appreciated. Thank you!!!
` pitchCount <- function(game_date, sv_id){
gameUnique<-unique(pitcher$game_date)
PC<-list()
for (j in 1:length(gameUnique)){
PCLocal<-filter(pitcher,game_date==gameUnique[j])
PCLocal[order(PCLocal$sv_id),]
for (i in 1:length(PCLocal$sv_id)){
PCLocal$PC[i]=i
}
PC[j]=PCLocal$PC
}
return(PC)
}
pitch.Count <- pitchCount(pitcher$game_date,pitcher$sv_id)
pitcher$PC<-pitch.Count
`
So you want to count pitches as they come in order, right? Should be no need for a loop. In R, loops are rarely needed.
Check if this is what you want. A tidyverse/dplyr solution.
The sv_id variable is in a format that can be converted to POSIX (a type of date format). This makes it simple to sort in order.
library(tidyverse)
# Create data_frame
pitcher <- data_frame(game_date = as.Date(c("2018-09-09", "2018-09-09", "2018-09-09")),
sv_id = c("090918_031456", "090918_031613", "090918_031534"))
# First, convert sv_id strings to POSIX format (this can be done in the code below but this makes it clearer.
pitcher$sv_id <- as.POSIXct(c("090918_031456", "090918_031613", "090918_031534"), format = "%y%m%d_%H%M%S", tz = "GMT")
# Create pitch count
pitcher %>%
arrange(sv_id) %>%
mutate(Count = 1, pitchcount = cumsum(Count), Count = NULL)
# A tibble: 3 x 3
game_date sv_id pitchcount
<date> <dttm> <dbl>
1 2018-09-09 2009-09-18 03:14:56 1
2 2018-09-09 2009-09-18 03:15:34 2
3 2018-09-09 2009-09-18 03:16:13 3
Try using data.table.
library(data.table)
pitcher_dt <- data.table(pitcher)
> pitcher_dt
game_date sv_id
1: 2018-01-02 090918_031456
2: 2018-01-02 090918_031613
3: 2018-01-02 090918_031534
We can add Count column by := and add a position of 'sv_id' by order(sv_id).
pitcher_dt [, Count := order(sv_id)]
> pitcher_dt
game_date sv_id Count
1: 2018-01-02 090918_031456 1
2: 2018-01-02 090918_031613 3
3: 2018-01-02 090918_031534 2
Since Count only puts the position of 'sv_id', in this case (1,3,2), we can either sort 'Count' or 'sv_id' in ascending order
pitcher_dt[,order(Count)] or pitcher_dt[,order(sv_id)]
> pitcher_dt[order(Count)]
game_date sv_id Count
1: 2018-01-02 090918_031456 1
2: 2018-01-02 090918_031534 2
3: 2018-01-02 090918_031613 3
For me, it is easy to manipulate data with data.table. But, you can also use dplyr.
Introduction to data.table is a good start to learn about data.table.
I am not sure how is your data looks like, but I assume the following from your description
> df
# A tibble: 9 x 2
game_date sv_id
<date> <chr>
1 2018-09-09 090918_031456
2 2018-09-09 090918_031613
3 2018-09-09 090918_031534
4 2018-05-17 090918_031156
5 2018-05-17 090918_031213
6 2018-06-30 090918_031177
7 2018-06-30 090918_031211
8 2018-06-30 090918_031144
9 2018-06-30 090918_031203
Then you use dplyr to do generate your target
library(dplyr)
df <- df %>%
group_by(game_date) %>%
mutate(count = n_distinct(sv_id)) %>% #count sv_id with each game_date
arrange(desc(sv_id))
The output is:
# A tibble: 9 x 3
# Groups: game_date [3]
game_date sv_id count
<date> <chr> <int>
1 2018-06-30 090918_031144 4
2 2018-05-17 090918_031156 2
3 2018-06-30 090918_031177 4
4 2018-06-30 090918_031203 4
5 2018-06-30 090918_031211 4
6 2018-05-17 090918_031213 2
7 2018-09-09 090918_031456 3
8 2018-09-09 090918_031534 3
9 2018-09-09 090918_031613 3
I hope this could help

Mark each row in a large dataframe via two variables

I have a dataframe like this (the real one is much larger):
time<-c(as.POSIXct('2011-11-11 06:00:00'),as.POSIXct('2011-11-11 06:05:00'),as.POSIXct('2011-11-11 07:05:00'),
as.POSIXct('2011-11-11 07:10:00'),as.POSIXct('2011-11-11 07:13:00'),as.POSIXct('2011-11-11 07:33:00'),
as.POSIXct('2011-11-11 05:05:00'),as.POSIXct('2011-11-11 06:05:00'),as.POSIXct('2011-11-11 06:20:00'),
as.POSIXct('2011-11-11 09:05:00'))
plate<-c('a','a','a','b','c','d','e','e','e','e')
df<-data.frame(time,plate)
The time variable represents the time that the vehicle be identified by the video device. The plate variable represents the vehicle's plate. The dataframe has been well ordered by firstly plate and secondly time.
Given this, I want to devide each vehicle's trip by marking the rows. Different vehicles (plates) certainly represent different trips. For one vehicle, the identified time difference within one trip should be shorter than 30 minutes, if not, the rows should belong to different trips.
In my way, I will do this by the following code:
trip<-vector()
trip[1]<-1
time_diff<-as.POSIXct('2011-11-11 07:00:00')-as.POSIXct('2011-11-11 06:30:00')
for (x in 2:nrow(df)) {
if (!df$plate[x]==df$plate[x-1]) (trip[x]<-trip[x-1]+1
) else{if (df$time[x]-df$time[x-1]<time_diff) (trip[x]<-trip[x-1]
) else (trip[x]<-trip[x-1]+1)}
}
df<-cbind(df,trip)
However, my df contains more than seven million rows thus my method will be very slow. So I'm asking if there are some more efficient ways to do this.
I'll suggest using dplyr for this, though with 7M rows you might consider data.table solution if this doesn't work well for you.
library(dplyr)
time_diff<-as.POSIXct('2011-11-11 07:00:00')-as.POSIXct('2011-11-11 06:30:00')
df %>%
arrange(time) %>% # it's important, so I reinforce it here
group_by(plate) %>%
mutate(
trip = cumsum( c(TRUE, diff(time) > time_diff) )
) %>%
ungroup()
# # A tibble: 10 × 3
# time plate trip
# <dttm> <fctr> <int>
# 1 2011-11-11 06:00:00 a 1
# 2 2011-11-11 06:05:00 a 1
# 3 2011-11-11 07:05:00 a 2
# 4 2011-11-11 07:10:00 b 1
# 5 2011-11-11 07:13:00 c 1
# 6 2011-11-11 07:33:00 d 1
# 7 2011-11-11 05:05:00 e 1
# 8 2011-11-11 06:05:00 e 2
# 9 2011-11-11 06:20:00 e 2
# 10 2011-11-11 09:05:00 e 3
I much prefer the above solution using group_by, but if you want the trip to be unique across plates, one technique is to handle the grouping yourself (requiring strict ordering):
df %>%
arrange(plate, time) %>%
mutate(
trip = cumsum( plate != lag(plate, default = plate[1]) | c(TRUE, diff(time) > time_diff) )
)
# time plate trip
# 1 2011-11-11 06:00:00 a 1
# 2 2011-11-11 06:05:00 a 1
# 3 2011-11-11 07:05:00 a 2
# 4 2011-11-11 07:10:00 b 3
# 5 2011-11-11 07:13:00 c 4
# 6 2011-11-11 07:33:00 d 5
# 7 2011-11-11 05:05:00 e 6
# 8 2011-11-11 06:05:00 e 7
# 9 2011-11-11 06:20:00 e 7
# 10 2011-11-11 09:05:00 e 8

R programming - Split up a group of time series indexed by ID with irregular observation periods into regular monthly observations

I have a set of data regarding amounts of something users with unique IDs used between in a data.frame in r.
ID start date end date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
....
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000
The time series for each ID starts and ends at inconsistent times, and contain an inconsistent number of observations. However, they are all formatted in the above manner; the start and end dates are Date objects.
I would like to standardize the breakdowns for each ID to a monthly time series, with data points at the start of each month, weighing the observed amount numbers which happen to straddle two or more months accordingly.
In other words, I would like to turn this series into something like
ID start date end date amount
1 1-1-2012 2-1-2012 3096 = 6000 * 16/31
1 2-1-2012 3-1-2012 4339 = 6000*15/31+4000*14/39
1 3-1-2012 4-1-2012 etc
....
1 6-1-2012 7-1-2012 etc
2 1-1-2012 2-1-2012 etc
2 2-1-2012 3-1-2012 etc
2 3-1-2012 4-1-2012 etc
2 4-1-2012 5-1-2012 etc
2 5-1-2012 6-1-2012 etc
....
10000 1-1-2012 2-1-2012 etc
....
10000 6-1-2012 7-1-2012 etc
Where the value for ID 1 between 2/1/12 and 3/1/12 is calculated by weighing the number of days in the 1-15-2012 to 2-15-2012 observation that land in February (15 days / 31 days) with the amount in that observation span (6000) with the number of days in the 2-15 to 3-25 observation span that fall in February (14 days/ 39 days, as 2012 was a leap year) times the amount in that observation span (4000), yielding 6000*15/31+4000*14/39 = 4339. This should be done for each ID time series. We do not consider the case where the observation periods all fit into one month; but if they are spread out over more than two months they should be split up over that number of months with the appropriate weighings.
I'm rather new to r and could certainly use some help on this!
Here is using native R:
#The data
df=read.table(text='ID start_date end_date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000',
header=T,row.names = NULL,stringsAsFactors =FALSE)
df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")
df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot
#using by compute new amount
df1$amount=do.call(c,
by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
)
df1
> df1
ID startm endm amount
1 1 2012-01-01 2012-02-01 3096.774
2 1 2012-02-01 2012-03-01 4339.123
3 1 2012-03-01 2012-05-01 4306.038
4 1 2012-05-01 2012-06-01 1535.842
5 2 2012-01-01 2012-02-01 2500.000
6 2 2012-02-01 2012-03-01 4700.000
7 2 2012-03-01 2012-05-01 3754.545
8 10000 2012-01-01 2012-02-01 5302.326
9 10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000
To solve this I think the easiest way is to break it down into two problems.
How can I get a daily breakdown of the figures I'm interested in? This is my assumption based on the information you provided above.
How do I group by a date range and summarise to what I'm interested in?
For the following example, I will use the data set which I created using the code below:
df <- data.frame(
id=c(1,1,1,1,2,2,2),
start_date=as.Date(c("1-15-2012",
"2-15-2012",
"3-25-2012",
"5-26-2012",
"1-16-2012",
"2-27-2012",
"3-18-2012"), "%m-%d-%Y"),
end_date=as.Date(c("2-15-2012",
"3-25-2012",
"5-26-2012",
"6-13-2012",
"2-27-2012",
"3-18-2012",
"5-23-2012"), "%m-%d-%Y"),
amount=c(6000,
4000,
3000,
1000,
7000,
2000,
3000)
)
1. Provide daily figures
To provide the daily figures, firstly we get the daily contribution:
df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)
Then, we will expand the date range using the start and end dates. There are a couple ways which you can do it, but seeing that you apply the dplyr tag, using the dplyr way we have:
library(dplyr)
df <- df %>%
rowwise() %>%
do(data.frame(id=.$id,
date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")),
daily_contribution=.$daily_contribution))
which has some output which looks like this:
Source: local data frame [285 x 3]
Groups: <by row>
id date daily_contribution
1 1 2012-01-15 193.5484
2 1 2012-01-16 193.5484
3 1 2012-01-17 193.5484
4 1 2012-01-18 193.5484
5 1 2012-01-19 193.5484
6 1 2012-01-20 193.5484
7 1 2012-01-21 193.5484
8 1 2012-01-22 193.5484
9 1 2012-01-23 193.5484
10 1 2012-01-24 193.5484
.. .. ... ...
2. Create a grouping variable
Next we create some kind of grouping variable that we're interested in. I've used lubridate for ease to get the month and year of the dates:
library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)
Now with all of this we can easily use dplyr to summarise our information by the dates as required.
df %>%
group_by(id, mnth, yr) %>%
summarise(amount=sum(daily_contribution))
with output:
Source: local data frame [11 x 4]
Groups: id, mnth
id mnth yr amount
1 1 1 2012 3290.3226
2 1 2 2012 4441.6873
3 1 3 2012 2902.8122
4 1 4 2012 1451.6129
5 1 5 2012 1591.3978
6 1 6 2012 722.2222
7 2 1 2012 2666.6667
8 2 2 2012 4800.0000
9 2 3 2012 2436.3636
10 2 4 2012 1363.6364
11 2 5 2012 1045.4545
To get it precisely in the format you specified:
df %>% rowwise() %>%
mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
select(id, start_date, end_date, amount)
with output:
Source: local data frame [11 x 4]
Groups: <by row>
id start_date end_date amount
1 1 2012-01-01 2012-02-01 3290.3226
2 1 2012-02-01 2012-03-01 4441.6873
3 1 2012-03-01 2012-04-01 2902.8122
4 1 2012-04-01 2012-05-01 1451.6129
5 1 2012-05-01 2012-06-01 1591.3978
6 1 2012-06-01 2012-07-01 722.2222
7 2 2012-01-01 2012-02-01 2666.6667
8 2 2012-02-01 2012-03-01 4800.0000
9 2 2012-03-01 2012-04-01 2436.3636
10 2 2012-04-01 2012-05-01 1363.6364
11 2 2012-05-01 2012-06-01 1045.4545
as needed.
note: I can see from your example, that you have, 3096 = 6000 * 16/31 and 4339 = 6000*15/31+4000*14/39, but for the first one, as an example, you have 15 of Jan to 31 of Jan which is 17 days if the date range is inclusive. You can trivially alter this information if required.
Here's a solution using plyr and reshape. The numbers aren't the same as what you provided, so I may have misunderstood your intent though this seems to meet your stated goal (weighted average of amount by month).
df$index <- 1:nrow(df) #Create a unique index number
#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")
library(plyr); library(reshape) #Load the libraries
#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
# ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) {
ID <- X$ID #Keep the ID value
n.days <- as.numeric(difftime( X$end.date, X$start.date )) #Calculate time difference in days, report the result as a number
day <- seq(X$start.date, X$end.date, by="days") #Sequence of days
amount.per.day <- X$amount/n.days #Amount for that day
data.frame(ID, day, amount.per.day) #Last line is the output
})
#Change list back into data.frame
df3 <- ldply(df2, data.frame) #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <- as.numeric(format(df3$day, "%m")) #Assign a month to all dates
#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))
# ID mon amount
# 1 1 1 3290.3226
# 2 1 2 4441.6873
# 3 1 3 2902.8122
# 4 1 4 1451.6129
# 5 1 5 1591.3978
# 6 1 6 722.2222
# 7 2 1 2666.6667
# 8 2 2 4800.0000
# 9 2 3 2436.3636
# 10 2 4 1363.6364
# 11 2 5 1045.4545
Incidentally, for future posts, you can get faster answers if you provide the code to replicate your data. If your code is somewhat complicated, you can use dput(yourdata).
HTH!

Calculating elapsed time for different interview dates in R

So my data looks like this
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","2013-02-27","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
i.e)
subjid type date
1 a baseline 2013-02-07
2 a first 2013-02-27
3 a second 2013-04-30
4 b baseline 2013-03-03
5 b first 2013-05-23
6 c baseline 2013-01-02
7 c first 2013-07-23
8 d baseline 2013-03-29
9 e baseline 2013-06-03
and I'm trying to make a variable "elapsedtime" that denotes the time elapsed from the baseline date to first and second round interview dates (so that elapsedtime=0 for baselines). Note that it varies individually whether they have taken further interviews.
I tried to reshape the data so that I could subtract each dates but my brain isn't really functioning today--or is there another way?
Please help and thank you.
Screaming out for ave:
I'll throw an NA value in there just for good measure:
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","NA","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
And you should probably sort the data to be on the safe side:
dat$type <- ordered(dat$type,levels=c("baseline","first","second","third") )
dat <- dat[order(dat$subjid,dat$type),]
Turn your date into a proper Date object:
dat$date <- as.Date(dat$date)
Then calculate the differences:
dat$elapsed <- ave(as.numeric(dat$date),dat$subjid,FUN=function(x) x-x[1] )
# subjid type date elapsed
#1 a baseline 2013-02-07 0
#2 a first <NA> NA
#3 a second 2013-04-30 82
#4 b baseline 2013-03-03 0
#5 b first 2013-05-23 81
#6 c baseline 2013-01-02 0
#7 c first 2013-07-23 202
#8 d baseline 2013-03-29 0
#9 e baseline 2013-06-03 0
This makes no assumptions that baseline is the always at position 1:
dat$date <- as.Date(dat$date)
dat$elapesed <- unlist(by(dat, dat$subjid, FUN=function(x) {
as.numeric(x$date - x[x$type=="baseline",]$date)
}))

Resources