Calculating elapsed time for different interview dates in R - r

So my data looks like this
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","2013-02-27","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
i.e)
subjid type date
1 a baseline 2013-02-07
2 a first 2013-02-27
3 a second 2013-04-30
4 b baseline 2013-03-03
5 b first 2013-05-23
6 c baseline 2013-01-02
7 c first 2013-07-23
8 d baseline 2013-03-29
9 e baseline 2013-06-03
and I'm trying to make a variable "elapsedtime" that denotes the time elapsed from the baseline date to first and second round interview dates (so that elapsedtime=0 for baselines). Note that it varies individually whether they have taken further interviews.
I tried to reshape the data so that I could subtract each dates but my brain isn't really functioning today--or is there another way?
Please help and thank you.

Screaming out for ave:
I'll throw an NA value in there just for good measure:
dat<-data.frame(
subjid=c("a","a","a","b","b","c","c","d","e"),
type=c("baseline","first","second","baseline","first","baseline","first","baseline","baseline"),
date=c("2013-02-07","NA","2013-04-30","2013-03-03","2013-05-23","2013-01-02","2013-07-23","2013-03-29","2013-06-03"))
And you should probably sort the data to be on the safe side:
dat$type <- ordered(dat$type,levels=c("baseline","first","second","third") )
dat <- dat[order(dat$subjid,dat$type),]
Turn your date into a proper Date object:
dat$date <- as.Date(dat$date)
Then calculate the differences:
dat$elapsed <- ave(as.numeric(dat$date),dat$subjid,FUN=function(x) x-x[1] )
# subjid type date elapsed
#1 a baseline 2013-02-07 0
#2 a first <NA> NA
#3 a second 2013-04-30 82
#4 b baseline 2013-03-03 0
#5 b first 2013-05-23 81
#6 c baseline 2013-01-02 0
#7 c first 2013-07-23 202
#8 d baseline 2013-03-29 0
#9 e baseline 2013-06-03 0

This makes no assumptions that baseline is the always at position 1:
dat$date <- as.Date(dat$date)
dat$elapesed <- unlist(by(dat, dat$subjid, FUN=function(x) {
as.numeric(x$date - x[x$type=="baseline",]$date)
}))

Related

Calculating Average Time between Dates

I'm having difficulty calculating the average time between the payment dates for my csv. I have tried multiple methods that I have seen online (changing to data.table, using ddply) with no success
WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
This is an example of my dataset- I wanted to calculate the average time between the PaymentDates (in number of days), in the simplest way possible. I would like to group by the workerID.
Thank you!
This is a perfect job for aggregate(). It groups PaymentDate by WorkerID and applies the function mean(diff(.)) to each group.
tt <- read.table(text="
WorkerID PaymentDate
1 2015-06-18
1 2015-07-18
1 2015-08-18
2 2015-09-18
3 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18
4 2015-12-16", header=TRUE)
tt$PaymentDate <- as.Date(tt$PaymentDate)
aggregate(PaymentDate ~ WorkerID, data=tt, FUN=function(x) mean(diff(x)))
# WorkerID PaymentDate
# 1 1 30.5
# 2 2 NaN
# 3 3 31.0
# 4 4 29.5
An alternative to AkselA's answer, one can use the data.table package if one prefers this over base R.
This is similar to using aggregate, but may sometimes give a speed boost. In my example below I've handled single times by setting the difference to 0, to illustrate how this can be achieved.
library(lubridate)
library(data.table)
df <- fread("WorkerID PaymentDate
1 2015-07-18
1 2015-08-18
3 2015-09-18
4 2015-10-18
4 2015-11-18")
df[,PaymentDate := as.Date(PaymentDate)]
df[,{
if(length(PaymentDate) > 1){
mean(diff(as.numeric(PaymentDate)))
}else
0
}, by = WorkerID]

Look up observation date, between two dates in reference data frame, and report another reference value for all observations in R

I have a dates associated with observations for c. 2000 observations, and would like to add a new factor to my observation data 'season'. Each date falls within a season, bound by a start and end date. I have season, start date and end date in a small reference data frame and would like to look up the observation date and deduce which season it belongs to, by which start and end dates it falls between.
I've tried various combinations of within and interval function but cannot get it to report back the season value.
https://rdrr.io/cran/lubridate/man/within-interval.html
Similar questions have been asked on stack exchange but not worded clearly enough for an applicable answer to be given that would help me.
My reference data are as follows:
Season <- c("A","B","C","D","E","F","G","H","I","J","K")
Start <- c("29-Apr-12","19-Oct-12",
"29-Apr-13","19-Oct-13",
"29-Apr-14","19-Oct-14",
"29-Apr-15","19-Oct-15",
"29-Apr-16","19-Oct-16",
"29-Apr-17")
Start <- as.Date (Start,"%d-%b-%y")
End <- c("18-Oct-12","28-Apr-13",
"18-Oct-13","28-Apr-14",
"18-Oct-14","28-Apr-15",
"18-Oct-15","28-Apr-16",
"18-Oct-16","28-Apr-17",
"18-Oct-17")
End <- as.Date (End,"%d-%b-%y")
Reference.df <- data.frame(Season,Start,End)
> Reference.df
Season Start End
A 2012-04-29 2012-04-29
B 2012-10-19 2012-10-19
C 2013-04-29 2013-04-29
D 2013-10-19 2013-10-19
E 2014-04-29 2014-04-29
F 2014-10-19 2014-10-19
G 2015-04-29 2015-04-29
H 2015-10-19 2015-10-19
I 2016-04-29 2016-04-29
J 2016-10-19 2016-10-19
K 2017-04-29 2017-04-29
And my observation data are as follows:
Date <- c("25-Apr-14","03-May-14","24-Nov-15","16-Feb-14","02-May-14","21-
Apr-17","27-Apr-15","27-Apr-13", "12-Aug-16","16-Apr-14")
Date <- as.Date (Date,"%d-%b-%y")
Observation <- seq(1,10)
Data.df <- data.frame(Observation,Date)
> Data.df
Observation Date
1 2014-04-25
2 2014-05-03
3 2015-11-24
4 2014-02-16
5 2014-05-02
6 2017-04-21
7 2015-04-27
8 2013-04-27
9 2016-08-12
10 2014-04-16
My desired output is as follows:
> Data.df
Observation Date Season
1 2014-04-25 D
2 2014-05-03 E
3 2015-11-24 H
4 2014-02-16 D
5 2014-05-02 E
6 2017-04-21 J
7 2015-04-27 F
8 2013-04-27 B
9 2016-08-12 I
10 2014-04-16 D
I think this has been asked before but could not find relevant post at the moment.
Anyway, to answer your question we could use sapply and check where each Date falls in between the Start and End date in Reference.df and extract the corresponding Season.
Data.df$Season <- sapply(Data.df$Date, function(x)
Reference.df$Season[x >= Reference.df$Start & x <= Reference.df$End])
Data.df
# Observation Date Season
#1 1 2014-04-25 D
#2 2 2014-05-03 E
#3 3 2015-11-24 H
#4 4 2014-02-16 D
#5 5 2014-05-02 E
#6 6 2017-04-21 J
#7 7 2015-04-27 F
#8 8 2013-04-27 B
#9 9 2016-08-12 I
#10 10 2014-04-16 D
This is assuming that we would have unique Start and End date for each season and there would be no overlap. If there is an overlap we could use which.max to get the first Season where it falls in between.
sapply(Data.df$Date, function(x)
Reference.df$Season[which.max(x >= Reference.df$Start & x <= Reference.df$End)])
Or a better option using cut assuming the Seasons are continuous one after another
cut(as.numeric(Data.df$Date),
breaks = c(-Inf, Reference.df$Start[-1], Inf), labels = Reference.df$Season)
#[1] D E H D E J F B I D
Using dplyr and magrittr, it doesn't save you as much time as an sapply but it is a bit easier to follow in case you're sharing this with not-so-expeRienced colleagues:
library(dplyr)
library(magrittr)
Data.df <- Data.df %>% mutate(Season = case_when(Date > as.Date("2012-04-29") & Date < as.Date("2012-10-18")~"A",
Date > as.Date("2012-10-19") & Date < as.Date("2013-04-28")~"B",
Date > as.Date("2013-04-29") & Date < as.Date("2013-10-18")~"C",
Date > as.Date("2013-10-19") & Date < as.Date("2014-04-28")~"D",
Date > as.Date("2014-04-29") & Date < as.Date("2014-10-18")~"E",
Date > as.Date("2014-10-19") & Date < as.Date("2015-04-28")~"F",
Date > as.Date("2015-04-29") & Date < as.Date("2015-10-18")~"G",
Date > as.Date("2015-10-19") & Date < as.Date("2016-04-28")~"H",
Date > as.Date("2016-04-29") & Date < as.Date("2016-10-18")~"I",
Date > as.Date("2016-10-19") & Date < as.Date("2017-04-28")~"J",
Date > as.Date("2017-04-29") & Date < as.Date("2017-10-18")~"K"))
Data.df
#Observation Date Season
#1 2014-04-25 D
#2 2014-05-03 E
#3 2015-11-24 H
#4 2014-02-16 D
#5 2014-05-02 E
#6 2017-04-21 J
#7 2015-04-27 F
#8 2013-04-27 B
#9 2016-08-12 I
#10 2014-04-16 D

R dplyr select row of minimum date difference between two separate events

I am working with a data set that includes roughly 400 unique subjects. for this example I will only be working with two however. You can generate sample data with this code:
set.seed(100)
library(tidyr)
library(dplyr)
Subject<-c("A","A","A","A","A","A","B","B","B","B")
Event1<-c("01/01/2001","01/01/2001","01/01/2001","01/01/2001","09/09/2001","09/09/2001","09/09/2009","09/09/2009","09/09/2009","09/09/2009")
random.dates<-function(N,sd="2001-01-01",ed="2010-01-01"){
sd<-as.Date(sd,"%Y-%m-%d")
ed<-as.Date(ed,"%Y-%m-%d")
dt<-as.numeric(difftime(ed,sd))
ev<-sort(runif(N,0,dt))
rt<-sd+ev
}
Event1<-as.Date(Event1,"%m/%d/%Y")
Event1
Event2<-print(random.dates(10))
df<-data.frame(Subject,Event1,Event2)
df
and produces something close to this output of output:
Subject Event1 Event2
1 A 2001-01-01 2001-05-04
2 A 2001-01-01 2001-09-24
3 A 2001-01-01 2002-10-22
4 A 2001-01-01 2003-02-25
5 A 2001-09-09 2007-07-16
6 A 2001-09-09 2008-04-06
7 B 2009-09-09 2008-07-12
8 B 2009-09-09 2008-07-24
9 B 2009-09-09 2009-04-01
10 B 2009-09-09 2009-09-11
In this case I am interested in first grouping unique Subjects with unique Event1's which I can do easily. From there I need to select Event2 that falls closest to Event1 for that unique Subject-Event1 combination, which I really need help with. For this example these data should decompose to 3 different records:
Subject Event1 Event2
1 A 2001-01-01 2001-05-04
2 A 2001-09-09 2008-04-06
3 B 2009-09-09 2009-09-11
I've jerry-rigged a solution to produce the 3 records of Subject-Event1 combinations:
df2<-df
df2$SubEv<-paste(df2$Subject,df2$Event1)
df2$Event1<-NULL
df2$Subject<-NULL
df2$Event2<-NULL
df2<-unique(df2)
df2<-separate(df2,SubEv,c("Subject","Event1"),sep=" ")
From here I'm just lost as to how to make R select from df the date of Event2 that is closest to Event1.
I already know that my code is super inefficient and sloppy (probably because of my approach at the get go). I'd like to know how to do this (at all honestly), and if there's a way I can do this calling fewer than 10 lines of code that would be pretty boss.
With dplyr:
library(dplyr)
df %>%
group_by(Subject, Event1) %>%
slice(which.min(abs(Event1 - Event2)))
# Subject Event1 Event2
# (chr) (date) (date)
# 1 A 2001-01-01 2001-07-05
# 2 A 2001-09-09 2004-05-02
# 3 B 2009-09-09 2008-04-24
Comments:
group_by can work with multiple columns.
slice selects row numbers within a group. Alternately...
... %>% filter( row_number() == which.min(abs(Event1 - Event2)) )
For a tie, which.min will return the first min. See ?which.min for details.
Data: When I run the OP's code, I get df looking like
Subject Event1 Event2
1 A 2001-01-01 2001-07-05
2 A 2001-01-01 2002-07-14
3 A 2001-01-01 2003-04-27
4 A 2001-01-01 2003-10-09
5 A 2001-09-09 2004-05-02
6 A 2001-09-09 2005-03-21
7 B 2009-09-09 2005-05-10
8 B 2009-09-09 2005-12-02
9 B 2009-09-09 2005-12-21
10 B 2009-09-09 2008-04-24
which explains why my result doesn't match exactly the OP's expected result.

Fastest way for filling-in missing dates for data.table

I am loading a data.table from CSV file that has date, orders, amount etc. fields.
The input file occasionally does not have data for all dates. For example, as shown below:
> NADayWiseOrders
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-04 1 18.81 0
4: 2013-01-05 2 77.62 0
5: 2013-01-07 2 35.82 2
In the above 03-Jan and 06-Jan do not have any entries.
Would like to fill the missing entries with default values (say, zero for orders, amount etc.), or carry the last vaue forward (e.g, 03-Jan will reuse 02-Jan values and 06-Jan will reuse the 05-Jan values etc..)
What is the best/optimal way to fill-in such gaps of missing dates data with such default values?
The answer here suggests using allow.cartesian = TRUE, and expand.grid for missing weekdays - it may work for weekdays (since they are just 7 weekdays) - but not sure if that would be the right way to go about dates as well, especially if we are dealing with multi-year data.
The idiomatic data.table way (using rolling joins) is this:
setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"),
to = as.Date("2013-01-07"),
by = "days")
NADayWiseOrders[J(all_dates), roll=Inf]
date orders amount guests
1: 2013-01-01 50 2272.55 149
2: 2013-01-02 3 64.04 4
3: 2013-01-03 3 64.04 4
4: 2013-01-04 1 18.81 0
5: 2013-01-05 2 77.62 0
6: 2013-01-06 2 77.62 0
7: 2013-01-07 2 35.82 2
Here is how you fill in the gaps within subgroup
# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]
# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]
# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]
#> group date x
#> 1: a 2017-01-01 1
#> 2: a 2017-02-01 2
#> 3: a 2017-03-01 2
#> 4: a 2017-04-01 2
#> 5: a 2017-05-01 3
#> 6: b 2017-02-01 4
#> 7: b 2017-03-01 4
#> 8: b 2017-04-01 5
Not sure if it's the fastest, but it'll work if there are no NAs in the data:
# just in case these aren't Dates.
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA's to 0)
require(xts)
na.locf(dt)

Extract rows from matrix based on if condition applied to each row in R

Could you help me figuring out why the following doesn't work? I have a 2528x3 matrix uniqueitems which looks like that:
Number Created Customer
=========== =================== ============
31464686486 2013-10-25 10:00:00 john#john.de
...
What I'd like to do: Go through every row, check if Created is more recent than a given time and, if so, write the row into a new table newerthantable. Here's my code:
library(lubridate);
newerthan <- function(x) {
times <- ymd_hms(uniqueitems[,2])
newerthantable <- matrix(data=NA,ncol=3,nrow=1)
i <- 1;
while (i <= nrow(uniqueitems)) {
if (x < times[i]) {
newerthantable <- rbind(newerthantable,uniqueitems[i,])
}
i <- i + 1;
}
}
But newerthan("2013-10-24 14:00:00") doesn't have the desired effect :(, nothing is written in newerthantable. Why?
In R loops are rarely needed. You can achieve the same results using vectorized operations or subsetting as in this case.
Setup sample data frame:
number <- c(1:10)
created <- seq(as.POSIXct("2013-01-01 10:01"), length.out=10, by="26 hours")
customer <- letters[c(1:10)]
df <- data.frame(number, created, customer)
head(df, 10)
number created customer
1 1 2013-01-01 10:01:00 a
2 2 2013-01-02 12:01:00 b
3 3 2013-01-03 14:01:00 c
4 4 2013-01-04 16:01:00 d
5 5 2013-01-05 18:01:00 e
6 6 2013-01-06 20:01:00 f
7 7 2013-01-07 22:01:00 g
8 8 2013-01-09 00:01:00 h
9 9 2013-01-10 02:01:00 i
10 10 2013-01-11 04:01:00 j
Select rows newer than a given date:
newerthantable <- df[df$created > as.POSIXct("2013-01-05 18:01:00"), ]
head(newerthantable,10)
number created customer
6 6 2013-01-06 20:01:00 f
7 7 2013-01-07 22:01:00 g
8 8 2013-01-09 00:01:00 h
9 9 2013-01-10 02:01:00 i
10 10 2013-01-11 04:01:00 j
The square brackets select rows matching our criteria (created column larger than a given date) and all columns (no column specification after the comma). Read more about subsetting operations here: http://www.ats.ucla.edu/stat/r/modules/subsetting.htm
If you want to wrap it up as a function it will look like this:
new_entries <- function(data, rows_since){
data[data$created > as.POSIXct(rows_since), ]
}
new_entries(df, "2013-01-05 18:01:00")

Resources