r data.table : lagging a date variable [duplicate] - r

This question already has answers here:
How to create a lag variable within each group?
(5 answers)
Closed 2 years ago.
I have data that looks similar to the following except with hundreds of IDs and thousands of observations:
ID date measles
1 2008-09-12 1
1 2008-10-25 NA
1 2009-01-12 1
1 2009-03-12 NA
1 2009-05-12 1
2 2010-05-17 NA
2 2010-06-12 NA
2 2010-07-02 1
2 2010-08-13 NA
I want to create a variable that will store the previous date for each pid like the following:
ID date measles previous_date
1 2008-09-12 1 NA
1 2008-10-25 NA 2008-09-12
1 2009-01-12 1 2008-10-25
1 2009-03-12 NA 2009-01-12
1 2009-05-12 1 2009-03-12
2 2010-05-17 NA NA
2 2010-06-12 NA 2010-05-17
2 2010-07-02 1 2010-06-12
2 2010-08-13 NA 2010-07-02
This should be an extremely easy task, but I have been unsuccessful at getting a lag variable to work properly. I have tried a few methods, such as the following:
dt[, previous_date:=c(NA, current_date[-.N]), by=c("ID")]
dt[,previous_date:=current_date-shift(current_date,1,type="lag"),by=ID]
The code samples above either produce sporadic numbers in the previous_date variable or produce all NAs. I'm not sure why this is? Is it because I'm using a date variable as opposed to an integer?
Is there a better way to accomplish this task that would work for a date variable?

We can just use shift on the 'date' column grouped by 'ID'. By default the type is lag
library(data.table)
dt[, previous_date := shift(date), ID]
dt
# ID date measles previous_date
#1: 1 2008-09-12 1 <NA>
#2: 1 2008-10-25 NA 2008-09-12
#3: 1 2009-01-12 1 2008-10-25
#4: 1 2009-03-12 NA 2009-01-12
#5: 1 2009-05-12 1 2009-03-12
#6: 2 2010-05-17 NA <NA>
#7: 2 2010-06-12 NA 2010-05-17
#8: 2 2010-07-02 1 2010-06-12
#9: 2 2010-08-13 NA 2010-07-02

Related

Reconstruct dataframe with dates as date intervals in R

I have a dataset basically looks like that, giving which campaigns are active for each household with given start and end dates of respective campaigns:
campaign_id household_id campaign_type start_date end_date
1 26 1 Type B 2016-12-28 2017-02-19
2 8 1 Type A 2017-05-08 2017-06-25
3 12 1 Type B 2017-07-12 2017-08-13
4 13 1 Type A 2017-08-08 2017-09-24
5 18 1 Type A 2017-10-30 2017-12-24
6 20 1 Type C 2017-11-27 2018-02-05
7 22 1 Type B 2017-12-06 2018-01-07
8 23 1 Type B 2017-12-28 2018-02-04
And I create a new dataframe with given structure, which will show which campaigns are active for given household in a given time (having all the campaign numbers as columns, i have omitted the rest while putting here):
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 NA NA NA NA
2 1 2016-12-06 NA NA NA NA
3 1 2016-12-28 NA NA NA NA
4 1 2017-02-08 NA NA NA NA
5 1 2017-03-03 NA NA NA NA
6 1 2017-03-08 NA NA NA NA
7 1 2017-03-13 NA NA NA NA
8 1 2017-03-29 NA NA NA NA
9 1 2017-04-03 NA NA NA NA
10 1 2017-04-19 NA NA NA NA
What I want to do is assigning the active promotions in the given dates as rows in the second dataframe. For example if household_id 1 is having campaign 2 running in 2016-11-14 but no other campaigns, then it will look like this:
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 0 1 0 0
How can i manage this construction, should I use for loops in the initial dataframe and assign to second one in each loop, or there is a better and faster way? Thanks in advance.

In R: how to sum a variable by group between two dates

I have two data frames (DF1 and DF2):
(1) DF1 contains information on individual-level, i.e. on 10.000 individuals nested in 30 units across 11 years (2000-2011). It contains four variables:
"individual" (numeric id for each individual; ranging from 1-10.000)
"unit" (numeric id for each unit; ranging from 1-30)
"date1" (a date in date format, i.e. 2000-01-01, etc; ranging from 2000-01-01 to 2010-12-31)
"date2" ("Date1" + 1 year)
(2) DF2 contains information on unit-level, i.e. on the same 30 units as in DF1 across the same time period (2000-2011) and further contains a numeric variable ("x"):
"unit" (numeric id for each unit; ranging from 1-30)
"date" (a date in date format, i.e. 2000-01-01, etc; ranging from 2000-01-01 to 2011-12-31)
"x" (a numeric variable, ranging from 0 to 200)
I would like to create new variable ("newvar") that gives me for each "individual" per "unit" the sum of "x" (DF2) counting from "date1" (DF1) to "date2" (DF2). This means that I would like to add this new variable to DF1.
For instance, if "individual"=1 in "unit"=1 has "date1"=2000-01-01 and "date2"=2001-01-01, and in DF2 "unit"=1 has three observations in the time period "date1" to "date2" (i.e. 2000-01-01 to 2001-01-01) with "x"=1, "x"=2 and "x"=3, then I would like add a new variable that gives for "individual"=1 in "unit"=1 "newvar"=6.
I assume that I need to use a for loop in R and have been using the following code:
for(i in length(DF1)){
DF1$newvar[i] <-sum(DF2$x[which(DF1$date == DF1$date1[i] &
DF1$date == DF1P$date1[i] &
DF2$unit == DF1P$unit[i]),])
}
but get the error message:
Error in DF2$x[which(DF2$date == : incorrect number of dimensions
Any ideas of how to create this variable would be tremendously appreciated!
Here is a small example as well as the expected output, using one unit for the sake of simplicity:
Assume DF1 looks as follows:
individual unit date1 date2
1 1 2000-01-01 2001-01-01
2 1 2000-02-02 2001-02-02
3 1 2000-03-03 2000-03-03
4 1 2000-04-04 2000-04-04
5 1 2000-12-31 2001-12-31
(...)
996 1 2010-01-01 2011-01-01
997 1 2010-02-15 2011-02-15
998 1 2010-03-05 2011-03-05
999 1 2010-04-10 2011-04-10
1000 1 2010-12-27 2011-12-27
1001 2 2000-01-01 2001-01-01
1002 2 2000-02-02 2001-02-02
1003 2 2000-03-03 2000-03-03
1004 2 2000-04-04 2000-04-04
1005 2 2000-12-31 2001-12-31
(...)
1996 2 2010-01-01 2011-01-01
1997 2 2010-02-15 2011-02-15
1998 2 2010-03-05 2011-03-05
1999 2 2010-04-10 2011-04-10
2000 2 2010-12-027 2011-12-27
(...)
3000 34 2000-02-02 2002-02-02
3001 34 2000-05-05 2001-05-05
3002 34 2000-06-06 2001-06-06
3003 34 2000-07-07 2001-07-07
3004 34 2000-11-11 2001-11-11
(...)
9996 34 2010-02-06 2011-02-06
9997 34 2010-05-05 2011-05-05
9998 34 2010-09-09 2011-09-09
9999 34 2010-09-25 2011-09-25
10000 34 2010-10-15 2011-10-15
Assume DF2 looks as follows:
unit date x
1 2000-01-01 1
1 2000-05-01 2
1 2000-12-01 3
1 2001-01-02 10
1 2001-07-05 20
1 2001-12-31 30
(...)
2 2010-05-05 1
2 2010-07-01 1
2 2010-08-09 1
3 (...)
This is what I would like DF1 to look like after running the code:
individual unit date1 date2 newvar
1 1 2000-01-01 2001-01-01 6
2 1 2000-02-02 2001-02-02 16
3 1 2000-03-03 2001-03-03 15
4 1 2000-04-04 2001-04-04 15
5 1 2000-12-31 2001-12-31 60
(...)
996 1 2010-01-01 2011-01-01 3
997 1 2010-02-15 2011-02-15 2
998 1 2010-03-05 2011-03-05 2
999 1 2010-04-10 2011-04-10 2
1000 1 2010-12-27 2011-12-27 0
(...)
However, I cannot simply aggregate: Imagine that in DF1 each "unit" has several hundreds of individuals for each year between 2000 and 2011. And DF2 has many observations for each unit across the years 2000-2011.
We can use data.table
library(data.table)
setDT(DF1)
setDT(DF2)
DF1[DF2[, .(newvar = sum(x)), .(unit, individual = cumsum(date %in% DF1$date1))],
newvar := newvar, on = .(individual, unit)]
DF1
# individual unit date1 date2 newvar
#1: 1 1 2000-01-01 2001-01-01 6
#2: 2 1 2001-01-02 2002-01-02 60
Or we can use a non-equi join
DF1[DF2[DF1, sum(x), on = .(unit, date >= date1, date <= date2),
by = .EACHI], newvar := V1, on = .(unit, date1=date)]
DF1
# individual unit date1 date2 newvar
#1: 1 1 2000-01-01 2001-01-01 6
#2: 2 1 2001-01-02 2002-01-02 60
You were almost there, I just modified slightly your for loop, and also made sure that the date variables are considered as such:
DF1$date1 = as.Date(DF1$date1,"%Y-%m-%d")
DF1$date2 = as.Date(DF1$date2,"%Y-%m-%d")
DF2$date = as.Date(DF2$date,"%Y-%m-%d")
for(i in 1:nrow(DF1)){
DF1$newvar[i] <-sum(DF2$x[which(DF2$unit == DF1$unit[i] &
DF2$date>= DF1$date1[i] &
DF2$date<= DF1$date2[i])])
}
The problem was, that you were asking DF2$date to be simultaneously == DF1$date1 & DF1$date2.
And also, length(DF1) gives you the number of columns. To have the number of rows you can either use nrow(DF1), or dim(DF1)[1].

Day difference between rows related to other row that is not NA

Let I have such a data frame(df):
Date x
20.01.2016 34
21.01.2016 28
22.01.2016 NA
23.01.2016 NA
24.01.2016 56
25.01.2016 NA
26.01.2016 28
I want to add such a column(z) to this data frame
Date x z
20.01.2016 34 -
21.01.2016 28 1
22.01.2016 NA NA
23.01.2016 NA NA
24.01.2016 56 3
25.01.2016 NA NA
26.01.2016 28 2
where z shows the day difference between the related row's date and closest previous date (where x is not NA).
For example for the date 24.01.2016 the closest previous date is 21.01.2016 where x is not NA. So the day difference of these two dates is 3.
How can I do this using R?
I will be very glad for any help. Thanks a lot.
Cinsidering that your date variable is as.Date,(i.e. df$Date <- as.Date(df$Date, format = '%d.%m.%Y')) then,
df$z[!is.na(df$x)] <- c(NA, diff.difftime(df$Date[!is.na(df$x)]))
df
# Date x z
#1 2016-01-20 34 NA
#2 2016-01-21 28 1
#3 2016-01-22 NA NA
#4 2016-01-23 NA NA
#5 2016-01-24 56 3
#6 2016-01-25 NA NA
#7 2016-01-26 28 2
We can use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%d.%m.%Y")][!is.na(x), z := Date - shift(Date)]
df
# Date x z
#1: 2016-01-20 34 NA
#2: 2016-01-21 28 1
#3: 2016-01-22 NA NA
#4: 2016-01-23 NA NA
#5: 2016-01-24 56 3
#6: 2016-01-25 NA NA
#7: 2016-01-26 28 2

R - Calculate Time Elapsed Since Last Event with Multiple Event Types

I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days

Adding row for missing value in data.table

My question is somehow related to Fastest way to add rows for missing values in a data.frame? but a bit tougher I think. And I can't figure out how to adapt this solution to my problem.
Here is what my data.table looks like :
ida idb value date
1: A 2 26600 2004-12-31
2: A 3 19600 2005-03-31
3: B 3 18200 2005-06-30
4: B 4 1230 2005-09-30
5: C 2 8700 2005-12-31
The difference is that every 'ida' has his own dates and there is at least one row where 'ida' appears with each date but not necessarily for all 'idb'. I want to insert every missing ('ida','idb') couple missing with the corresponding date and 0 as a value.
Moreover, there is no periodicity for the dates.
How would you do this ?
Desired output :
ida idb value date
1: A 2 26600 2004-12-31
1: A 2 0 2005-03-31
2: A 3 19600 2005-03-31
2: A 3 0 2004-12-31
3: B 3 18200 2005-06-30
4: B 3 0 2005-09-30
5: B 4 1230 2005-09-30
4: B 4 0 2005-06-30
6: C 2 8700 2005-12-31
The order doesn't matter. Every date missing is filled with a 0 value.
You just do the same thing as in your linked question by each ida:
setkey(dt, idb, date)
dt[, .SD[CJ(unique(idb), unique(date))], by = ida][is.na(value), value := 0][]
# ida idb value date
#1: A 2 26600 2004-12-31
#2: A 2 0 2005-03-31
#3: A 3 0 2004-12-31
#4: A 3 19600 2005-03-31
#5: C 2 8700 2005-12-31
#6: B 3 18200 2005-06-30
#7: B 3 0 2005-09-30
#8: B 4 0 2005-06-30
#9: B 4 1230 2005-09-30

Resources