I am trying to write a bit of code that will grab a portion of a data frame based on start date, where each start date is different for each user.
Suppose I have the following data.frames (in reality my dataset is several orders of magnitude larger, but this will suffice as a sample set)
df1:
> df
name start.date
1 Allison 2013-03-16
2 Andrew 2013-03-16
3 Carl 2013-03-16
4 Dora 2013-03-17
5 Hilary 2013-03-17
6 Louis 2013-03-18
7 Mary 2013-03-19
8 Mickey 2013-03-20
And df2:
> df2
names X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
2001 Allison 5 5 0 0
2002 Andrew 2 0 0 0
2003 Carl 10 8 11 10
2004 Dora 0 4 0 0
2005 Hilary 0 3 5 0
2006 Louis 0 0 8 3
2007 Mary 0 0 0 7
2008 Mickey 0 0 0 0
I merged these two data frames into one called tmp:
>tmp
name start.date X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
1 Allison 2013-03-16 5 5 0 0
2 Andrew 2013-03-16 2 0 0 0
3 Carl 2013-03-16 10 8 11 10
4 Dora 2013-03-17 0 4 0 0
5 Hilary 2013-03-17 0 3 5 0
6 Louis 2013-03-18 0 0 8 3
7 Mary 2013-03-19 0 0 0 7
8 Mickey 2013-03-20 0 0 0 0
I also have a list of the column names of df2 converted to dates:
>dts
[1] "2014-03-16" "2014-03-17" "2014-03-18" "2014-03-19"
I thought one way to approach this problem is to change all the zero entries in df2 that occur before each user's start date to NA using the following nested loops:
for (i in 1:dim(tmp)[1]){
for (j in 1:length(dts)){
for (z in 4:dim(tmp)[2]){
if (dts[j]< tmp$Date.of.Sign.Up[i]){
tmp[i,z]<-NA
} else {tmp[i,z]<-tmp[i,z]}
}
}
}
The trouble with this loop is that 1. It will run infinitely and 2. Doesn't work. It is changing all the values in tmp from tmp[,3:end] to zero, regardless of start date. Ideally I would end up with something like this:
name start.date X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
Allison 2013-03-16 5 5 0 0
Andrew 2013-03-16 2 0 0 0
Carl 2013-03-16 10 8 11 10
Dora 2013-03-17 NA 4 0 0
Hilary 2013-03-17 NA 3 5 0
Louis 2013-03-18 NA NA 8 3
Mary 2013-03-19 NA NA NA 7
Mickey 2013-03-20 NA NA NA NA
Any suggestions? Thank you in advance!
You may reshape 'tmp' to long format, convert the former headers to dates, compare them with the start dates, and insert NA when start dates occur after the 'header dates':
library(reshape2)
# melt data from wide to long format
df3 <- melt(tmp, id.vars = c("name", "start.date"))
# convert 'variable' to class Date
df3$variable <- as.Date(df3$variable, format = "X%m.%d.%Y")
# compare start dates with 'variable dates' and insert NA
df3$value[df3$start.date > df3$variable] <- NA
# reshape back to wide
dcast(df3, name + start.date ~ variable)
# name start.date 2013-03-16 2013-03-17 2013-03-18 2013-03-19
# 1 Allison 2013-03-16 5 5 0 0
# 2 Andrew 2013-03-16 2 0 0 0
# 3 Carl 2013-03-16 10 8 11 10
# 4 Dora 2013-03-17 NA 4 0 0
# 5 Hilary 2013-03-17 NA 3 5 0
# 6 Louis 2013-03-18 NA NA 8 3
# 7 Mary 2013-03-19 NA NA NA 7
# 8 Mickey 2013-03-20 NA NA NA NA
Another possibility where we loop over the 'date column names' in 'tmp':
dates <- names(tmp)[-c(1, 2)]
tmp[ , -c(1, 2)] <- sapply(dates, function(x){
date <- as.Date(x, format = "X%m.%d.%Y")
tmp[ , x][df2$start.date > date] <- NA
tmp[ , x]
})
tmp
# name start.date X03.16.2013 X03.17.2013 X03.18.2013 X03.19.2013
# 1 Allison 2013-03-16 5 5 0 0
# 2 Andrew 2013-03-16 2 0 0 0
# 3 Carl 2013-03-16 10 8 11 10
# 4 Dora 2013-03-17 NA 4 0 0
# 5 Hilary 2013-03-17 NA 3 5 0
# 6 Louis 2013-03-18 NA NA 8 3
# 7 Mary 2013-03-19 NA NA NA 7
# 8 Mickey 2013-03-20 NA NA NA NA
Related
These are subsets of two dataframes.
df1:
plot
mean_first_flower_date
gdd
1
2019-07-15
60
1
2019-07-21
50
1
2019-07-23
78
2
2019-05-13
100
2
2019-05-22
173
2
2019-05-25
245
(cont.)
df2:
plot
date
flowers
1
2019-07-12
2
1
2019-07-13
9
1
2019-07-14
3
1
2019-07-15
3
2
2019-05-12
10
2
2019-05-13
10
2
2019-05-14
14
2
2019-05-15
17
(cont.)
df2 has some matching dates with df1 but sometimes the dates are off for one or a couple days (highlighted in bold).
I would like to group both dfs based on both 'date' and 'plot', keeping df2, without losing 'gdd' data from df1.
This will happen if, for example, I inner_join both dfs because the dates will not match.
So if a date in df1 is one to three days earlier or later than what it's possible to match in df2, it's fine because the dates are relatively close. This is tricky because I want this data replacement only if there is not data available in df1 for that data range.
My goal is to have something like this:
plot
date
flowers
gdd
1
2019-07-12
2
60
1
2019-07-13
9
60
1
2019-07-14
3
60
1
2019-07-15
3
60
2
2019-05-12
10
100
2
2019-05-13
10
100
2
2019-05-14
14
100
2
2019-05-15
17
100
Is it possible to do?
I greatly appreciate any help!
Thanks!
I think a 'rolling join' from the data.table package can handle this:
library(data.table)
setDT(df1)
setDT(df2)
df1[, mean_first_flower_date := as.Date(mean_first_flower_date)]
df2[, date := as.Date(date)]
df1[df2, on=c("plot","mean_first_flower_date==date"), roll=3, rollends=TRUE]
# plot mean_first_flower_date gdd flowers
#1: 1 2019-07-12 60 2
#2: 1 2019-07-13 60 9
#3: 1 2019-07-14 60 3
#4: 1 2019-07-15 60 3
#5: 2 2019-05-12 100 10
#6: 2 2019-05-13 100 10
#7: 2 2019-05-14 100 14
#8: 2 2019-05-15 100 17
Using this data:
df1 <- read.table(text="plot mean_first_flower_date gdd
1 2019-07-15 60
1 2019-07-21 50
1 2019-07-23 78
2 2019-05-13 100
2 2019-05-22 173
2 2019-05-25 245", header=TRUE)
df2 <- read.table(text="plot date flowers
1 2019-07-12 2
1 2019-07-13 9
1 2019-07-14 3
1 2019-07-15 3
2 2019-05-12 10
2 2019-05-13 10
2 2019-05-14 14
2 2019-05-15 17", header=TRUE)
Try fill from dplyr. use this syntax
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 NA
8 2 2019-05-15 17 NA
As you can notice there are two NAs in the last two rows which shouldn't be there if you'll join your actual df2 where these rows will be filled by 173 as there will be a match for 2019-05-22. Still if you want to fill the last NA rows, if any, you can use fill again with .direction = "down"
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up") %>% fill(gdd, .direction = "down")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 100
8 2 2019-05-15 17 100
I would like to calculate the number of days which have passed since the first event. There are different groups, so each group's starting date for an event is different and I want to calculate each groups number of days passed since their own first event.
names = c('Ben',"Ben","Ben","Ben","Ben","Ben" ,'Dan',"Dan","Dan","Dan", 'Peter',"Peter","Peter","Peter","Peter","Peter","Peter",'Betty',"Betty","Betty",'Betty', "Betty")
dates = c('2000-02-01','2000-02-02',"2000-02-03","2000-02-04",'2000-02-05','2000-02-05', '2000-01-11','2000-01-12',"2000-01-13",'2000-01-14',
'2000-09-10','2000-09-11',"2000-09-12",'2000-09-13','2000-09-14','2000-09-15','2000-09-16','2000-11-13','2000-11-14', "2000-11-15",'2000-11-16','2000-11-17')
events = c(0,0,1,4,5,11,0,0,2,6,0,0,1,2,3,4,5,0,0,1,2,3)
newd = data.frame(names,dates,events)
newd
so the data frame looks like this:
> newd
names dates events
1 Ben 2000-02-01 0
2 Ben 2000-02-02 0
3 Ben 2000-02-03 1
4 Ben 2000-02-04 4
5 Ben 2000-02-05 5
6 Ben 2000-02-05 11
7 Dan 2000-01-11 0
8 Dan 2000-01-12 0
9 Dan 2000-01-13 2
10 Dan 2000-01-14 6
11 Peter 2000-09-10 0
12 Peter 2000-09-11 0
13 Peter 2000-09-12 1
14 Peter 2000-09-13 2
15 Peter 2000-09-14 3
16 Peter 2000-09-15 4
17 Peter 2000-09-16 5
18 Betty 2000-11-13 0
19 Betty 2000-11-14 0
20 Betty 2000-11-15 1
21 Betty 2000-11-16 2
22 Betty 2000-11-17 3
This is just an example I am using, the 'events' are not in a specific order and are totally random, there are also many other dates with the event of 0. So I would like to only start counting days where: event > 0.
So if there's a 0 at 'event' than there should also be a 0 days counted.
Convert the dates to actual date and you can then subtract minimum dates for each names.
newd$dates <- as.Date(newd$dates)
library(dplyr)
newd %>% group_by(names) %>% mutate(events = as.integer(dates - min(dates)))
# names dates events
# <chr> <date> <int>
# 1 Ben 2000-02-02 0
# 2 Ben 2000-02-03 1
# 3 Ben 2000-02-04 2
# 4 Ben 2000-02-05 3
# 5 Ben 2000-02-05 3
# 6 Dan 2000-01-12 0
# 7 Dan 2000-01-13 1
# 8 Dan 2000-01-14 2
# 9 Peter 2000-09-11 0
#10 Peter 2000-09-12 1
#11 Peter 2000-09-13 2
#12 Peter 2000-09-14 3
#13 Peter 2000-09-15 4
#14 Peter 2000-09-16 5
#15 Betty 2000-11-14 0
#16 Betty 2000-11-15 1
#17 Betty 2000-11-16 2
#18 Betty 2000-11-17 3
In base R :
newd$events <- with(newd, dates - ave(dates, names, FUN = min))
and data.table :
library(data.table)
setDT(newd)[, events := dates - min(dates), names]
I need to merge two datasets, but the rows have to merge if the date of the one dataset is between two dates of the other one. The first dataset data looks like this:
Date Weight diff Loc.nr
2013-01-24 1040 7 2
2013-01-31 1000 7 2
2013-02-07 1185 7 2
2013-02-14 915 7 2
2013-02-21 1090 7 2
2013-03-01 1065 9 2
2013-01-19 500 4 9
2013-01-23 1040 3 9
2013-01-28 415 5 9
2013-01-31 650 3 9
2013-02-04 725 4 9
2013-02-07 450 3 9
2013-02-11 550 4 9
The other data set matches looks like this:
Date winning
2013-01-20 1
2013-01-27 0
2013-02-03 1
2013-02-10 0
2013-02-17 1
2013-02-24 0
I wrote a code to connect the winning column from matches to the data set "data":
data$winning <- NA
for(i in 1:nrow(data)) {
for(j in 1:nrow(matches)) {
if((data$Date[i]-data$diff[i]) < matches$Date[j] & data$Date[i] > matches$Date[j]) {
data$winning[i] <- matches$winning[j]
}
}
}
This code takes 3 days to run, is there a faster way to do this?
My expected output is:
Date Weight diff Loc.nr winning
2013-01-24 1040 7 2 1
2013-01-31 1000 7 2 0
2013-02-07 1185 7 2 1
2013-02-14 915 7 2 0
2013-02-21 1090 7 2 1
2013-03-01 1065 9 2 0
2013-01-19 500 4 9 NA
2013-01-23 1040 3 9 NA
2013-01-28 415 5 9 0
2013-01-31 650 3 9 NA
2013-02-04 725 4 9 1
2013-02-07 450 3 9 NA
2013-02-11 550 4 9 0
With non-equi join as suggested by Gregor you can try something along
library(data.table)
setDT(data)[, winning := setDT(matches)[data[, .(upper = Date, lower = Date - diff)],
on = .(Date < upper, Date > lower)]$winning][]
Date Weight diff Loc.nr winning
1: 2013-01-24 1040 7 2 1
2: 2013-01-31 1000 7 2 0
3: 2013-02-07 1185 7 2 1
4: 2013-02-14 915 7 2 0
5: 2013-02-21 1090 7 2 1
6: 2013-03-01 1065 9 2 0
7: 2013-01-19 500 4 9 NA
8: 2013-01-23 1040 3 9 NA
9: 2013-01-28 415 5 9 0
10: 2013-01-31 650 3 9 NA
11: 2013-02-04 725 4 9 1
12: 2013-02-07 450 3 9 NA
13: 2013-02-11 550 4 9 0
I have a dataframe that contains the dates of multiple types of events.
df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000"
,"03/01/2001","17/03/2001","23/04/2001",
"26/05/2001","01/06/2001",
"30/06/2001","02/07/2001","15/07/2001"
,"21/12/2001"), "%d/%m/%Y"),
event_type=c(0,4,1,2,4,1,0,2,3,3,4,3))
date event_type
---------------- ----------
1 2000-07-06 0
2 2000-09-15 4
3 2000-10-15 1
4 2001-01-03 2
5 2001-03-17 4
6 2001-04-23 1
7 2001-05-26 0
8 2001-06-01 2
9 2001-06-30 3
10 2001-07-02 3
11 2001-07-15 4
12 2001-12-21 3
I am trying to calculate the days between each event type so the output looks like the below:
date event_type days_since_last_event
---------------- ---------- ---------------------
1 2000-07-06 0 NA
2 2000-09-15 4 NA
3 2000-10-15 1 NA
4 2001-01-03 2 NA
5 2001-03-17 4 183
6 2001-04-23 1 190
7 2001-05-26 0 324
8 2001-06-01 2 149
9 2001-06-30 3 NA
10 2001-07-02 3 2
11 2001-07-15 4 120
12 2001-12-21 3 172
I have benefited from the answers from these two previous posts but have not been able to address my specific problem in R; multiple event types.
Calculate elapsed time since last event
Calculate days since last event in R
Below is as far as I have gotten. I have not been able to leverage the last event index to calculate the last event date.
df <- cbind(df, as.vector(data.frame(count=ave(df$event_type==df$event_type,
df$event_type, FUN=cumsum))))
df <- rename(df, c("count" = "last_event_index"))
date event_type last_event_index
--------------- ------------- ----------------
1 2000-07-06 0 1
2 2000-09-15 4 1
3 2000-10-15 1 1
4 2001-01-03 2 1
5 2001-03-17 4 2
6 2001-04-23 1 2
7 2001-05-26 0 2
8 2001-06-01 2 2
9 2001-06-30 3 1
10 2001-07-02 3 2
11 2001-07-15 4 3
12 2001-12-21 3 3
We can use diff to get the difference between adjacent 'date' after grouping by 'event_type'. Here, I am using data.table approach by converting the 'data.frame' to 'data.table' (setDT(df)), grouped by 'event_type', we get the diff of 'date'.
library(data.table)
setDT(df)[,days_since_last_event :=c(NA,diff(date)) , by = event_type]
df
# date event_type days_since_last_event
# 1: 2000-07-06 0 NA
# 2: 2000-09-15 4 NA
# 3: 2000-10-15 1 NA
# 4: 2001-01-03 2 NA
# 5: 2001-03-17 4 183
# 6: 2001-04-23 1 190
# 7: 2001-05-26 0 324
# 8: 2001-06-01 2 149
# 9: 2001-06-30 3 NA
#10: 2001-07-02 3 2
#11: 2001-07-15 4 120
#12: 2001-12-21 3 172
Or as #Frank mentioned in the comments, we can also use shift (from version v1.9.5+ onwards) to get the lag (by default, the type='lag') of 'date' and subtract from the 'date'.
setDT(df)[, days_since_last_event := as.numeric(date-shift(date,type="lag")),
by = event_type]
The base R version of this is to use split/lapply/rbind to generate the new column.
> do.call(rbind,
lapply(
split(df, df$event_type),
function(d) {
d$dsle <- c(NA, diff(d$date)); d
}
)
)
date event_type dsle
0.1 2000-07-06 0 NA
0.7 2001-05-26 0 324
1.3 2000-10-15 1 NA
1.6 2001-04-23 1 190
2.4 2001-01-03 2 NA
2.8 2001-06-01 2 149
3.9 2001-06-30 3 NA
3.10 2001-07-02 3 2
3.12 2001-12-21 3 172
4.2 2000-09-15 4 NA
4.5 2001-03-17 4 183
4.11 2001-07-15 4 120
Note that this returns the data in a different order than provided; you can re-sort by date or save the original indices if you want to preserve that order.
Above, #akrun has posted the data.tables approach, the parallel dplyr approach would be straightforward as well:
library(dplyr)
df %>% group_by(event_type) %>% mutate(days_since_last_event=date - lag(date, 1))
Source: local data frame [12 x 3]
Groups: event_type [5]
date event_type days_since_last_event
(date) (dbl) (dfft)
1 2000-07-06 0 NA days
2 2000-09-15 4 NA days
3 2000-10-15 1 NA days
4 2001-01-03 2 NA days
5 2001-03-17 4 183 days
6 2001-04-23 1 190 days
7 2001-05-26 0 324 days
8 2001-06-01 2 149 days
9 2001-06-30 3 NA days
10 2001-07-02 3 2 days
11 2001-07-15 4 120 days
12 2001-12-21 3 172 days
I would like to check that an individual does not have any gaps in their eligibility status. I define a gap as a date_of_claim that occurs 30 days after the last elig_end_date. therefore, what I would like to do is check that each date_of_claim is no longer than the elig_end_date +30days in the row immediately preceeding. Ideally I would like an indicator that says 0 for no gap and 1 if there is a gap per person and where the gap occurs. Here is a sample df with the solution built in as 'gaps'.
names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1
I have found this post quite useful How can I compare a value in a column to the previous one using R?, but feel that I cant use a loop as my df has 4 million rows, and I have had a lot of difficulty trying to run a loop on it already.
to this end, i think the code i need is something like this:
df$gaps<-ifelse(df$date_of_claim>=df$elig_end_date+30,1,0) ##this doesn't use the preceeding row.
I've made a clumsy attempt using this:
df$gaps<-df$date_of_claim>=df$elig_end_date[-1,]
but I get an error to say i have an incorrect number of dimensions.
all help greatly appreciated! thank you.
With four million observations I would use data.table:
DF <- read.table(text="names date_of_claim elig_end_date obs gaps
1 tom 2010-01-01 2010-07-01 1 NA
2 tom 2010-05-04 2010-07-01 1 0
3 tom 2010-06-01 2014-01-01 2 0
4 tom 2010-10-10 2014-01-01 2 0
5 mary 2010-03-01 2014-06-14 1 NA
6 mary 2010-05-01 2014-06-14 1 0
7 mary 2010-08-01 2014-06-14 1 0
8 mary 2010-11-01 2014-06-14 1 0
9 mary 2011-01-01 2014-06-14 1 0
10 john 2010-03-27 2011-03-01 1 NA
11 john 2010-07-01 2011-03-01 1 0
12 john 2010-11-01 2011-03-01 1 0
13 john 2011-02-01 2011-03-01 1 0
14 sue 2010-02-01 2010-04-30 1 NA
15 sue 2010-02-27 2010-04-30 1 0
16 sue 2010-03-13 2010-05-31 2 0
17 sue 2010-04-27 2010-06-30 3 0
18 sue 2010-04-27 2010-06-30 3 0
19 sue 2010-05-06 2010-08-31 4 0
20 sue 2010-06-08 2010-09-30 5 0
21 mike 2010-05-01 2010-07-30 1 NA
22 mike 2010-06-01 2010-07-30 1 0
23 mike 2010-11-12 2011-07-30 2 1", header=TRUE)
library(data.table)
DT <- data.table(DF)
DT[, c("date_of_claim", "elig_end_date") := list(as.Date(date_of_claim), as.Date(elig_end_date))]
DT[, gaps2:= c(NA, date_of_claim[-1] > head(elig_end_date, -1)+30), by=names]
# names date_of_claim elig_end_date obs gaps gaps2
# 1: tom 2010-01-01 2010-07-01 1 NA NA
# 2: tom 2010-05-04 2010-07-01 1 0 FALSE
# 3: tom 2010-06-01 2014-01-01 2 0 FALSE
# 4: tom 2010-10-10 2014-01-01 2 0 FALSE
# 5: mary 2010-03-01 2014-06-14 1 NA NA
# 6: mary 2010-05-01 2014-06-14 1 0 FALSE
# 7: mary 2010-08-01 2014-06-14 1 0 FALSE
# 8: mary 2010-11-01 2014-06-14 1 0 FALSE
# 9: mary 2011-01-01 2014-06-14 1 0 FALSE
# 10: john 2010-03-27 2011-03-01 1 NA NA
# 11: john 2010-07-01 2011-03-01 1 0 FALSE
# 12: john 2010-11-01 2011-03-01 1 0 FALSE
# 13: john 2011-02-01 2011-03-01 1 0 FALSE
# 14: sue 2010-02-01 2010-04-30 1 NA NA
# 15: sue 2010-02-27 2010-04-30 1 0 FALSE
# 16: sue 2010-03-13 2010-05-31 2 0 FALSE
# 17: sue 2010-04-27 2010-06-30 3 0 FALSE
# 18: sue 2010-04-27 2010-06-30 3 0 FALSE
# 19: sue 2010-05-06 2010-08-31 4 0 FALSE
# 20: sue 2010-06-08 2010-09-30 5 0 FALSE
# 21: mike 2010-05-01 2010-07-30 1 NA NA
# 22: mike 2010-06-01 2010-07-30 1 0 FALSE
# 23: mike 2010-11-12 2011-07-30 2 1 TRUE
# names date_of_claim elig_end_date obs gaps gaps2