I have the following data.frame:
df <- data.frame(id=c(1,2,3),
first.date=as.Date(c("2014-01-01", "2014-03-01", "2014-06-01")),
second.date=as.Date(c("2015-01-01", "2015-03-01", "2015-06-1")),
third.date=as.Date(c("2016-01-01", "2017-03-01", "2018-06-1")),
fourth.date=as.Date(c("2017-01-01", "2018-03-01", "2019-06-1")))
> df
id first.date second.date third.date fourth.date
1 1 2014-01-01 2015-01-01 2016-01-01 2017-01-01
2 2 2014-03-01 2015-03-01 2017-03-01 2018-03-01
3 3 2014-06-01 2015-06-01 2018-06-01 2019-06-01
Each row represents three timespans; i.e. the time spans between first.date and second.date, second.date and third.date, and third.date and fourth.date respectively.
I would like to, in lack of a better word, unnest the dataframe to obtain this instead:
id StartDate EndDate
1 1 2014-01-01 2015-01-01
2 1 2015-01-01 2016-01-01
3 1 2016-01-01 2017-01-01
4 2 2014-03-01 2015-03-01
5 2 2015-03-01 2017-03-01
6 2 2017-03-01 2018-03-01
7 3 2014-06-01 2015-06-01
8 3 2015-06-01 2018-06-01
9 3 2018-06-01 2019-06-01
I have been playing around with the unnest function from the tidyr package, but I came to the conclusion that I don't think it's what I'm really looking for.
Any suggestions?
You can try tidyr/dplyr as follows:
library(tidyr)
library(dplyr)
df %>% gather(DateType, StartDate, -id) %>% select(-DateType) %>% arrange(id) %>% group_by(id) %>% mutate(EndDate = lead(StartDate))
You can eliminate the last row in each id group by adding:
%>% slice(-4)
To the above pipeline.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), then melt the dataset to long format, use shift with type='lead' grouped by 'id' and then remove the NA elements.
library(data.table)
na.omit(melt(setDT(df), id.var='id')[, shift(value,0:1, type='lead') , id])
# id V1 V2
#1: 1 2014-01-01 2015-01-01
#2: 1 2015-01-01 2016-01-01
#3: 1 2016-01-01 2017-01-01
#4: 2 2014-03-01 2015-03-01
#5: 2 2015-03-01 2017-03-01
#6: 2 2017-03-01 2018-03-01
#7: 3 2014-06-01 2015-06-01
#8: 3 2015-06-01 2018-06-01
#9: 3 2018-06-01 2019-06-01
The column names can be changed by using either setnames or earlier in the shift step.
Related
Consider the following data frame (df):
"id" "date_start" "date_end"
a 2012-03-11 2012-03-27
a 2012-05-17 2012-07-21
a 2012-06-09 2012-08-18
b 2015-06-21 2015-07-12
b 2015-06-27 2015-08-04
b 2015-07-02 2015-08-01
c 2017-10-11 2017-11-08
c 2017-11-27 2017-12-15
c 2017-01-02 2018-02-03
I am trying to create a new data frame with sequences of monthly dates, starting one month prior to the minimum value of "date_start" for each group in "id". The sequence also only includes dates from the first day of a month and ends at the maximum value of "date-end" for each group in "id".
This is a reproducible example for my data frame:
library(lubridate)
id <- c("a","a","a","b","b","b","c","c","c")
df <- data.frame(id)
df$date_start <- as.Date(c("2012-03-11", "2012-05-17","2012-06-09", "2015-06-21", "2015-06-27","2015-07-02", "2017-10-11", "2017-11-27","2018-01-02"))
df$date_end <- as.Date(c("2012-03-27", "2012-07-21","2012-08-18", "2015-07-12", "2015-08-04","2015-08-012", "2017-11-08", "2017-12-15","2018-02-03"))
What I have tried to do:
library(dplyr)
library(Desctools)
library(timeDate)
df2 <- df %>%
group_by(id) %>%
summarize(start= floor_date(AddMonths(min(date_start),-1), "month"),end=max(date_end)) %>%
do(data.frame(id=.$id, date=seq(.$start,.$end,by="1 month")))
The code works perfectly fine for an ungrouped data frame. Somehow, with the grouping by "id" it throws an error message:
Error in seq.default(.$date_start, .$date_end, by = "1 month") :
'from' must be of length 1
This is how the desired output looks like for the data frame given above:
"id" "date"
a 2012-02-01
a 2012-03-01
a 2012-04-01
a 2012-05-01
a 2012-06-01
a 2012-07-01
a 2012-08-01
b 2015-05-01
b 2015-06-01
b 2015-07-01
b 2015-08-01
c 2017-09-01
c 2017-10-01
c 2017-11-01
c 2017-12-01
c 2018-01-01
c 2018-02-01
Is there a way to alter the code to function with a grouped data frame? Is there an altogether different approach for this operation?
Another option using dplyr and lubridate is to first summarise a list of Date objects for each id and then unnest them to expand them into different rows.
library(dplyr)
library(lubridate)
df %>%
group_by(id) %>%
summarise(date = list(seq(floor_date(min(date_start),unit = "month") - months(1),
floor_date(max(date_end), unit = "month"), by = "month"))) %>%
tidyr::unnest()
# id date
# <fct> <date>
# 1 a 2012-02-01
# 2 a 2012-03-01
# 3 a 2012-04-01
# 4 a 2012-05-01
# 5 a 2012-06-01
# 6 a 2012-07-01
# 7 a 2012-08-01
# 8 b 2015-05-01
# 9 b 2015-06-01
#10 b 2015-07-01
#11 b 2015-08-01
#12 c 2017-09-01
#13 c 2017-10-01
#14 c 2017-11-01
#15 c 2017-12-01
#16 c 2018-01-01
#17 c 2018-02-01
In your code, since there are duplicates in id, you could group by row_number and achieve the same results as below:
df %>%
group_by(id) %>%
summarize(start= floor_date(AddMonths(min(date_start),-1), "month"),end=max(date_end)) %>%
group_by(rn=row_number()) %>%
do(data.frame(id=.$id, date=seq(.$start, .$end, by="1 month"))) %>%
ungroup() %>%
select(-rn)
# A tibble: 17 x 2
id date
<fct> <date>
1 a 2012-02-01
2 a 2012-03-01
3 a 2012-04-01
4 a 2012-05-01
5 a 2012-06-01
6 a 2012-07-01
7 a 2012-08-01
8 b 2015-05-01
9 b 2015-06-01
10 b 2015-07-01
11 b 2015-08-01
12 c 2017-09-01
13 c 2017-10-01
14 c 2017-11-01
15 c 2017-12-01
16 c 2018-01-01
17 c 2018-02-01
Use as.yearmon to convert to year/month. Note that yearmon objects are represented internally as year + fraction where fraction is 0 for January, 1/12 for February, 2/12 for March and so on. Then use as.Date to convert that to Date class. do allows the group to change size.
library(dplyr)
library(zoo)
df %>%
group_by(id) %>%
do( data.frame(month = as.Date(seq(as.yearmon(min(.$date_start)) - 1/12,
as.yearmon(max(.$date_end)),
1/12) ))) %>%
ungroup
giving:
# A tibble: 17 x 2
id month
<fct> <date>
1 a 2012-02-01
2 a 2012-03-01
3 a 2012-04-01
4 a 2012-05-01
5 a 2012-06-01
6 a 2012-07-01
7 a 2012-08-01
8 b 2015-05-01
9 b 2015-06-01
10 b 2015-07-01
11 b 2015-08-01
12 c 2017-09-01
13 c 2017-10-01
14 c 2017-11-01
15 c 2017-12-01
16 c 2018-01-01
17 c 2018-02-01
This could also be written like this using the same library statements as above:
Seq <- function(st, en) as.Date(seq(as.yearmon(st) - 1/12, as.yearmon(en), 1/12))
df %>%
group_by(id) %>%
do( data.frame(month = Seq(min(.$date_start), max(.$date_end))) ) %>%
ungroup
How do I manage to use a date I have in a dataframe, let's say dataframe 1, as reference for selecting a value that is in other dataframe, dataframe2, when my date in dataframe 1 is between a start date variable and an end date variable in dataframe 2?
For example, I have two dataframes. The first one is a dataframe that only has dates, we will call it "dates".
library(lubridate)
date <- ymd(c("2017-06-01", "2013-01-01", "2014-03-01", "2008-01-01","2011-03-01","2009-03-01","2012-03-01","2015-08-01","2008-08-01"))
date <- as.data.frame(date)
> date
date
1 2017-06-01
2 2013-01-01
3 2014-03-01
4 2008-01-01
5 2011-03-01
6 2009-03-01
7 2012-03-01
8 2015-08-01
9 2008-08-01
My other dataframe,"df2" , contains the start and end dates and a value that is to be assigned to the dataframe"dates" in case a date$date falls between the start date and the end date of the dataframe "df2" .
start_date <- dmy(c("1/6/2001","1/6/2002","1/6/2003","1/10/2011","1/11/2015","1/1/2016","1/1/2017","1/1/2018"))
end_date <-dmy(c("1/5/2002","1/5/2003","1/9/2011","1/10/2015","1/12/2015","1/12/2016","1/12/2017","1/12/2018"))
value <- c(2400,3600,4800,7000,7350,7717.5,8103.38,8508.54)
df2 <- data.frame(start_date, end_date, value)
> df2
start_date end_date value
1 2001-06-01 2002-05-01 2400.00
2 2002-06-01 2003-05-01 3600.00
3 2003-06-01 2011-09-01 4800.00
4 2011-10-01 2015-10-01 7000.00
5 2015-11-01 2015-12-01 7350.00
6 2016-01-01 2016-12-01 7717.50
7 2017-01-01 2017-12-01 8103.38
8 2018-01-01 2018-12-01 8508.54
In the end i would have this result :
date value
1 2017-06-01 8103.38
2 2013-01-01 7000.00
3 2014-03-01 7000.00
4 2008-01-01 4800.00
5 2011-03-01 4800.00
6 2009-03-01 4800.00
7 2012-03-01 7000.00
8 2015-08-01 7000.00
9 2008-08-01 4800.00
Using data.table, you can specify the join condition of the fly:
library(data.table)
setDT(date1) # date data frame
setDT(df1)
date1[df2, on = .(date >= start_date, date <= end_date), value := i.value]
print(date1)
date value
1: 2008-01-01 4800.00
2: 2008-08-01 4800.00
3: 2009-03-01 4800.00
4: 2011-03-01 4800.00
5: 2012-03-01 7000.00
6: 2013-01-01 7000.00
7: 2014-03-01 7000.00
8: 2015-08-01 7000.00
9: 2017-06-01 8103.38
I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0
I have a dataset in long form for start and end date. for each id you will see multiple start and end dates.
I need to find the difference between the first end date and second start date. I am not sure how to use two rows to calculate the difference. Any help is appreciated.
df=data.frame(c(1,2,2,2,3,4,4),
as.Date(c( "2010-10-01","2009-09-01","2014-01-01","2014-02-01","2009-01-01","2013-03-01","2014-03-01")),
as.Date(c("2016-04-30","2013-12-31","2014-01-31","2016-04-30","2014-02-28","2013-05-01","2014-08-31")));
names(df)=c('id','start','end')
my output would look like this:
df$diff=c(NA,1,1,NA,NA,304, NA)
Here's an attempt in base R that I think does what you want:
df$diff <- NA
split(df$diff, df$id) <- by(df, df$id, FUN=function(SD) c(SD$start[-1], NA) - SD$end)
df
# id start end diff
#1 1 2010-10-01 2016-04-30 NA
#2 2 2009-09-01 2013-12-31 1
#3 2 2014-01-01 2014-01-31 1
#4 2 2014-02-01 2016-04-30 NA
#5 3 2009-01-01 2014-02-28 NA
#6 4 2013-03-01 2013-05-01 304
#7 4 2014-03-01 2014-08-31 NA
Alternatively, in data.table it would be:
setDT(df)[, diff := shift(start,n=1,type="lead") - end, by=id]
Here's an alternative using the popular dplyr package:
library(dplyr)
df %>%
group_by(id) %>%
mutate(diff = difftime(lead(start), end, units = "days"))
# id start end diff
# (dbl) (date) (date) (dfft)
# 1 1 2010-10-01 2016-04-30 NA days
# 2 2 2009-09-01 2013-12-31 1 days
# 3 2 2014-01-01 2014-01-31 1 days
# 4 2 2014-02-01 2016-04-30 NA days
# 5 3 2009-01-01 2014-02-28 NA days
# 6 4 2013-03-01 2013-05-01 304 days
# 7 4 2014-03-01 2014-08-31 NA days
You can wrap diff in as.numeric if you want.
Again with base R, you can do the following:
df$noofdays <- as.numeric(as.difftime(df$end-df$start, units=c("days"), format="%Y-%m-%d"))
I'm trying to aggregate two data frames (df1 and df2).
The first contains 3 variables: ID, Date1 and Date2.
df1
ID Date1 Date2
1 2016-03-01 2016-04-01
1 2016-04-01 2016-05-01
2 2016-03-14 2016-04-15
2 2016-04-15 2016-05-17
3 2016-05-01 2016-06-10
3 2016-06-10 2016-07-15
The second also contains 3 variables: ID, Date3 and Value.
df2
ID Date3 Value
1 2016-03-15 5
1 2016-04-04 7
1 2016-04-28 7
2 2016-03-18 3
2 2016-03-27 5
2 2016-04-08 9
2 2016-04-20 2
3 2016-05-05 6
3 2016-05-25 8
3 2016-06-13 3
The idea is to get, for each df1 row, the sum of df2$Value that have the same ID and for which Date3 is between Date1 and Date2:
ID Date1 Date2 SumValue
1 2016-03-01 2016-04-01 5
1 2016-04-01 2016-05-01 14
2 2016-03-14 2016-04-15 17
2 2016-04-15 2016-05-17 2
3 2016-05-01 2016-06-10 14
3 2016-06-10 2016-07-15 3
I know how to make a loop on this, but the data frames are huge! Does someone has an efficient solution? Exploring data.table, plyr and dplyr but could not find a solution.
A couple of data.table solutions that should scale well (and a good stop-gap until non-equi joins are implemented):
Do the comparison in J using by=EACHI.
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df1[ df2,
{
idx = Date1 <= i.Date3 & i.Date3 <= Date2
.(Date1 = Date1[idx],
Date2 = Date2[idx],
Date3 = i.Date3,
Value = i.Value)
},
on=c("ID"),
by=.EACHI][, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
foverlap join (as suggested in the comments)
library(data.table)
setDT(df1)
setDT(df2)
df1[, `:=`(Date1 = as.Date(Date1), Date2 = as.Date(Date2))]
df2[, Date3 := as.Date(Date3)]
df2[, Date4 := Date3]
setkey(df1, ID, Date1, Date2)
foverlaps(df2,
df1,
by.x=c("ID", "Date3", "Date4"),
type="within")[, .(sumValue = sum(Value)), by=.(ID, Date1, Date2)]
# ID Date1 Date2 sumValue
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
Further reading
Rolling join on data.table with duplicate keys
foverlap joins in data.table
With the recently implemented non-equi joins feature in the current development version of data.table, v1.9.7, this can be done as follows:
dt2[dt1, .(sum = sum(Value)), on=.(ID, Date3>=Date1, Date3<=Date2), by=.EACHI]
# ID Date3 Date3 sum
# 1: 1 2016-03-01 2016-04-01 5
# 2: 1 2016-04-01 2016-05-01 14
# 3: 2 2016-03-14 2016-04-15 17
# 4: 2 2016-04-15 2016-05-17 2
# 5: 3 2016-05-01 2016-06-10 14
# 6: 3 2016-06-10 2016-07-15 3
The column names needs some fixing.. will work on it later.
Here's a base R solution using sapply():
df1 <- data.frame(ID=c(1L,1L,2L,2L,3L,3L),Date1=as.Date(c('2016-03-01','2016-04-01','2016-03-14','2016-04-15','2016-05-01','2016-06-01')),Date2=as.Date(c('2016-04-01','2016-05-01','2016-04-15','2016-05-17','2016-06-15','2016-07-15')));
df2 <- data.frame(ID=c(1L,1L,1L,2L,2L,2L,2L,3L,3L,3L),Date3=as.Date(c('2016-03-15','2016-04-04','2016-04-28','2016-03-18','2016-03-27','2016-04-08','2016-04-20','2016-05-05','2016-05-25','2016-06-13')),Value=c(5L,7L,7L,3L,5L,9L,2L,6L,8L,3L));
cbind(df1,SumValue=sapply(seq_len(nrow(df1)),function(ri) sum(df2$Value[df1$ID[ri]==df2$ID & df1$Date1[ri]<=df2$Date3 & df1$Date2[ri]>df2$Date3])));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3
Note that your df1 and expected output have slightly different dates in some cases; I used the df1 dates.
Here's another approach that attempts to be more vectorized: Precompute a cartesian product of indexes into the two frames, then perform a single vectorized conditional expression using the index vectors to get matching pairs of indexes, and finally use the matching indexes to aggregate the desired result:
cbind(df1,SumValue=with(expand.grid(i1=seq_len(nrow(df1)),i2=seq_len(nrow(df2))),{
x <- df1$ID[i1]==df2$ID[i2] & df1$Date1[i1]<=df2$Date3[i2] & df1$Date2[i1]>df2$Date3[i2];
tapply(df2$Value[i2[x]],i1[x],sum);
}));
## ID Date1 Date2 SumValue
## 1 1 2016-03-01 2016-04-01 5
## 2 1 2016-04-01 2016-05-01 14
## 3 2 2016-03-14 2016-04-15 17
## 4 2 2016-04-15 2016-05-17 2
## 5 3 2016-05-01 2016-06-15 17
## 6 3 2016-06-01 2016-07-15 3