I have a table:
Name| Start | Finish |
----|-----------|-----------|
A |2015-01-22 |2015-02-04 |
B |2015-01-02 |2015-01-10 |
A |2015-01-22 |2015-02-14 |
B |2015-01-02 |2015-02-10 |
I need to break periods by months. If a period starts in one month and ends in the next one then I need to split it into two periods. If a period starts and ends at the same month then it should be as it is. Let's assume period cannot contain more than one 1st day of the month. In other words, each line can be splitted for not more than two lines. Finish (end of the period) is always bigger than Start.
That's what I want to get:
Name| Start | Finish |
----|-----------|-----------|
A |2015-01-22 |2015-01-31 |
A |2015-02-01 |2015-02-04 |
A |2015-01-22 |2015-01-31 |
A |2015-02-01 |2015-02-14 |
B |2015-01-02 |2015-01-10 |
B |2015-01-02 |2015-01-31 |
B |2015-02-01 |2015-02-10 |
The order of output rows isn't a matter.
Here is a code for the table:
Name = c("A", "B", "A", "B")
Start = c(as.Date("2015-01-22"), as.Date("2015-01-02"), as.Date("2015-01-22"), as.Date("2015-01-02"))
Finish = c(as.Date("2015-02-04"), as.Date("2015-01-10"), as.Date("2015-02-14"), as.Date("2015-02-10"))
df = data.frame(Name, Start, Finish)
Any suggestion how it can be done?
The question has been changed. Originally the Name column uniquely identified the row but the changed version of the question no longer has that. The answer here has been modified accordingly so that now we identify rows by row number, i.e. 1:nrow(df), rather than df$Name in the second argument to by. Otherwise, code is unchanged.
Use by to split the data frame by row giving single rows and operating on each one with the anonymous function. It calculates the end-of-month (eom) for the Start and if the Finish is greater outputs a two-row data frame and otherwise returns the same data frame. Put it all together with rbind.
library(zoo)
do.call("rbind", by(df, 1:nrow(df), function(x) with(x, {
eom <- as.Date(as.yearmon(Start), frac = 1)
if (eom < Finish)
data.frame(Name, Start = c(Start, eom+1), Finish = c(eom, Finish))
else x
})))
giving:
Name Start Finish
1.1 A 2015-01-22 2015-01-31
1.2 A 2015-02-01 2015-02-04
2 B 2015-01-02 2015-01-10
3.1 A 2015-01-22 2015-01-31
3.2 A 2015-02-01 2015-02-14
4.1 B 2015-01-02 2015-01-31
4.2 B 2015-02-01 2015-02-10
Here's another approach in base R:
idx <- with(df, format(Finish, "%Y-%m") > format(Start, "%Y-%m"))
rbind(df[!idx,],
transform(df[idx,], Finish = as.Date(paste0(format(Finish, "%Y-%m"), "-01"))-1),
transform(df[idx,], Start = as.Date(paste0(format(Finish, "%Y-%m"), "-01"))))
# Name Start Finish
#2 B 2015-01-02 2015-01-10
#1 A 2015-01-22 2015-01-31
#3 A 2015-01-22 2015-01-31
#4 B 2015-01-02 2015-01-31
#11 A 2015-02-01 2015-02-04
#31 A 2015-02-01 2015-02-14
#41 B 2015-02-01 2015-02-10
Edit:
This answers the original question:
require(dplyr)
require(zoo)
df %>%
filter(Finish>as.Date(as.yearmon(Start),frac=1)) %>%
group_by(Name) %>%
do(rbind(.,c(.$Name,
paste(as.Date(as.yearmon(.$Start),frac=1)+1),
.$Finish))) %>%
mutate(Finish:=ifelse(as.Date(as.yearmon(Start),frac=1)<Finish,
paste(as.Date(as.yearmon(Start),frac=1)),Finish))
Output:
Name Start Finish
1 A 2015-01-22 2015-01-31
2 A 2015-02-01 2015-02-04
3 B 2015-03-02 2015-03-31
4 B 2015-04-01 2015-04-10
Sample data:
require(data.table)
df <- fread("Name Start Finish
A 2015-01-22 2015-02-01
B 2015-03-02 2015-04-01")
Related
I am trying to format my data for the spatially-explicit capture-recapture model (secr), for which we need an occasion column. I am taking a 90-day period of the overall data set per each year, with each row being a separate record of an animal on one of camera traps. Let's say that the first day is Feb 1st, 2019. It should get '1' in the 'occasion' column. The last day, May 1st, should get '90' in that column.
However, here's a catch: there wasn't a capture on every date of that time period, and some days there were multiple captures. So, the dates in the 'dt' column may go like this:
2019-02-01
2019-02-04
2019-02-05
2019-02-06
2019-02-07
2019-02-07
2019-02-07
I want to create an 'occasion' column so that my final table could have columns like this:
2019-02-01 | 1
2019-02-04 | 4
2019-02-05 | 5
2019-02-06 | 6
2019-02-07 | 7
2019-02-07 | 7
2019-02-07 | 7
I have gone two ways about this but neither was succesfull. Firstly, I tried this:
data_new = data_old %>%
arrange(dt) %>%
mutate(occasion = as.numeric(factor(dt))
Which gave me the the table that looked like this:
2019-02-01 | 1
2019-02-04 | 2
2019-02-05 | 3
2019-02-06 | 4
2019-02-07 | 5
2019-02-07 | 5
2019-02-07 | 5
So, the numbers for identical dates were identical, just how I wanted, but it didn't skip the number if the corresponding date was missing. I tried something more complicated:
First, I got the start dates for each of the 90-day periods per year.
mydttemp <- as.POSIXct("2014-02-01")
mydates = seq.POSIXt(from = mydttemp, length.out = 7, by = "1 year")
The final product for the list 'mydates' looked like this:
`
"2014-02-01 +11"
"2015-02-01 +10"
"2016-02-01 +10"
"2017-02-01 +10"
"2018-02-01 +10"
"2019-02-01 +10"
"2020-02-01 +10"
`
Second, I made the 90-day period for each year. I use a loop and the object fileNumber that goes through each year in the list (hence, the mydates[fileNumber] expression).
mydate = seq.POSIXt(from = mydates[fileNumber], length.out = 90, by = "1 day")
mydateseq = seq(as.character(mydate))
Finally, I feed these into the same part of my code, and it looks like this:
for (m in mydateseq) {
data_new = data_old %>%
arrange(dt) %>%
mutate(occasion = if_else(dt %in% mydate,
true = mydateseq[m],
false = NA_real_)
}
The idea was that if a date matched any of the dates in the 'dt' column was found in the created list, it would put the corresponding number for that date into a column. But that just gave me a column full of NAs. Any ideas?
Thank you in advance.
Subtract the first date from the dates and add 1:
# test data
s <- c("2019-02-01", "2019-02-04", "2019-02-05", "2019-02-06",
"2019-02-07", "2019-02-07", "2019-02-07")
d <- as.Date(s)
as.integer(d - d[1] + 1L)
## [1] 1 4 5 6 7 7 7
I am trying to merge two dataframes based on a conditional relationship between several dates associated with unique identifiers but distributed across different observations (rows).
I have two large datasets with unique identifiers. One dataset has 'enter' and 'exit' dates (alongside some other variables).
> df1 <- data.frame(ID=c(1,1,1,2,2,3,4),
enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'),
+ exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
> dcis <- grep('date$',names(df1));
> df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
> df1;
ID enter.date exit.date
1 1 2015-05-07 2015-07-01
2 1 2015-07-10 2015-10-15
3 1 2017-08-25 2017-09-03
4 2 2016-09-01 2016-09-30
5 2 2018-01-05 2019-06-01
6 3 2016-05-01 2017-05-01
7 4 2017-04-08 2017-06-08
and the other has "eval" dates.
> df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
> df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
> df2;
ID eval.date
1 1 2015-10-30
2 2 2016-10-10
3 2 2019-09-10
4 3 2018-05-15
5 4 2015-01-19
I am trying to calculate the average interval of time from 'exit' to 'eval' for each individual in the dataset. However, I only want those 'evals' that come after a given individual's 'exit' and before the next 'enter' for that individual (there are no 'eval' observations between enter and exit for a given individual), if such an 'eval' exists.
In other words, I'm trying to get an output that looks like this from the two dataframes above.
> df3 <- data.frame(ID=c(1,2,2,3), enter.date=c('7/10/2015','9/1/2016','1/05/2018','5/01/2016'),
+ exit.date = c('10/15/2015', '9/30/2016', '6/01/2019', '5/01/2017'),
+ assess.date=c('10/30/2015', '10/10/2016', '9/10/2019', '5/15/2018'));
> dcis <- grep('date$',names(df3));
> df3[dcis] <- lapply(df3[dcis],as.Date,'%m/%d/%Y');
> df3$time.diff<-difftime(df3$exit.date, df3$assess.date)
> df3;
ID enter.date exit.date assess.date time.diff
1 1 2015-07-10 2015-10-15 2015-10-30 -15 days
2 2 2016-09-01 2016-09-30 2016-10-10 -10 days
3 2 2018-01-05 2019-06-01 2019-09-10 -101 days
4 3 2016-05-01 2017-05-01 2018-05-15 -379 days
Once I perform the merge finding the averages is easy enough with
> aggregate(df3[,5], list(df3$ID), mean)
Group.1 x
1 1 -15.0
2 2 -55.5
3 3 -379.0
but I'm really at a loss as to how to perform the merge. I've tried to use leftjoin and fuzzyjoin to perform the merge per the advice given here and here, but I'm inexperienced at R and couldn't figure it out. I would really appreciate if someone could walk me through it - thanks!
A few other descriptive notes about the data: each ID may have some number of rows associated with it in each dataframe. df1 has enter dates which mark the beginning of a service delivery and exit dates that mark the end of a service delivery. All enters have one corresponding exit. df2 has eval dates. Eval dates can occur at any time when an individual is not receiving the service. There may be many evals between one period of service delivery and the next, or there may be no evals.
Just discovered the sqldf package. Assuming that for each ID the date ranges are in ascending order, you might use it like this:
df1 <- data.frame(ID=c(1,1,1,2,2,3,4), enter.date=c('5/07/2015','7/10/2015','8/25/2017','9/1/2016','1/05/2018','5/01/2016','4/08/2017'), exit.date = c('7/1/2015', '10/15/2015', '9/03/2017', '9/30/2016', '6/01/2019',
'5/01/2017', '6/08/2017'));
dcis <- grep('date$',names(df1));
df1[dcis] <- lapply(df1[dcis],as.Date,'%m/%d/%Y');
df1;
df2 <- data.frame(ID=c(1,2,2,3,4), eval.date=c('10/30/2015',
'10/10/2016','9/10/2019','5/15/2018','1/19/2015'));
df2$eval.date<-as.Date(df2$eval.date, '%m/%d/%Y')
df2;
library(sqldf)
df1 = unsplit(lapply(split(df1, df1$ID, drop=FALSE), function(df) {
df$next.date = as.Date('2100-12-31')
if (nrow(df) > 1)
df$next.date[1:(nrow(df) - 1)] = df$enter.date[2:nrow(df)]
df
}), df1$ID)
sqldf('
select df1.*, df2.*, df1."exit.date" - df2."eval.date" as "time.diff"
from df1, df2
where df1.ID == df2.ID
and df2."eval.date" between df1."exit.date"
and df1."next.date"')
ID enter.date exit.date next.date ID..5 eval.date time.diff
1 1 2015-07-10 2015-10-15 2017-08-25 1 2015-10-30 -15
2 2 2016-09-01 2016-09-30 2018-01-05 2 2016-10-10 -10
3 2 2018-01-05 2019-06-01 2100-12-31 2 2019-09-10 -101
4 3 2016-05-01 2017-05-01 2100-12-31 3 2018-05-15 -379
I have a big data frame with dates and i need to check for the first date in a continuous way, as follows:
ID ID_2 END BEG
1 55 2017-06-30 2016-01-01
1 55 2015-12-31 2015-11-12 --> Gap (required date)
1 88 2008-07-26 2003-02-24
2 19 2014-09-30 2013-05-01
2 33 2013-04-30 2011-01-01 --> Not Gap (overlapping)
2 19 2012-12-31 2011-01-01
2 33 2010-12-31 2008-01-01
2 19 2007-12-31 2006-01-01
2 19 2005-12-31 1980-10-20 --> No actual Gap(required date)
As shown, not all the dates have overlapping and i need to return by ID (not ID_2) the date when the first gap (going backwards in time) appears. I've tried using for but it's extremely slow (dataframe has 150k rows). I've been messing around with dplyr and mutate as follows:
df <- df%>%
group_by(ID)%>%
mutate(END_lead = lead(END))
df$FLAG <- df$BEG - days(1) == df$END_lead
df <- df%>%
group_by(ID)%>%
filter(cumsum(cumsum(FLAG == FALSE))<=1)
But this set of instructions stops at the first overlapping, filtering the wrong date. I've tried anything i could think of, ordering in decreasing or ascending order, and using min and max but could not figure out a solution.
The actual result wanted would be:
ID ID_2 END BEG
1 55 2015-12-31 2015-11-12
2 19 2008-07-26 1980-10-20
Is there a way of doing this using dplyr,tidyr and lubridate?
A possible solution using dplyr:
library(dplyr)
df %>%
mutate_at(vars(END, BEG), funs(as.Date)) %>%
group_by(ID) %>%
slice(which.max(BEG > ( lead(END) + 1 ) | is.na(BEG > ( lead(END) + 1 ))))
With your last data, it gives:
# A tibble: 2 x 4
# Groups: ID [2]
ID ID_2 END BEG
<int> <int> <date> <date>
1 1 55 2015-12-31 2015-11-12
2 2 19 2005-12-31 1980-10-20
What the solution does is basically:
Changes the dates to Date format (no need for lubridate);
Groups by ID;
Selects the highest row that satisfies your criteria, i.e. the highest row which is either a gap (TRUE), or if there is no gap it is the first row (meaning it has a missing value when checking for a gap, this is why is.na(BEG > ( lead(END) + 1 ))).
I would use xts package, first creating xts objects for each ID you have, than use first() and last() function on each objects.
https://www.datacamp.com/community/blog/r-xts-cheat-sheet
I have the following dataframe:
Person A made 5 vacances, the first vacations were from 2015-03-11 to 2015-03-15 and the last vacations from Person A from 2016-02-04 to 2016-02-10.
Person fromDate toDate
A 2015-03-11 2015-03-15
A 2015-04-17 2015-06-16
A 2015-09-18 2015-10-12
A 2015-12-18 2016-01-02
A 2016-02-04 2016-02-10
B 2015-04-10 2016-04-16
B 2016-12-12 2016-12-20
C 2015-01-02 2015-02-04
C 2015-03-03 2015-03-05
C 2015-04-04 2015-04-07
C 2016-01-03 2016-01-10
C 2016-10-12 2016-10-15
C 2016-11-01 2016-11-05
I want to find all persons which made within 365 days at least 5 times vacations.
In the example above Person A went in 365 day 5 times on vacation. Person C went on 6 vacations but not within 365 days.
The result should be a dataframe like
Person at_least_five_vacations_within_365_days
A TRUE
B FALSE
C FALSE
Your data:
library(data.table)
library(lubridate)
df <- fread("Person\tfromDate\ttoDate
A\t2015-03-11\t2015-03-15
A\t2015-04-17\t2015-06-16
A\t2015-09-18\t2015-10-12
A\t2015-12-18\t2016-01-02
A\t2016-02-04\t2016-02-10
B\t2015-04-10\t2016-04-16
B\t2016-12-12\t2016-12-20
C\t2015-01-02\t2015-02-04
C\t2015-03-03\t2015-03-05
C\t2015-04-04\t2015-04-07
C\t2016-01-03\t2016-01-10
C\t2016-10-12\t2016-10-15
C\t2016-11-01\t2016-11-05",header="auto",sep="auto") %>%
as.data.frame() %>%
mutate(fromDate=ymd(fromDate), toDate=ymd(toDate))
Setting number of trips window:
numoftrips <- 5
Using dpylr & assuming your dates are already sorted by Person
library(dplyr)
df1 <- df %>%
group_by(Person) %>%
mutate(toCompare=lead(toDate,(numoftrips-1))) %>% # Copy return date of 5th-trip-after as new column
mutate(within.year=(toCompare-fromDate)<=365) %>% # Check if difference is less than 365 days
summarise(at_least_five_vacations_within_365_days=ifelse(sum(within.year,na.rm=T)>0,TRUE,FALSE)) # If taken 5 trips in less than 365 days, return TRUE
Output
df1
Person too.many.vacay
1 A TRUE
2 B FALSE
3 C FALSE
This might work. But you should specify the expected output.
library(dplyr)
df %>% group_by(Person) %>%
mutate(diff = toDate - fromDate,instances = n())%>%
filter (instances >=5 & diff < 356)
df is just your dataset and instances is the number of visits for person
The accepted answer uses data.table to read the data but continues with a dplyr approach.
The approach below uses read_table2() from the readr package but achieves the desired result with a data.table "one-liner":
library(data.table) # CRAN version 1.10.4 used
n_trips <- 5L
n_days <- 365L
DT[order(Person, fromDate),
any(fromDate <= shift(toDate, n_trips - 1L, , "lag") + n_days, na.rm = TRUE),
by = Person][]
Person V1
1: A TRUE
2: B FALSE
3: C FALSE
Explanation
The approach is similar to the accepted answer: The toDate is lagged by the required number of trips of the person and then it is checked whether the actual fromDate is within the given range of days. The any() function is used to determine if there is at least one occurrence for a particular person. The result of shift operations depend on the order of rows. So, the data.table is ordered beforehand.
The OP has requested to find all persons which made within 365 days at least 5 times vacations but he hasn't specified exactly how to count the vacations (by start date, by end date, or by a mixture of both?). So, it has been deliberately chosen to check the end date of the 4th previous vacation vs the start date of the actual vacation.
Data
DT <- readr::read_table2(
"Person fromDate toDate
A 2015-03-11 2015-03-15
A 2015-04-17 2015-06-16
A 2015-09-18 2015-10-12
A 2015-12-18 2016-01-02
A 2016-02-04 2016-02-10
B 2015-04-10 2016-04-16
B 2016-12-12 2016-12-20
C 2015-01-02 2015-02-04
C 2015-03-03 2015-03-05
C 2015-04-04 2015-04-07
C 2016-01-03 2016-01-10
C 2016-10-12 2016-10-15
C 2016-11-01 2016-11-05"
)
library(data.table)
setDT(DT)
I have a data frame where each row has a unique ID. I need to replicate each one of these rows based on the number of days between the start date and the max of the end date and the approval date.
ID <- c(1,2)
Value <- c(10,20)
StartDate <- c(as.Date("01/01/2015", '%d/%m/%Y'),
as.Date("01/01/2015", '%d/%m/%Y'))
EndDate <- c(as.Date("31/01/2015", '%d/%m/%Y'),
as.Date("15/01/2015", '%d/%m/%Y'))
AppDate <- c(as.Date("15/01/2015", '%d/%m/%Y'),
as.Date("15/02/2015", '%d/%m/%Y'))
df <- data.frame(ID, Value, StartDate, EndDate, AppDate)
df <- df[rep(row.names(df), ifelse(as.numeric(df$AppDate) >
as.numeric(df$EndDate),as.numeric(df$AppDate-df$StartDate),
as.numeric(df$EndDate-df$StartDate)) + 1),]
I then need to add a sequential list of dates from the start date to the max of the end date or approval date.
I've done this via 2 loops. The outer loop loops through the data frame for each unique ID. The second loop then goes through the ID and adds the date. Once the second loop has finished it passes the row to the outer loop as the new start point.
IDs <- unique(df$ID)
df$Days <- rep(as.Date("01/01/1999",'%d/%m/%Y'), nrow(df))
counter <- 1
for (i in 1:length(IDs)) {
ref <- IDs[i]
start <- 1
while (df$ID[counter] == ref) {
ifelse(start == 1, df$Days[counter] <- df$StartDate[counter],
df$Days[counter] <- df$StartDate[counter] + start -1)
ifelse (counter > nrow(df), break, counter <- counter + 1)
ifelse (counter > nrow(df), break, start <- start + 1)
}
}
My actual data set has over 6,000 ID's and once I've replicated the rows it ends up being over 500,000 rows. The loop took over 15 minutes to run so it's obviously very inefficient.
So I guess I have 2 questions.
1). What is the most efficient way to do this in R
2). What would be the most efficient way of doing this in general i.e. in say something like C++
thanks
Here is one solution that is vectorized. Note: Your code does not match the concept of taking the maximum of EndDate and AppDate, which I tried to do, but if that is not what you want, you can modify the code accordingly.
library(dplyr)
df <- df %>% group_by(ID) %>% mutate(Days = rep(seq(min(StartDate), max(EndDate, df$AppDate), 'days'), ceiling(nrow(df) / n()))[1:n()])
Output will be as follows (just the first few rows):
head(df)
Source: local data frame [6 x 6]
Groups: ID [1]
ID Value StartDate EndDate AppDate Days
(dbl) (dbl) (date) (date) (date) (date)
1 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-01
2 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-02
3 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-03
4 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-04
5 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-05
6 1 10 2015-01-01 2015-01-31 2015-01-15 2015-01-06
tail(df)
Source: local data frame [6 x 6]
Groups: ID [1]
ID Value StartDate EndDate AppDate Days
(dbl) (dbl) (date) (date) (date) (date)
1 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-10
2 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-11
3 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-12
4 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-13
5 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-14
6 2 20 2015-01-01 2015-01-15 2015-02-15 2015-02-15
Normally, I would recommend the cross join SQL query that returns a cartesian product (all combination between two sets). However, you can replicate the cross join in R using merge() without any by arguments and with all=True. From there, filter for EndDate cut-off:
# CALCULATE CONDITIONAL END DATE
df$TrueEndDate <- as.Date(ifelse(df$AppDate > df$EndDate,
df$AppDate,
df$EndDate), origin="1970-01-01")
# CREATE A SEQUENTIAL DATES DATA FRAME (HERE IS 60 DAYS FROM 2015-01-01)
dates <- data.frame(Date=as.Date(unlist(lapply(0:60, function(x)
as.Date("2015-01-01") + x)),
origin="1970-01-01"))
# RUN CROSS JOIN MERGE, PULLING ONLY NEEDED FIELDS
mergedf <- merge(df[c('ID', 'StartDate', 'TrueEndDate')], dates, all=TRUE)
# FILTER OUT DATES PAST ROW'S TRUE END DATE
mergedf <- mergedf[(mergedf$Date <= mergedf$TrueEndDate),]
# CLEANUP
mergedf <- mergedf[with(mergedf, order(ID)), ] # ORDER BY ID
row.names(mergedf) <- 1:nrow(mergedf) # RESET ROW NAMES
Should you be curious on the equivalent cross join SQL (which you can have R call on a RDMS engine and import as final data frame, may help for performance issues):
SELECT ID.ID, ID.Value, ID.StartDate,
CASE WHEN ID.AppDate > ID.EndDate
THEN ID.AppDate
ELSE ID.EndDate
END As TrueEndDate,
Dates.Dates
FROM ID, Dates
WHERE Dates.Dates <= CASE WHEN ID.AppDate > ID.EndDate
THEN ID.AppDate ELSE ID.EndDate
END
ORDER BY ID.ID, Dates.Dates