Related
R:
I have a data-set with N Products sales value from some yyyy-mm-dd to some yyyy-mm-dd, I just want to filter the data for the last 12 months for each product in the data-set.
Eg:
Say, I have values from 2016-01-01 to 2020-02-01
So now I want to filter the sales values for the last 12 months that is from 2019-02-01 to 2020-02-01
I just cannot simply mention a "filter(Month >= as.Date("2019-04-01") & Month <= as.Date("2020-04-01"))" because the end date keeps changing for my case as every months passes by so I need to automate the case.
You can use :
library(dplyr)
library(lubridate)
data %>%
group_by(Product) %>%
filter(between(date, max(date) - years(1), max(date)))
#filter(date >= (max(date) - years(1)) & date <= max(date))
You can test whether the date is bigger equal the maximal date per product minus 365 days:
library(dplyr)
df %>%
group_by(Products) %>%
filter(Date >= max(Date)-365)
# A tibble: 6 x 2
# Groups: Products [3]
Products Date
<dbl> <date>
1 1 2002-01-21
2 1 2002-02-10
3 2 2002-02-24
4 2 2002-02-10
5 2 2001-07-01
6 3 2005-03-10
Data
df <- data.frame(
Products = c(1,1,1,1,2,2,2,3,3,3),
Date = as.Date(c("2000-02-01", "2002-01-21", "2002-02-10",
"2000-06-01", "2002-02-24", "2002-02-10",
"2001-07-01", "2003-01-02", "2005-03-10",
"2002-05-01")))
If your aim is to just capture entries from today back to the same day last year, then:
The function Sys.Date() returns the current date as an object of type Date. You can then convert that to POSIXlc form to adjust the year to get the start date. For example:
end.date <- Sys.Date()
end.date.lt <- asPOSIXlt(end.date)
start.date.lt <- end.date.lt
start.date.lt$year <- start.date.lt$year - 1
start.date <- asPOSIXct(start.date.lt)
Now this does have one potential fail-state: if today is February 29th. One way to deal with that would be to write a "today.last.year" function to do the above conversion, but give an explicit treatment for leap years - possibly including an option to count "today last year" as either February 28th or March 1st, depending on which gives you the desired behaviour.
Alternatively, if you wanted to filter based on a start-of-month date, you can make your function also set start.date.lt$day = 1, and so forth if you need to adjust in different ways.
Input:
product date
1: a 2017-01-01
2: b 2017-04-01
3: a 2017-07-01
4: b 2017-10-01
5: a 2018-01-01
6: b 2018-04-01
7: a 2018-07-01
8: b 2018-10-01
9: a 2019-01-01
10: b 2019-04-01
11: a 2019-07-01
12: b 2019-10-01
Code:
library(lubridate)
library(data.table)
DT <- data.table(
product = rep(c("a", "b"), 6),
date = seq(as.Date("2017-01-01"), as.Date("2019-12-31"), by = "quarter")
)
yearBefore <- function(x){
year(x) <- year(x) - 1
x
}
date_DT <- DT[, .(last_date = last(date)), by = product]
date_DT[, year_before := yearBefore(last_date)]
result <- DT[, date_DT[DT, on = .(product, year_before <= date), nomatch=0]]
result[, last_date := NULL]
setnames(result, "year_before", "date")
Output:
product date
1: a 2018-07-01
2: b 2018-10-01
3: a 2019-01-01
4: b 2019-04-01
5: a 2019-07-01
6: b 2019-10-01
Is this what you are looking for?
I want to select rows of data with dates such that the dates have a minimum time difference of 3 months.
Here is an example:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-01-05
5: 2 numser2 2020-01-08
6: 2 numser2 2020-01-20
7: 2 numser2 2020-03-15
8: 2 numser2 2020-03-18
9: 2 numser3 2020-03-13
10: 2 numser3 2020-03-18
11: 3 numser3 2020-01-22
12: 4 numser4 2018-01-02
I want, by patient and numsermed, keep the date that have at least 3 months difference. I cannot use simply the successive differences. Expected result is:
patient numsermed date
1: 1 numser1 2020-01-08
2: 2 numser2 2015-01-02
3: 2 numser2 2019-12-12
4: 2 numser2 2020-03-15
5: 2 numser3 2020-03-13
6: 3 numser3 2020-01-22
7: 4 numser4 2018-01-02
Here, for numsermed2 and patient 2, after 2019-12-12, the next date 3 months a least later is 2020-03-15, that I keep. I thus remove 2020-01-05, 2020-01-08, 2020-01-20.
I then remove 2020-03-18, which is 3 days after 2020-03-15.
Here is my solution with data.table:
library(data.table)
library(lubridate)
setkeyv(test,c("numsermed","patient","date"))
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
max(test[,.N,by = .(numsermed,patient)]$N)
Nmax <- max(test[,.N,by = .(numsermed,patient)]$N)
test[,supp := 0]
for(i in 1:Nmax){
test[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
test <- test2[supp != 1 ]
test[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
The idea is for each row, to test the condition and then perform the subset. It seems to work, but on a million row table, it is rather slow (few hours). I am sure there is an efficient way with semi equi join or rolling join in data.table, but I did not manage to write it. Could someone come up with a more efficient solution ? dplyr solutions are of course welcome too.
The data:
library(data.table)
library(lubridate) test<-setDT(list(patient=c(1:3,2),numsermed=c(paste0("numser",1:3),"numser2"),date=as_date(c("2020-01-08","2020-01-20","2020-01-22","2019-12-12"))))
test<-rbind(test,data.table(patient=4,numsermed="numser4",date=as_date("2018-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2015-01-02")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-15")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-05")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-01-08")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-13")))
test<-rbind(test,data.table(patient=2,numsermed="numser3",date=as_date("2020-03-18")))
test<-rbind(test,data.table(patient=2,numsermed="numser2",date=as_date("2020-03-18")))
Edit
I propose I comparison of the solution proposed, #Ben 's solution, #chinsoon12 's and #astrofunkswag 's .
Here is the test data:
library(data.table)
library(lubridate)
library(magrittr)
set.seed(1234)
origin <- "1970-01-01"
dt <- data.table(numsermed = sample(paste0("numsermed",1:30),10000,replace = T))
dt[,patient := sample(1:10000,.N,replace = T),by = numsermed]
dt[,date := sample((dmy("01.01.2019") %>% as.numeric()):(dmy("01.01.2020") %>% as.numeric()),.N),by = .(patient)]
and here the 4 functions, including mine:
ben = function(dt){
dt[, c("idx", "date2") := list(.I, date - 90L)]
dt_final <- unique(dt[dt, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(dt_final, c("patient", "numsermed", "i.date"))
return(dt_final[,.(patient,numsermed,date = i.date)])
}
chinson = function(dt){
dt[, d := as.integer(date)]
setkey(dt,date)
return( dt[dt[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1][,.(patient,numsermed,date)])
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
library(tidyverse); library(zoo)
astrofun = function(dt){
return(
dt %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
)
}
denis = function(dt){
df <- copy(dt)
setkeyv(dt,c("numsermed","patient","date"))
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
df[,N := .N,by = .(numsermed,patient)]
Nmax <- max(df[,N])
df[,supp := 0]
for(i in 1:Nmax){
df[N>1 ,supp := ifelse(i < indx & date < date[i] + 90,1,0),
by = .(numsermed,patient)]
df <- df[supp != 1 ]
df[,indx := .I - .I[1]+1,by = .(numsermed,patient)]
}
return(df[,.(patient,numsermed,date)])
}
First, none of them produce the same result! denis(dt) output 9833 lines, ben(dt) 9928, chinson(dt) 9929, and #astrofunkswag solution astrofun(dt) output 9990 lines. I am not sur why this does not produce the same output, nor what solution is the good one (I would say mine just to be pretentious, but I am not even sure).
Then a benchmarking to compare efficiency.
library(microbenchmark)
microbenchmark(ben(dt),
chinson(dt),
astrofun(dt),
denis(dt),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
ben(dt) 17.3841 19.8321 20.88349 20.9609 21.8815 23.5125 10
chinson(dt) 230.8868 232.6298 275.16637 236.8482 239.0144 544.2292 10
astrofun(dt) 4460.2159 4565.9120 4795.98600 4631.3251 5007.8055 5687.7717 10
denis(dt) 68.0480 68.4170 88.88490 80.9636 90.0514 142.9553 10
#Ben 's solution with rolling join is the fastest of course. Mine is not that bad, and #astrofunkswag 's solution is super slow because of the cumulative sum I guess.
With data.table you could try the following. This would involve creating a second date 90 days prior and then doing a rolling join.
library(data.table)
setDT(test[, c("idx", "date2") := list(.I, date - 90L)])
test_final <- unique(test[test, on = c(patient = "patient", numsermed = "numsermed", date = "date2"),
roll = -Inf][order(i.date)], by = "idx")
setorderv(test_final, c("patient", "numsermed", "i.date"))
test_final
Output
(i.date has the final date desired)
patient numsermed date idx date2 i.date i.idx
1: 1 numser1 2019-10-10 1 2019-10-10 2020-01-08 1
2: 2 numser2 2014-10-04 6 2014-10-04 2015-01-02 6
3: 2 numser2 2019-09-13 4 2019-09-13 2019-12-12 4
4: 2 numser2 2019-12-16 8 2019-10-07 2020-03-15 7
5: 2 numser3 2019-12-14 10 2019-12-14 2020-03-13 10
6: 3 numser3 2019-10-24 3 2019-10-24 2020-01-22 3
7: 4 numser4 2017-10-04 5 2017-10-04 2018-01-02 5
Here is a solution with dplyr and purrr. I use 2 helper functions, one to calculate month difference and one to calculate a cumulative sum that resets when a threshold is reached, credit to this post.
I calculate the month difference with the lagging date value, but you want to include the first one which will be NA. One weird part is that to include NA the easiest for me was to convert NA to some value 3 or greater. I arbitrarily made it 300. You could likely modify the sum_reset_at function to handle NA the way you want. You might also want to condense the code in some way since I do multiple mutate calls and then deselect those column, but I did it all in seperate lines to make it more clear what was happening. I think this functional programming solution will be quicker, but I haven't tested it on a large dataset compared to your current solution.
test <- test %>% arrange(patient, numsermed, date)
library(tidyverse); library(zoo)
mon_diff <- function(d1, d2){
12 * as.numeric((as.yearmon(d1) - as.yearmon(d2)))
}
sum_reset_at <- function(thresh) {
function(x) {
accumulate(x, ~if_else(.x>=thresh, .y, .x+.y))
}
}
test %>%
group_by(patient, numsermed) %>%
mutate(diff1 = mon_diff(date, lag(date)),
diff1 = if_else(is.na(diff1), 300, diff1)) %>%
mutate(diff2 = sum_reset_at(3)(diff1)) %>%
filter(diff2 >= 3) %>%
select(-contains('diff'))
test
<dbl> <chr> <date>
1 1 numser1 2020-01-08
2 2 numser2 2015-01-02
3 2 numser2 2019-12-12
4 2 numser2 2020-03-15
5 2 numser3 2020-03-13
6 3 numser3 2020-01-22
7 4 numser4 2018-01-02
Another option using findInterval to group:
library(data.table)
DT[, d := as.integer(date)]
DT[DT[, g := findInterval(d, seq(d[1L], d[.N]+90L, by=90L)), .(patient, numsermed)][,
.I[1L], .(patient, numsermed, g)]$V1]
output:
patient numsermed date d g
1: 1 numser1 2020-01-08 18269 1
2: 2 numser2 2015-01-02 16437 1
3: 2 numser2 2019-12-12 18242 21
4: 2 numser2 2020-03-15 18336 22
5: 2 numser3 2020-03-13 18334 1
6: 3 numser3 2020-01-22 18283 1
7: 4 numser4 2018-01-02 17533 1
If you have many groups of patient and numsermed, Ben's solution using rolling join will be faster. And another way of coding the rolling join by chaining:
DT[, .(patient, numsermed, date=date+90L)][
DT, on=.NATURAL, roll=-Inf, .(patient, numsermed, x.date, i.date)][,
.(date=i.date[1L]), .(patient, numsermed, x.date)][,
x.date := NULL][]
Or more succinctly:
DT[, c("rn", "qtrago") := .(.I, date - 90L)]
DT[DT[DT, on=.(patient, numsermed, date=qtrago), roll=-Inf, unique(rn)]]
data:
library(data.table)
DT <- fread("patient numsermed date
1 numser1 2020-01-08
2 numser2 2015-01-02
2 numser2 2019-12-12
2 numser2 2020-01-05
2 numser2 2020-01-08
2 numser2 2020-01-20
2 numser2 2020-03-15
2 numser2 2020-03-18
2 numser3 2020-03-13
2 numser3 2020-03-18
3 numser3 2020-01-22
4 numser4 2018-01-02")
DT[, date := as.IDate(date, format="%Y-%m-%d")]
I have a table that looks like this;
user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC
I have another table that has every week in 2018.
week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30
... ... ...
Assume the week_start is a Monday and week_end is a Sunday.
I'd like to do two things. I'd first like to join the week_id to the first table and then I'd like to assign a day to each of the timestamps. My output would look like this:
user_id timestamp week_id day_of_week
aa 2018-01-01 12:01 UTC 1 Monday
ab 2018-01-02 05:01 UTC 1 Tuesday
bb 2018-01-13 09:01 UTC 2 Friday
bc 2018-01-28 23:01 UTC 4 Friday
cc 2018-01-06 11:01 UTC 1 Saturday
In Excel I could easily do this with a vlookup. My main interest is to learn how to join tables in cases like this. For that reason, I won't accept answers that use the weekday function.
Here are both of the tables in a more accessible format.
user_id <- c("aa", "ab", "bb", "bc", "cc")
timestamp <- c("2018-01-01 12:01", "2018-01-01 05:01", "2018-06-01 09:01", "2018-03-03 23:01", "2018-01-02 11:01")
week_id <- seq(1,52)
week_start <- seq(as.Date("2018-01-01"), as.Date("2018-12-31"), 7)
week_end <- week_start + 6
week_start <- week_start[1:52]
week_end <- week_end[1:52]
table1 <- data.frame(user_id, timestamp)
table2 <- data.frame(week_id, week_start, week_end)
Using SQL one can join two tables on a range like this. This seems the most elegant solution expressing our intent directly but we also provide some alternatives further below.
library(sqldf)
DF1$date <- as.Date(DF1$timestamp)
sqldf("select *
from DF1 a
left join DF2 b on date between week_start and week_end")
giving:
user_id timestamp date week_id week_start week_end
1 aa 2018-01-01 12:01:00 2018-01-01 1 2018-01-01 2018-01-07
2 ab 2018-01-01 05:01:00 2018-01-01 1 2018-01-01 2018-01-07
3 bb 2018-06-01 09:01:00 2018-06-01 NA <NA> <NA>
4 bc 2018-03-03 23:01:00 2018-03-04 NA <NA> <NA>
5 cc 2018-01-02 11:01:00 2018-01-02 1 2018-01-01 2018-01-07
dplyr
In a comment the poster asked for whether it could be done in dplyr. It can't be done directly since dplyr does not support complex joins but a workaound would be to do a full cross join of the two data frames which gives rise to an nrow(DF1) * nrow(DF2) intermediate result and then filter this down. dplyr does not directly support cross joins but we can simulate one by doing a full join on an identical dummy constant column that is appended to both data frames in the full join. Since we actually need a right join here to add back the unmatched rows, we do a final right join with the original DF1 data frame. Obviously this is entirely impractical for sufficiently large inputs but for the small input here we can do it. If it were known that there is a match in DF2 to every row in DF1 then the right_join at the end could be omitted.
DF1 %>%
mutate(date = as.Date(timestamp), dummy = 1) %>%
full_join(DF2 %>% mutate(dummy = 1)) %>%
filter(date >= week_start & date <= week_end) %>%
select(-dummy) %>%
right_join(DF1)
R Base
findix finds the index in DF2 corresponding to a date d. We then sapply it over the dates corresponding to rows of DF1 and put DF1 and the corresponding DF2 row together.
findix <- function(d) c(which(d >= DF2$week_start & d <= DF2$week_end), NA)[1]
cbind(DF1, DF2[sapply(as.Date(DF1$timestamp), findix), ])
Note
The input data in reproducible form used is:
Lines1 <- "user_id timestamp
aa 2018-01-01 12:01 UTC
ab 2018-01-01 05:01 UTC
bb 2018-06-01 09:01 UTC
bc 2018-03-03 23:01 UTC
cc 2018-01-02 11:01 UTC"
DF1 <- read.csv(text = gsub(" +", ",", Lines1), strip.white = TRUE)
DF1$timestamp <- as.POSIXct(DF1$timestamp)
Lines2 <- "week_id week_start week_end
1 2018-01-01 2018-01-07
2 2018-01-08 2018-01-15
3 2018-01-16 2018-01-23
4 2018-01-23 2018-01-30"
DF2 <- read.table(text = Lines2, header = TRUE)
DF2$week_start <- as.Date(DF2$week_start)
DF2$week_end <- as.Date(DF2$week_end)
This is a case for the fuzzyjoin-package. With the match_fun- argument we can specify conditions for each column. In this case table1$date >= table2$week_start and table1$date <= table2$week_end.
library(fuzzyjoin)
library(lubridate)
table1$date <- as.Date(table1$timestamp)
fuzzy_left_join(table1, table2,
by = c("date" = "week_start", "date" = "week_end"),
match_fun = list(`>=`, `<=`)) %>%
mutate(day_of_week = wday(date, label = TRUE)) %>%
select(user_id, timestamp, week_id, day_of_week)
user_id timestamp week_id day_of_week
1 aa 2018-01-01 12:01 1 Mo
2 ab 2018-01-01 05:01 1 Mo
3 bb 2018-06-01 09:01 22 Fr
4 bc 2018-03-03 23:01 9 Sa
5 cc 2018-01-02 11:01 1 Di
I'm also a smartass because I didn't use the weekday-function but wday from the lubridate-package.
I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00
I have a data frame , it has three columns employid , start date(ydm) and end date(ydm). my objective was to create another data frame which has two columns, one is employee ID and the other one is date. Second data frame would be built around first Data frame such that it will take ids from the first data frame, and the column date will take all the months between Start Date and end date of that employee. In simple words , i would expand the data in first data frame by months according to the employee start date and end date.
I actually successfully created the code, using for loop. Problem is, it is very slower, and some where I read that one is to avoid loops in r. is there a way that can do the same in a much quicker way ?
an example of my data frame and code is below:
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-1-3','2018-1-9','2018-1-8'), stringsAsFactors = F)
a$StartDate <- ydm(a$StartDate)
a$EndDate <- ydm(a$EndDate)
#second empty data frame
a1 <-a
a1 <- a1[0,1:2]
#my code starts
r <- 1
r.1 <- 1
for (id in a$employeeid) {
#r.1 <- 1
for ( i in format(seq(a[r,2],a[r,3],by="month"), "%Y-%m-%d") ) {
a1[r.1,1] <- a[r,1]
a1[r.1,2] <- i
r.1 <- r.1 +1
}
r <- r+1
}
This results in this :
I want the same result, but a bit quicker
Almost a one-liner with tidyverse:
> result
# A tibble: 12 x 2
employeeid date
<chr> <date>
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-11-01
10 c 2018-12-01
11 c 2019-01-01
12 c 2019-02-01
Code
result <- df %>%
group_by(employeeid) %>%
summarise(date = list(seq(StartDate,
EndDate,
by = "month"))) %>%
unnest()
Data
library(tidyverse)
library(lubridate)
df <- data.frame(employeeid = c('a', 'b', 'c'),
StartDate = ymd(c('2018-1-1', '2018-5-1', '2018-11-1')),
EndDate = ymd(c('2018-3-1', '2018-9-1', '2019-02-1')),
stringsAsFactors = FALSE)
I'd try to solve this with by using apply and a custom function, that calculates the difference of end and start.
Im not sure how your desired output looks like, but in the function of the following example all month in between start and end are pasted in a string.
library(lubridate)
# Creating Data frame
a<- data.frame(employeeid =c('a','b','c'), StartDate= c('2018-1-1','2018-1-5','2018-11-2'),
EndDate= c('2018-2-3','2019-1-9','2020-1-8'), stringsAsFactors = F)
a$StartDate <- ymd(a$StartDate)
a$EndDate <- ymd(a$EndDate)
# create month-name month nummeric value mapping
month_names = month.abb[1:12]
month_dif = function(dates) # function to calc the dif. it expects a 2 units vector to be passed over
{
start = dates[1] # first unit of the vector is expected to be the start date
end = dates[2] # second unit is expected to be the end date
start_month = month(start)
end_month = month(end)
start_year = year(start)
end_year = year(end)
year_dif = end_year - start_year
if(year_dif == 0){ #if start and end both are in the same year month is start till end
return(paste(month_names[start_month:end_month], collapse= ", " ))
} else { #if there is an overlap, mont is start till dezember and jan till end (with x full year in between)
paste(c(month_names[start_month:12],
rep(month_names, year_dif-1),
month_names[1:end_month]), collapse = ", ")
}
}
apply(a[2:3], 1, month_dif)
output:
> apply(a[2:3], 1, month_dif)
[1] "Jan, Feb"
[2] "Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
[3] "Nov, Dec, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec, Jan"
You can use a combination of apply and do.call:
out_apply_list <- apply(X=a, MARGIN=1,
FUN=function(x) {
data.frame(id= x[1],
date=seq(from = as.Date(x[2], "%Y-%d-%m"),
to = as.Date(x[3], "%Y-%d-%m"),
by = "month"),
row.names = NULL)
})
df <- do.call(what = rbind, args = out_apply_list)
which gives you the following output:
> df
id date
1 a 2018-01-01
2 a 2018-02-01
3 a 2018-03-01
4 b 2018-05-01
5 b 2018-06-01
6 b 2018-07-01
7 b 2018-08-01
8 b 2018-09-01
9 c 2018-02-11
10 c 2018-03-11
11 c 2018-04-11
12 c 2018-05-11
13 c 2018-06-11
14 c 2018-07-11
For the sake of completeness, here is a concise one-line with data.table:
library(data.table)
setDT(a)[, .(StartDate = seq(StartDate, EndDate, by = "month")), by = employeeid]
employeeid StartDate
1: a 2018-01-01
2: a 2018-02-01
3: a 2018-03-01
4: b 2018-05-01
5: b 2018-06-01
6: b 2018-07-01
7: b 2018-08-01
8: b 2018-09-01
9: c 2018-02-11
10: c 2018-03-11
11: c 2018-04-11
12: c 2018-05-11
13: c 2018-06-11
14: c 2018-07-11