Group IDs by dates - r

I have a data table of three columns id, dtstart, dtend. For example:
id start end
1 01/01/2015 31/01/2015
1 02/02/2015 28/02/2015
1 01/07/2016 31/07/2016
1 01/08/2016 31/08/2016
2 01/03/2015 31/03/2015
2 01/04/2015 30/04/2015
2 01/02/2016 28/02/2016
2 01/03/2016 31/03/2016
...
I need to create another data table grouped by id with the same columns but the new start date is the minimum date in the original start date and the new end date is the maximum date in the original dtend.
When there is a break of more then one day between an end date and the next start date then it should be grouped separately.
For example for the above the new table would be:
id start end
1 01/01/2015 28/02/2015
1 01/07/2016 31/08/2016
2 01/03/2015 30/04/2016
2 01/02/2016 31/03/2016
...
Do I need a for loop or is there a more efficient way (data table grouping for example)? The table is over 20 million rows with 100k+ unique ids.
Cheers
Andrew

This can be done using dplyr
dt.new <- dt %>%
arrange(id, start, end) %>%
mutate(gr = cumsum(lag(id, default = min(id)) != id |
as.numeric(difftime(start, lag(end, default = first(start)), units = 'days')) > 1)) %>%
group_by(id, gr) %>%
summarise(start = first(start),
end = last(end))
The result is:
Source: local data frame [6 x 4]
Groups: id [?]
id gr start end
<int> <int> <dttm> <dttm>
1 1 0 2015-01-01 2015-01-31
2 1 1 2015-02-02 2015-02-28
3 1 2 2016-07-01 2016-08-31
4 2 3 2015-03-01 2015-04-30
5 2 4 2016-02-01 2016-02-28
6 2 5 2016-03-01 2016-03-31
This works and doesn't match your output because you requested a one day margin (if you want two day margins then switch from >1 to >2), and 2016 was a leap year, which is in R's internal calendar. So the margin between 2/28/2016 and 3/1/2016 is 2 days.

Thanks again #akash87
For example row 6 below is within a month so it should still return one row for id 1 from 1/02/2006 to 30/09/2006 but it breaks into two, the first from 01/02/2006 to 12/06/2006 and then from 01/07/2006 to 30/09/2016
id dtstart dtend
1 01/02/2006 28/02/2006
1 01/03/2006 31/03/2006
1 01/04/2006 30/04/2006
1 01/05/2006 31/05/2006
1 01/06/2006 30/06/2006
1 10/06/2006 12/06/2006
1 01/07/2006 31/07/2006
1 01/08/2006 31/08/2006
1 01/09/2006 30/09/2006
2 01/04/2006 30/04/2006
2 01/05/2006 31/05/2006
2 01/09/2006 30/09/2006
2 01/10/2006 31/10/2006
So instead of returning
id start end
1 01/02/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
We have
id start end
1 01/02/2006 12/06/2006
1 01/07/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
Andrew

Related

Break up rows representing long time intervals into multiple rows

I have a dataframe (tibble) with multiple rows, each row contains an IDNR, a start date, an end date and an exposure status. The IDNR is a character variable, the start and end date are date variables and the exposure status is a numerical variable. This is what the top 3 rows look like:
# A tibble: 48,266 x 4
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-07-01 0
2 2 2017-10-30 2018-07-01 0
3 3 2016-02-11 2016-12-03 1
# ... with 48,256 more rows
In order to do a time-varying cox regression, I want to split up the rows into 90 day parts, while maintaining the start and end date. Here is an example of what I would like to achieve. What happens, is that the new end date is start + 90 days, and a new row is created. This row has the start date which is the same as the end date from the previous row. If the time between start and end is now less than 90 days, this is fine (as for IDNR 1 and 3), however, for IDNR 2 the time is still exceeding 90 days. Therefore a third row needs to be added.
# A tibble: 48,266 x 4
# Groups: IDNR [33,240]
IDNR start end exposure
<chr> <date> <date> <dbl>
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-08-09 1
7 3 2016-08-09 2016-12-03 1
I'm relatively new to coding in R, but I've found dplyr to be very useful so far. So, if someone knows a solution using dplyr I would really appreciate that.
Thanks in advance!
Here you go:
Using df as your data frame:
df = data.frame(IDNR = 1:3,
start = c("2018-02-15","2017-10-30","2016-02-11"),
end = c("2018-07-01","2018-07-01","2016-12-03"),
exposure = c(0,0,1))
Do:
library(lubridate)
newDF = apply(df, 1, function(x){
newStart = seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)
newEnd = c(seq(from = ymd(x["start"]), to = ymd(x["end"]), by = 90)[-1], ymd(x["end"]))
d = data.frame(IDNR = rep(x["IDNR"], length(newStart)),
start = newStart,
end = newEnd,
exposure = rep(x["exposure"], length(newStart)))
})
newDF = do.call(rbind, newDF)
newDF = newDF[newDF$start != newDF$end,]
Result:
> newDF
IDNR start end exposure
1 1 2018-02-15 2018-05-16 0
2 1 2018-05-16 2018-07-01 0
3 2 2017-10-30 2018-01-28 0
4 2 2018-01-28 2018-04-28 0
5 2 2018-04-28 2018-07-01 0
6 3 2016-02-11 2016-05-11 1
7 3 2016-05-11 2016-08-09 1
8 3 2016-08-09 2016-11-07 1
9 3 2016-11-07 2016-12-03 1
What this does is create a sequence of days from start to end by 90 days and create a smaller data frame with them along with the IDNR and exposure. This apply will return a list of data frames that you can join together using do.call. The last line removes lines that have the same start and end date

"split" dataframe per month based on columns Start/End

I need to "split" a 15 million line df of the following form:
library(lubridate)
dateStart <- c(lubridate::ymd("2010-01-01"))
dateEnd <- c(lubridate::ymd("2010-03-06"))
length <- c(65)
Amt <- c(348.80)
df1 <- data.frame(dateStart, dateEnd, length, Amt)
df1
# dateStart dateEnd length Amt
# 1 2010-01-01 2010-03-06 65 348.8
into something like:
dateStart dateEnd length Amt
1 2010-01-01 2010-01-31 31 166.35
2 2010-02-01 2010-02-28 28 150.55
3 2010-03-01 2010-03-06 6 32.19
Where length is the number of days and Amt is the pro-rata amount for the number of days. Does anybody know how to do this? Someone mentioned the padr package to me but I do not know how to use it for this specific purpose.
Thank you in advance
I'm going to assume you have an some sort of unique id field in your data set so you have a unique record. Otherwise this is not going to work. I also added 1 extra record so we can see everything works on multiple records.
Data:
library(lubridate)
id <- c(1:2) # added id field needed for unique record and needed for grouping
dateStart <- c(lubridate::ymd("2010-01-01", "2011-01-09"))
dateEnd <- c(lubridate::ymd("2010-03-06", "2011-04-09"))
length <- c(65, 91)
Amt <- c(348.80, 468.70)
df1 <- data.frame(id , dateStart, dateEnd, length, Amt)
First create a data.frame which has the id and missing months. We need dplyr, tidyr and padr. Create groups per unique id, gather the dates so we have start and end date in 1 column. For padr to extend months we first need to thicken the data.frame. Get rid of not needed columns and fill in the missing months.
library(dplyr)
library(tidyr)
library(padr)
#create last_day function for later use
last_day <- function(date) {
ceiling_date(date, "month") - days(1)
}
dates <- df1 %>%
select(id, dateStart, dateEnd) %>%
group_by(id) %>%
gather(names, dates, -id) %>%
arrange(id, dates) %>%
thicken(interval = "month") %>% # need to thicken first for month interval
select(-c(names, dates)) %>%
pad(interval = "month")
dates
# A tibble: 7 x 2
# Groups: id [2]
id dates_month
<int> <date>
1 1 2010-01-01
2 1 2010-02-01
3 1 2010-03-01
4 2 2011-01-01
5 2 2011-02-01
6 2 2011-03-01
7 2 2011-04-01
Next join back the data to the original data.frame
df_extended <- inner_join(dates, df1, by = "id")
df_extended
# A tibble: 7 x 6
# Groups: id [2]
id dates_month dateStart dateEnd length Amt
<int> <date> <date> <date> <dbl> <dbl>
1 1 2010-01-01 2010-01-01 2010-03-06 65 349.
2 1 2010-02-01 2010-01-01 2010-03-06 65 349.
3 1 2010-03-01 2010-01-01 2010-03-06 65 349.
4 2 2011-01-01 2011-01-09 2011-04-09 91 469.
5 2 2011-02-01 2011-01-09 2011-04-09 91 469.
6 2 2011-03-01 2011-01-09 2011-04-09 91 469.
7 2 2011-04-01 2011-01-09 2011-04-09 91 469.
Now to get to the end result. need to use case_when, ifelse doesn't return the data in date format for some reason. The case_when replace set the correct start and end dates (I assume you need the exact start date, not the first of the month, otherwise adjust code to use dates_month instead.) I create an amount per day (amt_pd) variable to be able to multiply this with the number of days in the month to get the pro-rata amount for the number of days in the month.
df_end <- df_extended %>%
mutate(dateEnd = case_when(last_day(dates_month) <= dateEnd ~ last_day(dates_month),
TRUE ~ dateEnd),
dateStart = case_when(dates_month <= dateStart ~ dateStart,
TRUE ~ dates_month),
amt_pd = Amt / length,
length = dateEnd - dateStart + 1,
Amt = amt_pd * length) %>%
select(-c(dates_month, amt_pd)) # get rid of not needed columns
df_end
# A tibble: 7 x 5
# Groups: id [2]
id dateStart dateEnd length Amt
<int> <date> <date> <time> <time>
1 1 2010-01-01 2010-01-31 31 166.350769230769
2 1 2010-02-01 2010-02-28 28 150.252307692308
3 1 2010-03-01 2010-03-06 6 32.1969230769231
4 2 2011-01-09 2011-01-31 23 118.462637362637
5 2 2011-02-01 2011-02-28 28 144.215384615385
6 2 2011-03-01 2011-03-31 31 159.667032967033
7 2 2011-04-01 2011-04-09 9 46.354945054945
All of this could be done in one go. But if you have 15 million rows it might be better to see if the intermediate steps work. Also note that pad has a break_above option.
This is a numeric value that indicates the number of rows in millions
above which the function will break. Safety net for situations where
the interval is different than expected and padding yields a very
large dataframe, possibly overflowing memory.

R Sum rows by hourly rate

I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']

Nested for loops for date differences

I am new to R and I am trying to calculate date differences from a baseline for every subject. I know how to calculate the day differences using difftime but I am having trouble doing it in a loop for every subject. Any help would be greatly appreciated.
Basically I want to go from:
ID DATE
1 1.1.2015
1 1.1.2016
2 1.1.2017
3 1.1.2017
3 1.1.2016
3 1.1.2017
to:
ID DATE DATEDIFF
1 1.1.2015 0
1 1.1.2016 365
2 1.1.2017 0
3 1.1.2015 0
3 1.1.2016 365
3 1.1.2017 730
Use lubridate to parse the dates and dplyr to calculate the new column:
library(lubridate)
df <- data.frame(
id = c(1,1,2,3,3,3),
date = c('1.1.2015','1.1.2016','1.1.2017','1.1.2015','1.1.2016','1.1.2017'))
# parse dates as DayMonthYear
df$date <- dmy(df$date)
# calculate the difference to the oldest date in each group
# mutate is called once for each group, so you could use an
# arbitrary expression to calculate your new column only with
# the data for this group
df %>% group_by(id) %>% mutate(datediff = date-min(date))
Result:
id date datediff
1 1 2015-01-01 0 days
2 1 2016-01-01 365 days
3 2 2017-01-01 0 days
4 3 2015-01-01 0 days
5 3 2016-01-01 365 days
6 3 2017-01-01 731 days

Count event occurrence and assign it to individuals according to date and place of interview

I have two dataframes in R: A.df and B.df. The first contains N rows where each row is an event that happened in a certain date and place.
The second is a list of individuals that have been interviewed in a certain date and place.
For each individual, I would like to count the number of events that happened within a certain timeframe before the interview date in the same location of the individual's place of interview.
Let's say that the time frame is x days before the date of interview, and that I have computed that date and stored in the variable xdaysbefore.
Here below how the data frames look like
A.df
#Event Date Place
1 2015-05-01 1
2 2015-03-11 1
3 2015-07-04 2
4 2015-05-10 3
B.df
#Individual Date of Interview Place xdaysbefore
1 2016-07-11 1 2014-09-11
2 2016-05-07 3 2014-07-04
3 2016-08-09 2 2014-03-22
4 2016-01-10 3 2014-09-17
Note that Date, Date of Interview and xdaysbefore are all in Date R class
How can I count for each individual in B.df the events happened within the time frame Date of Interview - xdaysbefore according to the place in which the event has happened and the individual place of interview.
What I would expect in B.df would look like this:
B.df
#Individual Date of Interview Place xdaysbefore CountedEvents
1 2016-07-11 1 2014-09-11 2
2 2016-05-07 3 2014-07-04 1
3 2016-08-09 2 2014-03-22 1
4 2016-01-10 3 2014-09-17 1
where CountedEvents are the number of events happened in the time frame Date of Interview - xdaysbefore and in the same location where the individual i has been interviewed.
You can use apply on every row of B.df.
Take a subset of A.df where places are equal. Check if the Date in A.df is within the range of Date_of_Interview and xdaysbefore
B.df$CountedEvents <- apply(B.df, 1, function(x) {
temp = A.df[A.df$Place %in% x[3],]
length(temp$Date < as.Date(x[2]) & temp$Date > as.Date(x[4]))
})
B.df
# Individual Date_of_Interview Place xdaysbefore CountedEvents
#1 1 2016-07-11 1 2014-09-11 2
#2 2 2016-05-07 3 2014-07-04 1
#3 3 2016-08-09 2 2014-03-22 1
#4 4 2016-01-10 3 2014-09-17 1
EDIT
If you want to access columns with names instead of indexes, you can use
apply(B.df, 1, function(x) {
temp = A.df[A.df$Place %in% x["Place"],]
length(temp$Date < as.Date(x["Date_of_Interview"]) &
temp$Date > as.Date(x["xdaysbefore"]))
})
You can achieve that by using a combination of merge and aggregate:
# merge into a new dataset
AB <- merge(A, B, by = 'Place', all = TRUE)
# create a logical variable which indicates whether 'Date' falls within the range
AB$count <- AB$xdaysbefore < AB$Date & AB$Date_of_Interview > AB$Date
# aggregate into a count varaible
aggregate(count ~ Individual + Date_of_Interview + xdaysbefore, AB, sum)
which gives:
Individual Date_of_Interview xdaysbefore count
1 3 2016-08-09 2014-03-22 1
2 2 2016-05-07 2014-07-04 1
3 1 2016-07-11 2014-09-11 2
4 4 2016-01-10 2014-09-17 1
Alternatively you could use the new non-equi join possibility from the development version of the data.table package:
library(data.table)
# convert the dataframes to data.table's (which are enhanced dataframes)
setDT(A)
setDT(B)
# join and count
A[B, on = .(Place, Date < Date_of_Interview, Date > xdaysbefore)
][, .(count = .N), .(Individual, Place, Date_of_Interview = Date, xdaysbefore = Date.1)]
which gives:
Individual Place Date_of_Interview xdaysbefore count
1: 1 1 2016-07-11 2014-09-11 2
2: 2 3 2016-05-07 2014-07-04 1
3: 3 2 2016-08-09 2014-03-22 1
4: 4 3 2016-01-10 2014-09-17 1

Resources