Nested for loops for date differences - r

I am new to R and I am trying to calculate date differences from a baseline for every subject. I know how to calculate the day differences using difftime but I am having trouble doing it in a loop for every subject. Any help would be greatly appreciated.
Basically I want to go from:
ID DATE
1 1.1.2015
1 1.1.2016
2 1.1.2017
3 1.1.2017
3 1.1.2016
3 1.1.2017
to:
ID DATE DATEDIFF
1 1.1.2015 0
1 1.1.2016 365
2 1.1.2017 0
3 1.1.2015 0
3 1.1.2016 365
3 1.1.2017 730

Use lubridate to parse the dates and dplyr to calculate the new column:
library(lubridate)
df <- data.frame(
id = c(1,1,2,3,3,3),
date = c('1.1.2015','1.1.2016','1.1.2017','1.1.2015','1.1.2016','1.1.2017'))
# parse dates as DayMonthYear
df$date <- dmy(df$date)
# calculate the difference to the oldest date in each group
# mutate is called once for each group, so you could use an
# arbitrary expression to calculate your new column only with
# the data for this group
df %>% group_by(id) %>% mutate(datediff = date-min(date))
Result:
id date datediff
1 1 2015-01-01 0 days
2 1 2016-01-01 365 days
3 2 2017-01-01 0 days
4 3 2015-01-01 0 days
5 3 2016-01-01 365 days
6 3 2017-01-01 731 days

Related

Elegant way to get no of days to prev and next year using R?

I have an R data frame like as shown below
test_df <- data.frame("subbject_id" = c(1,2,3,4,5),
"date_1" = c("01/01/2003","12/31/2007","12/30/2008","01/02/2007","01/01/2007"))
I would like to get the no of days to prev year and next year.
I was trying something like the below
library(lubridate)
test_df$current_yr = year(mdy(test_df$date_1))
prev_yr = test_df$current_yr - 1 #(subtract 1 to get the prev year)
next_yr = test_df$current_yr + 1 #(add 1 to get the prev year)
days_to_prev_yr = days_in_year(current_yr) # this doesn't work
In python, I know we have something called day of the year and offsets.YearEnd(0) etc which I knew based on this post. But can help me with how to do this using R?
I expect my output to be like as shown below
You can use ceiling_date and floor_date from lubridate to get first and last days of the year and then subtract it with date_1 to get days_to_previous_year and days_to_next_year.
library(dplyr)
library(lubridate)
test_df %>%
mutate(date_1 = mdy(date_1),
previous_year = floor_date(date_1, 'year'),
next_year = ceiling_date(date_1, 'year') - 1,
days_to_previous_year = as.integer(date_1 - previous_year),
days_to_next_year = as.integer(next_year - date_1)) %>%
select(-previous_year, -next_year)
# subbject_id date_1 days_to_previous_year days_to_next_year
#1 1 2003-01-01 0 364
#2 2 2007-12-31 364 0
#3 3 2008-12-30 364 1
#4 4 2007-01-02 1 363
#5 5 2007-01-01 0 364
One dplyr and lubridate option could be:
test_df %>%
mutate(date_1 = mdy(date_1),
days_to_prev_year = date_1 - mdy(paste0("01-01-", year(date_1))),
days_to_next_year = mdy(paste0("12-31-", year(date_1))) - date_1)
subbject_id date_1 days_to_prev_year days_to_next_year
1 1 2003-01-01 0 days 364 days
2 2 2007-12-31 364 days 0 days
3 3 2008-12-30 364 days 1 days
4 4 2007-01-02 1 days 363 days
5 5 2007-01-01 0 days 364 days

R Sum rows by hourly rate

I'm getting started with R, so please bear with me
For example, I have this data.table (or data.frame) object :
Time Station count_starts count_ends
01/01/2015 00:30 A 2 3
01/01/2015 00:40 A 2 1
01/01/2015 00:55 B 1 1
01/01/2015 01:17 A 3 1
01/01/2015 01:37 A 1 1
My end goal is to group the "Time" column to hourly and sum the count_starts and count_ends based on the hourly time and station :
Time Station sum(count_starts) sum(count_ends)
01/01/2015 01:00 A 4 4
01/01/2015 01:00 B 1 1
01/01/2015 02:00 A 4 2
I did some research and found out that I should use the xts library.
Thanks for helping me out
UPDATE :
I converted the type of transactions$Time to POSIXct, so the xts package should be able to use the timeseries directly.
Using base R, we can still do the above. Only that the hour will be one less for all of them:
dat=read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
dat$Time=cut(strptime(dat$Time,"%m/%d/%Y %H:%M"),"hour")
aggregate(.~Time+Station,dat,sum)
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
2 2015-01-01 01:00:00 A 4 2
3 2015-01-01 00:00:00 B 1 1
You can use the order function to rearrange the table or even the sort.POSIXlt function:
m=aggregate(.~Time+Station,dat,sum)
m[order(m[,1]),]
Time Station count_starts count_ends
1 2015-01-01 00:00:00 A 4 4
3 2015-01-01 00:00:00 B 1 1
2 2015-01-01 01:00:00 A 4 2
A solution using dplyr and lubridate. The key is to use ceiling_date to convert the date time column to hourly time-step, and then group and summarize the data.
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Time = mdy_hm(Time)) %>%
mutate(Time = ceiling_date(Time, unit = "hour")) %>%
group_by(Time, Station) %>%
summarise(`sum(count_starts)` = sum(count_starts),
`sum(count_ends)` = sum(count_ends)) %>%
ungroup()
dt2
# # A tibble: 3 x 4
# Time Station `sum(count_starts)` `sum(count_ends)`
# <dttm> <chr> <int> <int>
# 1 2015-01-01 01:00:00 A 4 4
# 2 2015-01-01 01:00:00 B 1 1
# 3 2015-01-01 02:00:00 A 4 2
DATA
dt <- read.table(text = "Time Station count_starts count_ends
'01/01/2015 00:30' A 2 3
'01/01/2015 00:40' A 2 1
'01/01/2015 00:55' B 1 1
'01/01/2015 01:17' A 3 1
'01/01/2015 01:37' A 1 1",
header = TRUE, stringsAsFactors = FALSE)
Explanation
mdy_hm is the function to convert the string to date-time class. It means "month-day-year hour-minute", which depends on the structure of the string. ceiling_date rounds a date-time object up based on the unit specified. group_by is to group the variable. summarise is to conduct summary operation.
There are basically two things required:
1) round of the Time to nearest 1 hour window:
library(data.table)
library(lubridate)
data=data.table(Time=c('01/01/2015 00:30','01/01/2015 00:40','01/01/2015 00:55','01/01/2015 01:17','01/01/2015 01:37'),Station=c('A','A','B','A','A'),count_starts=c(2,2,1,3,1),count_ends=c(3,1,1,1,1))
data[,Time_conv:=as.POSIXct(strptime(Time,'%d/%m/%Y %H:%M'))]
data[,Time_round:=floor_date(Time_conv,unit="1 hour")]
2) List the data table obtained above to get the desired result:
New_data=data[,list(count_starts_sum=sum(count_starts),count_ends_sum=sum(count_ends)),by='Time_round']

Group IDs by dates

I have a data table of three columns id, dtstart, dtend. For example:
id start end
1 01/01/2015 31/01/2015
1 02/02/2015 28/02/2015
1 01/07/2016 31/07/2016
1 01/08/2016 31/08/2016
2 01/03/2015 31/03/2015
2 01/04/2015 30/04/2015
2 01/02/2016 28/02/2016
2 01/03/2016 31/03/2016
...
I need to create another data table grouped by id with the same columns but the new start date is the minimum date in the original start date and the new end date is the maximum date in the original dtend.
When there is a break of more then one day between an end date and the next start date then it should be grouped separately.
For example for the above the new table would be:
id start end
1 01/01/2015 28/02/2015
1 01/07/2016 31/08/2016
2 01/03/2015 30/04/2016
2 01/02/2016 31/03/2016
...
Do I need a for loop or is there a more efficient way (data table grouping for example)? The table is over 20 million rows with 100k+ unique ids.
Cheers
Andrew
This can be done using dplyr
dt.new <- dt %>%
arrange(id, start, end) %>%
mutate(gr = cumsum(lag(id, default = min(id)) != id |
as.numeric(difftime(start, lag(end, default = first(start)), units = 'days')) > 1)) %>%
group_by(id, gr) %>%
summarise(start = first(start),
end = last(end))
The result is:
Source: local data frame [6 x 4]
Groups: id [?]
id gr start end
<int> <int> <dttm> <dttm>
1 1 0 2015-01-01 2015-01-31
2 1 1 2015-02-02 2015-02-28
3 1 2 2016-07-01 2016-08-31
4 2 3 2015-03-01 2015-04-30
5 2 4 2016-02-01 2016-02-28
6 2 5 2016-03-01 2016-03-31
This works and doesn't match your output because you requested a one day margin (if you want two day margins then switch from >1 to >2), and 2016 was a leap year, which is in R's internal calendar. So the margin between 2/28/2016 and 3/1/2016 is 2 days.
Thanks again #akash87
For example row 6 below is within a month so it should still return one row for id 1 from 1/02/2006 to 30/09/2006 but it breaks into two, the first from 01/02/2006 to 12/06/2006 and then from 01/07/2006 to 30/09/2016
id dtstart dtend
1 01/02/2006 28/02/2006
1 01/03/2006 31/03/2006
1 01/04/2006 30/04/2006
1 01/05/2006 31/05/2006
1 01/06/2006 30/06/2006
1 10/06/2006 12/06/2006
1 01/07/2006 31/07/2006
1 01/08/2006 31/08/2006
1 01/09/2006 30/09/2006
2 01/04/2006 30/04/2006
2 01/05/2006 31/05/2006
2 01/09/2006 30/09/2006
2 01/10/2006 31/10/2006
So instead of returning
id start end
1 01/02/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
We have
id start end
1 01/02/2006 12/06/2006
1 01/07/2006 30/09/2006
2 01/04/2006 31/05/2006
2 01/09/2006 31/10/2006
Andrew

R programming - Split up a group of time series indexed by ID with irregular observation periods into regular monthly observations

I have a set of data regarding amounts of something users with unique IDs used between in a data.frame in r.
ID start date end date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
....
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000
The time series for each ID starts and ends at inconsistent times, and contain an inconsistent number of observations. However, they are all formatted in the above manner; the start and end dates are Date objects.
I would like to standardize the breakdowns for each ID to a monthly time series, with data points at the start of each month, weighing the observed amount numbers which happen to straddle two or more months accordingly.
In other words, I would like to turn this series into something like
ID start date end date amount
1 1-1-2012 2-1-2012 3096 = 6000 * 16/31
1 2-1-2012 3-1-2012 4339 = 6000*15/31+4000*14/39
1 3-1-2012 4-1-2012 etc
....
1 6-1-2012 7-1-2012 etc
2 1-1-2012 2-1-2012 etc
2 2-1-2012 3-1-2012 etc
2 3-1-2012 4-1-2012 etc
2 4-1-2012 5-1-2012 etc
2 5-1-2012 6-1-2012 etc
....
10000 1-1-2012 2-1-2012 etc
....
10000 6-1-2012 7-1-2012 etc
Where the value for ID 1 between 2/1/12 and 3/1/12 is calculated by weighing the number of days in the 1-15-2012 to 2-15-2012 observation that land in February (15 days / 31 days) with the amount in that observation span (6000) with the number of days in the 2-15 to 3-25 observation span that fall in February (14 days/ 39 days, as 2012 was a leap year) times the amount in that observation span (4000), yielding 6000*15/31+4000*14/39 = 4339. This should be done for each ID time series. We do not consider the case where the observation periods all fit into one month; but if they are spread out over more than two months they should be split up over that number of months with the appropriate weighings.
I'm rather new to r and could certainly use some help on this!
Here is using native R:
#The data
df=read.table(text='ID start_date end_date amount
1 1-15-2012 2-15-2012 6000
1 2-15-2012 3-25-2012 4000
1 3-25-2012 5-26-2012 3000
1 5-26-2012 6-13-2012 1000
2 1-16-2012 2-27-2012 7000
2 2-27-2012 3-18-2012 2000
2 3-18-2012 5-23-2012 3000
10000 1-12-2012 2-24-2012 12000
10000 2-24-2012 3-11-2012 22000
10000 3-11-2012 5-27-2012 33000
10000 5-27-2012 6-10-2012 5000',
header=T,row.names = NULL,stringsAsFactors =FALSE)
df[,2]=as.Date(df[,2],"%m-%d-%Y")
df[,3]=as.Date(df[,3],"%m-%d-%Y")
df1=data.frame(n=1:length(df$ID),ID=df$ID)
df1$startm=as.Date(levels(cut(df[,2],"month"))[cut(df[,2],"month")],"%Y-%m-%d")
df1$endm=as.Date(levels(cut(df[,3],"month"))[cut(df[,3],"month")],"%Y-%m-%d")
df1=df1[,-1]
#compute days in month and total days
df$dayin=as.numeric((df1$endm-1)-df$start_date)
df$daytot=as.numeric(df$end_date-df$start_date)
#separate amount this month and next month
df$ammt=df$amount*df$dayin/df$daytot
df$ammt.1=df$amount*(df$daytot-df$dayin)/df$daytot
#using by compute new amount
df1$amount=do.call(c,
by(df[,c("ammt","ammt.1")],df$ID,function(d)d[,1]+c(0,d[-nrow(d),2]))
)
df1
> df1
ID startm endm amount
1 1 2012-01-01 2012-02-01 3096.774
2 1 2012-02-01 2012-03-01 4339.123
3 1 2012-03-01 2012-05-01 4306.038
4 1 2012-05-01 2012-06-01 1535.842
5 2 2012-01-01 2012-02-01 2500.000
6 2 2012-02-01 2012-03-01 4700.000
7 2 2012-03-01 2012-05-01 3754.545
8 10000 2012-01-01 2012-02-01 5302.326
9 10000 2012-02-01 2012-03-01 13572.674
10 10000 2012-03-01 2012-05-01 36553.571
11 10000 2012-05-01 2012-06-01 13000.000
To solve this I think the easiest way is to break it down into two problems.
How can I get a daily breakdown of the figures I'm interested in? This is my assumption based on the information you provided above.
How do I group by a date range and summarise to what I'm interested in?
For the following example, I will use the data set which I created using the code below:
df <- data.frame(
id=c(1,1,1,1,2,2,2),
start_date=as.Date(c("1-15-2012",
"2-15-2012",
"3-25-2012",
"5-26-2012",
"1-16-2012",
"2-27-2012",
"3-18-2012"), "%m-%d-%Y"),
end_date=as.Date(c("2-15-2012",
"3-25-2012",
"5-26-2012",
"6-13-2012",
"2-27-2012",
"3-18-2012",
"5-23-2012"), "%m-%d-%Y"),
amount=c(6000,
4000,
3000,
1000,
7000,
2000,
3000)
)
1. Provide daily figures
To provide the daily figures, firstly we get the daily contribution:
df$daily_contribution = df$amount/as.numeric(df$end_date - df$start_date)
Then, we will expand the date range using the start and end dates. There are a couple ways which you can do it, but seeing that you apply the dplyr tag, using the dplyr way we have:
library(dplyr)
df <- df %>%
rowwise() %>%
do(data.frame(id=.$id,
date=as.Date(seq(from=.$start_date, to=(.$end_date), by="day")),
daily_contribution=.$daily_contribution))
which has some output which looks like this:
Source: local data frame [285 x 3]
Groups: <by row>
id date daily_contribution
1 1 2012-01-15 193.5484
2 1 2012-01-16 193.5484
3 1 2012-01-17 193.5484
4 1 2012-01-18 193.5484
5 1 2012-01-19 193.5484
6 1 2012-01-20 193.5484
7 1 2012-01-21 193.5484
8 1 2012-01-22 193.5484
9 1 2012-01-23 193.5484
10 1 2012-01-24 193.5484
.. .. ... ...
2. Create a grouping variable
Next we create some kind of grouping variable that we're interested in. I've used lubridate for ease to get the month and year of the dates:
library(lubridate)
df$mnth=month(df$date)
df$yr=year(df$date)
Now with all of this we can easily use dplyr to summarise our information by the dates as required.
df %>%
group_by(id, mnth, yr) %>%
summarise(amount=sum(daily_contribution))
with output:
Source: local data frame [11 x 4]
Groups: id, mnth
id mnth yr amount
1 1 1 2012 3290.3226
2 1 2 2012 4441.6873
3 1 3 2012 2902.8122
4 1 4 2012 1451.6129
5 1 5 2012 1591.3978
6 1 6 2012 722.2222
7 2 1 2012 2666.6667
8 2 2 2012 4800.0000
9 2 3 2012 2436.3636
10 2 4 2012 1363.6364
11 2 5 2012 1045.4545
To get it precisely in the format you specified:
df %>% rowwise() %>%
mutate(start_date=as.Date(ISOdate(yr, mnth, 1)),
end_date=as.Date(ISOdate(yr, mnth+1, 1))) %>%
select(id, start_date, end_date, amount)
with output:
Source: local data frame [11 x 4]
Groups: <by row>
id start_date end_date amount
1 1 2012-01-01 2012-02-01 3290.3226
2 1 2012-02-01 2012-03-01 4441.6873
3 1 2012-03-01 2012-04-01 2902.8122
4 1 2012-04-01 2012-05-01 1451.6129
5 1 2012-05-01 2012-06-01 1591.3978
6 1 2012-06-01 2012-07-01 722.2222
7 2 2012-01-01 2012-02-01 2666.6667
8 2 2012-02-01 2012-03-01 4800.0000
9 2 2012-03-01 2012-04-01 2436.3636
10 2 2012-04-01 2012-05-01 1363.6364
11 2 2012-05-01 2012-06-01 1045.4545
as needed.
note: I can see from your example, that you have, 3096 = 6000 * 16/31 and 4339 = 6000*15/31+4000*14/39, but for the first one, as an example, you have 15 of Jan to 31 of Jan which is 17 days if the date range is inclusive. You can trivially alter this information if required.
Here's a solution using plyr and reshape. The numbers aren't the same as what you provided, so I may have misunderstood your intent though this seems to meet your stated goal (weighted average of amount by month).
df$index <- 1:nrow(df) #Create a unique index number
#Format the dates from factors to dates
df$start.date <- as.Date(df$start.date, format="%m/%d/%Y")
df$end.date <- as.Date(df$end.date, format="%m/%d/%Y")
library(plyr); library(reshape) #Load the libraries
#dlaply = (d)ataframe to (l)ist using (ply)r
#Subset on dataframe by "index" and perform a function on each subset called "X"
#Create a list containing:
# ID, each day from start to end date, amount recorded over that day
df2 <- dlply(df, .(index), function(X) {
ID <- X$ID #Keep the ID value
n.days <- as.numeric(difftime( X$end.date, X$start.date )) #Calculate time difference in days, report the result as a number
day <- seq(X$start.date, X$end.date, by="days") #Sequence of days
amount.per.day <- X$amount/n.days #Amount for that day
data.frame(ID, day, amount.per.day) #Last line is the output
})
#Change list back into data.frame
df3 <- ldply(df2, data.frame) #ldply = (l)ist to (d)ataframe using (ply)r
df3$mon <- as.numeric(format(df3$day, "%m")) #Assign a month to all dates
#Summarize by each ID and month: add up the daily amounts
ddply(df3, .(ID, mon), summarise, amount = sum(amount.per.day))
# ID mon amount
# 1 1 1 3290.3226
# 2 1 2 4441.6873
# 3 1 3 2902.8122
# 4 1 4 1451.6129
# 5 1 5 1591.3978
# 6 1 6 722.2222
# 7 2 1 2666.6667
# 8 2 2 4800.0000
# 9 2 3 2436.3636
# 10 2 4 1363.6364
# 11 2 5 1045.4545
Incidentally, for future posts, you can get faster answers if you provide the code to replicate your data. If your code is somewhat complicated, you can use dput(yourdata).
HTH!

Extracting last date of the year from a date object

I have following data set:
>d
x date
1 1 1-3-2013
2 2 2-4-2010
3 3 2-5-2011
4 4 1-6-2012
I want:
> d
x date
1 1 31-12-2013
2 2 31-12-2010
3 3 31-12-2011
4 4 31-12-2012
i.e. Last day, last month and the year of the date object.
Please Help!
You can also just use the ceiling_date function in LUBRIDATE package.
You can do something like -
library(lubridate)
last_date <- ceiling_date(date,"year") - days(1)
ceiling_date(date,"year") gives you the first date of the next year and to get the last date of the current year, you subtract this by 1 or days(1).
Hope this helps.
Another option using lubridate package:
## using d from Roland answer
transform(d,last =dmy(paste0('3112',year(dmy(date)))))
x date last
1 1 1-3-2013 2013-12-31
2 2 2-4-2010 2010-12-31
3 3 2-5-2011 2011-12-31
4 4 1-6-2012 2012-12-31
d <- read.table(text="x date
1 1 1-3-2013
2 2 2-4-2010
3 3 2-5-2011
4 4 1-6-2012", header=TRUE)
d$date <- as.Date(d$date, "%d-%m-%Y")
d$date <- as.POSIXlt(d$date)
d$date$mon <- 11
d$date$mday <- 31
d$date <- as.Date(d$date)
# x date
#1 1 2013-12-31
#2 2 2010-12-31
#3 3 2011-12-31
#4 4 2012-12-31
1) cut.Date Define cut_year to give the first day of the year. Adding 366 gets us to the next year and then applying cut_year again gets us to the first day of the next year. Finally subtract 1 to get the last day of the year. The code uses base functionality only.
cut_year <- function(x) as.Date(cut(as.Date(x), "year"))
transform(d, date = cut_year(cut_year(date) + 366) - 1)
2) format
transform(d, date = as.Date(format(as.Date(date), "%Y-12-31")))
3) zoo A "yearmon" class variable stores the date as a year plus 0 for Jan, 1/12 for Feb, ..., 11/12 for Dec. Thus taking its floor and adding 11/12 gets one to Dec and as.Date.yearmon(..., frac = 1) uses the last of the month instead of the first.
library(zoo)
transform(d, date = as.Date(floor(as.yearmon(as.Date(date))) + 11 / 12, frac = 1))
Note: The inner as.Date in cut_year and in the other two solutions can be omitted if it is known that date is already of "Date" class.
ADDED additional solutions.

Resources