adding days to a date in R - r

My data looks like this:
date rmean
1/2/2004 6
1/5/2004 30
1/6/2004 27
1/7/2004 20
1/8/2004 10
1/9/2004 22
1/12/2004 21
1/13/2004 18
1/14/2004 19
1/15/2004 7
1/16/2004 9
1/19/2004 11
1/20/2004 18
1/21/2004 26
1/26/2004 8
1/27/2004 16
1/28/2004 19
1/29/2004 4
1/30/2004 1
2/3/2004 11
2/4/2004 9
2/5/2004 26
2/6/2004 16
2/9/2004 25
2/10/2004 2
2/11/2004 6
2/12/2004 2
2/13/2004 25
2/16/2004 17
2/17/2004 21
2/18/2004 26
2/19/2004 6
2/20/2004 14
2/23/2004 4
2/24/2004 7
2/25/2004 19
2/26/2004 10
2/27/2004 23
I want to find the rmean of (20 days + 15th of each month).
Note: if there isn't a value for rmean of that date in my data (some days are skipped), i want it to find the rmean of closest day of the
something like this but ( 20 + 15th of each month) instead of 15 :
dt <- Dataframe[, list(day15=abs(mday(date)-15) == min(abs(mday(date)-15)),
date, rmean), by=list(year(date), month(date))]
dt[day15==TRUE]
Finale = dt[day15==TRUE , .SD[1,] ,by=list(month, year)]
The expected output for my example above:
date rmean
2/4/2004 9

Here's one way to do it with base R.
First, some dummy data:
d <- data.frame(date=as.Date('1/1/2004', '%d/%m/%Y') + sort(sample(364, 200)),
x=runif(200))
head(d)
# date x
# 1 2004-01-02 0.29818227
# 2 2004-01-03 0.12543617
# 3 2004-01-04 0.78145310
# 4 2004-01-05 0.30456904
# 5 2004-01-06 0.45228066
# 6 2004-01-07 0.07511554
Calculate arrival dates within the date range of the data:
arrival <-
seq(as.Date(sprintf('15/%s', format(min(d$date), '%m/%Y')), '%d/%m/%Y'),
as.Date(sprintf('15/%s', format(max(d$date), '%m/%Y')), '%d/%m/%Y'),
by='month') + 20
arrival
# [1] "2004-02-04" "2004-03-06" "2004-04-04" "2004-05-05" "2004-06-04" "2004-07-05"
# [7] "2004-08-04" "2004-09-04" "2004-10-05" "2004-11-04" "2004-12-05" "2005-01-04"
Find the closest date to each of the arrival dates (taking that with max x value if there are two closest dates), and return a data.frame with the "arrival" dates, the closest dates to each of these arrival dates, and the corresponding values of x.
cbind(arrival, do.call(rbind, lapply(arrival, function(x) {
closest <- which(abs(d$date - x) == min(abs(d$date - x)))
d[closest[which.max(d$x[closest])], ]
})))
# arrival date x
# 25 2004-02-04 2004-02-03 0.78836413
# 45 2004-03-06 2004-03-06 0.61214949
# 63 2004-04-04 2004-04-04 0.49171847
# 79 2004-05-05 2004-05-05 0.02989788
# 93 2004-06-04 2004-06-04 0.25923715
# 109 2004-07-05 2004-07-05 0.90330331
# 120 2004-08-04 2004-08-04 0.48133237
# 139 2004-09-04 2004-09-03 0.12280267
# 151 2004-10-05 2004-10-03 0.46888891
# 169 2004-11-04 2004-11-04 0.40397949
# 186 2004-12-05 2004-12-04 0.18685615
# 200 2005-01-04 2004-12-30 0.97462347

Related

Count the number of active episodes per month from data with start and end dates

I am trying to get a count of active clients per month, using data that has a start and end date to each client's episode. The code I am using I can't work out how to count per month, rather than per every n days.
Here is some sample data:
Start.Date <- as.Date(c("2014-01-01", "2014-01-02","2014-01-03","2014-01-03"))
End.Date<- as.Date(c("2014-01-04", "2014-01-03","2014-01-03","2014-01-04"))
Make sure the dates are dates:
Start.Date <- as.Date(Start.Date, "%d/%m/%Y")
End.Date <- as.Date(End.Date, "%d/%m/%Y")
Here is the code I am using, which current counts the number per day:
library(plyr)
count(Reduce(c, Map(seq, start.month, end.month, by = 1)))
which returns:
x freq
1 2014-01-01 1
2 2014-01-02 2
3 2014-01-03 4
4 2014-01-04 2
The "by" argument can be changed to be however many days I want, but problems arise because months have different lengths.
Would anyone be able to suggest how I can count per month?
Thanks a lot.
note: I now realize that for my example data I have only used dates in the same month, but my real data has dates spanning 3 years.
Here's a solution that seems to work. First, I set the seed so that the example is reproducible.
# Set seed for reproducible example
set.seed(33550336)
Next, I create a dummy data frame.
# Test data
df <- data.frame(Start_date = as.Date(sample(seq(as.Date('2014/01/01'), as.Date('2015/01/01'), by="day"), 12))) %>%
mutate(End_date = as.Date(Start_date + sample(1:365, 12, replace = TRUE)))
which looks like,
# Start_date End_date
# 1 2014-11-13 2015-09-26
# 2 2014-05-09 2014-06-16
# 3 2014-07-11 2014-08-16
# 4 2014-01-25 2014-04-23
# 5 2014-05-16 2014-12-19
# 6 2014-11-29 2015-07-11
# 7 2014-09-21 2015-03-30
# 8 2014-09-15 2015-01-03
# 9 2014-09-17 2014-09-26
# 10 2014-12-03 2015-05-08
# 11 2014-08-03 2015-01-12
# 12 2014-01-16 2014-12-12
The function below takes a start date and end date and creates a sequence of months between these dates.
# Sequence of months
mon_seq <- function(start, end){
# Change each day to the first to aid month counting
day(start) <- 1
day(end) <- 1
# Create a sequence of months
seq(start, end, by = "month")
}
Right, this is the tricky bit. I apply my function mon_seq to all rows in the data frame using mapply. This gives the months between each start and end date. Then, I combine all these months together into a vector. I format this vector so that dates just contain months and years. Finally, I pipe (using dplyr's %>%) this into table which counts each occurrence of year-month and I cast as a data frame.
data.frame(format(do.call("c", mapply(mon_seq, df$Start_date, df$End_date)), "%Y-%m") %>% table)
This gives,
# . Freq
# 1 2014-01 2
# 2 2014-02 2
# 3 2014-03 2
# 4 2014-04 2
# 5 2014-05 3
# 6 2014-06 3
# 7 2014-07 3
# 8 2014-08 4
# 9 2014-09 6
# 10 2014-10 5
# 11 2014-11 7
# 12 2014-12 8
# 13 2015-01 6
# 14 2015-02 4
# 15 2015-03 4
# 16 2015-04 3
# 17 2015-05 3
# 18 2015-06 2
# 19 2015-07 2
# 20 2015-08 1
# 21 2015-09 1

Counting the number of months from a column in a dataframe to each month in a sequence for multiple rows

this is my first post so I do apologize if I am not specific enough.
I have a sequence of months and a data frame with approximately 100 rows, each with a unique identifier. Each identifier is associated with a start up date. I am trying to calculate the number of months since start up for each of these unique identifiers at each month in the sequence. I have tried unsuccessfully to write a for loop to accomplish this.
Example Below:
# Build Example Data Frame #
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
# Create Sequence of Months, Format to match Data Frame, Reverse for the For Loop #
base.date <- as.Date(c("2015-11-1"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m")
Months.2 <- rev(Months.1)
# Create For Loop #
require(zoo)
for(i in seq_along(Months.2))
{
for(j in 1:length(summary(as.factor(df_example$ID), maxsum = 100000)))
{
Active.Months <- 12 * as.numeric((as.yearmon(Months.2 - i) - as.yearmon(df_example$StartUp)))
}
}
The idea behind the for loop was that for every record in the Months.2 sequence, there would be a calculation of the number of months to that record (month date) from the Start Up month for each of the unique identifiers. However, this has been kicking back the error:
Error in Months.2 - i : non-numeric argument to binary operator
I am not sure what the solution is, or if I am using the for loop properly for this.
Thanks in advance for any help with solving this problem!
Edit: This is what I am hoping my expected outcome would be (this is just a sample as there are more months in the sequence):
ID Start Up Month 2015-11 2015-12 2015-12 2016-02 2016-03
1 A 2013-10 25 26 27 28 29
2 B 2013-10 25 26 27 28 29
3 C 2014-04 19 20 21 22 23
4 D 2015-06 5 6 7 8 9
5 E 2014-01 22 23 24 25 26
One way to do it is to first use as.yearmon from zoo package to convert the dates. Then simply we iterate over months and subtract from the ones in the df_example,
library(zoo)
df_example$StartUp <- as.Date(as.yearmon(df_example$StartUp))
Months.2 <- as.Date(as.yearmon(Months.2))
df <- as.data.frame(sapply(Months.2, function(i)
round(abs(difftime(df_example$StartUp, i, units = 'days')/30))))
names(df) <- Months.2
cbind(df_example, df)
# ID StartUp 2016-07 2016-06 2016-05 2016-04 2016-03 2016-02 2016-01 2015-12 2015-11
#1 A 2013-10 33 32 31 30 29 28 27 26 25
#2 B 2013-10 33 32 31 30 29 28 27 26 25
#3 C 2014-04 27 26 25 24 23 22 21 20 19
#4 D 2015-06 13 12 11 10 9 8 7 6 5
#5 E 2014-01 30 29 28 27 26 25 24 23 22
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
y_example <- paste(y_example,"-01",sep = "")
# past on the "-01" because I want the later function to work.
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
base.date <- as.Date(c("2015-11-01"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m-%d")
Months.2 <- rev(Months.1)
monnb <- function(d) { lt <- as.POSIXlt(as.Date(d, origin="1900-01-01")); lt$year*12 + lt$mon }
mondf <- function(d1, d2) {monnb(d2) - monnb(d1)}
NumofMonths <- abs(mondf(df_example[,2],Sys.Date()))
n = max(NumofMonths)
# sequence along the number of months and get the month count.
monthcount <- (t(sapply(NumofMonths, function(x) pmax(seq((x-n+1),x, +1), 0) )))
monthcount <- data.frame(monthcount[,-(1:24)])
names(monthcount) <- Months.1
finalDataFrame <- cbind.data.frame(df_example,monthcount)
Here is your final data frame which is the desired output you indicated:
ID StartUp 2015-11-01 2015-12-01 2016-01-01 2016-02-01 2016-03-01 2016-04-01 2016-05-01 2016-06-01 2016-07-01
1 A 2013-10-01 25 26 27 28 29 30 31 32 33
2 B 2013-10-01 25 26 27 28 29 30 31 32 33
3 C 2014-04-01 19 20 21 22 23 24 25 26 27
4 D 2015-06-01 5 6 7 8 9 10 11 12 13
5 E 2014-01-01 22 23 24 25 26 27 28 29 30
The overall idea is that we calculate the number of months and use the sequence function to create a counter of the number of months until we get the current month.

Apply a function on multiple columns in each row

Below is the data frame I have. Column 2 is the days to expiration of the nearest contract, column 3 is the days to expiration of the next nearest contract. I'm trying to create a vector that gives me the percentage of column 2 needed to give me a weighted average days to expiration of 28 days for each row.
Date DaysXone DaysXtwo
1 2006-01-03 15 43 days
2 2006-01-04 14 42 days
3 2006-01-05 13 41 days
4 2006-01-06 12 40 days
5 2006-01-09 9 37 days
6 2006-01-10 8 36 days
I've tried:
f <- function(x){
DF$DaysXone*(x) + (DF$DaysXtwo*(1-(x)) -28}
and then I've tried a few things with uniroot(), but now I'm stuck
Thanks!
Or go with data.table:
library(data.table)
df <- data.frame(DaysXone = c(15,14,13,12,9,8), DaysXtwo = c(43, 42,41,40, 37,36))
setDT(df)[,Perc := (28-DaysXtwo)/(DaysXone - DaysXtwo),]
DaysXone DaysXtwo Perc
1: 15 43 0.5357143
2: 14 42 0.5000000
3: 13 41 0.4642857
4: 12 40 0.4285714
5: 9 37 0.3214286
6: 8 36 0.2857143
Ok, I think this should do:
library(plyr)
ddply(df, "date", function(i) {res = (-i$DaysXtwo + 28) / (i$DaysXone - i$DaysXtwo)})
date V1
1 2006-01-03 0.5357143
2 2006-01-04 0.5000000
3 2006-01-05 0.4642857
library(data.table)
df <- data.frame(DaysXone = c(15,14,13,12,9,8), DaysXtwo = c(43, 42,41,40, 37,36))
perc_function = function(x,y) {
out = (28-y)/(x-y)
return(out)
}
df = cbind(df, perc = mapply(perc_function, df$DaysXone, df$DaysXtwo))

Difftime for workdays according to holidayNYSE in R

I'm trying to find difftime for working days only. I want to calculate difftime according to holidayNYSE calendar. When I use the difftime function weekends and holidays are included in the answers, my dataset contaies only data from working days, but when using difftime I have to subtract the non-working days somehow.
A is a vector of 0 and 1, and I want to find the duration of how many days with 0 or 1. Duration for run one are suppose to be 35 and I get 49 (working days from January 1990).
df <- data.frame(Date=(dates), A)
setDT(df)
df <- data.frame(Date=(dates), A)
DF1 <- df[, list(A = unique(A), duration = difftime(max(Date),min(Date), holidayNYSE
(year=setRmetricsOptions(start="1990-01-01", end="2015-31-12")))), by = run]
DF1
run A duration
1: 1 1 49 days
2: 2 0 22 days
3: 3 1 35 days
4: 4 0 27 days
5: 5 1 14 days
---
291: 291 1 6 days
292: 292 0 34 days
293: 293 1 10 days
294: 294 0 15 days
295: 295 1 29 days
An answer to my question without use of difftime:
df <- data.frame(Date=(dates), Value1=bull01)
setDT(df)
df[, run := cumsum(c(1, diff(Value1) !=0))]
duration <- rep(0)
for (i in 1:295){
ind <- which(df$run==i)
a <- df$Date[ind]
duration[i] <- length(a)
}
c <- rep(c(1,0),295)
c <- c[1:295]
df2 <- data.frame(duration, type=c)
> df2
run duration type
1 35 1
2 17 0
3 25 1
4 20 0
5 10 1
---
291 5 1
292 25 0
293 9 1
294 11 0
295 21 1

How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.

Resources