Finding the third Friday of a month and data table - r

I want to find the third Friday of a month for delivery date of the futures, I used the solution here, getNthDayOfWeek from RcppBDT package:
library(data.table)
library(RcppBDT)
data <- setDT(data.frame(mon=c(5:12, 1:12, 1:12, 1:4),
year=c(rep(2011,8), rep(2012,12), rep(2013,12), rep(2014,4))))
data[, third.friday:= getNthDayOfWeek(third, Fri, mon, year)]
However I get this message: Error: expecting a single value. What am I missing?

Since you did not specify a by clause in your transformation, := is (presumably) trying to apply getNthDayOfWeek as a vectorized function.
This should work:
Data[
,third.friday := getNthDayOfWeek(third, Fri, mon, year)
,by = "mon,year"]
Data
# mon year third.friday
#1: 5 2011 2011-05-20
#2: 6 2011 2011-06-17
#3: 7 2011 2011-07-15
#4: 8 2011 2011-08-19
#5: 9 2011 2011-09-16
#6: 10 2011 2011-10-21
#7: 11 2011 2011-11-18
#8: 12 2011 2011-12-16
#9: 1 2012 2012-01-20
Or, more generally, in case you have duplicate mon,year tuples in your object:
Data[,Idx := 1:.N][
,third.friday := getNthDayOfWeek(third, Fri, mon, year)
,by = "mon,year,Idx"
][,Idx := NULL][]
# mon year third.friday
#1: 5 2011 2011-05-20
#2: 6 2011 2011-06-17
#3: 7 2011 2011-07-15
#4: 8 2011 2011-08-19
#5: 9 2011 2011-09-16
#6: 10 2011 2011-10-21
#7: 11 2011 2011-11-18
#8: 12 2011 2011-12-16
#9: 1 2012 2012-01-20

Related

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

R data.table merge drops rows (December only)

Solved with the help of #Uwe Block.
R data.table merge drops December observations by shifting the month-index back in one data set while trying to merge a monthly data set onto a set of daily observations. What's a good way to do this merge that works as expected?
Using merge per #Harry Daniels merge(monthly, daily, by=c("year","month"), all=TRUE) instead of daily[monthly, on=c("year","month"), all=TRUE] retains all daily observations correctly, but the monthly data are still shifted so that January->0.
Problem: generating the month and year columns on the monthly dataset made months not quite exactly integer values. I.e. 1 was actually 0.999999999999091 so the merge took the floor internally and offset it.
Example: `monthly[,month:=100*(Date%%1)]' where the date was stored as numeric 2016.01, 2016.02,...,2016.12.
See the following:
> monthly
year month CPI
1: 2016 1 236.916
2: 2016 2 237.111
3: 2016 3 238.132
4: 2016 4 239.261
5: 2016 5 240.229
6: 2016 6 241.018
7: 2016 7 240.628
8: 2016 8 240.849
9: 2016 9 241.428
10: 2016 10 241.729
11: 2016 11 241.353
12: 2016 12 241.432
> daily
date year month close
1: 2016-01-04 2016 1 2012.66
2: 2016-01-05 2016 1 2016.71
3: 2016-01-06 2016 1 1990.26
4: 2016-01-07 2016 1 1943.09
5: 2016-01-08 2016 1 1922.03
---
248: 2016-12-23 2016 12 2263.79
249: 2016-12-27 2016 12 2268.88
250: 2016-12-28 2016 12 2249.92
251: 2016-12-29 2016 12 2249.26
252: 2016-12-30 2016 12 2238.83
> daily[monthly, on=c("year","month")]
date year month close CPI
1: <NA> 2016 0 NA 236.916
2: 2016-01-04 2016 1 2012.66 237.111
3: 2016-01-05 2016 1 2016.71 237.111
4: 2016-01-06 2016 1 1990.26 237.111
5: 2016-01-07 2016 1 1943.09 237.111
---
228: 2016-11-23 2016 11 2204.72 241.432
229: 2016-11-25 2016 11 2213.35 241.432
230: 2016-11-28 2016 11 2201.72 241.432
231: 2016-11-29 2016 11 2204.66 241.432
232: 2016-11-30 2016 11 2198.81 241.432
> merge(monthly, daily, by=c("year","month"), all=TRUE)
year month CPI close
1: 2016 0 236.916 NA
2: 2016 1 237.111 2012.66
3: 2016 1 237.111 2016.71
4: 2016 1 237.111 1990.26
5: 2016 1 237.111 1943.09
---
249: 2016 12 NA 2263.79
250: 2016 12 NA 2268.88
251: 2016 12 NA 2249.92
252: 2016 12 NA 2249.26
253: 2016 12 NA 2238.83
This should suffice:
merge(monthly, daily , by = 'month', all = TRUE )

How to add means to an existing column in R

I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20
Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]

How to compare values of a data frame to values in another data frame?

I posted this question several days ago, but I was told my description is too confusing. After clarifying my problem and adding an example, however, the question did not receive any further attention. Since I still need a solution, I deleted the old question and now post it in a hopefully better formulation.
The following example illustrates my problem.
I have two objects. First of these is a data frame that describes each individual's (id) group (group), a year when s(he) took an action (do.year) and the values of a variable (var) for every year between years 2010 and 2015 (var.year).
set.seed(1)
df <- data.frame(
id = rep(1:3, each = 6),
group = c(rep("a", 12), rep("b", 6)),
do.year = rep(sample(2011:2013), each = 6),
var = runif(18),
var.year = 2010:2015)
df
id group do.year var var.year
1 1 a 2011 0.90820779 2010
2 1 a 2011 0.20168193 2011
3 1 a 2011 0.89838968 2012
4 1 a 2011 0.94467527 2013
5 1 a 2011 0.66079779 2014
6 1 a 2011 0.62911404 2015
7 2 a 2013 0.06178627 2010
8 2 a 2013 0.20597457 2011
9 2 a 2013 0.17655675 2012
10 2 a 2013 0.68702285 2013
11 2 a 2013 0.38410372 2014
12 2 a 2013 0.76984142 2015
13 3 b 2012 0.49769924 2010
14 3 b 2012 0.71761851 2011
15 3 b 2012 0.99190609 2012
16 3 b 2012 0.38003518 2013
17 3 b 2012 0.77744522 2014
18 3 b 2012 0.93470523 2015
The second object consists of data frames for groups a and b and also contains values of a variable (var) for every year between years 2010 and 2015 (var.year), but these are the average values of group members. It's a list of data frames, but could also be converted into a single data frame if necessary.
avg <- list(
"a" = data.frame(var.year = 2010:2015, var = runif(6)),
"b" = data.frame(var.year = 2010:2015, var = runif(6)))
avg
$a
var.year var
1 2010 0.21214252
2 2011 0.65167377
3 2012 0.12555510
4 2013 0.26722067
5 2014 0.38611409
6 2015 0.01339033
$b
var.year var
1 2010 0.3823880
2 2011 0.8696908
3 2012 0.3403490
4 2013 0.4820801
5 2014 0.5995658
6 2015 0.4935413
My goal here is to compare each individual's result indicator to that of the respective comparison group's in a specific year (do.year). So, for each individual (id), I'd like to take the value of variable (var) in the year when an action was taken (do.year) and from that value subtract the group average value (var in avg) of the same year (var.year). The result for each individual would be stored in a new variable diff.var.
I have only a few weeks of experience with R, so my solution would be to just merge datasets for every group (and variable) and then do the calculations (below). However, since my original dataset involves 7 groups and 6 variables, it would result in about 1000 lines of code. I also tried looping, but was unable to properly define the loop variable everywhere.
df.a <- merge(df, avg[["a"]], by = "var.year")
df.a$diff.var[df.a$group == "a" & df.a$var.year == df.a$do.year] <-
df.a$var.x[df.a$group == "a" & df.a$var.year == df.a$do.year] -
df.a$var.y[df.a$group == "a" & df.a$var.year == df.a$do.year]
df.a
var.year id group do.year var.x var.y diff.var
1 2010 1 a 2011 0.90820779 0.21214252 NA
2 2010 2 a 2013 0.06178627 0.21214252 NA
3 2010 3 b 2012 0.49769924 0.21214252 NA
4 2011 1 a 2011 0.20168193 0.65167377 -0.4499918
5 2011 2 a 2013 0.20597457 0.65167377 NA
6 2011 3 b 2012 0.71761851 0.65167377 NA
7 2012 1 a 2011 0.89838968 0.12555510 NA
8 2012 2 a 2013 0.17655675 0.12555510 NA
9 2012 3 b 2012 0.99190609 0.12555510 NA
10 2013 1 a 2011 0.94467527 0.26722067 NA
11 2013 2 a 2013 0.68702285 0.26722067 0.4198022
12 2013 3 b 2012 0.38003518 0.26722067 NA
13 2014 1 a 2011 0.66079779 0.38611409 NA
14 2014 2 a 2013 0.38410372 0.38611409 NA
15 2014 3 b 2012 0.77744522 0.38611409 NA
16 2015 1 a 2011 0.62911404 0.01339033 NA
17 2015 2 a 2013 0.76984142 0.01339033 NA
18 2015 3 b 2012 0.93470523 0.01339033 NA
df.b <- merge(df, avg[["b"]], by = "var.year")
df.b$diff.var[df.b$group == "b" & df.b$var.year == df.b$do.year] <-
df.b$var.x[df.b$group == "b" & df.b$var.year == df.b$do.year] -
df.b$var.y[df.b$group == "b" & df.b$var.year == df.b$do.year]
df.b
var.year id group do.year var.x var.y diff.var
1 2010 1 a 2011 0.90820779 0.3823880 NA
2 2010 2 a 2013 0.06178627 0.3823880 NA
3 2010 3 b 2012 0.49769924 0.3823880 NA
4 2011 1 a 2011 0.20168193 0.8696908 NA
5 2011 2 a 2013 0.20597457 0.8696908 NA
6 2011 3 b 2012 0.71761851 0.8696908 NA
7 2012 1 a 2011 0.89838968 0.3403490 NA
8 2012 2 a 2013 0.17655675 0.3403490 NA
9 2012 3 b 2012 0.99190609 0.3403490 0.6515571
10 2013 1 a 2011 0.94467527 0.4820801 NA
11 2013 2 a 2013 0.68702285 0.4820801 NA
12 2013 3 b 2012 0.38003518 0.4820801 NA
13 2014 1 a 2011 0.66079779 0.5995658 NA
14 2014 2 a 2013 0.38410372 0.5995658 NA
15 2014 3 b 2012 0.77744522 0.5995658 NA
16 2015 1 a 2011 0.62911404 0.4935413 NA
17 2015 2 a 2013 0.76984142 0.4935413 NA
18 2015 3 b 2012 0.93470523 0.4935413 NA
How should this problem be solved in R? A base R or data.table solution would be preferred.
If you want a data.table solution here's a possible one. I would suggest first to convert your list to a data.table with a group column. And the just do a join on var.year and group while do.year == var.year and create diff.var on the fly. I'm also assuming that you are not really trying to create an identical data set for each group, rather just the original data set joined with avg according to your rules. Something like the following
library(data.table)
### Create a group column for each list and convert to a data.table
avg <- rbindlist(Map(cbind, avg, group = names(avg)))
### join by var.year and group while do.year == var.year and create diff.var on the fly
setDT(df)[do.year == var.year,
diff.var := var - avg[copy(.SD), var, on = c("var.year", "group")]]
df
# id group do.year var var.year diff.var
# 1: 1 a 2011 0.90820779 2010 NA
# 2: 1 a 2011 0.20168193 2011 -0.4499918
# 3: 1 a 2011 0.89838968 2012 NA
# 4: 1 a 2011 0.94467527 2013 NA
# 5: 1 a 2011 0.66079779 2014 NA
# 6: 1 a 2011 0.62911404 2015 NA
# 7: 2 a 2013 0.06178627 2010 NA
# 8: 2 a 2013 0.20597457 2011 NA
# 9: 2 a 2013 0.17655675 2012 NA
# 10: 2 a 2013 0.68702285 2013 0.4198022
# 11: 2 a 2013 0.38410372 2014 NA
# 12: 2 a 2013 0.76984142 2015 NA
# 13: 3 b 2012 0.49769924 2010 NA
# 14: 3 b 2012 0.71761851 2011 NA
# 15: 3 b 2012 0.99190609 2012 0.6515571
# 16: 3 b 2012 0.38003518 2013 NA
# 17: 3 b 2012 0.77744522 2014 NA
# 18: 3 b 2012 0.93470523 2015 NA

Correct previous year by id within R

I have data something like this:
df <- data.frame(Id=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,9,9,9,9),Date=c("2013-04","2013-12","2013-01","2013-12","2013-11",
"2013-12","2012-04","2013-12","2012-08","2014-12","2013-08","2014-12","2013-08","2014-12","2011-01","2013-11","2013-12","2014-01","2014-04"))
To get the correct format:
df$Date <- paste0(df$Date,"-01")
I would need to obtain only years, so that each id contains 2 dates following on each other.
I if do on the existing data something like this:
require(lubridate)
df$Date <- year(as.Date(df$Date)-days(1))
I get sometimes same date for given id.
The desired output for the column Date is this:
2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011 2013 2014
Please note that the last date for given id is always correct, so just the preceding year have to be corrected based on the last date. The date have to be in format that can be converted to years only as shown.
EDIT Here is the case:
Id Date
1 2013-11-01
1 2013-12-01
1 2014-01-01
1 2014-04-01
Now I'm getting this: 2012,2013,2013,2013
I would need: 2012,2013,2013,2014
This is how I would solve this using data.table package (though it looks over complicated to me)
library(data.table)
setDT(df)[, year := year(Date)][,
year := if(.N == 2) (year[2] - 1):year[2] else year,
Id][]
# Id Date year indx
# 1: 1 2013-04-01 2012 2
# 2: 1 2013-12-01 2013 2
# 3: 2 2013-01-01 2012 2
# 4: 2 2013-12-01 2013 2
# 5: 3 2013-11-01 2012 2
# 6: 3 2013-12-01 2013 2
# 7: 4 2012-04-01 2012 2
# 8: 4 2013-12-01 2013 2
# 9: 5 2012-08-01 2013 2
# 10: 5 2014-12-01 2014 2
# 11: 6 2013-08-01 2013 2
# 12: 6 2014-12-01 2014 2
# 13: 7 2013-08-01 2013 2
# 14: 7 2014-12-01 2014 2
# 15: 8 2011-01-01 2011 1
Or all in one step (thanks to #Arun for providing this):
setDT(df)[, year := {tmp = year(Date);
if (.N == 2L) (tmp[2]-1L):tmp[2] else tmp},
Id]
Edit:
Per OPs new data, we can modify the code by adding additional index
setDT(df)[, indx := if(.N > 2) rep(seq_len(.N/2), each = 2) + 1L else .N, Id]
df[, year := {tmp = year(Date); if (.N > 1L) (tmp[2] - 1L):tmp[2] else tmp},
list(Id, indx)][]
# Id Date indx year
# 1: 1 2013-04-01 2 2012
# 2: 1 2013-12-01 2 2013
# 3: 2 2013-01-01 2 2012
# 4: 2 2013-12-01 2 2013
# 5: 3 2013-11-01 2 2012
# 6: 3 2013-12-01 2 2013
# 7: 4 2012-04-01 2 2012
# 8: 4 2013-12-01 2 2013
# 9: 5 2012-08-01 2 2013
# 10: 5 2014-12-01 2 2014
# 11: 6 2013-08-01 2 2013
# 12: 6 2014-12-01 2 2014
# 13: 7 2013-08-01 2 2013
# 14: 7 2014-12-01 2 2014
# 15: 8 2011-01-01 1 2011
# 16: 9 2013-11-01 2 2012
# 17: 9 2013-12-01 2 2013
# 18: 9 2014-01-01 3 2013
# 19: 9 2014-04-01 3 2014
Or another possible solution provided by #akrun
setDT(df)[, `:=`(year = year(Date), indx = .N, indx2 = as.numeric(gl(.N,2, .N))), Id]
df[indx > 1, year:=(year[2]-1):year[2], list(Id, indx2)][]
Using dplyr using similar approach as #David Arenburg's
library(dplyr)
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date)),
year=replace(year, n()>1, c(year[2]-1, year[2])))
# Id Date year
#1 1 2013-04 2012
#2 1 2013-12 2013
#3 2 2013-01 2012
#4 2 2013-12 2013
#5 3 2013-11 2012
#6 3 2013-12 2013
#7 4 2012-04 2012
#8 4 2013-12 2013
#9 5 2012-08 2013
#10 5 2014-12 2014
#11 6 2013-08 2013
#12 6 2014-12 2014
#13 7 2013-08 2013
#14 7 2014-12 2014
#15 8 2011-01 2011
Or using base R
with(df, ave(as.numeric(sub('-.*', '', Date)), Id,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
Update
You can try
df$indx <- with(df, ave(Id, Id, FUN=function(x) (seq_along(x)-1)%/%2+1))
with(df, ave(as.numeric(sub('-.*', '', Date)), Id, indx,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
#[16] 2012 2013 2013 2014
Or
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date))) %>%
group_by(indx=cumsum(rep(c(TRUE,FALSE), length.out=n())), add=TRUE) %>%
mutate(year=replace(year, n()>1, c(year[2]-1, year[2])))
Here's a dplyr solution. You can remove the intermediate fields last_year and year2, but I left them here for clarity:
library(stringr)
library(dplyr)
df %>%
group_by(Id) %>%
mutate(
last_year = last(as.integer(str_sub(Date, 1, 4))),
year2 = row_number() - n(),
year = last_year + year2
)

Resources