Apply a function on multiple columns in each row - r

Below is the data frame I have. Column 2 is the days to expiration of the nearest contract, column 3 is the days to expiration of the next nearest contract. I'm trying to create a vector that gives me the percentage of column 2 needed to give me a weighted average days to expiration of 28 days for each row.
Date DaysXone DaysXtwo
1 2006-01-03 15 43 days
2 2006-01-04 14 42 days
3 2006-01-05 13 41 days
4 2006-01-06 12 40 days
5 2006-01-09 9 37 days
6 2006-01-10 8 36 days
I've tried:
f <- function(x){
DF$DaysXone*(x) + (DF$DaysXtwo*(1-(x)) -28}
and then I've tried a few things with uniroot(), but now I'm stuck
Thanks!

Or go with data.table:
library(data.table)
df <- data.frame(DaysXone = c(15,14,13,12,9,8), DaysXtwo = c(43, 42,41,40, 37,36))
setDT(df)[,Perc := (28-DaysXtwo)/(DaysXone - DaysXtwo),]
DaysXone DaysXtwo Perc
1: 15 43 0.5357143
2: 14 42 0.5000000
3: 13 41 0.4642857
4: 12 40 0.4285714
5: 9 37 0.3214286
6: 8 36 0.2857143

Ok, I think this should do:
library(plyr)
ddply(df, "date", function(i) {res = (-i$DaysXtwo + 28) / (i$DaysXone - i$DaysXtwo)})
date V1
1 2006-01-03 0.5357143
2 2006-01-04 0.5000000
3 2006-01-05 0.4642857

library(data.table)
df <- data.frame(DaysXone = c(15,14,13,12,9,8), DaysXtwo = c(43, 42,41,40, 37,36))
perc_function = function(x,y) {
out = (28-y)/(x-y)
return(out)
}
df = cbind(df, perc = mapply(perc_function, df$DaysXone, df$DaysXtwo))

Related

Writing a function to split data into training and testing - weird error

Night be a nooby post but I can't seem to figure out what is wrong with this function. Essentially, the idea was to write a single function that could output two new data frames (one with testing and one with training data), however I encountered a problem with that the function can only output one dataframe at a time so instead I attempted to use IF commands.
This was what I came up with.
split <- function(df, tr.split, option) {
set.seed(1337)
ind <- sample(2, nrow(df), replace = TRUE, prob = c(tr.split, 1-tr.split))
if(option=="TEST") {te.data <- df[ind==2,]}
if(option=="TEST") {te.data}
if(option=="TRAIN") {tr.data <- df[ind==1,]}
if(option=="TRAIN") {tr.data}}
Instead when the function is ran with my data set it returns an empty dataset.
It was run like this.
te <- split(cleaned.data, 0.8, "TEST")
tr <- split(cleaned.data, 0.8, "TRAIN")
Any help would be greatly appreciated :)
This function provide a list of two elements, the first is the training set, the second the test set:
split <- function(df, tr.split) {
set.seed(1337)
ind <- sample(2, nrow(df), replace = TRUE, prob = c(tr.split, 1-tr.split))
te.data <- df[ind==2,]
tr.data <- df[ind==1,]
return(list(tr.data,te.data))
}
An example:
df<-data.frame(x=c(1:20)+round(runif(20,min = 5, max=20),0),
+ date=seq(as.Date("2000/1/1"), by = "day", length.out = 20))
> split(df,0.8)
[[1]]
x date
1 19 2000-01-01
2 18 2000-01-02
3 9 2000-01-03
4 11 2000-01-04
5 18 2000-01-05
6 26 2000-01-06
8 26 2000-01-08
9 21 2000-01-09
10 30 2000-01-10
14 22 2000-01-14
16 31 2000-01-16
19 28 2000-01-19
20 28 2000-01-20
[[2]]
x date
7 26 2000-01-07
11 29 2000-01-11
12 19 2000-01-12
13 29 2000-01-13
15 32 2000-01-15
17 37 2000-01-17
18 34 2000-01-18

create an unique week variable NOT depending on the calendar in R

I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.

How to calculate rolling correlation between rows in an xts?

I have an xts of yearly data. I am trying to get the rank correlation between each year. For example, this is a subset of my xts:
> yearlyRanks[16:20,45:55]
35881 35880 42261 33445 46087 31486 8981 7687 8203 8202 41383
2009-12-31 8 9 19 8 18 18 16 4 16 16 20
2010-12-31 4 3 20 6 19 2 17 17 17 17 21
2011-12-31 3 4 21 3 20 1 18 18 18 18 22
2012-12-31 6 6 22 5 21 19 19 19 19 19 4
2013-12-31 7 7 3 4 22 20 20 20 20 20 2
I would like to know the correlation between the ranks in each year with the preceding year. (Trying to tell how well this year's rank was predicted by last year's.)
I am trying to use this:
yearlyCors <- rollapplyr(coredata(yearlyRanks), width = 2, function(x) cor(x[1], x[2], use = 'n'))
But it takes FOREVER, and it doesn't seem to work. I think it is because I am passing it a set of 2 rows, so it wants to return 2 values, but I am only expecting 1. (Does that make sense?)
Any ideas on how I would do this?
EDIT:
Just to be clear, this is what i would want from that subset:
> test <- yearlyRanks[16:20,45:55]
> c(cor(test[1,], test[2,]), cor(test[2,], test[3,]), cor(test[3,], test[4,]), cor(test[4,], test[5,]))
[1] 0.4679246 0.9930253 0.4854528 0.7193598
EDIT:
What I want is the diag() + 1 of the correlation matrix. Here is the correlation matrix (of the transpose):
> cor(t(test))
2009-12-31 2010-12-31 2011-12-31 2012-12-31 2013-12-31
2009-12-31 1.00000000 *0.4679246* 0.4716995 0.3722922 0.08786426
2010-12-31 0.46792463 1.0000000 *0.9930253* 0.4654688 0.17192856
2011-12-31 0.47169948 0.9930253 1.0000000 *0.4854528* 0.20237689
2012-12-31 0.37229225 0.4654688 0.4854528 1.0000000 *0.71935975*
2013-12-31 0.08786426 0.1719286 0.2023769 0.7193598 1.00000000
You can see the starred values are the ones I want. Is there a way to access the diag + 1 (if you follow)?
This is one way you could get your desired result:
data <- "35881 35880 42261 33445 46087 31486 8981 7687 8203 8202 41383
2009-12-31 8 9 19 8 18 18 16 4 16 16 20
2010-12-31 4 3 20 6 19 2 17 17 17 17 21
2011-12-31 3 4 21 3 20 1 18 18 18 18 22
2012-12-31 6 6 22 5 21 19 19 19 19 19 4
2013-12-31 7 7 3 4 22 20 20 20 20 20 2"
dat <- read.table(text = data)
yearlyRanks <- xts(dat, order.by = as.POSIXct(row.names(dat)))
m_yearlyRanks <- t(coredata(yearlyRanks))
unlist(lapply(1:(NCOL(m_yearlyRanks) -1), function(i, x) cor(x[,i], x[, i + 1]), x = m_yearlyRanks))
# > unlist(lapply(1:(NCOL(m_yearlyRanks) -1), function(i, x) cor(x[,i], x[, i + 1]), x = m_yearlyRanks))
# [1] 0.4679246 0.9930253 0.4854528 0.7193598
That last line of code might be a bit tricky to read. It could be expressed more verbosely as (the result is identical):
res <- vector("numeric", length = NCOL(m_yearlyRanks) -1)
for (i in 1:(NCOL(m_yearlyRanks) -1)) {
res[i] <- cor(m_yearlyRanks[,i], m_yearlyRanks[, i + 1])
}
# > res
# [1] 0.4679246 0.9930253 0.4854528 0.7193598
Your error in this code:
yearlyCors <- rollapplyr(coredata(yearlyRanks), width = 2, function(x) cor(x[1], x[2], use = 'n'))
arises from x returning one column of data (a numeric vector) to which x[1] and x[2] are elements 1 and 2 of x, which are then passed into cor. cor is expecting two vectors of data but it's getting 2 scalars each time the roll function is called. Try debugging the function with browser and it will become immediately obvious to you what the problem is. e.g. try calling:
yearlyCors <- rollapplyr(coredata(GS), width = 20, function(x) {
browser()
cor(x[1], x[2], use = 'n')
}
)
Use by.column=FALSE and be sure the function refers to the rows:
cor2 <- function(x) cor(x[1,], x[2,])
rollapplyr(coredata(yearlyRanks), 2, cor2, by.column = FALSE)
## [1] 0.4679246 0.9930253 0.4854528 0.7193598
We could also do this:
z <- rollapplyr(as.zoo(yearlyRanks), 2, cor2, by.column = FALSE)
as.xts(z)
giving:
[,1]
2010-12-31 0.4679246
2011-12-31 0.9930253
2012-12-31 0.4854528
2013-12-31 0.7193598
I think I figured it out. I just took the first column off the correlation matrix of the transpose, and then took the diag:
> test <- yearlyRanks[16:20,45:55]
> tester <- cor(t(test), use = 'p')
> tester
2009-12-31 2010-12-31 2011-12-31 2012-12-31 2013-12-31
2009-12-31 1.0000000 0.6309825 0.6167215 0.7106686 0.6076932
2010-12-31 0.6309825 1.0000000 0.9799418 0.4088352 0.2449624
2011-12-31 0.6167215 0.9799418 1.0000000 0.3973902 0.2471984
2012-12-31 0.7106686 0.4088352 0.3973902 1.0000000 0.7315524
2013-12-31 0.6076932 0.2449624 0.2471984 0.7315524 1.0000000
> xts(diag(tester[,-1]), order.by = as.Date(rownames(test))[-1])
[,1]
2010-12-31 0.6309825
2011-12-31 0.9799418
2012-12-31 0.3973902
2013-12-31 0.7315524
However, I do not believe this is the bast way to do this, as it seems like it might be inefficient. I am calculating a BUNCH of correlations I do not need. It is plenty quick, but if anyone wants to post a more efficient solution, please do!
(Apologies the values changed. I had done something wrong before, but no bother! You all should get the gist!)

Counting the number of months from a column in a dataframe to each month in a sequence for multiple rows

this is my first post so I do apologize if I am not specific enough.
I have a sequence of months and a data frame with approximately 100 rows, each with a unique identifier. Each identifier is associated with a start up date. I am trying to calculate the number of months since start up for each of these unique identifiers at each month in the sequence. I have tried unsuccessfully to write a for loop to accomplish this.
Example Below:
# Build Example Data Frame #
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
# Create Sequence of Months, Format to match Data Frame, Reverse for the For Loop #
base.date <- as.Date(c("2015-11-1"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m")
Months.2 <- rev(Months.1)
# Create For Loop #
require(zoo)
for(i in seq_along(Months.2))
{
for(j in 1:length(summary(as.factor(df_example$ID), maxsum = 100000)))
{
Active.Months <- 12 * as.numeric((as.yearmon(Months.2 - i) - as.yearmon(df_example$StartUp)))
}
}
The idea behind the for loop was that for every record in the Months.2 sequence, there would be a calculation of the number of months to that record (month date) from the Start Up month for each of the unique identifiers. However, this has been kicking back the error:
Error in Months.2 - i : non-numeric argument to binary operator
I am not sure what the solution is, or if I am using the for loop properly for this.
Thanks in advance for any help with solving this problem!
Edit: This is what I am hoping my expected outcome would be (this is just a sample as there are more months in the sequence):
ID Start Up Month 2015-11 2015-12 2015-12 2016-02 2016-03
1 A 2013-10 25 26 27 28 29
2 B 2013-10 25 26 27 28 29
3 C 2014-04 19 20 21 22 23
4 D 2015-06 5 6 7 8 9
5 E 2014-01 22 23 24 25 26
One way to do it is to first use as.yearmon from zoo package to convert the dates. Then simply we iterate over months and subtract from the ones in the df_example,
library(zoo)
df_example$StartUp <- as.Date(as.yearmon(df_example$StartUp))
Months.2 <- as.Date(as.yearmon(Months.2))
df <- as.data.frame(sapply(Months.2, function(i)
round(abs(difftime(df_example$StartUp, i, units = 'days')/30))))
names(df) <- Months.2
cbind(df_example, df)
# ID StartUp 2016-07 2016-06 2016-05 2016-04 2016-03 2016-02 2016-01 2015-12 2015-11
#1 A 2013-10 33 32 31 30 29 28 27 26 25
#2 B 2013-10 33 32 31 30 29 28 27 26 25
#3 C 2014-04 27 26 25 24 23 22 21 20 19
#4 D 2015-06 13 12 11 10 9 8 7 6 5
#5 E 2014-01 30 29 28 27 26 25 24 23 22
x_example <- c("A","B","C","D","E")
y_example <- c("2013-10","2013-10","2014-04","2015-06","2014-01")
y_example <- paste(y_example,"-01",sep = "")
# past on the "-01" because I want the later function to work.
x_name <- "ID"
y_name <- "StartUp"
df_example <- data.frame(x_example,y_example)
names(df_example) <- c(x_name,y_name)
base.date <- as.Date(c("2015-11-01"))
Months <- seq.Date(from = base.date , to = Sys.Date(), by = "month")
Months.1 <- format(Months, "%Y-%m-%d")
Months.2 <- rev(Months.1)
monnb <- function(d) { lt <- as.POSIXlt(as.Date(d, origin="1900-01-01")); lt$year*12 + lt$mon }
mondf <- function(d1, d2) {monnb(d2) - monnb(d1)}
NumofMonths <- abs(mondf(df_example[,2],Sys.Date()))
n = max(NumofMonths)
# sequence along the number of months and get the month count.
monthcount <- (t(sapply(NumofMonths, function(x) pmax(seq((x-n+1),x, +1), 0) )))
monthcount <- data.frame(monthcount[,-(1:24)])
names(monthcount) <- Months.1
finalDataFrame <- cbind.data.frame(df_example,monthcount)
Here is your final data frame which is the desired output you indicated:
ID StartUp 2015-11-01 2015-12-01 2016-01-01 2016-02-01 2016-03-01 2016-04-01 2016-05-01 2016-06-01 2016-07-01
1 A 2013-10-01 25 26 27 28 29 30 31 32 33
2 B 2013-10-01 25 26 27 28 29 30 31 32 33
3 C 2014-04-01 19 20 21 22 23 24 25 26 27
4 D 2015-06-01 5 6 7 8 9 10 11 12 13
5 E 2014-01-01 22 23 24 25 26 27 28 29 30
The overall idea is that we calculate the number of months and use the sequence function to create a counter of the number of months until we get the current month.

adding days to a date in R

My data looks like this:
date rmean
1/2/2004 6
1/5/2004 30
1/6/2004 27
1/7/2004 20
1/8/2004 10
1/9/2004 22
1/12/2004 21
1/13/2004 18
1/14/2004 19
1/15/2004 7
1/16/2004 9
1/19/2004 11
1/20/2004 18
1/21/2004 26
1/26/2004 8
1/27/2004 16
1/28/2004 19
1/29/2004 4
1/30/2004 1
2/3/2004 11
2/4/2004 9
2/5/2004 26
2/6/2004 16
2/9/2004 25
2/10/2004 2
2/11/2004 6
2/12/2004 2
2/13/2004 25
2/16/2004 17
2/17/2004 21
2/18/2004 26
2/19/2004 6
2/20/2004 14
2/23/2004 4
2/24/2004 7
2/25/2004 19
2/26/2004 10
2/27/2004 23
I want to find the rmean of (20 days + 15th of each month).
Note: if there isn't a value for rmean of that date in my data (some days are skipped), i want it to find the rmean of closest day of the
something like this but ( 20 + 15th of each month) instead of 15 :
dt <- Dataframe[, list(day15=abs(mday(date)-15) == min(abs(mday(date)-15)),
date, rmean), by=list(year(date), month(date))]
dt[day15==TRUE]
Finale = dt[day15==TRUE , .SD[1,] ,by=list(month, year)]
The expected output for my example above:
date rmean
2/4/2004 9
Here's one way to do it with base R.
First, some dummy data:
d <- data.frame(date=as.Date('1/1/2004', '%d/%m/%Y') + sort(sample(364, 200)),
x=runif(200))
head(d)
# date x
# 1 2004-01-02 0.29818227
# 2 2004-01-03 0.12543617
# 3 2004-01-04 0.78145310
# 4 2004-01-05 0.30456904
# 5 2004-01-06 0.45228066
# 6 2004-01-07 0.07511554
Calculate arrival dates within the date range of the data:
arrival <-
seq(as.Date(sprintf('15/%s', format(min(d$date), '%m/%Y')), '%d/%m/%Y'),
as.Date(sprintf('15/%s', format(max(d$date), '%m/%Y')), '%d/%m/%Y'),
by='month') + 20
arrival
# [1] "2004-02-04" "2004-03-06" "2004-04-04" "2004-05-05" "2004-06-04" "2004-07-05"
# [7] "2004-08-04" "2004-09-04" "2004-10-05" "2004-11-04" "2004-12-05" "2005-01-04"
Find the closest date to each of the arrival dates (taking that with max x value if there are two closest dates), and return a data.frame with the "arrival" dates, the closest dates to each of these arrival dates, and the corresponding values of x.
cbind(arrival, do.call(rbind, lapply(arrival, function(x) {
closest <- which(abs(d$date - x) == min(abs(d$date - x)))
d[closest[which.max(d$x[closest])], ]
})))
# arrival date x
# 25 2004-02-04 2004-02-03 0.78836413
# 45 2004-03-06 2004-03-06 0.61214949
# 63 2004-04-04 2004-04-04 0.49171847
# 79 2004-05-05 2004-05-05 0.02989788
# 93 2004-06-04 2004-06-04 0.25923715
# 109 2004-07-05 2004-07-05 0.90330331
# 120 2004-08-04 2004-08-04 0.48133237
# 139 2004-09-04 2004-09-03 0.12280267
# 151 2004-10-05 2004-10-03 0.46888891
# 169 2004-11-04 2004-11-04 0.40397949
# 186 2004-12-05 2004-12-04 0.18685615
# 200 2005-01-04 2004-12-30 0.97462347

Resources