Following R data.table Return calculation and set()
I would like to ask how I can use Delt() from library(quantmod) to find returns for a time-series in a data.table().
So far, thanks to Frank, I have:
set.seed(42)
DT <- data.table(
ticker=rep(letters,each=5),
priceA=runif(5*26^2),
priceB=runif(5*26^2))
DT[,paste('returns',LETTERS[1:2],sep=''):={
lapply(.SD,function(x){
old <- head(x,-1)
new <- tail(x,-1)
c(NA,(new-old)/old)
})
},by=ticker,.SDcols=grep('^price',names(DT))]
The result (for this seed value) is:
ticker priceA priceB returnsA returnsB
1: a 0.9148060 0.7956245 NA NA
2: a 0.9370754 0.9314941 0.02434327 0.1707710
3: a 0.2861395 0.6269996 -0.69464620 -0.3268883
4: a 0.8304476 0.1666758 1.90224707 -0.7341691
5: a 0.6417455 0.6483800 -0.22722939 2.8900659
---
3376: z 0.2887293 0.3473923 -0.54132570 -0.3514041
3377: z 0.9013438 0.1788842 2.12176058 -0.4850656
3378: z 0.3126429 0.7648157 -0.65313686 3.2754788
3379: z 0.8791381 0.1300418 1.81195584 -0.8299698
3380: z 0.8160158 0.8159330 -0.07180019 5.2743905
How can I use Delt() (or a similar %change function) instead of function(x){
old <- head(x,-1)
new <- tail(x,-1)
c(NA,(new-old)/old)
} ??
Many many thanks!
You just need to convert the return value of Delt to a simple vector:
DT[,
paste('returns',LETTERS[1:2],sep=''):=lapply(.SD,function(x) c(Delt(x))),
by=ticker,.SDcols=grep('^price',names(DT))
]
This produces:
ticker priceA priceB returnsA returnsB
1: a 0.9148060 0.7956245 NA NA
2: a 0.9370754 0.9314941 0.02434327 0.1707710
3: a 0.2861395 0.6269996 -0.69464620 -0.3268883
4: a 0.8304476 0.1666758 1.90224707 -0.7341691
5: a 0.6417455 0.6483800 -0.22722939 2.8900659
---
Related
I have a time series and want to use period.apply() function xts library to estimate the mean for 377 days
The reproducible example is as following
zoo.data <- zoo(rnorm(5031)+10,as.Date(13514:17744,origin="1970-01-01"))
ep <- endpoints(zoo.data,'days', k =377)
period.apply(zoo.data, INDEX=ep, FUN=function(x) mean(x))
The output generated is
2007-05-28 2007-12-31 2008-10-05 2008-12-31 2009-02-02 2009-12-31
9.905663 9.800760 10.006344 10.052163 10.152453 10.032073
2010-06-13 2010-12-31 2011-10-22 2011-12-31 2012-02-18 2012-12-31
9.879439 10.038644 9.957582 9.977026 9.959094 10.004348
2013-06-29 2013-12-31 2014-11-07 2014-12-31 2015-03-06 2015-12-31
10.004620 10.086071 9.902875 9.843695 9.851306 10.072610
2016-07-14 2016-12-31 2017-11-23 2017-12-31 2018-03-22 2018-08-01
9.966911 10.199251 10.001628 10.263590 10.181235 10.059080
The output is unexpected as the difference in each date is not 377. The output shows that its stops at year end 20xx-12-31 before moving on to next endpoints
I am not sure that you could solve this using endpoints function directly.
Here is one way to solve it using built-in functions. It is a slightly
general solution.
In the code below, you can uncomment the commented lines to print the number of observations in the last interval.
library(xts)
apply.fun <- function(data, variable=1, fun=mean, k=377) { # variable: variable name or column index
data <- as.xts(data)
variable <- data[, variable, drop=TRUE]
idx <- index(data)
byindex <- as.integer(idx - first(idx)) %/% k # intervals idendifiers
endates <- idx[!duplicated(byindex, fromLast=TRUE)]
ans <- setNames(tapply(variable, byindex, fun), endates)
#inter.end <- sum(byindex==last(byindex))
#if(inter.end < k) cat(sprintf("Last internal has fewer observations: %d<k=%d\n\n", inter.end, k))
return(as.xts(as.matrix(ans)))
}
set.seed(147)
zoo.data <- zoo(rnorm(5031)+10,as.Date(13514:17744,origin="1970-01-01"))
apply.fun(zoo.data, 1, mean)
# [,1]
# 2008-01-12 10.043735
# 2009-01-23 10.042741
# 2010-02-04 9.957842
# 2011-02-16 10.016998
# 2012-02-28 9.932871
# 2013-03-11 9.932731
# 2014-03-23 10.045344
# 2015-04-04 10.015821
# 2016-04-15 10.015023
# 2017-04-27 10.038887
# 2018-05-09 9.978744
# 2018-08-01 10.004074
I am trying to subset a data.table based on a function that I wrote.
I have found out that R only calls my function once (instead of once per row). However, I am having trouple debugging this as I can only address Date in the context of the data.table.
For runtime purposes, I cannot use a for-loop to check each row individually.
If this problem could be solved without a self-defined function, I would be happy, too.
library("data.table")
library("zoo")
dt <- data.table(Date = as.Date(1:10),
Sales = c(rep(0, 5), rep(1, 5)))
anySalesLastWeek <- function(date) {
return(!empty(dt[as.integer(Date - as.Date(date)) %in% -7:-1 & Sales > 0, ]))
}
dt[anySalesLastWeek(Date), ]
I expected R to call the function anySalesLastWeek() once per row with the column value of Date in that row. Instead, the function seems to be called only once with dt$Date as the input.
The expected output would be the first six rows.
Not sure if I understood it correctly. Here is an approach using non-equi join:
DT[, oneweekago := Date - 7]
DT[, anySalesLastWeek :=
DT[DT, on=.(Date>=oneweekago, Date<Date), allow.cartesian=TRUE,
sum(Sales, na.rm=TRUE) > 0, by=.EACHI]$V1
]
output:
Date Sales oneweekago anySalesLastWeek
1: 2019-10-04 0 2019-09-27 FALSE
2: 2019-10-05 0 2019-09-28 FALSE
3: 2019-10-06 0 2019-09-29 FALSE
4: 2019-10-07 0 2019-09-30 FALSE
5: 2019-10-08 0 2019-10-01 FALSE
6: 2019-10-09 1 2019-10-02 FALSE
7: 2019-10-10 1 2019-10-03 TRUE
8: 2019-10-11 1 2019-10-04 TRUE
9: 2019-10-12 1 2019-10-05 TRUE
10: 2019-10-13 1 2019-10-06 TRUE
data:
library("data.table")
DT <- data.table(Date=seq(Sys.Date(), by="1 day", length.out=10), Sales=c(rep(0, 5), rep(1, 5)))
# Date Sales
# 1: 2019-10-04 0
# 2: 2019-10-05 0
# 3: 2019-10-06 0
# 4: 2019-10-07 0
# 5: 2019-10-08 0
# 6: 2019-10-09 1
# 7: 2019-10-10 1
# 8: 2019-10-11 1
# 9: 2019-10-12 1
#10: 2019-10-13 1
edit: regarding OP's function and also from Rohit comment, you can fix it as follows:
anySalesLastWeek <- function(date) {
DT[as.integer(Date - as.Date(date)) %in% -7:-1 & Sales > 0, .N>0]
}
DT[sapply(Date, anySalesLastWeek)]
I have a vector of dates:
dates <- seq(as.Date('2017-01-01'), as.Date('2017-12-31'), by = 'days')
I want to create a data frame where this vector is repeated for n rows. Can anyone tell me how I might be able to accomplish this? Any help is greatly appreciated.
Thanks for the suggestions so far. Unfortunately, I think my intention was unclear in my original question. I would like each of n rows in the data frame to contain the vector of dates so that the final data frame would look something like this:
1 2017-01-01 2017-01-02.....2017-12-31
2 2017-01-01 2017-01-02.....2017-12-31
3 2017-01-01 2017-01-02.....2017-12-31
.
.
.
n 2017-01-01 2017-01-02.....2017-12-31
You can use rep to repeat the vector and then coerce to a dataframe. For example, repeating 10 times
num_repeat <- 10
dates <- data.frame(rep(
seq(as.Date('2017-01-01'), as.Date('2017-12-31'), by = 'days'),
times = num_repeat))
As the question asker is hoping to fill n rows, wouldn't it make more sense to specify length.out rather than times?
set.seed(1)
dtf <- data.frame(A=letters[sample(1:27, 1000, TRUE)])
dtf$B <- rep(dates, length.out=nrow(dtf))
tail(dtf)
# A B
# 995 d 2017-09-22
# 996 u 2017-09-23
# 997 r 2017-09-24
# 998 h 2017-09-25
# 999 f 2017-09-26
# 1000 h 2017-09-27
We use replicate to do this
n <- 5
out <- do.call(rbind, replicate(n, as.data.frame(as.list(dates)),
simplify = FALSE))
names(out) <- paste0('V', seq_along(out))
dim(out)
#[1] 5 365
out[1:3, 1:3]
# V1 V2 V3
#1 2017-01-01 2017-01-02 2017-01-03
#2 2017-01-01 2017-01-02 2017-01-03
#3 2017-01-01 2017-01-02 2017-01-03
out[1:3, 362:365]
# V362 V363 V364 V365
#1 2017-12-28 2017-12-29 2017-12-30 2017-12-31
#2 2017-12-28 2017-12-29 2017-12-30 2017-12-31
#3 2017-12-28 2017-12-29 2017-12-30 2017-12-31
I have a large panel of firms stock returns and value-weighted S&P500 return. I want to apply a rolling window regression, where I regress the firms returns of the previous twelve months on the value weighted S&P500 return, and extract the standard deviation of the squared residuals.
My code looks as follows
stdev <- matrix(NA,nrow = nrow(ReturnMatrix),ncol = 1)
pb <- winProgressBar(title = "",label = "",min = 1,max =
nrow(ReturnMatrix)-11)
for(i in 1:(nrow(ReturnMatrix)-11))
{
VWRet <- ReturnMatrix$VWReturn[i:(i+11)]
Ret <- ReturnMatrix$Return[i:(i+11)]
if(sum(is.na(Ret)) >= 6)
{
stdev[i+11] <- NA
}
else{
Model <- glm(Ret~VWRet-1)
stdev[i+11] <- sigma(Model)
}
setWinProgressBar(pb,value = i,title = paste0(round(100*
(i/(nrow(ReturnMatrix) - 11)),2)," % Done"))
}
SD <- cbind.data.frame(ReturnMatrix,stdev)
The dataframe ReturnMatrix is very large, it contains 3239065 rows. The variables in the dataframe are PERMNO which is a firm identifier, YearMonth which is the date in YYYYMM format, Return which is the firms return of that month and VWReturn which is the value weighted S&P500 return.
Right now, it takes about 1 hour to run this for loop.
My question is: Is there any way to speed this process up a notch, I have tried using rollapply on zoo(ReturnMatrix), but this only slows it down even more.
Any help would be greatly appreciated.
Here's how to do that with data.table, which should be the fastest way to do what you want. You first need to build a sigma function and then use rollaplyr with .SD.
set.seed(1)
library(data.table)
dt <- data.table(PERMNO=rep(LETTERS[1:3],each=13),
YearMonth=seq.Date(from=Sys.Date(),by="month",length.out =13),
Return=runif(39),VWReturn=runif(39))
#create sigma function
stdev <- function(x) sd(lm(x[, 1]~ x[, 2])$residuals)
#create new column with rollapply
dt[,roll_sd:=rollapplyr(.SD, 12, stdev, by.column = FALSE, fill = NA),
by=.(PERMNO),.SDcols = c("Return", "VWReturn")]
PERMNO YearMonth Return VWReturn roll_sd
1: A 2017-11-19 0.26550866 0.41127443 NA
2: A 2017-12-19 0.37212390 0.82094629 NA
3: A 2018-01-19 0.57285336 0.64706019 NA
4: A 2018-02-19 0.90820779 0.78293276 NA
5: A 2018-03-19 0.20168193 0.55303631 NA
6: A 2018-04-19 0.89838968 0.52971958 NA
7: A 2018-05-19 0.94467527 0.78935623 NA
8: A 2018-06-19 0.66079779 0.02333120 NA
9: A 2018-07-19 0.62911404 0.47723007 NA
10: A 2018-08-19 0.06178627 0.73231374 NA
11: A 2018-09-19 0.20597457 0.69273156 NA
12: A 2018-10-19 0.17655675 0.47761962 0.3181427
13: A 2018-11-19 0.68702285 0.86120948 0.3141638
14: B 2017-11-19 0.38410372 0.43809711 NA
....
I'm trying to do exactly what reshape from the stats package is designed for. I have a wide dataset with a series of variables in the form var_name.date. Unfortunately, reshape seems ill-equipped to deal with even medium-sized datasets, so I'm trying to use the the data.table.melt function.
My main problem is grouping the variables into separate value columns based on their long-form variable. Is this possible, or do I need to do each one separately and then cbind them?
Here is what I have:
widetable = data.table("id"=1:5,"A.2012-10"=runif(5),"A.2012-11"=runif(5),
"B.2012-10"=runif(5),"B.2012-11"=runif(5))
id A.2012-10 A.2012-11 B.2012-10 B.2012-11
1: 1 0.82982349 0.2257782 0.46390924 0.4448248
2: 2 0.46136746 0.2184797 0.05640388 0.4772663
3: 3 0.61723234 0.3950625 0.03252784 0.4006974
4: 4 0.19963437 0.7028052 0.06811452 0.3096969
5: 5 0.09575389 0.5510507 0.76059610 0.8630222
And here is the the stats package's reshape mocking me with one-line awesomeness doing exactly what I want but not scaling.
reshape(widetable, idvar="id", varying=colnames(widetable)[2:5],
sep=".", direction="long")
id time A B
1: 1 2012-10 0.82982349 0.46390924
2: 2 2012-10 0.46136746 0.05640388
3: 3 2012-10 0.61723234 0.03252784
4: 4 2012-10 0.19963437 0.06811452
5: 5 2012-10 0.09575389 0.76059610
6: 1 2012-11 0.22577823 0.44482478
7: 2 2012-11 0.21847969 0.47726629
8: 3 2012-11 0.39506249 0.40069737
9: 4 2012-11 0.70280519 0.30969695
10: 5 2012-11 0.55105075 0.86302220
This is just one of those times when reshape() is more straightforward to use.
The most direct approach using a combination of melt and dcast.data.table that I can think of is as follows:
library(data.table)
library(reshape2)
longtable <- melt(widetable, id.vars = "id")
vars <- do.call(rbind, strsplit(as.character(longtable$variable), ".", TRUE))
dcast.data.table(longtable[, c("V1", "V2") := lapply(1:2, function(x) vars[, x])],
id + V2 ~ V1, value.var = "value")
An alternative is to use merged.stack from my "splitstackshape" package, specifically the development version.
# library(devtools)
# install_github("splitstackshape", "mrdwab", ref = "devel")
library(splitstackshape)
merged.stack(widetable, id.vars = "id", var.stubs = c("A", "B"), sep = "\\.")
# id .time_1 A B
# 1: 1 2012-10 0.26550866 0.2059746
# 2: 1 2012-11 0.89838968 0.4976992
# 3: 2 2012-10 0.37212390 0.1765568
# 4: 2 2012-11 0.94467527 0.7176185
# 5: 3 2012-10 0.57285336 0.6870228
# 6: 3 2012-11 0.66079779 0.9919061
# 7: 4 2012-10 0.90820779 0.3841037
# 8: 4 2012-11 0.62911404 0.3800352
# 9: 5 2012-10 0.20168193 0.7698414
# 10: 5 2012-11 0.06178627 0.7774452
The merged.stack function works differently from a simple melt because it starts by "stacking" different groups of columns in a list and then merging them together. This allows the function to:
Work with column groups where each column group might be of a different type (character, numeric, and so on).
Work with "unbalanced" column groups (where one group might have two measure columns and another might have three).
This answer is based on the following sample data:
set.seed(1) # Please use `set.seed()` when sharing an example with random numbers
widetable = data.table("id"=1:5,"A.2012-10"=runif(5),"A.2012-11"=runif(5),
"B.2012-10"=runif(5),"B.2012-11"=runif(5))
See also: What reshaping problems can melt/cast not solve in a single step?