How do I speed up rolling regressions? - r

I have a large panel of firms stock returns and value-weighted S&P500 return. I want to apply a rolling window regression, where I regress the firms returns of the previous twelve months on the value weighted S&P500 return, and extract the standard deviation of the squared residuals.
My code looks as follows
stdev <- matrix(NA,nrow = nrow(ReturnMatrix),ncol = 1)
pb <- winProgressBar(title = "",label = "",min = 1,max =
nrow(ReturnMatrix)-11)
for(i in 1:(nrow(ReturnMatrix)-11))
{
VWRet <- ReturnMatrix$VWReturn[i:(i+11)]
Ret <- ReturnMatrix$Return[i:(i+11)]
if(sum(is.na(Ret)) >= 6)
{
stdev[i+11] <- NA
}
else{
Model <- glm(Ret~VWRet-1)
stdev[i+11] <- sigma(Model)
}
setWinProgressBar(pb,value = i,title = paste0(round(100*
(i/(nrow(ReturnMatrix) - 11)),2)," % Done"))
}
SD <- cbind.data.frame(ReturnMatrix,stdev)
The dataframe ReturnMatrix is very large, it contains 3239065 rows. The variables in the dataframe are PERMNO which is a firm identifier, YearMonth which is the date in YYYYMM format, Return which is the firms return of that month and VWReturn which is the value weighted S&P500 return.
Right now, it takes about 1 hour to run this for loop.
My question is: Is there any way to speed this process up a notch, I have tried using rollapply on zoo(ReturnMatrix), but this only slows it down even more.
Any help would be greatly appreciated.

Here's how to do that with data.table, which should be the fastest way to do what you want. You first need to build a sigma function and then use rollaplyr with .SD.
set.seed(1)
library(data.table)
dt <- data.table(PERMNO=rep(LETTERS[1:3],each=13),
YearMonth=seq.Date(from=Sys.Date(),by="month",length.out =13),
Return=runif(39),VWReturn=runif(39))
#create sigma function
stdev <- function(x) sd(lm(x[, 1]~ x[, 2])$residuals)
#create new column with rollapply
dt[,roll_sd:=rollapplyr(.SD, 12, stdev, by.column = FALSE, fill = NA),
by=.(PERMNO),.SDcols = c("Return", "VWReturn")]
PERMNO YearMonth Return VWReturn roll_sd
1: A 2017-11-19 0.26550866 0.41127443 NA
2: A 2017-12-19 0.37212390 0.82094629 NA
3: A 2018-01-19 0.57285336 0.64706019 NA
4: A 2018-02-19 0.90820779 0.78293276 NA
5: A 2018-03-19 0.20168193 0.55303631 NA
6: A 2018-04-19 0.89838968 0.52971958 NA
7: A 2018-05-19 0.94467527 0.78935623 NA
8: A 2018-06-19 0.66079779 0.02333120 NA
9: A 2018-07-19 0.62911404 0.47723007 NA
10: A 2018-08-19 0.06178627 0.73231374 NA
11: A 2018-09-19 0.20597457 0.69273156 NA
12: A 2018-10-19 0.17655675 0.47761962 0.3181427
13: A 2018-11-19 0.68702285 0.86120948 0.3141638
14: B 2017-11-19 0.38410372 0.43809711 NA
....

Related

How to create data.frame with different number of rows RData

I have a file (format RData).https://stepik.org/media/attachments/course/724/all_data.Rdata This file contains 7 lists with id and temperature of patients.
I need to make one data.frame from these lists and then remove all rows with NA
id temp i.temp i.temp.1 i.temp.2 i.temp.3 i.temp.4 i.temp.5
1: 1 36.70378 36.73161 36.22944 36.05907 35.66014 37.32798 35.88121
2: 2 36.43545 35.96814 36.86782 37.20890 36.45172 36.82727 36.83450
3: 3 36.87599 36.38842 36.70508 37.44710 36.73362 37.09359 35.92993
4: 4 36.17120 35.95853 36.33405 36.45134 37.17186 36.87482 35.45489
5: 5 37.20341 37.04881 36.53252 36.22922 36.78106 36.89219 37.13207
6: 6 36.12201 36.53433 37.29784 35.96451 36.70838 36.58684 36.60122
7: 7 36.92314 36.16220 36.48154 37.05324 36.57829 36.24955 37.23835
8: 8 35.71390 37.26879 37.01673 36.65364 36.89143 36.46331 37.15398
9: 9 36.63558 37.03452 36.40129 37.53705 36.03568 36.78083 36.71873
10: 10 36.77329 36.07161 36.42992 36.20715 36.78880 36.79875 36.15004
11: 11 36.66199 36.74958 36.28661 36.72539 36.17700 37.47495 35.60980
12: 12 NA 36.97689 36.00473 36.64292 35.96789 36.73904 36.93957
13: 13 NA NA NA NA NA 36.63760 36.83916
14: 14 37.40307 35.89668 36.30619 36.64382 37.21882 35.87420 35.45550
15: 15 NA NA NA 37.03758 36.72512 36.45281 37.54388
16: 16 NA 36.44912 36.57126 36.20703 36.83076 36.48287 35.99391
17: 17 NA NA NA 36.39900 36.54043 36.75989 36.47079
18: 18 36.51696 37.09903 37.31166 36.51000 36.42414 36.87976 36.45736
19: 19 37.05117 37.42526 36.15820 36.11824 37.07024 36.60699 36.80168
20: 20 NA NA NA NA NA NA 36.74118
I wrote:
load("https://stepik.org/media/attachments/course/724/all_data.Rdata")
library(data.table)
day1<-as.data.table(all_data[1])
day2<-as.data.table(all_data[2])
day3<-as.data.table(all_data[3])
day4<-as.data.table(all_data[4])
day5<-as.data.table(all_data[5])
day6<-as.data.table(all_data[6])
day7<-as.data.table(all_data[7])
setkey(day1, id)
setkey(day2, id)
setkey(day3, id)
setkey(day4, id)
setkey(day5, id)
setkey(day6, id)
setkey(day7, id)
all_day<-day1[day2,][day3, ][day4,][day5,][day6,][day7,]
all_day<-na.omit(all_day)
But it takes too long. How can I make it faster?
here is a data.table solution
library( data.table )
#set names for all_data
names( all_data ) <- paste0( "day", 1:length(all_data))
#bind lists to data.table
DT <- data.table::rbindlist( all_data, use.names = TRUE, fill = TRUE, idcol = "day" )
#cast to wide
ans <- dcast( DT, id ~ day, value.var = "temp" )
#only keep complete rows and present output (using [] at the end)
ans[ complete.cases( ans ), ][]
# id day1 day2 day3 day4 day5 day6 day7
# 1: 1 36.70378 36.73161 36.22944 36.05907 35.66014 37.32798 35.88121
# 2: 2 36.43545 35.96814 36.86782 37.20890 36.45172 36.82727 36.83450
# 3: 3 36.87599 36.38842 36.70508 37.44710 36.73362 37.09359 35.92993
# 4: 4 36.17120 35.95853 36.33405 36.45134 37.17186 36.87482 35.45489
# 5: 5 37.20341 37.04881 36.53252 36.22922 36.78106 36.89219 37.13207
# 6: 6 36.12201 36.53433 37.29784 35.96451 36.70838 36.58684 36.60122
# 7: 7 36.92314 36.16220 36.48154 37.05324 36.57829 36.24955 37.23835
# 8: 8 35.71390 37.26879 37.01673 36.65364 36.89143 36.46331 37.15398
# 9: 9 36.63558 37.03452 36.40129 37.53705 36.03568 36.78083 36.71873
# 10:10 36.77329 36.07161 36.42992 36.20715 36.78880 36.79875 36.15004
# 11:11 36.66199 36.74958 36.28661 36.72539 36.17700 37.47495 35.60980
# 12:14 37.40307 35.89668 36.30619 36.64382 37.21882 35.87420 35.45550
# 13:18 36.51696 37.09903 37.31166 36.51000 36.42414 36.87976 36.45736
# 14:19 37.05117 37.42526 36.15820 36.11824 37.07024 36.60699 36.80168

How to insert NA values in a ts object to fill the gap with another time series?

(I'm new to R) I have two time series with different lengths, one starting from jan 2011 (ts1) and the other from jan 2016 (ts2).
How to fill the time interval "ts1 - ts2" (from jan 2011 to dec 2015) in ts2 with NA values to "align" it with ts1?
Say you have two time-series data.tables of different lengths:
dt1 = data.table(
Date = seq(as.Date('2000-01-01'), as.Date('2000-01-10'), by = 1),
Return1 = rnorm(10)
)
dt2 = data.table(
Date = seq(as.Date('2000-01-05'), as.Date('2000-01-10'), by = 1),
Return2 = rnorm(6)
)
You can perform merge() onto the two data.tables and supply the variable you want to merge them by. In this case, "Date". Furthermore, we pass the All = T argument in order to keep rows which are not in the union of Date entries across dt1 and dt2.
dtmain = merge(dt1, dt2, on = 'Date', all = T)
> dtmain
Date Return1 Return2
1: 2000-01-01 -2.9934945 NA
2: 2000-01-02 -0.6712139 NA
3: 2000-01-03 0.2146184 NA
4: 2000-01-04 1.2342134 NA
5: 2000-01-05 0.3276646 -2.35205416
6: 2000-01-06 1.1823349 0.39382064
7: 2000-01-07 -0.8771251 0.72213968
8: 2000-01-08 -0.8145120 -0.15433887
9: 2000-01-09 1.0455526 0.05794934
10: 2000-01-10 -1.2378961 -0.49929648
Consider now if you have three or more time-series data.table objects:
dt3 = data.table(
Date = seq(as.Date('2000-01-02'), as.Date('2000-01-8'), by = 1),
Return3 = rnorm(7)
)
If you want to merge them all, you can use the following solution using Reduce():
dtlist = list(dt1, dt2, dt3) # Put your TS objects in a list
by = 'Date' # Declare the variable you want to merge the tables on
dtmain = Reduce(function(...) merge(..., all = TRUE, by = by), dtlist)
> dtmain
Date Return1 Return2 Return3
1: 2000-01-01 0.45667875 NA NA
2: 2000-01-02 -0.84284705 NA 0.7747270
3: 2000-01-03 0.58849764 NA -0.4224948
4: 2000-01-04 -0.76110475 NA -0.7372464
5: 2000-01-05 0.72950287 -0.6800249 -0.6412878
6: 2000-01-06 1.65512675 -0.9477490 0.4073604
7: 2000-01-07 -0.56407002 0.9283520 0.3264292
8: 2000-01-08 0.05535025 1.7146754 0.7125701
9: 2000-01-09 0.06031502 1.2413374 NA
10: 2000-01-10 -0.23840704 0.3846532 NA
Welcome to StackOverflow! In the future please include an example of your data so that we can test the code before providing an answer. In this case, any time series object with different start dates would suffice. I have had to find my own data to answer your question.
First I load stock price data into R with the quantmod package. This returns objects that are of the class xts, which is convenient in this case. I've loaded AAPL, which starts from 2011 and GOOG, which starts from 2016. Now to achieve what you want, the easiest way is to create a new xts object from 2011 to 2016 and fill it with NAs. Then simply combine the shorter time series object with this new time series object that has NAs, in this case GOOG.
library(quantmod)
getSymbols('AAPL', from = "2011-01-01", to = "2019-09-30")
getSymbols("GOOG", from = '2016-01-01', to = "2019-09-30")
new_rows <- nrow(AAPL) - nrow(GOOG)
temp <- matrix(NA, nrow = new_rows, ncol = ncol(GOOG))
temp <- xts(temp, order.by = index(AAPL[1:new_rows,,drop=F]))
column_names <- colnames(GOOG)
GOOG <- rbind(temp, GOOG)
colnames(GOOG) <- column_names
nrow(AAPL)==nrow(GOOG)
[1] TRUE
Now GOOG has the same start date as AAPL and it has NAs from 2011 to 2015 December.

Automatically expanding data frame with NAs values across any number of columns for missing dates

I'm interested in expanding a data frame with missing values across any number of columns for the periods where data is missing following the data units.
Example
The problem can be easily illustrated on with use of a simple example.
Data
The generated data contains some time series observations and dates missing on random.
# Data generation
# Seed
set.seed(1)
# Size
sizeDf <- 10
# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)
# Delete rows
dta <-
dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]
Preview
>> dta
dates varA varB varC
1 2016-07-28 0.26550866 0.2059746 0.93470523
2 2016-07-29 0.37212390 0.1765568 0.21214252
3 2016-07-30 0.57285336 0.6870228 0.65167377
4 2016-07-31 0.90820779 0.3841037 0.12555510
7 2016-08-03 0.94467527 0.7176185 0.01339033
8 2016-08-04 0.66079779 0.9919061 0.38238796
9 2016-08-05 0.62911404 0.3800352 0.86969085
10 2016-08-06 0.06178627 0.7774452 0.34034900
Key characteristics
From the perspective of the proposed analysis, the key characteristics are:
The date units, days in that case
Randomly missing dates
Missing dates
seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
)[!(seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
) %in% dta$dates)]
"2016-08-01" "2016-08-02"
Desired results
The newly created data frame should look like that:
>> dtaNew
dates varA varB varC
1 2016-07-28 0.3337749 0.32535215 0.8762692
2 2016-07-29 0.4763512 0.75708715 0.7789147
3 2016-07-30 0.8921983 0.20269226 0.7973088
4 2016-07-31 0.8643395 0.71112122 0.4552745
5 2016-08-01 NA NA NA
6 2016-08-02 NA NA NA
7 2016-08-03 0.9606180 0.14330438 0.6049333
8 2016-08-04 0.4346595 0.23962942 0.6547239
9 2016-08-05 0.7125147 0.05893438 0.3531973
10 2016-08-06 0.3999944 0.64228826 0.2702601
This simply obtained with use of:
dtaNew[dtaNew$dates %in% missDates, 2:4] <- NA
where the missDates is taken from the previous seq.
Attempts
Creating vector with all the dates is simple:
allDates <- seq(from = min(dta$dates), to = max(dta$dates), by = 1)
but obviously I cannot just push it to the data frame:
>> dta$allDates <- allDates
Error in `$<-.data.frame`(`*tmp*`, "allDates", value = c(17010, 17011, :
replacement has 10 rows, data has 8
The possible solution could use the loop that would push the row with NA values to the data frame row by row for each of the dates identified as missing but this is grossly inefficient and messy.
To sum up, I'm interested in achieving the following:
Expanding the data frame with all the dates following the same unit. I.e. for missing daily data days are added, for missing quarterly data quarters are added.
I would like to then push the NA values across all the columns in the data frame for where the missing date was found
If I understand your question, you can use rbind.fill from the plyr package to get your desired output:
sizeDf <- 10
# Populate data frame
dta <- data.frame(
dates = seq(
from = Sys.Date() - (sizeDf - 1),
to = Sys.Date(),
by = 1
),
varA = runif(n = sizeDf),
varB = runif(n = sizeDf),
varC = runif(n = sizeDf)
)
# Delete rows
dta <-dta[-sample(1:sizeDf, replace = TRUE, size = round(sqrt(sizeDf), 0)),]
#Get missing dates
missing_dates <- seq(from=min(dta$dates), to=max(dta$dates), by=1)[!(seq(from=min(dta$dates), to=max(dta$dates), by=1) %in% dta$dates)]
#Create the new dataset by using plyr's rbind.fill function
dta_new <- plyr::rbind.fill(dta,data.frame(dates=missing_dates))
#Order the data by the dates column
dta_new <- dta_new[order(dta_new$dates),]
#Print it
print(dta_new, row.names = F, right = F)
dates varA varB varC
2016-07-28 0.837859418 0.2966637 0.61245244
2016-07-29 0.144884547 0.9284294 0.11033990
2016-07-30 NA NA NA
2016-07-31 NA NA NA
2016-08-01 0.003167049 0.9096805 0.29239470
2016-08-02 0.574859760 0.1466993 0.69541969
2016-08-03 NA NA NA
2016-08-04 0.748639215 0.9602836 0.67681826
2016-08-05 0.983939562 0.4867804 0.35270309
2016-08-06 0.383366957 0.2241982 0.09244522
I hope this helps.

How to do a data.table rolling join?

I have two data tables that I'm trying to merge. One is data on company market values through time and the other is company dividend history through time. I'm trying to find out how much each company has paid each quarter and put that value next to the market value data through time.
library(magrittr)
library(data.table)
library(zoo)
library(lubridate)
set.seed(1337)
# data table of company market values
companies <-
data.table(companyID = 1:10,
Sedol = rep(c("91772E", "7A662B"), each = 5),
Date = (as.Date("2005-04-01") + months(seq(0, 12, 3))) - days(1),
MktCap = c(100 + cumsum(rnorm(5,5)),
50 + cumsum(rnorm(5,1,5)))) %>%
setkey(Sedol, Date)
# data table of dividends
dividends <-
data.table(DivID = 1:7,
Sedol = c(rep('91772E', each = 4), rep('7A662B', each = 3)),
Date = as.Date(c('2004-11-19', '2005-01-13', '2005-01-29',
'2005-10-01', '2005-06-29', '2005-06-30',
'2006-04-17')),
DivAmnt = rnorm(7, .8, .3)) %>%
setkey(Sedol, Date)
I believe this is a situation where you could use a data.table rolling join, something like:
dividends[companies, roll = "nearest"]
to try and get a dataset that looks like
DivID Sedol Date DivAmnt companyID MktCap
1: NA 7A662B <NA> NA 6 61.21061
2: 5 7A662B 2005-06-29 0.7772631 7 66.92951
3: 6 7A662B 2005-06-30 1.1815343 7 66.92951
4: NA 7A662B <NA> NA 8 78.33914
5: NA 7A662B <NA> NA 9 88.92473
6: NA 7A662B <NA> NA 10 87.85067
7: 2 91772E 2005-01-13 0.2964291 1 105.19249
8: 3 91772E 2005-01-29 0.8472649 1 105.19249
9: NA 91772E <NA> NA 2 108.74579
10: 4 91772E 2005-10-01 1.2467408 3 113.42261
11: NA 91772E <NA> NA 4 120.04491
12: NA 91772E <NA> NA 5 124.35588
(note that I've matched the dividends to the company market values by the exact quarter)
But I'm not exactly sure how to execute it. The CRAN pdf is rather vague about what the number is or should be if roll is a value (Can you pass dates? Does a number quantify the days forward to carry? the number of obersvations?) and changing rollends around doesn't seem to get me what I want.
In the end, I ended up mapping the dividend dates to their quarter end and then joining on that. A good solution, but not useful if I end up needing to know how to perform rolling joins. In your answer, could you describe a situation where rolling joins are the only solution as well as help me understand how to perform them?
Instead of a rolling join, you may want to use an overlap join with the foverlaps function of data.table:
# create an interval in the 'companies' datatable
companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
# create a second date in the 'dividends' datatable
dividends[, Date2 := divDate]
# set the keys for the two datatable
setkey(companies, Sedol, start, end)
setkey(dividends, Sedol, divDate, Date2)
# create a vector of columnnames which can be removed afterwards
deletecols <- c("Date2","start","end")
# perform the overlap join and remove the helper columns
res <- foverlaps(companies, dividends)[, (deletecols) := NULL]
the result:
> res
Sedol DivID divDate DivAmnt companyID compDate MktCap
1: 7A662B NA <NA> NA 6 2005-03-31 61.21061
2: 7A662B 5 2005-06-29 0.7772631 7 2005-06-30 66.92951
3: 7A662B 6 2005-06-30 1.1815343 7 2005-06-30 66.92951
4: 7A662B NA <NA> NA 8 2005-09-30 78.33914
5: 7A662B NA <NA> NA 9 2005-12-31 88.92473
6: 7A662B NA <NA> NA 10 2006-03-31 87.85067
7: 91772E 2 2005-01-13 0.2964291 1 2005-03-31 105.19249
8: 91772E 3 2005-01-29 0.8472649 1 2005-03-31 105.19249
9: 91772E NA <NA> NA 2 2005-06-30 108.74579
10: 91772E 4 2005-10-01 1.2467408 3 2005-09-30 113.42261
11: 91772E NA <NA> NA 4 2005-12-31 120.04491
12: 91772E NA <NA> NA 5 2006-03-31 124.35588
In the meantime the data.table authors have introduced non-equi joins (v1.9.8). You can also use that to solve this problem. Using a non-equi join you just need:
companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
dividends[companies, on = .(Sedol, divDate >= start, divDate <= end)]
to get the intended result.
Used data (the same as in the question, but without the creation of the keys):
set.seed(1337)
companies <- data.table(companyID = 1:10, Sedol = rep(c("91772E", "7A662B"), each = 5),
compDate = (as.Date("2005-04-01") + months(seq(0, 12, 3))) - days(1),
MktCap = c(100 + cumsum(rnorm(5,5)), 50 + cumsum(rnorm(5,1,5))))
dividends <- data.table(DivID = 1:7, Sedol = c(rep('91772E', each = 4), rep('7A662B', each = 3)),
divDate = as.Date(c('2004-11-19','2005-01-13','2005-01-29','2005-10-01','2005-06-29','2005-06-30','2006-04-17')),
DivAmnt = rnorm(7, .8, .3))

R: Using quantmod's Delt in a data.table

Following R data.table Return calculation and set()
I would like to ask how I can use Delt() from library(quantmod) to find returns for a time-series in a data.table().
So far, thanks to Frank, I have:
set.seed(42)
DT <- data.table(
ticker=rep(letters,each=5),
priceA=runif(5*26^2),
priceB=runif(5*26^2))
DT[,paste('returns',LETTERS[1:2],sep=''):={
lapply(.SD,function(x){
old <- head(x,-1)
new <- tail(x,-1)
c(NA,(new-old)/old)
})
},by=ticker,.SDcols=grep('^price',names(DT))]
The result (for this seed value) is:
ticker priceA priceB returnsA returnsB
1: a 0.9148060 0.7956245 NA NA
2: a 0.9370754 0.9314941 0.02434327 0.1707710
3: a 0.2861395 0.6269996 -0.69464620 -0.3268883
4: a 0.8304476 0.1666758 1.90224707 -0.7341691
5: a 0.6417455 0.6483800 -0.22722939 2.8900659
---
3376: z 0.2887293 0.3473923 -0.54132570 -0.3514041
3377: z 0.9013438 0.1788842 2.12176058 -0.4850656
3378: z 0.3126429 0.7648157 -0.65313686 3.2754788
3379: z 0.8791381 0.1300418 1.81195584 -0.8299698
3380: z 0.8160158 0.8159330 -0.07180019 5.2743905
How can I use Delt() (or a similar %change function) instead of function(x){
old <- head(x,-1)
new <- tail(x,-1)
c(NA,(new-old)/old)
} ??
Many many thanks!
You just need to convert the return value of Delt to a simple vector:
DT[,
paste('returns',LETTERS[1:2],sep=''):=lapply(.SD,function(x) c(Delt(x))),
by=ticker,.SDcols=grep('^price',names(DT))
]
This produces:
ticker priceA priceB returnsA returnsB
1: a 0.9148060 0.7956245 NA NA
2: a 0.9370754 0.9314941 0.02434327 0.1707710
3: a 0.2861395 0.6269996 -0.69464620 -0.3268883
4: a 0.8304476 0.1666758 1.90224707 -0.7341691
5: a 0.6417455 0.6483800 -0.22722939 2.8900659
---

Resources