Rolling multi regression in R data table - r

Say I have an R data.table DT which has a list of returns:
Date Return
2016-01-01 -0.01
2016-01-02 0.022
2016-01-03 0.1111
2016-01-04 -0.006
...
I want to do a rolling multi regression of the previous N observations of Return predicting the next Return over some window K. E.g. Over the last K = 120 days do a regression of the last N = 14 observations to predict the next observation. Once I have this regression I want to use the predict function to get a prediction for each row based on the regression. In pseudocode it would be something like:
DT[, Prediction := predict(lm(Return[prev K - N -1] ~ Return[N observations prev for each observation]), Return[N observations previous for this observation])]
To be clear i want to do a multi regression so if N was 3 it would be:
lm(Return ~ Return[-1] + Return[-2] + Return[-3]) ## where the negatives are the prev rows
How do I write this (as efficiently as possible).
Thanks

If I understand correctly you want a quarterly auto-regression.
There's a related thread on time-series with data.table here.
You can setup a rolling date in data.table like this (see the link above for more context):
#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]
setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")
Since you only provided a few rows of example data, I extended the series through 2019 and made up random return values.
First get your data setup:
require(forecast)
require(xts)
DT <- read.table(con<- file ( "clipboard"))
dput(DT) # the dput was too long to display here
DT[,1] <- as.POSIXct(strptime(DT[,1], "%m/%d/%Y"))
DT[,2] <- as.double(DT[,2])
dat <- xts(DT$V2,DT$V1, order.by = DT$V1)
x.ts <- to.quarterly(dat) # 120 days
dat.Open dat.High dat.Low dat.Close
2016 Q1 1292 1292 1 698
2016 Q2 138 1290 3 239
2016 Q3 451 1285 5 780
2016 Q4 355 1243 27 1193
2017 Q1 878 1279 4 687
2017 Q2 794 1283 12 411
2017 Q3 858 1256 9 1222
2017 Q4 219 1282 15 117
2018 Q1 554 1286 32 432
2018 Q2 630 1272 30 46
2018 Q3 310 1288 18 979
2019 Q1 143 1291 10 184
2019 Q2 250 1289 8 441
2019 Q3 110 1220 23 571
Then you can do a rolling ARIMA model with or without re-estimation like this:
fit <- auto.arima(x.ts)
order <- arimaorder(fit)
fcmat <- matrix(0, nrow=nrow(x), ncol=1)
n <- nrow(x)
for(i in 1:n)
{
x <- window(x.ts, end=2017.99 + (i-1)/4)
refit <- Arima(x, order=order[1:3], seasonal=order[4:6])
fcmat[i,] <- forecast(refit, h=h)$mean
}
Here's a good related resource with several examples of different ways you might construct this: http://robjhyndman.com/hyndsight/rolling-forecasts/

You have to have the lags in the columns anyway, so I if i understand you correctly you can do something like this, say for a lag of 3:
setkey(DT,date)
lag_max<-3
for(i in 1:lag_max){
set(DT,NULL,paste0("lag",i),shift(DT[["return"]],1L,type="lag"))
}
DT[, prediction := lm(return~lag1+lag2+lag3)[["fitted.values"]]]

Related

Efficient data.table method to generate additional rows given random numbers

I have a large data.table that I want to generate a random number (using two columns) and perform a calculation. Then I want to perform this step 1,000 times. I am looking for a way to do this efficiently with out a loop.
Example data:
> dt <- data.table(Group=c(rep("A",3),rep("B",3)),
Year=rep(2020:2022,2),
N=c(300,350,400,123,175,156),
Count=c(25,30,35,3,6,8),
Pop=c(1234,1543,1754,2500,2600,2400))
> dt
Group Year N Count Pop
1: A 2020 300 25 1234
2: A 2021 350 30 1543
3: A 2022 400 35 1754
4: B 2020 123 3 2500
5: B 2021 175 6 2600
6: B 2022 156 8 2400
> dt[, rate := rpois(.N, lambda=Count)/Pop*100000]
> dt[, value := N*(rate/100000)]
> dt
Group Year N Count Pop rate value
1: A 2020 300 25 1234 1944.8947 5.8346840
2: A 2021 350 30 1543 2009.0732 7.0317563
3: A 2022 400 35 1754 1938.4265 7.7537058
4: B 2020 123 3 2500 120.0000 0.1476000
5: B 2021 175 6 2600 115.3846 0.2019231
6: B 2022 156 8 2400 416.6667 0.6500000
I want to be able to do this calculation for value 1,000 times, and keep all instances (with an indicator column for 1-1,000 indicating which run) without using a loop. Any suggestions?
Maybe you can try replicate like below
n <- 1000
dt[, paste0(c("rate", "value"), rep(1:n, each = 2)) := replicate(n, list(u <- rpois(.N, lambda = Count) / Pop * 100000, N * (u / 100000)))]

R Programming Newbie! - Median Loop Function Broken

So the function below is using a really big dataframe. Two columns of this DF are the year houses were built and the other is the cost.
I want to input the column name, in this case ds$Built as the function argument
YearCount <- the length of the number of unique values, years, in DS$Built
YearList <- a vector of the unique values, years, in ds$Built
Then I want it to do a for loop of YearCount number of iterations where it takes the median of values in Cost06 but only where the values in ds$Built = the value in YearList[i]
Take the resulting median and append it to the empty vector CostVec
At the end, make a 2 column dataframe of YearList and iMedian.
But this does not work. It is doing what you see below where it is assigning the same median value (the value from the last iteration in the loop) to all the years, rather than making each year's result available.
I'm a newbie so please thank you very much for your patience.
Median.DF <- function(x)
{
YearCount <- length(unique(x))
YearList <- unique(x)
CostVec <- c()
for (i in YearCount) {
imedian <- median(ds[x == YearList,"COST06"],na.rm = TRUE)
CostVec <- append(CostVec,imedian)
}
MedianCost.data <- data.frame(YearList, CostVec)
return(MedianCost.data)
}
YearList CostVec
1 2004 1629
2 2007 1629
3 2005 1629
4 1980 1629
5 1985 1629
6 2003 1629
7 2008 1629
8 1990 1629
9 1975 1629
10 1970 1629
11 1950 1629
12 1920 1629
13 1960 1629
14 1930 1629
15 1919 1629
16 1940 1629
17 1995 1629
18 2006 1629
19 2009 1629
20 2000 1629
21 2002 1629
22 2001 1629
23 2010 1629
24 2011 1629
25 2012 1629
26 2013 1629
Here is a better (cleaner, more R-like) way to do this. I'll leave it to you to turn it into the function.
df <- data.frame(YearList=sample(2000:2006,30, replace=T), CostVec=10^3+100*runif(30,0,1))
dfSplit <- split(df, df$YearList)
medianByYears <- lapply(dfSplit, function(x) median(x$CostVec))
medianByYearsClean <- do.call(rbind, medianByYears)
data.frame(Years=rownames(medianByYearsClean), MedianPrices=medianByYearsClean, row.names=NULL)
A few other tips:
Before you write a function, just test each line by line outside of a function. The main problem here is:
for (i in YearCount) { do stuff }
is analogous to
for (i in 5){print(i)}
Which does one thing once. You want to do:
for (i in seq(YearCount)) { do stuff }

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

R- combine rows of a data frame to be unique by 3 columns

I have data frame looking like this:
> head(temp)
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 <NA> 36.4 <NA>
24339 2008253059 695 696 132 <NA> <NA>
72450 2008953178 527 528 <NA> 38.6 <NA>
72957 2008953178 527 528 123 <NA> <NA>
73976 2008965669 527 528 <NA> 36.2 <NA>
74504 2008965669 527 528 116 <NA> <NA>
First and second row are both for the same patient(same VisitIDCode), in the first row I have the value of heart rate and in the second I have the value of temperature from time 2 to 3. I want to combine these rows so that the result is one row that looks like:
VisitIDCode start stop Value_EVS hr heart rate NU EE0A Value_EVS temp celsius CAL 113C Value_EVS current weight kg CAL
23642 2008253059 695 696 132 36.4 <NA>
In other words, I want my data frame to be unique by combination of VisitIDCode, start and stop. This is a large dataframe with more columns that need to be combined.
What is the best way of doing it and if at all possible, avoiding for loop?
Edit: I don't want to remove the NAs. If there are 2 rows each of which have one value and 2 NAs, I want to combine them to one row so it has two values and one NA. Like the example above.
nasim,
It's useful to create a reproducible example when posting questions. It makes it much easier to sort out how to help. I created a toy example here. Hopefully, that reproduces your issue:
> df <- data.frame(MRN = c(123,125,213,214),
+ VID = c(2008,2008,2011,2011),
+ start=c(695,695),
+ heart.rate = c(NA,112,NA,96),
+ temp = c(39.6,NA,37.4,NA))
> df
MRN VID start heart.rate temp
1 123 2008 695 NA 39.6
2 125 2008 695 112 NA
3 213 2011 695 NA 37.4
4 214 2011 695 96 NA
Here is a solution using dplyr:
> library(dplyr)
> df <- df %>%
+ group_by(VID) %>%
+ summarise(MRN = max(MRN,na.rm=T),
+ start=max(start,na.rm=T),
+ heart.rate=max(heart.rate,na.rm=T),
+ temp = max(temp,na.rm=T))
> df
# A tibble: 2 × 5
VID MRN start heart.rate temp
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2008 125 695 112 39.6
2 2011 214 695 96 37.4
After I made sure all columns classes are numeric (not factors) by defining the classes of columns while reading the data in, this worked for me:
CompleteCoxObs<-aggregate(x=CompleteCoxObs[c("stop","Value_EVS current weight kg CAL","Value_EVS hr heart rate NU EE0A","Value_EVS temp celsius CAL 113C")], by=list(VisitIDCode=CompleteCoxObs$VisitIDCode,start=CompleteCoxObs$start), max, na.rm = FALSE);

r ddply error undefined columns selected

I have a time series data set like below:
age time income
16 to 24 2004 q1 400
16 to 24 2004 q2 500
… …
65 and over 2014 q3 800
it has different 60 quarters of income data for each age group.as income data is seasonal. i am trying to apply decomposition function to filter out trends.what i have did so far is below. but R consistently throw errors (error message:undefined columns selected) at me. any idea how to go about it?
fun =function(x){
ts = ts(x,frequency=4,start=c(2004,1))
ts.d =decompose(ts,type='additive')
as.vector(ts.d$trend)
}
trend.dt = ddply(my.dat,.(age),transform,trend=fun(income))
expected result is (NA is because, after decomposition, the first and last ob will not have value,but the rest should have)
age time income trend
16 to 24 2004 q1 400 NA
16 to 24 2004 q2 500 489
… …
65 and over 2014 q3 800 760
65 and over 2014 q3 810 NA

Resources