How to create a rolling linear regression in R? - r

I am trying to create (as the title suggests) a rolling linear regression equation on a set of data (daily returns of two variables, total of 257 observations for each, linked together by date, want to make the rolling window 100 observations). I have searched for rolling regression packages but I have not found one that works on my data. The two data pieces are stored within one data frame.
Also, I am pretty new to programming, so any advice would help.
Some of the code I have used is below.
WeightedIMV_VIX_returns_combined_ID100896 <- left_join(ID100896_WeightedIMV_daily_returns, ID100896_VIX_daily_returns, by=c("Date"))
head(WeightedIMV_VIX_returns_combined_ID100896, n=20)
plot(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)#the data seems to be correlated enought to run a regression, doesnt matter which one you put first
ID100896.lm <- lm(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)
summary(ID100896.lm) #so the estimate Intercept is 1.2370, estimate Slope is 5.8266.
termplot(ID100896.lm)
Again, sorry if this code is poor, or if I am missing any information that some of you may need to help. This is my first time on here! Just let me know what I can do better. Thanks!

Related

Using bootstrapping to compare full and sample datasets

This is a fairly complicated situation, so I'll try to succinctly explain but feel free to ask for clarification.
I have several datasets of biological data that vary significantly in sample size (e.g., 253-1221 observations/dataset). I need to estimate individual breeding parameters and compare them (for a different analysis), but because of the large sample size differences, I took a sub-set of data from each dataset so the sample sizes were equal for each comparison. For example, the smallest dataset had 253 observations, so for all the others I used the following code
AT_EABL_subset <- Atlantic_EABL[sample(1:nrow(Atlantic_EABL), 253,replace=FALSE),]
to take a subset of 253 observations from the full dataset (in this case AT_EABL originally had 1,221 observations).
It's now suggested that I use bootstrapping to check if the parameter estimates from my subsets are similar to the full dataset estimates. I'm looking for code that will run, say, 200 iterations of the above subset data and calculate the average of the coefficients so I can compare them to the coefficients from my model with the full dataset. I found a site that uses the sample function to achieve this (https://towardsdatascience.com/bootstrap-regression-in-r-98bfe4ff5007), but when I get to this portion of the code
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])
}
I get
Error: $ operator not defined for this S4 class
Below is the code I'm using. I don't know if I'm getting the above error because of the type of model I'm running (glmer vs. lm used in the link), or if there's a different function that will give me the data I need. Any advice is greatly appreciated.
sample_coef_intercept <- NULL
sample_coef_x1 <- NULL
for (i in 1:2) {
boot.sample = AT_EABL_subset[sample(1:nrow(AT_EABL_subset), nrow(AT_EABL_subset), replace = FALSE), ]
model_bootstrap <- glmer(cbind(YOUNG_HOST_TOTAL_ATLEAST,CLUTCH_SIZE_HOST_ATLEAST-YOUNG_HOST_TOTAL_ATLEAST)~as.factor(YEAR)+(1|LatLong),binomial,data=boot.sample)}
sample_coef_intercept <-
c(sample_coef_intercept, model_bootstrap$coefficients[1])
sample_coef_x1 <-
c(sample_coef_x1, model_bootstrap$coefficients[2])

Removing Outliers 3SDs from the mean of a monoexponetial function in R

I have a large data set that analyzes exercising subjects' oxygen consumption over time (x= Time, y = VO2). This data fits a monoexponential function.
Here is a brief, sample data frame:
'''
VO2 <- c(11.71,9.84,17.96,18.87,14.58,13.38,5.89,20.28,20.03,31.17,22.07,30.29,29.08,32.89,29.01,29.21,32.42,25.47,30.51,37.86,23.48,40.27,36.25,19.34,36.53,35.19,22.45,30.23,3,19.48,25.35,22.74)
Time <- c(0,2,27,29,31,33,39,77,80,94,99,131,133,134,135,149,167,170,177,178,183,184,192,222,239,241,244,245,250,251,255,256)
DF <- data.frame(VO2,Time)
'''
visual representation of the data -- * note that this data set is much smaller (and therefore might not fit a function as well) as the full data set.
I am somewhat new to R and very much not a mathematical expert. I would appreciate your help with the two goals of this data set.
Based on typical conventions of the laboratory I work in, this data should be fit to a monoexponential function
I would love some insight into fitting data to a function such as this. Note that I have many similar data sets (for different subjects) and need to fit a monoexponential function to each of them. It would be best if fit could be applied generically across my data sets.
Based on this monoexponential function, I would like to identify and remove any outlying points. Here I will define an outlier as any point >3 standard deviations from the mean of the monoexponential function.
So far, I have this (unsuccessful) code to fit a function to the above data. Not only does it not fit well, but I am also unable to create a smooth function.
'''
fit <- lm(VO2~poly(Time,2,raw=TRUE))
xx <- seq(1,250, length=32)
plot(Time,VO2,pch=19,ylim=c(0,50))+
lines(xx, predict(fit, data.frame(DF=xx)), col="red")
'''
Thank you to all the individuals who have commented and provided their valuable feedback. As I continue to learn and research, I will add to this post with successful/less successful attempts at the code for this process. Thank you for your knowledge, assistance and understanding.

Iterating over each Row of a large dataset R-Studio

Suppose I have a list of 1500000 states with given zip codes and I want to run my predictor Model (databas) on that list and get the predictions of Area, I did the same by the help of one gentleman and here is my code:
pred <- sapply(1:nrow(first), function(row) { predict(basdata,first[row, ],estimator="BMA", interval = "predict", se.fit=TRUE)$Ybma })
basdata: My Model
first: My new data set for which I am predicting the area.
Now, The issue that i am facing is that the code is taking a long time to predict the values. It iterates over every row and calculates the area. There are 150000 rows in my data set and I would request if anyone can help me optimizing the performance of this code.
I would like to thank onyambu for providing me the solution as I was just making the predict function more Complex. The following code can be used for iterating over each row of a data set and predict the values using the Model built.
predict(basdata,first,estimator="BMA", interval = "predict", se.fit=TRUE)$Ybma

Using the predict() function to make predictions past your existing data

I have a series of algorithms I am running on financial data. For the purposes of this question I have financial market data for a stock with 1226 rows of data.
I run the follow code to fit and predict the model:
strat.fit <- glm(DirNDay ~l_UUP.Close + l_FXE.Close + MA50 + +MA10 + RSI06 + BIAS10 + BBands05, data=STCK.df,family="binomial")
strat.probs <- predict(strat.fit, STCK.df,type="response")
I get probability prediction up to row 1226, I am interested in making a prediction for a new day which would be 1227. I get the following response on an attempt for a predict on day 1227
strat.probs[1227]
NA
Any help/suggestions would be appreciated
The predict function is going to predict the value of DirNDay based on the value of the other variables for that day. If you want it to predict DirNDay for a new day, then you need to provide it with all the other relevant variables for that new day.
It sounds like that's not what you're trying to do, and you need to create a totally different model which uses time (or day) to predict the values. Then you can provide predict with a new time and it can use that to predict a new DirNDay.
There's a free online textbook about forecasting using R by Rob Hyndman if you don't know where to start: https://www.otexts.org/fpp
(But if I totally misunderstood that glm model then nevermind those last two paragraphs.)
In order to make a prediction for the 1228th day, you'll need to know what the values of your explanatory variables (MA50, MA10, etc) will be for the 1228th day. Store those as a new data frame (say STCK.df.new) and put that into your predict function:
STCK.df.new <- data.frame(l_UUP.Close = .4, l_FXE.Close = 2, ... )
strat.probs <- predict(strat.fit ,STCK.df.new ,type="response")

Evolution of linear regression coefficients over time

I would like to observe the evolution of the linear regression coefficients over time. To be more precise, let's have a time frame of 2 years where the linear regression will always use the data set with a range of 1 year. After the first regression, we move one week further (i.e. we add a new week, but one is also subtracted from the beginning) and do the regression again as long as we reach the final date: altogether, there will be 52 regressions.
My problem is that there are some holidays in the data set and we can not simply add 7 days as one would easily suggest. I would like to have some wrapper function that would do aforementioned for many other functions from different packages, for example forecast.lm() from the forecast package or any function that one can think of: the objective in every case would be to find the evolution of the linear regression parameters week-by-week.
I think you might get more answers if you edit/subdivide your question in a clear way. (1) how do I find holidays (it's not clear what your definition of holidays is)? (2) how do I slice up a data set accordingly? (3) how do I run a linear regression in each chunk?
(1) find holidays: can't really help here, as I don't know how they're defined/coded in your data set. library(sos); findFn("holiday") finds some options
(2) partition the data set according to inter-holiday/weekend intervals. The example below supposes holidays are coded as 1 and non-holidays are coded as zero.
(3) run the linear regression for each chunk and extract the coefficients.
d <- data.frame(holiday=c(0,0,0,1,1,0,0,0,0,1,0,0,0,0),
x=runif(14),y=runif(14))
per <- cumsum(c(1,diff(d$holiday)==-1)) ## maybe use rle() instead
dd <- with(d,split(subset(d,!holiday),per[!holiday]))
t(sapply(lapply(dd,lm,formula=y~x),coef))

Resources