Using the predict() function to make predictions past your existing data - r

I have a series of algorithms I am running on financial data. For the purposes of this question I have financial market data for a stock with 1226 rows of data.
I run the follow code to fit and predict the model:
strat.fit <- glm(DirNDay ~l_UUP.Close + l_FXE.Close + MA50 + +MA10 + RSI06 + BIAS10 + BBands05, data=STCK.df,family="binomial")
strat.probs <- predict(strat.fit, STCK.df,type="response")
I get probability prediction up to row 1226, I am interested in making a prediction for a new day which would be 1227. I get the following response on an attempt for a predict on day 1227
strat.probs[1227]
NA
Any help/suggestions would be appreciated

The predict function is going to predict the value of DirNDay based on the value of the other variables for that day. If you want it to predict DirNDay for a new day, then you need to provide it with all the other relevant variables for that new day.
It sounds like that's not what you're trying to do, and you need to create a totally different model which uses time (or day) to predict the values. Then you can provide predict with a new time and it can use that to predict a new DirNDay.
There's a free online textbook about forecasting using R by Rob Hyndman if you don't know where to start: https://www.otexts.org/fpp
(But if I totally misunderstood that glm model then nevermind those last two paragraphs.)

In order to make a prediction for the 1228th day, you'll need to know what the values of your explanatory variables (MA50, MA10, etc) will be for the 1228th day. Store those as a new data frame (say STCK.df.new) and put that into your predict function:
STCK.df.new <- data.frame(l_UUP.Close = .4, l_FXE.Close = 2, ... )
strat.probs <- predict(strat.fit ,STCK.df.new ,type="response")

Related

How to create a rolling linear regression in R?

I am trying to create (as the title suggests) a rolling linear regression equation on a set of data (daily returns of two variables, total of 257 observations for each, linked together by date, want to make the rolling window 100 observations). I have searched for rolling regression packages but I have not found one that works on my data. The two data pieces are stored within one data frame.
Also, I am pretty new to programming, so any advice would help.
Some of the code I have used is below.
WeightedIMV_VIX_returns_combined_ID100896 <- left_join(ID100896_WeightedIMV_daily_returns, ID100896_VIX_daily_returns, by=c("Date"))
head(WeightedIMV_VIX_returns_combined_ID100896, n=20)
plot(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)#the data seems to be correlated enought to run a regression, doesnt matter which one you put first
ID100896.lm <- lm(WeightedIMV_returns ~ VIX_returns, data = WeightedIMV_VIX_returns_combined_ID100896)
summary(ID100896.lm) #so the estimate Intercept is 1.2370, estimate Slope is 5.8266.
termplot(ID100896.lm)
Again, sorry if this code is poor, or if I am missing any information that some of you may need to help. This is my first time on here! Just let me know what I can do better. Thanks!

Syntax for survival analysis with late-entry

I am trying to fit a survival model with left-truncated data using the survival package however I am unsure of the correct syntax.
Let's say we are measuring the effect of age at when hired (age) and job type (parttime) on duration of employment of doctors in public health clinics. Whether the doctor quit or was censored is indicated by the censor variable (0 for quittting, 1 for censoring). This behaviour was measured in an 18-month window. Time to either quit or censoring is indicated by two variables, entry (start time) and exit(stop time) indicating how long, in years, the doctor was employed at the clinic. If doctors commenced employment after the window 'opened' their entry time is set to 0. If they commenced employment prior to the window 'opening' their entry time represents how long they had already been employed in that position when the window 'opened', and their exit time is how long from when they were initially hired they either quit or were censored by the window 'closing'. We also postulate a two-way interaction between age and duration of employment (exit).
This is the toy data set. It is much smaller than a normal dataset would be, so the estimates themselves are not as important as whether the syntax and variables included (using the survival package in R) are correct, given the structure of the data. The toy data has the exact same structure as a dataset discussed in Chapter 15 of Singer and Willet's Applied Longitudinal Data Analysis. I have tried to match the results they report, without success. There is not a lot of explicit information online how to conduct survival analyses on left-truncated data in R, and the website that provides code for the book (here) does not provide R code for the chapter in question. The methods for modeling time-varying covariates and interaction effects are quite complex in R and I just wonder if I am missing something important.
Here is the toy data
id <- 1:40
entry <- c(2.3,2.5,2.5,1.2,3.5,3.1,2.5,2.5,1.5,2.5,1.4,1.6,3.5,1.5,2.5,2.5,3.5,2.5,2.5,0.5,rep(0,20))
exit <- c(5.0,5.2,5.2,3.9,4.0,3.6,4.0,3.0,4.2,4.0,2.9,4.3,6.2,4.2,3.0,3.9,4.1,4.0,3.0,2.0,0.2,1.2,0.6,1.9,1.7,1.1,0.2,2.2,0.8,1.9,1.2,2.3,2.2,0.2,1.7,1.0,0.6,0.2,1.1,1.3)
censor <- c(1,1,1,1,0,0,0,0,1,0,0,1,1,1,0,0,0,0,0,0,rep(1,20))
parttime <- c(1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0)
age <- c(34,28,29,38,33,33,32,28,40,30,29,34,31,33,28,29,29,31,29,29,30,37,33,38,34,37,37,40,29,38 ,49,32,30,27,35,34,35,30,35,34)
doctors <- data.frame(id,entry,exit,censor,parttime,age)
Now for the model.
coxph(Surv(entry, exit, 1-censor) ~ parttime + age + age:exit, data = doctors)
Is this the correct way to specify the model given the structure of the data and what we want to know? An answer here suggests it is correct, but I am not sure whether, for example, the interaction variable is correctly specified.
As is often the case, it's not until I post a question about a problem on SO that I work out how to do it myself. If there is an interaction with time predictor we need to convert the dataset into a count process, person period format (i.e. a long form). This is because each participant needs an interval that tracks their status with respect to the event for every time point that the event occurred to anyone else in the data set, up to the point when they exited the study.
First let's make an event variable
doctors$event <- 1 - doctors$censor
Before we run the cox model we need to use the survSplit function in the survival package. To do this we need to make a vector of all the time points when an event occurred
cutPoints <- order(unique(doctors$exit[doctors$event == 1]))
Now we can pass this into the survSplit function to create a new dataset...
docNew <- survSplit(Surv(entry, exit, event)~.,
data = doctors,
cut = cutPoints,
end = "exit")
... which we then run our model on
coxph(Surv(entry,exit,event) ~ parttime + age + age:exit, data = docNew)
Voila!

Time Series Forecasting using Support Vector Machine (SVM) in R

I've tried searching but couldn't find a specific answer to this question. So far I'm able to realize that Time Series Forecasting is possible using SVM. I've gone through a few papers/articles who've performed the same but didn't mention any code, instead explained the algorithm (which I didn't quite understand). And some have done it using python.
My problem here is that: I have a company data(say univariate) of sales from 2010 to 2017. And I need to forecast the sales value for 2018 using SVM in R.
Would you be kind enough to simply present and explain the R code to perform the same using a small example?
I really do appreciate your inputs and efforts!
Thanks!!!
let's assume you have monthly data, for example derived from Air Passengers data set. You don't need the timeseries-type data, just a data frame containing time steps and values. Let's name them x and y. Next you develop an svm model, and specify the time steps you need to forecast. Use the predict function to compute the forecast for given time steps. That's it. However, support vector machine is not commonly regarded as the best method for time series forecasting, especially for long series of data. It can perform good for few observations ahead, but I wouldn't expect good results for forecasting eg. daily data for a whole next year (but it obviously depends on data). Simple R code for SVM-based forecast:
# prepare sample data in the form of data frame with cols of timesteps (x) and values (y)
data(AirPassengers)
monthly_data <- unclass(AirPassengers)
months <- 1:144
DF <- data.frame(months,monthly_data)
colnames(DF)<-c("x","y")
# train an svm model, consider further tuning parameters for lower MSE
svmodel <- svm(y ~ x,data=DF, type="eps-regression",kernel="radial",cost=10000, gamma=10)
#specify timesteps for forecast, eg for all series + 12 months ahead
nd <- 1:156
#compute forecast for all the 156 months
prognoza <- predict(svmodel, newdata=data.frame(x=nd))
#plot the results
ylim <- c(min(DF$y), max(DF$y))
xlim <- c(min(nd),max(nd))
plot(DF$y, col="blue", ylim=ylim, xlim=xlim, type="l")
par(new=TRUE)
plot(prognoza, col="red", ylim=ylim, xlim=xlim)

Match "next day" using forecast() in R

I am working through the "Forecasting Using R" DataCamp course. I have completed the entire thing except for the last part of one particular exercise (link here, if you have an account), where I'm totally lost. The error help it's giving me isn't helping either. I'll put the various parts of the task down with the code I'm using to solve them:
Produce time plots of only the daily demand and maximum temperatures with facetting.
autoplot(elec[, c("Demand", "Temperature")], facets = TRUE)
Index elec accordingly to set up the matrix of regressors to include MaxTemp for the maximum temperatures, MaxTempSq which represents the squared value of the maximum temperature, and Workday, in that order.
xreg <- cbind(MaxTemp = elec[, "Temperature"],
MaxTempSq = elec[, "Temperature"] ^2,
Workday = elec[,"Workday"])
Fit a dynamic regression model of the demand column with ARIMA errors and call this fit.
fit <- auto.arima(elec[,"Demand"], xreg = xreg)
If the next day is a working day (indicator is 1) with maximum temperature forecast to be 20°C, what is the forecast demand? Fill out the appropriate values in cbind() for the xreg argument in forecast().
This is where I'm stuck. The sample code they supply looks like this:
forecast(___, xreg = cbind(___, ___, ___))
I have managed to work out that the first blank is fit, so I'm trying code that looks like this:
forecast(fit, xreg = cbind(elec[,"Workday"]==1, elec[, "Temperature"]==20, elec[,"Demand"]))
But that is giving me the error hint "Make sure to forecast the next day using the inputs given in the instructions." Which... doesn't tell me anything useful. Any ideas what I should be doing instead?
When you are forecasting ahead of time, you use new data that was not included in elec (which is the data set you used to fit your model). The new data was given to you in the question (temperature 20C and workday 1). Therefore, you do not need elec in your forecastcall. Just use the new data to forecast ahead:
forecast(fit, xreg = cbind(20, 20^2, 1))

"Simulating" a large number of regressions with different predictor values

Let's say I have the following data and I'm interested in examining some counterfactuals. In particular, I want to examine whether there would be changes in predicted income given a change in income. The best way I can think to do this is to write a loop that runs this regression 1:n. However, how do I also make adjustments to the data frame while running through the loop. I'm really hoping that there is a base R function or something in a package that someone can point me to.
df = data.frame(year=c(2000,2001,2002,2003,2004,2005,2006,2007,2009,2010),
income=c(100,50,70,80,50,40,60,100,90,80),
age=c(26,30,35,30,28,29,31,34,20,35),
gpa=c(2.8,3.5,3.9,4.0,2.1,2.65,2.9,3.2,3.3,3.1))
df
mod = lm(income ~ age + gpa, data=df)
summary(mod)
Here are some counter factuals that may be worth considering when looking at the relationship between age, gpa, and income.
# What is everyone in the class had a lower/higher gpa?
df$gpa2 = df$gpa + 0.55
# what if one person had a lower/higher gpa?
df$gpa2[3] = 1.6
# what if the most recent employee/person had a lower/higher gpa?
df[10,4] = 4.0
With or without looping, what would be the best way to "simulate" a large (1000+) number of regression models in order examine various counter factuals, and then save those results in some data structure? Is there a "counter factual" analysis package which could save me a bit of work?

Resources