I need to apply lm() to an enlarging subset of my dataframe dat, while making prediction for the next observation. For example, I am doing:
fit model predict
---------- -------
dat[1:3, ] dat[4, ]
dat[1:4, ] dat[5, ]
. .
. .
dat[-1, ] dat[nrow(dat), ]
I know what I should do for a particular subset (related to this question: predict() and newdata - How does this work?). For example to predict the last row, I do
dat1 = dat[1:(nrow(dat)-1), ]
dat2 = dat[nrow(dat), ]
fit = lm(log(clicks) ~ log(v1) + log(v12), data=dat1)
predict.fit = predict(fit, newdata=dat2, se.fit=TRUE)
How can I do this automatically for all subsets, and potentially extract what I want into a table?
From fit, I'd need the summary(fit)$adj.r.squared;
From predict.fit I'd need predict.fit$fit value.
Thanks.
(Efficient) solution
This is what you can do:
p <- 3 ## number of parameters in lm()
n <- nrow(dat) - 1
## a function to return what you desire for subset dat[1:x, ]
bundle <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v12), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ], se.fit = TRUE)
c(summary(fit)$adj.r.squared, pred$fit, pred$se.fit)
}
## rolling regression / prediction
result <- t(sapply(p:n, bundle))
colnames(result) <- c("adj.r2", "prediction", "se")
Note I have done several things inside the bundle function:
I have used subset argument for selecting a subset to fit
I have used model = FALSE to not save model frame hence we save workspace
Overall, there is no obvious loop, but sapply is used.
Fitting starts from p, the minimum number of data required to fit a model with p coefficients;
Fitting terminates at nrow(dat) - 1, as we at least need the final column for prediction.
Test
Example data (with 30 "observations")
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100),
v12 = runif(30, 1, 100))
Applying code above gives results (27 rows in total, truncated output for 5 rows)
adj.r2 prediction se
[1,] NaN 3.881068 NaN
[2,] 0.106592619 3.676821 0.7517040
[3,] 0.545993989 3.892931 0.2758347
[4,] 0.622612495 3.766101 0.1508270
[5,] 0.180462206 3.996344 0.2059014
The first column is the adjusted-R.squared value for fitted model, while the second column is the prediction. The first value for adj.r2 is NaN, because the first model we fit has 3 coefficients for 3 data points, hence no sensible statistics is available. The same happens to se as well, as the fitted line has no 0 residuals, so prediction is done without uncertainty.
I just made up some random data to use for this example. I'm calling the object data because that was what it was called in the question at the time that I wrote this solution (call it anything you like).
(Efficient) Solution
data <- data.frame(v1=rnorm(100),v2=rnorm(100),clicks=rnorm(100))
data1 = data[1:(nrow(data)-1), ]
data2 = data[nrow(data), ]
for(i in 3:nrow(data)){
nam <- paste("predict", i, sep = "")
nam1 <- paste("fit", i, sep = "")
nam2 <- paste("summary_fit", i, sep = "")
fit = lm(clicks ~ v1 + v2, data=data[1:i,])
tmp <- predict(fit, newdata=data2, se.fit=TRUE)
tmp1 <- fit
tmp2 <- summary(fit)
assign(nam, tmp)
assign(nam1, tmp1)
assign(nam2, tmp2)
}
All of the results you want will be stored in the data objects this creates.
For example:
> summary_fit10$r.squared
[1] 0.3087432
You mentioned in the comments that you'd like a table of results. You can programmatically create tables of results from the 3 types of output files like this:
rm(data,data1,data2,i,nam,nam1,nam2,fit,tmp,tmp1,tmp2)
frames <- ls()
frames.fit <- frames[1:98] #change index or use pattern matching as needed
frames.predict <- frames[99:196]
frames.sum <- frames[197:294]
fit.table <- data.frame(intercept=NA,v1=NA,v2=NA,sourcedf=NA)
for(i in 1:length(frames.fit)){
tmp <- get(frames.fit[i])
fit.table <- rbind(fit.table,c(tmp$coefficients[[1]],tmp$coefficients[[2]],tmp$coefficients[[3]],frames.fit[i]))
}
fit.table
> fit.table
intercept v1 v2 sourcedf
2 -0.0647017971121678 1.34929652763687 -0.300502017324518 fit10
3 -0.0401617893034109 -0.034750571912636 -0.0843076273486442 fit100
4 0.0132968863522573 1.31283604433593 -0.388846211083564 fit11
5 0.0315113918953643 1.31099122173898 -0.371130010135382 fit12
6 0.149582794027583 0.958692838785998 -0.299479715938493 fit13
7 0.00759688947362175 0.703525856001948 -0.297223988673322 fit14
8 0.219756240025917 0.631961979610744 -0.347851129205841 fit15
9 0.13389223748979 0.560583832333355 -0.276076134872669 fit16
10 0.147258022154645 0.581865844000838 -0.278212722024832 fit17
11 0.0592160359650468 0.469842498721747 -0.163187274356457 fit18
12 0.120640756525163 0.430051839741539 -0.201725012088506 fit19
13 0.101443924785995 0.34966728554219 -0.231560038360121 fit20
14 0.0416637001406594 0.472156988919337 -0.247684504074867 fit21
15 -0.0158319749710781 0.451944113682333 -0.171367482879835 fit22
16 -0.0337969739950376 0.423851304105399 -0.157905431162024 fit23
17 -0.109460218252207 0.32206642419212 -0.055331391802687 fit24
18 -0.100560410735971 0.335862465403716 -0.0609509815266072 fit25
19 -0.138175283219818 0.390418411384468 -0.0873106257144312 fit26
20 -0.106984355317733 0.391270279253722 -0.0560299858019556 fit27
21 -0.0740684978271464 0.385267011513678 -0.0548056844433894 fit28
Related
Running my logistic regression, I am to plot the data i have created, and then plot the line that is described by the coefficients of your model.
library("dplyr") # filter and reformat data frames
library("tidyr") # make data tidy
# Create an empty vector will 200 variables
df_vector <- c(1:200)
# For the first 100 variables,
for (i in 1: 100) {
# Set the normal distribution with mean -5 and sd 5
v1 <- rnorm(100, -5,5)
# For the second 100
for (i in 101:200)
# Set the normal distribution with mean 4 and sd 8
v2 <- rnorm(100,4,8)
}
# Input the vales of v1 and v2 into a data frame
df_vector <- data.frame(v1,v2)
# Combine the 2 columns of data into one single vector with a length of 200
Uni_dataframe <- data.frame(d = unlist(df_vector, use.names = FALSE))
# Create another vector with first half of the values = 0 and second half = 1
resp_vector <- c(1:200)
# For the first 100 variables
for (i in 1:100) {
# Set the first 100 variables = 0
resp1 <-(rep(0, times = 100 ))
# For the second 100 variables
for (i in 101:200)
# Set the second 100 variables = 1
resp2 <- (rep(1, times = 100 ))
}
# Input both resp1 and resp 2 into a data frame
resp_vector <- data.frame(resp1, resp2)
# Combine the 2 columns of data into a single vector with length 100
Uni_resp_vector <- data.frame (d = unlist(resp_vector))
The 2 vectors i have created above are Uni_dataframe and Uni_resp_vector.
I am then tasked to run a logistic regression according to the following code and function.
log.likelihood(params, x, y)
logistic.fit(xtrain, ytrain)
logistic.predict(xtest, fit)
log.likelihood = function(params, x, y) {
x = cbind(rep(1, nrow(x)), x)
Bx.sum = params %*% t(x)
t1 = sum((1-y)*Bx.sum)
t2 = sum(log(1+sapply(-Bx.sum, exp, simplify=T)))
likelihood = -(t1+t2)
return(-likelihood)
}
model <- optim(params = Uni_dataframe, fn = Uni_resp_vector, log.likelihood, data, method='BFGS')
However, I am unsure of what arguments to put in optim() and i am getting the error from optim: cannot coerce type 'closure' to vector of type 'double'. Is there an issue with the 'data' argument?
Any help would be appreciated to churn out a logistic regression model. Any suggestions on how it should be plotted? thank you!
I have a code which predict the change in the sign of future returns.
library(quantmod)
library(PerformanceAnalytics)
library(forecast)
library(e1071)
library(caret)
library(kernlab)
library(dplyr)
library(roll)
# get data yahoo finance
getSymbols("^GSPC", from = "1990-01-01", to = "2017-12-01")
# take logreturns
rnull <- CalculateReturns(prices = GSPC$GSPC.Adjusted ,method ="log")
# lags 1, 2, 3, 4, 5 as features
feat <- merge(na.trim(lag(rnull,1)),na.trim(lag(rnull,2)),na.trim(lag(rnull,3)),na.trim(lag(rnull,4)),na.trim(lag(rnull,5)),all=FALSE)
# create dataset. 6th column is actural. Previous is lagged
dataset <- merge(feat,rnull,all=FALSE)
# set columns' names
colnames(dataset) = c("lag.1", "lag.2", "lag.3","lag.4","lag.5","TARGET")
# get signs and make a data.frame
x <- sign(dataset)%>%as.data.frame
# exclude 0 sign and assume that these values are positive
x[x==0] <- 1
# for svm purposes we need to set dependent variable as factor and make levels to interpretation
x$TARGET <- as.factor(as.character(x$TARGET))
levels(x$TARGET) <- list(positive = "1", negative = "-1")
# divide sample to training and test subsamples
trainindex <- x[1:5792,]
testindex <- x[5792:7030,]
# run svm
svmFit <- ksvm(TARGET~.,data=trainindex,type="C-svc",kernel= "rbfdot")
# prediction
predsvm <- predict(svmFit, newdata=testindex)
# results
confusionMatrix(predsvm, testindex$TARGET)
The next thing I am going to do is add a rolling window (1 step forecast) to my model.
However the basic methods as rollapply does not work with dataframe. Commom methods of one step forecast for time-series are also not valid for data.frame used in e1071 package.
I wrote the following function:
svm_next_day_prediction <- function(x){
svmFit <- svm(TARGET~., data=x)
prediction <- predict(object = svmFit, newdata = tail(x,1) )
return(prediction)
}
apl = rollapplyr(data = x, width = 180, FUN = svm_next_day_prediction, by.column = TRUE)
but recieved a error because rollapply does not understand data.frames:
Error in terms.formula(formula, data = data) : '.' in formula and
no 'data' argument
Can you please explain how to apply rolling window for svm classification model with dataframe?
A few points
rollapply works with data frames that can be coerced to a matrix so be sure that your input is entirely numeric -- not a mix of numeric and factor. For example, this works using the built-in data frame BOD which has two numeric columns. Note that x passed to pred is a matrix here.
pred <- function(x) predict(svm(demand ~ Time, x))
rollapplyr(BOD, 3, FUN = pred, by.column = FALSE)
giving
## 1 2 3
## [1,] 8.868888 10.86889 17.25474
## [2,] 11.661666 17.24870 16.00000
## [3,] 18.328435 16.18583 15.78583
## [4,] 16.230474 15.83247 19.56886
I can't reproduce the error you get. I get a different error.
the code in the question has by.column = TRUE (which is the default anyways)
but that has the result of passing only a single vector to the function which
is not what you want. You want by.column = FALSE.
Try this:
x0 <- data.matrix(x)
rollapplyr(data = x0, width = 180, FUN = svm_next_day_prediction, by.column = FALSE)
you can create a list with the individual data frames and then apply your function. I rename x to df to avoid confusion:
df=x
rowwindow=179
dfList=lapply(1:(nrow(df)-rowwindow),function(x) df[x:(rowwindow+x),])
result=sapply(dfList,svm_next_day_prediction)
I have trying to apply logistic regression or any other of ML algorithm to this simple data set but I have failed miserably and got many error. I am tr
dim(data)
[1] 11580 12
head(data)
ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay ReturnJune
1 0.08067797 0.06625000 0.03294118 0.18309859 0.130333952 -0.01764234
2 -0.01067989 0.10211539 0.14549595 -0.08442804 -0.327300392 -0.35926605
3 0.04774193 0.03598972 0.03970223 -0.16235294 -0.147426982 0.04858934
4 -0.07404022 -0.04816956 0.01821862 -0.02467917 -0.006036217 -0.02530364
5 -0.03104575 -0.21267723 0.09147609 0.18933823 -0.153846154 -0.10611511
6 0.57980016 0.33225225 -0.40546095 -0.06000000 0.060732113 -0.21536106
And the 12th column the one I am trying to predict looks like this
PositiveDec
0
0
0
1
1
1
Here is my attempt
new.data <- data[,-12] #Remove labels' column
index <- sample(1:nrow(new.data), size = 0.8*nrow(new.data))#Split data
train.data <- new.data[index,]
test.data <- new.data[-index,]
fit.glm <- glm(data[,12]~.,data = data, family = "binomial")
You are getting there, but have several syntactic errors and, as pointed out in comments, need to leave your outcome variable in. This should work:
index <- sample(1:nrow(data), size = 0.8 * nrow(data))
train.data <- data[index, ]
fit.glm <- glm(PositiveDec ~ ., data = train.data, family = "binomial")
I currently have following code with two functions that calculate the model fit for two distinct models. The difference is in the lm function, where + log(v2) has been added in model 2.
R code
dat <- data.frame(clicks = runif(30, 1, 100), v1 = runif(30, 1, 100), v2 = runif(30, 1, 100))
p0 <- 1 # number of parameters in lm()
p1 <- 2 # number of parameters in lm()
n <- nrow(dat) - 1
## Model 1 Loop
model1 <- function(x) {
fit <- lm(log(clicks) ~ log(v1), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ])
c(summary(fit)$r.squared)
}
## Model 1 Regression
result_m1 <- t(sapply(p0:n, model1))
data.frame(result_m1)
## Model 2 Loop
model2 <- function(x) {
fit <- lm(log(clicks) ~ log(v1) + log(v2), data = dat, subset = 1:x, model = FALSE)
pred <- predict(fit, newdata = dat[x+1, ])
c(summary(fit)$r.squared)
}
## Model 2 Regression
result_m2 <- t(sapply(p1:n, model2))
data.frame(result_m2)
Question: Can I somehow create a function that implements a loop for the different models only, instead of repeating the calculation for every model?
I have something like this in mind but weren't able to implement it .http://www.ats.ucla.edu/stat/r/pages/looping_strings.htm
I don't see a point in recreating a function that can be easily done with model-selection functions in available packages.
library(leaps)
library(dplyr)
b <- regsubsets(clicks ~ ., data=dat, nbest=10, nvmax=2) # carries out exhaustive model selection (10 best models; 2 maximum predictors)
coef(b, 1:3) # returns coefficient for the 3 models in this case
[[1]]
(Intercept) v1
60.8067570 -0.2665699
[[2]]
(Intercept) v2
49.96974177 -0.05227489
[[3]]
(Intercept) v1 v2
62.02323816 -0.26422966 -0.02676747
summary(b)$rsq #provide r.squared value for 3 models
[1] 0.067952759 0.002366681 0.068568059
To run prediction is a tad more complicated.
all.mods <- summary(b)$which[,-1] # gives logic output of predictors combination
all.mods
v1 v2
1 TRUE FALSE
1 FALSE TRUE
2 TRUE TRUE
RHS <- lapply(seq(nrow(all.mods)), function(m) summary(b)$which[m,-1] %>% which %>% names %>% paste(., collapse="+"))
RHS
[[1]]
[1] "v1"
[[2]]
[1] "v2"
[[3]]
[1] "v1+v2"
lm.form <- lapply(RHS, function(m)parse(text=paste("lm(clicks ~", m, ", data=dat)")))
lm.mods <- lapply(lm.form, eval) # return list of all lm.mods generated
The list of lm.mods can subsequently be used for predict with new.data.
I am trying to build a rolling regression function based on the example here, but in addition to returning the predicted values, I would like to return the some rolling model diagnostics (i.e. coefficients, t-values, and mabye R^2). I would like the results to be returned in discrete objects based on the type of results. The example provided in the link above sucessfully creates thr rolling predictions, but I need some assistance packaging and writing out the rolling model diagnostics:
In the end, I would like the function to return three (3) objects:
Predictions
Coefficients
T values
R^2
Below is the code:
require(zoo)
require(dynlm)
## Create Some Dummy Data
set.seed(12345)
x <- rnorm(mean=3,sd=2,100)
y <- rep(NA,100)
y[1] <- x[1]
for(i in 2:100) y[i]=1+x[i-1]+0.5*y[i-1]+rnorm(1,0,0.5)
int <- 1:100
dummydata <- data.frame(int=int,x=x,y=y)
zoodata <- as.zoo(dummydata)
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted <- predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted) <- NULL
c(predicted=predicted,square.res <-(predicted-zoodata[nextOb,'y'])^2)
# 2) Extract coefficients
#coefficients <- coef(mod)
# 3) Extract rolling coefficient t values
#tvalues <- ????(mod)
# 4) Extract rolling R^2
#rsq <-
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
So after figuring out how to extract t values from model (i.e. mod) , what do I need to do to make the function return three (3) seperate objects (i.e. Predictions, Coefficients, and T-values)?
I am fairly new to R, really new to functions, and extreemly new to zoo, and I'm stuck.
Any assistance would be greatly appreciated.
I hope I got you correctly, but here is a small edit of your function:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Solution 1; Quicker to write
# c(predicted=predicted,
# square.res=(predicted-zoodata[nextOb,'y'])^2,
# summary(mod)$coef[, 1],
# summary(mod)$coef[, 3],
# AdjR = summary(mod)$adj.r.squared)
#Solution 2; Get column names right
c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
coef_intercept = summary(mod)$coef[1, 1],
coef_Ly = summary(mod)$coef[2, 1],
coef_Lx = summary(mod)$coef[3, 1],
tValue_intercept = summary(mod)$coef[1, 3],
tValue_Ly = summary(mod)$coef[2, 3],
tValue_Lx = summary(mod)$coef[3, 3],
AdjR = summary(mod)$adj.r.squared)
}
}
rolling.window <- 20
results.z <- rollapply(zoodata, width=rolling.window, FUN=rolling.regression, by.column=F, align='right')
head(results.z)
predicted square.res coef_intercept coef_Ly coef_Lx tValue_intercept tValue_Ly tValue_Lx AdjR
20 10.849344 0.721452 0.26596465 0.5798046 1.049594 0.38309211 7.977627 13.59831 0.9140886
21 12.978791 2.713053 0.26262820 0.5796883 1.039882 0.37741499 7.993014 13.80632 0.9190757
22 9.814676 11.719999 0.08050796 0.5964808 1.073941 0.12523824 8.888657 15.01353 0.9340732
23 5.616781 15.013297 0.05084124 0.5984748 1.077133 0.08964998 9.881614 16.48967 0.9509550
24 3.763645 6.976454 0.26466039 0.5788949 1.068493 0.51810115 11.558724 17.22875 0.9542983
25 9.433157 31.772658 0.38577698 0.5812665 1.034862 0.70969330 10.728395 16.88175 0.9511061
To see how it works, make a small example with a regression:
x <- rnorm(1000); y <- 2*x + rnorm(1000)
reg <- lm(y ~ x)
summary(reg)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.02694322 0.03035502 0.8876033 0.374968
x 1.97572544 0.03177346 62.1816310 0.000000
As you can see, calling summary first and then getting the coefficients of it (coef(summary(reg)) works as well) gives you a table with estimates, standard errors, and t-values. So estimates are saved in column 1 of that table, t-values in column 3. And that's how I obtain them in the updated rolling.regression function.
EDIT
I updated my solution; now it also contains the adjusted R2. If you just want the normal R2, get rid of the .adj.
EDIT 2
Quick and dirty hack how to name the columns:
rolling.regression <- function(series) {
mod <- dynlm(formula = y ~ L(y) + L(x), data = as.zoo(series)) # get model
nextOb <- max(series[,'int'])+1 # To get the first row that follows the window
if (nextOb<=nrow(zoodata)) { # You won't predict the last one
# 1) Make Predictions
predicted=predict(mod,newdata=data.frame(x=zoodata[nextOb,'x'],y=zoodata[nextOb,'y']))
attributes(predicted)<-NULL
#Get variable names
strVar <- c("Intercept", paste0("L", 1:(nrow(summary(mod)$coef)-1)))
vec <- c(predicted=predicted,
square.res=(predicted-zoodata[nextOb,'y'])^2,
AdjR = summary(mod)$adj.r.squared,
summary(mod)$coef[, 1],
summary(mod)$coef[, 3])
names(vec)[4:length(vec)] <- c(paste0("Coef_", strVar), paste0("tValue_", strVar))
vec
}
}