I have spent whole day today for resolving this.. please help me.
Although I just write very simple example here, my original data has a huge number of variables- about 2,000. Therefore, to run regression I need to pick certain variables.
I do need to develop many models, so I should do this procedure automatically.
I run stepwie.
I don't know how many variables are selected by stepwise.
after selecting variables, I run rolling regression for prediction.
library(car)
library(zoo)
# run regression
m <- lm(mpg~., data=mtcars)
# run stepwise
s<-step(m, direction="both")
# select variables
variable<- attr(s$terms,"term.labels")
b<-paste(dep,paste(s, collapse="+"),sep = "~")
rollapply(mtcars, width = 2,
FUN = function(z) coef(lm(b, data = as.data.frame(z))),
by.column = FALSE, align = "right")
# Here is the automatic model I developed..
models2 <- lapply(1:11, function(x) {
dep<-names(mtcars)[x]
ind<-mtcars[-x]
w<-names(ind)
indep<-paste(dep,paste(w, collapse="+"),sep = "~")
m<-lm(indep,data=mtcars)
s<-step(m, direction="both")
b<-paste(dep,paste(s, collapse="+"),sep = "~")
rollapply(mtcars, width = 2,
FUN = function(z) coef(lm(b, data = as.data.frame(z))),
by.column = FALSE, align = "right")})
I want to calculate prediction from rolling regression..
However, it is very hard to set up
data.frame without pre-knowldege about independent variables..
There is a similar one here, but in this model independent variables are known already.
You do not need to know the independent variables! If you provide a data.frame that contains all variables, the predict function will select the necessary ones. Similar to the post you have linked, you could to this:
mtcars[,"int"] <- seq(nrow(mtcars)) # add variable used to choose newdata
models2 <- lapply(1:11, function(x) {
dep <- names(mtcars)[x]
ind <- mtcars[-x]
w <- names(ind)
form <- paste(dep,paste(w, collapse="+"),sep = "~")
m <- lm(form, data=mtcars)
s <- step(m, direction="both", trace=0) # model selection (don't print trace)
b <- formula(s) # This is clearer than your version
rpl <- rollapply(mtcars, width = 20, # if you use width=2, your model will always be overdetermined
FUN = function(z) {
nextD <- max(z[,'int'])+1 # index of row for new data
fit <- lm(b, data = as.data.frame(z)) # fit the model
c(coef=coef(fit), # coefficients
predicted=predict(fit, newdata=mtcars[nextD,])) # predict using the next row
},
by.column = FALSE, align = "right")
rpl
})
Related
I'm working with the train() function from the caret package to fit multiple regression and ML models to test their fit. I'd like to write a function that iterates through all model types and enters the best fit into a dataframe. Biggest issue is that caret doesn't provide all the model fit statistics that I'd like so they need to be derived from the raw output. Based on my exploration there doesn't seem to be a standardized way caret outputs each models fit.
Another post (sorry don't have a link) created this function which pulls from fit$results and fit$bestTune to get pre calculated RMSE, R^2, etc.
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
One example of another fit statistic I need to calculate using raw output is BIC. The two functions below do that. The residuals (y_actual - y_predicted) are needed along with the number of x variables (k) and the number of rows used in the prediction (n). k and n must be derived from the output not the original dataset due to the models dropping x variables (feature selection) or rows (omitting NAs) based on its algorithm.
calculate_MSE <- function(residuals){
# residuals can be replaced with y_actual-y_predicted
mse <- mean(residuals^2)
return(mse)
}
calculate_BIC <- function(n, mse, k){
BIC <- n*log(mse)+k*log(n)
return(BIC)
}
The real question is is there a standardized output of caret::train() for x variables or either y_actual, y_predicted, or residuals?
I tried fit$finalModel$model and other methods but to no avail.
Here is a reproducible example along with the function I'm using. Please consider the functions above a part of this reproducible example.
library(rlist)
library(data.table)
# data
df <- data.frame(y1 = rnorm(50, 0, 1),
y2 = rnorm(50, .25, 1.5),
x1 = rnorm(50, .4, .9),
x2 = rnorm(50, 0, 1.1),
x3 = rnorm(50, 1, .75))
missing_index <- sample(1:50, 7, replace = F)
df[missing_index,] <- NA
# function to fit models and pull results
fitModels <- function(df, Ys, Xs, models){
# empty list
results <- list()
# number of for loops
loops_counter <- 0
# for every y
for(y in 1:length(Ys)){
# for every model
for(m in 1:length(models)){
# track loops
loops_counter <- loops_counter + 1
# fit the model
set.seed(1) # seed for reproducability
fit <- tryCatch(train(as.formula(paste(Ys[y], paste(Xs, collapse = ' + '),
sep = ' ~ ')),
data = df,
method = models[m],
na.action = na.omit,
tuneLength = 10),
error = function(e) {return(NA)})
# pull results
results[[loops_counter]] <- c(Y = Ys[y],
model = models[m],
sample_size = nrow(fit$finalModel$model),
RMSE = get_best_result(fit)[[2]],
R2 = get_best_result(fit)[[3]],
MAE = get_best_result(fit)[[4]],
BIC = calculate_BIC(n = length(fit$finalModel),
mse = calculate_MSE(fit$finalModel$residuals),
k = length(fit$finalModel$xNames)))
}
}
# list bind
results_df <- list.rbind(results)
return(results_df)
}
linear_models <- c('lm', 'glmnet', 'ridge', 'lars', 'enet')
fits <- fitModels(df, c(y1, y2), c(x1,x2,x3), linear_models)
I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.
I have a problem when using replicate to repeat the function.
I tried to use the bootstrap to fit
a quadratic model using concentration as the predictor and Total_lignin as the response and going to report an estimate of the maximum with a corresponding standard error.
My idea is to create a function called bootFun that essentially did everything within one iteration of a for loop. bootFun took in only the data set the predictor, and the response to use (both variable names in quotes).
However, the SD is 0, not correct. I do not know where is the wrong place. Could you please help me with it?
# Load the libraries
library(dplyr)
library(tidyverse)
# Read the .csv and only use M.giganteus and S.ravennae.
dat <- read_csv('concentration.csv') %>%
filter(variety == 'M.giganteus' | variety == 'S.ravennae') %>%
arrange(variety)
# Check the data
head(dat)
# sample size
n <- nrow(dat)
# A function to do one iteration
bootFun <- function(dat, pred, resp){
# Draw the sample size from the dataset
sample <- sample_n(dat, n, replace = TRUE)
# A quadratic model fit
formula <- paste0('resp', '~', 'pred', '+', 'I(pred^2)')
fit <- lm(formula, data = sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
max <- bootFun(dat = dat, pred = 'concentration', resp = 'Total_lignin' )
# Iterated times
N <- 5000
# Use 'replicate' function to do a loop
maxs <- replicate(N, max)
# An estimate of the max of predictor and corresponding SE
mean(maxs)
sd(maxs)
Base package boot, function boot, can ease the job of calling the bootstrap function repeatedly. The first argument must be the data set, the second argument is an indices argument, that the user does not set and other arguments can also be passed toit. In this case those other arguments are the predictor and the response names.
library(boot)
bootFun <- function(dat, indices, pred, resp){
# Draw the sample size from the dataset
dat.sample <- dat[indices, ]
# A quadratic model fit
formula <- paste0(resp, '~', pred, '+', 'I(', pred, '^2)')
formula <- as.formula(formula)
fit <- lm(formula, data = dat.sample)
# Derive the max of the value of concentration
max <- -fit$coefficients[2]/(2*fit$coefficients[3])
return(max)
}
N <- 5000
set.seed(1234) # Make the bootstrap results reproducible
results <- boot(dat, bootFun, R = N, pred = 'concentration', resp = 'Total_lignin')
results
#
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = dat, statistic = bootFun, R = N, pred = "concentration",
# resp = "Total_lignin")
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* -0.4629808 -0.0004433889 0.03014259
#
results$t0 # this is the statistic, not bootstrapped
#concentration
# -0.4629808
mean(results$t) # bootstrap value
#[1] -0.4633233
Note that to fit a polynomial, function poly is much simpler than to explicitly write down the polynomial terms one by one.
formula <- paste0(resp, '~ poly(', pred, ',2, raw = TRUE)')
Check the distribution of the bootstrapped statistic.
op <- par(mfrow = c(1, 2))
hist(results$t)
qqnorm(results$t)
qqline(results$t)
par(op)
Test data
set.seed(2020) # Make the results reproducible
x <- cumsum(rnorm(100))
y <- x + x^2 + rnorm(100)
dat <- data.frame(concentration = x, Total_lignin = y)
I have a code which predict the change in the sign of future returns.
library(quantmod)
library(PerformanceAnalytics)
library(forecast)
library(e1071)
library(caret)
library(kernlab)
library(dplyr)
library(roll)
# get data yahoo finance
getSymbols("^GSPC", from = "1990-01-01", to = "2017-12-01")
# take logreturns
rnull <- CalculateReturns(prices = GSPC$GSPC.Adjusted ,method ="log")
# lags 1, 2, 3, 4, 5 as features
feat <- merge(na.trim(lag(rnull,1)),na.trim(lag(rnull,2)),na.trim(lag(rnull,3)),na.trim(lag(rnull,4)),na.trim(lag(rnull,5)),all=FALSE)
# create dataset. 6th column is actural. Previous is lagged
dataset <- merge(feat,rnull,all=FALSE)
# set columns' names
colnames(dataset) = c("lag.1", "lag.2", "lag.3","lag.4","lag.5","TARGET")
# get signs and make a data.frame
x <- sign(dataset)%>%as.data.frame
# exclude 0 sign and assume that these values are positive
x[x==0] <- 1
# for svm purposes we need to set dependent variable as factor and make levels to interpretation
x$TARGET <- as.factor(as.character(x$TARGET))
levels(x$TARGET) <- list(positive = "1", negative = "-1")
# divide sample to training and test subsamples
trainindex <- x[1:5792,]
testindex <- x[5792:7030,]
# run svm
svmFit <- ksvm(TARGET~.,data=trainindex,type="C-svc",kernel= "rbfdot")
# prediction
predsvm <- predict(svmFit, newdata=testindex)
# results
confusionMatrix(predsvm, testindex$TARGET)
The next thing I am going to do is add a rolling window (1 step forecast) to my model.
However the basic methods as rollapply does not work with dataframe. Commom methods of one step forecast for time-series are also not valid for data.frame used in e1071 package.
I wrote the following function:
svm_next_day_prediction <- function(x){
svmFit <- svm(TARGET~., data=x)
prediction <- predict(object = svmFit, newdata = tail(x,1) )
return(prediction)
}
apl = rollapplyr(data = x, width = 180, FUN = svm_next_day_prediction, by.column = TRUE)
but recieved a error because rollapply does not understand data.frames:
Error in terms.formula(formula, data = data) : '.' in formula and
no 'data' argument
Can you please explain how to apply rolling window for svm classification model with dataframe?
A few points
rollapply works with data frames that can be coerced to a matrix so be sure that your input is entirely numeric -- not a mix of numeric and factor. For example, this works using the built-in data frame BOD which has two numeric columns. Note that x passed to pred is a matrix here.
pred <- function(x) predict(svm(demand ~ Time, x))
rollapplyr(BOD, 3, FUN = pred, by.column = FALSE)
giving
## 1 2 3
## [1,] 8.868888 10.86889 17.25474
## [2,] 11.661666 17.24870 16.00000
## [3,] 18.328435 16.18583 15.78583
## [4,] 16.230474 15.83247 19.56886
I can't reproduce the error you get. I get a different error.
the code in the question has by.column = TRUE (which is the default anyways)
but that has the result of passing only a single vector to the function which
is not what you want. You want by.column = FALSE.
Try this:
x0 <- data.matrix(x)
rollapplyr(data = x0, width = 180, FUN = svm_next_day_prediction, by.column = FALSE)
you can create a list with the individual data frames and then apply your function. I rename x to df to avoid confusion:
df=x
rowwindow=179
dfList=lapply(1:(nrow(df)-rowwindow),function(x) df[x:(rowwindow+x),])
result=sapply(dfList,svm_next_day_prediction)
Often, I want to run a cross validation on a dataset which contains some factor variables and after running for a while, the cross validation routine fails with the error: factor x has new levels Y.
For example, using package boot:
library(boot)
d <- data.frame(x=c('A', 'A', 'B', 'B', 'C', 'C'), y=c(1, 2, 3, 4, 5, 6))
m <- glm(y ~ x, data=d)
m.cv <- cv.glm(d, m, K=2) # Sometimes succeeds
m.cv <- cv.glm(d, m, K=2)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor x has new levels B
Update: This is a toy example. The same problem occurs with larger datasets as well, where there are several occurrences of level C but none of them is present in the training partition.
The function createDataPartition function from the package caret does stratified sampling for the outcome variables and correctly warns:
Also, for ‘createDataPartition’, very small class sizes (<= 3) the classes may not show up in both the training and test data.
There are two solutions which spring to mind:
First, create a subset of the data by selecting one random sample of each factor level first, starting from the rarest class (by frequency) and then greedily satisfying the next rare class and so on. Then using createDataPartition on the rest of the dataset and merging the results to create a new train dataset which contains all levels.
Using createDataPartitions and and doing rejection sampling.
So far, option 2 has worked for me because of the data sizes, but I cannot help but think that there must be a better solution than a hand rolled out one.
Ideally, I would want a solution which just works for creating partitions and fails early if there is no way to create such partitions.
Is there a fundamental theoretical reason why packages do not offer this? Do they offer it and I just haven't been able to spot them because of a blind spot? Is there a better way of doing this stratified sampling?
Please leave a comment if I should ask this question on stats.stackoverflow.com.
Update:
This is what my hand rolled out solution (2) looks like:
get.cv.idx <- function(train.data, folds, factor.cols = NA) {
if (is.na(factor.cols)) {
all.cols <- colnames(train.data)
factor.cols <- all.cols[laply(llply(train.data[1, ], class), function (x) 'factor' %in% x)]
}
n <- nrow(train.data)
test.n <- floor(1 / folds * n)
cond.met <- FALSE
n.tries <- 0
while (!cond.met) {
n.tries <- n.tries + 1
test.idx <- sample(nrow(train.data), test.n)
train.idx <- setdiff(1:nrow(train.data), test.idx)
cond.met <- TRUE
for(factor.col in factor.cols) {
train.levels <- train.data[ train.idx, factor.col ]
test.levels <- train.data[ test.idx , factor.col ]
if (length(unique(train.levels)) < length(unique(test.levels))) {
cat('Factor level: ', factor.col, ' violated constraint, retrying.\n')
cond.met <- FALSE
}
}
}
cat('Done in ', n.tries, ' trie(s).\n')
list( train.idx = train.idx
, test.idx = test.idx
)
}
Everyone agrees that there sure is an optimal solution. But personally, I would just try the cv.glm call until it works usingwhile.
m.cv<- try(cv.glm(d, m, K=2)) #First try
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
I've tried it with 100,000 rows in the data.fame and it only takes a few seconds.
library(boot)
n <-100000
d <- data.frame(x=c(rep('A',n), rep('B', n), 'C', 'C'), y=1:(n*2+2))
m <- glm(y ~ x, data=d)
m.cv<- try(cv.glm(d, m, K=2))
class(m.cv) #Sometimes error, sometimes list
while ( inherits(m.cv, "try-error") ) {
m.cv<- try(cv.glm(d, m, K=2))
}
class(m.cv) #always list
When I call traceback I get this:
> traceback()
9: stop(sprintf(ngettext(length(m), "factor %s has new level %s",
"factor %s has new levels %s"), nm, paste(nxl[m], collapse = ", ")),
domain = NA)
8: model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)
7: model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
6: predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==
"link", "response", type), terms = terms, na.action = na.action)
5: predict.glm(d.glm, data[j.out, , drop = FALSE], type = "response")
4: predict(d.glm, data[j.out, , drop = FALSE], type = "response")
3: mean((y - yhat)^2)
2: cost(glm.y[j.out], predict(d.glm, data[j.out, , drop = FALSE],
type = "response"))
1: cv.glm(d, m, K = 2)
And looking at the cv.glm function gives:
> cv.glm
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
{
call <- match.call()
if (!exists(".Random.seed", envir = .GlobalEnv, inherits = FALSE))
runif(1)
seed <- get(".Random.seed", envir = .GlobalEnv, inherits = FALSE)
n <- nrow(data)
out <- NULL
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
n.s <- table(s)
glm.y <- glmfit$y
cost.0 <- cost(glm.y, fitted(glmfit))
ms <- max(s)
CV <- 0
Call <- glmfit$call
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}
list(call = call, K = K, delta = as.numeric(c(CV, CV + cost.0)),
seed = seed)
}
It seems the problem has to do with your extremely small sample size and categorical effect (with values "A", "B", and "C"). You are fitting a glm with 2 effects: "B:A" and "C:A". In each CV iteration you bootstrap from the sample dataset and fit a new model d.glm. Given the size, the bootstrapped data are guaranteed to come up with 1 or more iteration in which the value "C" is not sampled, hence the error comes from calculating fitted probabilities from the bootstrap model from the training data in which validation data has a "C" level for x not observed in the training data.
Frank Harrell (often on stats.stackexchange.com) wrote in Regression Modelling Strategies that one ought to favor against split sample validation when sample size is small and/or some cell counts are small in categorical data analysis. Singularity (as you are seeing here) is one of many reasons why I think this is true.
Given the small sample size here, you should consider some split sample cross validation alternatives like a permutation test, or a parametric bootstrap. Another important consideration is exactly why you feel model based inference isn't correct. As Tukey said of the bootstrap, he'd like to call it a shotgun. It will blow the head off of any problem, as long as you're willing to reassemble the pieces.
There don't seem to be many simple solutions around the web so here's one I worked out that should be easy to generalize to as many factors as you need. It uses pre-installed packages and Caret but you could get away with just base R if you really wanted.
To use cross-validation when you have multiple factors follow a two-step process. Convert the factors to numerics and then multiply them together. Use this new variable as the target variable in a stratified sampling function. Be sure to remove it or keep it out of your training set after creating your folds.
If y is your DV and x is a factor then:
#Simulated factors (which are conveniently distributed for the example)
dataset <-data.frame(x=as.factor(rep(c(1,10),1000)),y=as.factor(rep(c(1,2,3,4),250)[sample(1000)]))
#Convert the factors to numerics and multiply together in new variable
dataset$cv.variable <-as.numeric(levels(dataset$x))[dataset$x]*as.numeric(levels(dataset$y))[dataset$y]
prop.table(table(dataset$y)) #One way to view distribution of levels
ftable(dataset$x,dataset$y) #A full table of all x and y combinations
folds <- caret::createFolds(dataset$cv.variable,k=10)
testIndexes <- folds[[k]]
testData <- as.data.frame(dataset[testIndexes, ])
trainData <- as.data.frame(dataset[-testIndexes, ])
prop.table(table(testData$y))
ftable(testData$x,testData$y) #evaluate distribution
which should produce a result that is close to balanced.
Note: In real life, if your sample lacks the requisite unique combinations of factors then your problem is harder overcome and might be impossible. You can either drop some levels from consideration before creating folds or employ some kind of over-sampling.