Coding a prediction interval from a generalized additive model with a very large dataset - r

I have a small dataset of people, thier locations, and whether or not they know each other. It is a subset of a dataset with 1000 people. Given that each person could know any other person, the number of potential links grows just under n^2. I want to fit a model with the small subset, to get probability of linkage as a function of distance, so that I can perform simulations with the wider dataset.
I've got two problems:
I'm not sure how to create a prediction interval from a fitted GAM object.
Generating a prediction interval using either posterior simulation or using this technique from R-sig-mixed is computationally prohibitive.
The following is an example of my problem, generating the interval using the technique from R-sig-mixed. Be warned that the last step will throw an error about not being able to allocate a huge vector, unless you're on a really impressive machine.
#Some fake location data
set.seed(13)
x = runif(50)*2
y = runif(50)*2
d = cbind(ID = 1:50,as.matrix(dist(data.frame(x,y))))
I want to model links as a function of distance. More fake data:
library(reshape)
mdata <- melt(as.data.frame(d), id=c("ID"),measure.vars = colnames(d)[2:ncol(d)],variable.name="distance")
mdata$popularity = rnorm(25,sd=.3)
colnames(mdata)[colnames(mdata)=="variable"] = "knows"
colnames(mdata)[colnames(mdata)=="value"] = "distance"
mdata = subset(mdata,ID!=knows)
a = exp(1/(mdata$distance/runif(nrow(mdata))^mdata$distance)+mdata$popularity+rnorm(nrow(mdata),sd=.001))
mdata$prlink = a/(1+a)
with(mdata,plot(distance,prlink))
mdata$link = runif(nrow(mdata))<mdata$prlink
mdata$ID = as.factor(mdata$ID)
mdata$knows = as.factor(mdata$knows)
mdata$dum=1 #this facilitates predicting from the population of the model, later
Now, I model the data:
library(mgcv)
mod = gam(link~s(distance)+s(ID,bs="re",by=dum)+s(knows,bs="re",by=dum),data=mdata,family=binomial(link="logit"))
plot(mod,pages=1)
summary(mod)
Now, I want to apply the fitted model to my main dataset:
x = runif(1000)*2
y = runif(1000)*2
d = cbind(ID = 1:1000,as.matrix(dist(data.frame(x,y))))
mdata <- melt(as.data.frame(d),id.vars = "ID")
colnames(mdata)[colnames(mdata)=="variable"] = "knows"
colnames(mdata)[colnames(mdata)=="value"] = "distance"
mdata = subset(mdata,ID!=knows)
mdata$dum=0; mdata$ID=1; mdata$knows=2 #These are needed for prediction, even though I am predicting from the population of the model, not one of the levels.
Some timekeeping tools...
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
type <- match.arg(type)
assign(".type", type, envir=baseenv())
if(gcFirst) gc(FALSE)
tic <- proc.time()[type]
assign(".tic", tic, envir=baseenv())
invisible(tic)
}
toc <- function()
{
type <- get(".type", envir=baseenv())
toc <- proc.time()[type]
tic <- get(".tic", envir=baseenv())
print(toc - tic)
invisible(toc)
}
tic()
p = predict(mod,newdata=mdata,type="response")
toc()
Just predicting the point estimates takes 31 seconds on my machine. Now to try to get the prediction intervals, first get the design matrix...
tic()
Designmat = predict(mod,newdata=mdata,type="lpmatrix")
toc()
That took me 47 seconds and froze my computer while it was working.
Now here is the technique for getting a prediction interval that I found on R-sig-mixed...
CAUTION: THE FOLLOWING CODE WILL ATTEMPT TO ALLOCATE A MASSIVE AMOUNT OF MEMORY AND MIGHT CRASH YOUR MACHINE.
tic()
predvar <- diag(Designmat %*% vcov(mod) %*% t(Designmat))
SE <- sqrt(predvar)
SE2 <- sqrt(predvar+mod$sig2)
tfrac <- qt(0.975, mod$df.residual)
interval = tfrac*SE2
toc()
>Error: cannot allocate vector of size 7435.7 Gb
Is there another way???

You need to avoid calculating Designmat %*% vcov(mod) %*% t(Designmat). You only need the diagonal after all. Try this:
tmp <- Designmat %*% vcov(mod)
library(compiler)
diagMult <- cmpfun(function(m1, m2) sapply(seq_len(nrow(m1)),
function(i) m1[i,] %*% m2[,i]))
predvar <- diagMult(tmp, t(Designmat))
(Not tested thoroughly. The function should be implemented with Rcpp to improve speed, if there is not already a compiled version in some package.)

Related

How to accomplish replicated calculation and plot in subset dataset?

I have a simulated data created like this:
average_vector = c(0,0,25)
sigma_matrix = matrix(c(4,1,0,1,8,0,0,0,9),nrow=3,ncol=3)
set.seed(12345)
data0 = as.data.frame(mvrnorm(n =20000, mu = average_vector, Sigma=sigma_matrix))
names(data0)=c("hard","smartness","age")
set.seed(13579)
data0$final=0.5*data0$hard+0.2*data0$smartness+(-0.1)*data0$age+rnorm(n=dim(data0)[1],mean=90,sd=6)
Now, I want to randomly sample 50 students 1,000 times (1,000 sets of 50 people), I used this code:
datsub<-(replicate(1000, sample(1:nrow(data0),50)))
After that step, I encountered a issue: I want to ask if I want to run a regression model with the 50 selected people (1,000 times), and record/store the point estimates of “hard” from model 4, where is given like this:
model4 = lm(formula = final ~ hard + smartness + age, data = data0), and plot the variation around the line of 0.5 (true value), is there any way I can achieve that? Thanks a lot!
I would highly suggest looking into either caret or the newer (and still maintained) TidyModels if you're just getting into R modelling. Either of these will make your life easier, once you get used to the dplyr-like syntax.
What you're trying to do is bootstrapping. Here is the manual approach using only base functions.
n <- nrow(data0)
k <- 1000
ns <- 50
samples <- replicate(k, sample(seq_len(n), ns))
params <- vector('list', k)
for(i in seq_len(n)){
params[[i]] <- coef( lm(formula = final ~ hard + smartness + age, data = data0[samples[, i],]) )
}
# merge params into columns
params <- do.call(rbind, params)
# Create plot from here.
plot(x = seq_len(n), y = params[, "hard"])
abline(h = 0.5)
Note the above may have a few typos as your example is not reproducible.

Plotting Forecast and Real values in one plot using a Rolling Window

I have a code which takes the input as the Yield Spread (dependent var.) and Forward Rates(independent var.) and operate an auto.arima to get the orders. Afterwards, I am forecasting the next 25 dates (forc.horizon). My training data are the first 600 (training). Then I am moving the time window 25 dates, meaning using the data from 26 to 625, estimating the auto.arima and then forecasting the data from 626 to 650 and so on. My data sets are 2298 rows (date) and 30 columns (maturity).
I want to store all of the forecasts and then plot the forecasted and real values in the same plot.
This is the code I have, but it doesn't store the forecasts in a way to plot later.
forecast.func <- function(NS.spread, ind.v, maturity, training, forc.horizon){
NS.spread <- NS.spread/100
forc <- c()
j <- 0
for(i in 1:floor((nrow(NS.spread)-training)/forc.horizon)){
# test data
y <- NS.spread[(1+j):(training+j) , maturity]
f <- ind.v[(1+j):(training+j) , maturity]
# auto- arima
c <- auto.arima(y, xreg = f, test= "adf")
# forecast
e <- ind.v[(training+j+1):(training+j+forc.horizon) , maturity]
h <- forecast(c, xreg = lagmatrix(e, -1))
forc <- c(forc, list(h))
j <- j + forc.horizon
}
return(forc)
}
a <- forecast.func(spread.NS.JPM, Forward.rate.JPM, 10, 600, 25)
lapply(a, plot)
Here's a link to my two datasets:
https://drive.google.com/drive/folders/1goCxllYHQo3QJ0IdidKbdmfR-DZgrezN?usp=sharing
LOOK AT THE END for a full functional example on how to handle AUTO.ARIMA MODEL with DAILY DATA using XREG and FOURIER SERIES with ROLLING STARTING TIMES and cross validated training and test.
Without a reproducible example no one can help you, because they can't run your code. You need to provide data. :-(
Even if it's not part of StackOverflow to discuss statistics matters, why don't you do an auto.arima with xreg instead of lm + auto.arima on residuals? Especially, considering how you forecast at the end, that training method looks really wrong. Consider using:
fit <- auto.arima(y, xreg = lagmatrix(f, -1))
h <- forecast(fit, xreg = lagmatrix(e, -1))
auto.arima will automatically calculate the best parameters by max likelihood.
On your coding question..
forc <- c() should be outside of the for loop, otherwise at every run you delete your previous results.
Same for j <- 0: at every run you're setting it back to 0. Put it outside if you need to change its value at every run.
The output of forecast is an object of class forecast, which is actually a type of list. Therefore, you can't use cbind effectively.
I'm my opinion, you should create forc in this way: forc <- list()
And create a list of your final results in this way:
forc <- c(forc, list(h)) # instead of forc <- cbind(forc, h)
This will create a list of objects of class forecast.
You can then plot them with a for loop by getting access at every object or with a lapply.
lapply(output_of_your_function, plot)
This is as far as I can go without a reproducible example.
FINAL EDIT
FULL FUNCTIONAL EXAMPLE
Here I try to sum up a conclusion out of the million comments we wrote.
With the data you provided, I built a code that can handle everything you need.
From training and test to model, till forecast and finally plotting which have the X axis with the time as required in one of your comments.
I removed the for loop. lapply is much better for your case.
You can leave the fourier series if you want to. That's how Professor Hyndman suggests to handle daily time series.
Functions and libraries needed:
# libraries ---------------------------
library(forecast)
library(lubridate)
# run model -------------------------------------
.daily_arima_forecast <- function(init, training, horizon, tt, ..., K = 10){
# create training and test
tt_trn <- window(tt, start = time(tt)[init] , end = time(tt)[init + training - 1])
tt_tst <- window(tt, start = time(tt)[init + training], end = time(tt)[init + training + horizon - 1])
# add fourier series [if you want to. Otherwise, cancel this part]
fr <- fourier(tt_trn[,1], K = K)
frf <- fourier(tt_trn[,1], K = K, h = horizon)
tsp(fr) <- tsp(tt_trn)
tsp(frf) <- tsp(tt_tst)
tt_trn <- ts.intersect(tt_trn, fr)
tt_tst <- ts.intersect(tt_tst, frf)
colnames(tt_tst) <- colnames(tt_trn) <- c("y", "s", paste0("k", seq_len(ncol(fr))))
# run model and forecast
aa <- auto.arima(tt_trn[,1], xreg = tt_trn[,-1])
fcst <- forecast(aa, xreg = tt_tst[,-1])
# add actual values to plot them later!
fcst$test.values <- tt_tst[,1]
# NOTE: since I modified the structure of the class forecast I should create a new class,
# but I didnt want to complicate your code
fcst
}
daily_arima_forecast <- function(y, x, training, horizon, ...){
# set up x and y together
tt <- ts.intersect(y, x)
# set up all starting point of the training set [give it a name to recognize them later]
inits <- setNames(nm = seq(1, length(y) - training, by = horizon))
# remove last one because you wouldnt have enough data in front of it
inits <- inits[-length(inits)]
# run model and return a list of all your models
lapply(inits, .daily_arima_forecast, training = training, horizon = horizon, tt = tt, ...)
}
# plot ------------------------------------------
plot_daily_forecast <- function(x){
autoplot(x) + autolayer(x$test.values)
}
Reproducible Example on how to use the previous functions
# create a sample data
tsp(EuStockMarkets) <- c(1991, 1991 + (1860-1)/365.25, 365.25)
# model
models <- daily_arima_forecast(y = EuStockMarkets[,1],
x = EuStockMarkets[,2],
training = 600,
horizon = 25,
K = 5)
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[1]]
Example for the author of the post
# your data
load("BVIS0157_Forward.rda")
load("BVIS0157_NS.spread.rda")
spread.NS.JPM <- spread.NS.JPM / 100
# pre-work [out of function!!!]
set_up_ts <- function(m){
start <- min(row.names(m))
end <- max(row.names(m))
# daily sequence
inds <- seq(as.Date(start), as.Date(end), by = "day")
ts(m, start = c(year(start), as.numeric(format(inds[1], "%j"))), frequency = 365.25)
}
mts_spread.NS.JPM <- set_up_ts(spread.NS.JPM)
mts_Forward.rate.JPM <- set_up_ts(Forward.rate.JPM)
# model
col <- 10
models <- daily_arima_forecast(y = mts_spread.NS.JPM[, col],
x = stats::lag(mts_Forward.rate.JPM[, col], -1),
training = 600,
horizon = 25,
K = 5) # notice that K falls between ... that goes directly to the inner function
# plot
plots <- lapply(models, plot_daily_forecast)
plots[[5]]

Perceptron in R not converging

I am trying to understand Neural Networks better so I am trying to implement a simple perceptron from scratch in R. I know that this is very inefficient as there are many libraries that do this extemely well optimized but my goal is to understand the basics of neural networks better and work my way forward to more complex models.
I have created some artificial test data with a very simple linear decision boundary and split this into a training set and a test set. I then ran a logistic regression on the training data and checked the predictions from the test-set and got +99% accuray, which was to be expected given the simple nature of the data. I then tried implementing a perceptron with 2 inputs, 1 neuron, 1000 iterations, a learning rate of 0.1 and a sigmoid activation function.
I would expect to get very similar accuracy to the logistic regression model but my results are a lot worse (around 70% correct classifications in the training set). so I definitly did something wrong. The predictions only seem to get better after the first couple of iterations and then just go back and forth around a specific value (I tried with many different learning rates, no success). I'm attaching my script and I#m thankful for any advice! I think the problem lies in the calculation of the error or the weight adjustment but I can't put my finger on it...
### Reproducible Example for StackOverflow
#### Setup
# loading libraries
library(data.table)
#remove scientifc notation
options(scipen = 999)
# setting seed for random number generation
seed <- 123
#### Selfmade Test Data
# input points
x1 <- runif(10000,-100,100)
x2 <- runif(10000,-100,100)
# setting decision boundary to create output
output <- vector()
output[0.5*x1 + -1.2*x2 >= 50] <- 0
output[0.5*x1 + -1.2*x2 < 50] <- 1
# combining to dataframe
points <- cbind.data.frame(x1,x2,output)
# plotting all data points
plot(points$x1,points$x2, col = as.factor(points$output), main = "Self-created data", xlab = "x1",ylab = "x2")
# split into test and training sets
trainsize = 0.2
set.seed(seed)
train_rows <- sample(1:dim(points)[1], size = trainsize * dim(points)[1])
train <- points[train_rows,]
test <- points[-c(train_rows),]
# plotting training set only
plot(train$x1,train$x2, col = as.factor(train$output), main = "Self-created data (training set)", xlab = "x1",ylab = "x2")
#### Approaching the problem with logistic regression
# building model
train_logit <- glm(output ~ x1 + x2, data = train, family = "binomial", maxit = 10000)
summary(train_logit)
# testing performance in training set
table(round(train_logit$fitted.values) == train$output)
# testing performance of train_logit model in test set
table(test$output == round(predict(train_logit,test[,c(1,2)], type = "response")))
# We get 100% accuracy in the training set and near 100% accuracy in the test set
#### Approaching Problem with a Perceptron from scratch
# setting inputs, outputs and weights
inputs <- as.matrix(train[,c(1,2)])
output <- as.matrix(train[,3])
set.seed(123456)
weights <- as.matrix(runif(dim(inputs)[2],-1,1))
## Defining activation function + derivative
# defining sigmoid and it's derivative
sigmoid <- function(x) {1 / (1 + exp(-x))}
sig_dir <- function(x){sigmoid(x)*(1 - sigmoid(x))}
## Perceptron nitial Settings
bias <- 1
# number of iterations
iterations <- 1000
# setting learning rate
alpha <- 0.1
## Perceptron
# creating vectors for saving results per iteration
weights_list <- list()
weights_list[[1]] <- weights
errors_vec <- vector()
outputs_vec <- vector()
# saving results across iterations
weights_list_all <- list()
outputs_list <- list()
errors_list <- list()
# looping through the backpropagation algorithm "iteration" # times
for (j in 1:iterations) {
# Loop for backpropagation with updating weights after every datapoint
for (i in 1:dim(train)[1]) {
# taking the weights from the last iteration of the outer loop as a starting point
if (j > 1) {
weights_list[[1]] <- weights
}
# Feed Forward (Should we really round this?!)
output_pred <- round(sigmoid(sum(inputs[i,] * as.numeric(weights)) + bias))
error <- output_pred - output[i]
# Backpropagation (Do I need the sigmoid derivative AND a learning rate? Or should I only take one of them?)
weight_adjustments <- inputs[i,] * (error * sig_dir(output_pred)) * alpha
weights <- weights - weight_adjustments
# saving progress for later plots
weights_list[[i + 1]] <- weights
errors_vec[i] <- error
outputs_vec[[i]] <- output_pred
}
# saving results for each iteration
weights_list_all[[j]] <- weights_list
outputs_list[[j]] <- outputs_vec
errors_list[[j]] <- errors_vec
}
#### Formatting Diagnostics for easier plotting
# implementing empty list to transform weightslist
WeightList <- list()
# collapsing individual weightslist into datafames
for (i in 1:iterations) {
WeightList[[i]] <- t(data.table::rbindlist(weights_list_all[i]))
}
# pasting dataframes together
WeightFrame <- do.call(rbind.data.frame, WeightList)
colnames(WeightFrame) <- paste("w",1:dim(WeightFrame)[2], sep = "")
# pasting dataframes together
ErrorFrame <- do.call(rbind.data.frame, errors_list)
OutputFrame <- do.call(rbind.data.frame, outputs_list)
##### Plotting Results
# Development of Mean Error per iteration
plot(rowMeans(abs(ErrorFrame)),
type = "l",
xlab = "Sum of absolute Error terms")
# Development of Weights over time
plot(WeightFrame$w1, type = "l",xlim = c(1,dim(train)[1]), ylim = c(min(WeightFrame),max(WeightFrame)), ylab = "Weights", xlab = "Iterations")
lines(WeightFrame$w2, col = "green")
# lines(WeightFrame$w3, col = "blue")
# lines(WeightFrame$w4, col = "red")
# lines(WeightFrame$w5, col = "orange")
# lines(WeightFrame$w6, col = "cyan")
# lines(WeightFrame$w7, col = "magenta")
# Empty vector for number of correct categorizations per iteration
NoCorr <- vector()
# Computing percentage of correct predictions per iteration
colnames(OutputFrame) <- paste("V",1:dim(OutputFrame)[2], sep = "")
Output_mat <- as.matrix(OutputFrame)
for (i in 1:iterations) {
NoCorr[i] <- sum(output == Output_mat[i,]) / nrow(train)
}
# plotting number of correct predictions per iteration
plot(NoCorr, type = "l")
# Performance in training set after last iteration
table(output,round(OutputFrame[iterations,]))
First of all, welcome to the world of Neural Networks :).
Secondly, I want to recommend a great article to you, which I personally used to get a better understanding of backtracking and the whole NN learning stuff: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/. Might be a bit rough to get through sometimes, and for the general implementation I think it is much easier to follow pseudocode from a NN book. However, to understand what is going on this is article is very nice!
Thirdly, I will hopefully solve your problem :)
You comment yourself already with whether you should really round that output_pred. Yes you should.. if you want to use that output_pred to make a prediction! However, if you want to use it for learning it is generally not good! The reason for this is that if you round it for learning, than an output which was rounded up from 0.51 to 1 with target output 1 will not learn anything as the output was the same as the target and thus is perfect. However, 0.99 would have been a lot better of a prediction than 0.51 and thus there is definitely something to learn!
I am not 100% sure if this solves all your problems (im not an R programmer) and gets your accuracy up to 99%, but it should solve some of it, and hopefully the intuition is also clear :)

Estimating an OLS model in R with million observations and thousands of variables

I am trying to estimate a big OLS regression with ~1 million observations and ~50,000 variables using biglm.
I am planning to run each estimation using chunks of approximately 100 observations each. I tested this strategy with a small sample and it worked fine.
However, with the real data I am getting an "Error: protect(): protection stack overflow" when trying to define the formula for the biglm function.
I've already tried:
starting R with --max-ppsize=50000
setting options(expressions = 50000)
but the error persists
I am working on Windows and using Rstudio
# create the sample data frame (In my true case, I simply select 100 lines from the original data that contains ~1,000,000 lines)
DF <- data.frame(matrix(nrow=100,ncol=50000))
DF[,] <- rnorm(100*50000)
colnames(DF) <- c("y", paste0("x", seq(1:49999)))
# get names of covariates
my_xvars <- colnames(DF)[2:( ncol(DF) )]
# define the formula to be used in biglm
# HERE IS WHERE I GET THE ERROR :
my_f <- as.formula(paste("y~", paste(my_xvars, collapse = " + ")))
EDIT 1:
The ultimate goal of my exercise is to estimate the average effect of all 50,000 variables. Therefore, simplifying the model selecting fewer variables is not the solution I am looking for now.
The first bottleneck (I can't guarantee there won't be others) is in the construction of the formula. R can't construct a formula that long from text (details are too ugly to explore right now). Below I show a hacked version of the biglm code that can take the model matrix X and response variable y directly, rather than using a formula to build them. However: the next bottleneck is that the internal function biglm:::bigqr.init(), which gets called inside biglm, tries to allocate a numeric vector of size choose(nc,2)=nc*(nc-1)/2 (where nc is the number of columns. When I try with 50000 columns I get
Error: cannot allocate vector of size 9.3 Gb
(2.3Gb are required when nc is 25000). The code below runs on my laptop when nc <- 10000.
I have a few caveats about this approach:
you won't be able to handle a probelm with 50000 columns unless you have at least 10G of memory, because of the issue described above.
the biglm:::update.biglm will have to be modified in a parallel way (this shouldn't be too hard)
I have no idea if the p>>n issue (which applies at the level of fitting the initial chunk) will bite you. When running my example below (with 10 rows, 10000 columns), all but 10 of the parameters are NA. I don't know if these NA values will contaminate the results so that successive updating fails. If so, I don't know if there's a way to work around the problem, or if it's fundamental (so that you would need nr>nc for at least the initial fit). (It would be straightforward to do some small experiments to see if there is a problem, but I've already spent too long on this ...)
don't forget that with this approach you have to explicitly add an intercept column to the model matrix (e.g. X <- cbind(1,X) if you want one.
Example (first save the code at the bottom as my_biglm.R):
nr <- 10
nc <- 10000
DF <- data.frame(matrix(rnorm(nr*nc),nrow=nr))
respvars <- paste0("x", seq(nc-1))
names(DF) <- c("y", respvars)
# illustrate formula problem: fails somewhere in 15000 < nc < 20000
try(reformulate(respvars,response="y"))
source("my_biglm.R")
rr <- my_biglm(y=DF[,1],X=as.matrix(DF[,-1]))
my_biglm <- function (formula, data, weights = NULL, sandwich = FALSE,
y=NULL, X=NULL, off=0) {
if (!is.null(weights)) {
if (!inherits(weights, "formula"))
stop("`weights' must be a formula")
w <- model.frame(weights, data)[[1]]
} else w <- NULL
if (is.null(X)) {
tt <- terms(formula)
mf <- model.frame(tt, data)
if (is.null(off <- model.offset(mf)))
off <- 0
mm <- model.matrix(tt, mf)
y <- model.response(mf) - off
} else {
## model matrix specified directly
if (is.null(y)) stop("both y and X must be specified")
mm <- X
tt <- NULL
}
qr <- biglm:::bigqr.init(NCOL(mm))
qr <- biglm:::update.bigqr(qr, mm, y, w)
rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
"assign"), terms = tt, n = NROW(mm), names = colnames(mm),
weights = weights)
if (sandwich) {
p <- ncol(mm)
n <- nrow(mm)
xyqr <- bigqr.init(p * (p + 1))
xx <- matrix(nrow = n, ncol = p * (p + 1))
xx[, 1:p] <- mm * y
for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
xyqr <- update(xyqr, xx, rep(0, n), w * w)
rval$sandwich <- list(xy = xyqr)
}
rval$df.resid <- rval$n - length(qr$D)
class(rval) <- "biglm"
rval
}

Cross validation for glm() models

I'm trying to do a 10-fold cross validation for some glm models that I have built earlier in R. I'm a little confused about the cv.glm() function in the boot package, although I've read a lot of help files. When I provide the following formula:
library(boot)
cv.glm(data, glmfit, K=10)
Does the "data" argument here refer to the whole dataset or only to the test set?
The examples I have seen so far provide the "data" argument as the test set but that did not really make sense, such as why do 10-folds on the same test set? They are all going to give exactly the same result (I assume!).
Unfortunately ?cv.glm explains it in a foggy way:
data: A matrix or data frame containing the data. The rows should be
cases and the columns correspond to variables, one of which is the
response
My other question would be about the $delta[1] result. Is this the average prediction error over the 10 trials? What if I want to get the error for each fold?
Here's what my script looks like:
##data partitioning
sub <- sample(nrow(data), floor(nrow(x) * 0.9))
training <- data[sub, ]
testing <- data[-sub, ]
##model building
model <- glm(formula = groupcol ~ var1 + var2 + var3,
family = "binomial", data = training)
##cross-validation
cv.glm(testing, model, K=10)
I am always a little cautious about using various packages 10-fold cross validation methods. I have my own simple script to create the test and training partitions manually for any machine learning package:
#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]
#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use test and train data partitions however you desire...
}
#Roman provided some answers in his comments, however, the answer to your questions is provided by inspecting the code with cv.glm:
I believe this bit of code splits the data set up randomly into the K-folds, arranging rounding as necessary if K does not divide n:
if ((K > n) || (K <= 1))
stop("'K' outside allowable range")
K.o <- K
K <- round(K)
kvals <- unique(round(n/(1L:floor(n/2))))
temp <- abs(kvals - K)
if (!any(temp == 0))
K <- kvals[temp == min(temp)][1L]
if (K != K.o)
warning(gettextf("'K' has been set to %f", K), domain = NA)
f <- ceiling(n/K)
s <- sample0(rep(1L:K, f), n)
This bit here shows that the delta value is NOT the root mean square error. It is, as the helpfile says The default is the average squared error function. What does this mean? We can see this by inspecting the function declaration:
function (data, glmfit, cost = function(y, yhat) mean((y - yhat)^2),
K = n)
which shows that within each fold, we calculate the average of the error squared, where error is in the usual sense between predicted response vs actual response.
delta[1] is simply the weighted average of the SUM of all of these terms for each fold, see my inline comments in the code of cv.glm:
for (i in seq_len(ms)) {
j.out <- seq_len(n)[(s == i)]
j.in <- seq_len(n)[(s != i)]
Call$data <- data[j.in, , drop = FALSE]
d.glm <- eval.parent(Call)
p.alpha <- n.s[i]/n #create weighted average for later
cost.i <- cost(glm.y[j.out], predict(d.glm, data[j.out,
, drop = FALSE], type = "response"))
CV <- CV + p.alpha * cost.i # add weighted average error to running total
cost.0 <- cost.0 - p.alpha * cost(glm.y, predict(d.glm,
data, type = "response"))
}

Resources