I have 500K users and I need to compute a linear regression (with intercept) for each of them.
Each user has around 30 records.
I tried with dplyr and lm and this is way too slow.
Around 2 sec by user.
df%>%
group_by(user_id, add = FALSE) %>%
do(lm = lm(Y ~ x, data = .)) %>%
mutate(lm_b0 = summary(lm)$coeff[1],
lm_b1 = summary(lm)$coeff[2]) %>%
select(user_id, lm_b0, lm_b1) %>%
ungroup()
)
I tried to use lm.fit which is known to be faster but it doesn't seem to be compatible with dplyr.
Is there a fast way to do a linear regression by group?
You can just use the basic formulas for calculating slope and regression. lm does a lot of unnecessary things if all you care about are those two numbers. Here I use data.table for the aggregation, but you could do it in base R as well (or dplyr):
system.time(
res <- DT[,
{
ux <- mean(x)
uy <- mean(y)
slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)
list(slope=slope, intercept=uy - slope * ux)
}, by=user.id
]
)
Produces for 500K users ~30 obs each (in seconds):
user system elapsed
7.35 0.00 7.36
Or about 15 microseconds per user.
Update: I ended up writing a bunch of blog posts that touch on this as well.
And to confirm this is working as expected:
> summary(DT[user.id==89663, lm(y ~ x)])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1965844 0.2927617 0.6714826 0.5065868
x 0.2021210 0.5429594 0.3722580 0.7120808
> res[user.id == 89663]
user.id slope intercept
1: 89663 0.202121 0.1965844
Data:
set.seed(1)
users <- 5e5
records <- 30
x <- runif(users * records)
DT <- data.table(
x=x, y=x + runif(users * records) * 4 - 2,
user.id=sample(users, users * records, replace=T)
)
If all you want is coefficients, I'd just use user_id as a factor in the regression. Using #miles2know's simulated data code (though renaming since an object other than exp() sharing that name looks weird to me)
dat <- data.frame(id = rep(c("a","b","c"), each = 20),
x = rnorm(60,5,1.5),
y = rnorm(60,2,.2))
mod = lm(y ~ x:id + id + 0, data = dat)
We fit no global intercept (+ 0) so that the intercept for each id is the id coefficient, and no x by itself, so that the x:id interactions are the slopes for each id:
coef(mod)
# ida idb idc x:ida x:idb x:idc
# 1.779686 1.893582 1.946069 0.039625 0.033318 0.000353
So, for the a level of id, the ida coefficient, 1.78, is the intercept and the x:ida coefficient, 0.0396, is the slope.
I'll leave the gathering of these coefficients into appropriate columns of a data frame to you...
This solution should be very fast because you're not having to deal with subsets of data frames. It could probably be sped up even more with fastLm or such.
Note on scalability:
I did just try this on #nrussell's simulated full-size data and ran into memory allocation issues. Depending on how much memory you have it may not work in one go, but you could probably do it in batches of user ids. Some combination of his answer and my answer might be the fastest overall---or nrussell's might just be faster---expanding the user id factor into thousands of dummy variables might not be computationally efficient, as I've been waiting more than a couple minutes now for a run on just 5000 user ids.
Update:
As pointed out by Dirk, my original approach can be greatly improved upon by specifying x and Y directly rather than using the formula-based interface of fastLm, which incurs (a fairly significant) processing overhead. For comparison, using the original full size data set,
R> system.time({
dt[,c("lm_b0", "lm_b1") := as.list(
unname(fastLm(x, Y)$coefficients))
,by = "user_id"]
})
# user system elapsed
#55.364 0.014 55.401
##
R> system.time({
dt[,c("lm_b0","lm_b1") := as.list(
unname(fastLm(Y ~ x, data=.SD)$coefficients))
,by = "user_id"]
})
# user system elapsed
#356.604 0.047 356.820
this simple change yields roughly a 6.5x speedup.
[Original approach]
There is probably some room for improvement, but the following took about 25 minutes on a Linux VM (2.6 GHz processor), running 64-bit R:
library(data.table)
library(RcppArmadillo)
##
dt[
,c("lm_b0","lm_b1") := as.list(
unname(fastLm(Y ~ x, data=.SD)$coefficients)),
by=user_id]
##
R> dt[c(1:2, 31:32, 61:62),]
user_id x Y lm_b0 lm_b1
1: 1 1.0 1674.8316 -202.0066 744.6252
2: 1 1.5 369.8608 -202.0066 744.6252
3: 2 1.0 463.7460 -144.2961 374.1995
4: 2 1.5 412.7422 -144.2961 374.1995
5: 3 1.0 513.0996 217.6442 261.0022
6: 3 1.5 1140.2766 217.6442 261.0022
Data:
dt <- data.table(
user_id = rep(1:500000,each=30))
##
dt[, x := seq(1, by=.5, length.out=30), by = user_id]
dt[, Y := 1000*runif(1)*x, by = user_id]
dt[, Y := Y + rnorm(
30,
mean = sample(c(-.05,0,0.5)*mean(Y),1),
sd = mean(Y)*.25),
by = user_id]
You might give this a try using data.table like this. I've just created some toy data but I'd imagine data.table would give some improvement. It's quite speedy. But that is quite a large data-set so perhaps benchmark this approach on a smaller sample to see if the speed is a lot better. good luck.
library(data.table)
exp <- data.table(id = rep(c("a","b","c"), each = 20), x = rnorm(60,5,1.5), y = rnorm(60,2,.2))
# edit: it might also help to set a key on id with such a large data-set
# with the toy example it would make no diff of course
exp <- setkey(exp,id)
# the nuts and bolts of the data.table part of the answer
result <- exp[, as.list(coef(lm(y ~ x))), by=id]
result
id (Intercept) x
1: a 2.013548 -0.008175644
2: b 2.084167 -0.010023549
3: c 1.907410 0.015823088
An example using Rfast.
Assuming a single response and 500K predictor variables.
y <- rnorm(30)
x <- matrnorm(500*1000,30)
system.time( Rfast::univglms(y, x,"normal") ) ## 0.70 seconds
Assuming 500K response variables and a singl predictor variable.
system.time( Rfast::mvbetas(x,y) ) ## 0.60 seconds
Note: The above times will decrease in the nearby future.
Related
I am searching for an equivalent function in R of the extremely convenient Stata command simulate. The command basically allows you to declare a program (reg_simulation in the example below) and then invoke such a program from simulate and store desired outputs.
Below is a Stata illustration of the usage of the simulate program, together with my attempt to replicate it using R.
Finally, my main question is: is this how R users will run a Montecarlo simulation? or am I missing something in terms of structure or speed bottlenecks? Thank you a lot in advance.
Stata example
Defining reg_simulation program.
clear all
*Define "reg_simulation" to be used later on by "simulate" command
program reg_simulation, rclass
*Declaring Stata version
version 13
*Droping all variables on memory
drop _all
*Set sample size (n=100)
set obs 100
*Simulate model
gen x1 = rnormal()
gen x2 = rnormal()
gen y = 1 + 0.5 * x1 + 1.5 *x2 + rnormal()
*Estimate OLS
reg y x1 x2
*Store coefficients
matrix B = e(b)
return matrix betas = B
end
Calling reg_simulation from simulate command:
*Seet seed
set seed 1234
*Run the actual simulation 10 times using "reg_simulation"
simulate , reps(10) nodots: reg_simulation
Obtained result (stored data on memory)
_b_x1 _b_x2 _b_cons
.4470155 1.50748 1.043514
.4235979 1.60144 1.048863
.5006762 1.362679 .8828927
.5319981 1.494726 1.103693
.4926634 1.476443 .8611253
.5920001 1.557737 .8391003
.5893909 1.384571 1.312495
.4721891 1.37305 1.017576
.7109139 1.47294 1.055216
.4197589 1.442816 .9404677
R replication of the Stata program above.
Using R I have managed to get the following (not an R expert tho). However, the part that worries me the most is the for-loop structure that loops over each the number of repetitions nreps.
Defining reg_simulation function.
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
}
Calling reg_simulation 10 times using a for-loop structure:
#Generate list to store results from simulation
results_list <- list()
# N repetitions
nreps <- 10
for (i in 1:nreps) {
#Set seed internally (to get different values in each run)
set.seed(i)
#Save results into list
results_list[i] <- list(reg_simulation(obs=1000))
}
#unlist results
df_results<- data.frame(t(sapply(results_list,
function(x) x[1:max(lengths(results_list))])))
Obtained result: df_results.
#final results
df_results
# X.Intercept. x1 x2
# 1 1.0162384 0.5490488 1.522017
# 2 1.0663263 0.4989537 1.496758
# 3 0.9862365 0.5144083 1.462388
# 4 1.0137042 0.4767466 1.551139
# 5 0.9996164 0.5020535 1.489724
# 6 1.0351182 0.4372447 1.444495
# 7 0.9975050 0.4809259 1.525741
# 8 1.0286192 0.5253288 1.491966
# 9 1.0107962 0.4659812 1.505793
# 10 0.9765663 0.5317318 1.501162
You're on the right track. Couple of hints/corrections:
Don't use <- inside data.frame()
In R, we construct data frames using = for internal column assignment, i.e. data.frame(x = 1:10, y = 11:20) rather than data.frame(x <- 1:10, y <- 11:20).
(There's more to be said about <- vs =, but I don't want to distract from your main question.)
In your case, you don't actually even need to create a data frame since x1, x2 and y will all be recognized as "global" variables within the scope of the function. I'll post some code at the end of my answer demonstrating this.
When growing a list via a for loop in R, always try to pre-allocate the list first
Always try to pre-allocate the list length and type if you are going to grow a (long) for loop. Reason: That way, R knows how much memory to efficiently allocate to your object. In the case where you are only doing 10 reps, that would mean starting with something like:
results_list <- vector("list", 10)
3. Consider using lapply instead of for
for loops have a bit of bad rep in R. (Somewhat unfairly, but that's a story for another day.) An alternative that many R users would consider is the functional programming approach offered by lapply. I'll hold off on showing you the code for a second, but it will look very similar to a for loop. Just to note quickly, following on from point 2, that one immediate benefit is that you don't need to pre-allocate the list with lapply.
4. Run large loops in parallel
A Monte Carlo simulation is an ideal candidate for running everything in parallel, since each iteration is supposed to be independent of the others. An easy way to go parallel in R is via the future.apply package.
Putting everything together, here's how I'd probably do your simulation. Note that this might be more "advanced" than you possibly need, but since I'm here...
library(data.table) ## optional, but what I'll use to coerce the list into a DT
library(future.apply) ## for parallel stuff
plan(multisession) ## use all available cores
obs <- 1e3
# Defining a function
reg_simulation <- function(...){
x1 <- rnorm(obs, 0 , 1)
x2 <- rnorm(obs, 0 , 1)
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1)
#Estimate OLS
ols <- lm(y ~ x1 + x2)
# return(ols$coefficients)
return(as.data.frame(t(ols$coefficients)))
}
# N repetitions
nreps <- 10
## Serial version
# results <- lapply(1:nreps, reg_simulation)
## Parallel version
results <- future_lapply(1:nreps, reg_simulation, future.seed = 1234L)
## Unlist / convert into a data.table
results <- rbindlist(results)
So, following up on the comments, you want to vary your independent variables (x) and also the error term and simulate the coefficients, but you also want to catch errors if any occur. The following would do the trick:
set.seed(42)
#Defining a function
reg_simulation<- function(obs = 1000){
data <- data.frame(
#Generate data
x1 <-rnorm(obs, 0 , 1) ,
x2 <-rnorm(obs, 0 , 1) ,
y <- 1 + 0.5* x1 + 1.5 * x2 + rnorm(obs, 0 , 1) )
#Estimate OLS
tryCatch(
{
ols <- lm(y ~ x1 + x2, data=data)
return(ols$coefficients)
},
error = function(e){
return(c('(Intercept)'=NA, 'x1'=NA, 'x2'=NA))
}
)
}
output <- t(data.frame(replicate(10, reg_simulation())))
output
(Intercept) x1 x2
X1 0.9961328 0.4782010 1.481712
X2 1.0234698 0.4801982 1.556393
X3 1.0336289 0.5239380 1.435468
X4 0.9796523 0.5095907 1.493548
...
Here, tryCatch (see also failwith) catches the error and returns NA as the default value.
Note that you only need to set the seed once because the seed changes automatically with every call to random number generator in a deterministic fashion.
I am trying to run a regression model that includes fixed effects for cities in the united states. I have over 10,000,000 million rows and 600 cities. The code below works, but it is really slow. When including a factor for a variable with lots of levels, is there any way to run the model faster.
x <- data.frame(
a = sample( 1:1000, 1000000 , replace=T),
cityfips = sample( 1:250, 1000000 , replace=T),
d = sample( 1:4, 1000000 , replace=T)
)
system.time(a1 <- lm( a~cityfips+d , x ) )
system.time(a2 <- lm( a~as.factor(cityfips) + d , x ) )
> system.time(a1 <- lm( a~cityfips+d , x ) )
user system elapsed
0.22 0.00 0.22
> system.time(a2 <- lm( a~as.factor(cityfips) + d , x ) )
user system elapsed
95.65 0.97 96.62
> system.time(a3 <- slm( a~as.factor(cityfips) + d , x ) )
user system elapsed
4.58 2.06 6.65
When you have that many factors, constructing the model.matrix in lm() will take up most of the time, one way is to use sparseMatrix like in glmnet and there are two packages, sparseM, MatrixModels that allows lm onto sparseMatrix:
set.seed(111)
x <- data.frame(
a = sample( 1:1000, 1000000 , replace=T),
cityfips = sample( 1:250, 1000000 , replace=T),
d = sample( 1:4, 1000000 , replace=T)
)
library(SparseM)
library(MatrixModels)
library(Matrix)
system.time(f_lm <- lm( a~as.factor(cityfips) + d , x ) )
user system elapsed
75.720 2.494 79.365
system.time(f_sparseM <- slm(a~as.factor(cityfips) + d , x ))
user system elapsed
5.373 3.952 10.646
system.time(f_modelMatrix <- glm4(a~as.factor(cityfips) + d ,data=x,sparse=TRUE))
user system elapsed
1.878 0.335 2.219
The closest I can find is glm4 in MatrixModels, but you can see below the coefficients are the same as the fit using lm:
all.equal(as.numeric(f_sparseM$coefficients),as.numeric(f_lm$coefficients))
[1] TRUE
all.equal(as.numeric(f_lm$coefficients),as.numeric(coefficients(f_modelMatrix)))
[1] TRUE
One other option besides glm4 in MatrixModels is to use lm.fit (as pointed out by #BenBolker:
lm.fit(x=Matrix::sparse.model.matrix(~as.factor(cityfips) + d,data=x),y=x$a)
This gives you a list as like lm.fit() normally and you cannot apply functions such as summary() etc.
Authors of both package warn about it being experimental so there might still be some differences compared to stats::lm , take care to check.
check out the lfe package. I've not dug into the details of the algorithm but at least in my experience it's produced exactly the same results as lm() in fractions of the time.
as a bonus it makes it easy to cluster standard errors so you don't need to do any clustering and/or sandwich estimator business afterward, although the syntax for doing so is a little unusual
novice here. I am fitting a negative binomial model on count data where Y is the count of events, D is the treatment, and X is a logarithmic offset:
out <- glm.nb(y ~ d + offset(log(x)),data=d1)
I would like to bootstrap the confidence intervals of the first difference between D=1 and D=0. I've gotten this far, but not sure if it is the correct approach:
holder <- matrix(NA,1200,1)
out <- out <- glm.nb(y ~ d + offset(log(x)),data=d1)
for (i in 1:1200){
q <- sample(1:nrow(d1), 1)
d2 <- d1[q,]
d1_1 <- d1_2 <- d2
d1_1$d <- 1
d1_2$d <- 0
d1pred <- predict(out,d1_1,type="response")
d2pred <- predict(out,d1_2,type="response")
holder[i,1] <- (d1pred[1] - d2pred[1])
}
mean(holder)
Is this the correct way to bootstrap the first difference?
Generally, your approach is ok, but you can do it in more R-ish way. Firstly, if you are serious about bootstrapping you can employ boot library and benefit from more compact code, no loops and many other advanced options.
In your case it can look like:
## Data generation
N <- 100
set.seed(1)
d1 <- data.frame(y=rbinom(N, N, 0.5),
d=rbinom(N, 1, 0.5),
x=rnorm(N, 10, 3))
## Model
out <- glm.nb(y ~ d + offset(log(x)), data=d1)
## Statistic function (what we are bootstrapping)
## Returns difference between D=1 and D=0
diff <- function(x,i,model){
v1 <- v2 <- x[i,]
v1$d <- 1
v2$d <- 0
predict(model,v1,type="response") - predict(model,v2,type="response")
}
## Bootstrapping itself
b <- boot(d1, diff, R=5e3, model=out)
mean(b$t)
Now b$t holds bootstrapped values. See names(b) and/or ?boot for extra information.
Bootstrapping is time consuming operation, and one of the obvious advantage of boot library is support for parallel operations. It's as easy as:
b <- boot(d1, diff, R=5e3, model=out, parallel="multicore", ncpus=2)
If you are on Windows use parallel="snow" instead.
I'm using either dyn or dynlm to predict time series using lagged variables.
However, the predict function in either case only evaluates one time step at a time, taking a constant time of 24 milliseconds per step on my computer, or about 1.8 hours for my dataset, which is super long, given that the entire regression takes about 10 seconds.
So, I'm thinking that perhaps the fastest thing might be just to evaluate the formula by hand?
So, is there some way of evaluating a formula given values in a data.frame or the current envrironment or similar?
I'm thinking of something along the lines of:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ) )
I suppose, as I write this, that we need to handle the coefficients somehow, something like:
evalMagic( load ~ temperature + time, data.frame( temperature = 10, time = 4 ), model$coefficients )
.... so this raises the questions of:
isn't this what predict is supposed to do?
why is predict so slow?
what options do I have to make the prediction a bit faster? After all, it's not inverting any matrices or something, it's just a bit of arithmetic!
I wrote my own lag implementation in the end. It's hacky and not beautiful, but it's a lot faster. It can process 1000 rows in 4 seconds on my crappy laptop.
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
addLags <- function( dataset, lags ) {
N <- nrow(dataset)
print(lags)
if( nrow(lags) > 0 ) {
print(lags)
for( j in 1:nrow(lags) ) {
sourcename <- as.character( lags[j,"var"] )
k <- lags[j,"amount"]
cat("k",k,"sourcename",sourcename,"\n")
lagcolname <- sprintf("%s_%d",sourcename,k)
dataset[,lagcolname] <- c(rep(0,k), dataset[1:(N-k),sourcename])
}
}
dataset
}
lmLagged <- function( formula, train, lags ) {
# get largest lag, and skip that
N <- nrow(train)
skip <- 0
for( j in 1:nrow(lags) ) {
k <- lags[j,"amount"]
skip <- max(k,skip)
}
print(train)
train <- addLags( train, lags )
print(train)
lm( formula, train[(skip+1):N,] )
}
# pass in training data, test data,
# it will step through one by one
# need to give dependent var name
# lags is a data.frame, eg:
# var amount
# y 1
# y 2
predictLagged <- function( model, train, test, dependentvarname, lags ) {
Ntrain <- nrow(train)
Ntest <- nrow(test)
test[,dependentvarname] <- NA
testtraindata <- rbind( train, test )
testtraindata <- addLags( testtraindata, lags )
for( i in 1:Ntest ) {
thistestdata <- testtraindata[Ntrain + i,]
result <- predict(model,newdata=thistestdata)
for( j in 1:nrow(lags) ) {
sourcename <- lags[j,"var"]
k <- lags[j,"amount"]
lagcolname <- sprintf("%s_%d",sourcename,k)
testtraindata[Ntrain + i + k,lagcolname] <- result
}
testtraindata[Ntrain+i,dependentvarname] <- result
}
return( testtraindata[(Ntrain+1):(Ntrain + Ntest),dependentvarname] )
}
library("RUnit")
# size of training data
N <- 6
predictN <- 50
# create training data, which we can get exact fit on
set.seed(1)
x = sample( 100, N )
traindata <- numeric()
traindata[1] <- 1 + 1.1 * x[1]
traindata[2] <- 2 + 1.1 * x[2]
for( i in 3:N ) {
traindata[i] <- 0.5 + 0.3 * traindata[i-2] - 0.8 * traindata[i-1] + 1.1 * x[i]
}
train <- data.frame(x = x, y = traindata, foo = 1)
#train$x <- NULL
# create testing data, bunch of NAs
test <- data.frame( x = sample(100,predictN), y = rep(NA,predictN), foo = 1)
# specify which lags we need to handle
# one row per lag, with name of variable we are lagging, and the distance
# we can then use these in the formula, eg y_1, and y_2
# are y lagged by 1 and 2 respectively
# It's hacky but it kind of works...
lags <- data.frame( var = c("y","y"), amount = c(1,2) )
# fit a model
model <- lmLagged( y ~ x + y_1 + y_2, train, lags )
# look at the model, it's a perfect fit. Nice!
print(model)
print(system.time( test <- predictLagged( model, train, test, "y", lags ) ))
#checkEqualsNumeric( 69.10228, test[56-6], tolerance = 0.0001 )
#checkEquals( 2972.159, test$y[106-6] )
print(test)
# nice plot
plot(test, type='l')
Output:
> source("test/test.regressionlagged.r",echo=F)
Call:
lm(formula = formula, data = train[(skip + 1):N, ])
Coefficients:
(Intercept) x y_1 y_2
0.5 1.1 -0.8 0.3
user system elapsed
0.204 0.000 0.204
[1] -19.108620 131.494916 -42.228519 80.331290 -54.433588 86.846257
[7] -13.807082 77.199543 12.698241 64.101270 56.428457 72.487616
[13] -3.161555 99.575529 8.991110 44.079771 28.433517 3.077118
[19] 30.768361 12.008447 2.323751 36.343533 67.822299 -13.154779
[25] 72.070513 -11.602844 115.003429 -79.583596 164.667906 -102.309403
[31] 193.347894 -176.071136 254.361277 -225.010363 349.216673 -299.076448
[37] 400.626160 -371.223862 453.966938 -420.140709 560.802649 -542.284332
[43] 701.568260 -679.439907 839.222404 -773.509895 897.474637 -935.232679
[49] 1022.328534 -991.232631
There's about 12 hours work in those 91 lines of code. Ok, I confess I played Plants and Zombies for a bit. So, 10 hours. Plus lunch and dinner. Still, quite a lot of work anyway.
If we change predictN to 1000, I get about 4.1 seconds from the system.time call.
I think it's faster because:
we don't use timeseries; I suspect that speeds things up
we don't use dynamic lm libraries, just normal lm; I guess that's slightly faster
we only pass a single row of data into predict for each prediction, which I think is significantly faster, eg using dyn$lm or dynmlm, if one has a lag of 30, one would need to pass 31 rows of data into predict AFAIK
a lot less data.frame/matrix copying, since we just update the lag values in-place on each iteration
Edit: corrected minor buggette where predictLagged returned a multi-column data-frame instead of just a numeric vector
Edit2: corrected less minor bug where you couldn't add more than one variable. Also reconciled the comments and code for lags, and changed the lags structure to "var" and "amount" in place of "name" and "lags". Also, updated the test code to add a second variable.
Edit: there are tons of bugs in this version, which I know, because I've unit-tested it a bit more and fixed them, but copying and pasting is very time-consuming, so I will update this post in a few days, once my deadline is over.
Maybe you're looking for this:
fastlinpred <- function(formula, newdata, coefs) {
X <- model.matrix( formula, data=newdata)
X %*% coefs
}
coefs <- c(1,2,3)
dd <- data.frame( temperature = 10, time = 4 )
fastlinpred( ~ temperature + time,
dd , coefs )
This assumes that the formula has only a RHS (you can get rid of the LHS of a formula by doing form[-2]).
This certainly gets rid of a lot of the overhead of predict.lm, but I don't know if it is as fast as you want. model.matrix has a lot of internal machinery too.
I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows.
I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of iterations, say 100.
Each firm can potentially be selected multiple times, so I need to include it data multiple times in each iteration's data set.
Using a loop and subset approach, like below, seems computationally burdensome.
Note that for my real data frame, n, and the number iterations is much larger than the example below.
My thoughts initially are to break the existing data frame into a list by subject using the split() command. From there, use
sample(unique(df1$subject),n,replace=TRUE)
to get the new list, then perhaps implement quickdf from the plyr package to construct a new data frame.
Example slow code:
require(plm)
data("Grunfeld", package="plm")
firms = unique(Grunfeld$firm)
n = 10
iterations = 100
mybootresults=list()
for(j in 1:iterations){
v = sample(length(firms),n,replace=TRUE)
newdata = NULL
for(i in 1:n){
newdata = rbind(newdata,subset(Grunfeld, firm == v[i]))
}
reg1 = lm(value ~ inv + capital, data = newdata)
mybootresults[[j]] = coefficients(reg1)
}
mybootresults = as.data.frame(t(matrix(unlist(mybootresults),ncol=iterations)))
names(mybootresults) = names(reg1$coefficients)
mybootresults
(Intercept) inv capital
1 373.8591 6.981309 -0.9801547
2 370.6743 6.633642 -1.4526338
3 528.8436 6.960226 -1.1597901
4 331.6979 6.239426 -1.0349230
5 507.7339 8.924227 -2.8661479
...
...
How about something like this:
myfit <- function(x, i) {
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
firms <- unique(Grunfeld$firm)
b0 <- boot(firms, myfit, 999)
You can also use the tsboot function in the boot package with fixed block resampling scheme.
require(plm)
require(boot)
data(Grunfeld)
### each firm is of length 20
table(Grunfeld$firm)
## 1 2 3 4 5 6 7 8 9 10
## 20 20 20 20 20 20 20 20 20 20
blockboot <- function(data)
{
coefficients(lm(value ~ inv + capital, data = data))
}
### fixed length (every 20 obs, so for each different firm) block bootstrap
set.seed(321)
boot.1 <- tsboot(Grunfeld, blockboot, R = 99, l = 20, sim = "fixed")
boot.1
## Bootstrap Statistics :
## original bias std. error
## t1* 410.81557 -25.785972 174.3766
## t2* 5.75981 0.451810 2.0261
## t3* -0.61527 0.065322 0.6330
dim(boot.1$t)
## [1] 99 3
head(boot.1$t)
## [,1] [,2] [,3]
## [1,] 522.11 7.2342 -1.453204
## [2,] 626.88 4.6283 0.031324
## [3,] 479.74 3.2531 0.637298
## [4,] 557.79 4.5284 0.161462
## [5,] 568.72 5.4613 -0.875126
## [6,] 379.04 7.0707 -1.092860
Here is a method that should typically be faster than the accepted answer, returns the same results and does not rely on additional packages (except boot). The key here is to use which and integer indexing to construct each data.frame replicate rather than split/subset and do.call/rbind.
# get function for boot
myIndex <- function(x, i) {
# select the observations to subset. Likely repeated observations
blockObs <- unlist(lapply(i, function(n) which(x[n] == Grunfeld$firm)))
# run regression for given replicate, return estimated coefficients
coefficients(lm(value~ inv + capital, data=Grunfeld[blockObs,]))
}
now, bootstrap
# get result
library(boot)
set.seed(1234)
b1 <- boot(firms, myIndex, 200)
Run the accepted answer
set.seed(1234)
b0 <- boot(firms, myfit, 200)
Let's eyeball a comparison
using indexing
b1
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myIndex, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
Original version
b0
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = firms, statistic = myfit, R = 200)
Bootstrap Statistics :
original bias std. error
t1* 410.8155650 -6.64885086 197.3147581
t2* 5.7598070 0.37922066 2.4966872
t3* -0.6152727 -0.04468225 0.8351341
These look pretty close. Now, a bit more checking
identical(b0$t, b1$t)
[1] TRUE
and
identical(summary(b0), summary(b1))
[1] TRUE
Finally, we'll do a quick benchmark
library(microbenchmark)
microbenchmark(index={b1 <- boot(firms, myIndex, 200)},
rbind={b0 <- boot(firms, myfit, 200)})
On my computer, this returns
Unit: milliseconds
expr min lq mean median uq max neval
index 292.5770 296.3426 303.5444 298.4836 301.1119 395.1866 100
rbind 712.1616 720.0428 729.6644 724.0777 731.0697 833.5759 100
So, direct indexing is more than 2 times faster at every level of the distribution.
note on missing fixed effects
As with most of the answers, the issue of missing "fixed effects" may emerge. Commonly, fixed effects are used as controls and the researcher is interested in one or a couple of variables that will be included with every selected observation. In this dominant case, there is no (or very little) harm in restricting the returned result of the myIndex or myfit function to only include the variables of interest in the returned vector.
The solution needs to be modified to manage fixed effects.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
## Get the Grunfeld firm data (10 firms, each for 20 years, 1935-1954)
data("Grunfeld", package="plm")
## Create dataframe with unique firm identifier (one line per firm)
firms <- data.frame(firm=unique(Grunfeld$firm),junk=1)
## for boot(), X is the firms dataframe; i index the sampled firms
myfit <- function(X, i) {
## join the sampled firms to their firm-year data
mydata <- left_join(X[i,], Grunfeld, by="firm")
## Distinguish between multiple resamples of the same firm
## Otherwise they have the same id in the fixed effects regression
## And trouble ensues
mydata <- mutate(group_by(mydata,firm,year),
firm_uniq4boot = paste(firm,"+",row_number())
)
## Run regression with and without firm fixed effects
c(coefficients(lm(value ~ inv + capital, data = mydata)),
coefficients(lm(value ~ inv + capital + factor(firm_uniq4boot), data = mydata)))
}
set.seed(1)
system.time(b <- boot(firms, myfit, 1000))
summary(b)
summary(lm(value ~ inv + capital, data=Grunfeld))
summary(lm(value ~ inv + capital + factor(firm), data=Grunfeld))
I found a method using dplyr::left_join that is a bit more concise, only takes about 60% as long, and gives the same results as in the answer by Sean. Here's a complete self-contained example.
library(boot) # for boot
library(plm) # for Grunfeld
library(dplyr) # for left_join
# First get the data
data("Grunfeld", package="plm")
firms <- unique(Grunfeld$firm)
myfit1 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- do.call("rbind", lapply(i, function(n) subset(Grunfeld, firm==x[n])))
coefficients(lm(value ~ inv + capital, data = mydata))
}
myfit2 <- function(x, i) {
# x is the vector of firms
# i are the indexes into x
mydata <- left_join(data.frame(firm=x[i]), Grunfeld, by="firm")
coefficients(lm(value ~ inv + capital, data = mydata))
}
# rbind method
set.seed(1)
system.time(b1 <- boot(firms, myfit1, 5000))
## user system elapsed
## 13.51 0.01 13.62
# left_join method
set.seed(1)
system.time(b2 <- boot(firms, myfit2, 5000))
## user system elapsed
## 8.16 0.02 8.26
b1
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191
b2
## original bias std. error
## t1* 410.8155650 9.2896499 198.6877889
## t2* 5.7598070 0.5748503 2.5725441
## t3* -0.6152727 -0.1200954 0.7829191