I would like to do a large analysis using bootstrapping. I saw that the speed of bootstrapping is increased using parallel computing as in the following code:
Parallel computing
# detect number of cpu
library(parallel)
detectCores()
library(boot)
# boot function --> mean
bt.mean <- function(dat, d){
x <- dat[d]
m <- mean(x)
return(m)
}
# obtain confidence intervals
# use parallel computing with 4 cpus
x <- mtcars$mpg
bt <- boot(x, bt.mean, R = 1000, parallel = "snow", ncpus = 4)
quantile(bt$t, probs = c(0.025, 0.975))
However, as the whole number of calculations is large in my case (10^6 regressions with 10,000 bootstrap samples), I read that there are ways to use GPU computing to increase the speed even more (link1, link2). You can easily use GPU computing with some functions like in:
GPU computing
m <- matrix(rnorm(10^6), ncol = 1000)
csm <- gpuR::colSums(m)
But it seems to me that the packages can only handle some specific R functions such as matrix operations, linear algebra or cluster analysis (link3).
Another approach is to use CUDA/C/C++/Fortran to create own functions (link4). But I am rather searching for a solution in R.
My question is therefore:
Is it possible to use GPU computing for bootstrapping using the boot package and other R packages (e.g. quantreg)?
I think it is not possible to gain the power of gpu computing freely without doing any additional programming now. But the gpuR package is a good starting point. As you point out, the gpuR can only handle some specific R functions such as matrix operations and linear algebra, it is restricted but useful, for example, linear regression can be easily formulated to a linear algebra problem. As to quantile regression, it is not that straightforward to translate it to a linear algebra as linear regression, but it can be done. For example, you can use Newton-Raphson algorithm or something other numerical optimization algorithm to deal with quantile regression (it is not that hard as it sounds like), and the Newton algorithm is in linear algebra form.
The gpuR package is already hiding a lot of c++ programming details and hardware details to utilize gpu computing power and provides a quite easy-to-use programming style, as long as I can think of, this is the way to achieve what you want with the least effort: to rely on the gpuR package, formulate your problem in matrix operations and linear algebra (Newton Raphson etc) and do the programming yourself, or maybe you can find some Newton Raphson implementation in R for quantile regression, and make some little modifications necessary, for example, use gpuMatrix instead of matrix, etc. Hope it helps.
Related
This question already has answers here:
How to speed up GLM estimation?
(3 answers)
Fit many glm models: improve speed
(1 answer)
Closed 16 days ago.
Surprised I couldn't find any answer to this, or someone doing something similar.
Question: I am running a bootstraps on a GLM that takes a long time. However, I don't care about any output/SE's/tests or matrices that glm gives me, just the MLE for the coefficients (because I am using the bootstrap for uncertainty). It seems it would be way faster/memory efficient to just run the ML optimizer store MLE for each coefficient, rinse/repeat. Any way to do this besides writing an optim() script from scratch?
P.S. Q: Any way to tell glm() to use something besides IRWLS for algorithm, like N-R etc.
The X matrix is about 100mil observations, 10-15 parameters. I haven't had any serious issues with convergence, just a big matrix to deal with. I have been using fastglm(, method=3) which I found the fastest and to give answer comparable to glm().
It takes about 15-20 minutes to run the regression, so to get 200 or so bootstraps takes a few days. I would like to run in parallel but lack the memory for it.
You should benchmark glm.fit(), which takes a model matrix X and a response y instead of doing all of the formula stuff. Also the fastglm package.
You might be able to speed glm.fit up a little bit by giving it the predicted values from the initial fit as mustart (this should cut the number of iterations required to converge).
glm() takes a method= argument; the default is glm.fit (IRLS), but you could swap in something else if you like. It's hard to know what you would want to substitute, though: at least for canonical link functions, IRLS is equivalent to N-R. Do you have a better/faster fitting algorithm in mind? "just run[ning] the ML optimize[r]": it will be pretty hard to improve on IRLS, I think.
Since you're already using fastglm, there's less scope for improvement. If one bootstrap replicate takes 15 minutes, then removing overhead (e.g. by using fastglmPure instead of fastglm, analogous to the difference between glm.fit and glm) will hardly do anything. However, trying out different values of method (for faster but slightly less numerically stable algorithms) is probably worth a try ... (given that you're already using method=3 I'm guessing you've tried that).
The only other performance tricks I can think of, in the absence of sufficient RAM/computing resources to run the bootstrap replicates in parallel, is to experiment with optimized/parallelized BLAS libraries. These should not in general require much additional memory, and might speed up your linear algebra considerably.
A little bit of benchmarking:
library(fastglm)
library(microbenchmark)
library(ggplot2)
set.seed(101)
n <- 1e6
dd <- data.frame(x = rnorm(n))
dd$y <- rbinom(n, size = 1, prob = plogis(2-3*dd$x))
X <- model.matrix(~x, data = dd)
b <- binomial()
m1 <- microbenchmark(
glm = glm(y~x, dd, family = b),
glm.fit = glm.fit(X, dd$y, family = b),
fastglm = fastglm(X, dd$y, family = b),
fpure = fastglmPure(X, dd$y, family = b),
fpure3 = fastglmPure(X, dd$y, family = b, method = 3),
times = 20)
autoplot(m1)
I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.
I've been able to minimize a non-linear objective with a linear constraint using quadprog, however, I haven't been able to do it the other way around...
require(quadprog)
min_var <- function(Obj,Rentabilidades,var_covar){
b <- c(Obj,1)
Betha <- var_covar
A <- t(matrix(rbind(Rentabilidades,c(1,1)),nrow=2))
Gamma <- matrix(0,nrow=2)
solve.QP(Betha,Gamma,A,b,2)
}
Now I want to maximize what previously was the constraint taking as a new constraint the former objective. Regrettably, solve.QP() only supports linear constraints. Does anyone know a package similar to quadprog that might help me?
A standard portfolio optimization model looks like:
min sum((i,j), x(i)*Q(i,j)*x(j))
sum(i,x(i)) = 1
sum(i,r(i)*x(i)) >= R
x(i) >= 0
This is a Quadratic Programming model, and can be solved with standard QP solvers.
If you turn this around (maximize return subject to a risk constraint), you can write:
max sum(i,r(i)*x(i))
sum((i,j), x(i)*Q(i,j)*x(j)) <= V
sum(i,x(i)) = 1
x(i) >= 0
This is now a Quadratically Constrained problem. Luckily this is convex so you can use solvers like Cplex, Gurobi, or Mosek to solve them (they have R interfaces). An open source candidate could be a solver like ECOSolveR, or even better a framework like cxvr.
I'm trying to perform a fixed effects regression for two factor variables in a CSV dataset containing over 4000000 rows. These variables can respectively assume about 140000 and 50000 different integer values.
I initially attempted to perform the regression using the biglm and ff packages for R as follows on a Linux machine with 8 Gb of memory; however, it seems that this requires too much memory because R complains about having to allocate a vector of a size greater than the maximum on my machine.
library(biglm)
library(ff)
d <- read.csv.ffdf(file='data.csv', header=TRUE)
model = y~factor(a)+factor(b)-1
out <- biglm(model, data=d)
Some research online revealed that since factors are loaded into memory by ff, the latter will not significantly improve memory usage if many factor values are present.
Is anyone aware of some other way to perform the aforementioned regression on a dataset of the magnitude I described without having to resort to a machine with significantly more memory?
You should try the package lfe, it has been designed for exactly this purpose:
library(lfe)
...
out <- felm(y ~ 0|a+b, data=d)
fe <- getfe(out)
A proof of the method can be found here: http://www.sciencedirect.com/science/article/pii/S0167947313001266
Here's an R-journal article about it: http://journal.r-project.org/archive/2013-2/gaure.pdf
you can get the same mathematical meaning of fixed effects if you will demean the variables (by category). So, instead of finding a constant per dummy, you demean it. and demeaning will be very fast, as it is will be vectorized.
Edit1:
see Green 2012 p.400-401 for the mathematical proof.
I'm experimenting with R and the randomForest Package, I have some experience with SVM and Neural Nets.
My first test is to try and regress: sin(x)+gaussian noise.
With Neural Nets and svm I obtain a "relatively" nice approximation of sin(x) so the noise is filtered out and the learning algorithm doesn't overfit. (for decent parameters)
When doing the same on randomForest I have a completely overfitted solution.
I simply use (R 2.14.0, tried on 2.14.1 too, just in case):
library("randomForest")
x<-seq(-3.14,3.14,by=0.00628)
noise<-rnorm(1001)
y<-sin(x)+noise/4
mat<-matrix(c(x,y),ncol=2,dimnames=list(NULL,c("X","Y")))
plot(x,predict(randomForest(Y~.,data=mat),mat),col="green")
points(x,y)
I guess there is a magic option in randomForest to make it work correctly, I tried a few but I did not find the right lever to pull...
You can use maxnodes to limit the size of the trees,
as in the examples in the manual.
r <- randomForest(Y~.,data=mat, maxnodes=10)
plot(x,predict(r,mat),col="green")
points(x,y)
You can do a lot better (rmse ~ 0.04, $R^2$ > 0.99) by training individual trees on small samples or bites as Breiman called them
Since there is a significant amount of noise in the training data, this problem is really about smoothing rather than generalization. In general machine learning terms this requires increasing regularization. For ensemble learner this means trading strength for diversity.
Diversity of randomForests can be increasing by reducing the number of candidate feature per split (mtry in R) or the training set of each tree (sampsize in R). Since there is only 1 input dimesions, mtry does not help, leaving sampsize. This leads to a 3.5x improvement in RMSE over the default settings and >6x improvement over the noisy training data itself. Since increased divresity means increased variance in the prediction of the individual learners, we also need to increase the number of trees to stabilize the ensemble prediction.
small bags, more trees :: rmse = 0.04:
>sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000),
mat)
- sin(x))
[1] 0.03912643
default settings :: rmse=0.14:
> sd(predict(randomForest(Y~.,data=mat),mat) - sin(x))
[1] 0.1413018
error due to noise in training set :: rmse = 0.25
> sd(y - sin(x))
[1] 0.2548882
The error due to noise is of course evident from
noise<-rnorm(1001)
y<-sin(x)+noise/4
In the above the evaluation is being done against the training set, as it is in the original question. Since the issue is smoothing rather than generalization, this is not as egregious as it may seem, but it is reassuring to see that out of bag evaluation shows similar accuracy:
> sd(predict(randomForest(Y~.,data=mat, sampsize=60, nodesize=2,
replace=FALSE, ntree=5000))
- sin(x))
[1] 0.04059679
My intuition is that:
if you had a simple decision tree to fit a 1 dimensional curve f(x), that would be equivalent to fit with a staircase function (not necessarily with equally spaced jumps)
with random forests you will make a linear combination of staircase functions
For a staircase function to be a good approximator of f(x), you want enough steps on the x axis, but each step should contain enough points so that their mean is a good approximation of f(x) and less affected by noise.
So I suggest you tune the nodesize parameter. If you have 1 decision tree and N points, and nodesize=n, then your staircase function will have N/n steps. n too small brings to overfitting. I got nice results with n~30 (RMSE~0.07):
r <- randomForest(Y~.,data=mat, nodesize=30)
plot(x,predict(r,mat),col="green")
points(x,y)
Notice that RMSE gets smaller if you take N'=10*N and n'=10*n.