I am trying to run an optimization using lp.transport from the package lpSolve in R, using the generic form
lp.transport (cost, "min", row.signs, row.rhs, col.signs, col.rhs)
The cost matrix is large, 6791 x 15594. The rows correspond to food producers and the columns to consumers, and obviously the sum of all values of row.rhs is equal to that of col.rhs.
The optimization has been running about 12 hours now (using about 30 Mb of memory, in a 64-bits R). Is there any way to estimate the time it will take? Any advice on how to modify the inputs to eventually reduce computational time?
Related
I am applying the functions from the flexclust package for hard competitive learning clustering, and I am having trouble with the convergence.
I am using this algorithm because I was looking for a method to perform a weighed clustering, giving different weights to groups of variables. I chose hard competitive learning based on a response for a previous question (Weighted Kmeans R).
I am trying to find the optimal number of clusters, and to do so I am using the function stepFlexclust with the following code:
new("flexclustControl") ## check the default values
fc_control <- new("flexclustControl")
fc_control#iter.max <- 500 ### 500 iterations
fc_control#verbose <- 1 # this will set the verbose to TRUE
fc_control#tolerance <- 0.01
### I want to give more weight to the first 24 variables of the dataframe
my_weights <- rep(c(1, 0.064), c(24, 31))
set.seed(1908)
hardcl <- stepFlexclust(x=df, k=c(7:20), nrep=100, verbose=TRUE,
FUN = cclust, dist = "euclidean", method = "hardcl", weights=my_weights, #Parameters for hard competitive learning
control = fc_control,
multicore=TRUE)
However, the algorithm does not converge, even with 500 iterations. I would appreciate any suggestion. Should I increase the number of iterations? Is this an indicator that something else is not going well, or did I a mistake with the R commands?
Thanks in advance.
Two things that answer my question (as well as a comment on weighted variables for kmeans, or better said, with hard competitive learning):
The weights are for observations (=rows of x), not variables (=columns of x). so using hardcl for weighting variables is wrong.
In hardcl or neural gas you need much more iterations compared to standard k-means: In k-means one iteration uses the complete data set to change the centroids, hard competitive learning and uses only a single observation. In comparison to k-means multiply the number of iterations by your sample size.
I have a data set with 500,000 observations with events and a competing risk as well as a time-to-event variable (survival analysis).
I want to run a survival random forest.
The R-package randomForestSRC is great for it, however, it is impossible to use more than 100,000 rows due to memory limitation (100'000 uses 40GB of RAM) even though I limit my number of predictors to 15 to 20.
I have a hard time finding a solution. Does anyone have a recommendation?
I looked at h2o and spark mllib, both of which do not support survival random forests.
Ideally I am looking for a somewhat R-based solution but I am happy to explore anything else if anyone knows a way to use large data + competing risk random forest.
In general, the memory profile for an RF-SRC data set is n x p x 8 on a 64-bit machine. With n=500,000 and p=20, RAM usage is approximately 80MB. This is not large.
You also need to consider the size of the forest, $nativeArray. With the default nodesize = 3, you will have n / 3 = 166,667 terminal nodes. Assuming symmetric trees for convenience, the total number of interanal/external nodes will approximately be 2 * n / 3 = 333,333. With the default ntree = 1000, and assuming no factors, $nativeArray will be of dimensions [2 * n / 3 * ntree] x [5]. A simple example will show you why we have [5] columns in the $nativeArray to tag the split parameter, and split value. Memory usage for the forest will be thus be 2 * n / 3 * ntree * 5 * 8 = 1.67GB.
So now we are getting into some serious memory usage.
Next consider the ensembles. You haven't said how many events you have in your competing risk data set, but let's assume there are two for simplicity.
The big arrays here are the cause-specific hazard function (CSH) and the cause-specific cumulative incidence function (CIF). These are both of dimension [n] x [time.interest] x [2]. In a worst case scenario, if all your times are distinct, and there are no censored events, time.interest = n. So each of these outputs is n * n * 2 * 8 bytes. This will blow up most machines. It's time.interest that is your enemy. In big-n situations, you need to constrain the time.interest vector to a subset of the actual event times. This can be controlled with the parameter ntime.
From the documentation:
ntime: Integer value used for survival families to constrain ensemble calculations to a grid of time values of no more than ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times.
My suggestion would be to start with a very small value of ntime, just to test whether the data set can be analyzed in its entirety without issue. Then increase it gradually and observe your RAM usage. Note that if you have missing data, then RAM usage will be much larger. Also note that I did not mention other arrays such as the terminal node statistics that also lead to heavy RAM usage. I only considered the ensembles, but the reality is that each terminal node will contain arrays of dimension [time.interest] x 2 for the node specific estimator of the CSH and CIF that is used in creating the forest ensemble.
In the future, we will be implementing a Big Data option that will suppress ensembles and optimize the memory profile of the package to better accommodate big-n scenarios. In the meantime, you will have to be diligent in using the existing options like ntree, nodesize, and ntime to reduce your RAM usage.
I would like to do a large analysis using bootstrapping. I saw that the speed of bootstrapping is increased using parallel computing as in the following code:
Parallel computing
# detect number of cpu
library(parallel)
detectCores()
library(boot)
# boot function --> mean
bt.mean <- function(dat, d){
x <- dat[d]
m <- mean(x)
return(m)
}
# obtain confidence intervals
# use parallel computing with 4 cpus
x <- mtcars$mpg
bt <- boot(x, bt.mean, R = 1000, parallel = "snow", ncpus = 4)
quantile(bt$t, probs = c(0.025, 0.975))
However, as the whole number of calculations is large in my case (10^6 regressions with 10,000 bootstrap samples), I read that there are ways to use GPU computing to increase the speed even more (link1, link2). You can easily use GPU computing with some functions like in:
GPU computing
m <- matrix(rnorm(10^6), ncol = 1000)
csm <- gpuR::colSums(m)
But it seems to me that the packages can only handle some specific R functions such as matrix operations, linear algebra or cluster analysis (link3).
Another approach is to use CUDA/C/C++/Fortran to create own functions (link4). But I am rather searching for a solution in R.
My question is therefore:
Is it possible to use GPU computing for bootstrapping using the boot package and other R packages (e.g. quantreg)?
I think it is not possible to gain the power of gpu computing freely without doing any additional programming now. But the gpuR package is a good starting point. As you point out, the gpuR can only handle some specific R functions such as matrix operations and linear algebra, it is restricted but useful, for example, linear regression can be easily formulated to a linear algebra problem. As to quantile regression, it is not that straightforward to translate it to a linear algebra as linear regression, but it can be done. For example, you can use Newton-Raphson algorithm or something other numerical optimization algorithm to deal with quantile regression (it is not that hard as it sounds like), and the Newton algorithm is in linear algebra form.
The gpuR package is already hiding a lot of c++ programming details and hardware details to utilize gpu computing power and provides a quite easy-to-use programming style, as long as I can think of, this is the way to achieve what you want with the least effort: to rely on the gpuR package, formulate your problem in matrix operations and linear algebra (Newton Raphson etc) and do the programming yourself, or maybe you can find some Newton Raphson implementation in R for quantile regression, and make some little modifications necessary, for example, use gpuMatrix instead of matrix, etc. Hope it helps.
I am running glmnet favoring lasso regression on a 16 core machine. I have some 800K rows with around 2K columns in a sparse matrix format that should be trained to predict probability in first column.
This process has become very slow. I want to know, is there a way to speed it up
either by parallelizing on nfolds or if I can select a smaller number of rows without affecting the accuracy. Is it possible? If so, what would be better?
The process can be expedited by using parallelization, which as explained in comment link above executing glmnet in parallel in R is done by setting parallel=TRUE option in cv.glmnet() function, once you specify the number of cores like this:
library(doParallel)
registerDoParallel(5)
m <- cv.glmnet(x, y, family="binomial", alpha=0.7, type.measure="auc",
grouped=FALSE, standardize=FALSE, parallel=TRUE)
Reducing the number of rows is more of a judgement call based on AUC value on test set. If it is above threshold, and reducing rows does not affect this, then it is certainly a good idea.
I am running random forest in R in parallel
library(doMC)
registerDoMC()
x <- matrix(runif(500), 100)
y <- gl(2, 50)
Parallel execution (took 73 sec)
rf <- foreach(ntree=rep(25000, 6), .combine=combine, .packages='randomForest') %dopar%
randomForest(x, y, ntree=ntree)
Sequential execution (took 82 sec)
rf <- foreach(ntree=rep(25000, 6), .combine=combine) %do%
randomForest(x, y, ntree=ntree)
In parallel execution, the tree generation is pretty quick like 3-7 sec, but the rest of the time is consumed in combining the results (combine option). So, its only worth to run parallel execution is the number of trees are really high. Is there any way I can tweak "combine" option to avoid any calculation at each node which I dont need and make it more faster
PS. Above is just an example of data. In real I have some 100 thousands features for some 100 observations.
Setting .multicombine to TRUE can make a significant difference:
rf <- foreach(ntree=rep(25000, 6), .combine=randomForest::combine,
.multicombine=TRUE, .packages='randomForest') %dopar% {
randomForest(x, y, ntree=ntree)
}
This causes combine to be called once rather than five times. On my desktop machine, this runs in 8 seconds rather than 19 seconds.
Are you aware that the caret package can do a lot of the hand-holding for parallel runs (as well as data prep, summaries, ...) for you?
Ultimately, of course, if there are some costly operations left in the random forest computation itself, there is little you can do as Andy spent quite a few years on improving it. I would expect few to no low-hanging fruits to be around for the picking...
H20 package can be used to solve your problem.
According to H20 documentation page H2O is "the open source
math engine for big data that computes parallel distributed
machine learning algorithms such as generalized linear models,
gradient boosting machines, random forests, and neural networks
(deep learning) within various cluster environments."
Random Forest implementation using H2O:
https://www.analyticsvidhya.com/blog/2016/05/h2o-data-table-build-models-large-data-sets/
I wonder if the parallelRandomForest code would be helpful to you?
According to the author it ran about 6 times faster on his data set with 16 times less memory consumption.
SPRINT also has a parallel implementation here.
Depending on your CPU, you could probably get 5%-30% speed-up choosing number of jobs to match your number of registered cores matching the number of system logical cores. (sometimes it is more efficient to match number of system physical cores).
If you have a generic Intel dual-core laptop with Hyper Threading(4 logical cores), then DoMC probably registered a cluster of 4 cores. Thus 2 cores will idle when iteration 5 and 6 are computed plus the extra time starting/stopping two extra jobs. It would more efficient to make only 2-4 jobs of more trees.