I am trying to take the dot product from a 331x23152 and 23152x23152 matrix.
In Python and Octave this is a trivial operation, but in R this seems to be incredibly slow.
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
})
print(tm3)
The output is
user system elapsed
101.95 0.04 101.99
In other words, this dot product takes over 100 seconds to execute.
I am running R-3.4.0 64-bit, with RStudio v1.0.143 on a i7-4790 with 16 GB RAM. As such, I did not expect this operation to take so long.
Am I overlooking something? I have started looking into the packages bigmemory and bigalgebra, but I can't help but think there's a solution without having to resort to packages.
EDIT
To give you an idea of time difference, here's a script for Octave:
n = 331;
m = 23152;
mat_1 = rand(n,m);
mat_2 = rand(m,m);
tic
mat_3 = mat_1*mat_2;
toc
The output is
Elapsed time is 3.81038 seconds.
And in Python:
import numpy as np
import time
n = 331
m = 23152
mat_1 = np.random.random((n,m))
mat_2 = np.random.random((m,m))
tm_1 = time.time()
mat_3 = np.dot(mat_1,mat_2)
tm_2 = time.time()
tm_3 = tm_2 - tm_1
print(tm_3)
The output is
2.781277894973755
As you can see, these numbers are not even in the same ballpark.
EDIT 2
At Zheyuan Li's request, here are toy examples for dot products.
In R:
mat_1 = matrix(c(1,2,1,2,1,2), nrow = 2, ncol = 3)
mat_2 = matrix(c(1,1,1,2,2,2,3,3,3), nrow = 3, ncol = 3)
mat_3 = mat_1 %*% mat_2
print(mat_3)
The output is:
[,1] [,2] [,3]
[1,] 3 6 9
[2,] 6 12 18
In Octave:
mat_1 = [1,1,1;2,2,2];
mat_2 = [1,2,3;1,2,3;1,2,3];
mat_3 = mat_1*mat_2
The output is:
mat_3 =
3 6 9
6 12 18
In Python:
import numpy as np
mat_1 = np.array([[1,1,1],[2,2,2]])
mat_2 = np.array([[1,2,3],[1,2,3],[1,2,3]])
mat_3 = np.dot(mat_1, mat_2)
print(mat_3)
The output is:
[[ 3 6 9]
[ 6 12 18]]
For more information on matrix dot products: https://en.wikipedia.org/wiki/Matrix_multiplication
EDIT 3
The output for sessionInfo() is:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0
EDIT 4
I tried the bigalgebra package but this did not seem to speed things up:
library('bigalgebra')
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_1 <- as.big.matrix(mat_1)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
})
print(tm3)
The output is:
user system elapsed
101.79 0.00 101.81
EDIT 5
James suggested to alter my randomly generated matrix:
N <- 331
M <- 23152
mat_1 = matrix( runif(N*M), N, M)
mat_2 = matrix( runif(M*M), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
})
print(tm3)
The output is:
user system elapsed
102.46 0.05 103.00
This is a trivial operation?? Matrix multiplication is always an expensive operation in linear algebra computations.
Actually I think it is quite fast. A matrix multiplication at this size has
2 * 23.152 * 23.152 * 0.331 = 354.8 GFLOP
With 100 seconds your performance is 3.5 GFLOPs. Note that on most machines, the performance is at most 0.8 GLOPs - 2 GFLOPs, unless you have an optimized BLAS library.
If you think implementation elsewhere is faster, check the possibility of usage of optimized BLAS, or parallel computing. R is doing this with a standard BLAS and no parallelism.
Important
From R-3.4.0, more tools are available with BLAS.
First of all, sessionInfo() now returns the full path of the linked BLAS library. Yes, this does not point to the symbolic link, but the final shared object! The other answer here just shows this: it has OpenBLAS.
The timing result (in the other answer) implies that parallel computing (via multi-threading in OpenBLAS) is in place. It is hard for me to tell the number of threads used, but looks like hyperthreading is on, as the slot for "system" is quite big!
Second, options can now set matrix multiplications methods, via matprod. Although this was introduced to deal with NA / NaN, it offers testing of performance, too!
"internal" is an implementation in non-optimized triple loop nest. This is written in C, and has equal performance to the standard (reference) BLAS written in F77;
"default", "blas" and "default.simd" mean using linked BLAS for computation, but the way for checking NA and NaN differs. If R is linked to standard BLAS, then as said, it has the same performance with "internal"; but otherwise we see significant boost. Also note that R team says that "default.simd" might be removed in future.
Based off the replies from knb and Zheyuan Li, I started investigating optimized BLAS packages. I came across GotoBlas, OpenBLAS, and MKL, e.g. here.
My conclusion is that MKL should outperform default BLAS by far.
It seems R has to be built from source in order to incorporate MKL. Instead, I found R Open. This has MKL (optionally) built-in, so installing is a breeze.
With the following code:
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
})
print(tm3)
The output is:
user system elapsed
10.61 0.10 3.12
As such, one solution to this problem is to use MKL instead of default BLAS.
However, upon investigation my real life matrices are highly sparse. I was able to take advantage of that fact by using the Matrix package. In practice I used it like e.g. Matrix(x = mat_1, sparse = TRUE), where mat_1 would be a highly sparse matrix. This brought down the execution time to around 3 seconds.
I have a similar machine: Linux PC, 16 GB RAM, intel 4770K ,
Relevant output from sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.15.1 clipr_0.3.2 tibble_1.3.0 colorout_1.1-2
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0 Rcpp_0.12.10
On my machine, your code snippet takes ~5 seconds (started RStudio, created empty .R file, ran snippet, output):
user system elapsed
27.608 5.524 4.920
Snippet:
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1 %*% mat_2
})
print(tm3)
Related
I am creating a package that I hope to eventually put onto CRAN. I have coded much of the package in C++ with the help of Rcpp and now would like to enable parallelization of this C++ code. I am using the foreach package, however, I am open to switch to snow or a different library if this would work better.
I started by trying to parallelize a simple function:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
// [[Rcpp::export]]
arma::vec rNorm_c(int length) {
return arma::vec(length, arma::fill::randn);
}
/*** R
n_workers <- parallel::detectCores(logical = F)
cl <- parallel::makeCluster(n_workers)
doParallel::registerDoParallel(cl)
n <- 10
library(foreach)
foreach(j = rep(n, n),
.noexport = c("rNorm_c"),
packages = "Rcpp") %dopar% {rNorm_c(j)}
*/
I added the .noexport because without, I get the error Error in { : task 1 failed - "NULL value passed as symbol address". This led me to this SO post which suggested doing this.
However, I now receive the error Error in { : task 1 failed - "could not find function "rNorm_c"", presumably because I have not followed the top answers instructions to load the function separately at each node. I am unsure of how to do this.
This SO post demonstrates how to do this by writing the C++ code inline, however, since the C++ code for my package is multiple functions, this is likely not the best solution. This SO post advises to create a local package for the workers to load and make calls to, however, since I am hoping to make this code available in a CRAN package, it does not seem as though a local package would be possible unless I wanted to attempt to publish two CRAN packages.
Any suggestions for how to approach this or references to resources for parallelization of Rcpp code would be appreciated.
EDIT:
I used the above function to create a package called rnormParallelization. In this package, I also included a couple of R functions, one of which made use of the snow package to parallelize a for loop using the rNorm_c function:
rNorm_samples_for <- function(num_samples, length){
sample_mat <- matrix(NA, length, num_samples)
for (j in 1:num_samples){
sample_mat[ , j] <- rNorm_c(length)
}
return(sample_mat)
}
rNorm_samples_snow1 <- function(num_samples, length){
clus <- snow::makeCluster(3)
snow::clusterExport(clus, "rNorm_c")
out <- snow::parSapply(clus, rep(length, num_samples), rNorm_c)
snow::stopCluster(clus)
return(out)
}
Both functions work as expected:
> rNorm_samples_for(2, 3)
[,1] [,2]
[1,] -0.82040308 -0.3284849
[2,] -0.05169948 1.7402912
[3,] 0.32073516 0.5439799
> rNorm_samples_snow1(2, 3)
[,1] [,2]
[1,] -0.07483493 1.3028315
[2,] 1.28361663 -0.4360829
[3,] 1.09040771 -0.6469646
However, the parallelized version works considerably slower:
> microbenchmark::microbenchmark(
+ rnormParallelization::rNorm_samples_for(1e3, 1e4),
+ rnormParallelization::rNorm_samples_snow1(1e3, 1e4)
+ )
Unit: milliseconds
expr min lq
rnormParallelization::rNorm_samples_for(1000, 10000) 217.0871 249.3977
rnormParallelization::rNorm_samples_snow1(1000, 10000) 1242.8315 1397.7643
mean median uq max neval
320.5456 285.9787 325.3447 802.7488 100
1527.0406 1482.5867 1563.0916 3411.5774 100
Here is my session info:
> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rnormParallelization_1.0
loaded via a namespace (and not attached):
[1] microbenchmark_1.4-7 compiler_4.1.1 snow_0.4-4
[4] parallel_4.1.1 tools_4.1.1 Rcpp_1.0.7
GitHub repo with both of these scripts
I am trying to run a reproducible example with the mlr R package in parallel, for which I have found the solution of using parallelStartMulticore (link). The project runs with packrat as well.
The code runs properly on workstations and small servers, but running it in an HPC with the torque batch system runs into memory exhaustion. It seems that R threads are spawned ad infinitum, contrary to regular linux machines. I have tried to switch to parallelStartSocket, which works fine, but then I cannot reproduce the results with RNG seeds.
Here is a minimal example:
library(mlr)
library(parallelMap)
M <- data.frame(x = runif(1e2), y = as.factor(rnorm(1e2) > 0))
# Example with random forest
parallelStartMulticore(parallel::detectCores())
plyr::l_ply(
seq(100),
function(x) {
message("Iteration number: ", x)
set.seed(1, "L'Ecuyer")
tsk <- makeClassifTask(data = M, target = "y")
num_ps <- makeParamSet(
makeIntegerParam("ntree", lower = 10, upper = 50),
makeIntegerParam("nodesize", lower = 1, upper = 5)
)
ctrl <- makeTuneControlGrid(resolution = 2L, tune.threshold = TRUE)
# define learner
lrn <- makeLearner("classif.randomForest", predict.type = "prob")
rdesc <- makeResampleDesc("CV", iters = 2L, stratify = TRUE)
# Grid search in parallel
res <- tuneParams(
lrn, task = tsk, resampling = rdesc, par.set = num_ps,
measures = list(auc), control = ctrl)
# Fit optimal params
lrn.optim <- setHyperPars(lrn, par.vals = res$x)
m <- train(lrn.optim, tsk)
# Test set
pred_rf <- predict(m, newdata = M)
pred_rf
}
)
parallelStop()
The hardware of the HPC is an HP Apollo 6000 System ProLiant XL230a Gen9 Server blade 64-bit, with Intel Xeon E5-2683 processors. I ignore if the issue comes from the torque batch system, the hardware or any flaw in the above code. The sessionInfo() of the HPC:
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /cm/shared/apps/intel/parallel_studio_xe/2017/compilers_and_libraries_2017.0.098/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
locale:
[1] C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] parallelMap_1.3 mlr_2.11 ParamHelpers_1.10 RLinuxModules_0.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.14 splines_3.4.0 munsell_0.4.3
[4] colorspace_1.3-2 lattice_0.20-35 rlang_0.1.1
[7] plyr_1.8.4 tools_3.4.0 parallel_3.4.0
[10] grid_3.4.0 packrat_0.4.8-1 checkmate_1.8.2
[13] data.table_1.10.4 gtable_0.2.0 randomForest_4.6-12
[16] survival_2.41-3 lazyeval_0.2.0 tibble_1.3.1
[19] Matrix_1.2-12 ggplot2_2.2.1 stringi_1.1.5
[22] compiler_3.4.0 BBmisc_1.11 scales_0.4.1
[25] backports_1.0.5
The "multicore" parallelMap backend uses parallel::mcmapply which should create a new fork()ed child process for every evaluation inside tuneParams and then quickly kill that process. Depending on what you use to count memory usage / active processes, it is possible that memory gets mis-reported and that child processes that are already dead (and were only alive for the fraction of a second) are shown, or that killing of finished processes for some reason does not happen.
Possible problems:
The batch system does not correctly track memory usage and counts the parent process's memory for every child separately. Does /usr/bin/free actually report that 30GB are gone while the script is running? As an easier test case, consider (running in an empty R session)
xxx <- 1:1e9
parallel::mclapply(1:4, function(x) {
Sys.sleep(60)
}, mc.cores = 4)
which should use about 4 GB of memory. If, during the 60 seconds in the child process, the reported memory usage is about 16 GB, it is this problem.
Memory reporting is accurate, but for some reason the memory space is changed a lot inside the child processes (triggering many COW writes), e.g. because of garbage collection. Does calling gc() before the tuneParams() call help?
Some setting on the machine prevents the "parallel" package from killing child processes. The following:
parallel::mclapply(1:4, function(x) {
xxx <<- 1:1e9 ; NULL
}, mc.cores = 4)
Sys.sleep(60)
should grab about 16 GB of memory, but release it right away. If the memory remains used during the Sys.sleep (and the remaining R session), it might be this problem.
With runjags, I am trying to monitor a very large number of values. The format for the monitor list is a string of values, In this case I am asking to moitor just 3, Y[14], Y[15], Y[3].
run.jags(model="model.MC.txt",data=list(Y=Y.NA.Rep,sizes=sizesB,cumul=cumul),
monitor=c("thetaj", "Y[14]", "Y[15]","Y[3]"))
Suppose I wanted to monitor hundreds of values. I can create this string, but it just returns to the prompt "+". and fails to run.
Is there some upper limit on the size of strings that can be created and passed in as arguments?
Is there a better way (non string) to pass this list into run.jags?
The only way I have been able to get it to run is to paste the string literal
into the function call, a variable containing the string does not work.
The longer run list looks something like this:
run.jags(model="model.MC.txt",data=list(Y=Y.NA.Rep,sizes=sizesB,cumul=cumul)
,monitor=c('Y[14]', 'Y[15]', 'Y[18]', 'Y[26]', 'Y[41]',
'Y[55]', 'Y[62]', 'Y[72]', 'Y[80]', 'Y[81]', 'Y[128]', 'Y[138]',
'Y[180]', 'Y[188]', 'Y[191]', 'Y[209]', 'Y[224]', 'Y[244]', '
'Y[255]', 'Y[263]', 'Y[282]', 'Y[292]', 'Y[303]', 'Y[324]',
'Y[349]', 'Y[358]', 'Y[359]', 'Y[365]', 'Y[384]',
... many lines deleted
'Y[1882]', 'Y[1895]', 'Y[1899]', 'Y[1903]', 'Y[1918]', 'Y[1922]',
'Y[1929]', 'Y[1942]', 'Y[1953]', 'Y[1990]'))
I'm not sure that this is a problem with runjags - the following code has 1002 monitors and runs just fine:
model <- "model {
for(i in 1 : N){ #data# N
Y[i] ~ dnorm(true.y[i], precision) #data# Y
true.y[i] <- (m * X[i]) + c #data# X
}
m ~ dnorm(0, 10^-3)
c ~ dnorm(0, 10^-3)
precision ~ dgamma(10^-3, 10^-3)
}"
X <- 1:1000
Y <- rnorm(length(X), 2*X + 10, 1)
N <- length(X)
monitors <- c('m','c',paste0('Y[',1:1000,']'))
results <- run.jags(model, n.chains=2, monitor=monitors, sample=100, method='rjags')
results <- run.jags(model, n.chains=2, monitor=monitors, sample=100, method='inter')
I have also tried writing the string directly into the function call by using:
cat('monitor = c("'); cat(monitors, sep='", "'); cat('")\n')
...and copy/pasting the resulting text as the monitor argument - that still works for me in R.app but when pasting into RStudio I get:
> results <- run.jags(model, n.chains=2, monitor = c("m", "c", "Y[1]", "Y[2]", "Y[3]", "Y[4]", "Y[5]", "Y[6]", "Y[7]", "Y[8]", "Y[9]", "Y[10]", "Y[11]", "Y[12]", "Y[13]", "Y[14]", "Y[15]", "Y[16]", "Y[17]", "Y[18]", "Y[19]", "Y[20]", "Y[21]", "Y[22]", "Y[23]", "Y[24]", "Y[25]", "Y[26]", "Y[27]", "Y[28]", "Y[29]", "Y[30]", "Y[31]", "Y[32]", "Y[33]", "Y[34]", "Y[35]", "Y[36]", "Y[37]", "Y[38]", "Y[39]", "Y[40]", "Y[41]", "Y[42]", "Y[43]", "Y[44]", "Y[45]", "Y[46]", "Y[47]", "Y[48]", "Y[49]", "Y[50]", "Y[51]", "Y[52]", "Y[53]", "Y[54]", "Y[55]", "Y[56]", "Y[57]", "Y[58]", "Y[59]", "Y[60]", "Y[61]", "Y[62]", "Y[63]", "Y[64]", "Y[65]", "Y[66]", "Y[67]", "Y[68]", "Y[69]", "Y[70]", "Y[71]", "Y[72]", "Y[73]", "Y[74]", "Y[75]", "Y[76]", "Y[77]", "Y[78]", "Y[79]", "Y[80]", "Y[81]", "Y[82]", "Y[83]", "Y[84]", "Y[85]", "Y[86]", "Y[87]", "Y[88]", "Y[89]", "Y[90]", "Y[91]", "Y[92]", "Y[93]", "Y[94]", "Y[95]", "Y[96]", "Y[97]", "Y[98]", "Y[99]", "Y[100]", "Y[101]", "Y[102]", "Y[103]", "Y[104]", "Y[105]... <truncated>
+
+
Which is somewhat similar to your description. So I'm guessing that you are using RStudio and that the problem is to do with the maximum length of a line of code that can be interpreted by RStudio.
If so, the fix is to simply hard wrap the command so it is broken over multiple lines - I tried this with 72 character width (100+ lines) and it works fine in RStudio. If my assumption is incorrect please modify your question to give more details of how you are running R, and your system using:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] runjags_2.0.4-2
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0 parallel_3.4.0 coda_0.19-1 grid_3.4.0 rjags_4-6 lattice_0.20-35
Following up some data.table parallelism (1) (2) (3) I'm trying to figure it out. What's wrong with this syntax?
library(data.table)
set.seed(1234)
dt <- data.table(id= factor(sample(1L:10000L, size= 1e6, replace= TRUE)),
val= rnorm(n= 1e6), key="id")
foo <- function(l) sum(l)
dt2 <- dt[, foo(.SD), by= "id"]
library(parallel)
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
# note that library(parallel) is annoying and you often have to do this type ("::", ":::") of exporting to the parallel package
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: incorrect number of dimensions
cl <- makeCluster(detectCores())
dt3 <- clusterApply(cl, x= parallel:::splitRows(dt, detectCores()),
fun=lapply, FUN= function(x,foo) {
x <- data.table::data.table(x)
x[, foo(data.table:::".SD"), by= "id"]
}, foo= foo)
stopCluster(cl)
Error in checkForRemoteErrors(val) :
4 nodes produced errors; first error: object 'id' not found
I've played around with the syntax quite a bit. These two seem to be the closest I can get. And obviously something's still not right.
My real problem is similarly structured but has many more rows and I'm using a machine with 24 cores / 48 logical processors. So watching my computer use roughly 4% of it's computing power (by using only 1 core) is really annoying
You may want to evaluate Rserve solution for parallelism.
See below example build on Rserve using 2 R nodes locally in parallel. It can be distributed over remote instances also.
library(data.table)
set.seed(1234)
dt <- data.table(id= factor(sample(1L:10000L, size= 1e6, replace= TRUE)),
val= rnorm(n= 1e6), key="id")
foo <- function(l) sum(l)
library(big.data.table)
# start 2 R instances
library(Rserve)
port = 6311:6312
invisible(sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save"))))
# client side
rscl = rscl.connect(port = port, pkgs = "data.table") # connect and auto require packages
bdt = as.big.data.table(dt, rscl) # create big.data.table from local data.table and list of connections to R nodes
rscl.assign(rscl, "foo", foo) # assign `foo` function to nodes
bdt[, foo(.SD), by="id"][, foo(.SD), by="id"] # first query is run remotely, second locally
# id V1
# 1: 1 10.328998
# 2: 2 -8.448441
# 3: 3 21.475910
# 4: 4 -5.302411
# 5: 5 -11.929699
# ---
# 9996: 9996 -4.905192
# 9997: 9997 -4.293194
# 9998: 9998 -2.387100
# 9999: 9999 16.530731
#10000: 10000 -15.390543
# optionally with special care
# bdt[, foo(.SD), by= "id", outer.aggregate = TRUE]
session info:
R version 3.2.3 (2015-12-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Rserve_1.8-5 big.data.table_0.3.3 data.table_1.9.7
loaded via a namespace (and not attached):
[1] RSclient_0.7-3 tools_3.2.3
I am facing a strange problem:
I save ffdf data using
save.ffdf()
from ffbase package and when i load them in a new R session, doing
load.ffdf("data.f")
it gets loaded into RAM aprox 90% of the memory than the same data as a data.frame object in R.
Having this issue, it does not make a lot of sense to use ffdf, isnĀ“t it?
I can't use ffsave because i am working in a server and do not have the zip app on it.
packageVersion(ff) # 2.2.10
packageVersion(ffbase) # 0.6.3
Any ideas about ?
[edit] some code example to help to clarify:
data <- read.csv.ffdf(file = fn, header = T, colClasses = classes)
# file fn is a csv database with 5 columns and 2.6 million rows,
# with some factor cols and some integer cols.
data.1 <- data
save.ffdf(data.1 , dir = my.dir) # my.dir is a string pointing to the file. "C:/data/R/test.f" for example.
closing the R session... opening again:
load.ffdf(file.name) # file.name is a string pointing to the file.
#that gives me object data, with class(data) = ffdf.
then i have a data object ffdf[5] , and its memory size is almost as big as:
data.R <- data[,] # which is a data.frame.
[end of edit]
*[ SECOND EDIT :: FULL REPRODUCIBLE CODE ::: ]
As my question is not answered yet, and i still find the problem, i give a reproducible example ::
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.limit(size=4000)
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = factor( sample(c(1:N/10) , N, replace=T)) )
df[1:10,]
dff <- as.ffdf(df)
head(dff)
#str(dff)
save.ffdf(dff, dir = "dframeffdf")
dim(dff)
# on disk, the directory "dframeffdf" is : 205 MB (215.706.264 bytes)
### resetting R :: fresh RStudio Session
dir1 <- 'P:/Projects/RLargeData';
setwd(dir1);
library(ff)
library(ffbase)
memory.size() # 15.63
load.ffdf(dir = "dframeffdf")
memory.size() # 384.42
gc()
memory.size() # 287
So we have into memory 384 Mb, and after gc() there are 287, which is around the size of the data in the disk. (checked also in "Process explorer" application for windows)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods base
other attached packages:
[1] ffbase_0.7-1 ff_2.2-10 bit_1.1-9
[END SECOND EDIT ]
In ff, when you have factor columns, the factor levels are always in RAM. ff character columns currently don't exist and character columns are converted to factors in an ffdf.
Regarding your example: your 'w' column in 'dff' contains more than 6 Mio levels. These levels are all in RAM. If you wouldn't have columns with a lot of levels, you wouldn' see the RAM increase as shown below using your example.
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = sample(c(1:N/10) , N, replace=T))
dff <- as.ffdf(df)
save.ffdf(dff, dir = "dframeffdf")
### resetting R :: fresh RStudio Session
library(ff)
library(ffbase)
memory.size() # 14.67
load.ffdf(dir = "dframeffdf")
memory.size() # 14.78
The ffdf package(s) have mechanisms for segregating object in 'physical' and 'virtual' storage. I suspect you are implicitly constructing items in physical memory, but since you offer not coding for how this workspace was created, there's only so much guessing that is possible.