Loading ffdf data take a lot of memory - r

I am facing a strange problem:
I save ffdf data using
from ffbase package and when i load them in a new R session, doing
it gets loaded into RAM aprox 90% of the memory than the same data as a data.frame object in R.
Having this issue, it does not make a lot of sense to use ffdf, isn´t it?
I can't use ffsave because i am working in a server and do not have the zip app on it.
packageVersion(ff) # 2.2.10
packageVersion(ffbase) # 0.6.3
Any ideas about ?
[edit] some code example to help to clarify:
data <- read.csv.ffdf(file = fn, header = T, colClasses = classes)
# file fn is a csv database with 5 columns and 2.6 million rows,
# with some factor cols and some integer cols.
data.1 <- data
save.ffdf(data.1 , dir = my.dir) # my.dir is a string pointing to the file. "C:/data/R/test.f" for example.
closing the R session... opening again:
load.ffdf(file.name) # file.name is a string pointing to the file.
#that gives me object data, with class(data) = ffdf.
then i have a data object ffdf[5] , and its memory size is almost as big as:
data.R <- data[,] # which is a data.frame.
[end of edit]
As my question is not answered yet, and i still find the problem, i give a reproducible example ::
dir1 <- 'P:/Projects/RLargeData';
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = factor( sample(c(1:N/10) , N, replace=T)) )
dff <- as.ffdf(df)
save.ffdf(dff, dir = "dframeffdf")
# on disk, the directory "dframeffdf" is : 205 MB (215.706.264 bytes)
### resetting R :: fresh RStudio Session
dir1 <- 'P:/Projects/RLargeData';
memory.size() # 15.63
load.ffdf(dir = "dframeffdf")
memory.size() # 384.42
memory.size() # 287
So we have into memory 384 Mb, and after gc() there are 287, which is around the size of the data in the disk. (checked also in "Process explorer" application for windows)
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: i386-w64-mingw32/i386 (32-bit)
[1] LC_COLLATE=Danish_Denmark.1252 LC_CTYPE=Danish_Denmark.1252 LC_MONETARY=Danish_Denmark.1252 LC_NUMERIC=C LC_TIME=Danish_Denmark.1252
attached base packages:
[1] tools stats graphics grDevices utils datasets methods base
other attached packages:
[1] ffbase_0.7-1 ff_2.2-10 bit_1.1-9

In ff, when you have factor columns, the factor levels are always in RAM. ff character columns currently don't exist and character columns are converted to factors in an ffdf.
Regarding your example: your 'w' column in 'dff' contains more than 6 Mio levels. These levels are all in RAM. If you wouldn't have columns with a lot of levels, you wouldn' see the RAM increase as shown below using your example.
N = 1e7;
df <- data.frame(
x = c(1:N),
y = sample(letters, N, replace =T),
z = sample( as.Date(sample(c(1:2000), N, replace=T), origin="1970-01-01")),
w = sample(c(1:N/10) , N, replace=T))
dff <- as.ffdf(df)
save.ffdf(dff, dir = "dframeffdf")
### resetting R :: fresh RStudio Session
memory.size() # 14.67
load.ffdf(dir = "dframeffdf")
memory.size() # 14.78

The ffdf package(s) have mechanisms for segregating object in 'physical' and 'virtual' storage. I suspect you are implicitly constructing items in physical memory, but since you offer not coding for how this workspace was created, there's only so much guessing that is possible.


arrangements package - permutations of order 128

I am trying to run a simple permutation of 'x' and 'y' across 128 spaces using the arrangements package on R.
I keep getting the following error message :
Error in permutations(test, k = 128, replace = TRUE) : too many results
The code that I ran was as follows:
test <- c('x','y')
permutations(test, k = 128, replace = TRUE)
sessionInfo() is as follows:
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.2.1
Is there a work around I can use? I am also experimenting with the parallel package. Please advice.
That's way too many results as #Rui points out in the comments:
npermutations(2, 128, replace=TRUE, bigz = TRUE)
Big Integer ('bigz') :
[1] 340282366920938463463374607431768211456
How about using the skip and nitem parameters? This allows a user to retrieve a handful of results at a time.
## First 100,000 results
system.time(res <- permutations(c('x', 'y'), 128, replace = TRUE, nitem = 1e5))
# user system elapsed
# 0.146 0.000 0.146
## process res
## Next 100,000 results
res <- permutations(c('x', 'y'), 128, replace = TRUE, skip = 1e5, nitem = 1e5)
## process res
## 100,000 results starting at leixcographical index 1e19 + 1
## n.b. need to use strings or bigz type
res <- permutations(c('x', 'y'), 128, replace = TRUE,
skip = "10000000000000000000", nitem = 1e5)
## process res
## etc.
This can easily be generalized to parallel processing if needed.

Slow dot product in R

I am trying to take the dot product from a 331x23152 and 23152x23152 matrix.
In Python and Octave this is a trivial operation, but in R this seems to be incredibly slow.
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
The output is
user system elapsed
101.95 0.04 101.99
In other words, this dot product takes over 100 seconds to execute.
I am running R-3.4.0 64-bit, with RStudio v1.0.143 on a i7-4790 with 16 GB RAM. As such, I did not expect this operation to take so long.
Am I overlooking something? I have started looking into the packages bigmemory and bigalgebra, but I can't help but think there's a solution without having to resort to packages.
To give you an idea of time difference, here's a script for Octave:
n = 331;
m = 23152;
mat_1 = rand(n,m);
mat_2 = rand(m,m);
mat_3 = mat_1*mat_2;
The output is
Elapsed time is 3.81038 seconds.
And in Python:
import numpy as np
import time
n = 331
m = 23152
mat_1 = np.random.random((n,m))
mat_2 = np.random.random((m,m))
tm_1 = time.time()
mat_3 = np.dot(mat_1,mat_2)
tm_2 = time.time()
tm_3 = tm_2 - tm_1
The output is
As you can see, these numbers are not even in the same ballpark.
At Zheyuan Li's request, here are toy examples for dot products.
In R:
mat_1 = matrix(c(1,2,1,2,1,2), nrow = 2, ncol = 3)
mat_2 = matrix(c(1,1,1,2,2,2,3,3,3), nrow = 3, ncol = 3)
mat_3 = mat_1 %*% mat_2
The output is:
[,1] [,2] [,3]
[1,] 3 6 9
[2,] 6 12 18
In Octave:
mat_1 = [1,1,1;2,2,2];
mat_2 = [1,2,3;1,2,3;1,2,3];
mat_3 = mat_1*mat_2
The output is:
mat_3 =
3 6 9
6 12 18
In Python:
import numpy as np
mat_1 = np.array([[1,1,1],[2,2,2]])
mat_2 = np.array([[1,2,3],[1,2,3],[1,2,3]])
mat_3 = np.dot(mat_1, mat_2)
The output is:
[[ 3 6 9]
[ 6 12 18]]
For more information on matrix dot products: https://en.wikipedia.org/wiki/Matrix_multiplication
The output for sessionInfo() is:
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
[1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252
[4] LC_NUMERIC=C LC_TIME=Dutch_Netherlands.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0
I tried the bigalgebra package but this did not seem to speed things up:
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_1 <- as.big.matrix(mat_1)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
The output is:
user system elapsed
101.79 0.00 101.81
James suggested to alter my randomly generated matrix:
N <- 331
M <- 23152
mat_1 = matrix( runif(N*M), N, M)
mat_2 = matrix( runif(M*M), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
The output is:
user system elapsed
102.46 0.05 103.00
This is a trivial operation?? Matrix multiplication is always an expensive operation in linear algebra computations.
Actually I think it is quite fast. A matrix multiplication at this size has
2 * 23.152 * 23.152 * 0.331 = 354.8 GFLOP
With 100 seconds your performance is 3.5 GFLOPs. Note that on most machines, the performance is at most 0.8 GLOPs - 2 GFLOPs, unless you have an optimized BLAS library.
If you think implementation elsewhere is faster, check the possibility of usage of optimized BLAS, or parallel computing. R is doing this with a standard BLAS and no parallelism.
From R-3.4.0, more tools are available with BLAS.
First of all, sessionInfo() now returns the full path of the linked BLAS library. Yes, this does not point to the symbolic link, but the final shared object! The other answer here just shows this: it has OpenBLAS.
The timing result (in the other answer) implies that parallel computing (via multi-threading in OpenBLAS) is in place. It is hard for me to tell the number of threads used, but looks like hyperthreading is on, as the slot for "system" is quite big!
Second, options can now set matrix multiplications methods, via matprod. Although this was introduced to deal with NA / NaN, it offers testing of performance, too!
"internal" is an implementation in non-optimized triple loop nest. This is written in C, and has equal performance to the standard (reference) BLAS written in F77;
"default", "blas" and "default.simd" mean using linked BLAS for computation, but the way for checking NA and NaN differs. If R is linked to standard BLAS, then as said, it has the same performance with "internal"; but otherwise we see significant boost. Also note that R team says that "default.simd" might be removed in future.
Based off the replies from knb and Zheyuan Li, I started investigating optimized BLAS packages. I came across GotoBlas, OpenBLAS, and MKL, e.g. here.
My conclusion is that MKL should outperform default BLAS by far.
It seems R has to be built from source in order to incorporate MKL. Instead, I found R Open. This has MKL (optionally) built-in, so installing is a breeze.
With the following code:
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1%*%mat_2
The output is:
user system elapsed
10.61 0.10 3.12
As such, one solution to this problem is to use MKL instead of default BLAS.
However, upon investigation my real life matrices are highly sparse. I was able to take advantage of that fact by using the Matrix package. In practice I used it like e.g. Matrix(x = mat_1, sparse = TRUE), where mat_1 would be a highly sparse matrix. This brought down the execution time to around 3 seconds.
I have a similar machine: Linux PC, 16 GB RAM, intel 4770K ,
Relevant output from sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.15.1 clipr_0.3.2 tibble_1.3.0 colorout_1.1-2
loaded via a namespace (and not attached):
[1] compiler_3.4.0 tools_3.4.0 Rcpp_0.12.10
On my machine, your code snippet takes ~5 seconds (started RStudio, created empty .R file, ran snippet, output):
user system elapsed
27.608 5.524 4.920
N <- 331
M <- 23152
mat_1 = matrix( rnorm(N*M,mean=0,sd=1), N, M)
mat_2 = matrix( rnorm(N*M,mean=0,sd=1), M, M)
tm3 <- system.time({
mat_3 = mat_1 %*% mat_2

Best way to count unique element in a string in r

I'm a still a beginner in R and I have a question!
I have data frame of 222.000 observations and I'm interesting by a specific column which name is id. The problem is it can be further ids separate by a ',' in the same string and I want to count unique element in a each string (I mean in each string of the first data frame).
For example:
id results
0000001,0000003 2
0000002,0000002 1
0010001,0001006,0010001 2
I have used the function 'str_split_fixed' to separate all id in the same string and I put the result in a new data frame(so know I have only 1 id by string or nothing in a string). The problem is that can be as many as 68 ',' so the new data frame is huge with 68columns and 220.000 observations and it take much time(15 secondes maybe). After a used a apply function to know all unique.
Does someone know a more efficient way or have an idea?
Finally, I used the following code:
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
But there is a message error:
Error in textConnection(text, encoding = "UTF-8") :
argument 'text' incorrect
6 textConnection(text, encoding = "UTF-8")
5 scan(text = x, what = "", sep = ",", quiet = TRUE)
4 unique(scan(text = x, what = "", sep = ",", quiet = TRUE))
3 FUN(X[[i]], ...)
2 lapply(X = X, FUN = FUN, ...)
1 sapply(id, function(x) length(unique(scan(text = x,
what = "", sep = ",", quiet = TRUE))))
My R version is:
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)
[1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.0.0 plyr_1.8.3
loaded via a namespace (and not attached):
[1] magrittr_1.5 tools_3.2.2 Rcpp_0.12.2 stringi_1.0-1
I've tried this: Encoding(id) <- "UTF-8"
But the result is:
Error in `Encoding<-`(`*tmp*`, value = "UTF-8")
and the output of dput(id) is from this:
[9987,] "2320212,2320230"
[9988,] "4530090,4530917"
[9989,] "8532412"
[9990,] "4560292"
[9991,] "4540375"
[9992,] "3311324"
[9993,] "4540030"
[9994,] "9010000"
[9995,] "2811810"
[9996,] "3311000"
[9997,] "4540030"
[9998,] "4540215"
[9999,] "1541201"
[10000,] "2423810"
[ getOption("max.print") est atteint -- 90000 lignes omises ]
the ouput is huge so I post just the end and the first line:
[9002,] "9460000"
and for dput( head(data$id) ):
"9460000,9433000", "9460000,9436000", "9460000,9437000",
"9510000", "9510010", "9510030", "9510090", "9910000", "9910020",
"9910040", "9910090", "D", "FIELD_NOT_FOUND", "I"), class = "factor")
Thanks in advance, Jef
sapply(id, function(x)
length( # count items
unique( # that are unique
scan( # when arguments are presented to scan as text
text=x, what="", sep =",", # when separated by ","
quiet=TRUE))) )
# --- result: first typed line is 'names' of the items, not the results.
1 2,3,4 1,1
1 3 1
The argument text=x should allow scan to accept a character element of length-1 and break it into components at divisions of the separator argument value. These will get passed element-by-element to the anonymous function from the id vector(or row by row if it were coming from a dataframe).

Get the most expressed genes from one .CEL file in R

In R the Limma package can give you a list of differentially expressed genes.
How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).
If you run the following script, it's all ok. You have many .CEL files and all work.
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)
But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:
> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
Then you'll get the error:
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
How can I get the most expressed genes from 1 .CEL sample file?
I've found a library that could be useful for my purpose: the panp package.
But, if you run the following script:
if(!require(panp)) { biocLite("panp") }
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)
you'll get an error:
> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero
even if the platform of the GDS is that expected by the library.
If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:
my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.
In summary, If you run the script you'll get an error while executing:
my_pa <- pa.calls(eset)
and not while executing
my_pa <- pa.calls(gcrma.ExpressionSet)
Why if they are both ExpressionSet?
> is(gcrma.ExpressionSet)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
> is(eset)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette
also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So
> eset = gcrma.ExpressionSet ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
row col
213477_x_at 22779 24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"
Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.
Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.

caught segfault error in R

I am getting a caught segfault error every time I try to run any plotting functions from the ggplot2 package (1.0.0). I have tried this with qplot, geom_dotplot, geom_histogram, etc. Data from the package (e.g. diamonds or economics) work just fine.
I am operating on Mac OS 10.9.4 (the latest version) and on R 3.1.1 (also the latest version). I get the same error with the standard R GUI, RStudio, and when using R from the command line. The command brings up the default graphic device (Quartz for R GUI and command line), but also the terminal error.
gives me the error:
*** caught segfault ***
address 0x18, cause 'memory not mapped'
1: .Call("plyr_split_indices", PACKAGE = "plyr", group, n)
2: split_indices(scale_id, n)
3: scale_apply(layer_data, x_vars, scale_train, SCALE_X, panel$x_scales)
4: train_position(panel, data, scale_x(), scale_y())
5: ggplot_build(x)
6: print.ggplot(list(data = list(), layers = list(<environment>), scales = <S4 object of class "Scales">, mapping = list(x = 1:3), theme = list(), coordinates = list(limits = list(x = NULL, y = NULL)), facet = list(shrink = TRUE), plot_env = <environment>, labels = list(x = "1:3", y = "count")))
7: print(list(data = list(), layers = list(<environment>), scales = <S4 object of class "Scales">, mapping = list(x = 1:3), theme = list(), coordinates = list( limits = list(x = NULL, y = NULL)), facet = list(shrink = TRUE), plot_env = <environment>, labels = list(x = "1:3", y = "count")))
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Here is my session info:
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] graphics grDevices utils datasets stats methods base
other attached packages:
[1] ggplot2_1.0.0 marelac_2.1.3 seacarb_3.0 shape_1.4.1 beepr_1.1 birk_1.1
loaded via a namespace (and not attached):
[1] audio_0.1-5 colorspace_1.2-4 digest_0.6.4 grid_3.1.1 gtable_0.1.2
[6] MASS_7.3-34 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.2
[11] reshape2_1.4 scales_0.2.4 stringr_0.6.2 tools_3.1.1
I've gathered from others that this is a memory issue of some sort, but this error occurs even when I have over 2 GB of free RAM. I know this is a widely used package, so of course this doesn't happen for everyone, but why is it happening for me? Does anyone know what I can do to fix this problem?
In case anyone else has this problem or similar in the future, I sent a bug report to the package maintainer and he recommended uninstalling all installed packages and starting over. I took his advice and it worked!
I followed advice from this posting: http://r.789695.n4.nabble.com/Reset-R-s-library-to-base-packages-only-remove-all-installed-contributed-packages-td3596151.html
ip <- installed.packages()
pkgs.to.remove <- ip[!(ip[,"Priority"] %in% c("base", "recommended")), 1]
sapply(pkgs.to.remove, remove.packages)
This is not an answer to this question but it might be helpful for someone. (Inspired by user1310503. Thanks!)
I am working on a data.frame df with three cols: col1, col2, col3.
df =data.frame(col1=character(),col2=numeric(),col3=numeric(),stringsAsFactors = F)
In the process, rbind is used for many times, like:
aList<-list(col1="aaa", col2 = "123", col3 = "234")
dfNew <- as.data.frame(aList)
df <- rbind(df, dfNew)
At last, df is written to file via data.table::fwrite
data.table::fwrite(x = df, file = fileDF, append = FALSE, row.names = F, quote = F, showProgress = T)
df has 5973 rows and 3 cols. The "caught segfault" always occurs:
address 0x1, cause 'memory not mapped'. 
The solution to this problem is:
aList<-list(col1=as.character("aaa"), col2 = as.numeric("123"), col3 = as.numeric("234"))
dfNew <- as.data.frame(aList)
dfNew$col1 <- as.characer(dfNew$col1)
dfNew$col2 <- as.numeric(dfNew$col2)
dfNew$col3 <- as.numeric(dfNew$col3)
df <- rbind(df, dfNew)
Then this problem is solved. Possible reason is that the classes of cols are different.
This is not an answer to this question but it might be useful for someone. I had segfaults when I did pdf to create a PDF graphics device and then used plot. This happened with R 2.15.3, 3.2.4, and one or two other versions, running on Scientific Linux release 6.7. I tried many different things, but the only ways I could get it to work were (a) using png or tiff instead of pdf, or (b) saving large .RData files and then using a completely separate R program to create the graphics.
