Big data. Store and analyze objects without using the RAM - r

I'm trying to get a dissimilarity matrix with the funciton beta.pair of the package betapart from a database with 11000 rows x 6000 columns.
However, when I execute the function R gives me the error: "cannot allocate a vector of size 10 GB".
I'm using a Linux system with 65 GB of RAM, and R 64 bits, so I shouldn't have this limitation problem, and R should use all the available memory of the CPU, but it doesn't.
Do you know how can I use any package to store the dissimilarity matrix (e.g. bigmemory, ff) and later use it to perform an UPGMA analysis storing the result in the CPU and not in the RAM.
Exactly what I would like to do is:
library(betapart)
data_base # Database with 6000 columns and 11000 rows
distance <- beta.pair (data_base, index.family="sorensen") # Object with the dissimilarity matrices.
beta_similarity <- distance$beta.sim # Select one of the dissimilarity matrices provided by the function beta.pair
library(cluster)
UPGMA<- agnes (beta_similarity ) # UPGMA analyses
Thanks in advance,

Related

Calculate Cosine Similarity between two documents in TermDocumentMatrix of tm Package in R

My task is to compare documents in a corpus by the cosine similarity. I use tm package and obtain the TermDocumentMatrix (in td-idf form) tdm. The following task should as simple as stated in here
d <- dist(tdm, method="cosine")
or
cosine_dist_mat <- 1 - crossprod_simple_triplet_matrix(tdm)/(sqrt(col_sums(tdm^2) %*% t(col_sums(tdm^2))))
However, the number of terms in my tdm is quite large, more than 120,000 (with around 50,000 documents). It is beyond the capability of R to handle such matrix.
My RStudio crashed several times.
My questions are 1) how can I handle such a large matrix and get the pair-wise (120,000*120,000) cosine similarity? 2) if impossible, how can I just get the cosine similarity of only two documents at one time? Suppose I want the similarity between document 10 and 21, then something like
sim10_21<-cosine_similarity(tdm, d1=10,d2=21)
If tdm is a simple matrix, I can do the calculate on tdm[,c(10,21)]. However, to convert tdm to a matrix is exactly what I cannot handle. My questions ultimately boils down to how to do matrix-like calculate on tdm.
120,000 x 120,000 matrix * 8 bytes (dbl float) = 115.2 gigabytes. This isn't necessarily beyond the capability of R, but you do need at least that much memory, regardless of what language you use. Realistically, you'll probably want to write to the disk, either using some database such as Sql (e.g. RSQLite package) or if you plan to only use R in your analysis, it might be better to use the "ff" package for storing/accessing large matrices on disk.
You could do this iteratively and multithread it to improve the speed of calculation.
To find the distance between two docs, you can do something like this:
dist(t(tdm[,1]), t(tdm[,2]), method='cosine')

Make For Loop and Spacial Computing Faster?

I am playing with a large dataset (~1.5m rows x 21 columns). Which includes a long, lat information of a transaction. I am computing the distance of this transaction from couple of target locations and appending this as new column to main dataset:
TargetLocation1<-data.frame(Long=XX.XXX,Lat=XX.XXX, Name="TargetLocation1", Size=ZZZZ)
TargetLocation2<-data.frame(Long=XX.XXX,Lat=XX.XXX, Name="TargetLocation2", Size=YYYY)
## MainData[6:7] are long and lat columns
MainData$DistanceFromTarget1<-distVincentyEllipsoid(MainData[6:7], TargetLocation1[1:2])
MainData$DistanceFromTarget2<-distVincentyEllipsoid(MainData[6:7], TargetLocation2[1:2])
I am using geosphere() package's distVincentyEllipsoid function to compute the distances. As you can imaging, distVincentyEllipsoid function is a computing intensive but it is more accurate (compared to other functions of the same package distHaversine(); distMeeus(); distRhumb(); distVincentySphere())
Q1) It takes me about 5-10 mins to compute distances for each target location [I have 16 GB RAM and i7 6600U 2.81Ghz Intel CPU ], and I have multiple target locations. Is there any faster way to do this?
Q2) Then I am creating a new column for a categorical variable to mark each transaction if it belongs to market definition of target locations. A for loop with 2 if statements. Is there any other way to make this computation faster?
MainData$TransactionOrigin<-"Other"
for (x in 1:nrow(MainData)){
if (MainData$DistanceFromTarget1[x]<=7000)
MainData$TransactionOrigin[x]="Target1"
if (MainData$DistanceFromTarget2[x]<=4000)
MainData$TransactionOrigin[x]="Target2"
}
Thanks
Regarding Q2
This will run much faster if you lose the loop.
MainData$TransactionOrigin <- "Other"
MainData$TransactionOrigin[which(MainData$DistanceFromTarget1[x]<=7000)] <- "Target1"
MainData$TransactionOrigin[which(MainData$DistanceFromTarget2[x]<=4000)] <- "Target2"

Very slow raster::sampleRandom, what can I do as a workaround?

tl;dr: why is raster::sampleRandom taking so much time? e.g. to extract 3k cells from 30k cells (over 10k timesteps). Is there anything I can do to improve the situation?
EDIT: workaround at bottom.
Consider a R script in which I have to read a big file (usually more than 2-3GB) and perform quantile calculation over the data. I use the raster package to read the (netCDF) file. I'm using R 3.1.2 under 64bit GNU/Linux with 4GB of RAM, 3.5GB available most of the time.
As the files are often too big to fit into memory (even 2GB files for some reason will NOT fit into 3GB of available memory: unable to allocate vector of size 2GB) I cannot always do this, which is what I would do if I had 16GB of RAM:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(getValues(pr)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
But instead I can sample a smaller number of cells in my files using the function sampleRaster() from the raster package, still getting good statistics.
e.g.:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(sampleRandom(pr, cnsample)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
I perform this over 6 different files (i goes from 1 to 6) which all have about 30k cells and 10k timesteps (so 300M values). Files are:
1.4GB, 1 variable, filesystem 1
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
1.2GB, 1 variable, filesystem 3
1.2GB, 1 variable, filesystem 3
Note that:
files are on three different nfs filesystem, whose performance I'm not sure of. I cannot rule out the fact that the nfs filesystems can greatly vary in performance from one moment to the other.
RAM usage is 100% all of the time when the script runs, but the system does not use all of it's swap.
sampleRandom(dataset, N) takes N non-NA random cells from one layer (= one timestep), and reads their content. Does so for the same N cells for each layer. If you visualize the dataset as a 3D matrix, with Z as timesteps, the function takes N random non-NA columns. However, I guess the function does not know that all the layers have the NAs in the same positions, so it has to check that any column it chooses does not have NAs in it.
When using the same commands on files with 8393 cells (about 340MB in total) and reading all the cells, the computing time is a fraction of trying to read 1000 cells from a file with 30k cells.
The full script which produces the output below is here, with comments etc.
If I try to read all the 30k cells:
cannot allocate vector of size 2.6 Gb
If I read 1000 cells:
5 minutes
45 m
30 m
30 m
20 m
20 m
If I read 3000 cells:
15 minutes
18 m
35 m
34 m
60 m
60 m
If I try to read 5000 cells:
2.5 h
22 h
for >2 I had to stop after 18h, I had to use the workstation for other tasks
With more tests, I've been able to find out that it's the sampleRandom() function that's taking most of the computing time, not the calculation of the quantile (which I can speed up using other quantile functions, such as kuantile()).
Why is sampleRandom() taking so long? Why does it perform so strangely, sometimes fast and sometimes very slow?
What is the best workaround? I guess I could manually generate N random cells for the 1st layer and then manually raster::extract for all timesteps.
EDIT:
Working workaround is to do:
cells <- sampleRandom(pr[[1]], cnsample, cells=T) #Extract cnsample random cells from the first layer, exluding NAs
cells[,1]
prvals <- pr[cells[,1]] #Read those cells from all layers
qs <- quantile(prvals, probs=qprobs, na.rm=T, type=8, names=F) #Compute quantile
This works and is very fast because all layers have NAs in the same positions. I think this should be an option that sampleRandom() could implement.

How can I remove invisible objects from an R workspace that are not removed by garbage collection?

I am using the blackboost function from the mboost package to estimate a model on an approximately 500mb dataset on a Windows 7 64-bit, 8gb RAM machine. During the execution R uses up to virtually all available memory. After the calculation is done, over 4.5gb keeps allocated to R even after calling the garbage collection with gc() or saving and reloading the workspace to a new R session. Using .ls.objects (1358003) I found that the size of all visible objects is about 550mb.
The output of gc() tells me that the bulk of data is in vector cells, although I'm not sure what that means:
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2856967 152.6 4418719 236.0 3933533 210.1
Vcells 526859527 4019.7 610311178 4656.4 558577920 4261.7
This is what I'm doing:
> memory.size()
[1] 1443.99
> model <- blackboost(formula, data = mydata[mydata$var == 1,c(dv,ivs)],tree_control=ctree_control(maxdepth = 4))
...a bunch of packages are loaded...
> memory.size()
[1] 4431.85
> print(object.size(model),units="Mb")
25.7 Mb
> memory.profile()
NULL symbol pairlist closure environment promise language
1 15895 826659 20395 4234 13694 248423
special builtin char logical integer double complex
174 1572 1197774 34286 84631 42071 28
character ... any list expression bytecode externalptr
228592 1 0 79877 1 51276 2182
weakref raw S4
413 417 4385
mydata[mydata$var == 1,c(dv,ivs)] has 139593 rows and 75 columns with mostly factor variables and some logical or numerical variables. formula is a formula object of the type: "dv ~ var2 + var3 + .... + var73". dv is a variable name string and ivs is a string vector with all independent variables var2 ... var74.
Why is so much memory being allocated to R? How can I make R free up the extra memory? Any thoughts appreciated!
I have talked to one of the package authors, who told me that much of the data associated with the model object is saved in environments, which explains why object.size does not reflect the complete memory usage of R induced by the blackboost function. He also told me that the mboost package was not optimized in terms of speed and memory efficiency but is aimed at flexibility, and that all trees are saved and thereby the data as well, which explains the large amounts of data generated (I still find the dimensions remarkable..). He recommended using the package gbm (which I couldn't get to replicate my results, yet) or to serialize, by doing something like this:
### first M_1 iterations
mod <- blackboost(...)[M_1]
f1 <- fitted(mod)
rm(mod)
### then M_2 additional iterations ...
mod <- blackboost(..., offset = f1)[M_2]
From what I can gather, it's not gc() in R that's the problem, but the fact that the memory is not fully returned to the OS.
This thread doesn't provide an answer, but it sheds light to the nature of the issue.

Running clustering analysis on a cluster causes one of the nodes to crash

I am a user of a Rocks 4.3 cluster with 22 nodes. I am using it to run a clustering function - parPvclust - on a dataset of 2 million rows and 100 columns (it clusters the sample names in the columns). To run parPvclust, I am using a C-shell script in which I've embedded some R code. Using the R code as it is below with a dataset of2 million rows and 100 columns, I always crash one of the nodes.
library("Rmpi")
library("pvclust")
library("snow")
cl <- makeCluster()
load("dataset.RData") # dataset.m: 2 million rows x 100 columns
# subset.m <- dataset.m[1:200000,] # 200 000 rows x 100 columns
output <- parPvclust(cl, dataset.m, method.dist="correlation", method.hclust="ward",nboot=500)
save(output,"clust.RData")
I know that the C-shell script code works, and I know that the R-code actually works with a smaller dataset because if I use a subset of the dataset (commented out above), the code runs fine and I get an output. Likewise, if I use the non-parallelized version (i.e. just pvclust), that also works fine, although running the non-parallelized version defeats the gain in speed of running it in parallel.
The parPvclust function requires the Rmpi and snow R packages (for parallelization) and the pvclust package.
The following can produce a reasonable approximation of the dataset I'm using:
dataset <- matrix(unlist(lapply(rnorm(n=2000,0,1),rep,sample.int(1000,1))),ncol=100,nrow=2000000)
Are there any ideas as to why I always crash a node with the larger dataset and not the smaller one?

Resources