R: How to perform lsa() with parallel processing format - r

I am trying to do some text analytic on tweets, and trying to use LSA() for DR.
However, seems like calculating lsa space is EXTREMELY memory intensive. I can only process up to 2.3k tweets or my computer will die.
As I researched through online resources for parallel processing, I learned that, even though my computer is 4 core, i'll only use 1 of them because that's the default setting in R.
I've also read this post here that is extremely helpful, but seems like that parallel processing can only be done:
on functions that can be used in apply() families
to replace for loops
I am trying to use parallel processing for lsa().
Here's my one line code:
lsa.train = lsa(tdm.train, dimcalc_share())
where the tdm.train is a TermDocumentMatrix with terms as rows and documents as columns.
my question is:
how can i change this line of code of lsa() so that it'll process in parallel format instead of sequential format? such that it'll use n cores instead of 1 core only, where n is number of cores defined by the user (me).

Related

Does parallellization in R copy all data in the parent process?

I have some large bioinformatics project where I want to run a small function on about a million markers, which takes a small tibble (22 rows, 2 columns) as well as an integer as input. The returned object is about 80KB each, and no large amount of data is created within the function, just some formatting and statistical testing. I've tried various approaches using the parallel, doParallel and doMC packages, all pretty canonical stuff (foreach, %dopar% etc.), on a machine with 182 cores, of which I am using 60.
However, no matter what I do, the memory requirement gets into the terabytes quickly and crashes the machine. The parent process holds many gigabytes of data in memory though, which makes me suspicious: Does all the memory content of the parent process get copied to the parallelized processes, even when it is not needed? If so, how can I prevent this?
Note: I'm not necessarily interested in a solution to my specific problem, hence no code example or the like. I'm having trouble understanding the details of how memory works in R parallelization.

Working with lage datasets in R (Sentinel 2)

I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards
I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).
I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.

R: String Operations on Large Data Set (How to speed up?)

I have a large data.frame (>4M rows) in which one column contains character strings. I want to perform several string operations/match regular expressions on each text field (e.g. gsub).
I'm wondering how I can speed up operations? Basically, I'm performing a bunch of
gsub(patternvector," [token] ",tweetDF$textcolumn)
gsub(patternvector," [token] ",tweetDF$textcolumn)
....
I'm running R on a 8GB RAM Mac and tried to move it to the cloud (Amazon EC2 large instance with ~64GB RAM), but it's not going very fast.
I've heard of the several packages (bigmemory, ff) and found an overview about High Performance/Parallel Computing for R here.
Does anyone have recommendations for a package most suitable for speeding up string operations? Or knows of a source explaining how apply the standard R string functions (gsub,..) to the 'objects' created by these 'High Performance Computing packages' ?
Thanks for your help!
mclapply or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.

Parallel computing for TraMineR

I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. In particular, I would like to use the commands seqtreeand seqdist, which works fine when I for example use a subsample of 10,000 observations. The limit my computer can manage is around 20,000 observations.
I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. However, this doesn't help much as the process runs on a single core only. Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? Or are there other ways to speed up the process? Any help would be appreciated!
The internal seqdist function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.
Apart from selecting a sample, you should consider the following optimizations:
aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR )
If relevant, you can try to reduce the time granularity. Distance computation time is highly dependent on sequence length (O^2). See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one.
There is a hidden option in seqdist to use an optimized version of the optimal matching algorithm. It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. To use it, set method="OMopt", instead of method="OM". Depending on your sequences, it may reduce computation time.

How to implement a multicore processors when using regsubset exhaustive method in Rstudio server

How would I rewrite my code so that I can implement the use of multicore on an Rstudio server to run regsubsets from the leaps package using the "exhaustive" method? The data has 1200 variables and 9000 obs so the code has been shortened here:
model<-regsubsets(price~x + y + z + a + b + ...., data=sample,
nvmax=500, method=c("exhaustive"))
Our server is a quad core 7.5 gb ram, is that enough for an equation like this?
In regard to your second question I would say: just give it a try and see. In general, a 1200 times 9000 dataset is not particularly large, but wether or not it works also depends on what regsubsets does under the hood.
In general, the approach here would be to cut up the problem in pieces and run each of the piece on a core, in your case 4 cores. The paralellization works most effective if the processes that we run on a core takes some time (e.g. 10 min). If it takes very little time, the overhead of parallelization will only serve to increase the time the total analysis takes. Creating a cluster on your server is quite easy using e.g. SNOW (available on CRAN). The approach I often use is to create a cluster using the doSNOW package, and then use the functions from the plyr package. A blog post I wrote recently provides some background and example code.
In your specific case, if regsubsets does not support parallelization out of the box, you'll have to do some coding yourself. I think a variable selection method such as regsubsets requires the entire dataset to be used, therefore I think solving the parallelization by running several regsubsets in parallel is not feasible. So I think you'll have to adapt the function to include the parallelization. Somewhere inside the function the different variable selections are evaluated, there you could send those evaluations to different cores. Do take care that if such an evaluation takes only little time, the parallelization overhead will only slow your analysis down. In that case you need to send a group of evaluations to each core in order to increase the time spent on each node, decreasing the amount of overhead from parallelization.

Resources