Parallel processing a function being applied with mapply a large dataset - r

My problem is the following: I have a large dataset on R (I run it on VS-Code), which I'll call full of about 1.2GB (28kk rows with 10 columns) and subset of this dataset, which I'll call main (4.3kk rows and 10 columns). I use a Windows with an i7-10700k CPU, 3.8 GHz, 8-core, with 16gb of RAM.
These datasets contain unique identifiers for products, which then spam over multiple time periods and stores. For each (product-store) combination, I need to calculate summary statistics for similar products excluding that store and product. For this reason, I essentially need the full dataset to be loaded, and I cannot split it.
I have created a function that takes a given product-store, filters the dataset to exclude that product-store, and then perform the summary statistics.
There are over 1 million product-stores, so an apply would take 1 million runs. Each run of the function is taking about 0.5 seconds, which is a lot.
I then decided to use furrr's future_map2 along with plan(cluster, workers=8) to try and parallelize the process.
One adviced that normally goes against parallelization is that, if a lot of data needs to be moved around for each cluster, this process can take a long time. My understanding is that the parallelization would move the large datasets to each cluster once, and then it would perform the apply in parallel. This seems to imply that my process will be more efficient under parallelization, even with a large dataset.
I wanted to know if overall I am doing the most advisable thing in terms of speeding up the function. I already switched fully to data.table functions to improve speed, so I don't believe there's a lot to be done within the function.
Tried parallelizing, worried about whats the smartest approach

Related

Does parallellization in R copy all data in the parent process?

I have some large bioinformatics project where I want to run a small function on about a million markers, which takes a small tibble (22 rows, 2 columns) as well as an integer as input. The returned object is about 80KB each, and no large amount of data is created within the function, just some formatting and statistical testing. I've tried various approaches using the parallel, doParallel and doMC packages, all pretty canonical stuff (foreach, %dopar% etc.), on a machine with 182 cores, of which I am using 60.
However, no matter what I do, the memory requirement gets into the terabytes quickly and crashes the machine. The parent process holds many gigabytes of data in memory though, which makes me suspicious: Does all the memory content of the parent process get copied to the parallelized processes, even when it is not needed? If so, how can I prevent this?
Note: I'm not necessarily interested in a solution to my specific problem, hence no code example or the like. I'm having trouble understanding the details of how memory works in R parallelization.

I am running 1000 parallel simulations in R on a cluster. What is the best way to manage and recombine the results after?

I currently have a simulation that I need to loop over 1000 times before averaging the results. Currently, I am putting the results in a list with 1000 entires. However, each simulation takes about 1 hour. As a result, I have set-up a parallel simulation coding scheme whereby I can sent my coding to be executed in parallel on a cluster.
Currently, it looks like I will be having 1000 .RData files, of which I will try to combine later. I have made sure that the variables of interest in each of the .RData files have been numbered already to avoid ambiguity.
I was wondering if anyone knew of a more efficient method than this? Would there be a way to combine all the results efficiently without having open 1000 .Rdata files? Thanks!

Parallel processing data analysis - Is there a benefit to having more splits than processor cores?

I am using a naive Bayesian classifier to predict some test data in R. The test data has >1,000,000,000 records, and takes far too long to process with one processor. The computer I am using has (only) four processors in total, three of which I can free-up to run my task (I could use all four, but prefer to keep one for other work I need to do).
Using the foreach and doSNOW packages, and following this tutorial, I have things set up and running. My question is:
I have the dataset split into three parts, one part per processor. Is there a benefit to splitting the dataset into say 6,9, or 12 parts? In other words, what is the trade-off between more splits, vs, just having one big block of records for each processor core to run?
I haven't provided any data here because I think this question is more theoretical. But if data are needed, please let me know.
Broadly speaking, the advantage of splitting it up into more parts is that you can optimize your processor use.
If the dataset is split into 3 parts, one per processor, and they take the following time:
Split A - 10 min
Split B - 20 min
Split C - 12 min
You can see immediately that two of your processors are going to be idle for a significant amount of time needed to do the full analysis.
Instead, if you have 12 splits, each one taking between 3 and 6 minutes to run, then processor A can pick up another chunk of the job after it finishes with the first one instead of idling until the longest-running split finishes.

What are the minimum system requirements for analysing large datasets (30gb) in R?

I tried running Apriori algorithm on 30GB CSV file in which each row is a basket upto 34 items(columns) in it. So R studio died just after execution. I want to know what are the minimum system requirements like how much RAM and CPU config I need to run algorithms on large data sets?
This question cannot be answered as such. It highly depends on what you want to do with the data.
Example
If you are able to process all lines 1 by 1, you just need a tiny bit of ram (for example if you want to count them, I believe this also holds for the most trivial use of Apriori)
If you want to calculate the distance between all points efficiently, you will want a ton of ram, and another few GB to store the output (I believe this is even less intense than the most extreme use of Apriori).
Conclusion
As such I would recommend:
Use whatever hardware you have to process a subset of the data. Check your memory and CPU usage, as you increase the data size (or other parameters) and extrapolate your results to see what you probably need.

Parallel computing for TraMineR

I have a large dataset with more than 250,000 observations, and I would like to use the TraMineR package for my analysis. In particular, I would like to use the commands seqtreeand seqdist, which works fine when I for example use a subsample of 10,000 observations. The limit my computer can manage is around 20,000 observations.
I would like to use all the observations and I do have access to a supercomputer who should be able to do just that. However, this doesn't help much as the process runs on a single core only. Therefore my question, is it possible to apply parallel computing technics to the above mentioned commands? Or are there other ways to speed up the process? Any help would be appreciated!
The internal seqdist function is written in C++ and has numerous optimizations. For this reason, if you want to parallelize seqdist, you need to do it in C++. The loop is located in the source file "distancefunctions.cpp" and you need to look at the two loops located around line 300 in function "cstringdistance" (Sorry but all comments are in French). Unfortunately, the second important optimization is that the memory is shared between all computations. For this reason, I think that parallelization would be very complicated.
Apart from selecting a sample, you should consider the following optimizations:
aggregation of identical sequences (see here: Problem with big data (?) during computation of sequence distances using TraMineR )
If relevant, you can try to reduce the time granularity. Distance computation time is highly dependent on sequence length (O^2). See https://stats.stackexchange.com/questions/43601/modifying-the-time-granularity-of-a-state-sequence
Reducing time granularity may also increase the number of identical sequences, and hence, the impact of optimization one.
There is a hidden option in seqdist to use an optimized version of the optimal matching algorithm. It is still in testing phase (that's why it is hidden), but it should replace the actual algorithm in a future version. To use it, set method="OMopt", instead of method="OM". Depending on your sequences, it may reduce computation time.

Resources