R: String Operations on Large Data Set (How to speed up?) - r

I have a large data.frame (>4M rows) in which one column contains character strings. I want to perform several string operations/match regular expressions on each text field (e.g. gsub).
I'm wondering how I can speed up operations? Basically, I'm performing a bunch of
gsub(patternvector," [token] ",tweetDF$textcolumn)
gsub(patternvector," [token] ",tweetDF$textcolumn)
....
I'm running R on a 8GB RAM Mac and tried to move it to the cloud (Amazon EC2 large instance with ~64GB RAM), but it's not going very fast.
I've heard of the several packages (bigmemory, ff) and found an overview about High Performance/Parallel Computing for R here.
Does anyone have recommendations for a package most suitable for speeding up string operations? Or knows of a source explaining how apply the standard R string functions (gsub,..) to the 'objects' created by these 'High Performance Computing packages' ?
Thanks for your help!

mclapply or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.

Related

Working with lage datasets in R (Sentinel 2)

I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards
I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).
I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.

Handling huge simulations in R

I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?
There are few problems and some not so good solutions.
First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.
Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!
Can anyone please suggest a way out! Thanks
This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.
If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.
If possible use sparse matrix techniques
If possible try leveraging storage memory and chunking the object into parts
If possible try to use Big Data tools such as H2O
Leverage multicore and HPC computing with pbdR, parallel, etc
Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can "spin up" (start) and down (stop) quickly and cheaply
Use sampling and statistical solutions when possible
Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually

R: How to perform lsa() with parallel processing format

I am trying to do some text analytic on tweets, and trying to use LSA() for DR.
However, seems like calculating lsa space is EXTREMELY memory intensive. I can only process up to 2.3k tweets or my computer will die.
As I researched through online resources for parallel processing, I learned that, even though my computer is 4 core, i'll only use 1 of them because that's the default setting in R.
I've also read this post here that is extremely helpful, but seems like that parallel processing can only be done:
on functions that can be used in apply() families
to replace for loops
I am trying to use parallel processing for lsa().
Here's my one line code:
lsa.train = lsa(tdm.train, dimcalc_share())
where the tdm.train is a TermDocumentMatrix with terms as rows and documents as columns.
my question is:
how can i change this line of code of lsa() so that it'll process in parallel format instead of sequential format? such that it'll use n cores instead of 1 core only, where n is number of cores defined by the user (me).

Sharing a data.table in memory for parallel computing

Following the post about data.table and parallel computing, I'm trying to find a way to get an operation on a data.table parallized.
I have a data.table with 4 million rows of 14 observations and would like to share it in a common memory so that operations on it can be parallelized by using the "parallel"-package with parLapply without having to copy the table for each node in the cluster (what parLapply does). At the moment the costs for moving the data.table around are bigger than the benefit of parallel computation.
I found the "bigmemory"-package as an answer for sharing memory, but it doesn't maintain the "data.table"-structure of the data. So does anyone know a way to:
1) put the data.table in shared memory
2) maintain the "data.table"-structure of the data by doing so
3) use parallel processing on this data.table?
Thanks in advance!
Old question, but here is an answer since nobody else has answered and it might be helpful. I assume the problem you are having is because you are on windows and having to use the PSOCK type of cluster. Unfortunately for windows this means you have to copy the data to each node. However, there is a work around. Get hold of docker and spin up an Rserve instance on the docker vm (e.g. stevenpollack/docker-rserve). Since this will be linux based you can create a FORK cluster on the docker vm. Then using your native R instance you can send over only once copy of the data to the Rserve instance (check out the RSclient library), do your parallelized job on the vm, and collect the results back into your native R.
The "complete" solution, shared read and write access from multiple processes, and their problems is discussed here: https://github.com/Rdatatable/data.table/issues/3104
As rookie mentioned, if you fork an R process (with parallel::makeCluster(type = "FORK") or future::plan(multicore) (note that this does not work reliably in RStudio), the operating system will reuse memory pages that are not modified by the child process. So, your workers will share the same memory as long as they don't modify it (Copy-on-write). But this works only if you have all parallel workers on the same machine and fork() has its own problems (although this might be going too far if you simply want to conduct some parallel analysis).
Meanwhile, you could find the packages feather and fst interesting. feather provides a file format that can be read both by R and python and if I understood the docs correctly, feather::feather() gives you a file-backed read-only data-frame, albeit no data.table. This allows for moving data between those two languages.
fst employs the Zstandard compression algorithm to achieve very fast reading and writing speeds to disk. You can read in a part of a fst file using the fst() function (instead of read_fst()). So, every worker could just read the part of your table that it needs. Concurrent writing to the fst file is not possible. You would need to save every result in its own file and concatenate them afterwards.
Alternatively, for concurrent reading and writing, you could switch to a database, albeit that is slower than data.table. See SO/SQLite concurrent access

How to let R use all the cores of the computer?

I have read that R uses only a single CPU. How can I let R use all the available cores to run statistical algorithms?
Yes, for starters, see the High Performance Computing Task View on CRAN. This lists details of packages that can be used in support of parallel computing on a single machine.
From R version 2.14.0, there is inbuilt support for parallel computing via the parallel package, which includes slightly modified versions of the existing snow and multicore packages. The parallel package has a vignette that you should read. You can view it using:
vignette(package="parallel", topic = "parallel")
There are other ways to exploit your multiple cores, for example via use of a multi-threaded BLAS for linear algebra computations.
Whether any of this will speed up the "statistics calculations" you want to do will depend on what those "statistics calculations" are. Spawning off multiple threads or workers entails an overhead cost to set them up, manage them and collect the results. Some operations see a benefit (some large, some small) of using multiple cores/threads, others are slowed down because of this extra overhead.
In short, do not expect to get an n times decrease in your compute time by using n cores instead of just 1.
If you happen to do few* iterations of the same thing (or same code for few* different parameters), the easiest way to go is to run several copies of R -- OS will allocate the work on different cores.
In the opposite case, go and learn how to use real parallel extensions.
For the sake of this answer, few means less or equal the number of cores.

Resources