Parallel cpu processing tm Dcorpus polarity - r

I am trying to examine a data base containing roughly 80.000 txt-documents through the polarity of each sentence in the text with R.
My problem is that my computer isn't able to transform the txt-files into a corpus (12gb RAM, 8 CPUs, Windows 10) - it takes more than two days.
I found out that there is a way to use all CPU's parallely with the DCorpus-function. However, starting with the Dcorpus, I don't know how to run the splitSentence-function, the transformation to a data frame and the scoring via the polarity-function using all CPUs parallely again.
Moreover, I am not sure whether a parallelization of the code helps me with the RAM-usage?
Thanks for your help in advance!

All your problems raise from tm package usage which is incredibly inefficient.
Try, for example, text2vec package. I believe you will be able to perform your analysis in minutes and with very moderate ram usage.
Disclosure - I'm the author of this package.

Related

How to speed EDA and model running in r?

I am running an exploratory data analysis with SmartEDA package (https://cran.r-project.org/web/packages/SmartEDA/SmartEDA.pdf) and one of its functions "ExpReport" allows to create a exploratory data analysis report in HTML format automatically.
I have a dataset with 172 variables and 16487 rows and this is taking so much time to run! Is there a way to speed up R in every tasks we do?
I will also have to run some models with this data (and more data eventually) like randomForest, logistic regression, etc and would like to have a method to do this quickly.
I heard about parallel-processing but can't really understand how it works and if it works only with specific packages or functions...
Thank you all!
This picture shows how memory and cpu are affected just running "ExpReport":
info about memory and CPU consuption
The Problem with large data sets in R is that R reads the entire data set into RAM all at once and R Objects entirely live in memory.
Package doMC Provides a parallel backend for the %dopar% function using the multicore functionality of the parallel package.
Secondly packages like Bigmemory, ff & data.table come in real handy.
Here is a vignette that will help you handle large datasets:
https://rpubs.com/msundar/large_data_analysis
Hope you find this helpful.

Working with lage datasets in R (Sentinel 2)

I'm working with more than 500 Gigabyte Rasters in Rstudio.
My code is working fine but the problem is that R is writing all raster data into a temporal folder, that means the computation time is more than 4 days (even on SSD). Is there a way to make the processing faster?
I'm working on a Computer with 64Gigabyte RAM and 1.5 Gigabyte SSD.
best regards
I don't know Sentinel 2, so it's complicated to help you on performance. Basically, you have to try to (a) use some parallel computation with foreach and doparallel packages, (b) find better packages to working with, or (c) reducing the complexity, in addition to the bad-answers like 'R is not suited for large datasets'.
A) One of the solutions would be a parallel computing, if it is possible to divide your calculations (e.g., your problem consists in a lot of calculations but you simply write results). For example, with the foreach and doparallel packages, observing many temporal networks is much faster than with a 'normal' serial for-loop (e.g., foreach/doparallel are very useful to compute basic statistics for each member of the network and for the global network, as soon as you need to repeat these computations to many 'sub-networks' or many 'networks at a time T' and .combine the results in a maxi-dataset). This last .combine arg. will be useless for a single 500 gb networks, so you have to write the results one by one and it will be very long (4 days = several hours or parallel computation, assuming parallel computing will be 6 or 7 times fastest than your actual computation).
B) Sometimes, it is simply a matter of identifying a more suitable package, as in the case of text-mining computations, and the performance offered by the quanteda package. I prefer to compute text-mining with tidyverse style, but for large datasets and before migrating to another language than R, quanteda is very powerful and fast in the end, even on large datasets of texts. In this example, if Quanteda is too slow to compute a basic text-mining on your dataset, you have to migrate to another technology or stop deploying 'death computing' and/or reduce the complexity of your problem / solution / size of datasets (e.g., Quanteda is not - yet - fast to compute a GloVe model on a very large dataset of 500 gb and you are reaching the border of the methods offered by the package Quanteda, so you have to try another langage than R: librairies in Python or Java like SpaCy will be better than R for deploy GloVe model on very large dataset, and it's not a very big step from R).
I would suggest trying the terra package, it has pretty much the same functions as raster, but it can be much faster.

Error in RStudio while running decision tree (mac)

I am running a CART decision tree on a training set which I've tokenized using quanteda for a routine text analysis task. The resulting DFM from tokenizing was turned into a dataframe and appended with the class attribute I am predicting for.
Like many DFMs, the table is very wide (33k columns), but only contains about 5,500 rows of documents. Calling rpart on my training set returns a stack overflow error.
If it matters, to help increase the speed of calculations, I am using the doSNOW library so I can run the model on 3 out of 4 of my cores in parallel.
I've looked at this answer but can't figure out how to do the equivalent on my mac workstation to see if the same solution would work for me. There is a chance that even if I increase the ppsize of RStudio, I may still run into this error.
So my question is how do I increase the maxppsize of RStudio on a mac, or more generally, how can I fix this stack overflow so I can run my model?
Thanks!
In the end, I found that macs don't have this same command line option since the mac version of RStudio uses all available memory by default.
So the way I fixed this is by decreasing the complexity of the task by reducing the sparsity. I cleaned the document-term matrix by removing all tokens that did not occur in at least 5% of the corpus. This was enough to take a matrix with 33k columns down to a much more manageable 3k columns while still leading to a highly representative DFM.

Handling huge simulations in R

I have written R program that generates a random vector of length 1 million. I need to simulate it 1 million times. Out of the 1 million simulations, I will be using 50K observed vectors (chosen in some random manner) as samples. So, 50K cross 1M is the sample size. Is there way to deal it in R?
There are few problems and some not so good solutions.
First R cannot store such huge matrix in my machine. It exceeds RAM memory. I looked into packages like bigmemory, ffbase etc that uses hard disk space. But such a huge data can have size in TB. I have 200GB hard disk available in my machine.
Even if storing is possible, there is a problem of running time. The code may take more than 100Hrs of running time!
Can anyone please suggest a way out! Thanks
This answer really stands as something in between a comment and an answer. The easy way out of your dilemma is to not work with such massive data sets. You can most likely take a reasonably-sized representative subset of that data (say requiring no more than a few hundred MB) and train your model this way.
If you have to use the model in production on actual data sets with millions of observations, then the problem would no longer be related to R.
If possible use sparse matrix techniques
If possible try leveraging storage memory and chunking the object into parts
If possible try to use Big Data tools such as H2O
Leverage multicore and HPC computing with pbdR, parallel, etc
Consider using a spot instance of a Big Data / HPC cloud VPS instance on AWS, Azure, DigitalOcean, etc. Most offer distributions with R preinstalled and with a high RAM multicore instance you can "spin up" (start) and down (stop) quickly and cheaply
Use sampling and statistical solutions when possible
Consider doing some of your simulations or pre-simulation steps in a relational database, or something like Spark + Scala; some have R integration nowadays, actually

R running out memory for large data set

I am running my code in a PC and I don't think I have problem with the RAM.
When I run this step:
dataset <- rbind(dataset_1, dataset_2,dataset_3,dataset_4,dataset_5)
I got the
Error: cannot allocate vector of size 261.0 Mb
The dataset_1 until dataset_5 have around 5 million observation each.
Could anyone please advise how to solve this problem?
Thank you very much!
There are several packages available that may solve your problem under the High Performance Computing CRAN taskview. See "Large memory and out-of-memory data", the ff package, for example.
R, as matlab, load all the data into the memory which means you can quickly run out of RAM (especially for big datasets). The only alternative I can see is to partition your data (i.e. load only part of the data), do the analysis on that part and write the results to files before loading the next chunk.
In your case you might want to use Linux tools to merge the datasets.
Say you have two files dataset1.txt and dataset2.txt, you can merge them using the shell command join, cat or awk.
More generally, using Linux shell tools for parsing big datasets is usually much faster and requires much less memory.

Resources