Using R with tidyquant and massiv data - r

While working with R I encountered a strange problem:
I am processing date in the follwing manner:
Reading data from a database into a dataframe, filling missing values, grouping and nesting the data to a combined primary key, creating a timeseries and forecastting it for every group, ungroup and clean the data, write it back into the DB.
Somehting like this:
https://cran.rstudio.com/web/packages/sweep/vignettes/SW01_Forecasting_Time_Series_Groups.html
For small data sets this works like a charm, but with lager ones (over about 100000 entries) I do get the "R Session Aborted" screen from R-Studio and the nativ R GUI just stops execution and implodes.
There is no information in every log file that I've look into. I suspect that it is some kind of (leaking) memory issue.
As a work around I'm processing the data in chunks with a for-loop. But no matter how small the chunk size is, I do get the "R Session Aborted" screen, which looks a lot like leaking memory.
The whole date consist of about 5 million rows.
I've looked a lot into packages like ff, the big-Family and matter basically everything from https://cran.r-project.org/web/views/HighPerformanceComputing.html
but this dose not seem to work well with tibbles and the tidyverse way of data processing.
So, how can I improve my scrip to work with massiv amounts of data?
How can I gather clues about why the R Session is Aborted?

Check out the article at:
datascience.la/dplyr-and-a-very-basic-benchmark
There is a table that shows runtime comparisons for some of the data wrangling tasks you are performing. From the table, it looks as though dplyr with data.table behind it is likely going to do much better than dplyr with a dataframe behind it.
There’s a link to the benchmarking code used to make the table, too.
In short, try adding a key, and try using data.table over dataframe.
To make x your key, and say your data.table is named dt, use setkey(dt,x).

While Pakes answer deals with the described problem I found a solution to the underlying problem. For Compatibility reason I used R in the 3.4.3 version. Now I'm using the newer 3.5.1 version which works quite fine.

Related

R running very slowly after loading large datasets > 8GB

I have been unable to work in R given how slow it is operating once my datasets are loaded. These datasets total around 8GB. I am running on a 8GB RAM and have adjusted memory.limit to exceed my RAM but nothing seems to be working. Also, I have used fread from the data.table package to read these files; simply because read.table would not run.
After seeing a similar post on the forum addressing the same issue, I have attempted to run gctorture(), but to no avail.
R is running so slowly that I cannot even check the length of the list of datasets I have uploaded, cannot View or do any basic operation once these datasets are uploaded.
I have tried uploading the datasets in 'pieces', so 1/3 of the total files over 3 times, which seemed to make things run more smoothly for the importing part, but has not changed anything with regards to how slow R runs after this.
Is there any way to get around this issue? Any help would be much appreciated.
Thank you all for your time.
The problem arises because R loads the full dataset into the RAM which mostly brings the system to a halt when you try to View your data.
If it's a really huge dataset, first make sure the data contains only the most important columns and rows. Valid columns can be identified through the domain and world knowledge you have about the problem. You can also try to eliminate rows with missing values.
Once this is done, depending on your size of the data, you can try different approaches. One is through the use of packages like bigmemory and ff. bigmemory for example, creates a pointer object using which you can read the data from disk without loading it to the memory.
Another approach is through parallelism (implicit or explicit). MapReduce is another package which is very useful for handling big datasets.
For more information on these, check out this blog post on rpubs and this old but gold post from SO.

Why is collect in SparkR so slow?

I have a 500K row spark DataFrame that lives in a parquet file. I'm using spark 2.0.0 and the SparkR package inside Spark (RStudio and R 3.3.1), all running on a local machine with 4 cores and 8gb of RAM.
To facilitate construction of a dataset I can work on in R, I use the collect() method to bring the spark DataFrame into R. Doing so takes about 3 minutes, which is far longer than it'd take to read an equivalently sized CSV file using the data.table package.
Admittedly, the parquet file is compressed and the time needed for decompression could be part of the issue, but I've found other comments on the internet about the collect method being particularly slow, and little in the way of explanation.
I've tried the same operation in sparklyr, and it's much faster. Unfortunately, sparklyr doesn't have the ability to do date path inside joins and filters as easily as SparkR, and so I'm stuck using SparkR. In addition, I don't believe I can use both packages at the same time (i.e. run queries using SparkR calls, and then access those spark objects using sparklyr).
Does anyone have a similar experience, an explanation for the relative slowness of SparkR's collect() method, and/or any solutions?
#Will
I don't know whether the following comment actually answers your question or not but Spark does lazy operations. All the transformations done in Spark (or SparkR) doesn't really create any data they just create a logical plan to follow.
When you run Actions like collect, it has to fetch data directly from source RDDs (assuming you haven't cached or persisted data).
If your data is not large enough and can be handled by local R easily then there is no need for going with SparkR. Other solution can be to cache your data for frequent uses.
Short: Serialization/deserialization is very slow.
See for example post on my blog http://dsnotes.com/articles/r-read-hdfs
However it should be equally slow in both sparkR and sparklyr.

Handling large objects in R

I'm trying to do a project on object recognition using the CIFAR image dataset using R, however there is one problem with handling the dataset, when I read the data into some data frame or matrix, the object size for the objects gets large around 98MB, and this causes my program to run very slow, my code is vectorized but still I face performance issues. Please can someone suggest how can I resolve this problem.
Thanks.
It looks like the binary form of those data would fit nicely into a flat-file type format, like that handled by bigmemory, ff, or rhdf5 packages.

How to use zoo or xts with large data?

How can I use the R packages zoo or xts with very large data sets? (100GB)
I know there are some packages such as bigrf, ff, bigmemory that can deal with this problem but you have to use their limited set of commands, they don't have the functions of zoo or xts and I don't know how to make zoo or xts to use them.
How can I use it?
I've seen that there are also some other things, related with databases, such as sqldf and hadoopstreaming, RHadoop, or some other used by Revolution R. What do you advise?, any other?
I just want to aggreagate series, cleanse, and perform some cointegrations and plots.
I wouldn't like to need to code and implement new functions for every command I need, using small pieces of data every time.
Added: I'm on Windows
I have had a similar problem (albeit I was only playing with 9-10 GBs). My experience is that there is no way R can handle so much data on its own, especially since your dataset appears to contain time series data.
If your dataset contains a lot of zeros, you may be able to handle it using sparse matrices - see Matrix package ( http://cran.r-project.org/web/packages/Matrix/index.html ); this manual may also come handy ( http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/ )
I used PostgreSQL - the relevant R package is RPostgreSQL ( http://cran.r-project.org/web/packages/RPostgreSQL/index.html ). It allows you to query your PostgreSQL database; it uses SQL syntax. Data is downloaded into R as a dataframe. It may be slow (depending on the complexity of your query), but it is robust and can be handy for data aggregation.
Drawback: you would need to upload data into the database first. Your raw data needs to be clean and saved in some readable format (txt/csv). This is likely to be the biggest issue if your data is not already in a sensible format. Yet uploading "well-behaved" data into the DB is easy ( see http://www.postgresql.org/docs/8.2/static/sql-copy.html and How to import CSV file data into a PostgreSQL table? )
I would recommend using PostgreSQL or any other relational database for your task. I did not try Hadoop, but using CouchDB nearly drove me round the bend. Stick with good old SQL

XTS size limitation

I have been working on large datasets lately (more than 400 thousands lines). So far, I have been using XTS format, which worked fine for "small" datasets of a few tenth of thousands elements.
Now that the project grows, R simply crashes when retrieving the data for the database and putting it into the XTS.
It is my understanding that R should be able to have vectors with size up to 2^32-1 elements (or 2^64-1 according the the version). Hence, I came to the conclusion that XTS might have some limitations but I could not find the answer in the doc. (maybe I was a bit overconfident about my understanding of theoretical possible vector size).
To sum up, I would like to know if:
XTS has indeed a size limitation
What do you think is the smartest way to handle large time series? (I was thinking about splitting the analysis into several smaller datasets).
I don't get an error message, R simply shuts down automatically. Is this a known behavior?
SOLUTION
The same as R and it depends on the kind of memory being used (64bits, 32 bits). It is anyway extremely large.
Chuncking data is indeed a good idea, but it is not needed.
This problem came from a bug in R 2.11.0 which has been solved in R 2.11.1. There was a problem with long dates vector (here the indexes of the XTS).
Regarding your two questions, my $0.02:
Yes, there is a limit of 2^32-1 elements for R vectors. This comes from the indexing logic, and that reportedly sits 'deep down' enough in R that it is unlikely to be replaced soon (as it would affect so much existing code). Google the r-devel list for details; this has come up before. The xts package does not impose an additional restriction.
Yes, splitting things into chunks that are manageable is the smartest approach. I used to do that on large data sets when I was working exclusively with 32-bit versions of R. I now use 64-bit R and no longer have this issue (and/or keep my data sets sane),
There are some 'out-of-memory' approaches, but I'd first try to rethink the problem and affirm that you really need all 400k rows at once.

Resources