Collecting a table using sparklyr in Databricks - r

I have a parquet table with approximately 5 billion rows. After all manipulations using sparklyr it is reduced to 1,880,573 rows and 629 columns. When I try to collect this for Factor Analysis using sdf_collect() it is giving me this memory error:
Error : org.apache.spark.sql.execution.OutOfMemorySparkException: Total memory usage during row decode exceeds spark.driver.maxResultSize (4.0 GB). The average row size was 5.0 KB
Is 1,880,573 rows x 629 columns too big for sparklyr to collect? Further, checking the number of rows using data %>% dplyr::count() took 9 minutes - how do I reduce this time?

Yes. 1,880,573 rows and 629 columns is too big. This isn't just a sparklyr problem but your R instance will have a lot of trouble collecting this into local memory.
As for the count(), 9 minutes isn't THAT long when you're working with data of this size. One thing you could try is to reduce the count to only one variable. data %>% select(one_var) %>% count(). That being said, I don't believe there is a way to dramatically speed up this time other than increasing your spark session parameters (i.e. the number of executors).
I'd suggest doing the factor analysis in spark if you can or using a smaller sample of your data.

Export the spark dataframe to disk using sparklyr::spark_write_* and then read it into your R session.
Parquet is a good choice due to fast and compact read/write capability.
Repartitioning the spark dataframe with sparklyr::repartition into one part before the write operation results in a single file. This is better to read into R, than multiple files and then a subsequent row binding operation.
Its advisable to not collect a 'large' (depends on your spark configuration, RAM) dataframe using collect function as it might brings all data to the driver node.

Related

R not releasing memory after filtering and reducing data frame size

I need to read a huge dataset, trim it to a tiny one, and then use in my program. After trimming memory is not released (regardless of usage of gc() and rm()). I am puzzled by this behavior.
I am on Linux, R 4.2.1. I read a huge .Rds file (>10 Gb) (both with the base function and the readr version). Memory usage shows 14.58 Gb. I do operations and decrease its size to 800 rows and 24.7 Mb. But memory usage stays the same within this session regardless of what I do. I tried:
Piping readRDS directly into trimming functions and only storing the trimmed result;
First reading rds into a variable and then replacing it with the trimmed version;
Reading rds into a variable, storing the trimmed data in a new variable, and then removing the big dataset with rm() followed by garbage collection gc().
I understand what the workaround should be: a bash script that first creates a temporary file with this reduced dataset and then runs a separate R session to work with that dataset. But feels like this shouldn't be happening?

Processing very large files in R

I have a dataset that is 188 million rows with 41 columns. It comes as a massive compressed fixed width file and I am currently reading it into R using the vroom package like this:
d <- vroom_fwf('data.dat.gz',
fwf_positions([41 column position],
[41 column names])
vroom does a wonderful job here in the sense that the data are actually read into an R session on a machine with 64Gb of memory. When I run object.size on d it is a whopping 61Gb is size. When I turn around to do anything with this data I can't. All I get back the Error: cannot allocate vector of size {x} Gb because there really isn't any memory left to much of anything with that data. I have tried base R with [, dplyr::filter and trying to convert to a data.table via data.table::setDT each with the same result.
So my question is what are people's strategies for this type of thing? My main goal is to convert the compressed fixed width file to a parquet format but I would like to split it into small more manageable files based on some values in a column in the data then write them to parquet (using arrow::write_parquet)
My ideas at this point are to read a subset of columns keeping the column that I want to subset by, write the parquet files then bind the columns/merge the two back together. This seems like a more error prone solution though so I thought I would turn here and see what is available for further conversions.

Transferring a data frame from R Studio Cloud to GoogleSheets seems prohibitively slow

I have a data frame in R Studio cloud that is 85 rows by 207 columns (ultimately it may be as wide as 366 columns). The actual data is created using the code in this thread:
Function to generate multiple rows of vectors in R
Now that I have the data frame set up, I'm trying to use the googlesheets library to transfer it to an existing sheet (and will periodically update it as I get more data). I'm using this code to edit the google sheet:
gs_edit_cells(sheet, ws = "Test", input=data, anchor="A2")
I let it think for a long time (more than an hour) and it sat there with "Range affected by the update" and it never went through. If I slice it into several segments (doing say 30 or 60 or 90 columns at a time) then it does transfer but is prohibitively slow. This isn't a huge dataset, is there something I could improve in my code? I have loaded the sheet in advance and it does recognize it.

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

Sample A CSV File Too Large To Load Into R?

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.

Resources