How to extract a portion of a large dataset via unzipping - r

I have a very large species dataset from gbif (178GB) zipped, when unzipped its approximately 800gb (TSV) My Mac only has 512gb Memory and 8GB of Ram, however I am not in need of using all of this data.
Are their any approaches that I can take that can unzip the file without eating all of my memory and extracting a portion of the dataset by filtering out rows relative to a column? For example, it has occurrence values going back until 1600, I only need data for the last 2 years which I believe my PC can more than handle. Perhaps there is a library with a function that can filter rows when loading the data?
I am unsure of how to unzip properly, and I have looked to see unzipping libraries and unzip according to this article, truncates data >4GB. My worry is where could I store 800gb of data when unzipped?
Update:
It seems that all the packages I have come across stop at 4GB after decompression. I am wondering if it is possible to create a function that can decompress at the 4GB, mark that point or data that has been retrieved, and begin decompression again from that point, and continue until the whole .zip file has been decompressed. It could store the decompressed files into a folder, that way you can access them with something like list.files(), any ideas if this can be done?

Related

R not releasing memory after filtering and reducing data frame size

I need to read a huge dataset, trim it to a tiny one, and then use in my program. After trimming memory is not released (regardless of usage of gc() and rm()). I am puzzled by this behavior.
I am on Linux, R 4.2.1. I read a huge .Rds file (>10 Gb) (both with the base function and the readr version). Memory usage shows 14.58 Gb. I do operations and decrease its size to 800 rows and 24.7 Mb. But memory usage stays the same within this session regardless of what I do. I tried:
Piping readRDS directly into trimming functions and only storing the trimmed result;
First reading rds into a variable and then replacing it with the trimmed version;
Reading rds into a variable, storing the trimmed data in a new variable, and then removing the big dataset with rm() followed by garbage collection gc().
I understand what the workaround should be: a bash script that first creates a temporary file with this reduced dataset and then runs a separate R session to work with that dataset. But feels like this shouldn't be happening?

merging huge data files in R (used for searching)

I am working in R 3.5 and need to create a huge database of around 200 million rows and then search a file with around 15 million rows in that data base to find the reference value (and then cbind the two files: input file + matched file).
For smaller database files (~10 million rows) I used the merge() function to merge the input file with the database file. But, this is almost impossible now.
I tried rsqlite package and although it did work I did not like it.
pros
reference data file is not loaded at first
it does not need any installation (rather than rsqlite package)
cons
it is very very slow (even after creating index on tables)
the database file is huge (around 10Gb)
binding the input file and found items is not simple (row number may be different)
I don't want to use SQL server or MySQL , because they both need installation and configuration and it is not suitable for all systems and servers.
any suggestions or similar experiences on big data matching?

High-scale signal processing in R

I have high-dimensional data, for brain signals, that I would like to explore using R.
Since I am a data scientist I really do not work with Matlab, but R and Python. Unfortunately, the team I am working with is using Matlab to record the signals. Therefore, I have several questions for those of you who are interested in data science.
The Matlab files, recorded data, are single objects with the following dimensions:
1000*32*6000
1000: denotes the sampling rate of the signal.
32: denotes the number of channels.
6000: denotes the time in seconds, so that is 1 hour and 40 minutes long.
The questions/challenges I am facing:
I converted the "mat" files I have into CSV files, so I can use them in R.
However, CSV files are 2 dimensional files with the dimensions: 1000*192000.
the CSV files are rather large, about 1.3 gigabytes. Is there a
better way to convert "mat" files into something compatible with R,
and smaller in size? I have tried "R.matlab" with readMat, but it is
not compatible with the 7th version of Matlab; so I tried to save as V6 version, but it says "Error: cannot allocate vector of size 5.7 Gb"
the time it takes to read the CSV file is rather long! It takes
about 9 minutes to load the data. That is using "fread" since the
base R function read.csv takes forever. Is there a better way to
read files faster?
Once I read the data into R, it is 1000*192000; while it is actually
1000*32*6000. Is there a way to have multidimensional object in R,
where accessing signals and time frames at a given time becomes
easier. like dataset[1007,2], which would be the time frame of the
1007 second and channel 2. The reason I want to access it this way
is to compare time frames easily and plot them against each other.
Any answer to any question would be appreciated.
This is a good reference for reading large CSV files: https://rpubs.com/msundar/large_data_analysis A key takeaway is to assign the datatype for each column that you are reading versus having the read function decide based on the content.

R using waaay more memory than expected

I have an Rscript being called from a java program. The purpose of the script is to automatically generate a bunch of graphs in ggplot and them splat them on a pdf. It has grown somewhat large with maybe 30 graphs each of which are called from their own scripts.
The input is a tab delimited file from 5-20mb but the R session goes up to 12gb of ram usage sometimes (on a mac 10.68 btw but this will be run on all platforms).
I have read about how to look at the memory size of objects and nothing is ever over 25mb and even if it deep copies everything for every function and every filter step it shouldn't get close to this level.
I have also tried gc() to no avail. If I do gcinfo(TRUE) then gc() it tells me that it is using something like 38mb of ram. But the activity monitor goes up to 12gb and things slow down presumably due to paging on the hd.
I tried calling it via a bash script in which I did ulimit -v 800000 but no good.
What else can I do?
In the process of making assignments R will always make temporary copies, sometimes more than one or even two. Each temporary assignment will require contiguous memory for the full size of the allocated object. So the usual advice is to plan to have _at_least_ three time the amount of contiguous _memory available. This means you also need to be concerned about how many other non-R programs are competing for system resources as well as being aware of how you memory is being use by R. You should try to restart your computer, run only R, and see if you get success.
An input file of 20mb might expand quite a bit (8 bytes per double, and perhaps more per character element in your vectors) depending on what the structure of the file is. The pdf file object will also take quite a bit of space if you are plotting each point within a large file.
My experience is not the same as others who have commented. I do issue gc() before doing memory intensive operations. You should offer code and describe what you mean by "no good". Are you getting errors or observing the use of virtual memory ... or what?
I apologize for not posting a more comprehensive description with code. It was fairly long as was the input. But the responses I got here were still quite helpful. Here is how I mostly fixed my problem.
I had a variable number of columns which, with some outliers got very numerous. But I didn't need the extreme outliers, so I just excluded them and cut off those extra columns. This alone decreased the memory usage greatly. I hadn't looked at the virtual memory usage before but sometimes it was as high as 200gb lol. This brought it down to up to 2gb.
Each graph was created in its own function. So I rearranged the code such that every graph was first generated, then printed to pdf, then rm(graphname).
Futher, I had many loops in which I was creating new columns in data frames. Instead of doing this, I just created vectors not attached to data frames in these calculations. This actually had the benefit of greatly simplifying some of the code.
Then after not adding columns to the existing dataframes and instead making column vectors it reduced it to 400mb. While this is still more than I would expect it to use, it is well within my restrictions. My users are all in my company so I have some control over what computers it gets run on.

What is the ideal format to store large results generated by R?

I simulate reasonably sized datasets (10-20mb) through a large number of parameter combinations (20-40k). Each dataset x parameter set is pushed through mclapply and the result is a list where each item contains output data (as list item 1) and parameters used to generate that result as list item 2 (where each element of that list is a parameter).
I just ran through a 81K list (but had to run them in 30k chunks) and the resulting lists are around 700 mb each. I've stored them as .rdata files but will probably resave them to .Rda. But each file takes forever to be read into R. Is there a best practice here, especially for long-term storage?
Ideally I would keep everything in one list but mclapply throws an error about not being able to serialize vectors, AND a job this large would take forever on the cluster (split 3 ways, it took 3 hours/job). But having several results files results1a.rdata, results2b.rdata, results3c.rdata also seems inefficient.
It sounds like you have a couple of different questions there -- I'd recommend asking about optimizing your list format in a separate question.
Regarding reading/writing R data to disk, however, I'm not sure that there's a better way than Rda files in terms of efficiency. However, I have found that the level of compression can have a real effect on the amount of time it takes to read/write these files depending on the computational setup. I've typically found that you get the best performance using no compression (save(x,file="y.Rda", compress=FALSE)).
As a backup plan, you can try leaving the compression on, but varying the level of compression, as well.

Resources