Load dataframe too large for system memory from jld2 file - julia

I have a file "myfile.jld2" which contains a DataFrame, say mydata. Is it possible to load mydata in some way although it doesn't fit into system RAM?
I only want to load it to split it up into smaller pieces and dump those to disk.

Related

R generate partially data.table save it to file and partially load

Is it possible to partially generate data.tabel bigger than RAM and partially save it to the hard drive? Then in code subsequently load parts of the file to RAM, and do calculations on those parts.
Thanks

Fastest way to read a data.frame into a shiny app on load?

For a shiny app in a repository containing a single static data file, what is the optimal format for that flat file (and corresponding function to read that file) which minimises the read time for that flat file to a data.frame?
For example, suppose when a shiny app starts it reads an .RDS, but suppose that takes ~30 seconds and we wish to decrease that. Are there any methods of saving the file and using a function which can save time?
Here's what I know already:
I have been reading some speed comparison articles, but none seem to comprehensive benchmark all methods in the context of a shiny app (and possible cores/threading implications). Some offer sound advice like trying to load in less data
I notice languages like julia can sometimes be faster, but I'm not sure if reading a file using another language would help since it would have to be converted to an object R recognises, and presumably that process would take longer than simply reading as an R object initially
I have noticed identical files seem to be smaller when saved as .RDS compared to .csv, however, I'm not sure if file size necessarily has an effect on read time.

How to read large data files in R allocating memory dynamically?

Is there a way to read large data files in R allocating memory dynamically in the same way as SAS does?
I'm referring to the import of a SAS data set. While SAS has no problem reading large files because it allocates memory dynamically, R is unusable from a certain size on because it allocates in RAM the whole file size.
I gave a look at the ff package as it keeps a pointer to file on disk but it doesn't have a read method for sas datasets.
So, is there a way to read a file allocating memory dynamically, in particular while importing a SAS data set?

Read only rows meeting a condition from a compressed RData file in a Shiny App?

I am trying to make a shiny app that can be hosted for free on shinyapps.io. Free hosting requires that all data/code to be uploaded is <1GB, and that when running the app the memory used is <1GB at any time.
The data
The underlying data (that I'm uploading) is 1000 iterations of a network with ~3050 nodes. Each interaction between nodes (~415,000 interactions per network) has 9 characteristics--of the origin, destination, and the interaction itself--that I need to keep track of. The app needs to read in data from all 1000 networks for user-selected node(s) meeting user-input(ted?) criteria (those 9 characteristics) and summarize it (in a map & table). I can use 1000 one-per-network RData files (more on format below) and the app works, but it takes ~10 minutes to load, and I'd like to speed that up.
A couple notes about what I've done/tried, but I'm not tied to any of this if you have better ideas.
The data is too large to store as CSVs (and fall under the 1GB upload limit), so I've been saving it as RData files of a data.frame with "xz" compression.
To further reduce size, I've turned the data into frequency tables of the 9 variables of interest
In a desktop version, I created 10 summary files that each contained the data for 100 networks (~5 minutes to load), but these are too large to be read into memory in a free shiny app.
I tried making RData files for each node (instead of by splitting by network), but they're too large for the 1GB upload limit.
I'm not sure there are better ways to package the data (but again, happy to hear ideas!), so I'm looking to optimize processing it.
Finally, a question
Is there a way to read only certain rows from a compressed RData file, based on some value (i.e. nodeID)? This post (quickly load a subset of rows from data.frame saved with `saveRDS()`) makes me think that might not be possible because it's compressed. In looking at other options, awk keeps coming up, but I'm not sure if that would work with an RData file (I only seem to see data.frame/data.table/CSV implementations).

Read a reproductible sample of data from multiple CSVs

I'm working with several large CSV files, large enough that I can't efficiently load them into memory.
Instead I would like to read a sample of data from each file. There have been other posts about this topic (such as Load a small random sample from a large csv file into R data frame ) but my requirements are a little different as I would like to read in the same rows from each file.
Using read.csv() with skip and nrows=1 would be very slow and tedious.
Does anyone have a suggestion for how to efficiently load the same N rows from several CSVs without reading them all into memory?

Resources