Is it possible to partially generate data.tabel bigger than RAM and partially save it to the hard drive? Then in code subsequently load parts of the file to RAM, and do calculations on those parts.
Thanks
Related
For a shiny app in a repository containing a single static data file, what is the optimal format for that flat file (and corresponding function to read that file) which minimises the read time for that flat file to a data.frame?
For example, suppose when a shiny app starts it reads an .RDS, but suppose that takes ~30 seconds and we wish to decrease that. Are there any methods of saving the file and using a function which can save time?
Here's what I know already:
I have been reading some speed comparison articles, but none seem to comprehensive benchmark all methods in the context of a shiny app (and possible cores/threading implications). Some offer sound advice like trying to load in less data
I notice languages like julia can sometimes be faster, but I'm not sure if reading a file using another language would help since it would have to be converted to an object R recognises, and presumably that process would take longer than simply reading as an R object initially
I have noticed identical files seem to be smaller when saved as .RDS compared to .csv, however, I'm not sure if file size necessarily has an effect on read time.
So, I saved a dataframe in both csv and RDs formats, but the RDs one weights significantly more than the csv alternative (40 GB vs. 10 GB). According to this blog:
[RDs format] creates a serialized version of the dataset and then saves it with gzip compression
So, if RDs data is compressed while csv one is uncompressed, then why is the RDs version so much heavier? I would understand the difference if the dataset was small, but it is 140,000 by 42,000, so there shouldn't be an issue with asymptotics kicking in.
What command did you use to save the file as Rds? If you used save_rds() then the RDS file is not compressed by default.
write_rds() does not compress by default as space is generally cheaper than time.
(https://readr.tidyverse.org/reference/read_rds.html)
From this article (https://waterdata.usgs.gov/blog/formats/) it seems that uncompressed RDS files are about 20 times bigger, so this could explain the difference in size that you see.
So, I believe this is some issue that is related to integer overflow in R when computing the indices of the new dataframe. Although nowhere in the documentation I could find a reference to overflow as a possible cause of such errors, I did run into similar issues with Python for which docs indicate overflow as a possible cause. I couldn't find any other way of fixing this and had to reduce the size of my dataset after which everything worked fine.
I recently ran a script that was meant to stack multiple large rasters and run a randomforest classification on the stack. I've done this numerous times with success, though it always takes up tremendous amounts of storage.
I'm aware of ways to check and clear the temporary folder in raster package: rasterTmpFile(prefix='r_tmp_'), showTmpFiles(), removeTmpFiles(h=24), TmpDir().
Typically when the process is complete and I no longer need the temp files, I go to the folder and delete them. Last night, the process ran, and 140 Gb of storage space were consumed, but there is no temp data (in raster - tmp folder, or others). Also these files were not written to .tif.
I do not understand what is happening. Where is the data? How can I remove it?
I have a large data set, one of the files is 5GB. Can someone suggest me how to quickly read it into R (RStudio)? Thanks
If you only have 4 GBs of RAM you cannot put 5 GBs of data 'into R'. You can alternatively look at the 'Large memory and out-of-memory data' section of the High Perfomance Computing task view in R. Packages designed for out-of-memory processes such as ff may help you. Otherwise you can use Amazon AWS services to buy computing time on a larger computer.
My package filematrix is made for working with matrices while storing them in files in binary format. The function fm.create.from.text.file reads a matrix from a text file and stores it in a binary file without loading the whole matrix in memory. It can then be accessed by parts using usual subscripting fm[1:4,1:3] or loaded quickly in memory as a whole fm[].
I need to use R to open an excel file, which can have 1000~10000 rows and 5000~20000 columns. I would like to know is there any restriction on the size of this kind of excel file in R?
Generally speaking, your limitation in using R will be how well the data set fits in memory, rather than specific limits on the size or dimension of a data set. The closer you are to filling up your available RAM (including everything else you're doing on your computer) the more likely you are to run into problems.
But keep in mind that having enough RAM to simply load the data set into memory is often a very different thing that having enough RAM to manipulate the data set, which by the very nature of R will often involve a lot of copying of objects. And this in turn leads to a whole collection of specialized R packages that allow for the manipulation of data in R with minimal (or zero) copying...
The most I can say about your specific situation, given the very limited amount of information you've provided, is that it seems likely your data will not exceed your physical RAM constraints, but it will be large enough that you will need to take some care to write smart code, as many naive approaches may end up being quite slow.
I do not see any barrier to this on the R side. Looks like a fairly modestly sized dataset. It could possibly depend on "how" you do this, but you have not described any code, so that remains an unknown.
The above answers correctly discuss the memory issue. I have been recently importing some large excel files too. I highly recommend trying out the XLConnect package to read in (and write) files.
options(java.parameters = "-Xmx1024m") # Increase the available memory for JVM to 1GB or more.
# This option should be always set before loading the XLConnect package.
library(XLConnect)
wb.read <- loadWorkbook("path.to.file")
data <- readWorksheet(wb.read, sheet = "sheet.name")