How to read large data files in R allocating memory dynamically? - r

Is there a way to read large data files in R allocating memory dynamically in the same way as SAS does?
I'm referring to the import of a SAS data set. While SAS has no problem reading large files because it allocates memory dynamically, R is unusable from a certain size on because it allocates in RAM the whole file size.
I gave a look at the ff package as it keeps a pointer to file on disk but it doesn't have a read method for sas datasets.
So, is there a way to read a file allocating memory dynamically, in particular while importing a SAS data set?

Related

RDs format weights more than csv one for the same dataframe

So, I saved a dataframe in both csv and RDs formats, but the RDs one weights significantly more than the csv alternative (40 GB vs. 10 GB). According to this blog:
[RDs format] creates a serialized version of the dataset and then saves it with gzip compression
So, if RDs data is compressed while csv one is uncompressed, then why is the RDs version so much heavier? I would understand the difference if the dataset was small, but it is 140,000 by 42,000, so there shouldn't be an issue with asymptotics kicking in.
What command did you use to save the file as Rds? If you used save_rds() then the RDS file is not compressed by default.
write_rds() does not compress by default as space is generally cheaper than time.
(https://readr.tidyverse.org/reference/read_rds.html)
From this article (https://waterdata.usgs.gov/blog/formats/) it seems that uncompressed RDS files are about 20 times bigger, so this could explain the difference in size that you see.
So, I believe this is some issue that is related to integer overflow in R when computing the indices of the new dataframe. Although nowhere in the documentation I could find a reference to overflow as a possible cause of such errors, I did run into similar issues with Python for which docs indicate overflow as a possible cause. I couldn't find any other way of fixing this and had to reduce the size of my dataset after which everything worked fine.

Load dataframe too large for system memory from jld2 file

I have a file "myfile.jld2" which contains a DataFrame, say mydata. Is it possible to load mydata in some way although it doesn't fit into system RAM?
I only want to load it to split it up into smaller pieces and dump those to disk.

How to quickly read a large txt data file (5GB) into R(RStudio) (Centrino 2 P8600, 4Gb RAM)

I have a large data set, one of the files is 5GB. Can someone suggest me how to quickly read it into R (RStudio)? Thanks
If you only have 4 GBs of RAM you cannot put 5 GBs of data 'into R'. You can alternatively look at the 'Large memory and out-of-memory data' section of the High Perfomance Computing task view in R. Packages designed for out-of-memory processes such as ff may help you. Otherwise you can use Amazon AWS services to buy computing time on a larger computer.
My package filematrix is made for working with matrices while storing them in files in binary format. The function fm.create.from.text.file reads a matrix from a text file and stores it in a binary file without loading the whole matrix in memory. It can then be accessed by parts using usual subscripting fm[1:4,1:3] or loaded quickly in memory as a whole fm[].

How to tell whether the given file was saved by `saveRDS` or by `save` without loading it?

I work in an environment, where we heavily depend on Excel to do the statistical job. We have our own Excel workbooks that create reports, charts and compute the models. But sometimes the Excel is not enough, so we would like to use R to augment the data processing.
I am developing a fairly universal and low-level Excel workbook that is capable to convert our data structures stored in Excel workbook into R using rcom and RExcel macros. Because data are large, the process of porting them into R is lengthy (in terms of the time a used needs to wait after pressing F9 to recalculate the workbook), so I started to develop caching capabilities to my Excel workbook.
Caching is achieved by embedding an extra attribute to the saved object(s), that is a function that checks if the mtime of the Excel's workbook with the data structure did not change since the time the R object was created. Additionally the template supports saving the objects into disk, so next time it is not mandatory to use the workbook and the original Excel data structures in the first place, when doing calculations that involve mostly R.
Although for the most cases the user wouldn't care, but internally sometimes it is more natural to save the data into one R object (like data.frame), and sometimes it seems that saving a whole set o multiple R objects is more intuitive.
When saving a single R object, the saveRDS is more convenient, so I prefer it over save, which works only for multiple objects. (I know, that I can always render multiple objects into one by combining them in the list)
According to the manual for saveRDS the file generated by save has first 5 bytes equal to ASCII representation of RDXs\n. Is there any ready function to test that, or should I manually open the file asbinary, read the 5 bytes, trap a corner case if the file doesn't have even 5 bytes, close the file, etc.?

restriction on the size of excel file

I need to use R to open an excel file, which can have 1000~10000 rows and 5000~20000 columns. I would like to know is there any restriction on the size of this kind of excel file in R?
Generally speaking, your limitation in using R will be how well the data set fits in memory, rather than specific limits on the size or dimension of a data set. The closer you are to filling up your available RAM (including everything else you're doing on your computer) the more likely you are to run into problems.
But keep in mind that having enough RAM to simply load the data set into memory is often a very different thing that having enough RAM to manipulate the data set, which by the very nature of R will often involve a lot of copying of objects. And this in turn leads to a whole collection of specialized R packages that allow for the manipulation of data in R with minimal (or zero) copying...
The most I can say about your specific situation, given the very limited amount of information you've provided, is that it seems likely your data will not exceed your physical RAM constraints, but it will be large enough that you will need to take some care to write smart code, as many naive approaches may end up being quite slow.
I do not see any barrier to this on the R side. Looks like a fairly modestly sized dataset. It could possibly depend on "how" you do this, but you have not described any code, so that remains an unknown.
The above answers correctly discuss the memory issue. I have been recently importing some large excel files too. I highly recommend trying out the XLConnect package to read in (and write) files.
options(java.parameters = "-Xmx1024m") # Increase the available memory for JVM to 1GB or more.
# This option should be always set before loading the XLConnect package.
library(XLConnect)
wb.read <- loadWorkbook("path.to.file")
data <- readWorksheet(wb.read, sheet = "sheet.name")

Resources