R Running out of memory over large file - r

I had a unique problem yesterday when trying to read a large .csv file into memory.
the file itself is 9GB with a bit more than 80 Million rows and 10 columns.
it loaded perfectly and took up around 7GB in memory using a remote machine with 128GB of RAM.
my problem is, i want to work on the data with a local machine that only has 32GB of RAM.
i tried reading it with data.table::fread but R crahes when it uses all of the machine's memory.
is there a safer way of reading the data that won't crash R?
is this a known issue? can something be wrong with the machine?
both machine are running windows 7 enterprise.
EDIT:
saving and reading the data in an RDS file worked, but i still want to be able to use just one computer for the entire job.
is there any other way to read the data directly from the csv file?
i don't want to report a bug in data.table unless i am sure this is an issue with fread and not a local issue.
any other ideas?

Related

Problem loading large .RData file: Error reading from connection

I have an .RData file that is rather large. It takes up 1.10 GB on my hard drive, it contains a data frame with 100 variables and 12 million observations. Now when I try to load it, I can open the task manager and watch the memory usage go all the way up to 7450 MB; at which point my RAM is completely exhausted, and I get "Error reading from connection." I'm pretty sure this memory deficiency is the problem, but how can that be? Like I said, the .RData is only 1.10 GB.
I'm using R x64 4.0.5. If it's any clue, when I open the 32-Bit version of R (4.0.5) it tells me "Error: memory exhausted (limit reached?)", reinforcing my suspicion that this is a memory issue.
I am unable to access the data any other way, I have to make the .RData file work or it's gone. Why does R require more than 8 GB of RAM to load a 1GB workspace?

Efficient switching between 32bit and 64bit R versions

I am working with large datasets that are available in *.mdb (i.e access database) format. I am using RODBC R package to extract data from access database. I figured out that I have 32 bit office installed on my machine. Since, I have 32 bit office installed, it seems I can use only 32 bit R in order to connect to the access database using RODBC. After I read the data using 32 bit R, then doing some exploratory analysis (plotting data, summary / regression), I got the memory issues which I didn't get while using 64-bit R.
Currently, I am using Rstudio to run all my code and I could change the version of R that I use from Options >> Global Options >> R version:
However, I don't want to switch to 32-bit while reading access database using RODBC and then go back to R-studio to revert back to 64-bit for analysis. Is there an automatic solution which allows me to specify 32-bit or 64-bit ? Can we do that using batch file ? If anyone could shed some light that would be great.
Write your code that extracts the data as one R script. Have that script save the output data that you need for your analysis to an .RData file.
Write the code that you run your analyses in, to be run in 64-bit R. Using the answer found here, run your code using the 32-bit R. Then, the next line can be reading the data in from the .RData file. If needed to allow things to load, use Sys.sleep to have your first program wait a few seconds for the load to complete.

Loading .dta data into R takes long time

Some confidential data is stored on a server and accessible for researchers via remote access.
Researchers can login via some (I think cisco) remote client, and share virtual machines on the same host
There's a 64 bit Windows running on the virtual machine
The system appears to be optimized for Stata, I'm among the first to use the data using R. There is no RStudio installed on the client, just the RGui 3.0.2.
And here's my problem: the data is saved in the stata format (.dta), and I need to open it in R. At the moment I am doing
read.dta(fileName, convert.factors = FALSE)[fields]
Loading in a smaller file (around 200MB) takes 1-2 minutes. However, loading in the main file (3-4 GB) takes very long, longer than my patience was for me. During that time, the R GUI is not responding anymore.
I can test my code on my own machine (OS X, RStudio) on a smaller data sample, which works all fine. Is this
because of OS X + RStudio, or only
because of the size of the file?
A college is using Stata on a similar file in their environment, and that was working fine for him.
What can I do to improve the situation? Possible solutions I came up with were
Load the data into R somehow differently (perhaps there is a way that doesn't require all this memory usage). I have also access to stata. If all else fails, I could prepare the data in Stata, for example slice it into smaller pieces and reassemble it in R
Ask them to allocate more memory to my user of the VM (if that indeed is the issue)
Ask them to provide RStudio as a backend (even if that's not faster, perhaps its less prone to crashes)
Certainly the size of the file is a prime factor, but the machine and configuration might be, too. Hard to tell without more information. You need a 64 bit operating system and a 64 bit version of R.
I don't imagine that RStudio will help or hinder the process.
If the process scales linearly, it means your big data case will take (120 seconds)*(4096 MB/200 MB) =2458 seconds, or around three quarters of an hour. Is that how long you waited?
The process might not be linear.
Was the processor making progress? If you checked CPU and memory, was the process still running? Was it doing a lot of page swaps?

What a good way to get in-memory cache with data.table

Let's say I have a 4GB dataset on a server with 32 GB.
I can read all of that into R, make a data.table global variable and have all of my functions use that global as a kind of in-memory data-base. However, when I exit R and restart, I have to read that from disk again. Even with smart disk cacheing strategies (save/load or R.cache) I have 10 seconds delay or so getting that data in. Copying that data takes about 4 seconds.
Is there a good way to cache this in memory that survives the exit of an R session?
A couple of things comes to mind, RServe, redis/Rredis, Memcache, multicore ...
Shiny-Server and Rstudio-Server also seem to have ways of solving this problem.
But then again, it seems to me that perhaps data.table could provide this functionality since it appears to move data outside of R's memory block anyway. That would be ideal in that it wouldn't require any data copying, restructuring etc.
Update:
I ran some more detailed tests and I agree with the comment below that I probably don't have much to complain about.
But here are some numbers that others might find useful. I have a 32GB server. I created a data.table of 4GB size. According to gc() and also looking at top, it appeared to use about 15GB peak memory and that includes making one copy of the data. That's pretty good I think.
I wrote to disk with save(), deleted the object and used load() to remake it. This took 17 seconds and 10 seconds respectively.
I did the same with the R.cache package and this was actually slower. 23 and 14 seconds.
However both of those reload times are quite fast. The load() method gave me 357 MB/s transfer rate. By comparison, a copy took 4.6 seconds. This is a virtual server. Not sure what kind of storage it has or how much that read speed is influenced by the cache.
Very true: data.table hasn't got to on-disk tables yet. In the meantime some options are :
Don't exit R. Leave it running on a server and use svSocket to evalServer() to it, as the video on the data.table homepage demonstrates. Or the other similar options you mentioned.
Use a database for persistency such as SQL or any other noSQL database.
If you have large delimited files then some people have recently reported that fread() appears (much) faster than load(). But experiment with compress=FALSE. Also, we've just pushed fwrite to the most current development version (1.9.7, use devtools::install_github("Rdatatable/data.table") to install), which has some reported write times on par with native save.
Packages ff, bigmemory and sqldf, too. See the HPC Task View, the "Large memory and out-of-memory data" section.
In enterprises where data.table is being used, my guess is that it is mostly being fed with data from some other persistent database, currently. Those enterprises probably :
use 64bit with say 16GB, 64GB or 128GB of RAM. RAM is cheap these days. (But I realise this doesn't address persistency.)
The internals have been written with on-disk tables in mind. But don't hold your breath!
If you really need to exit R for some strange reasons between the computation sessions and the server is not restarted, then just make a 4 GB ramdisk in RAM and store the data there. Loading the data from RAM to RAM would be much faster compared to any SAS or SSD drive :)
This can be solved pretty easily on Linux with something like adding this line to /etc/fstab:
none /data tmpfs nodev,nosuid,noatime,size=5000M,mode=1777 0 0
Depending on how your dataset looks like, you might consider using package ff. If you save your dataset as an ffdf, it will be stored on disk but you can still access the data from R.
ff objects have a virtual part and a physical part. The physical part is the data on disk, the virtual part gives you information about the data.
To load this dataset in R, you only load in the virtual part of the dataset which is a lot smaller, maybe only a few Kb, depending if you have a lot of data with factors. So this would load your data in R in a matter of milliseconds instead of seconds, while still having access to the physical data to do your processing.

Where does R store temporary files

I am running some basic data manipulation on a Macbook Air (4GB Memory, 120GB HD with 8GB available). My input file is about 40 MB, and I don't write anything to the disk until end of the process. However, in the middle of my process, my Mac says there's no memory to run. I checked hard drive and found there's about 500MB left.
So here are my questions:
How is it possible that R filled up my disk so quickly? My understanding is that R store everything in memory (unless I explicitly write something out to disk).
If R does write temporary files on the disk, how can I find these files to delete them?
Thanks a lot.
Update 1: error message I got:
Force Quit Applications: Your Mac OS X startup disk has no more space available for
application memory
Update 2: I checked tempdir() and it shows "var/folders/k_xxxxxxx/T//Rtmpdp9GCo". But I can't locate this director from my Finder
Update 3: After unlink(tempdir(),recursive=TRUE) in R and restarting my computer, I got my disk space back. I still would like to know if R write on my hard drive to avoid similar situations in the future.
Update 4: My main object is about 1GB. I use Activity Monitor to track process, and while Memory usage is about 2GB, Disk activity is extremely high: Data read: 14GB, data write, 44GB. I have no idea what R is writing.
R writes to a temporary per-session directory which it also cleans up at exit.
It follows convention and respects TMP and related environment variables.
What makes you think that disk space has anything to do with this? R needs all objects held in memory, not off disk (by default; there are add-on packages that allow a subset of operations on on-disk stored files too big to fit into RAM).
One of the steps in the "process" is causing R to request a chunk of RAM from the OS to enable it to continue. The OS could not comply and thus R terminated the "process" that you were running with the error message you failed to give us. [Hint, it would help if you showed the actual error not your paraphrasing thereof. Some inkling of the code you were running would also help. 40MB on-disk sounds like a reasonably large file; how many rows/columns etc.? How big is the object within R; object.size()?

Resources